Denis Bakhvalov

Understanding performance events skid.

29 Aug 2018

In my previous article I promised to write about retired instructions, reference cycles and other stuff. But one of my readers asked me why he sometimes sees an instruction tagged by an event which was not caused by this instruction? So, today I want to discuss this very important concept, which I think is crucial to understand.

Performance analysis is a hard thing, no question about it. But it becomes even harder when the profile data that you are looking at misleads you. Imagine you have application with big amount of L1D-cache misses and the hot assembly code that look like this:

; load1 
; load2
; load3 <-- here profile shows you lots of L1D-cache misses

This is great, but in reality load1 is the instruction that causes L1D-cache misses. Ugghf!

The skid is defined as the distance between the IP(s) that caused the issue to the IP(s) where the event is tagged

Example of skid

To demonstrate the thing I wrote a small test in assembly that is available on my github. Inside it I have a simple loop:

; there will be 100'000'000 iterations
.loop:

; cache_line of 8-byte NOPs
; cache_line of 8-byte NOPs
; cache_line of 8-byte NOPs
; cache_line of 8-byte NOPs

dec rdi
jnz .loop

For the purpose of the experiment we don’t need to have real assembly instructions. We will emulate the workload with NOPs. In my experiment we will sample on the event branches (BR_INST_RETIRED.ALL_BRANCHES) and expect all such events to correspond to jnz .loop instruction. I made all experiments on Haswell CPU:

$ perf stat -e cpu/event=0xc4,umask=0x4,name=BR_INST_RETIRED.ALL_BRANCHES/ ./a.out
 Performance counter stats for './a.out':
         100338645      BR_INST_RETIRED.ALL_BRANCHES                                   
       0,301300877 seconds time elapsed

Total number of branches retired is close to 100'000'000 (1 branch per iteration). Now let’s sample on it:

$ perf record -e cpu/event=0xc4,umask=0x4,name=BR_INST_RETIRED.ALL_BRANCHES/ ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.062 MB perf.data (1176 samples) ]
$ perf annotate --stdio -M intel main.loop
 Percent |      Source code & Disassembly of a.out for BR_INST_RETIRED.ALL_BRANCHES (1170 samples)
--------------------------------------------------------------------------------------------------
         :
         :
         :
         :      Disassembly of section .text:
         :
         :      000000000040057e <main.loop>:
         :      main.loop():
    0.00 :        40057e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400586:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40058e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400596:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40059e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005a6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ae:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005b6:       nop    DWORD PTR [rax+rax*1+0x0]
  100.00 :        4005be:       nop    DWORD PTR [rax+rax*1+0x0]	<-- OOOPS, we have skid of ~10 instructions!
    0.00 :        4005c6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ce:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005d6:       nop    DWORD PTR [rax+rax*1+0x0]	
    0.00 :        4005de:       nop    DWORD PTR [rax+rax*1+0x0]	<-- This insruction is tagged on Ivy Bridge CPU.
    0.00 :        4005e6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ee:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005f6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005fe:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400606:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40060e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400616:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40061e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400626:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40062e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400636:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40063e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400646:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40064e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400656:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40065e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400666:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40066e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400676:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40067e:       dec    rdi
    0.00 :        400681:       jne    40057e <main.loop>		<-- This instruction should be tagged.
    0.00 :        400687:       add    rsp,0x4
    0.00 :        40068b:       pop    rdi
    0.00 :        40068c:       pop    rax
    0.00 :        40068d:       ret    
    0.00 :        40068e:       ud2 

Notice, that all collected samples correspond to the wrong instruction! To understand why that happens you might want to check one of my previous articles: Advanced profiling topics PEBS and LBR. But if you want to know the short answer, there is a delay between performance monitoring interrupt issued and capture of instruction pointer (IP).

What we can do about it?

Skid makes it difficult to discover the instruction which is actually causing the performance issue. But fortunately, there is a special mechanism called PEBS (Precise Event-Based Sampling) which is dedicated to solve the problem. More on this topic I wrote in already mentioned blog post. Here is how the things changed when using it (notice pp suffix in the event declaration):

$ perf record -e cpu/event=0xc4,umask=0x4,name=BR_INST_RETIRED.ALL_BRANCHES/pp ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.064 MB perf.data (1245 samples) ]
$ perf annotate --stdio -M intel main.loop
 Percent |      Source code & Disassembly of a.out for BR_INST_RETIRED.ALL_BRANCHES (1237 samples)
--------------------------------------------------------------------------------------------------
         :
         :
         :
         :      Disassembly of section .text:
         :
         :      000000000040057e <main.loop>:
         :      main.loop():
    0.00 :        40057e:       nop    DWORD PTR [rax+rax*1+0x0] <-- This instruction is tagged for SNB families.
    0.00 :        400586:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40058e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400596:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40059e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005a6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ae:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005b6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005be:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005c6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ce:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005d6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005de:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005e6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005ee:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005f6:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        4005fe:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400606:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40060e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400616:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40061e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400626:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40062e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400636:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40063e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400646:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40064e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400656:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40065e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400666:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40066e:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        400676:       nop    DWORD PTR [rax+rax*1+0x0]
    0.00 :        40067e:       dec    rdi
  100.00 :        400681:       jne    40057e <main.loop>	 <-- Indeed, this is the instruction, we were looking for.
    0.00 :        400687:       add    rsp,0x4
    0.00 :        40068b:       pop    rdi
    0.00 :        40068c:       pop    rax
    0.00 :        40068d:       ret    
    0.00 :        40068e:       ud2

Now we have clear picture of the event that we sample on and the instruction that caused it. However, keep in mind, that only since Haswell precise events tag the same instruction, but on SandyBridge processor family they skid to the next IP (see the listing above).

Precise events is a very important tool for top-down analysis (look at Appendix B.1). It’s idea that you first identify the source of the performance problems. You basically do this by counting a lot of events at once (not necessary precise). You understand what is causing problems and than you locate the exact place in the code using precise event. For more details and examples take a look at TMA metrics.


comments powered by Disqus