Denis Bakhvalov

Performance analysis vocabulary.

04 Sep 2018

For a beginer it can be a very hard time looking in a profile generated by the tool like perf or Intel Vtune Amplifier. It immediately throws at you lots of strange terms, which you might not know. Whenever I’m presenting to the audience or speaking with somebody who is not very much involved into performance analysis activities, they ask the same basic questions. Like: “What is instruction retired?” or “What is reference cycles?”. So, I decided to write an article describing some of unobvious terms connected with performance analysis.

What is retired instruction?

Modern processors execute much more instructions that the program flow needs. This is called a speculative execution. Instructions that were “proven” as indeed needed by the program execution flow are “retired”.

Source: https://software.intel.com/en-us/vtune-amplifier-help-instructions-retired-event.

So, instruction processed by the CPU can be executed but not necessary retired. And retired instruction is usually executed, except those times when it does not require an execution unit. An example of it can be mov elimination (see my post What optimizations you can expect from CPU?). Taking this into account we can usually expect the number of executed instructions to be higher than the number of retired instructions.

There is a fixed performance counter (PMC) that is collecting this metric. See one of my previous articles for more information on this topic: PMU counters and profiling basics.

To collect this basic metric, you can use perf:

$ perf stat -e instructions ./a.out
or just simply
$ perf stat ./a.out

What’s an UOP(micro-op)?

From Agner’s Fog microarchitecture manual, chapter 2.1 “Instructions are split into µops”:

The microprocessors with out-of-order execution are translating all instructions into microoperations - abbreviated µops or uops. A simple instruction such as ADD EAX,EBX generates only one µop, while an instruction like ADD EAX,[MEM1] may generate two: one for reading from memory into a temporary (unnamed) register, and one for adding the contents of the temporary register to EAX. The instruction ADD [MEM1],EAX may generate three µops: one for reading from memory, one for adding, and one for writing the result back to memory. The advantage of this is that the µops can be executed out of order.

In the chapter about micro-ops Agner has more examples, so you may want to read them as well.

Modern Intel architectures are capable of collecting the number of issued, executed and retired uops. The difference between executed and retired uop is mostly the same as for instruction.

$ perf stat -e cpu/event=0xe,umask=0x1,name=UOPS_ISSUED.ANY/,cpu/event=0xb1,umask=0x1,name=UOPS_EXECUTED.THREAD/,cpu/event=0xc2,umask=0x1,name=UOPS_RETIRED.ALL/ ls
 Performance counter stats for 'ls':
           2856278      UOPS_ISSUED.ANY                                             
           2720241      UOPS_EXECUTED.THREAD                                        
           2557884      UOPS_RETIRED.ALL

Uops also can be MacroFused and MicroFused.

What is reference cycle?

Majority of modern CPUs including Intel’s and AMD’s ones don’t have fixed frequency on which they operate. Instead, they have dynamic frequency scaling. In Intel’s CPUs this technology is called Turbo Boost, in AMD’s processors it’s called Turbo Core. There is nice explanation of the term “reference cycles” on this stackoverflow thread:

Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different. That’s where the reference clock comes into play - it is uniform. If snippet A runs in 100 ref clocks and snippet B runs in 200 ref clocks then B really took more time than A.

$ perf stat -e cycles,ref-cycles ./a.out
 Performance counter stats for './a.out':
       43340884632      cycles		# 3.97 GHz
       37028245322      ref-cycles	# 3.39 GHz
      10,899462364 seconds time elapsed

I did this experiment on Skylake i7-6000 process, which base frequency is 3.4 GHz. So, ref-cycles event counts cycles as if there were no frequency scaling. This also matches with clock multiplier for that processor, which can find in the specs (it’s equal to 34). Usually system clock has frequency of 100 MHz, and if we multiply it by clock multiplier we will receive the base frequency of the processor. You also might be interested to read about Overclocking.

One interesting experiment which I suggest to do on your own looks like this: open 3 terminals and run corresponding commands:

1. perf stat -e cycles -a -I 1000
2. perf stat -e ref-cycles -a -I 1000
3. perf stat -e bus-cycles -a -I 1000

Place them in such a way that all 3 will be visible. Then open another terminal in which start executing some workload. You will notice how collected values will increase and decrease over time. This experiment will also give you an idea how the state of the CPU is changing.

For advanced information about reference cycles please check this thread on Intel forum.

What is mispredicted branch?

Modern CPUs try to predict the outcome of a branch instruction (taken or not taken). For example, when processor see a code like that:

dec eax
jz .zero
# eax is not 0
...
zero:
# eax is 0

Instruction jz is a branch instruction and in order to increase performance modern CPU architectures try to predict the result of such branch. This is also called speculative execution. Processor will speculate that, for example, branch will not be taken and will execute the code that corresponds to the situation when eax is not 0. However, if the guess was wrong, this is called “branch misprediction” and CPU is required to undo all the speculative work that it has done lately. This may involve something between 10 and 20 clock cycles.

You can check how much branch mispredictions there were in the workload by using perf:

$ perf stat -e branches,branch-misses ls
 Performance counter stats for 'ls':
            358209      branches                                                    
             14026      branch-misses             #    3,92% of all branches        
       0,009579852 seconds time elapsed
or simply
$ perf stat ls

More information like history, possible and real world implementations and more can be found on wikipedia and in Agner’s Fog microarchitecture manual, chapter 3 “Branch prediction”.

What is CPI & IPC?

Those two are derivative metrics that stand for:

  • CPI - Cycles Per Instruction (how much cycles it took to execute one instruction on the average?)
  • IPC - Instructions Per Cycle (how much instructions were retired per one cycle on the average?)

There are lots of other analysis that can be done based on those metrics. But in the nutshell, you want to have low CPI and high IPC.

Formulas:

IPC = INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD
CPI = 1 / IPC

Let’s look at the example:

$ perf stat -e cycles,instructions ls
 Performance counter stats for 'ls':
           2369632      cycles                                                      
           1725916      instructions              #    0,73  insn per cycle         
       0,001014339 seconds time elapsed

Notice, perf tool automatically calculates IPC metric for us.


comments powered by Disqus