Performance analysis and tuning challenge #2.

Categories: challenge

10 Apr 2019

Contents:

Quickstart
Couple of hints
Validation
Submissions
Spread the word
Updated 4th May 2019

Welcome to the second edition of my performance analysis and tuning contest. If you see this post and haven’t read my initial post about the contest, I encourage you to read it first.

Subscribe to my mailing list, support me on Patreon or by PayPal donation.

The benchmark for the 2nd edition is 7zip-benchmark from LLVM test-suite. Yes, this time we have a serious challenge and I expect it will be lots of fun, but at the same time lots of new things to learn!

7-Zip is a well-known peace of software. This is a file archiver with a high compression ratio. 7-Zip which supports different compression algorithms and formats.

There are two tests combined in the benchmark:

Compression with LZMA method
Decompression with LZMA method

Documentation for this benchmark can be found here (see chapter “Benchmark”, page 25).

Quickstart

To download and build the benchmark do the following:

$ git clone https://github.com/llvm-mirror/test-suite
$ mkdir build
$ cd build
$ cmake -DTEST_SUITE_COLLECT_CODE_SIZE=OFF -DTEST_SUITE_BENCHMARKING_ONLY=ON -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_C_FLAGS="-O3 -march=core-avx2" -DCMAKE_CXX_FLAGS="-O3 -march=core-avx2" ../test-suite/
$ make 7zip-benchmark -j6

To run the benchmark do:

$ cd MultiSource/Benchmarks/7zip
$ ./7zip-benchmark b

There is some amount of macroses in the benchmark, so you might want to preprocess it. Special target generated for this by cmake might be helpful, for example:

make C/LzmaDec.i

For building with gcc you need to revert this commit: 7ee8c3b0943f6e6c7d2180c1bd7648e27c3a19cc.

Clang and gcc compilers give roughly the same score¹, so you can choose whatever is more convenient for you.

Target machine for this edition of the contest is again Haswell CPU with 64-bit Linux. Although you can do your experiments on Windows since cmake is used for building the benchmark. If you choose Windows as a platform, here is the article that might be helpful: How to collect CPU performance counters on Windows?.

Couple of hints

Here is the workflow that I recommend:

Collect the baseline (use time or analogs).
Find the hotspot (use perf record).
Find performance headroom
- Take a look at the assembly and try to guess how you can do better.
- Collect general statistics like branch mispredictions, cache-misses (use perf stat).
- Run through TMAM process.
Fix the issue, build the benchmark, run it and compare against baseline.

I also have a few general advises:

Do not try to understand the whole benchmark. For some people (including me) it’s crucial to understand how every peace of code works. For the purposes of optimizing it will be wasted effort. There are CPU benchmarks with thousands LOC (like SPEC2017) it’s absolutely impossible to understand them in a reasonable time. What you need to familiarize yourself with are hotspots. That’s it. You most likely need to understand one function/loop which is not more than 100 LOC.
You have specific workload for which you optimize the benchmark. You don’t need to optimize it for any other input/workload. The main principle behind Data-oriented design is that you know the data of your application.

If you feel you’re stuck, don’t hesitate to ask questions or look for support elsewhere. I don’t have much time to answer every question promptly, but I will do my best.

See the Q&A post about what optimizations are allowed and what not.

Validation

If the benchmark ran correctly it will print something like:

7-Zip (A) [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,1 CPU)

RAM size:     128 MB,  # CPU hardware threads:   1
RAM usage:    107 MB,  # Benchmark threads:      1

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:    5623    99   5543   5470  |    46195   101   4146   4171
23:    4982   100   5098   5076  |    45875   100   4199   4200
----------------------------------------------------------------
Avr:           99   5320   5273               100   4173   4185
Tot:          100   4746   4729

The benchmark shows a rating in MIPS (million instructions per second). But for our purposes this is not a good metric, since we strive for the fastest execution time. We can have higher number of instructions but lower running time. For this reason I recommend just measuring execution time.

If for some reason you did something wrong it will print:

Decoding Error

Submissions

Disclaimer: I will not use submissions in any commercial purposes.

The baseline that I will be measuring against is ‘clang -O3 -march=core-avx2’. I noticed that enabling LTO (-flto) doesn’t bring any additional improvement, but you may add it if you want. Notice however, that it increases build time of the benchmark which is not good to doing quick experiments.

If you’re willing to submit your work subscribe to my mailing list and then send all that you have via email.

See the rules and guidelines for submissions here.

If you are in a position of writing article with description of your findings, I highly encourage you to do so. It will be much better to have the author describe the finding in comparison with me interpreting your submission.

I’m collecting all your submissions until 30th April 2019.

Spread the word

If you know someone who might be interesting in participating in this contest, please spread the word about it!

Good luck and have fun!

P.S. I have no automation for my contest yet, so if anyone knows any good service or a way to automate it using web interface, please let me know.

P.P.S. I’m also open to your comments and suggestions. If you have an proposition of a benchmark for the next edition of the contest, please tell me about it.

Updated 4th May 2019

I haven’t received enough submissions for the contest so I decided not to publish the score table this time. Instead of that I will share what participants were able to find.

Fighting branch mispredictions (+9%)

Almost every participant identified the bottleneck in the benchmark which is the amount of branch mispredictions:

$ ~/pmu-tools/toplev.py --core S0-C0 -l2 -v --no-desc taskset -c 0 ./7zip-benchmark b
S0-C0    FE             Frontend_Bound:                             13.74 +-     0.00 % Slots below
S0-C0    BAD            Bad_Speculation:                            39.32 +-     0.00 % Slots      
S0-C0    BE             Backend_Bound:                              15.61 +-     0.00 % Slots      
S0-C0    RET            Retiring:                                   31.28 +-     0.00 % Slots below
S0-C0    FE             Frontend_Bound.Frontend_Latency:             8.48 +-     0.00 % Slots below
S0-C0    FE             Frontend_Bound.Frontend_Bandwidth:           5.28 +-     0.00 % Slots below
S0-C0    BAD            Bad_Speculation.Branch_Mispredicts:         39.29 +-     0.00 % Slots       <==
S0-C0    BAD            Bad_Speculation.Machine_Clears:              0.03 +-     0.00 % Slots below
S0-C0    BE/Mem         Backend_Bound.Memory_Bound:                  7.14 +-     0.00 % Slots below
S0-C0    BE/Core        Backend_Bound.Core_Bound:                    8.47 +-     0.00 % Slots below
S0-C0    RET            Retiring.Base:                              31.15 +-     0.00 % Slots below
S0-C0    RET            Retiring.Microcode_Sequencer:                0.12 +-     0.00 % Slots below

If HW branch predictor can’t deal with some branch there is usually not much you can do about it. Because if there would be some repeatable pattern, the HW branch predictor unit will catch it. However, it still may be a good idea to confirm that the pattern is in fact unpredictable. I wrote complete new article on how you can collect statistics for a particular branch: Estimating branch probability using Intel LBR feature.

I put all the details there, so I really encourage you to take a look at the post I just mentioned. There you will find how one can locate the branch that was mispredicted and how to find the corresponding line in source code.

So, here is the piece of code of our interest (preprocessed):

do {
  ttt = *(prob + symbol);
  if (range < ((UInt32)1 << 24)) {
    range <<= 8;
    code = (code << 8) | (*buf++);
  };
  bound = (range >> 11) * ttt;
  if (code < bound) {             // <-- unpredictable branch
    range = bound;
    *(prob + symbol) = (UInt16)(ttt + (((1 << 11) - ttt) >> 5));
    symbol = (symbol + symbol);
  } else {
    range -= bound;
    code -= bound;
    *(prob + symbol) = (UInt16)(ttt - (ttt >> 5));
    symbol = (symbol + symbol) + 1; 
  }
} while (symbol < 0x100);

-How we can get rid of the branch mispredictions?

-By getting rid of the branch itself.

-How to do that?

-One way to do this is to force compiler to generate CMOVs (Conditional Move) instead of branches.

Here is the modified code:

do {
  ttt = *(prob + symbol);
  if (range < ((UInt32)1 << 24)) {
    range <<= 8;
    code = (code << 8) | (*buf++);
  };
  bound = (range >> 11) * ttt;
  
  Bool cond = code < bound;
  
  range = cond ? bound : (range - bound);
  UInt16 *addr = prob + symbol;
  UInt16 LHS = (UInt16)(ttt + (((1 << 11) - ttt) >> 5));
  UInt16 RHS = (UInt16)(ttt - (ttt >> 5));
  *addr = cond ? LHS : RHS;
  
  symbol = (symbol + symbol);
  symbol = cond ? symbol : symbol + 1;
  code = cond ? code : code - bound;
} while (symbol < 0x100);

The idea is that we compute both branches and then just select the right answer using CMOV instruction.

Note, however, there is a threshold for generating cmovs. When there are too much computations in both branches it is better to take the penalty of branch misprediction (~15-20 cycles, depending how deep is the pipeline) than executing both branches. I think this threshold is usually low, meaning that CMOV is only profitable when there are not many computations in both branches and branch is unpredictable. For further reading check this stackoverflow question.

There are other similar places where branch misprediction happens inside LzmaDec_DecodeReal(). The best place to fix them all is to modify the GET_BIT2 macro. You can find the patch which does that on my github and try it yourself.

Improving machine block placement (+1.5%)

Once the major issue (branch mispredictions) is fixed, workload starts being Frontend Bound:

S0-C0    FE             Frontend_Bound:       22.11 +-     0.00 % Slots       <==
S0-C0    BAD            Bad_Speculation:      17.81 +-     0.00 % Slots      
S0-C0    BE             Backend_Bound:        20.80 +-     0.00 % Slots      
S0-C0    RET            Retiring:             40.02 +-     0.00 % Slots below

I ran the benchmark with Profile guided optimizations² (PGO) enabled and it showed slight degradation so that wasn’t any helpful.

I noticed that NORMALIZE macro (inside GET_BIT2 macro) is almost never executed:

  if (range < ((UInt32)1 << 24)) { // almost always false
    range <<= 8;
    code = (code << 8) | (*buf++);
  };

Yet in the assembly we have this code physically residing inside the hot loop:

00 : ┌─┬───> 4edab0:       mov    ecx,ecx		<== begin loop
00 : │ │     4edab2:       movzx  eax,WORD PTR [r9+rcx*2]
36 : │ │     4edab7:       cmp    ebp,0xffffff
66 : │ │ ┌── 4edabd:       ja     4edad0 
00 : │ │ │   4edabf:       shl    ebp,0x8		<== NORMALIZE body
00 : │ │ │   4edac2:       shl    r10d,0x8
00 : │ │ │   4edac6:       movzx  edx,BYTE PTR [r8]
00 : │ │ │   4edaca:       inc    r8
00 : │ │ │   4edacd:       or     r10d,edx
00 : │ │ └─> 4edad0:       mov    edx,ebp
00 : │ │     4edad2:       shr    ebp,0xb
00 : │ │     4edad5:       imul   ebp,eax
           ...

You see, we have taken branch in the hot region. If we invert the control flow we will have hot fall through code. See details in the post Machine code layout optimizations. We can insert Likely/Unlikely compiler hints to make things better:

  if (__builtin_expect(range < ((UInt32)1 << 24), 0)) { // unlikely to be true
    range <<= 8;
    code = (code << 8) | (*buf++);
  };

With this information compiler is able to move the body of NORMALIZE macro out of the hot region. The patch that does that is on my github.

Reducing size of the hot code (+0.5%)

I found this somewhat hot loop in LzmaDec_DecodeReal():

// C/LzmaDec.c
         dicPos += curLen;
         do
           *(dest) = (Byte)*(dest + src);
         while (++dest != lim);

The scalar (not unrolled) version of it was firing although compiler prepared vectorized and unrolled + slp-vectorized versions of it. See examples and more information in my article: Vectorization part6. Multiversioning by trip counts.

I did several experiments of unrolling and vectorizing the loop ³ with different factors but none of them gave me any performance boost. I soon figured out that this loop has very small number of iterations (less than 4). And the overhead of dispatching to a better assembly version of the loop is very significant. If we are still going to fire scalar (not unrolled) version of the loop then there is no need to prepare other versions. So I decided to disable vectorization and unrolling for LzmaDec.c:

$ clang -c -O3 C/LzmaDec.c ... -fno-vectorize -fno-slp-vectorize -fno-unroll-loops

That should remove the overhead of runtime checks inserted by the compiler for dispatching between the assembly versions for the loop. But also it reduces the code size of the outer loop (lines 153-411) in LzmaDec_DecodeReal().

Optimizations summary

Here is the summary of optimizations that could be made for this benchmark:

 time(s)   submission         timings for 10 consecutive runs (s)                            speedup
([5.39, 'cmov_norm_scalar', [5.39, 5.39, 5.39, 5.39, 5.39, 5.39, 5.39, 5.4,  5.41, 5.42]], ' +11.7%')
([5.42, 'cmov_norm',        [5.42, 5.43, 5.43, 5.43, 5.43, 5.43, 5.43, 5.43, 5.44, 5.45]], ' +11.0%')
([5.49, 'cmov',             [5.49, 5.49, 5.49, 5.5,  5.51, 5.51, 5.51, 5.52, 5.54, 5.54]], '  +9.7%')
([6.02, 'baseline',         [6.02, 6.02, 6.03, 6.03, 6.04, 6.05, 6.06, 6.07, 6.07, 6.08]], ' ')

I think it was a good and interesting challenge. Improving such a mature application as 7-zip by 12% is very cool. However, I would not run to the author of 7-zip with pull request from those patches, because we just improved the single workload of 7-zip. It might be the case that for some of the workloads it would only make the things worse.

Further optimization directions

I would continue reducing the code size of the outermost loop (lines 153-411) in LzmaDec_DecodeReal(). It should be possible to do by outlining cold parts of the loop. For example, lines 209-260.
I did briefly look at the other hotspots. Function GetMatchesSpec1 seems to be a good candidate for inlining. Just that GET_MATCHES_FOOTER generates a lot of code and looks like it’s cold. Maybe it’s a good idea to outline it as well.
Top Down analysis shows there are other small issues like DTLB misses (attribute to 1% cycles stalled) and store-to-load forwarding failures (attribute to 0.5% cycles stalled). Also analysis shows there are Ports_Utilization problems which could be worth analyzing.

At least newer versions: clang 9.0 and gcc 8.2. ↩
I wrote a little bit about PGO here. Guide for using PGO in clang in here. ↩
Using clang pragmas described here. ↩

Denis Bakhvalov