Denis Bakhvalov

Vectorization part1. Intro.

24 Oct 2017

Recently I was working closely with analyzing different vectorization cases. So I decided to write a series of articles dedicated to this topic.

This is the first post in this series, so let me start with some introduction info. Vectorization is a form of SIMD which allows to crunch multiple values in one CPU instruction.

I know it is bad introduction when just links to wiki thrown everywhere around, so let me show you simple example (godbolt):

#include <vector>
void foo( std::vector<unsigned>& lhs, std::vector<unsigned>& rhs )
    for( unsigned i = 0; i < lhs.size(); i++ )
            lhs[i] = ( rhs[i] + 1 ) >> 1;

Lets compile it with clang with options -O2 -march=core-avx2 -std=c++14 -fno-unroll-loops. I turned off loop unrolling to simplify the assembly and -march=core-avx2 tells compiler that generated code will be executed on a machine with avx2 support. Assembly generated contains:

Scalar version

mov edx, dword ptr [r9 + 4*rsi]         # loading rhs[i]
add edx, 1                              # rhs[i] + 1
shr edx                                 # (rhs[i] + 1) >> 1
mov dword ptr [rax + 4*rsi], edx        # store result to lhs
mov esi, edi
add edi, 1                              # incrementing i by 1
cmp rcx, rsi
ja <next iteration>

Vector version

vmovdqu ymm1, ymmword ptr [r9 + 4*rdi]  # loading 256 bits from rhs
vpsubd ymm1, ymm1, ymm0                 # ymm0 has all bits set, like +1
vpsrld ymm1, ymm1, 1                    # vector shift right.
vmovdqu ymmword ptr [rax + 4*rdi], ymm1 # storing result 
add rdi, 8                              # incrementing i by 8
cmp rsi, rdi
jne <next iteration>

So here you can see that vector version crunches 8 integers at a time (256 bits = 8 bytes). If you will analyze assembly carefull enough you will spot the runtime check which dispatch to those two versions. If there are not enough elements in the vector for choosing vector version, scalar version will be taken. Amount of instructions is smaller in vector version, although all vector instructions have bigger latency than scalar counterparts.

Vector operations can give significant performance gains but they have quite many restrictions which we will cover later. Historically, Intel has 3 instruction sets for vectorization: MMX, SSE and AVX.

Vector registers for those instruction sets are described here.

In general, not only loops can be vectorized. There is also linear vectorizer (in llvm it is called SLP vectorizer) which is searching for similar independent scalar instructions and tries to combine them.

To check vector capabilities of your CPU you can type lscpu. For my Intel Core i5-7300U filtered output is:

Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed clflushopt

For us the most interesting is that this CPU supports sse4_2 and avx instruction sets.

That’s all for now. In later articles I’m planing to cover following topics:

  1. Vectorization intro (this article).
  2. Vectorization warmup.
  3. Checking compiler vectorization report.
  4. Vectorization width.
  5. Multiversioning by data dependency.
  6. Multiversioning by trip counts.
  7. Tips for writing vectorizable code.

comments powered by Disqus