In this post I will discussed another interesting feature of Intel processor that is called store forwarding.
In order to describe it I will quote Agner’s Fog microarchitecture.pdf:
The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works if a write to memory is followed by a read from the same address when the read has the same operand size.
Here is the example of a successful store to load forwarding:
mov DWORD [esi], edi mov eax, DWORD [esi]
In this example the temporary 4-byte store will be kept in Store Buffer without even writing it to L1. Load will take those 4 bytes directly from Store Buffer.
But there are some situations where store to load forwarding fails. For example:
mov WORD [esi], di ; small write mov eax, DWORD [esi] ; big read (stall)
I put all those two examples in a tight loop and microbenchmarked them using uarch-bench tool.
successful store forwarding:
.loop: mov DWORD [esi + edi * 4], edi mov eax, DWORD [esi + edi * 4] dec edi jnz .loop
big read after small write:
.loop: mov WORD [esi + edi * 4], di ; small write mov eax, DWORD [esi + edi * 4] ; big read (stall) dec edi jnz .loop ret
I did my experiments on IvyBridge CPU. The counters that I will show are (details here):
The benchmark runs 1000 iterations of this loop and the counters presented below are per iteration:
Benchmark Cycles LD_BLOCKS.STORE_FORWARD UOPS_RETIRED.STALL_CYCLES successful store forwarding 1.02 0.00 0.02 big read after small write 15.00 1.00 14.00
So, here you can see that we are running super fast when nothing prevents store to load forwarding. Everything is nicely pipelined. But when store forwarding failed we run 15 times worse, which is really nasty. The
LD_BLOCKS.STORE_FORWARD counter shows us that we have 1 such issue per iteration which results in additional 14 penalty cycles per iteration.
But according to Agner’s Fog microarchitecture.pdf on the SandyBridge family the penalty for a failed store forwarding is approximately 12 clock cycles in most cases. But we see 14 cycles penalty.
Because load-store reordering is not allowed in x86 (even though store-load is) only one blocked load can execute at a time and perhaps the subsequent stores (on next iterations) are also blocked from committing to preserve memory ordering. I think that explains why why we might have additional 2 cycles penalty, although I’m not 100% sure in that.
But also I think that doesn’t mean that the whole pipe is stalled. If you will add lots of math instructions in the loop, they will not be blocked:
mov WORD [esi + edi * 4], di ; small write mov eax, DWORD [esi + edi * 4] ; big read (stall) add ebx, 1 ; not stalled
You can have the same effect when load start address is not the same as store start address. Example:
mov DWORD [esi], edi mov eax, DWORD [esi + 1] ; not the same start address (stall)
When I benchmarked this assembly sequence I basically received the same numbers as for “big read after small write” case.
The best way to find complete list of things that can prevent store forwarding for particular architecture is to find them in microarchitecture.pdf.
I did one more interesting experiment where I tried to hide the store forwarding fail under another store forwarding fail. I did 2 experiments, where I’m just accessing one cache line, writing 2 bytes and reading 4 bytes at a time (store forwarding stall) :
mov WORD [esi], di mov eax, DWORD [esi] mov WORD [esi + 4], di mov eax, DWORD [esi + 4] mov WORD [esi + 8], di mov eax, DWORD [esi + 8] ; ... more stores and loads mov WORD [esi + 60], di mov eax, DWORD [esi + 60]
mov WORD [esi], di mov WORD [esi + 4], di mov WORD [esi + 8], di ; ... more stores mov WORD [esi + 60], di mov eax, DWORD [esi] mov eax, DWORD [esi + 4] mov eax, DWORD [esi + 8] ; ... more loads mov eax, DWORD [esi + 60]
The difference is that in “full unroll” loads and stores are intermixed, but in “full interleave” I first write to the entire cache line and after that start reading from it.
Benchmark Cycles LD_BLOCKS.STORE_FORWARD UOPS_RETIRED.STALL_CYCLES cache line: full unroll 235.00 16.00 220.00 cache line: full interleave 29.00 16.00 4.00
From this experiment you can see that in “full unroll” case every time we experience store forwarding stall we stop right there. But in the second case we were almost fully able to hide all the store forwarding penalty. But notice that the number of
LD_BLOCKS.STORE_FORWARD is the same in both cases.