We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.
Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.
Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
Throughput Analysis Report -------------------------- Block Throughput: 2.20 Cycles Throughput Bottleneck: Port1 | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | 1.0 | | | | | CP | mov rcx, rdi | 0* | | | | | | | | xor edi, edi | 2^ | 0.1 | 0.6 | 0.5 0.5 | 0.5 0.5 | | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rax, qword ptr [rsi] | 3 | 1.8 | 0.6 | | | | 0.6 | CP | cmovbe rax, rdi | 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | cmp byte ptr [rcx+0xf], 0x10 | 0F | | | | | | | | jb 0xf Total Num Of Uops: 9
After, SNB:
Throughput Analysis Report -------------------------- Block Throughput: 2.00 Cycles Throughput Bottleneck: Port5 | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | 0.5 | 0.5 | | | | | | mov rax, rdi | 0* | | | | | | | | xor edi, edi | 2^ | 0.5 | 0.5 | 1.0 1.0 | | | | | cmp byte ptr [rsi+0xf], 0xf | 1 | 0.5 | 0.5 | | | | | | mov ecx, 0x0 | 1 | | | | | | 1.0 | CP | jnbe 0x39 | 2^ | | | | 1.0 1.0 | | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10 | 0F | | | | | | | | jnb 0x3c Total Num Of Uops: 7
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
Throughput Analysis Report -------------------------- Block Throughput: 2.00 Cycles Throughput Bottleneck: FrontEnd | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 0* | | | | | | | | | | mov rcx, rdi | 0* | | | | | | | | | | xor edi, edi | 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov rax, qword ptr [rsi] | 3 | 1.0 | 1.0 | | | | | 1.0 | | | cmovbe rax, rdi | 2^ | 0.5 | | 0.5 0.5 | 0.5 0.5 | | | 0.5 | | | cmp byte ptr [rcx+0xf], 0x10 | 0F | | | | | | | | | | jb 0xf Total Num Of Uops: 8
After, HSW:
Throughput Analysis Report -------------------------- Block Throughput: 1.50 Cycles Throughput Bottleneck: FrontEnd | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 0* | | | | | | | | | | mov rax, rdi | 0* | | | | | | | | | | xor edi, edi | 2^ | | | 1.0 1.0 | | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf | 1 | | 1.0 | | | | | | | | mov ecx, 0x0 | 1 | | | | | | | 1.0 | | | jnbe 0x39 | 2^ | 1.0 | | | 1.0 1.0 | | | | | | cmp byte ptr [rax+0xf], 0x10 | 0F | | | | | | | | | | jnb 0x3c Total Num Of Uops: 6
Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.
I have run a suite of internal benchmarks with this change and saw no
significant regressions and few very significant improvements. I'm still
working on collecting data for SPEC and the LLVM test suite. I will
update when I have it.
I also am still working on dedicated testing of this functionality, but
I've built a very large amount of code with the patch and had no issues.
Depends on D36783.