This is an archive of the discontinued LLVM Phabricator instance.

[x86] Teach the cmov converter to aggressively convert cmovs with memory operands into control flow.
ClosedPublic

Authored by chandlerc on Aug 17 2017, 7:02 PM.

Details

Summary

We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.

Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.

Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:

Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles       Throughput Bottleneck: Port1

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     | CP | mov rcx, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.1       | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.8       | 0.6 |           |           |     | 0.6 | CP | cmovbe rax, rdi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jb 0xf
Total Num Of Uops: 9

After, SNB:

Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: Port5

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.5       | 0.5 | 1.0   1.0 |           |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     | 1.0 | CP | jnbe 0x39
|   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jnb 0x3c
Total Num Of Uops: 7

The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:

Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rcx, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.0       | 1.0 |           |           |     |     | 1.0 |     |    | cmovbe rax, rdi
|   2^   | 0.5       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.5 |     |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xf
Total Num Of Uops: 8

After, HSW:

Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           | 1.0 |           |           |     |     |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     |     | 1.0 |     |    | jnbe 0x39
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0x3c
Total Num Of Uops: 6

Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.

I have run a suite of internal benchmarks with this change and saw no
significant regressions and few very significant improvements. I'm still
working on collecting data for SPEC and the LLVM test suite. I will
update when I have it.

I also am still working on dedicated testing of this functionality, but
I've built a very large amount of code with the patch and had no issues.

Depends on D36783.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc created this revision.Aug 17 2017, 7:02 PM

I've run this across SPEC CPU 2006 and the LLVM test suite on my Haswell system.

Only 11 programs had a different hash, only three from SPEC (433.milc, 445.gobmk, 483.xalancbmk).

Of those, only two had any significant changes in performance that a quick test showed. It is a bit hard to be 100% certain, the test suite is very noisy. Those two (PENNANT and consumer-typeset) had 4.7% and 5.4% difference theoretically.

However, when re-running each under perf stat I saw no significant cycle count or instruction count difference and the timings swung in both directions so this appears to be just noise.

So I think this is totally fine for the LLVM test suite and SPEC.

And our internal benchmarks (which have a bit more to carefully measure and minimize noise) show no significant regressions and at least a couple of significant improvements. There is some various shifting in each direction of course, but it seems net positive or neutral across Sandybridge, Haswell, and Skylake.

I think this covers a reasonable set of performance data for checking in, so code review would be very much appreciated. If there are other benchmarks or systems others want to run, by all means. Note that there is a flag to disable this behavior as well.

aaboud edited edge metadata.Aug 18 2017, 6:01 AM

Thanks Chandler for preparing the patch, the implementation looks elegant, however, it overlooked a case where the memory registers are a result of a previous CMOV instructions.
This is a small reproducer that result in bad MIR:

target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-unknown"

define i32 @bar(i32* %a, i32* %b, i32 %n1, i32 %n2, i32 %d) #0 {
entry:
  %cmp = icmp sgt i32 %n1, %n2
  %s = select i1 %cmp, i32* %a, i32* %b
  %p = getelementptr inbounds i32, i32* %s, i64 1
  %load = load i32, i32* %p, align 4
  %res = select i1 %cmp, i32 %d, i32 %load
  
  ret i32 %res
}

attributes #0 = {"target-cpu"="skylake"}

Thanks Chandler for preparing the patch, the implementation looks elegant, however, it overlooked a case where the memory registers are a result of a previous CMOV instructions.

Doh, of course. I'll add a remapping step to generating unfolded load. Should be easy because we know that the load is always on one side so can just remap to one of the inputs.

This is a small reproducer that result in bad MIR:

Sweet!

target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-unknown"

define i32 @bar(i32* %a, i32* %b, i32 %n1, i32 %n2, i32 %d) #0 {
entry:
  %cmp = icmp sgt i32 %n1, %n2
  %s = select i1 %cmp, i32* %a, i32* %b
  %p = getelementptr inbounds i32, i32* %s, i64 1
  %load = load i32, i32* %p, align 4
  %res = select i1 %cmp, i32 %d, i32 %load
  
  ret i32 %res
}

attributes #0 = {"target-cpu"="skylake"}

While I'm writing the fix, and since it is already late in your TZ -- any other concerns before I land this?

While I'm writing the fix, and since it is already late in your TZ -- any other concerns before I land this?

No concerns, the direction looks just fine.
We only need to make sure the functionality is not broken.

Also, I just got the results from running internal benchmarks, and it seems that this optimization did not affect most of the workloads, and those that had code change, the performance was not affected as well!
So, I do not see a reason why not to continue with this improvement.

chandlerc updated this revision to Diff 111779.Aug 18 2017, 6:02 PM

Update with more comprehensive testing and a bug fix mentioned in code review
(as well as the suggested test and an even more exciting test case).

Ok, review comments addressed, tests added (after winning several battles with the register allocator to make interesting cmov groups).

craig.topper accepted this revision.Aug 18 2017, 6:31 PM

LGTM with the one nit.

test/CodeGen/X86/x86-cmov-converter.ll
414 ↗(On Diff #111779)

nit, space before "loads"

This revision is now accepted and ready to land.Aug 18 2017, 6:31 PM
This revision was automatically updated to reflect the committed changes.

There is still an issue with this implementation, here another reproducer:

target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-unknown"

define i32 @bar(i32* %a, i32* %b, i32* %c, i32 %n1, i32 %n2, i32 %d) #0 {
entry:
  %cmp = icmp sgt i32 %n1, %n2
  %s1 = select i1 %cmp, i32* %a, i32* %b
  %s = select i1 %cmp, i32* %c, i32* %s1
  %p = getelementptr inbounds i32, i32* %s, i64 1
  %load = load i32, i32* %p, align 4
  %res = select i1 %cmp, i32 %d, i32 %load
  
  ret i32 %res
}

attributes #0 = {"target-cpu"="skylake"}

There is still an issue with this implementation, here another reproducer:

target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-unknown"

define i32 @bar(i32* %a, i32* %b, i32* %c, i32 %n1, i32 %n2, i32 %d) #0 {
entry:
  %cmp = icmp sgt i32 %n1, %n2
  %s1 = select i1 %cmp, i32* %a, i32* %b
  %s = select i1 %cmp, i32* %c, i32* %s1
  %p = getelementptr inbounds i32, i32* %s, i64 1
  %load = load i32, i32* %p, align 4
  %res = select i1 %cmp, i32 %d, i32 %load
  
  ret i32 %res
}

attributes #0 = {"target-cpu"="skylake"}

Good catch, fixed in r311267.