This is an archive of the discontinued LLVM Phabricator instance.

[x86] allow movmsk with 2-element reductions
ClosedPublic

Authored by spatel on Mar 29 2019, 8:21 AM.

Details

Summary

One motivation for making this change is that the lack of using movmsk is likely a main source of perf difference between clang and gcc on the C-Ray benchmark as shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of' examples look clearly better using movmsk because we've traded 2 vector instructions for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the 'all-of' movmsk variant looks worse on paper, it would perform better on both Haswell and Jaguar.

$ llvm-mca -mcpu=haswell no_movmsk.s -timeline
Iterations:        100
Instructions:      400
Total Cycles:      504
Total uOps:        400

Dispatch Width:    4
uOps Per Cycle:    0.79
IPC:               0.79
Block RThroughput: 1.0
$ llvm-mca -mcpu=haswell movmsk.s -timeline
Iterations:        100
Instructions:      600
Total Cycles:      358
Total uOps:        600

Dispatch Width:    4
uOps Per Cycle:    1.68
IPC:               1.68
Block RThroughput: 1.5
$ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
Iterations:        100
Instructions:      400
Total Cycles:      407
Total uOps:        400

Dispatch Width:    2
uOps Per Cycle:    0.98
IPC:               0.98
Block RThroughput: 2.0
$ llvm-mca -mcpu=btver2 movmsk.s -timeline
Iterations:        100
Instructions:      600
Total Cycles:      311
Total uOps:        600

Dispatch Width:    2
uOps Per Cycle:    1.93
IPC:               1.93
Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if that's true, then we're also almost certainly making the wrong transform already for reductions with >2 elements, so that should be fixed independently.

Diff Detail

Event Timeline

spatel created this revision.Mar 29 2019, 8:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2019, 8:21 AM
spatel edited the summary of this revision. (Show Details)Mar 29 2019, 8:25 AM

llvm-mca numbers are quite accurate for btver2 (see below for the perf results):

vcmpltpd %xmm0, %xmm1, %xmm2
vmovmskpd %xmm0, %ecx
xorl %eax, %eax
cmpl $3, %ecx
sete %al
negq %rax

-->

cycles:           79314982                                        ( +- 0.36% )
instructions:     154000245        #   1.94 insn per cycle        ( +- 0.00% )
micro-opcodes:    154030776        #   1.94 uOps per cycle        ( +- 0.00% )

While..

vcmpltpd %xmm0, %xmm1, %xmm2
vpermilpd $1, %xmm2, %xmm1
vandpd %xmm1, %xmm2, %xmm2
vmovq %xmm2, %rax

Gives us this:

cycles:           114486380                                       ( +- 1.56% )
instructions:     102800331        #   0.90 insn per cycle        ( +- 0.00% )
micro-opcodes:    102844837        #   0.90 uOps per cycle        ( +- 0.00% )

llvm-mca numbers are quite accurate for btver2 (see below for the perf results):

Great - thanks for checking that!
@craig.topper - are you aware of any Intel uarch outliers for movmsk?

I'm not aware of any outliers on Intel CPUs.

RKSimon accepted this revision.Mar 30 2019, 4:51 AM

LGTM - thanks!

This revision is now accepted and ready to land.Mar 30 2019, 4:51 AM
This revision was automatically updated to reflect the committed changes.