Page MenuHomePhabricator

[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.
ClosedPublic

Authored by andreadb on Aug 27 2019, 6:47 AM.

Details

Summary

On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit.

Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this:

  • The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP - executed on JFPU0].
  • In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1.

As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired).

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Aug 27 2019, 6:47 AM
RKSimon added inline comments.Aug 30 2019, 6:12 AM
lib/Target/X86/X86ScheduleBtVer2.td
821 ↗(On Diff #217381)

Store

858 ↗(On Diff #217381)

Would we be better off just splitting WriteFMaskedStore into WriteFMaskedStore32 + WriteFMaskedStore64?

andreadb marked 2 inline comments as done.Aug 30 2019, 7:15 AM
andreadb added inline comments.
lib/Target/X86/X86ScheduleBtVer2.td
821 ↗(On Diff #217381)

Thanks. I will fix it.

858 ↗(On Diff #217381)

I have been thinking about it before sending this patch. The possibility of adding new classes for conditional writes was not so bad to start.
However, btver2 is currently the only model that requires to special case the PS/PD variants. So, eventually I opted for this solution because it seemed like a good compromise. Maybe we could revisit this decision later on if we see that other models also require to special case these writes. What do you think?

andreadb updated this revision to Diff 218122.Aug 30 2019, 10:00 AM

Patch updated.

Address review comments.

This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes:

  • WriteFMaskedStore32 [ XMM Packed Single ]
  • WriteFMaskedStore32Y [ YMM Packed Single ]
  • WriteFMaskedStore64 [ XMM Packed Double ]
  • WriteFMaskedStore64T [ YMM Packed Double ]

Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition.
Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions.

Since this patch introduces new writes, I had to update all the X86 scheduling models.

This patch is a NFC for all x86 models except BtVer2.

FWIW this does not appear to be the case on BdVer2:

$ ./bin/llvm-exegesis --mode=uops --opcode-name=VMASKMOVPSmr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-072799.o
---
mode:            uops
key:
  instructions:
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0x0 %noreg XMM6 XMM11'
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0x40 %noreg XMM4 XMM9'
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0x80 %noreg XMM12 XMM12'
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0xc0 %noreg XMM6 XMM2'
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0x100 %noreg XMM1 XMM7'
    - 'VMASKMOVPSmr RDI i_0x1 %noreg i_0x140 %noreg XMM10 XMM15'
  config:          ''
  register_initial_values:
    - 'XMM6=0x0'
    - 'XMM11=0x0'
    - 'XMM4=0x0'
    - 'XMM9=0x0'
    - 'XMM12=0x0'
    - 'XMM2=0x0'
    - 'XMM1=0x0'
    - 'XMM7=0x0'
    - 'XMM10=0x0'
    - 'XMM15=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 8.0055, per_snippet_value: 48.033 }
  - { key: PdFPU1, value: 4.0124, per_snippet_value: 24.0744 }
  - { key: PdFPU2, value: 2.0042, per_snippet_value: 12.0252 }
  - { key: PdFPU3, value: 4.0078, per_snippet_value: 24.0468 }
  - { key: NumMicroOps, value: 18.0142, per_snippet_value: 108.085 }
error:           ''
info:            instruction is parallel, repeating a random one.
assembled_snippet
...
$ ./bin/llvm-exegesis --mode=uops --opcode-name=VMASKMOVPDmr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-28613e.o
---
mode:            uops
key:
  instructions:
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0x0 %noreg XMM8 XMM7'
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0x40 %noreg XMM14 XMM0'
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0x80 %noreg XMM11 XMM5'
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0xc0 %noreg XMM4 XMM11'
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0x100 %noreg XMM12 XMM11'
    - 'VMASKMOVPDmr RDI i_0x1 %noreg i_0x140 %noreg XMM4 XMM0'
  config:          ''
  register_initial_values:
    - 'XMM8=0x0'
    - 'XMM7=0x0'
    - 'XMM14=0x0'
    - 'XMM0=0x0'
    - 'XMM11=0x0'
    - 'XMM5=0x0'
    - 'XMM4=0x0'
    - 'XMM12=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 7.9896, per_snippet_value: 47.9376 }
  - { key: PdFPU1, value: 4.0235, per_snippet_value: 24.141 }
  - { key: PdFPU2, value: 2.0042, per_snippet_value: 12.0252 }
  - { key: PdFPU3, value: 4.0077, per_snippet_value: 24.0462 }
  - { key: NumMicroOps, value: 18.0128, per_snippet_value: 108.077 }
error:           ''
info:            instruction is parallel, repeating a random one.
assembled_snippet
...
$ ./bin/llvm-exegesis --mode=uops --opcode-name=VMASKMOVPSYmr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-f26721.o
---
mode:            uops
key:
  instructions:
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0x0 %noreg YMM5 YMM4'
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0x40 %noreg YMM2 YMM0'
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0x80 %noreg YMM15 YMM14'
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0xc0 %noreg YMM10 YMM13'
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0x100 %noreg YMM7 YMM15'
    - 'VMASKMOVPSYmr RDI i_0x1 %noreg i_0x140 %noreg YMM15 YMM5'
  config:          ''
  register_initial_values:
    - 'YMM5=0x0'
    - 'YMM4=0x0'
    - 'YMM2=0x0'
    - 'YMM0=0x0'
    - 'YMM15=0x0'
    - 'YMM14=0x0'
    - 'YMM10=0x0'
    - 'YMM13=0x0'
    - 'YMM7=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 15.9929, per_snippet_value: 95.9574 }
  - { key: PdFPU1, value: 8.089, per_snippet_value: 48.534 }
  - { key: PdFPU2, value: 2.0012, per_snippet_value: 12.0072 }
  - { key: PdFPU3, value: 8.0068, per_snippet_value: 48.0408 }
  - { key: NumMicroOps, value: 34.018, per_snippet_value: 204.108 }
error:           ''
info:            instruction is parallel, repeating a random one.
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F2C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F24244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F14244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F04244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F3C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F34244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F14244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F2C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F3C244883C420C4E2552E27C4E26D2E4740C462052EB780000000C4622D2EAFC0000000C462452EBF00010000C4E2052EAF40010000C4E2552E27C4E26D2E4740C462052EB780000000C4622D2EAFC0000000C462452EBF00010000C4E2052EAF40010000C4E2552E27C4E26D2E4740C462052EB780000000C4622D2EAFC0000000C3
...
$ ./bin/llvm-exegesis --mode=uops --opcode-name=VMASKMOVPDYmr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-e45324.o
---
mode:            uops
key:
  instructions:
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0x0 %noreg YMM15 YMM5'
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0x40 %noreg YMM9 YMM10'
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0x80 %noreg YMM10 YMM7'
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0xc0 %noreg YMM1 YMM8'
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0x100 %noreg YMM10 YMM10'
    - 'VMASKMOVPDYmr RDI i_0x1 %noreg i_0x140 %noreg YMM13 YMM9'
  config:          ''
  register_initial_values:
    - 'YMM15=0x0'
    - 'YMM5=0x0'
    - 'YMM9=0x0'
    - 'YMM10=0x0'
    - 'YMM7=0x0'
    - 'YMM1=0x0'
    - 'YMM8=0x0'
    - 'YMM13=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: PdFPU0, value: 16.0013, per_snippet_value: 96.0078 }
  - { key: PdFPU1, value: 8.0093, per_snippet_value: 48.0558 }
  - { key: PdFPU2, value: 2.0018, per_snippet_value: 12.0108 }
  - { key: PdFPU3, value: 8.0068, per_snippet_value: 48.0408 }
  - { key: NumMicroOps, value: 34.0168, per_snippet_value: 204.101 }
error:           ''
info:            instruction is parallel, repeating a random one.
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F3C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F2C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F0C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F14244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F3C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F0C244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F04244883C4204883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C57E6F2C244883C420C4E2052F2FC462352F5740C4E22D2FBF80000000C462752F87C0000000C4622D2F9700010000C462152F8F40010000C4E2052F2FC462352F5740C4E22D2FBF80000000C462752F87C0000000C4622D2F9700010000C462152F8F40010000C4E2052F2FC462352F5740C4E22D2FBF80000000C462752F87C0000000C3
...
RKSimon accepted this revision.Sep 1 2019, 4:39 AM

LGTM - thanks @andreadb I think this is the way to go. As ever its up to the people responsible for the other models to tweak as necessary, as you said this is NFC for everything but btver2.

I don't see accurate numbers for these ops on Agner/instlatx64 for any target, I'm curious how they've checked the perf range for different mask register values (although Agner does mention that btver2 is often bad with VMASKMOVPS loads when mask == 0).

@lebedev.ri By the looks of it llvm-exegesis always uses zero registers for those tests - does it alter if you hack in other values?

This revision is now accepted and ready to land.Sep 1 2019, 4:40 AM

LGTM - thanks @andreadb I think this is the way to go. As ever its up to the people responsible for the other models to tweak as necessary, as you said this is NFC for everything but btver2.

I don't see accurate numbers for these ops on Agner/instlatx64 for any target, I'm curious how they've checked the perf range for different mask register values (although Agner does mention that btver2 is often bad with VMASKMOVPS loads when mask == 0).

@lebedev.ri By the looks of it llvm-exegesis always uses zero registers for those tests - does it alter if you hack in other values?

I was about to post a similar comment.

It may be worthy to rerun those experiments by forcing a different mask value. Otherwise, we don't know for sure if the zero-mask is treated specially on bdver2.

More in general: it would be better if exegesis uses a all-ones default for initial register values. That is what I tend to do when doing throughput analysis (actually, I tend to test both cases, i.e. the all-zero case and the all-ones case). On Jaguar, I know that no optimization is performed if registers are not set via a zero idiom. On entry to the benchmark loop, I set those registers to all-ones. For XMM/YMM registers, as you know, it is really straightforward (just use an all-ones (v)pcmpeq instead of a zero-idiom (v)xorps).
When reading counters, make sure that the initialization code is not counted too (to minimize the noise - all-ones idioms are executed, while all-zeroes are eliminated). I don't know exegesis enough, but I would advice for that change in default values if possible. At least, give an option for testing all-ones...

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 2 2019, 5:31 AM