This is an archive of the discontinued LLVM Phabricator instance.

AMD BdVer2 (Piledriver) Initial Scheduler model
ClosedPublic

Authored by lebedev.ri on Oct 2 2018, 6:37 AM.

Details

Summary

Overview

This is somewhat partial.

  • Latencies are good
    • All of these remaining inconsistencies appear to be noise/noisy/flaky.
  • NumMicroOps are somewhat good
    • Most of the remaining inconsistencies are from Ld / Ld_ReadAfterLd classes
  • Actual unit occupation (pipes, ResourceCycles) are undiscovered lands, i did not really look there. They are basically verbatum copy from btver2
  • Many InstRW. And there are still inconsistencies left...

To be noted:
I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis!

Benchmark

I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - RawSpeed raw image decoding library.
Diff (the exact clang from trunk without/with this patch):

Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean                              -0.0607         -0.0604           234           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median                            -0.0630         -0.0626           233           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev                            +0.2581         +0.2587             1             2             1             2
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean                              -0.0770         -0.0767           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median                            -0.0767         -0.0763           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev                            -0.4170         -0.4156             1             0             1             0
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue                                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean                                           -0.0271         -0.0270           463           450           463           450
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median                                         -0.0093         -0.0093           453           449           453           449
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev                                         -0.7280         -0.7280            13             4            13             4
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean                                           -0.0065         -0.0065           569           565           569           565
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median                                         -0.0077         -0.0077           569           564           569           564
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev                                         +1.0077         +1.0068             2             5             2             5
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue                                          0.0220          0.0199      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean                                           +0.0006         +0.0007           312           312           312           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median                                         +0.0031         +0.0032           311           312           311           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev                                         -0.7069         -0.7072             4             1             4             1
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean                                           -0.0015         -0.0015           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median                                         -0.0010         -0.0011           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev                                         -0.1486         -0.1456             0             0             0             0
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue                                          0.6139          0.8766      U Test, Repetitions: 25 vs 25
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean                                           -0.0008         -0.0005            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median                                         -0.0006         -0.0002            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev                                         -0.1467         -0.1390             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue                                          0.0137          0.0137      U Test, Repetitions: 25 vs 25
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean                                           +0.0002         +0.0002           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median                                         -0.0015         -0.0014           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev                                         +3.3687         +3.3587             0             2             0             2
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue                                     0.4041          0.3933      U Test, Repetitions: 25 vs 25
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean                                      +0.0004         +0.0004            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median                                    -0.0000         -0.0000            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev                                    +0.1947         +0.1995             0             0             0             0
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue                              0.0074          0.0001      U Test, Repetitions: 25 vs 25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean                               -0.0092         +0.0074           547           542            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median                             -0.0054         +0.0115           544           541            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev                             -0.4086         -0.3486             8             5             0             0
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue                                        0.3320          0.0000      U Test, Repetitions: 25 vs 25
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean                                         +0.0015         +0.0204           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median                                       +0.0001         +0.0203           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev                                       +0.2259         +0.2023             1             1             0             0
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue                                      0.0000          0.0001      U Test, Repetitions: 25 vs 25
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean                                       -0.0209         -0.0179            96            94            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median                                     -0.0182         -0.0155            95            93            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev                                     -0.6164         -0.2703             2             1             2             1
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue                                     0.0000          0.0000      U Test, Repetitions: 25 vs 25
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean                                      -0.0098         -0.0098           176           175           176           175
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median                                    -0.0126         -0.0126           176           174           176           174
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev                                    +6.9789         +6.9157             0             2             0             2
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 25 vs 25
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean                  -0.0237         -0.0238           474           463           474           463
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median                -0.0267         -0.0267           473           461           473           461
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev                +0.7179         +0.7178             3             5             3             5
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue                   0.6837          0.6554      U Test, Repetitions: 25 vs 25
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean                    -0.0014         -0.0013          1375          1373          1375          1373
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median                  +0.0018         +0.0019          1371          1374          1371          1374
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev                  -0.7457         -0.7382            11             3            10             3
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue                                        0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean                                         -0.0080         -0.0289            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median                                       -0.0070         -0.0287            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev                                       +1.0977         +0.6614             0             0             0             0
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue                                       0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean                                        +0.0132         +0.0967            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median                                      +0.0132         +0.0956            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev                                      -0.0407         -0.1695             0             0             0             0
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue                                      0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean                                       +0.0331         +0.1307            13            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median                                     +0.0430         +0.1373            12            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev                                     -0.9006         -0.8847             1             0             0             0
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue                                            0.0016          0.0010      U Test, Repetitions: 25 vs 25
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean                                             -0.0023         -0.0024           395           394           395           394
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median                                           -0.0029         -0.0030           395           394           395           393
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev                                           -0.0275         -0.0375             1             1             1             1
Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue                                          0.0232          0.0000      U Test, Repetitions: 25 vs 25
Phase One/P65/CF027310.IIQ/threads:8/real_time_mean                                           -0.0047         +0.0039           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_median                                         -0.0050         +0.0037           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev                                         -0.0599         -0.2683             1             1             0             0
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean                           +0.0206         +0.0207           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median                         +0.0204         +0.0205           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev                         +0.2155         +0.2212             1             1             1             1
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean                          -0.0109         -0.0108           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median                        -0.0104         -0.0103           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev                        -0.4919         -0.4800             0             0             0             0
Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue                                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean                                          -0.0149         -0.0147           220           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_median                                        -0.0173         -0.0169           221           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev                                        +1.0337         +1.0341             1             3             1             3
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue                                         0.0001          0.0001      U Test, Repetitions: 25 vs 25
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean                                          -0.0019         -0.0019           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median                                        -0.0021         -0.0021           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev                                        -0.4441         -0.4282             0             0             0             0
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue                                0.0000          0.4263      U Test, Repetitions: 25 vs 25
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean                                 +0.0258         -0.0006            81            83            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median                               +0.0235         -0.0011            81            82            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev                               +0.1634         +0.1070             1             1             0             0


If we look at the _means, the time column, the biggest win is -7.7% (Canon/EOS 5D Mark II/10.canon.sraw2.cr2),
and the biggest loose is +3.3% (Panasonic/DC-GH5S/P1022085.RW2);
Overall: mean -0.7436%, median -0.23%, cbrt(sum(time^3)) = -8.73%
Looks good so far i'd say.

llvm-exegesis details:


Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Oct 2 2018, 6:37 AM
lebedev.ri updated this revision to Diff 167937.Oct 2 2018, 6:51 AM
lebedev.ri edited the summary of this revision. (Show Details)

I fat-fingered it, re-uploading so the actual description is in place.

This is somewhat partial.
Overview:

  • Latencies are reasonably-good
    • There are still some inconsistencies.
    • *Many* of the inconsistencies are noise
    • fp measurements are flaky
    • Non-fp measurements are somewhat flaky too
  • NumMicroOps are somewhat good
    • Most of the remaining inconsistencies are from Ld / Ld_ReadAfterLd classes
  • Actual unit occupation (pipes, ResourceCycles) are undiscovered lands, i did not really look there. They are basically verbatum copy from btver2
  • Many InstRW. And there are still inconsistencies left...

What's not here:

  • llvm-mca test coverage. I understand how to not just add the coverage here, but actually show a diff, but i did not look there yet.
  • benchmarks. I'd say it's too soon for any measurements.

To be noted:
I think this is the first new schedule profile produced with the new next-gen tools like llvm-exegesis!

llvm-exegesis details:

This is awesome !

  • *Many* of the inconsistencies are noise
  • fp measurements are flaky
  • Non-fp measurements are somewhat flaky too

Part of the flakiness can be explained by zero idioms: since llvm-exegesis explores register allocation randomly, it will hit some zero idioms by chance (and you've hit it e.g. for SUB32rr). Analysis still does not handle variant classes (PR38884), and we need better highlighting of instances where mcinst predicates were true. I'm currently working on this.

I'll have a look at the inconsistencies to see if I can see anything else that might be an analysis issue.

This is awesome !

Thank you!

  • *Many* of the inconsistencies are noise
  • fp measurements are flaky
  • Non-fp measurements are somewhat flaky too

Part of the flakiness can be explained by zero idioms: since llvm-exegesis explores register allocation randomly, it will hit some zero idioms by chance (and you've hit it e.g. for SUB32rr). Analysis still does not handle variant classes (PR38884), and we need better highlighting of instances where mcinst predicates were true. I'm currently working on this.

I'll have a look at the inconsistencies to see if I can see anything else that might be an analysis issue.

Part of, sure.
I do some measurement 3x 10'000 repetitions, and get 3cycles latency all three times, then repeat that and get 3cycles one time and 4cycles two times.
Doing some measurement 10x 10'000 times vs 1x 1'000'000 times sometimes produces different results (sometimes, very different), too.
For fp, as discussed previously elsewhere, it is caused by nan/inf/subnormals/etc.
But i get the same flakiness for non-floats, too, so they may be somewhat affected by the same problem.

craig.topper added inline comments.Oct 2 2018, 5:33 PM
lib/Target/X86/X86.td
1039

Should we apply this to bdver3/4 as well? Are they similar?

lebedev.ri added inline comments.Oct 2 2018, 11:43 PM
lib/Target/X86/X86.td
1039

Hmm. It would be ok to apply this to bdver1.

But i'm not sure about steamroller/excavator.
Those are more different, have loop buffer, 3 FPU pipes instead of 4, etc.

This is awesome !

Thank you!

  • *Many* of the inconsistencies are noise
  • fp measurements are flaky
  • Non-fp measurements are somewhat flaky too

Part of the flakiness can be explained by zero idioms: since llvm-exegesis explores register allocation randomly, it will hit some zero idioms by chance (and you've hit it e.g. for SUB32rr). Analysis still does not handle variant classes (PR38884), and we need better highlighting of instances where mcinst predicates were true. I'm currently working on this.

I'll have a look at the inconsistencies to see if I can see anything else that might be an analysis issue.

Part of, sure.
I do some measurement 3x 10'000 repetitions, and get 3cycles latency all three times, then repeat that and get 3cycles one time and 4cycles two times.

You'll be getting different register allocations each time. This will impact two things: some instructions might have special paths for some combinations of registers (we all know about xor eax, eax, be there are more surprising ones, see the original llvm-exegesis RFC for an example).
gchatelet@ is working on autodetecting these by exploring the allocation space (both registers to operands and values to registers and immediates).

Doing some measurement 10x 10'000 times vs 1x 1'000'000 times sometimes produces different results (sometimes, very different), too.

Do you have the mnemonics for these ? I would expect this to happen for instructions whose latency depends on the values in the registers, where repeating execution leads to changing the value, and therefore the latency. This happens for e.g. SQRT or FMUL.

For fp, as discussed previously elsewhere, it is caused by nan/inf/subnormals/etc.
But i get the same flakiness for non-floats, too, so they may be somewhat affected by the same problem.

This is awesome !

Thank you!

  • *Many* of the inconsistencies are noise
  • fp measurements are flaky
  • Non-fp measurements are somewhat flaky too

Part of the flakiness can be explained by zero idioms: since llvm-exegesis explores register allocation randomly, it will hit some zero idioms by chance (and you've hit it e.g. for SUB32rr). Analysis still does not handle variant classes (PR38884), and we need better highlighting of instances where mcinst predicates were true. I'm currently working on this.

I'll have a look at the inconsistencies to see if I can see anything else that might be an analysis issue.

Part of, sure.
I do some measurement 3x 10'000 repetitions, and get 3cycles latency all three times, then repeat that and get 3cycles one time and 4cycles two times.
Doing some measurement 10x 10'000 times vs 1x 1'000'000 times sometimes produces different results (sometimes, very different), too.
For fp, as discussed previously elsewhere, it is caused by nan/inf/subnormals/etc.
But i get the same flakiness for non-floats, too, so they may be somewhat affected by the same problem.

lebedev.ri added a comment.EditedOct 3 2018, 12:14 AM
$ ./bin/llvm-exegesis -num-repetitions=10000 -mode=latency -opcode-name=BEXTRI64ri
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-96c4a7.o
---
mode:            latency
key:             
  instructions:    
    - 'BEXTRI64ri R10 R10 i_0x1x'
  config:          ''
  register_initial_values: 
    - 'R10=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 2.0366, per_snippet_value: 2.0366 }
error:           ''
info:            explicit self cycles, selecting one aliasing Conf.
assembled_snippet: 49BA00000000000000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D2010000008F4AF810D201000000C3
...

$ ./bin/llvm-exegesis -num-repetitions=1000000 -mode=latency -opcode-name=BEXTRI64ri
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-77bfda.o
---
mode:            latency
key:             
  instructions:    
    - 'BEXTRI64ri R13 R13 i_0x1x'
  config:          ''
  register_initial_values: 
    - 'R13=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:    
  - { key: latency, value: 6.87283, per_snippet_value: 6.87283 }
error:           ''
info:            explicit self cycles, selecting one aliasing Conf.
assembled_snippet: 415549BD00000000000000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED010000008F4AF810ED01000000415DC3
...

Interesting. The code is so large that maybe you're seeing i-cache issues ? In that case you should see a constant value, then a sudden drop after a certain value of -num-repetitions.

RKSimon added inline comments.Oct 3 2018, 1:20 AM
lib/Target/X86/X86.td
1039

I'd expect it to help all bdver targets - but as @lebedev.ri said the arch is slightly different on bdver3/bdver4 - for instance the exegesis pfms for the pipe resources won't work correctly (and may crash).

We have the same problem with the generic x86_64 cpu which expects SNB pfms - @courbet how tricky would it be to attach the pfm counters to a set of CPUs instead of the scheduler model? This might even help creation of new models if we can get pfm (cycles/uops) data from cpus without a model.

courbet added inline comments.Oct 3 2018, 2:09 AM
lib/Target/X86/X86.td
1039

how tricky would it be to attach the pfm counters to a set of CPUs instead of the scheduler model

The main reason why I added pfm counters to the sched model is that we want analysis to be able to match counters to resource names.
e.g. PfmIssueCounter<SBPort0, ["uops_dispatched_port:port_0"]>; binds the counter to SBPort0, and the TD resolver automatically checks that SBPort0 exists in the correct sched model.

What we could do is:

  • move the PfmCountersInfo out of MCSubtargetInfo->MCSchedModel->MCExtraProcessorInfo and put it in MCSubtargetInfo directly.
  • make SchedModel optional and take the resource as a string in PfmIssueCounter ( PfmIssueCounter<"SBPort0", ["uops_dispatched_port:port_0"]>;), and do the resolving/checking in the SubtargetEmitter only if the subtarget has a sched model.

Sounds reasonable ?

Interesting. The code is so large that maybe you're seeing i-cache issues ? In that case you should see a constant value, then a sudden drop after a certain value of -num-repetitions.

Got some numbers for you to play with.
Rough overview: https://docs.google.com/spreadsheets/d/1hmznTDLXGFeETc3yGtU9UGGyAHgBodLORr9sU4uaRis/edit?usp=sharing
More precise measurements for 1000...100000 range:


Fit:

RKSimon added inline comments.Oct 3 2018, 9:55 AM
lib/Target/X86/X86.td
1039

Yes, creating a mapping index from cpu to a pfm table list should work - it would allow cpus to share models and have their own counters to map to the models resources (some may be partial) - PR37068 could probably be dealt with at the same time.

lebedev.ri updated this revision to Diff 168270.Oct 4 2018, 5:07 AM
lebedev.ri edited the summary of this revision. (Show Details)
  • Also use this model for bdver1 (Bulldozer).
  • One more latency pass
    • I think all the remaining latency inconsistencies are noise.

Hi Roman,

Thanks for contributing this model! I really enjoyed reading your patch.

I only have a few minor nits (mostly style comments). But, overall, from my point of view, it looks very good!

Thanks
-Andrea

lib/Target/X86/X86ScheduleBdVer2.td
299–301

Latency is set to 1 by default.
Is it because you wanted to emphasize that aspect?
Otherwise, I think you can just write:

def : WriteRes<WriteNop, [PdEX01]>;
322–324

Same.

337

Why the question marks?

383
def : WriteRes<WriteSETCC, [PdEX01]>;
552

I am a bit confused. Why do you need this comment? WriteFLoadZ is not even a thing...
I noticed that you do the same in various other places (for other non-existent writes).
I suggest to remove these comments.

1068

Do you plan to add these too?
I noticed that you have marked those as dep-breaking. However, if I read correctly, those still map to WriteFLogicY, which declares 2 resource cycles. Presumably these zero-idioms should only consume 1 resource cycle (to execute the zero-move to the upper half of YMM).

lebedev.ri updated this revision to Diff 168361.Oct 4 2018, 1:31 PM
lebedev.ri marked 4 inline comments as done.

Slight cleanup as per @andreadb review notes.

lib/Target/X86/X86ScheduleBdVer2.td
299–301

No, not really intentionally.

337

Because i wasn't able to measure this, and up until now, i didn't trace it back to the definition.
Now i see it's BMI2 instruction, so i'll mark this as unsupported.

Also, it would be really good to have some automatic (a warning) to be notified of cases like this.
I do realize it won't really work for default generic sched models used for many different CPUs.

552

There are remnants from when i was looking which sched classes were somehow not listed here; dropped.

1068

Resource cycles is something i haven't touched at all.. Like completely at all.
I do not even really understand how they are calculated.

I'm not fully sure what/how this should be, so i'd leave this as-is for now..

andreadb added inline comments.Oct 5 2018, 4:07 AM
lib/Target/X86/X86ScheduleBdVer2.td
56–57

This may not work as you expect.
The document here: https://www.realworldtech.com/bulldozer/8/
suggests that the L1D cannot sustain more than one store per cycle.

It is true that the two AGEN units are identical.
However, (correct me if I am wrong) I don't think that you can issue two stores per cycle.

Also, it would be interesting to see if you can actually issue two independent loads per cycle. As the 'realworldtech' document suggests, the extra load port in the L1D is probably used to avoid that the load bandwidth is halved when executing AVX 256-bit loads (which are 2 COPs).

Ideally, you should check if two independent loads can be issued in the same cycle (i.e. not just the LO/HI parts of a same AVX 256b load).
Also, It doesn't look like the L1D has two ports for store operations.

The easier way to workaround this issue is to define separate units for the load/store AGU, and let "writes" in tablegen select which AGEN they effectively consume.

131–144

To answer to your FIXME: I don't think it hurts to have those definitions.
You are essentially limiting the number of store operations to 24 (which is probably what you wanted to achieve here?).

A load would consume PdLoad buffer entries, and it would also consume entries in the PdEX unified scheduler.
Also, a load would be issued to PdAGLU01 (i.e. one of the two AGEN units).

A store behaves pretty much the same. The only difference is the size of the PdStore buffer (which matches the store queue).
It would consume one of the two AGU pipelines; it means, you allow the execution of two stores per cycles.

1068

That comment is unexpected from a person that just wrote an entire scheduling model... How were you able to write all of this without knowing what "resource cycles" actually means? :-)

Anyway.... File TargetSchedule.td has a nice description of resource cycles. It is used to model the consumption of resources.

For a zero idiom XOR to consume the same resource cycles as a normal (i.e. non zero-idiom) XOR is really strange.

lebedev.ri added inline comments.Oct 6 2018, 3:30 AM
lib/Target/X86/X86ScheduleBdVer2.td
56–57

Hmm, good point.
I agree regarding one store per cycle.
Some quotes:

  • https://www.agner.org/optimize/microarchitecture.pdf
    • Memory read instructions use AGLU0 and AGLU1.
    • Most execution units are doubled, as table 19.2 and 19.3 show, so that the throughput is two 128-bit operations or one 256-bit operation per clock cycle. <...> The store unit is not doubled, and 256-bit stores always take more than one clock cycle.
    • The data cache has two 128-bit ports which can be used for either read or write. This means that it can do two reads or one read and one write in the same clock cycle.
    • The measured throughput is two reads or one read and one write per clock cycle when only one thread is active.
  • https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
    • The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle.
      • Only one load can be performed from a given bank of the L1 cache in a single cycle.
    • There is an FPU load-store unit which supports up to two 128-bit loads and one 128-bit store per cycle.
    • The LS unit supports two 128-bit loads/cycle and one 128-bit store/cycle.

So either one store, or one store and one load, or two loads, at least that is how i read it.

131–144

You are essentially limiting the number of store operations to 24
(which is probably what you wanted to achieve here?).

Yes.

it means, you allow the execution of two stores per cycles.

Yeah, that is a bug.

lebedev.ri edited the summary of this revision. (Show Details)

A bit more cleanup for latencies, and for nummicroops (particularly for conversion ops)

lebedev.ri updated this revision to Diff 168591.Oct 7 2018, 7:56 AM
lebedev.ri edited the summary of this revision. (Show Details)

Some more cleaning, of no particular note.

Adding Ganesh at AMD who might have some insights.

I realize that isn't what was suggested, but i'll start with some "internal" public real-world benchmark i understand - RawSpeed raw image decoding library.
Diff (the exact clang from trunk without/with this patch):

Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_mean                              -0.0607         -0.0604           234           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_median                            -0.0630         -0.0626           233           219           233           219
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/real_time_stddev                            +0.2581         +0.2587             1             2             1             2
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_mean                              -0.0770         -0.0767           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_median                            -0.0767         -0.0763           144           133           144           133
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/real_time_stddev                            -0.4170         -0.4156             1             0             1             0
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_pvalue                                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_mean                                           -0.0271         -0.0270           463           450           463           450
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_median                                         -0.0093         -0.0093           453           449           453           449
Canon/EOS 5DS/2K4A9927.CR2/threads:8/real_time_stddev                                         -0.7280         -0.7280            13             4            13             4
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_mean                                           -0.0065         -0.0065           569           565           569           565
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_median                                         -0.0077         -0.0077           569           564           569           564
Canon/EOS 5DS/2K4A9928.CR2/threads:8/real_time_stddev                                         +1.0077         +1.0068             2             5             2             5
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_pvalue                                          0.0220          0.0199      U Test, Repetitions: 25 vs 25
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_mean                                           +0.0006         +0.0007           312           312           312           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_median                                         +0.0031         +0.0032           311           312           311           312
Canon/EOS 5DS/2K4A9929.CR2/threads:8/real_time_stddev                                         -0.7069         -0.7072             4             1             4             1
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_pvalue                                          0.0004          0.0004      U Test, Repetitions: 25 vs 25
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_mean                                           -0.0015         -0.0015           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_median                                         -0.0010         -0.0011           141           141           141           141
Canon/EOS 10D/CRW_7673.CRW/threads:8/real_time_stddev                                         -0.1486         -0.1456             0             0             0             0
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_pvalue                                          0.6139          0.8766      U Test, Repetitions: 25 vs 25
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_mean                                           -0.0008         -0.0005            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_median                                         -0.0006         -0.0002            60            60            60            60
Canon/EOS 40D/_MG_0154.CR2/threads:8/real_time_stddev                                         -0.1467         -0.1390             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_pvalue                                          0.0137          0.0137      U Test, Repetitions: 25 vs 25
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_mean                                           +0.0002         +0.0002           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_median                                         -0.0015         -0.0014           275           275           275           275
Canon/EOS 77D/IMG_4049.CR2/threads:8/real_time_stddev                                         +3.3687         +3.3587             0             2             0             2
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_pvalue                                     0.4041          0.3933      U Test, Repetitions: 25 vs 25
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_mean                                      +0.0004         +0.0004            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_median                                    -0.0000         -0.0000            67            67            67            67
Canon/PowerShot G1/crw_1693.crw/threads:8/real_time_stddev                                    +0.1947         +0.1995             0             0             0             0
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_pvalue                              0.0074          0.0001      U Test, Repetitions: 25 vs 25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_mean                               -0.0092         +0.0074           547           542            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_median                             -0.0054         +0.0115           544           541            25            25
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/real_time_stddev                             -0.4086         -0.3486             8             5             0             0
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_pvalue                                        0.3320          0.0000      U Test, Repetitions: 25 vs 25
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_mean                                         +0.0015         +0.0204           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_median                                       +0.0001         +0.0203           218           218            12            12
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/real_time_stddev                                       +0.2259         +0.2023             1             1             0             0
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_pvalue                                      0.0000          0.0001      U Test, Repetitions: 25 vs 25
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_mean                                       -0.0209         -0.0179            96            94            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_median                                     -0.0182         -0.0155            95            93            90            88
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/real_time_stddev                                     -0.6164         -0.2703             2             1             2             1
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_pvalue                                     0.0000          0.0000      U Test, Repetitions: 25 vs 25
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_mean                                      -0.0098         -0.0098           176           175           176           175
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_median                                    -0.0126         -0.0126           176           174           176           174
Kodak/DCS Pro 14nx/D7465857.DCR/threads:8/real_time_stddev                                    +6.9789         +6.9157             0             2             0             2
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 25 vs 25
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_mean                  -0.0237         -0.0238           474           463           474           463
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_median                -0.0267         -0.0267           473           461           473           461
Nikon/D850/Nikon-D850-14bit-lossless-compressed.NEF/threads:8/real_time_stddev                +0.7179         +0.7178             3             5             3             5
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_pvalue                   0.6837          0.6554      U Test, Repetitions: 25 vs 25
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_mean                    -0.0014         -0.0013          1375          1373          1375          1373
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_median                  +0.0018         +0.0019          1371          1374          1371          1374
Olympus/E-M1MarkII/Olympus_EM1mk2__HIRES_50MP.ORF/threads:8/real_time_stddev                  -0.7457         -0.7382            11             3            10             3
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_pvalue                                        0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_mean                                         -0.0080         -0.0289            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_median                                       -0.0070         -0.0287            22            22            10            10
Panasonic/DC-G9/P1000476.RW2/threads:8/real_time_stddev                                       +1.0977         +0.6614             0             0             0             0
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_pvalue                                       0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_mean                                        +0.0132         +0.0967            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_median                                      +0.0132         +0.0956            35            36            10            11
Panasonic/DC-GH5/_T012014.RW2/threads:8/real_time_stddev                                      -0.0407         -0.1695             0             0             0             0
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_pvalue                                      0.0000          0.0000      U Test, Repetitions: 25 vs 25
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_mean                                       +0.0331         +0.1307            13            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_median                                     +0.0430         +0.1373            12            13             6             6
Panasonic/DC-GH5S/P1022085.RW2/threads:8/real_time_stddev                                     -0.9006         -0.8847             1             0             0             0
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_pvalue                                            0.0016          0.0010      U Test, Repetitions: 25 vs 25
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_mean                                             -0.0023         -0.0024           395           394           395           394
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_median                                           -0.0029         -0.0030           395           394           395           393
Pentax/645Z/IMGP2837.PEF/threads:8/real_time_stddev                                           -0.0275         -0.0375             1             1             1             1
Phase One/P65/CF027310.IIQ/threads:8/real_time_pvalue                                          0.0232          0.0000      U Test, Repetitions: 25 vs 25
Phase One/P65/CF027310.IIQ/threads:8/real_time_mean                                           -0.0047         +0.0039           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_median                                         -0.0050         +0.0037           114           113            28            28
Phase One/P65/CF027310.IIQ/threads:8/real_time_stddev                                         -0.0599         -0.2683             1             1             0             0
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_pvalue                          0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_mean                           +0.0206         +0.0207           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_median                         +0.0204         +0.0205           405           414           405           414
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/real_time_stddev                         +0.2155         +0.2212             1             1             1             1
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_pvalue                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_mean                          -0.0109         -0.0108           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_median                        -0.0104         -0.0103           147           145           147           145
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/real_time_stddev                        -0.4919         -0.4800             0             0             0             0
Samsung/NX3000/_3184416.SRW/threads:8/real_time_pvalue                                         0.0000          0.0000      U Test, Repetitions: 25 vs 25
Samsung/NX3000/_3184416.SRW/threads:8/real_time_mean                                          -0.0149         -0.0147           220           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_median                                        -0.0173         -0.0169           221           217           220           217
Samsung/NX3000/_3184416.SRW/threads:8/real_time_stddev                                        +1.0337         +1.0341             1             3             1             3
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_pvalue                                         0.0001          0.0001      U Test, Repetitions: 25 vs 25
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_mean                                          -0.0019         -0.0019           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_median                                        -0.0021         -0.0021           194           193           194           193
Sony/DSLR-A350/DSC05472.ARW/threads:8/real_time_stddev                                        -0.4441         -0.4282             0             0             0             0
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_pvalue                                0.0000          0.4263      U Test, Repetitions: 25 vs 25
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_mean                                 +0.0258         -0.0006            81            83            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_median                               +0.0235         -0.0011            81            82            19            19
Sony/ILCE-7RM2/14-bit-compressed.ARW/threads:8/real_time_stddev                               +0.1634         +0.1070             1             1             0             0


If we look at the _means, the time column, the biggest win is -7.7% (Canon/EOS 5D Mark II/10.canon.sraw2.cr2),
and the biggest loose is +3.3% (Panasonic/DC-GH5S/P1022085.RW2);
Overall: mean -0.7436%, median -0.23%, cbrt(sum(time^3)) = -8.73%
Looks good so far i'd say.

andreadb accepted this revision.Oct 24 2018, 3:42 AM

I am okay with accepting this patch, provided that you fix the store throughput. If not in this patch, then it should definitely be addressed by a follow-up patch.
At the moment, your model assumes a maximum throughput of two store operations per cycle, which is incorrect.

You may want to do something similar to what we did for Jaguar, where the AGU scheduler is defined as a resource-group; we provide distinct definitions for the load/store AGEN pipes.
You can probaby do the same; if I understand correctly, that should be enough to fix your issue with the store throughput.

If for some reasons it takes time to fix, then you can commit this patch in the meantime.
Basically, you can commit this change for now, and raise a bug for the store throughput issue (so that we don't forget about fixing it).

Thanks,
Andrea

This revision is now accepted and ready to land.Oct 24 2018, 3:42 AM

I forgot to write:
Please make sure to add all the bdver2 tests to llvm-mca - just copy the relevant ISA test files and update the RUN tags.
Also, a test that shows the problematic store throughput would be nice.

Thanks!

lebedev.ri edited the summary of this revision. (Show Details)

Rebased.
All further fixes/tuning after landing.

I am okay with accepting this patch

Thanks everyone!

provided that you fix the store throughput.
If not in this patch, then it should definitely be addressed by a follow-up patch.
At the moment, your model assumes a maximum throughput of two store operations per cycle, which is incorrect.
You may want to do something similar to what we did for Jaguar, where the AGU scheduler is defined as a resource-group; we provide distinct definitions for the load/store AGEN pipes.
You can probaby do the same; if I understand correctly, that should be enough to fix your issue with the store throughput.

i *think* it may work, but indeed, that *should* be covered with tests, and there aren't any now, so i'm going to go with "fix afterwards."

If for some reasons it takes time to fix, then you can commit this patch in the meantime.
Basically, you can commit this change for now, and raise a bug for the store throughput issue (so that we don't forget about fixing it).

Thanks,
Andrea

I forgot to write:
Please make sure to add all the bdver2 tests to llvm-mca - just copy the relevant ISA test files and update the RUN tags.

Yep, i understand :)

Also, a test that shows the problematic store throughput would be nice.
Thanks!

Precommitted tests in rL345462, rebased ontop of committed tests.
Proceeding to commit.

This revision was automatically updated to reflect the committed changes.