This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Set vectorizer-maximize-bandwidth as default true
AbandonedPublic

Authored by zatrazz on Apr 30 2018, 1:11 PM.

Details

Summary

Although this shows no virtual gain in speccpu2006 on A72:

Benchmark Diff
400.perlbench +1.55
401.bzip2 -1.22
403.gcc +0.73
429.mcf +3.00
445.gobmk -0.39
456.hmmer -0.90
458.sjeng -0.41
462.libquantum -1.91
464.h264ref 0.00
471.omnetpp -0.64
473.astar -0.38
483.xalancbmk 0.90
geomean: 0.04

It shows some good improvements in generic loops code where each
element is truncate to a narrow type. For instance vector body for
the following code:

void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
{
  for (int i = 0; i < width; i++) {
   *dst++ = *src++;
  }
}

It currently compiled to:


.LBB0_4: // %vector.body

                                // =>This Inner Loop Header: Depth=1
ldp     w14, w15, [x11, #-4]
add     x11, x11, #8            // =8
subs    x13, x13, #2            // =2
sturb   w14, [x12, #-1]
strb    w15, [x12], #2
b.ne    .LBB0_4

Where with current patch it is now compiled to:


.LBB0_4: // %vector.body

                                // =>This Inner Loop Header: Depth=1
ldp     q0, q1, [x11, #-64]
ldp     q2, q3, [x11, #-32]
ldp     q4, q5, [x11]
ldp     q6, q7, [x11, #32]
xtn     v0.4h, v0.4s
xtn     v2.4h, v2.4s
xtn2    v2.8h, v3.4s
xtn2    v0.8h, v1.4s
xtn     v6.4h, v6.4s
xtn     v4.4h, v4.4s
xtn     v0.8b, v0.8h
xtn2    v0.16b, v2.8h
xtn2    v6.8h, v7.4s
xtn2    v4.8h, v5.4s
xtn     v1.8b, v4.8h
xtn2    v1.16b, v6.8h
add     x11, x11, #128          // =128
subs    x13, x13, #32           // =32
stp     q0, q1, [x12, #-16]
add     x12, x12, #32           // =32
b.ne    .LBB0_4

It is a increase of about 12% of throughput in a micro-benchmark with an array of
16777216 elements.

Diff Detail

Event Timeline

zatrazz created this revision.Apr 30 2018, 1:11 PM
rengolin added a subscriber: mcrosier.

This seems like an improvement, but we have to be careful with wide variations and little gain. The geomean is almost null but the standard deviation is higher than 75% of the results.

This is not a good sign and is a reflection of what I have complained about exchanging numbers and not tracking which benchmarks get worse/better, then risking hurting more one side than another in the long term.

The patch sets the default to *all* AArch64 cores, which is a wide range and you have only tested on one.

To enable such flag by default, we'd need to run on more cores (preferably including A53) as well as other vendors (like Apple, Samsung, Qualcomm), which this patch is likely to impact.

@t.p.northover @evandro @mcrosier please have a look.

This seems like an improvement, but we have to be careful with wide variations and little gain. The geomean is almost null but the standard deviation is higher than 75% of the results.

This is not a good sign and is a reflection of what I have complained about exchanging numbers and not tracking which benchmarks get worse/better, then risking hurting more one side than another in the long term.

The patch sets the default to *all* AArch64 cores, which is a wide range and you have only tested on one.

To enable such flag by default, we'd need to run on more cores (preferably including A53) as well as other vendors (like Apple, Samsung, Qualcomm), which this patch is likely to impact.

@t.p.northover @evandro @mcrosier please have a look.

My understanding is the 'vectorizer-maximize-bandwidth' was previously set and reset as default for all targets. This history is more that it has cause some functionality issues than performance one [1]. I couldn't find any follow up on this thread, so I took a more conservative approach and just enable it on AArch64. I will try to check on more recent workloads (speccpu2017) to see any gains or regressions and also rerun it on master. But I do agree that this kind of change should take with care.

[1] [llvm] r305960 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r305990 - Revert "Enable vectorizer-maximize-bandwidth by default." (Diana Picus)
[llvm] r306336 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306344 - revert r306336 for breaking ppc test. (Dehao)
[llvm] r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306792 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default." (Daniel)
[llvm] r306933 - Enable vectorizer-maximize-bandwidth by default.
[llvm] r306934 - revert r306336 for breaking ppc test.
[llvm] r306935 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
[llvm] r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

rengolin requested changes to this revision.Apr 30 2018, 2:17 PM

[1] [llvm] r305960 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r305990 - Revert "Enable vectorizer-maximize-bandwidth by default." (Diana Picus)
[llvm] r306336 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306344 - revert r306336 for breaking ppc test. (Dehao)
[llvm] r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306792 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default." (Daniel)
[llvm] r306933 - Enable vectorizer-maximize-bandwidth by default.
[llvm] r306934 - revert r306336 for breaking ppc test.
[llvm] r306935 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
[llvm] r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

This is a horrible history, even worse if it broke things, not just performance...

Worse still is the number of targets (Arm, PPC) and it was re-committed a number of times without modification.

This trend is worrying and we should be *really* worried about the overall implementation before assuming that, because it didn't break SPEC2k6 on A72, it's a good patch.

We need a full analysis on what happened, why it's breaking so much, why the noise is so great and how can we reduce it while fixing the actual bugs.

Re-committing the patch as is simply won't work.

cheers,
--renato

This revision now requires changes to proceed.Apr 30 2018, 2:17 PM
zatrazz updated this revision to Diff 147815.May 21 2018, 10:41 AM
zatrazz added a comment.EditedMay 21 2018, 2:33 PM

For some reason I did not attach the meant comments in this update. This is an update of the previous patch with an extended analysis. I checked a bootstrap build TargetTransformation::shouldMaximizeVectorBandwidth enabled for both armhf (r332595) and powerpc64le (r332840). On armhf I did not see any regression, however on powerpc64le I found an issue related on how current code handles the MaximizeBandwidth option. The testcase 'Transforms/LoopVectorize/PowerPC/pr30990.ll' explicit sets vectorizer-maximize-bandwidth to 0, however the code checks for:

  • lib/Transforms/Vectorize/LoopVectorize.cpp

4970 unsigned MaxVF = MaxVectorSize;
4971 if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||
4972 (MaximizeBandwidth && !OptForSize)) {

To enable/disable this optimization. I think a possible fix would to check if Maximize is explicit disable (instead to check for its default value) by:

diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp b/lib/Transforms/Vectorize/LoopVectorize.cpp
index a65dc09..40c6583 100644

  • a/lib/Transforms/Vectorize/LoopVectorize.cpp

+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4968,7 +4968,8 @@ LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,

}
 
unsigned MaxVF = MaxVectorSize;
  • if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||

+ if (TTI.shouldMaximizeVectorBandwidth(OptForSize &&
+ !MaximizeBandwidth.getNumOccurrences()) ||

  (MaximizeBandwidth && !OptForSize)) {
// Collect all viable vectorization factors larger than the default MaxVF
// (i.e. MaxVectorSize).

But I think it should not act a block for this patch (I can sent it in another one if required).

Now related to performance differences, for speccpu2006 I see mixed results The machine I am testing show some variance, and even trying to get as minimum OS jitter as possible (by placing cpu allocation and binding to a specific node and disabling OS services), in two runs I see:

  • RUN 1 (1 iteration)

Benchmark Difference (%) *
400.perlbench 0.73
401.bzip2 0.01
403.gcc 0.53
429.mcf 0.05
445.gobmk 0.99
456.hmmer -0.25
458.sjeng 1.02
462.libquantum 0.04
464.h264ref 0.28
471.omnetpp 0.30
473.astar -0.11
483.xalancbmk 1.92

433.milc 0.03
444.namd -0.38
447.dealII 0.95
450.soplex 0.99
453.povray 1.24
470.lbm -0.88
482.sphinx3 1.43

  • RUN 2 (3 iteration, get the better result)

Benchmark Difference (%) *
400.perlbench 0.66
401.bzip2 -2.84
403.gcc 0.09
429.mcf 0.46
445.gobmk 0.03
456.hmmer -1.34
458.sjeng 0.12
462.libquantum 0.06
464.h264ref 0.45
471.omnetpp -0.74
473.astar 1.05
483.xalancbmk -0.57

433.milc -0.04
444.namd 0.14
447.dealII -0.37
450.soplex 0.97
453.povray -0.90
470.lbm -0.88
482.sphinx3 0.28

On speccpu2017 the results are slighter more stable:

Benchmark Difference (%) *
600.perlbench_s 0,41
602.gcc_s 0,96
605.mcf_s -0,87
620.omnetpp_s 1,74
623.xalancbmk_s 1,80
625.x264_s 0,16
631.deepsjeng_s 0,33
641.leela_s 0,38
657.xz_s -0,14

619.lbm_s -0,45
638.imagick_s 0,09
644.nab_s -0,10

[EDIT: remove geekbench5 data points since it is not released yet]

  • Difference between r332336 with and without the patch, positive values represents a higher score indicating a improvement with the patch.

SPEC06 results look too noisy to conclude anything, especially bzip, xalan and povray. Can you find a more stable machine?

In the geekbench side, I guess we need to understand what's wrong with PDFRendering before proceeding.

mcrosier resigned from this revision.May 24 2018, 12:54 PM

Indeed the machine was using for speccpu2006 was not best suitable, I used a different one now (tx1, A57) with extra care to lower variance (cpu binding, services disabled, etc) and it indeed showed a better result:

BenchmarkDifference (%)
400.perlbench+1.04
401.bzip2-2.04
403.gcc+0.63
429.mcf+0.05
445.gobmk+0.18
456.hmmer-0.90
458.sjeng+0.29
462.libquantum+0.10
464.h264ref+1.73
471.omnetpp+0.18
473.astar-0.21
483.xalancbmk+0.12
433.milc-0.36
444.namd+0.86
447.dealII-0.66
450.soplex-0.13
453.povray+0.37
470.lbm+0.08
482.sphinx3+0.54

I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios. Taking the example I am trying to get a better auto-vectorization pattern:

void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
{
  for (int i = 0; i < width; i++) {
   *dst++ = *src++;
  }
}

I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout. Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.

For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.

401.bzip2-2.04

I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios.

We have to be careful, though. In the past we have accepted this kind of scenario as "obviously good" because geomean is better. But that's a fallacy that has brought some pain over the past few years.

While a positive geomean is good, a -2% on bzip will mean someone will be trying to fix that later on, and might just as well undo the "good" changes this patch brings.

This is a never-ending scenario where the overall win is zero (plus or minus something).

I'm not saying we should stop any improvement if we have one bad result, but your bad result is worse than any other is good. This is still worrying, and may mean that the changes bring instability to passes after that (including back-end ones).

I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout. Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.

I believe the fact that LLVM gets this right with the extra flag is a side-effect of how the vectoriser used to work: keep trying until whatever. Very soon this will no longer be the case and the approach for this would probably end up in a VPlan or something.

While I understand the need to get it "faster" on benchmarks, I think we need to look at the bigger picture here.

For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.

Until we know more about the reason bzip and PDF are so much worse, there is no way to know if this is a blocker or not.

fhahn requested changes to this revision.Jan 11 2019, 9:20 AM

Requesting changes as additional info on the bzip and PDF regressions is needed.

This revision now requires changes to proceed.Jan 11 2019, 9:20 AM
zatrazz abandoned this revision.Jan 22 2019, 3:08 AM