This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize][AArch64] Enable ordered reductions by default for AArch64
ClosedPublic

Authored by david-arm on Jul 23 2021, 5:16 AM.

Details

Summary

I have added a new TTI interface called enableOrderedReductions() that
controls whether or not ordered reductions should be enabled for a
given target. By default this returns false, whereas for AArch64 it
returns true and we rely upon the cost model to make sensible
vectorisation choices. It is still possible to override the new TTI
interface by setting the command line flag:

-force-ordered-reductions=true|false

I have added a new RUN line to show that we use ordered reductions by
default for SVE and Neon:

Transforms/LoopVectorize/AArch64/strict-fadd.ll
Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

Diff Detail

Event Timeline

david-arm created this revision.Jul 23 2021, 5:16 AM
david-arm requested review of this revision.Jul 23 2021, 5:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2021, 5:16 AM

I wonder if this might be something of interest for other targets, not just AArch64?

Do you have any performance results?

david-arm retitled this revision from [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64 to [LoopVectorize][AArch64] Enable ordered reductions by default for SVE.
david-arm edited the summary of this revision. (Show Details)
  • Changed the patch to only enable strict (ordered) reductions for AArch64 when SVE is enabled.

Hi @dmgreen, we've decided for now to only enable this by default for AArch64 when SVE is enabled as this is lower risk. We are currently still collecting performance data for the fixed-width vectorisation case when SVE is enabled.

Hi @dmgreen, we've decided for now to only enable this by default for AArch64 when SVE is enabled as this is lower risk. We are currently still collecting performance data for the fixed-width vectorisation case when SVE is enabled.

Sorry, that should really be "collecting performance data for the fixed-width vectorisation case when strict reductions are forced".

What do you mean by "lower risk"? Do you have performance numbers for SVE? Or is the cost so high that in practice they are never generated?

What do you mean by "lower risk"? Do you have performance numbers for SVE? Or is the cost so high that in practice they are never generated?

The end-goal is to enable strict reductions by default for all targets so that buildbots guard the functionality and hopefully give some performance benefit as well. The cost-model must be conservative enough to avoid any regressions, but still let through loops where there is an obvious benefit. At that point, we can start tuning the cost-model to let through more cases when that helps performance.

Our other motivation is around making LLVM 13 an experimental compiler for VLA auto-vectorization. We specifically want to enable strict reductions by default in LLVM 13 for vector-length-agnostic SVE, because this is a new vectorization capability which SVE can handle. The cost-model doesn't really matter too much at this point, because VLA auto-vec is experimental and little effort has yet been made to improve code-quality, so it's unlikely that strict-reductions will make a dent.

To achieve our end-goal of enabling strict reductions by default for all targets, we can do that in stages:

  1. Enable it by default for VLA SVE (this patch)

We can enable it because performance doesn't really matter. It would mean this patch needs to be updated to (temporarily) give a high/invalid cost for ordered reductions when the type is a FixedVectorType, so that we don't accidentally introduce any regressions for e.g. -mcpu=a64fx when -scalable-vectorization=on|preferred is not specified.

  1. Enable it by default for AArch64.

We have performed SPEC2K6 measurements where we can see that the cost-model holds up and where performance across the board is similar or better, with only very minor regressions (<1%). We want to do a bit more benchmarking (such as measuring on different AArch64 machines) to present numbers we're confident about.

  1. Enable it by default for other targets.

This will require measurements on targets other than AArch64.

Does that sound like a sensible approach?

Matt added a subscriber: Matt.Jul 28 2021, 2:43 PM

Turning this on sounds good, so long as we do our due-diligence and check it's OK performance wise. I just did some experiments and it seemed fine-ish, but they were only very small exampled run on a A510 and A710. So long as there was more than a single real vector operation, the benefits started overcoming the overheads.

A couple of high level comments though:

  • I'm not a fan of -march=arm8-a+sve working differently to -march=armv8-a for non-sve related code. i.e "hasSVE" shouldn't be the trigger for allowing NEON inorder reductions, if they do not have anything to do with SVE. We should consider just jumping to step 2 and enabling it for all aarch64, so long as the performance results look OK and we are careful about regressions of course.
  • It seems that the cost is so high under SVE that they will never be generated in practice. This suggests that enabling them will have little effect for either testing or performance.
david-arm updated this revision to Diff 363445.Aug 2 2021, 4:59 AM
david-arm retitled this revision from [LoopVectorize][AArch64] Enable ordered reductions by default for SVE to [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64.
david-arm edited the summary of this revision. (Show Details)
  • Enabled ordered reductions by default for AArch64.
  • Rebase.
spatel added inline comments.Aug 6 2021, 5:37 AM
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
5 ↗(On Diff #363445)

Drive-by comment (checking my understanding) - we could add this RUN line with --check-prefix=CHECK-NOT-VECTORIZED as a preliminary commit and then update only the prefix to show the change in behavior with this patch?

david-arm added inline comments.Aug 10 2021, 1:46 AM
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
5 ↗(On Diff #363445)

That sounds like a good idea! I'll commit a one line patch to do this and then rebase.

david-arm added a comment.EditedAug 18 2021, 5:17 AM

Hi @dmgreen, so I have run SPEC2006 on a neoverse-n1 9 times without my patch and 9 times with when building with -O3, then compared the averages. Here is a summary of the results showing a few outliers (slowest at the top, fastest at the bottom):

Benchmark | Percentage Runtime Change (<0 = faster with ordered reductions)

453.povray: 0.3
464.h264ref: 0.1
462.libquantum: 0
...
450.soplex: -0.95
429.mcf: -1.22
482.sphinx3: -1.27
471.omnetpp: -1.51

Geometric mean: -0.45

Overall it looks like this slightly faster with ordered reductions enabled by default for AArch64.

david-arm updated this revision to Diff 367181.Aug 18 2021, 5:27 AM
  • Just spotted a bug in enableOrderedReductions and fixed it. This was fixed in the patch I was using for performance testing. :)
dmgreen accepted this revision.Aug 18 2021, 9:27 AM

The numbers look good to me. With D108292 this LGTM. Thanks.

This revision is now accepted and ready to land.Aug 18 2021, 9:27 AM
This revision was landed with ongoing or failed builds.Aug 19 2021, 1:29 AM
This revision was automatically updated to reflect the committed changes.
pcc added a subscriber: pcc.Aug 20 2021, 10:07 AM

Hi David, it looks like this change caused an assertion failure. Can you please take a look?

> cat test.i
int a;
double b, c;
void d() {
  for (; a; a++) {
    b += c;
    c = a;
  }
}
> clang -O2 test.i --target=aarch64-linux
clang: ../llvm/lib/Transforms/Vectorize/VPlanValue.h:186: llvm::Value *llvm::VPValue::getLiveInIRValue(): Assertion `!getDef() && "VPValue is not a live-in; it is defined by a VPDef inside a VPlan"' failed.
fhahn added a subscriber: fhahn.Aug 20 2021, 1:25 PM

Hi David, it looks like this change caused an assertion failure. Can you please take a look?

> cat test.i
int a;
double b, c;
void d() {
  for (; a; a++) {
    b += c;
    c = a;
  }
}
> clang -O2 test.i --target=aarch64-linux
clang: ../llvm/lib/Transforms/Vectorize/VPlanValue.h:186: llvm::Value *llvm::VPValue::getLiveInIRValue(): Assertion `!getDef() && "VPValue is not a live-in; it is defined by a VPDef inside a VPlan"' failed.

Thanks for the report! I have a suspicion of what may be going wrong here. I'll revert the patch while I take a look.

fhahn added a comment.Aug 23 2021, 3:33 AM

I recommitted the patch in d024a01511c1 after fixing the issue causing the revert in 9baed023b4b5

I recommitted the patch in d024a01511c1 after fixing the issue causing the revert in 9baed023b4b5

Hi @fhahn, thanks a lot for fixing this! I was on holiday Monday and Tuesday, only just saw the fallout now. :)