Page MenuHomePhabricator

[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
ClosedPublic

Authored by lebedev.ri on Jul 18 2020, 1:30 PM.

Details

Summary

I've been looking at missed vectorizations in one codebase. One particular thing
that stands out is that some of the loops reach vectorizer in a rather mangled
form, with weird PHI's, and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because the loop's headers were
too big by then. It is evident that SimplifyCFG's common code hoisting transform
is at fault there, because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, SimplifyCFGOpt::HoistThenElseCodeToIf() is enabled by default,
and is always run, unlike it's friend, common code sinking transform,
SinkCommonCodeFromPredecessors(), which is not enabled by default and is only
run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting until late
in pipeline. Definition of late may vary, currently i've picked the same one
as for code sinking, but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help, more loops got
rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering. This will undoubtedly be
a mixed bag in terms of both compile- and run- time performance, codesize.
Since we no longer aggressively hoist+deduplicate common code, we don't pay the
price of said hoisting (which wasn't big). That may allow more loops to be
rotated, so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run

, this is mostly within the noise,
there are some small improvements, some small regressions. One big regression
i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S

llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions

  • this does regress compile-time by +0.5% geomean (unsurprizingly)
  • size impact varies; for ThinLTO it's actually an improvement

If we look at stats before/after diffs, some excerpts:

  • RawSpeed (the target)
    • -14 (-73.68%) loops not rotated due to the header size (yay)
    • -272 (-0.67%) "Number of live out of a loop variables" - good for vectorizer
    • -3937 (-64.19%) common instructions hoisted
    • +561 (+0.06%) x86 asm instructions
    • -2 basic blocks
    • +2418 (+0.11%) IR instructions
  • vanilla test-suite + RawSpeed + darktable
    • -36396 (-65.29%) common instructions hoisted
    • +1676 (+0.02%) x86 asm instructions
    • +662 (+0.06%) basic blocks
    • +4395 (+0.04%) IR instructions

Thoughts?

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lebedev.ri marked an inline comment as done.Jul 18 2020, 1:31 PM
lebedev.ri added inline comments.
llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
1

This is the test that actually shows the phase ordering problem.

nikic added a comment.Jul 18 2020, 1:43 PM

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

Total Execution Time: 0.9114 seconds (0.9113 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 0.1325 ( 14.7%)   0.0000 (  0.0%)   0.1325 ( 14.5%)   0.1325 ( 14.5%)  Global Value Numbering

to

===-------------------------------------------------------------------------===
  Total Execution Time: 1.7953 seconds (1.7952 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9874 ( 55.9%)   0.0095 ( 32.7%)   0.9970 ( 55.5%)   0.9970 ( 55.5%)  Global Value Numbering

Stats delta:

statistic namebaselineproposedΔ%abs(%)
memdep.NumCacheNonLocalPtr28041141881113843972.33%3972.33%
memdep.NumUncacheNonLocalPtr263891937892993385.10%3385.10%
memory-builtins.ObjectVisitorArgument380460642261112.11%1112.11%
simplifycfg.NumHoistCommonCode49223174355.10%355.10%
basicaa.SearchTimes99200445902346702349.50%349.50%
local.NumRemoved35137102291.43%291.43%
simplifycfg.NumHoistCommonInstrs378803425112.43%112.43%
mem2reg.NumLocalPromoted31-2-66.67%66.67%
indvars.NumElimExt1061484239.62%39.62%
licm.NumSunk3063385579225.86%25.86%
instcombine.NumDeadInst4285259722.66%22.66%
jump-threading.NumThreads1822422.22%22.22%
licm.NumHoisted4535529921.85%21.85%
gvn.NumGVNLoad1111352421.62%21.62%
simplifycfg.NumSinkCommonInstrs396312-84-21.21%21.21%
instcombine.NumSunkInst1012220.00%20.00%
simplifycfg.NumSimpl1195142422919.16%19.16%
early-cse.NumCSELoad971141717.53%17.53%
sroa.MaxUsesPerAllocaPartition1201381815.00%15.00%
sroa.NumAllocaPartitionUses77088411414.81%14.81%
instcombine.NumCombined1328152219414.61%14.61%
sroa.NumDeleted82894211413.77%13.77%
gvn.NumGVNInstr3033434013.20%13.20%
simplifycfg.NumSinkCommonCode188167-21-11.17%11.17%
bdce.NumRemoved1816-2-11.11%11.11%
licm.NumMovedLoads7769-8-10.39%10.39%
gvn.NumGVNSimpl3128-3-9.68%9.68%
lcssa.NumLCSSA293265-28-9.56%9.56%
instsimplify.NumSimplified5449-5-9.26%9.26%
correlated-value-propagation.NumPhis252728.00%8.00%
memdep.NumCacheDirtyNonLocalPtr262827.69%7.69%
gvn.NumGVNBlocks5855-3-5.17%5.17%
loop-vectorize.LoopsAnalyzed5855-3-5.17%5.17%
scalar-evolution.NumTripCountsComputed264251-13-4.92%4.92%
memdep.NumCacheCompleteNonLocalPtr414212.44%2.44%
assume-queries.NumAssumeQueries616604-12-1.95%1.95%
instcombine.NegatorNumValuesVisited23332289-44-1.89%1.89%
instcombine.NegatorTotalNegationsAttempted23332289-44-1.89%1.89%
scalar-evolution.NumTripCountsNotComputed0110.00%0.00%

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

It appears that IsValueFullyAvailableInBlock() is at fault here, which is only called from PerformLoadPRE(), which is only called from processNonLocalLoad().
Not yet sure what specifically is going wrong yet.

I'd suggest splitting this in two parts:

  • NFC patch with new options & test updates, but preserving current behavior;
  • One-liner patch to switch default values.

This is needed to make revert easier in case if it exposes some problems.

I'd suggest splitting this in two parts:

  • NFC patch with new options & test updates, but preserving current behavior;
  • One-liner patch to switch default values.

This is needed to make revert easier in case if it exposes some problems.

Sure, done.

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

It appears that IsValueFullyAvailableInBlock() is at fault here, which is only called from PerformLoadPRE(), which is only called from processNonLocalLoad().
Not yet sure what specifically is going wrong yet.

Rewrote it (D84181), sadly that wasn't it.
Here's the IR before GVN, without/with this patch:


The slowdown of -gvn on that input is observable:

$ /builddirs/llvm-project/build-Clang10-Release/bin/opt -gvn -time-passes /tmp/lencod-context_ini-pre-gvn-old.bc -o /dev/null 2>&1 | grep "Global Value Numbering"
   0.0911 ( 87.0%)   0.0008 ( 77.1%)   0.0920 ( 86.9%)   0.0920 ( 86.9%)  Global Value Numbering
$ /builddirs/llvm-project/build-Clang10-Release/bin/opt -gvn -time-passes /tmp/lencod-context_ini-pre-gvn-new.bc -o /dev/null 2>&1 | grep "Global Value Numbering"
   0.7440 ( 98.1%)   0.0001 ( 90.3%)   0.7441 ( 98.1%)   0.7441 ( 98.1%)  Global Value Numbering

As it can be observed via various means, we were originally doing 54 lvm::GVN::PerformLoadPRE(),
and now do 286, i.e. 5.2x, which is not that far off from 8x time regression..
We seem to be spending all the time in llvm::MemoryDependenceResults::getNonLocalPointerDepFromBB().
Nothing really immediately stands out to me there i'm afraid..
Maybe someone more familiar with that analysis can spot the (preexisting) issue, if there is one.

nikic added a comment.Jul 25 2020, 8:27 AM

@lebedev.ri This issue definitely looks familiar, I've seen a few other cases where a large fraction of compile-time is spent in non-local memdep analysis in GVN.

Some thoughts:

  • Ultimately memdep analysis spends most time in alias analysis, and we can make that part faster but using BatchAA. This has a non-trivial positive impact in general: http://llvm-compile-time-tracker.com/compare.php?from=bc79ed7e16003c8550da8710b321d6d5d4243faf&to=1ab8f1b475d2b9d464add10d1a7f87f18c073fb0&stat=instructions For this particular test case it drops instructions from ~4.2M to ~3.8M.
  • I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).
  • What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M. It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

I think we're focusing on the wrong thing here.
Is the preexisting memdep perf problems the only issue everyone has with this?
Does nobody have any other, perhaps fundamental, concerns here?
If not, are memdep perf problems a blocker here?

Some thoughts:

Looking forward to such a patch.

  • I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).

Unless there's some global cache missing that is, but i'd think it would already be there, so yeah.

  • What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M.

Right.

It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

On the first glance, it doesn't look right to me to lower the existing limits, they don't seem *that* high.
What indeed i think we are missing, is some more global budget. I'll take a look.

I think we're focusing on the wrong thing here.
Is the preexisting memdep perf problems the only issue everyone has with this?
Does nobody have any other, perhaps fundamental, concerns here?
If not, are memdep perf problems a blocker here?

Some thoughts:

Looking forward to such a patch.

  • I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).

Unless there's some global cache missing that is, but i'd think it would already be there, so yeah.

  • What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M.

Right.

It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

On the first glance, it doesn't look right to me to lower the existing limits, they don't seem *that* high.
What indeed i think we are missing, is some more global budget. I'll take a look.

After doing some stats, D84609

mkazantsev accepted this revision.Jul 28 2020, 5:12 AM

Fine by me, but not sure if @nikic has any questions.

This revision is now accepted and ready to land.Jul 28 2020, 5:12 AM

Just a drive-by general comment...
Before we added SimplifyCFGOptions, there was a named "LateSimplifyCFG" pass that implicitly enabled more transforms. There was push back on llvm-dev about passes that use options to change behavior because that made it harder to test the pass managers.

Given the known problems with SimplifyCFG (it does poison-unsafe transforms, it does not reliably reach fix-point, it makes questionable use of the cost model, etc.), the best long-term solution would be to break it up into multiple passes - canonicalization-only early and potentially various optimization passes for later in the pipeline. It's a tough job because (as shown in the data here), that's probably going to uncover lots of other problems. If we now have a fairly reliable way to test that with llvm-compile-time-tracker and there's work underway to find/fuzz our way to better pass pipeline configs, then we should be able to finally untangle this knot.

Fine by me, but not sure if @nikic has any questions.

@nikic ?

nikic added a comment.Jul 29 2020, 9:33 AM

Fine by me, but not sure if @nikic has any questions.

@nikic ?

No further concerns from my side, go for it :)

Well okay then.
@mkazantsev @nikic @spatel thank you for the reviews!

This will totally light up in every tracking graph..

This revision was landed with ongoing or failed builds.Jul 29 2020, 10:07 AM
This revision was automatically updated to reflect the committed changes.

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

-eric

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

-eric

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

Right. I think you're still in scientific computing yes? The internal and external benchmarks are fairly straightforward scalar code. I.e. if they were vectorizable we'd probably have done so ages ago for the internal and the external are things like bytecode interpreters etc. So the patch is penalizing things that just aren't vectorizable for things that might be - and I'm not sure that's the right tradeoff to make here. Let's start an llvm-dev discussion and bring in a few more people as well. I'd be very interested in Kit's perspective here too.

lebedev.ri added a comment.EditedAug 20 2020, 8:40 AM

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

No i mean FDO. Can you either point me at the docs where said incantation for test-suite is spelled out, or at provide commands nessesary to reproduce?

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

Right. I think you're still in scientific computing yes? The internal and external benchmarks are fairly straightforward scalar code. I.e. if they were vectorizable we'd probably have done so ages ago for the internal and the external are things like bytecode interpreters etc. So the patch is penalizing things that just aren't vectorizable for things that might be - and I'm not sure that's the right tradeoff to make here. Let's start an llvm-dev discussion and bring in a few more people as well. I'd be very interested in Kit's perspective here too.

"hey let's penalize some code by +30% runtime because that improves some code by -10%" obviously won't fly,
so unless you want to start the discussion, i want to first understand what is actually going on.
I'm going to guess that such huge regressions likely mean lack of inlining.

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

Was that with old-pm, or new-pm?

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

lebedev.ri added a comment.EditedAug 21 2020, 5:42 AM

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

That might be the pattern then, just another old-pm vs new-pm difference for the switch to track.
But i'm not seeing ab umbrella bug for performance regressions on https://bugs.llvm.org/show_bug.cgi?id=46649?

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.


Uhm, for N=9, i'm afraid i'm not seeing anything horrible like that:

$ /repositories/llvm-test-suite/utils/compare.py --lhs-name old-newpm --rhs-name new-newpm llvm-test-suite-old-NewPM/res-*.json vs llvm-test-suite-new-NewPM/res-*.json --merge-average --filter-hash
Tests: 41
Metric: exec_time

Program                                        old-newpm new-newpm diff 
 test-suite...marks/Ptrdist/yacr2/yacr2.test     0.77      0.74    -4.0%
 test-suite...s/Ptrdist/anagram/anagram.test     0.76      0.78     3.0%
 test-suite...flt/InductionVariable-flt.test     2.90      2.95     1.9%
 test-suite...lFlow-flt/ControlFlow-flt.test     3.48      3.44    -1.4%
 test-suite...ing-dbl/NodeSplitting-dbl.test     3.09      3.13     1.1%
 test-suite...dbl/InductionVariable-dbl.test     4.02      4.05     1.0%
 test-suite...pansion-dbl/Expansion-dbl.test     2.49      2.51     0.7%
 test-suite...pansion-flt/Expansion-flt.test     1.56      1.57     0.5%
 test-suite...lt/CrossingThresholds-flt.test     2.86      2.87     0.5%
 test-suite...lFlow-dbl/ControlFlow-dbl.test     4.05      4.03    -0.5%
 test-suite...ing-flt/Equivalencing-flt.test     1.13      1.14     0.3%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     3.20      3.21     0.3%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test     0.40      0.40     0.3%
 test-suite...C/Packing-flt/Packing-flt.test     2.90      2.91     0.3%
 test-suite.../Benchmarks/Ptrdist/ft/ft.test     1.05      1.05     0.3%
 Geomean difference                                                 0.1%
       old-newpm  new-newpm       diff
count  41.000000  41.000000  41.000000
mean   2.761491   2.765264   0.001355 
std    1.382334   1.383351   0.009172 
min    0.400356   0.401544  -0.040124 
25%    2.158678   2.161544  -0.000219 
50%    2.836133   2.839078   0.000974 
75%    3.204467   3.214600   0.002932 
max    6.865056   6.881911   0.029819 

@echristo or does that match what you were seeing for test-suite? is the Ptrdist/anagram regression the one?

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

We didn't see any codesize regression, but any scalar code with a hot loop started regressing fairly strongly. Particularly things like compression that were already hand vectorized a bit.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

Understood.

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

Codesize I wouldn't be surprised. Inlining is much different with the new pass manager.

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

That might be the pattern then, just another old-pm vs new-pm difference for the switch to track.
But i'm not seeing ab umbrella bug for performance regressions on https://bugs.llvm.org/show_bug.cgi?id=46649?

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.


Uhm, for N=9, i'm afraid i'm not seeing anything horrible like that:

$ /repositories/llvm-test-suite/utils/compare.py --lhs-name old-newpm --rhs-name new-newpm llvm-test-suite-old-NewPM/res-*.json vs llvm-test-suite-new-NewPM/res-*.json --merge-average --filter-hash
Tests: 41
Metric: exec_time

Program                                        old-newpm new-newpm diff 
 test-suite...marks/Ptrdist/yacr2/yacr2.test     0.77      0.74    -4.0%
 test-suite...s/Ptrdist/anagram/anagram.test     0.76      0.78     3.0%
 test-suite...flt/InductionVariable-flt.test     2.90      2.95     1.9%
 test-suite...lFlow-flt/ControlFlow-flt.test     3.48      3.44    -1.4%
 test-suite...ing-dbl/NodeSplitting-dbl.test     3.09      3.13     1.1%
 test-suite...dbl/InductionVariable-dbl.test     4.02      4.05     1.0%
 test-suite...pansion-dbl/Expansion-dbl.test     2.49      2.51     0.7%
 test-suite...pansion-flt/Expansion-flt.test     1.56      1.57     0.5%
 test-suite...lt/CrossingThresholds-flt.test     2.86      2.87     0.5%
 test-suite...lFlow-dbl/ControlFlow-dbl.test     4.05      4.03    -0.5%
 test-suite...ing-flt/Equivalencing-flt.test     1.13      1.14     0.3%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     3.20      3.21     0.3%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test     0.40      0.40     0.3%
 test-suite...C/Packing-flt/Packing-flt.test     2.90      2.91     0.3%
 test-suite.../Benchmarks/Ptrdist/ft/ft.test     1.05      1.05     0.3%
 Geomean difference                                                 0.1%
       old-newpm  new-newpm       diff
count  41.000000  41.000000  41.000000
mean   2.761491   2.765264   0.001355 
std    1.382334   1.383351   0.009172 
min    0.400356   0.401544  -0.040124 
25%    2.158678   2.161544  -0.000219 
50%    2.836133   2.839078   0.000974 
75%    3.204467   3.214600   0.002932 
max    6.865056   6.881911   0.029819 

@echristo or does that match what you were seeing for test-suite? is the Ptrdist/anagram regression the one?

The two tests in the llvm test-suite we're seeing the largest regressions on are:

llvm_multisource_tsvc.TSVC-ControlFlow-flt -0.34% -> -19.33%
llvm_multisource_misc.Ptrdist-yacr2 (I can't recall the difference here)

Those are running on a similar system to yours (but fully quieted/no other code running) with -O3 -fexperimental-new-pass-manager -march=haswell -fno-strict-aliasing -fno-omit-frame-pointer and -mprefer-vector-width=128 as the option set.

In addition we're seeing 10% regressions in benchmarks with https://github.com/google/snappy code (benchmarks aren't public unfortunately) and other scalar code with loops whether they have some hand vectorization or not.

Looking at some other benchmarks on skylake (so no -march=haswell from above) (https://github.com/google/gipfeli) under perf we're seeing the number of executed instructions is almost same, the number of branches is increased by 0.8%, but the number of branch miss predictions is 3.2 times larger. I don't think that this is related to the pass manager at all, but that the new code is causing significant regressions on the processor due to code layout and the branch predictor.

My proposal for now is that we revert this and continue working on performance analysis and try to figure out what we can do to make this more of a win outside of the vectorizer. :)

My proposal for now is that we revert this and continue working on performance analysis and try to figure out what we can do to make this more of a win outside of the vectorizer. :)

Looking forward to seeing a path forward here.

loops not rotated due to the header size

Increase that limit? Any disadvantages?

bjope added a subscriber: bjope.Aug 25 2020, 2:04 AM

loops not rotated due to the header size

Increase that limit? Any disadvantages?

Loops not being rotated is actually a symptom.
While it may be worthwhile to increase that threshold,
that isn't really the fix, because then we still have hoisted the code.

But according to https://reviews.llvm.org/D65060#1596212 , the hoinsting should be done..

LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization.

@fhahn

But according to https://reviews.llvm.org/D65060#1596212 , the hoinsting should be done..

LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization.

@fhahn

I'm sorry but i completely fail to see the connection here.

fhahn added a comment.Nov 30 2020, 5:09 AM

Just a heads up: we are seeing a notable 3.5% regression caused by this change for ARM64 with LTO for CINT2006/473.astar.

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Also, what is the status of the regressions reported by @echristo ? Was it possible to address them?

Just a heads up: we are seeing a notable 3.5% regression caused by this change for ARM64 with LTO for CINT2006/473.astar.

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Interesting.

Also, what is the status of the regressions reported by @echristo ? Was it possible to address them?

See rGbb7d3af1139c36270bc9948605e06f40e4c51541.

Just wanted to add to @fhahn that we're seeing this too, as an even more notable 8% regression (same benchmark, different hardware). Florian's explanation matches what I'm seeing, but I hadn't had a chance to confirm what was happening.

I think Florian's diagnosis is worth looking into for sure - and probably
is the right solution. For the regressions I had they could largely be
explained by the branch predictor on a micro level, I didn't see some of
the large scale LTO level regressions in mine.

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Hi Florian, did you get anywhere with this? If I understand correctly, generally, the verctorizer wants to see the loops in rotated form, so trying to avoid rotations in certain cases might not be the right approach? At the same time, I'm not sure what else we could do! Any thoughts appreciated, and, if this has slipped down your priorities I'm happy to pick this up. It's been bumped up on my list!

fhahn added a comment.Jan 7 2021, 6:16 AM

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Hi Florian, did you get anywhere with this? If I understand correctly, generally, the verctorizer wants to see the loops in rotated form, so trying to avoid rotations in certain cases might not be the right approach? At the same time, I'm not sure what else we could do! Any thoughts appreciated, and, if this has slipped down your priorities I'm happy to pick this up. It's been bumped up on my list!

I put up D94232 with the approach I outlined earlier. I tried to summarize why I think this should make sense in the description. It would be great if you could give it a try on a set of benchmarks.

FYI, I tracked down another ~50% regression in one of our benchmarks to this change. It boils down to failing to vectorize the loop below (or here https://clang.godbolt.org/z/sbjd8Wshx). I put up D100329, which always allows hoisting if only a termintor needs to be hoisted, which effectively replaces the original terminator.

double clamp(double v) {
  if (v < 0.0)
    return 0.0;
  if (v > 6.0)
    return 6.0;
  return v;
}

void loop(double* X, double *Y) {
  for (unsigned i = 0; i < 20000; i++) {
    X[i] = clamp(Y[i]);
  }
}