This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Utils/
-
llvm/
-
Transforms/
-
Utils/
-
SimplifyCFGOptions.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
Target/
-
AArch64/
-
AArch64TargetMachine.cpp
-
ARM/
-
ARMTargetMachine.cpp
-
Hexagon/
-
HexagonTargetMachine.cpp
-
Transforms/
-
IPO/
-
PassManagerBuilder.cpp
-
Scalar/
-
SimplifyCFGPass.cpp
-
test/Transforms/
-
Transforms/
-
PGOProfile/
-
chr.ll
-
PhaseOrdering/
1/1
loop-rotation-vs-common-code-hoisting.ll
-
SimplifyCFG/
-
common-code-hoisting.ll

Differential D84108

[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
ClosedPublic

Authored by lebedev.ri on Jul 18 2020, 1:30 PM.

Download Raw Diff

Details

Reviewers

reames
fhahn
mkazantsev
spatel

Commits

rGbb7d3af1139c: Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction…
rG1d51dc38d89b: [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by…

Summary

I've been looking at missed vectorizations in one codebase. One particular thing
that stands out is that some of the loops reach vectorizer in a rather mangled
form, with weird PHI's, and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because the loop's headers were
too big by then. It is evident that SimplifyCFG's common code hoisting transform
is at fault there, because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, SimplifyCFGOpt::HoistThenElseCodeToIf() is enabled by default,
and is always run, unlike it's friend, common code sinking transform,
SinkCommonCodeFromPredecessors(), which is not enabled by default and is only
run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting until late
in pipeline. Definition of late may vary, currently i've picked the same one
as for code sinking, but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help, more loops got
rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering. This will undoubtedly be
a mixed bag in terms of both compile- and run- time performance, codesize.
Since we no longer aggressively hoist+deduplicate common code, we don't pay the
price of said hoisting (which wasn't big). That may allow more loops to be
rotated, so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run

rsbench.txt68 KBDownload

, this is mostly within the noise,
there are some small improvements, some small regressions. One big regression
i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S

llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions

this does regress compile-time by +0.5% geomean (unsurprizingly)
size impact varies; for ThinLTO it's actually an improvement

If we look at stats before/after diffs, some excerpts:

RawSpeed (the target)
total-rs-diff.csv20 KBDownload
- -14 (-73.68%) loops not rotated due to the header size (yay)
- -272 (-0.67%) "Number of live out of a loop variables" - good for vectorizer
- -3937 (-64.19%) common instructions hoisted
- +561 (+0.06%) x86 asm instructions
- -2 basic blocks
- +2418 (+0.11%) IR instructions
vanilla test-suite + RawSpeed + darktable
total-diff.csv35 KBDownload
- -36396 (-65.29%) common instructions hoisted
- +1676 (+0.02%) x86 asm instructions
- +662 (+0.06%) basic blocks
- +4395 (+0.04%) IR instructions

Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Jul 18 2020, 1:30 PM

Herald added subscribers: dexonsmith, hiraditya, kristof.beyls, nemanjai. · View Herald TranscriptJul 18 2020, 1:30 PM

lebedev.ri marked an inline comment as done.Jul 18 2020, 1:31 PM

lebedev.ri added inline comments.

llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
1	This is the test that actually shows the phase ordering problem.

Herald added a subscriber: • wuzish. · View Herald TranscriptJul 18 2020, 1:31 PM

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

In D84108#2160494, @nikic wrote:

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

Harbormaster failed remote builds in B64820: Diff 279023!Jul 18 2020, 2:05 PM

In D84108#2160496, @lebedev.ri wrote:

In D84108#2160494, @nikic wrote:

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

Total Execution Time: 0.9114 seconds (0.9113 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 0.1325 ( 14.7%)   0.0000 (  0.0%)   0.1325 ( 14.5%)   0.1325 ( 14.5%)  Global Value Numbering

===-------------------------------------------------------------------------===
  Total Execution Time: 1.7953 seconds (1.7952 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9874 ( 55.9%)   0.0095 ( 32.7%)   0.9970 ( 55.5%)   0.9970 ( 55.5%)  Global Value Numbering

Stats delta:

statistic name	baseline	proposed	Δ	%	abs(%)
memdep.NumCacheNonLocalPtr	2804	114188	111384	3972.33%	3972.33%
memdep.NumUncacheNonLocalPtr	2638	91937	89299	3385.10%	3385.10%
memory-builtins.ObjectVisitorArgument	380	4606	4226	1112.11%	1112.11%
simplifycfg.NumHoistCommonCode	49	223	174	355.10%	355.10%
basicaa.SearchTimes	99200	445902	346702	349.50%	349.50%
local.NumRemoved	35	137	102	291.43%	291.43%
simplifycfg.NumHoistCommonInstrs	378	803	425	112.43%	112.43%
mem2reg.NumLocalPromoted	3	1	-2	-66.67%	66.67%
indvars.NumElimExt	106	148	42	39.62%	39.62%
licm.NumSunk	3063	3855	792	25.86%	25.86%
instcombine.NumDeadInst	428	525	97	22.66%	22.66%
jump-threading.NumThreads	18	22	4	22.22%	22.22%
licm.NumHoisted	453	552	99	21.85%	21.85%
gvn.NumGVNLoad	111	135	24	21.62%	21.62%
simplifycfg.NumSinkCommonInstrs	396	312	-84	-21.21%	21.21%
instcombine.NumSunkInst	10	12	2	20.00%	20.00%
simplifycfg.NumSimpl	1195	1424	229	19.16%	19.16%
early-cse.NumCSELoad	97	114	17	17.53%	17.53%
sroa.MaxUsesPerAllocaPartition	120	138	18	15.00%	15.00%
sroa.NumAllocaPartitionUses	770	884	114	14.81%	14.81%
instcombine.NumCombined	1328	1522	194	14.61%	14.61%
sroa.NumDeleted	828	942	114	13.77%	13.77%
gvn.NumGVNInstr	303	343	40	13.20%	13.20%
simplifycfg.NumSinkCommonCode	188	167	-21	-11.17%	11.17%
bdce.NumRemoved	18	16	-2	-11.11%	11.11%
licm.NumMovedLoads	77	69	-8	-10.39%	10.39%
gvn.NumGVNSimpl	31	28	-3	-9.68%	9.68%
lcssa.NumLCSSA	293	265	-28	-9.56%	9.56%
instsimplify.NumSimplified	54	49	-5	-9.26%	9.26%
correlated-value-propagation.NumPhis	25	27	2	8.00%	8.00%
memdep.NumCacheDirtyNonLocalPtr	26	28	2	7.69%	7.69%
gvn.NumGVNBlocks	58	55	-3	-5.17%	5.17%
loop-vectorize.LoopsAnalyzed	58	55	-3	-5.17%	5.17%
scalar-evolution.NumTripCountsComputed	264	251	-13	-4.92%	4.92%
memdep.NumCacheCompleteNonLocalPtr	41	42	1	2.44%	2.44%
assume-queries.NumAssumeQueries	616	604	-12	-1.95%	1.95%
instcombine.NegatorNumValuesVisited	2333	2289	-44	-1.89%	1.89%
instcombine.NegatorTotalNegationsAttempted	2333	2289	-44	-1.89%	1.89%
scalar-evolution.NumTripCountsNotComputed	0	1	1	0.00%	0.00%

In D84108#2160869, @lebedev.ri wrote:

In D84108#2160496, @lebedev.ri wrote:

In D84108#2160494, @nikic wrote:

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

It appears that IsValueFullyAvailableInBlock() is at fault here, which is only called from PerformLoadPRE(), which is only called from processNonLocalLoad().
Not yet sure what specifically is going wrong yet.

I'd suggest splitting this in two parts:

NFC patch with new options & test updates, but preserving current behavior;
One-liner patch to switch default values.

This is needed to make revert easier in case if it exposes some problems.

lebedev.ri mentioned this in rG04b729d076af: [NFCI][SimplifyCFG] Guard common code hoisting with a (default-on) flag.Jul 20 2020, 12:30 AM

In D84108#2161264, @mkazantsev wrote:

I'd suggest splitting this in two parts:

NFC patch with new options & test updates, but preserving current behavior;

One-liner patch to switch default values.

This is needed to make revert easier in case if it exposes some problems.

Sure, done.

Harbormaster completed remote builds in B64874: Diff 279137.Jul 20 2020, 1:05 AM

lebedev.ri mentioned this in D84181: [GVN] Rewrite IsValueFullyAvailableInBlock().Jul 20 2020, 9:20 AM

In D84108#2160881, @lebedev.ri wrote:

In D84108#2160869, @lebedev.ri wrote:

In D84108#2160496, @lebedev.ri wrote:

In D84108#2160494, @nikic wrote:

this does regress compile-time by +0.5% geomean (unsurprizingly)

FWIW, a large part of the regressions seems to come down to a single file that regresses big time:

CMakeFiles/lencod.dir/context_ini.c.o 3917M 7067M (+80.41%)

Probably worth taking a look whether that can be mitigated.

Thanks, i will take a look, just really wanted to finally post this after two dry-runs due to the rG0fdcca07ad2c0bdc2cdd40ba638109926f4f513b/rG8d487668d09fb0e4e54f36207f07c1480ffabbfd :)

I presently don't know what is actually going on, but [old] GVN is to blame there, it goes from

It appears that IsValueFullyAvailableInBlock() is at fault here, which is only called from PerformLoadPRE(), which is only called from processNonLocalLoad().
Not yet sure what specifically is going wrong yet.

Rewrote it (D84181), sadly that wasn't it.
Here's the IR before GVN, without/with this patch:

lencod-context_ini-pre-gvn-old.ll.xz52 KBDownload

lencod-context_ini-pre-gvn-new.ll.xz55 KBDownload

The slowdown of -gvn on that input is observable:

$ /builddirs/llvm-project/build-Clang10-Release/bin/opt -gvn -time-passes /tmp/lencod-context_ini-pre-gvn-old.bc -o /dev/null 2>&1 | grep "Global Value Numbering"
   0.0911 ( 87.0%)   0.0008 ( 77.1%)   0.0920 ( 86.9%)   0.0920 ( 86.9%)  Global Value Numbering
$ /builddirs/llvm-project/build-Clang10-Release/bin/opt -gvn -time-passes /tmp/lencod-context_ini-pre-gvn-new.bc -o /dev/null 2>&1 | grep "Global Value Numbering"
   0.7440 ( 98.1%)   0.0001 ( 90.3%)   0.7441 ( 98.1%)   0.7441 ( 98.1%)  Global Value Numbering

As it can be observed via various means, we were originally doing 54 lvm::GVN::PerformLoadPRE(),
and now do 286, i.e. 5.2x, which is not that far off from 8x time regression..
We seem to be spending all the time in llvm::MemoryDependenceResults::getNonLocalPointerDepFromBB().
Nothing really immediately stands out to me there i'm afraid..
Maybe someone more familiar with that analysis can spot the (preexisting) issue, if there is one.

@lebedev.ri This issue definitely looks familiar, I've seen a few other cases where a large fraction of compile-time is spent in non-local memdep analysis in GVN.

Some thoughts:

Ultimately memdep analysis spends most time in alias analysis, and we can make that part faster but using BatchAA. This has a non-trivial positive impact in general: http://llvm-compile-time-tracker.com/compare.php?from=bc79ed7e16003c8550da8710b321d6d5d4243faf&to=1ab8f1b475d2b9d464add10d1a7f87f18c073fb0&stat=instructions For this particular test case it drops instructions from ~4.2M to ~3.8M.
I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).
What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M. It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

I think we're focusing on the wrong thing here.
Is the preexisting memdep perf problems the only issue everyone has with this?
Does nobody have any other, perhaps fundamental, concerns here?
If not, are memdep perf problems a blocker here?

In D84108#2174113, @nikic wrote:

Some thoughts:

Ultimately memdep analysis spends most time in alias analysis, and we can make that part faster but using BatchAA. This has a non-trivial positive impact in general: http://llvm-compile-time-tracker.com/compare.php?from=bc79ed7e16003c8550da8710b321d6d5d4243faf&to=1ab8f1b475d2b9d464add10d1a7f87f18c073fb0&stat=instructions For this particular test case it drops instructions from ~4.2M to ~3.8M.

Looking forward to such a patch.

I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).

Unless there's some global cache missing that is, but i'd think it would already be there, so yeah.

What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M.

Right.

It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

On the first glance, it doesn't look right to me to lower the existing limits, they don't seem *that* high.
What indeed i think we are missing, is some more global budget. I'll take a look.

In D84108#2174232, @lebedev.ri wrote:

I think we're focusing on the wrong thing here.
Is the preexisting memdep perf problems the only issue everyone has with this?
Does nobody have any other, perhaps fundamental, concerns here?
If not, are memdep perf problems a blocker here?

In D84108#2174113, @nikic wrote:

Some thoughts:

Ultimately memdep analysis spends most time in alias analysis, and we can make that part faster but using BatchAA. This has a non-trivial positive impact in general: http://llvm-compile-time-tracker.com/compare.php?from=bc79ed7e16003c8550da8710b321d6d5d4243faf&to=1ab8f1b475d2b9d464add10d1a7f87f18c073fb0&stat=instructions For this particular test case it drops instructions from ~4.2M to ~3.8M.

Looking forward to such a patch.

I think fundamentally, what memdep does and the way it does it, cannot be easily improved. This is probably something that MemorySSA-based NewGVN improves on (at least in theory -- in practice MemorySSA has a not so stellar compile-time record).

Unless there's some global cache missing that is, but i'd think it would already be there, so yeah.

What can be improved though are the cutoffs. Non-local memdep analysis has two primary cutoffs: The BlockScanLimit=100, which limits the amount of instructions inspected in a single block, and the BlockNumberLimit=1000, which limits the amount of inspected blocks. Taken together, this means that a single memdep query can inspect up to 100000 instructions, which is frankly insane. A quick test with -memdep-block-number-limit=100 drops the instructions for the test case to ~1.8M.

Right.

It may be reasonable to either reduce the default block limit here, or to introduce an additional "total instruction limit" for non-local queries.

On the first glance, it doesn't look right to me to lower the existing limits, they don't seem *that* high.
What indeed i think we are missing, is some more global budget. I'll take a look.

After doing some stats, D84609

lebedev.ri mentioned this in D83663: [LoopRotate] Bump rotation-max-header-size to 64.Jul 27 2020, 6:38 AM

lebedev.ri mentioned this in rGe40315d2b4ed: [GVN] Rewrite IsValueFullyAvailableInBlock(): no recursion, less false-negatives.Jul 28 2020, 12:27 AM

Ping.

Rebased now that D84181 has landed.
D84609 is still TODO.
New numbers: http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions

Harbormaster failed remote builds in B65974: Diff 281142!Jul 28 2020, 2:34 AM

lebedev.ri mentioned this in D84742: [NFCI]MemDepAnalysis] Introduce global limit on a number of instructions to be traversed during single query.Jul 28 2020, 3:37 AM

Fine by me, but not sure if @nikic has any questions.

This revision is now accepted and ready to land.Jul 28 2020, 5:12 AM

Just a drive-by general comment...
Before we added SimplifyCFGOptions, there was a named "LateSimplifyCFG" pass that implicitly enabled more transforms. There was push back on llvm-dev about passes that use options to change behavior because that made it harder to test the pass managers.

Given the known problems with SimplifyCFG (it does poison-unsafe transforms, it does not reliably reach fix-point, it makes questionable use of the cost model, etc.), the best long-term solution would be to break it up into multiple passes - canonicalization-only early and potentially various optimization passes for later in the pipeline. It's a tough job because (as shown in the data here), that's probably going to uncover lots of other problems. If we now have a fairly reliable way to test that with llvm-compile-time-tracker and there's work underway to find/fuzz our way to better pass pipeline configs, then we should be able to finally untangle this knot.

In D84108#2178377, @mkazantsev wrote:

Fine by me, but not sure if @nikic has any questions.

@nikic ?

In D84108#2179954, @lebedev.ri wrote:

In D84108#2178377, @mkazantsev wrote:

Fine by me, but not sure if @nikic has any questions.

@nikic ?

No further concerns from my side, go for it :)

Well okay then.
@mkazantsev @nikic @spatel thank you for the reviews!

This will totally light up in every tracking graph..

This revision was landed with ongoing or failed builds.Jul 29 2020, 10:07 AM

Closed by commit rG1d51dc38d89b: [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rG1d51dc38d89b: [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by….

yonghong-song mentioned this in D85434: BPF: add a SimplifyCFG IR pass during generic Scalar/IPO optimization.Aug 6 2020, 8:17 AM

yonghong-song mentioned this in rG87cba434027b: BPF: add a SimplifyCFG IR pass during generic Scalar/IPO optimization.Aug 6 2020, 1:17 PM

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

-eric

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

In D84108#2227371, @echristo wrote:

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

In D84108#2227365, @echristo wrote:

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

-eric

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

In D84108#2227701, @lebedev.ri wrote:

In D84108#2227371, @echristo wrote:

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

In D84108#2227365, @echristo wrote:

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

In D84108#2227709, @dmgreen wrote:

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

Right. I think you're still in scientific computing yes? The internal and external benchmarks are fairly straightforward scalar code. I.e. if they were vectorizable we'd probably have done so ages ago for the internal and the external are things like bytecode interpreters etc. So the patch is penalizing things that just aren't vectorizable for things that might be - and I'm not sure that's the right tradeoff to make here. Let's start an llvm-dev discussion and bring in a few more people as well. I'd be very interested in Kit's perspective here too.

In D84108#2228500, @echristo wrote:

In D84108#2227701, @lebedev.ri wrote:

In D84108#2227371, @echristo wrote:

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

No i mean FDO. Can you either point me at the docs where said incantation for test-suite is spelled out, or at provide commands nessesary to reproduce?

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

In D84108#2227365, @echristo wrote:

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

In D84108#2228508, @echristo wrote:

In D84108#2227709, @dmgreen wrote:

We saw performance improvements from this patch. They were a little bit up-and-down, but definitely an improvement overall.

Right. I think you're still in scientific computing yes? The internal and external benchmarks are fairly straightforward scalar code. I.e. if they were vectorizable we'd probably have done so ages ago for the internal and the external are things like bytecode interpreters etc. So the patch is penalizing things that just aren't vectorizable for things that might be - and I'm not sure that's the right tradeoff to make here. Let's start an llvm-dev discussion and bring in a few more people as well. I'd be very interested in Kit's perspective here too.

"hey let's penalize some code by +30% runtime because that improves some code by -10%" obviously won't fly,
so unless you want to start the discussion, i want to first understand what is actually going on.
I'm going to guess that such huge regressions likely mean lack of inlining.

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

In D84108#2229869, @dmgreen wrote:

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

Was that with old-pm, or new-pm?

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

In D84108#2229882, @dmgreen wrote:

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

That might be the pattern then, just another old-pm vs new-pm difference for the switch to track.
But i'm not seeing ab umbrella bug for performance regressions on https://bugs.llvm.org/show_bug.cgi?id=46649?

In D84108#2228500, @echristo wrote:

In D84108#2227701, @lebedev.ri wrote:

In D84108#2227371, @echristo wrote:

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

old.tar.xz27 KBDownload

new.tar.xz27 KBDownload

Uhm, for N=9, i'm afraid i'm not seeing anything horrible like that:

$ /repositories/llvm-test-suite/utils/compare.py --lhs-name old-newpm --rhs-name new-newpm llvm-test-suite-old-NewPM/res-*.json vs llvm-test-suite-new-NewPM/res-*.json --merge-average --filter-hash
Tests: 41
Metric: exec_time

Program                                        old-newpm new-newpm diff 
 test-suite...marks/Ptrdist/yacr2/yacr2.test     0.77      0.74    -4.0%
 test-suite...s/Ptrdist/anagram/anagram.test     0.76      0.78     3.0%
 test-suite...flt/InductionVariable-flt.test     2.90      2.95     1.9%
 test-suite...lFlow-flt/ControlFlow-flt.test     3.48      3.44    -1.4%
 test-suite...ing-dbl/NodeSplitting-dbl.test     3.09      3.13     1.1%
 test-suite...dbl/InductionVariable-dbl.test     4.02      4.05     1.0%
 test-suite...pansion-dbl/Expansion-dbl.test     2.49      2.51     0.7%
 test-suite...pansion-flt/Expansion-flt.test     1.56      1.57     0.5%
 test-suite...lt/CrossingThresholds-flt.test     2.86      2.87     0.5%
 test-suite...lFlow-dbl/ControlFlow-dbl.test     4.05      4.03    -0.5%
 test-suite...ing-flt/Equivalencing-flt.test     1.13      1.14     0.3%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     3.20      3.21     0.3%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test     0.40      0.40     0.3%
 test-suite...C/Packing-flt/Packing-flt.test     2.90      2.91     0.3%
 test-suite.../Benchmarks/Ptrdist/ft/ft.test     1.05      1.05     0.3%
 Geomean difference                                                 0.1%
       old-newpm  new-newpm       diff
count  41.000000  41.000000  41.000000
mean   2.761491   2.765264   0.001355 
std    1.382334   1.383351   0.009172 
min    0.400356   0.401544  -0.040124 
25%    2.158678   2.161544  -0.000219 
50%    2.836133   2.839078   0.000974 
75%    3.204467   3.214600   0.002932 
max    6.865056   6.881911   0.029819

@echristo or does that match what you were seeing for test-suite? is the Ptrdist/anagram regression the one?

Ideally for a change as large as this you'd have run these benchmarks yourself - they're fairly useful here and I think I'd have expected them if I were reviewing.

In D84108#2227365, @echristo wrote:

So, we're seeing several significant (20-30%) regressions due to this in various different library benchmarks. Usually around things that are compression/decompression loops, but also other places.

I'm uncertain what we can do, but I think this might need a wider range of discussion rather than the phabricator review.

Would you mind terribly reverting and starting a thread on llvm-dev about this? I think that'll give us an opportunity to weigh some of the tradeoffs in a wider space.

I think I'm still here. I'll comment a bit more in reply to David below.

In D84108#2229869, @dmgreen wrote:

I think you're still in scientific computing yes?

Embedded computing actually. Low power Arm devices, but I'm always interested in all uses of Arm devices. I have been working in the vectorizer a lot lately, and a number of the benchmarks we usually run are DSP style codes, so quite similar to HPC. But the improvements I saw included CPU's without any vectorization. Codesize was also flat in all, which I took as a good sign that the patch wasn't bad for general codegen.

We didn't see any codesize regression, but any scalar code with a hot loop started regressing fairly strongly. Particularly things like compression that were already hand vectorized a bit.

The developer policy is quite clear when it comes to addressing regressions, but my vote would be for keeping the change :)

Understood.

In D84108#2229882, @dmgreen wrote:

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

Codesize I wouldn't be surprised. Inlining is much different with the new pass manager.

In D84108#2230308, @lebedev.ri wrote:
In D84108#2229882, @dmgreen wrote:

I would have been using old, the default. I think we still see some performance/codesize problems with switch to new.

That might be the pattern then, just another old-pm vs new-pm difference for the switch to track.
But i'm not seeing ab umbrella bug for performance regressions on https://bugs.llvm.org/show_bug.cgi?id=46649?

In D84108#2228500, @echristo wrote:

In D84108#2227701, @lebedev.ri wrote:

In D84108#2227371, @echristo wrote:

One more follow up here:

One set of benchmarks we're seeing this in are also in the llvm test harness:

llvm_multisource_tsvc.TSVC-ControlFlow-flt and llvm_multisource_misc.Ptrdist-yacr2

in an FDO mode (so we have some decent branch hints for the code).

You should be able to duplicate at least some of the slowdown there.

That sadly doesn't tell me much.
Can you please provide the reproduction steps, and ideally the IR with/without this change?

It should be as straightforward as -O3 -fexperimental-new-pass-manager, if that isn't then we'll look a bit more. Those tests are in the llvm test-suite so they should be easy to get as part of your checkout.

old.tar.xz27 KBDownload

new.tar.xz27 KBDownload

Uhm, for N=9, i'm afraid i'm not seeing anything horrible like that:
$ /repositories/llvm-test-suite/utils/compare.py --lhs-name old-newpm --rhs-name new-newpm llvm-test-suite-old-NewPM/res-*.json vs llvm-test-suite-new-NewPM/res-*.json --merge-average --filter-hash
Tests: 41
Metric: exec_time

Program                                        old-newpm new-newpm diff 
 test-suite...marks/Ptrdist/yacr2/yacr2.test     0.77      0.74    -4.0%
 test-suite...s/Ptrdist/anagram/anagram.test     0.76      0.78     3.0%
 test-suite...flt/InductionVariable-flt.test     2.90      2.95     1.9%
 test-suite...lFlow-flt/ControlFlow-flt.test     3.48      3.44    -1.4%
 test-suite...ing-dbl/NodeSplitting-dbl.test     3.09      3.13     1.1%
 test-suite...dbl/InductionVariable-dbl.test     4.02      4.05     1.0%
 test-suite...pansion-dbl/Expansion-dbl.test     2.49      2.51     0.7%
 test-suite...pansion-flt/Expansion-flt.test     1.56      1.57     0.5%
 test-suite...lt/CrossingThresholds-flt.test     2.86      2.87     0.5%
 test-suite...lFlow-dbl/ControlFlow-dbl.test     4.05      4.03    -0.5%
 test-suite...ing-flt/Equivalencing-flt.test     1.13      1.14     0.3%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test     3.20      3.21     0.3%
 test-suite.../Benchmarks/Ptrdist/bc/bc.test     0.40      0.40     0.3%
 test-suite...C/Packing-flt/Packing-flt.test     2.90      2.91     0.3%
 test-suite.../Benchmarks/Ptrdist/ft/ft.test     1.05      1.05     0.3%
 Geomean difference                                                 0.1%
       old-newpm  new-newpm       diff
count  41.000000  41.000000  41.000000
mean   2.761491   2.765264   0.001355 
std    1.382334   1.383351   0.009172 
min    0.400356   0.401544  -0.040124 
25%    2.158678   2.161544  -0.000219 
50%    2.836133   2.839078   0.000974 
75%    3.204467   3.214600   0.002932 
max    6.865056   6.881911   0.029819 
@echristo or does that match what you were seeing for test-suite? is the Ptrdist/anagram regression the one?

The two tests in the llvm test-suite we're seeing the largest regressions on are:

llvm_multisource_tsvc.TSVC-ControlFlow-flt -0.34% -> -19.33%
llvm_multisource_misc.Ptrdist-yacr2 (I can't recall the difference here)

Those are running on a similar system to yours (but fully quieted/no other code running) with -O3 -fexperimental-new-pass-manager -march=haswell -fno-strict-aliasing -fno-omit-frame-pointer and -mprefer-vector-width=128 as the option set.

In addition we're seeing 10% regressions in benchmarks with https://github.com/google/snappy code (benchmarks aren't public unfortunately) and other scalar code with loops whether they have some hand vectorization or not.

Looking at some other benchmarks on skylake (so no -march=haswell from above) (https://github.com/google/gipfeli) under perf we're seeing the number of executed instructions is almost same, the number of branches is increased by 0.8%, but the number of branch miss predictions is 3.2 times larger. I don't think that this is related to the pass manager at all, but that the new code is causing significant regressions on the processor due to code layout and the branch predictor.

My proposal for now is that we revert this and continue working on performance analysis and try to figure out what we can do to make this more of a win outside of the vectorizer. :)

sstefan1 added a subscriber: sstefan1.Aug 21 2020, 9:02 AM

lebedev.ri added a reverting change: rG503deec2183d: Temporairly revert "[SimplifyCFG][LoopRotate] SimplifyCFG: disable common….Aug 21 2020, 2:33 PM

In D84108#2230624, @echristo wrote:

My proposal for now is that we revert this and continue working on performance analysis and try to figure out what we can do to make this more of a win outside of the vectorizer. :)

Looking forward to seeing a path forward here.

loops not rotated due to the header size

Increase that limit? Any disadvantages?

bjope added a subscriber: bjope.Aug 25 2020, 2:04 AM

Herald added a subscriber: wenlei. · View Herald TranscriptAug 25 2020, 2:04 AM

In D84108#2231809, @xbolva00 wrote:

loops not rotated due to the header size

Increase that limit? Any disadvantages?

Loops not being rotated is actually a symptom.
While it may be worthwhile to increase that threshold,
that isn't really the fix, because then we still have hoisted the code.

uabelho added a subscriber: uabelho.Aug 25 2020, 3:38 AM

But according to https://reviews.llvm.org/D65060#1596212 , the hoinsting should be done..

LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization.

@fhahn

In D84108#2243411, @xbolva00 wrote:

But according to https://reviews.llvm.org/D65060#1596212 , the hoinsting should be done..

LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization.

@fhahn

I'm sorry but i completely fail to see the connection here.

lebedev.ri added a commit: rGbb7d3af1139c: Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction….Sep 7 2020, 2:24 PM

Just a heads up: we are seeing a notable 3.5% regression caused by this change for ARM64 with LTO for CINT2006/473.astar.

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Also, what is the status of the regressions reported by @echristo ? Was it possible to address them?

Herald added a subscriber: pengfei. · View Herald TranscriptNov 30 2020, 5:09 AM

In D84108#2422611, @fhahn wrote:

Just a heads up: we are seeing a notable 3.5% regression caused by this change for ARM64 with LTO for CINT2006/473.astar.

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Interesting.

Also, what is the status of the regressions reported by @echristo ? Was it possible to address them?

See rGbb7d3af1139c36270bc9948605e06f40e4c51541.

Just wanted to add to @fhahn that we're seeing this too, as an even more notable 8% regression (same benchmark, different hardware). Florian's explanation matches what I'm seeing, but I hadn't had a chance to confirm what was happening.

I think Florian's diagnosis is worth looking into for sure - and probably
is the right solution. For the regressions I had they could largely be
explained by the branch predictor on a micro level, I didn't see some of
the large scale LTO level regressions in mine.

In D84108#2422611, @fhahn wrote:

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Hi Florian, did you get anywhere with this? If I understand correctly, generally, the verctorizer wants to see the loops in rotated form, so trying to avoid rotations in certain cases might not be the right approach? At the same time, I'm not sure what else we could do! Any thoughts appreciated, and, if this has slipped down your priorities I'm happy to pick this up. It's been bumped up on my list!

fhahn mentioned this in D94232: [LoopRotate] Add PrepareForLTO stage, avoid rotating with inline cands (WIP)..Jan 7 2021, 6:14 AM

In D84108#2481728, @sanwou01 wrote:

In D84108#2422611, @fhahn wrote:

I had a first look and it seems like there's some bad interactions with the LTO pipeline. The problem in this case is the following: we now rotate a loop just before the vectorizer which requires duplicating a function call in the preheader when compiling the individual files. But this then stops inlining during LTO. I'll look into whether we should avoid rotating such loops in the 'prepare-for-lto' stage.

Hi Florian, did you get anywhere with this? If I understand correctly, generally, the verctorizer wants to see the loops in rotated form, so trying to avoid rotations in certain cases might not be the right approach? At the same time, I'm not sure what else we could do! Any thoughts appreciated, and, if this has slipped down your priorities I'm happy to pick this up. It's been bumped up on my list!

I put up D94232 with the approach I outlined earlier. I tried to summarize why I think this should make sense in the description. It would be great if you could give it a try on a set of benchmarks.

fhahn mentioned this in rG83daa49758a1: [LoopRotate] Add PrepareForLTO stage, avoid rotating with inline cands..Jan 19 2021, 2:16 AM

FYI, I tracked down another ~50% regression in one of our benchmarks to this change. It boils down to failing to vectorize the loop below (or here https://clang.godbolt.org/z/sbjd8Wshx). I put up D100329, which always allows hoisting if only a termintor needs to be hoisted, which effectively replaces the original terminator.

double clamp(double v) {
  if (v < 0.0)
    return 0.0;
  if (v > 6.0)
    return 6.0;
  return v;
}

void loop(double* X, double *Y) {
  for (unsigned i = 0; i < 20000; i++) {
    X[i] = clamp(Y[i]);
  }
}

fhahn mentioned this in D100329: [SimplifyCFG] Allow hoisting terminators only with HoistCommonInsts=false..Apr 13 2021, 2:37 AM

nikic mentioned this in D101290: [LV] Try to sink and hoist inside candidate loops for vectorization..Apr 27 2021, 12:42 PM

DianQK mentioned this in D155711: [SimplifyCFG] Hoist common instructions on Switch..Jul 20 2023, 6:16 AM

nikic mentioned this in D156532: [Pipelines] Perform hoisting prior to GVN.Jul 28 2023, 6:21 AM

nikic mentioned this in rG1f37088679a5: [Pipelines] Perform hoisting prior to GVN.Aug 7 2023, 1:06 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Utils/

SimplifyCFGOptions.h

2 lines

lib/

Passes/

PassBuilder.cpp

13 lines

Target/

AArch64/

AArch64TargetMachine.cpp

1 line

ARM/

ARMTargetMachine.cpp

3 lines

Hexagon/

HexagonTargetMachine.cpp

1 line

Transforms/

IPO/

PassManagerBuilder.cpp

3 lines

Scalar/

SimplifyCFGPass.cpp

4 lines

test/

Transforms/

PGOProfile/

chr.ll

7 lines

PhaseOrdering/

loop-rotation-vs-common-code-hoisting.ll

29 lines

SimplifyCFG/

common-code-hoisting.ll

2 lines

Diff 281648

llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h

	Show All 19 Lines

	class AssumptionCache;			class AssumptionCache;

	struct SimplifyCFGOptions {			struct SimplifyCFGOptions {
	int BonusInstThreshold = 1;			int BonusInstThreshold = 1;
	bool ForwardSwitchCondToPhi = false;			bool ForwardSwitchCondToPhi = false;
	bool ConvertSwitchToLookupTable = false;			bool ConvertSwitchToLookupTable = false;
	bool NeedCanonicalLoop = true;			bool NeedCanonicalLoop = true;
	bool HoistCommonInsts = true;			bool HoistCommonInsts = false;
	bool SinkCommonInsts = false;			bool SinkCommonInsts = false;
	bool SimplifyCondBranch = true;			bool SimplifyCondBranch = true;
	bool FoldTwoEntryPHINode = true;			bool FoldTwoEntryPHINode = true;

	AssumptionCache *AC = nullptr;			AssumptionCache *AC = nullptr;

	// Support 'builder' pattern to set members by name at construction time.			// Support 'builder' pattern to set members by name at construction time.
	SimplifyCFGOptions &bonusInstThreshold(int I) {			SimplifyCFGOptions &bonusInstThreshold(int I) {
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

llvm/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 1,138 Lines • ▼ Show 20 Lines	ModulePassManager PassBuilder::buildModuleOptimizationPipeline(
// optimizations. These are run afterward as they might block doing complex		// optimizations. These are run afterward as they might block doing complex
// analyses and transforms such as what are needed for loop vectorization.		// analyses and transforms such as what are needed for loop vectorization.

// Cleanup after loop vectorization, etc. Simplification passes like CVP and		// Cleanup after loop vectorization, etc. Simplification passes like CVP and
// GVN, loop transforms, and others have already run, so it's now better to		// GVN, loop transforms, and others have already run, so it's now better to
// convert to more optimized IR using more aggressive simplify CFG options.		// convert to more optimized IR using more aggressive simplify CFG options.
// The extra sinking transform can create larger basic blocks, so do this		// The extra sinking transform can create larger basic blocks, so do this
// before SLP vectorization.		// before SLP vectorization.
OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions().		// FIXME: study whether hoisting and/or sinking of common instructions should
forwardSwitchCondToPhi(true).		// be delayed until after SLP vectorizer.
convertSwitchToLookupTable(true).		OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions()
needCanonicalLoops(false).		.forwardSwitchCondToPhi(true)
sinkCommonInsts(true)));		.convertSwitchToLookupTable(true)
		.needCanonicalLoops(false)
		.hoistCommonInsts(true)
		.sinkCommonInsts(true)));

// Optimize parallel scalar instruction chains into SIMD instructions.		// Optimize parallel scalar instruction chains into SIMD instructions.
if (PTO.SLPVectorization)		if (PTO.SLPVectorization)
OptimizePM.addPass(SLPVectorizerPass());		OptimizePM.addPass(SLPVectorizerPass());

// Enhance/cleanup vector code.		// Enhance/cleanup vector code.
OptimizePM.addPass(VectorCombinePass());		OptimizePM.addPass(VectorCombinePass());
OptimizePM.addPass(InstCombinePass());		OptimizePM.addPass(InstCombinePass());
▲ Show 20 Lines • Show All 1,633 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

Show First 20 Lines • Show All 451 Lines • ▼ Show 20 Lines	void AArch64PassConfig::addIRPasses() {
// Cmpxchg instructions are often used with a subsequent comparison to		// Cmpxchg instructions are often used with a subsequent comparison to
// determine whether it succeeded. We can exploit existing control-flow in		// determine whether it succeeded. We can exploit existing control-flow in
// ldrex/strex loops to simplify this, but it needs tidying up.		// ldrex/strex loops to simplify this, but it needs tidying up.
if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)		if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
addPass(createCFGSimplificationPass(SimplifyCFGOptions()		addPass(createCFGSimplificationPass(SimplifyCFGOptions()
.forwardSwitchCondToPhi(true)		.forwardSwitchCondToPhi(true)
.convertSwitchToLookupTable(true)		.convertSwitchToLookupTable(true)
.needCanonicalLoops(false)		.needCanonicalLoops(false)
		.hoistCommonInsts(true)
.sinkCommonInsts(true)));		.sinkCommonInsts(true)));

// Run LoopDataPrefetch		// Run LoopDataPrefetch
//		//
// Run this before LSR to remove the multiplies involved in computing the		// Run this before LSR to remove the multiplies involved in computing the
// pointer values N iterations ahead.		// pointer values N iterations ahead.
if (TM->getOptLevel() != CodeGenOpt::None) {		if (TM->getOptLevel() != CodeGenOpt::None) {
if (EnableLoopDataPrefetch)		if (EnableLoopDataPrefetch)
▲ Show 20 Lines • Show All 237 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	void ARMPassConfig::addIRPasses() {
else		else
addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());

// Cmpxchg instructions are often used with a subsequent comparison to		// Cmpxchg instructions are often used with a subsequent comparison to
// determine whether it succeeded. We can exploit existing control-flow in		// determine whether it succeeded. We can exploit existing control-flow in
// ldrex/strex loops to simplify this, but it needs tidying up.		// ldrex/strex loops to simplify this, but it needs tidying up.
if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)		if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
addPass(createCFGSimplificationPass(		addPass(createCFGSimplificationPass(
SimplifyCFGOptions().sinkCommonInsts(true), [this](const Function &F) {		SimplifyCFGOptions().hoistCommonInsts(true).sinkCommonInsts(true),
		[this](const Function &F) {
const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);		const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);
return ST.hasAnyDataBarrier() && !ST.isThumb1Only();		return ST.hasAnyDataBarrier() && !ST.isThumb1Only();
}));		}));

addPass(createMVEGatherScatterLoweringPass());		addPass(createMVEGatherScatterLoweringPass());

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp

Show First 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	void HexagonPassConfig::addIRPasses() {
addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());

if (!NoOpt) {		if (!NoOpt) {
if (EnableInitialCFGCleanup)		if (EnableInitialCFGCleanup)
addPass(createCFGSimplificationPass(SimplifyCFGOptions()		addPass(createCFGSimplificationPass(SimplifyCFGOptions()
.forwardSwitchCondToPhi(true)		.forwardSwitchCondToPhi(true)
.convertSwitchToLookupTable(true)		.convertSwitchToLookupTable(true)
.needCanonicalLoops(false)		.needCanonicalLoops(false)
		.hoistCommonInsts(true)
.sinkCommonInsts(true)));		.sinkCommonInsts(true)));
if (EnableLoopPrefetch)		if (EnableLoopPrefetch)
addPass(createLoopDataPrefetchPass());		addPass(createLoopDataPrefetchPass());
if (EnableCommGEP)		if (EnableCommGEP)
addPass(createHexagonCommonGEP());		addPass(createHexagonCommonGEP());
// Replace certain combinations of shifts and ands with extracts.		// Replace certain combinations of shifts and ands with extracts.
if (EnableGenExtract)		if (EnableGenExtract)
addPass(createHexagonGenExtract());		addPass(createHexagonGenExtract());
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 778 Lines • ▼ Show 20 Lines	if (OptLevel > 1 && ExtraVectorizerPasses) {
MPM.add(createInstructionCombiningPass());		MPM.add(createInstructionCombiningPass());
}		}

// Cleanup after loop vectorization, etc. Simplification passes like CVP and		// Cleanup after loop vectorization, etc. Simplification passes like CVP and
// GVN, loop transforms, and others have already run, so it's now better to		// GVN, loop transforms, and others have already run, so it's now better to
// convert to more optimized IR using more aggressive simplify CFG options.		// convert to more optimized IR using more aggressive simplify CFG options.
// The extra sinking transform can create larger basic blocks, so do this		// The extra sinking transform can create larger basic blocks, so do this
// before SLP vectorization.		// before SLP vectorization.
		// FIXME: study whether hoisting and/or sinking of common instructions should
		// be delayed until after SLP vectorizer.
MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()		MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()
.forwardSwitchCondToPhi(true)		.forwardSwitchCondToPhi(true)
.convertSwitchToLookupTable(true)		.convertSwitchToLookupTable(true)
.needCanonicalLoops(false)		.needCanonicalLoops(false)
		.hoistCommonInsts(true)
.sinkCommonInsts(true)));		.sinkCommonInsts(true)));

if (SLPVectorize) {		if (SLPVectorize) {
MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.		MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.
if (OptLevel > 1 && ExtraVectorizerPasses) {		if (OptLevel > 1 && ExtraVectorizerPasses) {
MPM.add(createEarlyCSEPass());		MPM.add(createEarlyCSEPass());
}		}
}		}
▲ Show 20 Lines • Show All 439 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	static cl::opt<bool> UserSwitchToLookup(
"switch-to-lookup", cl::Hidden, cl::init(false),		"switch-to-lookup", cl::Hidden, cl::init(false),
cl::desc("Convert switches to lookup tables (default = false)"));		cl::desc("Convert switches to lookup tables (default = false)"));

static cl::opt<bool> UserForwardSwitchCond(		static cl::opt<bool> UserForwardSwitchCond(
"forward-switch-cond", cl::Hidden, cl::init(false),		"forward-switch-cond", cl::Hidden, cl::init(false),
cl::desc("Forward switch condition to phi ops (default = false)"));		cl::desc("Forward switch condition to phi ops (default = false)"));

static cl::opt<bool> UserHoistCommonInsts(		static cl::opt<bool> UserHoistCommonInsts(
"hoist-common-insts", cl::Hidden, cl::init(true),		"hoist-common-insts", cl::Hidden, cl::init(false),
cl::desc("hoist common instructions (default = true)"));		cl::desc("hoist common instructions (default = false)"));

static cl::opt<bool> UserSinkCommonInsts(		static cl::opt<bool> UserSinkCommonInsts(
"sink-common-insts", cl::Hidden, cl::init(false),		"sink-common-insts", cl::Hidden, cl::init(false),
cl::desc("Sink common instructions (default = false)"));		cl::desc("Sink common instructions (default = false)"));


STATISTIC(NumSimpl, "Number of blocks simplified");		STATISTIC(NumSimpl, "Number of blocks simplified");

▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

llvm/test/Transforms/PGOProfile/chr.ll

	Show First 20 Lines • Show All 2,002 Lines • ▼ Show 20 Lines

	; Test a case with a really long use-def chains. This test checks that it's not			; Test a case with a really long use-def chains. This test checks that it's not
	; really slow and doesn't appear to be hanging.			; really slow and doesn't appear to be hanging.
	define i64 @test_chr_22(i1 %i, i64* %j, i64 %v0) !prof !14 {			define i64 @test_chr_22(i1 %i, i64* %j, i64 %v0) !prof !14 {
	; CHECK-LABEL: @test_chr_22(			; CHECK-LABEL: @test_chr_22(
	; CHECK-NEXT: bb0:			; CHECK-NEXT: bb0:
	; CHECK-NEXT: [[REASS_ADD:%.]] = shl i64 [[V0:%.]], 1			; CHECK-NEXT: [[REASS_ADD:%.]] = shl i64 [[V0:%.]], 1
	; CHECK-NEXT: [[V2:%.*]] = add i64 [[REASS_ADD]], 3			; CHECK-NEXT: [[V2:%.*]] = add i64 [[REASS_ADD]], 3
				; CHECK-NEXT: [[C1:%.*]] = icmp slt i64 [[V2]], 100
				; CHECK-NEXT: br i1 [[C1]], label [[BB0_SPLIT:%.]], label [[BB0_SPLIT_NONCHR:%.]], !prof !15
				; CHECK: bb0.split:
	; CHECK-NEXT: [[V299:%.*]] = mul i64 [[V2]], 7860086430977039991			; CHECK-NEXT: [[V299:%.*]] = mul i64 [[V2]], 7860086430977039991
	; CHECK-NEXT: store i64 [[V299]], i64* [[J:%.*]], align 4			; CHECK-NEXT: store i64 [[V299]], i64* [[J:%.*]], align 4
	; CHECK-NEXT: ret i64 99			; CHECK-NEXT: ret i64 99
				; CHECK: bb0.split.nonchr:
				; CHECK-NEXT: [[V299_NONCHR:%.*]] = mul i64 [[V2]], 7860086430977039991
				; CHECK-NEXT: store i64 [[V299_NONCHR]], i64* [[J]], align 4
				; CHECK-NEXT: ret i64 99
	;			;
	bb0:			bb0:
	%v1 = add i64 %v0, 3			%v1 = add i64 %v0, 3
	%v2 = add i64 %v1, %v0			%v2 = add i64 %v1, %v0
	%c1 = icmp sgt i64 %v2, 99			%c1 = icmp sgt i64 %v2, 99
	%v3 = select i1 %c1, i64 %v1, i64 %v2, !prof !15			%v3 = select i1 %c1, i64 %v1, i64 %v2, !prof !15
	%v4 = add i64 %v2, %v2			%v4 = add i64 %v2, %v2
	%v5 = add i64 %v4, %v2			%v5 = add i64 %v4, %v2
	▲ Show 20 Lines • Show All 527 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions This is the test that actually shows the phase ordering problem. lebedev.ri: This is the test that actually shows the phase ordering problem.
	; RUN: opt -O3 -rotation-max-header-size=0 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR0,FALLBACK0			; RUN: opt -O3 -rotation-max-header-size=0 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR0,FALLBACK0
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=0 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR0,FALLBACK1			; RUN: opt -passes='default<O3>' -rotation-max-header-size=0 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR0,FALLBACK1

	; RUN: opt -O3 -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK2			; RUN: opt -O3 -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK2
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK3			; RUN: opt -passes='default<O3>' -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK3

	; RUN: opt -O3 -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK4			; RUN: opt -O3 -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK4
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK5			; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK5

	; RUN: opt -O3 -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK6			; RUN: opt -O3 -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK6
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK7			; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK7

	; RUN: opt -O3 -rotation-max-header-size=4 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK8
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=4 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK9

	; This example is produced from a very basic C code:			; This example is produced from a very basic C code:
	;			;
	; void f0();			; void f0();
	; void f1();			; void f1();
	; void f2();			; void f2();
	;			;
	; void loop(int width) {			; void loop(int width) {
	Show All 32 Lines
	; HOIST-NEXT: entry:			; HOIST-NEXT: entry:
	; HOIST-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; HOIST-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; HOIST-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; HOIST-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; HOIST: for.cond.preheader:			; HOIST: for.cond.preheader:
	; HOIST-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1			; HOIST-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; HOIST-NEXT: br label [[FOR_COND:%.*]]			; HOIST-NEXT: br label [[FOR_COND:%.*]]
	; HOIST: for.cond:			; HOIST: for.cond:
	; HOIST-NEXT: [[I_0:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]			; HOIST-NEXT: [[I_0:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]
	; HOIST-NEXT: tail call void @f0()
	; HOIST-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[TMP0]]			; HOIST-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[TMP0]]
				; HOIST-NEXT: tail call void @f0()
	; HOIST-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; HOIST-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; HOIST: for.cond.cleanup:			; HOIST: for.cond.cleanup:
	; HOIST-NEXT: tail call void @f2()			; HOIST-NEXT: tail call void @f2()
	; HOIST-NEXT: br label [[RETURN]]			; HOIST-NEXT: br label [[RETURN]]
	; HOIST: for.body:			; HOIST: for.body:
	; HOIST-NEXT: tail call void @f1()			; HOIST-NEXT: tail call void @f1()
	; HOIST-NEXT: [[INC]] = add nuw i32 [[I_0]], 1			; HOIST-NEXT: [[INC]] = add nuw i32 [[I_0]], 1
	; HOIST-NEXT: br label [[FOR_COND]]			; HOIST-NEXT: br label [[FOR_COND]]
	; HOIST: return:			; HOIST: return:
	; HOIST-NEXT: ret void			; HOIST-NEXT: ret void
	;			;
	; ROTATED_LATER_OLDPM-LABEL: @_Z4loopi(			; ROTATED_LATER_OLDPM-LABEL: @_Z4loopi(
	; ROTATED_LATER_OLDPM-NEXT: entry:			; ROTATED_LATER_OLDPM-NEXT: entry:
	; ROTATED_LATER_OLDPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; ROTATED_LATER_OLDPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; ROTATED_LATER_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATED_LATER_OLDPM: for.cond.preheader:			; ROTATED_LATER_OLDPM: for.cond.preheader:
	; ROTATED_LATER_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1			; ROTATED_LATER_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0			; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY:%.]]			; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY:%.]]
	; ROTATED_LATER_OLDPM: for.cond.cleanup:			; ROTATED_LATER_OLDPM: for.cond.cleanup:
				; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f2()			; ROTATED_LATER_OLDPM-NEXT: tail call void @f2()
	; ROTATED_LATER_OLDPM-NEXT: br label [[RETURN]]			; ROTATED_LATER_OLDPM-NEXT: br label [[RETURN]]
	; ROTATED_LATER_OLDPM: for.body:			; ROTATED_LATER_OLDPM: for.body:
	; ROTATED_LATER_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_COND_PREHEADER]] ]			; ROTATED_LATER_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_COND_PREHEADER]] ]
				; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f1()			; ROTATED_LATER_OLDPM-NEXT: tail call void @f1()
	; ROTATED_LATER_OLDPM-NEXT: [[INC]] = add nuw i32 [[I_04]], 1			; ROTATED_LATER_OLDPM-NEXT: [[INC]] = add nuw i32 [[I_04]], 1
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]			; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]			; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
	; ROTATED_LATER_OLDPM: return:			; ROTATED_LATER_OLDPM: return:
	; ROTATED_LATER_OLDPM-NEXT: ret void			; ROTATED_LATER_OLDPM-NEXT: ret void
	;			;
	; ROTATED_LATER_NEWPM-LABEL: @_Z4loopi(			; ROTATED_LATER_NEWPM-LABEL: @_Z4loopi(
	; ROTATED_LATER_NEWPM-NEXT: entry:			; ROTATED_LATER_NEWPM-NEXT: entry:
	; ROTATED_LATER_NEWPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; ROTATED_LATER_NEWPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; ROTATED_LATER_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATED_LATER_NEWPM: for.cond.preheader:			; ROTATED_LATER_NEWPM: for.cond.preheader:
	; ROTATED_LATER_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1			; ROTATED_LATER_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
	; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0			; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE:%.]]			; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE:%.]]
	; ROTATED_LATER_NEWPM: for.cond.preheader.for.body_crit_edge:			; ROTATED_LATER_NEWPM: for.cond.preheader.for.body_crit_edge:
	; ROTATED_LATER_NEWPM-NEXT: [[INC_1:%.*]] = add nuw i32 0, 1			; ROTATED_LATER_NEWPM-NEXT: [[INC_1:%.*]] = add nuw i32 0, 1
	; ROTATED_LATER_NEWPM-NEXT: br label [[FOR_BODY:%.*]]			; ROTATED_LATER_NEWPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATED_LATER_NEWPM: for.cond.cleanup:			; ROTATED_LATER_NEWPM: for.cond.cleanup:
				; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f2()			; ROTATED_LATER_NEWPM-NEXT: tail call void @f2()
	; ROTATED_LATER_NEWPM-NEXT: br label [[RETURN]]			; ROTATED_LATER_NEWPM-NEXT: br label [[RETURN]]
	; ROTATED_LATER_NEWPM: for.body:			; ROTATED_LATER_NEWPM: for.body:
	; ROTATED_LATER_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE]] ]			; ROTATED_LATER_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE]] ]
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f1()
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()			; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
				; ROTATED_LATER_NEWPM-NEXT: tail call void @f1()
	; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]			; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]			; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
	; ROTATED_LATER_NEWPM: for.body.for.body_crit_edge:			; ROTATED_LATER_NEWPM: for.body.for.body_crit_edge:
	; ROTATED_LATER_NEWPM-NEXT: [[INC_0]] = add nuw i32 [[INC_PHI]], 1			; ROTATED_LATER_NEWPM-NEXT: [[INC_0]] = add nuw i32 [[INC_PHI]], 1
	; ROTATED_LATER_NEWPM-NEXT: br label [[FOR_BODY]]			; ROTATED_LATER_NEWPM-NEXT: br label [[FOR_BODY]]
	; ROTATED_LATER_NEWPM: return:			; ROTATED_LATER_NEWPM: return:
	; ROTATED_LATER_NEWPM-NEXT: ret void			; ROTATED_LATER_NEWPM-NEXT: ret void
	;			;
	; ROTATE_OLDPM-LABEL: @_Z4loopi(			; ROTATE_OLDPM-LABEL: @_Z4loopi(
	; ROTATE_OLDPM-NEXT: entry:			; ROTATE_OLDPM-NEXT: entry:
	; ROTATE_OLDPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; ROTATE_OLDPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; ROTATE_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; ROTATE_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATE_OLDPM: for.cond.preheader:			; ROTATE_OLDPM: for.cond.preheader:
	; ROTATE_OLDPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1			; ROTATE_OLDPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
	; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]			; ROTATE_OLDPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
	; ROTATE_OLDPM: for.body.preheader:			; ROTATE_OLDPM: for.body.preheader:
	; ROTATE_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1			; ROTATE_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATE_OLDPM-NEXT: br label [[FOR_BODY:%.*]]			; ROTATE_OLDPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATE_OLDPM: for.cond.cleanup:			; ROTATE_OLDPM: for.cond.cleanup:
				; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: tail call void @f2()			; ROTATE_OLDPM-NEXT: tail call void @f2()
	; ROTATE_OLDPM-NEXT: br label [[RETURN]]			; ROTATE_OLDPM-NEXT: br label [[RETURN]]
	; ROTATE_OLDPM: for.body:			; ROTATE_OLDPM: for.body:
	; ROTATE_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; ROTATE_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: tail call void @f1()			; ROTATE_OLDPM-NEXT: tail call void @f1()
	; ROTATE_OLDPM-NEXT: [[INC]] = add nuw nsw i32 [[I_04]], 1			; ROTATE_OLDPM-NEXT: [[INC]] = add nuw nsw i32 [[I_04]], 1
	; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]			; ROTATE_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
	; ROTATE_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]			; ROTATE_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
	; ROTATE_OLDPM: return:			; ROTATE_OLDPM: return:
	; ROTATE_OLDPM-NEXT: ret void			; ROTATE_OLDPM-NEXT: ret void
	;			;
	; ROTATE_NEWPM-LABEL: @_Z4loopi(			; ROTATE_NEWPM-LABEL: @_Z4loopi(
	; ROTATE_NEWPM-NEXT: entry:			; ROTATE_NEWPM-NEXT: entry:
	; ROTATE_NEWPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; ROTATE_NEWPM-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; ROTATE_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; ROTATE_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATE_NEWPM: for.cond.preheader:			; ROTATE_NEWPM: for.cond.preheader:
	; ROTATE_NEWPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1			; ROTATE_NEWPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
	; ROTATE_NEWPM-NEXT: tail call void @f0()
	; ROTATE_NEWPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]			; ROTATE_NEWPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
	; ROTATE_NEWPM: for.body.preheader:			; ROTATE_NEWPM: for.body.preheader:
	; ROTATE_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1			; ROTATE_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATE_NEWPM-NEXT: [[INC_1:%.*]] = add nuw nsw i32 0, 1			; ROTATE_NEWPM-NEXT: [[INC_1:%.*]] = add nuw nsw i32 0, 1
	; ROTATE_NEWPM-NEXT: br label [[FOR_BODY:%.*]]			; ROTATE_NEWPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATE_NEWPM: for.cond.cleanup:			; ROTATE_NEWPM: for.cond.cleanup:
				; ROTATE_NEWPM-NEXT: tail call void @f0()
	; ROTATE_NEWPM-NEXT: tail call void @f2()			; ROTATE_NEWPM-NEXT: tail call void @f2()
	; ROTATE_NEWPM-NEXT: br label [[RETURN]]			; ROTATE_NEWPM-NEXT: br label [[RETURN]]
	; ROTATE_NEWPM: for.body:			; ROTATE_NEWPM: for.body:
	; ROTATE_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_BODY_PREHEADER]] ]			; ROTATE_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_BODY_PREHEADER]] ]
	; ROTATE_NEWPM-NEXT: tail call void @f1()
	; ROTATE_NEWPM-NEXT: tail call void @f0()			; ROTATE_NEWPM-NEXT: tail call void @f0()
				; ROTATE_NEWPM-NEXT: tail call void @f1()
	; ROTATE_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]			; ROTATE_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
	; ROTATE_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]			; ROTATE_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
	; ROTATE_NEWPM: for.body.for.body_crit_edge:			; ROTATE_NEWPM: for.body.for.body_crit_edge:
	; ROTATE_NEWPM-NEXT: [[INC_0]] = add nuw nsw i32 [[INC_PHI]], 1			; ROTATE_NEWPM-NEXT: [[INC_0]] = add nuw nsw i32 [[INC_PHI]], 1
	; ROTATE_NEWPM-NEXT: br label [[FOR_BODY]]			; ROTATE_NEWPM-NEXT: br label [[FOR_BODY]]
	; ROTATE_NEWPM: return:			; ROTATE_NEWPM: return:
	; ROTATE_NEWPM-NEXT: ret void			; ROTATE_NEWPM-NEXT: ret void
	;			;
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -simplifycfg -hoist-common-insts=1 -S < %s \| FileCheck %s --check-prefixes=HOIST			; RUN: opt -simplifycfg -hoist-common-insts=1 -S < %s \| FileCheck %s --check-prefixes=HOIST
	; RUN: opt -simplifycfg -hoist-common-insts=0 -S < %s \| FileCheck %s --check-prefixes=NOHOIST			; RUN: opt -simplifycfg -hoist-common-insts=0 -S < %s \| FileCheck %s --check-prefixes=NOHOIST
	; RUN: opt -simplifycfg -S < %s \| FileCheck %s --check-prefixes=HOIST,DEFAULT			; RUN: opt -simplifycfg -S < %s \| FileCheck %s --check-prefixes=NOHOIST,DEFAULT

	; This example is produced from a very basic C code:			; This example is produced from a very basic C code:
	;			;
	; void f0();			; void f0();
	; void f1();			; void f1();
	; void f2();			; void f2();
	;			;
	; void loop(int width) {			; void loop(int width) {
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipelineClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 281648

llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp

llvm/test/Transforms/PGOProfile/chr.ll

llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll

llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll

[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
ClosedPublic