This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Passes/
-
Passes/
-
PassBuilderPipelines.cpp
-
test/Transforms/PhaseOrdering/
-
Transforms/
-
PhaseOrdering/
1/3
gvn-replacement-vs-hoist.ll
-
loop-rotation-vs-common-code-hoisting.ll

Differential D156532

[Pipelines] Perform hoisting prior to GVN
ClosedPublic

Authored by nikic on Jul 28 2023, 6:21 AM.

Download Raw Diff

Details

Reviewers

aeubanks
fhahn
artagnon

Commits

rG1f37088679a5: [Pipelines] Perform hoisting prior to GVN

Summary

We currently only enable hoisting in the last SimplifyCFG run of the function simplification pipeline. In particular this happens after GVN, which means that instructions that were identical (and thus hoistable) prior to GVN might no longer be so after it ran, due to equality replacements (see the phase ordering test).

The history here is that D84108 restricted hoisting to the very late (module optimization) pipeline only. Then D101468 went back on that, and also performed it at the end of function simplification. This patch goes one step further and allows it prior to GVN. Importantly, we still don't perform hoisting before LoopRotate, which was the original motivation for delaying it.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nikic created this revision.Jul 28 2023, 6:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 6:21 AM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

nikic requested review of this revision.Jul 28 2023, 6:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 6:21 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nikic mentioned this in D156458: GVNSink: add test to show GVN-aware sinking.Jul 28 2023, 6:24 AM

Thanks for this much-needed fix! It looks like the motivating example will be vectorized. I will apply the patch on downstream and report back soon.

Harbormaster completed remote builds in B248825: Diff 545128.Jul 28 2023, 6:55 AM

xbolva00 added a subscriber: xbolva00.Jul 28 2023, 7:08 AM

LGTM based on benchmark runs.

Small improvement on upstream with no regressions:

Program                                       upstream   upstream+nikic diff
test-suite...xternal/CoreMark/coremark.test      1207474    1207474      0.0%
test-suite...ternal/Embench/aha-mont64.test      2688858    2688858      0.0%
test-suite...External/Embench/wikisort.test       890574     890574      0.0%
test-suite :: External/Embench/ud.test           1997172    1997172      0.0%
test-suite...xternal/Embench/statemate.test       115932     115932      0.0%
test-suite :: External/Embench/st.test           3313229    3313229      0.0%
test-suite... :: External/Embench/slre.test      2051440    2051440      0.0%
test-suite...al/Embench/sglib-combined.test      2478118    2478118      0.0%
test-suite... External/Embench/qrduino.test      2359823    2359823      0.0%
test-suite...ternal/Embench/primecount.test      3926955    3926955      0.0%
test-suite...External/Embench/picojpeg.test      2218597    2218597      0.0%
test-suite...External/Embench/nsichneu.test      2309564    2309564      0.0%
test-suite...nal/Embench/nettle-sha256.test      1757617    1757617      0.0%
test-suite...ternal/Embench/nettle-aes.test      2241404    2241404      0.0%
test-suite...:: External/Embench/nbody.test      3050149    3050149      0.0%
test-suite...: External/Embench/minver.test       505186     505186      0.0%
test-suite...: External/Embench/md5sum.test      1155937    1155937      0.0%
test-suite...ernal/Embench/matmult-int.test      1270436    1270436      0.0%
test-suite...xternal/Embench/huffbench.test      1768954    1768954      0.0%
test-suite :: External/Embench/edn.test          2229782    2229782      0.0%
test-suite...:: External/Embench/cubic.test      5219564    5219564      0.0%
test-suite...:: External/Embench/crc32.test      2962144    2962144      0.0%
test-suite...marks/Dhrystone/dhrystone.test       226782     226782      0.0%
test-suite... External/Embench/tarfind.test   2139922189 2139605677     -0.0%

No changes downstream:

Program                                       master  master+nikic diff
test-suite...xternal/CoreMark/coremark.test   1028821 1028821       0.0%
test-suite...ternal/Embench/aha-mont64.test   2478307 2478307       0.0%
test-suite...External/Embench/wikisort.test    878540  878540       0.0%
test-suite :: External/Embench/ud.test        1163530 1163530       0.0%
test-suite... External/Embench/tarfind.test    984403  984403       0.0%
test-suite...xternal/Embench/statemate.test     78614   78614       0.0%
test-suite :: External/Embench/st.test        3329531 3329531       0.0%
test-suite... :: External/Embench/slre.test   1723908 1723908       0.0%
test-suite...al/Embench/sglib-combined.test   2359553 2359553       0.0%
test-suite... External/Embench/qrduino.test   2132286 2132286       0.0%
test-suite...ternal/Embench/primecount.test   3059021 3059021       0.0%
test-suite...External/Embench/picojpeg.test   1820298 1820298       0.0%
test-suite...External/Embench/nsichneu.test   2303401 2303401       0.0%
test-suite...nal/Embench/nettle-sha256.test   1303960 1303960       0.0%
test-suite...ternal/Embench/nettle-aes.test   1729373 1729373       0.0%
test-suite...:: External/Embench/nbody.test   3043072 3043072       0.0%
test-suite...: External/Embench/minver.test    398686  398686       0.0%
test-suite...: External/Embench/md5sum.test    720967  720967       0.0%
test-suite...ernal/Embench/matmult-int.test    990512  990512       0.0%
test-suite...xternal/Embench/huffbench.test   1566315 1566315       0.0%
test-suite :: External/Embench/edn.test       1161388 1161388       0.0%
test-suite...:: External/Embench/cubic.test        44      44       0.0%
test-suite...:: External/Embench/crc32.test   2788051 2788051       0.0%
test-suite...marks/Dhrystone/dhrystone.test     98109   98109       0.0%

This revision is now accepted and ready to land.Jul 28 2023, 7:52 AM

artagnon added inline comments.Jul 28 2023, 8:56 AM

llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2	I just noticed that you're using the old PM. If you change this to `opt -S -passes='default<O3>'`, this example will be vectorized as well.

llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2	they should be equivalent now

nikic added inline comments.Jul 28 2023, 10:33 AM

llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2	This test doesn't use a target triple to keep it target-independent. If you specify one, then yes, it does get vectorized (well, depending on target of course).

Closed by commit rG1f37088679a5: [Pipelines] Perform hoisting prior to GVN (authored by nikic). · Explain WhyAug 7 2023, 1:06 AM

This revision was automatically updated to reflect the committed changes.

nikic added a commit: rG1f37088679a5: [Pipelines] Perform hoisting prior to GVN.

Sorry I had to revert this as it caused some large regressions. I'm not sure if I have the greatest reason though. The most important is probably x264 from Spec, which I'm seeing a 9-10% regressions depending on the specific configuration. There are also some downstream embedded benchmarks that run under LTO and can be quite fragile to phase ordering changes. There are some there that get better, some that get worse.

I think the main change in x264 is from this function: https://github.com/MasterNobody/x264/blob/eaa68fad9e5d201d42fde51665f2d137ae96baf0/common/quant.c#L64, which requires some vectorization to perform well. https://godbolt.org/z/oKG5onsna

dmgreen added a reverting change: rG05b4310c8aec: Revert "[Pipelines] Perform hoisting prior to GVN".Aug 8 2023, 7:33 AM

The issue for x264 is that the earlier hoisting reduces the size of the loop body enough for it to be unrolled, and SLP doesn't vectorize this case. This is the IR before slp-vectorize: https://gist.github.com/nikic/6ccde4b5320f7f4a6c5e7bd3ff8db1f6

Quite annoying, because the earlier hoisting would otherwise be a clear improvement for that test case.

nikic mentioned this in rGb92711931daf: [PhaseOrdering] Add test for quant_4x4 vectorization (NFC).Aug 8 2023, 8:36 AM

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

Do we again hit https://reviews.llvm.org/D102834?

cc @fhahn

In D156532#4569686, @nikic wrote:

The issue for x264 is that the earlier hoisting reduces the size of the loop body enough for it to be unrolled, and SLP doesn't vectorize this case. This is the IR before slp-vectorize: https://gist.github.com/nikic/6ccde4b5320f7f4a6c5e7bd3ff8db1f6

Quite annoying, because the earlier hoisting would otherwise be a clear improvement for that test case.

Yeah I agree, it doesn't feel like the greatest reason but phase ordering can be difficult and this is a fairly large regression in what people keep telling me is a fairly important benchmark.

In D156532#4569747, @nikic wrote:

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

Thanks. I guess there are maybe two things going on. The unrolling of loops with conditions creates something this is not easy to deal with in the SLP vectorizer (multiple blocks), and the SLP vectorizer doesn't know how to add runtime alias checks.

I can try moving the simplifycfg run later and see what effect it has. There might also be something that we can do about not unrolling with multiple blocks that are not going to be simplified.

In D156532#4570045, @dmgreen wrote:

In D156532#4569747, @nikic wrote:

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

Thanks. I guess there are maybe two things going on. The unrolling of loops with conditions creates something this is not easy to deal with in the SLP vectorizer (multiple blocks), and the SLP vectorizer doesn't know how to add runtime alias checks.

I can try moving the simplifycfg run later and see what effect it has. There might also be something that we can do about not unrolling with multiple blocks that are not going to be simplified.

Yeah, fully unrolling a loop with interior (that is, non-exit) control flow is probably significantly less profitable than unrolling without interior control flow, because the latter gives you straight-line code (or at least an extended basic block, if there are multiple exits). That seems like something we should take into account during cost modelling, but currently don't.

Hello. I ran some tests for moving the SimplifyCFG to after LPM2. The version that added an extra run of SimplifyCFG after LPM2 probably performed the best and showed the least decreases. There might be other cases that just don't come up in these benchmarks of course. The version that moved the existing LoopSimplify did a little worse in places and showed more ups and downs.
But.. I am worried it feels a little fragile for the quant case to be relying on it not optimizing prior to unrolling. In the long run we can probably simplify it, unroll and rely on the SLP vectorizer if it can add runtime checks, but I was looking at whether it made sense to disable unrolling in some cases too.

Yeah, fully unrolling a loop with interior (that is, non-exit) control flow is probably significantly less profitable than unrolling without interior control flow, because the latter gives you straight-line code (or at least an extended basic block, if there are multiple exits). That seems like something we should take into account during cost modelling, but currently don't.

I had a look at disabling unrolling in the FullUnrollPass for loops with multiple exits. The existing heuristics are often quite.. rough at times. It was probably mostly OK, but I wasn't super enthused by the results and apparently in cases like this it is important to be able to unroll the outer loop so the inner loop can vectorize: https://godbolt.org/z/n6W8Tbdrf.

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilderPipelines.cpp

5 lines

test/

Transforms/

PhaseOrdering/

gvn-replacement-vs-hoist.ll

23 lines

loop-rotation-vs-common-code-hoisting.ll

2 lines

Diff 547666

llvm/lib/Passes/PassBuilderPipelines.cpp

Show First 20 Lines • Show All 634 Lines • ▼ Show 20 Lines	LPM2.addPass(LoopFullUnrollPass(Level.getSpeedupLevel(),
/* OnlyWhenForced= */ !PTO.LoopUnrolling,		/* OnlyWhenForced= */ !PTO.LoopUnrolling,
PTO.ForgetAllSCEVInLoopUnroll));		PTO.ForgetAllSCEVInLoopUnroll));

invokeLoopOptimizerEndEPCallbacks(LPM2, Level);		invokeLoopOptimizerEndEPCallbacks(LPM2, Level);

FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM1),		FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM1),
/UseMemorySSA=/true,		/UseMemorySSA=/true,
/UseBlockFrequencyInfo=/true));		/UseBlockFrequencyInfo=/true));
FPM.addPass(		FPM.addPass(SimplifyCFGPass(
SimplifyCFGPass(SimplifyCFGOptions().convertSwitchRangeToICmp(true)));		SimplifyCFGOptions().hoistCommonInsts(true).convertSwitchRangeToICmp(
		true)));
FPM.addPass(InstCombinePass());		FPM.addPass(InstCombinePass());
// The loop passes in LPM2 (LoopIdiomRecognizePass, IndVarSimplifyPass,		// The loop passes in LPM2 (LoopIdiomRecognizePass, IndVarSimplifyPass,
// LoopDeletionPass and LoopFullUnrollPass) do not preserve MemorySSA.		// LoopDeletionPass and LoopFullUnrollPass) do not preserve MemorySSA.
// All loop passes must preserve it, in order to be able to use it.		// All loop passes must preserve it, in order to be able to use it.
FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2),		FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2),
/UseMemorySSA=/false,		/UseMemorySSA=/false,
/UseBlockFrequencyInfo=/false));		/UseBlockFrequencyInfo=/false));

▲ Show 20 Lines • Show All 1,400 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
	; RUN: opt -S -O3 < %s \| FileCheck %s			; RUN: opt -S -O3 < %s \| FileCheck %s
				artagnonUnsubmitted Not Done Reply Inline Actions I just noticed that you're using the old PM. If you change this to `opt -S -passes='default<O3>'`, this example will be vectorized as well. artagnon: I just noticed that you're using the old PM. If you change this to `opt -S…
				aeubanksUnsubmitted Not Done Reply Inline Actions they should be equivalent now aeubanks: they should be equivalent now
				nikicAuthorUnsubmitted Done Reply Inline Actions This test doesn't use a target triple to keep it target-independent. If you specify one, then yes, it does get vectorized (well, depending on target of course). nikic: This test doesn't use a target triple to keep it target-independent. If you specify one, then…

	define void @test(ptr noundef %a, i32 noundef %beam) {			define void @test(ptr noundef %a, i32 noundef %beam) {
	; CHECK-LABEL: define void @test			; CHECK-LABEL: define void @test
	; CHECK-SAME: (ptr nocapture noundef writeonly [[A:%.]], i32 noundef [[BEAM:%.]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {			; CHECK-SAME: (ptr nocapture noundef writeonly [[A:%.]], i32 noundef [[BEAM:%.]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[MUL:%.*]] = shl nuw nsw i32 [[BEAM]], 1
	; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MUL]] to i64
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM]]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_06:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.]], [[FOR_INC:%.]] ]			; CHECK-NEXT: [[I_06:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i32 [[I_06]], [[BEAM]]			; CHECK-NEXT: [[CMP1:%.*]] = icmp ne i32 [[I_06]], [[BEAM]]
	; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]			; CHECK-NEXT: [[MUL:%.*]] = shl nuw nsw i32 [[I_06]], 1
	; CHECK: if.then:			; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MUL]] to i64
	; CHECK-NEXT: store i32 0, ptr [[ARRAYIDX]], align 4			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM]]
	; CHECK-NEXT: br label [[FOR_INC]]			; CHECK-NEXT: [[SPEC_SELECT:%.*]] = zext i1 [[CMP1]] to i32
	; CHECK: if.else:			; CHECK-NEXT: store i32 [[SPEC_SELECT]], ptr [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[MUL2:%.*]] = shl nuw nsw i32 [[I_06]], 1
	; CHECK-NEXT: [[IDXPROM3:%.*]] = zext i32 [[MUL2]] to i64
	; CHECK-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM3]]
	; CHECK-NEXT: store i32 1, ptr [[ARRAYIDX4]], align 4
	; CHECK-NEXT: br label [[FOR_INC]]
	; CHECK: for.inc:
	; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_06]], 1			; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_06]], 1
	; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[I_06]], 9999			; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[I_06]], 9999
	; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]			; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]
	;			;
	entry:			entry:
	%a.addr = alloca ptr, align 8			%a.addr = alloca ptr, align 8
	%beam.addr = alloca i32, align 4			%beam.addr = alloca i32, align 4
	%i = alloca i32, align 4			%i = alloca i32, align 4
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; HOIST-NEXT: entry:			; HOIST-NEXT: entry:
	; HOIST-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1			; HOIST-NEXT: [[CMP:%.]] = icmp slt i32 [[WIDTH:%.]], 1
	; HOIST-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]			; HOIST-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; HOIST: for.cond.preheader:			; HOIST: for.cond.preheader:
	; HOIST-NEXT: [[SUB:%.*]] = add nsw i32 [[WIDTH]], -1			; HOIST-NEXT: [[SUB:%.*]] = add nsw i32 [[WIDTH]], -1
	; HOIST-NEXT: br label [[FOR_COND:%.*]]			; HOIST-NEXT: br label [[FOR_COND:%.*]]
	; HOIST: for.cond:			; HOIST: for.cond:
	; HOIST-NEXT: [[I_0:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]			; HOIST-NEXT: [[I_0:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]
	; HOIST-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[SUB]]
	; HOIST-NEXT: tail call void @f0()			; HOIST-NEXT: tail call void @f0()
				; HOIST-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[SUB]]
	; HOIST-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; HOIST-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; HOIST: for.cond.cleanup:			; HOIST: for.cond.cleanup:
	; HOIST-NEXT: tail call void @f2()			; HOIST-NEXT: tail call void @f2()
	; HOIST-NEXT: br label [[RETURN]]			; HOIST-NEXT: br label [[RETURN]]
	; HOIST: for.body:			; HOIST: for.body:
	; HOIST-NEXT: tail call void @f1()			; HOIST-NEXT: tail call void @f1()
	; HOIST-NEXT: [[INC]] = add nuw i32 [[I_0]], 1			; HOIST-NEXT: [[INC]] = add nuw i32 [[I_0]], 1
	; HOIST-NEXT: br label [[FOR_COND]]			; HOIST-NEXT: br label [[FOR_COND]]
	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines