This is an archive of the discontinued LLVM Phabricator instance.

[Pipelines] Perform hoisting prior to GVN
ClosedPublic

Authored by nikic on Jul 28 2023, 6:21 AM.

Details

Summary

We currently only enable hoisting in the last SimplifyCFG run of the function simplification pipeline. In particular this happens after GVN, which means that instructions that were identical (and thus hoistable) prior to GVN might no longer be so after it ran, due to equality replacements (see the phase ordering test).

The history here is that D84108 restricted hoisting to the very late (module optimization) pipeline only. Then D101468 went back on that, and also performed it at the end of function simplification. This patch goes one step further and allows it prior to GVN. Importantly, we still don't perform hoisting before LoopRotate, which was the original motivation for delaying it.

Diff Detail

Event Timeline

nikic created this revision.Jul 28 2023, 6:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 6:21 AM
nikic requested review of this revision.Jul 28 2023, 6:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2023, 6:21 AM

Thanks for this much-needed fix! It looks like the motivating example will be vectorized. I will apply the patch on downstream and report back soon.

artagnon accepted this revision.Jul 28 2023, 7:52 AM

LGTM based on benchmark runs.

Small improvement on upstream with no regressions:

Program                                       upstream   upstream+nikic diff
test-suite...xternal/CoreMark/coremark.test      1207474    1207474      0.0%
test-suite...ternal/Embench/aha-mont64.test      2688858    2688858      0.0%
test-suite...External/Embench/wikisort.test       890574     890574      0.0%
test-suite :: External/Embench/ud.test           1997172    1997172      0.0%
test-suite...xternal/Embench/statemate.test       115932     115932      0.0%
test-suite :: External/Embench/st.test           3313229    3313229      0.0%
test-suite... :: External/Embench/slre.test      2051440    2051440      0.0%
test-suite...al/Embench/sglib-combined.test      2478118    2478118      0.0%
test-suite... External/Embench/qrduino.test      2359823    2359823      0.0%
test-suite...ternal/Embench/primecount.test      3926955    3926955      0.0%
test-suite...External/Embench/picojpeg.test      2218597    2218597      0.0%
test-suite...External/Embench/nsichneu.test      2309564    2309564      0.0%
test-suite...nal/Embench/nettle-sha256.test      1757617    1757617      0.0%
test-suite...ternal/Embench/nettle-aes.test      2241404    2241404      0.0%
test-suite...:: External/Embench/nbody.test      3050149    3050149      0.0%
test-suite...: External/Embench/minver.test       505186     505186      0.0%
test-suite...: External/Embench/md5sum.test      1155937    1155937      0.0%
test-suite...ernal/Embench/matmult-int.test      1270436    1270436      0.0%
test-suite...xternal/Embench/huffbench.test      1768954    1768954      0.0%
test-suite :: External/Embench/edn.test          2229782    2229782      0.0%
test-suite...:: External/Embench/cubic.test      5219564    5219564      0.0%
test-suite...:: External/Embench/crc32.test      2962144    2962144      0.0%
test-suite...marks/Dhrystone/dhrystone.test       226782     226782      0.0%
test-suite... External/Embench/tarfind.test   2139922189 2139605677     -0.0%

No changes downstream:

Program                                       master  master+nikic diff
test-suite...xternal/CoreMark/coremark.test   1028821 1028821       0.0%
test-suite...ternal/Embench/aha-mont64.test   2478307 2478307       0.0%
test-suite...External/Embench/wikisort.test    878540  878540       0.0%
test-suite :: External/Embench/ud.test        1163530 1163530       0.0%
test-suite... External/Embench/tarfind.test    984403  984403       0.0%
test-suite...xternal/Embench/statemate.test     78614   78614       0.0%
test-suite :: External/Embench/st.test        3329531 3329531       0.0%
test-suite... :: External/Embench/slre.test   1723908 1723908       0.0%
test-suite...al/Embench/sglib-combined.test   2359553 2359553       0.0%
test-suite... External/Embench/qrduino.test   2132286 2132286       0.0%
test-suite...ternal/Embench/primecount.test   3059021 3059021       0.0%
test-suite...External/Embench/picojpeg.test   1820298 1820298       0.0%
test-suite...External/Embench/nsichneu.test   2303401 2303401       0.0%
test-suite...nal/Embench/nettle-sha256.test   1303960 1303960       0.0%
test-suite...ternal/Embench/nettle-aes.test   1729373 1729373       0.0%
test-suite...:: External/Embench/nbody.test   3043072 3043072       0.0%
test-suite...: External/Embench/minver.test    398686  398686       0.0%
test-suite...: External/Embench/md5sum.test    720967  720967       0.0%
test-suite...ernal/Embench/matmult-int.test    990512  990512       0.0%
test-suite...xternal/Embench/huffbench.test   1566315 1566315       0.0%
test-suite :: External/Embench/edn.test       1161388 1161388       0.0%
test-suite...:: External/Embench/cubic.test        44      44       0.0%
test-suite...:: External/Embench/crc32.test   2788051 2788051       0.0%
test-suite...marks/Dhrystone/dhrystone.test     98109   98109       0.0%
This revision is now accepted and ready to land.Jul 28 2023, 7:52 AM
artagnon added inline comments.Jul 28 2023, 8:56 AM
llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2

I just noticed that you're using the old PM. If you change this to opt -S -passes='default<O3>', this example will be vectorized as well.

aeubanks accepted this revision.Jul 28 2023, 10:16 AM

lg

llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2

they should be equivalent now

nikic added inline comments.Jul 28 2023, 10:33 AM
llvm/test/Transforms/PhaseOrdering/gvn-replacement-vs-hoist.ll
2

This test doesn't use a target triple to keep it target-independent. If you specify one, then yes, it does get vectorized (well, depending on target of course).

This revision was automatically updated to reflect the committed changes.
dmgreen added a subscriber: dmgreen.Aug 8 2023, 7:32 AM

Sorry I had to revert this as it caused some large regressions. I'm not sure if I have the greatest reason though. The most important is probably x264 from Spec, which I'm seeing a 9-10% regressions depending on the specific configuration. There are also some downstream embedded benchmarks that run under LTO and can be quite fragile to phase ordering changes. There are some there that get better, some that get worse.

I think the main change in x264 is from this function: https://github.com/MasterNobody/x264/blob/eaa68fad9e5d201d42fde51665f2d137ae96baf0/common/quant.c#L64, which requires some vectorization to perform well. https://godbolt.org/z/oKG5onsna

nikic added a comment.Aug 8 2023, 8:18 AM

The issue for x264 is that the earlier hoisting reduces the size of the loop body enough for it to be unrolled, and SLP doesn't vectorize this case. This is the IR before slp-vectorize: https://gist.github.com/nikic/6ccde4b5320f7f4a6c5e7bd3ff8db1f6

Quite annoying, because the earlier hoisting would otherwise be a clear improvement for that test case.

nikic added a comment.Aug 8 2023, 8:38 AM

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

The issue for x264 is that the earlier hoisting reduces the size of the loop body enough for it to be unrolled, and SLP doesn't vectorize this case. This is the IR before slp-vectorize: https://gist.github.com/nikic/6ccde4b5320f7f4a6c5e7bd3ff8db1f6

Quite annoying, because the earlier hoisting would otherwise be a clear improvement for that test case.

Yeah I agree, it doesn't feel like the greatest reason but phase ordering can be difficult and this is a fairly large regression in what people keep telling me is a fairly important benchmark.

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

Thanks. I guess there are maybe two things going on. The unrolling of loops with conditions creates something this is not easy to deal with in the SLP vectorizer (multiple blocks), and the SLP vectorizer doesn't know how to add runtime alias checks.

I can try moving the simplifycfg run later and see what effect it has. There might also be something that we can do about not unrolling with multiple blocks that are not going to be simplified.

nikic added a comment.Aug 8 2023, 12:14 PM

I've added a phase ordering test for quant_4x4 in https://github.com/llvm/llvm-project/commit/b92711931daf45426a23c082c732ddfbf6d02814. Moving the SimplifyCFG run until after LPM2 should avoid this issue, not sure whether that would cause other problems.

Thanks. I guess there are maybe two things going on. The unrolling of loops with conditions creates something this is not easy to deal with in the SLP vectorizer (multiple blocks), and the SLP vectorizer doesn't know how to add runtime alias checks.

I can try moving the simplifycfg run later and see what effect it has. There might also be something that we can do about not unrolling with multiple blocks that are not going to be simplified.

Yeah, fully unrolling a loop with interior (that is, non-exit) control flow is probably significantly less profitable than unrolling without interior control flow, because the latter gives you straight-line code (or at least an extended basic block, if there are multiple exits). That seems like something we should take into account during cost modelling, but currently don't.

Hello. I ran some tests for moving the SimplifyCFG to after LPM2. The version that added an extra run of SimplifyCFG after LPM2 probably performed the best and showed the least decreases. There might be other cases that just don't come up in these benchmarks of course. The version that moved the existing LoopSimplify did a little worse in places and showed more ups and downs.
But.. I am worried it feels a little fragile for the quant case to be relying on it not optimizing prior to unrolling. In the long run we can probably simplify it, unroll and rely on the SLP vectorizer if it can add runtime checks, but I was looking at whether it made sense to disable unrolling in some cases too.

Yeah, fully unrolling a loop with interior (that is, non-exit) control flow is probably significantly less profitable than unrolling without interior control flow, because the latter gives you straight-line code (or at least an extended basic block, if there are multiple exits). That seems like something we should take into account during cost modelling, but currently don't.

I had a look at disabling unrolling in the FullUnrollPass for loops with multiple exits. The existing heuristics are often quite.. rough at times. It was probably mostly OK, but I wasn't super enthused by the results and apparently in cases like this it is important to be able to unroll the outer loop so the inner loop can vectorize: https://godbolt.org/z/n6W8Tbdrf.