This is an archive of the discontinued LLVM Phabricator instance.

[LoopRotate] Add PrepareForLTO stage, avoid rotating with inline cands (WIP).
ClosedPublic

Authored by fhahn on Jan 7 2021, 6:14 AM.

Details

Summary

WIP because it is still a bit rough around the edges.

D84108 exposed a bad interaction between inlining and loop-rotation
during regular LTO, which is causing notable regressions in at least
CINT2006/473.astar.

The problem boils down to: we now rotate a loop just before the vectorizer
which requires duplicating a function call in the preheader when compiling
the individual files ('prepare for LTO'). But this then prevents further
inlining of the function during LTO.

This patch tries to resolve this issue by making LoopRotate more
conservative with respect to rotating loops that have inline-able calls
during the 'prepare for LTO' stage.

I think this change intuitively improves the current situation in
general. Loop-rotate tries hard to avoid creating headers that are 'too
big'. At the moment, it assumes all inlining already happened and the
cost of duplicating a call is equal to just doing the call. But with LTO,
inlining also happens during full LTO and it is possible that a previously
duplicated call is actually a huge function which gets inlined
during LTO.

From the perspective of LV, not much should change overall. Most loops
calling user-provided functions won't get vectorized to start with
(unless we can infer that the function does not touch memory, has no
other side effects). If we do not inline the 'inline-able' call during
the LTO stage, we merely delayed loop-rotation & vectorization. If we
inline during LTO, chances should be very high that the inlined code is
itself vectorizable or the user call was not vectorizable to start with.

There could of course be scenarios where we inline a sufficiently large
function with code not profitable to vectorize, which would have be
vectorized earlier (by scalarzing the call). But even in that case,
there probably is no big performance impact, because it should be mostly
down to the cost-model to reject vectorization in that case. And then
the version with scalarized calls should also not be beneficial. In a way,
LV should have strictly more information after inlining and make more
accurate decisions (barring cost-model issues).

There is of course plenty of room for things to go wrong unexpectedly,
so we need to keep a close look at actual performance and address any
follow-up issues.

I took a look at the impact on statistics for
MultiSource/SPEC2000/SPEC2006. There are a few benchmarks with fewer
loops rotated, but no change to the number of loops vectorized.

I still need to think on how to best test the change. Perhaps add a
-prepare-for-lto flag to LoopRotate?

Diff Detail

Event Timeline

fhahn created this revision.Jan 7 2021, 6:14 AM
fhahn requested review of this revision.Jan 7 2021, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2021, 6:14 AM

Thanks, Florian. I'll give the patch a run through our benchmarking.

it assumes all inlining already happened and the cost of duplicating a call is equal to just doing the call

This was the bit of information I was missing, thanks! With that, I think your proposed change makes a lot of sense.

@fhahn I'm afraid I'm seeing runtime and verification errors on perlbench, gcc, gobmk, and astar in SPEC INT 2006. I'm guessing something later in the pipe really doesn't like it when certain loops aren't rotated?

fhahn updated this revision to Diff 315338.Jan 8 2021, 3:46 AM

@fhahn I'm afraid I'm seeing runtime and verification errors on perlbench, gcc, gobmk, and astar in SPEC INT 2006. I'm guessing something later in the pipe really doesn't like it when certain loops aren't rotated?

Hmmm, I think it would be very surprising, if not rotating some loops would expose any serious mis-compiles in SPEC. When building with -Oz for example, hardly any loops are rotated.

What configuration are you using? I couldn't reproduce any SPEC2006 INT failures on either x86/ARM64 with -O3 -flto.

Our flags are -mcpu=native -O3 -fomit-frame-pointer -flto on a Neoverse-N1 based system, so -mcpu=neoverse-n1 should do the same thing.

It looks like this might be an issue on our end, possibly unrelated to the patch.

Yep, definitely a problem on our end; I've got similar symptoms in other runs. Sorry about the noise, and I will try to get some perf data once we sort this out.

Turns out the failing benchmarks are due to a miscompilation with -g3 (which we add to profiled runs). The patch does seem to make that miscompilation more likely. I'll try to reduce that separately, but at least I'll have some performance numbers shortly.

So on SPEC 2006 this fixes astar (+9.6%) as well as shakes things up enough to "fix" h264ref (+9.0%, see D93946). Other notable changes are libquantum (+3.2%) and omnetpp (-2.8%). Geomean is +1.5%.

On SPEC 2017 there is still a problem with omnetpp_r (runtime error), which I'll have a look at next, but is otherwise neutral on geomean with only minor wobbles (+1%, -0.5% at most) on individual benchmarks.

With this having a substantial impact on performance, I'm quite keen to help get this in soon, ideally before the LLVM 12 branch, so please let me know if there is anything else I can do to help.

So on SPEC 2006 this fixes astar (+9.6%) as well as shakes things up enough to "fix" h264ref (+9.0%, see D93946). Other notable changes are libquantum (+3.2%) and omnetpp (-2.8%). Geomean is +1.5%.

Thank you very much for running the numbers!

Just to double check, positive here means good (I assume increase in score)?

On SPEC 2017 there is still a problem with omnetpp_r (runtime error), which I'll have a look at next, but is otherwise neutral on geomean with only minor wobbles (+1%, -0.5% at most) on individual benchmarks.

With this having a substantial impact on performance, I'm quite keen to help get this in soon, ideally before the LLVM 12 branch, so please let me know if there is anything else I can do to help.

Sounds good! I think the theory behind the change should mean it should be quite safe and the data so far seems encouraging; I can confirm similar improvements for astar on our tracking.

It would probably good to get this in well ahead of the 12.0 branch, so it has a chance to make it through various downstream perf-tracking systems.

So on SPEC 2006 this fixes astar (+9.6%) as well as shakes things up enough to "fix" h264ref (+9.0%, see D93946). Other notable changes are libquantum (+3.2%) and omnetpp (-2.8%). Geomean is +1.5%.

Thank you very much for running the numbers!

Just to double check, positive here means good (I assume increase in score)?

Yes, positive is good :)

On SPEC 2017 there is still a problem with omnetpp_r (runtime error), which I'll have a look at next, but is otherwise neutral on geomean with only minor wobbles (+1%, -0.5% at most) on individual benchmarks.

With this having a substantial impact on performance, I'm quite keen to help get this in soon, ideally before the LLVM 12 branch, so please let me know if there is anything else I can do to help.

Sounds good! I think the theory behind the change should mean it should be quite safe and the data so far seems encouraging; I can confirm similar improvements for astar on our tracking.

It would probably good to get this in well ahead of the 12.0 branch, so it has a chance to make it through various downstream perf-tracking systems.

That's great! Agreed that this could do with a little soak on main before 12 branches, which IIRC is scheduled for 26 January?

fhahn added a comment.Jan 17 2021, 1:33 PM

On SPEC 2017 there is still a problem with omnetpp_r (runtime error), which I'll have a look at next, but is otherwise neutral on geomean with only minor wobbles (+1%, -0.5% at most) on individual benchmarks.

With this having a substantial impact on performance, I'm quite keen to help get this in soon, ideally before the LLVM 12 branch, so please let me know if there is anything else I can do to help.

Sounds good! I think the theory behind the change should mean it should be quite safe and the data so far seems encouraging; I can confirm similar improvements for astar on our tracking.

It would probably good to get this in well ahead of the 12.0 branch, so it has a chance to make it through various downstream perf-tracking systems.

That's great! Agreed that this could do with a little soak on main before 12 branches, which IIRC is scheduled for 26 January?

Yes, I think the only thing outstanding with respect to testing is the omnetpp_r failure. Can you confirm if that is caused by the patch or not?

And any review of the patch would be appreciated of course!

Turns out the failing benchmarks are due to a miscompilation with -g3 (which we add to profiled runs). The patch does seem to make that miscompilation more likely. I'll try to reduce that separately, but at least I'll have some performance numbers shortly.

This is worrying, -g3 should not cause miscompiles w/o this patch.

[...]

Yes, I think the only thing outstanding with respect to testing is the omnetpp_r failure. Can you confirm if that is caused by the patch or not?

And any review of the patch would be appreciated of course!

(I thought I posted this on Friday, but looks like I forgot). The omnetpp_r crash on SPEC 2017 has not been seen again in subsequent runs so this might have been a fluke. Performance on the new runs looks fine, no change on omnetpp_r.

In term of review, could you add a test case? A -prepare-for-lto flag for loop rotate for testing purposes seems reasonable to me.

Turns out the failing benchmarks are due to a miscompilation with -g3 (which we add to profiled runs). The patch does seem to make that miscompilation more likely. I'll try to reduce that separately, but at least I'll have some performance numbers shortly.

This is worrying, -g3 should not cause miscompiles w/o this patch.

I agree, but I don't think it should block the patch going in. I'm trying to reduce it on one of the benchmark. It only shows up with LTO enabled, so it is quite slow going. (I'm having a look at the LLVM test suite as well to see if it shows up at all there, hopefully in a slightly smaller form.)

fhahn updated this revision to Diff 317374.Jan 18 2021, 9:01 AM

[...]

Yes, I think the only thing outstanding with respect to testing is the omnetpp_r failure. Can you confirm if that is caused by the patch or not?

And any review of the patch would be appreciated of course!

(I thought I posted this on Friday, but looks like I forgot). The omnetpp_r crash on SPEC 2017 has not been seen again in subsequent runs so this might have been a fluke. Performance on the new runs looks fine, no change on omnetpp_r.

In term of review, could you add a test case? A -prepare-for-lto flag for loop rotate for testing purposes seems reasonable to me.

Sounds good, I added a new -rotation-prepare-for-lto option and a test using it.

Turns out the failing benchmarks are due to a miscompilation with -g3 (which we add to profiled runs). The patch does seem to make that miscompilation more likely. I'll try to reduce that separately, but at least I'll have some performance numbers shortly.

This is worrying, -g3 should not cause miscompiles w/o this patch.

I agree, but I don't think it should block the patch going in. I'm trying to reduce it on one of the benchmark. It only shows up with LTO enabled, so it is quite slow going. (I'm having a look at the LLVM test suite as well to see if it shows up at all there, hopefully in a slightly smaller form.)

I cannot reproduce this with regular LTO, as I mentioned earlier I do not think this is caused by the patch.

sanwou01 accepted this revision.Jan 18 2021, 9:49 AM

Looks good to me!

This revision is now accepted and ready to land.Jan 18 2021, 9:49 AM
nikic added a subscriber: nikic.Jan 18 2021, 11:18 AM

Does the same issue apply to ThinLTO?

fhahn added a comment.Jan 19 2021, 2:14 AM

Does the same issue apply to ThinLTO?

Possibly, but unfortunately I do not have performance data at the moment to back that up and not enough insight into the specifics of the ThinLTO pipeline. But it should be easy to switch to using it for ThinLTO as well, once we are confident it works as expected for LTO on a large range of workloads.

xbolva00 added a comment.EditedJan 19 2021, 2:30 AM

Does the same issue apply to ThinLTO?

D84108 mentions “size impact varies; for ThinLTO it's actually an improvement” so it sounds like the original patch missed just regular lto pipeline.

cc @lebedev.ri

fhahn added a comment.Jan 19 2021, 6:41 AM

Wow.
What about runtime performance, eg. mafft? No changes?

Thanks for the heads-up.

Unfortunately I think this is an unintended bug. It looks like the original code did not skip calls that are not actual calls. With the patch, LoopRotation would also be skipped if the header contains intrinsic calls, like @llvm.dbg.* calls. Those should not block rotation though (there's nothing to inline). I pushed a fix, so we do not consider calls that are not lowered to calls as inline candidates: 3747b69b5312

I expect this should revert back the improvements with -g.

nikic added a comment.Jan 19 2021, 8:29 AM

@fhahn Thanks! I did suspect that this is a bit too good to be true...

The code size changes after your fix look much more sensible: http://llvm-compile-time-tracker.com/compare.php?from=4d3081331ad854e0bff5032c818ec6414fb974c0&to=3747b69b531299f7a2a0289b8a59ac7234e47d4f&stat=size-text

ormris added a subscriber: ormris.Jan 19 2021, 10:10 AM