This is an archive of the discontinued LLVM Phabricator instance.

[Passes] Run sinking/hoisting in SimplifyCFG earlier.
ClosedPublic

Authored by fhahn on Apr 28 2021, 9:15 AM.

Details

Summary

Hoisting and sinking instructions out of conditional blocks enables
additional vectorization by:

  1. Executing memory accesses unconditionally.
  2. Reducing the number of instructions that need predication.

After disabling early hoisting / sinking, we miss out on a few
vectorization opportunities. One of those is causing a ~10% performance
regression in one of the Geekbench benchmarks on AArch64.

This patch tires to recover the regression by running hoisting/sinking
as part of a SimplifyCFG run after LoopRotate and before LoopVectorize.

Note that in the legacy pass-manager, we run LoopRotate just before
vectorization again and there's no SimplifyCFG run in between, so the
sinking/hoisting may impact the later run on LoopRotate. But the impact
should be limited and the benefit of hosting/sinking at this stage
should outweigh the risk of not rotating.

Compile-time impact looks slightly positive for most cases.
http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=instructions

NewPM-O3: geomean -0.19%
NewPM-ReleaseThinLTO: geoman -0.54%
NewPM-ReleaseLTO-g: geomean -0.03%

With a few benchmarks seeing a notable increase, but also some
improvements.

Alternative to D101290.

Diff Detail

Event Timeline

fhahn created this revision.Apr 28 2021, 9:15 AM
fhahn requested review of this revision.Apr 28 2021, 9:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 9:15 AM
nikic added a comment.Apr 28 2021, 9:42 AM

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

Do we have any simplifycfg run before LV but after inliner?

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

Do we have any simplifycfg run before LV but after inliner?

It might make sense to do this even earlier, as I suspect that passes like GVN would also benefit from hoisted/sunk IR. But this looks like a reasonable starting point, at least to me.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

I see. So i take it, D101231, if restricted to function-terminating block, makes more sense to you now, in general?

fhahn added a comment.Apr 29 2021, 1:51 AM

Thanks! I like this a lot more than the LoopVectorize variant. I also think it makes sense that this happens at the end of the "function simplification" pipeline, which means that the inliner will see the hoisted/sunk IR, which should result in a more accurate cost (as sinkable/hoistable instructions will not be counted multiple times).

Nonono, this is actually the opposite from what should happen.
We will then over-inline, and grow the size even more.

I don't follow. It makes the inlining cost more accurate, so if that results in over-inlining, then the issue would be with the inlining cost model, not this change. It is my understanding that the inliner should see functions in their maximally simplified form (and prior to size-increasing optimizations like runtime unrolling and vectorization), so that it can make the most accurate decisions. Intentionally crippling the function simplification pipeline to get less inlining seems rather backward. Taken ad absurdum, that would mean that we shouldn't simplify functions prior to inlining at all.

This is even visible in the http://llvm-compile-time-tracker.com/compare.php?from=2ea7fb7b1c045a7d60fcccf3df3ebb26aa3699e5&to=e58b4a763c691da651f25996aad619cb3d946faf&stat=size-total

This looks pretty normal for a phase ordering change. SPASS on O3 is up 0.46%, lencod on ThinLTO is down 0.66%. The geomeans are 0.1% up or down for NewPM.

I see. So i take it, D101231, if restricted to function-terminating block, makes more sense to you now, in general?

I think I might be missing how D101231 is related to the current patch.

Is the concern that we may hoist/sink instructions from cold into hot blocks? I think for hosting this should definitely not happen, because we only hoist common instructions on both paths. I think the same is mostly true for sinking, although it supports sinking from only a subset of predecessors I think.

In any case, IIUC the size reduction should be accurate and I fail to see how this would lead to over-inlining. If I am missing anything, it would be great if you could elaborate in a bit more detail what cases you are concerned about.

lebedev.ri accepted this revision.Apr 29 2021, 4:36 AM

I'm actually in favor of doing this pre-inliner, because that will simplify an upcoming patch, should i post it :)
If everyone is okay with extra inlining, i think this is fine.
I checked, and since this runs post-looprotation, we could do this.
So LG.

llvm/lib/Passes/PassBuilder.cpp
1325–1326

Drop comment

This revision is now accepted and ready to land.Apr 29 2021, 4:36 AM
This revision was landed with ongoing or failed builds.Apr 30 2021, 4:35 AM
This revision was automatically updated to reflect the committed changes.
fhahn added inline comments.Apr 30 2021, 4:36 AM
llvm/lib/Passes/PassBuilder.cpp
1325–1326

done, thanks!