Page MenuHomePhabricator

[SLP][LTO][WIP]Allow full SLP in LTO only at link time.
Needs ReviewPublic

Authored by ABataev on Aug 27 2021, 9:04 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None

Diff Detail

Event Timeline

ABataev created this revision.Aug 27 2021, 9:04 AM
ABataev requested review of this revision.Aug 27 2021, 9:04 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 27 2021, 9:04 AM

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002
IMO workarounding it in the pass isn't the right course of action. Such workarounds tend to stick around.

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

IMO workarounding it in the pass isn't the right course of action. Such workarounds tend to stick around.

I agree, I thought about a pipeline fix, this is just a temp solution to check how it affects the performance. It gets important especially for upcoming non-power-2 vectorization, which may cause regressions with LTO.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

We found that there were differences between regular and LTO for the passes invoked, their orderings, and parameters used to enable extra optimizations. (There was also inconsistency between new and old pass manager, but we can probably just focus on NPM now.)
I suspect that almost none of those differences were intentional - people just made changes for whatever pipeline they were interested in at the time and didn't realize there was divergence.
So we now have things refactored with this note:

/// TODO: Should LTO cause any differences to this set of passes?
void PassBuilder::addVectorPasses(OptimizationLevel Level,
                                  FunctionPassManager &FPM, bool IsFullLTO) {

So if there really is a reason for something to be different with LTO, it's set up to make that easily visible at least. :)
I made a couple of small fixes in there already, but basically any place where we do something differently for FullLTO should be investigated.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

We found that there were differences between regular and LTO for the passes invoked, their orderings, and parameters used to enable extra optimizations. (There was also inconsistency between new and old pass manager, but we can probably just focus on NPM now.)
I suspect that almost none of those differences were intentional - people just made changes for whatever pipeline they were interested in at the time and didn't realize there was divergence.
So we now have things refactored with this note:

/// TODO: Should LTO cause any differences to this set of passes?
void PassBuilder::addVectorPasses(OptimizationLevel Level,
                                  FunctionPassManager &FPM, bool IsFullLTO) {

So if there really is a reason for something to be different with LTO, it's set up to make that easily visible at least. :)
I made a couple of small fixes in there already, but basically any place where we do something differently for FullLTO should be investigated.

Do I understand correctly that your patch just reorders passes? Because I need a bit different. We need to run SLP only for the widest possible VF at compile time and run SLP at full (for all possible VFs) only at link time.
Early optimization using small VF may affect other passes(alias analysis, loads/stores/allocas elimination, Loop Vectorization, SLP itself, etc.) and we need to run SLP at full only at link time.

nikic added a subscriber: nikic.Aug 27 2021, 12:13 PM

@ABataev The pipeline already distinguishes between pre-link and post-link optimization pipelines, see e.g. the flag that gets passed to LoopRotate to control rotation of loops with calls (https://github.com/llvm/llvm-project/blob/2f69c82cec1ae05b4fdcef4ac48f48e9e2bad32b/llvm/lib/Passes/PassBuilder.cpp#L760). You'd probably want to do something similar here.

Though TBH I'm surprised that we perform vectorization in the pre-link pipelines at all, I'd have assumed that this only gets done in the LTO step.

@ABataev The pipeline already distinguishes between pre-link and post-link optimization pipelines, see e.g. the flag that gets passed to LoopRotate to control rotation of loops with calls (https://github.com/llvm/llvm-project/blob/2f69c82cec1ae05b4fdcef4ac48f48e9e2bad32b/llvm/lib/Passes/PassBuilder.cpp#L760). You'd probably want to do something similar here.

Probably, will check, thanks for the link.

Though TBH I'm surprised that we perform vectorization in the pre-link pipelines at all, I'd have assumed that this only gets done in the LTO step.

No, also at the compile time. Most probably, just nobody looked at it.

Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 1:12 AM