Page MenuHomePhabricator

[LV] Disable runtime unrolling for vectorized loops.
AcceptedPublic

Authored by fhahn on Dec 7 2021, 9:21 AM.

Details

Summary

This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.

Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.

The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961

Compile-time improvements:

NewPM-O3: -2.08%
NewPM-ReleaseThinLTO: -1.09%
NewPM-ReleaseLTO-g: -1.59%

http://llvm-compile-time-tracker.com/compare.php?from=398dffd4ffc7cbc320e15207c8c04ca682d821c4&to=86c8f5499f5c9ef34d018491f8eb4364579dca16&stat=instructions

Diff Detail

Event Timeline

fhahn created this revision.Dec 7 2021, 9:21 AM
fhahn requested review of this revision.Dec 7 2021, 9:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2021, 9:21 AM

Compile time is misleading.
What about run time impact?

Even if both of the unrollers are right as per their model
(LU duplicates whole loop body; while LU duplicates each instruction,
increasing live ranges, i believe), i'm mainly just worried
that two unroll strategies disagree in the end.

Which one is actually right? LV?
Is there some analysis that can be extracted from LV that LU could use
to deduce better unroll factor? (which would be 1x (no further unroll) after LV)

All that being said, i don't have any concrete examples that regress with this.

nikic added a subscriber: nikic.Dec 9 2021, 7:41 AM

Even if both of the unrollers are right as per their model
(LU duplicates whole loop body; while LU duplicates each instruction,
increasing live ranges, i believe), i'm mainly just worried
that two unroll strategies disagree in the end.

Which one is actually right? LV?
Is there some analysis that can be extracted from LV that LU could use
to deduce better unroll factor? (which would be 1x (no further unroll) after LV)

All that being said, i don't have any concrete examples that regress with this.

Runtime loop unrolling doesn't really have anything that deserves the name of "cost model", at least if there is no profile data. It basically just unrolls the loop as many times as fits into a threshold. I don't know what kind of cost modelling LV does in this area, but I can only assume it's better than that ;)

I believe many targets already disable runtime unrolling for loops that contain vector instructions. For example AArch64 does that, though X86 currently does not. This is the principal alternative I would see, to move that logic up into the generic unroll preferences. It would be the difference between not unrolling loops that LLVM vectorized and not unrolling vector loops in general -- I assume the preference would be the former, as this patch does?

I would still want to see some numbers from a run on an affected arch/cpu, where there previously would be unrolling and now there won't be.
Lack of change will be great, presence will mainly be a canary test only, not a blocker.

fhahn added a comment.Dec 10 2021, 8:28 AM

Even if both of the unrollers are right as per their model
(LU duplicates whole loop body; while LU duplicates each instruction,
increasing live ranges, i believe), i'm mainly just worried
that two unroll strategies disagree in the end.

Which one is actually right? LV?
Is there some analysis that can be extracted from LV that LU could use
to deduce better unroll factor? (which would be 1x (no further unroll) after LV)

All that being said, i don't have any concrete examples that regress with this.

Runtime loop unrolling doesn't really have anything that deserves the name of "cost model", at least if there is no profile data. It basically just unrolls the loop as many times as fits into a threshold. I don't know what kind of cost modelling LV does in this area, but I can only assume it's better than that ;)

LV at least tries to limit interleaving based on the number of execution units, so in that respect it should be a more realistic heuristic than the purely size-based on in the unroller. I guess one reason why the size based thresholds for unrolling are still in place is that one of the main benefits from aggressive unrolling in LLVM is increasing the context for later local optimizations. This point shouldn't really apply for vectorized loops in most cases.

One interesting point that @lebedev.ri is that in some cases interleaving by LV won't happen due to it causing spills of vector registers, whereas this isn't a problem with runtime-unrolling. But I'd assume in practice such loops should already be 'large enough'.

I believe many targets already disable runtime unrolling for loops that contain vector instructions. For example AArch64 does that, though X86 currently does not. This is the principal alternative I would see, to move that logic up into the generic unroll preferences. It would be the difference between not unrolling loops that LLVM vectorized and not unrolling vector loops in general -- I assume the preference would be the former, as this patch does?

IIRC AArch64 only enables runtime unrolling for in-order targets at the moment. Disabling unrolling for loops with vector instructions in general seems like a workaround.

I would still want to see some numbers from a run on an affected arch/cpu, where there previously would be unrolling and now there won't be.
Lack of change will be great, presence will mainly be a canary test only, not a blocker.

Agreed & fair point! I run SPEC2006 on X86 and the only notable change is sphinx3, but interestingly enough in the function with the biggest runtime difference there are no codegen changes. Still taking a closer look, but it may be down to code alignment changes.

In theory, I'm not a fan of this patch for many of the reasons @lebedev.ri spelled out. Our cost model really should be good enough to avoid unrolling something which is not profitable without needing to have one pass overrule the heuristic in another. In practice, I think this is a reasonable workaround for a very deep and hard to fix problem (poor unroll heuristics). Assuming perf results are neutral, I'd be fine with this going in.

@fhahn Can you point to a couple other cases where this is needed for profitability? I took a look at PR40961, and that really looks more like a problem us figuring out small trip counts and restricting transforms appropriately. (i.e. the unroller should never be unrolling longer than the total loop trip count) That's "easily" fixable, and we should do so.

fhahn added a comment.Dec 10 2021, 9:19 AM

@fhahn Can you point to a couple other cases where this is needed for profitability? I took a look at PR40961, and that really looks more like a problem us figuring out small trip counts and restricting transforms appropriately. (i.e. the unroller should never be unrolling longer than the total loop trip count) That's "easily" fixable, and we should do so.

Yeah some of the issues are related to SCEV missing trip counts and a couple of those have been fixed. That was one of the original motivations for the applyLoopGuards logic in SCEV. Unfortunately I am not able to find the other issue in the bug tracker I was thinking about just now.

But one different example is https://godbolt.org/z/7cE9soTa9 with a dependency on the accumulator registers

@fhahn Can you point to a couple other cases where this is needed for profitability? I took a look at PR40961, and that really looks more like a problem us figuring out small trip counts and restricting transforms appropriately. (i.e. the unroller should never be unrolling longer than the total loop trip count) That's "easily" fixable, and we should do so.

Yeah some of the issues are related to SCEV missing trip counts and a couple of those have been fixed. That was one of the original motivations for the applyLoopGuards logic in SCEV. Unfortunately I am not able to find the other issue in the bug tracker I was thinking about just now.

But one different example is https://godbolt.org/z/7cE9soTa9 with a dependency on the accumulator registers

So, looking at the codegen for that one, the unrolled form doesn't really look terrible. For a unknown trip count loop without profiling available, unrolling here seems fairly neutral perf wise (or at least I'd guess so). Codesize is definitely a regression. Can you explain why it is that unrolling is strictly unprofitable here? I have a bad feeling that this is only indirectly an interaction with the vectorizer, and there's some alternate framing here. I want to see your explanation to see what might fall out.

I believe many targets already disable runtime unrolling for loops that contain vector instructions. For example AArch64 does that, though X86 currently does not. This is the principal alternative I would see, to move that logic up into the generic unroll preferences. It would be the difference between not unrolling loops that LLVM vectorized and not unrolling vector loops in general -- I assume the preference would be the former, as this patch does?

In ARM-MVE we do already, both for vectorized code from the vectorizer and vector code from the frontend (intrinsics for example). That is largely to do with the way MVE handles tail-predication and hardware loops, which makes runtime unrolling generally detrimental.

For AArch64 we inherit some of the unrolling preferences from the base class, so I would expect this to alter in-order and out of order unrolling decisions to some degree. The main reasons we disabled the extra unrolling of loops with vectors was to not over-unroll loops past the point that it is useful - as it if you have a loop acting on i8's that is vectorized x16, then "interleaved" x2 or x4, then further runtime unrolled, you end up with a fast loop body that handles 64 or 128 or even 256 elements at a time. For too many tripcounts you just don't enter that loop or do most of the processing in the remainder loop. (Also we didn't want to break a lot of the hand-crafted intrinsic code that is out there that already unrolls the loop to carefully fit into the number of registers available).

Note that we already add metadata to disable runtime unrolling to the scalar loop after vectorization.

We only currently add no-unroll metadata to the remainder when there are no runtime-checks added. I wouldn't be surprised if a lot of the codesize gain from this patch is due to this patch adding no-unroll to the remainder, not to the vector body. It is quite easy to construct cases where the remainder really should be allowed to unroll.

LLVM has never had very good end-to-end testing for this kind of thing and has relied upon benchmarks like the llvm-test-suite to fill that gap. The noise makes it difficult, but I would expect a descent amount of benchmarking to prove a change like this is better overall and to catch a lot of the cases where it is not.

fhahn updated this revision to Diff 453036.Aug 16 2022, 9:10 AM

I believe many targets already disable runtime unrolling for loops that contain vector instructions. For example AArch64 does that, though X86 currently does not. This is the principal alternative I would see, to move that logic up into the generic unroll preferences. It would be the difference between not unrolling loops that LLVM vectorized and not unrolling vector loops in general -- I assume the preference would be the former, as this patch does?

In ARM-MVE we do already, both for vectorized code from the vectorizer and vector code from the frontend (intrinsics for example). That is largely to do with the way MVE handles tail-predication and hardware loops, which makes runtime unrolling generally detrimental.

For AArch64 we inherit some of the unrolling preferences from the base class, so I would expect this to alter in-order and out of order unrolling decisions to some degree. The main reasons we disabled the extra unrolling of loops with vectors was to not over-unroll loops past the point that it is useful - as it if you have a loop acting on i8's that is vectorized x16, then "interleaved" x2 or x4, then further runtime unrolled, you end up with a fast loop body that handles 64 or 128 or even 256 elements at a time. For too many tripcounts you just don't enter that loop or do most of the processing in the remainder loop. (Also we didn't want to break a lot of the hand-crafted intrinsic code that is out there that already unrolls the loop to carefully fit into the number of registers available).

That's a good point. AFAICT AArch64 disables unrolling of loops with vector instructions in general (not just runtime unrolling, which is still not enabled by default in general IIRC). I guess an alternative would be to do this generally (or at least for x86 as well),, as I sketched in D131972. It would probably also make sense to only do it for out-of-order CPUs for now.

But the general assumption remains: If LV didn't decide to interleave to increase parallelism, it is extremely unlikely LoopUnroll will make a better informed discussion later on.

Note that we already add metadata to disable runtime unrolling to the scalar loop after vectorization.

We only currently add no-unroll metadata to the remainder when there are no runtime-checks added. I wouldn't be surprised if a lot of the codesize gain from this patch is due to this patch adding no-unroll to the remainder, not to the vector body. It is quite easy to construct cases where the remainder really should be allowed to unroll.

LLVM has never had very good end-to-end testing for this kind of thing and has relied upon benchmarks like the llvm-test-suite to fill that gap. The noise makes it difficult, but I would expect a descent amount of benchmarking to prove a change like this is better overall and to catch a lot of the cases where it is not.

I updated the patch to leave the metadata of the scalar loop untouched and adds metadata to disable any unrolling of the vector loop. This should effectively be the same as effect as D131972, although compile-time is slightly better as LoopUnroll exits earlier (geoman reduction with this patch for -O3 is -1.38% vs -1.18%).

If we decide to predicate this on out-of-order vs in-order, I am currently leaning on making this decision in getUnrollingPreferences in TTI.

There is further feedback that aggressive unrolling of vector code on X86 isn't beneficial: https://github.com/llvm/llvm-project/issues/42332

https://llvm-compile-time-tracker.com/compare.php?from=a4a2ac5d1878177b57b76b109fda3820c6939a28&to=dca1dc4332b14064cf7f7618de58f2407b52c805&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=94d21a94d90db8bc0e983bde672790843f81ddca&to=a908a561c4639f45d29b43dd921fee0b24b42dfb&stat=instructions

Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2022, 9:10 AM
Herald added a subscriber: zzheng. · View Herald Transcript
nikic accepted this revision.Aug 29 2022, 5:03 AM

LGTM. I think we should start with this patch, as the more conservative variant to D131972. We can then explore applying that one on top.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7539

Should probably drop Runtime from the name?

10450

Broken indent?

This revision is now accepted and ready to land.Aug 29 2022, 5:03 AM
xbolva00 added a subscriber: xbolva00.EditedAug 30 2022, 8:43 AM

Yeah, this looks as a great start. Better performance & compile times.

Maybe also add a release note that LLVM no longer unrolls vectorized loops?

Hello. Sorry I have been travelling, I had not seen this had returned.

This looks like it disabled full unrolling after vectorization too? I think it is fairly important for performance to be able to simplify loops away where the runtime trip count is small enough by unrolling them. Either where the loop count becomes constant propagated through LTO or the trip count is just low enough. As far as I understand straight line code will almost always be better that looping, and preventing full unrolling would lead to performance regressions.

There is an example in https://godbolt.org/z/8YnsWfGGo where.. something is going wrong.

fhahn added a comment.Aug 31 2022, 6:13 AM

There is an example in https://godbolt.org/z/8YnsWfGGo where.. something is going wrong.

Yeah this is a case where the current logic to eliminate single-iteration vector loops doesn't trigger. Should be fixed by D133017

This looks like it disabled full unrolling after vectorization too? I think it is fairly important for performance to be able to simplify loops away where the runtime trip count is small enough by unrolling them. Either where the loop count becomes constant propagated through LTO or the trip count is just low enough. As far as I understand straight line code will almost always be better that looping, and preventing full unrolling would lead to performance regressions.

The above patch doesn't solve the issue where the loop will need unrolling twice or more times to completely remove the loop overhead. This shouldn't be an issue for X86/most AArch64 cores and full unrolling leads to excessive unrolling as in https://github.com/llvm/llvm-project/issues/42332. But it may still be an issue for some other architectures. So maybe targets should be able to opt-in/out?

There is an example in https://godbolt.org/z/8YnsWfGGo where.. something is going wrong.

Yeah this is a case where the current logic to eliminate single-iteration vector loops doesn't trigger. Should be fixed by D133017

Thanks

This looks like it disabled full unrolling after vectorization too? I think it is fairly important for performance to be able to simplify loops away where the runtime trip count is small enough by unrolling them. Either where the loop count becomes constant propagated through LTO or the trip count is just low enough. As far as I understand straight line code will almost always be better that looping, and preventing full unrolling would lead to performance regressions.

The above patch doesn't solve the issue where the loop will need unrolling twice or more times to completely remove the loop overhead. This shouldn't be an issue for X86/most AArch64 cores and full unrolling leads to excessive unrolling as in https://github.com/llvm/llvm-project/issues/42332. But it may still be an issue for some other architectures. So maybe targets should be able to opt-in/out?

Thanks for the extra context. The reasoning given (micro-op buffers) does sound very (micro-)architectural. I'm surprised if the loop body is the bottleneck that decode couldn't keep up with a fully unrolled version. A (relatively small) number of vector operations in a single basic block will often be faster than a loop. Just having the loop is going to cause a certain amount of overhead.

Disabling extra runtime unrolling after vectorization makes sense - otherwise the loop body can get too big and end up never executed. We do the same thing already on certain targets. There will be places where it is worse for performance, but the benefits are likely to be more common.

But full unrolling sounds like it should be mostly beneficial, due to the extra simplification it can provide. We are going from too much unrolling to too little. I don't think I have anything that shows it in the benchmarks I usually run, but the cases I've heard from customers in the past were DPS routines like those from https://github.com/ARM-software/CMSIS-DSP/tree/main/Source. They can get compiled with LTO, so during the normal compile step they are vectorized with unknown trip counts. During LTO they get inlined or const-propagated and a lot of the loop bounds become constant. It is expected that the compiler can simplify the result nicely, including the predicated vector loops we might have produced (which is why patches like https://reviews.llvm.org/D94103 were useful).

So I have no evidence of full unrolling being a problem, but my gut says that if it is useful to unroll scalar loops then vector loops shouldn't really be treated any differently.

Are you gonna land this patch? Or any blockers?

Matt added a subscriber: Matt.Tue, Sep 27, 11:41 AM