This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Loop peeling opportunity for identity operators
AbandonedPublic

Authored by jamieschmeiser on Jan 20 2023, 1:58 PM.

Details

Reviewers
nikic
reames
mkazantsev
Group Reviewers
Restricted Project
Summary

This is a work in progress and is to aid discussion in the loop opt working group.

An opportunity for peeling exists for loops like the following:

unsigned Sum = 0;
for (unsigned I = 0; I < 999; ++I)

Sum += g();

In this situation, the initial value results in an identity operation on the first iteration of the loop.
This loop could be peeled to avoid this operation:

unsigned Sum = g();
for (unsigned I = 0; I < 998; ++I)

Sum += g();

This idiom frequently occurs in the form of summing up the values in an array. It can also occur with *=, &&= and ||=.

The number of tests requiring changes here illustrates that this is a common idiom. There may be more tests that need updating
but since this is currently for discussion purposes, this task has not been completed. These tests were mostly altered by dis-allowing peeling or changing the initial value to 1 from 0, making it not an identity operation.

There are several questions about the role of peeling in the llvm opt pipeline but the main one raised by this sample follows: This will likely be a relatively minor small improvement and it may interfere with other optimizations so is it worth pursuing?

For example, there is an attempt to not interfere with vectorization in the code as it exists. This heuristic needs improvement as the above sample, for example, will not peel if the upper bound is 10,000 as it may fail the heuristic on some hardware.

Diff Detail

Event Timeline

jamieschmeiser created this revision.Jan 20 2023, 1:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 1:58 PM
jamieschmeiser requested review of this revision.Jan 20 2023, 1:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 1:58 PM
mkazantsev added inline comments.
llvm/test/Transforms/LoopUnroll/AArch64/runtime-loop.ll
27

Looks like a bug, the number of iterations didn't change but sum did?

llvm/test/Transforms/LoopUnroll/AArch64/runtime-loop.ll
27

This value is coming in from the entry block and represents the initial value. I changed it to 1 from 0 so that the identity operation would not apply and this test would not otherwise change. I did this to the tests here because these tests are for testing other behaviours, rather than testing the peeling of an identity operator.

The approach looks good to me. Got some smaller remarks, nothing critical since it doesn't affect correctness.

llvm/lib/Transforms/Utils/LoopPeel.cpp
471–487

Consider using doxygen comments (starting with ///). Doxygen code samples are indented by 4 spaces.

Could you explicitly list all the cases that should be handled. E.g.

  • Integer, initial value 0, add
  • Integer, initial value 1, mul
  • Float, initial value 0, fadd
  • Float, initial value 0, fmadd
  • Float, initial value 1, fmul
  • ...
506

[style] Functions start with a verb, e.g. getConstIn

516

ConstantFP::isZeroValue() returns true for negative and positive zero; should we do this only with -ffast-math (or the subflag that controls this)?

529

Would both (or and add; respectively mul and and with isAllOnes()) be valid independent of bit with. E.g. The PHI value might be a flags enum so we'd do a bitwise operation.

flags_t Flags = 0;
for (I = 0; I < 100000; ++I) {
   Total |= enable_flag(I);
}
544

[style] Functions start with a verb

558

128 is is really big; x86 is 2, 4, 8 or 16 elements. Is TTI to get native vector size?

llvm/test/Transforms/LoopUnroll/peel-loop-identity-op.ll
4

Consider adding a description to the test

7
nikic requested changes to this revision.Jan 26 2023, 1:00 AM
nikic added reviewers: reames, mkazantsev, nikic.
nikic added a subscriber: nikic.

Strong -1 on this change as implemented. There's a lot of red flags here, for example that you are trying to artificially limit this transform so as not to completely break LoopVectorize (indicating that you're breaking a canonical form), and the number of times you have to suppress the transform in tests, where it looks like the transform would be clearly non-profitable if it were actually applied.

There might be something to the general idea here, but this would need entirely different cost modelling from the current peeling transform. Loop peeling is a tradeoff between simplifying the loop body vs increasing code size. In the cases where current peeling transforms are performed, we expect some significant simplification of the loop, like the elimination of branches that are constant after some iterations. For the transform proposed here the benefit is very marginal, you're basically saving a single operation. This might make sense if that operation is pretty much the only thing the loop does, but not if you need to duplicate a large loop body that also does many other things.

@nikic, thank you for your comments. I too have concerns about this proposal. I agree that the benefits are marginal but I would like to address some of your comments.

Strong -1 on this change as implemented. There's a lot of red flags here, for example that you are trying to artificially limit this transform so as not to completely break LoopVectorize (indicating that you're breaking a canonical form), and the number of times you have to suppress the transform in tests, where it looks like the transform would be clearly non-profitable if it were actually applied.

How this affects other transforms, LoopVectorize in particular, is one of my biggest concerns, but for different reasons. I don't think that it is breaking a canonical form, but rather that it may cause different remainders after vectorizing. This could be beneficial in some cases, harmful in others, depending on how it happens to line up. This was one of the things we discussed in the loop-opt group meeting and there was some question as to why peeling was being done in the full looop unroll pass before vectorizing as well as after in the loop unroll pass. I investigated this and found that peeling was specifically turned on in loop full unroll to aid vectorizing (commit 35b3989a30eefa66cd6edca4c6e1ec061c05ad96) because, as you indicated, the existing peeling strategies tend to eliminate phis, which can help the vectorizer. A proposed solution from this discussion was to not do this new peeling strategy in full loop unrolling but only after vectorization. This also removes the need for that bit of code for limiting the peel.
I disagree that the number of times I suppressed it shows it is not profitable. In fact, the frequency that it occurs shows its usefulness, but I suppressed it because it isn't germane to the tests in question and obfuscates their meaning and purpose.

There might be something to the general idea here, but this would need entirely different cost modelling from the current peeling transform. Loop peeling is a tradeoff between simplifying the loop body vs increasing code size. In the cases where current peeling transforms are performed, we expect some significant simplification of the loop, like the elimination of branches that are constant after some iterations. For the transform proposed here the benefit is very marginal, you're basically saving a single operation. This might make sense if that operation is pretty much the only thing the loop does, but not if you need to duplicate a large loop body that also does many other things.

The peeling code already considers the size of code growth and establishes limits to the amount of peeling that can occur. The proposed code does not change that and only suggests the peel when it is within the established limits.

I am preparing a revision of this code that adds a new peeling flag to the peeling control struct called allow-aggressive-peeling. This would control opportunities that do not necessarily simplify phi structure of control flow but exist for other opportunities such as this one. It would be set to false for full loop unroll (before vectorizing) but on for the loop unroll pass. Like other flags in this struct, it will have an option for controlling it and it will be suppressed using the option in the loop unrolling tests for the reasons given above but not specified for other ones. Does this approach address your concerns? I recognize that the change would need to be examined before any decision is made...

I disagree that the number of times I suppressed it shows it is not profitable. In fact, the frequency that it occurs shows its usefulness, but I suppressed it because it isn't germane to the tests in question and obfuscates their meaning and purpose.

I think my phrasing here was ambiguous: It's not the number of times it is suppressed, but if you take a look at the specific cases where it is suppressed, many of them look non-profitable to me. E.g. the second test file llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll already has a non-trivial loop body with internal control flow, where I would expect the marginal cost of more code size to already outweigh the marginal benefit of this optimization.

The peeling code already considers the size of code growth and establishes limits to the amount of peeling that can occur. The proposed code does not change that and only suggests the peel when it is within the established limits.

My point here was that the existing cost model is inappropriate for this transform. This transform basically saves you from executing a single instruction over the entire loop (not per iteration!) It's a very small improvement, that will be outweighed by code size increases for anything but the smallest loops. The existing cost model will perform this optimization in too many cases.

I am preparing a revision of this code that adds a new peeling flag to the peeling control struct called allow-aggressive-peeling. This would control opportunities that do not necessarily simplify phi structure of control flow but exist for other opportunities such as this one. It would be set to false for full loop unroll (before vectorizing) but on for the loop unroll pass. Like other flags in this struct, it will have an option for controlling it and it will be suppressed using the option in the loop unrolling tests for the reasons given above but not specified for other ones. Does this approach address your concerns? I recognize that the change would need to be examined before any decision is made...

Delaying this to runtime unrolling is certainly a good start, but I don't think it will be sufficient, and I'm not really convinced that this optimization is worth investing into in the first place. Do you have any performance data that suggests that this is really a worthwhile thing to do?

jamieschmeiser abandoned this revision.Jan 31 2023, 9:04 AM

I agree that the potential gain is limited and probably not worth the potential increase in code size. This patch captures the changes/discussion if someone else is interested in the future. I am abandoning this revision.