Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[LoopVectorize] Allow inner loop runtime checks to be hoisted above an outer loop

Authored by david-arm on Jun 7 2023, 5:11 AM.



Suppose we have a nested loop like this:

void foo(int32_t *dst, int32_t *src, int m, int n) {
  for (int i = 0; i < m; i++) {
    for (int j = 0; j < n; j++) {
      dst[(i * n) + j] += src[(i * n) + j];

We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:

  1. &dst[(i * n)] -> &dst[(i * n) + n]
  2. &src[(i * n)] -> &src[(i * n) + n]

we can instead look at the expanded ranges:

  1. &dst[0] -> &dst[((m - 1) * n) + n]
  2. &src[0] -> &src[((m - 1) * n) + n]

which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.

I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.

When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:

SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
GEOMEAN: +0.5%

I suspect the omnetpp and xz differences are noise, but I know the
x264 improvement is real because it has some hot nested loops
with low trip counts where I can see this hoisting is beneficial.

Tests have been added here:


Diff Detail

Event Timeline

david-arm created this revision.Jun 7 2023, 5:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2023, 5:11 AM
david-arm requested review of this revision.Jun 7 2023, 5:11 AM
david-arm added inline comments.Jun 7 2023, 5:17 AM

Hmm, I just noticed these CHECK lines look wrong, since we still create the diff checks here. I'll fix this in a new patch.

david-arm updated this revision to Diff 529272.Jun 7 2023, 5:52 AM
  • Removed the target-specific attributes from the test and fixed the DEBUG check lines to be more accurate.

Gentle ping. :) Hi @fhahn, I don't suppose you have any thoughts about the approach I've taken in this patch, such as whether or not I've missed something critical?

Does this require the extra-vectorizer-passes to hoist the checks out of the loop in the full pipeline? Do you know if we already have sufficient runtime tests in llvm-test-suite for this kind of code? If not, it would be good to add some, perhaps to

Also, could you clean up the block & value names in the tests a bit so they are easier to read and land them separately?

Ayal added a subscriber: Ayal.Jun 27 2023, 11:27 AM
Ayal added inline comments.

This raises a thought, which may deserve a separate patch:
In this example the runtime checks should be optimized into a form independent of the enclosing i loop and hoisted out of it w/o expansion - and w/o asking a flag, as an obvious loop invariant unswitching opportunity.

I.e., instead of checking if intervals dst[i*n : i*n+n) and src[i*n : i*n + n) intersect, suffice to check if dst[0 : n) and src[0 : n) intersect. When checking if the end SCEV point of one precedes the start SCEV of the other, try to cancel common addends, or check if the SCEV of their difference is positive - subject to wrapping?

This also holds for the motivating memcopy example w/o the additional load, which is also worth adding a test for? Perhaps the first diff_checks() test above can serve as a better example where expansion is indeed necessary in order to loop unswitch a (non invariant) runtime check.

Wonder if the SPEC2017 significant cases are invariant(?)

Does this require the extra-vectorizer-passes to hoist the checks out of the loop in the full pipeline? Do you know if we already have sufficient runtime tests in llvm-test-suite for this kind of code? If not, it would be good to add some, perhaps to

Also, could you clean up the block & value names in the tests a bit so they are easier to read and land them separately?

Hi @fhahn, so I can see there are tests in the LLVM test suite that have examples of nested loops where the inner loop runtime checks could be hoisted out with this patch. For example, see MicroBenchmarks/ImageProcessing/Dither/floydDitherKernel.c and MicroBenchmarks/ImageProcessing/Dither/orderedDitherKernel.c. However, I am looking at trying to add some specific tests to SingleSource/UnitTests/Vectorizer/runtime-checks.cpp as well. Thanks for the suggestion!


Hi @Ayal, that's a good spot and thanks for pointing that out! The tests I wrote don't really bear any resemblance to the specific loops in x264 that benefit from this optimisation, where the inner loop runtime checks are not loop invariant. For example, one loop looks like a bit like this in x264:

for(int i = 0; i < m; i++) {
    for(int j = 0; j < n; j++)
        out[j] = ... do something with in1[j] + in2[j] ...;
    out += out_stride;
    in1 += in1_stride;
    in2 += in2_stride;

However, what you said makes me wonder if the tests I added in this patch are unreliable, because like you said they may get optimised away in future because they are genuinely invariant. I'll look into adding another test.

david-arm updated this revision to Diff 535748.Jun 29 2023, 5:38 AM
  • Add extra full_checks_diff_strides test to ensure the original inner loop runtime checks were not already loop invariant.

Hi @fhahn, when vectorising as part of the normal pipeline we always run LICM after the vectoriser which ensures the runtime checks are hoisted out. This is why we see the improvement for x264. However, if we vectorise a loop during the LTO pipeline then you're right that LICM is not currently one. This will probably change with D143631 though, since there are other problems with not running LICM after InstCombine in the LTO pipeline.

I've tried cleaning up the tests, and added a new one too. I'll continue working separately on adding a specific test to the llvm test suite, although obviously the test suite won't be run with this new flag enabled.

Hi @fhahn, I've now created some nested loop runtime check tests in the LLVM test suite - see D154719.

Does this require the extra-vectorizer-passes to hoist the checks out of the loop in the full pipeline?

Hi @fhahn, I just realised I forgot to reply to this comment. It does require follow-on passes like LICM to hoist the checks out of the loop, but this does seem to be happening for x264. For both normal and LTO pipelines we do run LICM after the vectoriser, so I think it's fine without the extra-vectorizer-passes flag.

paulwalker-arm added inline comments.Jul 18 2023, 3:23 AM

This doesn't read very well.


Typo, should be "inner".


"This is why the"


Given you evaluate High using the outer loop's exit count shouldn't you also check the following?

cast<SCEVAddRecExpr>(High)->getLoop() == TheLoop->getParentLoop()
david-arm updated this revision to Diff 542014.Jul 19 2023, 7:21 AM
  • Addressed review comments.
david-arm marked 4 inline comments as done.Jul 19 2023, 7:21 AM
paulwalker-arm accepted this revision.Jul 19 2023, 9:36 AM
paulwalker-arm added inline comments.

This space is redundant because you've already got one at the end of the previous string.


Up to you but long NOT lines like this are fragile. I think you'd be better off adding another DEBUG line to tryToCreateDiffCheck for the case when a diff check is created and then explicitly checking for the specific string you expect to see.

This revision is now accepted and ready to land.Jul 19 2023, 9:36 AM
fhahn added inline comments.Jul 19 2023, 10:28 AM

Could you add a test case for this case?


Do we need to check if the AddRecs here don't wrap? If it would wrap, could NewHigh < NewLow?

david-arm added inline comments.Jul 21 2023, 5:35 AM

I can have a look, but the scenario you're thinking about would require 3 levels of nesting. I didn't originally add a test for that because I was worried about introducing so many increasingly complicated negative test cases.


That's an interesting point!

So I've looked into this and in order to even get to this point we must have already called evaluateAtIteration once already to get the original High value. This happens in RuntimePointerChecking::insert, which does something similar:

const SCEV *Ex = PSE.getBackedgeTakenCount();

ScStart = AR->getStart();
ScEnd = AR->evaluateAtIteration(Ex, *SE);

I can't see any existing code that worries about ScEnd being < ScStart here and all I'm doing is performing the same evaluation for the outer loop backedge count. It's quite difficult to follow the whole paper trail of complex code in LoopAccessAnalysis, but it seems that AccessAnalysis::createCheckForAccess only worries about overflow if we failed a dependency check and retry the runtime memory checks.

Since this is under a flag and not enabled by default, would you be happy for now with me adding a TODO here to investigate whether wrapping really is a problem or not?

david-arm updated this revision to Diff 542900.Jul 21 2023, 7:07 AM
  • Rebased off new triple nested loop test and add TODO
david-arm marked an inline comment as done.Jul 21 2023, 7:09 AM
fhahn added inline comments.Jul 25 2023, 12:40 AM

Yeah it looks like this is handled a bit inconsistently at the moment. One other potential case that might be missed here is when the outer induction is decreasing. In that case I think the new high and lows would be swapped?

Since this is under a flag and not enabled by default, would you be happy for now with me adding a TODO here to investigate whether wrapping really is a problem or not?

What is your plan in general to enable the flag by default? I think we want to avoid adding this behind a flag and not working towards enabling it by default in the near term;

david-arm added inline comments.Jul 25 2023, 1:23 AM

So I would like to enable this by default in the near term after running a sufficient number of benchmarks on at least AArch64 targets. Having the flag in early though allows other people to test it or make use of it until it's enabled. Since it's under a flag, if we do enable it by default it's easy to turn off again should any problems arise.

With regards to decreasing outer induction variables I think I tried this and the hoisting doesn't happen, but I can double check.

david-arm updated this revision to Diff 545141.Jul 28 2023, 6:54 AM
  • Added support for cases where the stride of the outer loop memory accesses is negative. In such cases we may have to add an extra runtime check that the stride is positive because otherwise the range expansion will be incorrect.
david-arm marked an inline comment as done.Jul 28 2023, 6:58 AM
david-arm added inline comments.

I believe I've now addressed this problem by checking if the stride is known non-negative. If we can't prove it's not negative then we need additional checks for positive strides. The extra check doesn't seem to hurt the x264 benchmark and I still see a nice improvement.

paulwalker-arm added inline comments.Aug 3 2023, 10:29 AM

Should this be IsNegativeStride given you're planting A.StrideToCheck < 0?

Perhaps it's worth giving the node a name like "stride.check"?

david-arm updated this revision to Diff 547207.Aug 4 2023, 7:39 AM
  • Renamed IsPositiveStride -> IsNegativeStride and given IR variables more meaningful names.
david-arm marked an inline comment as done.Aug 4 2023, 7:39 AM
paulwalker-arm accepted this revision.Aug 4 2023, 8:28 AM
fhahn added inline comments.Aug 6 2023, 1:08 PM

IIRC for regular runtime checks we create SCEV expressions for the minimum and maximum directly (something like Start = SE->getUMinExpr(Start, End);). Could this be used here as well?

david-arm added inline comments.Aug 21 2023, 7:55 AM

That's a good suggestion. Do you mean something like:

TmpLow = umin(Low, High);
TmpHigh = umax(Low, High);
Low = TmpLow;
High = TmpHigh;

? If so I gave that a try and it comes at a price, since for my simple test case:

void foo(int32_t *dst, int32_t *src, int stride1, int stride2, int m, int n) {
  for (int i = 0; i < m; i++) {
    for (int j = 0; j <= n; j++) {
      dst[(i * stride1) + j] += src[(i * stride2) + j];

we end up with 2 extra instructions for the runtime checks in the preheader when building for SVE. For loops with short trip counts like x264 this does matter.

Given this is just an initial patch and likely to be a work in progress are you happy with me landing this version as is for now? We can always revisit this in future if we can figure out a better way of doing this that's just as efficient.

Matt added a subscriber: Matt.Aug 21 2023, 1:02 PM
This revision was landed with ongoing or failed builds.Aug 24 2023, 5:14 AM
This revision was automatically updated to reflect the committed changes.