This is an archive of the discontinued LLVM Phabricator instance.

[LAA,LV] Add initial support for pointer-diff memory checks.
ClosedPublic

Authored by fhahn on Feb 6 2022, 6:08 AM.

Details

Summary

This patch adds initial support for a pointer diff based runtime check
scheme for vectorization. This scheme requires fewer computations and
checks than the existing full overlap checking, if it is applicable.

The main idea is to only check if source and sink of a dependency are
far enough apart so the accesses won't overlap in the vector loop. To do
so, it is sufficient to compute the difference and compare it to the
VF * UF * AccessSize. It is sufficient to check
(Sink - Src) <u VF * UF * AccessSize to rule out a backwards
dependence in the vector loop with the given VF and UF. If Src >=u Sink,
there is not dependence preventing vectorization, hence the overflow
should not matter and using the ULT should be sufficient.

Note that the initial version is restricted in multiple ways:

    1. Pointers must only either be read or written, by a single instruction (this allows re-constructing source/sink for dependences with the available information)
  1. Source and sink pointers must be add-recs, with matching steps
  2. The step must be a constant.
  3. abs(step) == AccessSize.

Most of those restrictions can be relaxed in the future.

See https://github.com/llvm/llvm-project/issues/53590.

Diff Detail

Event Timeline

fhahn created this revision.Feb 6 2022, 6:08 AM
fhahn requested review of this revision.Feb 6 2022, 6:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2022, 6:08 AM
lebedev.ri edited the summary of this revision. (Show Details)Feb 6 2022, 6:12 AM
fhahn updated this revision to Diff 406517.Feb 7 2022, 10:08 AM

Cleanup, fix failing tests, add comments.

fhahn retitled this revision from [LAA,LV] Add initial support for pointer-diff memory checks (WIP). to [LAA,LV] Add initial support for pointer-diff memory checks..Feb 7 2022, 11:13 AM
fhahn edited the summary of this revision. (Show Details)

This should now be ready for a first round of reviews.

If I'm following this correctly, this is basically doing two things. One, it reduces the minimum distance by basing the required distance on the vector factor, instead of the start/end of the whole loop. Two, it takes advantage of the ordering of different operations: a false dependency is okay as long as you perform the operations in the correct order.

It looks like we end up rejecting at runtime if the two pointers are exactly equal; if I'm following correctly, that's an unnecessary restriction?

Does the code generator actually guarantee that we perform operations in the correct order? Does interleaving matter? Other transforms?

reames added inline comments.Feb 7 2022, 5:05 PM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
310

This bit makes me uncomfortable. The recent direction of changes to SCEV seem to be leading us towards disallowing pointer subtractions involving distinct memory objects. This change very much relies on the historical behavior of doing an implicit inttoptr. See the comment of doing getMinusSCEV.

In fact, I think you are accidentally only applying this new logic in the case where we can prove the two accesses share a common base object.

If you're okay documenting that restriction, we can punt the pointer_subtract semantics question one step further down the road.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1628

Somewhat an aside, but this is one more place where having a general SCEV note for a predicate would seem useful. In this case, we'd cache the overlap predicate directly, and the emission code wouldn't need to be so closely coupled.

fhahn updated this revision to Diff 406940.Feb 8 2022, 12:43 PM

If I'm following this correctly, this is basically doing two things. One, it reduces the minimum distance by basing the required distance on the vector factor, instead of the start/end of the whole loop. Two, it takes advantage of the ordering of different operations: a false dependency is okay as long as you perform the operations in the correct order.

That's a great summary, thanks! Those 2 things do not need to be done in a single step/patch. It is also possible (and simpler) to start with 2) - taking advantage of the ordering. I updated the patch to *only* take advantage of the ordering to emit one of the existing checks we generate (instead of 2). (This was also suggested by @Ayal offline)

The latest version of the patch emits the following check for each (src, sink) pair (if they be classified): there is a conflict if lower bound of source < upper bound of sink, i.e. if src starts before sink ends.

Changing the checks to use the pointer-difference approach can be done as follow up and has the benefit Eli mentioned, as well as removing the requirement to have computable bounds.

It looks like we end up rejecting at runtime if the two pointers are exactly equal; if I'm following correctly, that's an unnecessary restriction?

Yes that's an unnecessary restriction of the original patch.

Does the code generator actually guarantee that we perform operations in the correct order? Does interleaving matter? Other transforms?

The distance needs to also include the interleave count, if the vectorizer interleaves the vector loop. I think we also cannot add !noalias metadata with the lightweight checks, but that should guarantee that other transformations work with the correct aliasing assumptions.

AFAICT we do not rely on the noalias property in any part of the vectorizer codegenerator at the moment, but there is a patch in flight that does (D110235). We need to be careful there.

fhahn marked an inline comment as done.Feb 8 2022, 12:49 PM
fhahn added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
310

This bit makes me uncomfortable. The recent direction of changes to SCEV seem to be leading us towards disallowing pointer subtractions involving distinct memory objects. This change very much relies on the historical behavior of doing an implicit inttoptr. See the comment of doing getMinusSCEV.

Thanks for the heads up, I missed that! The latest version of the patch has the pointer-difference part stripped, but I'll keep that in mind for the follow-up. The only reason to use SCEV here was to take advantage of SCEVExpander for convenience, but the code can also be generated without SCEV.

In fact, I think you are accidentally only applying this new logic in the case where we can prove the two accesses share a common base object.

We should only compute the difference if the bases are may-alias (or the step is not constant). I *think* needsChecking should skip any pointer-group pairs with a common base.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1628

The notion of a predicate in SCEV would indeed be convenient here (and at the other places where we generate runtime checks). It would/should also allow for more convenient checking if the check can be simplified by SCEV.

fhahn planned changes to this revision.Feb 11 2022, 12:40 AM
fhahn marked an inline comment as done.

I realized that the current version the checks are overly restrictive. I'll adjust the change back to the difference check, handle the case when both pointers are equal and address the original code comments

bsmith added a subscriber: bsmith.Feb 28 2022, 2:59 AM
fhahn updated this revision to Diff 413074.Mar 4 2022, 11:01 AM

Rebase & ping.

I restored the original diff-check code, with a modification to not create pointer sub expressions using SCEV.

I also put up an initial patch (D121008) that adds a set of microbenchmarks to track the improvements by the patch and guard against future regressions.

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 11:01 AM

I have a set of MVE routines (that we were using as benchmarks) where this certainly helps a really decent amount.

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
448

-> used to prove

llvm/lib/Analysis/LoopAccessAnalysis.cpp
248

I find its best not to overuse auto when the type isn't obvious. This is a PointerInfo?

284

Does this assume that both the AllocSize's are the same?

If it picks the larger size - I think it would be OK so long as the steps below matched. And the smaller size might need some very strange code to cause problems. It might be worth checking for though.

285

Is this one of the methods removed with opaque pointers?

fhahn updated this revision to Diff 420222.Apr 4 2022, 10:00 AM
fhahn marked 3 inline comments as done.

Address latest comments, thanks!

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
448

Updated, thanks!

llvm/lib/Analysis/LoopAccessAnalysis.cpp
284

Does this assume that both the AllocSize's are the same?

Originally yes, but I updated the code to use the max. Added tests to cover those cases in 368d35a89440.

285

Good point, I updated the code to get the store/loaded type from the actual instructions.

fhahn updated this revision to Diff 420224.Apr 4 2022, 10:03 AM
fhahn marked an inline comment as done.

Clean up a bit of code.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
248

Agreed, updated most uses of auto that do not use dyn_cast.

dmgreen added inline comments.Apr 12 2022, 1:17 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
282

Is this the same as getLoadStoreType?

fhahn updated this revision to Diff 422149.Apr 12 2022, 2:24 AM

Update to use getLoadStoreType.

fhahn marked an inline comment as done.Apr 12 2022, 2:25 AM
fhahn added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
282

Yes, updated, thanks!

dmgreen accepted this revision.Apr 12 2022, 7:58 AM

Thanks. If there are no other comments from anyone, this LGTM

This revision is now accepted and ready to land.Apr 12 2022, 7:58 AM
fhahn marked an inline comment as done.Apr 19 2022, 11:49 PM

Thanks, I plan to land this in the next few days, unless there are any further comments :)

fhahn updated this revision to Diff 429679.May 16 2022, 4:06 AM

Rebased before commit.

This revision was landed with ongoing or failed builds.May 16 2022, 7:27 AM
This revision was automatically updated to reflect the committed changes.
bjope added a subscriber: bjope.May 26 2022, 4:09 AM
bjope added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3090

Hi @fhahn and @dmgreen (and well, anyone else who might happen to read this).

I've been trying to understand some regressions that we've seen downstream after this patch.
It seems to be related to this part, where the new memchecks somehow are weaker(?) in the sense that we can't deduce NoAlias across all iterations (well, that is how I've interpreted the diff here).

What we've seen happening for a couple of benchmarks is that in our OOT backend some vectorized loops aren't software pipelined any longer. And the SWP scheduler is bailing out since noalias isn't guaranteed. No SWP => quite huge regressions. No idea if this could be a problem for other targets as well.

So far I haven't figured out what to do downstream in this situation.

Maybe we should look into the SWP scheduler to see if it can deduce "no alias" in some other way (I'm not sure, but I figure SWP isn't requiring no overlap across all iterations, but depending on how much pipelining it might require no overlap across iteration N and N+1 etc.).

Maybe we should add some heuristic already in the LoopVectorizer to not use the new kind of memory checks when we think that it would block SWP (an initial heuristic would probably be to use the old kind of checks for out target).
Here I'm not quite sure about the plans in-tree for this. Are the new memory checks supposed to replace the old checks in the future?

If anyone has some insight/ideas here, then I'd be happy to read your comments on this.

fhahn marked an inline comment as done.May 26 2022, 6:41 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3090

It seems to be related to this part, where the new memchecks somehow are weaker(?) in the sense that we can't deduce NoAlias across all iterations (well, that is how I've interpreted the diff here).

Yes exactly, the cheaper check only rules out dependences for the selected VF * IC, so no alias is not guaranteed for all accesses in the loop.

What we've seen happening for a couple of benchmarks is that in our OOT backend some vectorized loops aren't software pipelined any longer. And the SWP scheduler is bailing out since noalias isn't guaranteed. No SWP => quite huge regressions. No idea if this could be a problem for other targets as well.

That seems like an unfortunate side effect from this patch, but in a way SWP got 'lucky' earlier because the dependence checks by LV were checking more than is required for vectorization.

If pipelining is profitable but requires runtime checks, then ideally the software pipeliner would emit them (and replace the LV checks with the stricter checks). If you are talking about MachinePipeliner, which runs on machine-functions, this is likely going to be very difficult unfortunately.

A more crude solution would be to introduce a TTI hook to opt-out of the more lightweight checks. The drawback here is that the backend misses out on cases where the lightweight checks are sufficient because no pipelining is happening.

If it is enough to rule out no-aliasing for the pipelined iterations another option might be for LV to emit the difference checks with a slightly larger distance.

bjope added inline comments.May 31 2022, 4:31 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3090

That seems like an unfortunate side effect from this patch, but in a way SWP got 'lucky' earlier because the dependence checks by LV were checking more than is required for vectorization.

Yes, agree. We've been lucky in the past.

If pipelining is profitable but requires runtime checks, then ideally the software pipeliner would emit them (and replace the LV checks with the stricter checks). If you are talking about MachinePipeliner, which runs on machine-functions, this is likely going to be very difficult unfortunately.

We are not using the in-tree MachinePipeliner directly, but we got a software pipeline pass that runs late in the backend (both pre-RA and also later after register allocation). And yes, it is a bit difficult to add stricter checks at that stage (I figure specially without adding a lot more loop versioning for both the vector and scalar loop).

A more crude solution would be to introduce a TTI hook to opt-out of the more lightweight checks. The drawback here is that the backend misses out on cases where the lightweight checks are sufficient because no pipelining is happening.

This is probably what we will aim for (downstream), at least as a starting point to give us some time to analyse the regressions a bit more to see if we can find some heuristics or alternative solutions in the future.

If it is enough to rule out no-aliasing for the pipelined iterations another option might be for LV to emit the difference checks with a slightly larger distance.

This is an interesting solution. One problem could be how to setup that distance (I think SWP is trying to pipeline using different distances). A bigger problem is how to analyze the runtime checks that late in the backend to understand for which distances SWP is "safe". Maybe some kind of metadata could be used, but then that metadata must be propagated properly all the way from the vectorizer to the backend.