Page MenuHomePhabricator

[LAA] Analyze pointers forked by a select
AcceptedPublic

Authored by huntergr on Aug 25 2021, 6:04 AM.

Details

Summary

Given a function like the following:

void forked_ptrs_different_base_same_offset(float *Base1, float *Base2, float *Dest, int *Preds) {
  for (int i=0; i<100; i++) {
    if (Pred[i] != 0) {
      Dest[i] = Base1[i];
    } else {
      Dest[i] = Base2[i];
    }
  }
}

LLVM will optimize the IR to a single load using a pointer determined by a select instruction:

%spec.select = select i1 %cmp1.not, float* %Base2, float* %Base1
%.sink.in = getelementptr inbounds float, float* %spec.select, i64 %indvars.iv 
%.sink = load float, float* %.sink.in, align 4

LAA is currently unable to analyze such IR, since ScalarEvolution will return a SCEVUnknown for the pointer operand for the load.

This patch adds initial optional support for analyzing both possibilities for the pointer and allowing LAA to generate runtime checks for the bounds if required.

Diff Detail

Event Timeline

huntergr created this revision.Aug 25 2021, 6:04 AM
huntergr requested review of this revision.Aug 25 2021, 6:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2021, 6:04 AM
huntergr updated this revision to Diff 370905.Sep 6 2021, 6:18 AM

Rebased, ran clang-format over the patch.

Hi @huntergr, apologies for delay reviewing this patch! I've not finished reviewing it, but I've left some comments that I have so far. Thanks!

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
365

If we're only ever going to support two forks couldn't we make this simpler by just having:

unsigned Members[2];

instead or is this a real pain to implement? Or is the idea that we may want to extend the fork to allow more than 2 in future?

386

Again, does this need to be a SmallVector and can it just be const SCEV *Starts[2];?

llvm/include/llvm/Analysis/ScalarEvolution.h
460

Hi Graham, it feels slightly awkward having to essentially redefine a ForkedPointer as a pair in the same way as ForkedPtrs in llvm/include/llvm/Analysis/LoopAccessAnalysis.h. Maybe we could define something like:

using ForkedPointer = std::pair<const SCEV *, const SCEV *>;

in llvm/include/llvm/Analysis/ScalarEvolution.h that can be used in LoopAccessAnalysis too?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
186–187

nit: Perhaps you can just write assert(Fork <= 1 && ...) since it's unsigned anyway?

310

nit: Perhaps you can just write assert(Fork <= 1 && ...) since it's unsigned anyway?

461

Have you rewritten the logic here as a performance improvement, i.e. to avoid calling Group.addPointer() after it's already been merged?

697

nit: This is just a suggestion, but you could restructure this a little to remove the extra indentation I think here:

if (!AR) {
  if (!RtCheck.AllowForkedPtrs)
    return false;
  ...

Not sure if this is better?

713

Is it better to just make a recursive call to hasComputableBounds instead of calling isAffine here? We're potentially missing out on future improvements to hasComputableBounds here, and also we're ignoring the possibility of a forked pointer being loop invariant I think.

llvm/test/Transforms/LoopVectorize/forked-pointers.ll
2

Can we also have a test where at least one of the forked pointers is loop-invariant?

huntergr added inline comments.Sep 7 2021, 7:31 AM
llvm/include/llvm/Analysis/LoopAccessAnalysis.h
365

So this is for a RuntimeCheckingPtrGroup, which may have more than two members already; I've only changed it from a SmallVector to a SmallSetVector here in order to avoid duplicate members, since both sides of a forked pointer can be added to the same group.

386

I'll look into it, but it'll make other parts messier since I can't just iterate over the members of the SmallVectors and would need to perform null checks for all those places for the second element.

One solution I thought of (but didn't implement, since I wanted some community feedback on the overall work first) was to create a new SCEVForkedExpr so that we wouldn't need to store two SCEVs here and could just note which fork(s) was represented. It'll take a while to implement that, but might be cleaner. Thoughts?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
461

Not intentionally, no -- I just replicated the behaviour of the original (break out of the loop if the pointer merged into a group) but considered both potential forks.

llvm/test/Transforms/LoopVectorize/forked-pointers.ll
2

We already do; see forked_ptrs_uniform_and_contiguous_forks

We could properly analyze and vectorize that case as well, but I haven't implemented that yet so I'm just testing that it gets rejected for now.

david-arm added inline comments.Sep 7 2021, 7:39 AM
llvm/test/Transforms/LoopVectorize/forked-pointers.ll
2

Ah ok, sorry I missed that. I guess what I meant was that this should be trivial to implement, particularly if we can find a way of making calls to hasComputableBounds recursive and re-use the existing code that checks for loop-invariants and affine pointers.

huntergr updated this revision to Diff 378920.Oct 12 2021, 1:53 AM

Rebased, updated based on review comments.

huntergr marked 4 inline comments as done.Oct 12 2021, 2:09 AM
huntergr added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
697

We still needed to bail out if AR was false and forked pointers weren't allowed... so I rewrote it to check for the positive case for AR first and then proceed with the forked pointers check and default false afterwards.

713

We can't do that right now -- hasComputableBounds takes a Value* rather than a SCEV* so it can (potentially) be added to the stride map in replaceSymbolicStrideSCEV. We'd just end up looking at the same Value and splitting again.

This is something the SCEVForkedExpr would make cleaner, since we would only need to evaluate a single SCEV. But it'll take a bit of refactoring to do that, which is why I wanted some feedback on the whole idea first.

We could also separate out parts of these functions to make it recursive, but I'll need to be careful since replaceSymbolicStrideSCEV has other users.

llvm/test/Transforms/LoopVectorize/forked-pointers.ll
2

Sadly, it'll require a bit more work to support invariant addresses. My original downstream code allowed them, but we ran into a bug with it and disabled them.

My plan is to get the base functionality committed, then go back and add an interface so that LoopVectorize (or other LAA consumers) can query the type of the forks in order to generate correct (and hopefully more optimal) IR for various cases -- the current contiguous-only SCEVs, strides of >1, loop-invariant but unknown strides, uniform/invariant addresses, indexed gather/scatter, etc.

If both forks have a stride of 1, or are invariant, then we could potentially plant two masked load instructions (or load + broadcast) instead of a gather, for instance. But that's future work until this part is completed.

Thanks for making the changes @huntergr! I've got a few mostly minor comments so far, but still have to review findForkedSCEVs. :)

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
409

Hi @huntergr, thanks for making the changes to Exprs. I just wonder if we should have an assert here that ScExprs.size() <= 2?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
395

nit: Is it worth calling CheckingGroups.emplace_back(I, *this, /*Fork=0*/0) for clarity here?

709

nit: This is just a minor comment, but you could remove indentation further here by bailing out early, i.e.

if (!FPtr)
  return false;

const SCEV *A =
713

nit: Instead of writing:

LLVM_DEBUG(dbgs() << "LAA: SCEV1: " << *(FPtr->first) << "\n");

I think you can just write

LLVM_DEBUG(dbgs() << "LAA: SCEV1: " << *A << "\n");

and same for the second one.

715

Is there any danger in setting this before we return true?

762

You're introducing a new implicit TypeSize -> uint64_t cast here. Could you rewrite this as:

int64_t Size = DL.getTypeAllocSize(PtrTy->getElementType()).getFixedSize();
huntergr updated this revision to Diff 381211.Oct 21 2021, 4:43 AM

Rebased, minor fixes from review comments.

huntergr marked 5 inline comments as done.Oct 21 2021, 4:45 AM
david-arm added a subscriber: lebedev.ri.

Thanks for dealing with all the review comments @huntergr! I think the patch looks sensible to me, although I'm not as familiar with the SCEV code as others might be. I'm adding @lebedev.ri as a potential reviewer if that's ok?

llvm/test/Transforms/LoopVectorize/forked-pointers.ll
33

It would be good to show some distinction here between Check 1 and Check 2. I assume it's actually checking each of the forked pointers, but the output doesn't make that clear.

huntergr updated this revision to Diff 387879.Wed, Nov 17, 2:23 AM
huntergr marked an inline comment as done.

Rebased, and changed the membership of a checking group to be a pair of PointerInfo index and fork so that printing can show which forks are present in a group.

huntergr updated this revision to Diff 388889.Mon, Nov 22, 5:23 AM

Rebased, disabled by default, added a couple of different instructions into the tests to ensure those paths are at least covered even if they aren't in a positive case right now; I'm still planning to leave those cases (known strides, g/s, uniform) for a followup patch; I suppose I could start a patch series if people want to see those earlier.

I've run the LNT nightly suite with this enabled, along with a few HPC benchmarks I have access to and they all pass.

LGTM! Thanks for making all the changes @huntergr and adding the tests. This patch is low risk at the moment with the flag being disabled currently. At some point once there have been performance investigations to prove there are no regressions we can then enable this by default.

david-arm accepted this revision.Tue, Nov 23, 5:10 AM
This revision is now accepted and ready to land.Tue, Nov 23, 5:10 AM
fhahn added a comment.Tue, Nov 23, 5:22 AM

LGTM! Thanks for making all the changes @huntergr and adding the tests. This patch is low risk at the moment with the flag being disabled currently. At some point once there have been performance investigations to prove there are no regressions we can then enable this by default.

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
20

Is this needed? Other SCEV types are forward-declared below, so ScalarEvolution.h doesn't need to be included.

334

Could we use a bool for Fork if it only has 2 possible values?

llvm/include/llvm/Analysis/ScalarEvolution.h
750

This is only used by LAA. Is there a reason this needs to be part of ScalarEvolution?

llvm/test/Transforms/LoopVectorize/forked-pointers.ll
2

the tests running -loop-accesses should be in llvm/test/Analysis/LoopAccessAnalysis.

Also, could you pre-commit the tests and update the diff here to show only the difference? That way it is a bit easier to see the impact in the diff.

I just realized how this may be a bit similar to how we handle pointers that are phi nodes. Currently those are handled by adding accesses for both incoming values (see D109381). Unfortunately the same approach cannot be directly used for selects, because we need to create 2 pointers that do not exist in the IR.

But if MemAccessInfo/ would also carry the pointer SCEV directly, I think it would be possible to avoid adding another dimension to RuntimePointerChecking::PointerInfo. Instead we would add 2 PointerInfo entries with separate translated pointer SCEVs. I put up a rough sketch of what this may look like in D114480, D114479 to see how this may look like

fhahn added a comment.Tue, Nov 23, 4:19 PM

I just realized how this may be a bit similar to how we handle pointers that are phi nodes. Currently those are handled by adding accesses for both incoming values (see D109381). Unfortunately the same approach cannot be directly used for selects, because we need to create 2 pointers that do not exist in the IR.

But if MemAccessInfo/ would also carry the pointer SCEV directly, I think it would be possible to avoid adding another dimension to RuntimePointerChecking::PointerInfo. Instead we would add 2 PointerInfo entries with separate translated pointer SCEVs. I put up a rough sketch of what this may look like in D114480, D114479 to see how this may look like

I also put up a sketch of a stripped down variant that only supports runtime check generation by adding multiple PointerInfos with different associated pointer SCEVs: D114487.

The variant that also extends MemAccessInfo should also be able to determine that certain loops are safe without runtime checks, e.g. like the one below I think:

%s1 = type { [32000 x float], [32000 x float], [32000 x float] }
define dso_local void @foo(%s1 * nocapture readonly %Base, i32* nocapture readonly %Preds) {
entry:
  br label %for.body

for.cond.cleanup:
  ret void

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %Preds, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4
  %cmp1.not = icmp eq i32 %0, 0
  %gep.1 = getelementptr inbounds %s1, %s1* %Base, i64 0, i32 1, i32 0
  %gep.2 = getelementptr inbounds %s1, %s1* %Base, i64 0, i32 2, i32 0
  %spec.select = select i1 %cmp1.not, float* %gep.1, float* %gep.2
  %.sink.in = getelementptr inbounds float, float* %spec.select, i64 %indvars.iv
  %.sink = load float, float* %.sink.in, align 4
  %1= getelementptr inbounds %s1, %s1 * %Base, i64 0, i32 0, i64 %indvars.iv
  store float %.sink, float* %1, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 100
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}
huntergr updated this revision to Diff 389956.Fri, Nov 26, 2:22 AM

Revised based on @fhahn 's initial suggestions, rebased.

I'll look into the MemAccessInfo approach.

huntergr marked 3 inline comments as done.Fri, Nov 26, 2:27 AM
huntergr added inline comments.
llvm/include/llvm/Analysis/LoopAccessAnalysis.h
20

It was there to include the 'using ForkedPointer =' definition, but moving it out of ScalarEvolution means we can just define it here.

I had hoped that we could maybe integrate it further into ScalarEvolution as a new type of SCEV, but maybe this isn't the right time -- you're correct, this will be the only user for now, so we can just keep it here until we find a case where it would be useful to have it more widely available.