This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Remove scalable get_active_lane_mask calls which are always false
AbandonedPublic

Authored by MattDevereau on Apr 12 2023, 5:52 AM.

Details

Summary

The LoopVectorizer pass can emit calls to get_active_lane_mask who's operands
are induction variables which in reality only increment in a single loop
iteration before the get_active_lane_mask call, meaning the intrinsic always
returns an all-false predicate. It is possible to detect this in InstCombine
and eliminate the always false intrinsic which allows optimizations further
down the pipeline to trigger for scalable vectors.

Diff Detail

Event Timeline

MattDevereau created this revision.Apr 12 2023, 5:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 5:52 AM
MattDevereau requested review of this revision.Apr 12 2023, 5:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 5:52 AM

Thanks for this @MattDevereau! Looks like a nice improvement. I just have a few comments ...

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3219

I think the following optimisation should also work for fixed-width vectors, right? It just means that in your MinVScaleElts calculations below you wouldn't need to query the vscale minimum.

3231

I wonder if this is the canonical form or not? For example, I don't know if the operands would ever be swapped so the PHI node is on the right? Perhaps in order to make this more robust you could do something like:

Value *AddOp1, *AddOp2;
if (!match(Op0, m_Add(m_Value(AddOp1), m_Value(AddOp2))))
  break;

Value *NonPhi = AddOp2;
PHINode *Phi = dyn_cast<PHINode>(AddOp1);
if (!Phi && (Phi = dyn_cast<PHINode>(AddOp2)))
  NonPhi = AddOp1;

if (!Phi)
  break;

Not saying this is the best code obviously, but hopefully it helps to explain what I mean.

3235

I think this condition needs to be a bit stronger, i.e. check the PHI has only two operands and that one incoming value comes from this block, then look for the other incoming value. The problem is that this might not be a loop, but instead just be a block with two predecessors.

nikic added a comment.Apr 12 2023, 1:11 PM

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3254

takeName on a constant doesn't make sense.

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

Yes that's correct. If its impossible for the vectorization factor op0 to be lower than op1 in get.active.lane.mask it will always return an all false mask. This optimization is to fix an edge case kicked out by loop vectorizer which is unable to perform this optimization.

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Would this be as simple as requiring nsw/nuw flags on the add?

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant

nikic added a comment.Apr 14 2023, 4:53 AM

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

Yes that's correct. If its impossible for the vectorization factor op0 to be lower than op1 in get.active.lane.mask it will always return an all false mask. This optimization is to fix an edge case kicked out by loop vectorizer which is unable to perform this optimization.

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Would this be as simple as requiring nsw/nuw flags on the add?

Yes, that should be enough.

FYI there is a matchSimpleRecurrence() helper to check for the general "phi + binop recurrence" pattern.

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant

Does D148010 fix your issue? This is exactly why vectorizations shouldn't be run pre-link.

@nikic I've updated this patch just for the sake of it, I expect D148010 should fix the problem and make this patch unnecessary. Unfortunately I only have a small snippet of IR generated by clang but do not have my hands on the source code that generated this. Until I have the source I can't yet conclude if D148010 has fixed the original problem.

MattDevereau marked 3 inline comments as done.Apr 18 2023, 4:50 AM
MattDevereau added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3219

I was under the impression this issue only appeared because of the vscale/vectorisation factor values stopping llvm from recognising everything as constants. There is get_active_lane_mask logic in ConstantFolding.cpp, however until I have the source code of this problem which I can test with fixed-width vectors, or run the source of this problem with D148010 applied it's hard to say if fixed-width vectors are really an issue.

3231

I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop

3235

I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop

MattDevereau marked 2 inline comments as done.Jun 5 2023, 6:32 AM

@nikic I'm still interested in landing this patch as this is a problem for tail vectorization on AArch64. Though D148010 would likely fix this, that patch is a large change that seems to be more of a longer term goal. Is it possible to re-evaluate landing this patch?

david-arm added inline comments.Jun 16 2023, 9:12 AM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3241

I think in reality the add of an induction variable generated by the loop vectoriser won't have the nsw and nuw flags on it so this optimisation won't trigger in practice. I don't think we need to require these flags though for your optimisation to work. You should just be able to check the start value for the PHI is 0, because you know it cannot wrap on the first iteration. This is typically the canonical form for vectoriser output as well I think.

3246

We should also bail out if the it's not zero here I think.

3258

By bailing out earlier for non-zero values of PhiOp0 you can simplify this to:

if (MinVScaleElts < Op1->getZExtValue())
  break;
llvm/test/Transforms/InstCombine/get-active-lane-mask.ll
43

It would be good to test this without the nuw and nsw flags, since that's what the vectoriser generates.

MattDevereau marked 3 inline comments as done.Jul 11 2023, 8:06 AM
david-arm added inline comments.Jul 12 2023, 9:11 AM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3237

I think 'L' and 'R' are a bit misleading here because they are actually 'Start' and 'Step', whereas on reading the current code it sounds like 'Left' and 'Right'. Can you rename them please?

3242

Again, perhaps rename to StartC?

3246

In theory, Vf and 'Step' (or 'R' in the current code) should be identical. If not, then we can't apply the optimisation. I'm thinking of a case like this:

loop:
  %idx = phi i64 [ 0, %preheader ], [ %idx.inc, %loop ]
  ...
  %idx.new = add i64 %idx, %vf
  %idx.inc = add i64 %idx, %other_thing
  %pred = call <vscale x 16 x i1> @llvm.active.get.lane.mask.nxv16i1.i64(%idx.new, %limit)

So I think you need to check that Vf == Step somehow.

llvm/test/Transforms/InstCombine/get-active-lane-mask.ll
5

I think you either have to make this a generic test without the "target triple" or move this test into a AArch64 directory.

Matt added a subscriber: Matt.Aug 1 2023, 2:29 PM
MattDevereau abandoned this revision.Sep 21 2023, 10:41 AM