The LoopVectorizer pass can emit calls to get_active_lane_mask who's operands
are induction variables which in reality only increment in a single loop
iteration before the get_active_lane_mask call, meaning the intrinsic always
returns an all-false predicate. It is possible to detect this in InstCombine
and eliminate the always false intrinsic which allows optimizations further
down the pipeline to trigger for scalable vectors.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for this @MattDevereau! Looks like a nice improvement. I just have a few comments ...
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
3219 | I think the following optimisation should also work for fixed-width vectors, right? It just means that in your MinVScaleElts calculations below you wouldn't need to query the vscale minimum. | |
3231 | I wonder if this is the canonical form or not? For example, I don't know if the operands would ever be swapped so the PHI node is on the right? Perhaps in order to make this more robust you could do something like: Value *AddOp1, *AddOp2; if (!match(Op0, m_Add(m_Value(AddOp1), m_Value(AddOp2)))) break; Value *NonPhi = AddOp2; PHINode *Phi = dyn_cast<PHINode>(AddOp1); if (!Phi && (Phi = dyn_cast<PHINode>(AddOp2))) NonPhi = AddOp1; if (!Phi) break; Not saying this is the best code obviously, but hopefully it helps to explain what I mean. | |
3235 | I think this condition needs to be a bit stronger, i.e. check the PHI has only two operands and that one incoming value comes from this block, then look for the other incoming value. The problem is that this might not be a loop, but instead just be a block with two predecessors. |
Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?
I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.
Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
3254 | takeName on a constant doesn't make sense. |
Yes that's correct. If its impossible for the vectorization factor op0 to be lower than op1 in get.active.lane.mask it will always return an all false mask. This optimization is to fix an edge case kicked out by loop vectorizer which is unable to perform this optimization.
I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.
Would this be as simple as requiring nsw/nuw flags on the add?
Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.
It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant
Yes, that should be enough.
FYI there is a matchSimpleRecurrence() helper to check for the general "phi + binop recurrence" pattern.
Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.
It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant
Does D148010 fix your issue? This is exactly why vectorizations shouldn't be run pre-link.
@nikic I've updated this patch just for the sake of it, I expect D148010 should fix the problem and make this patch unnecessary. Unfortunately I only have a small snippet of IR generated by clang but do not have my hands on the source code that generated this. Until I have the source I can't yet conclude if D148010 has fixed the original problem.
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
3219 | I was under the impression this issue only appeared because of the vscale/vectorisation factor values stopping llvm from recognising everything as constants. There is get_active_lane_mask logic in ConstantFolding.cpp, however until I have the source code of this problem which I can test with fixed-width vectors, or run the source of this problem with D148010 applied it's hard to say if fixed-width vectors are really an issue. | |
3231 | I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop | |
3235 | I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop |
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
3241 | I think in reality the add of an induction variable generated by the loop vectoriser won't have the nsw and nuw flags on it so this optimisation won't trigger in practice. I don't think we need to require these flags though for your optimisation to work. You should just be able to check the start value for the PHI is 0, because you know it cannot wrap on the first iteration. This is typically the canonical form for vectoriser output as well I think. | |
3246 | We should also bail out if the it's not zero here I think. | |
3258 | By bailing out earlier for non-zero values of PhiOp0 you can simplify this to: if (MinVScaleElts < Op1->getZExtValue()) break; | |
llvm/test/Transforms/InstCombine/get-active-lane-mask.ll | ||
43 | It would be good to test this without the nuw and nsw flags, since that's what the vectoriser generates. |
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
3237 | I think 'L' and 'R' are a bit misleading here because they are actually 'Start' and 'Step', whereas on reading the current code it sounds like 'Left' and 'Right'. Can you rename them please? | |
3242 | Again, perhaps rename to StartC? | |
3246 | In theory, Vf and 'Step' (or 'R' in the current code) should be identical. If not, then we can't apply the optimisation. I'm thinking of a case like this: loop: %idx = phi i64 [ 0, %preheader ], [ %idx.inc, %loop ] ... %idx.new = add i64 %idx, %vf %idx.inc = add i64 %idx, %other_thing %pred = call <vscale x 16 x i1> @llvm.active.get.lane.mask.nxv16i1.i64(%idx.new, %limit) So I think you need to check that Vf == Step somehow. | |
llvm/test/Transforms/InstCombine/get-active-lane-mask.ll | ||
5 | I think you either have to make this a generic test without the "target triple" or move this test into a AArch64 directory. |
I think the following optimisation should also work for fixed-width vectors, right? It just means that in your MinVScaleElts calculations below you wouldn't need to query the vscale minimum.