This patch enables the vectorizer to generate both scalar and vector versions of an integer induction variable for a given loop. Previously, we only generated a scalar induction variable if we knew all its users were going to be scalar. Otherwise, we generated a vector induction variable. In the case of a loop with both scalar and vector users of the induction variable, we would generate the vector induction variable and extract scalar values from it for the scalar users. With this patch, we now generate both versions of the induction variable when there are both scalar and vector users and select which version to use based on whether the user is scalar or vector.
Details
Diff Detail
Event Timeline
Out of curiosity - what does the final generate code end up looking with and without this patch, for cases where we have both a scalar and a vector use?
It seems like it should be better, I'm wondering how much.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1976 | Why "Int"? I mean, it may be called only from widenIntInduction now, but I don't really see a reason to bake it into the name, the logic is independent of the IV type. Regardless, I find the use of "scalarize" sort of confusing. It evokes the notion of constructing a vector, and then extracting the scalars, which is the opposite of what we're doing. Maybe something like needsScalarInduction, and then a matching name for the variable below? | |
1980 | Why do we need !isLoopInvariant(U)? If the use is loop invariant, shouldn't it necessarily be a scalar? (Assuming we do need it, why two separate ifs, and not a && ?) |
If you don't mind looking at the full code, I've pasted below what we generate for the new test case I've added to induction.ll (VF=2, UF=2, after instcombine).
Without this patch:
vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.ind = phi <2 x i64> [ <i64 0, i64 1>, %vector.ph ], [ %vec.ind.next, %vector.body ] %vec.ind2 = phi <2 x i32> [ <i32 0, i32 1>, %vector.ph ], [ %vec.ind.next5, %vector.body ] %step.add = add <2 x i64> %vec.ind, <i64 2, i64 2> %step.add3 = add <2 x i32> %vec.ind2, <i32 2, i32 2> %3 = add <2 x i32> %broadcast.splat, %vec.ind2 %4 = add <2 x i32> %broadcast.splat, %step.add3 %5 = trunc <2 x i32> %3 to <2 x i16> %6 = trunc <2 x i32> %4 to <2 x i16> %7 = extractelement <2 x i64> %vec.ind, i32 0 %8 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %7, i32 1 %9 = extractelement <2 x i64> %vec.ind, i32 1 %10 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %9, i32 1 %11 = extractelement <2 x i64> %step.add, i32 0 %12 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %11, i32 1 %13 = extractelement <2 x i64> %step.add, i32 1 %14 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %13, i32 1 %15 = extractelement <2 x i16> %5, i32 0 store i16 %15, i16* %8, align 2 %16 = extractelement <2 x i16> %5, i32 1 store i16 %16, i16* %10, align 2 %17 = extractelement <2 x i16> %6, i32 0 store i16 %17, i16* %12, align 2 %18 = extractelement <2 x i16> %6, i32 1 store i16 %18, i16* %14, align 2 %index.next = add i64 %index, 4 %vec.ind.next = add <2 x i64> %vec.ind, <i64 4, i64 4> %vec.ind.next5 = add <2 x i32> %vec.ind2, <i32 4, i32 4> %19 = icmp eq i64 %index.next, %n.vec br i1 %19, label %middle.block, label %vector.body, !llvm.loop !0
With this patch:
vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.ind2 = phi <2 x i32> [ <i32 0, i32 1>, %vector.ph ], [ %vec.ind.next5, %vector.body ] %3 = or i64 %index, 1 %4 = or i64 %index, 2 %5 = or i64 %index, 3 %step.add3 = add <2 x i32> %vec.ind2, <i32 2, i32 2> %6 = add <2 x i32> %broadcast.splat, %vec.ind2 %7 = add <2 x i32> %broadcast.splat, %step.add3 %8 = trunc <2 x i32> %6 to <2 x i16> %9 = trunc <2 x i32> %7 to <2 x i16> %10 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %index, i32 1 %11 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %3, i32 1 %12 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %4, i32 1 %13 = getelementptr inbounds %pair.i16, %pair.i16* %p, i64 %5, i32 1 %14 = extractelement <2 x i16> %8, i32 0 store i16 %14, i16* %10, align 2 %15 = extractelement <2 x i16> %8, i32 1 store i16 %15, i16* %11, align 2 %16 = extractelement <2 x i16> %9, i32 0 store i16 %16, i16* %12, align 2 %17 = extractelement <2 x i16> %9, i32 1 store i16 %17, i16* %13, align 2 %index.next = add i64 %index, 4 %vec.ind.next5 = add <2 x i32> %vec.ind2, <i32 4, i32 4> %18 = icmp eq i64 %index.next, %n.vec br i1 %18, label %middle.block, label %vector.body, !llvm.loop !0
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1976 | I don't feel that strongly about the name, so I'm fine with this. | |
1980 | isScalarAfterVectorization returns true for values or instructions not in the loop. Using !isLoopInVariant ensures we're looking at an instruction in the loop that will be scalar. With the &&, we would get a clang-format line break, so I kept it as two if's because I thought it might be easier to read. Let me know if you prefer one over the other. |
Addressed Michael's comments.
- Rebased.
- Renamed "shouldScalarizeIntInduciton" to "needsScalarInduction"
Yes, that does look much better, thanks!
(I actually meant post-CG, but the IR looks so much better that doesn't really matter).
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1975–1986 | Oh, so if your only scalar users are outside the loop, then you don't want to generate a scalar IV. Makes sense. Anyway, if all what you want to check is whether the user is in the loop, why are you checking isLoopInvariant and not Contains (like isScalarAfterVectorization does)? |
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1975–1986 |
I don't think so. If there is a user of the induction variable outside the loop (but no scalar users in the loop), yes, the induction variable would remain vector. The external induction variable users are handled in fixupIVUsers (this was recently added I think). And I think the fixup's are based on the canonical induction variable.
isLoopInvariant does the additional check that the value is an instruction. But I'm thinking we should just cast the user to an instruction here and use Contains. That should make this more clear. I'll update the patch. |
LGTM after the Invariant -> Contains change.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1975–1986 |
Yes, I've added that, I just wanted to make sure this doesn't interfere, and was too lazy to look up what I actually did. :-) |
Use contains instead of isLoopInvariant.
Thanks for the review, Michael! I'll wait to commit until after D22867 is approved.
Similarly to other patch, it would be good to describe the changed functionality in this patch -- you only describe the end state. (Sorry to keep banging on this but given the complexity I think it's crucial to keep a good a record. In generally, it's also a good idea to have the description match the proposed commit log. If you use arcanist, passing --verbatim to arc diff supports this flow. )
Essentially the part that is missing is the before picture which I think is this. We used to use a scalar IV *if and only if* there was no need for a vector IV. Otherwise we would still use the vector IV and extract the scalar value from it. With this new patch we can now generate both and use either variant of the IV depending on the user. Correct?
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1979–1982 | Can we do this with std::any_of? | |
2007 | Why are you calling this 'EntryVal'? |
Adam,
That's exactly right. I'll be sure to use that language in the commit log. Echoing what you said for the record: Previously, we only used the scalar IV if we knew all users were going to be scalar. In the case of a loop with both scalar and vector users, we would generate the vector IV and extract scalar values from it for the scalar users. The functional change here is that now when there are both scalar and vector users, we generate both versions of the IV and select which version to use based on whether the user is scalar or vector.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
1975–1986 | Yes, I think so! I'll update the patch. | |
2007 | It's the entry value in WidenMap. For the truncation case, the truncate instruction maps to the new narrow IV. I'm happy to rename this though. I probably went with EntryVal because I couldn't think of anything better at the time. |
Addressed Adam's comments.
- Rebased.
- Used any_of in needsScalarInduction. I also simplified this a bit by removing the explicit check for the IV itself. At the very least, if the IV is scalar, it's update instruction will also be scalar.
- Made needsScalarInduction more readable.
- Updated summary to reflect functional changes.
test/Transforms/LoopVectorize/induction.ll | ||
---|---|---|
604–678 | I could be wrong, but now this test does not seem to test what it was meant for. I thought the point was to ensure that most of the work to get the vector IV set up is pushed into the preheader. But now it seems that we no longer generate that? |
test/Transforms/LoopVectorize/induction.ll | ||
---|---|---|
604–678 | Yeah, I think you're right. With the current patch, the vector IV is complete removed after instcombine. We generate both a scalar one and a vector one (because of the store) during vectorization. But because the store is scalarized, instcombine is able to remove the vector IV. If we add a pre-instcombine check for this test, we could check the original functionality as well. What do you think? |
test/Transforms/LoopVectorize/induction.ll | ||
---|---|---|
604–678 | Ah, I didn't see that this was a non-consecutive store. What if you make it consecutive to avoid the store to get scalarized (the non-zero based IV would still require us to create a new IV hopefully)? |
test/Transforms/LoopVectorize/induction.ll | ||
---|---|---|
604–678 | That will work -- we'll be left with both a scalar and a vector IV after instcombine. I'll update the patch. Thanks! |
Addressed Adam's comments.
- Made store consecutive in test case and added a pre-instcombine check. We now generate both scalar and vector IVs for this test.
Why "Int"? I mean, it may be called only from widenIntInduction now, but I don't really see a reason to bake it into the name, the logic is independent of the IV type.
Regardless, I find the use of "scalarize" sort of confusing. It evokes the notion of constructing a vector, and then extracting the scalars, which is the opposite of what we're doing. Maybe something like needsScalarInduction, and then a matching name for the variable below?
(If you find the current terminology clearer, I won't insist on this, this could be just my own bias.)