This patch transform the following IR: %1 = extractelement <8 x i8> %v1, i32 0 %conv1 = zext i8 %1 to i16 %2 = extractelement <8 x i8> %v1, i32 1 %conv2 = zext i8 %2 to i16 ... store i16 %conv1, i16* %arrayidx1 store i16 %conv2, i16* %arrayidx2 Into: %1 = zext <8 x i8> %v1 to <8 x i16> %2 = extractelement <8 x i16> %1, i32 0 %3 = extractelement <8 x i16> %1, i32 1 ... store i16 %2, i16* %arrayidx1 store i16 %3, i16* %arrayidx2 And transform the following IR: %1 = load i8, i8* %arrayidx1 %conv1 = zext i8 %1 to i16 %2 = load i8, i8* %arrayidx2 %conv2 = zext i8 %2 to i16 ... %x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0 %x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1 Into: %1 = load i8, i8* %arrayidx1 %2 = load i8, i8* %arrayidx2 %9 = insertelement <8 x i8> undef, i8 %1, i32 0 %10 = insertelement <8 x i8> %9, i8 %2, i32 1 ... %17 = zext <8 x i8> %16 to <8 x i16> As a summary, if vector is N x M, it save N-1 ext instructions.
Details
- Reviewers
nadav t.p.northover
Diff Detail
- Repository
- rL LLVM
Event Timeline
I've added a few nits.
Reviewers might be interested to know why you're running the ADCE pass after the SLP pass.
I'll defer to Tim, James, and others to comment on the overall approach.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
64 | Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. | |
68 | Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. | |
405 | Please do not add white space. | |
3122 | Why is the necessary? | |
3130 | Same. Is this necessary? | |
3212 | 80-column violation? | |
4176 | Please add comments for the various cases you're trying to detect and avoid. | |
4185 | Running clang-format might resolve some of the formatting issues. | |
4201 | Don't evaluate .size() every iteration. | |
4344 | Maximize 80-column. | |
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll | ||
5 | I assume you want to remove this comment along with the others? | |
129 | Shouldn't we be checking something here? | |
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll | ||
167 | Shouldn't we be checking something here? | |
179 | Shouldn't we be checking something here? |
I need ADCE to clean up some code left by SLPVectorizer, and ADCE is only run once in the whole compiling life time, so adding one pass is not a bad thing.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3122 | This is to work around compile warning, similar to existing code | |
3130 | This is to work around compile warning, similar to existing code | |
3212 | will fix that, thx | |
4176 | Comments are before loop | |
4185 | will do that, thanks. | |
4201 | will fix that | |
4344 | will fix that | |
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll | ||
129 | This case is for future extension, I can remove that, but it does hurt to keep it here. | |
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll | ||
167 | This case is for future extension, I can remove that, but it does hurt to keep it here. | |
179 | This case is for future extension, I can remove that, but it does hurt to keep it here. |
Hi Lawrence,
The SLP vectorizer already supports collecting trees that start at insertElement (see “findBuildVector”), and definitely supports trees that start at stores. It looks like you are adding special handling for these instructions just to work around the cost model, which is the wrong way of implementing vectorization of insert/extract instructions. Did you look into the code that calculates the cost of vector zext/sext?
-Nadav
Hi, Nadav:
Thanks for your comments.
This is a joint patch between I and Ana, Yes I noticed there are some codes for insertElement, however since it doesn't catch our case, I didn't check if it could be expanded.
Sorry for the late reply, I was asked to focus on a release feature, I will go to it and take a detailed look after this feature is done, hopefully this week or next week.
Thanks
Lawrence Hu
Hi, Nadav:
Very sorry to get back to you so late.
I did more investigation on existing code, for the following code example:
%1 = load i32, i32* %arrayidx1
%conv1 = zext i32 %1 to i64 %2 = load i32, i32* %arrayidx2 %conv2 = zext i32 %2 to i64 %x0 = insertelement <2 x i64> undef, i64 %conv1, i32 0 %x1 = insertelement <2 x i64> %x0, i64 %conv2, i32 1 ret <2 x i64> %x1
The existing logic will generate the following IRs (I have to by pass the cost function to get this ), which is not efficient, probably that's why the cost function doesn't allow it:
%1 = load i32, i32* %arrayidx1 %2 = load i32, i32* %arrayidx2 %3 = insertelement <2 x i32> undef, i32 %1, i32 0 %4 = insertelement <2 x i32> %3, i32 %2, i32 1 %5 = zext <2 x i32> %4 to <2 x i64> %6 = extractelement <2 x i64> %5, i32 0 %x0 = insertelement <2 x i64> undef, i64 %6, i32 0 %7 = extractelement <2 x i64> %5, i32 1 %x1 = insertelement <2 x i64> %x0, i64 %7, i32 1 ret <2 x i64> %x1
However, the following IRs are more much efficient:
%1 = load i32, i32* %arrayidx1 %2 = load i32, i32* %arrayidx2
%3 = insertelement <2 x i32> undef, i32 %1, i32 0
%4 = insertelement <2 x i32> %3, i32 %2, i32 1
%5 = zext <2 x i32> %4 to <2 x i64>
That's what our patches do.
Because our code is for this particular pattern, and it generate much more efficient code, I would think keeping our code is a reasonable choice.
What do you think?
Regards
Lawrence Hu
Forgot to mention, I investigated why the cost function doesn't allow further processing: it is because the loads in my example are not from consecutive memory location, then gather operation is need, when NeedGather is true, the cost function won't allow further vectorization.
Hi Lawrence,
I haven't looked into this patch in details, but I have a couple of suggestions that would help further review:
- upload the patch with full context
- separate independent parts into different patches (e.g. adding ADCE pass after SLP is totally independent on the new stuff you implemented in SLP)
- Please describe the case you're working on in IR terms, not asm. SLP operates on IR level, so it's easier to grasp what transformation you're seeking if we use IR.
Thanks,
Michael
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4346 | SmallPtrSet could be used here instead. | |
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll | ||
2 | The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add *another* test for ADCE (the output of SLP would be the input for ADCE). | |
46–48 | I believe that wrapped line would be a syntax error. | |
87–88 | Some line is missing here. | |
166 | No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase. |
At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:
- I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.
- The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4215 | embeeded -> embedded | |
4223 | Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2*(number of underlying vector ext instructions). |
We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation.
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
---|---|---|
290 | The SLP vectorizer should clean after itself. Is it not? | |
360 | The SLP vectorizer should clean after itself. Is it not? | |
504 | Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE and CSE built into the SLP-vectorizer. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
68 | Why do we need two flags for insert and extract? Do you feel like this feature is experimental? Did you run some performance measurements on the llvm test suite? Are you seeing any wins? | |
3116 | This part looks fine. | |
3203 | What does the function return? | |
3209 | Please document the functions below. | |
4092 | What's going on here? Why do you need to zext/sext? | |
4121 | Same comment as above. Why do you need to zext/sext? | |
4151 | Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block? | |
4206 | Please add more comments. I don't understand what's going on here. |
Just saw comments from Hal and Nadav.
For Hal's comments:
- If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.
- I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.
For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.
For the rest of coding comments, I will address it with another patch update.
Thanks
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
68 | I can remove that two flags. I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite. |
No, you'd replace them all with the corresponding extract. What am I missing?
- I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.
For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.
Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.
Hi folks,
Just want to clarify where this issue comes from:
- SROA will replace large allocas with vector SSA values.
E.g., alloca "short a [32]" is rewritten as 4 vectors of type <8 x i16> to avoid the load/stores to the stack-allocated variable.
This results in insert/extract instructions being generated in the IR code.
- The AArch64 backend is not able to combine scattered loads and stores with the insert/extract instructions to generate scalar/lane-based loads/stores in the presence of extension instructions.
Example 1: When there no extension/truncation of the loaded values we are fine, the backend generates optimized code.
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]
Example 2: But when extension instructions are present:
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb w8, [x0]
ins v0.h[1], w8
However this is better code:
ld1 { v0.b }[1], [x0]
ushll v0.8h, v0.8b, #0
You notice it is better code when you have more than one insert instruction:
ldrb w8, [x0]
ldrb w9, [x1]
ins v0.h[1], w8
ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
ushll v0.8h, v0.8b, #0
The same is true for extract instructions:
umov w8, v0.b[1] umov w9, v0.b[5] strh w8, [x0] strh w9, [x1]
Better code would be:
ushll v0.8h, v0.8b, #0 st1 { v0.h }[1], [x0] st1 { v0.h }[5], [x1]
Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.
This should be done target-independent. Maybe it can be done in Inst Combine, or SLP vectorizer (as in this patch).
Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.
This is the problem Lawrence is trying to solve. Any other suggestion?
If any of the original ext is used more than once, then it can't be deleted even though I will insert a vector ext instruction later, that may make this transformation not beneficial, of course a cost model would be better, but I didn't do it because the performance gain of this is not big, adding a complicate cost mode may not justify it.
- I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.
For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.
Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.
I will have to investigate that, didn't know SelectionDAG can handle this until now.
The SLP vectorizer should clean after itself. Is it not?