This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
18
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
2
addsub.ll
1
operandorder.ll

Differential D6677

[SLPVectorizer] Reorder operands of shufflevector if it can result in a vectorized code.
ClosedPublic

Authored by karthikthecool on Dec 16 2014, 1:49 AM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer
karthikthecool

Summary

Hi All,
The below code was not being vectorized in vector shuffle in SLPVectorizer.

float fa[4],fb[4],fc[4];
void fn() {
  fc[0] = fb[0]+fa[0];
  fc[1] = fa[1]-fb[1];
  fc[2] = fa[2]+fb[2];
  fc[3] = fa[3]-fb[3];
}

This was because we were to take the operands in the given order fb[0] and fa[1] are not consecutive access. But since '+' is commutative for both float and int for which we handle Shuffle Vector. Hence we can reorder the addition fb[0] + fa[0] -> fa[0] + fb[0] in which case buildTree_rec will be able to conclude it as a consecutive load and vectorize the same.

In this patch we check if we can reorder commutative operations in AltShuffle which can result in vectorization if yes we reorder the operands of the commutative operation.

Please let me know if this is good to commit.

Thanks and Regards
Karthik Bhat

Diff Detail

Repository: rL LLVM

Event Timeline

karthikthecool updated this revision to Diff 17325.Dec 16 2014, 1:49 AM

karthikthecool retitled this revision from to [SLPVectorizer] Reorder operands of shufflevector if it can result in a vectorized code..

karthikthecool updated this object.

karthikthecool edited the test plan for this revision. (Show Details)

karthikthecool added reviewers: aschwaighofer, nadav.

karthikthecool set the repository for this revision to rL LLVM.

karthikthecool added a subscriber: Unknown Object (MLST).

Hi Karthik,

This patch is very similar to http://reviews.llvm.org/D6675. Do you think it's possible to reuse a reordering code from one patch in another?

Hi Michael,
I went through the link http://reviews.llvm.org/D6675. The swap operation is different for these 2 patches. The patch in the http://reviews.llvm.org/D6675 tries to swap across lanes and is valid only for interger operands.

In this patch though we swap within the same lane when the operand is commutative so it can be applied to floating point operands as well because float addition etc are commutative but NOT necessarily transitive.
I feel having seperate function to handle this case would be better. But please let me know if you feel otherwise; will see if i can get some common parts out for the 2 patches.

Thanks and Regards
Karthik Bhat

aschwaighofer added inline comments.Jan 7 2015, 9:51 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
357	Why does that only apply to altShuffle operands shouldn't this work on any commutative binary vector operation?
1802–1991	Can you add some documentation to this function. Something like // Reorder operands of commutative operations if the resulting vectors are consecutive loads.

Hi Arnold,
Please find my comments inline. I will add more documentation to the function if this looks good with you.
Thanks and Regards
Karthik Bhat

lib/Transforms/Vectorize/SLPVectorizer.cpp
357	Hi Arnold, The reason why we applied reorderAltShuffleOperands to altShuffle operands is that in case of code such as- fc[0] = fa[0]+ fb[0]; fc[1] = fb[1]+fa[1]; // operands have been swapped in code. fc[2] = fa[2]+fb[2]; fc[3] = fa[3]+fb[3]; It already gets vectorized and reordring happens in reorderInputsAccordingToOpcode function. But reorderInputsAccordingToOpcode cannot be called in case of AltShuffleOperands as all operations used in AltShuffleOperands are not commuitative (i.e. add is commutiative but sub is not). Hence we added reorderAltShuffleOperands to handle ordering in case of shuffle vector with alt opcode.

Hi Karthik,

Thanks for the answer, I agree with you.

Please also see a comment from me inline.

Thanks for working on this!

lib/Transforms/Vectorize/SLPVectorizer.cpp

358–360

I don't think reorderInputsAccordingToOpcode currently handle it. I.e. it can accidentally handle it in some cases, but it doesn't do that always. For example the following code doesn't get vectorized:

define void @foo() #0 {
  %1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 0), align 4
  %2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 0), align 4
  %3 = add nsw i32 %1, %2
  store i32 %3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 4
  %4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 1), align 4
  %5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 1), align 4  

  ; Please note that %4 and %5 are swapped in the following line:
  %6 = add nsw i32 %5, %4

  store i32 %6, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4
  %7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 2), align 4
  %8 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 2), align 4
  %9 = add nsw i32 %7, %8
  store i32 %9, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 4
  %10 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 3), align 4
  %11 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 3), align 4
  %12 = add nsw i32 %10, %11
  store i32 %12, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4
  ret void
}

It might make sense to handle such cases explicitly, like you do for altShuffles.

Hi Michael,
Thanks for the inputs. Please find my comments inline.
Thanks
Karthik Bhat

lib/Transforms/Vectorize/SLPVectorizer.cpp
358–360	Hi Michael, Thanks for the inputs. I feel the reason the above code doesn't get vectorized is because on 64 bit machine the GVN pass combines the 2 32 bit load into a 64 bit load as a result the pattern match in SLP fails. You can reffer to D6654 for the same. If we run GVN pass after SLPVectorizer the above code gets vectorized. SLP vectorizer not being able to vectorize widned load is a seperate issue and i plan to work on it shortly. I will also try to see if i can see any example were we need to handle this case in reorderInputsAccordingToOpcode. Actually I tried to add this pattern matching at the end of reorderInputsAccordingToOpcode but it results in few regressions in "operandorder.ll" by reordering "good" source order . I'm trying to debug this seperatly as well.

Hi Karthik,

While the loads widening is a real problem, the example I wrote in previous comment doesn't need any GVN invocation at all - you can run slp (+basicaa) on it and see that SLP fails to vectorize it. To make it even clearer, we can use double instead of i32:

Vectorized:

Original code:

double a[1000], b[1000], c[1000];
void foo()
{
  c[0] = a[0] + b[0];
  c[1] = a[1] + b[1];
}

IR:

define void @foo() #0 {
  %1 = load double* getelementptr inbounds ([1000 x double]* @a, i32 0, i64 0), align 4
  %2 = load double* getelementptr inbounds ([1000 x double]* @b, i32 0, i64 0), align 4
  %3 = fadd double %1, %2
  store double %3, double* getelementptr inbounds ([1000 x double]* @c, i32 0, i64 0), align 4
  %4 = load double* getelementptr inbounds ([1000 x double]* @a, i32 0, i64 1), align 4
  %5 = load double* getelementptr inbounds ([1000 x double]* @b, i32 0, i64 1), align 4
  %6 = fadd double %4, %5
  store double %6, double* getelementptr inbounds ([1000 x double]* @c, i32 0, i64 1), align 4
  ret void
}

IR after SLP:

define void @foo() #0 {
  %1 = load <2 x double>* bitcast ([1000 x double]* @a to <2 x double>*), align 16, !tbaa !2
  %2 = load <2 x double>* bitcast ([1000 x double]* @b to <2 x double>*), align 16, !tbaa !2
  %3 = fadd <2 x double> %1, %2
  store <2 x double> %3, <2 x double>* bitcast ([1000 x double]* @c to <2 x double>*), align 16, !tbaa !2
  ret void
}

Not vectorized:

Original code:

double a[1000], b[1000], c[1000];
void foo()
{
  c[0] = a[0] + b[0];
  c[1] = b[1] + a[1]; // a[1] and b[1] are swapped
}

IR:

define void @foo() #0 {
  %1 = load double* getelementptr inbounds ([1000 x double]* @a, i32 0, i64 0), align 4
  %2 = load double* getelementptr inbounds ([1000 x double]* @b, i32 0, i64 0), align 4
  %3 = fadd double %1, %2
  store double %3, double* getelementptr inbounds ([1000 x double]* @c, i32 0, i64 0), align 4
  %4 = load double* getelementptr inbounds ([1000 x double]* @a, i32 0, i64 1), align 4
  %5 = load double* getelementptr inbounds ([1000 x double]* @b, i32 0, i64 1), align 4
  %6 = fadd double %5, %4    ; %4 and %5 are swapped
  store double %6, double* getelementptr inbounds ([1000 x double]* @c, i32 0, i64 1), align 4
  ret void
}

IR after SLP:

define void @foo() #0 {
  %1 = load double* getelementptr inbounds ([1000 x double]* @a, i64 0, i64 0), align 16, !tbaa !2
  %2 = load double* getelementptr inbounds ([1000 x double]* @b, i64 0, i64 0), align 16, !tbaa !2
  %3 = load double* getelementptr inbounds ([1000 x double]* @b, i64 0, i64 1), align 8, !tbaa !2
  %4 = load double* getelementptr inbounds ([1000 x double]* @a, i64 0, i64 1), align 8, !tbaa !2
  %5 = insertelement <2 x double> undef, double %1, i32 0
  %6 = insertelement <2 x double> %5, double %3, i32 1
  %7 = insertelement <2 x double> undef, double %2, i32 0
  %8 = insertelement <2 x double> %7, double %4, i32 1
  %9 = fadd <2 x double> %6, %8
  store <2 x double> %9, <2 x double>* bitcast ([1000 x double]* @c to <2 x double>*), align 16, !tbaa !2
  ret void
}

One correction to my last reply: I wrote that SLP doesn't vectorize the example, which is not technically correct. It vectorizes the code, but the code is far from optimal.

Hi Michael,
Thanks for the clarifications. Updated the code to handle reordering for any commutative binary vector operation were it is profitable.
Post this we are able to optimally vectorize your example.
Please let me know if this looks good to you.

Thanks and Regards
Karthik Bhat

Hi Michael,
One small update in the patch. I was not handling case were we have right operand as load and left as some other binary operation in case of reorderAltShuffleOperands as a result we were not vectorizing code such as -

void foo(double* c,double* restrict a,double* restrict b,double* restrict d) {
  c[0] = (a[0] + b[0])-d[0];
  c[1] = d[1]+(a[1]+b[1]);
}

This is fixed with this updated patch. This is not required in reorderInputsAccordingToOpcode as it is already handled while we sort as per the opcode.
Please let me know your inputs on the same.

Thanks
Karthik Bhat

Hi Karthik,

Thanks for the updated patch, it looks good to me except some minor points (you could find my remarks inline).

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
358–363	Do we really need these functions to be public?
1939–1940	Something wrong with formatting here.
1952	A missing sentence end after 'tree'?
1963–1971	To minimize change in the current behavior, we could place this check before the last loop. In this case we won't try to reorder the operands if one part is splat or 'allSameOpcode'. Also, we'll get rid of needsReordering flag.
test/Transforms/SLPVectorizer/X86/operandorder.ll
254–256	C-equvalent in comments would be useful here and before other test cases.

Hi Michael,
Thanks for your time and review comments.
Updated the patch addressing the review comments.

Thanks
Karthik Bhat

Hi Michael,
While still at this patch fix one more issue(small change) originally present in reorderInputsAccordingToOpcode.

In reorderInputsAccordingToOpcode AllSameOpcodeLeft and AllSameOpcodeLeft should return true if all the opCode on left and right respectively are same. But alas in the code we can see that in the loop we are resetting the value to-

AllSameOpcodeLeft = I0;
AllSameOpcodeRight = I1;

for each iteration. As a result we end up checking if the last 2 instructions have the same opcode in left and right.
As a result a code such as -

void foo(float* restrict a,float* restrict b,float* restrict c, float* restrict d)
{
  a[0] = (b[0]+c[0])+d[0];
  a[1] = d[1]+(b[1]+c[1]);
  a[2] = (b[2]+c[2])+d[2];
  a[3] = (b[3]+c[3])+d[3];
}

results in unoptimal vectorization as it considers all opcode on left and right to be same(since the last 2 have same opcode on left and right) and doesn't reorder the 2nd instruction which could have resulted in a better vectorized code.
This updated patch fixes this issue and added a test case for the same.

Please let me know your inputs on the same.

Thanks and Rergards
Karthik Bhat

Hi Karthik,

Thanks for the updated patch, please see my comments inline.

As for the AllSameOpcode changes - your fix looks right anyway, and I think it might go independently of this patch.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1939–1940	You might want to return from here. Otherwise, the broadcast/allSameOpcode could be spoiled by the following it code.
1960	This check looks strange. Why do we bail out if e.g. left-operands are the same? If all left-elements are the same, we should have `AllSameOpcode`, otherwise I see no special value in keeping them together. I guess that here you wanted to discard some cases, when we don't want to change anything not to spoil already good disposition. If that's the case, I think we need a better check to detect such cases (e.g. `Left[i]` and `Left[i+1]` are consecutive or something like this).
1963	It is possible that `Left[i]` and `Right[i+1]` are not consecutive, yet the swap is beneficial. See the following example: a[0] + b[0] b[1] + c[1] c[2] + d[2] d[3] + e[3] In this scenario, the current code won't swap anything, but it could be transformed to a[0] + b[0] c[1] + b[1] c[2] + d[2] e[3] + d[3] if we check `Right[i]` Vs `Left[i+1]` too. It's an artificial example though, and maybe it's not worth a code to handle it (but I'm not sure if the original code we are trying to handle isn't artificial as well). It would be great btw if you share the motivation for this changes - have you seen something like this in real-world tests?
test/Transforms/SLPVectorizer/X86/addsub.ll
254	Missed `;` at the end of the line :)
280–283	What about C-code for this test?

Hi Michael,
Thanks for the review. Please find the updated patch and my comments inline.
Please let me know if this is good to commit.
Thanks and Regards
Karthik Bhat

Please find my comments inline. Thanks.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1939–1940	We cannot return from here as in case of allSameOpcode as well we still might want to reorder. For instance in the example which you mentioned in previous comments- e.g. c[0] = a[0] + b[0]; c[1] = b[1] + a[1]; // a[1] and b[1] are swapped all the left and right operands have same opcode i.e. load. So AllSameOpcodeRight and AllSameOpcodeLeft would be true but we still want to reorder those loads if they were commutative across instruction as in the example. Yes if this was a broadcast we might not want to reorder and keep original order for which we have the check in the below code as - !(Left[j] == Left[j + 1] \|\| Right[j] == Right[j + 1]) This implies that if this insturction which we are trying to reorder is part of a broadcast then do not reorder.
1960	Hi Michael, As mentioned in previous comments. AllSameOpcode just checks if all the opcodes of the instructions are same not necessarily the instructions itself. To check if this is a broadcast we need to compare the actual instructions. Hence this check was added to make sure that we do not reorder anything that is part of broadcast. Now that i think about it i think we can check the LeftBroadcast and RightBroadcast before entering the loop and avoid this check inside the loop.
1963	I initally wanted to just handle this for shuffle vectors as I had found some hand written code in our codebase which had that pattern but extended it generaically for any commutative operands as per review comments from arnold. As you mentioned I'm not sure if it is worth to handle all combinations but will add this check anyway as it doesn't seem to be a lot of overhead.

Hi Karthik,

Thanks for the comments, please find my answers inline (there is a couple of them in the tests too).

lib/Transforms/Vectorize/SLPVectorizer.cpp
1960	Yep, I'd prefer checking for broadcast before the loop. Actually, checking for AllSameOpcode before the reordering loop now seems useless - anyway it's invalidated when we swap operands. So, what do you think we should do here? Check for broadcasts, reorder commutative operands, check for AllSameOperand, then revert to original version if all of these attempts failed?
2390–2393	My understanding is that we only vectorize ShuffleVectors if it's AltInst, which means VL0 should always be a BinaryOperator. Am I wrong here? If I'm not, please add an assert and remove if-else here.

Hi Michael,
Thanks for the reveiw. Addressed all review comments.
I'm ok with the current way we are handling reorderInputsAccordingToOpcode. If we were to move the load matching before checking AllSameOpcodeRight we would need to maintain a flag so that we do not reorder operands that were swapped during consecutive load matching.(This would end up similar to the initial version of the patch that was uploaded with needsReordering flag). I do not see much difference in the 2 approaches. I hope we can keep this as it is?

Please let me know if i can go ahead and commit this into mainline or if you have any other comments.
Thanks a lot.

Regards
Karthik Bhat

Hi Karthik,

Thanks for the fixes, the patch mostly looks ok, but I still have a bad feeling regarding reorderInputsAccordingToOpcode.
Now we have two checks for if(!(LeftBroadcast || RightBroadcast)) in a row, which doesn't look good. And the overall logic of the function becomes really fuzzy.

You're right that it resembles the original approach. However, I don't think needToReorder flag solves the problem in a good way either. To solve it properly, we need to set our priorities first by answering simple questions:

If the original operands formed a broadcast, do we want to touch them?
What do we prefer: get AllSameOpcode operands, or operands, some of which are consecutive?
What if after swapping to create consecutive operands we lose AllSameOpcode property?
What if operands are AllSameOpcode, but not consecutive?
What if left operands are broadcast, but if we swap one of them with a right one, we'll get consecutive access in the right operands?

...

I guess that we actually only care about cases in which we can make operands both consecutive, and 'AllSameOpcode' - if either of these two conditions is false, we most probably end up with no-vectorization. Thus, we want to maintain both properties at the same time, bailing out if that's not possible.

As for the broadcasting, I think it's always bad to break it, so once we discover it, we want to keep it.

Do you agree with these general strategies, or do you have another opinion?

If we agree on them, it'll be easy to efficiently align the code around them, like the following:

Sort the operands as we do right now.
Check for broadcasts, if isSplat(Left) or isSplat(Right) - return.
Try to reorder things to create as many consecutive loads as possible.
Check if we have AllSameOpcode in either left, or right operands. If yes, return Left and Right, otherwise - LeftOrig and RightOrig.

Hi Michael,
Thanks for the review. Addressed review comments after slight modification in code flow. This version does minimal changes in reorderInputsAccordingToOpcode and to make logic clear i have added appropriate comments in the code.
The only change is the check we have added at the end of reorderInputsAccordingToOpcode to reorder operands to create a longer vectorizable chain without effecting AllSameOpcode property.

I hope this makes things clearer in reorderInputsAccordingToOpcode?

Please find the updated patch and comments inline-

If the original operands formed a broadcast, do we want to touch them? > No. As you mentioned if we detect a broadcast we do not reorder them. We can return as soon as we detect a broadcast.

What do we prefer: get AllSameOpcode operands, or operands, some of which are consecutive? > If we have AllSameOpcode operands reodering it so that consecutive loads are grouped together will not change AllSameOpcode property. > This case (i.e AllSameOpcode and code hits our swap logic) is only possible when we have all loads in left and right side of the binary > operation.)

What if after swapping to create consecutive operands we lose AllSameOpcode property? > This is not possible as far as i can tell. > If we have AllSameOpcode and if we enter in to the condition to swap to create consecutive operands. It means that we have all loads in > left and right lane. Hence swaping will retain AllSameOpcode property as we will still have loads on either side after swap.

What if operands are AllSameOpcode, but not consecutive? > Our logic should not alter anything in this scenario.

What if left operands are broadcast, but if we swap one of them with a right one, we'll get consecutive access in the right operands? > As you mentioned we don't want to disturb boardcast. Hence we do not do anything here.

Comments on code flow suggested -

Sort the operands as we do right now. > OK
Check for broadcasts, if isSplat(Left) or isSplat(Right) - return. > OK
Try to reorder things to create as many consecutive loads as possible. > OK
Check if we have AllSameOpcode in either left, or right operands. If yes, return Left and Right, otherwise - LeftOrig and RightOrig. > We will have a problem here. As mentioned above in case we have AllSameOpcode there are 2 cases which we have to handle- > 1) If operands are AllSameOpcode but reordering will result in consecutive loads and retain *AllSameOpcode* property. > 2) If operands are AllSameOpcode but not consecutive loads. > So in the 1st case above we would like to reorder operands but in the second case we return the LeftOrig and RightOrig. > In both these case AllSameOpcode is the same but in one case we want to reorder and other case we want to return LeftOrig and RightOrig. > So if we follow this code flow any operands reordered in step 3 will be discarded and we will return LeftOrig and RightOrig preventing > vectorization.

So the final code flow will be something like -

Sort the operands as we do right now. (Same as in original code)
If broadcast then return. (Same as in original code but we return now as you suggested)
Check if we have AllSameOpcode if yes retain the original left/right order (same as original code)
Check if we can reorder the opcodes without disturbing good operand order if yes reorder the same. (Our logic)

Thanks
Karthik Bhat

Hi Micahel, sorry i think the comments above are not readable.Some problem with phabricator formatting.
Updated comments to make if more readable.

Please find the updated patch and comments inline-

1.If the original operands formed a broadcast, do we want to touch them?

No. As you mentioned if we detect a broadcast we do not reorder them. We can return as soon as we detect a broadcast.

2.What do we prefer: get AllSameOpcode operands, or operands, some of which are consecutive?

If we have AllSameOpcode operands reodering it so that consecutive loads are grouped together will not change AllSameOpcode property.
This case (i.e AllSameOpcode and code hits our swap logic) is only possible when we have all loads in left and right side of the binary
operation.)

3.What if after swapping to create consecutive operands we lose AllSameOpcode property?

This is not possible as far as i can tell.
If we have AllSameOpcode and if we enter in to the condition to swap to create consecutive operands. It means that we have all loads in
left and right lane. Hence swaping will retain AllSameOpcode property as we will still have loads on either side after swap.

4.What if operands are AllSameOpcode, but not consecutive?

Our logic should not alter anything in this scenario.

5.What if left operands are broadcast, but if we swap one of them with a right one, we'll get consecutive access in the right operands?

As you mentioned we don't want to disturb boardcast. Hence we do not do anything here.

Comments on code flow suggested -

1.Sort the operands as we do right now.

OK

2.Check for broadcasts, if isSplat(Left) or isSplat(Right) - return.

OK

3.Try to reorder things to create as many consecutive loads as possible.

OK

4.Check if we have AllSameOpcode in either left, or right operands. If yes, return Left and Right, otherwise - LeftOrig and RightOrig.

We will have a problem here. As mentioned above in case we have AllSameOpcode there are 2 cases which we have to handle-
1) If operands are AllSameOpcode but reordering will result in consecutive loads and retain *AllSameOpcode* property. 
2) If operands are AllSameOpcode but not consecutive loads.
So in the 1st case above we would like to reorder operands but in the second case we return the LeftOrig and RightOrig. 
In both these case AllSameOpcode is the same but in one case we want to reorder and other case we want to return LeftOrig and RightOrig.

So the final code flow will be something like -

Sort the operands as we do right now. (Same as in original code)
If broadcast then return. (Same as in original code but we return now as you suggested)
Check if we have AllSameOpcode if yes retain the original left/right order (same as original code)
Check if we can reorder the opcodes without disturbing good operand order if yes reorder the same. (Our logic)

Thanks
Karthik Bhat

Hi Karthik,

Thanks for updating the patch according to my numerous remarks:) To be honest, I think that the current path looks good, but I want to make sure we've considered all the possible cases before it gets in to the trunk. That'll guarantee that we do what we want to do in all the cases and the produced code is optimal.

Now back to the comments.

If the original operands formed a broadcast, do we want to touch them?

No. As you mentioned if we detect a broadcast we do not reorder them.
We can return as soon as we detect a broadcast.

What if left operands are broadcast, but if we swap one of them with a right one,

we'll get consecutive access in the right operands?

As you mentioned we don't want to disturb boardcast.
Hence we do not do anything here.

Great that we've agreed on this, it'll help to hoist the broadcast check and forget about it:)

What do we prefer: get AllSameOpcode operands,

or operands, some of which are consecutive?

If we have AllSameOpcode operands reodering it so that consecutive loads
are grouped together will not change AllSameOpcode property.
This case (i.e AllSameOpcode and code hits our swap logic) is only possible
when we have all loads in left and right side of the binary operation.)

What if after swapping to create consecutive operands we lose AllSameOpcode property?

This is not possible as far as i can tell.
If we have AllSameOpcode and if we enter in to the condition
to swap to create consecutive operands. It means that we have all loads in
left and right lane. Hence swaping will retain AllSameOpcode property
as we will still have loads on either side after swap.

What if operands are AllSameOpcode, but not consecutive?

Our logic should not alter anything in this scenario.

That's not exactly true. Consider the following example:

b[0]     b[0]
b[1]     a[0]+c[0]
b[2]     a[1]*c[1]
b[3]     d[5]/e[6]

Here we have AllSameOpcodeLeft=false and AllSameOpcodeRight=true. However, if we swap b[1] and a[0]+c[0] (since Right[0] and Left[1] are consecutive), we'll lose a good candidate for vectorization.

This example not only shows, how we can lose AllSameOpcode property, but also how we can disturb initially good order by swapping operands.

And to reiterate: of course this is a constructed artificial case, and we most likely won't see anything like this in real tests. But such cases still should be considered since they can reveal flaws in our logic. I.e. it's fine if we can modify our logic to do the best in every case. But it's also fine if we decide to ignore such (hopefully rare) cases to keep the logic simple, but in this case I would prefer a comment explaining that it was an intentional, not just a random decision.

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
1872–1873	Maybe it's better to rename I0 to ILeft, and I1 to IRight, or something like this? (the same for V0, P0 and others). What do you think?

Hi Michael,
Thanks for the input. I really appreciate your help in the review process. I will rename the variables and update the patch as per review comments shortly.
Just one clarification here.
In your example -

b[0]     b[0]
b[1]     a[0]+c[0]
b[2]     a[1]*c[1]
b[3]     d[5]/e[6]

Post reordering i think we will still have good load order but in the right lane instead of left lane. If I'm not wrong we would get something like-

    b[0]     b[0]
a[0]+c[0]    b[1]
a[1]*c[1]    b[2]    
d[5]/e[6]    b[3]

retaining AllSameOpcode on right instead of left now.

Thanks and Regards
Karthik Bhat

Hi Karthik,

Yes, you're right about my example. But we can modify it in the following way:

a[0]+c[0]    b[0]
a[3]/d[3]    b[1]
b[2]         b[2]
a[1]*c[1]    b[3]

If I'm not mistaken, we'll transform it to

a[0]+c[0]    b[0]
a[3]/d[3]    b[1]
b[2]         b[2]
b[3]         a[1]*c[1]

Thanks,
Michael

Hi Michael,
I agree with you on the above example. Added a FIXME comment as per your suggession.
Updated variable names as per review comments.
Thanks and Regards
Karthik Bhat

Thanks, Karthik,

The patch looks good to me.

Michael

Thanks Michael. Commit as r226547

This revision is now accepted and ready to land.Jan 19 2015, 10:16 PM

karthikthecool closed this revision.Jan 20 2015, 11:59 PM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

313 lines

test/

Transforms/

SLPVectorizer/

X86/

addsub.ll

133 lines

operandorder.ll

110 lines

Diff 18412

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	for (unsigned i = 1, e = VL.size(); i < e; ++i) {

if (!CI \|\| CI->getZExtValue() != i \|\| E->getOperand(0) != Vec)		if (!CI \|\| CI->getZExtValue() != i \|\| E->getOperand(0) != Vec)
return false;		return false;
}		}

return true;		return true;
}		}

static void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right) {

SmallVector<Value *, 16> OrigLeft, OrigRight;

bool AllSameOpcodeLeft = true;
bool AllSameOpcodeRight = true;
for (unsigned i = 0, e = VL.size(); i != e; ++i) {
Instruction *I = cast<Instruction>(VL[i]);
Value *V0 = I->getOperand(0);
Value *V1 = I->getOperand(1);

OrigLeft.push_back(V0);
OrigRight.push_back(V1);

Instruction *I0 = dyn_cast<Instruction>(V0);
Instruction *I1 = dyn_cast<Instruction>(V1);

// Check whether all operands on one side have the same opcode. In this case
// we want to preserve the original order and not make things worse by
// reordering.
AllSameOpcodeLeft = I0;
AllSameOpcodeRight = I1;

if (i && AllSameOpcodeLeft) {
if(Instruction *P0 = dyn_cast<Instruction>(OrigLeft[i-1])) {
if(P0->getOpcode() != I0->getOpcode())
AllSameOpcodeLeft = false;
} else
AllSameOpcodeLeft = false;
}
if (i && AllSameOpcodeRight) {
if(Instruction *P1 = dyn_cast<Instruction>(OrigRight[i-1])) {
if(P1->getOpcode() != I1->getOpcode())
AllSameOpcodeRight = false;
} else
AllSameOpcodeRight = false;
}

// Sort two opcodes. In the code below we try to preserve the ability to use
// broadcast of values instead of individual inserts.
// vl1 = load
// vl2 = phi
// vr1 = load
// vr2 = vr2
// = vl1 x vr1
// = vl2 x vr2
// If we just sorted according to opcode we would leave the first line in
// tact but we would swap vl2 with vr2 because opcode(phi) > opcode(load).
// = vl1 x vr1
// = vr2 x vl2
// Because vr2 and vr1 are from the same load we loose the opportunity of a
// broadcast for the packed right side in the backend: we have [vr1, vl2]
// instead of [vr1, vr2=vr1].
if (I0 && I1) {
if(!i && I0->getOpcode() > I1->getOpcode()) {
Left.push_back(I1);
Right.push_back(I0);
} else if (i && I0->getOpcode() > I1->getOpcode() && Right[i-1] != I1) {
// Try not to destroy a broad cast for no apparent benefit.
Left.push_back(I1);
Right.push_back(I0);
} else if (i && I0->getOpcode() == I1->getOpcode() && Right[i-1] == I0) {
// Try preserve broadcasts.
Left.push_back(I1);
Right.push_back(I0);
} else if (i && I0->getOpcode() == I1->getOpcode() && Left[i-1] == I1) {
// Try preserve broadcasts.
Left.push_back(I1);
Right.push_back(I0);
} else {
Left.push_back(I0);
Right.push_back(I1);
}
continue;
}
// One opcode, put the instruction on the right.
if (I0) {
Left.push_back(V1);
Right.push_back(I0);
continue;
}
Left.push_back(V0);
Right.push_back(V1);
}

bool LeftBroadcast = isSplat(Left);
bool RightBroadcast = isSplat(Right);

// Don't reorder if the operands where good to begin with.
if (!(LeftBroadcast \|\| RightBroadcast) &&
(AllSameOpcodeRight \|\| AllSameOpcodeLeft)) {
Left = OrigLeft;
Right = OrigRight;
}
}

/// \returns True if in-tree use also needs extract. This refers to		/// \returns True if in-tree use also needs extract. This refers to
/// possible scalar operand in vectorized instruction.		/// possible scalar operand in vectorized instruction.
static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,		static bool InTreeUserNeedToExtract(Value Scalar, Instruction UserInst,
TargetLibraryInfo *TLI) {		TargetLibraryInfo *TLI) {

unsigned Opcode = UserInst->getOpcode();		unsigned Opcode = UserInst->getOpcode();
switch (Opcode) {		switch (Opcode) {
case Instruction::Load: {		case Instruction::Load: {
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	void deleteTree() {
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
}		}

/// \returns true if the memory operations A and B are consecutive.		/// \returns true if the memory operations A and B are consecutive.
bool isConsecutiveAccess(Value A, Value B);		bool isConsecutiveAccess(Value A, Value B);

		aschwaighoferUnsubmitted Not Done Reply Inline Actions Why does that only apply to altShuffle operands shouldn't this work on any commutative binary vector operation? aschwaighofer: Why does that only apply to altShuffle operands shouldn't this work on any commutative binary…
		karthikthecoolAuthorUnsubmitted Not Done Reply Inline Actions Hi Arnold, The reason why we applied reorderAltShuffleOperands to altShuffle operands is that in case of code such as- fc[0] = fa[0]+ fb[0]; fc[1] = fb[1]+fa[1]; // operands have been swapped in code. fc[2] = fa[2]+fb[2]; fc[3] = fa[3]+fb[3]; It already gets vectorized and reordring happens in reorderInputsAccordingToOpcode function. But reorderInputsAccordingToOpcode cannot be called in case of AltShuffleOperands as all operations used in AltShuffleOperands are not commuitative (i.e. add is commutiative but sub is not). Hence we added reorderAltShuffleOperands to handle ordering in case of shuffle vector with alt opcode. karthikthecool: Hi Arnold, The reason why we applied reorderAltShuffleOperands to altShuffle operands is that…
/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

		mzolotukhinUnsubmitted Not Done Reply Inline Actions I don't think `reorderInputsAccordingToOpcode` currently handle it. I.e. it can accidentally handle it in some cases, but it doesn't do that always. For example the following code doesn't get vectorized: define void @foo() #0 { %1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 0), align 4 %2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 0), align 4 %3 = add nsw i32 %1, %2 store i32 %3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 4 %4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 1), align 4 %5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 1), align 4 ; Please note that %4 and %5 are swapped in the following line: %6 = add nsw i32 %5, %4 store i32 %6, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4 %7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 2), align 4 %8 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 2), align 4 %9 = add nsw i32 %7, %8 store i32 %9, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 4 %10 = load i32* getelementptr inbounds ([1000 x i32]* @a, i32 0, i64 3), align 4 %11 = load i32* getelementptr inbounds ([1000 x i32]* @b, i32 0, i64 3), align 4 %12 = add nsw i32 %10, %11 store i32 %12, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4 ret void } It might make sense to handle such cases explicitly, like you do for altShuffles. mzolotukhin: I don't think `reorderInputsAccordingToOpcode` currently handle it. I.e. it can accidentally…
		karthikthecoolAuthorUnsubmitted Not Done Reply Inline Actions Hi Michael, Thanks for the inputs. I feel the reason the above code doesn't get vectorized is because on 64 bit machine the GVN pass combines the 2 32 bit load into a 64 bit load as a result the pattern match in SLP fails. You can reffer to D6654 for the same. If we run GVN pass after SLPVectorizer the above code gets vectorized. SLP vectorizer not being able to vectorize widned load is a seperate issue and i plan to work on it shortly. I will also try to see if i can see any example were we need to handle this case in reorderInputsAccordingToOpcode. Actually I tried to add this pattern matching at the end of reorderInputsAccordingToOpcode but it results in few regressions in "operandorder.ll" by reordering "good" source order . I'm trying to debug this seperatly as well. karthikthecool: Hi Michael, Thanks for the inputs. I feel the reason the above code doesn't get vectorized is…
/// \returns true if it is benefitial to reverse the vector order.		/// \returns true if it is benefitial to reverse the vector order.
bool shouldReorder() const {		bool shouldReorder() const {
return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;		return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Do we really need these functions to be public? mzolotukhin: Do we really need these functions to be public?
}		}

private:		private:
struct TreeEntry;		struct TreeEntry;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

Show All 33 Lines	private:

/// \returns a vector from a collection of scalars in \p VL.		/// \returns a vector from a collection of scalars in \p VL.
Value Gather(ArrayRef<Value > VL, VectorType *Ty);		Value Gather(ArrayRef<Value > VL, VectorType *Ty);

/// \returns whether the VectorizableTree is fully vectoriable and will		/// \returns whether the VectorizableTree is fully vectoriable and will
/// be beneficial even the tree height is tiny.		/// be beneficial even the tree height is tiny.
bool isFullyVectorizableTinyTree();		bool isFullyVectorizableTinyTree();

		/// \reorder commutative operands in alt shuffle if they result in
		/// vectorized code.
		void reorderAltShuffleOperands(ArrayRef<Value *> VL,
		SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right);
		/// \reorder commutative operands to get better probability of
		/// generating vectorized code.
		void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
		SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right);
struct TreeEntry {		struct TreeEntry {
TreeEntry() : Scalars(), VectorizedValue(nullptr),		TreeEntry() : Scalars(), VectorizedValue(nullptr),
NeedToGather(0) {}		NeedToGather(0) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
▲ Show 20 Lines • Show All 917 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
if (!isAltShuffle) {		if (!isAltShuffle) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

		// Reorder operands if reordering would enable vectorization.
		if (isa<BinaryOperator>(VL0)) {
		ValueList Left, Right;
		reorderAltShuffleOperands(VL, Left, Right);
		buildTree_rec(Left, Depth + 1);
		buildTree_rec(Right, Depth + 1);
		return;
		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
Operands.push_back(cast<Instruction>(VL[j])->getOperand(i));		Operands.push_back(cast<Instruction>(VL[j])->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
▲ Show 20 Lines • Show All 420 Lines • ▼ Show 20 Lines	bool BoUpSLP::isConsecutiveAccess(Value A, Value B) {

// Otherwise compute the distance with SCEV between the base pointers.		// Otherwise compute the distance with SCEV between the base pointers.
const SCEV *PtrSCEVA = SE->getSCEV(PtrA);		const SCEV *PtrSCEVA = SE->getSCEV(PtrA);
const SCEV *PtrSCEVB = SE->getSCEV(PtrB);		const SCEV *PtrSCEVB = SE->getSCEV(PtrB);
const SCEV *C = SE->getConstant(BaseDelta);		const SCEV *C = SE->getConstant(BaseDelta);
const SCEV *X = SE->getAddExpr(PtrSCEVA, C);		const SCEV *X = SE->getAddExpr(PtrSCEVA, C);
return X == PtrSCEVB;		return X == PtrSCEVB;
}		}

		// Reorder commutative operations in alternate shuffle if the resulting vectors
		// are consecutive loads. This would allow us to vectorize the tree.
		// If we have something like-
		// load a[0] - load b[0]
		// load b[1] + load a[1]
		// load a[2] - load b[2]
		// load a[3] + load b[3]
		// Reordering the second load b[1] load a[1] would allow us to vectorize this
		// code.
		void BoUpSLP::reorderAltShuffleOperands(ArrayRef<Value *> VL,
		SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right) {

		// Push left and right operands of binary operation into Left and Right
		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
		Left.push_back(cast<Instruction>(VL[i])->getOperand(0));
		Right.push_back(cast<Instruction>(VL[i])->getOperand(1));
		}

		// Reorder if we have a commutative operation and consecutive access
		// are on either side of the alternate instructions.
		for (unsigned j = 0; j < VL.size() - 1; ++j) {
		if (LoadInst *L = dyn_cast<LoadInst>(Left[j])) {
		if (LoadInst *L1 = dyn_cast<LoadInst>(Right[j + 1])) {
		Instruction *VL1 = cast<Instruction>(VL[j]);
		Instruction *VL2 = cast<Instruction>(VL[j + 1]);
		if (isConsecutiveAccess(L, L1) && VL1->isCommutative()) {
		std::swap(Left[j], Right[j]);
		continue;
		} else if (isConsecutiveAccess(L, L1) && VL2->isCommutative()) {
		std::swap(Left[j + 1], Right[j + 1]);
		continue;
		}
		// else unchanged
		}
		}
		if (LoadInst *L = dyn_cast<LoadInst>(Right[j])) {
		if (LoadInst *L1 = dyn_cast<LoadInst>(Left[j + 1])) {
		Instruction *VL1 = cast<Instruction>(VL[j]);
		Instruction *VL2 = cast<Instruction>(VL[j + 1]);
		if (isConsecutiveAccess(L, L1) && VL1->isCommutative()) {
		std::swap(Left[j], Right[j]);
		continue;
		} else if (isConsecutiveAccess(L, L1) && VL2->isCommutative()) {
		std::swap(Left[j + 1], Right[j + 1]);
		continue;
		}
		// else unchanged
		}
		}
		}
		}

		void BoUpSLP::reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
		SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right) {

		SmallVector<Value *, 16> OrigLeft, OrigRight;

		bool AllSameOpcodeLeft = true;
		bool AllSameOpcodeRight = true;
		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
		Instruction *I = cast<Instruction>(VL[i]);
		Value *VLeft = I->getOperand(0);
		Value *VRight = I->getOperand(1);

		OrigLeft.push_back(VLeft);
		OrigRight.push_back(VRight);

		Instruction *ILeft = dyn_cast<Instruction>(VLeft);
		Instruction *IRight = dyn_cast<Instruction>(VRight);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Maybe it's better to rename I0 to ILeft, and I1 to IRight, or something like this? (the same for V0, P0 and others). What do you think? mzolotukhin: Maybe it's better to rename I0 to ILeft, and I1 to IRight, or something like this? (the same…

		// Check whether all operands on one side have the same opcode. In this case
		// we want to preserve the original order and not make things worse by
		// reordering.
		if (i && AllSameOpcodeLeft && ILeft) {
		if (Instruction *PLeft = dyn_cast<Instruction>(OrigLeft[i - 1])) {
		if (PLeft->getOpcode() != ILeft->getOpcode())
		AllSameOpcodeLeft = false;
		} else
		AllSameOpcodeLeft = false;
		}
		if (i && AllSameOpcodeRight && IRight) {
		if (Instruction *PRight = dyn_cast<Instruction>(OrigRight[i - 1])) {
		if (PRight->getOpcode() != IRight->getOpcode())
		AllSameOpcodeRight = false;
		} else
		AllSameOpcodeRight = false;
		}

		// Sort two opcodes. In the code below we try to preserve the ability to use
		// broadcast of values instead of individual inserts.
		// vl1 = load
		// vl2 = phi
		// vr1 = load
		// vr2 = vr2
		// = vl1 x vr1
		// = vl2 x vr2
		// If we just sorted according to opcode we would leave the first line in
		// tact but we would swap vl2 with vr2 because opcode(phi) > opcode(load).
		// = vl1 x vr1
		// = vr2 x vl2
		// Because vr2 and vr1 are from the same load we loose the opportunity of a
		// broadcast for the packed right side in the backend: we have [vr1, vl2]
		// instead of [vr1, vr2=vr1].
		if (ILeft && IRight) {
		if (!i && ILeft->getOpcode() > IRight->getOpcode()) {
		Left.push_back(IRight);
		Right.push_back(ILeft);
		} else if (i && ILeft->getOpcode() > IRight->getOpcode() &&
		Right[i - 1] != IRight) {
		// Try not to destroy a broad cast for no apparent benefit.
		Left.push_back(IRight);
		Right.push_back(ILeft);
		} else if (i && ILeft->getOpcode() == IRight->getOpcode() &&
		Right[i - 1] == ILeft) {
		// Try preserve broadcasts.
		Left.push_back(IRight);
		Right.push_back(ILeft);
		} else if (i && ILeft->getOpcode() == IRight->getOpcode() &&
		Left[i - 1] == IRight) {
		// Try preserve broadcasts.
		Left.push_back(IRight);
		Right.push_back(ILeft);
		} else {
		Left.push_back(ILeft);
		Right.push_back(IRight);
		}
		continue;
		}
		// One opcode, put the instruction on the right.
		if (ILeft) {
		Left.push_back(VRight);
		Right.push_back(ILeft);
		continue;
		}
		Left.push_back(VLeft);
		Right.push_back(VRight);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Something wrong with formatting here. mzolotukhin: Something wrong with formatting here.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions You might want to return from here. Otherwise, the broadcast/allSameOpcode could be spoiled by the following it code. mzolotukhin: You might want to return from here. Otherwise, the broadcast/allSameOpcode could be spoiled by…
		karthikthecoolAuthorUnsubmitted Not Done Reply Inline Actions We cannot return from here as in case of allSameOpcode as well we still might want to reorder. For instance in the example which you mentioned in previous comments- e.g. c[0] = a[0] + b[0]; c[1] = b[1] + a[1]; // a[1] and b[1] are swapped all the left and right operands have same opcode i.e. load. So AllSameOpcodeRight and AllSameOpcodeLeft would be true but we still want to reorder those loads if they were commutative across instruction as in the example. Yes if this was a broadcast we might not want to reorder and keep original order for which we have the check in the below code as - !(Left[j] == Left[j + 1] \|\| Right[j] == Right[j + 1]) This implies that if this insturction which we are trying to reorder is part of a broadcast then do not reorder. karthikthecool: We cannot return from here as in case of allSameOpcode as well we still might want to reorder.
		}

		bool LeftBroadcast = isSplat(Left);
		bool RightBroadcast = isSplat(Right);

		// If operands end up being broadcast return this operand order.
		if (LeftBroadcast \|\| RightBroadcast)
		return;

		// Don't reorder if the operands where good to begin.
		if (AllSameOpcodeRight \|\| AllSameOpcodeLeft) {
		Left = OrigLeft;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions A missing sentence end after 'tree'? mzolotukhin: A missing sentence end after 'tree'?
		Right = OrigRight;
		}

		// Finally check if we can get longer vectorizable chain by reordering
		// without breaking the good operand order detected above.
		// E.g. If we have something like-
		// load a[0] load b[0]
		// load b[1] load a[1]
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This check looks strange. Why do we bail out if e.g. left-operands are the same? If all left-elements are the same, we should have `AllSameOpcode`, otherwise I see no special value in keeping them together. I guess that here you wanted to discard some cases, when we don't want to change anything not to spoil already good disposition. If that's the case, I think we need a better check to detect such cases (e.g. `Left[i]` and `Left[i+1]` are consecutive or something like this). mzolotukhin: This check looks strange. Why do we bail out if e.g. left-operands are the same? If all…
		karthikthecoolAuthorUnsubmitted Not Done Reply Inline Actions Hi Michael, As mentioned in previous comments. AllSameOpcode just checks if all the opcodes of the instructions are same not necessarily the instructions itself. To check if this is a broadcast we need to compare the actual instructions. Hence this check was added to make sure that we do not reorder anything that is part of broadcast. Now that i think about it i think we can check the LeftBroadcast and RightBroadcast before entering the loop and avoid this check inside the loop. karthikthecool: Hi Michael, As mentioned in previous comments. AllSameOpcode just checks if all the opcodes of…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Yep, I'd prefer checking for broadcast before the loop. Actually, checking for AllSameOpcode before the reordering loop now seems useless - anyway it's invalidated when we swap operands. So, what do you think we should do here? Check for broadcasts, reorder commutative operands, check for AllSameOperand, then revert to original version if all of these attempts failed? mzolotukhin: Yep, I'd prefer checking for broadcast before the loop. Actually, checking for AllSameOpcode…
		// load a[2] load b[2]
		// load a[3] load b[3]
		// Reordering the second load b[1] load a[1] would allow us to vectorize
		mzolotukhinUnsubmitted Not Done Reply Inline Actions It is possible that `Left[i]` and `Right[i+1]` are not consecutive, yet the swap is beneficial. See the following example: a[0] + b[0] b[1] + c[1] c[2] + d[2] d[3] + e[3] In this scenario, the current code won't swap anything, but it could be transformed to a[0] + b[0] c[1] + b[1] c[2] + d[2] e[3] + d[3] if we check `Right[i]` Vs `Left[i+1]` too. It's an artificial example though, and maybe it's not worth a code to handle it (but I'm not sure if the original code we are trying to handle isn't artificial as well). It would be great btw if you share the motivation for this changes - have you seen something like this in real-world tests? mzolotukhin: It is possible that `Left[i]` and `Right[i+1]` are not consecutive, yet the swap is beneficial.
		karthikthecoolAuthorUnsubmitted Not Done Reply Inline Actions I initally wanted to just handle this for shuffle vectors as I had found some hand written code in our codebase which had that pattern but extended it generaically for any commutative operands as per review comments from arnold. As you mentioned I'm not sure if it is worth to handle all combinations but will add this check anyway as it doesn't seem to be a lot of overhead. karthikthecool: I initally wanted to just handle this for shuffle vectors as I had found some hand written code…
		// this code and we still retain AllSameOpcode property.
		// FIXME: This load reordering might break AllSameOpcode in some rare cases
		// such as-
		// add a[0],c[0] load b[0]
		// add a[1],c[2] load b[1]
		// b[2] load b[2]
		// add a[3],c[3] load b[3]
		for (unsigned j = 0; j < VL.size() - 1; ++j) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions To minimize change in the current behavior, we could place this check before the last loop. In this case we won't try to reorder the operands if one part is splat or 'allSameOpcode'. Also, we'll get rid of needsReordering flag. mzolotukhin: To minimize change in the current behavior, we could place this check before the last loop. In…
		if (LoadInst *L = dyn_cast<LoadInst>(Left[j])) {
		if (LoadInst *L1 = dyn_cast<LoadInst>(Right[j + 1])) {
		if (isConsecutiveAccess(L, L1)) {
		std::swap(Left[j + 1], Right[j + 1]);
		continue;
		}
		}
		}
		if (LoadInst *L = dyn_cast<LoadInst>(Right[j])) {
		if (LoadInst *L1 = dyn_cast<LoadInst>(Left[j + 1])) {
		if (isConsecutiveAccess(L, L1)) {
		std::swap(Left[j + 1], Right[j + 1]);
		continue;
		}
		}
		}
		// else unchanged
		}
		}

		aschwaighoferUnsubmitted Not Done Reply Inline Actions Can you add some documentation to this function. Something like // Reorder operands of commutative operations if the resulting vectors are consecutive loads. aschwaighofer: Can you add some documentation to this function. Something like // Reorder operands of…
void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {		void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
BasicBlock::iterator NextInst = VL0;		BasicBlock::iterator NextInst = VL0;
++NextInst;		++NextInst;
Builder.SetInsertPoint(VL0->getParent(), NextInst);		Builder.SetInsertPoint(VL0->getParent(), NextInst);
Builder.SetCurrentDebugLocation(VL0->getDebugLoc());		Builder.SetCurrentDebugLocation(VL0->getDebugLoc());
}		}

▲ Show 20 Lines • Show All 380 Lines • ▼ Show 20 Lines	case Instruction::Call: {
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
ValueList LHSVL, RHSVL;		ValueList LHSVL, RHSVL;
for (int i = 0, e = E->Scalars.size(); i < e; ++i) {		assert(isa<BinaryOperator>(VL0) && "Invalid Shuffle Vector Operand");
LHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(0));		reorderAltShuffleOperands(E->Scalars, LHSVL, RHSVL);
RHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(1));
}
setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *LHS = vectorizeTree(LHSVL);		Value *LHS = vectorizeTree(LHSVL);
Value *RHS = vectorizeTree(RHSVL);		Value *RHS = vectorizeTree(RHSVL);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions My understanding is that we only vectorize ShuffleVectors if it's AltInst, which means VL0 should always be a BinaryOperator. Am I wrong here? If I'm not, please add an assert and remove if-else here. mzolotukhin: My understanding is that we only vectorize ShuffleVectors if it's AltInst, which means VL0…

if (Value *V = alreadyVectorized(E->Scalars))		if (Value *V = alreadyVectorized(E->Scalars))
return V;		return V;

// Create a vector of LHS op1 RHS		// Create a vector of LHS op1 RHS
BinaryOperator *BinOp0 = cast<BinaryOperator>(VL0);		BinaryOperator *BinOp0 = cast<BinaryOperator>(VL0);
Value *V0 = Builder.CreateBinOp(BinOp0->getOpcode(), LHS, RHS);		Value *V0 = Builder.CreateBinOp(BinOp0->getOpcode(), LHS, RHS);

▲ Show 20 Lines • Show All 1,570 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/addsub.ll

; RUN: opt < %s -basicaa -slp-vectorizer -S \| FileCheck %s		; RUN: opt < %s -basicaa -slp-vectorizer -S \| FileCheck %s
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

@b = common global [4 x i32] zeroinitializer, align 16		@b = common global [4 x i32] zeroinitializer, align 16
@c = common global [4 x i32] zeroinitializer, align 16		@c = common global [4 x i32] zeroinitializer, align 16
@d = common global [4 x i32] zeroinitializer, align 16		@d = common global [4 x i32] zeroinitializer, align 16
@e = common global [4 x i32] zeroinitializer, align 16		@e = common global [4 x i32] zeroinitializer, align 16
@a = common global [4 x i32] zeroinitializer, align 16		@a = common global [4 x i32] zeroinitializer, align 16
@fb = common global [4 x float] zeroinitializer, align 16		@fb = common global [4 x float] zeroinitializer, align 16
@fc = common global [4 x float] zeroinitializer, align 16		@fc = common global [4 x float] zeroinitializer, align 16
@fa = common global [4 x float] zeroinitializer, align 16		@fa = common global [4 x float] zeroinitializer, align 16
		@fd = common global [4 x float] zeroinitializer, align 16

; CHECK-LABEL: @addsub		; CHECK-LABEL: @addsub
; CHECK: %5 = add nsw <4 x i32> %3, %4		; CHECK: %5 = add nsw <4 x i32> %3, %4
; CHECK: %6 = add nsw <4 x i32> %2, %5		; CHECK: %6 = add nsw <4 x i32> %2, %5
; CHECK: %7 = sub nsw <4 x i32> %2, %5		; CHECK: %7 = sub nsw <4 x i32> %2, %5
; CHECK: %8 = shufflevector <4 x i32> %6, <4 x i32> %7, <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK: %8 = shufflevector <4 x i32> %6, <4 x i32> %7, <4 x i32> <i32 0, i32 5, i32 2, i32 7>

; Function Attrs: nounwind uwtable		; Function Attrs: nounwind uwtable
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	entry:
store float %add2, float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 2), align 4		store float %add2, float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 2), align 4
%6 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 3), align 4		%6 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 3), align 4
%7 = load float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 3), align 4		%7 = load float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 3), align 4
%sub = fsub float %6, %7		%sub = fsub float %6, %7
store float %sub, float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 3), align 4		store float %sub, float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 3), align 4
ret void		ret void
}		}

		; Check vectorization of following code for float data type-
		; fc[0] = fb[0]+fa[0]; //swapped fb and fa
		; fc[1] = fa[1]-fb[1];
		; fc[2] = fa[2]+fb[2];
		; fc[3] = fa[3]-fb[3];

		; CHECK-LABEL: @reorder_alt
		; CHECK: %3 = fadd <4 x float> %1, %2
		; CHECK: %4 = fsub <4 x float> %1, %2
		; CHECK: %5 = shufflevector <4 x float> %3, <4 x float> %4, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
		define void @reorder_alt() #0 {
		%1 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 0), align 4
		%2 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 0), align 4
		%3 = fadd float %1, %2
		store float %3, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 0), align 4
		%4 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 1), align 4
		%5 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 1), align 4
		%6 = fsub float %4, %5
		store float %6, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 1), align 4
		%7 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 2), align 4
		%8 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 2), align 4
		%9 = fadd float %7, %8
		store float %9, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 2), align 4
		%10 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 3), align 4
		%11 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 3), align 4
		%12 = fsub float %10, %11
		store float %12, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 3), align 4
		ret void
		}

		; Check vectorization of following code for float data type-
		; fc[0] = fa[0]+(fb[0]-fd[0]);
		; fc[1] = fa[1]-(fb[1]+fd[1]);
		; fc[2] = fa[2]+(fb[2]-fd[2]);
		; fc[3] = fa[3]-(fd[3]+fb[3]); //swapped fd and fb

		; CHECK-LABEL: @reorder_alt_subTree
		; CHECK: %4 = fsub <4 x float> %3, %2
		; CHECK: %5 = fadd <4 x float> %3, %2
		; CHECK: %6 = shufflevector <4 x float> %4, <4 x float> %5, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
		; CHECK: %7 = fadd <4 x float> %1, %6
		; CHECK: %8 = fsub <4 x float> %1, %6
		; CHECK: %9 = shufflevector <4 x float> %7, <4 x float> %8, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
		define void @reorder_alt_subTree() #0 {
		%1 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 0), align 4
		%2 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 0), align 4
		%3 = load float* getelementptr inbounds ([4 x float]* @fd, i32 0, i64 0), align 4
		%4 = fsub float %2, %3
		%5 = fadd float %1, %4
		store float %5, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 0), align 4
		%6 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 1), align 4
		%7 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 1), align 4
		%8 = load float* getelementptr inbounds ([4 x float]* @fd, i32 0, i64 1), align 4
		%9 = fadd float %7, %8
		%10 = fsub float %6, %9
		store float %10, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 1), align 4
		%11 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 2), align 4
		%12 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 2), align 4
		%13 = load float* getelementptr inbounds ([4 x float]* @fd, i32 0, i64 2), align 4
		%14 = fsub float %12, %13
		%15 = fadd float %11, %14
		store float %15, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 2), align 4
		%16 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 3), align 4
		%17 = load float* getelementptr inbounds ([4 x float]* @fd, i32 0, i64 3), align 4
		%18 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 3), align 4
		%19 = fadd float %17, %18
		%20 = fsub float %16, %19
		store float %20, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 3), align 4
		ret void
		}

		; Check vectorization of following code for double data type-
		; c[0] = (a[0]+b[0])-d[0];
		; c[1] = d[1]+(a[1]+b[1]); //swapped d[1] and (a[1]+b[1])
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Missed `;` at the end of the line :) mzolotukhin: Missed `;` at the end of the line :)

		; CHECK-LABEL: @reorder_alt_rightsubTree
		; CHECK: fadd <2 x double>
		; CHECK: fsub <2 x double>
		; CHECK: shufflevector <2 x double>
		define void @reorder_alt_rightsubTree(double* nocapture %c, double* noalias nocapture readonly %a, double* noalias nocapture readonly %b, double* noalias nocapture readonly %d) {
		%1 = load double* %a
		%2 = load double* %b
		%3 = fadd double %1, %2
		%4 = load double* %d
		%5 = fsub double %3, %4
		store double %5, double* %c
		%6 = getelementptr inbounds double* %d, i64 1
		%7 = load double* %6
		%8 = getelementptr inbounds double* %a, i64 1
		%9 = load double* %8
		%10 = getelementptr inbounds double* %b, i64 1
		%11 = load double* %10
		%12 = fadd double %9, %11
		%13 = fadd double %7, %12
		%14 = getelementptr inbounds double* %c, i64 1
		store double %13, double* %14
		ret void
		}

		; Dont vectorization of following code for float data type as sub is not commutative-
		; fc[0] = fb[0]+fa[0];
		; fc[1] = fa[1]-fb[1];
		; fc[2] = fa[2]+fb[2];
		mzolotukhinUnsubmitted Not Done Reply Inline Actions What about C-code for this test? mzolotukhin: What about C-code for this test?
		; fc[3] = fb[3]-fa[3];
		; In the above code we can swap the 1st and 2nd operation as fadd is commutative
		; but not 2nd or 4th as fsub is not commutative.

		; CHECK-LABEL: @no_vec_shuff_reorder
		; CHECK-NOT: fadd <4 x float>
		; CHECK-NOT: fsub <4 x float>
		; CHECK-NOT: shufflevector
		define void @no_vec_shuff_reorder() #0 {
		%1 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 0), align 4
		%2 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 0), align 4
		%3 = fadd float %1, %2
		store float %3, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 0), align 4
		%4 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 1), align 4
		%5 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 1), align 4
		%6 = fsub float %4, %5
		store float %6, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 1), align 4
		%7 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 2), align 4
		%8 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 2), align 4
		%9 = fadd float %7, %8
		store float %9, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 2), align 4
		%10 = load float* getelementptr inbounds ([4 x float]* @fb, i32 0, i64 3), align 4
		%11 = load float* getelementptr inbounds ([4 x float]* @fa, i32 0, i64 3), align 4
		%12 = fsub float %10, %11
		store float %12, float* getelementptr inbounds ([4 x float]* @fc, i32 0, i64 3), align 4
		ret void
		}


attributes #0 = { nounwind }		attributes #0 = { nounwind }

test/Transforms/SLPVectorizer/X86/operandorder.ll

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	for.body3:
store float %mul45, float* %arrayidx31, align 4		store float %mul45, float* %arrayidx31, align 4
%11 = trunc i64 %indvars.iv.next to i32		%11 = trunc i64 %indvars.iv.next to i32
%cmp2 = icmp slt i32 %11, 31995		%cmp2 = icmp slt i32 %11, 31995
br i1 %cmp2, label %for.body3, label %for.end		br i1 %cmp2, label %for.body3, label %for.end

for.end:		for.end:
ret void		ret void
}		}

		; Check vectorization of following code for double data type-
		; c[0] = a[0]+b[0];
		; c[1] = b[1]+a[1]; // swapped b[1] and a[1]

		; CHECK-LABEL: load_reorder_double
		; CHECK: load <2 x double>*
		; CHECK: fadd <2 x double>
		define void @load_reorder_double(double* nocapture %c, double* noalias nocapture readonly %a, double* noalias nocapture readonly %b){
		%1 = load double* %a
		%2 = load double* %b
		%3 = fadd double %1, %2
		store double %3, double* %c
		%4 = getelementptr inbounds double* %b, i64 1
		%5 = load double* %4
		%6 = getelementptr inbounds double* %a, i64 1
		%7 = load double* %6
		%8 = fadd double %5, %7
		%9 = getelementptr inbounds double* %c, i64 1
		store double %8, double* %9
		ret void
		}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions C-equvalent in comments would be useful here and before other test cases. mzolotukhin: C-equvalent in comments would be useful here and before other test cases.

		; Check vectorization of following code for float data type-
		; c[0] = a[0]+b[0];
		; c[1] = b[1]+a[1]; // swapped b[1] and a[1]
		; c[2] = a[2]+b[2];
		; c[3] = a[3]+b[3];

		; CHECK-LABEL: load_reorder_float
		; CHECK: load <4 x float>*
		; CHECK: fadd <4 x float>
		define void @load_reorder_float(float* nocapture %c, float* noalias nocapture readonly %a, float* noalias nocapture readonly %b){
		%1 = load float* %a
		%2 = load float* %b
		%3 = fadd float %1, %2
		store float %3, float* %c
		%4 = getelementptr inbounds float* %b, i64 1
		%5 = load float* %4
		%6 = getelementptr inbounds float* %a, i64 1
		%7 = load float* %6
		%8 = fadd float %5, %7
		%9 = getelementptr inbounds float* %c, i64 1
		store float %8, float* %9
		%10 = getelementptr inbounds float* %a, i64 2
		%11 = load float* %10
		%12 = getelementptr inbounds float* %b, i64 2
		%13 = load float* %12
		%14 = fadd float %11, %13
		%15 = getelementptr inbounds float* %c, i64 2
		store float %14, float* %15
		%16 = getelementptr inbounds float* %a, i64 3
		%17 = load float* %16
		%18 = getelementptr inbounds float* %b, i64 3
		%19 = load float* %18
		%20 = fadd float %17, %19
		%21 = getelementptr inbounds float* %c, i64 3
		store float %20, float* %21
		ret void
		}

		; Check we properly reorder the below code so that it gets vectorized optimally-
		; a[0] = (b[0]+c[0])+d[0];
		; a[1] = d[1]+(b[1]+c[1]);
		; a[2] = (b[2]+c[2])+d[2];
		; a[3] = (b[3]+c[3])+d[3];

		; CHECK-LABEL: opcode_reorder
		; CHECK: load <4 x float>*
		; CHECK: fadd <4 x float>
		define void @opcode_reorder(float* noalias nocapture %a, float* noalias nocapture readonly %b,
		float* noalias nocapture readonly %c,float* noalias nocapture readonly %d){
		%1 = load float* %b
		%2 = load float* %c
		%3 = fadd float %1, %2
		%4 = load float* %d
		%5 = fadd float %3, %4
		store float %5, float* %a
		%6 = getelementptr inbounds float* %d, i64 1
		%7 = load float* %6
		%8 = getelementptr inbounds float* %b, i64 1
		%9 = load float* %8
		%10 = getelementptr inbounds float* %c, i64 1
		%11 = load float* %10
		%12 = fadd float %9, %11
		%13 = fadd float %7, %12
		%14 = getelementptr inbounds float* %a, i64 1
		store float %13, float* %14
		%15 = getelementptr inbounds float* %b, i64 2
		%16 = load float* %15
		%17 = getelementptr inbounds float* %c, i64 2
		%18 = load float* %17
		%19 = fadd float %16, %18
		%20 = getelementptr inbounds float* %d, i64 2
		%21 = load float* %20
		%22 = fadd float %19, %21
		%23 = getelementptr inbounds float* %a, i64 2
		store float %22, float* %23
		%24 = getelementptr inbounds float* %b, i64 3
		%25 = load float* %24
		%26 = getelementptr inbounds float* %c, i64 3
		%27 = load float* %26
		%28 = fadd float %25, %27
		%29 = getelementptr inbounds float* %d, i64 3
		%30 = load float* %29
		%31 = fadd float %28, %30
		%32 = getelementptr inbounds float* %a, i64 3
		store float %31, float* %32
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorizer] Reorder operands of shufflevector if it can result in a vectorized code.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 18412

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/addsub.ll

test/Transforms/SLPVectorizer/X86/operandorder.ll

[SLPVectorizer] Reorder operands of shufflevector if it can result in a vectorized code.
ClosedPublic