This is an archive of the discontinued LLVM Phabricator instance.

Fix PR19657 : SLP vectorization doesn't combine scalar load to vector loads
ClosedPublic

Authored by karthikthecool on May 16 2014, 7:40 AM.

Details

Summary

Hi Nadav,Hal,Arnold,
This patch fixes PR19657. As Arnold pointed out the reason we miss to vectorize these loads is we are processing the smaller subtree before the larger subtree. This patch calculates the depth of the subtress before calling buildTree_rec and calls the buildTree_rec for the larger subtree before calling it for smaller subtree.

This seems to fix the problem and there are no regressions.
But I'm not sure if there is any easier and more efficient way to check this than actually traversing the subtrees and calculating their depth before we call buildTree_rec. It would be great if you could suggest improvements to this patch.

Thanks
Karthik Bhat

Diff Detail

Event Timeline

karthikthecool retitled this revision from to Fix PR19657 : SLP vectorization doesn't combine scalar load to vector loads.
karthikthecool updated this object.
karthikthecool edited the test plan for this revision. (Show Details)
karthikthecool added a subscriber: Unknown Object (MLST).
nadav edited edge metadata.May 16 2014, 8:01 AM

I am sorry but this is the wrong approach. You copy a ton of code that we will need to maintain if we want to add support to new instructions.

You also increase the compile time by scanning the tree twice. In the SLP vectorizer there are a bunch of features we did not implement because we did not want to increase compile time. Remember that we run this pass on all optimization levels.

Thanks Nadav for the review. Yes i agree this increases the compile time as we will be scanning the tree multiple times. That is the reason i was not that sure and wanted to know if there is any better way to implement this? Or should we leave this as it is?
Thanks
Karthik

Any reason we don't want to check opt level when running to see how
aggressive we should be and how much we should spend compile time? Is
there some other approach you'd like to get optimizations like this?

-eric

karthikthecool edited edge metadata.

Hi Nadav,
I went through your suggestion but unfortunetly modification in consecutive memory address tests is not resulting in vectorization of these load instruction the reason being when the build_tree sees the user of load it has not processed the larger subtree and hence doesn't know that this user is part of the larger subtree which will get vectorized and hence assume we need to extract and concludes that can't schedule extractelement.

Checking for a larger subtree and processing it first avoids the above mentioned problem as we are able to conclude that the user of load is part of vectorizable subtree.
I added flag as suggested by Eric so that this vectorization runs only when slp-vectorize-load-aggressive is specified.

In order to avoid duplication we can roughly calculate the depth of the tree without being concerned if the tree gets vectorized or not. This is anyhow checked by buildTree_rec. The TreeDepth will now just traverse the tree till leaf node or till depth limit is reached and calculate the max depth of the tree. Please if you could let me know your inputs on this.

Thanks
Karthik Bhat

Is it really the depth of the subtrees that is the fundamental issue? Or is it really the relative depths of the common uses? If the latter, this patch wouldn't be addressing the general case.

Without looking too deeply, would it be feasible to defer the insertion of the extractions after the whole tree has been vectorized? If that sounds potentially fruitful I can dig a bit more on that direction.

aschwaighofer edited edge metadata.May 21 2014, 4:51 PM

It is the relative depth of common uses.

Without looking too deeply, would it be feasible to defer the insertion of the extractions after the whole tree has been vectorized? If that sounds potentially fruitful I can dig a bit more on that direction.

No because we would give up vectorization the tree at the scheduling conflict (that really is none but the algorithm we use can't tell because it uses program order).

But, I think we should get away (for this scheduling problem where the order we pick matters for the algorithm we use) of picking the subtree traversal based on basic block numbering (because the earlier user has to come in between the common use and the later user, so look at the later one first which will have a greater common depth if there is one). At least according to my on the back of the napkin reasoning ...

if (getLatestBBNumber(Left) > getLatestBBNumber(Right)) {

buildTree_rec(Left, depth+1)
buildTree_rec(Right, depth+1)

} else {

buildTree_rec(Right, depth+1)
 buildTree_rec(Left, depth+1)

}

karthikthecool edited edge metadata.

Hi Arnold,Nadav,Raul,
Thanks for the review. Using basic block numbering to decide which subtree to process seems to work in these cases. A larger subtree will have larger getLastIndex we can use the same to decide which subtree to process first. Updated the patch accordingly.
Does this approach look good?
I am still relatively new to this module so thanks for the patience and clarifications.
Regards
Karthik Bhat

nadav added a comment.May 23 2014, 9:32 AM

Hi Karthik,

Can you please measure the effects of this patch on the LLVM test suite? It would be interesting to see if other workloads are affected by this change and if they improve or regress.

Thanks,
Nadav

I agree this is an improvement, but this approach is still failing to catch
cases where the common uses are not on separate paths on the tree. In those
cases no matter which order we take there will be a common use that can't
be scheduled.

Here is a sample case where we still fail to schedule a common load:

void foo(double *x, double C) {

x[0] = x[0]*C + x[0] * x[0];
x[1] = x[1]*C + x[1] * x[1];
}

Any ideas on how to get those cases? It would seem to me we'd need either a
prepass or deferral of the decisions until the whole tree is inspected.
Thoughts?

Raúl E Silvera | SWE | rsilvera@google.com | *408-789-2846*

Typo: "the common uses *are *on separate paths..."

Raúl E Silvera | SWE | rsilvera@google.com | *408-789-2846*

Hi Nadav,
Please find the performance result with and without patch. I dont see large regression in compilation time though execution time of one test case improved greatly. The baseline is without patch and current is with patch.

Hi Raul,Arnold
I agree that the current patch will not handle the case mentioned. I was thinking of handling unschedulable loads again after buildTree_rec was completed but as arnold mentioned i'm not sure if this would be worth the ovehead.

For now do you think we can move ahead with this approach as we are able to vectorize loads similar to one's mentioned in the PR without much overhead?

Thanks for all the comments and review.

Regards
Karthik Bhat

Karthik, I do not have any objections to this patch. I think longer term we
should try to pursue the other cases but this is a good short-term step.

I'll let others provide their approval.

Raúl E Silvera | SWE | rsilvera@google.com | *408-789-2846*

Karthik,

Please document the code that you are adding to the vectorization of binary operators.

+ buildTree_rec(Left, Depth + 1);
+ }
+ else {

Please format your patch properly.

Also, would it be better to swap Right and Left instead of duplicating the calls to buildTree_rec?

The performance numbers look okay. Do you know why we see a nice gain in the one workloads that wins?

Thanks,
Nadav

Hi Nadav, I have a quick question. In case of commutative operation like the one we are handling here is it guaranteed that that right subtree height >= left subtree height always?
It seems so. I tried few examples -

void foo(double *x) {
  x[0] = x[0] * x[0] + x[0];
  x[1] = x[1] * x[1] + x[1];
  x[2] = x[2] * x[2] + x[2];
  x[3] = x[3] * x[3] + x[3];
}

and

void foo(double *x) {
  x[0] = x[0] + x[0] * x[0];
  x[1] = x[1] + x[1] * x[1];
  x[2] = x[2] + x[2] * x[2];
  x[3] = x[3] + x[3] * x[3];
}

The IR in both these cases are same and it seems we are reordering the IR so that the instructions for which the larger subtree will be constructed will always be on right side in case of commutative operations.
If this is guaranteed to do so we might be able to skip the getLastIndex check as well and directly call -

buildTree_rec(Right, Depth + 1);
buildTree_rec(Left, Depth + 1);

Thanks for the clarification.
Regards
Karthik Bhat

Hi Nadav, Arnold,
Clang format code as per review comments. I think we cannot swap the right and left in buildTree_rec as we need to process the larger subtree before the smaller one and in case were Left subtree is larger we need to process it first hence we will need the check.

Does this look good to commit? I'm also looking into a generic solution for this problem as Raul suggested but it might need more time and benchmarking to make sure the benifit outweighs the overhead.

Thanks for your time.
Regards
Karthik Bhat

Hi Nadav,
Thanks for your time and inputs.
Added documentation for the patch as per suggession.

Yes we had run llvm lit test cases with/without patch.
The results were submitted previously on phabricator you can have a look at the results on
http://reviews.llvm.org/file/data/s6jdoxcsfh2d5pog2a6x/PHID-FILE-lfpzyswqkebd2fj36nu6/performance.png
There were no considerable regressions and we saw one test case performance was improved.

Thanks once again for spending time on this patch.
Regards
Karthik Bhat

nadav added a comment.Jun 5 2014, 9:20 AM

LGTM. Please commit.

karthikthecool accepted this revision.Jun 5 2014, 11:30 PM
karthikthecool added a reviewer: karthikthecool.

Comitted as r210310.
Thanks Nadav,Arnold,Raul and Eric for the review.

This revision is now accepted and ready to land.Jun 5 2014, 11:30 PM