This is an archive of the discontinued LLVM Phabricator instance.

[Power9] Exploit vinserth instruction
ClosedPublic

Authored by gyiu on Jun 13 2017, 12:51 PM.

Details

Summary

This patch adds code to PPC ISEL lowering to recognize half-word inserts from vector_shuffles, and use P9 shift and vector insert instructions instead of vperm.

Diff Detail

Repository
rL LLVM

Event Timeline

gyiu created this revision.Jun 13 2017, 12:51 PM
inouehrs edited edge metadata.Jun 13 2017, 11:48 PM

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

lei added inline comments.Jun 14 2017, 12:03 PM
lib/Target/PowerPC/PPCISelLowering.cpp
1663

nit: don't forget the "." :)

test/CodeGen/PowerPC/p9-vinsert-vextract.ll
6

Can you add a short description of what each of these functions are testing?

gyiu added a comment.Jun 15 2017, 8:41 AM

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

In D34160#781261, @gyiu wrote:

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

I feel we should not increase the number of vector instructions within a loop (i.e. a common case for vector code) if we can load the mask into a vector register before the loop.
In case without an additional shift, it is nice to do opt in a loop for freeing up one vector register.

jtony added inline comments.Jun 15 2017, 1:55 PM
lib/Target/PowerPC/PPCISelLowering.h
527

The community doesn't like the bool parameters here. But I am not sure whether we must remove them or it is just nice to do.

test/CodeGen/PowerPC/p9-vinsert-vextract.ll
9

I would prefer to use non-mangled function names to make it more readable. I think you can just regenerate the IR from c source file instead of cpp file.

gyiu updated this revision to Diff 102768.Jun 15 2017, 6:39 PM
gyiu marked an inline comment as done.
  • Added comments and demangled function names for LIT tests
  • Added period to comment
  • Fixed issue when one operand of the shufflevector is 'undef', in which case the PPCISDs we generate will use only the defined one.
  • Initialize 'Swap' boolean

Go ahead and submit the rename separately (if you don't have commit access send me a patch and I'll do it). I would prefer to try to minimize boolean arguments - in this case you're adding a few more including a true/false and a return value one. Feel free to refactor the code around it to require fewer booleans or have multiple functions with helper functions that end up being called.

Thanks!

lib/Target/PowerPC/PPCISelLowering.h
527

Please do.

jtony added inline comments.Jun 16 2017, 7:55 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1640

Is there any reason why we use uint32_t? If not, I would use unsigned instead to make it consistent. We are using both unsigned and uint32_t in this file.

gyiu marked 2 inline comments as done.Jun 16 2017, 10:13 AM
gyiu added inline comments.Jun 19 2017, 8:05 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1640

I think using uint32_t is appropriate here because I want to illustrate that I'm using 32-bits exactly. Using 'unsigned int' would be fine as well, but I don't think it's as clear that I'm looking for 32-bits exactly.

gyiu updated this revision to Diff 103079.Jun 19 2017, 11:12 AM
gyiu marked 2 inline comments as done.
  • Added -O0 to LIT tests to test corner case of undef 2nd operand of vector shuffle.
  • Refactored VINSERTH code to avoid boolean parameters and return value.
  • Merged loops for 2nd operand undefined case and both operands defined.
echristo added inline comments.Jun 19 2017, 11:19 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7954

Nit: (Here and other places) Comments are complete sentences including punctuation.

nemanjai edited edge metadata.Jun 21 2017, 6:18 AM
In D34160#781261, @gyiu wrote:

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

I feel we should not increase the number of vector instructions within a loop (i.e. a common case for vector code) if we can load the mask into a vector register before the loop.
In case without an additional shift, it is nice to do opt in a loop for freeing up one vector register.

Although I definitely agree that we should take steps to ensure we don't introduce further instructions in loops, I'm not sure that avoiding a 2-instruction sequence for a shuffle is necessarily the right thing to do. This statement is predicated on the fact that we can hoist the constant pool load out of a loop. If register pressure prevents this, we will have a load in the loop. Furthermore, if the loop is large enough and has other memory operations, it is conceivable that the constant pool load could be a cache miss on every iteration. And it is conceivable that such large loops will be the ones for which register pressure prevents the hoisting of the load. Furthermore, if the GPR register pressure is also high, we might not even be able to hoist the address calculations outside the loop, which would make the vperm sequence 3-4 instructions.
I think that at ISEL time, we should favour shorter instruction sequences that don't involve loads. And perhaps if we can show that multi-instruction permute sequences in loops appear enough in real code, we might want to have a loop pass that simplifies them into a load outside the loop with a vperm in the loop in general.

lib/Target/PowerPC/PPCISelLowering.h
1079

This is probably a remnant of a previous implementation. Please rewrite the comment.

nemanjai added inline comments.Jun 21 2017, 8:35 PM
lib/Target/PowerPC/PPCISelLowering.cpp
7948

So we don't want to shift if we're within the same register? Is there a specific reason for this?

7959

Isn't this already guaranteed to only have the low order 3 bits set?

7964

Why would we continue to search if we've already confirmed that:

  1. We have an element from vector A
  2. All other elements are from vector B in the correct order
gyiu updated this revision to Diff 103522.Jun 21 2017, 9:43 PM
  • Addressed comments about my comments (grammar, periods, etc).
  • Removed irrelevant comments in PPCISelLowering.h
gyiu marked 2 inline comments as done.Jun 21 2017, 9:58 PM
gyiu added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
7948

I believe we have to add a xxlor to another VR if we want to shift the vector since we can't shift if both operands of the vector shuffle are the same vector. Adding another two cycles to VECSHL+VECINSERT seems diminish its value versus load+vperm.

7959

MaskOneElt could actually be >= 8, since the mask is in range [0, 15].

7964

Yep, you're correct. Need a break here since we can't find more than one candidate.

nemanjai added inline comments.Jun 23 2017, 6:17 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7948

I don't really see why. Assume that you have something like this:

vector unsigned short test(vector unsigned short a) {
  a[5] = a[2];
}

I don't see why we can't codegen something like this for it:

vsldoi 3, 2, 2, 4
vinserth 2, 3, 4

Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm).

7959

Ah, right. I didn't think of that. Sorry about that.

gyiu added inline comments.Jun 23 2017, 11:49 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7948

Hmmm... Yep, you're right. I guess I can simplify my code even further now. I think this also means I have to fix up the code for the original xxinsertw lowering in a separate patch.

nemanjai added inline comments.Jun 23 2017, 1:08 PM
lib/Target/PowerPC/PPCISelLowering.cpp
7948

Yes, as @echristo mentioned, you should do all the renaming of things in a separate patch that doesn't really require a review. You're just renaming stuff.

gyiu added a comment.Jun 26 2017, 8:07 AM
  • I'll open a separate item to address Nemenja's comments as I will not get a chance to do another enchancement.

I don't really see why. Assume that you have something like this:

vector unsigned short test(vector unsigned short a) {

a[5] = a[2];

}

I don't see why we can't codegen something like this for it:

vsldoi 3, 2, 2, 4
vinserth 2, 3, 4

Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm).

lib/Target/PowerPC/PPCISelLowering.cpp
7948

Actually, I'm not quite sure what you mean here. The original code for xxinsertw has the limitation of only being able to insert element 3 if both input vectors to the vector_shuffle are the same. I'll need to change that in a separate patch. I'm not sure where the 'renaming of things' comes into play?

nemanjai added inline comments.Jun 26 2017, 9:14 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1142

By "renaming stuff", I mean things like this. Kind of orthogonal to the patch and should go in as a separate NFC change.

gyiu updated this revision to Diff 103974.Jun 26 2017, 9:15 AM
  • Added breaks to stop searching for the pattern once I've found a candidate.
gyiu marked an inline comment as done.Jun 26 2017, 9:15 AM
nemanjai added inline comments.Jun 29 2017, 1:45 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7941

You should be able to get rid of this condition here.

  • Move the assignment if (V.isUndef()) V2 = V1; above here
  • Use the OriginalOrderLow if the two vectors are the same

The rest should fall out naturally and we'll do the shift for the single-input case as well. And the code will also be simpler.

gyiu added inline comments.Aug 23 2017, 10:35 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7941

@nemanjai I created Issue #410 on github to address the issue when using vector shifts in the case when both inputs are the same vector. There's further investigation that's required as it's not clear which input/output registers the (vector shift + vector extract) sequence uses in this case. I would rather do this change as part of that work item instead.

gyiu updated this revision to Diff 113193.Aug 29 2017, 8:32 PM
  • Refactored NFCs to another patch to be committed.
  • Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation.
gyiu added a comment.Aug 29 2017, 8:38 PM

Note that I was able to re-implement Nemanja's suggestion of generalizing the case when both inputs are the same vector because the registers used in code-gen are now consistent. Not sure if it was a real problem that I saw previously, or a transient issue that was fixed with newer levels of LLVM.

gyiu updated this revision to Diff 113259.Aug 30 2017, 7:01 AM

Changed my mind, removed changes related to this comment:

"Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation."

Will use a different patch to remove the restriction instead, as the contents of this patch is still functionally correct.

kbarton added inline comments.Oct 23 2017, 8:03 PM
lib/Target/PowerPC/PPCISelLowering.cpp
117

Is this still necessary? I don't see any calls to it - only to the 3-parameter version of it.

gyiu marked an inline comment as done.Oct 24 2017, 8:29 AM
gyiu added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
117

Yeah, this patch is old so the declaration is based on the version that had two parameters. I'll update it when I merge with the latest code.

kbarton accepted this revision.Oct 24 2017, 3:27 PM

LGTM

This revision is now accepted and ready to land.Oct 24 2017, 3:27 PM
This revision was automatically updated to reflect the committed changes.
gyiu marked an inline comment as done.