This patch adds code to PPC ISEL lowering to recognize half-word inserts from vector_shuffles, and use P9 shift and vector insert instructions instead of vperm.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?
Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.
I feel we should not increase the number of vector instructions within a loop (i.e. a common case for vector code) if we can load the mask into a vector register before the loop.
In case without an additional shift, it is nice to do opt in a loop for freeing up one vector register.
lib/Target/PowerPC/PPCISelLowering.h | ||
---|---|---|
515 ↗ | (On Diff #102384) | The community doesn't like the bool parameters here. But I am not sure whether we must remove them or it is just nice to do. |
test/CodeGen/PowerPC/p9-vinsert-vextract.ll | ||
8 ↗ | (On Diff #102384) | I would prefer to use non-mangled function names to make it more readable. I think you can just regenerate the IR from c source file instead of cpp file. |
- Added comments and demangled function names for LIT tests
- Added period to comment
- Fixed issue when one operand of the shufflevector is 'undef', in which case the PPCISDs we generate will use only the defined one.
- Initialize 'Swap' boolean
Go ahead and submit the rename separately (if you don't have commit access send me a patch and I'll do it). I would prefer to try to minimize boolean arguments - in this case you're adding a few more including a true/false and a return value one. Feel free to refactor the code around it to require fewer booleans or have multiple functions with helper functions that end up being called.
Thanks!
lib/Target/PowerPC/PPCISelLowering.h | ||
---|---|---|
515 ↗ | (On Diff #102384) | Please do. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1629 ↗ | (On Diff #102384) | Is there any reason why we use uint32_t? If not, I would use unsigned instead to make it consistent. We are using both unsigned and uint32_t in this file. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1629 ↗ | (On Diff #102384) | I think using uint32_t is appropriate here because I want to illustrate that I'm using 32-bits exactly. Using 'unsigned int' would be fine as well, but I don't think it's as clear that I'm looking for 32-bits exactly. |
- Added -O0 to LIT tests to test corner case of undef 2nd operand of vector shuffle.
- Refactored VINSERTH code to avoid boolean parameters and return value.
- Merged loops for 2nd operand undefined case and both operands defined.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7916 ↗ | (On Diff #103079) | Nit: (Here and other places) Comments are complete sentences including punctuation. |
Although I definitely agree that we should take steps to ensure we don't introduce further instructions in loops, I'm not sure that avoiding a 2-instruction sequence for a shuffle is necessarily the right thing to do. This statement is predicated on the fact that we can hoist the constant pool load out of a loop. If register pressure prevents this, we will have a load in the loop. Furthermore, if the loop is large enough and has other memory operations, it is conceivable that the constant pool load could be a cache miss on every iteration. And it is conceivable that such large loops will be the ones for which register pressure prevents the hoisting of the load. Furthermore, if the GPR register pressure is also high, we might not even be able to hoist the address calculations outside the loop, which would make the vperm sequence 3-4 instructions.
I think that at ISEL time, we should favour shorter instruction sequences that don't involve loads. And perhaps if we can show that multi-instruction permute sequences in loops appear enough in real code, we might want to have a loop pass that simplifies them into a load outside the loop with a vperm in the loop in general.
lib/Target/PowerPC/PPCISelLowering.h | ||
---|---|---|
1066 ↗ | (On Diff #103079) | This is probably a remnant of a previous implementation. Please rewrite the comment. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7910 ↗ | (On Diff #103079) | So we don't want to shift if we're within the same register? Is there a specific reason for this? |
7921 ↗ | (On Diff #103079) | Isn't this already guaranteed to only have the low order 3 bits set? |
7926 ↗ | (On Diff #103079) | Why would we continue to search if we've already confirmed that:
|
- Addressed comments about my comments (grammar, periods, etc).
- Removed irrelevant comments in PPCISelLowering.h
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7910 ↗ | (On Diff #103079) | I believe we have to add a xxlor to another VR if we want to shift the vector since we can't shift if both operands of the vector shuffle are the same vector. Adding another two cycles to VECSHL+VECINSERT seems diminish its value versus load+vperm. |
7921 ↗ | (On Diff #103079) | MaskOneElt could actually be >= 8, since the mask is in range [0, 15]. |
7926 ↗ | (On Diff #103079) | Yep, you're correct. Need a break here since we can't find more than one candidate. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7910 ↗ | (On Diff #103079) | I don't really see why. Assume that you have something like this: vector unsigned short test(vector unsigned short a) { a[5] = a[2]; } I don't see why we can't codegen something like this for it: vsldoi 3, 2, 2, 4 vinserth 2, 3, 4 Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm). |
7921 ↗ | (On Diff #103079) | Ah, right. I didn't think of that. Sorry about that. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7910 ↗ | (On Diff #103079) | Hmmm... Yep, you're right. I guess I can simplify my code even further now. I think this also means I have to fix up the code for the original xxinsertw lowering in a separate patch. |
- I'll open a separate item to address Nemenja's comments as I will not get a chance to do another enchancement.
I don't really see why. Assume that you have something like this:
vector unsigned short test(vector unsigned short a) {
a[5] = a[2];}
I don't see why we can't codegen something like this for it:
vsldoi 3, 2, 2, 4
vinserth 2, 3, 4Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm).
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7910 ↗ | (On Diff #103079) | Actually, I'm not quite sure what you mean here. The original code for xxinsertw has the limitation of only being able to insert element 3 if both input vectors to the vector_shuffle are the same. I'll need to change that in a separate patch. I'm not sure where the 'renaming of things' comes into play? |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1132 ↗ | (On Diff #103522) | By "renaming stuff", I mean things like this. Kind of orthogonal to the patch and should go in as a separate NFC change. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7903 ↗ | (On Diff #103974) | You should be able to get rid of this condition here.
The rest should fall out naturally and we'll do the shift for the single-input case as well. And the code will also be simpler. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7903 ↗ | (On Diff #103974) | @nemanjai I created Issue #410 on github to address the issue when using vector shifts in the case when both inputs are the same vector. There's further investigation that's required as it's not clear which input/output registers the (vector shift + vector extract) sequence uses in this case. I would rather do this change as part of that work item instead. |
- Refactored NFCs to another patch to be committed.
- Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation.
Note that I was able to re-implement Nemanja's suggestion of generalizing the case when both inputs are the same vector because the registers used in code-gen are now consistent. Not sure if it was a real problem that I saw previously, or a transient issue that was fixed with newer levels of LLVM.
Changed my mind, removed changes related to this comment:
"Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation."
Will use a different patch to remove the restriction instead, as the contents of this patch is still functionally correct.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
117 ↗ | (On Diff #113259) | Is this still necessary? I don't see any calls to it - only to the 3-parameter version of it. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
117 ↗ | (On Diff #113259) | Yeah, this patch is old so the declaration is based on the version that had two parameters. I'll update it when I merge with the latest code. |