The front end portion of D18592.
Rather than emitting target-specific intrinsics, this patch uses the target-independent __builtin_shufflevector to generate vector_shuffle nodes.
The reason for this change is that there are some conditions under which these vec_splat functions get inlined into other code and the information about the calls to the vperm intrinsics actually being splats is somehow lost. The calls to __builtin_shufflevector will ensure that the information about the shuffle is retained. The benchmark that identified this issue sees significant improvement with this patch.
The arguments to __builtin_shufflevector must be compile-time constants. These vec_splat functions will typically be called with compile-time constants anyway so at anything other than -O0, the unnecessary branches will be eliminated. And if they're called with non-constant arguments for the element index, the sequence is probably still better than building a mask vector and using vperm.
Basically, at -O2 this form when the argument is variable will have a shift, compare, branch and xxspltw. The previous form has 16 stores (of each byte), a vector load (which includes a swap on LE) and a vperm.
Yeah, when entering code in phabricator comments, you can do one of two things:
__builtin_shufflevector(__a, __a, 3, 3, 3, 3);
FWIW LGTM ;-)
One question: xxspltd is an extended mnemonic for xxpermdi. In your current patch for BE I do not see adding a pattern for xxpermdi. Did you add some code to that patch that is not yet posted here? Or something else is going on here?
I have added a minor nit as well.
b is unused
On Friday I looked at this a little bit in rush. I looked into this more carefully again (including checking what IR we generate after the change for the testcase added). I think the implementation could be improved by using a built-in that accepts variable parameters (either an existing builtin or a new one). We do need to generate shufflevector instruction which requires constant parameters, but that can be done in CGBuiltin.cpp by calling Builder.CreateShuffleVector API. (There are multiple uses of this in CGBuiltin.cpp)
It is true that current implementation is a significant improvement over existing code, but that should not prevent us from generating better code when that is possible. (Unless the change is too complex or requires significant amount of extra work. But that is not the case here).
Also I grepped for existing uses of __builtin_shufflevector in the header files. I didn't check every single use, but in the fairly large number that I checked I did not find any use within a conditional statement such as switch case.
It appears @amehsan and I have come to an agreement here. Does anyone else have any objections to this approach @hfinkel, @echristo, @kbarton.
Sorry for the spam from the pings, I just want to make sure that we all agree on this approach for vector shuffles emitted from altivec.h as we might want do do something like this for other functions. This particular fix provided a significant improvement in Eigen and it is likely that there are other places where the compiler is tricked into not seeing through vperm's to see that they're vector shuffles.
@nemanjai I was just thinking of something else :)
If we have a runtime value for the second parameter, do we know that this code is an improvement over the previous implementation? (Specially if values of the runtime parameter has more or less equal frequency which code result in a lot of mispredicted branches.)
Something else to consider: (I have not yet checked it). If we specialize vec_perm implementation for vec_splat with runtime values, can we improve it? If yes, how would that compare to the current implementation.
I don't see any cases in which the old implementation leads to better code than what this patch suggests. I'm focusing on LE here.
Old implementation (noopt - both const and non-const index): 16 stb's, a mess of both vector and scalar ops, vector loads and stores, etc.
Old implementation (-O1 - const index): vspltw (plus we lose the information about the shuffle in some cases - which prompted this patch)
Old implementation (-O1 - non-const index): vspltisb, some rlwimi's, 16 stb's, lxvd2x along with the requisite swap, xxlor and vperm (the bulk of the work is for building the mask)
This patch (noopt - both const and non-const index): cmplwi, bgt, mtctr, bctr, xxspltw (along with the necessary swaps of the vectors and storing the arguments)
This patch (-O1 - const index): xxspltw (and we always retain the information about this being a shufflevector since we are using that builtin)
This patch (-O1 - non-const index): rlwinm, worst case - 3 cmplwi and 3 beq/bne
Although the noopt code emitted even with this patch has some SPR operations that are likely expensive, I don't think it is ever worse than 16 stores and a load (along with other instructions necessary to create a mask vector). And I am not sure how much merit there is in discussing performance characteristics of code generated at noopt.
Could we add a test case (or extend an existing test case) to ensure that these extra branches are cleaned up at -O2 and above?
Thanks for the comparison. This is what I have been thinking about. But if others prefer switch-case implementation, that is fine with me as well. I am not sure this approach is better.
If we have a built-in that accepts variable operands, in CGBuitlin.cpp we can check the operand. If we see a constant value we can emit shufflevector. Otherwise : there are only 4 different masks needed for vec_splat (of vector int or vector unsigned int). We can add a global array that includes all these 4 masks. Then we can load proper mask (using the parameter passed to vec_splat to index the array). Now once we have the maskwe probably need to generate an intrinsic (Is there an IR instruction that we can use?). So this approach has its own disadvantages too....
I suppose we can index into PerfectShuffleTable or something similar (presumably it has entries for splats). But then we're stuck with VMX instructions rather than VSX. So I'm not sure that this is a win even in that case.
I agree; a macro can be used here.
Another option is to use __builtin_constant_p to choose between the switch statement (in the case where we know it gets optimized away), and a target-specific intrinsic for the cases we can't generically represent.