This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Front end improvements for vec_splat
AbandonedPublic

Authored by nemanjai on Mar 30 2016, 5:24 AM.

Details

Summary

The front end portion of D18592.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 52039.Mar 30 2016, 5:24 AM
nemanjai retitled this revision from to [PowerPC] Front end improvements for vec_splat.
nemanjai updated this object.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added a subscriber: llvm-commits.
nemanjai updated this revision to Diff 52375.Apr 1 2016, 7:44 AM
nemanjai updated this object.

Rather than emitting target-specific intrinsics, this patch uses the target-independent __builtin_shufflevector to generate vector_shuffle nodes.

The reason for this change is that there are some conditions under which these vec_splat functions get inlined into other code and the information about the calls to the vperm intrinsics actually being splats is somehow lost. The calls to __builtin_shufflevector will ensure that the information about the shuffle is retained. The benchmark that identified this issue sees significant improvement with this patch.

amehsan edited edge metadata.Apr 1 2016, 8:52 AM

Thanks for using target-independent builtin and IR.

lib/Headers/altivec.h
7726–7736

Why do we need a switch case here (and in the ones below) instead of just returning builtin_shufflevector(a, a, elem, elem, elem, __elem)?

amehsan added inline comments.Apr 1 2016, 8:53 AM
lib/Headers/altivec.h
7726–7736

Apparently phabricator has a problem with underlines....

nemanjai added inline comments.Apr 1 2016, 9:57 AM
lib/Headers/altivec.h
7726–7736

The arguments to __builtin_shufflevector must be compile-time constants. These vec_splat functions will typically be called with compile-time constants anyway so at anything other than -O0, the unnecessary branches will be eliminated. And if they're called with non-constant arguments for the element index, the sequence is probably still better than building a mask vector and using vperm.

Basically, at -O2 this form when the argument is variable will have a shift, compare, branch and xxspltw. The previous form has 16 stores (of each byte), a vector load (which includes a swap on LE) and a vperm.

7726–7736

Yeah, when entering code in phabricator comments, you can do one of two things:

  1. Put them in a code block:
__builtin_shufflevector(__a, __a, 3, 3, 3, 3);
  1. Use the monospaced tag around the text: __builtin_shufflevector(__a, __a, 3, 3, 3, 3);

FWIW LGTM ;-)

One question: xxspltd is an extended mnemonic for xxpermdi. In your current patch for BE I do not see adding a pattern for xxpermdi. Did you add some code to that patch that is not yet posted here? Or something else is going on here?

I have added a minor nit as well.

test/CodeGen/ppc-vsx-splat.c
66

b is unused

74

same here

amehsan accepted this revision.Apr 1 2016, 10:48 AM
amehsan edited edge metadata.
This revision is now accepted and ready to land.Apr 1 2016, 10:48 AM

never mind the question

On Friday I looked at this a little bit in rush. I looked into this more carefully again (including checking what IR we generate after the change for the testcase added). I think the implementation could be improved by using a built-in that accepts variable parameters (either an existing builtin or a new one). We do need to generate shufflevector instruction which requires constant parameters, but that can be done in CGBuiltin.cpp by calling Builder.CreateShuffleVector API. (There are multiple uses of this in CGBuiltin.cpp)

It is true that current implementation is a significant improvement over existing code, but that should not prevent us from generating better code when that is possible. (Unless the change is too complex or requires significant amount of extra work. But that is not the case here).

Also I grepped for existing uses of __builtin_shufflevector in the header files. I didn't check every single use, but in the fairly large number that I checked I did not find any use within a conditional statement such as switch case.

nemanjai marked 2 inline comments as done.Apr 3 2016, 12:36 PM
nemanjai added inline comments.
test/CodeGen/ppc-vsx-splat.c
66

Oops, thank you. I'll remove both.

Ignore my last comment. That approach is not going to work.

It appears @amehsan and I have come to an agreement here. Does anyone else have any objections to this approach @hfinkel, @echristo, @kbarton.
Sorry for the spam from the pings, I just want to make sure that we all agree on this approach for vector shuffles emitted from altivec.h as we might want do do something like this for other functions. This particular fix provided a significant improvement in Eigen and it is likely that there are other places where the compiler is tricked into not seeing through vperm's to see that they're vector shuffles.

@nemanjai I was just thinking of something else :)

If we have a runtime value for the second parameter, do we know that this code is an improvement over the previous implementation? (Specially if values of the runtime parameter has more or less equal frequency which code result in a lot of mispredicted branches.)

Something else to consider: (I have not yet checked it). If we specialize vec_perm implementation for vec_splat with runtime values, can we improve it? If yes, how would that compare to the current implementation.

@nemanjai I was just thinking of something else :)

If we have a runtime value for the second parameter, do we know that this code is an improvement over the previous implementation? (Specially if values of the runtime parameter has more or less equal frequency which code result in a lot of mispredicted branches.)

Something else to consider: (I have not yet checked it). If we specialize vec_perm implementation for vec_splat with runtime values, can we improve it? If yes, how would that compare to the current implementation.

I don't see any cases in which the old implementation leads to better code than what this patch suggests. I'm focusing on LE here.
Old implementation (noopt - both const and non-const index): 16 stb's, a mess of both vector and scalar ops, vector loads and stores, etc.
Old implementation (-O1 - const index): vspltw (plus we lose the information about the shuffle in some cases - which prompted this patch)
Old implementation (-O1 - non-const index): vspltisb, some rlwimi's, 16 stb's, lxvd2x along with the requisite swap, xxlor and vperm (the bulk of the work is for building the mask)

This patch (noopt - both const and non-const index): cmplwi, bgt, mtctr, bctr, xxspltw (along with the necessary swaps of the vectors and storing the arguments)
This patch (-O1 - const index): xxspltw (and we always retain the information about this being a shufflevector since we are using that builtin)
This patch (-O1 - non-const index): rlwinm, worst case - 3 cmplwi and 3 beq/bne

Although the noopt code emitted even with this patch has some SPR operations that are likely expensive, I don't think it is ever worse than 16 stores and a load (along with other instructions necessary to create a mask vector). And I am not sure how much merit there is in discussing performance characteristics of code generated at noopt.

kbarton requested changes to this revision.Apr 4 2016, 10:51 AM
kbarton edited edge metadata.
kbarton added inline comments.
lib/Headers/altivec.h
7726–7736

Could we add a test case (or extend an existing test case) to ensure that these extra branches are cleaned up at -O2 and above?
The branches will hinder instruction scheduling (among other things) so we should ensure they are cleaned up.
I agree the extra branches at -O0 are not a concern.

This revision now requires changes to proceed.Apr 4 2016, 10:51 AM

I'll get to the rest of it, but for this Kit's question about O2: It's
preferred to test such things in the back end.

Nemanja

Thanks for the comparison. This is what I have been thinking about. But if others prefer switch-case implementation, that is fine with me as well. I am not sure this approach is better.

If we have a built-in that accepts variable operands, in CGBuitlin.cpp we can check the operand. If we see a constant value we can emit shufflevector. Otherwise : there are only 4 different masks needed for vec_splat (of vector int or vector unsigned int). We can add a global array that includes all these 4 masks. Then we can load proper mask (using the parameter passed to vec_splat to index the array). Now once we have the maskwe probably need to generate an intrinsic (Is there an IR instruction that we can use?). So this approach has its own disadvantages too....

nemanjai added inline comments.Apr 4 2016, 11:21 AM
test/CodeGen/ppc-vsx-splat.c
38

@kbarton @echristo These tests check that we definitely have the right instruction, but they do not test that we do not have the branches. I can add some CHECK-NOT's or a check for the blr at the end of the function. Or I can add a test case to the back end portion (i.e. D18592).

Nemanja

Thanks for the comparison. This is what I have been thinking about. But if others prefer switch-case implementation, that is fine with me as well. I am not sure this approach is better.

If we have a built-in that accepts variable operands, in CGBuitlin.cpp we can check the operand. If we see a constant value we can emit shufflevector. Otherwise : there are only 4 different masks needed for vec_splat (of vector int or vector unsigned int). We can add a global array that includes all these 4 masks. Then we can load proper mask (using the parameter passed to vec_splat to index the array). Now once we have the maskwe probably need to generate an intrinsic (Is there an IR instruction that we can use?). So this approach has its own disadvantages too....

I suppose we can index into PerfectShuffleTable or something similar (presumably it has entries for splats). But then we're stuck with VMX instructions rather than VSX. So I'm not sure that this is a win even in that case.

echristo added inline comments.Apr 4 2016, 2:48 PM
lib/Headers/altivec.h
7726–7736

Another common workaround for these is to make the definitions macros and then you can use the mask as part of the macro arguments.

hfinkel added inline comments.Apr 4 2016, 4:19 PM
lib/Headers/altivec.h
7726–7736

I agree; a macro can be used here.

Another option is to use __builtin_constant_p to choose between the switch statement (in the case where we know it gets optimized away), and a target-specific intrinsic for the cases we can't generically represent.

amehsan added inline comments.Apr 4 2016, 9:25 PM
lib/Headers/altivec.h
7726–7736

__builtin_constant_p is really interesting. Much simpler implementation, no need to touch CGBuiltin.cpp. Thanks Hal for mentioning it.

nemanjai abandoned this revision.May 20 2016, 2:41 AM

The underlying cause for the poor code we were getting is better resolved with
http://reviews.llvm.org/D20443