This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Detect zeroable shuffle elements from different value types
ClosedPublic

Authored by RKSimon on Nov 2 2015, 3:02 PM.

Details

Summary

Improve computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38992.Nov 2 2015, 3:02 PM
RKSimon retitled this revision from to [X86][SSE] Recursive search for zeroable shuffle elements.
RKSimon updated this object.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
andreadb edited edge metadata.Nov 6 2015, 4:47 AM

Hi Simon,

lib/Target/X86/X86ISelLowering.cpp
6732–6734 ↗(On Diff #38992)

Do you think it would make sense to limit the recursion level?

6746–6749 ↗(On Diff #38992)

I don't think you need to compute V1IsZero and V2IsZero anymore. The two calls to 'isBuildVectorAllZeros' are now made redundant by the calls to 'GetSubZeroable'.

You can also simplify the if statement at line 6754. In particular, you would only need to check for the presence of an undef index. All the remaining cases should be taken care by the checks after line 6760.

6777–6787 ↗(On Diff #38992)

At line 6777, you don't need to check if 'Size < SubSize'. If control reaches line 6777, then Size can never be bigger than or equal to SubSize.

The motivation is that the check for (Size < SubSize) is dominated by the checks for (Size == SubSize) and the check for (Size > SubSize). You can also remove the llvm_unreachable at line 6787 as it is not needed.

RKSimon updated this revision to Diff 40179.Nov 13 2015, 2:36 PM
RKSimon edited edge metadata.

Added recursion limit and other recommendations from Andrea.

RKSimon marked 3 inline comments as done.Nov 13 2015, 2:37 PM
chandlerc edited edge metadata.Nov 15 2015, 7:21 AM

I'm really not convinced this is the correct approach.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

I'm really not convinced this is the correct approach.

Is it mainly the recursion that is concerning you? There is more that we could do at the DAGCombiner or X86 level to merge shuffles, but it would still be recursion. Other targets have shuffle instructions that set elements to constants (zero and allbits being the most common) as well as as input permutations, but I don't see much that makes use of this at all.

I've been trying to think of ways to canonicalize shuffles with zeros/constants inputs to reduce this but can't think of anything that would really help.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

The X86 shuffle combining could do some of this, but it would still be recursive and it would be repeating a lot of what is done at lowering already.

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

I'm really not convinced this is the correct approach.

Is it mainly the recursion that is concerning you? There is more that we could do at the DAGCombiner or X86 level to merge shuffles, but it would still be recursion.

It's not the same at all though. When we combine N shuffles into 1 shuffle, we can do that in O(N) because we can combine as we go. When we instead recursively match during lowering here we can do O(N^2) work.

Other targets have shuffle instructions that set elements to constants (zero and allbits being the most common) as well as as input permutations, but I don't see much that makes use of this at all.

I'm not really concerned about helping other targets here, I'm concerned about making this use the correct model.

I've been trying to think of ways to canonicalize shuffles with zeros/constants inputs to reduce this but can't think of anything that would really help.

Can you describe why not? I'm not seeing it.

Specifically, I can't come up with any way that we would need to look through more than N-1 VECTOR_SHUFFLE nodes to find N inputs.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

The X86 shuffle combining could do some of this, but it would still be recursive and it would be repeating a lot of what is done at lowering already.

As above, the point is to do *less* of this at lowering time. Even if we still need some of this logic during lowering in order to handle iterative decomposition, we should minimize it and always avoid unbounded recursion.

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

Ah, I think this is really the critical thing to realize.

Because of the simplification provided by it, we should essentially re-associate shuffles to cause all constant inputs to come from a single build_vector, and that a build_vector of constants should be the last input shuffled:

For example:

S1 = (shuffle A, (build constant, constant, ...))
S2 = (shuffle B, S1)

should combine to:

S1 = (shuffle A, B)
S2 = (shuffle S1, (build constant, constant, ...))

So that we can always identify constant inputs at the bottom of the shuffle tree. We can still indicate *unused* inputs in S1 with undef shuffle indices.

Does this make sense to you? Is there some *other* thing broken by this style of reassociation?

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

Ah, I think this is really the critical thing to realize.

Because of the simplification provided by it, we should essentially re-associate shuffles to cause all constant inputs to come from a single build_vector, and that a build_vector of constants should be the last input shuffled:

For example:

S1 = (shuffle A, (build constant, constant, ...))
S2 = (shuffle B, S1)

should combine to:

S1 = (shuffle A, B)
S2 = (shuffle S1, (build constant, constant, ...))

So that we can always identify constant inputs at the bottom of the shuffle tree. We can still indicate *unused* inputs in S1 with undef shuffle indices.

Does this make sense to you? Is there some *other* thing broken by this style of reassociation?

The approach you propose is the canonicalization step I mentioned. My main concern with it is making sure our existing shuffle lowering can be adapted to 'peek' through a bottom shuffle with constants (in this case zeroable vectors) - this will affect insertps and the perm2f128 instructions and possibly others - but its certainly doable, but may involve reordering some of the lowerings so that they are attempted earlier. This will probably affect build_vector lowering as well - but I won't know the scope until I investigate this more.

Do you think this should be limited to the X86? Large parts of the code, such as combining multiple constant build_vector shuffle inputs into one, would almost certainly be useful in the DAGCombiner and I can't think of reasons why pushing down the shuffles with constants would cause any notable regressions in other targets' shuffle codegen.

That would leave this patch without the recursion element - I think reducing it to just allowing it to check for zeroable build_vectors of different element counts would still be useful, I'll see if I can update this patch.

Cheers, Simon.

qcolombet edited edge metadata.Feb 10 2016, 5:01 PM

Hi Simon,

Is this patch still alive?
(I.e., does it need review.)

Thanks,
-Quentin

It needs redoing (and reducing in scope) now that more of the target shuffle combine work is completed. We should be able to reduce it to a test for zero on BUILD_VECTORs with different number of elements to the mask (same as to what has been done in setTargetShuffleZeroElements).

It just hasn't been high enough on my todo list......

RKSimon updated this revision to Diff 50792.Mar 15 2016, 7:31 PM
RKSimon retitled this revision from [X86][SSE] Recursive search for zeroable shuffle elements to [X86][SSE] Detect zeroable shuffle elements from different value types.
RKSimon updated this object.
RKSimon edited edge metadata.

As recommended by Chandler, I've adjusted this patch to avoid recursion (we now do this in setTargetShuffleZeroElements for target shuffle combining) and instead focus on improving computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.

chandlerc accepted this revision.Mar 16 2016, 7:53 AM
chandlerc edited edge metadata.

FWIW, this looks really nice now. Nits below, LGTM with suggested changes.

lib/Target/X86/X86ISelLowering.cpp
7249–7250 ↗(On Diff #50792)

Please just use 'int' unless you need modular arithmetic...

7272 ↗(On Diff #50792)

int here as well.

7293–7295 ↗(On Diff #50792)

int here, and j < Scale.

This revision is now accepted and ready to land.Mar 16 2016, 7:53 AM
This revision was automatically updated to reflect the committed changes.