This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] fold shuffles of insert_subvectors
ClosedPublic

Authored by spatel on May 16 2019, 12:04 PM.

Details

Summary

This should be a valid exception to the general rule of not creating new shuffle masks in IR...because we already do it. :)
Also AFAICT, DAG combining/legalization can undo this by widening the shuffle back out if needed.

Explanation for how we already do this: SLP or vector source can create chains of insert/extract as shown in 1 of the examples from PR16739:
https://godbolt.org/z/NlK7rA
https://bugs.llvm.org/show_bug.cgi?id=16739

And we expect instcombine or DAGCombine to clean that up by creating relatively simple shuffles. But instcombine has a strange way of dealing with this and which can fail with multiple uses - grep for:

// TODO: Looking at the user(s) to determine if this insert is a
// fold-to-shuffle opportunity does not match the usual instcombine
// constraints.

So this patch is proposing to reduce that restriction by matching a sub-pattern of shuffles. We would have folded that anyway if they had appeared in an alternate form (as a sequence of insert/extract).

Diff Detail

Event Timeline

spatel created this revision.May 16 2019, 12:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2019, 12:04 PM

Can we make a list somewhere of the shuffles which all targets with vector types are expected to support?

Can we make a list somewhere of the shuffles which all targets with vector types are expected to support?

Sure - LangRef or less formal? eg, doxygen code comments for InstCombiner::visitShuffleVectorInst()?

LangRef isn't really appropriate; it's not an issue of correctness. (All targets support all shuffles; it's just that some of them generate terrible code.)

Maybe doxygen on InstCombiner::visitShuffleVectorInst is okay. Not sure where else we would put it; maybe CodeGenerator.rst, but that doesn't seem great.

spatel updated this revision to Diff 199915.May 16 2019, 3:24 PM

Patch updated:
Added descriptive text to CodeGenerator.rst to document our (previously unstated?) rules about creating shuffles in IR.
The docs folder seems easier to point to than a doxygen comment, but I'm open to suggestions. We could duplicate text or link to the docs file within the source code to make it more likely to be found. Also, the text is likely not completely accurate, so any refinement ideas are welcome.

efriedma added inline comments.May 16 2019, 4:20 PM
llvm/docs/CodeGenerator.rst
932

This description is useful from the perspective from the optimizer, but it would probably also be a good idea to mention what this means in practice from the perspective of the backend. For shuffles where operand and result types are all the same, and legal, that's pretty straightforward, but some of these are more complicated. Can a splat change the vector length, and if so, what DAG nodes does that produce? What DAG nodes are we actually talking about for an "insert subvector"?

Also, which of those categories does the proposed transform fall under?

spatel updated this revision to Diff 200041.May 17 2019, 7:29 AM

Revised documentation text: tried to make this backend-centric by specifying the mapping from IR to SDAG opcodes.

Also, which of those categories does the proposed transform fall under?

This transform relies on expanding the shuffle of narrowed operands back to:
shuffle_vector (insert_subvector undef, X, 0), (insert_subvector undef, Y, 0), Mask'

Here, I'm assuming insertion into 'undef' rather than an arbitrary base vector, so it's an easier operation to support than what I wrote in this rev of the doc text. I'm not sure yet if inserting at index 0 into an arbitrary vector is actually handled well enough by all targets, but that seems like a common enough vector op? It's a fuzzy spec because it can change as we add DAGCombiner rules.

This transform relies on expanding the shuffle of narrowed operands back to

I think that qualifies as its own category; you're basically saying that certain shuffles should be equivalent to certain other shuffles, which is sort of orthogonal to whether any particular shuffle is legal.

This transform relies on expanding the shuffle of narrowed operands back to

I think that qualifies as its own category; you're basically saying that certain shuffles should be equivalent to certain other shuffles, which is sort of orthogonal to whether any particular shuffle is legal.

I'm struggling to find words to describe this without opening the door completely to arbitrary shuffle combining. Any suggestions are welcome. :)

The motivating transform that led me here is the last test diff (ends up as just a single insert_subvector), so I could restrict the transform further, but it seemed like the more general fold would be beneficial and reversible if necessary.

In terms of describing it, what is the backend actually going to see?

If both the source and destination types are legal, the rule is a sort of a "truncated shuffle rule"; essentially, given vectors with the same element type, the shuffle mask <x,y,z,w> should have roughly the same cost as the shuffle mask <x,y,z,w,undef,undef,undef,undef>. (This isn't a great description because it sort of implies that the mask constants would be the same, but you get the idea.) This is probably reasonable to expect on most targets that support multiple vector widths, but it does require some implementation work. IIRC I did some optimization work related to this for AArch64.

If the destination type is legal, and the source type would be "widened" (in vector legalization terms) to the destination type, then effectively the two shuffles are the same, and the backend doesn't really need to be aware, I think.

If the destination type is legal, and the source type would be "promoted" (in vector legalization terms) to the destination type, then I'm not sure what rule would allow the transform: the shuffles aren't really related in any obvious way.

I won't try to enumerate the various cases where the destination type is illegal; I'm guessing it generally breaks down to one of the above cases somehow.

RKSimon added inline comments.May 18 2019, 5:52 AM
llvm/docs/CodeGenerator.rst
922

"Each element of the vector is chosen from either of the corresponding elements of the 2 input vectors."

In terms of describing it, what is the backend actually going to see?

Good question - I stepped through some examples, and I might've been overthinking this. We assume that the original shuffle was safe/legal because nothing in the IR optimizer should create anything difficult. So the initial DAG for these patterns appears to be the same independently of this IR transform:

Before:
define <4 x float> @widen_shuffles(<2 x float> %x, <2 x float> %y) {
  %s1 = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> ; concat with undef elements
  %s2 = shufflevector <2 x float> %y, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> ; concat with undef elements
  %s3 = shufflevector <4 x float> %s2, <4 x float> %s1, <4 x i32> <i32 1, i32 5, i32 4, i32 0> ; arbitrary shuffle
  ret <4 x float> %s3
}

$ llc -o - -debug
...
Initial selection DAG: %bb.0 'widen_shuffles:'
        t9: v4f32 = concat_vectors t6, undef:v2f32
        t8: v4f32 = concat_vectors t3, undef:v2f32
      t10: v4f32 = vector_shuffle<1,5,4,0> t9, t8
After:
define <4 x float> @widen_shuffles(<2 x float> %x, <2 x float> %y) {
  %s3 = shufflevector <2 x float> %y, <2 x float> %x, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
  ret <4 x float> %s3
}

$ llc -o - -debug
...
Initial selection DAG: %bb.0 'widen_shuffles:'
        t8: v4f32 = concat_vectors t6, undef:v2f32
        t9: v4f32 = concat_vectors t3, undef:v2f32
      t10: v4f32 = vector_shuffle<1,5,4,0> t8, t9

So IIUC, this is always the 1st option that you listed: legal destination type with vector widening, so the backend shouldn't ever see a difference.

spatel updated this revision to Diff 200304.May 20 2019, 8:34 AM

Patch updated:

  1. Improved doc text for 'shuffle select' as suggested.
  2. Adjusted code comment to say the transform is expected to be expanded back to original form in the backend.

So IIUC, this is always the 1st option that you listed: legal destination type with vector widening, so the backend shouldn't ever see a difference.

Did you try any testcases with integer types? floats can't be "promoted", but integers can.

So IIUC, this is always the 1st option that you listed: legal destination type with vector widening, so the backend shouldn't ever see a difference.

Did you try any testcases with integer types? floats can't be "promoted", but integers can.

Yes - looks like the same concat_vectors pattern to me:

  t9: v4i32 = concat_vectors t6, undef:v2i32
  t8: v4i32 = concat_vectors t5, undef:v2i32
t10: v4i32 = vector_shuffle<1,5,4,0> t9, t8

  t9: v8i16 = concat_vectors t7, undef:v4i16
  t10: v8i16 = concat_vectors t6, undef:v4i16
t11: v8i16 = vector_shuffle<1,9,10,0,1,1,11,3> t9, t10

I also tried with an odd input type (<3 x i32>) to a legal type (<4x i32>), and that has extra ops on initial DAG, but it's still identical between the 2 proposed IR variants:

      t7: v3i32 = extract_subvector t4, Constant:i64<0>
    t9: v6i32 = concat_vectors t7, undef:v3i32
      t6: v3i32 = extract_subvector t2, Constant:i64<0>
    t10: v6i32 = concat_vectors t6, undef:v3i32
  t11: v6i32 = vector_shuffle<1,7,6,0,u,u> t9, t10
t12: v4i32 = extract_subvector t11, Constant:i64<0>

I need to update my local repo to confirm, but this case exposes an existing crash for AArch64 codegen.

I tried llc with the before/after of the <7 x i8> testcase from the patch on x86-64, and I'm seeing different shuffles. I'm not sure why, though; the type v9i8 is involved for some non-obvious reason. Full testcase:

define <7 x i8> @insert_subvector_shuffles(<3 x i8> %x, <3 x i8> %y) {
  %s1 = shufflevector <3 x i8> %x, <3 x i8> undef, <7 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %s2 = shufflevector <3 x i8> %y, <3 x i8> undef, <7 x i32> <i32 undef, i32 1, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
  %s3 = shufflevector <7 x i8> %s1, <7 x i8> %s2, <7 x i32> <i32 0, i32 8, i32 1, i32 undef, i32 8, i32 1, i32 9>
  ret <7 x i8> %s3
}
define <7 x i8> @insert_subvector_shuffles2(<3 x i8> %x, <3 x i8> %y) {
  %s3 = shufflevector <3 x i8> %x, <3 x i8> %y, <7 x i32> <i32 0, i32 4, i32 1, i32 undef, i32 4, i32 1, i32 5>
  ret <7 x i8> %s3
}

I tried llc with the before/after of the <7 x i8> testcase from the patch on x86-64, and I'm seeing different shuffles. I'm not sure why, though; the type v9i8 is involved for some non-obvious reason. Full testcase:

define <7 x i8> @insert_subvector_shuffles(<3 x i8> %x, <3 x i8> %y) {
  %s1 = shufflevector <3 x i8> %x, <3 x i8> undef, <7 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %s2 = shufflevector <3 x i8> %y, <3 x i8> undef, <7 x i32> <i32 undef, i32 1, i32 2, i32 undef, i32 undef, i32 undef, i32 undef>
  %s3 = shufflevector <7 x i8> %s1, <7 x i8> %s2, <7 x i32> <i32 0, i32 8, i32 1, i32 undef, i32 8, i32 1, i32 9>
  ret <7 x i8> %s3
}
define <7 x i8> @insert_subvector_shuffles2(<3 x i8> %x, <3 x i8> %y) {
  %s3 = shufflevector <3 x i8> %x, <3 x i8> %y, <7 x i32> <i32 0, i32 4, i32 1, i32 undef, i32 4, i32 1, i32 5>
  ret <7 x i8> %s3
}

That's bizarre. We use "alignTo(7, 3)":

/// Returns the next integer (mod 2**64) that is greater than or equal to
/// \p Value and is a multiple of \p Align. \p Align must be non-zero.

But for the purposes of this transform, we don't have to worry about weird types, right? We get to assume that the destination type (the original wide shuffle type) was intentionally created in source or was validated by TTI in the vectorizers.

We use "alignTo(7, 3)":

Hmm. It looks like the idea is that we should perform the shuffle in some type that's a multiple of the source vectors, so we can construct the source operands using CONCAT_VECTORS. Not sure why we aren't just generating an INSERT_SUBVECTOR into undef, though.

But for the purposes of this transform, we don't have to worry about weird types, right?

Depends on what you mean by "weird"; AMDGPU has legal non-power-of-two vector types. But yes, for the purpose of preserving shuffles, we're generally trying to preserve shuffles that were legal in the first place, which means either the shuffle is "trivial", like the ones you've listed, or the destination type is legal (or maybe some multiple of a legal type, I guess).


I had one more thought, which I'm not sure we really covered before: what happens if the destination type of the final shuffle is narrower than the intermediate types? I guess the reasoning ends up mostly the same, though.

We use "alignTo(7, 3)":

Hmm. It looks like the idea is that we should perform the shuffle in some type that's a multiple of the source vectors, so we can construct the source operands using CONCAT_VECTORS. Not sure why we aren't just generating an INSERT_SUBVECTOR into undef, though.

INSERT_SUBVECTOR makes sense to me, but I just tried that experiment and...fallout: x86 has regressions/crashing, AMDGPU is infinite looping.

Okay, so my current understanding is the following: the invariant we want for this patch is that given an IR shufflevector with a given destination type and shuffle mask, we produce an ISD::VECTOR_SHUFFLE with the same destination type and shuffle mask. If that's true, we can change the source type without causing issues. If the source and destination vectors both have both power-of-two length, SelectionDAG does in fact preserve this invariant. But in general, it doesn't.

I don't want to create extra work in the way of an incremental improvement, but I don't think we can merge this patch given the current state of SelectionDAG. Maybe if would be okay if you restrict it to power-of-two vectors for now?

Okay, so my current understanding is the following: the invariant we want for this patch is that given an IR shufflevector with a given destination type and shuffle mask, we produce an ISD::VECTOR_SHUFFLE with the same destination type and shuffle mask. If that's true, we can change the source type without causing issues. If the source and destination vectors both have both power-of-two length, SelectionDAG does in fact preserve this invariant. But in general, it doesn't.

I don't want to create extra work in the way of an incremental improvement, but I don't think we can merge this patch given the current state of SelectionDAG. Maybe if would be okay if you restrict it to power-of-two vectors for now?

Sure - that would still catch the motivating cases that I saw. I only included the prime number vector sizes to verify that the shuffle mask adjustment was working as intended. I'll add more tests and limit this.

spatel updated this revision to Diff 200596.May 21 2019, 3:04 PM

Patch updated:
Added power-of-2 element checks for all types, so we can be more confident that there are no codegen regressions.

This revision is now accepted and ready to land.May 21 2019, 4:01 PM
This revision was automatically updated to reflect the committed changes.