This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type.
AbandonedPublic

Authored by ab on Apr 7 2015, 5:07 PM.

Details

Summary

Code that deals with small vectors but worries about performance (is hardware-aware) often passes the vectors around (on some targets as scalars, thanks to type coercion), but immediately shuffles them into a native-width vector, to do proper operations that we can't reasonably do a horrible job at. Promoting small vector types leads to some really nasty stuff, and people actively try to avoid that now ;)

Anyway, this means a pretty common pattern is this:

(vector_shuffle (v8i8 concat_vectors (v2i8 bitcast (i16)), undef..), M)

which we'll usually end up scalarizing. Instead, we can turn it into:

(vector_shuffle (v8i8 bitcast (v4i16 scalar_to_vector (i16))), M)

which lets us deal with native-width types all the time (this should be the foremost canonicalization goal whenever dealing with vectors IMHO, but I digress).

Diff Detail

Event Timeline

ab updated this revision to Diff 23382.Apr 7 2015, 5:07 PM
ab retitled this revision from to [CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type..
ab updated this object.
ab edited the test plan for this revision. (Show Details)
ab added reviewers: spatel, qcolombet.
ab added a subscriber: Unknown Object (MLST).

I have a gut feeling telling me this isn't the right solution; let's add more of the shuffle brigade! =)

Forgive the AArch64 testcase; on X86 there's enough heroics that we're able to recover from scalarizing the concat_vectors. I'm a bit fuzzy on the details, so I'll definitely keep looking before committing. In the meantime, guidance welcome!

-Ahmed

RKSimon edited edge metadata.Apr 8 2015, 5:53 AM

Much of this optimization looks like to will be accomplished by D8883 no?

ab added a comment.Apr 8 2015, 10:05 AM

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

From what I see, there's two other alternatives:

  • not creating CONCAT_VECTORS of illegal types
  • fixing whatever de-scalarizes the legalized code later on for this testcase on X86 but not on AArch64 (intuition says it's the fact that i16 isn't legal on AArch64, whereas it is on X86, so when we promote the scalar input as well, all hell breaks loose).

The latter seems brittle, the former is basically saying "CONCAT_VECTORS is only the canonical way to concat vectors when they're legal", which I found wrong, but isn't that shocking, when you spell it out.

Thoughts?

-Ahmed

In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

ab added a comment.Apr 8 2015, 6:43 PM
In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):

(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)

into:

(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))

Is that what you're proposing?

-Ahmed

chandlerc edited edge metadata.Apr 8 2015, 7:42 PM
In D8884#153590, @ab wrote:
In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):

(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)

into:

(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))

Is that what you're proposing?

-Ahmed

FWIW, this is what I think makes far and away the most sense. We should definitely prefer a build_vector of scalars over concat_vector of bitcasts of scalars.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):

(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)

into:

(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))

Is that what you're proposing?

Yes that seems the cleanest approach and gives us the best chance of further optimization later on.

ab abandoned this revision.Apr 9 2015, 6:29 PM

Alrighty then, superseded by D8948.

Thanks for the help!

-Ahmed