This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner, x86] convert insertelement of bitcasted vector into shuffle
ClosedPublic

Authored by spatel on Sep 28 2017, 3:54 PM.

Details

Summary

This is a generalization of the IR fold in D38316 to handle insertion into a non-undef vector. If this looks ok, then we should probably just abandon that one.

I had to add a target hook to avoid AVX512 horror with vXi1 shuffles. I think ARM/AArch64 will want to enable this too based on the earlier discussion, but I'm not sure if that would be limited to certain types or just set it to true for everything.

There may be room for improvement in the shuffle lowering here, but I think that would be follow-up work.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Sep 28 2017, 3:54 PM
fhahn added a subscriber: fhahn.Sep 29 2017, 1:23 AM
RKSimon added inline comments.Sep 29 2017, 2:52 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗(On Diff #117069)

We need to test for isShuffleMaskLegal here and I think it will remove the need for shouldConvertInsSubVecToShuffle

spatel added inline comments.Sep 29 2017, 7:35 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗(On Diff #117069)

Aha...I thought we must have something here already. I was trying something similar to that in an earlier draft, but it's tricky:

  1. If we use the actual shuffle type (VT), the subvector that we're trying to avoid messing with will already be cast from vXi1 to a potentially legal mask type. AVX512 disaster will occur for existing test cases.
  1. If we use the subvector type (SubVecVT), we will fail to get any of the cases shown here because the subvector type isn't legal (eg, v2i32). This is similar to why I couldn't use insert_subvector nodes - they require legal types.
  1. If we hack it to check for a hybrid fake type (the number of elements in the result + the element type of the subvector), we won't get the 512-bit cases for the AVX2 targets. This probably doesn't matter much in practice (shouldn't have larger than legal vectors in the first place?).

So it's not quite what we're looking for, but I can post a draft of option #3 if that's a reasonable alternative to a new hook. Or let me know if you see another way out.

spatel added inline comments.Sep 29 2017, 7:59 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗(On Diff #117069)

Note: option #3 isn't as crazy as I made it sound, that 'fake type' is ConcatVT in the current patch.

RKSimon edited edge metadata.Sep 29 2017, 8:21 AM

I like the idea of #3 (making sure we comment it thoroughly!)

I like the idea of #3 (making sure we comment it thoroughly!)

We can make it less hacky by performing the shuffle in the pre-bitcast type. Then, we're not lying about the shuffle. We'll have to bitcast the big vector and then bitcast the result though...and the mask will be less beautiful (it'll be more like what I was trying to create in D38316). But that should still catch the cases we want without adding another hook. Let me draft that and see what breaks... :)

spatel updated this revision to Diff 117180.Sep 29 2017, 10:59 AM

Patch updated:

  1. Use the existing isShuffleMaskLegal() TLI hook to determine when this transform is possible.
  2. To make that query legitimate, perform the shuffle in the type of the padded subvector.
  3. To make the proper shuffle op, bitcast the big vector and bitcast the shuffle back to the big vector type.

This gives us the same wins for x86 as the earlier rev only if the vector type is legal. Ie, we no longer optimize the 512-bit vectors for an AVX2 target. That's ok because we don't care about optimizing codegen for oversized illegal vector types?

I think there's also an improvement for AArch64 now that we're using a hook that is enabled for other targets. We had:

ins	v1.d[1], v0.d[0]
mov		v0.16b, v1.16b

and now:

zip1	v0.2d, v1.2d, v0.2d
RKSimon added inline comments.Oct 1 2017, 5:43 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13774 ↗(On Diff #117180)

How bad does it get if we allow any shuffle prior to legalization?

13787 ↗(On Diff #117069)

Don't the interim generated nodes need adding to the worklist?

spatel marked an inline comment as done.Oct 1 2017, 9:29 AM
spatel added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13774 ↗(On Diff #117180)

I'm not sure about other targets because no existing tests change, but given that this isn't a clear win, I think we need some kind of hook to opt out.

For AVX512, it gets Bad - as in hundreds of extra instructions.

It looks like we'd need to add a concat_vectors fold to prevent this pattern:

t76: v128i1 = concat_vectors t13, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1
t77: v8i16 = bitcast t76
t78: v8i16 = vector_shuffle<0,8,2,3,4,5,6,7> t80, t77

From becoming:

  ...t218: i8 = extract_vector_elt t13, Constant:i64<12>
  t217: i8 = extract_vector_elt t13, Constant:i64<13>
  t216: i8 = extract_vector_elt t13, Constant:i64<14>
  t215: i8 = extract_vector_elt t13, Constant:i64<15>
t210: v32i8 = BUILD_VECTOR t230, t229, t228, t227, t226, t225, t224, t223, t222, t221, t220, t219, t218, t217, t216, t215, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8
spatel updated this revision to Diff 117287.Oct 1 2017, 9:32 AM

Patch updated:

  1. Add intermediate nodes to worklist.
  2. This shifted the balance to have the helper function be a class member rather than a static, so some cosmetic diffs associated with that.

Has the legalization requirements affected whether we can abandon D38316?

Has the legalization requirements affected whether we can abandon D38316?

It's a gray area (to me at least). This patch is enough to solve the motivating bug:
https://bugs.llvm.org/show_bug.cgi?id=34716
...and it captures the more general case of inserting into a non-undef vector too, but you're right - it no longer handles the 512-bit vector example with a potentially pre-AVX512 target that Eli may have been suggesting in the other review.

So the considerations are (and I don't have a strong opinion either way):

  1. The IR canonicalization of D38316 can enable an IR improvement (less instructions in the motivating bug), so it still has some value. Is that value enough to justify the cost?
  2. Do we want to add a different target-independent IR-level fold to kill the extra insertelement instruction for a non-undef insertion? That one seems more complicated, so we'd at least want to know how that pattern was created.
RKSimon accepted this revision.Oct 10 2017, 5:59 AM

LGTM

This revision is now accepted and ready to land.Oct 10 2017, 5:59 AM
This revision was automatically updated to reflect the committed changes.