This is an archive of the discontinued LLVM Phabricator instance.

DAGCombine: Extend createBuildVecShuffle for case len(in_vec) = 4*len(result_vec)
AbandonedPublic

Authored by zvi on May 9 2017, 12:57 AM.

Details

Summary

Add support for the case where there is a single input vector from which
elements are gathered into a build_vector, and that input vector is 4x longer
than the result vector.
VECTOR_SHUFFLE requires that the input vectors and result vector be of same type,
and createBuildVecShuffle() already has some recipes for handling numerous cases
to meet this requirement. As for the case the patch addresses:
This is done by splitting the input vector to two half-sized vectors and
extending the result vector to become twice as long.

Event Timeline

zvi created this revision.May 9 2017, 12:57 AM
zvi added inline comments.May 9 2017, 1:05 AM
test/CodeGen/ARM/vpadd.ll
376

This appears to be a regression for ARM codegen. Assuming it is, what the options for fixing it? IMHO these are the options ordered by preference:

  1. Can we improve the ARM backend to handle this case?
  2. Add a TLI hook for deciding when insert-extract sequences are better than composed shuffle?
  3. Do this only in the X86 lowering.
efriedma edited reviewers, added: efriedma; removed: eli.friedman.May 9 2017, 8:32 AM
efriedma added a subscriber: efriedma.
efriedma added inline comments.
test/CodeGen/ARM/vpadd.ll
376

We have a combine in the ARM backend which specifically combines vuzp+vadd to vpadd. It looks like the reason it isn't triggering here is that we're doing the vuzp in the wrong width; probably easy to fix.

zvi added inline comments.May 10 2017, 11:53 AM
test/CodeGen/ARM/vpadd.ll
376

Thanks for highlighting the problem, Eli.

The following case shows the same missed combine opportunity without this patch, by being lowered to the same asm code as the right-hand side of the diff.

define void @test(<16 x i8> *%cbcr, <4 x i16> *%X) nounwind ssp {
  %tmp = load <16 x i8>, <16 x i8>* %cbcr
  %tmp1 = zext <16 x i8> %tmp to <16 x i16>
  %tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 undef, i32 undef, i32 undef, i32 undef>
  %tmp2a = shufflevector <8 x i16> %tmp2, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %tmp3a = shufflevector <8 x i16> %tmp3, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %add = add <4 x i16> %tmp2a, %tmp3a
  store <4 x i16> %add, <4 x i16>* %X, align 8
  ret void
}
zvi added inline comments.May 11 2017, 12:14 AM
test/CodeGen/ARM/vpadd.ll
376

Create Bug 32999 to track this.

zvi added inline comments.May 18 2017, 8:17 AM
test/CodeGen/ARM/vpadd.ll
376

Just want to understand what is needed for this review to proceed. Does the ARM regression need to be fixed first, or are we ok with letting it in assuming it is easy to fix and will be fixed shortly after?

efriedma added inline comments.May 18 2017, 11:40 AM
test/CodeGen/ARM/vpadd.ll
376

Taking another look, I'm not convinced we should just be putting this issue aside. It's not really an ARM-specific issue: we're deciding to create an EXTRACT_SUBVECTOR from an 128-bit shuffle rather than just creating a 64-bit shuffle, which might be more efficient depending on the hardware. Equivalently, for x86 hardware, this is like creating a ymm shuffle when an xmm shuffle would be sufficient.

I mean, this issue isn't really important enough to block this change, but I'd like to make sure we at least understand what we're doing.

zvi added inline comments.May 18 2017, 2:28 PM
test/CodeGen/ARM/vpadd.ll
376

I see your point. Going back to a list of options:

  1. Conservatively bail out if (min_mask_index*2 > NumElem || max_mask_index * 2 < NumElems) which means that we are accessing elements from one half of the input vector
  2. Add a TLI hook to let the target decide if the large shuffle is ok
  3. Always allow creation of large shuffles (what the current patch does)
efriedma edited edge metadata.May 18 2017, 2:52 PM

Conservatively bail out if (min_mask_index*2 > NumElem || max_mask_index * 2 < NumElems) which means that we are accessing elements from one half of the input vector

You could extend this a little: try to cut the input size to one quarter, and generate the shuffle that way, if we can.

I don't think we need a new target hook here; isExtractSubvectorCheap should be enough to drive the behavior here.

Always allow creation of large shuffles (what the current patch does)

We could try to clean this up later in DAGCombine, yes... but it seems better to try to generate a reasonable shuffle from the start.

zvi added a subscriber: igorb.Jun 13 2017, 11:53 PM

@zvi Abandon this? AFAICT we seem to have improved all the x86 cases with shuffle combining improvements already.

zvi abandoned this revision.Dec 24 2019, 5:37 AM