Add support for the case where there is a single input vector from which
elements are gathered into a build_vector, and that input vector is 4x longer
than the result vector.
VECTOR_SHUFFLE requires that the input vectors and result vector be of same type,
and createBuildVecShuffle() already has some recipes for handling numerous cases
to meet this requirement. As for the case the patch addresses:
This is done by splitting the input vector to two half-sized vectors and
extending the result vector to become twice as long.
Details
- Reviewers
spatel RKSimon craig.topper hfinkel efriedma
Diff Detail
- Build Status
Buildable 6270 Build 6270: arc lint + arc unit
Event Timeline
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | This appears to be a regression for ARM codegen. Assuming it is, what the options for fixing it? IMHO these are the options ordered by preference:
|
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | We have a combine in the ARM backend which specifically combines vuzp+vadd to vpadd. It looks like the reason it isn't triggering here is that we're doing the vuzp in the wrong width; probably easy to fix. |
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | Thanks for highlighting the problem, Eli. The following case shows the same missed combine opportunity without this patch, by being lowered to the same asm code as the right-hand side of the diff. define void @test(<16 x i8> *%cbcr, <4 x i16> *%X) nounwind ssp { %tmp = load <16 x i8>, <16 x i8>* %cbcr %tmp1 = zext <16 x i8> %tmp to <16 x i16> %tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 undef, i32 undef, i32 undef, i32 undef> %tmp2a = shufflevector <8 x i16> %tmp2, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 undef, i32 undef, i32 undef, i32 undef> %tmp3a = shufflevector <8 x i16> %tmp3, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %add = add <4 x i16> %tmp2a, %tmp3a store <4 x i16> %add, <4 x i16>* %X, align 8 ret void } |
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | Just want to understand what is needed for this review to proceed. Does the ARM regression need to be fixed first, or are we ok with letting it in assuming it is easy to fix and will be fixed shortly after? |
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | Taking another look, I'm not convinced we should just be putting this issue aside. It's not really an ARM-specific issue: we're deciding to create an EXTRACT_SUBVECTOR from an 128-bit shuffle rather than just creating a 64-bit shuffle, which might be more efficient depending on the hardware. Equivalently, for x86 hardware, this is like creating a ymm shuffle when an xmm shuffle would be sufficient. I mean, this issue isn't really important enough to block this change, but I'd like to make sure we at least understand what we're doing. |
test/CodeGen/ARM/vpadd.ll | ||
---|---|---|
376 | I see your point. Going back to a list of options:
|
Conservatively bail out if (min_mask_index*2 > NumElem || max_mask_index * 2 < NumElems) which means that we are accessing elements from one half of the input vector
You could extend this a little: try to cut the input size to one quarter, and generate the shuffle that way, if we can.
I don't think we need a new target hook here; isExtractSubvectorCheap should be enough to drive the behavior here.
Always allow creation of large shuffles (what the current patch does)
We could try to clean this up later in DAGCombine, yes... but it seems better to try to generate a reasonable shuffle from the start.
@zvi Abandon this? AFAICT we seem to have improved all the x86 cases with shuffle combining improvements already.
This appears to be a regression for ARM codegen. Assuming it is, what the options for fixing it? IMHO these are the options ordered by preference: