This is part of the missing IR-level folding noted in D52912.
This should be ok as a canonicalization because the new shuffle mask can't be any more complicated than the existing shuffle mask. If there's some target where the shorter vector shuffle is not legal, it should just end up expanding to something like the pair of shuffles that we're starting with here.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Transforms/InstCombine/InstCombineVectorOps.cpp | ||
---|---|---|
1496 ↗ | (On Diff #168860) | I'm being paranoid here. What is the expected outcome of this transform. |
lib/Transforms/InstCombine/InstCombineVectorOps.cpp | ||
---|---|---|
1496 ↗ | (On Diff #168860) | The new mask is the same size as the extracting shuffle (Shuf) size. The extracting shuffle size must be smaller than the first shuffle's size. This is guaranteed by 'isIdentityWithExtract', but I'll add an assert to make that clearer. |
Patch updated:
Added an assert to (hopefully) make this transform clearer.
Side note: while trying to edit this patch, I realized that I had accidentally already committed it to trunk with rL344082. I reverted that in rL344178.
Sorry about that...on the plus side, there were no bot complaints or bug reports while it was in trunk. :)
lib/Transforms/InstCombine/InstCombineVectorOps.cpp | ||
---|---|---|
1493 ↗ | (On Diff #169088) | I think this comment is too consice. |
lib/Transforms/InstCombine/InstCombineVectorOps.cpp | ||
---|---|---|
1493 ↗ | (On Diff #169088) | Good observations!
I'll clarify these points in a revised code comment. |
test/Transforms/InstCombine/vec_shuffle.ll | ||
173 ↗ | (On Diff #169088) | Yes - forgot to update the comment here and below. |
I found a case were this combine causes a codegen regression on AArch64. In the example below, %s0 puts data into a 128 bit vector and %s1 and %s2 extract the lower and upper halves. Without folding %s0 and %s1, we can generate a single AArch64 tbl instruction for %s0 and a mov instruction for %s2. With the fold in this patch, we generate 3 additional instructions: additional tbl for %s2 and 2 instructions for loading the mask.
So on AArch64, the combine produces worse code, in case we can generate a single tbl instruction for the top-level shuffle and we extract the lower and upper halves, which is cheap. Do you have an idea how to best address the issue?
define <8 x i16> @test(<16 x i8> %s) { entry: %0 = sub <16 x i8> <i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1>, %s %s0 = shufflevector <16 x i8> %0, <16 x i8> undef, <16 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7, i32 11, i32 11, i32 11, i32 11, i32 15, i32 15, i32 15, i32 15> %s1 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %s2 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> %a = call <8 x i16> @fn(<8 x i8> %s1, <8 x i8> %s2) #6 ret <8 x i16> %a } declare <8 x i16> @fn(<8 x i8>, <8 x i8>)
That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.
Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.
But can you be sure there won't ever be such a shuffle via some other means?
I don't know that much about AArch64, but this really sounds like something to be dealt with in the backend..
The goal for this patch was to make things easier for the backend, but I didn't think of this case of course. How about limiting the transform like this:
Index: InstCombine/InstCombineVectorOps.cpp =================================================================== --- InstCombine/InstCombineVectorOps.cpp (revision 353169) +++ InstCombine/InstCombineVectorOps.cpp (working copy) @@ -1498,6 +1498,11 @@ if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask)))) return nullptr; + // Be extra conservative with shuffle transforms. If we can't kill the 1st + // shuffle, then combining may result in worse codegen. + if (!Op0->hasOneUse()) + return nullptr; + // We are extracting a subvector from a shuffle. Remove excess elements from // the 1st shuffle mask to eliminate the extract. //
That would solve the issue, thanks. I think it would make sense to keep this transform quite conservative, without knowing more about the backend. If this is helpful for some backends, it might make sense to do it at a later stage in the pipeline. Merging back shuffles in the backend might be possible in some cases, but I think it would be better if the backend does not have to fight instcombine in this case.
Yes, that's our default argument for IR canonicalization: if the source code was already in this form to begin with, then there must be some missing optimization already out there.
But we have a special restriction for shufflevector - we assume that masks are only ever created if they are easy for the target to digest. And the corollary is that we don't create all-new shuffle masks in target-independent passes. That means we're going to miss some optimizations in IR that would've fallen out from intermediate transforms, but I think we have to live with that. As discussed in the earlier comments, this patch is right at the edge of the mask restriction.