I noticed some spurious vector shuffle chains in my code that in theory could be folded
away completely, but InstCombine refused to do so, because it didn't like forming the intermediate
masks. I see that the last time a relaxation was attempted here (rL180802), it was reverted because
there was no consensus as to whether this was the right place to do that. CCing folks that were
involved in that discussion here. Note this isn't quite the same change, but it's in a similar vein.
Details
Diff Detail
- Build Status
Buildable 5201 Build 5201: arc lint + arc unit
Event Timeline
test/Transforms/InstCombine/vec_shuffle.ll | ||
---|---|---|
471 | Regenerate with utils/update_test_checks.py |
The revert message doesn't give us any details about what went wrong:
https://reviews.llvm.org/rL180834
cc'ing @efriedma @mssimpso @jnspaulsson
This would help PR30630?
https://bugs.llvm.org/show_bug.cgi?id=30630
lib/Transforms/InstCombine/InstCombineVectorOps.cpp | ||
---|---|---|
1438–1439 | Please update the comment for the new possibility. | |
test/Transforms/InstCombine/vec_shuffle.ll | ||
467–471 | The test should not require a load; pass the input vector in as a parameter? The script was updated after the checks in this file were generated, so the path of least resistance will be to use the '--function=shuffle_after_wide_load' parameter to not induce a bunch of unrelated changes. |
The corresponding discussion for the original patch was on llvm-commits: http://marc.info/?t=136735490000006&r=1&w=2
Ah, thanks - somehow didn't see that in the search results. So one of the primary objections was x86 transpose:
https://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx
4 years later, we're still living in fear of arbitrary shuffle masks, but x86 has added tons of logic to produce good shuffles for every possible SSE/AVX variant. It would be interesting to see if this patch has any effect on those cases now.
That discussion also suggested canonicalizing shuffles to selects, but we did the opposite:
https://reviews.llvm.org/D24279
...because we're confident enough now that the backend can deal with that specialized (non-lane-crossing blend) form of shuffle.
x86 might have improved, but we still don't have a good story in general for handling arbitrary shuffles for large vector types with limited shuffle support.
I would be okay with poking a hole for shuffles which are equivalent to EXTRACT_SUBVECTOR, if that helps here.
The shuffles in my test case are equivalent to vector concatenation, but I have a local test case which is more complicated. Do you have a specific example where the backend would generate better code with having the middle end generate code of the form:
v1 = %shufflevector %v, undef, mask1 v2 = %shufflevector %v, undef, mask2 %shufflevector %v1, v2, mask3
than
%shufflevector %v, undef, mask4
I would have hoped that just lowering a shuffle of one vector would have sufficiently improved in the backend by now. You also mentioned this was a limitation for large vector types. Does it make sense to have a TTI hook to determine whether the target can efficiently lower such shuffles?
Try the following on x86 (with SSE2 only):
#include <xmmintrin.h> __m128i f(__m128i z) { return _mm_unpacklo_epi8(_mm_unpacklo_epi8(z, z), _mm_unpackhi_epi8(z, z)); }
Granted, the x86 backend currently manages to screw this up even without help from instcombine, but the point is that shuffle lowering is extremely fragile if we don't have perfect shuffle tables or a general shuffle instruction.
I'm not exactly enthusiastic about a TTI hook here; I mean, it solves the problem in some sense, but I don't want to create a situation where our optimization pipeline becomes overly dependent on the presence of a general shuffle instruction.
Hmm, unless I did something wrong, the backend generates equivalent code whether or not we merge the shuffles. Or did you mean that we could do a better job with the three IR instruction form even though we currently don't?
If a TTI hook is not the right way to go here, maybe we can find a way to actually express what we want here, maybe via metadata or some other specification on shufflevector that would tell the optimizer not to touch the shufflemask unless it's confident it can improve upon it. As it is, I see my CPU with a perfectly good shuffle instruction shuffle the same vector three times, because LLVM refuses to combine these shuffles - pretty embarrassing ;).
Or did you mean that we could do a better job with the three IR instruction form even though we currently don't?
I mean the obvious lowering for the three-shuffle form is better that what the backend can figure out on its own for the one-shuffle form. Compare to gcc, which doesn't try to do anything clever. Or compile the IR for another target, like ARM.
Trying to distinguish between shuffles which should be preserved and those which shouldn't using metadata is probably a dead end; we don't really trust the user to use the best possible shuffles.
Maybe we could record the original shuffle when we combine a shuffle, as a hint to the backend if it can't figure out a good way to lower the combined shuffle? That solves the problem, I guess, but it's a lot of work just to improve code generation for obscure shuffles.
Or maybe a target hook is the least-bad option.
I wasn't suggesting we blindly trust the user on shuffles, but that we somehow distinguish between masks generated by some earlier LLVM pass and masks passed in by the user. For the former I feel like it would make a lot of sense to be more aggressive about folding them than we are now, which for the latter the user may have used some target specific knowledge, so we may want to be less aggressive. I'm also fine with a TTI hook, happy to write the code. Somebody has to make a call here, and I don't think I'm the best person to do so - I just want all those redundant shuffles out of my code ;).
There are very few passes which create vector shuffles in LLVM, besides the vectorizers, and the vectorizers should be aware of the cost of the shuffles they generate; I don't think there's a useful distinction to be found here.
Making InstCombine use TargetTransformInfo::getShuffleCost() or something like that is probably the right choice. But please send an email to llvmdev so the maintainers of other targets see this.
I had problems with vector shuffles on SystemZ which has partly come back recently. I first reported this on: https://bugs.llvm.org//show_bug.cgi?id=30630. The advice then (last year) was that this should be handled by InstCombiner. This makes sense to me as it is a way of simplifying the vectorizer algorithms. I wonder if this is still the general advice, or if the backend is supposed to handle this during DAGCombining, per discussion above?
I put my patch for InstCombiner at https://reviews.llvm.org/D31927. Note that it is not in perfect shape (that's why I never shared it), but it does some good things currently for SystemZ at least. It would be nice if anyone would try it and see if it helps... If so, we could perhaps make something out of it.
Comparing to last year, I see that there is a smaller problem with shuffles today. It also seems they may have gotten worse recently. Does anybody know what the causes might be here?
Can you post examples/bug reports of the IR that you are seeing? I have a draft of a patch that appears to solve this case and the case in PR30630. It's an instsimplify patch, so it will avoid controversy about target costs and difficult intermediate shuffle masks...although it may be masking the root cause in the vectorizers if that's where these chains of useless shuffles are originating.
I checked again, and it seems that there isn't really that much problem right now after all. Perhaps it was something temporary, I am not sure.
With the two new patches applied (D31509, D31960), I did find one thing unhandled, although it was indeed rare. It is like
%28 = shufflevector <4 x i32> %27, <4 x i32> undef, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3> %interleaved.vec = shufflevector <12 x i32> %28, <12 x i32> undef, <12 x i32> <i32 0, i32 4, i32 8, i32 1, i32 5, i32 9, i32 2, i32 6, i32 10, i32 3, i32 7, i32 11>
A shuffle with undef is then shuffled again with undef...
Is this a case that could be handled? I found this in a vectorized loop with -O3, with extra machine instructions as a result...
I posted the test case I found for this on https://bugs.llvm.org//show_bug.cgi?id=30630
This specific example was discussed starting here:
https://bugs.llvm.org//show_bug.cgi?id=30630#c9
...and that bug is resolved.
Also, since this patch was posted, we've carved out a few more seemingly-safe target-independent shuffle transforms for concat/extract patterns, but nothing as general as proposed here.
Please update the comment for the new possibility.