When lowering large v16f32->v16i8 fp_to_si_sat, the fp_to_si_sat node it split several times, creating an illegal v4i8 concat that gets expanded into a BUILD_VECTOR. After some combining and other legalisation, it ends up the a buildvector that extracts from 4 vectors, looking like BUILDVECTOR(a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3,d0,d1,d2,d3). That is really an v16i32->v16i8 truncate in disguise. This adds a ReconstructTruncateFromBuildVector method to detect the pattern, converting it back into the legal concat(trunc(concat(trunc(a), trunc(b))), trunc(concat(trunc(c), trunc(d)))) tree. The extracted nodes could also be v4i16, in which case the truncates are not needed. All those truncates and concats then become uzip1's, which it much better than expanding by moving vector lanes around.
Details
Diff Detail
Event Timeline
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9257 | Hi @dmgreen, this feels a bit *too* specific to a particular optimisation. For example, there is a test in CodeGen/AArch64/neon-extracttruncate.ll called @extract_2_v4i32 that could also benefit from something like this. Also, some of the code in here looks very similar to what ReconstructShuffle attempts to do, which also looks at BUILD_VECTORS that are constructed from extractelement operations. Would it be better to simply extend ReconstructShuffle to cover this case? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9257 | ReconstructShuffle is for reconstructing buildvectors into Legal Shuffles, and if you look at the code only handles 2 sources. This method is for reconstructing a truncates, not shuffles, and needs to handle 4 sources. As this isn't a legal shuffle, it doesn't really fit there. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9257 | OK, but is it worth making this more generic then than it currently is? i.e. dealing with more cases than just this one very specific issue? @extract_2_v4i32 looks like it could benefit too. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9257 | Hmm. Like I showed in the godbolt link above extract_2_v4i32 will be converted to shuffles prior to ISel. It will go though normal shuffle codepaths, not being converted to a BUILD_VECTOR anywhere during ISel. This method cant really help with that, and won't expect to catch any cases which have been optimized from llvm into a shuffle already. Do you have cases where you think a more general version of this function would come up? |
creating an illegal v4i8 concat that gets expanded into a BUILD_VECTOR
Maybe it would make sense to stop this from happening? concat_vectors where the operands will be widened seems like it would be a common pattern.
Yeah - That is something I looked into, but didn't get very far as it seems simple enough to fix after. That seems to be the same way we handle all other concats - we expand them with extract then reconstruct the shuffle. I think it would be a too large job to try and change how concat is lowered.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9292 | I think you're missing some test cases here because your code permits the possibility of mixed types, i.e. %a0 = extractelement <4 x i16> %a, i32 0 ... %b0 = extractelement <4 x i32> %b, i32 0 ... %c0 = extractelement <4 x i16> %c, i32 0 ... %d0 = extractelement <4 x i32> %d, i32 0 ... %t0 = trunc i16 %a0 to i8 ... %t4 = trunc i32 %b0 to i8 ... %t8 = trunc i16 %c0 to i8 ... %t12 = trunc i32 %d0 to i8 ... %i0 = insertelement <16 x i8> undef, i8 %t0, i32 0 ... %i4 = insertelement <16 x i8> %i3, i8 %t4, i32 4 ... %i8 = insertelement <16 x i8> %i7, i8 %t8, i32 8 ... %i12 = insertelement <16 x i8> %i11, i8 %t12, i32 12 ... Can you add at least one test for mixed types? |
LGTM! Thanks for adding the tests @dmgreen . This patch seems like a nice improvement. :)
Hi @dmgreen, this feels a bit *too* specific to a particular optimisation. For example, there is a test in CodeGen/AArch64/neon-extracttruncate.ll called @extract_2_v4i32 that could also benefit from something like this. Also, some of the code in here looks very similar to what ReconstructShuffle attempts to do, which also looks at BUILD_VECTORS that are constructed from extractelement operations. Would it be better to simply extend ReconstructShuffle to cover this case?