This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn truncating buildvectors into truncates
ClosedPublic

Authored by dmgreen on Feb 10 2022, 12:10 PM.

Details

Summary

When lowering large v16f32->v16i8 fp_to_si_sat, the fp_to_si_sat node it split several times, creating an illegal v4i8 concat that gets expanded into a BUILD_VECTOR. After some combining and other legalisation, it ends up the a buildvector that extracts from 4 vectors, looking like BUILDVECTOR(a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3,d0,d1,d2,d3). That is really an v16i32->v16i8 truncate in disguise. This adds a ReconstructTruncateFromBuildVector method to detect the pattern, converting it back into the legal concat(trunc(concat(trunc(a), trunc(b))), trunc(concat(trunc(c), trunc(d)))) tree. The extracted nodes could also be v4i16, in which case the truncates are not needed. All those truncates and concats then become uzip1's, which it much better than expanding by moving vector lanes around.

Found when looking at D96522 / D118979.

Diff Detail

Event Timeline

dmgreen created this revision.Feb 10 2022, 12:10 PM
dmgreen requested review of this revision.Feb 10 2022, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2022, 12:10 PM
david-arm added inline comments.Feb 14 2022, 1:24 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9269

Hi @dmgreen, this feels a bit *too* specific to a particular optimisation. For example, there is a test in CodeGen/AArch64/neon-extracttruncate.ll called @extract_2_v4i32 that could also benefit from something like this. Also, some of the code in here looks very similar to what ReconstructShuffle attempts to do, which also looks at BUILD_VECTORS that are constructed from extractelement operations. Would it be better to simply extend ReconstructShuffle to cover this case?

dmgreen added inline comments.Feb 15 2022, 2:02 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9269

ReconstructShuffle is for reconstructing buildvectors into Legal Shuffles, and if you look at the code only handles 2 sources. This method is for reconstructing a truncates, not shuffles, and needs to handle 4 sources. As this isn't a legal shuffle, it doesn't really fit there.
The tests in neon-extracttruncate.ll I just added as a quick test. They are not expected to come up in practice. https://godbolt.org/z/eEser71Gq. It is the lowering of large fptosi.sat that we are targeting.

david-arm added inline comments.Feb 15 2022, 2:51 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9269

OK, but is it worth making this more generic then than it currently is? i.e. dealing with more cases than just this one very specific issue? @extract_2_v4i32 looks like it could benefit too.

dmgreen added inline comments.Feb 15 2022, 9:55 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9269

Hmm. Like I showed in the godbolt link above extract_2_v4i32 will be converted to shuffles prior to ISel. It will go though normal shuffle codepaths, not being converted to a BUILD_VECTOR anywhere during ISel. This method cant really help with that, and won't expect to catch any cases which have been optimized from llvm into a shuffle already.

Do you have cases where you think a more general version of this function would come up?

creating an illegal v4i8 concat that gets expanded into a BUILD_VECTOR

Maybe it would make sense to stop this from happening? concat_vectors where the operands will be widened seems like it would be a common pattern.

creating an illegal v4i8 concat that gets expanded into a BUILD_VECTOR

Maybe it would make sense to stop this from happening? concat_vectors where the operands will be widened seems like it would be a common pattern.

Yeah - That is something I looked into, but didn't get very far as it seems simple enough to fix after. That seems to be the same way we handle all other concats - we expand them with extract then reconstruct the shuffle. I think it would be a too large job to try and change how concat is lowered.

dmgreen edited the summary of this revision. (Show Details)Feb 17 2022, 1:09 AM
david-arm added inline comments.Mar 3 2022, 6:38 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9304

I think you're missing some test cases here because your code permits the possibility of mixed types, i.e.

%a0 = extractelement <4 x i16> %a, i32 0
...
%b0 = extractelement <4 x i32> %b, i32 0
...
%c0 = extractelement <4 x i16> %c, i32 0
...
%d0 = extractelement <4 x i32> %d, i32 0
...
%t0 = trunc i16 %a0 to i8
...
%t4 = trunc i32 %b0 to i8
...
%t8 = trunc i16 %c0 to i8
...
%t12 = trunc i32 %d0 to i8
...
%i0 = insertelement <16 x i8> undef, i8 %t0, i32 0
...
%i4 = insertelement <16 x i8> %i3, i8 %t4, i32 4
...
%i8 = insertelement <16 x i8> %i7, i8 %t8, i32 8
...
%i12 = insertelement <16 x i8> %i11, i8 %t12, i32 12
...

Can you add at least one test for mixed types?

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 6:38 AM
dmgreen updated this revision to Diff 412956.Mar 4 2022, 1:24 AM

Sounds good. This now has that test and a couple more.

david-arm accepted this revision.Mar 4 2022, 2:51 AM

LGTM! Thanks for adding the tests @dmgreen . This patch seems like a nice improvement. :)

This revision is now accepted and ready to land.Mar 4 2022, 2:51 AM
This revision was landed with ongoing or failed builds.Mar 7 2022, 1:43 AM
This revision was automatically updated to reflect the committed changes.