Details
Diff Detail
Event Timeline
The automated build tests failed for the previous patch because it was based on a previous commit for a unit test that isn't submitted yet. This patch fixes it by squashing the previous commit, removing the dependency & showing the final update.
Ran clang-format since it was failing in the build report at https://buildkite.com/llvm-project/diff-checks/builds/133184
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13405–13406 | I think these are guaranteed to succeed by checks in the caller (and essential here), so cast<...> is probably better. Applies to some of the later dyn_casts too. | |
13453–13462 | There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall: int ShuffleCount = 128/SrcElemSize; SmallVector<int> ShuffleLanes; for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes.push_back(i); SmallVector<Value *> Results; while (ShuffleLanes.back() < NumElements) { Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy)); for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes[i] += ShuffleCount; if (Parts.size() == 4) { // Call tbl4, push result into Results, clear Parts. } } // Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results // Shuffle-merge all of Results. and allow the code to apply to a wider range of truncates. What are your views on the implementation? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13453–13462 | I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13405 | Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (<vscale x 16 x i8> or something like that? | |
13451–13452 | It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here. | |
13475 | store here seems ambiguous here, as we won't emit a store instruction, right? | |
13494 | SmallVector? | |
13499 | SmallVector? | |
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll | ||
676 ↗ | (On Diff #477022) | Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13405 | I think it should be fine, I added a test in 4783345426da |
Updated comments as mentioned in the reviews. Rebased on tests for this change prior to applying this patch.
Removed case for 'trunc <(8|16)xi16> %x to <(8|16)xi8>' since it was adding more instructions to loop header, while not improving loop instruction count
Removed (8|16)xi16 to (8|16)xi8 conversion because it wasn't showing benefits in instruction count, & additionally adding more instructions to the header. Updated comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13405 | Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again. | |
13405 | Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too. | |
13475 | I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results. | |
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll | ||
676 ↗ | (On Diff #477022) | Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it. |
LGTM with the inline suggestions. Please wait a day or so with committing in case there are additional comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13414 | could you add an assert to make sure the division happens without remainder? | |
13436 | Could use Builder.getInt8(....)? | |
13443 | IIUC the only case that can happen here is that Parts == 4, right? Might be good to update the check. |
Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (<vscale x 16 x i8> or something like that?