shuffle (tbl2, tbl2) can be folded into a single tbl4 if the shuffle selects
the first 8 elements of each tbl2 and the tbl2 masks can be merged.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10702–10703 | Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle. | |
10707 | Won't this overflow if it's a tbl2 produding an <8 x i8>? | |
10716 | Maybe default fill with SDValue()? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line. | |
10719 | Have we checked anywhere that the lower 8 operands are actually constant? |
Address latest comments, thanks!
It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10702–10703 | Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants. | |
10707 | Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up. | |
10716 | Done, thanks! | |
10719 | That's checked in the latest version and I added extra test cases in 9f2c39418b85. |
It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.
I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.
Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):
if (ShuffleMask[I] < 16) TblMaskParts[I] = Mask1->getOperand(ShuffleMask[I]); else { auto *C = cast<ConstantSDNode>(Mask2->getOperand(ShuffleMask[I] - 16)); TblMaskParts[I] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32); }
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10706 | This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from isExtractLowerHalfMask in the same patch seems a bit unmotivated. Not insisting on one style or the other (or even consistency), just making sure it's an active decision. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10709 | Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later. |
Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle.