This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.
ClosedPublic

Authored by fhahn on Sep 8 2022, 7:04 AM.

Details

Summary

shuffle (tbl2, tbl2) can be folded into a single tbl4 if the shuffle selects
the first 8 elements of each tbl2 and the tbl2 masks can be merged.

Diff Detail

Event Timeline

fhahn created this revision.Sep 8 2022, 7:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM
fhahn requested review of this revision.Sep 8 2022, 7:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10834–10835

Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle.

10839

Won't this overflow if it's a tbl2 produding an <8 x i8>?

10848

Maybe default fill with SDValue()? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line.

10851

Have we checked anywhere that the lower 8 operands are actually constant?

fhahn updated this revision to Diff 460683.Sep 16 2022, 2:31 AM
fhahn marked 4 inline comments as done.

Address latest comments, thanks!

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10834–10835

Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants.

10839

Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up.

10848

Done, thanks!

10851

That's checked in the latest version and I added extra test cases in 9f2c39418b85.

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

if (ShuffleMask[I] < 16)
  TblMaskParts[I] = Mask1->getOperand(ShuffleMask[I]);
else {
  auto *C = cast<ConstantSDNode>(Mask2->getOperand(ShuffleMask[I] - 16));
  TblMaskParts[I] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32);
}
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10838

This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from isExtractLowerHalfMask in the same patch seems a bit unmotivated.

Not insisting on one style or the other (or even consistency), just making sure it's an active decision.

t.p.northover added inline comments.Sep 20 2022, 1:46 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10841

Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later.

fhahn updated this revision to Diff 461877.Sep 21 2022, 6:28 AM

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

Thanks Tim! Generalized as suggested.

fhahn marked 2 inline comments as done.Sep 21 2022, 6:29 AM
fhahn added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10838

This is not needed any longer, because it is checked in the loop below now.

10841

Done in the loop below, thanks!

t.p.northover accepted this revision.Sep 21 2022, 9:16 AM

Thanks Florian. LGTM!

This revision is now accepted and ready to land.Sep 21 2022, 9:16 AM
This revision was landed with ongoing or failed builds.Sep 21 2022, 11:16 AM
This revision was automatically updated to reflect the committed changes.
fhahn marked 2 inline comments as done.