This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic

Authored by nilanjana_basu on Oct 4 2022, 5:12 PM.

Details

Summary

[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.

A microbenchmark for runtime has been added for all these cases in D136274

Diff Detail

Event Timeline

nilanjana_basu created this revision.Oct 4 2022, 5:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
nilanjana_basu requested review of this revision.Oct 4 2022, 5:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM
nilanjana_basu set the repository for this revision to rG LLVM Github Monorepo.Oct 4 2022, 5:47 PM
nilanjana_basu retitled this revision from Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction to [AArch64] Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction.Oct 4 2022, 6:01 PM

Ran git-clang-format & made minor change to reduce LoC.

Removed an unused variable warning

nilanjana_basu retitled this revision from [AArch64] Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction to [AArch64] Extending lowering of 'trunc <(8|16) x (i16|i64)> %x to <(8|16) x i8>' to use tbl instructions.
nilanjana_basu edited the summary of this revision. (Show Details)

Extended the trunc lowering for other types like 16xi64, 16xi16, 8xi16

The automated build tests failed for the previous patch because it was based on a previous commit for a unit test that isn't submitted yet. This patch fixes it by squashing the previous commit, removing the dependency & showing the final update.

Ran clang-format since it was failing in the build report at https://buildkite.com/llvm-project/diff-checks/builds/133184

nilanjana_basu edited the summary of this revision. (Show Details)Oct 28 2022, 4:57 PM
t.p.northover added inline comments.Nov 2 2022, 8:08 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405–13406

I think these are guaranteed to succeed by checks in the caller (and essential here), so cast<...> is probably better. Applies to some of the later dyn_casts too.

13453–13463

There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall:

int ShuffleCount = 128/SrcElemSize;
SmallVector<int> ShuffleLanes;
for (int i = 0; i < ShuffleCount; ++i)
  ShuffleLanes.push_back(i);

SmallVector<Value *> Results;
while (ShuffleLanes.back() < NumElements) {
  Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
  for (int i = 0; i < ShuffleCount; ++i)
    ShuffleLanes[i] += ShuffleCount;
  if (Parts.size() == 4) {
    // Call tbl4, push result into Results, clear Parts.
  }
}

// Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results

// Shuffle-merge all of Results.

and allow the code to apply to a wider range of truncates. What are your views on the implementation?

nilanjana_basu marked an inline comment as done.

Addressed comments by t.p.northover - refactored code to remove redundancy

nilanjana_basu marked an inline comment as done.Nov 3 2022, 7:39 PM
nilanjana_basu added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13453–13463

I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are.

In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction?

nilanjana_basu marked an inline comment as done.

Ran clang-format

Added comments

fhahn added inline comments.Nov 22 2022, 3:57 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405

Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (<vscale x 16 x i8> or something like that?

13451–13452

It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here.

13475

store here seems ambiguous here, as we won't emit a store instruction, right?

13495

SmallVector?

13500

SmallVector?

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗(On Diff #477022)

Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2.

fhahn added inline comments.Nov 22 2022, 10:44 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405

I think it should be fine, I added a test in 4783345426da

nilanjana_basu marked 3 inline comments as done.

Updated comments as mentioned in the reviews. Rebased on tests for this change prior to applying this patch.

Rebasing on commit of test cases prior to application of this patch

Removed case for 'trunc <(8|16)xi16> %x to <(8|16)xi8>' since it was adding more instructions to loop header, while not improving loop instruction count

nilanjana_basu marked an inline comment as done.

Updated a comment

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'trunc <(8|16) x (i16|i64)> %x to <(8|16) x i8>' to use tbl instructions to [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions.Nov 23 2022, 2:38 AM
nilanjana_basu edited the summary of this revision. (Show Details)
nilanjana_basu marked 3 inline comments as done.Nov 23 2022, 2:45 AM

Removed (8|16)xi16 to (8|16)xi8 conversion because it wasn't showing benefits in instruction count, & additionally adding more instructions to the header. Updated comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405

Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again.

13405

Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too.

13475

I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results.

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗(On Diff #477022)

Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it.

nilanjana_basu marked an inline comment as done.

Rebasing on parent patch for tests

nilanjana_basu edited the summary of this revision. (Show Details)Nov 25 2022, 3:54 PM

Trying to fix rebasing error

fhahn accepted this revision.Dec 14 2022, 1:47 PM

LGTM with the inline suggestions. Please wait a day or so with committing in case there are additional comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13414

could you add an assert to make sure the division happens without remainder?

13436

Could use Builder.getInt8(....)?

13443

IIUC the only case that can happen here is that Parts == 4, right? Might be good to update the check.

This revision is now accepted and ready to land.Dec 14 2022, 1:47 PM
nilanjana_basu marked 3 inline comments as done.Dec 15 2022, 12:13 PM

Addressed the final comments in a separate commit.