This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl instructions
ClosedPublic

Authored by nilanjana_basu on Oct 25 2022, 5:01 PM.

Details

Summary

Adding support for ZExt lowering for destination types beyond the existing support for (8|16) x i32

[AArch64] Patch for lowering zext instructions to 'tbl' for (8|16)xi8 -> (8|16)xi32 conversions in D120571 is extended to support zext to 'tbl' lowering for Y x i8 to Y x i8X. Any arbitrary number of vector elements & any destination element type whose size is a multiple of 8, greater than 16 and less than 64, is allowed for this transformation.

Related microbenchmarks are in D136274 & D138059

Depends on D120571

Diff Detail

Event Timeline

nilanjana_basu created this revision.Oct 25 2022, 5:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 5:01 PM
nilanjana_basu edited the summary of this revision. (Show Details)Oct 28 2022, 5:06 PM
nilanjana_basu added a reviewer: paquette.

Added more test cases for zext lowering of different vector types

Allowed the zext to tbl lowering for all lengths of vectors

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'zext <(8|16) x i8> %x to <(8|16) x (i16|i64)>' to use tbl instructions to [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x (i16|i64)>' to use tbl instructions.Nov 7 2022, 1:47 AM
nilanjana_basu edited the summary of this revision. (Show Details)

Added Big-Endian checks for the test cases that I missed earlier

Ran clang-format

Allowed all element sizes in the destination element that is a multiple of 8

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x (i16|i64)>' to use tbl instructions to [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl instructions.Nov 7 2022, 10:46 PM
nilanjana_basu edited the summary of this revision. (Show Details)
nilanjana_basu published this revision for review.Nov 8 2022, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2022, 11:01 AM

I know it's not your problem, but the code in optimizeExtendOrTruncateConversion doesn't feel like it is in the best place, to be honest. CGP has always described itself as a hack, but we shouldn't be hacking things that much. There will be some obvious cases where the extend/trunc can be optimized but the tbl blocks it.
As far as I understand, the code is only in CGP because it is trying limit the transforms to loops. I'm wondering if it would be better to add some sort of flag into ISel so that combines could tell that the current block is a loop, and behave differently because of it.

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
444 ↗(On Diff #473877)

I think this is worse, I'm afraid. We only want to use tbl if it would replace two instructions (it performs two truncate/zext steps). Otherwise we are just adding instructions to the loop header (and using more registers) for no gain.

Minor update to comment

nilanjana_basu edited the summary of this revision. (Show Details)Nov 21 2022, 1:40 PM
nilanjana_basu edited the summary of this revision. (Show Details)
nilanjana_basu added inline comments.Nov 21 2022, 1:53 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13951

This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.

Is it better to rule out this case in this 'if' block or should we not allow tbl-lowering when there are multiple zext instructions of the same type present back to back?

fhahn added a comment.Nov 22 2022, 3:11 AM

I know it's not your problem, but the code in optimizeExtendOrTruncateConversion doesn't feel like it is in the best place, to be honest. CGP has always described itself as a hack, but we shouldn't be hacking things that much. There will be some obvious cases where the extend/trunc can be optimized but the tbl blocks it.
As far as I understand, the code is only in CGP because it is trying limit the transforms to loops. I'm wondering if it would be better to add some sort of flag into ISel so that combines could tell that the current block is a loop, and behave differently because of it.

Yep the only reason for doing it in CGP is to work around SelDAG's limitation.I am not sure about extending SelDAG for this, as we are planning to transition to GIsel at least on Darwin platforms very soon, which won't require doing this in CGP. I think @nilanjana_basu will also look into implementing this in GIsel

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13951

This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.

is this covered by one of the unit tests? zext_v16i8_to_v16i64_in_loop looks fine to me, at least for little endian

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
444 ↗(On Diff #473877)

yep it looks like we should have a check for that. @nilanjana_basu could you update the patch and make sure this is also tested in zext-to-tbl.ll

llvm/test/CodeGen/AArch64/zext-to-tbl.ll
1069

could you ad those new tests separately?

Removed cases where TBL lowering will not be beneficial

nilanjana_basu marked 2 inline comments as done.

Rebasing & merging on a recent commit

Rebased on latest updated zext unit tests

Trying to fix patching error because of rebasing

Trying to fix patching error again

Fixed rebasing error of duplicated tests

nilanjana_basu marked 2 inline comments as done.Nov 29 2022, 11:13 AM
nilanjana_basu added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13951

I was mistaken in checking the instruction count earlier, but I have still added a unit test in zext_v16i8_to_v16i64_in_sequence_in_loop, since I see a performance regression in my local setup.

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
444 ↗(On Diff #473877)

In the latest patch, I've blocked zext lowering to tbl for destination vectors with i16 element type, since those were the ones that don't benefit from it. For other vector types, a single zext/truncate too improves on the instruction count.

llvm/test/CodeGen/AArch64/zext-to-tbl.ll
1069

I added two new tests for 2 back-to-back zext instructions - zext_v8i8_to_v8i64_with_add_in_sequence_in_loop & zext_v16i8_to_v16i64_in_sequence_in_loop. The pre-patch codegen has been updated in the parent revision D137993.

There seems to be a slight increase in instruction count for zext_v8i8_to_v8i64_with_add_in_sequence_in_loop.

nilanjana_basu marked 2 inline comments as done.Dec 1 2022, 3:53 AM

Blocked tbl-conversion for destination element size above 64 since only 2 or less destination vector elements can be chosen with each tbl instruction in these cases, making it less beneficial

Ran clang-format

nilanjana_basu edited the summary of this revision. (Show Details)Dec 2 2022, 11:25 AM

Removed tbl-conversion cases to destination vector element width above 64, due to observed performance regressions. Will move this to a later patch, once we find a fix.

fhahn added a comment.Dec 8 2022, 3:01 AM

Thanks for the latest update! This looks good in general to me, with just one more inline comment about an edge case

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13851

I *think* we could have something like zext i8 to i33 and then the division will drop the remainder. Could you add a test to see if that's the case and make sure we don't perform an incorrect transformation? It should be fine to just ignore cases where there would be remainder.

Re-based on newly added tests

Trying to fix patching error

nilanjana_basu marked an inline comment as done.Dec 8 2022, 10:40 AM
nilanjana_basu added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13851

I have also added testing for the Global-ISel path.

nilanjana_basu marked an inline comment as done.Dec 8 2022, 10:42 AM

Added an assert for an extra check

fhahn accepted this revision.Dec 8 2022, 12:32 PM

LGTM, thanks!

This revision is now accepted and ready to land.Dec 8 2022, 12:32 PM