This is an archive of the discontinued LLVM Phabricator instance.

[CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl.
ClosedPublic

Authored by fhahn on Feb 25 2022, 9:30 AM.

Details

Summary

This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.

This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.

This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.

This improves the generated code for loops like the one below in
combination with D96522.

int foo(uint8_t *p, int N) {
  unsigned long long sum = 0;
  for (int i = 0; i < N ; i++, p++) {

unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;

  }
  return sum;
}

https://clang.godbolt.org/z/Wco866MjY

Diff Detail

Event Timeline

fhahn created this revision.Feb 25 2022, 9:30 AM
fhahn requested review of this revision.Feb 25 2022, 9:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2022, 9:30 AM
fhahn updated this revision to Diff 411477.Feb 25 2022, 11:29 AM

rebase on top of additional tests

fhahn updated this revision to Diff 411604.Feb 26 2022, 6:09 AM

Fixed extraction mask so the extracted elements from the vi16i8 are at the front of each 4 x i8 block, so the vector before bitcasting looks like orig[0], 0, 0, 0, orig[1], 0, 0, 0,....

Modeled in Alive2: https://alive2.llvm.org/ce/z/hnrmsP (although a reduced version)

SjoerdMeijer added inline comments.
llvm/test/CodeGen/AArch64/vselect-ext.ll
193

I haven't looked too carefully at this, so this is more of a drive by comment, but I think I see more code/instructions so was wondering if we shouldn't be doing this for code-size, and if this is always a win (e.g. smaller/in-order aarch64 implementations).

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 1:18 AM
fhahn updated this revision to Diff 458736.Sep 8 2022, 7:34 AM

Rebase & ping.

This is a major update over the original version. The interface has been changed to a optimizeExtendOrTruncateConversion helper, which is able to apply the optimization directly. This was needed because I will also soon share follow-up patches that lower truncates using tbl4. For that we need to able to generate code using NEON intrinsics, which is why it is convenient to also implement the actual transform in the backend specific parts.

It also address @SjoerdMeijer's comment to skip this transform when optimizing for size.

t.p.northover accepted this revision.Sep 15 2022, 1:21 AM

Looks reasonable, apart from a glitch in a comment.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12387

Destination type seems wrong.

This revision is now accepted and ready to land.Sep 15 2022, 1:21 AM
fhahn updated this revision to Diff 460387.Sep 15 2022, 6:10 AM

Thanks Tim! Rebased and also adjusted the index as discussed offline for big-endian targets.

fhahn marked an inline comment as done.Sep 15 2022, 6:40 AM
fhahn added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12387

Thanks, should be fixed; updated to i32.

This revision was landed with ongoing or failed builds.Sep 15 2022, 11:18 AM
This revision was automatically updated to reflect the committed changes.
fhahn marked an inline comment as done.
aaronpuchert added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12352–12354

The static analyzer finds this suspicious: SrcTy is the result of a dyn_cast, but is unconditionally dereferenced. Perhaps it should be a cast?

Regression reported at https://github.com/llvm/llvm-project/issues/62620 . I think the issue is that this transform isn't aware of widening instructions; if the <16 x i32> output is used in a way that can be optimized to use 16-bit inputs, the six ushll/ushll2 instructions are actually lowered to just two ushll/ushll2 instructions, so transforming that into for tbl isn't profitable.

There are quite a few places where this transform is not profitable due to it blocking fold in selection-dag. They would be more obvious, but we don't have many tests with loops in the backend. I have been looking lately about whether it makes sense to replace it with something that happens either during or after ISel. I think after ISel should work as a (larger) peephole optimization, which should then work with both SDAG and GISel and prevent us needing to try and handle it in both places. We just need to recognize the patterns of USHLL's and that way we only optimize if it turns out to really be useful.

fhahn added a comment.May 12 2023, 2:10 PM

Regression reported at https://github.com/llvm/llvm-project/issues/62620 . I think the issue is that this transform isn't aware of widening instructions; if the <16 x i32> output is used in a way that can be optimized to use 16-bit inputs, the six ushll/ushll2 instructions are actually lowered to just two ushll/ushll2 instructions, so transforming that into for tbl isn't profitable.

There are quite a few places where this transform is not profitable due to it blocking fold in selection-dag. They would be more obvious, but we don't have many tests with loops in the backend. I have been looking lately about whether it makes sense to replace it with something that happens either during or after ISel. I think after ISel should work as a (larger) peephole optimization, which should then work with both SDAG and GISel and prevent us needing to try and handle it in both places. We just need to recognize the patterns of USHLL's and that way we only optimize if it turns out to really be useful.

Yeah I saw the report, thanks! The current logic only tries to introduce tbl if there are at least 2 casting steps required, but doesn't know about widening instructions, so misses cases where only one step will be needed. I think we should be able to catch (hopefully) most cases using the existing logic in TTI: D150482