This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount
ClosedPublic

Authored by Allen on Dec 24 2022, 12:32 AM.

Diff Detail

Event Timeline

Allen created this revision.Dec 24 2022, 12:32 AM
Allen requested review of this revision.Dec 24 2022, 12:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 24 2022, 12:32 AM

I don't think this is right. The high bits do in fact have to be zero; otherwise they affect the result of the uaddlv.

If you add a pattern to optimize "insertelement <2 x i32> zeroinitializer, i32 %x, i32 0", you should be able to leverage that for ctpop lowering.

Allen updated this revision to Diff 485377.Dec 27 2022, 7:47 AM
Allen edited the summary of this revision. (Show Details)

update with comment

efriedma added inline comments.Jan 4 2023, 7:52 PM
llvm/test/CodeGen/AArch64/arm64-popcnt.ll
43

This doesn't appear to be equivalent.

Allen added inline comments.Jan 5 2023, 7:12 PM
llvm/test/CodeGen/AArch64/arm64-popcnt.ll
43

Thanks, I find the %4:fpr32 = COPY %1.ssub:fpr128 will be eliminated in pass SIMPLE REGISTER COALESCING with this change. but I don't sure the elimination is fine?

# *** IR Dump After Live Interval Analysis (liveintervals) ***:
# Machine code for function cnt32_advsimd_1: NoPHIs, TracksLiveness
Function Live Ins: $d0 in %0

0B	bb.0 (%ir-block.0):
	  liveins: $d0
16B	  %0:fpr64 = COPY $d0
32B	  undef %1.dsub:fpr128 = COPY %0:fpr64
48B	  %4:fpr32 = COPY %1.ssub:fpr128
64B	  %5:fpr64 = SUBREG_TO_REG 0, %4:fpr32, %subreg.ssub
80B	  %6:fpr64 = CNTv8i8 %5:fpr64
96B	  %7:fpr16 = UADDLVv8i8v %6:fpr64
112B	  undef %8.hsub:fpr128 = COPY %7:fpr16
128B	  %10:gpr32all = COPY %8.ssub:fpr128
144B	  $w0 = COPY %10:gpr32all
160B	  RET_ReallyLR implicit killed $w0
  • After the SIMPLE REGISTER COALESCING.
Function Live Ins: $d0 in %0

0B	bb.0 (%ir-block.0):
	  liveins: $d0
16B	  undef %1.dsub:fpr128 = COPY $d0
80B	  %6:fpr64 = CNTv8i8 %1.dsub:fpr128
96B	  undef %8.hsub:fpr128 = UADDLVv8i8v %6:fpr64
128B	  %10:gpr32all = COPY %8.ssub:fpr128
144B	  $w0 = COPY %10:gpr32all
160B	  RET_ReallyLR implicit killed $w0
efriedma added inline comments.Jan 5 2023, 8:01 PM
llvm/test/CodeGen/AArch64/arm64-popcnt.ll
43

See D127154 for a similar situation.

Could it just use a FMOVWSr?

def : Pat<(v8i8 (bitconvert (i64 (zext GPR32:$Rn)))),
          (SUBREG_TO_REG (i32 0), (f32 (FMOVWSr GPR32:$Rn)), ssub)>;

That way we know the top bits will be zero from the FMOVWSr, and so the SUBREG_TO_REG will correctly assert the top bits are zero.

Allen updated this revision to Diff 487020.Jan 6 2023, 5:35 PM

update COPY_TO_REGCLASS with FMOVWSr to avoid the elimination in pass SIMPLE REGISTER COALESCING

Allen added a comment.Jan 6 2023, 5:36 PM

Could it just use a FMOVWSr?

def : Pat<(v8i8 (bitconvert (i64 (zext GPR32:$Rn)))),
          (SUBREG_TO_REG (i32 0), (f32 (FMOVWSr GPR32:$Rn)), ssub)>;

That way we know the top bits will be zero from the FMOVWSr, and so the SUBREG_TO_REG will correctly assert the top bits are zero.

Thanks, apply your comment.

dmgreen accepted this revision.Jan 8 2023, 11:48 PM

LGTM. Thanks

This revision is now accepted and ready to land.Jan 8 2023, 11:48 PM
This revision was landed with ongoing or failed builds.Jan 9 2023, 12:08 AM
This revision was automatically updated to reflect the committed changes.