Details
Diff Detail
Event Timeline
I don't think this is right. The high bits do in fact have to be zero; otherwise they affect the result of the uaddlv.
If you add a pattern to optimize "insertelement <2 x i32> zeroinitializer, i32 %x, i32 0", you should be able to leverage that for ctpop lowering.
llvm/test/CodeGen/AArch64/arm64-popcnt.ll | ||
---|---|---|
43 | This doesn't appear to be equivalent. |
llvm/test/CodeGen/AArch64/arm64-popcnt.ll | ||
---|---|---|
43 | Thanks, I find the %4:fpr32 = COPY %1.ssub:fpr128 will be eliminated in pass SIMPLE REGISTER COALESCING with this change. but I don't sure the elimination is fine? # *** IR Dump After Live Interval Analysis (liveintervals) ***: # Machine code for function cnt32_advsimd_1: NoPHIs, TracksLiveness Function Live Ins: $d0 in %0 0B bb.0 (%ir-block.0): liveins: $d0 16B %0:fpr64 = COPY $d0 32B undef %1.dsub:fpr128 = COPY %0:fpr64 48B %4:fpr32 = COPY %1.ssub:fpr128 64B %5:fpr64 = SUBREG_TO_REG 0, %4:fpr32, %subreg.ssub 80B %6:fpr64 = CNTv8i8 %5:fpr64 96B %7:fpr16 = UADDLVv8i8v %6:fpr64 112B undef %8.hsub:fpr128 = COPY %7:fpr16 128B %10:gpr32all = COPY %8.ssub:fpr128 144B $w0 = COPY %10:gpr32all 160B RET_ReallyLR implicit killed $w0
Function Live Ins: $d0 in %0 0B bb.0 (%ir-block.0): liveins: $d0 16B undef %1.dsub:fpr128 = COPY $d0 80B %6:fpr64 = CNTv8i8 %1.dsub:fpr128 96B undef %8.hsub:fpr128 = UADDLVv8i8v %6:fpr64 128B %10:gpr32all = COPY %8.ssub:fpr128 144B $w0 = COPY %10:gpr32all 160B RET_ReallyLR implicit killed $w0 |
Could it just use a FMOVWSr?
def : Pat<(v8i8 (bitconvert (i64 (zext GPR32:$Rn)))), (SUBREG_TO_REG (i32 0), (f32 (FMOVWSr GPR32:$Rn)), ssub)>;
That way we know the top bits will be zero from the FMOVWSr, and so the SUBREG_TO_REG will correctly assert the top bits are zero.
update COPY_TO_REGCLASS with FMOVWSr to avoid the elimination in pass SIMPLE REGISTER COALESCING
This doesn't appear to be equivalent.