This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove redundant `mov 0` instruction for high 64-bits
ClosedPublic

Authored by jaykang10 on Mar 30 2023, 7:25 AM.

Details

Summary

gcc generates less instructions than llvm from below intrinsic example.

#include <arm_neon.h>

float16x8_t test1(const float32x4_t a) {
    float16x4_t b = vcvt_f16_f32(a);
    return vcombine_f16(b, vdup_n_f16(0.0));
}

uint8x8_t test2(uint16_t *in, uint8x8_t *dst, uint8x8_t idx) {
    return vtbl1_u8(vshrn_n_u16(vld1q_u16(in), 4), idx); 
}

gcc output
test1:
        fcvtn   v0.4h, v0.4s 
        fmov    d0, d0
        ret

test2:
        ldr     q1, [x0]
        shrn    v1.8b, v1.8h, 4
        tbl     v0.8b, {v1.16b}, v0.8b 
        ret

llvm output
test1:                                  // @test1
        movi    d1, #0000000000000000
        fcvtn   v0.4h, v0.4s
        mov     v0.d[1], v1.d[0]
        ret

test2:                                  // @test2
        ldr     q1, [x0]
        movi    v2.2d, #0000000000000000
        shrn    v1.8b, v1.8h, #4
        mov     v1.d[1], v2.d[0]
        tbl     v0.8b, { v1.16b }, v0.8b
        ret

The fcvtn and shrn instructions set zero for high 64-bits implicitly so we do not need mov 0 instruction for high 64-bits. It looks gcc has patterns for the cases. For example,

the gcc rtl pattern for test2 function's shrn
(define_insn "aarch64_shrn<mode>_insn_le"
  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
        (vec_concat:<VNARROWQ2>
          (truncate:<VNARROWQ>
            (lshiftrt:VQN (match_operand:VQN 1 "register_operand" "w")
              (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>")))
          (match_operand:<VNARROWQ> 3 "aarch64_simd_or_scalar_imm_zero")))]
  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
  [(set_attr "type" "neon_shift_imm_narrow_q")]
)

llvm could also add tablegen patterns for them like gcc but it could be better to handle the patterns on MIR Peephole optimization pass because they have common sub patterns and the pass can consider multiple basic blocks.

With this patch, llvm generates below output.

llvm output
test1:                                  // @test1
        fcvtn   v0.4h, v0.4s
        ret

test2:                                  // @test2
        ldr     q1, [x0]
        shrn    v1.8b, v1.8h, #4
        tbl     v0.8b, { v1.16b }, v0.8b
        ret

Diff Detail

Event Timeline

jaykang10 created this revision.Mar 30 2023, 7:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2023, 7:25 AM
jaykang10 requested review of this revision.Mar 30 2023, 7:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2023, 7:25 AM
jaykang10 edited the summary of this revision. (Show Details)Mar 30 2023, 7:29 AM
dmgreen added inline comments.Mar 30 2023, 8:44 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
618–619

Can we extend this to all the instructions that are like FCVTNv4i16/SHRNv8i8_shift? For example maybe these, which I think produce 64bit results and are similar to the instructions you already have:

RSHRNv2i32_shift
RSHRNv4i16_shift
RSHRNv8i8_shift 
SHRNv2i32_shift
SHRNv4i16_shift
SHRNv8i8_shift 
FCVTNv2i32
FCVTNv4i16

We might be able to get away with "Any instruction that defs a FPR64", but that might need more careful checking and there are quite a few of them. We should probably try and get these classes of instruction though, not just the exact sizes.

620

If this return's true directly, then isSetZeroHigh64bits won't be needed and more.

llvm/test/CodeGen/AArch64/implicitly-set-zero-high-64-bits.ll
8

We can probably remove all the nofpclass(nan inf) stuff

20

dst doesn't seem to be used.

jaykang10 added inline comments.Mar 31 2023, 2:15 AM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
618–619

Can we extend this to all the instructions that are like FCVTNv4i16/SHRNv8i8_shift? For example maybe these, which I think produce 64bit results and are similar to the instructions you already have:

Yep, they write lower 64-bits and clear high 64-bits so I think we can add them. Let me add them.

We might be able to get away with "Any instruction that defs a FPR64", but that might need more careful checking and there are quite a few of them. We should probably try and get these classes of instruction though, not just the exact sizes.

Yep, I agree with you. It is worth to try. After committing this patch, let's check it.

620

Yep, let me remove it.

llvm/test/CodeGen/AArch64/implicitly-set-zero-high-64-bits.ll
8

Yep, let me remove it.

20

You are right!
Let me remove it.

jaykang10 updated this revision to Diff 509953.Mar 31 2023, 3:08 AM
dmgreen accepted this revision.Apr 3 2023, 1:13 AM

Thanks. LGTM

This revision is now accepted and ready to land.Apr 3 2023, 1:13 AM
This revision was landed with ongoing or failed builds.Apr 3 2023, 2:59 AM
This revision was automatically updated to reflect the committed changes.