Download Raw Diff

Details

Reviewers

dmgreen
efriedma
t.p.northover

Commits

rG5474d49f1f5c: [AArch64] Remove copy instruction between uaddlv and urshr

Summary

gcc generates less number of instructions from below example than llvm.

#include <arm_neon.h>

uint8x8_t foo(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

gcc output
foo:
        uaddlv  h0, v0.8b
        urshr   d0, d0, 3
        dup     v0.8b, v0.b[0]
        ret

llvm output
foo:
        uaddlv  h0, v0.8b
        fmov    w8, s0
        fmov    d0, x8
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

There are copy instructions between gpr and fpr. We could remove them as below pattern.

def : Pat<(v1i64 (scalar_to_vector (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))))),
          (INSERT_SUBREG (v1i64 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub)>;

With above pattern, llvm generates below output.

foo:                                    // @foo
        uaddlv  h0, v0.8b
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.

Diff Detail

Event Timeline

jaykang10 created this revision.Aug 31 2023, 3:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Aug 31 2023, 3:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B255970: Diff 554964.Aug 31 2023, 4:43 AM

Eli reminded us recently that there is a v16i8 variant of uaddlv that should be handled as well. Maybe the i16->i32 variants of uaddlv too, for matching the i32->i64 zext.

You may find that if this is able to be done in DAG (as in D159267), that the pattern is unneeded though. The transform might become simplified in the DAG before it gets to selection.

Updated pattern using UADDLV SDNode

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Let me try to detect the sequence and generate BITCAST with DAGCombine.

Following @efriedma's comment, folded sdnode sequence into bitcast in DAGCombine.

Harbormaster completed remote builds in B256793: Diff 556143.Sep 7 2023, 8:13 AM

LGTM

In D159265#4640515, @jaykang10 wrote:

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

BITCAST itself should be fine; I meant, if we use a substitute sequence, we still need a REV, but there isn't any code to generate it.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23159 ↗	(On Diff #556143)	I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup.

This revision is now accepted and ready to land.Sep 7 2023, 10:29 AM

dmgreen requested changes to this revision.Sep 7 2023, 10:55 PM

dmgreen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23169 ↗	(On Diff #556143)	I'm pretty sure this needs to be an AArch64ISD::NVCAST, not a BITCAST. The BITCAST will swap the 0th and 1st lanes into the i64, we need to keep them inorder.

This revision now requires changes to proceed.Sep 7 2023, 10:55 PM

jaykang10 added inline comments.Sep 8 2023, 1:33 AM