This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove copy instruction between uaddlv and urshr
ClosedPublic

Authored by jaykang10 on Aug 31 2023, 3:58 AM.

Details

Summary

gcc generates less number of instructions from below example than llvm.

#include <arm_neon.h>

uint8x8_t foo(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

gcc output
foo:
        uaddlv  h0, v0.8b
        urshr   d0, d0, 3
        dup     v0.8b, v0.b[0]
        ret

llvm output
foo:
        uaddlv  h0, v0.8b
        fmov    w8, s0
        fmov    d0, x8
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

There are copy instructions between gpr and fpr. We could remove them as below pattern.

def : Pat<(v1i64 (scalar_to_vector (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))))),
          (INSERT_SUBREG (v1i64 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub)>;

With above pattern, llvm generates below output.

foo:                                    // @foo
        uaddlv  h0, v0.8b
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.

Diff Detail

Event Timeline

jaykang10 created this revision.Aug 31 2023, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM
jaykang10 requested review of this revision.Aug 31 2023, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM

Eli reminded us recently that there is a v16i8 variant of uaddlv that should be handled as well. Maybe the i16->i32 variants of uaddlv too, for matching the i32->i64 zext.

You may find that if this is able to be done in DAG (as in D159267), that the pattern is unneeded though. The transform might become simplified in the DAG before it gets to selection.

jaykang10 updated this revision to Diff 556017.Sep 6 2023, 6:02 AM

Updated pattern using UADDLV SDNode

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Let me try to detect the sequence and generate BITCAST with DAGCombine.

jaykang10 updated this revision to Diff 556143.Sep 7 2023, 6:53 AM

Following @efriedma's comment, folded sdnode sequence into bitcast in DAGCombine.

efriedma accepted this revision.Sep 7 2023, 10:29 AM

LGTM

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

BITCAST itself should be fine; I meant, if we use a substitute sequence, we still need a REV, but there isn't any code to generate it.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23159 ↗(On Diff #556143)

I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup.

This revision is now accepted and ready to land.Sep 7 2023, 10:29 AM
dmgreen requested changes to this revision.Sep 7 2023, 10:55 PM
dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23169 ↗(On Diff #556143)

I'm pretty sure this needs to be an AArch64ISD::NVCAST, not a BITCAST. The BITCAST will swap the 0th and 1st lanes into the i64, we need to keep them inorder.

This revision now requires changes to proceed.Sep 7 2023, 10:55 PM
jaykang10 added inline comments.Sep 8 2023, 1:33 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23169 ↗(On Diff #556143)

Ah, I did not know we do not need rev instruction here.
Let me change BITCAST to NVCAST.
Thanks for letting me know.

jaykang10 updated this revision to Diff 556236.Sep 8 2023, 1:43 AM

Following @dmgreen's comment, changed BITCAST to NVCAST.

jaykang10 added inline comments.Sep 8 2023, 1:45 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23159 ↗(On Diff #556143)

Yep, let's check it with other patch.

dmgreen accepted this revision.Sep 10 2023, 3:48 AM

Thanks. It would be good to add other ops too if we can. Otherwise LGTM.

This revision is now accepted and ready to land.Sep 10 2023, 3:48 AM
This revision was landed with ongoing or failed builds.Sep 11 2023, 1:07 AM
This revision was automatically updated to reflect the committed changes.