This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv
ClosedPublic

Authored by jaykang10 on Jun 14 2021, 8:59 AM.

Details

Summary

This patch defines a TableGen pattern for below one.

uaddlp + addv --> uaddlv

Diff Detail

Event Timeline

jaykang10 created this revision.Jun 14 2021, 8:59 AM
jaykang10 requested review of this revision.Jun 14 2021, 8:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2021, 8:59 AM

Can we add the other types too? It's good to add all the varieties if we can.

Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?

Can we add the other types too? It's good to add all the varieties if we can.

Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?

Yep, let me try to add them.

Can we add the other types too? It's good to add all the varieties if we can.

Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?

Yep, let me try to add them.

um... I have tried below patterns.

def : Pat<(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),
          (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), hsub)>;
def : Pat<(v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),
          (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv16i8v V128:$op), hsub)>;

The tests are as follows.

declare <4 x i16>  @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone
declare <8 x i16>  @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone

declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone
declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone

define i16 @addv4h(<8 x i8>* %A) nounwind {
        %tmp1 = load <8 x i8>, <8 x i8>* %A
        %tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)
        %tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)
        ret i16 %tmp5
}

define i16 @addv8h(<16 x i8>* %A) nounwind {
        %tmp1 = load <16 x i8>, <16 x i8>* %A
        %tmp3 = call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %tmp1)
        %tmp5 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %tmp3)
        ret i16 %tmp5
}

Before instruction selection, the ISelDAG of addv4h is as below.

Optimized legalized selection DAG: %bb.0 'addv4h:'
SelectionDAG has 14 nodes:
  t0: ch = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t5: v8i8,ch = load<(load 8 from %ir.A)> t0, t2, undef:i64
          t20: v4i16 = AArch64ISD::UADDLP t5
        t14: v4i16 = AArch64ISD::UADDV t20
      t18: v8i16 = insert_subvector undef:v8i16, t14, Constant:i64<0>
    t19: i32 = extract_vector_elt t18, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
  t12: ch = AArch64ISD::RET_FLAG t11, Register:i32 $w0, t11:1

The insert_subvector causes to select other pattern as below rather than above pattern which I define.

def : Pat<(i32 (vector_extract (insert_subvector undef,
            (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
            hsub), ssub)>;

In the end, the add4h selects instruction as below.

Selected selection DAG: %bb.0 'addv4h:'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t5: v8i8,ch = LDRDui<Mem:(load 8 from %ir.A)> t2, TargetConstant:i64<0>, t0
          t20: v4i16 = UADDLPv8i8_v4i16 t5
        t22: bf16 = ADDVv4i16v t20
      t24: v4i16 = INSERT_SUBREG IMPLICIT_DEF:v4i16, t22, TargetConstant:i32<7>
    t19: i32 = EXTRACT_SUBREG t24, TargetConstant:i32<14>
  t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
  t12: ch = RET_ReallyLR Register:i32 $w0, t11, t11:1

Other types for addv have similar situation except v4i32. They are selected by one of below patterns.

multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
                                    SDPatternOperator opNode> {
...
// If none did, fallback to the explicit patterns, consuming the vector_extract.
def : Pat<(i32 (vector_extract (insert_subvector undef, (v8i8 (opNode V64:$Rn)),
            (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn),
            bsub), ssub)>;
def : Pat<(i32 (vector_extract (v16i8 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn),
            bsub), ssub)>;
def : Pat<(i32 (vector_extract (insert_subvector undef,
            (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
            hsub), ssub)>;
def : Pat<(i32 (vector_extract (v8i16 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn),
            hsub), ssub)>;
def : Pat<(i32 (vector_extract (v4i32 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn),
            ssub), ssub)>;

}

The v4i32 is selected by a pattern which I defined for addv because there is a pattern for extractelt with i32.

// Extracting lane zero is a special case where we can just use a plain
// EXTRACT_SUBREG instruction, which will become FMOV. This is easier for the
// rest of the compiler, especially the register allocator and copy propagation,
// to reason about, so is preferred when it's possible to use it.
let AddedComplexity = 10 in {
  def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;
  def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;
  def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;
}

In order to support other types of addv pattern, we could need to touch existing patterns... How do you think about it? @dmgreen If possible, I would not like to touch existing patterns... If I missed something, please let me know.

It sounds like it needs something like SIMDAcrossLanesIntrinsic but for the combined uaddlp + addv patterns. That way it should be able to handle both the (AArch64uaddv(AArch64uaddlp(..)) form and the extract(insert((AArch64uaddv(AArch64uaddlp(..)))) form, hopefully.

It sounds like it needs something like SIMDAcrossLanesIntrinsic but for the combined uaddlp + addv patterns. That way it should be able to handle both the (AArch64uaddv(AArch64uaddlp(..)) form and the extract(insert((AArch64uaddv(AArch64uaddlp(..)))) form, hopefully.

Ok, let me try to add it. :)

jaykang10 updated this revision to Diff 352415.Jun 16 2021, 6:27 AM

Following comments from @dmgreen, updated patch

Thanks. I would expect 2 more types I think. v2i32 and v2i64? It may be better to create a new multiclass like SIMDAcrossLanesIntrinsic.

It's a shame the types needed are a little different, else the same multiclass could be used for both the new and the old opcodes. That would add a both the base pattern and the insert/extract pattern, although I'm not entirely sure that both are needed.

Thanks. I would expect 2 more types I think. v2i32 and v2i64? It may be better to create a new multiclass like SIMDAcrossLanesIntrinsic.

The TableGen definition of UADDLV is as below.

defm UADDLV  : SIMDAcrossLanesHSD<1, 0b00011, "uaddlv">;
...
multiclass SIMDAcrossLanesHSD<bit U, bits<5> opcode, string asm> {
  def v8i8v  : BaseSIMDAcrossLanes<0, U, 0b00, opcode, FPR16, V64,
                                   asm, ".8b", []>;
  def v16i8v : BaseSIMDAcrossLanes<1, U, 0b00, opcode, FPR16, V128, 
                                   asm, ".16b", []>;
  def v4i16v : BaseSIMDAcrossLanes<0, U, 0b01, opcode, FPR32, V64,
                                   asm, ".4h", []>;
  def v8i16v : BaseSIMDAcrossLanes<1, U, 0b01, opcode, FPR32, V128, 
                                   asm, ".8h", []>;
  def v4i32v : BaseSIMDAcrossLanes<1, U, 0b10, opcode, FPR64, V128, 
                                   asm, ".4s", []>;
}

Therefore, it supports below pairs of input and output.

v8i8   ==>  i16
v16i8  ==>  i16
v4i16  ==>  i32
v8i16  ==>  i32
v4i32  ==>  i64

From The TableGen definition of UADDLP, I think the AArch64uaddlp node supports below intput and output types.

v8i8 ==> v4i16
v16i8 ==> v8i16
v4i16 ==> v2i32
v8i16 ==> v4i32
v2i32 ==> v1i64
v4i32 ==> v2i64

The TableGen definition of UADDLV does not support v2i32 so I did not add the pattern for it.

Next, I think AArch64uaddv supports below input type.

v8i8
v16i8
v4i16
v8i16
v4i32

We can see v4i16, v8i16, v4i32 among the output types of AArch64uaddlp are supported as input type of AArch64uaddv so I have added the patterns for these types.
I could make a mistake. If you feel something from above one, please let me know.

It's a shame the types needed are a little different, else the same multiclass could be used for both the new and the old opcodes. That would add a both the base pattern and the insert/extract pattern, although I'm not entirely sure that both are needed.

um... I thought the multiclass is not suitable for our case because our case has different input/output types and needs multiple Opnodes combination.

AArch64uaddv is a slightly overloaded term. It comes from a vecreduce and can be lowered to a number of things. I think for anything that is v2iX, it will produce an addp, not an addv. So we want to turn addp(addlp(x)) ->addv(x), but that is still the same pattern extended to v4i32->v2i64->i64 and v4i16->v2i32->i32 variants.

AArch64uaddv is a slightly overloaded term. It comes from a vecreduce and can be lowered to a number of things. I think for anything that is v2iX, it will produce an addp, not an addv. So we want to turn addp(addlp(x)) ->addv(x), but that is still the same pattern extended to v4i32->v2i64->i64 and v4i16->v2i32->i32 variants.

Yep, for the first time, I was confused a bit about the AArch64uaddv SDNode because it is not same with ISA's addv.

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

With types and full opcode names it would be (i64 addp(v2i64 uaddlp(v4i32 x))) -> (i64 uaddlv(v4i32 x)) or (i32 addp(v2i32 uaddlp(v4i16 x))) -> (i32 uaddlv(v4i16 x))

Do you have an example of where you don't think that would work the same?

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

With types and full opcode names it would be (i64 addp(v2i64 uaddlp(v4i32 x))) -> (i64 uaddlv(v4i32 x)) or (i32 addp(v2i32 uaddlp(v4i16 x))) -> (i32 uaddlv(v4i16 x))

Do you have an example of where you don't think that would work the same?

Ah, sorry. If it is uaddlv rather than addv, it is ok. Let me try to add it.

jaykang10 updated this revision to Diff 352972.Jun 18 2021, 4:38 AM

Following comments from @dmgreen, updated patch.

dmgreen accepted this revision.Jun 18 2021, 8:45 AM

Thanks. LGTM if you fix the v2i32 pattern to be like the others.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
5769

v4i16 -> v2i32 here?

But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler.

llvm/test/CodeGen/AArch64/neon-uaddlv.ll
23

There are a lot of spaces here. Can you cut them down to 2, as would be more common :)

This revision is now accepted and ready to land.Jun 18 2021, 8:45 AM
jaykang10 added inline comments.Jun 18 2021, 9:17 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.td
5769

v4i16 -> v2i32 here?

Oops, sorry. Let me update it.

But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler.

You are right! Let me make it simpler.

llvm/test/CodeGen/AArch64/neon-uaddlv.ll
23

Yep, let me update it.

jaykang10 updated this revision to Diff 353025.Jun 18 2021, 9:18 AM

Following comments from @dmgreen, updated patch.

This revision was landed with ongoing or failed builds.Jun 18 2021, 9:43 AM
This revision was automatically updated to reflect the committed changes.

I'm not sure this transform is sound, if you consider a simple case such as uaddlv4h_from_v8i8 and perform the expansion by hand you'll see wildly different results.
Given an input vector of u8{1,1,1,1,1,1,1,1} one would expect uaddlp to turn it into u16{2,2,2,2} and uaddlv to fold it into u16{0x8}. On the other hand, replacing everything with a single uaddlv operating on a v0.16b yields u16{0x404} as it skips the pairwise sum step.

We've observed some serious miscompilations after upgrading to LLVM13 as the offending pattern is produced when expanding ctpop, as shown in the following example:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-musl"

define i16 @count([4 x i16]* %0) {
Entry:
  %1 = bitcast [4 x i16]* %0 to <4 x i16>*
  %2 = load <4 x i16>, <4 x i16>* %1, align 2
  %3 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> %2)
  %4 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %3)
  ret i16 %4
}

declare <4 x i16> @llvm.ctpop.v4i16(<4 x i16>)
declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)

cc some more ARM people: @eli.friedman @aemerson

It looks like there's a typo in the patch. The pattern for v8i8 produces UADDLVv4i16v; I think it's supposed to be UADDLVv8i8v. If that's fixed, for the testcase uaddlv4h_from_v8i8 in this patch, we should produce uaddlv h0, v0.8b. The current uaddlv s0, v0.4h is pretty clearly wrong.

@LemonBoy I am sorry for bug... It was my mistake... As @efriedma mentioned, it should be v8i18 rather than v4i16. Let me update it.

jaykang10 updated this revision to Diff 385019.EditedNov 5 2021, 4:20 AM

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

@LemonBoy If possible, can you check the last update of this patch solves the error from your side please?

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

It would be better as a new phabricator review, but the update looks good to me.

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

It would be better as a new phabricator review, but the update looks good to me.

Ah, Yep, Let me create a new review for the patch.