In D104236#2817995, @dmgreen wrote:

Can we add the other types too? It's good to add all the varieties if we can.

Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?

Yep, let me try to add them.

In D104236#2818622, @jaykang10 wrote:

In D104236#2817995, @dmgreen wrote:

Can we add the other types too? It's good to add all the varieties if we can.

Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?

Yep, let me try to add them.

um... I have tried below patterns.

def : Pat<(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),
          (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), hsub)>;
def : Pat<(v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),
          (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv16i8v V128:$op), hsub)>;

The tests are as follows.

declare <4 x i16>  @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone
declare <8 x i16>  @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone

declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone
declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone

define i16 @addv4h(<8 x i8>* %A) nounwind {
        %tmp1 = load <8 x i8>, <8 x i8>* %A
        %tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)
        %tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)
        ret i16 %tmp5
}

define i16 @addv8h(<16 x i8>* %A) nounwind {
        %tmp1 = load <16 x i8>, <16 x i8>* %A
        %tmp3 = call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %tmp1)
        %tmp5 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %tmp3)
        ret i16 %tmp5
}

Before instruction selection, the ISelDAG of addv4h is as below.

Optimized legalized selection DAG: %bb.0 'addv4h:'
SelectionDAG has 14 nodes:
  t0: ch = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t5: v8i8,ch = load<(load 8 from %ir.A)> t0, t2, undef:i64
          t20: v4i16 = AArch64ISD::UADDLP t5
        t14: v4i16 = AArch64ISD::UADDV t20
      t18: v8i16 = insert_subvector undef:v8i16, t14, Constant:i64<0>
    t19: i32 = extract_vector_elt t18, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
  t12: ch = AArch64ISD::RET_FLAG t11, Register:i32 $w0, t11:1

The insert_subvector causes to select other pattern as below rather than above pattern which I define.

def : Pat<(i32 (vector_extract (insert_subvector undef,
            (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
            hsub), ssub)>;

In the end, the add4h selects instruction as below.

Selected selection DAG: %bb.0 'addv4h:'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %0
            t5: v8i8,ch = LDRDui<Mem:(load 8 from %ir.A)> t2, TargetConstant:i64<0>, t0
          t20: v4i16 = UADDLPv8i8_v4i16 t5
        t22: bf16 = ADDVv4i16v t20
      t24: v4i16 = INSERT_SUBREG IMPLICIT_DEF:v4i16, t22, TargetConstant:i32<7>
    t19: i32 = EXTRACT_SUBREG t24, TargetConstant:i32<14>
  t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
  t12: ch = RET_ReallyLR Register:i32 $w0, t11, t11:1

Other types for addv have similar situation except v4i32. They are selected by one of below patterns.

multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
                                    SDPatternOperator opNode> {
...
// If none did, fallback to the explicit patterns, consuming the vector_extract.
def : Pat<(i32 (vector_extract (insert_subvector undef, (v8i8 (opNode V64:$Rn)),
            (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn),
            bsub), ssub)>;
def : Pat<(i32 (vector_extract (v16i8 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn),
            bsub), ssub)>;
def : Pat<(i32 (vector_extract (insert_subvector undef,
            (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
            hsub), ssub)>;
def : Pat<(i32 (vector_extract (v8i16 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn),
            hsub), ssub)>;
def : Pat<(i32 (vector_extract (v4i32 (opNode V128:$Rn)), (i64 0))),
          (EXTRACT_SUBREG (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
            (!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn),
            ssub), ssub)>;

}

The v4i32 is selected by a pattern which I defined for addv because there is a pattern for extractelt with i32.

// Extracting lane zero is a special case where we can just use a plain
// EXTRACT_SUBREG instruction, which will become FMOV. This is easier for the
// rest of the compiler, especially the register allocator and copy propagation,
// to reason about, so is preferred when it's possible to use it.
let AddedComplexity = 10 in {
  def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;
  def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;
  def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;
}

In order to support other types of addv pattern, we could need to touch existing patterns... How do you think about it? @dmgreen If possible, I would not like to touch existing patterns... If I missed something, please let me know.

It sounds like it needs something like SIMDAcrossLanesIntrinsic but for the combined uaddlp + addv patterns. That way it should be able to handle both the (AArch64uaddv(AArch64uaddlp(..)) form and the extract(insert((AArch64uaddv(AArch64uaddlp(..)))) form, hopefully.

In D104236#2821370, @dmgreen wrote:

It sounds like it needs something like SIMDAcrossLanesIntrinsic but for the combined uaddlp + addv patterns. That way it should be able to handle both the (AArch64uaddv(AArch64uaddlp(..)) form and the extract(insert((AArch64uaddv(AArch64uaddlp(..)))) form, hopefully.

Ok, let me try to add it. :)

Following comments from @dmgreen, updated patch

Thanks. I would expect 2 more types I think. v2i32 and v2i64? It may be better to create a new multiclass like SIMDAcrossLanesIntrinsic.

It's a shame the types needed are a little different, else the same multiclass could be used for both the new and the old opcodes. That would add a both the base pattern and the insert/extract pattern, although I'm not entirely sure that both are needed.

In D104236#2821939, @dmgreen wrote:

Thanks. I would expect 2 more types I think. v2i32 and v2i64? It may be better to create a new multiclass like SIMDAcrossLanesIntrinsic.

The TableGen definition of UADDLV is as below.

defm UADDLV  : SIMDAcrossLanesHSD<1, 0b00011, "uaddlv">;
...
multiclass SIMDAcrossLanesHSD<bit U, bits<5> opcode, string asm> {
  def v8i8v  : BaseSIMDAcrossLanes<0, U, 0b00, opcode, FPR16, V64,
                                   asm, ".8b", []>;
  def v16i8v : BaseSIMDAcrossLanes<1, U, 0b00, opcode, FPR16, V128, 
                                   asm, ".16b", []>;
  def v4i16v : BaseSIMDAcrossLanes<0, U, 0b01, opcode, FPR32, V64,
                                   asm, ".4h", []>;
  def v8i16v : BaseSIMDAcrossLanes<1, U, 0b01, opcode, FPR32, V128, 
                                   asm, ".8h", []>;
  def v4i32v : BaseSIMDAcrossLanes<1, U, 0b10, opcode, FPR64, V128, 
                                   asm, ".4s", []>;
}

Therefore, it supports below pairs of input and output.

v8i8   ==>  i16
v16i8  ==>  i16
v4i16  ==>  i32
v8i16  ==>  i32
v4i32  ==>  i64

From The TableGen definition of UADDLP, I think the AArch64uaddlp node supports below intput and output types.

v8i8 ==> v4i16
v16i8 ==> v8i16
v4i16 ==> v2i32
v8i16 ==> v4i32
v2i32 ==> v1i64
v4i32 ==> v2i64

The TableGen definition of UADDLV does not support v2i32 so I did not add the pattern for it.

Next, I think AArch64uaddv supports below input type.

v8i8
v16i8
v4i16
v8i16
v4i32

We can see v4i16, v8i16, v4i32 among the output types of AArch64uaddlp are supported as input type of AArch64uaddv so I have added the patterns for these types.
I could make a mistake. If you feel something from above one, please let me know.

It's a shame the types needed are a little different, else the same multiclass could be used for both the new and the old opcodes. That would add a both the base pattern and the insert/extract pattern, although I'm not entirely sure that both are needed.

um... I thought the multiclass is not suitable for our case because our case has different input/output types and needs multiple Opnodes combination.

Harbormaster completed remote builds in B109502: Diff 352415.Jun 16 2021, 2:53 PM

AArch64uaddv is a slightly overloaded term. It comes from a vecreduce and can be lowered to a number of things. I think for anything that is v2iX, it will produce an addp, not an addv. So we want to turn addp(addlp(x)) ->addv(x), but that is still the same pattern extended to v4i32->v2i64->i64 and v4i16->v2i32->i32 variants.

In D104236#2826369, @dmgreen wrote:

AArch64uaddv is a slightly overloaded term. It comes from a vecreduce and can be lowered to a number of things. I think for anything that is v2iX, it will produce an addp, not an addv. So we want to turn addp(addlp(x)) ->addv(x), but that is still the same pattern extended to v4i32->v2i64->i64 and v4i16->v2i32->i32 variants.

Yep, for the first time, I was confused a bit about the AArch64uaddv SDNode because it is not same with ISA's addv.

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

With types and full opcode names it would be (i64 addp(v2i64 uaddlp(v4i32 x))) -> (i64 uaddlv(v4i32 x)) or (i32 addp(v2i32 uaddlp(v4i16 x))) -> (i32 uaddlv(v4i16 x))

Do you have an example of where you don't think that would work the same?

In D104236#2826609, @dmgreen wrote:

For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know.

With types and full opcode names it would be (i64 addp(v2i64 uaddlp(v4i32 x))) -> (i64 uaddlv(v4i32 x)) or (i32 addp(v2i32 uaddlp(v4i16 x))) -> (i32 uaddlv(v4i16 x))

Do you have an example of where you don't think that would work the same?

Ah, sorry. If it is uaddlv rather than addv, it is ok. Let me try to add it.

Following comments from @dmgreen, updated patch.

Thanks. LGTM if you fix the v2i32 pattern to be like the others.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
5769	v4i16 -> v2i32 here? But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler.
llvm/test/CodeGen/AArch64/neon-uaddlv.ll
23	There are a lot of spaces here. Can you cut them down to 2, as would be more common :)

This revision is now accepted and ready to land.Jun 18 2021, 8:45 AM

jaykang10 added inline comments.Jun 18 2021, 9:17 AM

llvm/lib/Target/AArch64/AArch64InstrInfo.td
5769	v4i16 -> v2i32 here? Oops, sorry. Let me update it. But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler. You are right! Let me make it simpler.
llvm/test/CodeGen/AArch64/neon-uaddlv.ll
23	Yep, let me update it.

Following comments from @dmgreen, updated patch.

This revision was landed with ongoing or failed builds.Jun 18 2021, 9:43 AM

Closed by commit rG78b75b452b08: [AArch64] Add TableGen patterns to generate uaddlv (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG78b75b452b08: [AArch64] Add TableGen patterns to generate uaddlv.

Harbormaster completed remote builds in B109947: Diff 353025.Jun 19 2021, 2:42 AM

I'm not sure this transform is sound, if you consider a simple case such as uaddlv4h_from_v8i8 and perform the expansion by hand you'll see wildly different results.
Given an input vector of u8{1,1,1,1,1,1,1,1} one would expect uaddlp to turn it into u16{2,2,2,2} and uaddlv to fold it into u16{0x8}. On the other hand, replacing everything with a single uaddlv operating on a v0.16b yields u16{0x404} as it skips the pairwise sum step.

We've observed some serious miscompilations after upgrading to LLVM13 as the offending pattern is produced when expanding ctpop, as shown in the following example:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-musl"

define i16 @count([4 x i16]* %0) {
Entry:
  %1 = bitcast [4 x i16]* %0 to <4 x i16>*
  %2 = load <4 x i16>, <4 x i16>* %1, align 2
  %3 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> %2)
  %4 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %3)
  ret i16 %4
}

declare <4 x i16> @llvm.ctpop.v4i16(<4 x i16>)
declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>)

cc some more ARM people: @eli.friedman @aemerson

It looks like there's a typo in the patch. The pattern for v8i8 produces UADDLVv4i16v; I think it's supposed to be UADDLVv8i8v. If that's fixed, for the testcase uaddlv4h_from_v8i8 in this patch, we should produce uaddlv h0, v0.8b. The current uaddlv s0, v0.4h is pretty clearly wrong.

@LemonBoy I am sorry for bug... It was my mistake... As @efriedma mentioned, it should be v8i18 rather than v4i16. Let me update it.

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

@LemonBoy If possible, can you check the last update of this patch solves the error from your side please?

In D104236#3111240, @jaykang10 wrote:

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

It would be better as a new phabricator review, but the update looks good to me.

In D104236#3111257, @dmgreen wrote:

In D104236#3111240, @jaykang10 wrote:

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

It would be better as a new phabricator review, but the update looks good to me.

Ah, Yep, Let me create a new review for the patch.

I have create a new review https://reviews.llvm.org/D113263.

Harbormaster completed remote builds in B132650: Diff 385019.Nov 5 2021, 4:52 AM

Diff 385019

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 5,747 Lines • ▼ Show 20 Lines
	defm FMINNMV : SIMDFPAcrossLanes<0b01100, 1, "fminnmv", int_aarch64_neon_fminnmv>;			defm FMINNMV : SIMDFPAcrossLanes<0b01100, 1, "fminnmv", int_aarch64_neon_fminnmv>;
	defm FMINV : SIMDFPAcrossLanes<0b01111, 1, "fminv", int_aarch64_neon_fminv>;			defm FMINV : SIMDFPAcrossLanes<0b01111, 1, "fminv", int_aarch64_neon_fminv>;

	// Patterns for uaddv(uaddlp(x)) ==> uaddlv			// Patterns for uaddv(uaddlp(x)) ==> uaddlv
	def : Pat<(i32 (vector_extract (v8i16 (insert_subvector undef,			def : Pat<(i32 (vector_extract (v8i16 (insert_subvector undef,
	(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),			(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),
	(i64 0))), (i64 0))),			(i64 0))), (i64 0))),
	(EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),			(EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
	(UADDLVv4i16v V64:$op), ssub), ssub)>;			(UADDLVv8i8v V64:$op), hsub), ssub)>;
	def : Pat<(i32 (vector_extract (v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp			def : Pat<(i32 (vector_extract (v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp
	(v16i8 V128:$op))))), (i64 0))),			(v16i8 V128:$op))))), (i64 0))),
	(EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),			(EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
	(UADDLVv16i8v V128:$op), hsub), ssub)>;			(UADDLVv16i8v V128:$op), hsub), ssub)>;
	def : Pat<(v4i32 (AArch64uaddv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),			def : Pat<(v4i32 (AArch64uaddv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),
	(INSERT_SUBREG (v4i32 (IMPLICIT_DEF)), (UADDLVv8i16v V128:$op), ssub)>;			(INSERT_SUBREG (v4i32 (IMPLICIT_DEF)), (UADDLVv8i16v V128:$op), ssub)>;

	// Patterns for addp(uaddlp(x))) ==> uaddlv			// Patterns for addp(uaddlp(x))) ==> uaddlv
	def : Pat<(v2i32 (AArch64uaddv (v2i32 (AArch64uaddlp (v4i16 V64:$op))))),			def : Pat<(v2i32 (AArch64uaddv (v2i32 (AArch64uaddlp (v4i16 V64:$op))))),
	(INSERT_SUBREG (v2i32 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), ssub)>;			(INSERT_SUBREG (v2i32 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), ssub)>;
	def : Pat<(v2i64 (AArch64uaddv (v2i64 (AArch64uaddlp (v4i32 V128:$op))))),			def : Pat<(v2i64 (AArch64uaddv (v2i64 (AArch64uaddlp (v4i32 V128:$op))))),
	(INSERT_SUBREG (v2i64 (IMPLICIT_DEF)), (UADDLVv4i32v V128:$op), dsub)>;			(INSERT_SUBREG (v2i64 (IMPLICIT_DEF)), (UADDLVv4i32v V128:$op), dsub)>;

				dmgreenUnsubmitted Not Done Reply Inline Actions v4i16 -> v2i32 here? But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler. dmgreen: v4i16 -> v2i32 here? But I think the pattern may work without the vector_extract /…
				jaykang10AuthorUnsubmitted Done Reply Inline Actions v4i16 -> v2i32 here? Oops, sorry. Let me update it. But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler. You are right! Let me make it simpler. jaykang10: > v4i16 -> v2i32 here? Oops, sorry. Let me update it. > But I think the pattern may work…
	// Patterns for across-vector intrinsics, that have a node equivalent, that			// Patterns for across-vector intrinsics, that have a node equivalent, that
	// returns a vector (with only the low lane defined) instead of a scalar.			// returns a vector (with only the low lane defined) instead of a scalar.
	// In effect, opNode is the same as (scalar_to_vector (IntNode)).			// In effect, opNode is the same as (scalar_to_vector (IntNode)).
	multiclass SIMDAcrossLanesIntrinsic<string baseOpc,			multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
	SDPatternOperator opNode> {			SDPatternOperator opNode> {
	// If a lane instruction caught the vector_extract around opNode, we can			// If a lane instruction caught the vector_extract around opNode, we can
	// directly match the latter to the instruction.			// directly match the latter to the instruction.
	def : Pat<(v8i8 (opNode V64:$Rn)),			def : Pat<(v8i8 (opNode V64:$Rn)),
	▲ Show 20 Lines • Show All 2,421 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-uaddlv.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple aarch64-none-linux-gnu < %s \| FileCheck %s			; RUN: llc -mtriple aarch64-none-linux-gnu < %s \| FileCheck %s

	declare <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone			declare <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone
	declare <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone			declare <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone
	declare <4 x i32> @llvm.aarch64.neon.uaddlp.v4i32.v8i16(<8 x i16>) nounwind readnone			declare <4 x i32> @llvm.aarch64.neon.uaddlp.v4i32.v8i16(<8 x i16>) nounwind readnone
	declare <2 x i64> @llvm.aarch64.neon.uaddlp.v2i64.v4i32(<4 x i32>) nounwind readnone			declare <2 x i64> @llvm.aarch64.neon.uaddlp.v2i64.v4i32(<4 x i32>) nounwind readnone
	declare <2 x i32> @llvm.aarch64.neon.uaddlp.v2i32.v4i16(<4 x i16>) nounwind readnone			declare <2 x i32> @llvm.aarch64.neon.uaddlp.v2i32.v4i16(<4 x i16>) nounwind readnone

	declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone			declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone
	declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone			declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone
	declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>) nounwind readnone			declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>) nounwind readnone
	declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>) nounwind readnone			declare i64 @llvm.vector.reduce.add.v2i64(<2 x i64>) nounwind readnone
	declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>) nounwind readnone			declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>) nounwind readnone

				declare <4 x i16> @llvm.ctpop.v4i16(<4 x i16>)

	define i16 @uaddlv4h_from_v8i8(<8 x i8>* %A) nounwind {			define i16 @uaddlv4h_from_v8i8(<8 x i8>* %A) nounwind {
	; CHECK-LABEL: uaddlv4h_from_v8i8:			; CHECK-LABEL: uaddlv4h_from_v8i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldr d0, [x0]			; CHECK-NEXT: ldr d0, [x0]
	; CHECK-NEXT: uaddlv s0, v0.4h			; CHECK-NEXT: uaddlv h0, v0.8b
	; CHECK-NEXT: fmov w0, s0			; CHECK-NEXT: fmov w0, s0
				dmgreenUnsubmitted Not Done Reply Inline Actions There are a lot of spaces here. Can you cut them down to 2, as would be more common :) dmgreen: There are a lot of spaces here. Can you cut them down to 2, as would be more common :)
				jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me update it. jaykang10: Yep, let me update it.
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%tmp1 = load <8 x i8>, <8 x i8>* %A			%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)			%tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)
	%tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)			%tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)
	ret i16 %tmp5			ret i16 %tmp5
	}			}

	define i16 @uaddlv16b_from_v16i8(<16 x i8>* %A) nounwind {			define i16 @uaddlv16b_from_v16i8(<16 x i8>* %A) nounwind {
	▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: uaddlv s0, v0.4h			; CHECK-NEXT: uaddlv s0, v0.4h
	; CHECK-NEXT: fmov w0, s0			; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%tmp1 = load <4 x i16>, <4 x i16>* %A			%tmp1 = load <4 x i16>, <4 x i16>* %A
	%tmp3 = call <2 x i32> @llvm.aarch64.neon.uaddlp.v2i32.v4i16(<4 x i16> %tmp1)			%tmp3 = call <2 x i32> @llvm.aarch64.neon.uaddlp.v2i32.v4i16(<4 x i16> %tmp1)
	%tmp5 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %tmp3)			%tmp5 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %tmp3)
	ret i32 %tmp5			ret i32 %tmp5
	}			}

				define i16 @count([4 x i16]* %0) {
				; CHECK-LABEL: count:
				; CHECK: // %bb.0: // %Entry
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: cnt v0.8b, v0.8b
				; CHECK-NEXT: uaddlv h0, v0.8b
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret
				Entry:
				%1 = bitcast [4 x i16]* %0 to <4 x i16>*
				%2 = load <4 x i16>, <4 x i16>* %1, align 2
				%3 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> %2)
				%4 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %3)
				ret i16 %4
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 385019

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/neon-uaddlv.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addvClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 385019

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/neon-uaddlv.ll

[AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv
ClosedPublic