This patch defines a TableGen pattern for below one.
uaddlp + addv --> uaddlv
Differential D104236
[AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv jaykang10 on Jun 14 2021, 8:59 AM. Authored by
Details This patch defines a TableGen pattern for below one. uaddlp + addv --> uaddlv
Diff Detail
Unit Tests
Event TimelineComment Actions Can we add the other types too? It's good to add all the varieties if we can. Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp? Comment Actions um... I have tried below patterns. def : Pat<(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))), (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), hsub)>; def : Pat<(v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv16i8v V128:$op), hsub)>; The tests are as follows. declare <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone declare <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone define i16 @addv4h(<8 x i8>* %A) nounwind { %tmp1 = load <8 x i8>, <8 x i8>* %A %tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1) %tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3) ret i16 %tmp5 } define i16 @addv8h(<16 x i8>* %A) nounwind { %tmp1 = load <16 x i8>, <16 x i8>* %A %tmp3 = call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %tmp1) %tmp5 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %tmp3) ret i16 %tmp5 } Before instruction selection, the ISelDAG of addv4h is as below. Optimized legalized selection DAG: %bb.0 'addv4h:' SelectionDAG has 14 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t5: v8i8,ch = load<(load 8 from %ir.A)> t0, t2, undef:i64 t20: v4i16 = AArch64ISD::UADDLP t5 t14: v4i16 = AArch64ISD::UADDV t20 t18: v8i16 = insert_subvector undef:v8i16, t14, Constant:i64<0> t19: i32 = extract_vector_elt t18, Constant:i64<0> t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19 t12: ch = AArch64ISD::RET_FLAG t11, Register:i32 $w0, t11:1 The insert_subvector causes to select other pattern as below rather than above pattern which I define. def : Pat<(i32 (vector_extract (insert_subvector undef, (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub), ssub)>; In the end, the add4h selects instruction as below. Selected selection DAG: %bb.0 'addv4h:' SelectionDAG has 15 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t5: v8i8,ch = LDRDui<Mem:(load 8 from %ir.A)> t2, TargetConstant:i64<0>, t0 t20: v4i16 = UADDLPv8i8_v4i16 t5 t22: bf16 = ADDVv4i16v t20 t24: v4i16 = INSERT_SUBREG IMPLICIT_DEF:v4i16, t22, TargetConstant:i32<7> t19: i32 = EXTRACT_SUBREG t24, TargetConstant:i32<14> t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19 t12: ch = RET_ReallyLR Register:i32 $w0, t11, t11:1 Other types for addv have similar situation except v4i32. They are selected by one of below patterns. multiclass SIMDAcrossLanesIntrinsic<string baseOpc, SDPatternOperator opNode> { ... // If none did, fallback to the explicit patterns, consuming the vector_extract. def : Pat<(i32 (vector_extract (insert_subvector undef, (v8i8 (opNode V64:$Rn)), (i64 0)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn), bsub), ssub)>; def : Pat<(i32 (vector_extract (v16i8 (opNode V128:$Rn)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn), bsub), ssub)>; def : Pat<(i32 (vector_extract (insert_subvector undef, (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn), hsub), ssub)>; def : Pat<(i32 (vector_extract (v8i16 (opNode V128:$Rn)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn), hsub), ssub)>; def : Pat<(i32 (vector_extract (v4i32 (opNode V128:$Rn)), (i64 0))), (EXTRACT_SUBREG (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)), (!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn), ssub), ssub)>; } The v4i32 is selected by a pattern which I defined for addv because there is a pattern for extractelt with i32. // Extracting lane zero is a special case where we can just use a plain // EXTRACT_SUBREG instruction, which will become FMOV. This is easier for the // rest of the compiler, especially the register allocator and copy propagation, // to reason about, so is preferred when it's possible to use it. let AddedComplexity = 10 in { def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>; def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>; def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>; } In order to support other types of addv pattern, we could need to touch existing patterns... How do you think about it? @dmgreen If possible, I would not like to touch existing patterns... If I missed something, please let me know. Comment Actions It sounds like it needs something like SIMDAcrossLanesIntrinsic but for the combined uaddlp + addv patterns. That way it should be able to handle both the (AArch64uaddv(AArch64uaddlp(..)) form and the extract(insert((AArch64uaddv(AArch64uaddlp(..)))) form, hopefully. Comment Actions Thanks. I would expect 2 more types I think. v2i32 and v2i64? It may be better to create a new multiclass like SIMDAcrossLanesIntrinsic. It's a shame the types needed are a little different, else the same multiclass could be used for both the new and the old opcodes. That would add a both the base pattern and the insert/extract pattern, although I'm not entirely sure that both are needed. Comment Actions The TableGen definition of UADDLV is as below. defm UADDLV : SIMDAcrossLanesHSD<1, 0b00011, "uaddlv">; ... multiclass SIMDAcrossLanesHSD<bit U, bits<5> opcode, string asm> { def v8i8v : BaseSIMDAcrossLanes<0, U, 0b00, opcode, FPR16, V64, asm, ".8b", []>; def v16i8v : BaseSIMDAcrossLanes<1, U, 0b00, opcode, FPR16, V128, asm, ".16b", []>; def v4i16v : BaseSIMDAcrossLanes<0, U, 0b01, opcode, FPR32, V64, asm, ".4h", []>; def v8i16v : BaseSIMDAcrossLanes<1, U, 0b01, opcode, FPR32, V128, asm, ".8h", []>; def v4i32v : BaseSIMDAcrossLanes<1, U, 0b10, opcode, FPR64, V128, asm, ".4s", []>; } Therefore, it supports below pairs of input and output. v8i8 ==> i16 v16i8 ==> i16 v4i16 ==> i32 v8i16 ==> i32 v4i32 ==> i64 From The TableGen definition of UADDLP, I think the AArch64uaddlp node supports below intput and output types. v8i8 ==> v4i16 v16i8 ==> v8i16 v4i16 ==> v2i32 v8i16 ==> v4i32 v2i32 ==> v1i64 v4i32 ==> v2i64 The TableGen definition of UADDLV does not support v2i32 so I did not add the pattern for it. Next, I think AArch64uaddv supports below input type. v8i8 v16i8 v4i16 v8i16 v4i32 We can see v4i16, v8i16, v4i32 among the output types of AArch64uaddlp are supported as input type of AArch64uaddv so I have added the patterns for these types.
um... I thought the multiclass is not suitable for our case because our case has different input/output types and needs multiple Opnodes combination. Comment Actions AArch64uaddv is a slightly overloaded term. It comes from a vecreduce and can be lowered to a number of things. I think for anything that is v2iX, it will produce an addp, not an addv. So we want to turn addp(addlp(x)) ->addv(x), but that is still the same pattern extended to v4i32->v2i64->i64 and v4i16->v2i32->i32 variants. Comment Actions Yep, for the first time, I was confused a bit about the AArch64uaddv SDNode because it is not same with ISA's addv. For addp(addlp(x)) ->addv(x), addv genereates i32 output from v4i32 input and i16 output from v4i16 input. However, addp(addlp(x)) generates i64 output from v4i32 input and i32 output from v4i16 input. I think the outputs could not be same with big numbers input between addp(addlp(x)) and addv(x). If I missed something, please let me know. Comment Actions
With types and full opcode names it would be (i64 addp(v2i64 uaddlp(v4i32 x))) -> (i64 uaddlv(v4i32 x)) or (i32 addp(v2i32 uaddlp(v4i16 x))) -> (i32 uaddlv(v4i16 x)) Do you have an example of where you don't think that would work the same? Comment Actions Thanks. LGTM if you fix the v2i32 pattern to be like the others.
Comment Actions I'm not sure this transform is sound, if you consider a simple case such as uaddlv4h_from_v8i8 and perform the expansion by hand you'll see wildly different results. We've observed some serious miscompilations after upgrading to LLVM13 as the offending pattern is produced when expanding ctpop, as shown in the following example: target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-unknown-linux-musl" define i16 @count([4 x i16]* %0) { Entry: %1 = bitcast [4 x i16]* %0 to <4 x i16>* %2 = load <4 x i16>, <4 x i16>* %1, align 2 %3 = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> %2) %4 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %3) ret i16 %4 } declare <4 x i16> @llvm.ctpop.v4i16(<4 x i16>) declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) cc some more ARM people: @eli.friedman @aemerson Comment Actions It looks like there's a typo in the patch. The pattern for v8i8 produces UADDLVv4i16v; I think it's supposed to be UADDLVv8i8v. If that's fixed, for the testcase uaddlv4h_from_v8i8 in this patch, we should produce uaddlv h0, v0.8b. The current uaddlv s0, v0.4h is pretty clearly wrong. Comment Actions [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv A pattern has selected wrong uaddlv MI. It should be as below. Comment Actions @LemonBoy If possible, can you check the last update of this patch solves the error from your side please? |
v4i16 -> v2i32 here?
But I think the pattern may work without the vector_extract / insert_subvector, just like the v4i32 and v2i64 patterns do. Then it is overall a bit simpler.