Add Advanced SIMD instructions for Rounding Double Multiply Add/Subtract v8.1 extension
Details
Diff Detail
Event Timeline
Hi Vladimir,
I think this mostly looks fine (very nice first patch!). I've got a couple of questions though (don't take them as non-negotiable, if you think you have good reasons for those decisions).
Cheers.
Tim.
include/llvm/IR/IntrinsicsAArch64.td | ||
---|---|---|
660 | I don't think these new intrinsics are needed. The instructions are effectively "(int_aarch64_neon_sqadd $acc, (int_aarch64_neon_sqrdmulh $LHS, $RHS))". The only reason we've fused operations like that in the past is when loop optimisations can interfere and make block-at-a-time selection produce worse code. Basically, this is when sext/zext happen *before* the key NEON op, and those nodes get hoisted out and become invisible to local selection (e.g. the int_aarch64_neon_smull intrinsics). That shouldn't be an issue here. | |
lib/Target/AArch64/AArch64.td | ||
29–30 | Just how granular is v8.1? Are these instructions optional, or are we expecting CPUs that support them but not v8.1 in general? If not, it'd be much better to keep things as hierarchical as possible and just add a HasV8_1 predicate, or perhaps FeatureFPARMv8_1, but again unless you expect a v8.1 core without a v8.1 FPU that could just be "Requires<HasV8_1, FeatureFPARMv8>". |
Hi Tim,
thank you for your warm feedback.
- I don't think these new intrinsics are needed. The instructions are effectively "(int_aarch64_neon_sqadd $acc, (int_aarch64_neon_sqrdmulh $LHS, $RHS))".
Ok, but now I have a severe trouble that could be well-known for v1iNN types. If so, would you please give some hint?
For scalar operations, I have temporarily added
lib/Target/AArch64/AArch64InstrFormats.td: let mayStore = 0, mayLoad = 0, hasSideEffects = 0 in class BaseSIMDThreeScalar_cstr<bit U, bits<2> size, bits<5> opcode, dag oops, dag iops, string asm, string cstr, list<dag> pattern> : I<oops, iops, asm, "\t$Rd, $Rn, $Rm", cstr, pattern>, Sched<[WriteV]> { bits<5> Rd; bits<5> Rn; bits<5> Rm; let Inst{31-30} = 0b01; let Inst{29} = U; let Inst{28-24} = 0b11110; let Inst{23-22} = size; let Inst{21} = 1; let Inst{20-16} = Rm; let Inst{15-11} = opcode; let Inst{10} = 1; let Inst{9-5} = Rn; let Inst{4-0} = Rd; } class BaseSIMDThreeScalarExtRDMA<bit U, bits<2> size, bits<5> opcode, dag oops, dag iops, string asm, list<dag> pattern> : BaseSIMDThreeScalar_cstr<U, size, opcode, oops, iops, asm, "$Rd = $dst", pattern> { let Inst{21} =0; } multiclass SIMDThreeScalarHSExtRDMA<bit U, bits<5> opc, string asm, SDPatternOperator OpNode = null_frag> { def i32 : BaseSIMDThreeScalarExtRDMA<U, 0b10, opc, (outs FPR32:$dst), (ins FPR32:$Rd, FPR32:$Rn, FPR32:$Rm), asm, []>; def i16 : BaseSIMDThreeScalarExtRDMA<U, 0b01, opc, (outs FPR16:$dst), (ins FPR16:$Rd, FPR16:$Rn, FPR16:$Rm), asm, []>; // def v1i16 : BaseSIMDThreeScalarExtRDMA<U, 0b01, opc, (outs FPR16:$dst), (ins FPR16:$Rd, FPR16:$Rn, FPR16:$Rm), asm, []>; } lib/Target/AArch64/AArch64InstrInfo.td: defm SQRDMLAH : SIMDThreeScalarHSExtRDMA<1, 0b10000, "sqrdmlah">; def : Pat<(i16 (int_aarch64_neon_sqadd (i16 FPR16:$Rd), (i16 (int_aarch64_neon_sqrdmulh (i16 FPR16:$Rn), (i16 FPR16:$Rm))))), (SQRDMLAHi16 FPR16:$Rd, FPR16:$Rn, FPR16:$Rm)>; //def : Pat<(v1i16 (int_aarch64_neon_sqadd (v1i16 FPR16:$Rd), // (v1i16 (int_aarch64_neon_sqrdmulh (v1i16 FPR16:$Rn), // (v1i16 FPR16:$Rm))))), // (SQRDMLAHv1i16 FPR16:$Rd, FPR16:$Rn, FPR16:$Rm)>; def : Pat<(i32 (int_aarch64_neon_sqadd (i32 FPR32:$Rd), (i32 (int_aarch64_neon_sqrdmulh (i32 FPR32:$Rn), (i32 FPR32:$Rm))))), (SQRDMLAHi32 FPR32:$Rd, FPR32:$Rn, FPR32:$Rm)>;
SQRDMLAHi32 works fine, but for i16 version, I got Tablegen error
anonymous_1513: (intrinsic_wo_chain:<empty> 117:<empty>, FPR16:i16:$Rd, (intrinsic_wo_chain:i16 124:<empty>, FPR16:i16:$Rn, FPR16:i16:$Rm)) Included from /work/llvm/lib/Target/AArch64/AArch64.td:58: /work/llvm/lib/Target/AArch64/AArch64InstrInfo.td:2992:1: error: In anonymous_1513: Type inference contradiction found, merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'i16' def : Pat<(i16 (int_aarch64_neon_sqadd (i16 FPR16:$Rd), ^ anonymous_1513: (SQRDMLAHi16:f16 FPR16:f16:$Rd, FPR16:<empty>:$Rn, FPR16:f16:$Rm) Included from /work/llvm/lib/Target/AArch64/AArch64.td:58: /work/llvm/lib/Target/AArch64/AArch64InstrInfo.td:2992:1: error: In anonymous_1513: Type inference contradiction found, merging 'i16' into 'f16' def : Pat<(i16 (int_aarch64_neon_sqadd (i16 FPR16:$Rd), ^ anonymous_1513: (intrinsic_wo_chain:<empty> 117:<empty>, FPR16:i16:$Rd, (intrinsic_wo_chain:i16 124:<empty>, FPR16:i16:$Rn, FPR16:i16:$Rm))
The similar is for v1i16
anonymous_1513: (intrinsic_wo_chain:<empty> 117:<empty>, FPR16:v1i16:$Rd, (intrinsic_wo_chain:v1i16 124:<empty>, FPR16:v1i16:$Rn, FPR16:v1i16:$Rm)) Included from /work/llvm/lib/Target/AArch64/AArch64.td:58: /work/llvm/lib/Target/AArch64/AArch64InstrInfo.td:2996:1: error: In anonymous_1513: Type inference contradiction found, merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'v1i16' def : Pat<(v1i16 (int_aarch64_neon_sqadd (v1i16 FPR16:$Rd), ^ anonymous_1513: (SQRDMLAHv1i16:f16 FPR16:f16:$Rd, FPR16:<empty>:$Rn, FPR16:f16:$Rm) Included from /work/llvm/lib/Target/AArch64/AArch64.td:58: /work/llvm/lib/Target/AArch64/AArch64InstrInfo.td:2996:1: error: In anonymous_1513: Type inference contradiction found, merging 'v1i16' into 'f16' def : Pat<(v1i16 (int_aarch64_neon_sqadd (v1i16 FPR16:$Rd), ^ anonymous_1513: (intrinsic_wo_chain:<empty> 117:<empty>, FPR16:v1i16:$Rd, (intrinsic_wo_chain:v1i16 124:<empty>, FPR16:v1i16:$Rn, FPR16:v1i16:$Rm))
It's not the first time I see LLVM has problems with i16, thats why I in my first patch I skipped both implementation and tests for SQRDMLAHi16.
Previously. I was able to implement at least SQRDMLAHv1i16, but now I cannot do it even for v1i16.
What would you suggest here?
Note: First I tried another way of implementation, like the following snipped for vector variant. It has failed due to incomplete type inference between two intrinsics.
lib/Target/AArch64/AArch64InstrFormats.td: multiclass SIMDThreeSameVectorExtRDMA<bit U, bits<5> opc, string asm, SDPatternOperator OpNode> { def v4i16 : BaseSIMDThreeSameVectorExtRDMA<0, U, 0b01, opc, V64, asm, ".4h", [(set (v4i16 V64:$dst), (OpNode (v4i16 V64:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm)))]>; } lib/Target/AArch64/AArch64InstrInfo.td: defm SQRDMLAH : SIMDThreeSameVectorExtRDMA<1,0b10000,"sqrdmlah", TriOpFrag<(int_aarch64_neon_sqadd node:$LHS, (int_aarch64_neon_sqrdmulh node:$MHS, node:$RHS))> >; SQRDMLAHv4i16: (set V64:v4i16:$dst, (intrinsic_wo_chain:v4i16 117:iPTR, V64:v4i16:$Rd, (intrinsic_wo_chain:{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64} 124:iPTR, V64:v4i16:$Rn, V64:v4i16:$Rm))) Included from /work/llvm/lib/Target/AArch64/AArch64.td:58: /work/llvm/lib/Target/AArch64/AArch64InstrInfo.td:2730:1: error: In SQRDMLAHv4i16: Could not infer all types in pattern!
though, when I changed "sqadd" for "add", it worked. So, I have abandoned this "TriOpFrag" approach,
- it'd be much better to keep things as hierarchical as possible and just add a HasV8_1 predicate, or perhaps FeatureFPARMv8_1ble and just add a HasV8_1 predicate, or perhaps FeatureFPARMv8_1,
At the time when these predicates has been introduced downstream, we were not sure, whether it might be partial v8.05 implementations. Nowadays we are almost sure it should not be, but not 100%. However, I'd agree with you, and in case there unexpectedly some partial implementation will take place in the future, we'd better split this HasV8_1 predicate.
I'll change HasRDMA for HasV8_1 in next revision.
Yet, fortunately, no FeatureFPARMv8_1 will be required : FP has not been improved.
Hi Tim,
after deep research I've come up with usage of v1i8, v1i16, v1i32 as valid contents of FPR8, FPR12, FPR32.
Only this way I could perform fusion of intrinsics sqadd and sqrdmulh, with v1i16 as operands, intermediate and result type.
In fact, even for current neon scalar instructions sqadd and sqrdmulh, v1i16 types should be used.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/SQRDMULH_advsimd_vec_scalar.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/SQADD_advsimd_scalar.html
Instead, f16 type, enclosed by FPR16 is used. That results in tricky/hacky dag matching, in this way
f16 SQRDMULH(f16,f16,f16); v1i16 op1; v1i16 op2; v1i16 result = cast(v1i16, SQRDMULH(cast(f16, op1), cast(f16, op2))
For single instruction it works, but I couldn't found a way to fuse two instrs, not using explicit v1i16 type. Won't do that using explicit f16 type.
Now locally I have the following
diff --git a/lib/Target/AArch64/AArch64RegisterInfo.td b/lib/Target/AArch64/AArch64RegisterInfo.td index d5ff3f1..628e9c7 100644 --- a/lib/Target/AArch64/AArch64RegisterInfo.td +++ b/lib/Target/AArch64/AArch64RegisterInfo.td @@ -382,13 +382,13 @@ def Q30 : AArch64Reg<30, "q30", [D30], ["v30", ""]>, DwarfRegAlias<B30>; def Q31 : AArch64Reg<31, "q31", [D31], ["v31", ""]>, DwarfRegAlias<B31>; } -def FPR8 : RegisterClass<"AArch64", [untyped], 8, (sequence "B%u", 0, 31)> { +def FPR8 : RegisterClass<"AArch64", [untyped, v1i8], 8, (sequence "B%u", 0, 31)> { let Size = 8; } -def FPR16 : RegisterClass<"AArch64", [f16], 16, (sequence "H%u", 0, 31)> { +def FPR16 : RegisterClass<"AArch64", [f16, v1i16], 16, (sequence "H%u", 0, 31)> { let Size = 16; } -def FPR32 : RegisterClass<"AArch64", [f32, i32], 32,(sequence "S%u", 0, 31)>; +def FPR32 : RegisterClass<"AArch64", [f32, i32, v1i32], 32,(sequence "S%u", 0, 31)>; def FPR64 : RegisterClass<"AArch64", [f64, i64, v2f32, v1f64, v8i8, v4i16, v2i32, v1i64, v4f16], 64, (sequence "D%u", 0, 31)>;
plus corresponding changes, polishing and explicit type qualifications added to AArch64InstrFormats.td and AArch64InstrInfo.td. Will submit this refactoring shortly.
I'm aware of recent doubts, whether to accept these types as fully valid ( http://permalink.gmane.org/gmane.comp.compilers.llvm.cvs/175395 ). I guess the time has come for the correct implementation.
I would appreciate any objections/suggestions/other approaches to fuse.
Thanks, Vladimir
after deep research I've come up with usage of v1i8, v1i16, v1i32 as valid contents of FPR8, FPR12, FPR32.
The original AArch64 backend went down that path, and I think it was a
mistake. It ended up adding lots of complexity for not much gain.
I think the proposed plan for actually using these scalar instructions
is by enhancing the AArch64AdvSIMDScalarPass to optimise these cases
when they occur.
Cheers.
Tim.
Hmm... this way, I'd like to postpone implementation of this RDMA extension, since we have some troubles with it.
Meanwhile I have implemented other 4 extensions, and they are ready to upstream.
Let's start work on it, and after it I'll return to rewrite RDMA?
I've just found, for an unknown reason, one feedback in this thread has been sent by mail, but is not present in this webpage.
Quoting it here:
> Hi Tim, > thank you for your warm feedback. > > 1. **I don't think these new intrinsics are needed. The instructions > are effectively "(int_aarch64_neon_sqadd $acc, > (int_aarch64_neon_sqrdmulh $LHS, $RHS))".** > > Ok, but now I have a severe trouble that could be well-known for v1iNN types. If so, would you please give some hint? > [...] > def : Pat<(i16 (int_aarch64_neon_sqadd (i16 FPR16:$Rd), > (i16 (int_aarch64_neon_sqrdmulh (i16 FPR16:$Rn), > (i16 > FPR16:$Rm))))), Most i16 intrinsics don't have patterns yet, because i16 isn't a legal AArch64 type. Clang currently generates code like this (using vector ops) to get those semantics: define signext i16 @foo(i16 signext %l, i16 signext %r) #0 { %1 = insertelement <4 x i16> undef, i16 %l, i64 0 %2 = insertelement <4 x i16> undef, i16 %r, i64 0 %3 = tail call <4 x i16> @llvm.aarch64.neon.sqadd.v4i16(<4 x i16> %1, <4 x i16> %2) #2 %4 = extractelement <4 x i16> %3, i64 0 ret i16 %4 } > Note: First I tried another way of implementation, like the following snipped for vector variant. It has failed due to incomplete type inference between two intrinsics. That looks like an annoying shortcoming in TableGen's type inference, you often need to be more explicit than you'd hope when specifying the types of trees involving intrinsics. Annoyingly, I think you sometimes need to create a new multiclass hierarchy to insert the needed types. We should probably fix that some time. Cheers. Tim.
I don't think these new intrinsics are needed. The instructions are effectively "(int_aarch64_neon_sqadd $acc, (int_aarch64_neon_sqrdmulh $LHS, $RHS))".
The only reason we've fused operations like that in the past is when loop optimisations can interfere and make block-at-a-time selection produce worse code.
Basically, this is when sext/zext happen *before* the key NEON op, and those nodes get hoisted out and become invisible to local selection (e.g. the int_aarch64_neon_smull intrinsics). That shouldn't be an issue here.