Related to http://reviews.llvm.org/D8503 [ARM] Add v8.1a "Rounding Double Multiply Add/Subtract" extension
Details
Diff Detail
Event Timeline
Hi Vladimir,
Thanks for working on this. It looks really good!
Comments inline.
Cheers,
James
lib/Target/AArch64/AArch64InstrFormats.td | ||
---|---|---|
8555 | I think this should say "Doubling", right? | |
8607 | We don't allow commented out code, sorry. You should also be able to enable this pattern, by using an SMOV instruction (you'd need a UMOV for the equivalent unsigned operation): def : Pat<(... stays the same ...), (SMOVvi16to32 (!cast<Instruction>(NAME # v4i16_indexed) ..., VectorIndexH:0)>; But the commented out pattern seems weird; it uses SUBREG_TO_REG, which I don't understand why, and it matches sqdmull, not sqrdmulh like the surrounding code. So something's fishy here. | |
8657 | This is really weird and sounds like a bug, although if the pattern matches I can't really argue with it as it means the bug is somewhere else... ... I assume this pattern is explicitly tested? | |
8714 | You should be able to implement this with SMOV/UMOV, as I mentioned above. | |
lib/Target/AArch64/AArch64InstrInfo.td | ||
3086 | Again, UMOV/SMOV to implement this. |
lib/Target/AArch64/AArch64InstrFormats.td | ||
---|---|---|
8555 | Sorry, will be changed in next revision. | |
8607 | Oops, good catch. // FIXME: this cannot be processed by TableGen // error: In SQRDMLAHanonymous_913: Type inference contradiction found, // merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'i16' // error: In SQRDMLAHanonymous_913: Type inference contradiction found, merging 'i16' into 'f16' //def : Pat<(i16 (Accum (i16 FPR16Op:$Rd), // (i16 (vector_extract // (v8i16 (insert_subvector // (undef), // (v4i16 (int_aarch64_neon_sqrdmulh // (v4i16 V64:$Rn), // (v4i16 (AArch64duplane16 // (v8i16 V128_lo:$Rm), // VectorIndexH:$idx)))), // (i32 0))), // (i64 0))))), // (EXTRACT_SUBREG // (v4i16 (!cast<Instruction>(NAME # v4i16_indexed) // (v4i16 (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), // FPR16Op:$Rd, // ssub)), // V64:$Rn, // V128_lo:$Rm, // VectorIndexH:$idx)), // ssub)>; Test for it: test_sqrdmlsh_extracted_lane_s16 (see below) | |
8636 | Also, that's a non-compilable pattern, supposed to be here and tested by test_sqrdmlahq_extracted_lane_s16 // FIXME: this cannot be processed by TableGen // error: In SQRDMLAHanonymous_913: Type inference contradiction found, // merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'i16' // error: In SQRDMLAHanonymous_913: Type inference contradiction found, merging 'i16' into 'f16' //def : Pat<(i16 (Accum (i16 FPR16Op:$Rd), // (i16 (vector_extract // (v8i16 (int_aarch64_neon_sqrdmulh // (v8i16 V128:$Rn), // (v8i16 (AArch64duplane16 // (v8i16 V128_lo:$Rm), // VectorIndexH:$idx)))), // (i64 0))))), // (EXTRACT_SUBREG // (v8i16 (!cast<Instruction>(NAME # v8i16_indexed) // (v8i16 (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), // FPR16Op:$Rd, // ssub)), // V128:$Rn, // V128_lo:$Rm, // VectorIndexH:$idx)), // ssub)>; | |
8657 | weird extra node (v4i32 (insert_subvector (undef),(2i32... is inserted to this DAG, because extact_subvector is illegal from 2i32. It is legal only from 4i32. Meanwhile this pattern successully matches DAG, that we have for explicit test "test_sqrdmlah_extracted_lane_s32" (see comment below) | |
8714 | As I commented above, it's a problem of another kind: namely, Tablegen cannot generate matcher for snippet Accum (i16 FPR16Op:$Rd), (i16 (int_aarch64_neon_sqrdmulh.... because of error SQRDMLAHi16_indexed: (set FPR16Op:i16:$dst, (intrinsic_wo_chain:{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64} 117:iPTR, FPR16Op:<empty>:$Rd, (intrinsic_wo_chain:i16 122:<empty>, FPR16Op:i16:$Rn, (vector_extract:i16 V128_lo:v8i16:$Rm, (imm:i64)<<P:Predicate_VectorIndexH>>:$idx)))) Included from /work/llvm-rw/lib/Target/AArch64/AArch64.td:58: /work/llvm-rw/lib/Target/AArch64/AArch64InstrInfo.td:4357:1: error: In SQRDMLAHi16_indexed: Type inference contradiction found, merging 'f16' into 'i16' defm SQRDMLAH : SIMDIndexedSQRDMLxHSDTied<1, 0b1101, "sqrdmlah", ^ Included from /work/llvm-rw/lib/Target/AArch64/AArch64.td:58: Included from /work/llvm-rw/lib/Target/AArch64/AArch64InstrInfo.td:283: /work/llvm-rw/lib/Target/AArch64/AArch64InstrFormats.td:8737:3: note: instantiated from multiclass def i16_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc, ^ | |
lib/Target/AArch64/AArch64InstrInfo.td | ||
3086 | (the same as discussion above) | |
test/CodeGen/AArch64/arm64-neon-v8.1a.ll | ||
230 | test for second non-compilable pattern, marked with // FIXME: this cannot be processed by TableGen | |
241 | That's a test for really weird pattern above to match weird extra insert_subvector in DAG | |
269 | test for first non-compilable pattern, marked with // FIXME: this cannot be processed by TableGen |
Hi Vladimir,
I've checked out your patch and fiddled around with it. It is possible, but ugly, to match your unmatchable pattern.
First, we need to properly legalize the intrinsic. It has type i16 (and takes i16 arguments). I16 is illegal so needs to be promoted, but the generic code can't promote it for you so we need to do it ourselves. There are two ways to do this: either create a new AArch64ISD:: node for this operation or operate on ISD::INTRINSIC_WO_CHAIN nodes themselves. For simplicity I've done the latter.
First, we need to tell the target-agnostic gubbins that we want to custom lower intrinsic nodes:
// Somewhere near AArch64ISelLowering.cpp:120 setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i16, Custom);
Now we need to implement the custom lowering.
// In AArch64ISelLowering.cpp , function ReplaceNodeResults() case ISD::INTRINSIC_WO_CHAIN: { auto ID = getIntrinsicID(N); if ((ID == Intrinsic::aarch64_neon_sqrdmulh || ID == Intrinsic::aarch64_neon_sqadd) && N->getValueType(0) == MVT::i16) { // Promote to i32. SDLoc DL(N); auto Op0 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(1)); auto Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(2)); auto NN = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32, DAG.getConstant(ID, MVT::i32), Op0, Op1); NN = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, NN); Results.push_back(NN); } return; }
With this change, we can get code that at least doesn't crash:
umov w8, v0.h[3] fmov s0, w0 fmov s1, w1 fmov s2, w8 sqrdmlah s0, s1, s2 fmov w0, s0 ret
That uses the i32 variant of the sqrdmlah instruction. We need to do at least this much, I think, because we can't have intrinsics that just crash the compiler.
Now, matching the pattern. The pattern we need to match is basically the same as the i32_indexed version of the pattern, but with a v8i16 instead of v4i32 type:
def : Pat<(i32 (Accum (i32 FPR32Op:$Rd), (i32 (int_aarch64_neon_sqrdmulh (i32 FPR32Op:$Rn), (i32 (vector_extract (v8i16 V128:$Rm), VectorIndexH:$idx)))))),
But the pattern to generate is even uglier still. This is the best i've got:
(COPY_TO_REGCLASS (f32 (INSERT_SUBREG (IMPLICIT_DEF), (!cast<Instruction>(NAME#"i16_indexed") (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rd, FPR32)), hsub), (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rn, FPR32)), hsub), V128:$Rm, VectorIndexH:$idx), hsub)), FPR32)>;
All the operands are going to be i32 types, so we need to make sure they're in the FPR32 register bank before we try and take the "hsub" subregister from them. That's the COPY_TO_REGCLASS nodes. We will then end up with an f32 type, which in order to merge it into the i32 the pattern must return, I've added another COPY_TO_REGCLASS so the return value of the entire pattern is merely FPR32 (which both i32 and f32 can be allocated to).
This produces:
fmov s1, w1 fmov s2, w0 sqrdmlah h2, h1, v0.h[3] fmov w0, s2 ret
Which is what we want. It can also produce chained sqrdmlah's, such as:
fmov s1, w1 fmov s2, w0 sqrdmlah h2, h1, v0.h[3] sqrdmlah h2, h1, v0.h[2] fmov w0, s2 ret
So I think this is certainly a valid way of implementing those intrinsics.
I'm not sure the best way forward here - @Tim, would you mind please checking my tomfoolery above and see if you agree or not? If so, implementing these is quite involved so possibly would be better done in a separate patch.
Cheers,
James
- In AArch64ISelLowering.cpp , function ReplaceNodeResults() , we need also "sqsub"
- I don't think i32 sqrdmlah() will work right, even if we'd replace ISD::TRUNCATE with SQXTN. Is the following correct?
i16 sqrdmlah (0, 100, 1000) -> 0 sqadd (100 sqrdmulh 1000) -> 0 sqadd (high i16 half of 100000) -> 0 sqadd 1 -> 1 - we need to obtain that with workaround...
i32 sqrdmlah (0, 100, 1000) -> 0 sqadd (high i32 half of 100000) -> 0
nope, workaround does not seem to be right
- commented out patterns are removed
- commented out tests are rewritten from illegal IR to clang-style
Hi Vladimir,
Thanks for doing that. It looks a lot better without the nasty selection logic, and we can just fix up vector instructions to scalar ones in the AdvSIMD pass if needed, as Tim suggested.
LGTM.
Cheers,
James
I think this should say "Doubling", right?