This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add v8.1a "Rounding Doubling Multiply Add/Subtract" extension
ClosedPublic

Authored by vsukharev on Mar 20 2015, 3:01 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

vsukharev updated this revision to Diff 22387.Mar 20 2015, 3:01 PM
vsukharev retitled this revision from to [AArch64] Add v8.1a "Rounding Double Multiply Add/Subtract" extension.
vsukharev updated this object.
vsukharev edited the test plan for this revision. (Show Details)
vsukharev set the repository for this revision to rL LLVM.
vsukharev added a subscriber: Unknown Object (MLST).
jmolloy requested changes to this revision.Mar 24 2015, 4:10 AM
jmolloy added a reviewer: jmolloy.
jmolloy added a subscriber: jmolloy.

Hi Vladimir,

Thanks for working on this. It looks really good!

Comments inline.

Cheers,

James

lib/Target/AArch64/AArch64InstrFormats.td
8555 ↗(On Diff #22387)

I think this should say "Doubling", right?

8607 ↗(On Diff #22387)

We don't allow commented out code, sorry. You should also be able to enable this pattern, by using an SMOV instruction (you'd need a UMOV for the equivalent unsigned operation):

def : Pat<(... stays the same ...),
              (SMOVvi16to32 (!cast<Instruction>(NAME # v4i16_indexed)
                ..., VectorIndexH:0)>;

But the commented out pattern seems weird; it uses SUBREG_TO_REG, which I don't understand why, and it matches sqdmull, not sqrdmulh like the surrounding code. So something's fishy here.

8657 ↗(On Diff #22387)

This is really weird and sounds like a bug, although if the pattern matches I can't really argue with it as it means the bug is somewhere else...

... I assume this pattern is explicitly tested?

8714 ↗(On Diff #22387)

You should be able to implement this with SMOV/UMOV, as I mentioned above.

lib/Target/AArch64/AArch64InstrInfo.td
3086 ↗(On Diff #22387)

Again, UMOV/SMOV to implement this.

This revision now requires changes to proceed.Mar 24 2015, 4:10 AM
vsukharev retitled this revision from [AArch64] Add v8.1a "Rounding Double Multiply Add/Subtract" extension to [AArch64] Add v8.1a "Rounding Doubling Multiply Add/Subtract" extension.Mar 26 2015, 11:43 AM
vsukharev updated this object.
vsukharev edited edge metadata.
vsukharev added inline comments.Mar 26 2015, 12:45 PM
lib/Target/AArch64/AArch64InstrFormats.td
8555 ↗(On Diff #22387)

Sorry, will be changed in next revision.

8607 ↗(On Diff #22387)

Oops, good catch.
That's the correct pattern that is supposed to be here, but cannot be compiled due to problem with first part(matching), not with second part(generating)

// FIXME: this cannot be processed by TableGen
// error: In SQRDMLAHanonymous_913: Type inference contradiction found, 
// merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'i16'
// error: In SQRDMLAHanonymous_913: Type inference contradiction found, merging 'i16' into 'f16'
//def : Pat<(i16 (Accum (i16 FPR16Op:$Rd),
//                      (i16 (vector_extract
//                             (v8i16 (insert_subvector
//                                      (undef),
//                                      (v4i16 (int_aarch64_neon_sqrdmulh 
//                                               (v4i16 V64:$Rn),
//                                               (v4i16 (AArch64duplane16 
//                                                        (v8i16 V128_lo:$Rm),
//                                                        VectorIndexH:$idx)))),
//                                    (i32 0))),
//                             (i64 0))))),
//          (EXTRACT_SUBREG
//              (v4i16 (!cast<Instruction>(NAME # v4i16_indexed)
//                        (v4i16 (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
//                                              FPR16Op:$Rd,
//                                              ssub)),
//                        V64:$Rn,
//                        V128_lo:$Rm, 
//                        VectorIndexH:$idx)),
//              ssub)>;

Test for it: test_sqrdmlsh_extracted_lane_s16 (see below)

8636 ↗(On Diff #22387)

Also, that's a non-compilable pattern, supposed to be here and tested by test_sqrdmlahq_extracted_lane_s16

// FIXME: this cannot be processed by TableGen
// error: In SQRDMLAHanonymous_913: Type inference contradiction found, 
// merging '{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64}' into 'i16'
// error: In SQRDMLAHanonymous_913: Type inference contradiction found, merging 'i16' into 'f16'
//def : Pat<(i16 (Accum (i16 FPR16Op:$Rd),
//                      (i16 (vector_extract 
//                             (v8i16 (int_aarch64_neon_sqrdmulh 
//                                      (v8i16 V128:$Rn),
//                                      (v8i16 (AArch64duplane16 
//                                               (v8i16 V128_lo:$Rm),
//                                               VectorIndexH:$idx)))),
//                             (i64 0))))),
//          (EXTRACT_SUBREG
//              (v8i16 (!cast<Instruction>(NAME # v8i16_indexed)
//                        (v8i16 (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
//                                              FPR16Op:$Rd, 
//                                              ssub)), 
//                        V128:$Rn,
//                        V128_lo:$Rm, 
//                        VectorIndexH:$idx)),
//              ssub)>;
8657 ↗(On Diff #22387)

weird extra node (v4i32 (insert_subvector (undef),(2i32... is inserted to this DAG, because extact_subvector is illegal from 2i32. It is legal only from 4i32.
That could be a bug of higher design level, do you have any thoughts?

Meanwhile this pattern successully matches DAG, that we have for explicit test "test_sqrdmlah_extracted_lane_s32" (see comment below)

8714 ↗(On Diff #22387)

As I commented above, it's a problem of another kind: namely, Tablegen cannot generate matcher for snippet

Accum (i16 FPR16Op:$Rd), (i16 (int_aarch64_neon_sqrdmulh....

because of error

SQRDMLAHi16_indexed: 	(set FPR16Op:i16:$dst, (intrinsic_wo_chain:{i32:i64:v8i8:v16i8:v4i16:v8i16:v2i32:v4i32:v1i64:v2i64} 117:iPTR, FPR16Op:<empty>:$Rd, (intrinsic_wo_chain:i16 122:<empty>, FPR16Op:i16:$Rn, (vector_extract:i16 V128_lo:v8i16:$Rm, (imm:i64)<<P:Predicate_VectorIndexH>>:$idx))))
Included from /work/llvm-rw/lib/Target/AArch64/AArch64.td:58:
/work/llvm-rw/lib/Target/AArch64/AArch64InstrInfo.td:4357:1: error: In SQRDMLAHi16_indexed: Type inference contradiction found, merging 'f16' into 'i16'
defm SQRDMLAH : SIMDIndexedSQRDMLxHSDTied<1, 0b1101, "sqrdmlah",
^
Included from /work/llvm-rw/lib/Target/AArch64/AArch64.td:58:
Included from /work/llvm-rw/lib/Target/AArch64/AArch64InstrInfo.td:283:
/work/llvm-rw/lib/Target/AArch64/AArch64InstrFormats.td:8737:3: note: instantiated from multiclass
  def i16_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc,
  ^
lib/Target/AArch64/AArch64InstrInfo.td
3086 ↗(On Diff #22387)

(the same as discussion above)

test/CodeGen/AArch64/arm64-neon-v8.1a.ll
230 ↗(On Diff #22387)

test for second non-compilable pattern, marked with // FIXME: this cannot be processed by TableGen

241 ↗(On Diff #22387)

That's a test for really weird pattern above to match weird extra insert_subvector in DAG

269 ↗(On Diff #22387)

test for first non-compilable pattern, marked with // FIXME: this cannot be processed by TableGen

Hi Vladimir,

I've checked out your patch and fiddled around with it. It is possible, but ugly, to match your unmatchable pattern.

First, we need to properly legalize the intrinsic. It has type i16 (and takes i16 arguments). I16 is illegal so needs to be promoted, but the generic code can't promote it for you so we need to do it ourselves. There are two ways to do this: either create a new AArch64ISD:: node for this operation or operate on ISD::INTRINSIC_WO_CHAIN nodes themselves. For simplicity I've done the latter.

First, we need to tell the target-agnostic gubbins that we want to custom lower intrinsic nodes:

// Somewhere near AArch64ISelLowering.cpp:120
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i16, Custom);

Now we need to implement the custom lowering.

// In AArch64ISelLowering.cpp , function ReplaceNodeResults() 
  case ISD::INTRINSIC_WO_CHAIN: {                                                                                                                                                         
    auto ID = getIntrinsicID(N);
    if ((ID == Intrinsic::aarch64_neon_sqrdmulh ||
         ID == Intrinsic::aarch64_neon_sqadd) &&
        N->getValueType(0) == MVT::i16) {
      // Promote to i32.
      SDLoc DL(N);                                                                                                                                                  
                                                                                                                                                                    
      auto Op0 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(1));                                                                                      
      auto Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(2));                                                                                      
                                                                                                                                                                    
      auto NN = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,                                                                                                  
                            DAG.getConstant(ID, MVT::i32),                                                                                                          
                            Op0, Op1);                                                                                                                              
      NN = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, NN);                                                                                                            
      Results.push_back(NN);                                                                                                                                        
    }                                                                                                                                                               
    return;                                                                                                                                                         
  }

With this change, we can get code that at least doesn't crash:

umov    w8, v0.h[3]
fmov    s0, w0
fmov    s1, w1
fmov    s2, w8
sqrdmlah        s0, s1, s2
fmov    w0, s0
ret

That uses the i32 variant of the sqrdmlah instruction. We need to do at least this much, I think, because we can't have intrinsics that just crash the compiler.

Now, matching the pattern. The pattern we need to match is basically the same as the i32_indexed version of the pattern, but with a v8i16 instead of v4i32 type:

def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),                                                                                                                              
                 (i32 (int_aarch64_neon_sqrdmulh                                                                                                                    
                   (i32 FPR32Op:$Rn),                                                                                                                               
                   (i32 (vector_extract (v8i16 V128:$Rm),                                                                                                           
                                        VectorIndexH:$idx)))))),

But the pattern to generate is even uglier still. This is the best i've got:

(COPY_TO_REGCLASS (f32 (INSERT_SUBREG (IMPLICIT_DEF),                                                                                                       
               (!cast<Instruction>(NAME#"i16_indexed")                                                                                                      
                 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rd, FPR32)), hsub),                                                                        
                 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rn, FPR32)), hsub),                                                                        
                 V128:$Rm, VectorIndexH:$idx),                                                                                                              
               hsub)), FPR32)>;

All the operands are going to be i32 types, so we need to make sure they're in the FPR32 register bank before we try and take the "hsub" subregister from them. That's the COPY_TO_REGCLASS nodes. We will then end up with an f32 type, which in order to merge it into the i32 the pattern must return, I've added another COPY_TO_REGCLASS so the return value of the entire pattern is merely FPR32 (which both i32 and f32 can be allocated to).

This produces:

 fmov    s1, w1
fmov    s2, w0
sqrdmlah        h2, h1, v0.h[3]
fmov    w0, s2
ret

Which is what we want. It can also produce chained sqrdmlah's, such as:

fmov    s1, w1
fmov    s2, w0
sqrdmlah        h2, h1, v0.h[3]
sqrdmlah        h2, h1, v0.h[2]
fmov    w0, s2
ret

So I think this is certainly a valid way of implementing those intrinsics.

I'm not sure the best way forward here - @Tim, would you mind please checking my tomfoolery above and see if you agree or not? If so, implementing these is quite involved so possibly would be better done in a separate patch.

Cheers,

James

vsukharev added a comment.EditedMar 27 2015, 9:12 AM
  1. In AArch64ISelLowering.cpp , function ReplaceNodeResults() , we need also "sqsub"
  2. I don't think i32 sqrdmlah() will work right, even if we'd replace ISD::TRUNCATE with SQXTN. Is the following correct?

i16 sqrdmlah (0, 100, 1000) -> 0 sqadd (100 sqrdmulh 1000) -> 0 sqadd (high i16 half of 100000) -> 0 sqadd 1 -> 1 - we need to obtain that with workaround...
i32 sqrdmlah (0, 100, 1000) -> 0 sqadd (high i32 half of 100000) -> 0
nope, workaround does not seem to be right

vsukharev updated this revision to Diff 22940.Mar 31 2015, 5:41 AM
vsukharev updated this object.
vsukharev edited edge metadata.
  • commented out patterns are removed
  • commented out tests are rewritten from illegal IR to clang-style
jmolloy accepted this revision.Mar 31 2015, 5:43 AM
jmolloy edited edge metadata.

Hi Vladimir,

Thanks for doing that. It looks a lot better without the nasty selection logic, and we can just fix up vector instructions to scalar ones in the AdvSIMD pass if needed, as Tim suggested.

LGTM.

Cheers,

James

This revision is now accepted and ready to land.Mar 31 2015, 5:43 AM
This revision was automatically updated to reflect the committed changes.