Page MenuHomePhabricator

Add arm_neon.intr.sdot operation

Authored by asaadaldien on Mar 8 2021, 10:18 AM.


  • Create and move ops with ISA compatibility to arm_neon.intr.*
  • Add arm_neon.intr.sdot

Diff Detail

Event Timeline

asaadaldien created this revision.Mar 8 2021, 10:18 AM
asaadaldien requested review of this revision.Mar 8 2021, 10:18 AM
aartbik added inline comments.Mar 8 2021, 5:20 PM

nit: OneResult is a bit more consistent with naming above


very minor nit: the summary above just lists "opcode" name.

I am fine either way, but we probably want to make this consistent (with either more there or less here)


ftynse added inline comments.Mar 9 2021, 12:51 AM

You need to encode this in verifier (potentially using the TypesMatchWith trait). Otherwise, the op accepts any combination, so (vector<4xi32>, vector<8xi8>, vector<8xi8>) -> vector<16xi32> is currently accepted.


Nit: since b, c and a, res have pairwise equal types, you don't need to list all of them


Re op semantics, we have a choice of using 1-1 mapping to ARM or making it more MLIR codegen friendly.
The problem I see with just adopting the ARM semantics is that the details required to map to this instruction will leak to higher levels of codegen.
I am afraid this will hamper retargetability of the vector dialect to other targets such as GPU and xPUs.

Given the way we represent MLIR vectors, I'd much rather make the vector<4xi8> part explicit and use the ARM op as the means to hide the abstraction gap between 1-D flattened (HW-detail) and 2-D vectors (MLIR-representation):

(vector<2xi32>, vector<2x4xi8>, vector<2x4xi8>) -> vector<2xi32>

Now I see the issue here: vector<2x4xi8> is not a native LLVM type and it "won't just work" magically.
@ftynse do you see opportunities to use some of your recent data layout work to have something nice here?

Additionally, it is possible (likely) that we will also need better vector-level abstractions and canonicalizations for flattening / unflattening between n-D and 1-D.
It will be a bit more work but I think is worth it.

Also, note the vector.contract semantics is %lhs, %rhs, %acc.
Can we be consistent with it and consider that the ARM dialect bridges the abstraction gap between MLIR and Neon intrinsics but that it is still an MLIR abstraction?

The intrinsics page linked in the op doc does not mandate a form (I understand the ISA does but, like intrinsics, MLIR is closer to user / programming level than HW in this case):

uint8x8_t vadd_u8 (uint8x8_t a, uint8x8_t b)
uint8x16_t vaddq_u8 (uint8x16_t a, uint8x16_t b)

I would just represent it in the retargetable codegen-friendly way suggested above.


Isn't the above enough?

AllTypesMatch<["b", "c"]>,
AllTypesMatch<["a", "res"]>

I'd just make it consistent with vector.contract.

ftynse added inline comments.Mar 9 2021, 2:18 AM

It's MLIR, we don't have to choose :)

We do want something that maps 1-1 to LLVM IR intrinsics for translation simplicity purposes. This doesn't prevent us from having a slightly higher-level op that is easier to target. So we can have arm_neon.sdot that works on 2D types _and_ an arm_neon.intr.sdot that works on 1D types and maps directly to the intrinsic plus a simple conversion between the two that flattens the vector. This isn't a no-op conversion that we used to have between the llvm and non-llvm version of the dialect as it actually inserts the flattening op. If we start having several such ops, we can automate the definition and generate conversions at the ODS level. I am leaning in a similar direction for other "intrinsic" dialects.


Actually, the comment looks wrong (vector<4xi32>, vector<16xi8>, vector<16xi8>) -> vector<16xi32> wouldn't pass the AllTypesMatch<["a", "res"]> verifier.

I'll revise my example to (vector<2xi32>, vector<16xi8>, vector<16xi8>) -> vector<2xi32>, which does pass the current verifier but should not. There is nothing that guarantees only co-indexed length values from VectorOfLengthAndType are chosen.



Then I'd just suggest to make turn this op into arm_neon.intrin.sdot to signify it is an implementation detail and make another arm_neon.sdot that implements the codegen-friendly suggestion above (can be in a separate CL).


Ah indeed, good catch.

@asaadaldien here is an example from AVX512:

def MaskRndScaleOp : AVX512_Op<"mask.rndscale", [NoSideEffect,
  AllTypesMatch<["src", "a", "dst"]>,
  TypesMatchWith<"imm has the same number of bits as elements in dst",
                 "dst", "imm",
                 "IntegerType::get($_self.getContext(), "
                 "($_self.cast<VectorType>().getShape()[0]))">]> {
asaadaldien added inline comments.Mar 9 2021, 10:15 AM

@nicolasvasilache, If we split the dialect into arm_neon.* and arm_neon.intr.* will we need to write arm_neon.* ->arm_neon.* transformations ? I think most of these transformations if needed aren't layout specific and better to exist above at vector dialect level. The flat 1-d vector here aren't crossing vector.contract -> neon.sdot pattern boundary.


Good catch @ftynse , Thanks @nicolasvasilache for the example.

ftynse added inline comments.Mar 9 2021, 10:30 AM

I would suggest to only have */intr dichotomy when it is strictly necessary. We will have to write in-dialect transformations. They will flatten the vectors, and I would expect the flattening to be common for different ArmNeon, but potentially different for other dialects that use vector types (e.g., GPU mmafragment is opaque and may be targeted from vectors). In this light, it makes sense to me to put the flattening in the dialect. This doesn't prevent us from having some VectorUtils that support it in a generic way and called by concrete dialects.

asaadaldien marked an inline comment as not done.

Constrain elements and change assembly format

asaadaldien marked 2 inline comments as done and an inline comment as not done.Mar 9 2021, 11:16 AM
asaadaldien added inline comments.

If we have the arm_neon dialect operating on flattened vectors the only time we need to do this flattening is when dialect convert vector.op_x -> arm_neon.op_y, I am trying to understand why we need the dialect to exist in two groups arm_neon.op_x_with_nd_vectors and arm_neon.intr.op_with_isa_like_1d_vec ?

ftynse added inline comments.Mar 10 2021, 1:14 AM

We get two trivial conversions: vector.op_x_with_nd_vectors to arm_neon.op_x_with_nd_vectors followed by arm_neon.op_x_with_nd_vectors to arm_neon.op_x_with_1d_vectors - instead of one larger, non-trivial conversion. Having an nD abstraction also helps if we want to program at a level slightly above the intrinsics but not at vector dialect level, e.g., manual performance benchmarking.

Move ISA compatible ops into neon.intr.*

asaadaldien added inline comments.Mar 10 2021, 11:16 AM

Added arm_neon.inter.* to ops we have so far..

Thanks @ftynse , @nicolasvasilache rethinking about I can see how useful the break down is:

e.g vector.* -> arm_neon.op* dialect conversion can be done independent of vector length, arm.* -> arm_neon.intr.op* pattern rewrite is the part that is ISA aware and can have other ISA specific transformations (e.g vector-padding).

asaadaldien retitled this revision from Add arm_neon.sdot operation to Add arm_neon.intr.sdot operation.Mar 10 2021, 11:23 AM
asaadaldien edited the summary of this revision. (Show Details)
ftynse accepted this revision.Mar 17 2021, 5:47 AM
This revision is now accepted and ready to land.Mar 17 2021, 5:47 AM
nicolasvasilache accepted this revision.Mar 17 2021, 6:06 AM
This revision was landed with ongoing or failed builds.Mar 17 2021, 8:26 AM
This revision was automatically updated to reflect the committed changes.