Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache
aartbik

Commits

rGf5963944d97d: Add arm_neon.sdot operation

Summary

Create and move ops with ISA compatibility to arm_neon.intr.*
Add arm_neon.intr.sdot

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

asaadaldien created this revision.Mar 8 2021, 10:18 AM

Herald added subscribers: dcaballe, cota, teijeong and 17 others. · View Herald TranscriptMar 8 2021, 10:18 AM

asaadaldien requested review of this revision.Mar 8 2021, 10:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2021, 10:18 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

asaadaldien added reviewers: ftynse, nicolasvasilache, aartbik.Mar 8 2021, 10:25 AM

Harbormaster completed remote builds in B92700: Diff 329063.Mar 8 2021, 3:20 PM

aartbik added inline comments.Mar 8 2021, 5:20 PM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
58	nit: OneResult is a bit more consistent with naming above
97	very minor nit: the summary above just lists "opcode" name. I am fine either way, but we probably want to make this consistent (with either more there or less here)

Comments...

ftynse added inline comments.Mar 9 2021, 12:51 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
106–108	You need to encode this in verifier (potentially using the `TypesMatchWith` trait). Otherwise, the op accepts any combination, so `(vector<4xi32>, vector<8xi8>, vector<8xi8>) -> vector<16xi32>` is currently accepted.
114	Nit: since `b`, `c` and `a`, `res` have pairwise equal types, you don't need to list all of them

nicolasvasilache added inline comments.Mar 9 2021, 1:37 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	Re op semantics, we have a choice of using 1-1 mapping to ARM or making it more MLIR codegen friendly. The problem I see with just adopting the ARM semantics is that the details required to map to this instruction will leak to higher levels of codegen. I am afraid this will hamper retargetability of the vector dialect to other targets such as GPU and xPUs. Given the way we represent MLIR vectors, I'd much rather make the `vector<4xi8>` part explicit and use the ARM op as the means to hide the abstraction gap between 1-D flattened (HW-detail) and 2-D vectors (MLIR-representation): (vector<2xi32>, vector<2x4xi8>, vector<2x4xi8>) -> vector<2xi32> Now I see the issue here: vector<2x4xi8> is not a native LLVM type and it "won't just work" magically. @ftynse do you see opportunities to use some of your recent data layout work to have something nice here? Additionally, it is possible (likely) that we will also need better vector-level abstractions and canonicalizations for flattening / unflattening between n-D and 1-D. It will be a bit more work but I think is worth it. Also, note the vector.contract semantics is `%lhs, %rhs, %acc`. Can we be consistent with it and consider that the ARM dialect bridges the abstraction gap between MLIR and Neon intrinsics but that it is still an MLIR abstraction? The intrinsics page linked in the op doc does not mandate a form (I understand the ISA does but, like intrinsics, MLIR is closer to user / programming level than HW in this case): uint8x8_t vadd_u8 (uint8x8_t a, uint8x8_t b) uint8x16_t vaddq_u8 (uint8x16_t a, uint8x16_t b) I would just represent it in the retargetable codegen-friendly way suggested above.
106–108	Isn't the above enough? AllTypesMatch<["b", "c"]>, AllTypesMatch<["a", "res"]>
114	I'd just make it consistent with vector.contract.

ftynse added inline comments.Mar 9 2021, 2:18 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	It's MLIR, we don't have to choose :) We do want something that maps 1-1 to LLVM IR intrinsics for translation simplicity purposes. This doesn't prevent us from having a slightly higher-level op that is easier to target. So we can have `arm_neon.sdot` that works on 2D types _and_ an `arm_neon.intr.sdot` that works on 1D types and maps directly to the intrinsic plus a simple conversion between the two that flattens the vector. This isn't a no-op conversion that we used to have between the llvm and non-llvm version of the dialect as it actually inserts the flattening op. If we start having several such ops, we can automate the definition and generate conversions at the ODS level. I am leaning in a similar direction for other "intrinsic" dialects.
106–108	Actually, the comment looks wrong `(vector<4xi32>, vector<16xi8>, vector<16xi8>) -> vector<16xi32>` wouldn't pass the `AllTypesMatch<["a", "res"]>` verifier. I'll revise my example to `(vector<2xi32>, vector<16xi8>, vector<16xi8>) -> vector<2xi32>`, which does pass the current verifier but should not. There is nothing that guarantees only co-indexed length values from `VectorOfLengthAndType` are chosen.

nicolasvasilache added inline comments.Mar 9 2021, 2:32 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	WFM! Then I'd just suggest to make turn this op into arm_neon.intrin.sdot to signify it is an implementation detail and make another arm_neon.sdot that implements the codegen-friendly suggestion above (can be in a separate CL).
106–108	Ah indeed, good catch. @asaadaldien here is an example from AVX512: def MaskRndScaleOp : AVX512_Op<"mask.rndscale", [NoSideEffect, AllTypesMatch<["src", "a", "dst"]>, TypesMatchWith<"imm has the same number of bits as elements in dst", "dst", "imm", "IntegerType::get($_self.getContext(), " "($_self.cast<VectorType>().getShape()[0]))">]> { ...

Harbormaster completed remote builds in B92796: Diff 329211.Mar 9 2021, 4:34 AM

asaadaldien added inline comments.Mar 9 2021, 10:15 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	@nicolasvasilache, If we split the dialect into `arm_neon.` and `arm_neon.intr.` will we need to write `arm_neon.* ->arm_neon.*` transformations ? I think most of these transformations if needed aren't layout specific and better to exist above at vector dialect level. The flat 1-d vector here aren't crossing `vector.contract -> neon.sdot` pattern boundary.
106–108	Good catch @ftynse , Thanks @nicolasvasilache for the example.

ftynse added inline comments.Mar 9 2021, 10:30 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	I would suggest to only have */intr dichotomy when it is strictly necessary. We will have to write in-dialect transformations. They will flatten the vectors, and I would expect the flattening to be common for different ArmNeon, but potentially different for other dialects that use vector types (e.g., GPU mmafragment is opaque and may be targeted from vectors). In this light, it makes sense to me to put the flattening in the dialect. This doesn't prevent us from having some VectorUtils that support it in a generic way and called by concrete dialects.

Constrain elements and change assembly format

asaadaldien marked 2 inline comments as done and an inline comment as not done.Mar 9 2021, 11:16 AM

asaadaldien added inline comments.

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	If we have the `arm_neon` dialect operating on flattened vectors the only time we need to do this flattening is when dialect convert `vector.op_x -> arm_neon.op_y`, I am trying to understand why we need the dialect to exist in two groups `arm_neon.op_x_with_nd_vectors` and `arm_neon.intr.op_with_isa_like_1d_vec` ?

Harbormaster completed remote builds in B92923: Diff 329408.Mar 9 2021, 10:32 PM

ftynse added inline comments.Mar 10 2021, 1:14 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	We get two trivial conversions: `vector.op_x_with_nd_vectors` to `arm_neon.op_x_with_nd_vectors` followed by `arm_neon.op_x_with_nd_vectors` to `arm_neon.op_x_with_1d_vectors` - instead of one larger, non-trivial conversion. Having an nD abstraction also helps if we want to program at a level slightly above the intrinsics but not at vector dialect level, e.g., manual performance benchmarking.

Move ISA compatible ops into neon.intr.*

asaadaldien added inline comments.Mar 10 2021, 11:16 AM

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td
101	Added `arm_neon.inter.` to ops we have so far.. Thanks @ftynse , @nicolasvasilache rethinking about I can see how useful the break down is: e.g `vector. -> arm_neon.op` dialect conversion can be done independent of vector length, `arm. -> arm_neon.intr.op*` pattern rewrite is the part that is ISA aware and can have other ISA specific transformations (e.g vector-padding).

asaadaldien retitled this revision from Add arm_neon.sdot operation to Add arm_neon.intr.sdot operation.Mar 10 2021, 11:23 AM

asaadaldien edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B93131: Diff 329713.Mar 10 2021, 10:13 PM

nicolasvasilache mentioned this in D98470: [mlir][amx] Add Intel AMX dialect (architectural-specific vector dialect).Mar 13 2021, 5:04 AM

ftynse accepted this revision.Mar 17 2021, 5:47 AM

This revision is now accepted and ready to land.Mar 17 2021, 5:47 AM

nicolasvasilache accepted this revision.Mar 17 2021, 6:06 AM

This revision was landed with ongoing or failed builds.Mar 17 2021, 8:26 AM

Closed by commit rGf5963944d97d: Add arm_neon.sdot operation (authored by asaadaldien). · Explain Why

This revision was automatically updated to reflect the committed changes.

asaadaldien added a commit: rGf5963944d97d: Add arm_neon.sdot operation.

aartbik mentioned this in D100593: [mlir][vector][avx] add AVX dot product to X86Vector dialect with lowering.Apr 15 2021, 12:38 PM

Diff 331274

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td

Show All 33 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// ArmNeon dialect op that corresponds (and is convertible to) an LLVM IR		// ArmNeon dialect op that corresponds (and is convertible to) an LLVM IR
// intrinsic.		// intrinsic.
class ArmNeon_IntrOp<string mnemonic, list<int> overloadedResults,		class ArmNeon_IntrOp<string mnemonic, list<int> overloadedResults,
list<int> overloadedOperands, int numResults,		list<int> overloadedOperands, int numResults,
list<OpTrait> traits = [], bit requiresAccessGroup = 0>		list<OpTrait> traits = [], bit requiresAccessGroup = 0>
: LLVM_IntrOpBase</dialect=/ArmNeon_Dialect,		: LLVM_IntrOpBase</dialect=/ArmNeon_Dialect,
/opName=/mnemonic,		/opName=/"intr." # mnemonic,
/enumName=/"aarch64_neon_" # !subst(".", "_", mnemonic),		/enumName=/"aarch64_neon_" # !subst(".", "_", mnemonic),
/overloadedResults=/overloadedResults,		/overloadedResults=/overloadedResults,
/overloadedOperands=/overloadedOperands,		/overloadedOperands=/overloadedOperands,
/traits=/traits,		/traits=/traits,
/numResults=/numResults,		/numResults=/numResults,
/requiresAccessGroup=/requiresAccessGroup>;		/requiresAccessGroup=/requiresAccessGroup>;

// ArmNeon dialect op that corresponds to an LLVM IR intrinsic with one		// ArmNeon dialect op that corresponds to an LLVM IR intrinsic with one
// overloaded result.		// overloaded result.
class ArmNeon_OverloadedOneResultIntrOp<string mnemonic,		class ArmNeon_OverloadedOneResultIntrOp<string mnemonic,
list<OpTrait> traits = []>		list<OpTrait> traits = []>
: ArmNeon_IntrOp<mnemonic, [0], [], 1, traits>;		: ArmNeon_IntrOp<mnemonic, [0], [], 1, traits>;

		// ArmNeon dialect op that corresponds to an LLVM IR intrinsic with one
		// overloaded result and overloaded operands list.
		class ArmNeon_OverloadedOperandsWithOneResultIntrOp<string mnemonic,
		aartbikUnsubmitted Not Done Reply Inline Actions nit: OneResult is a bit more consistent with naming above aartbik: nit: OneResult is a bit more consistent with naming above
		list<int> overloadedOperands,
		list<OpTrait> traits = []>
		: ArmNeon_IntrOp<mnemonic, [0], overloadedOperands, 1, traits>;

def SMullOp : ArmNeon_OverloadedOneResultIntrOp<"smull", [		def SMullOp : ArmNeon_OverloadedOneResultIntrOp<"smull", [
NoSideEffect,		NoSideEffect,
AllTypesMatch<["a", "b"]>,		AllTypesMatch<["a", "b"]>,
TypesMatchWith<		TypesMatchWith<
"res has same vector shape and element bitwidth scaled by 2 as a",		"res has same vector shape and element bitwidth scaled by 2 as a",
"a", "res", "$_self.cast<VectorType>().scaleElementBitwidth(2)">		"a", "res", "$_self.cast<VectorType>().scaleElementBitwidth(2)">
]> {		]> {
let summary = "smull roundscale op";		let summary = "smull roundscale op";
Show All 13 Lines	def SMullOp : ArmNeon_OverloadedOneResultIntrOp<"smull", [
// (vector<2xi32>, vector<2xi32>) -> (vector<2xi64>)		// (vector<2xi32>, vector<2xi32>) -> (vector<2xi64>)
let arguments = (ins VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$a,		let arguments = (ins VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$a,
VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$b);		VectorOfLengthAndType<[8, 4, 2], [I8, I16, I32]>:$b);
let results = (outs VectorOfLengthAndType<[8, 4, 2], [I16, I32, I64]>:$res);		let results = (outs VectorOfLengthAndType<[8, 4, 2], [I16, I32, I64]>:$res);
let assemblyFormat =		let assemblyFormat =
"$a `,` $b attr-dict `:` type($a) `to` type($res)";		"$a `,` $b attr-dict `:` type($a) `to` type($res)";
}		}

		def SdotOp : ArmNeon_OverloadedOperandsWithOneResultIntrOp<"sdot",[1], [
		NoSideEffect,
		AllTypesMatch<["b", "c"]>,
		AllTypesMatch<["a", "res"]>,
		TypesMatchWith<"res has the same number of elements as operand b",
		"b", "res",
		aartbikUnsubmitted Not Done Reply Inline Actions very minor nit: the summary above just lists "opcode" name. I am fine either way, but we probably want to make this consistent (with either more there or less here) aartbik: very minor nit: the summary above just lists "opcode" name. I am fine either way, but we…
		"VectorType::get({$_self.cast<VectorType>().getShape()[0] / 4},"
		"IntegerType::get($_self.getContext(), 32))">]> {
		let summary = "sdot op";
		let description = [{
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Re op semantics, we have a choice of using 1-1 mapping to ARM or making it more MLIR codegen friendly. The problem I see with just adopting the ARM semantics is that the details required to map to this instruction will leak to higher levels of codegen. I am afraid this will hamper retargetability of the vector dialect to other targets such as GPU and xPUs. Given the way we represent MLIR vectors, I'd much rather make the `vector<4xi8>` part explicit and use the ARM op as the means to hide the abstraction gap between 1-D flattened (HW-detail) and 2-D vectors (MLIR-representation): (vector<2xi32>, vector<2x4xi8>, vector<2x4xi8>) -> vector<2xi32> Now I see the issue here: vector<2x4xi8> is not a native LLVM type and it "won't just work" magically. @ftynse do you see opportunities to use some of your recent data layout work to have something nice here? Additionally, it is possible (likely) that we will also need better vector-level abstractions and canonicalizations for flattening / unflattening between n-D and 1-D. It will be a bit more work but I think is worth it. Also, note the vector.contract semantics is `%lhs, %rhs, %acc`. Can we be consistent with it and consider that the ARM dialect bridges the abstraction gap between MLIR and Neon intrinsics but that it is still an MLIR abstraction? The intrinsics page linked in the op doc does not mandate a form (I understand the ISA does but, like intrinsics, MLIR is closer to user / programming level than HW in this case): uint8x8_t vadd_u8 (uint8x8_t a, uint8x8_t b) uint8x16_t vaddq_u8 (uint8x16_t a, uint8x16_t b) I would just represent it in the retargetable codegen-friendly way suggested above. nicolasvasilache: Re op semantics, we have a choice of using 1-1 mapping to ARM or making it more MLIR codegen…
		ftynseUnsubmitted Not Done Reply Inline Actions It's MLIR, we don't have to choose :) We do want something that maps 1-1 to LLVM IR intrinsics for translation simplicity purposes. This doesn't prevent us from having a slightly higher-level op that is easier to target. So we can have `arm_neon.sdot` that works on 2D types _and_ an `arm_neon.intr.sdot` that works on 1D types and maps directly to the intrinsic plus a simple conversion between the two that flattens the vector. This isn't a no-op conversion that we used to have between the llvm and non-llvm version of the dialect as it actually inserts the flattening op. If we start having several such ops, we can automate the definition and generate conversions at the ODS level. I am leaning in a similar direction for other "intrinsic" dialects. ftynse: It's MLIR, we don't have to choose :) We do want something that maps 1-1 to LLVM IR intrinsics…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions WFM! Then I'd just suggest to make turn this op into arm_neon.intrin.sdot to signify it is an implementation detail and make another arm_neon.sdot that implements the codegen-friendly suggestion above (can be in a separate CL). nicolasvasilache: WFM! Then I'd just suggest to make turn this op into arm_neon.intrin.sdot to signify it is an…
		asaadaldienAuthorUnsubmitted Not Done Reply Inline Actions @nicolasvasilache, If we split the dialect into `arm_neon.` and `arm_neon.intr.` will we need to write `arm_neon.* ->arm_neon.` transformations ? I think most of these transformations if needed aren't layout specific and better to exist above at vector dialect level. The flat 1-d vector here aren't crossing `vector.contract -> neon.sdot` pattern boundary. asaadaldien:* @nicolasvasilache, If we split the dialect into `arm_neon.` and `arm_neon.intr.` will we need…
		ftynseUnsubmitted Not Done Reply Inline Actions I would suggest to only have /intr dichotomy when it is strictly necessary. We will have to write in-dialect transformations. They will flatten the vectors, and I would expect the flattening to be common for different ArmNeon, but potentially different for other dialects that use vector types (e.g., GPU mmafragment is opaque and may be targeted from vectors). In this light, it makes sense to me to put the flattening in the dialect. This doesn't prevent us from having some VectorUtils that support it in a generic way and called by concrete dialects. ftynse:* I would suggest to only have */intr dichotomy when it is strictly necessary. We will have to…
		asaadaldienAuthorUnsubmitted Done Reply Inline Actions If we have the `arm_neon` dialect operating on flattened vectors the only time we need to do this flattening is when dialect convert `vector.op_x -> arm_neon.op_y`, I am trying to understand why we need the dialect to exist in two groups `arm_neon.op_x_with_nd_vectors` and `arm_neon.intr.op_with_isa_like_1d_vec` ? asaadaldien: If we have the `arm_neon` dialect operating on flattened vectors the only time we need to do…
		ftynseUnsubmitted Not Done Reply Inline Actions We get two trivial conversions: `vector.op_x_with_nd_vectors` to `arm_neon.op_x_with_nd_vectors` followed by `arm_neon.op_x_with_nd_vectors` to `arm_neon.op_x_with_1d_vectors` - instead of one larger, non-trivial conversion. Having an nD abstraction also helps if we want to program at a level slightly above the intrinsics but not at vector dialect level, e.g., manual performance benchmarking. ftynse: We get two trivial conversions: `vector.op_x_with_nd_vectors` to `arm_neon.
		asaadaldienAuthorUnsubmitted Done Reply Inline Actions Added `arm_neon.inter.` to ops we have so far.. Thanks @ftynse , @nicolasvasilache rethinking about I can see how useful the break down is: e.g `vector. -> arm_neon.op` dialect conversion can be done independent of vector length, `arm. -> arm_neon.intr.op` pattern rewrite is the part that is ISA aware and can have other ISA specific transformations (e.g vector-padding). asaadaldien:* Added `arm_neon.inter.*` to ops we have so far.. Thanks @ftynse , @nicolasvasilache rethinking…
		Signed integer addition of dot product (vector). This instruction performs
		the following operation on signed integer vectors: res = dot(b, c) + a,
		where vector operands are partitioned into groups of four elements.

		Source:
		https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics
		}];
		ftynseUnsubmitted Not Done Reply Inline Actions You need to encode this in verifier (potentially using the `TypesMatchWith` trait). Otherwise, the op accepts any combination, so `(vector<4xi32>, vector<8xi8>, vector<8xi8>) -> vector<16xi32>` is currently accepted. ftynse: You need to encode this in verifier (potentially using the `TypesMatchWith` trait). Otherwise…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Isn't the above enough? AllTypesMatch<["b", "c"]>, AllTypesMatch<["a", "res"]> nicolasvasilache: Isn't the above enough? ``` AllTypesMatch<["b", "c"]>, AllTypesMatch<["a", "res"]>…
		ftynseUnsubmitted Not Done Reply Inline Actions Actually, the comment looks wrong `(vector<4xi32>, vector<16xi8>, vector<16xi8>) -> vector<16xi32>` wouldn't pass the `AllTypesMatch<["a", "res"]>` verifier. I'll revise my example to `(vector<2xi32>, vector<16xi8>, vector<16xi8>) -> vector<2xi32>`, which does pass the current verifier but should not. There is nothing that guarantees only co-indexed length values from `VectorOfLengthAndType` are chosen. ftynse: Actually, the comment looks wrong `(vector<4xi32>, vector<16xi8>, vector<16xi8>) ->…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Ah indeed, good catch. @asaadaldien here is an example from AVX512: def MaskRndScaleOp : AVX512_Op<"mask.rndscale", [NoSideEffect, AllTypesMatch<["src", "a", "dst"]>, TypesMatchWith<"imm has the same number of bits as elements in dst", "dst", "imm", "IntegerType::get($_self.getContext(), " "($_self.cast<VectorType>().getShape()[0]))">]> { ... nicolasvasilache: Ah indeed, good catch. @asaadaldien here is an example from AVX512: ``` def MaskRndScaleOp…
		asaadaldienAuthorUnsubmitted Not Done Reply Inline Actions Good catch @ftynse , Thanks @nicolasvasilache for the example. asaadaldien: Good catch @ftynse , Thanks @nicolasvasilache for the example.
		// Supports either:
		// (vector<2xi32>, vector<8xi8>, vector<8xi8>) -> vector<2xi32>
		// (vector<4xi32>, vector<16xi8>, vector<16xi8>) -> vector<16xi32>
		let arguments = (ins VectorOfLengthAndType<[4, 2], [I32]>:$a,
		VectorOfLengthAndType<[16, 8], [I8]>:$b,
		VectorOfLengthAndType<[16, 8], [I8]>:$c);
		ftynseUnsubmitted Done Reply Inline Actions Nit: since `b`, `c` and `a`, `res` have pairwise equal types, you don't need to list all of them ftynse: Nit: since `b`, `c` and `a`, `res` have pairwise equal types, you don't need to list all of them
		nicolasvasilacheUnsubmitted Done Reply Inline Actions I'd just make it consistent with vector.contract. nicolasvasilache: I'd just make it consistent with vector.contract.
		let results = (outs VectorOfLengthAndType<[4, 2], [I32]>:$res);
		let assemblyFormat =
		"$a `,` $b `,` $c attr-dict `:` type($b) `,` type($c) `to` type($res)";
		}

#endif // ARMNEON_OPS		#endif // ARMNEON_OPS

mlir/test/Dialect/ArmNeon/roundtrip.mlir

	// RUN: mlir-opt -verify-diagnostics %s \| mlir-opt \| FileCheck %s			// RUN: mlir-opt -verify-diagnostics %s \| mlir-opt \| FileCheck %s

	// CHECK-LABEL: arm_neon_smull			// CHECK-LABEL: arm_neon_smull
	func @arm_neon_smull(%a: vector<8xi8>, %b: vector<8xi8>)			func @arm_neon_smull(%a: vector<8xi8>, %b: vector<8xi8>)
	-> (vector<8xi16>, vector<4xi32>, vector<2xi64>) {			-> (vector<8xi16>, vector<4xi32>, vector<2xi64>) {
	// CHECK: arm_neon.smull {{.*}}: vector<8xi8> to vector<8xi16>			// CHECK: arm_neon.intr.smull {{.*}}: vector<8xi8> to vector<8xi16>
	%0 = arm_neon.smull %a, %b : vector<8xi8> to vector<8xi16>			%0 = arm_neon.intr.smull %a, %b : vector<8xi8> to vector<8xi16>
	%00 = vector.extract_strided_slice %0 {offsets = [3], sizes = [4], strides = [1]}:			%00 = vector.extract_strided_slice %0 {offsets = [3], sizes = [4], strides = [1]}:
	vector<8xi16> to vector<4xi16>			vector<8xi16> to vector<4xi16>

	// CHECK: arm_neon.smull {{.*}}: vector<4xi16> to vector<4xi32>			// CHECK: arm_neon.intr.smull {{.*}}: vector<4xi16> to vector<4xi32>
	%1 = arm_neon.smull %00, %00 : vector<4xi16> to vector<4xi32>			%1 = arm_neon.intr.smull %00, %00 : vector<4xi16> to vector<4xi32>
	%11 = vector.extract_strided_slice %1 {offsets = [1], sizes = [2], strides = [1]}:			%11 = vector.extract_strided_slice %1 {offsets = [1], sizes = [2], strides = [1]}:
	vector<4xi32> to vector<2xi32>			vector<4xi32> to vector<2xi32>

	// CHECK: arm_neon.smull {{.*}}: vector<2xi32> to vector<2xi64>			// CHECK: arm_neon.intr.smull {{.*}}: vector<2xi32> to vector<2xi64>
	%2 = arm_neon.smull %11, %11 : vector<2xi32> to vector<2xi64>			%2 = arm_neon.intr.smull %11, %11 : vector<2xi32> to vector<2xi64>

	return %0, %1, %2 : vector<8xi16>, vector<4xi32>, vector<2xi64>			return %0, %1, %2 : vector<8xi16>, vector<4xi32>, vector<2xi64>
	}			}

				// CHECK-LABEL: arm_neon_sdot
				func @arm_neon_sdot(%a: vector<2xi32>, %b: vector<8xi8>, %c: vector<8xi8>) -> vector<2xi32> {
				// CHECK: arm_neon.intr.sdot {{.*}}: vector<8xi8>, vector<8xi8> to vector<2xi32>
				%0 = arm_neon.intr.sdot %a, %b, %c : vector<8xi8>, vector<8xi8> to vector<2xi32>
				return %0 : vector<2xi32>
				}

mlir/test/Target/LLVMIR/arm-neon.mlir

	// RUN: mlir-translate -mlir-to-llvmir %s \| FileCheck %s			// RUN: mlir-translate -mlir-to-llvmir %s \| FileCheck %s

	// CHECK-LABEL: arm_neon_smull			// CHECK-LABEL: arm_neon_smull
	llvm.func @arm_neon_smull(%arg0: vector<8xi8>, %arg1: vector<8xi8>) -> !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)> {			llvm.func @arm_neon_smull(%arg0: vector<8xi8>, %arg1: vector<8xi8>) -> !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)> {
	// CHECK: %[[V0:.]] = call <8 x i16> @llvm.aarch64.neon.smull.v8i16(<8 x i8> %{{.}}, <8 x i8> %{{.*}})			// CHECK: %[[V0:.]] = call <8 x i16> @llvm.aarch64.neon.smull.v8i16(<8 x i8> %{{.}}, <8 x i8> %{{.*}})
	// CHECK-NEXT: %[[V00:.*]] = shufflevector <8 x i16> %3, <8 x i16> %[[V0]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>			// CHECK-NEXT: %[[V00:.*]] = shufflevector <8 x i16> %3, <8 x i16> %[[V0]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
	%0 = arm_neon.smull %arg0, %arg1 : vector<8xi8> to vector<8xi16>			%0 = arm_neon.intr.smull %arg0, %arg1 : vector<8xi8> to vector<8xi16>
	%1 = llvm.shufflevector %0, %0 [3, 4, 5, 6] : vector<8xi16>, vector<8xi16>			%1 = llvm.shufflevector %0, %0 [3, 4, 5, 6] : vector<8xi16>, vector<8xi16>

	// CHECK-NEXT: %[[V1:.*]] = call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %[[V00]], <4 x i16> %[[V00]])			// CHECK-NEXT: %[[V1:.*]] = call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %[[V00]], <4 x i16> %[[V00]])
	// CHECK-NEXT: %[[V11:.*]] = shufflevector <4 x i32> %[[V1]], <4 x i32> %[[V1]], <2 x i32> <i32 1, i32 2>			// CHECK-NEXT: %[[V11:.*]] = shufflevector <4 x i32> %[[V1]], <4 x i32> %[[V1]], <2 x i32> <i32 1, i32 2>
	%2 = arm_neon.smull %1, %1 : vector<4xi16> to vector<4xi32>			%2 = arm_neon.intr.smull %1, %1 : vector<4xi16> to vector<4xi32>
	%3 = llvm.shufflevector %2, %2 [1, 2] : vector<4xi32>, vector<4xi32>			%3 = llvm.shufflevector %2, %2 [1, 2] : vector<4xi32>, vector<4xi32>

	// CHECK-NEXT: %[[V1:.*]] = call <2 x i64> @llvm.aarch64.neon.smull.v2i64(<2 x i32> %[[V11]], <2 x i32> %[[V11]])			// CHECK-NEXT: %[[V1:.*]] = call <2 x i64> @llvm.aarch64.neon.smull.v2i64(<2 x i32> %[[V11]], <2 x i32> %[[V11]])
	%4 = arm_neon.smull %3, %3 : vector<2xi32> to vector<2xi64>			%4 = arm_neon.intr.smull %3, %3 : vector<2xi32> to vector<2xi64>

	%5 = llvm.mlir.undef : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>			%5 = llvm.mlir.undef : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>
	%6 = llvm.insertvalue %0, %5[0] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>			%6 = llvm.insertvalue %0, %5[0] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>
	%7 = llvm.insertvalue %2, %6[1] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>			%7 = llvm.insertvalue %2, %6[1] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>
	%8 = llvm.insertvalue %4, %7[2] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>			%8 = llvm.insertvalue %4, %7[2] : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>

	// CHECK: ret { <8 x i16>, <4 x i32>, <2 x i64> }			// CHECK: ret { <8 x i16>, <4 x i32>, <2 x i64> }
	llvm.return %8 : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>			llvm.return %8 : !llvm.struct<(vector<8xi16>, vector<4xi32>, vector<2xi64>)>
	}			}

				// CHECK-LABEL: arm_neon_sdot_i8i8
				llvm.func @arm_neon_sdot_i8i8(%a: vector<2xi32>, %b: vector<8xi8>, %c: vector<8xi8>) -> vector<2xi32> {
				// CHECK: %[[V0:.]] = call <2 x i32> @llvm.aarch64.neon.sdot.v2i32.v8i8(<2 x i32> %{{.}}, <8 x i8> %{{.}}, <8 x i8> %{{.}})
				// CHECK-NEXT: ret <2 x i32>
				%0 = arm_neon.intr.sdot %a, %b, %c : vector<8xi8>, vector<8xi8> to vector<2xi32>
				llvm.return %0 : vector<2xi32>
				}

				// CHECK-LABEL: arm_neon_sdot_i16i16
				llvm.func @arm_neon_sdot_i16i16(%a: vector<4xi32>, %b: vector<16xi8>, %c: vector<16xi8>) -> vector<4xi32> {
				// CHECK: %[[V0:.]] = call <4 x i32> @llvm.aarch64.neon.sdot.v4i32.v16i8(<4 x i32> %{{.}}, <16 x i8> %{{.}}, <16 x i8> %{{.}})
				// CHECK-NEXT: ret <4 x i32>
				%0 = arm_neon.intr.sdot %a, %b, %c : vector<16xi8>, vector<16xi8> to vector<4xi32>
				llvm.return %0 : vector<4xi32>
				}

This is an archive of the discontinued LLVM Phabricator instance.

Add arm_neon.intr.sdot operation
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 331274

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td

mlir/test/Dialect/ArmNeon/roundtrip.mlir

mlir/test/Target/LLVMIR/arm-neon.mlir

This is an archive of the discontinued LLVM Phabricator instance.

Add arm_neon.intr.sdot operationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 331274

mlir/include/mlir/Dialect/ArmNeon/ArmNeon.td

mlir/test/Dialect/ArmNeon/roundtrip.mlir

mlir/test/Target/LLVMIR/arm-neon.mlir

Add arm_neon.intr.sdot operation
ClosedPublic