This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/ArmSME/
-
mlir/
-
Dialect/
-
ArmSME/
-
IR/
6/7
ArmSME.td
-
Utils/
3/4
Utils.h
-
lib/
-
Conversion/VectorToArmSME/
-
VectorToArmSME/
1/1
CMakeLists.txt
5/7
VectorToArmSME.cpp
-
Dialect/ArmSME/
-
ArmSME/
-
CMakeLists.txt
-
Transforms/
-
CMakeLists.txt
14/18
LegalizeForLLVMExport.cpp
-
Utils/
1/3
CMakeLists.txt
4/6
Utils.cpp
-
test/
-
Dialect/ArmSME/
-
ArmSME/
1/1
roundtrip.mlir
5/7
vector-ops-to-llvm.mlir
-
Integration/Dialect/Vector/CPU/ArmSME/
-
Dialect/
-
Vector/
-
CPU/
-
ArmSME/
3/5
vector-load-store.mlir

Differential D155306

[mlir][ArmSME] Add tile load op and extend tile store tile size support
ClosedPublic

Authored by c-rhodes on Jul 14 2023, 9:00 AM.

Download Raw Diff

Details

Reviewers

awarzynski
WanderAway
dcaballe
aartbik
ftynse
nicolasvasilache

Commits

rGca9a3354d04b: [mlir][ArmSME] Add tile load op and extend tile store tile size support

Summary

This extends the existing 'arm_sme.tile_store' op to support all tile
sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from
vector -> custom ops and custom ops -> intrinsics. Currently there's no
lowering for i128.

Depends on D154867

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

c-rhodes created this revision.Jul 14 2023, 9:00 AM

Herald added a reviewer: aartbik. · View Herald TranscriptJul 14 2023, 9:00 AM

Herald added a reviewer: ftynse. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 26 others. · View Herald Transcript

c-rhodes requested review of this revision.Jul 14 2023, 9:00 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 14 2023, 9:00 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B245410: Diff 540443.Jul 14 2023, 9:01 AM

c-rhodes edited the summary of this revision. (Show Details)Jul 14 2023, 9:01 AM

c-rhodes added a parent revision: D154867: [mlir][ArmSME] Introduce custom ops for SME.

Matt added a subscriber: Matt.Jul 14 2023, 2:45 PM

Overall looks good, thanks! We definitely need to take care of the loop materialisation after this change (i.e. move it higher up the compilation stack).

I've left a fair few comments, but nothing major and this is a rather large patch. I still need to go over the tests.

Thanks for working on this!

awarzynski added inline comments.Jul 19 2023, 12:47 AM

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230
238
mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
25
28
mlir/lib/Conversion/VectorToArmSME/CMakeLists.txt
15	sort
mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
102–120	I'd move this to an utility function - I expect that we'll be needing this for other Ops as well.
117–118	IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx); This way every element in `patterns.add` would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer.
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
98–117	Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.
280–282	No casting happens in this routine :)
285	don't use else after a return ;-) Similar comment for `getOffset`.
291–292	What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D? In the 1-D case, is it: offet * vscale * minElems ? And is `minElems` the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"?
321–322	The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en. I would rename `vnumI32` as `sliceNumI32` so that this is easier to match with the spec.
331–344	Could this be a switch statement instead?
mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2	Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to `MLIRArmSMETransforms`?
mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
21	This repeats the comment from the header file - it will become out of sync if somebody (e.g. me) forgets that and only updates one copy. IMHO, it's fine to limit the comments to where the interface is defined (i.e. the header file).
24	Avoid `else` after `return`.
32	Given that this code will only be used only by SME/SSVE, why not name this as: getSVEVectorBaseSize This way it will be very clear that it's some special SME/SSVE hook. Also: in SME array vector + "and an SVE vector"?

Address comments.

Overall looks good, thanks! We definitely need to take care of the loop materialisation after this change (i.e. move it higher up the compilation stack).

Thanks for the comments! I agree w.r.t. loop materialization, currently looking into that separately from this.

mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
25	I discovered `getIntOrFloatBitWidth` so I've removed this and replaced it with that.
mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
102–120	I'd move this to an utility function - I expect that we'll be needing this for other Ops as well. Moved to `arm_sme::isSMETileLikeVectorType`.
117–118	IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx); This way every element in `patterns.add` would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer. That's a good suggestion! Done
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
98–117	Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.
280–282	No casting happens in this routine :)
291–292	What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D? The offset to the load or store pointer. In the 2D case `getStridedElementPtr` does the arithmetic for us, but in the 1D case we have to do it ourselves. In the 1-D case, is it: offet * vscale * minElems Yeah, and `offset` is `vnum` so it's `vnum * vscale * minElems`. So the offset is the number of elements for a given type in a vector of SVL bits (SVLt), and this is scaled by `vnum`. So the base would get incremented a tile vector at a time. ? And is `minElems` the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"? Yeah, so `128 / esize`. Perhaps `getMinNumElts` would be better implemented like that rather than with a switch actually. As for where supporting for both 1D and 2D memrefs came from, I initially started with these integration tests that dump ZA but could get it to work with 2D memrefs: https://gist.github.com/c-rhodes/1e9f2d8fd0ca3c6539f167e08079f6ab I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.
321–322	The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en. I would rename `vnumI32` as `sliceNumI32` so that this is easier to match with the spec. Good point, naming is difficult, updated to "tile slice", this seems consistent with LLVM as well.
mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2	Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to `MLIRArmSMETransforms`? I copied this from another dialect and it seems all dialects with utils do this.
mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
32	Given that this code will only be used only by SME/SSVE, why not name this as: getSVEVectorBaseSize This way it will be very clear that it's some special SME/SSVE hook. Also: in SME array vector + "and an SVE vector"? Naming this is tricky, I mean it to be the minSVLT which is the minimum number of elements in a vector of SVL bits, which when scaled by vscale gives both the number of tile slices (vector of SVL bits) in ZA and also the number of elements in a tile slice. I've updated it to `getSMETileSliceMinNumElts` but will have a think, also not sure we want to mention SVE here?

Harbormaster completed remote builds in B246583: Diff 542073.Jul 19 2023, 10:31 AM

Thanks for the updates, I've left a few more nits/suggestions, but nothing major.

mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
10	[nit]
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
114–118	Right, IIUC, this is something like: Scale the memory offset, i.e. `vnum`, if needed: * for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. * for rank 1 memrefs, assume row-major storage and scale by the effective vector length. Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of `offset` and `vnum` in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context `getStridedElementPtr`, `offset` probably makes more sense. `getOffset` also feels a bit too generic 🤔 .
291–292	I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests. Yeah, it would be nice to include them. Wouldn't it be possible to add `CHECK` lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is.
292	[nit] Suggestion for a more descriptive name (it's a rather key bit in the SME logic)
mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
28	[nit] Naming is hard
35	[nit] Naming is hard
mlir/test/Dialect/ArmSME/roundtrip.mlir
195	For consistency with the `tile_store` at the bottom.
mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
28–29	[nit] Not needed
70	[nit] Just to make it clearer what's distinct about this test
106–130	This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/: Tests should be minimal, and only check what is absolutely necessary. This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line. The 2 tests above are sufficient to test the other nuances.
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78	Having a "negative" would be good too (i.e. verify that `mem1 != mem2`).
97	How about printing `mem1` and `mem2` and checking something this: // CHECK: 0.1, 0.1 {{.}} // CHECK: 1.1, 1.1 {{.}} // CHECK: <some clever string printed after all of mem1 has been printed>

awarzynski mentioned this in D155800: [mlir][ArmSME] Add missing roundtrip tests for `arm_sme.store_tile`.Jul 20 2023, 1:35 AM

Btw, after reviewing this I feel bad for not adding more roundtrip tests for arm_sme.store_tile, so I've created https://reviews.llvm.org/D155800 :) Lets land this one first so that it's your patch that fixes the definition in ArmSME.td 👍🏻 .

In D155306#4517819, @awarzynski wrote:

Btw, after reviewing this I feel bad for not adding more roundtrip tests for arm_sme.store_tile, so I've created https://reviews.llvm.org/D155800 :) Lets land this one first so that it's your patch that fixes the definition in ArmSME.td 👍🏻 .

You couldn't have added them as only i8 was supported then 😄 I probably should have done that as part of this patch, but thanks for adding them.

Address comments

In D155306#4517783, @awarzynski wrote:

Thanks for the updates, I've left a few more nits/suggestions, but nothing major.

Thanks again, addressed almost all of them (just got a question on comment you left in integration test), also renamed a few variables in the rewrites to make things clearer. Cheers

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
114–118	Right, IIUC, this is something like: Scale the memory offset, i.e. `vnum`, if needed: * for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. * for rank 1 memrefs, assume row-major storage and scale by the effective vector length. Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of `offset` and `vnum` in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context `getStridedElementPtr`, `offset` probably makes more sense. `getOffset` also feels a bit too generic 🤔 . I've renamed it to `getTileSlicePtrIndex`, hopefully this clarifies things, also improved comment at top based on your suggestion.
291–292	I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests. Yeah, it would be nice to include them. Wouldn't it be possible to add `CHECK` lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is. Yeah it would actually, I've done that in the integration test already part of this based on your suggestion, but would could add the one I linked in future as well.
mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
106–130	This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/: Tests should be minimal, and only check what is absolutely necessary. This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line. The 2 tests above are sufficient to test the other nuances. Good suggestion they were a bit verbose, I've simplified them to highlight the important bits.
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78	Having a "negative" would be good too (i.e. verify that `mem1 != mem2`). After zeroing mem2?

LGTM % minor comments, thanks!

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230	I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the `vector.transfer_read` op.
360	It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.
mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
82–92	We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file.
116	why do we need these specializations?
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
165	we may want to materialize the loops in the SME dialect before the number of cases grow more
335	spell out auto?
mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2	Yeah, I've seen some complaints about the utils library in the past. I was even recommended to remove the utils files altogether. I'm ok with this, though, if other dialects are doing the same...
mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
3–4	remove init separator. Otherwise, this will compile an empty input for the upper part.
354	Nice testing!

This revision is now accepted and ready to land.Jul 20 2023, 11:02 AM

Harbormaster completed remote builds in B246956: Diff 542574.Jul 20 2023, 1:23 PM

LGTM, thanks for the updates!

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
116	I do see _why_, but I agree with Diego that it's not immediately clear.
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78	That comment was for: // Verify "mem1" == "mem2" I am suggesting a negative test, because current testing won't capture scenarios where something goes badly wrong. For example, when `mem1` and `mem2` end up pointing to the same memory location. Or, put differently, we can make sure that two mem buffers are identical. Can we make sure that they are different? This something to consider adding, not a blocker. The testing in this patch is already pretty through.

Address comments.

@awarzynski @dcaballe thanks for reviewing, I believe I've addressed all comments now, will probably land this tomorrow unless there's further comments by then, cheers!

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230	I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the `vector.transfer_read` op. Done cheers, the indices aren't used in the lowering at the moment it just assumes 0 to begin with then adjusts it for each tile slice. I've added a TODO to fix that.
360	It may be worth asking in Discourse about this. There might be a better side effect or workaround for this. Good suggestion will do
mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
82–92	We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file. I copied this from `mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp`, but it's actually less lines and simpler in place, thanks for suggestion!
116	why do we need these specializations?
mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
165	we may want to materialize the loops in the SME dialect before the number of cases grow more That's the plan, initially I was going to focus on vector.broadcast -> ArmSME after this to enable linalg.fill, but I've been looking at loop materialization first so the number of cases doesn't grow as you point out.
mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
3–4	remove init separator. Otherwise, this will compile an empty input for the upper part. Did not know that! Fixed thanks
mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78	That comment was for: // Verify "mem1" == "mem2" I am suggesting a negative test, because current testing won't capture scenarios where something goes badly wrong. For example, when `mem1` and `mem2` end up pointing to the same memory location. Or, put differently, we can make sure that two mem buffers are identical. Can we make sure that they are different? This something to consider adding, not a blocker. The testing in this patch is already pretty through. Done

Harbormaster completed remote builds in B247477: Diff 543268.Jul 23 2023, 4:32 AM

Remove side-effects from load intrinsics.

c-rhodes added inline comments.Jul 24 2023, 2:31 AM

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
360	It may be worth asking in Discourse about this. There might be a better side effect or workaround for this. Good suggestion will do In the process of asking on Discourse I did some digging and found the SME load intrinsics don't get DCE'd in the backend because they have no side-effects and so the default worse-cast is assumed Intr*Mem - Memory properties. If no property is set, the worst case is assumed (it may read and write any memory it can get access to and it may have other side effects). I've removed the side-effects from the load intrinsics to match the semantics of the backend and they no longer get DCE'd. Thanks for the suggestion to look into this!

Harbormaster completed remote builds in B247597: Diff 543431.Jul 24 2023, 2:45 AM

Closed by commit rGca9a3354d04b: [mlir][ArmSME] Add tile load op and extend tile store tile size support (authored by c-rhodes). · Explain WhyJul 25 2023, 1:39 AM

This revision was automatically updated to reflect the committed changes.

c-rhodes added a commit: rGca9a3354d04b: [mlir][ArmSME] Add tile load op and extend tile store tile size support.

c-rhodes mentioned this in rGe7dc73bbade5: [mlir][ArmSME] Add missing roundtrip tests for `arm_sme.tile_store`.Jul 26 2023, 6:00 AM

benmxwl-arm mentioned this in D158418: [mlir][ArmSME] Lower loads/stores of (.Q) 128-bit tiles to intrinsics.Aug 21 2023, 5:22 AM

benmxwl-arm mentioned this in rG97da41418226: [mlir][ArmSME] Lower loads/stores of (.Q) 128-bit tiles to intrinsics.Aug 23 2023, 2:17 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

ArmSME/

IR/

ArmSME.td

69 lines

Utils/

Utils.h

38 lines

lib/

Conversion/

VectorToArmSME/

CMakeLists.txt

1 line

VectorToArmSME.cpp

36 lines

Dialect/

ArmSME/

CMakeLists.txt

1 line

Transforms/

CMakeLists.txt

1 line

LegalizeForLLVMExport.cpp

299 lines

Utils/

CMakeLists.txt

11 lines

Utils.cpp

48 lines

test/

Dialect/

ArmSME/

roundtrip.mlir

83 lines

vector-ops-to-llvm.mlir

356 lines

Integration/

Dialect/

Vector/

CPU/

ArmSME/

vector-load-store.mlir

192 lines

Diff 543866

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td

Show First 20 Lines • Show All 218 Lines • ▼ Show 20 Lines def ZeroOp : ArmSME_Op<"zero", [Pure]> {

let extraClassDeclaration = [{ let extraClassDeclaration = [{

VectorType getVectorType() { VectorType getVectorType() {

return ::llvm::cast<VectorType>(getRes().getType()); return ::llvm::cast<VectorType>(getRes().getType());

} }

}]; }];

let assemblyFormat = "attr-dict `:` type($res)"; let assemblyFormat = "attr-dict `:` type($res)";

} }

def TileLoadOp : ArmSME_Op<"tile_load"> {

let summary = "Tile load operation";

let description = [{

Loads a 2D SME "virtual tile" from memory defined by a base and indices,

awarzynskiUnsubmitted

Done

let description = [{

- Load a 2D SME "virtual tile" to memory.

+ Load a 2D SME "virtual tile" from memory.

Example:

awarzynski:

dcaballeUnsubmitted

Done

I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the vector.transfer_read op.

dcaballe: I would add some minor comment about memory constraints (the slice of memory read should be…

c-rhodesAuthorUnsubmitted

Done

I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the vector.transfer_read op.

Done cheers, the indices aren't used in the lowering at the moment it just assumes 0 to begin with then adjusts it for each tile slice. I've added a TODO to fix that.

c-rhodes: > I would add some minor comment about memory constraints (the slice of memory read should be…

with the shape defined by the 2D scalable vector type of the result tile.

The slice of memory must be contiguous. The memref must be either rank 1 or

rank 2 with dynamic dimensions, since the operation is scalable, and the

element type must be a scalar that matches the element type of the result.

Example 1: Load an 8-bit element ZA tile from memory (ZA0.B).

```mlir

%tile = arm_sme.tile_load %base[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

awarzynskiUnsubmitted

Done

%tile = arm_sme.tile_load %base[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

```

}];

- let arguments = (ins Arg<AnyMemRef, "load base", [MemRead]>:$base,

+ let arguments = (ins Arg<AnyMemRef, "the reference to load from", [MemRead]>:$base,

Variadic<Index>:$indices);

awarzynski:

```

Example 2: Load a FP 32-bit element ZA tile from memory.

```mlir

%tile = arm_sme.tile_load %base[%c0, %c0] : memref<?x?xf32>, vector<[4]x[4]xf32>

```

Example 3: Load a 128-bit element ZA tile from memory.

```mlir

%tile = arm_sme.tile_load %base[%c0, %c0] : memref<?x?xi128>, vector<[1]x[1]xi128>

```

}];

let arguments = (ins

Arg<AnyMemRef, "the reference to load from", [MemRead]>:$base,

Variadic<Index>:$indices);

let results = (outs SMETile:$result);

let extraClassDeclaration = [{

MemRefType getMemRefType() {

return ::llvm::cast<MemRefType>(getBase().getType());

}

VectorType getVectorType() {

return ::llvm::cast<VectorType>(getResult().getType());

}

}];

let assemblyFormat =

"$base `[` $indices `]` attr-dict `:` type($base) `,` type($result)";

}

def TileStoreOp : ArmSME_Op<"tile_store"> { def TileStoreOp : ArmSME_Op<"tile_store"> {

let summary = "Tile store operation"; let summary = "Tile store operation";

let description = [{ let description = [{

Store a 2D SME "virtual tile" to memory. Stores a 2D SME "virtual tile" to memory defined by a base and indices,

with the shape defined by the 2D scalable vector type of the tile being

stored. The slice of memory must be contiguous. The memref must be either

rank 1 or rank 2 with dynamic dimensions, since the operation is scalable,

and the element type must be a scalar that matches the element type of the

result.

NOTE: At the moment it is assumed that the element type is `i8` and that Example 1: Store an 8-bit element ZA tile to memory (ZA0.B).

there's only one "virtual tile". ```mlir

arm_sme.tile_store %tile, %base[%c0, %c0] : vector<[16]x[16]xi8>, memref<?x?xi8>

```

Example: Example 2: Store a FP 32-bit element ZA tile to memory.

```mlir

arm_sme.tile_store %tile, %base[%c0, %c0] : vector<[4]x[4]xf32>, memref<?x?xf32>

```

Example 3: Store a 128-bit element ZA tile to memory.

```mlir ```mlir

arm_sme.tile_store %0, %arg0[%c0, %c0] : vector<[16]x[16]xi8>, memref<?x?xi8> arm_sme.tile_store %tile, %base[%c0, %c0] : vector<[1]x[1]xi128>, memref<?x?xi128>

``` ```

}]; }];

let arguments = (ins nxnxv16i8:$valueToStore, let arguments = (ins SMETile:$valueToStore,

Arg<AnyMemRef, "the reference to store to", [MemWrite]>:$base, Arg<AnyMemRef, "the reference to store to", [MemWrite]>:$base,

Variadic<Index>:$indices); Variadic<Index>:$indices);

let extraClassDeclaration = [{ let extraClassDeclaration = [{

MemRefType getMemRefType() { MemRefType getMemRefType() {

return ::llvm::cast<MemRefType>(getBase().getType()); return ::llvm::cast<MemRefType>(getBase().getType());

} }

VectorType getVectorType() { VectorType getVectorType() {

return ::llvm::cast<VectorType>(getValueToStore().getType()); return ::llvm::cast<VectorType>(getValueToStore().getType());

▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines

def LLVM_aarch64_sme_sumops_wide : ArmSME_IntrMopOverloadedOp<"sumops.wide">; def LLVM_aarch64_sme_sumops_wide : ArmSME_IntrMopOverloadedOp<"sumops.wide">;

def LLVM_aarch64_sme_usmopa_wide : ArmSME_IntrMopOverloadedOp<"usmopa.wide">; def LLVM_aarch64_sme_usmopa_wide : ArmSME_IntrMopOverloadedOp<"usmopa.wide">;

def LLVM_aarch64_sme_usmops_wide : ArmSME_IntrMopOverloadedOp<"usmops.wide">; def LLVM_aarch64_sme_usmops_wide : ArmSME_IntrMopOverloadedOp<"usmops.wide">;

// Loads // Loads

class ArmSME_IntrLoadOp<string mnemonic> class ArmSME_IntrLoadOp<string mnemonic>

: ArmSME_IntrOp<mnemonic>, : ArmSME_IntrOp<mnemonic>,

Arguments<(ins Arg<LDSTPredicate, "Vector predicate">, Arguments<(ins Arg<LDSTPredicate, "Vector predicate">,

Arg<LLVM_AnyPointer, "Load address", [MemRead]>, Arg<LLVM_AnyPointer, "Load address">,

dcaballeUnsubmitted

Not Done

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

dcaballe: It may be worth asking in Discourse about this. There might be a better side effect or…

c-rhodesAuthorUnsubmitted

Done

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

Good suggestion will do

c-rhodes: > It may be worth asking in Discourse about this. There might be a better side effect or…

c-rhodesAuthorUnsubmitted

Done

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

Good suggestion will do

In the process of asking on Discourse I did some digging and found the SME load intrinsics don't get DCE'd in the backend because they have no side-effects and so the default worse-cast is assumed

Intr*Mem - Memory properties. If no property is set, the worst case
is assumed (it may read and write any memory it can get access to and it may
have other side effects).

I've removed the side-effects from the load intrinsics to match the semantics of the backend and they no longer get DCE'd. Thanks for the suggestion to look into this!

c-rhodes: > > It may be worth asking in Discourse about this. There might be a better side effect or…

Arg<I32, "Virtual tile ID">, Arg<I32, "Virtual tile ID">,

Arg<I32, "Tile slice">)>; Arg<I32, "Tile slice">)>;

def LLVM_aarch64_sme_ld1b_horiz : ArmSME_IntrLoadOp<"ld1b.horiz">; def LLVM_aarch64_sme_ld1b_horiz : ArmSME_IntrLoadOp<"ld1b.horiz">;

def LLVM_aarch64_sme_ld1h_horiz : ArmSME_IntrLoadOp<"ld1h.horiz">; def LLVM_aarch64_sme_ld1h_horiz : ArmSME_IntrLoadOp<"ld1h.horiz">;

def LLVM_aarch64_sme_ld1w_horiz : ArmSME_IntrLoadOp<"ld1w.horiz">; def LLVM_aarch64_sme_ld1w_horiz : ArmSME_IntrLoadOp<"ld1w.horiz">;

def LLVM_aarch64_sme_ld1d_horiz : ArmSME_IntrLoadOp<"ld1d.horiz">; def LLVM_aarch64_sme_ld1d_horiz : ArmSME_IntrLoadOp<"ld1d.horiz">;

def LLVM_aarch64_sme_ld1q_horiz : ArmSME_IntrLoadOp<"ld1q.horiz">; def LLVM_aarch64_sme_ld1q_horiz : ArmSME_IntrLoadOp<"ld1q.horiz">;

Show All 34 Lines

mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h

This file was added.

//===- Utils.h - General ArmSME transformation utilities --------*- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This header file defines prototypes for various utilities for the ArmSME

// dialect. These are not passes by themselves but are used either by passes,

awarzynskiUnsubmitted

Done

//===----------------------------------------------------------------------===//

- // This header file defines prototypes for various transformation utilities for

+ // This header file defines prototypes for various utilities for

// the ArmSME dialect. These are not passes by themselves but are used

[nit]

awarzynski: [nit]

// optimization sequences, or in turn by other transformation utilities.

//===----------------------------------------------------------------------===//

#ifndef MLIR_DIALECT_ARMSME_UTILS_UTILS_H_

#define MLIR_DIALECT_ARMSME_UTILS_UTILS_H_

#include "mlir/Dialect/ArmSME/IR/ArmSME.h"

namespace mlir {

namespace arm_sme {

/// Return minimum number of elements for the given element `type` in

/// a vector of SVL bits.

unsigned getSMETileSliceMinNumElts(Type type);

awarzynskiUnsubmitted

Not Done

namespace arm_sme {

- /// Utility to return bitwidth of type which should be an integer or float.

+ /// Return bitwidth of `type` which should be an integer or float.

unsigned getWidth(Type type);

awarzynski:

c-rhodesAuthorUnsubmitted

Done

I discovered getIntOrFloatBitWidth so I've removed this and replaced it with that.

c-rhodes: > I discovered `getIntOrFloatBitWidth` so I've removed this and replaced it with that.

/// Returns true if `type` is a valid element type for an SME tile or false

/// otherwise.

awarzynskiUnsubmitted

Done

unsigned getWidth(Type type);

- /// Utility to return minimum number of elements for the given element type in

+ /// Return minimum number of elements for the given element `type` in

/// an SME array vector.

awarzynski:

bool isValidSMETileElementType(Type type);

/// Returns true if `vType` is a valid vector type for an SME tile or false

/// otherwise.

bool isValidSMETileVectorType(VectorType vType);

} // namespace arm_sme

} // namespace mlir

#endif // MLIR_DIALECT_ARMSME_UTILS_UTILS_H_

mlir/lib/Conversion/VectorToArmSME/CMakeLists.txt

	add_mlir_conversion_library(MLIRVectorToArmSME			add_mlir_conversion_library(MLIRVectorToArmSME
	VectorToArmSME.cpp			VectorToArmSME.cpp
	VectorToArmSMEPass.cpp			VectorToArmSMEPass.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/VectorToArmSME			${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/VectorToArmSME

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRArmSMEDialect			MLIRArmSMEDialect
				MLIRArmSMEUtils
	MLIRLLVMCommonConversion			MLIRLLVMCommonConversion
	)			)
				awarzynskiUnsubmitted Done Reply Inline Actions sort awarzynski: sort

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp

//===- VectorToArmSME.cpp - Conversion from Vector to the ArmSME dialect --===//		//===- VectorToArmSME.cpp - Conversion from Vector to the ArmSME dialect --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/VectorToArmSME/VectorToArmSME.h"		#include "mlir/Conversion/VectorToArmSME/VectorToArmSME.h"

#include "mlir/Dialect/ArmSME/IR/ArmSME.h"		#include "mlir/Dialect/ArmSME/IR/ArmSME.h"
		#include "mlir/Dialect/ArmSME/Utils/Utils.h"
#include "mlir/IR/BuiltinTypes.h"		#include "mlir/IR/BuiltinTypes.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"

using namespace mlir;		using namespace mlir;

static constexpr unsigned kMinNumElts = 16;		static constexpr unsigned kMinNumElts = 16;

/// Returns true if 'val' is a splat of zero, false otherwise.		/// Returns true if 'val' is a splat of zero, false otherwise.
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	LogicalResult matchAndRewrite(vector::TransferWriteOp writeOp,
auto zero = rewriter.create<arm_sme::ZeroOp>(loc, vType);		auto zero = rewriter.create<arm_sme::ZeroOp>(loc, vType);

rewriter.replaceOpWithNewOp<arm_sme::TileStoreOp>(		rewriter.replaceOpWithNewOp<arm_sme::TileStoreOp>(
writeOp, zero, writeOp.getSource(), writeOp.getIndices());		writeOp, zero, writeOp.getSource(), writeOp.getIndices());
return success();		return success();
}		}
};		};

		/// Conversion pattern for vector.load.
		struct VectorLoadToArmSMELowering : public OpRewritePattern<vector::LoadOp> {
		using OpRewritePattern<vector::LoadOp>::OpRewritePattern;

		LogicalResult matchAndRewrite(vector::LoadOp load,
		PatternRewriter &rewriter) const override {
		if (!arm_sme::isValidSMETileVectorType(load.getVectorType()))
		return failure();

		rewriter.replaceOpWithNewOp<arm_sme::TileLoadOp>(
		load, load.getVectorType(), load.getBase(), load.getIndices());

		return success();
		dcaballeUnsubmitted Not Done Reply Inline Actions We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file. dcaballe: We usually do this in place. No strong opinion but the number of these utilities may explode…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file. I copied this from `mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp`, but it's actually less lines and simpler in place, thanks for suggestion! c-rhodes: > We usually do this in place. No strong opinion but the number of these utilities may explode…
		}
		};

		/// Conversion pattern for vector.store.
		struct VectorStoreToArmSMELowering : public OpRewritePattern<vector::StoreOp> {
		using OpRewritePattern<vector::StoreOp>::OpRewritePattern;

		LogicalResult matchAndRewrite(vector::StoreOp store,
		PatternRewriter &rewriter) const override {
		if (!arm_sme::isValidSMETileVectorType(store.getVectorType()))
		return failure();

		rewriter.replaceOpWithNewOp<arm_sme::TileStoreOp>(
		store, store.getValueToStore(), store.getBase(), store.getIndices());

		return success();
		}
		};

} // namespace		} // namespace

void mlir::populateVectorToArmSMEPatterns(RewritePatternSet &patterns,		void mlir::populateVectorToArmSMEPatterns(RewritePatternSet &patterns,
MLIRContext &ctx) {		MLIRContext &ctx) {
patterns.add<TransferWriteToArmSMELowering>(&ctx);		patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSMELowering,
		dcaballeUnsubmitted Done Reply Inline Actions why do we need these specializations? dcaballe: why do we need these specializations?
		awarzynskiUnsubmitted Done Reply Inline Actions I do see _why_, but I agree with Diego that it's not immediately clear. awarzynski: I do see _why_, but I agree with Diego that it's not immediately clear.
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions why do we need these specializations? c-rhodes: > why do we need these specializations?
		VectorStoreToArmSMELowering>(&ctx);
}		}
		awarzynskiUnsubmitted Not Done Reply Inline Actions IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx); This way every element in `patterns.add` would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer. awarzynski: IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): ```…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx); This way every element in `patterns.add` would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer. That's a good suggestion! Done c-rhodes: > IMHO, this would be less noisy (i.e. no `VectorLoadStoreToArmSMELowering` template): > ``` >…

mlir/lib/Dialect/ArmSME/CMakeLists.txt

	add_subdirectory(IR)			add_subdirectory(IR)
	add_subdirectory(Transforms)			add_subdirectory(Transforms)
				add_subdirectory(Utils)

mlir/lib/Dialect/ArmSME/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRArmSMETransforms			add_mlir_dialect_library(MLIRArmSMETransforms
	ArmSMETypeConverter.cpp			ArmSMETypeConverter.cpp
	EnableArmStreaming.cpp			EnableArmStreaming.cpp
	LegalizeForLLVMExport.cpp			LegalizeForLLVMExport.cpp
	TileAllocation.cpp			TileAllocation.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/ArmSME/Transforms			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/ArmSME/Transforms

	DEPENDS			DEPENDS
	MLIRArmSMETransformsIncGen			MLIRArmSMETransformsIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRArmSMEDialect			MLIRArmSMEDialect
				MLIRArmSMEUtils
	MLIRFuncDialect			MLIRFuncDialect
	MLIRLLVMCommonConversion			MLIRLLVMCommonConversion
	MLIRVectorDialect			MLIRVectorDialect
	MLIRSCFDialect			MLIRSCFDialect
	MLIRPass			MLIRPass
	)			)

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp

//===- LegalizeForLLVMExport.cpp - Prepare ArmSME for LLVM translation ----===// //===- LegalizeForLLVMExport.cpp - Prepare ArmSME for LLVM translation ----===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "mlir/Conversion/LLVMCommon/ConversionTarget.h" #include "mlir/Conversion/LLVMCommon/ConversionTarget.h"

#include "mlir/Conversion/LLVMCommon/Pattern.h" #include "mlir/Conversion/LLVMCommon/Pattern.h"

#include "mlir/Dialect/Arith/IR/Arith.h" #include "mlir/Dialect/Arith/IR/Arith.h"

#include "mlir/Dialect/ArmSME/IR/ArmSME.h" #include "mlir/Dialect/ArmSME/IR/ArmSME.h"

#include "mlir/Dialect/ArmSME/Transforms/Transforms.h" #include "mlir/Dialect/ArmSME/Transforms/Transforms.h"

#include "mlir/Dialect/ArmSME/Utils/Utils.h"

#include "mlir/Dialect/Func/IR/FuncOps.h" #include "mlir/Dialect/Func/IR/FuncOps.h"

#include "mlir/Dialect/LLVMIR/LLVMDialect.h" #include "mlir/Dialect/LLVMIR/LLVMDialect.h"

#include "mlir/Dialect/SCF/IR/SCF.h" #include "mlir/Dialect/SCF/IR/SCF.h"

#include "mlir/Dialect/Vector/IR/VectorOps.h" #include "mlir/Dialect/Vector/IR/VectorOps.h"

using namespace mlir; using namespace mlir;

using namespace mlir::arm_sme; using namespace mlir::arm_sme;

static constexpr unsigned kMinNumElts = 16;

static constexpr unsigned kZeroZAMask = 255; static constexpr unsigned kZeroZAMask = 255;

namespace { namespace {

/// Insert 'llvm.aarch64.sme.za.enable' intrinsic at the start of 'func.func' /// Insert 'llvm.aarch64.sme.za.enable' intrinsic at the start of 'func.func'

/// ops to enable the ZA storage array. /// ops to enable the ZA storage array.

struct EnableZAPattern : public OpRewritePattern<func::FuncOp> { struct EnableZAPattern : public OpRewritePattern<func::FuncOp> {

using OpRewritePattern::OpRewritePattern; using OpRewritePattern::OpRewritePattern;

LogicalResult matchAndRewrite(func::FuncOp op, LogicalResult matchAndRewrite(func::FuncOp op,

Show All 14 Lines LogicalResult matchAndRewrite(func::ReturnOp op,

PatternRewriter &rewriter) const final { PatternRewriter &rewriter) const final {

OpBuilder::InsertionGuard g(rewriter); OpBuilder::InsertionGuard g(rewriter);

rewriter.setInsertionPoint(op); rewriter.setInsertionPoint(op);

rewriter.create<arm_sme::aarch64_sme_za_disable>(op->getLoc()); rewriter.create<arm_sme::aarch64_sme_za_disable>(op->getLoc());

rewriter.updateRootInPlace(op, [] {}); rewriter.updateRootInPlace(op, [] {});

return success(); return success();

} }

}; };

} // namespace

/// Lower 'arm_sme.zero'. Use 'arm_sme.cast_tile_to_vector' to model the return /// Lower 'arm_sme.zero'. Use 'arm_sme.cast_tile_to_vector' to model the return

/// value. The latter is a nop, which should be folded away (e.g. during /// value. The latter is a nop, which should be folded away (e.g. during

/// canonicalisation). /// canonicalisation).

/// ///

/// BEFORE: /// BEFORE:

/// ```mlir /// ```mlir

/// %0 = arm_sme.zero : vector<[16]x[16]xi8> /// %0 = arm_sme.zero : vector<[16]x[16]xi8>

Show All 28 Lines matchAndRewrite(ZeroOp zero, OpAdaptor adaptor,

// Create `CastTileToVectorOp` to use it as the output // Create `CastTileToVectorOp` to use it as the output

rewriter.replaceOpWithNewOp<arm_sme::CastTileToVector>(zero, zero.getType(), rewriter.replaceOpWithNewOp<arm_sme::CastTileToVector>(zero, zero.getType(),

tileId); tileId);

return success(); return success();

} }

}; };

/// Lower 'arm_sme.store_tile' to a loop over the rows of ZA and store each row /// Extends or truncates `tile`, which should be an `arm_sme::GetTileID` or

/// using 'arm_sme.intr.str'. /// `arm_sme::CastVectorToTile` op returning an 8/16/32/64/128-bit scalar

/// integer, to an i32 that can be passed as the `tile` parameter to the SME

/// intrinsics. Or returns `tile` if already i32.

Value castTileIDToI32(Value tile, Location loc,

ConversionPatternRewriter &rewriter) {

assert((isa<arm_sme::GetTileID, arm_sme::CastVectorToTile>(

tile.getDefiningOp())) &&

"expected ArmSME GetTileID or CastVectorToTile op!");

unsigned tileElementWidth = tile.getType().getIntOrFloatBitWidth();

if (tileElementWidth < 32)

return rewriter.create<arith::ExtUIOp>(loc, rewriter.getI32Type(), tile);

if (tileElementWidth > 32)

return rewriter.create<arith::TruncIOp>(loc, rewriter.getI32Type(), tile);

return tile;

}

/// Returns the following

/// * for rank 2 memrefs `tileSliceIndex`, since `getStridedElementPtr` does

/// the arithmetic.

/// * for rank 1 memrefs `tileSliceIndex * tileSliceNumElts`, adjusting the

/// index by the number of elements in a vector of SVL bits.

awarzynskiUnsubmitted

Not Done

return tile;

}

- /// Returns `offset` if memref is rank 2, otherwise adjusts `offset` by the

+ /// Returns `vnum` if memref is rank 2, otherwise adjusts `offset` by the

/// number of elements in a vector of SVL bits.

- Value getOffset(MemRefType memRefType, Value offset, Value vscale,

+ Value getEffectiveMemOffset(MemRefType memRefType, Value vnum, Value vscale,

Value minElems, Location loc,

ConversionPatternRewriter &rewriter) {

unsigned rank = memRefType.getRank();

Right, IIUC, this is something like:

Scale the memory offset, i.e. `vnum`, if needed:
* for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. 
* for rank 1 memrefs, assume row-major storage and scale by the effective vector length.

Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of offset and vnum in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context getStridedElementPtr, offset probably makes more sense. getOffset also feels a bit too generic 🤔 .

awarzynski: Right, IIUC, this is something like: ``` Scale the memory offset, i.e. `vnum`, if needed: *…

c-rhodesAuthorUnsubmitted

Done

Right, IIUC, this is something like:
Scale the memory offset, i.e. `vnum`, if needed:
* for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. 
* for rank 1 memrefs, assume row-major storage and scale by the effective vector length.
Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of offset and vnum in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context getStridedElementPtr, offset probably makes more sense. getOffset also feels a bit too generic 🤔 .

I've renamed it to getTileSlicePtrIndex, hopefully this clarifies things, also improved comment at top based on your suggestion.

c-rhodes: > Right, IIUC, this is something like: > > ``` Scale the memory offset, i.e. `vnum`, if needed…

/// * otherwise throws an unreachable error.

Value getTileSlicePtrIndex(unsigned rank, Value tileSliceIndex,

Value tileSliceNumElts, Location loc,

ConversionPatternRewriter &rewriter) {

assert((rank == 1 || rank == 2) && "memref has unexpected rank!");

auto tileSliceIndexI64 = rewriter.create<arith::IndexCastUIOp>(

loc, rewriter.getI64Type(), tileSliceIndex);

if (rank == 1) {

auto tileSliceNumEltsI64 = rewriter.create<arith::IndexCastUIOp>(

loc, rewriter.getI64Type(), tileSliceNumElts);

return rewriter.create<arith::MulIOp>(loc, tileSliceIndexI64,

tileSliceNumEltsI64);

}

if (rank == 2)

return tileSliceIndexI64;

llvm_unreachable("memref has unexpected rank!");

}

/// Conversion pattern for `arm_sme.tile_load` to SME intrinsics.

///

/// Lower `arm_sme.tile_load` to a loop over the rows of ZA and load each row

/// using `arm_sme.intr.ld1*.horiz`.

///

/// BEFORE:

/// ```mlir

/// %tile = arm_sme.tile_load %base[%c0, %c0] :

/// memref<?x?xi32>, vector<[4]x[4]xi32>

/// ```

///

/// AFTER:

/// ```mlir

/// %tile_id = arm_sme.get_tile_id : i32

/// %vscale = vector.vscale

/// %c0 = arith.constant 0 : index

/// %c1 = arith.constant 1 : index

/// %min_svl_s = arith.constant 4 : index

/// %svl_s = arith.muli %min_svl_s, %vscale : index

/// scf.for %tile_slice = %c0 to %svl_s step %c1 {

/// // (...)

/// "arm_sme.intr.ld1w.horiz"(%ptrue_s, %ptr, %tile_id, %tile_slice) :

/// (vector<[4]xi1>, !llvm.ptr, i32, i32) -> ()

/// }

/// %tile = arm_sme.cast_tile_to_vector %tile_id : i32 to vector<[4]x[4]xi32>

dcaballeUnsubmitted

Not Done

we may want to materialize the loops in the SME dialect before the number of cases grow more

dcaballe: we may want to materialize the loops in the SME dialect before the number of cases grow more

c-rhodesAuthorUnsubmitted

Done

we may want to materialize the loops in the SME dialect before the number of cases grow more

That's the plan, initially I was going to focus on vector.broadcast -> ArmSME after this to enable linalg.fill, but I've been looking at loop materialization first so the number of cases doesn't grow as you point out.

c-rhodes: > we may want to materialize the loops in the SME dialect before the number of cases grow more…

/// ```

struct TileLoadToArmSMELowering

: public ConvertOpToLLVMPattern<arm_sme::TileLoadOp> {

using ConvertOpToLLVMPattern<arm_sme::TileLoadOp>::ConvertOpToLLVMPattern;

LogicalResult

matchAndRewrite(arm_sme::TileLoadOp tileLoadOp,

arm_sme::TileLoadOp::Adaptor adaptor,

ConversionPatternRewriter &rewriter) const override {

auto loc = tileLoadOp.getLoc();

auto tileType = tileLoadOp.getVectorType();

auto tileElementType = tileType.getElementType();

unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();

// Create 'arm_sme.get_tile_id' op.

auto tile = rewriter.create<arm_sme::GetTileID>(

loc, rewriter.getIntegerType(tileElementWidth));

// Create a loop that loads each ZA tile slice from memory.

auto step = rewriter.create<arith::ConstantIndexOp>(loc, 1);

auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(

loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));

auto vscale =

rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());

auto lowerBound = rewriter.create<arith::ConstantIndexOp>(loc, 0);

// This describes both the number of ZA tile slices and the number of

// elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,

// ..., SVL_Q).

auto numTileSlices =

rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

auto forOp =

rewriter.create<scf::ForOp>(loc, lowerBound, numTileSlices, step);

rewriter.setInsertionPointToStart(forOp.getBody());

// Create 'arm_sme.intr.ld1*.horiz' intrinsic to load ZA tile slice.

auto memRefType = tileLoadOp.getMemRefType();

auto tileSlice = forOp.getInductionVar();

// TODO: The 'indices' argument for the 'base' memref is currently ignored,

// 'tileSliceIndex' should be added to 'indices[0]'.

Value tileSliceIndex = getTileSlicePtrIndex(memRefType.getRank(), tileSlice,

numTileSlices, loc, rewriter);

Value ptr = this->getStridedElementPtr(loc, memRefType, adaptor.getBase(),

{tileSliceIndex}, rewriter);

// Cast tile slice to i32 for intrinsic.

auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(

loc, rewriter.getI32Type(), tileSlice);

// Create all active predicate mask.

auto one = rewriter.create<arith::ConstantOp>(

loc, rewriter.getI1Type(),

rewriter.getIntegerAttr(rewriter.getI1Type(), 1));

auto predTy = VectorType::get(tileType.getShape()[0], rewriter.getI1Type(),

/*scalableDims=*/{true});

auto allActiveMask = rewriter.create<vector::SplatOp>(loc, predTy, one);

auto tileI32 = castTileIDToI32(tile, loc, rewriter);

switch (tileElementWidth) {

default:

llvm_unreachable("unexpected element type!");

case 8:

rewriter.create<arm_sme::aarch64_sme_ld1b_horiz>(loc, allActiveMask, ptr,

tileI32, tileSliceI32);

break;

case 16:

rewriter.create<arm_sme::aarch64_sme_ld1h_horiz>(loc, allActiveMask, ptr,

tileI32, tileSliceI32);

break;

case 32:

rewriter.create<arm_sme::aarch64_sme_ld1w_horiz>(loc, allActiveMask, ptr,

tileI32, tileSliceI32);

break;

case 64:

rewriter.create<arm_sme::aarch64_sme_ld1d_horiz>(loc, allActiveMask, ptr,

tileI32, tileSliceI32);

break;

}

rewriter.setInsertionPointAfter(forOp);

// The load intrinsics have no result, replace 'arm_sme.tile_load' with

// 'arm_sme.cast_tile_to_vector' to preserve dataflow.

rewriter.replaceOpWithNewOp<arm_sme::CastTileToVector>(tileLoadOp, tileType,

tile);

return success();

}

};

/// Conversion pattern for `arm_sme.tile_store` to SME intrinsics.

///

/// Lower `arm_sme.tile_store` to a loop over the rows of ZA and store each row

/// using `arm_sme.intr.st1*.horiz`.

/// ///

/// BEFORE: /// BEFORE:

/// ```mlir /// ```mlir

/// arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, /// arm_sme.tile_store %value, %base[%c0, %c0] : memref<?x?xi32>,

/// vector<[16]x[16]xi8 /// vector<[4]x[4]xi32

/// ``` /// ```

/// ///

/// AFTER: /// AFTER:

/// ```mlir /// ```mlir

/// %vscale = "llvm.intr.vscale"() : () -> index /// %tile_id = arm_sme.cast_vector_to_tile %tile : vector<[4]x[4]xi32> to i32

/// %vscale = vector.vscale

/// %c0 = arith.constant 0 : index /// %c0 = arith.constant 0 : index

/// %c1 = arith.constant 1 : index /// %c1 = arith.constant 1 : index

/// %c16 = arith.constant 16 : index /// %min_svl_s = arith.constant 4 : index

/// %vec_size = arith.muli %c16, %vscale : index /// %svl_s = arith.muli %min_svl_s, %vscale : index

/// scf.for %row_idx = %c0 to %vec_size step %c1 { /// scf.for %tile_slice = %c0 to %svl_s step %c1 {

/// // (...) /// // (...)

/// "arm_sme.intr.str"(%row_idx, %addr) : (i32, !llvm.ptr) -> () /// "arm_sme.intr.st1w.horiz"(%ptrue_s, %ptr, %tile_id, %tile_slice) :

/// (vector<[4]xi1>, !llvm.ptr, i32, i32) -> ()

/// }

/// ``` /// ```

awarzynskiUnsubmitted

Done

Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.

awarzynski: Please preserve this comment. Fine details, imho, can be extracted from the code. But…

c-rhodesAuthorUnsubmitted

Done

Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.

c-rhodes: > Please preserve this comment. Fine details, imho, can be extracted from the code. But…

struct TileStoreOpConversion : public ConvertOpToLLVMPattern<TileStoreOp> { struct TileStoreToArmSMELowering

using ConvertOpToLLVMPattern<TileStoreOp>::ConvertOpToLLVMPattern; : public ConvertOpToLLVMPattern<arm_sme::TileStoreOp> {

using ConvertOpToLLVMPattern<arm_sme::TileStoreOp>::ConvertOpToLLVMPattern;

awarzynskiUnsubmitted

Done

No casting happens in this routine :)

awarzynski: No casting happens in this routine :)

c-rhodesAuthorUnsubmitted

Done

No casting happens in this routine :)

c-rhodes: > No casting happens in this routine :)

LogicalResult LogicalResult

matchAndRewrite(TileStoreOp store, OpAdaptor adaptor, matchAndRewrite(arm_sme::TileStoreOp tileStoreOp,

awarzynskiUnsubmitted

Done

don't use else after a return ;-)

Similar comment for getOffset.

awarzynski: [[ https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return | don't use else…

arm_sme::TileStoreOp::Adaptor adaptor,

ConversionPatternRewriter &rewriter) const override { ConversionPatternRewriter &rewriter) const override {

auto loc = store.getLoc(); auto loc = tileStoreOp.getLoc();

auto tileType = tileStoreOp.getVectorType();

auto tileElementType = tileType.getElementType();

unsigned tileElementWidth = tileElementType.getIntOrFloatBitWidth();

awarzynskiUnsubmitted

Done

What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

In the 1-D case, is it:

offet * vscale * minElems

? And is minElems the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"?

awarzynski: What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

c-rhodesAuthorUnsubmitted

Done

What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

The offset to the load or store pointer. In the 2D case getStridedElementPtr does the arithmetic for us, but in the 1D case we have to do it ourselves.

In the 1-D case, is it:
offet * vscale * minElems

Yeah, and offset is vnum so it's vnum * vscale * minElems. So the offset is the number of elements for a given type in a vector of SVL bits (SVLt), and this is scaled by vnum. So the base would get incremented a tile vector at a time.

? And is minElems the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"?

Yeah, so 128 / esize. Perhaps getMinNumElts would be better implemented like that rather than with a switch actually.

As for where supporting for both 1D and 2D memrefs came from, I initially started with these integration tests that dump ZA but could get it to work with 2D memrefs: https://gist.github.com/c-rhodes/1e9f2d8fd0ca3c6539f167e08079f6ab

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

c-rhodes: > What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

awarzynskiUnsubmitted

Not Done

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

Yeah, it would be nice to include them. Wouldn't it be possible to add CHECK lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is.

awarzynski: > I found those tests useful for verification but since the output varies depending on the…

c-rhodesAuthorUnsubmitted

Done

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

Yeah, it would be nice to include them. Wouldn't it be possible to add CHECK lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is.

Yeah it would actually, I've done that in the integration test already part of this based on your suggestion, but would could add the one I linked in future as well.

c-rhodes: > > I found those tests useful for verification but since the output varies depending on the…

awarzynskiUnsubmitted

Done

// Create 'arm_sme.get_tile_id' op.

- unsigned width = vType.getElementType().getIntOrFloatBitWidth();

+ unsigned tileElemWidth = vType.getElementType().getIntOrFloatBitWidth();

auto tile = rewriter.create<arm_sme::GetTileID>(

[nit] Suggestion for a more descriptive name (it's a rather key bit in the SME logic)

awarzynski: [nit] Suggestion for a more descriptive name (it's a rather key bit in the SME logic)

// Create 'arm_sme.cast_vector_to_tile' to get a tile ID for the vector

// being stored.

auto tile = rewriter.create<arm_sme::CastVectorToTile>(

loc, rewriter.getIntegerType(tileElementWidth),

tileStoreOp.getValueToStore());

// Create loop that iterates from 0 to SVLB-1 inclusive (the number of // Create a loop that stores each ZA tile slice to memory.

// vectors in ZA) and stores each ZA vector to memory.

auto step = rewriter.create<arith::ConstantIndexOp>(loc, 1); auto step = rewriter.create<arith::ConstantIndexOp>(loc, 1);

auto minElems = rewriter.create<arith::ConstantIndexOp>(loc, kMinNumElts); auto minTileSlices = rewriter.create<arith::ConstantIndexOp>(

loc, arm_sme::getSMETileSliceMinNumElts(tileElementType));

auto vscale = auto vscale =

rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType()); rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());

auto lowerBound = rewriter.create<arith::ConstantIndexOp>(loc, 0); auto lowerBound = rewriter.create<arith::ConstantIndexOp>(loc, 0);

auto upperBound = rewriter.create<arith::MulIOp>(loc, minElems, vscale); // This describes both the number of ZA tile slices and the number of

auto forOp = rewriter.create<scf::ForOp>(loc, lowerBound, upperBound, step); // elements in a vector of SVL bits for a given element type (SVL_B, SVL_H,

// ..., SVL_Q).

auto numTileSlices =

rewriter.create<arith::MulIOp>(loc, minTileSlices, vscale);

auto forOp =

rewriter.create<scf::ForOp>(loc, lowerBound, numTileSlices, step);

rewriter.setInsertionPointToStart(forOp.getBody()); rewriter.setInsertionPointToStart(forOp.getBody());

// Create 'arm_sme.intr.str' intrinsic to store ZA vector. // Create 'arm_sme.intr.st1*.horiz' intrinsic to store ZA tile slice.

auto vnumI64 = rewriter.create<arith::IndexCastUIOp>( auto memRefType = tileStoreOp.getMemRefType();

loc, rewriter.getI64Type(), forOp.getInductionVar()); auto tileSlice = forOp.getInductionVar();

auto offset = // TODO: The 'indices' argument for the 'base' memref is currently ignored,

rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI64Type(), 0); // 'tileSliceIndex' should be added to 'indices[0]'.

Value ptr = Value tileSliceIndex = getTileSlicePtrIndex(memRefType.getRank(), tileSlice,

getStridedElementPtr(loc, store.getMemRefType(), adaptor.getBase(), numTileSlices, loc, rewriter);

ValueRange{vnumI64, offset}, rewriter); Value ptr = this->getStridedElementPtr(loc, memRefType, adaptor.getBase(),

awarzynskiUnsubmitted

Not Done

The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en.

I would rename vnumI32 as sliceNumI32 so that this is easier to match with the spec.

awarzynski: The "horizontal" load instructions take "slice number": https://developer.arm.

c-rhodesAuthorUnsubmitted

Done

The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en.

I would rename vnumI32 as sliceNumI32 so that this is easier to match with the spec.

Good point, naming is difficult, updated to "tile slice", this seems consistent with LLVM as well.

c-rhodes: > The "horizontal" load instructions take "slice number": https://developer.arm.

auto vnumI32 = rewriter.create<arith::IndexCastUIOp>( {tileSliceIndex}, rewriter);

loc, rewriter.getI32Type(), forOp.getInductionVar());

rewriter.create<arm_sme::aarch64_sme_str>(loc, vnumI32, ptr); // Cast tile slice to i32 for intrinsic.

auto tileSliceI32 = rewriter.create<arith::IndexCastUIOp>(

loc, rewriter.getI32Type(), tileSlice);

// Create all active predicate mask.

auto one = rewriter.create<arith::ConstantOp>(

loc, rewriter.getI1Type(),

rewriter.getIntegerAttr(rewriter.getI1Type(), 1));

auto predTy = VectorType::get(tileType.getShape()[0], rewriter.getI1Type(),

/*scalableDims=*/{true});

auto allActiveMask = rewriter.create<vector::SplatOp>(loc, predTy, one);

dcaballeUnsubmitted

Done

spell out auto?

dcaballe: spell out auto?

Value tileI32 = castTileIDToI32(tile, loc, rewriter);

switch (tileElementWidth) {

default:

llvm_unreachable("unexpected element type!");

case 8:

rewriter.replaceOpWithNewOp<arm_sme::aarch64_sme_st1b_horiz>(

tileStoreOp, allActiveMask, ptr, tileI32, tileSliceI32);

break;

awarzynskiUnsubmitted

Done

Could this be a switch statement instead?

awarzynski: Could this be a switch statement instead?

case 16:

rewriter.replaceOpWithNewOp<arm_sme::aarch64_sme_st1h_horiz>(

tileStoreOp, allActiveMask, ptr, tileI32, tileSliceI32);

break;

case 32:

rewriter.replaceOpWithNewOp<arm_sme::aarch64_sme_st1w_horiz>(

tileStoreOp, allActiveMask, ptr, tileI32, tileSliceI32);

break;

case 64:

rewriter.replaceOpWithNewOp<arm_sme::aarch64_sme_st1d_horiz>(

tileStoreOp, allActiveMask, ptr, tileI32, tileSliceI32);

break;

}

rewriter.eraseOp(store);

return success(); return success();

} }

}; };

} // namespace

void mlir::configureArmSMELegalizeForExportTarget( void mlir::configureArmSMELegalizeForExportTarget(

LLVMConversionTarget &target) { LLVMConversionTarget &target) {

target.addLegalOp<scf::ForOp, scf::YieldOp, arm_sme::CastTileToVector, target.addLegalOp<

scf::ForOp, scf::YieldOp, arm_sme::CastTileToVector,

arm_sme::CastVectorToTile, arm_sme::aarch64_sme_zero, arm_sme::CastVectorToTile, arm_sme::aarch64_sme_zero,

arm_sme::aarch64_sme_str, arm_sme::aarch64_sme_za_enable, arm_sme::aarch64_sme_str, arm_sme::aarch64_sme_ld1b_horiz,

arm_sme::aarch64_sme_ld1h_horiz, arm_sme::aarch64_sme_ld1w_horiz,

arm_sme::aarch64_sme_ld1d_horiz, arm_sme::aarch64_sme_st1b_horiz,

arm_sme::aarch64_sme_st1h_horiz, arm_sme::aarch64_sme_st1w_horiz,

arm_sme::aarch64_sme_st1d_horiz, arm_sme::aarch64_sme_za_enable,

arm_sme::aarch64_sme_za_disable>(); arm_sme::aarch64_sme_za_disable>();

target.addLegalOp<GetTileID>(); target.addLegalOp<GetTileID>();

// Mark 'func.func' ops as legal if either: // Mark 'func.func' ops as legal if either:

// 1. no 'arm_za' function attribute is present. // 1. no 'arm_za' function attribute is present.

// 2. the 'arm_za' function attribute is present and the first op in the // 2. the 'arm_za' function attribute is present and the first op in the

// function is an 'arm_sme::aarch64_sme_za_enable' intrinsic. // function is an 'arm_sme::aarch64_sme_za_enable' intrinsic.

target.addDynamicallyLegalOp<func::FuncOp>([&](func::FuncOp funcOp) { target.addDynamicallyLegalOp<func::FuncOp>([&](func::FuncOp funcOp) {

if (funcOp.isDeclaration()) if (funcOp.isDeclaration())

Show All 14 Lines funcOp->walk<WalkOrder::PreOrder>(

[&](arm_sme::aarch64_sme_za_disable op) { hasDisableZA = true; }); [&](arm_sme::aarch64_sme_za_disable op) { hasDisableZA = true; });

return !funcOp->hasAttr("arm_za") || hasDisableZA; return !funcOp->hasAttr("arm_za") || hasDisableZA;

}); });

} }

void mlir::populateArmSMELegalizeForLLVMExportPatterns( void mlir::populateArmSMELegalizeForLLVMExportPatterns(

LLVMTypeConverter &converter, RewritePatternSet &patterns) { LLVMTypeConverter &converter, RewritePatternSet &patterns) {

patterns.add<EnableZAPattern, DisableZAPattern>(patterns.getContext()); patterns.add<EnableZAPattern, DisableZAPattern>(patterns.getContext());

patterns.add<TileStoreOpConversion, ZeroOpConversion>(converter); patterns.add<ZeroOpConversion, TileLoadToArmSMELowering,

TileStoreToArmSMELowering>(converter);

} }

mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt

This file was added.

				add_mlir_dialect_library(MLIRArmSMEUtils
				Utils.cpp
				awarzynskiUnsubmitted Not Done Reply Inline Actions Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to `MLIRArmSMETransforms`? awarzynski: Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to…
				c-rhodesAuthorUnsubmitted Done Reply Inline Actions Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to `MLIRArmSMETransforms`? I copied this from another dialect and it seems all dialects with utils do this. c-rhodes: > Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to…
				dcaballeUnsubmitted Not Done Reply Inline Actions Yeah, I've seen some complaints about the utils library in the past. I was even recommended to remove the utils files altogether. I'm ok with this, though, if other dialects are doing the same... dcaballe: Yeah, I've seen some complaints about the utils library in the past. I was even recommended to…

				ADDITIONAL_HEADER_DIRS
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/ArmSME/Utils

				LINK_LIBS PUBLIC
				MLIRArmSMEDialect
				MLIRDialect
				MLIRIR
				)

mlir/lib/Dialect/ArmSME/Utils/Utils.cpp

This file was added.

//===- Utils.cpp - Utilities to support the ArmSME dialect ----------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This file implements utilities for the ArmSME dialect.

//===----------------------------------------------------------------------===//

#include "mlir/Dialect/ArmSME/Utils/Utils.h"

#include "mlir/Dialect/ArmSME/IR/ArmSME.h"

using namespace mlir;

using namespace mlir::arm_sme;

static constexpr unsigned MinStreamingVectorLengthInBits = 128;

awarzynskiUnsubmitted

Done

This repeats the comment from the header file - it will become out of sync if somebody (e.g. me) forgets that and only updates one copy. IMHO, it's fine to limit the comments to where the interface is defined (i.e. the header file).

awarzynski: This repeats the comment from the header file - it will become out of sync if somebody (e.g.

unsigned mlir::arm_sme::getSMETileSliceMinNumElts(Type type) {

assert(isValidSMETileElementType(type) && "invalid tile type!");

return MinStreamingVectorLengthInBits / type.getIntOrFloatBitWidth();

awarzynskiUnsubmitted

Done

Avoid else after return.

awarzynski: Avoid `else` after `return`.

}

bool mlir::arm_sme::isValidSMETileElementType(Type type) {

// TODO: add support for i128.

awarzynskiUnsubmitted

Done

return MinStreamingVectorLengthInBits / type.getIntOrFloatBitWidth();

}

- bool mlir::arm_sme::isValidTileElementType(Type type) {

+ bool mlir::arm_sme::isValidSMETileElementType(Type type) {

// TODO: add support for i128.

[nit] Naming is hard

awarzynski: [nit] Naming is hard

return type.isInteger(8) || type.isInteger(16) || type.isInteger(32) ||

type.isInteger(64) || type.isF16() || type.isBF16() || type.isF32() ||

type.isF64();

}

awarzynskiUnsubmitted

Not Done

Given that this code will only be used only by SME/SSVE, why not name this as:

getSVEVectorBaseSize

This way it will be very clear that it's some special SME/SSVE hook. Also:

in SME array vector

+ "and an SVE vector"?

awarzynski: Given that this code will only be used only by SME/SSVE, why not name this as: ```…

c-rhodesAuthorUnsubmitted

Not Done

Given that this code will only be used only by SME/SSVE, why not name this as:
getSVEVectorBaseSize
This way it will be very clear that it's some special SME/SSVE hook. Also:

in SME array vector

+ "and an SVE vector"?

Naming this is tricky, I mean it to be the minSVLT which is the minimum number of elements in a vector of SVL bits, which when scaled by vscale gives both the number of tile slices (vector of SVL bits) in ZA and also the number of elements in a tile slice. I've updated it to getSMETileSliceMinNumElts but will have a think, also not sure we want to mention SVE here?

c-rhodes: > Given that this code will only be used only by SME/SSVE, why not name this as: > ``` >…

bool mlir::arm_sme::isValidSMETileVectorType(VectorType vType) {

if ((vType.getRank() != 2) && vType.allDimsScalable())

awarzynskiUnsubmitted

Done

type.isF64();

}

- bool mlir::arm_sme::isSMETileLikeVectorType(VectorType vType) {

+ bool mlir::arm_sme::isValidSMETileVectorType(VectorType vType) {

if ((vType.getRank() != 2) && vType.allDimsScalable())

[nit] Naming is hard

awarzynski: [nit] Naming is hard

return false;

// TODO: add support for i128.

auto elemType = vType.getElementType();

if (!isValidSMETileElementType(elemType))

return false;

unsigned minNumElts = arm_sme::getSMETileSliceMinNumElts(elemType);

if (vType.getShape() != ArrayRef<int64_t>({minNumElts, minNumElts}))

return false;

return true;

}

mlir/test/Dialect/ArmSME/roundtrip.mlir

// RUN: mlir-opt -split-input-file -verify-diagnostics %s | mlir-opt | FileCheck %s // RUN: mlir-opt -split-input-file -verify-diagnostics %s | mlir-opt | FileCheck %s

// -----

func.func @arm_sme_cast_tile_to_vector_i8(%tile_id : i8) -> vector<[16]x[16]xi8> { func.func @arm_sme_cast_tile_to_vector_i8(%tile_id : i8) -> vector<[16]x[16]xi8> {

// CHECK: arm_sme.cast_tile_to_vector {{.*}} : i8 to vector<[16]x[16]xi8> // CHECK: arm_sme.cast_tile_to_vector {{.*}} : i8 to vector<[16]x[16]xi8>

%0 = arm_sme.cast_tile_to_vector %tile_id : i8 to vector<[16]x[16]xi8> %0 = arm_sme.cast_tile_to_vector %tile_id : i8 to vector<[16]x[16]xi8>

return %0 : vector<[16]x[16]xi8> return %0 : vector<[16]x[16]xi8>

} }

// ----- // -----

▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines

func.func @arm_sme_zero() -> () { func.func @arm_sme_zero() -> () {

// CHECK: arm_sme.zero : vector<[16]x[16]xi8> // CHECK: arm_sme.zero : vector<[16]x[16]xi8>

%0 = arm_sme.zero : vector<[16]x[16]xi8> %0 = arm_sme.zero : vector<[16]x[16]xi8>

return return

} }

// ----- // -----

func.func @arm_sme_tile_load_i8(%src : memref<?x?xi8>) -> () {

awarzynskiUnsubmitted

Done

// -----

- func.func @arm_sme_tile_load_i8(%memref : memref<?x?xi8>) -> () {

+ func.func @arm_sme_tile_load_i8(%src : memref<?x?xi8>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi8>, vector<[16]x[16]xi8>

For consistency with the tile_store at the bottom.

awarzynski: For consistency with the `tile_store` at the bottom.

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi8>, vector<[16]x[16]xi8>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

return

}

// -----

func.func @arm_sme_tile_load_i16(%src : memref<?x?xi16>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi16>, vector<[8]x[8]xi16>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi16>, vector<[8]x[8]xi16>

return

}

// -----

func.func @arm_sme_tile_load_i32(%src : memref<?x?xi32>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi32>, vector<[4]x[4]xi32>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>

return

}

// -----

func.func @arm_sme_tile_load_i64(%src : memref<?x?xi64>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi64>, vector<[2]x[2]xi64>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi64>, vector<[2]x[2]xi64>

return

}

// -----

func.func @arm_sme_tile_load_i128(%src : memref<?x?xi128>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi128>, vector<[1]x[1]xi128>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xi128>, vector<[1]x[1]xi128>

return

}

// -----

func.func @arm_sme_tile_load_f16(%src : memref<?x?xf16>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xf16>, vector<[8]x[8]xf16>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xf16>, vector<[8]x[8]xf16>

return

}

// -----

func.func @arm_sme_tile_load_bf16(%src : memref<?x?xbf16>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xbf16>, vector<[8]x[8]xbf16>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xbf16>, vector<[8]x[8]xbf16>

return

}

// -----

func.func @arm_sme_tile_load_f32(%src : memref<?x?xf32>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xf32>, vector<[4]x[4]xf32>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xf32>, vector<[4]x[4]xf32>

return

}

// -----

func.func @arm_sme_tile_load_f64(%src : memref<?x?xf64>) -> () {

// CHECK: arm_sme.tile_load {{.*}} : memref<?x?xf64>, vector<[2]x[2]xf64>

%c0 = arith.constant 0 : index

%tile = arm_sme.tile_load %src[%c0, %c0] : memref<?x?xf64>, vector<[2]x[2]xf64>

return

}

// -----

func.func @arm_sme_store_tile(%tile : vector<[16]x[16]xi8>, %dest : memref<?x?xi8>) -> () { func.func @arm_sme_store_tile(%tile : vector<[16]x[16]xi8>, %dest : memref<?x?xi8>) -> () {

// CHECK: arm_sme.tile_store {{.*}} : memref<?x?xi8>, vector<[16]x[16]xi8> // CHECK: arm_sme.tile_store {{.*}} : memref<?x?xi8>, vector<[16]x[16]xi8>

%c0 = arith.constant 0 : index %c0 = arith.constant 0 : index

arm_sme.tile_store %tile, %dest[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8> arm_sme.tile_store %tile, %dest[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

return return

} }

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir

// RUN: mlir-opt %s -convert-vector-to-arm-sme -convert-vector-to-llvm="enable-arm-sme" -split-input-file | mlir-opt | FileCheck %s

// CHECK-LABEL: @transfer_write_2d_zero_i8

// CHECK-LABEL: @transfer_write_2d_zero_i8(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)

dcaballeUnsubmitted

Done

remove init separator. Otherwise, this will compile an empty input for the upper part.

dcaballe: remove init separator. Otherwise, this will compile an empty input for the upper part.

c-rhodesAuthorUnsubmitted

Done

remove init separator. Otherwise, this will compile an empty input for the upper part.

Did not know that! Fixed thanks

c-rhodes: > remove init separator. Otherwise, this will compile an empty input for the upper part. Did…

// CHECK-DAG: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-DAG: %[[C255:.*]] = arith.constant 255 : i32

// CHECK-DAG: "arm_sme.intr.zero"(%[[C255]]) : (i32) -> ()

// CHECK-DAG: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8

// CHECK-DAG: %[[CAST_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>

// CHECK-DAG: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>

// CHECK-DAG: %[[CAST_VECTOR_TO_TILE:.*]] = arm_sme.cast_vector_to_tile %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8> to i8

// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index

// CHECK-DAG: %[[MIN_ZA_VECTORS:.*]] = arith.constant 16 : index

// CHECK-DAG: %[[MIN_SVL_B:.*]] = arith.constant 16 : index

// CHECK-NEXT: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64

// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index

// CHECK-NEXT: %[[C0_0:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[C0:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[NUM_ZA_VECTORS:.*]] = arith.muli %[[MIN_ZA_VECTORS]], %[[VSCALE_IDX]] : index

// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[MIN_SVL_B]], %[[VSCALE_IDX]] : index

// CHECK-NEXT: scf.for %[[VNUM:.*]] = %[[C0_0]] to %[[NUM_ZA_VECTORS]] step %[[C1]] {

// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0]] to %[[SVL_B]] step %[[C1]] {

// CHECK-NEXT: %[[VNUM_I64:.*]] = arith.index_castui %[[VNUM]] : index to i64

// CHECK-NEXT: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64

// CHECK-NEXT: %[[C0_1:.*]] = llvm.mlir.constant(0 : i64) : i64

// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[VNUM_I64]], %[[STRIDE0]] : i64

// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64

// CHECK-NEXT: %[[OFF1:.*]] = llvm.add %[[OFF0]], %[[C0_1]] : i64

// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8

// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF1]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8

// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32

// CHECK-NEXT: %[[VNUM_I32:.*]] = arith.index_castui %[[VNUM]] : index to i32

// CHECK-NEXT: %[[TRUE:.*]] = arith.constant true

// CHECK-NEXT: "arm_sme.intr.str"(%[[VNUM_I32]], %[[GEP]]) : (i32, !llvm.ptr) -> ()

// CHECK-NEXT: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>

// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[CAST_VECTOR_TO_TILE]] : i8 to i32

// CHECK-NEXT: "arm_sme.intr.st1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()

func.func @transfer_write_2d_zero_i8(%arg0 : memref<?x?xi8>) {

%c0 = arith.constant 0 : index

awarzynskiUnsubmitted

Done

[nit] Not needed

awarzynski: [nit] Not needed

%cst = arith.constant dense<0> : vector<[16]x[16]xi8>

vector.transfer_write %cst, %arg0[%c0, %c0] {in_bounds = [true, true]} : vector<[16]x[16]xi8>, memref<?x?xi8>

return

}

// -----

// CHECK-LABEL: @vector_load_i8(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)

// CHECK-NEXT: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[C0_0:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8

// CHECK-NEXT: %[[C1:.*]] = arith.constant 1 : index

// CHECK-NEXT: %[[MIN_SVL_B:.*]] = arith.constant 16 : index

// CHECK-NEXT: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64

// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index

// CHECK-NEXT: %[[C0_1:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[MIN_SVL_B]], %[[VSCALE_IDX]] : index

// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0_1]] to %[[SVL_B]] step %[[C1]] {

// CHECK-NEXT: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64

// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[STRIDE0:.*]] = llvm.extractvalue %[[MEM_DESC]][4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[OFF0:.*]] = llvm.mul %[[TILE_SLICE_I64]], %[[STRIDE0]] : i64

// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[OFF0]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8

// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32

// CHECK-NEXT: %[[TRUE:.*]] = arith.constant true

// CHECK-NEXT: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>

// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32

// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()

// CHECK-NEXT: }

// CHECK-NEXT: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>

// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>

func.func @vector_load_i8(%arg0 : memref<?x?xi8>) -> vector<[16]x[16]xi8> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

return %tile : vector<[16]x[16]xi8>

}

// -----

// CHECK-LABEL: @vector_load_i8_from_rank_1_memref(

awarzynskiUnsubmitted

Done

// -----

- // CHECK-LABEL: @vector_load_i8_rank_1_memref(

+ // CHECK-LABEL: @vector_load_i8_from_rank_1_memref(

// CHECK-SAME: %[[ARG0:.*]]: memref<?xi8>)

[nit] Just to make it clearer what's distinct about this test

awarzynski: [nit] Just to make it clearer what's distinct about this test

// CHECK-SAME: %[[ARG0:.*]]: memref<?xi8>)

// CHECK-NEXT: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?xi8> to !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>

// CHECK-NEXT: %[[C0_0:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i8

// CHECK-NEXT: %[[C1:.*]] = arith.constant 1 : index

// CHECK-NEXT: %[[MIN_SVL_B:.*]] = arith.constant 16 : index

// CHECK-NEXT: %[[VSCALE:.*]] = "llvm.intr.vscale"() : () -> i64

// CHECK-NEXT: %[[VSCALE_IDX:.*]] = builtin.unrealized_conversion_cast %[[VSCALE]] : i64 to index

// CHECK-NEXT: %[[C0_1:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[SVL_B:.*]] = arith.muli %[[MIN_SVL_B]], %[[VSCALE_IDX]] : index

// CHECK-NEXT: scf.for %[[TILE_SLICE:.*]] = %[[C0_1]] to %[[SVL_B]] step %[[C1]] {

// CHECK-NEXT: %[[TILE_SLICE_I64:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i64

// CHECK-NEXT: %[[SVL_B_I64:.*]] = arith.index_castui %[[SVL_B]] : index to i64

// CHECK-NEXT: %[[TILE_SLICE_IDX:.*]] = arith.muli %[[TILE_SLICE_I64]], %[[SVL_B_I64]] : i64

// CHECK-NEXT: %[[ALIGNED_BASE:.*]] = llvm.extractvalue %[[MEM_DESC]][1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>

// CHECK-NEXT: %[[GEP:.*]] = llvm.getelementptr %[[ALIGNED_BASE]]{{\[}}%[[TILE_SLICE_IDX]]] : (!llvm.ptr, i64) -> !llvm.ptr, i8

// CHECK-NEXT: %[[TILE_SLICE_I32:.*]] = arith.index_castui %[[TILE_SLICE]] : index to i32

// CHECK-NEXT: %[[TRUE:.*]] = arith.constant true

// CHECK-NEXT: %[[PTRUE_ALL:.*]] = arith.constant dense<true> : vector<[16]xi1>

// CHECK-NEXT: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i8 to i32

// CHECK-NEXT: "arm_sme.intr.ld1b.horiz"(%[[PTRUE_ALL]], %[[GEP]], %[[TILE_ID_I32]], %[[TILE_SLICE_I32]]) : (vector<[16]xi1>, !llvm.ptr, i32, i32) -> ()

// CHECK-NEXT: }

// CHECK-NEXT: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i8 to vector<[16]x[16]xi8>

// CHECK-NEXT: return %[[CAST_TILE_TO_VECTOR]] : vector<[16]x[16]xi8>

func.func @vector_load_i8_from_rank_1_memref(%arg0 : memref<?xi8>) -> vector<[16]x[16]xi8> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0] : memref<?xi8>, vector<[16]x[16]xi8>

return %tile : vector<[16]x[16]xi8>

}

// -----

// CHECK-LABEL: @vector_load_i16(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi16>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i16

// CHECK: %[[MIN_SVL_H:.*]] = arith.constant 8 : index

// CHECK: %[[SVL_H:.*]] = arith.muli %[[MIN_SVL_H]], %{{.*}} : index

// CHECK: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i16 to i32

// CHECK: arm_sme.intr.ld1h.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i16 to vector<[8]x[8]xi16>

func.func @vector_load_i16(%arg0 : memref<?x?xi16>) -> vector<[8]x[8]xi16> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi16>, vector<[8]x[8]xi16>

return %tile : vector<[8]x[8]xi16>

}

// -----

// CHECK-LABEL: @vector_load_i32(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi32>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i32

// CHECK: %[[MIN_SVL_S:.*]] = arith.constant 4 : index

// CHECK: %[[SVL_S:.*]] = arith.muli %[[MIN_SVL_S]], %{{.*}} : index

// CHECK-NOT: arith.extui %[[TILE_ID]]

// CHECK-NOT: arith.trunci %[[TILE_ID]]

// CHECK: arm_sme.intr.ld1w.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i32 to vector<[4]x[4]xi32>

func.func @vector_load_i32(%arg0 : memref<?x?xi32>) -> vector<[4]x[4]xi32> {

%c0 = arith.constant 0 : index

awarzynskiUnsubmitted

Not Done

This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/:

Tests should be minimal, and only check what is absolutely necessary.
This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line.

The 2 tests above are sufficient to test the other nuances.

awarzynski: This and the following tests differ only in a few details that are tricky to spot. I am…

c-rhodesAuthorUnsubmitted

Done

This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/:
Tests should be minimal, and only check what is absolutely necessary.
This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line.
The 2 tests above are sufficient to test the other nuances.

Good suggestion they were a bit verbose, I've simplified them to highlight the important bits.

c-rhodes: > This and the following tests differ only in a few details that are tricky to spot. I am…

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>

return %tile : vector<[4]x[4]xi32>

}

// -----

// CHECK-LABEL: @vector_load_i64(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi64>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i64

// CHECK: %[[MIN_SVL_D:.*]] = arith.constant 2 : index

// CHECK: %[[SVL_D:.*]] = arith.muli %[[MIN_SVL_D]], %{{.*}} : index

// CHECK: %[[TILE_ID_I32:.*]] = arith.trunci %[[TILE_ID]] : i64 to i32

// CHECK: arm_sme.intr.ld1d.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i64 to vector<[2]x[2]xi64>

func.func @vector_load_i64(%arg0 : memref<?x?xi64>) -> vector<[2]x[2]xi64> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xi64>, vector<[2]x[2]xi64>

return %tile : vector<[2]x[2]xi64>

}

// -----

// CHECK-LABEL: @vector_load_f16(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xf16>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i16

// CHECK: %[[MIN_SVL_H:.*]] = arith.constant 8 : index

// CHECK: %[[SVL_H:.*]] = arith.muli %[[MIN_SVL_H]], %{{.*}} : index

// CHECK: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i16 to i32

// CHECK: arm_sme.intr.ld1h.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i16 to vector<[8]x[8]xf16>

func.func @vector_load_f16(%arg0 : memref<?x?xf16>) -> vector<[8]x[8]xf16> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xf16>, vector<[8]x[8]xf16>

return %tile : vector<[8]x[8]xf16>

}

// -----

// CHECK-LABEL: @vector_load_bf16(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xbf16>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i16

// CHECK: %[[MIN_SVL_H:.*]] = arith.constant 8 : index

// CHECK: %[[SVL_H:.*]] = arith.muli %[[MIN_SVL_H]], %{{.*}} : index

// CHECK: %[[TILE_ID_I32:.*]] = arith.extui %[[TILE_ID]] : i16 to i32

// CHECK: arm_sme.intr.ld1h.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i16 to vector<[8]x[8]xbf16>

func.func @vector_load_bf16(%arg0 : memref<?x?xbf16>) -> vector<[8]x[8]xbf16> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xbf16>, vector<[8]x[8]xbf16>

return %tile : vector<[8]x[8]xbf16>

}

// -----

// CHECK-LABEL: @vector_load_f32(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xf32>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i32

// CHECK: %[[MIN_SVL_S:.*]] = arith.constant 4 : index

// CHECK: %[[SVL_S:.*]] = arith.muli %[[MIN_SVL_S]], %{{.*}} : index

// CHECK-NOT: arith.extui %[[TILE_ID]]

// CHECK-NOT: arith.trunci %[[TILE_ID]]

// CHECK: arm_sme.intr.ld1w.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i32 to vector<[4]x[4]xf32>

func.func @vector_load_f32(%arg0 : memref<?x?xf32>) -> vector<[4]x[4]xf32> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xf32>, vector<[4]x[4]xf32>

return %tile : vector<[4]x[4]xf32>

}

// -----

// CHECK-LABEL: @vector_load_f64(

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xf64>)

// CHECK: %[[TILE_ID:.*]] = arm_sme.get_tile_id : i64

// CHECK: %[[MIN_SVL_D:.*]] = arith.constant 2 : index

// CHECK: %[[SVL_D:.*]] = arith.muli %[[MIN_SVL_D]], %{{.*}} : index

// CHECK: %[[TILE_ID_I32:.*]] = arith.trunci %[[TILE_ID]] : i64 to i32

// CHECK: arm_sme.intr.ld1d.horiz

// CHECK: %[[CAST_TILE_TO_VECTOR:.*]] = arm_sme.cast_tile_to_vector %[[TILE_ID]] : i64 to vector<[2]x[2]xf64>

func.func @vector_load_f64(%arg0 : memref<?x?xf64>) -> vector<[2]x[2]xf64> {

%c0 = arith.constant 0 : index

%tile = vector.load %arg0[%c0, %c0] : memref<?x?xf64>, vector<[2]x[2]xf64>

return %tile : vector<[2]x[2]xf64>

}

// -----

// CHECK-LABEL: @vector_store_i8(

// CHECK-SAME: %[[TILE:.*]]: vector<[16]x[16]xi8>,

// CHECK-SAME: %[[ARG0:.*]]: memref<?x?xi8>)

// CHECK-NEXT: %[[MEM_DESC:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<?x?xi8> to !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>

// CHECK-NEXT: %[[C0_0:.*]] = arith.constant 0 : index

// CHECK-NEXT: %[[CAST_VECTOR_TO_TILE:.*]] = arm_sme.cast_vector_to_tile %[[TILE]] : vector<[16]x[16]xi8> to i8