This is an archive of the discontinued LLVM Phabricator instance.

[mlir][ArmSME] Add tile load op and extend tile store tile size support
ClosedPublic

Authored by c-rhodes on Jul 14 2023, 9:00 AM.

Details

Summary

This extends the existing 'arm_sme.tile_store' op to support all tile
sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from
vector -> custom ops and custom ops -> intrinsics. Currently there's no
lowering for i128.

Depends on D154867

Diff Detail

Event Timeline

c-rhodes created this revision.Jul 14 2023, 9:00 AM
Herald added a reviewer: ftynse. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
c-rhodes requested review of this revision.Jul 14 2023, 9:00 AM
Matt added a subscriber: Matt.Jul 14 2023, 2:45 PM

Overall looks good, thanks! We definitely need to take care of the loop materialisation after this change (i.e. move it higher up the compilation stack).

I've left a fair few comments, but nothing major and this is a rather large patch. I still need to go over the tests.

Thanks for working on this!

awarzynski added inline comments.Jul 19 2023, 12:47 AM
mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230
238
mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
25
28
mlir/lib/Conversion/VectorToArmSME/CMakeLists.txt
15

sort

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
102–120

I'd move this to an utility function - I expect that we'll be needing this for other Ops as well.

117–118

IMHO, this would be less noisy (i.e. no VectorLoadStoreToArmSMELowering template):

patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx);

This way every element in patterns.add would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer.

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
98–117

Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.

280–282

No casting happens in this routine :)

285

don't use else after a return ;-)

Similar comment for getOffset.

291–292

What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

In the 1-D case, is it:

offet * vscale * minElems

? And is minElems the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"?

321–322

The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en.

I would rename vnumI32 as sliceNumI32 so that this is easier to match with the spec.

331–344

Could this be a switch statement instead?

mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2

Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to MLIRArmSMETransforms?

mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
21

This repeats the comment from the header file - it will become out of sync if somebody (e.g. me) forgets that and only updates one copy. IMHO, it's fine to limit the comments to where the interface is defined (i.e. the header file).

24

Avoid else after return.

32

Given that this code will only be used only by SME/SSVE, why not name this as:

getSVEVectorBaseSize

This way it will be very clear that it's some special SME/SSVE hook. Also:

in SME array vector

+ "and an SVE vector"?

c-rhodes updated this revision to Diff 542073.Jul 19 2023, 9:28 AM

Address comments.

c-rhodes marked 12 inline comments as done.Jul 19 2023, 9:42 AM

Overall looks good, thanks! We definitely need to take care of the loop materialisation after this change (i.e. move it higher up the compilation stack).

Thanks for the comments! I agree w.r.t. loop materialization, currently looking into that separately from this.

mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
25

I discovered getIntOrFloatBitWidth so I've removed this and replaced it with that.

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
102–120

I'd move this to an utility function - I expect that we'll be needing this for other Ops as well.

Moved to arm_sme::isSMETileLikeVectorType.

117–118

IMHO, this would be less noisy (i.e. no VectorLoadStoreToArmSMELowering template):

patterns.add<TransferWriteToArmSMELowering, VectorLoadToArmSME, VectorStoreToArmSME>(&ctx);

This way every element in patterns.add would have a more distinct name. But ultimately, it's a matter of preference so go with whatever you prefer.

That's a good suggestion! Done

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
98–117

Please preserve this comment. Fine details, imho, can be extracted from the code. But documenting the overall structure is helpful.

280–282

No casting happens in this routine :)

291–292

What "offset" is it? Why do we need to adjust it in the 1-D case and just return as is in 2-D?

The offset to the load or store pointer. In the 2D case getStridedElementPtr does the arithmetic for us, but in the 1D case we have to do it ourselves.

In the 1-D case, is it:

offet * vscale * minElems

Yeah, and offset is vnum so it's vnum * vscale * minElems. So the offset is the number of elements for a given type in a vector of SVL bits (SVLt), and this is scaled by vnum. So the base would get incremented a tile vector at a time.

? And is minElems the minimum number of elements in a scalable vector? So basically the "base size of an SVE vector"?

Yeah, so 128 / esize. Perhaps getMinNumElts would be better implemented like that rather than with a switch actually.

As for where supporting for both 1D and 2D memrefs came from, I initially started with these integration tests that dump ZA but could get it to work with 2D memrefs: https://gist.github.com/c-rhodes/1e9f2d8fd0ca3c6539f167e08079f6ab

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

321–322

The "horizontal" load instructions take "slice number": https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/LD1H--scalar-plus-scalar--tile-slice---Contiguous-load-of-halfwords-to-16-bit-element-ZA-tile-slice-?lang=en.

I would rename vnumI32 as sliceNumI32 so that this is easier to match with the spec.

Good point, naming is difficult, updated to "tile slice", this seems consistent with LLVM as well.

mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2

Do we need a dedicated library for one CPP file? Perhaps it's sufficient to add this to MLIRArmSMETransforms?

I copied this from another dialect and it seems all dialects with utils do this.

mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
32

Given that this code will only be used only by SME/SSVE, why not name this as:

getSVEVectorBaseSize

This way it will be very clear that it's some special SME/SSVE hook. Also:

in SME array vector

+ "and an SVE vector"?

Naming this is tricky, I mean it to be the minSVLT which is the minimum number of elements in a vector of SVL bits, which when scaled by vscale gives both the number of tile slices (vector of SVL bits) in ZA and also the number of elements in a tile slice. I've updated it to getSMETileSliceMinNumElts but will have a think, also not sure we want to mention SVE here?

Thanks for the updates, I've left a few more nits/suggestions, but nothing major.

mlir/include/mlir/Dialect/ArmSME/Utils/Utils.h
10

[nit]

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
114–118

Right, IIUC, this is something like:

Scale the memory offset, i.e. `vnum`, if needed:
* for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. 
* for rank 1 memrefs, assume row-major storage and scale by the effective vector length.

Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of offset and vnum in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context getStridedElementPtr, offset probably makes more sense. getOffset also feels a bit too generic 🤔 .

291–292

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

Yeah, it would be nice to include them. Wouldn't it be possible to add CHECK lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is.

292

[nit] Suggestion for a more descriptive name (it's a rather key bit in the SME logic)

mlir/lib/Dialect/ArmSME/Utils/Utils.cpp
28

[nit] Naming is hard

35

[nit] Naming is hard

mlir/test/Dialect/ArmSME/roundtrip.mlir
195

For consistency with the tile_store at the bottom.

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
28–29

[nit] Not needed

70

[nit] Just to make it clearer what's distinct about this test

106–130

This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/:

Tests should be minimal, and only check what is absolutely necessary.

This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line.

The 2 tests above are sufficient to test the other nuances.

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78

Having a "negative" would be good too (i.e. verify that mem1 != mem2).

97

How about printing mem1 and mem2 and checking something this:

//  CHECK:  0.1, 0.1 {{.*}}
//  CHECK: 1.1, 1.1 {{.*}}
//  CHECK: <some clever string printed after all of mem1 has been printed>

Btw, after reviewing this I feel bad for not adding more roundtrip tests for arm_sme.store_tile, so I've created https://reviews.llvm.org/D155800 :) Lets land this one first so that it's your patch that fixes the definition in ArmSME.td 👍🏻 .

c-rhodes marked 4 inline comments as done.Jul 20 2023, 2:50 AM

Btw, after reviewing this I feel bad for not adding more roundtrip tests for arm_sme.store_tile, so I've created https://reviews.llvm.org/D155800 :) Lets land this one first so that it's your patch that fixes the definition in ArmSME.td 👍🏻 .

You couldn't have added them as only i8 was supported then 😄 I probably should have done that as part of this patch, but thanks for adding them.

c-rhodes updated this revision to Diff 542574.Jul 20 2023, 10:09 AM

Address comments

c-rhodes marked 8 inline comments as done.Jul 20 2023, 10:15 AM

Thanks for the updates, I've left a few more nits/suggestions, but nothing major.

Thanks again, addressed almost all of them (just got a question on comment you left in integration test), also renamed a few variables in the rewrites to make things clearer. Cheers

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
114–118

Right, IIUC, this is something like:

Scale the memory offset, i.e. `vnum`, if needed:
* for rank 2 memrefs, `getStridedElementPtr`does the calculation for us, so just return `vnum`. 
* for rank 1 memrefs, assume row-major storage and scale by the effective vector length.

Btw, this makes lots of sense, I just would like for us to be very clear about the meaning of offset and vnum in this context. The latter name. imho, includes a bit of helpful context, hence suggestion to rename. In the context getStridedElementPtr, offset probably makes more sense. getOffset also feels a bit too generic 🤔 .

I've renamed it to getTileSlicePtrIndex, hopefully this clarifies things, also improved comment at top based on your suggestion.

291–292

I found those tests useful for verification but since the output varies depending on the runtime VL we cant add these tests.

Yeah, it would be nice to include them. Wouldn't it be possible to add CHECK lines that would assume minimum possible VL for each type? Not in this patch though - it's quite large as is.

Yeah it would actually, I've done that in the integration test already part of this based on your suggestion, but would could add the one I linked in future as well.

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
106–130

This and the following tests differ only in a few details that are tricky to spot. I am thinking that perhaps we should trim these to highlight the differences? That would be more in line with https://mlir.llvm.org/getting_started/TestingGuide/:

Tests should be minimal, and only check what is absolutely necessary.

This means that anything in the output that is not core to the functionality that you are testing should not be present in a CHECK line.

The 2 tests above are sufficient to test the other nuances.

Good suggestion they were a bit verbose, I've simplified them to highlight the important bits.

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78

Having a "negative" would be good too (i.e. verify that mem1 != mem2).

After zeroing mem2?

dcaballe accepted this revision.Jul 20 2023, 11:02 AM

LGTM % minor comments, thanks!

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230

I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the vector.transfer_read op.

360

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
82–92

We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file.

116

why do we need these specializations?

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
165

we may want to materialize the loops in the SME dialect before the number of cases grow more

335

spell out auto?

mlir/lib/Dialect/ArmSME/Utils/CMakeLists.txt
2

Yeah, I've seen some complaints about the utils library in the past. I was even recommended to remove the utils files altogether. I'm ok with this, though, if other dialects are doing the same...

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
3–4

remove init separator. Otherwise, this will compile an empty input for the upper part.

354

Nice testing!

This revision is now accepted and ready to land.Jul 20 2023, 11:02 AM
awarzynski accepted this revision.Jul 21 2023, 1:11 AM

LGTM, thanks for the updates!

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
116

I do see _why_, but I agree with Diego that it's not immediately clear.

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78

That comment was for:

// Verify "mem1" == "mem2"

I am suggesting a negative test, because current testing won't capture scenarios where something goes badly wrong. For example, when mem1 and mem2 end up pointing to the same memory location. Or, put differently, we can make sure that two mem buffers are identical. Can we make sure that they are different?

This something to consider adding, not a blocker. The testing in this patch is already pretty through.

c-rhodes updated this revision to Diff 543268.Jul 23 2023, 4:15 AM
c-rhodes marked 2 inline comments as done.

Address comments.

c-rhodes marked 4 inline comments as done.Jul 23 2023, 4:19 AM

@awarzynski @dcaballe thanks for reviewing, I believe I've addressed all comments now, will probably land this tomorrow unless there's further comments by then, cheers!

mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
230

I would add some minor comment about memory constraints (the slice of memory read should be contiguous, etc.). You can get some inspiration from the vector.transfer_read op.

Done cheers, the indices aren't used in the lowering at the moment it just assumes 0 to begin with then adjusts it for each tile slice. I've added a TODO to fix that.

360

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

Good suggestion will do

mlir/lib/Conversion/VectorToArmSME/VectorToArmSME.cpp
82–92

We usually do this in place. No strong opinion but the number of these utilities may explode and it also create some kind of specific convention within this file.

I copied this from mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp, but it's actually less lines and simpler in place, thanks for suggestion!

116

why do we need these specializations?

mlir/lib/Dialect/ArmSME/Transforms/LegalizeForLLVMExport.cpp
165

we may want to materialize the loops in the SME dialect before the number of cases grow more

That's the plan, initially I was going to focus on vector.broadcast -> ArmSME after this to enable linalg.fill, but I've been looking at loop materialization first so the number of cases doesn't grow as you point out.

mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir
3–4

remove init separator. Otherwise, this will compile an empty input for the upper part.

Did not know that! Fixed thanks

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
78

That comment was for:

// Verify "mem1" == "mem2"

I am suggesting a negative test, because current testing won't capture scenarios where something goes badly wrong. For example, when mem1 and mem2 end up pointing to the same memory location. Or, put differently, we can make sure that two mem buffers are identical. Can we make sure that they are different?

This something to consider adding, not a blocker. The testing in this patch is already pretty through.

Done

c-rhodes updated this revision to Diff 543431.Jul 24 2023, 2:24 AM
c-rhodes marked 3 inline comments as done.

Remove side-effects from load intrinsics.

c-rhodes added inline comments.Jul 24 2023, 2:31 AM
mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.td
360

It may be worth asking in Discourse about this. There might be a better side effect or workaround for this.

Good suggestion will do

In the process of asking on Discourse I did some digging and found the SME load intrinsics don't get DCE'd in the backend because they have no side-effects and so the default worse-cast is assumed

Intr*Mem - Memory properties. If no property is set, the worst case
is assumed (it may read and write any memory it can get access to and it may
have other side effects).

I've removed the side-effects from the load intrinsics to match the semantics of the backend and they no longer get DCE'd. Thanks for the suggestion to look into this!