Page MenuHomePhabricator

[Matrix] Update load/store intrinsics.
ClosedPublic

Authored by fhahn on Jun 9 2020, 8:20 AM.

Details

Summary

This patch adjust the load/store matrix intrinsics, formerly known as
llvm.matrix.columnwise.load/store, to improve the naming and allow
passing of extra information (volatile).

The patch performs the following changes:

  • Rename columnwise.load/store to column.major.load/store. This is more expressive and also more in line with the naming in Clang.
  • Changes the shape and stride arguments from i32 to i64. All 3 arguments could in theory be larger than a i32 and there is no real reason for restricting them. For the immargs, there should be no change in practice. This makes things more uniform with the way things are handled in Clang.
  • A new boolean argument is added to indicate whether the load/store is volatile. The lowering respects that when emitting vector load/store instructions
  • MatrixBuilder is updated to require both Alignment and IsVolatile arguments, which are passed through to the generated intrinsic. The alignment is set using the align attribute.

The changes are grouped together in a single patch, to have a single
commit that breaks the compatibility. We probably should be fine with
updating the intrinsics, as we did not yet officially support them in
the last stable release. If there are any concerns, we can add
auto-upgrade rules for the columnwise intrinsics though.

Diff Detail

Event Timeline

fhahn created this revision.Jun 9 2020, 8:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2020, 8:21 AM

I admittedly don't know Clang's naming scheme, so feel free to ignore this, but I dislike the change to *.column.major.* because it feels like both are changeable parameters (i.e. column/row major/minor are two different axes). To anyone who knows what they're looking for they would understand that this doesn't make any sense, but the potential remains. I'm not sure that dropping the major portion makes any sense either because it might seem like you're promising loads and stores no matter the storage order (e.g. llvm.matrix.column.load might seem like it promises a column load even if the matrix is row major, which isn't the intent). Removing the . between them is an option but then it's not that much different from the original columnwise, unless "in line with Clang's naming" means using "column major" instead of "column-wise". Maybe I'm just putting too much meaning into the . too, but I'd rather mention it and be told it's fine than not say anything :)

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

llvm/docs/LangRef.rst
15489

Like llvm.memcpy and so on, you should document that the align attribute can be added to the pointer parameter to specify the required alignment.

llvm/include/llvm/IR/MatrixBuilder.h
60–62

Alignment should be an llvm::Align.

fhahn updated this revision to Diff 269855.Jun 10 2020, 8:00 AM

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches. There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

I admittedly don't know Clang's naming scheme, so feel free to ignore this, but I dislike the change to *.column.major.* because it feels like both are changeable parameters (i.e. column/row major/minor are two different axes). To anyone who knows what they're looking for they would understand that this doesn't make any sense, but the potential remains. I'm not sure that dropping the major portion makes any sense either because it might seem like you're promising loads and stores no matter the storage order

I am not sure I quite follow I am afraid. The column major in the name is meant to refer to what data layout is assumed for the operation. The intrinsics treat both the accessed memory as well as the loaded/stored value as column-major. It could also be a parameter, but the layout is intentionally encoded in the name. We have to use column.major. instead of something like .column_major., because of the way intrinsic names are handled: _ are automatically replaced by .. In the future, we are planning on adding llvm.matrix.row.major.load/store.

So far, I am not planning on adding variants that allow the memory-layout to not match the operand/result layout (e.g. treat memory as row-major but the loaded value is in column-major).

(e.g. llvm.matrix.column.load might seem like it promises a column load even if the matrix is row major, which isn't the intent). Removing the . between them is an option but then it's not that much different from the original columnwise, unless "in line with Clang's naming" means using "column major" instead of "column-wise". Maybe I'm just putting too much meaning into the . too, but I'd rather mention it and be told it's fine than not say anything :)

Thanks for sharing your thoughts! As mentioned above, we have to use either . as separator or no separator at all. Please let me know if that makes sense

fhahn marked 2 inline comments as done.Jun 10 2020, 8:00 AM
fhahn added inline comments.
llvm/docs/LangRef.rst
15489

Updated, thanks!

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

The conservative alignment for addresses of the form &p[i] is the min of the alignment of p with the size of the element type. If the index is constant you can do better, of course.

jdoerfert added inline comments.Jun 10 2020, 11:47 AM
llvm/include/llvm/IR/Intrinsics.td
1465

[Drive by][unrelated] I think we should add nocapture to the ptr argument and nosync to all of them (until we have the white/blacklist for intrinsics with sensible defaults).

fhahn added a comment.Jun 10 2020, 1:21 PM

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

This patch already clumps together a bunch of changes and I think it would be good to have the discussion once row major support for those is added. It should be easy to auto-upgrade to the more general intrinsics, if desired.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

The conservative alignment for addresses of the form &p[i] is the min of the alignment of p with the size of the element type. If the index is constant you can do better, of course.

Yes, I'll put something conservative together soon!

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

This patch already clumps together a bunch of changes and I think it would be good to have the discussion once row major support for those is added. It should be easy to auto-upgrade to the more general intrinsics, if desired.

Doing fewer signature changes is probably best, but I won't insist.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

No, I think the semantics here are more like a volatile memcpy: we're demanding that the access be done, but not guaranteeing anything about access widths.

fhahn added a comment.Jun 10 2020, 2:04 PM

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

No, I think the semantics here are more like a volatile memcpy: we're demanding that the access be done, but not guaranteeing anything about access widths.

Sounds good, that's what the patch assumes :)

fhahn marked an inline comment as done.Jun 10 2020, 2:11 PM
fhahn added inline comments.
llvm/include/llvm/IR/Intrinsics.td
1465

Thanks for pointing that out. There are just too many attributes to keep track of. I wish we had some kind of attribute 'group' to say: just reads/write from the pointer, no capture and other stuff

jdoerfert added inline comments.
llvm/include/llvm/IR/Intrinsics.td
1465

We (@sstefan1) proposed a whitelist and blacklist approach for intrinsics before. Hasn't gone anywhere yet. For the OpenMP runtime functions we actually have such attribute groups. Either way is better than what we do so far.

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

Wait, what? I assumed you used a canonical layout (presumably column-major) in the flattened vector. Are you planning to make all your intrinsics support either column-major or row-major layout? That seems like a lot of complexity in the backend that you're mostly not actually going to be using because the frontend will use a canonical layout. Are you anticipating that it's going to be important to e.g. peephole a row-major load that's fed into a row-major store so that you don't unnecessarily shuffle the vector?

Also, if you *are* trying to proactively support multiple flattened-vector layouts, I feel like stopping at row-major vs. column-major is probably unnecessarily limiting and you should really allow a whole enumeration's worth of possibilities in case you want to e.g. incorporate internal padding into the vector.

fhahn added a comment.Jun 10 2020, 5:08 PM

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

Wait, what? I assumed you used a canonical layout (presumably column-major) in the flattened vector.

Yes, currently we default to column major for the lowering.

Are you planning to make all your intrinsics support either column-major or row-major layout? That seems like a lot of complexity in the backend that you're mostly not actually going to be using because the frontend will use a canonical layout.

My main focus currently is to evolve the intrinsics & lowering so they work well with the matrix extension in Clang. We don't plan to propose/work towards mixing layouts or supporting switching the layout in the C/C++ extension.

On the other hand, I am also trying to ensure the intrinsics are useful for use-cases beyond Clang. For example, the intrinsics are also used by MLIR to lower matrix operations and for that use case, supporting row-major with the intrinsics and also mixing row/column major layouts makes things much easier. At the moment, it is already possible to switch the canonical layout to row-major for the lowering on the LLVM side. Supporting both layouts for most parts was relatively straight-forward (excluding the load/store intrinsics) and fits quite naturally. I am also looking into providing a more powerful way to describe additional properties of the inputs using operand bundles.

Are you anticipating that it's going to be important to e.g. peephole a row-major load that's fed into a row-major store so that you don't unnecessarily shuffle the vector?

The main benefits I expect from making the layouts more flexible is 1) making IRGen easier for frontends with different underlying layouts and 2) potentially being able to optimise conversions/transposes for larger expressions. For small expressions I do not expect too much benefit in terms of optimisations, as LLVM is relatively good at eliminating the kinds of shuffles we emit, at least for small matrix sizes.

Also, if you *are* trying to proactively support multiple flattened-vector layouts, I feel like stopping at row-major vs. column-major is probably unnecessarily limiting and you should really allow a whole enumeration's worth of possibilities in case you want to e.g. incorporate internal padding into the vector.

Unfortunately I am a bit limited in terms of bandwidth when it comes to evolving the intrinsics outside of the clang extension case. I try to focus on evolving them to make sure they work well for the existing users, but also try to make sure the whole system remains flexible enough to support additional cases as you mentioned. I am also happy to quite aggressively adjust the intrinsics when we encounter issues/missing pieces, as in the patch for now. But I suppose we have to be a bit more careful about backwards-compatibility once released frontends out there support the intrinsics.

I hope that makes sense and please let me know if you have any concerns.

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

sstefan1 added inline comments.Jun 11 2020, 12:24 AM
llvm/include/llvm/IR/Intrinsics.td
1465

I will try to get back to that soon.

fhahn added a comment.Jun 11 2020, 7:16 AM

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major). Granted, I think a few other people also mentioned that they would prefer a few more powerful intrinsics, rather than having to compose intrinsics in earlier discussions.

In the end I am happy to go either way, as both approaches should be equivalent in terms of optimization power and it mostly boils down to slightly different matching in the lowering. But as I mentioned earlier, I think it would be good to make those changes once someone has time to add row-major support for the load/store intrinsics.

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

If you're willing to rework the intrinsics later, then I have no objections to committing this now, yeah.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major).

This is definitely a feasible alternative intrinsic design, where you have a small number of basic operations and the backend combines them to emit the operation more efficiently. My intuition is that how well it works in practice depends on the performance gap between emitting the combined operation and emitting them separately, because the backend can be quite bad at actually emitting the combined operation reliably. I don't have a sense of how that applies here; the naive approach for loading row-major as column-major is essentially to load and then shuffle, i.e. to essentially emit them separately anyway, but maybe there's a reasonable alternative that I don't know because I'm less familiar with vector instruction sets.

fhahn added a comment.Jun 11 2020, 4:59 PM

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

If you're willing to rework the intrinsics later, then I have no objections to committing this now, yeah.

Yeah that would be my preference. I think we made quite good progress with that approach so far.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major).

This is definitely a feasible alternative intrinsic design, where you have a small number of basic operations and the backend combines them to emit the operation more efficiently. My intuition is that how well it works in practice depends on the performance gap between emitting the combined operation and emitting them separately, because the backend can be quite bad at actually emitting the combined operation reliably. I don't have a sense of how that applies here; the naive approach for loading row-major as column-major is essentially to load and then shuffle, i.e. to essentially emit them separately anyway, but maybe there's a reasonable alternative that I don't know because I'm less familiar with vector instruction sets.

I suppose it would depend on the details, but this approach might be a bit more fragile, in that other passes could interfere more easily. Thank you very much for elaborating on the issue, the discussion will be very helpful to reference once it is time to propose support for additional layout options for load/store.

fhahn added a comment.Jun 15 2020, 9:07 AM

@nicolasvasilache I also need to update mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td. Initially I'd just like to pass through no alignment (= use the one from datalayout) and false, but I am not sure how to construct a false constant. Looks like there is getLLVMConstant, but I am not sure how to get a Int1 LLVM type. Any ideas?

@fhahn see https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L203

you can extract the LLVM dialect from an LLVM type.

Then you should be able to extract the datalayout with:

dialect.getLLVMModule().getDataLayout();
align = dataLayout.getPrefTypeAlignment(
    elementTy.cast<LLVM::LLVMType>().getUnderlyingType());

see e.g. https://github.com/llvm/llvm-project/blob/master/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp#L135

you would then need to update this location: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L828

would that work for you ?

Is there a test for passing alignment?

llvm/include/llvm/IR/Intrinsics.td
1439–1467

If this is only reformatting, I would leave that to a separate patch.

1465

Has this been addressed?

llvm/include/llvm/IR/MatrixBuilder.h
62

unsigned -> uint64_t?

jdoerfert added inline comments.Jun 15 2020, 2:08 PM
llvm/include/llvm/IR/Intrinsics.td
1465

This is unrelated and should not block the patch. Eventually we want easier attribute handling for intrinsics but we are still working on it.

fhahn updated this revision to Diff 270865.Jun 15 2020, 2:18 PM

Address comments: Change row/column types to uint64_t in MatrixBuilder for load/store, submit formatting/nosync/nocapture changes outside of patch, rebased.

fhahn marked 4 inline comments as done.Jun 15 2020, 2:20 PM

Is there a test for passing alignment?

not yet, will add as a follow up along with the implementation to respect them (should be ready soon).

@fhahn see https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L203

you can extract the LLVM dialect from an LLVM type.

Then you should be able to extract the datalayout with:

dialect.getLLVMModule().getDataLayout();
align = dataLayout.getPrefTypeAlignment(
    elementTy.cast<LLVM::LLVMType>().getUnderlyingType());

see e.g. https://github.com/llvm/llvm-project/blob/master/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp#L135

you would then need to update this location: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L828

would that work for you ?

@nicolasvasilache Yeah, I initially was not sure how to create a false constant there, but I'll just go ahead and update the MLIR definition to conform to the update in one go, unless you think that will cause problems.

llvm/include/llvm/IR/Intrinsics.td
1439–1467
1465

I added to nosync/nocapture attributes in 1d33c09f220ea9fe2846813bafc46dc5d9394577

fhahn updated this revision to Diff 271004.Jun 16 2020, 2:47 AM
fhahn marked an inline comment as done.

Update MLIR llvm intrinsic wrappers.

I think I found the right place to update. There was no need to get a dialect I think. I went with using the getABITypeAlign for the pointer element type. This ensures we generate the same alignment as before. I am not sure if/how MLIR supports align parameter attributes, but in the future ideally the MLIR LLVM dialect would support them and we could just pass it through.

Herald added a project: Restricted Project. · View Herald Transcript
anemet accepted this revision.Jun 16 2020, 10:55 AM

LGTM! @nicolasvasilache can you please OK the MLIR parts if you're happy with them?

This revision is now accepted and ready to land.Jun 16 2020, 10:55 AM
fhahn added a comment.Jun 17 2020, 4:52 AM

LGTM! @nicolasvasilache can you please OK the MLIR parts if you're happy with them?

Thanks everyone! Given that this has been extensively discussed I plan to submit this in a day or so to give people time to raise additional concerns. If there are additional suggestions for the MLIR side afterwards, I think we can address them post-commit.

nicolasvasilache accepted this revision.Jun 17 2020, 8:39 AM

SG, thank you @fhahn !

fhahn updated this revision to Diff 271603.Jun 18 2020, 1:27 AM

Remove unnecessary changes of row/column args from i32 -> i64. Keep them as i32, for consistency with the other intrinsics. They are required to be constant (and are used as constants directly in the lowering), so this has no impact on the lowering code. If we ever need to allow rows/columns not fitting in i32, we should revisit then.

This revision was automatically updated to reflect the committed changes.