This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
1/2
LangRef.rst
-
include/llvm/IR/
-
llvm/
-
IR/
4/9
Intrinsics.td
2/2
MatrixBuilder.h
-
lib/
-
IR/
-
Verifier.cpp
-
Transforms/Scalar/
-
Scalar/
-
LowerMatrixIntrinsics.cpp
-
test/
-
Transforms/LowerMatrixIntrinsics/
-
LowerMatrixIntrinsics/
-
bigger-expressions-double.ll
-
const-gep.ll
-
multiply-add-sub-double-row-major.ll
-
propagate-backward.ll
-
propagate-forward.ll
-
propagate-mixed-users.ll
-
propagate-multiple-iterations.ll
-
remarks-inlining.ll
-
remarks-shared-subtrees.ll
-
remarks.ll
-
strided-load-double.ll
-
strided-load-float.ll
-
strided-load-i32.ll
-
strided-store-double.ll
-
strided-store-float.ll
-
strided-store-i32.ll
-
Verifier/
-
matrix-intrinsics.ll
-
mlir/
-
include/mlir/Dialect/LLVMIR/
-
mlir/
-
Dialect/
-
LLVMIR/
-
LLVMOps.td
-
test/Target/
-
Target/
-
llvmir-intrinsics.mlir

Differential D81472

[Matrix] Update load/store intrinsics.
ClosedPublic

Authored by fhahn on Jun 9 2020, 8:20 AM.

Download Raw Diff

Details

Reviewers

anemet
Gerolf
hfinkel
andrew.w.kaylor
LuoYuanke
nicolasvasilache
rjmccall
ftynse

Commits

rG6d18c2067ef1: [Matrix] Update load/store intrinsics.

Summary

This patch adjust the load/store matrix intrinsics, formerly known as
llvm.matrix.columnwise.load/store, to improve the naming and allow
passing of extra information (volatile).

The patch performs the following changes:

Rename columnwise.load/store to column.major.load/store. This is more expressive and also more in line with the naming in Clang.
Changes the shape and stride arguments from i32 to i64. All 3 arguments could in theory be larger than a i32 and there is no real reason for restricting them. For the immargs, there should be no change in practice. This makes things more uniform with the way things are handled in Clang.
A new boolean argument is added to indicate whether the load/store is volatile. The lowering respects that when emitting vector load/store instructions
MatrixBuilder is updated to require both Alignment and IsVolatile arguments, which are passed through to the generated intrinsic. The alignment is set using the align attribute.

The changes are grouped together in a single patch, to have a single
commit that breaks the compatibility. We probably should be fine with
updating the intrinsics, as we did not yet officially support them in
the last stable release. If there are any concerns, we can add
auto-upgrade rules for the columnwise intrinsics though.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Jun 9 2020, 8:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2020, 8:21 AM

Herald added subscribers: tschuett, jdoerfert, hiraditya. · View Herald Transcript

fhahn mentioned this in D72781: [Matrix] Add __builtin_matrix_column_load to Clang..Jun 9 2020, 8:45 AM

Harbormaster failed remote builds in B59642: Diff 269550!Jun 9 2020, 9:52 AM

I admittedly don't know Clang's naming scheme, so feel free to ignore this, but I dislike the change to *.column.major.* because it feels like both are changeable parameters (i.e. column/row major/minor are two different axes). To anyone who knows what they're looking for they would understand that this doesn't make any sense, but the potential remains. I'm not sure that dropping the major portion makes any sense either because it might seem like you're promising loads and stores no matter the storage order (e.g. llvm.matrix.column.load might seem like it promises a column load even if the matrix is row major, which isn't the intent). Removing the . between them is an option but then it's not that much different from the original columnwise, unless "in line with Clang's naming" means using "column major" instead of "column-wise". Maybe I'm just putting too much meaning into the . too, but I'd rather mention it and be told it's fine than not say anything :)

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

llvm/docs/LangRef.rst
15489	Like `llvm.memcpy` and so on, you should document that the `align` attribute can be added to the pointer parameter to specify the required alignment.
llvm/include/llvm/IR/MatrixBuilder.h
60–62	Alignment should be an `llvm::Align`.

In D81472#2083075, @rjmccall wrote:

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches. There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

In D81472#2082756, @braedy wrote:

I admittedly don't know Clang's naming scheme, so feel free to ignore this, but I dislike the change to *.column.major.* because it feels like both are changeable parameters (i.e. column/row major/minor are two different axes). To anyone who knows what they're looking for they would understand that this doesn't make any sense, but the potential remains. I'm not sure that dropping the major portion makes any sense either because it might seem like you're promising loads and stores no matter the storage order

I am not sure I quite follow I am afraid. The column major in the name is meant to refer to what data layout is assumed for the operation. The intrinsics treat both the accessed memory as well as the loaded/stored value as column-major. It could also be a parameter, but the layout is intentionally encoded in the name. We have to use column.major. instead of something like .column_major., because of the way intrinsic names are handled: _ are automatically replaced by .. In the future, we are planning on adding llvm.matrix.row.major.load/store.

So far, I am not planning on adding variants that allow the memory-layout to not match the operand/result layout (e.g. treat memory as row-major but the loaded value is in column-major).

(e.g. llvm.matrix.column.load might seem like it promises a column load even if the matrix is row major, which isn't the intent). Removing the . between them is an option but then it's not that much different from the original columnwise, unless "in line with Clang's naming" means using "column major" instead of "column-wise". Maybe I'm just putting too much meaning into the . too, but I'd rather mention it and be told it's fine than not say anything :)

Thanks for sharing your thoughts! As mentioned above, we have to use either . as separator or no separator at all. Please let me know if that makes sense

fhahn marked 2 inline comments as done.Jun 10 2020, 8:00 AM

fhahn added inline comments.

llvm/docs/LangRef.rst
15489	Updated, thanks!

Harbormaster failed remote builds in B59808: Diff 269855!Jun 10 2020, 9:18 AM

In D81472#2085129, @fhahn wrote:

In D81472#2083075, @rjmccall wrote:

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

The conservative alignment for addresses of the form &p[i] is the min of the alignment of p with the size of the element type. If the index is constant you can do better, of course.

jdoerfert added inline comments.Jun 10 2020, 11:47 AM

llvm/include/llvm/IR/Intrinsics.td
1465	[Drive by][unrelated] I think we should add `nocapture` to the ptr argument and `nosync` to all of them (until we have the white/blacklist for intrinsics with sensible defaults).

In D81472#2085724, @rjmccall wrote:

In D81472#2085129, @fhahn wrote:

In D81472#2083075, @rjmccall wrote:

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

This patch already clumps together a bunch of changes and I think it would be good to have the discussion once row major support for those is added. It should be easy to auto-upgrade to the more general intrinsics, if desired.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

There's already one for IsVolatile D81498. I think setting the alignment correctly for the split stores is not completely trivial, because the original alignment will hold for the first split access, but may not hold for the subsequent accesses and some extra work is needed to figure out which alignments to use for subsequent stores, depending on the stride.

The conservative alignment for addresses of the form &p[i] is the min of the alignment of p with the size of the element type. If the index is constant you can do better, of course.

Yes, I'll put something conservative together soon!

In D81472#2086031, @fhahn wrote:

In D81472#2085724, @rjmccall wrote:

In D81472#2085129, @fhahn wrote:

In D81472#2083075, @rjmccall wrote:

I like the name change, although I wonder if you could just have a single intrinsic that takes both a row stride and a column stride and recognizes the common patterns. Presumably even with column-major ordering you already want to optimize the case where the stride is a constant equal to the row count, so this would just be a generalization of that.

I am not sure about having a single intrinsic. The column.major part in the name signifies that the resulting matrix is in column-major layout (whic, but is then used internally during he lowering). I suppose it would be possible to have both row & column strides and use a special value to indicate what the leading dimension is, but it seems to me that having dedicated intrinsics would be more explicit.

This is just Fortran array slices. You don't need a special value, the two strides are sufficient. M[i][j] is at p[i * rowStride + j * columnStride]. To not have overlapping storage, you need either rowStride * rowCount <= columnStride or vice-versa. Row-major means rowStride >= columnCount && columnStride == 1; column-major means rowStride == 1 && columnStride >= rowCount. You get better locality from doing the smaller stride in the inner loop (which may not actually be a loop, of course), but it's not wrong to do either way.

Anyway, it's up to you, but I think the two-stride representation is more flexible and avoids ultimately needing three separate intrinsics with optimizations that turn the general one into the more specific ones. And it may have benefits for frontends like Flang that have to support these strided multi-dimensional slices.

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

This patch already clumps together a bunch of changes and I think it would be good to have the discussion once row major support for those is added. It should be easy to auto-upgrade to the more general intrinsics, if desired.

Doing fewer signature changes is probably best, but I won't insist.

Are you planning to actually support alignment in a follow-up patch? I don't see anything here that propagates alignment to the lowering routines.

Yes, this patch just adjusts the intrinsics definitions/langref. Respecting both IsVolatile & the alignment attribute will be done in follow-up patches.\

Okay.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

No, I think the semantics here are more like a volatile memcpy: we're demanding that the access be done, but not guaranteeing anything about access widths.

In D81472#2086062, @rjmccall wrote:

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

I've put up D81498. It just adds volatile to the generated loads/stores, if IsVolatile is true. Do you think volatile should/has to prevent splitting the memory operation into multiple loads/stores?

No, I think the semantics here are more like a volatile memcpy: we're demanding that the access be done, but not guaranteeing anything about access widths.

Sounds good, that's what the patch assumes :)

fhahn marked an inline comment as done.Jun 10 2020, 2:11 PM

fhahn added inline comments.

llvm/include/llvm/IR/Intrinsics.td
1465	Thanks for pointing that out. There are just too many attributes to keep track of. I wish we had some kind of attribute 'group' to say: just reads/write from the pointer, no capture and other stuff

jdoerfert added a subscriber: sstefan1.Jun 10 2020, 3:30 PM

jdoerfert added inline comments.

llvm/include/llvm/IR/Intrinsics.td
1465	We (@sstefan1) proposed a whitelist and blacklist approach for intrinsics before. Hasn't gone anywhere yet. For the OpenMP runtime functions we actually have such attribute groups. Either way is better than what we do so far.

In D81472#2086096, @fhahn wrote:

In D81472#2086062, @rjmccall wrote:

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

Wait, what? I assumed you used a canonical layout (presumably column-major) in the flattened vector. Are you planning to make all your intrinsics support either column-major or row-major layout? That seems like a lot of complexity in the backend that you're mostly not actually going to be using because the frontend will use a canonical layout. Are you anticipating that it's going to be important to e.g. peephole a row-major load that's fed into a row-major store so that you don't unnecessarily shuffle the vector?

Also, if you *are* trying to proactively support multiple flattened-vector layouts, I feel like stopping at row-major vs. column-major is probably unnecessarily limiting and you should really allow a whole enumeration's worth of possibilities in case you want to e.g. incorporate internal padding into the vector.

In D81472#2086283, @rjmccall wrote:

In D81472#2086096, @fhahn wrote:

In D81472#2086062, @rjmccall wrote:

[snip]

Oh right, the 'special value' to indicate row/column major would be setting either stride to 1. As long as exactly one of those is 1, the layout of the result/operand should be clear. Personally I find including layout included in the name a bit easier to follow, as it is more explicit. But it might be preferable to have a single variant that handles row/column major depending on the strides (as long as we enforce that exactly one stride has to be 1.), once we add those variants.

Why have the restriction that exactly one stride has to be 1? If you can optimize that as a constant, great, do it, but otherwise just do the separate loads/stores, and impose an UB restriction that the strides have to make them non-overlapping.

Besides making assumptions about the layout of the access memory, the intrinsic also specifies the layout of the loaded/stored values (=layout in the flattened vector). If either stride is constant 1, we could use that to determine the layout of the loaded/stored value. I may be missing something, but if both are != 1 or arbitrary values, it is not clear what we should pick for the in-vector layout.

Wait, what? I assumed you used a canonical layout (presumably column-major) in the flattened vector.

Yes, currently we default to column major for the lowering.

Are you planning to make all your intrinsics support either column-major or row-major layout? That seems like a lot of complexity in the backend that you're mostly not actually going to be using because the frontend will use a canonical layout.

My main focus currently is to evolve the intrinsics & lowering so they work well with the matrix extension in Clang. We don't plan to propose/work towards mixing layouts or supporting switching the layout in the C/C++ extension.

On the other hand, I am also trying to ensure the intrinsics are useful for use-cases beyond Clang. For example, the intrinsics are also used by MLIR to lower matrix operations and for that use case, supporting row-major with the intrinsics and also mixing row/column major layouts makes things much easier. At the moment, it is already possible to switch the canonical layout to row-major for the lowering on the LLVM side. Supporting both layouts for most parts was relatively straight-forward (excluding the load/store intrinsics) and fits quite naturally. I am also looking into providing a more powerful way to describe additional properties of the inputs using operand bundles.

Are you anticipating that it's going to be important to e.g. peephole a row-major load that's fed into a row-major store so that you don't unnecessarily shuffle the vector?

The main benefits I expect from making the layouts more flexible is 1) making IRGen easier for frontends with different underlying layouts and 2) potentially being able to optimise conversions/transposes for larger expressions. For small expressions I do not expect too much benefit in terms of optimisations, as LLVM is relatively good at eliminating the kinds of shuffles we emit, at least for small matrix sizes.

Also, if you *are* trying to proactively support multiple flattened-vector layouts, I feel like stopping at row-major vs. column-major is probably unnecessarily limiting and you should really allow a whole enumeration's worth of possibilities in case you want to e.g. incorporate internal padding into the vector.

Unfortunately I am a bit limited in terms of bandwidth when it comes to evolving the intrinsics outside of the clang extension case. I try to focus on evolving them to make sure they work well for the existing users, but also try to make sure the whole system remains flexible enough to support additional cases as you mentioned. I am also happy to quite aggressively adjust the intrinsics when we encounter issues/missing pieces, as in the patch for now. But I suppose we have to be a bit more careful about backwards-compatibility once released frontends out there support the intrinsics.

I hope that makes sense and please let me know if you have any concerns.

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

sstefan1 added inline comments.Jun 11 2020, 12:24 AM

llvm/include/llvm/IR/Intrinsics.td
1465	I will try to get back to that soon.

In D81472#2086652, @rjmccall wrote:

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major). Granted, I think a few other people also mentioned that they would prefer a few more powerful intrinsics, rather than having to compose intrinsics in earlier discussions.

In the end I am happy to go either way, as both approaches should be equivalent in terms of optimization power and it mostly boils down to slightly different matching in the lowering. But as I mentioned earlier, I think it would be good to make those changes once someone has time to add row-major support for the load/store intrinsics.

In D81472#2087558, @fhahn wrote:

In D81472#2086652, @rjmccall wrote:

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

If you're willing to rework the intrinsics later, then I have no objections to committing this now, yeah.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major).

This is definitely a feasible alternative intrinsic design, where you have a small number of basic operations and the backend combines them to emit the operation more efficiently. My intuition is that how well it works in practice depends on the performance gap between emitting the combined operation and emitting them separately, because the backend can be quite bad at actually emitting the combined operation reliably. I don't have a sense of how that applies here; the naive approach for loading row-major as column-major is essentially to load and then shuffle, i.e. to essentially emit them separately anyway, but maybe there's a reasonable alternative that I don't know because I'm less familiar with vector instruction sets.

In D81472#2088112, @rjmccall wrote:

In D81472#2087558, @fhahn wrote:

In D81472#2086652, @rjmccall wrote:

My immediate concern is just that I think the memory layout of the matrix should be orthogonal to the component layout of the vector. If you want the matrix intrinsics to support a variety of vector layouts, you should pass down the expected layout as a constant argument to the intrinsic rather than picking it up purely from whether the matrix is being loaded from a row-major or column-major layout in memory. I would guess that making that constant an i32 is probably sufficiently future-proof; if you ever need more structure than that, you'd probably be better off biting the bullet and adding an llvm::MatrixType.

Hm I understand the appeal of having a single very powerful intrinsic. Selecting the different variants by a single parameter is convenient in terms of maintaining backwards compatibility, but personally I find it more readable to include some of the variant information in the name. Of course there's a limit to the number of variants for which that approach is feasible. I think it is important to have this discussion, but I am not sure if it is in scope for this patch (which only adds a few smallish improvements to the naming/arguments of the intrinsics) and it might be better to discuss that once work on row-major versions of the intrinsics starts?

If you're willing to rework the intrinsics later, then I have no objections to committing this now, yeah.

Yeah that would be my preference. I think we made quite good progress with that approach so far.

My thinking here is that, for example, you're adding a builtin to Clang to do a column-major load. You want that to produce a column-major vector layout because that's your canonical layout within Clang. But you should also eventually add a builtin to do a row-major load, because there are quite a few reasons people might have a matrix in memory in row-major order: for example, if they've declared a nested C array, they'll naturally get row-major order. That builtin *also* needs to produce a column-major vector layout. So tying the two things together is bad intrinsic design.

Having intrinsics that can apply such conversions directly certainly is a convenient option here. But alternatively it should also be possible to have a small set of load/store intrinsics (e.g. load row-major from row-major, load column-major from column-major) and get the other variants by composing conversion functions (e.g. transpose (load row major() to load a row-major matrix from memory and convert it to column-major).

This is definitely a feasible alternative intrinsic design, where you have a small number of basic operations and the backend combines them to emit the operation more efficiently. My intuition is that how well it works in practice depends on the performance gap between emitting the combined operation and emitting them separately, because the backend can be quite bad at actually emitting the combined operation reliably. I don't have a sense of how that applies here; the naive approach for loading row-major as column-major is essentially to load and then shuffle, i.e. to essentially emit them separately anyway, but maybe there's a reasonable alternative that I don't know because I'm less familiar with vector instruction sets.

I suppose it would depend on the details, but this approach might be a bit more fragile, in that other passes could interfere more easily. Thank you very much for elaborating on the issue, the discussion will be very helpful to reference once it is time to propose support for additional layout options for load/store.

@nicolasvasilache I also need to update mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td. Initially I'd just like to pass through no alignment (= use the one from datalayout) and false, but I am not sure how to construct a false constant. Looks like there is getLLVMConstant, but I am not sure how to get a Int1 LLVM type. Any ideas?

@fhahn see https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L203

you can extract the LLVM dialect from an LLVM type.

Then you should be able to extract the datalayout with:

dialect.getLLVMModule().getDataLayout();
align = dataLayout.getPrefTypeAlignment(
    elementTy.cast<LLVM::LLVMType>().getUnderlyingType());

see e.g. https://github.com/llvm/llvm-project/blob/master/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp#L135

you would then need to update this location: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L828

would that work for you ?

Is there a test for passing alignment?

llvm/include/llvm/IR/Intrinsics.td
1439–1467	If this is only reformatting, I would leave that to a separate patch.
1465	Has this been addressed?
llvm/include/llvm/IR/MatrixBuilder.h
62	unsigned -> uint64_t?

jdoerfert added inline comments.Jun 15 2020, 2:08 PM

llvm/include/llvm/IR/Intrinsics.td
1465	This is unrelated and should not block the patch. Eventually we want easier attribute handling for intrinsics but we are still working on it.

Address comments: Change row/column types to uint64_t in MatrixBuilder for load/store, submit formatting/nosync/nocapture changes outside of patch, rebased.

In D81472#2093178, @anemet wrote:

Is there a test for passing alignment?

not yet, will add as a follow up along with the implementation to respect them (should be ready soon).

In D81472#2093172, @nicolasvasilache wrote:
@fhahn see https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L203

you can extract the LLVM dialect from an LLVM type.

Then you should be able to extract the datalayout with:
dialect.getLLVMModule().getDataLayout();
align = dataLayout.getPrefTypeAlignment(
    elementTy.cast<LLVM::LLVMType>().getUnderlyingType());
see e.g. https://github.com/llvm/llvm-project/blob/master/mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp#L135

you would then need to update this location: https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L828

would that work for you ?

@nicolasvasilache Yeah, I initially was not sure how to create a false constant there, but I'll just go ahead and update the MLIR definition to conform to the update in one go, unless you think that will cause problems.

llvm/include/llvm/IR/Intrinsics.td
1439–1467	done in 1d33c09f220ea9fe2846813bafc46dc5d9394577
1465	I added to nosync/nocapture attributes in 1d33c09f220ea9fe2846813bafc46dc5d9394577

fhahn mentioned this in rG1d33c09f220e: [IR] Add nocapture & nosync to matrix intrinsics..Jun 15 2020, 2:22 PM

Harbormaster failed remote builds in B60370: Diff 270865!Jun 15 2020, 3:31 PM

Update MLIR llvm intrinsic wrappers.

In D81472#2093172, @nicolasvasilache wrote:

@fhahn see https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td#L203

you can extract the LLVM dialect from an LLVM type.

I think I found the right place to update. There was no need to get a dialect I think. I went with using the getABITypeAlign for the pointer element type. This ensures we generate the same alignment as before. I am not sure if/how MLIR supports align parameter attributes, but in the future ideally the MLIR LLVM dialect would support them and we could just pass it through.

Herald added a reviewer: ftynse. · View Herald TranscriptJun 16 2020, 2:47 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: msifontes, jurahul, Kayjukh and 14 others. · View Herald Transcript

Harbormaster failed remote builds in B60440: Diff 271004!Jun 16 2020, 4:58 AM

LGTM! @nicolasvasilache can you please OK the MLIR parts if you're happy with them?

This revision is now accepted and ready to land.Jun 16 2020, 10:55 AM

In D81472#2096211, @anemet wrote:

LGTM! @nicolasvasilache can you please OK the MLIR parts if you're happy with them?

Thanks everyone! Given that this has been extensively discussed I plan to submit this in a day or so to give people time to raise additional concerns. If there are additional suggestions for the MLIR side afterwards, I think we can address them post-commit.

SG, thank you @fhahn !

Remove unnecessary changes of row/column args from i32 -> i64. Keep them as i32, for consistency with the other intrinsics. They are required to be constant (and are used as constants directly in the lowering), so this has no impact on the lowering code. If we ever need to allow rows/columns not fitting in i32, we should revisit then.

Closed by commit rG6d18c2067ef1: [Matrix] Update load/store intrinsics. (authored by fhahn). · Explain WhyJun 18 2020, 2:09 AM

This revision was automatically updated to reflect the committed changes.

Harbormaster failed remote builds in B60770: Diff 271603!Jun 18 2020, 2:42 AM

Revision Contents

Path

Size

llvm/

docs/

LangRef.rst

42 lines

include/

llvm/

IR/

Intrinsics.td

16 lines

MatrixBuilder.h

42 lines

lib/

IR/

Verifier.cpp

16 lines

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

87 lines

test/

Transforms/

LowerMatrixIntrinsics/

bigger-expressions-double.ll

28 lines

const-gep.ll

4 lines

multiply-add-sub-double-row-major.ll

16 lines

propagate-backward.ll

6 lines

propagate-forward.ll

2 lines

propagate-mixed-users.ll

6 lines

propagate-multiple-iterations.ll

12 lines

remarks-inlining.ll

10 lines

remarks-shared-subtrees.ll

46 lines

remarks.ll

56 lines

strided-load-double.ll

44 lines

strided-load-float.ll

40 lines

strided-load-i32.ll

40 lines

strided-store-double.ll

34 lines

strided-store-float.ll

30 lines

strided-store-i32.ll

30 lines

Verifier/

matrix-intrinsics.ll

20 lines

mlir/

include/

mlir/

Dialect/

LLVMIR/

LLVMOps.td

54 lines

test/

Target/

llvmir-intrinsics.mlir

22 lines

Diff 271615

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 15,460 Lines • ▼ Show 20 Lines
	""""""""""			""""""""""

	The <OuterRows>, <Inner> and <OuterColumns> arguments must be constant			The <OuterRows>, <Inner> and <OuterColumns> arguments must be constant
	integers. The vector argument %A must have <OuterRows> * <Inner> elements, %B			integers. The vector argument %A must have <OuterRows> * <Inner> elements, %B
	must have <Inner> * <OuterColumns> elements and the returned vector must have			must have <Inner> * <OuterColumns> elements and the returned vector must have
	<OuterRows> * <OuterColumns> elements.			<OuterRows> * <OuterColumns> elements.


	'``llvm.matrix.columnwise.load.*``' Intrinsic			'``llvm.matrix.column.major.load.*``' Intrinsic
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:			Syntax:
	"""""""			"""""""

	::			::

	declare vectorty @llvm.matrix.columnwise.load.*(ptrty %Ptr, i32 %Stride, i32 <Rows>, i32 <Cols>)			declare vectorty @llvm.matrix.column.major.load.*(
				ptrty %Ptr, i64 %Stride, i1 <IsVolatile>, i32 <Rows>, i32 <Cols>)

	Overview:			Overview:
	"""""""""			"""""""""

	The '``llvm.matrix.columnwise.load.*``' intrinsic loads a matrix with <Rows>			The '``llvm.matrix.column.major.load.*``' intrinsic loads a matrix with <Rows>
	rows and <Cols> columns, using a stride of %Stride between columns. For two			rows and <Cols> columns, using a stride of %Stride between columns. For two
	consecutive columns A and B, %Stride refers to the distance (the number of			consecutive columns A and B, %Stride refers to the distance (the number of
	elements) between the start of column A and the start of column B. The result			elements) between the start of column A and the start of column B. The result
	matrix is returned embedded in the result vector. This allows for convenient			matrix is returned embedded in the result vector. This allows for convenient
	loading of sub matrixes.			loading of sub matrixes. If <IsVolatile> is true, the intrinsic is considered
				a :ref:`volatile memory access <volatile>.`
				rjmccallUnsubmitted Not Done Reply Inline Actions Like `llvm.memcpy` and so on, you should document that the `align` attribute can be added to the pointer parameter to specify the required alignment. rjmccall: Like `llvm.memcpy` and so on, you should document that the `align` attribute can be added to…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Updated, thanks! fhahn: Updated, thanks!

				If the %Ptr argument is known to be aligned to some boundary, this can be
				specified as an attribute on the argument.

	Arguments:			Arguments:
	""""""""""			""""""""""

	The <Rows> and <Cols> arguments must be constant integers. The returned vector			The <IsVolatile>, <Rows> and <Cols> arguments must be constant integers. The
	must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.			returned vector must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.

				The :ref:`align <attr_align>` parameter attribute can be provided
				for the %Ptr arguments.

	'``llvm.matrix.columnwise.store.*``' Intrinsic
				'``llvm.matrix.column.major.store.*``' Intrinsic
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:			Syntax:
	"""""""			"""""""

	::			::

	declare void @llvm.matrix.columnwise.store.*(vectorty %In, ptrty %Ptr, i32 %Stride, i32 <Rows>, i32 <Cols>)			declare void @llvm.matrix.column.major.store.*(
				vectorty %In, ptrty %Ptr, i64 %Stride, i1 <IsVolatile>, i32 <Rows>, i32 <Cols>)

	Overview:			Overview:
	"""""""""			"""""""""

	The '``llvm.matrix.columnwise.store.*``' intrinsic stores the matrix with			The '``llvm.matrix.column.major.store.*``' intrinsic stores the matrix with
	<Rows> rows and <Cols> columns embedded in %In, using a stride of %Stride			<Rows> rows and <Cols> columns embedded in %In, using a stride of %Stride
	between columns. For two consecutive columns A and B, %Stride refers to the			between columns. For two consecutive columns A and B, %Stride refers to the
	distance (the number of elements) between the start of column A and the start			distance (the number of elements) between the start of column A and the start
	of column B.			of column B. If <IsVolatile> is true, the intrinsic is considered a
				:ref:`volatile memory access <volatile>.`

				If the %Ptr argument is known to be aligned to some boundary, this can be
				specified as an attribute on the argument.

	Arguments:			Arguments:
	""""""""""			""""""""""

	The <Rows> and <Cols> arguments must be constant integers. The vector argument			The <IsVolatile>, <Rows>, <Cols> arguments must be constant integers. The
	%In must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.			vector argument %In must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.

				The :ref:`align <attr_align>` parameter attribute can be provided
				for the %Ptr arguments.


	Half Precision Floating-Point Intrinsics			Half Precision Floating-Point Intrinsics
	----------------------------------------			----------------------------------------

	For most target platforms, half precision floating-point is a			For most target platforms, half precision floating-point is a
	storage-only format. This means that it is a dense encoding (in memory)			storage-only format. This means that it is a dense encoding (in memory)
	but does not support computation in the format.			but does not support computation in the format.

	▲ Show 20 Lines • Show All 4,723 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 1,430 Lines • ▼ Show 20 Lines	let IntrProperties = [IntrNoMem, IntrWillReturn] in {
def int_experimental_vector_reduce_fmax : Intrinsic<[LLVMVectorElementType<0>],		def int_experimental_vector_reduce_fmax : Intrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
def int_experimental_vector_reduce_fmin : Intrinsic<[LLVMVectorElementType<0>],		def int_experimental_vector_reduce_fmin : Intrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
}		}

//===----- Matrix intrinsics ---------------------------------------------===//		//===----- Matrix intrinsics ---------------------------------------------===//

def int_matrix_transpose		def int_matrix_transpose
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_i32_ty, llvm_i32_ty],		[LLVMMatchType<0>, llvm_i32_ty, llvm_i32_ty],
[ IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>,		[ IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>,
ImmArg<ArgIndex<2>>]>;		ImmArg<ArgIndex<2>>]>;

def int_matrix_multiply		def int_matrix_multiply
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[llvm_anyvector_ty, llvm_anyvector_ty, llvm_i32_ty, llvm_i32_ty,		[llvm_anyvector_ty, llvm_anyvector_ty, llvm_i32_ty, llvm_i32_ty,
llvm_i32_ty],		llvm_i32_ty],
[IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<2>>,		[IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<2>>,
ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]>;		ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]>;

def int_matrix_columnwise_load		def int_matrix_column_major_load
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMAnyPointerType<LLVMMatchType<0>>, llvm_i32_ty, llvm_i32_ty,		[LLVMAnyPointerType<LLVMMatchType<0>>, llvm_i64_ty, llvm_i1_ty,
llvm_i32_ty],		llvm_i32_ty, llvm_i32_ty],
[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem,		[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem,
NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>,		NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>,
ImmArg<ArgIndex<3>>]>;		ImmArg<ArgIndex<4>>]>;

def int_matrix_columnwise_store		def int_matrix_column_major_store
: Intrinsic<[],		: Intrinsic<[],
[llvm_anyvector_ty, LLVMAnyPointerType<LLVMMatchType<0>>,		[llvm_anyvector_ty, LLVMAnyPointerType<LLVMMatchType<0>>,
llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],		llvm_i64_ty, llvm_i1_ty, llvm_i32_ty, llvm_i32_ty],
[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrWriteMem,		[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrWriteMem,
WriteOnly<ArgIndex<1>>, NoCapture<ArgIndex<1>>,		WriteOnly<ArgIndex<1>>, NoCapture<ArgIndex<1>>,
		jdoerfertUnsubmitted Not Done Reply Inline Actions [Drive by][unrelated] I think we should add `nocapture` to the ptr argument and `nosync` to all of them (until we have the white/blacklist for intrinsics with sensible defaults). jdoerfert: [Drive by][unrelated] I think we should add `nocapture` to the ptr argument and `nosync` to all…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Thanks for pointing that out. There are just too many attributes to keep track of. I wish we had some kind of attribute 'group' to say: just reads/write from the pointer, no capture and other stuff fhahn: Thanks for pointing that out. There are just too many attributes to keep track of. I wish we…
		jdoerfertUnsubmitted Not Done Reply Inline Actions We (@sstefan1) proposed a whitelist and blacklist approach for intrinsics before. Hasn't gone anywhere yet. For the OpenMP runtime functions we actually have such attribute groups. Either way is better than what we do so far. jdoerfert: We (@sstefan1) proposed a whitelist and blacklist approach for intrinsics before. Hasn't gone…
		sstefan1Unsubmitted Not Done Reply Inline Actions I will try to get back to that soon. sstefan1: I will try to get back to that soon.
		anemetUnsubmitted Not Done Reply Inline Actions Has this been addressed? anemet: Has this been addressed?
		jdoerfertUnsubmitted Not Done Reply Inline Actions This is unrelated and should not block the patch. Eventually we want easier attribute handling for intrinsics but we are still working on it. jdoerfert: This is unrelated and should not block the patch. Eventually we want easier attribute handling…
		fhahnAuthorUnsubmitted Done Reply Inline Actions I added to nosync/nocapture attributes in 1d33c09f220ea9fe2846813bafc46dc5d9394577 fhahn: I added to nosync/nocapture attributes in 1d33c09f220ea9fe2846813bafc46dc5d9394577
ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]>;		ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>, ImmArg<ArgIndex<5>>]>;

		anemetUnsubmitted Done Reply Inline Actions If this is only reformatting, I would leave that to a separate patch. anemet: If this is only reformatting, I would leave that to a separate patch.
		fhahnAuthorUnsubmitted Done Reply Inline Actions done in 1d33c09f220ea9fe2846813bafc46dc5d9394577 fhahn: done in 1d33c09f220ea9fe2846813bafc46dc5d9394577
//===---------- Intrinsics to control hardware supported loops ----------===//		//===---------- Intrinsics to control hardware supported loops ----------===//

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute.		// will execute.
def int_set_loop_iterations :		def int_set_loop_iterations :
Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/include/llvm/IR/MatrixBuilder.h

Show All 16 Lines
#include "llvm/IR/Constant.h"		#include "llvm/IR/Constant.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
		#include "llvm/Support/Alignment.h"

namespace llvm {		namespace llvm {

class Function;		class Function;
class Twine;		class Twine;
class Module;		class Module;

template <class IRBuilderTy> class MatrixBuilder {		template <class IRBuilderTy> class MatrixBuilder {
Show All 13 Lines	else if (!LHS->getType()->isVectorTy() && RHS->getType()->isVectorTy())
cast<VectorType>(RHS->getType())->getNumElements(), LHS,		cast<VectorType>(RHS->getType())->getNumElements(), LHS,
"scalar.splat");		"scalar.splat");
return {LHS, RHS};		return {LHS, RHS};
}		}

public:		public:
MatrixBuilder(IRBuilderTy &Builder) : B(Builder) {}		MatrixBuilder(IRBuilderTy &Builder) : B(Builder) {}

/// Create a columnwise, strided matrix load.		/// Create a column major, strided matrix load.
/// \p DataPtr - Start address of the matrix read		/// \p DataPtr - Start address of the matrix read
/// \p Rows - Number of rows in matrix (must be a constant)		/// \p Rows - Number of rows in matrix (must be a constant)
/// \p Columns - Number of columns in matrix (must be a constant)		/// \p Columns - Number of columns in matrix (must be a constant)
/// \p Stride - Space between columns		/// \p Stride - Space between columns
CallInst CreateMatrixColumnwiseLoad(Value DataPtr, unsigned Rows,		CallInst CreateColumnMajorLoad(Value DataPtr, Align Alignment,
unsigned Columns, Value *Stride,		Value *Stride, bool IsVolatile, unsigned Rows,
const Twine &Name = "") {		unsigned Columns, const Twine &Name = "") {
		rjmccallUnsubmitted Done Reply Inline Actions Alignment should be an `llvm::Align`. rjmccall: Alignment should be an `llvm::Align`.
		anemetUnsubmitted Done Reply Inline Actions unsigned -> uint64_t? anemet: unsigned -> uint64_t?

// Deal with the pointer		// Deal with the pointer
PointerType *PtrTy = cast<PointerType>(DataPtr->getType());		PointerType *PtrTy = cast<PointerType>(DataPtr->getType());
Type *EltTy = PtrTy->getElementType();		Type *EltTy = PtrTy->getElementType();

auto RetType = FixedVectorType::get(EltTy, Rows Columns);		auto RetType = FixedVectorType::get(EltTy, Rows Columns);

Value *Ops[] = {DataPtr, Stride, B.getInt32(Rows), B.getInt32(Columns)};		Value *Ops[] = {DataPtr, Stride, B.getInt1(IsVolatile), B.getInt32(Rows),
		B.getInt32(Columns)};
Type *OverloadedTypes[] = {RetType, PtrTy};		Type *OverloadedTypes[] = {RetType, PtrTy};

Function *TheFn = Intrinsic::getDeclaration(		Function *TheFn = Intrinsic::getDeclaration(
getModule(), Intrinsic::matrix_columnwise_load, OverloadedTypes);		getModule(), Intrinsic::matrix_column_major_load, OverloadedTypes);

return B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);		CallInst *Call = B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);
		Attribute AlignAttr =
		Attribute::getWithAlignment(Call->getContext(), Alignment);
		Call->addAttribute(1, AlignAttr);
		return Call;
}		}

/// Create a columnwise, strided matrix store.		/// Create a column major, strided matrix store.
/// \p Matrix - Matrix to store		/// \p Matrix - Matrix to store
/// \p Ptr - Pointer to write back to		/// \p Ptr - Pointer to write back to
/// \p Stride - Space between columns		/// \p Stride - Space between columns
CallInst CreateMatrixColumnwiseStore(Value Matrix, Value *Ptr,		CallInst CreateColumnMajorStore(Value Matrix, Value *Ptr, Align Alignment,
Value *Stride, unsigned Rows,		Value *Stride, bool IsVolatile,
unsigned Columns,		unsigned Rows, unsigned Columns,
const Twine &Name = "") {		const Twine &Name = "") {
Value *Ops[] = {Matrix, Ptr, Stride, B.getInt32(Rows), B.getInt32(Columns)};		Value *Ops[] = {Matrix, Ptr,
		Stride, B.getInt1(IsVolatile),
		B.getInt32(Rows), B.getInt32(Columns)};
Type *OverloadedTypes[] = {Matrix->getType(), Ptr->getType()};		Type *OverloadedTypes[] = {Matrix->getType(), Ptr->getType()};

Function *TheFn = Intrinsic::getDeclaration(		Function *TheFn = Intrinsic::getDeclaration(
getModule(), Intrinsic::matrix_columnwise_store, OverloadedTypes);		getModule(), Intrinsic::matrix_column_major_store, OverloadedTypes);

return B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);		CallInst *Call = B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);
		Attribute AlignAttr =
		Attribute::getWithAlignment(Call->getContext(), Alignment);
		Call->addAttribute(2, AlignAttr);
		return Call;
}		}

/// Create a llvm.matrix.transpose call, transposing \p Matrix with \p Rows		/// Create a llvm.matrix.transpose call, transposing \p Matrix with \p Rows
/// rows and \p Columns columns.		/// rows and \p Columns columns.
CallInst CreateMatrixTranspose(Value Matrix, unsigned Rows,		CallInst CreateMatrixTranspose(Value Matrix, unsigned Rows,
unsigned Columns, const Twine &Name = "") {		unsigned Columns, const Twine &Name = "") {
auto *OpType = cast<VectorType>(Matrix->getType());		auto *OpType = cast<VectorType>(Matrix->getType());
auto *ReturnType =		auto *ReturnType =
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

llvm/lib/IR/Verifier.cpp

Show First 20 Lines • Show All 4,986 Lines • ▼ Show 20 Lines	#include "llvm/IR/ConstrainedOps.def"
case Intrinsic::bswap: {		case Intrinsic::bswap: {
Type *Ty = Call.getType();		Type *Ty = Call.getType();
unsigned Size = Ty->getScalarSizeInBits();		unsigned Size = Ty->getScalarSizeInBits();
Assert(Size % 16 == 0, "bswap must be an even number of bytes", &Call);		Assert(Size % 16 == 0, "bswap must be an even number of bytes", &Call);
break;		break;
}		}
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_columnwise_store: {		case Intrinsic::matrix_column_major_store: {
ConstantInt *NumRows;		ConstantInt *NumRows;
ConstantInt *NumColumns;		ConstantInt *NumColumns;
VectorType *TypeToCheck;		VectorType *TypeToCheck;
switch (ID) {		switch (ID) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
NumRows = cast<ConstantInt>(Call.getArgOperand(2));		NumRows = cast<ConstantInt>(Call.getArgOperand(2));
NumColumns = cast<ConstantInt>(Call.getArgOperand(4));		NumColumns = cast<ConstantInt>(Call.getArgOperand(4));
TypeToCheck = cast<VectorType>(Call.getType());		TypeToCheck = cast<VectorType>(Call.getType());
break;		break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
NumRows = cast<ConstantInt>(Call.getArgOperand(1));		NumRows = cast<ConstantInt>(Call.getArgOperand(1));
NumColumns = cast<ConstantInt>(Call.getArgOperand(2));		NumColumns = cast<ConstantInt>(Call.getArgOperand(2));
TypeToCheck = cast<VectorType>(Call.getType());		TypeToCheck = cast<VectorType>(Call.getType());
break;		break;
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
NumRows = cast<ConstantInt>(Call.getArgOperand(2));
NumColumns = cast<ConstantInt>(Call.getArgOperand(3));
TypeToCheck = cast<VectorType>(Call.getType());
break;
case Intrinsic::matrix_columnwise_store:
NumRows = cast<ConstantInt>(Call.getArgOperand(3));		NumRows = cast<ConstantInt>(Call.getArgOperand(3));
NumColumns = cast<ConstantInt>(Call.getArgOperand(4));		NumColumns = cast<ConstantInt>(Call.getArgOperand(4));
		TypeToCheck = cast<VectorType>(Call.getType());
		break;
		case Intrinsic::matrix_column_major_store:
		NumRows = cast<ConstantInt>(Call.getArgOperand(4));
		NumColumns = cast<ConstantInt>(Call.getArgOperand(5));
TypeToCheck = cast<VectorType>(Call.getArgOperand(0)->getType());		TypeToCheck = cast<VectorType>(Call.getArgOperand(0)->getType());
break;		break;
default:		default:
llvm_unreachable("unexpected intrinsic");		llvm_unreachable("unexpected intrinsic");
}		}
Assert(TypeToCheck->getNumElements() ==		Assert(TypeToCheck->getNumElements() ==
NumRows->getZExtValue() * NumColumns->getZExtValue(),		NumRows->getZExtValue() * NumColumns->getZExtValue(),
"result of a matrix operation does not fit in the returned vector");		"result of a matrix operation does not fit in the returned vector");
▲ Show 20 Lines • Show All 816 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines
/// 1. Propagate the shape information from intrinsics to connected		/// 1. Propagate the shape information from intrinsics to connected
/// instructions.		/// instructions.
/// 2. Lower instructions with shape information (assuming column-major layout).		/// 2. Lower instructions with shape information (assuming column-major layout).
/// The lowering works similarly using row-major layout.		/// The lowering works similarly using row-major layout.
/// 2.1. Get column vectors for each argument. If we already lowered the		/// 2.1. Get column vectors for each argument. If we already lowered the
/// definition of an argument, use the produced column vectors directly.		/// definition of an argument, use the produced column vectors directly.
/// If not, split the operand vector containing an embedded matrix into		/// If not, split the operand vector containing an embedded matrix into
/// a set of column vectors,		/// a set of column vectors,
/// 2.2. Lower the instruction in terms of columnwise operations, which yields		/// 2.2. Lower the instruction in terms of column major operations, which
/// a set of column vectors containing result matrix. Note that we lower		/// yields a set of column vectors containing result matrix. Note that we
/// all instructions that have shape information. Besides the intrinsics,		/// lower all instructions that have shape information. Besides the
/// this includes stores for example.		/// intrinsics, this includes stores for example.
/// 2.3. Update uses of the lowered instruction. If we have shape information		/// 2.3. Update uses of the lowered instruction. If we have shape information
/// for a user, there is nothing to do, as we will look up the result		/// for a user, there is nothing to do, as we will look up the result
/// column matrix when lowering the user. For other uses, we embed the		/// column matrix when lowering the user. For other uses, we embed the
/// result matrix in a flat vector and update the use.		/// result matrix in a flat vector and update the use.
/// 2.4. Cache the result column matrix for the instruction we lowered		/// 2.4. Cache the result column matrix for the instruction we lowered
/// 3. After we lowered all instructions in a function, remove the now		/// 3. After we lowered all instructions in a function, remove the now
/// obsolete instructions.		/// obsolete instructions.
///		///
▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	unsigned getNumVectors() const {
return NumColumns;		return NumColumns;
return NumRows;		return NumRows;
}		}
};		};

/// Maps instructions to their shape information. The shape information		/// Maps instructions to their shape information. The shape information
/// describes the shape to be used while lowering. This matches the shape of		/// describes the shape to be used while lowering. This matches the shape of
/// the result value of the instruction, with the only exceptions being store		/// the result value of the instruction, with the only exceptions being store
/// instructions and the matrix_columnwise_store intrinsics. For those, the		/// instructions and the matrix_column_major_store intrinsics. For those, the
/// shape information indicates that those instructions should be lowered		/// shape information indicates that those instructions should be lowered
/// using shape information as well.		/// using shape information as well.
DenseMap<Value *, ShapeInfo> ShapeMap;		DenseMap<Value *, ShapeInfo> ShapeMap;

/// List of instructions to remove. While lowering, we are not replacing all		/// List of instructions to remove. While lowering, we are not replacing all
/// users of a lowered instruction, if shape information is available and		/// users of a lowered instruction, if shape information is available and
/// those need to be removed after we finished lowering.		/// those need to be removed after we finished lowering.
SmallVector<Instruction *, 16> ToRemove;		SmallVector<Instruction *, 16> ToRemove;
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	bool supportsShapeInfo(Value *V) {
if (!Inst)		if (!Inst)
return false;		return false;

IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);		IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);
if (II)		if (II)
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_column_major_store:
return true;		return true;
default:		default:
return false;		return false;
}		}
return isUniformShape(V) \|\| isa<StoreInst>(V) \|\| isa<LoadInst>(V);		return isUniformShape(V) \|\| isa<StoreInst>(V) \|\| isa<LoadInst>(V);
}		}

/// Propagate the shape information of instructions to their users.		/// Propagate the shape information of instructions to their users.
Show All 22 Lines	while (!WorkList.empty()) {
if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(		if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(
m_Value(MatrixA), m_Value(MatrixB), m_Value(M),		m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
m_Value(N), m_Value(K)))) {		m_Value(N), m_Value(K)))) {
Propagate = setShapeInfo(Inst, {M, K});		Propagate = setShapeInfo(Inst, {M, K});
} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(
m_Value(MatrixA), m_Value(M), m_Value(N)))) {		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
// Flip dimensions.		// Flip dimensions.
Propagate = setShapeInfo(Inst, {N, M});		Propagate = setShapeInfo(Inst, {N, M});
} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_columnwise_store>(		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_column_major_store>(
m_Value(MatrixA), m_Value(), m_Value(),		m_Value(MatrixA), m_Value(), m_Value(),
m_Value(M), m_Value(N)))) {		m_Value(), m_Value(M), m_Value(N)))) {
Propagate = setShapeInfo(Inst, {N, M});		Propagate = setShapeInfo(Inst, {N, M});
} else if (match(Inst,		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_column_major_load>(
m_Intrinsic<Intrinsic::matrix_columnwise_load>(		m_Value(), m_Value(), m_Value(), m_Value(M),
m_Value(), m_Value(), m_Value(M), m_Value(N)))) {		m_Value(N)))) {
Propagate = setShapeInfo(Inst, {M, N});		Propagate = setShapeInfo(Inst, {M, N});
} else if (match(Inst, m_Store(m_Value(MatrixA), m_Value()))) {		} else if (match(Inst, m_Store(m_Value(MatrixA), m_Value()))) {
auto OpShape = ShapeMap.find(MatrixA);		auto OpShape = ShapeMap.find(MatrixA);
if (OpShape != ShapeMap.end())		if (OpShape != ShapeMap.end())
setShapeInfo(Inst, OpShape->second);		setShapeInfo(Inst, OpShape->second);
continue;		continue;
} else if (isUniformShape(Inst)) {		} else if (isUniformShape(Inst)) {
// Find the first operand that has a known shape and use that.		// Find the first operand that has a known shape and use that.
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	while (!WorkList.empty()) {
if (setShapeInfo(MatrixB, {N, K}))		if (setShapeInfo(MatrixB, {N, K}))
pushInstruction(MatrixB, WorkList);		pushInstruction(MatrixB, WorkList);

} else if (match(V, m_Intrinsic<Intrinsic::matrix_transpose>(		} else if (match(V, m_Intrinsic<Intrinsic::matrix_transpose>(
m_Value(MatrixA), m_Value(M), m_Value(N)))) {		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
// Flip dimensions.		// Flip dimensions.
if (setShapeInfo(MatrixA, {M, N}))		if (setShapeInfo(MatrixA, {M, N}))
pushInstruction(MatrixA, WorkList);		pushInstruction(MatrixA, WorkList);
} else if (match(V, m_Intrinsic<Intrinsic::matrix_columnwise_store>(		} else if (match(V, m_Intrinsic<Intrinsic::matrix_column_major_store>(
m_Value(MatrixA), m_Value(), m_Value(),		m_Value(MatrixA), m_Value(), m_Value(), m_Value(),
m_Value(M), m_Value(N)))) {		m_Value(M), m_Value(N)))) {
if (setShapeInfo(MatrixA, {M, N})) {		if (setShapeInfo(MatrixA, {M, N})) {
pushInstruction(MatrixA, WorkList);		pushInstruction(MatrixA, WorkList);
}		}
} else if (isa<LoadInst>(V) \|\|		} else if (isa<LoadInst>(V) \|\|
match(V, m_Intrinsic<Intrinsic::matrix_columnwise_load>())) {		match(V, m_Intrinsic<Intrinsic::matrix_column_major_load>())) {
// Nothing to do, no matrix input.		// Nothing to do, no matrix input.
} else if (isa<StoreInst>(V)) {		} else if (isa<StoreInst>(V)) {
// Nothing to do. We forward-propagated to this so we would just		// Nothing to do. We forward-propagated to this so we would just
// backward propagate to an instruction with an already known shape.		// backward propagate to an instruction with an already known shape.
} else if (isUniformShape(V)) {		} else if (isUniformShape(V)) {
// Propagate to all operands.		// Propagate to all operands.
ShapeInfo Shape = ShapeMap[V];		ShapeInfo Shape = ShapeMap[V];
for (Use &U : cast<Instruction>(V)->operands()) {		for (Use &U : cast<Instruction>(V)->operands()) {
Show All 22 Lines	if (EnableShapePropagation) {
for (Instruction &Inst : BB) {		for (Instruction &Inst : BB) {
IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);		IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);
if (!II)		if (!II)
continue;		continue;

switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_column_major_store:
WorkList.push_back(&Inst);		WorkList.push_back(&Inst);
break;		break;
default:		default:
break;		break;
}		}
}		}
// Propagate shapes until nothing changes any longer.		// Propagate shapes until nothing changes any longer.
while (!WorkList.empty()) {		while (!WorkList.empty()) {
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	bool VisitCallInst(CallInst *Inst) {

switch (Inst->getCalledFunction()->getIntrinsicID()) {		switch (Inst->getCalledFunction()->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
LowerMultiply(Inst);		LowerMultiply(Inst);
break;		break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
LowerTranspose(Inst);		LowerTranspose(Inst);
break;		break;
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
LowerColumnwiseLoad(Inst);		LowerColumnMajorLoad(Inst);
break;		break;
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_column_major_store:
LowerColumnwiseStore(Inst);		LowerColumnMajorStore(Inst);
break;		break;
default:		default:
return false;		return false;
}		}
return true;		return true;
}		}

/// Load a matrix with \p Shape starting at \p Ptr and using \p Stride between		/// Load a matrix with \p Shape starting at \p Ptr and using \p Stride between
/// vectors.		/// vectors.
MatrixTy loadMatrix(Type Ty, Value Ptr, Value *Stride, ShapeInfo Shape,		MatrixTy loadMatrix(Type Ty, Value Ptr, Value *Stride, ShapeInfo Shape,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
auto VType = cast<VectorType>(Ty);		auto VType = cast<VectorType>(Ty);
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);		Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);
MatrixTy Result;		MatrixTy Result;
for (unsigned I = 0, E = Shape.getNumVectors(); I < E; ++I) {		for (unsigned I = 0, E = Shape.getNumVectors(); I < E; ++I) {
Value *GEP = computeVectorAddr(EltPtr, Builder.getInt32(I), Stride,		Value *GEP = computeVectorAddr(EltPtr, Builder.getInt64(I), Stride,
Shape.getStride(), VType->getElementType(),		Shape.getStride(), VType->getElementType(),
Builder);		Builder);
Value *Vector = createVectorLoad(GEP, VType->getElementType(), Builder);		Value *Vector = createVectorLoad(GEP, VType->getElementType(), Builder);
Result.addVector(Vector);		Result.addVector(Vector);
}		}
return Result.addNumLoads(getNumOps(Result.getVectorTy()) *		return Result.addNumLoads(getNumOps(Result.getVectorTy()) *
Result.getNumVectors());		Result.getNumVectors());
}		}

/// Loads a sub-matrix with shape \p ResultShape from a \p R x \p C matrix,		/// Loads a sub-matrix with shape \p ResultShape from a \p R x \p C matrix,
/// starting at \p MatrixPtr[I][J].		/// starting at \p MatrixPtr[I][J].
MatrixTy loadMatrix(Value MatrixPtr, ShapeInfo MatrixShape, Value I,		MatrixTy loadMatrix(Value MatrixPtr, ShapeInfo MatrixShape, Value I,
Value J, ShapeInfo ResultShape, Type EltTy,		Value J, ShapeInfo ResultShape, Type EltTy,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {

Value *Offset = Builder.CreateAdd(		Value *Offset = Builder.CreateAdd(
Builder.CreateMul(J, Builder.getInt32(MatrixShape.getStride())), I);		Builder.CreateMul(J, Builder.getInt64(MatrixShape.getStride())), I);

unsigned AS = cast<PointerType>(MatrixPtr->getType())->getAddressSpace();		unsigned AS = cast<PointerType>(MatrixPtr->getType())->getAddressSpace();
Value *EltPtr =		Value *EltPtr =
Builder.CreatePointerCast(MatrixPtr, PointerType::get(EltTy, AS));		Builder.CreatePointerCast(MatrixPtr, PointerType::get(EltTy, AS));
Value *TileStart = Builder.CreateGEP(EltTy, EltPtr, Offset);		Value *TileStart = Builder.CreateGEP(EltTy, EltPtr, Offset);
auto TileTy = FixedVectorType::get(EltTy, ResultShape.NumRows		auto TileTy = FixedVectorType::get(EltTy, ResultShape.NumRows
ResultShape.NumColumns);		ResultShape.NumColumns);
Type *TilePtrTy = PointerType::get(TileTy, AS);		Type *TilePtrTy = PointerType::get(TileTy, AS);
Value *TilePtr =		Value *TilePtr =
Builder.CreatePointerCast(TileStart, TilePtrTy, "col.cast");		Builder.CreatePointerCast(TileStart, TilePtrTy, "col.cast");

return loadMatrix(TileTy, TilePtr,		return loadMatrix(TileTy, TilePtr,
Builder.getInt32(MatrixShape.getStride()), ResultShape,		Builder.getInt64(MatrixShape.getStride()), ResultShape,
Builder);		Builder);
}		}

/// Lower a load instruction with shape information.		/// Lower a load instruction with shape information.
void LowerLoad(Instruction Inst, Value Ptr, Value *Stride,		void LowerLoad(Instruction Inst, Value Ptr, Value *Stride,
ShapeInfo Shape) {		ShapeInfo Shape) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
finalizeLowering(Inst,		finalizeLowering(Inst,
loadMatrix(Inst->getType(), Ptr, Stride, Shape, Builder),		loadMatrix(Inst->getType(), Ptr, Stride, Shape, Builder),
Builder);		Builder);
}		}

/// Lowers llvm.matrix.columnwise.load.		/// Lowers llvm.matrix.column.major.load.
///		///
/// The intrinsic loads a matrix from memory using a stride between columns.		/// The intrinsic loads a matrix from memory using a stride between columns.
void LowerColumnwiseLoad(CallInst *Inst) {		void LowerColumnMajorLoad(CallInst *Inst) {
assert(MatrixLayout == MatrixLayoutTy::ColumnMajor &&		assert(MatrixLayout == MatrixLayoutTy::ColumnMajor &&
"Intrinsic only supports column-major layout!");		"Intrinsic only supports column-major layout!");
Value *Ptr = Inst->getArgOperand(0);		Value *Ptr = Inst->getArgOperand(0);
Value *Stride = Inst->getArgOperand(1);		Value *Stride = Inst->getArgOperand(1);
LowerLoad(Inst, Ptr, Stride,		LowerLoad(Inst, Ptr, Stride,
{Inst->getArgOperand(2), Inst->getArgOperand(3)});		{Inst->getArgOperand(3), Inst->getArgOperand(4)});
}		}

/// Stores a sub-matrix \p StoreVal into the \p R x \p C matrix starting at \p		/// Stores a sub-matrix \p StoreVal into the \p R x \p C matrix starting at \p
/// MatrixPtr[I][J].		/// MatrixPtr[I][J].
void storeMatrix(const MatrixTy &StoreVal, Value *MatrixPtr,		void storeMatrix(const MatrixTy &StoreVal, Value *MatrixPtr,
ShapeInfo MatrixShape, Value I, Value J, Type *EltTy,		ShapeInfo MatrixShape, Value I, Value J, Type *EltTy,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
Value *Offset = Builder.CreateAdd(		Value *Offset = Builder.CreateAdd(
Builder.CreateMul(J, Builder.getInt32(MatrixShape.getStride())), I);		Builder.CreateMul(J, Builder.getInt64(MatrixShape.getStride())), I);

unsigned AS = cast<PointerType>(MatrixPtr->getType())->getAddressSpace();		unsigned AS = cast<PointerType>(MatrixPtr->getType())->getAddressSpace();
Value *EltPtr =		Value *EltPtr =
Builder.CreatePointerCast(MatrixPtr, PointerType::get(EltTy, AS));		Builder.CreatePointerCast(MatrixPtr, PointerType::get(EltTy, AS));
Value *TileStart = Builder.CreateGEP(EltTy, EltPtr, Offset);		Value *TileStart = Builder.CreateGEP(EltTy, EltPtr, Offset);
auto TileTy = FixedVectorType::get(EltTy, StoreVal.getNumRows()		auto TileTy = FixedVectorType::get(EltTy, StoreVal.getNumRows()
StoreVal.getNumColumns());		StoreVal.getNumColumns());
Type *TilePtrTy = PointerType::get(TileTy, AS);		Type *TilePtrTy = PointerType::get(TileTy, AS);
Value *TilePtr =		Value *TilePtr =
Builder.CreatePointerCast(TileStart, TilePtrTy, "col.cast");		Builder.CreatePointerCast(TileStart, TilePtrTy, "col.cast");

storeMatrix(TileTy, StoreVal, TilePtr,		storeMatrix(TileTy, StoreVal, TilePtr,
Builder.getInt32(MatrixShape.getStride()), Builder);		Builder.getInt64(MatrixShape.getStride()), Builder);
}		}

/// Store matrix \p StoreVal starting at \p Ptr and using \p Stride between		/// Store matrix \p StoreVal starting at \p Ptr and using \p Stride between
/// vectors.		/// vectors.
MatrixTy storeMatrix(Type Ty, MatrixTy StoreVal, Value Ptr, Value *Stride,		MatrixTy storeMatrix(Type Ty, MatrixTy StoreVal, Value Ptr, Value *Stride,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
auto VType = cast<VectorType>(Ty);		auto VType = cast<VectorType>(Ty);
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);		Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);
for (auto Vec : enumerate(StoreVal.vectors())) {		for (auto Vec : enumerate(StoreVal.vectors())) {
Value *GEP = computeVectorAddr(EltPtr, Builder.getInt32(Vec.index()),		Value *GEP = computeVectorAddr(EltPtr, Builder.getInt64(Vec.index()),
Stride, StoreVal.getStride(),		Stride, StoreVal.getStride(),
VType->getElementType(), Builder);		VType->getElementType(), Builder);
createVectorStore(Vec.value(), GEP, VType->getElementType(), Builder);		createVectorStore(Vec.value(), GEP, VType->getElementType(), Builder);
}		}
return MatrixTy().addNumStores(getNumOps(StoreVal.getVectorTy()) *		return MatrixTy().addNumStores(getNumOps(StoreVal.getVectorTy()) *
StoreVal.getNumVectors());		StoreVal.getNumVectors());
}		}

/// Lower a store instruction with shape information.		/// Lower a store instruction with shape information.
void LowerStore(Instruction Inst, Value Matrix, Value Ptr, Value Stride,		void LowerStore(Instruction Inst, Value Matrix, Value Ptr, Value Stride,
ShapeInfo Shape) {		ShapeInfo Shape) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
auto StoreVal = getMatrix(Matrix, Shape, Builder);		auto StoreVal = getMatrix(Matrix, Shape, Builder);
finalizeLowering(		finalizeLowering(
Inst, storeMatrix(Matrix->getType(), StoreVal, Ptr, Stride, Builder),		Inst, storeMatrix(Matrix->getType(), StoreVal, Ptr, Stride, Builder),
Builder);		Builder);
}		}

/// Lowers llvm.matrix.columnwise.store.		/// Lowers llvm.matrix.column.major.store.
///		///
/// The intrinsic store a matrix back memory using a stride between columns.		/// The intrinsic store a matrix back memory using a stride between columns.
void LowerColumnwiseStore(CallInst *Inst) {		void LowerColumnMajorStore(CallInst *Inst) {
assert(MatrixLayout == MatrixLayoutTy::ColumnMajor &&		assert(MatrixLayout == MatrixLayoutTy::ColumnMajor &&
"Intrinsic only supports column-major layout!");		"Intrinsic only supports column-major layout!");
Value *Matrix = Inst->getArgOperand(0);		Value *Matrix = Inst->getArgOperand(0);
Value *Ptr = Inst->getArgOperand(1);		Value *Ptr = Inst->getArgOperand(1);
Value *Stride = Inst->getArgOperand(2);		Value *Stride = Inst->getArgOperand(2);
LowerStore(Inst, Matrix, Ptr, Stride,		LowerStore(Inst, Matrix, Ptr, Stride,
{Inst->getArgOperand(3), Inst->getArgOperand(4)});		{Inst->getArgOperand(4), Inst->getArgOperand(5)});
}		}

// Set elements I..I+NumElts-1 to Block		// Set elements I..I+NumElts-1 to Block
Value insertVector(Value Col, unsigned I, Value *Block,		Value insertVector(Value Col, unsigned I, Value *Block,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {

// First, bring Block to the same size as Col		// First, bring Block to the same size as Col
unsigned BlockNumElts =		unsigned BlockNumElts =
▲ Show 20 Lines • Show All 295 Lines • ▼ Show 20 Lines	for (unsigned J = 0; J < C; J += TileSize)
for (unsigned I = 0; I < R; I += TileSize) {		for (unsigned I = 0; I < R; I += TileSize) {
const unsigned TileR = std::min(R - I, unsigned(TileSize));		const unsigned TileR = std::min(R - I, unsigned(TileSize));
const unsigned TileC = std::min(C - J, unsigned(TileSize));		const unsigned TileC = std::min(C - J, unsigned(TileSize));
MatrixTy Res = getZeroMatrix(EltType, TileR, TileC);		MatrixTy Res = getZeroMatrix(EltType, TileR, TileC);

for (unsigned K = 0; K < M; K += TileSize) {		for (unsigned K = 0; K < M; K += TileSize) {
const unsigned TileM = std::min(M - K, unsigned(TileSize));		const unsigned TileM = std::min(M - K, unsigned(TileSize));
MatrixTy A =		MatrixTy A =
loadMatrix(APtr, LShape, Builder.getInt32(I), Builder.getInt32(K),		loadMatrix(APtr, LShape, Builder.getInt64(I), Builder.getInt64(K),
{TileR, TileM}, EltType, Builder);		{TileR, TileM}, EltType, Builder);
MatrixTy B =		MatrixTy B =
loadMatrix(BPtr, RShape, Builder.getInt32(K), Builder.getInt32(J),		loadMatrix(BPtr, RShape, Builder.getInt64(K), Builder.getInt64(J),
{TileM, TileC}, EltType, Builder);		{TileM, TileC}, EltType, Builder);
emitMatrixMultiply(Res, A, B, AllowContract, Builder, true);		emitMatrixMultiply(Res, A, B, AllowContract, Builder, true);
}		}
storeMatrix(Res, CPtr, {R, M}, Builder.getInt32(I), Builder.getInt32(J),		storeMatrix(Res, CPtr, {R, M}, Builder.getInt64(I), Builder.getInt64(J),
EltType, Builder);		EltType, Builder);
}		}

// Mark eliminated instructions as fused and remove them.		// Mark eliminated instructions as fused and remove them.
FusedInsts.insert(Store);		FusedInsts.insert(Store);
FusedInsts.insert(MatMul);		FusedInsts.insert(MatMul);
Store->eraseFromParent();		Store->eraseFromParent();
MatMul->eraseFromParent();		MatMul->eraseFromParent();
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	public:
}		}

/// Lower load instructions, if shape information is available.		/// Lower load instructions, if shape information is available.
bool VisitLoad(Instruction Inst, Value Ptr, IRBuilder<> &Builder) {		bool VisitLoad(Instruction Inst, Value Ptr, IRBuilder<> &Builder) {
auto I = ShapeMap.find(Inst);		auto I = ShapeMap.find(Inst);
if (I == ShapeMap.end())		if (I == ShapeMap.end())
return false;		return false;

LowerLoad(Inst, Ptr, Builder.getInt32(I->second.getStride()), I->second);		LowerLoad(Inst, Ptr, Builder.getInt64(I->second.getStride()), I->second);
return true;		return true;
}		}

bool VisitStore(Instruction Inst, Value StoredVal, Value *Ptr,		bool VisitStore(Instruction Inst, Value StoredVal, Value *Ptr,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
auto I = ShapeMap.find(StoredVal);		auto I = ShapeMap.find(StoredVal);
if (I == ShapeMap.end())		if (I == ShapeMap.end())
return false;		return false;

LowerStore(Inst, StoredVal, Ptr, Builder.getInt32(I->second.getStride()),		LowerStore(Inst, StoredVal, Ptr, Builder.getInt64(I->second.getStride()),
I->second);		I->second);
return true;		return true;
}		}

/// Lower binary operators, if shape information is available.		/// Lower binary operators, if shape information is available.
bool VisitBinaryOperator(BinaryOperator *Inst) {		bool VisitBinaryOperator(BinaryOperator *Inst) {
auto I = ShapeMap.find(Inst);		auto I = ShapeMap.find(Inst);
if (I == ShapeMap.end())		if (I == ShapeMap.end())
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	void writeFnName(CallInst *CI) {
SS << ".";		SS << ".";
prettyPrintMatrixType(II->getOperand(1), SS);		prettyPrintMatrixType(II->getOperand(1), SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
prettyPrintMatrixType(II->getOperand(0), SS);		prettyPrintMatrixType(II->getOperand(0), SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_column_major_load:
prettyPrintMatrixType(II, SS);		prettyPrintMatrixType(II, SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_column_major_store:
prettyPrintMatrixType(II->getOperand(0), SS);		prettyPrintMatrixType(II->getOperand(0), SS);
SS << "." << *II->getOperand(0)->getType()->getScalarType();		SS << "." << *II->getOperand(0)->getType()->getScalarType();
break;		break;
default:		default:
llvm_unreachable("Unhandled case");		llvm_unreachable("Unhandled case");
}		}
SS.flush();		SS.flush();
write(Tmp);		write(Tmp);
}		}
}		}

unsigned getNumShapeArgs(CallInst *CI) const {		unsigned getNumShapeArgs(CallInst *CI) const {
if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CI)) {		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CI)) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
return 3;		return 3;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_columnwise_load:
case Intrinsic::matrix_columnwise_store:
return 2;		return 2;
		case Intrinsic::matrix_column_major_load:
		case Intrinsic::matrix_column_major_store:
		return 3;
default:		default:
return 0;		return 0;
}		}
}		}
return 0;		return 0;
}		}

/// Special printing for values: for pointers, we print if they refer to an		/// Special printing for values: for pointers, we print if they refer to an
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	void linearizeExpr(Value *Expr, unsigned Indent, bool ParentReused,
} else {		} else {
Ops.append(I->value_op_begin(), I->value_op_end());		Ops.append(I->value_op_begin(), I->value_op_end());
write(std::string(I->getOpcodeName()));		write(std::string(I->getOpcodeName()));
}		}

write(std::string("("));		write(std::string("("));

unsigned NumOpsToBreak = 1;		unsigned NumOpsToBreak = 1;
if (match(Expr, m_Intrinsic<Intrinsic::matrix_columnwise_load>()))		if (match(Expr, m_Intrinsic<Intrinsic::matrix_column_major_load>()))
NumOpsToBreak = 2;		NumOpsToBreak = 2;

for (Value *Op : Ops) {		for (Value *Op : Ops) {
if (Ops.size() > NumOpsToBreak)		if (Ops.size() > NumOpsToBreak)
lineBreak();		lineBreak();

maybeIndent(Indent + 1);		maybeIndent(Indent + 1);
if (isMatrix(Op))		if (isMatrix(Op))
▲ Show 20 Lines • Show All 258 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {			define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
	; CHECK-LABEL: @transpose_multiply(			; CHECK-LABEL: @transpose_multiply(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:

	; Load columns of input matrixes %A and %B.			; Load columns of input matrixes %A and %B.

	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[A_PTR:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[A_PTR:%.]] to double
	; CHECK-NEXT: [[COL_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[COL_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[COL_CAST]], align 8
	; CHECK-NEXT: [[COL_GEP:%.]] = getelementptr double, double [[TMP0]], i32 3			; CHECK-NEXT: [[COL_GEP:%.]] = getelementptr double, double [[TMP0]], i64 3
	; CHECK-NEXT: [[COL_CAST1:%.]] = bitcast double [[COL_GEP]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST1:%.]] = bitcast double [[COL_GEP]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[COL_CAST1]], align 8			; CHECK-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[COL_CAST1]], align 8
	; CHECK-NEXT: [[COL_GEP3:%.]] = getelementptr double, double [[TMP0]], i32 6			; CHECK-NEXT: [[COL_GEP3:%.]] = getelementptr double, double [[TMP0]], i64 6
	; CHECK-NEXT: [[COL_CAST4:%.]] = bitcast double [[COL_GEP3]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST4:%.]] = bitcast double [[COL_GEP3]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD5:%.]] = load <3 x double>, <3 x double> [[COL_CAST4]], align 8			; CHECK-NEXT: [[COL_LOAD5:%.]] = load <3 x double>, <3 x double> [[COL_CAST4]], align 8
	; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[B_PTR:%.]] to double			; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[B_PTR:%.]] to double
	; CHECK-NEXT: [[COL_CAST6:%.]] = bitcast double [[TMP1]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST6:%.]] = bitcast double [[TMP1]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD7:%.]] = load <3 x double>, <3 x double> [[COL_CAST6]], align 8			; CHECK-NEXT: [[COL_LOAD7:%.]] = load <3 x double>, <3 x double> [[COL_CAST6]], align 8
	; CHECK-NEXT: [[COL_GEP8:%.]] = getelementptr double, double [[TMP1]], i32 3			; CHECK-NEXT: [[COL_GEP8:%.]] = getelementptr double, double [[TMP1]], i64 3
	; CHECK-NEXT: [[COL_CAST9:%.]] = bitcast double [[COL_GEP8]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST9:%.]] = bitcast double [[COL_GEP8]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD10:%.]] = load <3 x double>, <3 x double> [[COL_CAST9]], align 8			; CHECK-NEXT: [[COL_LOAD10:%.]] = load <3 x double>, <3 x double> [[COL_CAST9]], align 8
	; CHECK-NEXT: [[COL_GEP11:%.]] = getelementptr double, double [[TMP1]], i32 6			; CHECK-NEXT: [[COL_GEP11:%.]] = getelementptr double, double [[TMP1]], i64 6
	; CHECK-NEXT: [[COL_CAST12:%.]] = bitcast double [[COL_GEP11]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST12:%.]] = bitcast double [[COL_GEP11]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD13:%.]] = load <3 x double>, <3 x double> [[COL_CAST12]], align 8			; CHECK-NEXT: [[COL_LOAD13:%.]] = load <3 x double>, <3 x double> [[COL_CAST12]], align 8

	; Transpose %A.			; Transpose %A.

	; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[COL_LOAD]], i64 0			; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[COL_LOAD]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[COL_LOAD2]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[COL_LOAD2]], i64 0
	▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>

	; Store result columns.			; Store result columns.

	; CHECK-NEXT: [[TMP108:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double			; CHECK-NEXT: [[TMP108:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double
	; CHECK-NEXT: [[TMP109:%.]] = bitcast double [[TMP108]] to <3 x double>*			; CHECK-NEXT: [[TMP109:%.]] = bitcast double [[TMP108]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP47]], <3 x double>* [[TMP109]], align 8			; CHECK-NEXT: store <3 x double> [[TMP47]], <3 x double>* [[TMP109]], align 8
	; CHECK-NEXT: [[TMP110:%.]] = getelementptr double, double [[TMP108]], i32 3			; CHECK-NEXT: [[TMP110:%.]] = getelementptr double, double [[TMP108]], i64 3
	; CHECK-NEXT: [[TMP111:%.]] = bitcast double [[TMP110]] to <3 x double>*			; CHECK-NEXT: [[TMP111:%.]] = bitcast double [[TMP110]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP77]], <3 x double>* [[TMP111]], align 8			; CHECK-NEXT: store <3 x double> [[TMP77]], <3 x double>* [[TMP111]], align 8
	; CHECK-NEXT: [[TMP112:%.]] = getelementptr double, double [[TMP108]], i32 6			; CHECK-NEXT: [[TMP112:%.]] = getelementptr double, double [[TMP108]], i64 6
	; CHECK-NEXT: [[TMP113:%.]] = bitcast double [[TMP112]] to <3 x double>*			; CHECK-NEXT: [[TMP113:%.]] = bitcast double [[TMP112]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP107]], <3 x double>* [[TMP113]], align 8			; CHECK-NEXT: store <3 x double> [[TMP107]], <3 x double>* [[TMP113]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;

	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr, align 8			%a = load <9 x double>, <9 x double>* %A.Ptr, align 8
	%b = load <9 x double>, <9 x double>* %B.Ptr, align 8			%b = load <9 x double>, <9 x double>* %B.Ptr, align 8
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	store <9 x double> %c, <9 x double>* %C.Ptr, align 8			store <9 x double> %c, <9 x double>* %C.Ptr, align 8
	ret void			ret void
	}			}

	declare <9 x double> @llvm.matrix.transpose(<9 x double>, i32, i32)			declare <9 x double> @llvm.matrix.transpose(<9 x double>, i32, i32)
	declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x double>, i32, i32, i32)			declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x double>, i32, i32, i32)

	define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {			define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
	; CHECK-LABEL: @transpose_multiply_add(			; CHECK-LABEL: @transpose_multiply_add(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:

	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[A_PTR:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[A_PTR:%.]] to double
	; CHECK-NEXT: [[COL_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[COL_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[COL_CAST]], align 8
	; CHECK-NEXT: [[COL_GEP:%.]] = getelementptr double, double [[TMP0]], i32 3			; CHECK-NEXT: [[COL_GEP:%.]] = getelementptr double, double [[TMP0]], i64 3
	; CHECK-NEXT: [[COL_CAST1:%.]] = bitcast double [[COL_GEP]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST1:%.]] = bitcast double [[COL_GEP]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[COL_CAST1]], align 8			; CHECK-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[COL_CAST1]], align 8
	; CHECK-NEXT: [[COL_GEP3:%.]] = getelementptr double, double [[TMP0]], i32 6			; CHECK-NEXT: [[COL_GEP3:%.]] = getelementptr double, double [[TMP0]], i64 6
	; CHECK-NEXT: [[COL_CAST4:%.]] = bitcast double [[COL_GEP3]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST4:%.]] = bitcast double [[COL_GEP3]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD5:%.]] = load <3 x double>, <3 x double> [[COL_CAST4]], align 8			; CHECK-NEXT: [[COL_LOAD5:%.]] = load <3 x double>, <3 x double> [[COL_CAST4]], align 8
	; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[B_PTR:%.]] to double			; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[B_PTR:%.]] to double
	; CHECK-NEXT: [[COL_CAST6:%.]] = bitcast double [[TMP1]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST6:%.]] = bitcast double [[TMP1]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD7:%.]] = load <3 x double>, <3 x double> [[COL_CAST6]], align 8			; CHECK-NEXT: [[COL_LOAD7:%.]] = load <3 x double>, <3 x double> [[COL_CAST6]], align 8
	; CHECK-NEXT: [[COL_GEP8:%.]] = getelementptr double, double [[TMP1]], i32 3			; CHECK-NEXT: [[COL_GEP8:%.]] = getelementptr double, double [[TMP1]], i64 3
	; CHECK-NEXT: [[COL_CAST9:%.]] = bitcast double [[COL_GEP8]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST9:%.]] = bitcast double [[COL_GEP8]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD10:%.]] = load <3 x double>, <3 x double> [[COL_CAST9]], align 8			; CHECK-NEXT: [[COL_LOAD10:%.]] = load <3 x double>, <3 x double> [[COL_CAST9]], align 8
	; CHECK-NEXT: [[COL_GEP11:%.]] = getelementptr double, double [[TMP1]], i32 6			; CHECK-NEXT: [[COL_GEP11:%.]] = getelementptr double, double [[TMP1]], i64 6
	; CHECK-NEXT: [[COL_CAST12:%.]] = bitcast double [[COL_GEP11]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST12:%.]] = bitcast double [[COL_GEP11]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD13:%.]] = load <3 x double>, <3 x double> [[COL_CAST12]], align 8			; CHECK-NEXT: [[COL_LOAD13:%.]] = load <3 x double>, <3 x double> [[COL_CAST12]], align 8

	; Transpose %A.			; Transpose %A.

	; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[COL_LOAD]], i64 0			; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[COL_LOAD]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[COL_LOAD2]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[COL_LOAD2]], i64 0
	▲ Show 20 Lines • Show All 190 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>

	; Load %C.			; Load %C.

	; CHECK-NEXT: [[TMP110:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double			; CHECK-NEXT: [[TMP110:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double
	; CHECK-NEXT: [[COL_CAST92:%.]] = bitcast double [[TMP110]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST92:%.]] = bitcast double [[TMP110]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD93:%.]] = load <3 x double>, <3 x double> [[COL_CAST92]], align 8			; CHECK-NEXT: [[COL_LOAD93:%.]] = load <3 x double>, <3 x double> [[COL_CAST92]], align 8
	; CHECK-NEXT: [[COL_GEP94:%.]] = getelementptr double, double [[TMP110]], i32 3			; CHECK-NEXT: [[COL_GEP94:%.]] = getelementptr double, double [[TMP110]], i64 3
	; CHECK-NEXT: [[COL_CAST95:%.]] = bitcast double [[COL_GEP94]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST95:%.]] = bitcast double [[COL_GEP94]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD96:%.]] = load <3 x double>, <3 x double> [[COL_CAST95]], align 8			; CHECK-NEXT: [[COL_LOAD96:%.]] = load <3 x double>, <3 x double> [[COL_CAST95]], align 8
	; CHECK-NEXT: [[COL_GEP97:%.]] = getelementptr double, double [[TMP110]], i32 6			; CHECK-NEXT: [[COL_GEP97:%.]] = getelementptr double, double [[TMP110]], i64 6
	; CHECK-NEXT: [[COL_CAST98:%.]] = bitcast double [[COL_GEP97]] to <3 x double>*			; CHECK-NEXT: [[COL_CAST98:%.]] = bitcast double [[COL_GEP97]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD99:%.]] = load <3 x double>, <3 x double> [[COL_CAST98]], align 8			; CHECK-NEXT: [[COL_LOAD99:%.]] = load <3 x double>, <3 x double> [[COL_CAST98]], align 8

	; Add column vectors.			; Add column vectors.

	; CHECK-NEXT: [[TMP108:%.*]] = fadd <3 x double> [[COL_LOAD93]], [[TMP47]]			; CHECK-NEXT: [[TMP108:%.*]] = fadd <3 x double> [[COL_LOAD93]], [[TMP47]]
	; CHECK-NEXT: [[TMP109:%.*]] = fadd <3 x double> [[COL_LOAD96]], [[TMP77]]			; CHECK-NEXT: [[TMP109:%.*]] = fadd <3 x double> [[COL_LOAD96]], [[TMP77]]
	; CHECK-NEXT: [[TMP110:%.*]] = fadd <3 x double> [[COL_LOAD99]], [[TMP107]]			; CHECK-NEXT: [[TMP110:%.*]] = fadd <3 x double> [[COL_LOAD99]], [[TMP107]]

	; Store result columns.			; Store result columns.

	; CHECK-NEXT: [[TMP111:%.]] = bitcast <9 x double> [[C_PTR]] to double*			; CHECK-NEXT: [[TMP111:%.]] = bitcast <9 x double> [[C_PTR]] to double*
	; CHECK-NEXT: [[TMP112:%.]] = bitcast double [[TMP111]] to <3 x double>*			; CHECK-NEXT: [[TMP112:%.]] = bitcast double [[TMP111]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP108]], <3 x double>* [[TMP112]], align 8			; CHECK-NEXT: store <3 x double> [[TMP108]], <3 x double>* [[TMP112]], align 8
	; CHECK-NEXT: [[TMP113:%.]] = getelementptr double, double [[TMP111]], i32 3			; CHECK-NEXT: [[TMP113:%.]] = getelementptr double, double [[TMP111]], i64 3
	; CHECK-NEXT: [[TMP114:%.]] = bitcast double [[TMP113]] to <3 x double>*			; CHECK-NEXT: [[TMP114:%.]] = bitcast double [[TMP113]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP109]], <3 x double>* [[TMP114]], align 8			; CHECK-NEXT: store <3 x double> [[TMP109]], <3 x double>* [[TMP114]], align 8
	; CHECK-NEXT: [[TMP115:%.]] = getelementptr double, double [[TMP111]], i32 6			; CHECK-NEXT: [[TMP115:%.]] = getelementptr double, double [[TMP111]], i64 6
	; CHECK-NEXT: [[TMP116:%.]] = bitcast double [[TMP115]] to <3 x double>*			; CHECK-NEXT: [[TMP116:%.]] = bitcast double [[TMP115]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[TMP110]], <3 x double>* [[TMP116]], align 8			; CHECK-NEXT: store <3 x double> [[TMP110]], <3 x double>* [[TMP116]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr, align 8			%a = load <9 x double>, <9 x double>* %A.Ptr, align 8
	%b = load <9 x double>, <9 x double>* %B.Ptr, align 8			%b = load <9 x double>, <9 x double>* %B.Ptr, align 8
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	%c = load <9 x double>, <9 x double>* %C.Ptr, align 8			%c = load <9 x double>, <9 x double>* %C.Ptr, align 8
	%res = fadd <9 x double> %c, %mult			%res = fadd <9 x double> %c, %mult

	store <9 x double> %res, <9 x double>* %C.Ptr, align 8			store <9 x double> %res, <9 x double>* %C.Ptr, align 8
	ret void			ret void
	}			}

llvm/test/Transforms/LowerMatrixIntrinsics/const-gep.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s

	; Make sure we correctly lower in the presence of getelementptr constant			; Make sure we correctly lower in the presence of getelementptr constant
	; expressions.			; expressions.

	@foo = global [5 x <4 x double>] zeroinitializer, align 16			@foo = global [5 x <4 x double>] zeroinitializer, align 16

	define void @test(i32 %r, i32 %c) {			define void @test(i32 %r, i32 %c) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[R_ADDR:%.*]] = alloca i32, align 4			; CHECK-NEXT: [[R_ADDR:%.*]] = alloca i32, align 4
	; CHECK-NEXT: [[C_ADDR:%.*]] = alloca i32, align 4			; CHECK-NEXT: [[C_ADDR:%.*]] = alloca i32, align 4
	; CHECK-NEXT: store i32 [[R:%.]], i32 [[R_ADDR]], align 4			; CHECK-NEXT: store i32 [[R:%.]], i32 [[R_ADDR]], align 4
	; CHECK-NEXT: store i32 [[C:%.]], i32 [[C_ADDR]], align 4			; CHECK-NEXT: store i32 [[C:%.]], i32 [[C_ADDR]], align 4
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <2 x double>, <2 x double> bitcast ([5 x <4 x double>]* @foo to <2 x double>*), align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <2 x double>, <2 x double> bitcast ([5 x <4 x double>]* @foo to <2 x double>*), align 8
	; CHECK-NEXT: [[COL_LOAD1:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr ([5 x <4 x double>], [5 x <4 x double>]* @foo, i32 0, i32 0, i32 2) to <2 x double>*), align 8			; CHECK-NEXT: [[COL_LOAD1:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr ([5 x <4 x double>], [5 x <4 x double>]* @foo, i32 0, i32 0, i64 2) to <2 x double>*), align 8
	; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[COL_LOAD]], <2 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[COL_LOAD]], <2 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[COL_LOAD]], i64 0			; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[COL_LOAD]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP0]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP0]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP1:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]			; CHECK-NEXT: [[TMP1:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
	; CHECK-NEXT: [[BLOCK2:%.*]] = shufflevector <2 x double> [[COL_LOAD1]], <2 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK2:%.*]] = shufflevector <2 x double> [[COL_LOAD1]], <2 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[COL_LOAD]], i64 1			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[COL_LOAD]], i64 1
	; CHECK-NEXT: [[SPLAT_SPLATINSERT3:%.*]] = insertelement <1 x double> undef, double [[TMP2]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT3:%.*]] = insertelement <1 x double> undef, double [[TMP2]], i32 0
	Show All 37 Lines
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <2 x double> [[COL_LOAD1]], i64 1			; CHECK-NEXT: [[TMP23:%.*]] = extractelement <2 x double> [[COL_LOAD1]], i64 1
	; CHECK-NEXT: [[SPLAT_SPLATINSERT21:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT21:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT22:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT21]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT22:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT21]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK20]], [[SPLAT_SPLAT22]]			; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK20]], [[SPLAT_SPLAT22]]
	; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]			; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
	; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
	; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <2 x double> [[TMP20]], <2 x double> [[TMP26]], <2 x i32> <i32 0, i32 2>			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <2 x double> [[TMP20]], <2 x double> [[TMP26]], <2 x i32> <i32 0, i32 2>
	; CHECK-NEXT: store <2 x double> [[COL_LOAD]], <2 x double>* bitcast (double* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2, i32 0) to <2 x double>*), align 8			; CHECK-NEXT: store <2 x double> [[COL_LOAD]], <2 x double>* bitcast (double* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2, i32 0) to <2 x double>*), align 8
	; CHECK-NEXT: store <2 x double> [[COL_LOAD1]], <2 x double>* bitcast (double* getelementptr ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2, i32 2) to <2 x double>*), align 8			; CHECK-NEXT: store <2 x double> [[COL_LOAD1]], <2 x double>* bitcast (double* getelementptr ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2, i64 2) to <2 x double>*), align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%r.addr = alloca i32, align 4			%r.addr = alloca i32, align 4
	%c.addr = alloca i32, align 4			%c.addr = alloca i32, align 4
	store i32 %r, i32* %r.addr, align 4			store i32 %r, i32* %r.addr, align 4
	store i32 %c, i32* %c.addr, align 4			store i32 %c, i32* %c.addr, align 4
	%0 = load <4 x double>, <4 x double>* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 0), align 16			%0 = load <4 x double>, <4 x double>* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 0), align 16
	%mul = call <4 x double> @llvm.matrix.multiply(<4 x double> %0, <4 x double> %0, i32 2, i32 2, i32 2)			%mul = call <4 x double> @llvm.matrix.multiply(<4 x double> %0, <4 x double> %0, i32 2, i32 2, i32 2)
	store <4 x double> %0, <4 x double>* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2), align 16			store <4 x double> %0, <4 x double>* getelementptr inbounds ([5 x <4 x double>], [5 x <4 x double>]* @foo, i64 0, i64 2), align 16
	ret void			ret void
	}			}

	declare <4 x double> @llvm.matrix.multiply(<4 x double>, <4 x double>, i32, i32, i32)			declare <4 x double> @llvm.matrix.multiply(<4 x double>, <4 x double>, i32, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/multiply-add-sub-double-row-major.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py

	; RUN: opt -lower-matrix-intrinsics -matrix-default-layout=row-major -S < %s \| FileCheck --check-prefix=RM %s			; RUN: opt -lower-matrix-intrinsics -matrix-default-layout=row-major -S < %s \| FileCheck --check-prefix=RM %s

	; Check row-major code generation for loads, stores, binary operators (fadd/fsub) and multiply.			; Check row-major code generation for loads, stores, binary operators (fadd/fsub) and multiply.
	; %a.ptr is a pointer to a 2x3 matrix, %b.ptr to a 3x2 matrix and %c.ptr to a 2x2 matrix.			; %a.ptr is a pointer to a 2x3 matrix, %b.ptr to a 3x2 matrix and %c.ptr to a 2x2 matrix.
	; Load, store and binary operators on %a should operate on 3 element vectors and on 2 element vectors for %b.			; Load, store and binary operators on %a should operate on 3 element vectors and on 2 element vectors for %b.
	define void @multiply_sub_add_2x3_3x2(<6 x double>* %a.ptr, <6 x double>* %b.ptr, <4 x double>* %c.ptr) {			define void @multiply_sub_add_2x3_3x2(<6 x double>* %a.ptr, <6 x double>* %b.ptr, <4 x double>* %c.ptr) {
	; RM-LABEL: @multiply_sub_add_2x3_3x2(			; RM-LABEL: @multiply_sub_add_2x3_3x2(
	; RM-NEXT: entry:			; RM-NEXT: entry:
	; RM-NEXT: [[TMP0:%.]] = bitcast <6 x double> [[A_PTR:%.]] to double			; RM-NEXT: [[TMP0:%.]] = bitcast <6 x double> [[A_PTR:%.]] to double
	; RM-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*			; RM-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP0]] to <3 x double>*
	; RM-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[VEC_CAST]], align 8			; RM-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[VEC_CAST]], align 8
	; RM-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i32 3			; RM-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i64 3
	; RM-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*			; RM-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*
	; RM-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[VEC_CAST1]], align 8			; RM-NEXT: [[COL_LOAD2:%.]] = load <3 x double>, <3 x double> [[VEC_CAST1]], align 8
	; RM-NEXT: [[TMP1:%.]] = bitcast <6 x double> [[B_PTR:%.]] to double			; RM-NEXT: [[TMP1:%.]] = bitcast <6 x double> [[B_PTR:%.]] to double
	; RM-NEXT: [[VEC_CAST3:%.]] = bitcast double [[TMP1]] to <2 x double>*			; RM-NEXT: [[VEC_CAST3:%.]] = bitcast double [[TMP1]] to <2 x double>*
	; RM-NEXT: [[COL_LOAD4:%.]] = load <2 x double>, <2 x double> [[VEC_CAST3]], align 8			; RM-NEXT: [[COL_LOAD4:%.]] = load <2 x double>, <2 x double> [[VEC_CAST3]], align 8
	; RM-NEXT: [[VEC_GEP5:%.]] = getelementptr double, double [[TMP1]], i32 2			; RM-NEXT: [[VEC_GEP5:%.]] = getelementptr double, double [[TMP1]], i64 2
	; RM-NEXT: [[VEC_CAST6:%.]] = bitcast double [[VEC_GEP5]] to <2 x double>*			; RM-NEXT: [[VEC_CAST6:%.]] = bitcast double [[VEC_GEP5]] to <2 x double>*
	; RM-NEXT: [[COL_LOAD7:%.]] = load <2 x double>, <2 x double> [[VEC_CAST6]], align 8			; RM-NEXT: [[COL_LOAD7:%.]] = load <2 x double>, <2 x double> [[VEC_CAST6]], align 8
	; RM-NEXT: [[VEC_GEP8:%.]] = getelementptr double, double [[TMP1]], i32 4			; RM-NEXT: [[VEC_GEP8:%.]] = getelementptr double, double [[TMP1]], i64 4
	; RM-NEXT: [[VEC_CAST9:%.]] = bitcast double [[VEC_GEP8]] to <2 x double>*			; RM-NEXT: [[VEC_CAST9:%.]] = bitcast double [[VEC_GEP8]] to <2 x double>*
	; RM-NEXT: [[COL_LOAD10:%.]] = load <2 x double>, <2 x double> [[VEC_CAST9]], align 8			; RM-NEXT: [[COL_LOAD10:%.]] = load <2 x double>, <2 x double> [[VEC_CAST9]], align 8
	; RM-NEXT: [[TMP2:%.*]] = fadd <3 x double> [[COL_LOAD]], [[COL_LOAD]]			; RM-NEXT: [[TMP2:%.*]] = fadd <3 x double> [[COL_LOAD]], [[COL_LOAD]]
	; RM-NEXT: [[TMP3:%.*]] = fadd <3 x double> [[COL_LOAD2]], [[COL_LOAD2]]			; RM-NEXT: [[TMP3:%.*]] = fadd <3 x double> [[COL_LOAD2]], [[COL_LOAD2]]
	; RM-NEXT: [[TMP4:%.]] = bitcast <6 x double> [[A_PTR]] to double*			; RM-NEXT: [[TMP4:%.]] = bitcast <6 x double> [[A_PTR]] to double*
	; RM-NEXT: [[VEC_CAST11:%.]] = bitcast double [[TMP4]] to <3 x double>*			; RM-NEXT: [[VEC_CAST11:%.]] = bitcast double [[TMP4]] to <3 x double>*
	; RM-NEXT: store <3 x double> [[TMP2]], <3 x double>* [[VEC_CAST11]], align 8			; RM-NEXT: store <3 x double> [[TMP2]], <3 x double>* [[VEC_CAST11]], align 8
	; RM-NEXT: [[VEC_GEP12:%.]] = getelementptr double, double [[TMP4]], i32 3			; RM-NEXT: [[VEC_GEP12:%.]] = getelementptr double, double [[TMP4]], i64 3
	; RM-NEXT: [[VEC_CAST13:%.]] = bitcast double [[VEC_GEP12]] to <3 x double>*			; RM-NEXT: [[VEC_CAST13:%.]] = bitcast double [[VEC_GEP12]] to <3 x double>*
	; RM-NEXT: store <3 x double> [[TMP3]], <3 x double>* [[VEC_CAST13]], align 8			; RM-NEXT: store <3 x double> [[TMP3]], <3 x double>* [[VEC_CAST13]], align 8
	; RM-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[COL_LOAD4]], <double 1.000000e+00, double 1.000000e+00>			; RM-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[COL_LOAD4]], <double 1.000000e+00, double 1.000000e+00>
	; RM-NEXT: [[TMP6:%.*]] = fsub <2 x double> [[COL_LOAD7]], <double 1.000000e+00, double 1.000000e+00>			; RM-NEXT: [[TMP6:%.*]] = fsub <2 x double> [[COL_LOAD7]], <double 1.000000e+00, double 1.000000e+00>
	; RM-NEXT: [[TMP7:%.*]] = fsub <2 x double> [[COL_LOAD10]], <double 1.000000e+00, double 1.000000e+00>			; RM-NEXT: [[TMP7:%.*]] = fsub <2 x double> [[COL_LOAD10]], <double 1.000000e+00, double 1.000000e+00>
	; RM-NEXT: [[TMP8:%.]] = bitcast <6 x double> [[B_PTR]] to double*			; RM-NEXT: [[TMP8:%.]] = bitcast <6 x double> [[B_PTR]] to double*
	; RM-NEXT: [[VEC_CAST14:%.]] = bitcast double [[TMP8]] to <2 x double>*			; RM-NEXT: [[VEC_CAST14:%.]] = bitcast double [[TMP8]] to <2 x double>*
	; RM-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[VEC_CAST14]], align 8			; RM-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[VEC_CAST14]], align 8
	; RM-NEXT: [[VEC_GEP15:%.]] = getelementptr double, double [[TMP8]], i32 2			; RM-NEXT: [[VEC_GEP15:%.]] = getelementptr double, double [[TMP8]], i64 2
	; RM-NEXT: [[VEC_CAST16:%.]] = bitcast double [[VEC_GEP15]] to <2 x double>*			; RM-NEXT: [[VEC_CAST16:%.]] = bitcast double [[VEC_GEP15]] to <2 x double>*
	; RM-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[VEC_CAST16]], align 8			; RM-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[VEC_CAST16]], align 8
	; RM-NEXT: [[VEC_GEP17:%.]] = getelementptr double, double [[TMP8]], i32 4			; RM-NEXT: [[VEC_GEP17:%.]] = getelementptr double, double [[TMP8]], i64 4
	; RM-NEXT: [[VEC_CAST18:%.]] = bitcast double [[VEC_GEP17]] to <2 x double>*			; RM-NEXT: [[VEC_CAST18:%.]] = bitcast double [[VEC_GEP17]] to <2 x double>*
	; RM-NEXT: store <2 x double> [[TMP7]], <2 x double>* [[VEC_CAST18]], align 8			; RM-NEXT: store <2 x double> [[TMP7]], <2 x double>* [[VEC_CAST18]], align 8
	; RM-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> undef, <1 x i32> zeroinitializer			; RM-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> undef, <1 x i32> zeroinitializer
	; RM-NEXT: [[TMP9:%.*]] = extractelement <3 x double> [[TMP2]], i64 0			; RM-NEXT: [[TMP9:%.*]] = extractelement <3 x double> [[TMP2]], i64 0
	; RM-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP9]], i32 0			; RM-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP9]], i32 0
	; RM-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; RM-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; RM-NEXT: [[TMP10:%.*]] = fmul <1 x double> [[SPLAT_SPLAT]], [[BLOCK]]			; RM-NEXT: [[TMP10:%.*]] = fmul <1 x double> [[SPLAT_SPLAT]], [[BLOCK]]
	; RM-NEXT: [[BLOCK19:%.*]] = shufflevector <2 x double> [[TMP6]], <2 x double> undef, <1 x i32> zeroinitializer			; RM-NEXT: [[BLOCK19:%.*]] = shufflevector <2 x double> [[TMP6]], <2 x double> undef, <1 x i32> zeroinitializer
	▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
	; RM-NEXT: [[SPLAT_SPLAT51:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT50]], <1 x double> undef, <1 x i32> zeroinitializer			; RM-NEXT: [[SPLAT_SPLAT51:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT50]], <1 x double> undef, <1 x i32> zeroinitializer
	; RM-NEXT: [[TMP45:%.*]] = fmul <1 x double> [[SPLAT_SPLAT51]], [[BLOCK49]]			; RM-NEXT: [[TMP45:%.*]] = fmul <1 x double> [[SPLAT_SPLAT51]], [[BLOCK49]]
	; RM-NEXT: [[TMP46:%.*]] = fadd <1 x double> [[TMP43]], [[TMP45]]			; RM-NEXT: [[TMP46:%.*]] = fadd <1 x double> [[TMP43]], [[TMP45]]
	; RM-NEXT: [[TMP47:%.*]] = shufflevector <1 x double> [[TMP46]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>			; RM-NEXT: [[TMP47:%.*]] = shufflevector <1 x double> [[TMP46]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
	; RM-NEXT: [[TMP48:%.*]] = shufflevector <2 x double> [[TMP38]], <2 x double> [[TMP47]], <2 x i32> <i32 0, i32 2>			; RM-NEXT: [[TMP48:%.*]] = shufflevector <2 x double> [[TMP38]], <2 x double> [[TMP47]], <2 x i32> <i32 0, i32 2>
	; RM-NEXT: [[TMP49:%.]] = bitcast <4 x double> [[C_PTR:%.]] to double			; RM-NEXT: [[TMP49:%.]] = bitcast <4 x double> [[C_PTR:%.]] to double
	; RM-NEXT: [[VEC_CAST52:%.]] = bitcast double [[TMP49]] to <2 x double>*			; RM-NEXT: [[VEC_CAST52:%.]] = bitcast double [[TMP49]] to <2 x double>*
	; RM-NEXT: [[COL_LOAD53:%.]] = load <2 x double>, <2 x double> [[VEC_CAST52]], align 8			; RM-NEXT: [[COL_LOAD53:%.]] = load <2 x double>, <2 x double> [[VEC_CAST52]], align 8
	; RM-NEXT: [[VEC_GEP54:%.]] = getelementptr double, double [[TMP49]], i32 2			; RM-NEXT: [[VEC_GEP54:%.]] = getelementptr double, double [[TMP49]], i64 2
	; RM-NEXT: [[VEC_CAST55:%.]] = bitcast double [[VEC_GEP54]] to <2 x double>*			; RM-NEXT: [[VEC_CAST55:%.]] = bitcast double [[VEC_GEP54]] to <2 x double>*
	; RM-NEXT: [[COL_LOAD56:%.]] = load <2 x double>, <2 x double> [[VEC_CAST55]], align 8			; RM-NEXT: [[COL_LOAD56:%.]] = load <2 x double>, <2 x double> [[VEC_CAST55]], align 8
	; RM-NEXT: [[TMP50:%.*]] = fsub <2 x double> [[COL_LOAD53]], [[TMP28]]			; RM-NEXT: [[TMP50:%.*]] = fsub <2 x double> [[COL_LOAD53]], [[TMP28]]
	; RM-NEXT: [[TMP51:%.*]] = fsub <2 x double> [[COL_LOAD56]], [[TMP48]]			; RM-NEXT: [[TMP51:%.*]] = fsub <2 x double> [[COL_LOAD56]], [[TMP48]]
	; RM-NEXT: [[TMP52:%.]] = bitcast <4 x double> [[C_PTR]] to double*			; RM-NEXT: [[TMP52:%.]] = bitcast <4 x double> [[C_PTR]] to double*
	; RM-NEXT: [[VEC_CAST57:%.]] = bitcast double [[TMP52]] to <2 x double>*			; RM-NEXT: [[VEC_CAST57:%.]] = bitcast double [[TMP52]] to <2 x double>*
	; RM-NEXT: store <2 x double> [[TMP50]], <2 x double>* [[VEC_CAST57]], align 8			; RM-NEXT: store <2 x double> [[TMP50]], <2 x double>* [[VEC_CAST57]], align 8
	; RM-NEXT: [[VEC_GEP58:%.]] = getelementptr double, double [[TMP52]], i32 2			; RM-NEXT: [[VEC_GEP58:%.]] = getelementptr double, double [[TMP52]], i64 2
	; RM-NEXT: [[VEC_CAST59:%.]] = bitcast double [[VEC_GEP58]] to <2 x double>*			; RM-NEXT: [[VEC_CAST59:%.]] = bitcast double [[VEC_GEP58]] to <2 x double>*
	; RM-NEXT: store <2 x double> [[TMP51]], <2 x double>* [[VEC_CAST59]], align 8			; RM-NEXT: store <2 x double> [[TMP51]], <2 x double>* [[VEC_CAST59]], align 8
	; RM-NEXT: ret void			; RM-NEXT: ret void
	;			;
	entry:			entry:
	%a = load <6 x double>, <6 x double>* %a.ptr, align 8			%a = load <6 x double>, <6 x double>* %a.ptr, align 8
	%b = load <6 x double>, <6 x double>* %b.ptr, align 8			%b = load <6 x double>, <6 x double>* %b.ptr, align 8
	%add = fadd <6 x double> %a, %a			%add = fadd <6 x double> %a, %a
	Show All 11 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-backward.ll

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	}			}

	define <8 x double> @load_fadd_transpose(<8 x double>* %A.Ptr, <8 x double> %b) {			define <8 x double> @load_fadd_transpose(<8 x double>* %A.Ptr, <8 x double> %b) {
	; CHECK-LABEL: @load_fadd_transpose(			; CHECK-LABEL: @load_fadd_transpose(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x double> [[A_PTR:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x double> [[A_PTR:%.]] to double
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP0]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP0]] to <2 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <2 x double>, <2 x double> [[VEC_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <2 x double>, <2 x double> [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i32 2			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i64 2
	; CHECK-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*
	; CHECK-NEXT: [[COL_LOAD2:%.]] = load <2 x double>, <2 x double> [[VEC_CAST1]], align 8			; CHECK-NEXT: [[COL_LOAD2:%.]] = load <2 x double>, <2 x double> [[VEC_CAST1]], align 8
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP0]], i32 4			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP0]], i64 4
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <2 x double>*
	; CHECK-NEXT: [[COL_LOAD5:%.]] = load <2 x double>, <2 x double> [[VEC_CAST4]], align 8			; CHECK-NEXT: [[COL_LOAD5:%.]] = load <2 x double>, <2 x double> [[VEC_CAST4]], align 8
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP0]], i32 6			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP0]], i64 6
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <2 x double>*
	; CHECK-NEXT: [[COL_LOAD8:%.]] = load <2 x double>, <2 x double> [[VEC_CAST7]], align 8			; CHECK-NEXT: [[COL_LOAD8:%.]] = load <2 x double>, <2 x double> [[VEC_CAST7]], align 8
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[B:%.]], <8 x double> undef, <2 x i32> <i32 0, i32 1>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[B:%.]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: [[SPLIT9:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 2, i32 3>			; CHECK-NEXT: [[SPLIT9:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
	; CHECK-NEXT: [[SPLIT10:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 4, i32 5>			; CHECK-NEXT: [[SPLIT10:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT11:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 6, i32 7>			; CHECK-NEXT: [[SPLIT11:%.*]] = shufflevector <8 x double> [[B]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
	; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> [[COL_LOAD]], [[SPLIT]]			; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> [[COL_LOAD]], [[SPLIT]]
	; CHECK-NEXT: [[TMP2:%.*]] = fadd <2 x double> [[COL_LOAD2]], [[SPLIT9]]			; CHECK-NEXT: [[TMP2:%.*]] = fadd <2 x double> [[COL_LOAD2]], [[SPLIT9]]
	Show All 31 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll

	Show All 24 Lines
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3			; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
	; CHECK-NEXT: [[TMP16:%.]] = bitcast <8 x double> [[PTR:%.]] to double			; CHECK-NEXT: [[TMP16:%.]] = bitcast <8 x double> [[PTR:%.]] to double
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP16]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP16]] to <4 x double>*
	; CHECK-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[VEC_CAST]], align 8			; CHECK-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP16]], i32 4			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP16]], i64 4
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*
	; CHECK-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[VEC_CAST4]], align 8			; CHECK-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[VEC_CAST4]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)			%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
	store <8 x double> %c, <8 x double>* %Ptr, align 8			store <8 x double> %c, <8 x double>* %Ptr, align 8
	ret void			ret void
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll

	Show All 24 Lines
	; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 3			; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 3
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <2 x double> [[TMP14]], double [[TMP15]], i64 1			; CHECK-NEXT: [[TMP16:%.*]] = insertelement <2 x double> [[TMP14]], double [[TMP15]], i64 1
	; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <2 x double> [[TMP12]], <2 x double> [[TMP16]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <2 x double> [[TMP12]], <2 x double> [[TMP16]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <4 x double> [[TMP17]], <4 x double> [[TMP18]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <4 x double> [[TMP17]], <4 x double> [[TMP18]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: [[TMP20:%.]] = bitcast <8 x double> [[PTR:%.]] to double			; CHECK-NEXT: [[TMP20:%.]] = bitcast <8 x double> [[PTR:%.]] to double
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP20]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP20]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[VEC_CAST]], align 8			; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP20]], i32 2			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP20]], i64 2
	; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[VEC_CAST2]], align 8			; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[VEC_CAST2]], align 8
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP20]], i32 4			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP20]], i64 4
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[VEC_CAST4]], align 8			; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[VEC_CAST4]], align 8
	; CHECK-NEXT: [[VEC_GEP5:%.]] = getelementptr double, double [[TMP20]], i32 6			; CHECK-NEXT: [[VEC_GEP5:%.]] = getelementptr double, double [[TMP20]], i64 6
	; CHECK-NEXT: [[VEC_CAST6:%.]] = bitcast double [[VEC_GEP5]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST6:%.]] = bitcast double [[VEC_GEP5]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP16]], <2 x double>* [[VEC_CAST6]], align 8			; CHECK-NEXT: store <2 x double> [[TMP16]], <2 x double>* [[VEC_CAST6]], align 8
	; CHECK-NEXT: call void @foo(<8 x double> [[TMP19]])			; CHECK-NEXT: call void @foo(<8 x double> [[TMP19]])
	; CHECK-NEXT: ret <8 x double> [[TMP19]]			; CHECK-NEXT: ret <8 x double> [[TMP19]]
	;			;
	%transposed = call <8 x double> @llvm.matrix.transpose(<8 x double> %in, i32 4, i32 2)			%transposed = call <8 x double> @llvm.matrix.transpose(<8 x double> %in, i32 4, i32 2)
	store <8 x double> %transposed, <8 x double>* %Ptr, align 8			store <8 x double> %transposed, <8 x double>* %Ptr, align 8
	call void @foo(<8 x double> %transposed)			call void @foo(<8 x double> %transposed)
	ret <8 x double> %transposed			ret <8 x double> %transposed
	}			}

	declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)			declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

	declare void @foo(<8 x double>)			declare void @foo(<8 x double>)

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-multiple-iterations.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s


	; Make sure we propagate in multiple iterations. First, we back-propagate the			; Make sure we propagate in multiple iterations. First, we back-propagate the
	; shape information from the transpose to %A, in the next iteration we			; shape information from the transpose to %A, in the next iteration we
	; forward-propagate it to %Mul, and then back to %B.			; forward-propagate it to %Mul, and then back to %B.
	define <16 x double> @backpropagation_iterations(<16 x double>* %A.Ptr, <16 x double>* %B.Ptr) {			define <16 x double> @backpropagation_iterations(<16 x double>* %A.Ptr, <16 x double>* %B.Ptr) {
	; CHECK-LABEL: @backpropagation_iterations(			; CHECK-LABEL: @backpropagation_iterations(
	; CHECK-NEXT: [[TMP1:%.]] = bitcast <16 x double> [[A_PTR:%.]] to double			; CHECK-NEXT: [[TMP1:%.]] = bitcast <16 x double> [[A_PTR:%.]] to double
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP1]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[TMP1]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x double>, <4 x double> [[VEC_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x double>, <4 x double> [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP1]], i32 4			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP1]], i64 4
	; CHECK-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST1:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD2:%.]] = load <4 x double>, <4 x double> [[VEC_CAST1]], align 8			; CHECK-NEXT: [[COL_LOAD2:%.]] = load <4 x double>, <4 x double> [[VEC_CAST1]], align 8
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP1]], i32 8			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[TMP1]], i64 8
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD5:%.]] = load <4 x double>, <4 x double> [[VEC_CAST4]], align 8			; CHECK-NEXT: [[COL_LOAD5:%.]] = load <4 x double>, <4 x double> [[VEC_CAST4]], align 8
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP1]], i32 12			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP1]], i64 12
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD8:%.]] = load <4 x double>, <4 x double> [[VEC_CAST7]], align 8			; CHECK-NEXT: [[COL_LOAD8:%.]] = load <4 x double>, <4 x double> [[VEC_CAST7]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x double> [[COL_LOAD]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x double> [[COL_LOAD]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> undef, double [[TMP2]], i64 0			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> undef, double [[TMP2]], i64 0
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x double> [[COL_LOAD2]], i64 0			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x double> [[COL_LOAD2]], i64 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 1			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 1
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x double> [[COL_LOAD5]], i64 0			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x double> [[COL_LOAD5]], i64 0
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 2			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 2
	Show All 21 Lines
	; CHECK-NEXT: [[TMP29:%.*]] = insertelement <4 x double> [[TMP27]], double [[TMP28]], i64 1			; CHECK-NEXT: [[TMP29:%.*]] = insertelement <4 x double> [[TMP27]], double [[TMP28]], i64 1
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <4 x double> [[COL_LOAD5]], i64 3			; CHECK-NEXT: [[TMP30:%.*]] = extractelement <4 x double> [[COL_LOAD5]], i64 3
	; CHECK-NEXT: [[TMP31:%.*]] = insertelement <4 x double> [[TMP29]], double [[TMP30]], i64 2			; CHECK-NEXT: [[TMP31:%.*]] = insertelement <4 x double> [[TMP29]], double [[TMP30]], i64 2
	; CHECK-NEXT: [[TMP32:%.*]] = extractelement <4 x double> [[COL_LOAD8]], i64 3			; CHECK-NEXT: [[TMP32:%.*]] = extractelement <4 x double> [[COL_LOAD8]], i64 3
	; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x double> [[TMP31]], double [[TMP32]], i64 3			; CHECK-NEXT: [[TMP33:%.*]] = insertelement <4 x double> [[TMP31]], double [[TMP32]], i64 3
	; CHECK-NEXT: [[TMP34:%.]] = bitcast <16 x double> [[B_PTR:%.]] to double			; CHECK-NEXT: [[TMP34:%.]] = bitcast <16 x double> [[B_PTR:%.]] to double
	; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast double [[TMP34]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast double [[TMP34]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD10:%.]] = load <4 x double>, <4 x double> [[VEC_CAST9]], align 8			; CHECK-NEXT: [[COL_LOAD10:%.]] = load <4 x double>, <4 x double> [[VEC_CAST9]], align 8
	; CHECK-NEXT: [[VEC_GEP11:%.]] = getelementptr double, double [[TMP34]], i32 4			; CHECK-NEXT: [[VEC_GEP11:%.]] = getelementptr double, double [[TMP34]], i64 4
	; CHECK-NEXT: [[VEC_CAST12:%.]] = bitcast double [[VEC_GEP11]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST12:%.]] = bitcast double [[VEC_GEP11]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD13:%.]] = load <4 x double>, <4 x double> [[VEC_CAST12]], align 8			; CHECK-NEXT: [[COL_LOAD13:%.]] = load <4 x double>, <4 x double> [[VEC_CAST12]], align 8
	; CHECK-NEXT: [[VEC_GEP14:%.]] = getelementptr double, double [[TMP34]], i32 8			; CHECK-NEXT: [[VEC_GEP14:%.]] = getelementptr double, double [[TMP34]], i64 8
	; CHECK-NEXT: [[VEC_CAST15:%.]] = bitcast double [[VEC_GEP14]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST15:%.]] = bitcast double [[VEC_GEP14]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD16:%.]] = load <4 x double>, <4 x double> [[VEC_CAST15]], align 8			; CHECK-NEXT: [[COL_LOAD16:%.]] = load <4 x double>, <4 x double> [[VEC_CAST15]], align 8
	; CHECK-NEXT: [[VEC_GEP17:%.]] = getelementptr double, double [[TMP34]], i32 12			; CHECK-NEXT: [[VEC_GEP17:%.]] = getelementptr double, double [[TMP34]], i64 12
	; CHECK-NEXT: [[VEC_CAST18:%.]] = bitcast double [[VEC_GEP17]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST18:%.]] = bitcast double [[VEC_GEP17]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD19:%.]] = load <4 x double>, <4 x double> [[VEC_CAST18]], align 8			; CHECK-NEXT: [[COL_LOAD19:%.]] = load <4 x double>, <4 x double> [[VEC_CAST18]], align 8
	; CHECK-NEXT: [[TMP35:%.*]] = fmul <4 x double> [[COL_LOAD]], [[COL_LOAD10]]			; CHECK-NEXT: [[TMP35:%.*]] = fmul <4 x double> [[COL_LOAD]], [[COL_LOAD10]]
	; CHECK-NEXT: [[TMP36:%.*]] = fmul <4 x double> [[COL_LOAD2]], [[COL_LOAD13]]			; CHECK-NEXT: [[TMP36:%.*]] = fmul <4 x double> [[COL_LOAD2]], [[COL_LOAD13]]
	; CHECK-NEXT: [[TMP37:%.*]] = fmul <4 x double> [[COL_LOAD5]], [[COL_LOAD16]]			; CHECK-NEXT: [[TMP37:%.*]] = fmul <4 x double> [[COL_LOAD5]], [[COL_LOAD16]]
	; CHECK-NEXT: [[TMP38:%.*]] = fmul <4 x double> [[COL_LOAD8]], [[COL_LOAD19]]			; CHECK-NEXT: [[TMP38:%.*]] = fmul <4 x double> [[COL_LOAD8]], [[COL_LOAD19]]
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <4 x double> [[TMP35]], <4 x double> [[TMP36]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <4 x double> [[TMP35]], <4 x double> [[TMP36]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <4 x double> [[TMP37]], <4 x double> [[TMP38]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <4 x double> [[TMP37]], <4 x double> [[TMP38]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	Show All 12 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "aarch64-apple-ios"			target triple = "aarch64-apple-ios"

	; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops
	; CHECK-NEXT: load(addr %A)			; CHECK-NEXT: load(addr %A)

	; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops
	; CHECK-NEXT: columnwise.load.3x5.double(addr %B, 5)			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)

	; CHECK-LABEL: remark: load.h:41:11: Lowered with 0 stores, 1 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:11: Lowered with 0 stores, 1 loads, 0 compute ops
	; CHECK-NEXT: load(addr %D)			; CHECK-NEXT: load(addr %D)

	; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops
	; CHECK-NEXT: load(addr %A)			; CHECK-NEXT: load(addr %A)

	; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops
	; CHECK-NEXT: columnwise.load.3x5.double(addr %B, 5)			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)

	; CHECK-LABEL: remark: toplevel.c:410:0: Lowered with 10 stores, 20 loads, 10 compute ops			; CHECK-LABEL: remark: toplevel.c:410:0: Lowered with 10 stores, 20 loads, 10 compute ops
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: fadd(			; CHECK-NEXT: fadd(
	; CHECK-NEXT: load(addr %A),			; CHECK-NEXT: load(addr %A),
	; CHECK-NEXT: columnwise.load.3x5.double(addr %B, 5)),			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)),
	; CHECK-NEXT: addr %C)			; CHECK-NEXT: addr %C)

	; CHECK-LABEL: remark: toplevel.c:510:0: Lowered with 1 stores, 1 loads, 8 compute ops			; CHECK-LABEL: remark: toplevel.c:510:0: Lowered with 1 stores, 1 loads, 8 compute ops
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(load(addr %D))),			; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(load(addr %D))),
	; CHECK-NEXT: addr %D)			; CHECK-NEXT: addr %D)

	; CHECK-LABEL: remark: add.h:66:11: Lowered with 0 stores, 0 loads, 10 compute ops			; CHECK-LABEL: remark: add.h:66:11: Lowered with 0 stores, 0 loads, 10 compute ops
	Show All 12 Lines
	; CHECK-NEXT: addr %D)			; CHECK-NEXT: addr %D)

	; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 8 compute ops			; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 8 compute ops
	; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(addr %D))			; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(addr %D))

	define void @toplevel(<15 x double>* %A, <15 x double>* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {			define void @toplevel(<15 x double>* %A, <15 x double>* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {
	entry:			entry:
	%a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791			%a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791
	%b = call <15 x double> @llvm.matrix.columnwise.load(<15 x double>* %B, i32 5, i32 3, i32 5), !dbg !3793			%b = call <15 x double> @llvm.matrix.column.major.load(<15 x double>* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793
	%c = fadd <15 x double> %a, %b, !dbg !100			%c = fadd <15 x double> %a, %b, !dbg !100
	store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102			store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102

	%load = load <2 x float>, <2 x float>* %D, !dbg !104			%load = load <2 x float>, <2 x float>* %D, !dbg !104
	%t1 = call <2 x float> @llvm.matrix.transpose(<2 x float> %load, i32 2, i32 1), !dbg !106			%t1 = call <2 x float> @llvm.matrix.transpose(<2 x float> %load, i32 2, i32 1), !dbg !106
	%t2 = call <2 x float> @llvm.matrix.transpose(<2 x float> %t1, i32 1, i32 2), !dbg !106			%t2 = call <2 x float> @llvm.matrix.transpose(<2 x float> %t1, i32 1, i32 2), !dbg !106
	store <2 x float> %t2, <2 x float>* %D, !dbg !108			store <2 x float> %t2, <2 x float>* %D, !dbg !108
	ret void			ret void
	}			}

	declare <15 x double> @llvm.matrix.columnwise.load(<15 x double>*, i32, i32, i32)			declare <15 x double> @llvm.matrix.column.major.load(<15 x double>*, i64, i1, i32, i32)
	declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32)			declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32)

	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4}			!llvm.module.flags = !{!3, !4}

	!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2)			!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2)
	!1 = !DIFile(filename: "load.h", directory: "/test")			!1 = !DIFile(filename: "load.h", directory: "/test")
	!2 = !{}			!2 = !{}
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-shared-subtrees.ll

	Show All 22 Lines
	; YAML-NEXT: - NumStores: '0'			; YAML-NEXT: - NumStores: '0'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '4'			; YAML-NEXT: - NumLoads: '4'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumFPOps: '16'			; YAML-NEXT: - NumFPOps: '16'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops'
	; YAML-NEXT: - String: ' are shared with other expressions'			; YAML-NEXT: - String: ' are shared with other expressions'
	; YAML-NEXT: - String: \|			; YAML-NEXT: - String: \|
	; YAML: columnwise.store.4x2.double(			; YAML: column.major.store.4x2.double(
	; YAML-NEXT: shared with remark at line 35 column 45 (transpose.2x4.double(columnwise.load.2x4.double(addr %arg1,			; YAML-NEXT: shared with remark at line 35 column 45 (transpose.2x4.double(column.major.load.2x4.double(addr %arg1,
	; YAML-NEXT: scalar)),			; YAML-NEXT: scalar)),
	; YAML-NEXT: addr %arg3,			; YAML-NEXT: addr %arg3,
	; YAML-NEXT: 10)			; YAML-NEXT: 10)

	; YAML-LABEL: --- !Passed			; YAML-LABEL: --- !Passed
	; YAML-NEXT: Pass: lower-matrix-intrinsics			; YAML-NEXT: Pass: lower-matrix-intrinsics
	; YAML-NEXT: Name: matrix-lowered			; YAML-NEXT: Name: matrix-lowered
	; YAML-NEXT: DebugLoc: { File: test.cpp, Line: 35, Column: 45 }			; YAML-NEXT: DebugLoc: { File: test.cpp, Line: 35, Column: 45 }
	Show All 11 Lines
	; YAML-NEXT: - NumStores: '0'			; YAML-NEXT: - NumStores: '0'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '4'			; YAML-NEXT: - NumLoads: '4'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumFPOps: '16'			; YAML-NEXT: - NumFPOps: '16'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops'
	; YAML-NEXT: - String: ' are shared with other expressions'			; YAML-NEXT: - String: ' are shared with other expressions'
	; YAML-NEXT: - String: \|			; YAML-NEXT: - String: \|
	; YAML: columnwise.store.4x15.double(			; YAML: column.major.store.4x15.double(
	; YAML-NEXT: fsub(			; YAML-NEXT: fsub(
	; YAML-NEXT: columnwise.load.4x15.double(addr %arg2, 20),			; YAML-NEXT: column.major.load.4x15.double(addr %arg2, 20),
	; YAML-NEXT: multiply.4x2.2x15.double(			; YAML-NEXT: multiply.4x2.2x15.double(
	; YAML-NEXT: shared with remark at line 35 column 71 (transpose.2x4.double(columnwise.load.2x4.double(addr %arg1,			; YAML-NEXT: shared with remark at line 35 column 71 (transpose.2x4.double(column.major.load.2x4.double(addr %arg1,
	; YAML-NEXT: scalar)),			; YAML-NEXT: scalar)),
	; YAML-NEXT: columnwise.load.2x15.double(addr %arg3, scalar))),			; YAML-NEXT: column.major.load.2x15.double(addr %arg3, scalar))),
	; YAML-NEXT: addr %arg2,			; YAML-NEXT: addr %arg2,
	; YAML-NEXT: 10)			; YAML-NEXT: 10)


	; STDERR-LABEL: remark: test.cpp:35:71: Lowered with 4 stores, 0 loads, 0 compute ops,			; STDERR-LABEL: remark: test.cpp:35:71: Lowered with 4 stores, 0 loads, 0 compute ops,
	; STDERR-NEXT: additionally 0 stores, 4 loads, 16 compute ops are shared with other expressions			; STDERR-NEXT: additionally 0 stores, 4 loads, 16 compute ops are shared with other expressions
	; STDERR-NEXT: columnwise.store.4x2.double(			; STDERR-NEXT: column.major.store.4x2.double(
	; STDERR-NEXT: shared with remark at line 35 column 45 (transpose.2x4.double(columnwise.load.2x4.double(addr %arg1,			; STDERR-NEXT: shared with remark at line 35 column 45 (transpose.2x4.double(column.major.load.2x4.double(addr %arg1,
	; STDERR-NEXT: scalar)),			; STDERR-NEXT: scalar)),
	; STDERR-NEXT: addr %arg3,			; STDERR-NEXT: addr %arg3,
	; STDERR-NEXT: 10)			; STDERR-NEXT: 10)

	; STDERR-LABEL: remark: test.cpp:35:45: Lowered with 30 stores, 45 loads, 120 compute ops,			; STDERR-LABEL: remark: test.cpp:35:45: Lowered with 30 stores, 45 loads, 120 compute ops,
	; STDERR-NEXT: additionally 0 stores, 4 loads, 16 compute ops are shared with other expressions			; STDERR-NEXT: additionally 0 stores, 4 loads, 16 compute ops are shared with other expressions
	; STDERR-NEXT: columnwise.store.4x15.double(			; STDERR-NEXT: column.major.store.4x15.double(
	; STDERR-NEXT: fsub(			; STDERR-NEXT: fsub(
	; STDERR-NEXT: columnwise.load.4x15.double(addr %arg2, 20),			; STDERR-NEXT: column.major.load.4x15.double(addr %arg2, 20),
	; STDERR-NEXT: multiply.4x2.2x15.double(			; STDERR-NEXT: multiply.4x2.2x15.double(
	; STDERR-NEXT: shared with remark at line 35 column 71 (transpose.2x4.double(columnwise.load.2x4.double(addr %arg1,			; STDERR-NEXT: shared with remark at line 35 column 71 (transpose.2x4.double(column.major.load.2x4.double(addr %arg1,
	; STDERR-NEXT: scalar)),			; STDERR-NEXT: scalar)),
	; STDERR-NEXT: columnwise.load.2x15.double(addr %arg3, scalar))),			; STDERR-NEXT: column.major.load.2x15.double(addr %arg3, scalar))),
	; STDERR-NEXT: addr %arg2,			; STDERR-NEXT: addr %arg2,
	; STDERR-NEXT: 10)			; STDERR-NEXT: 10)
	define void @test_2leafs(double* %arg1, double* %arg2, double* %arg3, i32 %stride, i32 %offset) !dbg !8 {			define void @test_2leafs(double* %arg1, double* %arg2, double* %arg3, i64 %stride) !dbg !8 {
	bb:			bb:
	%shared.load = tail call <8 x double> @llvm.matrix.columnwise.load.v8f64.p0f64(double* %arg1, i32 %stride, i32 2, i32 4), !dbg !10, !noalias !10			%shared.load = tail call <8 x double> @llvm.matrix.column.major.load.v8f64.p0f64(double* %arg1, i64 %stride, i1 false, i32 2, i32 4), !dbg !10, !noalias !10
	%shared.load.2 = tail call <30 x double> @llvm.matrix.columnwise.load.v30f64.p0f64(double* %arg3, i32 %stride, i32 2, i32 15), !dbg !10, !noalias !10			%shared.load.2 = tail call <30 x double> @llvm.matrix.column.major.load.v30f64.p0f64(double* %arg3, i64 %stride, i1 false, i32 2, i32 15), !dbg !10, !noalias !10
	%tmp17 = tail call <8 x double> @llvm.matrix.transpose.v8f64(<8 x double> %shared.load, i32 2, i32 4), !dbg !10			%tmp17 = tail call <8 x double> @llvm.matrix.transpose.v8f64(<8 x double> %shared.load, i32 2, i32 4), !dbg !10
	tail call void @llvm.matrix.columnwise.store.v8f64.p0f64(<8 x double> %tmp17, double* %arg3, i32 10, i32 4, i32 2), !dbg !10			tail call void @llvm.matrix.column.major.store.v8f64.p0f64(<8 x double> %tmp17, double* %arg3, i64 10, i1 false, i32 4, i32 2), !dbg !10
	%tmp18 = tail call <60 x double> @llvm.matrix.columnwise.load.v60f64.p0f64(double* %arg2, i32 20, i32 4, i32 15), !dbg !11			%tmp18 = tail call <60 x double> @llvm.matrix.column.major.load.v60f64.p0f64(double* %arg2, i64 20, i1 false, i32 4, i32 15), !dbg !11
	%tmp48 = tail call <60 x double> @llvm.matrix.multiply.v60f64.v8f64.v30f64(<8 x double> %tmp17, <30 x double> %shared.load.2, i32 4, i32 2, i32 15), !dbg !11			%tmp48 = tail call <60 x double> @llvm.matrix.multiply.v60f64.v8f64.v30f64(<8 x double> %tmp17, <30 x double> %shared.load.2, i32 4, i32 2, i32 15), !dbg !11
	%tmp49 = fsub <60 x double> %tmp18, %tmp48, !dbg !11			%tmp49 = fsub <60 x double> %tmp18, %tmp48, !dbg !11
	tail call void @llvm.matrix.columnwise.store.v60f64.p0f64(<60 x double> %tmp49, double* %arg2, i32 10, i32 4, i32 15), !dbg !11			tail call void @llvm.matrix.column.major.store.v60f64.p0f64(<60 x double> %tmp49, double* %arg2, i64 10, i1 false, i32 4, i32 15), !dbg !11
	ret void			ret void
	}			}

	declare <8 x double> @llvm.matrix.transpose.v8f64(<8 x double>, i32 immarg, i32 immarg)			declare <8 x double> @llvm.matrix.transpose.v8f64(<8 x double>, i32 immarg, i32 immarg)
	declare <8 x double> @llvm.matrix.columnwise.load.v8f64.p0f64(double*, i32, i32 immarg, i32 immarg)			declare <8 x double> @llvm.matrix.column.major.load.v8f64.p0f64(double*, i64, i1 immarg, i32 immarg, i32 immarg)
	declare <30 x double> @llvm.matrix.columnwise.load.v30f64.p0f64(double*, i32, i32 immarg, i32 immarg)			declare <30 x double> @llvm.matrix.column.major.load.v30f64.p0f64(double*, i64, i1 immarg, i32 immarg, i32 immarg)
	declare <60 x double> @llvm.matrix.columnwise.load.v60f64.p0f64(double*, i32, i32 immarg, i32 immarg)			declare <60 x double> @llvm.matrix.column.major.load.v60f64.p0f64(double*, i64, i1 immarg, i32 immarg, i32 immarg)
	declare void @llvm.matrix.columnwise.store.v60f64.p0f64(<60 x double>, double* writeonly, i32, i32 immarg, i32 immarg)			declare void @llvm.matrix.column.major.store.v60f64.p0f64(<60 x double>, double* writeonly, i64, i1 immarg, i32 immarg, i32 immarg)
	declare void @llvm.matrix.columnwise.store.v8f64.p0f64(<8 x double>, double* writeonly, i32, i32 immarg, i32 immarg)			declare void @llvm.matrix.column.major.store.v8f64.p0f64(<8 x double>, double* writeonly, i64, i1 immarg, i32 immarg, i32 immarg)
	declare <60 x double> @llvm.matrix.multiply.v60f64.v8f64.v30f64(<8 x double>, <30 x double>, i32 immarg, i32 immarg, i32 immarg)			declare <60 x double> @llvm.matrix.multiply.v60f64.v8f64.v30f64(<8 x double>, <30 x double>, i32 immarg, i32 immarg, i32 immarg)

	!llvm.module.flags = !{!0, !1, !2, !3}			!llvm.module.flags = !{!0, !1, !2, !3}
	!llvm.dbg.cu = !{!4}			!llvm.dbg.cu = !{!4}
	!llvm.ident = !{!7}			!llvm.ident = !{!7}

	!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 0]}			!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 0]}
	!1 = !{i32 2, !"Debug Info Version", i32 3}			!1 = !{i32 2, !"Debug Info Version", i32 3}
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll

Show All 30 Lines	define void @multiply(<12 x double>* %A, <12 x double>* %B, <4 x double>* %C) !dbg !25 {
store <4 x double> %t, <4 x double>* %C, !dbg !26		store <4 x double> %t, <4 x double>* %C, !dbg !26
ret void		ret void
}		}

declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32)		declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32)

; CHECK-LABEL: remark: test.h:60:20: Lowered with 6 stores, 6 loads, 0 compute ops		; CHECK-LABEL: remark: test.h:60:20: Lowered with 6 stores, 6 loads, 0 compute ops
; CHECK-NEXT: store(		; CHECK-NEXT: store(
; CHECK-NEXT: columnwise.load.3x3.double(addr %A, 5),		; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5),
; CHECK-NEXT: addr %B)		; CHECK-NEXT: addr %B)
define void @columnwise.load(<9 x double>* %A, <9 x double>* %B) !dbg !27 {		define void @column.major.load(<9 x double>* %A, <9 x double>* %B) !dbg !27 {
%A.matrix = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %A, i32 5, i32 3, i32 3), !dbg !28		%A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !28
store <9 x double> %A.matrix, <9 x double>* %B, !dbg !28		store <9 x double> %A.matrix, <9 x double>* %B, !dbg !28
ret void		ret void
}		}

declare <9 x double> @llvm.matrix.columnwise.load(<9 x double>*, i32, i32, i32)		declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32)

; CHECK-LABEL: remark: test.h:70:20: Lowered with 6 stores, 6 loads, 0 compute ops		; CHECK-LABEL: remark: test.h:70:20: Lowered with 6 stores, 6 loads, 0 compute ops
; CHECK-NEXT: columnwise.store.3x3.double(		; CHECK-NEXT: column.major.store.3x3.double(
; CHECK-NEXT: columnwise.load.3x3.double(addr %A, 5),		; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5),
; CHECK-NEXT: addr %B,		; CHECK-NEXT: addr %B,
; CHECK-NEXT: 10)		; CHECK-NEXT: 10)
define void @columnwise.store(<9 x double>* %A, <9 x double>* %B) !dbg !29 {		define void @column.major.store(<9 x double>* %A, <9 x double>* %B) !dbg !29 {
%A.matrix = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %A, i32 5, i32 3, i32 3), !dbg !30		%A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !30
call void @llvm.matrix.columnwise.store(<9 x double> %A.matrix, <9 x double>* %B, i32 10, i32 3, i32 3), !dbg !30		call void @llvm.matrix.column.major.store(<9 x double> %A.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !30
ret void		ret void
}		}

declare void @llvm.matrix.columnwise.store(<9 x double>, <9 x double>*, i32, i32, i32)		declare void @llvm.matrix.column.major.store(<9 x double>, <9 x double>*, i64, i1, i32, i32)

; CHECK-LABEL: remark: test.h:80:20: Lowered with 6 stores, 6 loads, 12 compute ops		; CHECK-LABEL: remark: test.h:80:20: Lowered with 6 stores, 6 loads, 12 compute ops
; CHECK-NEXT: columnwise.store.3x3.double(		; CHECK-NEXT: column.major.store.3x3.double(
; CHECK-NEXT: fmul(		; CHECK-NEXT: fmul(
; CHECK-NEXT: fadd(		; CHECK-NEXT: fadd(
; CHECK-NEXT: columnwise.load.3x3.double(addr %A, 5)		; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5)
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: addr %B,		; CHECK-NEXT: addr %B,
; CHECK-NEXT: 10)		; CHECK-NEXT: 10)

define void @binaryops(<9 x double>* %A, <9 x double>* %B) !dbg !31 {		define void @binaryops(<9 x double>* %A, <9 x double>* %B) !dbg !31 {
%A.matrix = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %A, i32 5, i32 3, i32 3), !dbg !32		%A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !32
%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !32		%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !32
%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !32		%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !32
call void @llvm.matrix.columnwise.store(<9 x double> %R2.matrix, <9 x double>* %B, i32 10, i32 3, i32 3), !dbg !32		call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !32
ret void		ret void
}		}

; CHECK-LABEL: remark: test.h:90:20: Lowered with 6 stores, 6 loads, 12 compute ops		; CHECK-LABEL: remark: test.h:90:20: Lowered with 6 stores, 6 loads, 12 compute ops
; CHECK-NEXT: columnwise.store.3x3.double(		; CHECK-NEXT: column.major.store.3x3.double(
; CHECK-NEXT: fmul(		; CHECK-NEXT: fmul(
; CHECK-NEXT: fadd(		; CHECK-NEXT: fadd(
; CHECK-NEXT: columnwise.load.3x3.double(addr %A, 5)		; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5)
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: addr %B,		; CHECK-NEXT: addr %B,
; CHECK-NEXT: 10)		; CHECK-NEXT: 10)
; CHECK-NEXT: remark: test.h:90:20: Lowered with 2 stores, 12 loads, 22 compute ops		; CHECK-NEXT: remark: test.h:90:20: Lowered with 2 stores, 12 loads, 22 compute ops
; CHECK-NEXT: store(		; CHECK-NEXT: store(
; CHECK-NEXT: multiply.2x6.6x2.double(		; CHECK-NEXT: multiply.2x6.6x2.double(
; CHECK-NEXT: load(addr %C),		; CHECK-NEXT: load(addr %C),
; CHECK-NEXT: load(addr %D)),		; CHECK-NEXT: load(addr %D)),
; CHECK-NEXT: addr %E)		; CHECK-NEXT: addr %E)

define void @multiple_expressions(<9 x double>* %A, <9 x double>* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 {		define void @multiple_expressions(<9 x double>* %A, <9 x double>* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 {
%A.matrix = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %A, i32 5, i32 3, i32 3), !dbg !34		%A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !34
%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !34		%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !34
%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !34		%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !34
call void @llvm.matrix.columnwise.store(<9 x double> %R2.matrix, <9 x double>* %B, i32 10, i32 3, i32 3), !dbg !34		call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !34

%C.matrix = load <12 x double>, <12 x double>* %C, !dbg !34		%C.matrix = load <12 x double>, <12 x double>* %C, !dbg !34
%D.matrix = load <12 x double>, <12 x double>* %D, !dbg !34		%D.matrix = load <12 x double>, <12 x double>* %D, !dbg !34
%Mult.matrix = call <4 x double> @llvm.matrix.multiply(<12 x double> %C.matrix, <12 x double> %D.matrix, i32 2, i32 6, i32 2), !dbg !34		%Mult.matrix = call <4 x double> @llvm.matrix.multiply(<12 x double> %C.matrix, <12 x double> %D.matrix, i32 2, i32 6, i32 2), !dbg !34
store <4 x double> %Mult.matrix, <4 x double>* %E, !dbg !34		store <4 x double> %Mult.matrix, <4 x double>* %E, !dbg !34

ret void		ret void
}		}

; CHECK-LABEL: remark: test.h:100:20: Lowered with 6 stores, 6 loads, 12 compute ops		; CHECK-LABEL: remark: test.h:100:20: Lowered with 6 stores, 6 loads, 12 compute ops
; CHECK-NEXT: columnwise.store.3x3.double(		; CHECK-NEXT: column.major.store.3x3.double(
; CHECK-NEXT: fmul(		; CHECK-NEXT: fmul(
; CHECK-NEXT: fadd(		; CHECK-NEXT: fadd(
; CHECK-NEXT: columnwise.load.3x3.double(addr %A, 5)		; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5)
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: (reused) columnwise.load.3x3.double(addr %A, 5)),		; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)),
; CHECK-NEXT: stack addr %B,		; CHECK-NEXT: stack addr %B,
; CHECK-NEXT: 10)		; CHECK-NEXT: 10)
define void @stackaddresses(<9 x double>* %A) !dbg !35 {		define void @stackaddresses(<9 x double>* %A) !dbg !35 {
%B = alloca <9 x double>		%B = alloca <9 x double>
%A.matrix = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %A, i32 5, i32 3, i32 3), !dbg !36		%A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !36
%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !36		%R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !36
%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !36		%R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !36
call void @llvm.matrix.columnwise.store(<9 x double> %R2.matrix, <9 x double>* %B, i32 10, i32 3, i32 3), !dbg !36		call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !36
ret void		ret void
}		}

; CHECK-LABEL: remark: test.h:30:20: Lowered with 10 stores, 9 loads, 30 compute ops		; CHECK-LABEL: remark: test.h:30:20: Lowered with 10 stores, 9 loads, 30 compute ops
; CHECK-NEXT: store(		; CHECK-NEXT: store(
; CHECK-NEXT: transpose.5x3.double(load(addr %A)),		; CHECK-NEXT: transpose.5x3.double(load(addr %A)),
; CHECK-NEXT: stack addr %s1)		; CHECK-NEXT: stack addr %s1)
%S1 = type {<15 x double>*}		%S1 = type {<15 x double>*}
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define <9 x double> @strided_load_3x3(<9 x double>* %in, i32 %stride) {			define <9 x double> @strided_load_3x3(<9 x double>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_3x3(			; CHECK-LABEL: @strided_load_3x3(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[IN:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[IN:%.]] to double
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[VEC_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x double>, <3 x double> [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast double [[VEC_GEP2]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast double [[VEC_GEP2]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x double>, <3 x double> [[VEC_CAST3]], align 8			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x double>, <3 x double> [[VEC_CAST3]], align 8
	; CHECK-NEXT: [[VEC_START5:%.*]] = mul i32 2, [[STRIDE]]			; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START5]]			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START5]]
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <3 x double>*
	; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x double>, <3 x double> [[VEC_CAST7]], align 8			; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x double>, <3 x double> [[VEC_CAST7]], align 8
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x double> [[COL_LOAD]], <3 x double> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x double> [[COL_LOAD]], <3 x double> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x double> [[COL_LOAD8]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x double> [[COL_LOAD8]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x double> [[TMP1]], <6 x double> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x double> [[TMP1]], <6 x double> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: ret <9 x double> [[TMP3]]			; CHECK-NEXT: ret <9 x double> [[TMP3]]
	;			;
	entry:			entry:
	%load = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %in, i32 %stride, i32 3, i32 3)			%load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 3, i32 3)
	ret <9 x double> %load			ret <9 x double> %load
	}			}

	declare <9 x double> @llvm.matrix.columnwise.load(<9 x double>*, i32, i32, i32)			declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32)

	define <9 x double> @strided_load_9x1(<9 x double>* %in, i32 %stride) {			define <9 x double> @strided_load_9x1(<9 x double>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_9x1(			; CHECK-LABEL: @strided_load_9x1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[IN:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x double> [[IN:%.]] to double
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <9 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <9 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x double>, <9 x double> [[VEC_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x double>, <9 x double> [[VEC_CAST]], align 8
	; CHECK-NEXT: ret <9 x double> [[COL_LOAD]]			; CHECK-NEXT: ret <9 x double> [[COL_LOAD]]
	;			;
	entry:			entry:
	%load = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %in, i32 %stride, i32 9, i32 1)			%load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 9, i32 1)
	ret <9 x double> %load			ret <9 x double> %load
	}			}

	declare <8 x double> @llvm.matrix.columnwise.load.v8f64(<8 x double>*, i32, i32, i32)			declare <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>*, i64, i1, i32, i32)

	define <8 x double> @strided_load_4x2(<8 x double>* %in, i32 %stride) {			define <8 x double> @strided_load_4x2(<8 x double>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_4x2(			; CHECK-LABEL: @strided_load_4x2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x double> [[IN:%.]] to double			; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x double> [[IN:%.]] to double
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x double>, <4 x double> [[VEC_CAST]], align 8			; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x double>, <4 x double> [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr double, double [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr double, double [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast double [[VEC_GEP2]] to <4 x double>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast double [[VEC_GEP2]] to <4 x double>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x double>, <4 x double> [[VEC_CAST3]], align 8			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x double>, <4 x double> [[VEC_CAST3]], align 8
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x double> [[COL_LOAD]], <4 x double> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x double> [[COL_LOAD]], <4 x double> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: ret <8 x double> [[TMP1]]			; CHECK-NEXT: ret <8 x double> [[TMP1]]
	;			;
	entry:			entry:
	%load = call <8 x double> @llvm.matrix.columnwise.load.v8f64(<8 x double>* %in, i32 %stride, i32 4, i32 2)			%load = call <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>* %in, i64 %stride, i1 false, i32 4, i32 2)
	ret <8 x double> %load			ret <8 x double> %load
	}			}

	; CHECK: declare <9 x double> @llvm.matrix.columnwise.load.v9f64.p0v9f64(<9 x double>* nocapture, i32, i32 immarg, i32 immarg) [[READONLY:#[0-9]]]			; CHECK: declare <9 x double> @llvm.matrix.column.major.load.v9f64.p0v9f64(<9 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY:#[0-9]]]

	; CHECK: declare <8 x double> @llvm.matrix.columnwise.load.v8f64.p0v8f64(<8 x double>* nocapture, i32, i32 immarg, i32 immarg) [[READONLY]]			; CHECK: declare <8 x double> @llvm.matrix.column.major.load.v8f64.p0v8f64(<8 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY]]

	; CHECK: attributes [[READONLY]] = { argmemonly nosync nounwind readonly willreturn }			; CHECK: attributes [[READONLY]] = { argmemonly nosync nounwind readonly willreturn }

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define <9 x float> @strided_load_3x3(<9 x float>* %in, i32 %stride) {			define <9 x float> @strided_load_3x3(<9 x float>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_3x3(			; CHECK-LABEL: @strided_load_3x3(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x float> [[IN:%.]] to float			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x float> [[IN:%.]] to float
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x float>, <3 x float> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x float>, <3 x float> [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast float [[VEC_GEP2]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast float [[VEC_GEP2]] to <3 x float>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x float>, <3 x float> [[VEC_CAST3]], align 4			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x float>, <3 x float> [[VEC_CAST3]], align 4
	; CHECK-NEXT: [[VEC_START5:%.*]] = mul i32 2, [[STRIDE]]			; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START5]]			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START5]]
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast float [[VEC_GEP6]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast float [[VEC_GEP6]] to <3 x float>*
	; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x float>, <3 x float> [[VEC_CAST7]], align 4			; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x float>, <3 x float> [[VEC_CAST7]], align 4
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x float> [[COL_LOAD]], <3 x float> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x float> [[COL_LOAD]], <3 x float> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[COL_LOAD8]], <3 x float> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x float> [[COL_LOAD8]], <3 x float> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x float> [[TMP1]], <6 x float> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x float> [[TMP1]], <6 x float> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: ret <9 x float> [[TMP3]]			; CHECK-NEXT: ret <9 x float> [[TMP3]]
	;			;
	entry:			entry:
	%load = call <9 x float> @llvm.matrix.columnwise.load(<9 x float>* %in, i32 %stride, i32 3, i32 3)			%load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 3, i32 3)
	ret <9 x float> %load			ret <9 x float> %load
	}			}

	declare <9 x float> @llvm.matrix.columnwise.load(<9 x float>*, i32, i32, i32)			declare <9 x float> @llvm.matrix.column.major.load(<9 x float>*, i64, i1, i32, i32)

	define <9 x float> @strided_load_9x1(<9 x float>* %in, i32 %stride) {			define <9 x float> @strided_load_9x1(<9 x float>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_9x1(			; CHECK-LABEL: @strided_load_9x1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x float> [[IN:%.]] to float			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x float> [[IN:%.]] to float
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <9 x float>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <9 x float>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x float>, <9 x float> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x float>, <9 x float> [[VEC_CAST]], align 4
	; CHECK-NEXT: ret <9 x float> [[COL_LOAD]]			; CHECK-NEXT: ret <9 x float> [[COL_LOAD]]
	;			;
	entry:			entry:
	%load = call <9 x float> @llvm.matrix.columnwise.load(<9 x float>* %in, i32 %stride, i32 9, i32 1)			%load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 9, i32 1)
	ret <9 x float> %load			ret <9 x float> %load
	}			}

	declare <8 x float> @llvm.matrix.columnwise.load.v8f32(<8 x float>*, i32, i32, i32)			declare <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>*, i64, i1, i32, i32)

	define <8 x float> @strided_load_4x2(<8 x float>* %in, i32 %stride) {			define <8 x float> @strided_load_4x2(<8 x float>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_4x2(			; CHECK-LABEL: @strided_load_4x2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x float> [[IN:%.]] to float			; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x float> [[IN:%.]] to float
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <4 x float>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <4 x float>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x float>, <4 x float> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x float>, <4 x float> [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr float, float [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr float, float [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast float [[VEC_GEP2]] to <4 x float>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast float [[VEC_GEP2]] to <4 x float>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x float>, <4 x float> [[VEC_CAST3]], align 4			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x float>, <4 x float> [[VEC_CAST3]], align 4
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[COL_LOAD]], <4 x float> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[COL_LOAD]], <4 x float> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: ret <8 x float> [[TMP1]]			; CHECK-NEXT: ret <8 x float> [[TMP1]]
	;			;
	entry:			entry:
	%load = call <8 x float> @llvm.matrix.columnwise.load.v8f32(<8 x float>* %in, i32 %stride, i32 4, i32 2)			%load = call <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>* %in, i64 %stride, i1 false, i32 4, i32 2)
	ret <8 x float> %load			ret <8 x float> %load
	}			}

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define <9 x i32> @strided_load_3x3(<9 x i32>* %in, i32 %stride) {			define <9 x i32> @strided_load_3x3(<9 x i32>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_3x3(			; CHECK-LABEL: @strided_load_3x3(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x i32> [[IN:%.]] to i32			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x i32> [[IN:%.]] to i32
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast i32 [[VEC_GEP2]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast i32 [[VEC_GEP2]] to <3 x i32>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST3]], align 4			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST3]], align 4
	; CHECK-NEXT: [[VEC_START5:%.*]] = mul i32 2, [[STRIDE]]			; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START5]]			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START5]]
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast i32 [[VEC_GEP6]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast i32 [[VEC_GEP6]] to <3 x i32>*
	; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST7]], align 4			; CHECK-NEXT: [[COL_LOAD8:%.]] = load <3 x i32>, <3 x i32> [[VEC_CAST7]], align 4
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x i32> [[COL_LOAD]], <3 x i32> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x i32> [[COL_LOAD]], <3 x i32> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x i32> [[COL_LOAD8]], <3 x i32> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <3 x i32> [[COL_LOAD8]], <3 x i32> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x i32> [[TMP1]], <6 x i32> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <6 x i32> [[TMP1]], <6 x i32> [[TMP2]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: ret <9 x i32> [[TMP3]]			; CHECK-NEXT: ret <9 x i32> [[TMP3]]
	;			;
	entry:			entry:
	%load = call <9 x i32> @llvm.matrix.columnwise.load(<9 x i32>* %in, i32 %stride, i32 3, i32 3)			%load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 3, i32 3)
	ret <9 x i32> %load			ret <9 x i32> %load
	}			}

	declare <9 x i32> @llvm.matrix.columnwise.load(<9 x i32>*, i32, i32, i32)			declare <9 x i32> @llvm.matrix.column.major.load(<9 x i32>*, i64, i1, i32, i32)

	define <9 x i32> @strided_load_9x1(<9 x i32>* %in, i32 %stride) {			define <9 x i32> @strided_load_9x1(<9 x i32>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_9x1(			; CHECK-LABEL: @strided_load_9x1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x i32> [[IN:%.]] to i32			; CHECK-NEXT: [[TMP0:%.]] = bitcast <9 x i32> [[IN:%.]] to i32
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <9 x i32>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <9 x i32>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x i32>, <9 x i32> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <9 x i32>, <9 x i32> [[VEC_CAST]], align 4
	; CHECK-NEXT: ret <9 x i32> [[COL_LOAD]]			; CHECK-NEXT: ret <9 x i32> [[COL_LOAD]]
	;			;
	entry:			entry:
	%load = call <9 x i32> @llvm.matrix.columnwise.load(<9 x i32>* %in, i32 %stride, i32 9, i32 1)			%load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 9, i32 1)
	ret <9 x i32> %load			ret <9 x i32> %load
	}			}

	declare <8 x i32> @llvm.matrix.columnwise.load.v8i32(<8 x i32>*, i32, i32, i32)			declare <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>*, i64, i1, i32, i32)

	define <8 x i32> @strided_load_4x2(<8 x i32>* %in, i32 %stride) {			define <8 x i32> @strided_load_4x2(<8 x i32>* %in, i64 %stride) {
	; CHECK-LABEL: @strided_load_4x2(			; CHECK-LABEL: @strided_load_4x2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x i32> [[IN:%.]] to i32			; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x i32> [[IN:%.]] to i32
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <4 x i32>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <4 x i32>*
	; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x i32>, <4 x i32> [[VEC_CAST]], align 4			; CHECK-NEXT: [[COL_LOAD:%.]] = load <4 x i32>, <4 x i32> [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START1:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr i32, i32 [[TMP0]], i32 [[VEC_START1]]			; CHECK-NEXT: [[VEC_GEP2:%.]] = getelementptr i32, i32 [[TMP0]], i64 [[VEC_START1]]
	; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast i32 [[VEC_GEP2]] to <4 x i32>*			; CHECK-NEXT: [[VEC_CAST3:%.]] = bitcast i32 [[VEC_GEP2]] to <4 x i32>*
	; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x i32>, <4 x i32> [[VEC_CAST3]], align 4			; CHECK-NEXT: [[COL_LOAD4:%.]] = load <4 x i32>, <4 x i32> [[VEC_CAST3]], align 4
	; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[COL_LOAD]], <4 x i32> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[COL_LOAD]], <4 x i32> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: ret <8 x i32> [[TMP1]]			; CHECK-NEXT: ret <8 x i32> [[TMP1]]
	;			;
	entry:			entry:
	%load = call <8 x i32> @llvm.matrix.columnwise.load.v8i32(<8 x i32>* %in, i32 %stride, i32 4, i32 2)			%load = call <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>* %in, i64 %stride, i1 false, i32 4, i32 2)
	ret <8 x i32> %load			ret <8 x i32> %load
	}			}

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define void @strided_store_3x2(<6 x double> %in, double* %out) {			define void @strided_store_3x2(<6 x double> %in, double* %out) {
	; CHECK-LABEL: @strided_store_3x2(			; CHECK-LABEL: @strided_store_3x2(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x double> [[IN:%.]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x double> [[IN:%.]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[OUT:%.]] to <3 x double>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[OUT:%.]] to <3 x double>
	; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[VEC_CAST]], align 8			; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT]], i32 5			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT]], i64 5
	; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST2]], align 8			; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST2]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x double> %in, double* %out, i32 5, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 5, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}

	define void @strided_store_3x2_nonconst_stride(<6 x double> %in, i32 %stride, double* %out) {			define void @strided_store_3x2_nonconst_stride(<6 x double> %in, i64 %stride, double* %out) {
	; CHECK-LABEL: @strided_store_3x2_nonconst_stride(			; CHECK-LABEL: @strided_store_3x2_nonconst_stride(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x double> [[IN:%.]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x double> [[IN:%.]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT:%.*]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT:%.*]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[VEC_GEP]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[VEC_CAST]], align 8			; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_START2:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START2:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[OUT]], i32 [[VEC_START2]]			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr double, double [[OUT]], i64 [[VEC_START2]]
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <3 x double>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast double [[VEC_GEP3]] to <3 x double>*
	; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST4]], align 8			; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST4]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x double> %in, double* %out, i32 %stride, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 %stride, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}


	declare void @llvm.matrix.columnwise.store(<6 x double>, double*, i32, i32, i32)			declare void @llvm.matrix.column.major.store(<6 x double>, double*, i64, i1, i32, i32)

	define void @strided_store_2x3(<10 x double> %in, double* %out) {			define void @strided_store_2x3(<10 x double> %in, double* %out) {
	; CHECK-LABEL: @strided_store_2x3(			; CHECK-LABEL: @strided_store_2x3(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x double> [[IN:%.]], <10 x double> undef, <2 x i32> <i32 0, i32 1>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x double> [[IN:%.]], <10 x double> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 2, i32 3>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 2, i32 3>
	; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 4, i32 5>			; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 6, i32 7>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 6, i32 7>
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 8, i32 9>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x double> [[IN]], <10 x double> undef, <2 x i32> <i32 8, i32 9>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[OUT:%.]] to <2 x double>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast double [[OUT:%.]] to <2 x double>
	; CHECK-NEXT: store <2 x double> [[SPLIT]], <2 x double>* [[VEC_CAST]], align 8			; CHECK-NEXT: store <2 x double> [[SPLIT]], <2 x double>* [[VEC_CAST]], align 8
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT]], i32 4			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr double, double [[OUT]], i64 4
	; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast double [[VEC_GEP]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[SPLIT1]], <2 x double>* [[VEC_CAST5]], align 8			; CHECK-NEXT: store <2 x double> [[SPLIT1]], <2 x double>* [[VEC_CAST5]], align 8
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[OUT]], i32 8			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr double, double [[OUT]], i64 8
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast double [[VEC_GEP6]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[SPLIT2]], <2 x double>* [[VEC_CAST7]], align 8			; CHECK-NEXT: store <2 x double> [[SPLIT2]], <2 x double>* [[VEC_CAST7]], align 8
	; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr double, double [[OUT]], i32 12			; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr double, double [[OUT]], i64 12
	; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast double [[VEC_GEP8]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast double [[VEC_GEP8]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[SPLIT3]], <2 x double>* [[VEC_CAST9]], align 8			; CHECK-NEXT: store <2 x double> [[SPLIT3]], <2 x double>* [[VEC_CAST9]], align 8
	; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr double, double [[OUT]], i32 16			; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr double, double [[OUT]], i64 16
	; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast double [[VEC_GEP10]] to <2 x double>*			; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast double [[VEC_GEP10]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[SPLIT4]], <2 x double>* [[VEC_CAST11]], align 8			; CHECK-NEXT: store <2 x double> [[SPLIT4]], <2 x double>* [[VEC_CAST11]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store.v10f64(<10 x double> %in, double* %out, i32 4, i32 2, i32 5)			call void @llvm.matrix.column.major.store.v10f64(<10 x double> %in, double* %out, i64 4, i1 false, i32 2, i32 5)
	ret void			ret void
	}			}

	declare void @llvm.matrix.columnwise.store.v10f64(<10 x double>, double*, i32, i32, i32)			declare void @llvm.matrix.column.major.store.v10f64(<10 x double>, double*, i64, i1, i32, i32)

	; CHECK: declare void @llvm.matrix.columnwise.store.v6f64.p0f64(<6 x double>, double* nocapture writeonly, i32, i32 immarg, i32 immarg) [[WRITEONLY:#[0-9]]]			; CHECK: declare void @llvm.matrix.column.major.store.v6f64.p0f64(<6 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY:#[0-9]]]

	; CHECK: declare void @llvm.matrix.columnwise.store.v10f64.p0f64(<10 x double>, double* nocapture writeonly, i32, i32 immarg, i32 immarg) [[WRITEONLY]]			; CHECK: declare void @llvm.matrix.column.major.store.v10f64.p0f64(<10 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY]]

	; CHECK: attributes [[WRITEONLY]] = { argmemonly nosync nounwind willreturn writeonly }			; CHECK: attributes [[WRITEONLY]] = { argmemonly nosync nounwind willreturn writeonly }

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-float.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define void @strided_store_3x2(<6 x float> %in, float* %out) {			define void @strided_store_3x2(<6 x float> %in, float* %out) {
	; CHECK-LABEL: @strided_store_3x2(			; CHECK-LABEL: @strided_store_3x2(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x float> [[IN:%.]], <6 x float> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x float> [[IN:%.]], <6 x float> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x float> [[IN]], <6 x float> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x float> [[IN]], <6 x float> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[OUT:%.]] to <3 x float>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[OUT:%.]] to <3 x float>
	; CHECK-NEXT: store <3 x float> [[SPLIT]], <3 x float>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <3 x float> [[SPLIT]], <3 x float>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT]], i32 5			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT]], i64 5
	; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*
	; CHECK-NEXT: store <3 x float> [[SPLIT1]], <3 x float>* [[VEC_CAST2]], align 4			; CHECK-NEXT: store <3 x float> [[SPLIT1]], <3 x float>* [[VEC_CAST2]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x float> %in, float* %out, i32 5, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x float> %in, float* %out, i64 5, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}

	define void @strided_store_3x2_nonconst_stride(<6 x float> %in, i32 %stride, float* %out) {			define void @strided_store_3x2_nonconst_stride(<6 x float> %in, i64 %stride, float* %out) {
	; CHECK-LABEL: @strided_store_3x2_nonconst_stride(			; CHECK-LABEL: @strided_store_3x2_nonconst_stride(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x float> [[IN:%.]], <6 x float> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x float> [[IN:%.]], <6 x float> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x float> [[IN]], <6 x float> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x float> [[IN]], <6 x float> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT:%.*]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT:%.*]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[VEC_GEP]] to <3 x float>*
	; CHECK-NEXT: store <3 x float> [[SPLIT]], <3 x float>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <3 x float> [[SPLIT]], <3 x float>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START2:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START2:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr float, float [[OUT]], i32 [[VEC_START2]]			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr float, float [[OUT]], i64 [[VEC_START2]]
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast float [[VEC_GEP3]] to <3 x float>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast float [[VEC_GEP3]] to <3 x float>*
	; CHECK-NEXT: store <3 x float> [[SPLIT1]], <3 x float>* [[VEC_CAST4]], align 4			; CHECK-NEXT: store <3 x float> [[SPLIT1]], <3 x float>* [[VEC_CAST4]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x float> %in, float* %out, i32 %stride, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x float> %in, float* %out, i64 %stride, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}


	declare void @llvm.matrix.columnwise.store(<6 x float>, float*, i32, i32, i32)			declare void @llvm.matrix.column.major.store(<6 x float>, float*, i64, i1, i32, i32)

	define void @strided_store_2x3(<10 x float> %in, float* %out) {			define void @strided_store_2x3(<10 x float> %in, float* %out) {
	; CHECK-LABEL: @strided_store_2x3(			; CHECK-LABEL: @strided_store_2x3(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x float> [[IN:%.]], <10 x float> undef, <2 x i32> <i32 0, i32 1>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x float> [[IN:%.]], <10 x float> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 2, i32 3>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 2, i32 3>
	; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 4, i32 5>			; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 6, i32 7>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 6, i32 7>
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 8, i32 9>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x float> [[IN]], <10 x float> undef, <2 x i32> <i32 8, i32 9>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[OUT:%.]] to <2 x float>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast float [[OUT:%.]] to <2 x float>
	; CHECK-NEXT: store <2 x float> [[SPLIT]], <2 x float>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <2 x float> [[SPLIT]], <2 x float>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT]], i32 4			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr float, float [[OUT]], i64 4
	; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast float [[VEC_GEP]] to <2 x float>*			; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast float [[VEC_GEP]] to <2 x float>*
	; CHECK-NEXT: store <2 x float> [[SPLIT1]], <2 x float>* [[VEC_CAST5]], align 4			; CHECK-NEXT: store <2 x float> [[SPLIT1]], <2 x float>* [[VEC_CAST5]], align 4
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr float, float [[OUT]], i32 8			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr float, float [[OUT]], i64 8
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast float [[VEC_GEP6]] to <2 x float>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast float [[VEC_GEP6]] to <2 x float>*
	; CHECK-NEXT: store <2 x float> [[SPLIT2]], <2 x float>* [[VEC_CAST7]], align 4			; CHECK-NEXT: store <2 x float> [[SPLIT2]], <2 x float>* [[VEC_CAST7]], align 4
	; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr float, float [[OUT]], i32 12			; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr float, float [[OUT]], i64 12
	; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast float [[VEC_GEP8]] to <2 x float>*			; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast float [[VEC_GEP8]] to <2 x float>*
	; CHECK-NEXT: store <2 x float> [[SPLIT3]], <2 x float>* [[VEC_CAST9]], align 4			; CHECK-NEXT: store <2 x float> [[SPLIT3]], <2 x float>* [[VEC_CAST9]], align 4
	; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr float, float [[OUT]], i32 16			; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr float, float [[OUT]], i64 16
	; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast float [[VEC_GEP10]] to <2 x float>*			; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast float [[VEC_GEP10]] to <2 x float>*
	; CHECK-NEXT: store <2 x float> [[SPLIT4]], <2 x float>* [[VEC_CAST11]], align 4			; CHECK-NEXT: store <2 x float> [[SPLIT4]], <2 x float>* [[VEC_CAST11]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store.v10f32(<10 x float> %in, float* %out, i32 4, i32 2, i32 5)			call void @llvm.matrix.column.major.store.v10f32(<10 x float> %in, float* %out, i64 4, i1 false, i32 2, i32 5)
	ret void			ret void
	}			}

	declare void @llvm.matrix.columnwise.store.v10f32(<10 x float>, float*, i32, i32, i32)			declare void @llvm.matrix.column.major.store.v10f32(<10 x float>, float*, i64, i1, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-i32.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

	define void @strided_store_3x2(<6 x i32> %in, i32* %out) {			define void @strided_store_3x2(<6 x i32> %in, i32* %out) {
	; CHECK-LABEL: @strided_store_3x2(			; CHECK-LABEL: @strided_store_3x2(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x i32> [[IN:%.]], <6 x i32> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x i32> [[IN:%.]], <6 x i32> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x i32> [[IN]], <6 x i32> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x i32> [[IN]], <6 x i32> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[OUT:%.]] to <3 x i32>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[OUT:%.]] to <3 x i32>
	; CHECK-NEXT: store <3 x i32> [[SPLIT]], <3 x i32>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <3 x i32> [[SPLIT]], <3 x i32>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT]], i32 5			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT]], i64 5
	; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST2:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*
	; CHECK-NEXT: store <3 x i32> [[SPLIT1]], <3 x i32>* [[VEC_CAST2]], align 4			; CHECK-NEXT: store <3 x i32> [[SPLIT1]], <3 x i32>* [[VEC_CAST2]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x i32> %in, i32* %out, i32 5, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x i32> %in, i32* %out, i64 5, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}

	define void @strided_store_3x2_nonconst_stride(<6 x i32> %in, i32 %stride, i32* %out) {			define void @strided_store_3x2_nonconst_stride(<6 x i32> %in, i64 %stride, i32* %out) {
	; CHECK-LABEL: @strided_store_3x2_nonconst_stride(			; CHECK-LABEL: @strided_store_3x2_nonconst_stride(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x i32> [[IN:%.]], <6 x i32> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <6 x i32> [[IN:%.]], <6 x i32> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x i32> [[IN]], <6 x i32> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x i32> [[IN]], <6 x i32> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[VEC_START:%.]] = mul i32 0, [[STRIDE:%.]]			; CHECK-NEXT: [[VEC_START:%.]] = mul i64 0, [[STRIDE:%.]]
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT:%.*]], i32 [[VEC_START]]			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT:%.*]], i64 [[VEC_START]]
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[VEC_GEP]] to <3 x i32>*
	; CHECK-NEXT: store <3 x i32> [[SPLIT]], <3 x i32>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <3 x i32> [[SPLIT]], <3 x i32>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_START2:%.*]] = mul i32 1, [[STRIDE]]			; CHECK-NEXT: [[VEC_START2:%.*]] = mul i64 1, [[STRIDE]]
	; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr i32, i32 [[OUT]], i32 [[VEC_START2]]			; CHECK-NEXT: [[VEC_GEP3:%.]] = getelementptr i32, i32 [[OUT]], i64 [[VEC_START2]]
	; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast i32 [[VEC_GEP3]] to <3 x i32>*			; CHECK-NEXT: [[VEC_CAST4:%.]] = bitcast i32 [[VEC_GEP3]] to <3 x i32>*
	; CHECK-NEXT: store <3 x i32> [[SPLIT1]], <3 x i32>* [[VEC_CAST4]], align 4			; CHECK-NEXT: store <3 x i32> [[SPLIT1]], <3 x i32>* [[VEC_CAST4]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store(<6 x i32> %in, i32* %out, i32 %stride, i32 3, i32 2)			call void @llvm.matrix.column.major.store(<6 x i32> %in, i32* %out, i64 %stride, i1 false, i32 3, i32 2)
	ret void			ret void
	}			}


	declare void @llvm.matrix.columnwise.store(<6 x i32>, i32*, i32, i32, i32)			declare void @llvm.matrix.column.major.store(<6 x i32>, i32*, i64, i1, i32, i32)

	define void @strided_store_2x3(<10 x i32> %in, i32* %out) {			define void @strided_store_2x3(<10 x i32> %in, i32* %out) {
	; CHECK-LABEL: @strided_store_2x3(			; CHECK-LABEL: @strided_store_2x3(
	; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x i32> [[IN:%.]], <10 x i32> undef, <2 x i32> <i32 0, i32 1>			; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <10 x i32> [[IN:%.]], <10 x i32> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 2, i32 3>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 2, i32 3>
	; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 4, i32 5>			; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 6, i32 7>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 6, i32 7>
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 8, i32 9>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <10 x i32> [[IN]], <10 x i32> undef, <2 x i32> <i32 8, i32 9>
	; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[OUT:%.]] to <2 x i32>			; CHECK-NEXT: [[VEC_CAST:%.]] = bitcast i32 [[OUT:%.]] to <2 x i32>
	; CHECK-NEXT: store <2 x i32> [[SPLIT]], <2 x i32>* [[VEC_CAST]], align 4			; CHECK-NEXT: store <2 x i32> [[SPLIT]], <2 x i32>* [[VEC_CAST]], align 4
	; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT]], i32 4			; CHECK-NEXT: [[VEC_GEP:%.]] = getelementptr i32, i32 [[OUT]], i64 4
	; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast i32 [[VEC_GEP]] to <2 x i32>*			; CHECK-NEXT: [[VEC_CAST5:%.]] = bitcast i32 [[VEC_GEP]] to <2 x i32>*
	; CHECK-NEXT: store <2 x i32> [[SPLIT1]], <2 x i32>* [[VEC_CAST5]], align 4			; CHECK-NEXT: store <2 x i32> [[SPLIT1]], <2 x i32>* [[VEC_CAST5]], align 4
	; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr i32, i32 [[OUT]], i32 8			; CHECK-NEXT: [[VEC_GEP6:%.]] = getelementptr i32, i32 [[OUT]], i64 8
	; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast i32 [[VEC_GEP6]] to <2 x i32>*			; CHECK-NEXT: [[VEC_CAST7:%.]] = bitcast i32 [[VEC_GEP6]] to <2 x i32>*
	; CHECK-NEXT: store <2 x i32> [[SPLIT2]], <2 x i32>* [[VEC_CAST7]], align 4			; CHECK-NEXT: store <2 x i32> [[SPLIT2]], <2 x i32>* [[VEC_CAST7]], align 4
	; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr i32, i32 [[OUT]], i32 12			; CHECK-NEXT: [[VEC_GEP8:%.]] = getelementptr i32, i32 [[OUT]], i64 12
	; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast i32 [[VEC_GEP8]] to <2 x i32>*			; CHECK-NEXT: [[VEC_CAST9:%.]] = bitcast i32 [[VEC_GEP8]] to <2 x i32>*
	; CHECK-NEXT: store <2 x i32> [[SPLIT3]], <2 x i32>* [[VEC_CAST9]], align 4			; CHECK-NEXT: store <2 x i32> [[SPLIT3]], <2 x i32>* [[VEC_CAST9]], align 4
	; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr i32, i32 [[OUT]], i32 16			; CHECK-NEXT: [[VEC_GEP10:%.]] = getelementptr i32, i32 [[OUT]], i64 16
	; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast i32 [[VEC_GEP10]] to <2 x i32>*			; CHECK-NEXT: [[VEC_CAST11:%.]] = bitcast i32 [[VEC_GEP10]] to <2 x i32>*
	; CHECK-NEXT: store <2 x i32> [[SPLIT4]], <2 x i32>* [[VEC_CAST11]], align 4			; CHECK-NEXT: store <2 x i32> [[SPLIT4]], <2 x i32>* [[VEC_CAST11]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	call void @llvm.matrix.columnwise.store.v10i32(<10 x i32> %in, i32* %out, i32 4, i32 2, i32 5)			call void @llvm.matrix.column.major.store.v10i32(<10 x i32> %in, i32* %out, i64 4, i1 false, i32 2, i32 5)
	ret void			ret void
	}			}

	declare void @llvm.matrix.columnwise.store.v10i32(<10 x i32>, i32*, i32, i32, i32)			declare void @llvm.matrix.column.major.store.v10i32(<10 x i32>, i32*, i64, i1, i32, i32)

llvm/test/Verifier/matrix-intrinsics.ll

	Show All 13 Lines
	define <4 x float> @multiply(<4 x float> %m) {			define <4 x float> @multiply(<4 x float> %m) {
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	%result.1 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %m, <4 x float> %m, i32 3, i32 2, i32 2)			%result.1 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %m, <4 x float> %m, i32 3, i32 2, i32 2)
	%result.2 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %result.1, <4 x float> %m, i32 2, i32 2, i32 1)			%result.2 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %result.1, <4 x float> %m, i32 2, i32 2, i32 1)
	ret <4 x float> %result.2			ret <4 x float> %result.2
	}			}

	declare <4 x float> @llvm.matrix.columnwise.load.v4f32.p0v4f32(<4 x float>*, i32, i32, i32)			declare <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>*, i64, i1, i32, i32)
	declare <6 x float> @llvm.matrix.columnwise.load.v6f32.p0v6f32(<6 x float>*, i32, i32, i32)			declare <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>*, i64, i1, i32, i32)
	define <4 x float> @columnwise_load(<4 x float>* %m, <6 x float>* %n) {			define <4 x float> @column.major_load(<4 x float>* %m, <6 x float>* %n) {
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	%result.1 = call <4 x float> @llvm.matrix.columnwise.load.v4f32.p0v4f32(<4 x float>* %m, i32 2, i32 1, i32 2)			%result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>* %m, i64 2, i1 false, i32 1, i32 2)
	%result.2 = call <6 x float> @llvm.matrix.columnwise.load.v6f32.p0v6f32(<6 x float>* %n, i32 2, i32 3, i32 3)			%result.2 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 3)
	ret <4 x float> %result.1			ret <4 x float> %result.1
	}			}

	declare void @llvm.matrix.columnwise.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32, i32, i32)			declare void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i64, i1, i32, i32)
	declare void @llvm.matrix.columnwise.store.v6f32.p0v6f32(<6 x float>, <6 x float>*, i32, i32, i32)			declare void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float>, <6 x float>*, i64, i1, i32, i32)
	define void @columnwise_store(<4 x float>* %m, <6 x float>* %n) {			define void @column.major_store(<4 x float>* %m, <6 x float>* %n) {
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	; CHECK-NEXT: result of a matrix operation does not fit in the returned vector			; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
	call void @llvm.matrix.columnwise.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i32 2, i32 1, i32 2)			call void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i64 2, i1 false, i32 1, i32 2)
	call void @llvm.matrix.columnwise.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i32 2, i32 3, i32 3)			call void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i64 2, i1 false, i32 3, i32 3)
	ret void			ret void
	}			}

mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td

	Show First 20 Lines • Show All 809 Lines • ▼ Show 20 Lines

	def LLVM_experimental_vector_reduce_v2_fadd : LLVM_VectorReductionV2<"fadd">;			def LLVM_experimental_vector_reduce_v2_fadd : LLVM_VectorReductionV2<"fadd">;
	def LLVM_experimental_vector_reduce_v2_fmul : LLVM_VectorReductionV2<"fmul">;			def LLVM_experimental_vector_reduce_v2_fmul : LLVM_VectorReductionV2<"fmul">;

	//			//
	// LLVM Matrix operations.			// LLVM Matrix operations.
	//			//

	/// Create a columnwise, strided 2-D matrix load, as specified in the LLVM			/// Create a column major, strided 2-D matrix load, as specified in the LLVM
	/// MatrixBuilder.			/// MatrixBuilder.
	/// data - Start address of the matrix read			/// data - Start address of the matrix read
	/// rows - Number of rows in matrix (must be a constant)			/// rows - Number of rows in matrix (must be a constant)
				/// isVolatile - True if the load operation is marked as volatile.
	/// columns - Number of columns in matrix (must be a constant)			/// columns - Number of columns in matrix (must be a constant)
	/// stride - Space between columns			/// stride - Space between columns
	def LLVM_MatrixColumnsWiseLoadOp			def LLVM_MatrixColumnMajorLoadOp
	: LLVM_OneResultOp<"intr.matrix.columnwise.load">,			: LLVM_OneResultOp<"intr.matrix.column.major.load">,
	Arguments<(ins LLVM_Type:$data, LLVM_Type:$stride,			Arguments<(ins LLVM_Type:$data, LLVM_Type:$stride, I1Attr:$isVolatile,
	I32Attr:$rows, I32Attr:$columns)> {			I32Attr:$rows, I32Attr:$columns)> {
	string llvmBuilder = [{			string llvmBuilder = [{
	llvm::MatrixBuilder<decltype(builder)> mb(builder);			llvm::MatrixBuilder<decltype(builder)> mb(builder);
	$res = mb.CreateMatrixColumnwiseLoad(			const llvm::DataLayout &dl =
	$data, $rows.getZExtValue(), $columns.getZExtValue(), $stride);			builder.GetInsertBlock()->getModule()->getDataLayout();
				llvm::Align align = dl.getABITypeAlign(
				$data->getType()->getPointerElementType());
				$res = mb.CreateColumnMajorLoad(
				$data, align, $stride, $isVolatile.getZExtValue(), $rows.getZExtValue(),
				$columns.getZExtValue());
	}];			}];
	let assemblyFormat = "$data `,` `<` `stride` `=` $stride `>` attr-dict"			let assemblyFormat = "$data `,` `<` `stride` `=` $stride `>` attr-dict"
	"`:` type($res) `from` type($data) `stride` type($stride)";			"`:` type($res) `from` type($data) `stride` type($stride)";
	}			}

	/// Create a columnwise, strided 2-D matrix store, as specified in the LLVM			/// Create a column major, strided 2-D matrix store, as specified in the LLVM
	/// MatrixBuilder.			/// MatrixBuilder.
	/// matrix - Matrix to store			/// matrix - Matrix to store
	/// ptr - Pointer to write back to			/// ptr - Pointer to write back to
				/// isVolatile - True if the load operation is marked as volatile.
	/// rows - Number of rows in matrix (must be a constant)			/// rows - Number of rows in matrix (must be a constant)
	/// columns - Number of columns in matrix (must be a constant)			/// columns - Number of columns in matrix (must be a constant)
	/// stride - Space between columns			/// stride - Space between columns
	def LLVM_MatrixColumnsWiseStoreOp			def LLVM_MatrixColumnMajorStoreOp
	: LLVM_ZeroResultOp<"intr.matrix.columnwise.store">,			: LLVM_ZeroResultOp<"intr.matrix.column.major.store">,
	Arguments<(ins LLVM_Type:$matrix, LLVM_Type:$data, LLVM_Type:$stride,			Arguments<(ins LLVM_Type:$matrix, LLVM_Type:$data, LLVM_Type:$stride,
	I32Attr:$rows, I32Attr:$columns)> {			I1Attr:$isVolatile, I32Attr:$rows, I32Attr:$columns)> {
	string llvmBuilder = [{			string llvmBuilder = [{
	llvm::MatrixBuilder<decltype(builder)> mb(builder);			llvm::MatrixBuilder<decltype(builder)> mb(builder);
	mb.CreateMatrixColumnwiseStore(			const llvm::DataLayout &dl =
	$matrix, $data, $stride, $rows.getZExtValue(), $columns.getZExtValue());			builder.GetInsertBlock()->getModule()->getDataLayout();
				llvm::Align align = dl.getABITypeAlign(
				$data->getType()->getPointerElementType());
				mb.CreateColumnMajorStore(
				$matrix, $data, align, $stride, $isVolatile.getZExtValue(),
				$rows.getZExtValue(), $columns.getZExtValue());
	}];			}];
	let assemblyFormat = "$matrix `,` $data `,` `<` `stride` `=` $stride `>` "			let assemblyFormat = "$matrix `,` $data `,` `<` `stride` `=` $stride `>` "
	"attr-dict`:` type($matrix) `to` type($data) `stride` type($stride)";			"attr-dict`:` type($matrix) `to` type($data) `stride` type($stride)";
	}			}

	/// Create a llvm.matrix.multiply call, multiplying 2-D matrices LHS and RHS, as			/// Create a llvm.matrix.multiply call, multiplying 2-D matrices LHS and RHS, as
	/// specified in the LLVM MatrixBuilder.			/// specified in the LLVM MatrixBuilder.
	def LLVM_MatrixMultiplyOp			def LLVM_MatrixMultiplyOp
	▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

mlir/test/Target/llvmir-intrinsics.mlir

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	llvm.func @vector_reductions(%arg0: !llvm.float, %arg1: !llvm<"<8 x float>">, %arg2: !llvm<"<8 x i32>">) {
// CHECK: call i32 @llvm.experimental.vector.reduce.xor.v8i32		// CHECK: call i32 @llvm.experimental.vector.reduce.xor.v8i32
"llvm.intr.experimental.vector.reduce.xor"(%arg2) : (!llvm<"<8 x i32>">) -> !llvm.i32		"llvm.intr.experimental.vector.reduce.xor"(%arg2) : (!llvm<"<8 x i32>">) -> !llvm.i32
llvm.return		llvm.return
}		}

// CHECK-LABEL: @matrix_intrinsics		// CHECK-LABEL: @matrix_intrinsics
// 4x16 16x3		// 4x16 16x3
llvm.func @matrix_intrinsics(%A: !llvm<"<64 x float>">, %B: !llvm<"<48 x float>">,		llvm.func @matrix_intrinsics(%A: !llvm<"<64 x float>">, %B: !llvm<"<48 x float>">,
%ptr: !llvm<"float*">, %stride: !llvm.i32) {		%ptr: !llvm<"float*">, %stride: !llvm.i64) {
// CHECK: call <12 x float> @llvm.matrix.multiply.v12f32.v64f32.v48f32(<64 x float> %0, <48 x float> %1, i32 4, i32 16, i32 3)		// CHECK: call <12 x float> @llvm.matrix.multiply.v12f32.v64f32.v48f32(<64 x float> %0, <48 x float> %1, i32 4, i32 16, i32 3)
%C = llvm.intr.matrix.multiply %A, %B		%C = llvm.intr.matrix.multiply %A, %B
{ lhs_rows = 4: i32, lhs_columns = 16: i32 , rhs_columns = 3: i32} :		{ lhs_rows = 4: i32, lhs_columns = 16: i32 , rhs_columns = 3: i32} :
(!llvm<"<64 x float>">, !llvm<"<48 x float>">) -> !llvm<"<12 x float>">		(!llvm<"<64 x float>">, !llvm<"<48 x float>">) -> !llvm<"<12 x float>">
// CHECK: call <48 x float> @llvm.matrix.transpose.v48f32(<48 x float> %1, i32 3, i32 16)		// CHECK: call <48 x float> @llvm.matrix.transpose.v48f32(<48 x float> %1, i32 3, i32 16)
%D = llvm.intr.matrix.transpose %B { rows = 3: i32, columns = 16: i32} :		%D = llvm.intr.matrix.transpose %B { rows = 3: i32, columns = 16: i32} :
!llvm<"<48 x float>"> into !llvm<"<48 x float>">		!llvm<"<48 x float>"> into !llvm<"<48 x float>">
// CHECK: call <48 x float> @llvm.matrix.columnwise.load.v48f32.p0f32(float* %2, i32 %3, i32 3, i32 16)		// CHECK: call <48 x float> @llvm.matrix.column.major.load.v48f32.p0f32(float* align 4 %2, i64 %3, i1 false, i32 3, i32 16)
%E = llvm.intr.matrix.columnwise.load %ptr, <stride=%stride>		%E = llvm.intr.matrix.column.major.load %ptr, <stride=%stride>
{ rows = 3: i32, columns = 16: i32} :		{ isVolatile = 0: i1, rows = 3: i32, columns = 16: i32} :
!llvm<"<48 x float>"> from !llvm<"float*"> stride !llvm.i32		!llvm<"<48 x float>"> from !llvm<"float*"> stride !llvm.i64
// CHECK: call void @llvm.matrix.columnwise.store.v48f32.p0f32(<48 x float> %7, float* %2, i32 %3, i32 3, i32 16)		// CHECK: call void @llvm.matrix.column.major.store.v48f32.p0f32(<48 x float> %7, float* align 4 %2, i64 %3, i1 false, i32 3, i32 16)
llvm.intr.matrix.columnwise.store %E, %ptr, <stride=%stride>		llvm.intr.matrix.column.major.store %E, %ptr, <stride=%stride>
{ rows = 3: i32, columns = 16: i32} :		{ isVolatile = 0: i1, rows = 3: i32, columns = 16: i32} :
!llvm<"<48 x float>"> to !llvm<"float*"> stride !llvm.i32		!llvm<"<48 x float>"> to !llvm<"float*"> stride !llvm.i64
llvm.return		llvm.return
}		}

// CHECK-LABEL: @masked_intrinsics		// CHECK-LABEL: @masked_intrinsics
llvm.func @masked_intrinsics(%A: !llvm<"<7 x float>*">, %mask: !llvm<"<7 x i1>">) {		llvm.func @masked_intrinsics(%A: !llvm<"<7 x float>*">, %mask: !llvm<"<7 x i1>">) {
// CHECK: call <7 x float> @llvm.masked.load.v7f32.p0v7f32(<7 x float>* %{{.}}, i32 1, <7 x i1> %{{.}}, <7 x float> undef)		// CHECK: call <7 x float> @llvm.masked.load.v7f32.p0v7f32(<7 x float>* %{{.}}, i32 1, <7 x i1> %{{.}}, <7 x float> undef)
%a = llvm.intr.masked.load %A, %mask { alignment = 1: i32} :		%a = llvm.intr.masked.load %A, %mask { alignment = 1: i32} :
(!llvm<"<7 x float>*">, !llvm<"<7 x i1>">) -> !llvm<"<7 x float>">		(!llvm<"<7 x float>*">, !llvm<"<7 x i1>">) -> !llvm<"<7 x float>">
Show All 26 Lines
// CHECK-DAG: declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #0		// CHECK-DAG: declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #0
// CHECK-DAG: declare float @llvm.ceil.f32(float)		// CHECK-DAG: declare float @llvm.ceil.f32(float)
// CHECK-DAG: declare <8 x float> @llvm.ceil.v8f32(<8 x float>) #0		// CHECK-DAG: declare <8 x float> @llvm.ceil.v8f32(<8 x float>) #0
// CHECK-DAG: declare float @llvm.cos.f32(float)		// CHECK-DAG: declare float @llvm.cos.f32(float)
// CHECK-DAG: declare <8 x float> @llvm.cos.v8f32(<8 x float>) #0		// CHECK-DAG: declare <8 x float> @llvm.cos.v8f32(<8 x float>) #0
// CHECK-DAG: declare float @llvm.copysign.f32(float, float)		// CHECK-DAG: declare float @llvm.copysign.f32(float, float)
// CHECK-DAG: declare <12 x float> @llvm.matrix.multiply.v12f32.v64f32.v48f32(<64 x float>, <48 x float>, i32 immarg, i32 immarg, i32 immarg)		// CHECK-DAG: declare <12 x float> @llvm.matrix.multiply.v12f32.v64f32.v48f32(<64 x float>, <48 x float>, i32 immarg, i32 immarg, i32 immarg)
// CHECK-DAG: declare <48 x float> @llvm.matrix.transpose.v48f32(<48 x float>, i32 immarg, i32 immarg)		// CHECK-DAG: declare <48 x float> @llvm.matrix.transpose.v48f32(<48 x float>, i32 immarg, i32 immarg)
// CHECK-DAG: declare <48 x float> @llvm.matrix.columnwise.load.v48f32.p0f32(float* nocapture, i32, i32 immarg, i32 immarg)		// CHECK-DAG: declare <48 x float> @llvm.matrix.column.major.load.v48f32.p0f32(float* nocapture, i64, i1 immarg, i32 immarg, i32 immarg)
// CHECK-DAG: declare void @llvm.matrix.columnwise.store.v48f32.p0f32(<48 x float>, float* nocapture writeonly, i32, i32 immarg, i32 immarg)		// CHECK-DAG: declare void @llvm.matrix.column.major.store.v48f32.p0f32(<48 x float>, float* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg)
// CHECK-DAG: declare <7 x float> @llvm.masked.load.v7f32.p0v7f32(<7 x float>*, i32 immarg, <7 x i1>, <7 x float>)		// CHECK-DAG: declare <7 x float> @llvm.masked.load.v7f32.p0v7f32(<7 x float>*, i32 immarg, <7 x i1>, <7 x float>)
// CHECK-DAG: declare void @llvm.masked.store.v7f32.p0v7f32(<7 x float>, <7 x float>*, i32 immarg, <7 x i1>)		// CHECK-DAG: declare void @llvm.masked.store.v7f32.p0v7f32(<7 x float>, <7 x float>*, i32 immarg, <7 x i1>)

This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Update load/store intrinsics.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 271615

llvm/docs/LangRef.rst

llvm/include/llvm/IR/Intrinsics.td

llvm/include/llvm/IR/MatrixBuilder.h

llvm/lib/IR/Verifier.cpp

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll

llvm/test/Transforms/LowerMatrixIntrinsics/const-gep.ll

llvm/test/Transforms/LowerMatrixIntrinsics/multiply-add-sub-double-row-major.ll

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-backward.ll

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-multiple-iterations.ll

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-shared-subtrees.ll

llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-float.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-i32.ll

llvm/test/Verifier/matrix-intrinsics.ll

mlir/include/mlir/Dialect/LLVMIR/LLVMOps.td

mlir/test/Target/llvmir-intrinsics.mlir

[Matrix] Update load/store intrinsics.
ClosedPublic