This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
10/23
LangRef.rst
-
include/llvm/
-
llvm/
-
IR/
1/2
Intrinsics.td
-
InitializePasses.h
-
Transforms/
1/1
Scalar.h
-
Scalar/
-
LowerMatrixIntrinsics.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
PassRegistry.def
-
Transforms/
-
IPO/
-
PassManagerBuilder.cpp
-
Scalar/
-
CMakeLists.txt
15/26
LowerMatrixIntrinsics.cpp
-
Scalar.cpp
-
test/Transforms/LowerMatrixIntrinsics/
-
Transforms/
-
LowerMatrixIntrinsics/
-
multiply.ll
-
strided-load.ll
-
strided-store.ll
1
transpose.ll

Differential D70456

[Matrix] Add first set of matrix intrinsics and initial lowering pass.
ClosedPublic

Authored by fhahn on Nov 19 2019, 12:19 PM.

Download Raw Diff

Details

Reviewers

anemet
Gerolf
reames
hfinkel
andrew.w.kaylor
efriedma
rengolin

Commits

rG526244b187d2: [Matrix] Add first set of matrix intrinsics and initial lowering pass.

Summary

This is the first patch adding an initial set of matrix intrinsics and a
corresponding lowering pass. This has been discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2019-October/136240.html

The first patch introduces four new intrinsics (transpose, multiply,
columnwise load and store) and a LowerMatrixIntrinsics pass, that
lowers those intrinsics to vector operations.

Matrixes are embedded in a 'flat' vector (e.g. a 4 x 4 float matrix
embedded in a <16 x float> vector) and the intrinsics take the dimension
information as parameters. Those parameters need to be ConstantInt.
For the memory layout, we initially assume column-major, but in the RFC
we also described how to extend the intrinsics to support row-major as
well.

For the initial lowering, we split the input of the intrinsics into a
set of column vectors, transform those column vectors and concatenate
the result columns to a flat result vector.

This allows us to lower the intrinsics without any shape propagation, as
mentioned in the RFC. In follow-up patches, we plan to submit the
following improvements:

Shape propagation to eliminate the embedding/splitting for each intrinsic.
Fused & tiled lowering of multiply and other operations.
Optimization remarks highlighting matrix expressions and costs.
Generate loops for operations on large matrixes.
More general block processing for operation on large vectors, exploiting shape information.

We would like to add dedicated transpose, columnwise load and store
intrinsics, even though they are not strictly necessary. For example, we
could instead emit a large shufflevector instruction instead of the
transpose. But we expect that to

(1) become unwieldy for larger matrixes (even for 16x16 matrixes,
    the resulting shufflevector masks would be huge),
(2) risk instcombine making small changes, causing us to fail to
    detect the transpose, preventing better lowerings

For the load/store, we are additionally planning on exploiting the
intrinsics for better alias analysis.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 41192
Build 41373: arc lint + arc unit

Event Timeline

fhahn created this revision.Nov 19 2019, 12:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 19 2019, 12:19 PM

Herald added subscribers: jdoerfert, hiraditya, mgorny, mehdi_amini. · View Herald Transcript

Harbormaster completed remote builds in B41192: Diff 230121.Nov 19 2019, 12:24 PM

lebedev.ri added a subscriber: lebedev.ri.Nov 19 2019, 12:41 PM

lebedev.ri added inline comments.

llvm/docs/LangRef.rst
14417	Column-major-ness seems unusual to me. Perhaps motivation can be stated either in the langref, or at least in the review?
llvm/include/llvm/IR/Intrinsics.td
1235	You may want to use `ImmArg` to actually enforce the constant-ness of dimensions.

fhahn marked 2 inline comments as done.Nov 19 2019, 1:42 PM

fhahn added inline comments.

llvm/docs/LangRef.rst
14417	The main motivation was that column-major layout seems to be the default for at least a few popular matrix related libraries (proprietary and open source, like Eigen for example: https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html) we are working on supporting.
llvm/include/llvm/IR/Intrinsics.td
1235	Nice, I was not aware of that, thanks!

Use ImmArg for intrinsics.

Harbormaster completed remote builds in B41200: Diff 230141.Nov 19 2019, 1:47 PM

tschuett added a subscriber: tschuett.Nov 20 2019, 1:06 AM

anemet added inline comments.Nov 20 2019, 7:52 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
158–167	splitVector and splitToColumnVectors names are not very descriptive especially WRT their difference.
244–272	Commenting the differences between the overloads would be good, also can these be stand-alone static functions?

tellenbach added a subscriber: tellenbach.Nov 20 2019, 11:35 AM

Squashed splitToColumnVector & splitVector into getMatrix, removed unnecessary overload and moved address computation code out of class, add some additional comments, renamed MatrixTy -> ColumnMatrixTy.

Harbormaster completed remote builds in B41272: Diff 230337.Nov 20 2019, 2:26 PM

Update comment for getMatrix.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
158–167	Now there's just getMatrix.
244–272	Moved the code out and the overload is gone now

Harbormaster completed remote builds in B41273: Diff 230338.Nov 20 2019, 2:34 PM

dmgreen added a subscriber: dmgreen.Nov 21 2019, 7:54 AM

dmgreen added inline comments.

llvm/docs/LangRef.rst
14429	Does the size of the vector need to be rows * cols? Can it be larger? I presume it would be a problem if it was smaller.
14457	Minor wording: "The '`llvm.matrix.multiply.*`' intrinsic treats %A as a matrix with <M> rows and <K> columns, %B as a matrix with <K> rows and <N> columns and multiplies them."
14478	Can Stride be less that Cols?
14479	The returned matrix is expected to be packed, the memory layout is padded with Stride, correct?
llvm/include/llvm/Transforms/Scalar.h
362	I guess this should have an extra comment, to be consistent.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
255	What does "offset from the initial element" refer to?
259	How big should we expect these to be? Will stamping out all the loads be OK for the larger cases or would it in be better to start creating loops? (Probably not a question that needs fixing in this commit. Just a question in general).
llvm/test/Transforms/LowerMatrixIntrinsics/transpose.ll
37	llvm.matrix.transpose.v8f64

Clarify strides and vector sizes in LangRef, update code comments

fhahn marked an inline comment as done.Nov 21 2019, 2:43 PM

fhahn added inline comments.

llvm/docs/LangRef.rst
14429	Good catch! I think being bigger would mean we only process the first rows * columns, but I think we should restrict the vector size to match rows * columns. Not sure if we can do that on the intrinsic level, but I'll add asserts.
14479	No, the packed matrix in the flat vector will only contain the elements in the columns/rows, as described by the dimensions. The 'stride' elements are not loaded. I've tried to clarify the meaning of the stride argument. Please let me know if that makes more sense now.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
255	That was a left-over from an earlier version, where the intrinsic took an extra offset parameter. I've updated the comment (same for LowerMatrixStore).
259	There is no upper bound to the dimensions, other than limits imposed by the frontend (and the number of vector elements for LLVM's vector type), so for larger matrixes this won't generate great code. For our initial use cases, we focused on smaller matrixes (<= 16x16) and we tried to keep things simple for the initial patch. As follow ups, we are planning on adding patches that generate loops for operations on larger matrixes, e.g. generating a fused load/multiply/add/store loops.

Fix dimensions in test case, surfaced by new assertion.

Harbormaster completed remote builds in B41341: Diff 230540.Nov 21 2019, 2:48 PM

Harbormaster completed remote builds in B41342: Diff 230542.

Some food for thought & perhaps you could add a test (or more) for FP types other than double?

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
15	Reduce or eliminate?
45	Would this be clearer: \p Offset is usually the number of elements in a column (equivalent to the number of rows of the matrix). When MatrixPtr represents a submatrix, Offset is the number of rows of the parent matrix, which is the matrix that contains the submatrix? At least "other than accessing a sub matrix." needs consideration.
63	Always zero extend even when XBitWidth matches WidestBitWidth?
72	.. column = Col * Offset
73	Then you could remove this line?!
74	ColumnOffset -> Distance? Offset seems a bit loaded.
237	You could early exit at this point (if (!Changed)....)
270	Misplaced comment?
304	\p Cols -> \p LM
323	assert(NumElts >= BlockNumElts)?

Address Gerolf's comments, thanks!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
45	I've updated the wording, hopefully it is clearer now.
63	The intrinsics now take i32 only, so we do not need this any longer.
73	I've compacted the comments a bit.
237	We do not really need to keep a list of instructions to remove currently. We can just remove them directly. I've dropped the code for now.

Harbormaster completed remote builds in B41362: Diff 230621.Nov 22 2019, 3:10 AM

LuoYuanke added a subscriber: LuoYuanke.Nov 26 2019, 7:01 AM

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

In D70456#1761117, @LuoYuanke wrote:

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.

typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.
typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}

Do you plan to support dynamic matrix type like array?

typedef float mmxk_t __attribute__((matrix_type(m, k)));
typedef float mkxn_t __attribute__((matrix_type(k, n)));

In D70456#1762440, @LuoYuanke wrote:

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

Assuming the dimensions are known at the call sites, you could implement it as a templated function, with the dimension being template parameters.

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}
Do you plan to support dynamic matrix type like array?

In our initial proposal, we focus exclusively on matrixes with dimensions known at compile-time (for example, uses cases similar to using Eigen with known dimensions). This is where most of the optimization potential comes from and to start with, we are focusing on generating the best code for that use case. We also think this is were LLVM can provide the most value initially.

However, we plan to implement fusion of matrix operations, which I think is the key building block to also generate efficient code with dynamic dimensions. We would love to collaborate with people interested in supporting dynamic dimensions!

fhahn added a child revision: D70897: [Matrix] Add forward shape propagation and first shape aware lowerings..Dec 2 2019, 5:44 AM

fhahn added a reviewer: rengolin.Dec 2 2019, 12:22 PM

I have one suggestion and one more question.

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
43	Could you add a picture/graphic showing a matrix, submatrix, and then their (lowered) memory layout and stride? This would make it easier to understand the computeAddress functions.
50	Do you need this function? Why are Row,Col Value here vs unsigned in the caller (computeColumnAddr())?

Remove computeEltAddr, drop unnecessary Row argument and clarify computeColumnAddr with a bit of ASCII art.

fhahn marked an inline comment as done.Dec 4 2019, 6:23 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
50	Not really. I folded the relevant bits into computeColumnAddr and removed the obsolete Row argument. Also turned the Col argument into a Value *.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41850: Diff 232111!Dec 4 2019, 6:34 AM

Update address computation to take an element pointer directly, instead of creating them. This reduces the number of generated bitcasts.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41865: Diff 232157!Dec 4 2019, 9:14 AM

LGTM.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
46	Nit: \p Stride elements. You can delete the rest since you repeat the Stride definition below.

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

This revision is now accepted and ready to land.Dec 5 2019, 12:46 PM

In D70456#1771596, @anemet wrote:

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

Sure! thanks for all the comments so far. I plan to land this early next week, unless there are any major comments/concerns!

LuoYuanke added inline comments.Dec 7 2019, 6:53 PM

llvm/docs/LangRef.rst
14486	Is it more straight forward that the start address of column B is computed as A + %stride? Given a 3D array "tensor[9][8][7]", to load some rows of data from the array the %stride can be 872 instead of 872-7.

LuoYuanke added inline comments.Dec 7 2019, 7:04 PM

llvm/docs/LangRef.rst
14486	What I mean is the the start address of column B is computed as A + <Rows> * %Stride.

LuoYuanke added inline comments.Dec 7 2019, 8:49 PM

llvm/docs/LangRef.rst
14486	I mean the the start address of column B is computed as A + %Stride.

fhahn marked an inline comment as done.Dec 8 2019, 3:57 AM

fhahn added inline comments.

llvm/docs/LangRef.rst
14486	I mean the the start address of column B is computed as A + %Stride. That would simplify things a bit and be more in-line with what people expect a stride to be? The only slight advantage of having stride separately might be that it is slightly easier to ensure stride >= Rows (assuming only positive strides). But we can check for that separately and require Stride >= Rows. What do you think?

Update columnwise load/store to accept stride between the start addresses of 2 consecutive columns.

I plan to land this change tomorrow, unless there are any remaining concerns.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B42152: Diff 232918!Dec 9 2019, 1:12 PM

anemet added inline comments.Dec 9 2019, 1:27 PM

llvm/docs/LangRef.rst
14492–14493	We should also be explicit that the unit for %Stride is also the element type. "Distance" above is a bit ambiguous. The rest of the new changes look good.

fhahn marked an inline comment as done.Dec 10 2019, 12:46 PM

fhahn added inline comments.

llvm/docs/LangRef.rst
14486	I went ahead and put up a patch that uses the fact that Stride >= NumRows to emit !alias.scope and !noalias metadata to make sure we can deduce no-alias for columwise accesses sharing the same stride and base pointer (except accesses to the same column of course): D71295

Closed by commit rG526244b187d2: [Matrix] Add first set of matrix intrinsics and initial lowering pass. (authored by fhahn). · Explain WhyDec 12 2019, 7:45 AM

This revision was automatically updated to reflect the committed changes.

djtodoro added a subscriber: djtodoro.Dec 23 2019, 6:03 AM

djtodoro added inline comments.

llvm/docs/LangRef.rst
14508	It looks like this breaks the llvm-sphinx-docs build.

fhahn marked an inline comment as done.Dec 23 2019, 12:55 PM

fhahn added inline comments.

llvm/docs/LangRef.rst
14508	thanks, I fixed the build errors for LLVM's docs in 5762648

djtodoro added inline comments.Dec 24 2019, 1:20 AM

llvm/docs/LangRef.rst
14508	Thanks!

LuoYuanke added inline comments.Mar 31 2020, 7:04 AM

llvm/docs/LangRef.rst
14452	I have a question for the shape propagation. What if there is some conflict of the shape. Take below code for example. The shape of matrix A is defined both by load and multiply. What is shape of matrix C and D? 4x4 or 5x5? A = llvm.matrix.columnwise.load(ptr, stride, 5, 5); B = llvm.matrix.columnwise.load(ptr, stride, 5, 5); C = A + B llvm.matrix.multiply(A, B, 4, 4, 4); llvm.matrix.multiply(A, B, 5, 5 ,5); D = A - B

fhahn marked an inline comment as done.Mar 31 2020, 8:17 AM

fhahn added inline comments.

llvm/docs/LangRef.rst
14452	I think the example above is not entirely valid as the loaded values are of type <25 x double> and the first multiply would take arguments with <16 x double>. I've put up D77129 to update the verifier to reject such IR. However a conflict between shapes can occur, if the number of elements matches, as in the example below. In cases where the shape information inferred for an operand does not match the shape required at a use, we embed the operand into a flat vector and extract vectors of the required shape for that use (it is done in getMatrix https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L331) A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3 llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires 2 x 3 - use A directly // second operand requires 3 x 2 - cannot use A directly; embedded A in flat vector and extract columns assuming 3 x 2. The specification of the intrinsics does not mention shape propagation and the shape imposed by the intrinsics only applies to the uses at the intrinsic calls. Shape propagation is only used for optimization purposes to better split up instructions that do not require shape info. Does that make sense? It might be desirable to add an option to check for conflicts.

LuoYuanke added inline comments.Apr 1 2020, 1:24 AM

llvm/docs/LangRef.rst
14452	I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix. C = llvm.matrix.reshape(A, row, col); A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3 llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires 2 x 3 - use A directly // second operand requires 3 x 2 - illegal.

fhahn marked an inline comment as done.Apr 2 2020, 1:07 PM

fhahn added inline comments.

llvm/docs/LangRef.rst
14452	I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix. Unfortunately I think without a matrix type on the IR level, such a stricter syntax won't really be feasible. The intrinsics operate on arbitrary IR vectors and the operands/results can also be used by other instructions. At the moment, the matrix intrinsics and regular IR instructions interact nicely, I think exactly because they operate exclusively on flat vectors, with the intrinsics having some additional information. But the intrinsics themselves produce a flat vector again. This keeps the spec simple and also has the advantage of giving the optimizer freedom to do shape related optimizations. A restriction as you mentioned would have to be enforced on the frontend side I think. For example, there should be no shape mismatches generated by Clang with our proposal/patches. It might be good to add an option to the lowering pass to turn shape mismatches into error to verify a frontend does indeed not introduce conflicts. What you you think?

LuoYuanke added inline comments.Apr 2 2020, 6:20 PM

llvm/docs/LangRef.rst
14452	It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed. About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump. BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix?

Hi Florian, cool work, thanks!

I'm wondering how the vectoriser could profit from this.

Currently, your patch is passing the intrinsic lowering pass before the vectoriser, so we'd see the long sequence of insert/extract element that we'd normally see anyway. Ie, it should have no functional difference if you lowered like that from the front-end.

However, it would perhaps be much easier to do loop tiling directly on the intrinsic, if we knew how to handle them. We could also directly take the vector dimension from the intrinsics to define how many native vector operations are needed (with padding, etc).

Do you plan to add support in the vectoriser, and if so, would that reduce / invalidate the need for the current lowering pass?

cheers,
--renato

Hi Renato,

In D70456#1959243, @rengolin wrote:

Hi Florian, cool work, thanks!

I'm wondering how the vectoriser could profit from this.

Currently, your patch is passing the intrinsic lowering pass before the vectoriser, so we'd see the long sequence of insert/extract element that we'd normally see anyway. Ie, it should have no functional difference if you lowered like that from the front-end.

Yes, the lowering as done in this patch could also have been done exclusively by the frontend without functional difference.

However, we since improved the lowering to propagate shape information from matrix intrinsics to connected operations. This shape information is then used to split up connected instructions (like regular IR loads/store/binary operators) to operate on column vectors, eliminating most insertelement/extractelement/shuffle instructions previously used to pack/unpack columns from the flat vector. For example, this can be seen in https://github.com/llvm/llvm-project/blob/master/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll.

However, it would perhaps be much easier to do loop tiling directly on the intrinsic, if we knew how to handle them. We could also directly take the vector dimension from the intrinsics to define how many native vector operations are needed (with padding, etc).

The aim/goal is to do exactly that and improve the lowering step-by-step. The lowering of matrix.multiply currently already tries to break down the operations according to the vector width of the target (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L881), but there is lot of potential for further improvements.

Do you plan to add support in the vectoriser, and if so, would that reduce / invalidate the need for the current lowering pass?

Currently I am working on adding initial tiling support for multiplies directly to the lowering pass: D75566.

One advantage of doing it in the lowering pass is that we have all information necessary available there and it is very specific to the intrinsic. Without lowering the intrinsic, there is no loop at all. (Even with the proposed tiling, there won't be any loops as the lowering effectively unrolls the tiled loops, but that will be improved in the future, as this approach is not practical for larger matrixes).

I think currently the general direction of the work is towards making the lowering pass better, rather than teaching other passes about the matrix intrinsics. I've also been thinking about using the infrastructure in the lowering pass to optimize large vector operations, even if no matrix intrinsics are involved. At the moment I am not sure how supporting matrix intrinsics would fit into passes like the loop vectorizer, but the lowering pass might be a good candidate to use VPlan for code generation/cost-modeling, once the infrastructure is there. Another direction to explore would be to detect loops that perform a matrix multiply and replacing them with a call to the intrinsic, which then gets further optimized.

Sorry for the somewhat lengthy response, but does the overall direction make sense to you?

Cheers,
Florian

llvm/docs/LangRef.rst
14452	It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed. I have put up the latest version of the proposed spec for the C/C++ extensions on Phabricator: D76612. The actual clang patches to implement the proposal are also available, starting at D72281 (and the linked patches). It introduces a matrix type on the C/C++ level and the matrix types must agree for all operands of operations. There is no syntax/builtin provided for converting between different shapes, other than going through memory or moving elements to a matrix value with the desired shape. About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump. No I don't think so. It would be solely helpful to verify the frontend does not introduce conflicts during testing. BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix? Interesting. I think that should ideally be driven be concrete use cases and I would like to hear more! Email or some other medium might be slightly better suited to discuss the topic though.

In D70456#1959386, @fhahn wrote:

Yes, the lowering as done in this patch could also have been done exclusively by the frontend without functional difference.

Right, that makes sense. Do you expect front-ends to detect code patterns (like nested loops over i, j) or just to lower from existing "matmul" operations?

LLVM already does that for libc calls (ex. llvm.memcpy) and if languages have matmul intrinsics, then this would be a trivial lowering. But detecting patterns, especially in C/C++ code, can end up horribly wrong or slow. :)

Currently I am working on adding initial tiling support for multiplies directly to the lowering pass: D75566.

Sure, and I expect that this loop would already be "vectorised", sith safety guaranteed by construction and widths extracted from TTI, so "pragma clang vectorise" disabled and the vectoriser won't even look at it.

I'm not sure how VPlan handles partially vectorised nested loops, but it would be interesting if we could re-vectorise after loop fusion or outer-loop vectorisation.

One advantage of doing it in the lowering pass is that we have all information necessary available there and it is very specific to the intrinsic. Without lowering the intrinsic, there is no loop at all. (Even with the proposed tiling, there won't be any loops as the lowering effectively unrolls the tiled loops, but that will be improved in the future, as this approach is not practical for larger matrixes).

That was my point in letting the LV "know" about the intrinsic. To recognise it as a loop and work on it.

I think currently the general direction of the work is towards making the lowering pass better, rather than teaching other passes about the matrix intrinsics.

That sounds very sensible. :)

I've also been thinking about using the infrastructure in the lowering pass to optimize large vector operations, even if no matrix intrinsics are involved. At the moment I am not sure how supporting matrix intrinsics would fit into passes like the loop vectorizer, but the lowering pass might be a good candidate to use VPlan for code generation/cost-modeling, once the infrastructure is there.

Indeed, what I thought would be a way into the LV. I don't mind if we teach the LV about matmul or if we export the VPlan machinery and let other passes use it, as long as we don't duplicate the work.

Another direction to explore would be to detect loops that perform a matrix multiply and replacing them with a call to the intrinsic, which then gets further optimized.

That's curious. Do you mean tracing a path from (weird loop) to (llvm.matmul) to (matmul loop), in a way to canonicalise loops?

Sorry for the somewhat lengthy response, but does the overall direction make sense to you?

No problems at all. Also, bear in mind I don't want to delay the approval/merge of this patch. Glad to continue discussing it after it's committed.

cheers,
--renato

Oh, already merged, ignore me. :)

I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.

In D70456#1959561, @LuoYuanke wrote:

I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.

For example, assume we have an imaginary float @llvm.dot(<2 x float>, <2 x float> ) that computes the dot product of the 2 arguments and we would like to lower @llvm.matrix.multiply(<4 x float> %a, <4 x float> %b, 2, 2, 2) using @llvm.dot. Currently, the LowerMatrixIntrinsics pass is where this needs to happen, similar to the tiling patch (D75566). You could add a separate emitMultiplyUsingLLVMDot() which would generate something like

%a.row.0 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 0, i32 2>
%a.row.1 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 1, i32 3>
%b.col.0 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 0, i32 1>
%b.col.1 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 2, i32 3>

%r.0.0 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.0)
%res.1 = insertelement <4 x float> undef, float %r.0.0, i32 0
%r.1.0 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.0)
%res.2 = insertelement <4 x float> %res.1, float %r.1.0, i32 1
%r.0.1 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.1)
%res.3 = insertelement <4 x float> %res.2, float %r.0.1, i32 2
%r.1.1 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.1)
%res.4 = insertelement <4 x float> %res.3, float %r.1.1, i32 3

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

Yes. I'm interested in how to lower multiplies on large matrixes. The matrix type in front-end can support any large shape of matrix. right? Take below code as example, I'd like to lower it to some small VNNI operation. I'd like read your infrastructure code or example code to achieve it. Thanks.

%a = call <1048576 x i32> @llvm.matrix.columnwise.load(<1048576 x i32>* %ina, i32 1024, i32 1024, i32 1024)
%b = call <1048576 x i32> @llvm.matrix.columnwise.load(<1048576 x i32>* %inb, i32 1024, i32 1024, i32 1024)
%c = call <1048576 x i32> @llvm.matrix.multiply(<1048576 x i32> %a, <1048576 x i32> %b, i32 1024, i32 1024, i32 1024)

In D70456#1961167, @LuoYuanke wrote:

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

Yes. I'm interested in how to lower multiplies on large matrixes. The matrix type in front-end can support any large shape of matrix. right? Take below code as example, I'd like to lower it to some small VNNI operation. I'd like read your infrastructure code or example code to achieve it. Thanks.

I've created D77549 which uses AArch64's udot instruction to compute the result of multiplies on 4x4 tiles. To do so, first a tiled loop nest is created that iterates over the columns, rows and the inner dimension. In the inner loop, 4x4 tiles are loaded, multiplied (using the dot product) and accumulated. After the inner loop, the final result of the 4x4 tile is stored. The main reason I went for AArch64's udot is that I can easily run it, but IIUC the VNNI instructions are very similar, they just allow processing of larger tiles.

Please note that the patch is a bit rough around the edges and we currently it not clear how to specify 'multiply 8 bit operands, accumulate in 32 bit result' nature of those instructions; we will have to extend the llvm.matrix.multiply definition for that I think. But it should be enough for you to be able to get started with getting something working for VNNI. Please let me know if you have any questions or encounter any problems, either in the discussion for D77549 or email.

Cheers,
Florian

I've created D77549 which uses AArch64's udot instruction to compute the result of multiplies on 4x4 tiles. To do so, first a tiled loop nest is created that iterates over the columns, rows and the inner dimension. In the inner loop, 4x4 tiles are loaded, multiplied (using the dot product) and accumulated. After the inner loop, the final result of the 4x4 tile is stored. The main reason I went for AArch64's udot is that I can easily run it, but IIUC the VNNI instructions are very similar, they just allow processing of larger tiles.

Please note that the patch is a bit rough around the edges and we currently it not clear how to specify 'multiply 8 bit operands, accumulate in 32 bit result' nature of those instructions; we will have to extend the llvm.matrix.multiply definition for that I think. But it should be enough for you to be able to get started with getting something working for VNNI. Please let me know if you have any questions or encounter any problems, either in the discussion for D77549 or email.

Cheers,
Florian

Thanks Florian.

Revision Contents

Path

Size

llvm/

docs/

LangRef.rst

95 lines

include/

llvm/

IR/

Intrinsics.td

31 lines

InitializePasses.h

1 line

Transforms/

Scalar.h

2 lines

Scalar/

LowerMatrixIntrinsics.h

24 lines

lib/

Passes/

PassBuilder.cpp

1 line

PassRegistry.def

1 line

Transforms/

IPO/

PassManagerBuilder.cpp

12 lines

Scalar/

CMakeLists.txt

1 line

LowerMatrixIntrinsics.cpp

516 lines

Scalar.cpp

1 line

test/

Transforms/

LowerMatrixIntrinsics/

272 lines

83 lines

87 lines

129 lines

Diff 230121

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 14,402 Lines • ▼ Show 20 Lines

	If the intrinsic call has the ``nnan`` fast-math flag then the operation can			If the intrinsic call has the ``nnan`` fast-math flag then the operation can
	assume that NaNs are not present in the input vector.			assume that NaNs are not present in the input vector.

	Arguments:			Arguments:
	""""""""""			""""""""""
	The argument to this intrinsic must be a vector of floating-point values.			The argument to this intrinsic must be a vector of floating-point values.

				Matrix Intrinsics
				-----------------

				Operations on matrixes requiring shape information (like number of rows/columns
				or the memory layout) can be expressed using the matrix intrinsics. Matrixes are
				embedded in a flat vector and the intrinsics take the dimensions as arguments.
				Currently column-major layout is assumed. The intrinsics support both integer
				lebedev.riUnsubmitted Not Done Reply Inline Actions Column-major-ness seems unusual to me. Perhaps motivation can be stated either in the langref, or at least in the review? lebedev.ri: Column-major-ness seems unusual to me. Perhaps motivation can be stated either in the langref…
				fhahnAuthorUnsubmitted Done Reply Inline Actions The main motivation was that column-major layout seems to be the default for at least a few popular matrix related libraries (proprietary and open source, like Eigen for example: https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html) we are working on supporting. fhahn: The main motivation was that column-major layout seems to be the default for at least a few…
				and floating point matrixes.


				'``llvm.matrix.transpose.*``' Intrinsic
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Syntax:
				"""""""

				::

				declare vectorty @llvm.matrix.transpose.*(vectorty %in, i32 <rows>, i32 <cols>)
				dmgreenUnsubmitted Not Done Reply Inline Actions Does the size of the vector need to be rows * cols? Can it be larger? I presume it would be a problem if it was smaller. dmgreen: Does the size of the vector need to be rows * cols? Can it be larger? I presume it would be a…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Good catch! I think being bigger would mean we only process the first rows * columns, but I think we should restrict the vector size to match rows * columns. Not sure if we can do that on the intrinsic level, but I'll add asserts. fhahn: Good catch! I think being bigger would mean we only process the first rows * columns, but I…

				Overview:
				"""""""""

				The '``llvm.matrix.transpose.*``' intrinsic treats %in as containing a matrix
				with <rows> rows and <cols> columns and returns the transposed matrix embedded in
				the result vector.

				Arguments:
				""""""""""

				The <rows> and <cols> arguments must be constant integers.

				'``llvm.matrix.multiply.*``' Intrinsic
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Syntax:
				"""""""

				::

				declare vectorty @llvm.matrix.multiply.*(vectorty %A, vectorty %B, i32 <M>, i32 <N>, i32 <K>)

				LuoYuankeUnsubmitted Not Done Reply Inline Actions I have a question for the shape propagation. What if there is some conflict of the shape. Take below code for example. The shape of matrix A is defined both by load and multiply. What is shape of matrix C and D? 4x4 or 5x5? A = llvm.matrix.columnwise.load(ptr, stride, 5, 5); B = llvm.matrix.columnwise.load(ptr, stride, 5, 5); C = A + B llvm.matrix.multiply(A, B, 4, 4, 4); llvm.matrix.multiply(A, B, 5, 5 ,5); D = A - B LuoYuanke: I have a question for the shape propagation. What if there is some conflict of the shape. Take…
				fhahnAuthorUnsubmitted Done Reply Inline Actions I think the example above is not entirely valid as the loaded values are of type <25 x double> and the first multiply would take arguments with <16 x double>. I've put up D77129 to update the verifier to reject such IR. However a conflict between shapes can occur, if the number of elements matches, as in the example below. In cases where the shape information inferred for an operand does not match the shape required at a use, we embed the operand into a flat vector and extract vectors of the required shape for that use (it is done in getMatrix https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L331) A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3 llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires 2 x 3 - use A directly // second operand requires 3 x 2 - cannot use A directly; embedded A in flat vector and extract columns assuming 3 x 2. The specification of the intrinsics does not mention shape propagation and the shape imposed by the intrinsics only applies to the uses at the intrinsic calls. Shape propagation is only used for optimization purposes to better split up instructions that do not require shape info. Does that make sense? It might be desirable to add an option to check for conflicts. fhahn: I think the example above is not entirely valid as the loaded values are of type <25 x double>…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix. C = llvm.matrix.reshape(A, row, col); A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3 llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires 2 x 3 - use A directly // second operand requires 3 x 2 - illegal. LuoYuanke: I prefer to be more strict for the syntax. The example that you provide is also illegal to me.
				fhahnAuthorUnsubmitted Done Reply Inline Actions I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix. Unfortunately I think without a matrix type on the IR level, such a stricter syntax won't really be feasible. The intrinsics operate on arbitrary IR vectors and the operands/results can also be used by other instructions. At the moment, the matrix intrinsics and regular IR instructions interact nicely, I think exactly because they operate exclusively on flat vectors, with the intrinsics having some additional information. But the intrinsics themselves produce a flat vector again. This keeps the spec simple and also has the advantage of giving the optimizer freedom to do shape related optimizations. A restriction as you mentioned would have to be enforced on the frontend side I think. For example, there should be no shape mismatches generated by Clang with our proposal/patches. It might be good to add an option to the lowering pass to turn shape mismatches into error to verify a frontend does indeed not introduce conflicts. What you you think? fhahn: > I prefer to be more strict for the syntax. The example that you provide is also illegal to me.
				LuoYuankeUnsubmitted Not Done Reply Inline Actions It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed. About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump. BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix? LuoYuanke: It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what…
				fhahnAuthorUnsubmitted Done Reply Inline Actions It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed. I have put up the latest version of the proposed spec for the C/C++ extensions on Phabricator: D76612. The actual clang patches to implement the proposal are also available, starting at D72281 (and the linked patches). It introduces a matrix type on the C/C++ level and the matrix types must agree for all operands of operations. There is no syntax/builtin provided for converting between different shapes, other than going through memory or moving elements to a matrix value with the desired shape. About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump. No I don't think so. It would be solely helpful to verify the frontend does not introduce conflicts during testing. BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix? Interesting. I think that should ideally be driven be concrete use cases and I would like to hear more! Email or some other medium might be slightly better suited to discuss the topic though. fhahn: > It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what…
				Overview:
				"""""""""

				The '``llvm.matrix.multiply.*``' intrinsic treats %A as matrix with <M> rows and <K> columns, %B as
				matrix with <K> rows and <N> columns and multiply them. The result matrix is returned embedded in the
				dmgreenUnsubmitted Done Reply Inline Actions Minor wording: "The '`llvm.matrix.multiply.`' intrinsic treats %A as a matrix with <M> rows and <K> columns, %B as a matrix with <K> rows and <N> columns and multiplies them." dmgreen:* Minor wording: "The '``llvm.matrix.multiply.*``' intrinsic treats %A as a matrix with <M> rows…
				result vector.

				Arguments:
				""""""""""

				The <M>, <N> and <K> arguments must be constant integers.

				'``llvm.matrix.columnwise.load.*``' Intrinsic
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Syntax:
				"""""""

				::

				declare vectorty @llvm.matrix.columnwise.load.*(ptrty %Ptr, i32 %Stride, i32 <Rows>, i32 <Cols>)

				Overview:
				"""""""""

				The '``llvm.matrix.columnwise.load.*``' intrinsic loads a matrix with <Rows>
				dmgreenUnsubmitted Not Done Reply Inline Actions Can Stride be less that Cols? dmgreen: Can Stride be less that Cols?
				rows and <Cols> columns, using a stride of %Stride between columns. The result
				dmgreenUnsubmitted Not Done Reply Inline Actions The returned matrix is expected to be packed, the memory layout is padded with Stride, correct? dmgreen: The returned matrix is expected to be packed, the memory layout is padded with Stride, correct?
				fhahnAuthorUnsubmitted Done Reply Inline Actions No, the packed matrix in the flat vector will only contain the elements in the columns/rows, as described by the dimensions. The 'stride' elements are not loaded. I've tried to clarify the meaning of the stride argument. Please let me know if that makes more sense now. fhahn: No, the packed matrix in the flat vector will only contain the elements in the columns/rows, as…
				matrix is returned embedded in the result vector. This allows for convenient
				loading of sub matrixes.

				'``llvm.matrix.columnwise.store.*``' Intrinsic
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Syntax:
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Is it more straight forward that the start address of column B is computed as A + %stride? Given a 3D array "tensor[9][8][7]", to load some rows of data from the array the %stride can be 872 instead of 872-7. LuoYuanke: Is it more straight forward that the start address of column B is computed as A + %stride?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions What I mean is the the start address of column B is computed as A + <Rows> * %Stride. LuoYuanke: What I mean is the the start address of column B is computed as A + <Rows> * %Stride.
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I mean the the start address of column B is computed as A + %Stride. LuoYuanke: I mean the the start address of column B is computed as A + %Stride.
				fhahnAuthorUnsubmitted Done Reply Inline Actions I mean the the start address of column B is computed as A + %Stride. That would simplify things a bit and be more in-line with what people expect a stride to be? The only slight advantage of having stride separately might be that it is slightly easier to ensure stride >= Rows (assuming only positive strides). But we can check for that separately and require Stride >= Rows. What do you think? fhahn: > I mean the the start address of column B is computed as A + %Stride. That would simplify…
				fhahnAuthorUnsubmitted Done Reply Inline Actions I went ahead and put up a patch that uses the fact that Stride >= NumRows to emit !alias.scope and !noalias metadata to make sure we can deduce no-alias for columwise accesses sharing the same stride and base pointer (except accesses to the same column of course): D71295 fhahn: I went ahead and put up a patch that uses the fact that Stride >= NumRows to emit !alias.scope…
				"""""""

				::

				declare void @llvm.matrix.columnwise.store.*(vectorty %In, ptrty %Ptr, i32 %Stride, i32 <Rows>, i32 <Cols>)

				Overview:
				anemetUnsubmitted Not Done Reply Inline Actions We should also be explicit that the unit for %Stride is also the element type. "Distance" above is a bit ambiguous. The rest of the new changes look good. anemet: We should also be explicit that the unit for %Stride is also the element type. "Distance"…
				"""""""""

				The '``llvm.matrix.columnwise.store.*``' intrinsic stores the matrix with
				<Rows> rows and <Cols> columns embedded in %In , using a stride of %Stride
				between columns. This allows for convenient storing of sub matrixes.


				Arguments:
				""""""""""

				The <Rows> and <Cols> arguments must be constant integers.

	Half Precision Floating-Point Intrinsics			Half Precision Floating-Point Intrinsics
	----------------------------------------			----------------------------------------

				djtodoroUnsubmitted Not Done Reply Inline Actions It looks like this breaks the llvm-sphinx-docs build. djtodoro: It looks like this breaks the llvm-sphinx-docs build.
				fhahnAuthorUnsubmitted Done Reply Inline Actions thanks, I fixed the build errors for LLVM's docs in 5762648 fhahn: thanks, I fixed the build errors for LLVM's docs in 5762648
				djtodoroUnsubmitted Not Done Reply Inline Actions Thanks! djtodoro: Thanks!
	For most target platforms, half precision floating-point is a			For most target platforms, half precision floating-point is a
	storage-only format. This means that it is a dense encoding (in memory)			storage-only format. This means that it is a dense encoding (in memory)
	but does not support computation in the format.			but does not support computation in the format.

	This means that code must first load the half-precision floating-point			This means that code must first load the half-precision floating-point
	value as an i16, then convert it to float with			value as an i16, then convert it to float with
	:ref:`llvm.convert.from.fp16 <int_convert_from_fp16>`. Computation can			:ref:`llvm.convert.from.fp16 <int_convert_from_fp16>`. Computation can
	then be performed on the float value (including extending to double			then be performed on the float value (including extending to double
	▲ Show 20 Lines • Show All 3,582 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 1,226 Lines • ▼ Show 20 Lines	let IntrProperties = [IntrNoMem, IntrWillReturn] in {
def int_experimental_vector_reduce_umin : Intrinsic<[LLVMVectorElementType<0>],		def int_experimental_vector_reduce_umin : Intrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
def int_experimental_vector_reduce_fmax : Intrinsic<[LLVMVectorElementType<0>],		def int_experimental_vector_reduce_fmax : Intrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
def int_experimental_vector_reduce_fmin : Intrinsic<[LLVMVectorElementType<0>],		def int_experimental_vector_reduce_fmin : Intrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
}		}

		//===----- Matrix intrinsics ---------------------------------------------===//
		lebedev.riUnsubmitted Not Done Reply Inline Actions You may want to use `ImmArg` to actually enforce the constant-ness of dimensions. lebedev.ri: You may want to use `ImmArg` to actually enforce the constant-ness of dimensions.
		fhahnAuthorUnsubmitted Done Reply Inline Actions Nice, I was not aware of that, thanks! fhahn: Nice, I was not aware of that, thanks!

		def int_matrix_transpose : Intrinsic<[llvm_anyvector_ty],
		[LLVMMatchType<0>,
		llvm_i32_ty,
		llvm_i32_ty],
		[IntrNoMem, IntrSpeculatable, IntrWillReturn]>;

		def int_matrix_multiply : Intrinsic<[llvm_anyvector_ty],
		[llvm_anyvector_ty,
		llvm_anyvector_ty,
		llvm_i32_ty,
		llvm_i32_ty,
		llvm_i32_ty],
		[IntrNoMem, IntrSpeculatable, IntrWillReturn]>;

		def int_matrix_columnwise_load : Intrinsic<[llvm_anyvector_ty],
		[LLVMAnyPointerType<LLVMMatchType<0>>,
		llvm_i32_ty,
		llvm_i32_ty,
		llvm_i32_ty],
		[IntrReadMem]>;

		def int_matrix_columnwise_store : Intrinsic<[],
		[llvm_anyvector_ty,
		LLVMAnyPointerType<LLVMMatchType<0>>,
		llvm_i32_ty,
		llvm_i32_ty,
		llvm_i32_ty],
		[WriteOnly<1>]>;

//===---------- Intrinsics to control hardware supported loops ----------===//		//===---------- Intrinsics to control hardware supported loops ----------===//

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute.		// will execute.
def int_set_loop_iterations :		def int_set_loop_iterations :
Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines
	void initializeLowerEmuTLSPass(PassRegistry&);			void initializeLowerEmuTLSPass(PassRegistry&);
	void initializeLowerExpectIntrinsicPass(PassRegistry&);			void initializeLowerExpectIntrinsicPass(PassRegistry&);
	void initializeLowerGuardIntrinsicLegacyPassPass(PassRegistry&);			void initializeLowerGuardIntrinsicLegacyPassPass(PassRegistry&);
	void initializeLowerWidenableConditionLegacyPassPass(PassRegistry&);			void initializeLowerWidenableConditionLegacyPassPass(PassRegistry&);
	void initializeLowerIntrinsicsPass(PassRegistry&);			void initializeLowerIntrinsicsPass(PassRegistry&);
	void initializeLowerInvokeLegacyPassPass(PassRegistry&);			void initializeLowerInvokeLegacyPassPass(PassRegistry&);
	void initializeLowerSwitchPass(PassRegistry&);			void initializeLowerSwitchPass(PassRegistry&);
	void initializeLowerTypeTestsPass(PassRegistry&);			void initializeLowerTypeTestsPass(PassRegistry&);
				void initializeLowerMatrixIntrinsicsLegacyPassPass(PassRegistry &);
	void initializeMIRCanonicalizerPass(PassRegistry &);			void initializeMIRCanonicalizerPass(PassRegistry &);
	void initializeMIRNamerPass(PassRegistry &);			void initializeMIRNamerPass(PassRegistry &);
	void initializeMIRPrintingPassPass(PassRegistry&);			void initializeMIRPrintingPassPass(PassRegistry&);
	void initializeMachineBlockFrequencyInfoPass(PassRegistry&);			void initializeMachineBlockFrequencyInfoPass(PassRegistry&);
	void initializeMachineBlockPlacementPass(PassRegistry&);			void initializeMachineBlockPlacementPass(PassRegistry&);
	void initializeMachineBlockPlacementStatsPass(PassRegistry&);			void initializeMachineBlockPlacementStatsPass(PassRegistry&);
	void initializeMachineBranchProbabilityInfoPass(PassRegistry&);			void initializeMachineBranchProbabilityInfoPass(PassRegistry&);
	void initializeMachineCSEPass(PassRegistry&);			void initializeMachineCSEPass(PassRegistry&);
	▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 353 Lines • ▼ Show 20 Lines
	Pass *createLowerAtomicPass();			Pass *createLowerAtomicPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LowerGuardIntrinsic - Lower guard intrinsics to normal control flow.			// LowerGuardIntrinsic - Lower guard intrinsics to normal control flow.
	//			//
	Pass *createLowerGuardIntrinsicPass();			Pass *createLowerGuardIntrinsicPass();

				Pass *createLowerMatrixIntrinsicsPass();
				dmgreenUnsubmitted Done Reply Inline Actions I guess this should have an extra comment, to be consistent. dmgreen: I guess this should have an extra comment, to be consistent.

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LowerWidenableCondition - Lower widenable condition to i1 true.			// LowerWidenableCondition - Lower widenable condition to i1 true.
	//			//
	Pass *createLowerWidenableConditionPass();			Pass *createLowerWidenableConditionPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Scalar/LowerMatrixIntrinsics.h

This file was added.

				//===- LowerMatrixIntrinsics.h - Lower matrix intrinsics. -------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass lowers matrix intrinsics down to vector operations.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_LOWERMATRIXINTRINSICSPASS_H
				#define LLVM_TRANSFORMS_SCALAR_LOWERMATRIXINTRINSICSPASS_H

				#include "llvm/IR/PassManager.h"

				namespace llvm {
				struct LowerMatrixIntrinsicsPass : PassInfoMixin<LowerMatrixIntrinsicsPass> {
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};
				} // namespace llvm

				#endif

llvm/lib/Passes/PassBuilder.cpp

	Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
	#include "llvm/Transforms/Scalar/LoopSink.h"			#include "llvm/Transforms/Scalar/LoopSink.h"
	#include "llvm/Transforms/Scalar/LoopStrengthReduce.h"			#include "llvm/Transforms/Scalar/LoopStrengthReduce.h"
	#include "llvm/Transforms/Scalar/LoopUnrollAndJamPass.h"			#include "llvm/Transforms/Scalar/LoopUnrollAndJamPass.h"
	#include "llvm/Transforms/Scalar/LoopUnrollPass.h"			#include "llvm/Transforms/Scalar/LoopUnrollPass.h"
	#include "llvm/Transforms/Scalar/LowerAtomic.h"			#include "llvm/Transforms/Scalar/LowerAtomic.h"
	#include "llvm/Transforms/Scalar/LowerConstantIntrinsics.h"			#include "llvm/Transforms/Scalar/LowerConstantIntrinsics.h"
	#include "llvm/Transforms/Scalar/LowerExpectIntrinsic.h"			#include "llvm/Transforms/Scalar/LowerExpectIntrinsic.h"
	#include "llvm/Transforms/Scalar/LowerGuardIntrinsic.h"			#include "llvm/Transforms/Scalar/LowerGuardIntrinsic.h"
				#include "llvm/Transforms/Scalar/LowerMatrixIntrinsics.h"
	#include "llvm/Transforms/Scalar/LowerWidenableCondition.h"			#include "llvm/Transforms/Scalar/LowerWidenableCondition.h"
	#include "llvm/Transforms/Scalar/MakeGuardsExplicit.h"			#include "llvm/Transforms/Scalar/MakeGuardsExplicit.h"
	#include "llvm/Transforms/Scalar/MemCpyOptimizer.h"			#include "llvm/Transforms/Scalar/MemCpyOptimizer.h"
	#include "llvm/Transforms/Scalar/MergeICmps.h"			#include "llvm/Transforms/Scalar/MergeICmps.h"
	#include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"			#include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"
	#include "llvm/Transforms/Scalar/NaryReassociate.h"			#include "llvm/Transforms/Scalar/NaryReassociate.h"
	#include "llvm/Transforms/Scalar/NewGVN.h"			#include "llvm/Transforms/Scalar/NewGVN.h"
	#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"			#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"
	▲ Show 20 Lines • Show All 2,233 Lines • Show Last 20 Lines

llvm/lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines
	FUNCTION_PASS("float2int", Float2IntPass())			FUNCTION_PASS("float2int", Float2IntPass())
	FUNCTION_PASS("no-op-function", NoOpFunctionPass())			FUNCTION_PASS("no-op-function", NoOpFunctionPass())
	FUNCTION_PASS("libcalls-shrinkwrap", LibCallsShrinkWrapPass())			FUNCTION_PASS("libcalls-shrinkwrap", LibCallsShrinkWrapPass())
	FUNCTION_PASS("inject-tli-mappings", InjectTLIMappings())			FUNCTION_PASS("inject-tli-mappings", InjectTLIMappings())
	FUNCTION_PASS("loweratomic", LowerAtomicPass())			FUNCTION_PASS("loweratomic", LowerAtomicPass())
	FUNCTION_PASS("lower-expect", LowerExpectIntrinsicPass())			FUNCTION_PASS("lower-expect", LowerExpectIntrinsicPass())
	FUNCTION_PASS("lower-guard-intrinsic", LowerGuardIntrinsicPass())			FUNCTION_PASS("lower-guard-intrinsic", LowerGuardIntrinsicPass())
	FUNCTION_PASS("lower-constant-intrinsics", LowerConstantIntrinsicsPass())			FUNCTION_PASS("lower-constant-intrinsics", LowerConstantIntrinsicsPass())
				FUNCTION_PASS("lower-matrix-intrinsics", LowerMatrixIntrinsicsPass())
	FUNCTION_PASS("lower-widenable-condition", LowerWidenableConditionPass())			FUNCTION_PASS("lower-widenable-condition", LowerWidenableConditionPass())
	FUNCTION_PASS("guard-widening", GuardWideningPass())			FUNCTION_PASS("guard-widening", GuardWideningPass())
	FUNCTION_PASS("gvn", GVN())			FUNCTION_PASS("gvn", GVN())
	FUNCTION_PASS("load-store-vectorizer", LoadStoreVectorizerPass())			FUNCTION_PASS("load-store-vectorizer", LoadStoreVectorizerPass())
	FUNCTION_PASS("loop-simplify", LoopSimplifyPass())			FUNCTION_PASS("loop-simplify", LoopSimplifyPass())
	FUNCTION_PASS("loop-sink", LoopSinkPass())			FUNCTION_PASS("loop-sink", LoopSinkPass())
	FUNCTION_PASS("lowerinvoke", LowerInvokePass())			FUNCTION_PASS("lowerinvoke", LowerInvokePass())
	FUNCTION_PASS("mem2reg", PromotePass())			FUNCTION_PASS("mem2reg", PromotePass())
	▲ Show 20 Lines • Show All 127 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	cl::opt<bool> FlattenedProfileUsed(
"flattened-profile-used", cl::init(false), cl::Hidden,		"flattened-profile-used", cl::init(false), cl::Hidden,
cl::desc("Indicate the sample profile being used is flattened, i.e., "		cl::desc("Indicate the sample profile being used is flattened, i.e., "
"no inline hierachy exists in the profile. "));		"no inline hierachy exists in the profile. "));

cl::opt<bool> EnableOrderFileInstrumentation(		cl::opt<bool> EnableOrderFileInstrumentation(
"enable-order-file-instrumentation", cl::init(false), cl::Hidden,		"enable-order-file-instrumentation", cl::init(false), cl::Hidden,
cl::desc("Enable order file instrumentation (default = off)"));		cl::desc("Enable order file instrumentation (default = off)"));

		static cl::opt<bool>
		EnableMatrix("enable-matrix", cl::init(false), cl::Hidden,
		cl::desc("Enable lowering of the matrix intrinsics"));

PassManagerBuilder::PassManagerBuilder() {		PassManagerBuilder::PassManagerBuilder() {
OptLevel = 2;		OptLevel = 2;
SizeLevel = 0;		SizeLevel = 0;
LibraryInfo = nullptr;		LibraryInfo = nullptr;
Inliner = nullptr;		Inliner = nullptr;
DisableUnrollLoops = false;		DisableUnrollLoops = false;
SLPVectorize = RunSLPVectorization;		SLPVectorize = RunSLPVectorization;
LoopVectorize = EnableLoopVectorization;		LoopVectorize = EnableLoopVectorization;
▲ Show 20 Lines • Show All 493 Lines • ▼ Show 20 Lines	void PassManagerBuilder::populateModulePassManager(
// this to work. Fortunately, it is trivial to preserve AliasAnalysis		// this to work. Fortunately, it is trivial to preserve AliasAnalysis
// (doing nothing preserves it as it is required to be conservatively		// (doing nothing preserves it as it is required to be conservatively
// correct in the face of IR changes).		// correct in the face of IR changes).
MPM.add(createGlobalsAAWrapperPass());		MPM.add(createGlobalsAAWrapperPass());

MPM.add(createFloat2IntPass());		MPM.add(createFloat2IntPass());
MPM.add(createLowerConstantIntrinsicsPass());		MPM.add(createLowerConstantIntrinsicsPass());

		if (EnableMatrix) {
		MPM.add(createLowerMatrixIntrinsicsPass());
		// CSE the pointer arithmetic of the column vectors. This allows alias
		// analysis to establish no-aliasing between loads and stores of different
		// columns of the same matrix.
		MPM.add(createEarlyCSEPass(false));
		}

addExtensionsToPM(EP_VectorizerStart, MPM);		addExtensionsToPM(EP_VectorizerStart, MPM);

// Re-rotate loops in all our loop nests. These may have fallout out of		// Re-rotate loops in all our loop nests. These may have fallout out of
// rotated form due to GVN or other transformations, and the vectorizer relies		// rotated form due to GVN or other transformations, and the vectorizer relies
// on the rotated form. Disable header duplication at -Oz.		// on the rotated form. Disable header duplication at -Oz.
MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));		MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1));

// Distribute loops to allow partial vectorization. I.e. isolate dependences		// Distribute loops to allow partial vectorization. I.e. isolate dependences
▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/CMakeLists.txt

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	add_llvm_library(LLVMScalarOpts
LoopUnrollPass.cpp		LoopUnrollPass.cpp
LoopUnrollAndJamPass.cpp		LoopUnrollAndJamPass.cpp
LoopUnswitch.cpp		LoopUnswitch.cpp
LoopVersioningLICM.cpp		LoopVersioningLICM.cpp
LowerAtomic.cpp		LowerAtomic.cpp
LowerConstantIntrinsics.cpp		LowerConstantIntrinsics.cpp
LowerExpectIntrinsic.cpp		LowerExpectIntrinsic.cpp
LowerGuardIntrinsic.cpp		LowerGuardIntrinsic.cpp
		LowerMatrixIntrinsics.cpp
LowerWidenableCondition.cpp		LowerWidenableCondition.cpp
MakeGuardsExplicit.cpp		MakeGuardsExplicit.cpp
MemCpyOptimizer.cpp		MemCpyOptimizer.cpp
MergeICmps.cpp		MergeICmps.cpp
MergedLoadStoreMotion.cpp		MergedLoadStoreMotion.cpp
NaryReassociate.cpp		NaryReassociate.cpp
NewGVN.cpp		NewGVN.cpp
PartiallyInlineLibCalls.cpp		PartiallyInlineLibCalls.cpp
Show All 26 Lines

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

This file was added.

				//===- LowerMatrixIntrinsics.cpp - Lower matrix intrinsics ------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Lower matrix intrinsics to vector operations.
				//
				// TODO:
				// * Implement multiply & add fusion
				// * Implement shape propagation
				// * Implement optimizations to reduce shufflevector uses by using shape
				// information.
				GerolfUnsubmitted Not Done Reply Inline Actions Reduce or eliminate? Gerolf: Reduce or eliminate?
				// * Add remark, summarizing the available matrix optimization opportunities.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/LowerMatrixIntrinsics.h"
				#include "llvm/ADT/GraphTraits.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Analysis/VectorUtils.h"
				#include "llvm/IR/CFG.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Scalar.h"

				using namespace llvm;

				#define DEBUG_TYPE "lower-matrix-intrinsics"

				// Given a \p MatrixPtr for the in-memory representation of a matrix,
				// compute the address of the element at index \p Row, \p Col. \p Stride is
				// the number of elements to skip to move same row, next column (this the
				GerolfUnsubmitted Not Done Reply Inline Actions Could you add a picture/graphic showing a matrix, submatrix, and then their (lowered) memory layout and stride? This would make it easier to understand the computeAddress functions. Gerolf: Could you add a picture/graphic showing a matrix, submatrix, and then their (lowered) memory…
				// number of rows other than accessing a submatrix.
				static Value computeEltAddr(Value MatrixPtr, Value Row, Value Col,
				GerolfUnsubmitted Not Done Reply Inline Actions Would this be clearer: \p Offset is usually the number of elements in a column (equivalent to the number of rows of the matrix). When MatrixPtr represents a submatrix, Offset is the number of rows of the parent matrix, which is the matrix that contains the submatrix? At least "other than accessing a sub matrix." needs consideration. Gerolf: Would this be clearer: \p Offset is usually the number of elements in a column (equivalent to…
				fhahnAuthorUnsubmitted Done Reply Inline Actions I've updated the wording, hopefully it is clearer now. fhahn: I've updated the wording, hopefully it is clearer now.
				Type EltType, Value Stride,
				GerolfUnsubmitted Not Done Reply Inline Actions Nit: \p Stride elements. You can delete the rest since you repeat the Stride definition below. Gerolf: Nit: \p Stride elements. You can delete the rest since you repeat the Stride definition below.
				IRBuilder<> &Builder) {
				Type *EltPtrType = PointerType::get(
				EltType, cast<PointerType>(MatrixPtr->getType())->getAddressSpace());
				Value *Base = Builder.CreatePointerCast(MatrixPtr, EltPtrType);
				GerolfUnsubmitted Not Done Reply Inline Actions Do you need this function? Why are Row,Col Value here vs unsigned in the caller (computeColumnAddr())? Gerolf: Do you need this function? Why are Row,Col Value here vs unsigned in the caller…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Not really. I folded the relevant bits into computeColumnAddr and removed the obsolete Row argument. Also turned the Col argument into a Value . fhahn:* Not really. I folded the relevant bits into computeColumnAddr and removed the obsolete Row…

				unsigned RowBitWidth = cast<IntegerType>(Row->getType())->getBitWidth();
				unsigned ColBitWidth = cast<IntegerType>(Col->getType())->getBitWidth();
				unsigned StrideBitWidth = cast<IntegerType>(Stride->getType())->getBitWidth();

				unsigned WidestBitWidth =
				std::max(std::max(RowBitWidth, ColBitWidth), StrideBitWidth);

				Type *IntegerType = IntegerType::get(Builder.getContext(), WidestBitWidth);

				// If they are not the same width, extend all to be the same width
				if (RowBitWidth != WidestBitWidth \|\| ColBitWidth != WidestBitWidth \|\|
				StrideBitWidth != WidestBitWidth) {
				GerolfUnsubmitted Not Done Reply Inline Actions Always zero extend even when XBitWidth matches WidestBitWidth? Gerolf: Always zero extend even when XBitWidth matches WidestBitWidth?
				fhahnAuthorUnsubmitted Done Reply Inline Actions The intrinsics now take i32 only, so we do not need this any longer. fhahn: The intrinsics now take i32 only, so we do not need this any longer.
				Row = Builder.CreateZExt(Row, IntegerType);
				Col = Builder.CreateZExt(Col, IntegerType);
				Stride = Builder.CreateZExt(Stride, IntegerType);
				}

				// i = base + row + column * stride

				// Distance to the desired column
				// (column * + stride)
				GerolfUnsubmitted Done Reply Inline Actions .. column = Col * Offset Gerolf: .. column = Col * Offset
				Value *ColumnOffset = Builder.CreateMul(Col, Stride);
				GerolfUnsubmitted Not Done Reply Inline Actions Then you could remove this line?! Gerolf: Then you could remove this line?!
				fhahnAuthorUnsubmitted Done Reply Inline Actions I've compacted the comments a bit. fhahn: I've compacted the comments a bit.

				GerolfUnsubmitted Done Reply Inline Actions ColumnOffset -> Distance? Offset seems a bit loaded. Gerolf: ColumnOffset -> Distance? Offset seems a bit loaded.
				// Compute the final element address offset
				// (row + column * stride)
				Value *EltIndex = Builder.CreateAdd(Row, ColumnOffset);
				if (isa<ConstantInt>(EltIndex) && cast<ConstantInt>(EltIndex)->isZero())
				return Base;
				return Builder.CreateGEP(EltType, Base, EltIndex);
				}

				namespace {
				/// LowerMatrixIntrinsics class contains the methods used to lower
				/// instructions and intrinsic calls on matrices.
				class LowerMatrixIntrinsics {
				/// Maximum "expected" number of columns in all matrix, to improve performance
				static constexpr unsigned DefaultVectorSize = 16;

				Function &Func;
				const DataLayout &DL;
				const TargetTransformInfo &TTI;

				typedef SmallVector<Value *, DefaultVectorSize> LoweredMatrixVec;
				class MatrixTy {

				LoweredMatrixVec Columns;

				public:
				MatrixTy() : Columns() {}
				MatrixTy(ArrayRef<Value *> Cols) : Columns(Cols.begin(), Cols.end()) {}

				Value *getColumn(unsigned i) const { return Columns[i]; }

				void setColumn(unsigned i, Value *V) { Columns[i] = V; }

				size_t getNumColumns() const { return Columns.size(); }

				const LoweredMatrixVec &getColumnVectors() const { return Columns; }

				LoweredMatrixVec &getColumnVectors() { return Columns; }

				void addColumn(Value *V) { Columns.push_back(V); }

				iterator_range<LoweredMatrixVec::iterator> columns() {
				return make_range(Columns.begin(), Columns.end());
				}

				Value *embeddInVector(IRBuilder<> &Builder) const {
				return Columns.size() == 1 ? Columns[0]
				: concatenateVectors(Builder, Columns);
				}
				};

				/// The list of instructions that need to be removed after
				/// lowering.
				SmallVector<Instruction *, DefaultVectorSize> ToRemove;

				struct ShapeInfo {
				unsigned NumRows;
				unsigned NumColumns;

				ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)
				: NumRows(NumRows), NumColumns(NumColumns) {}

				ShapeInfo(ConstantInt NumRows, ConstantInt NumColumns)
				: NumRows(NumRows->getZExtValue()),
				NumColumns(NumColumns->getZExtValue()) {}

				operator bool() const { return NumRows != 0; }
				bool operator==(const ShapeInfo &other) {
				return NumRows == other.NumRows && NumColumns == other.NumColumns;
				}
				bool operator!=(const ShapeInfo &other) { return !(*this == other); }
				};

				public:
				LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)
				: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}

				/// Return the set of column vectors that a matrix value is lowered to.
				///
				/// When this value is defined as a matrix value via an instruction return the
				/// lowered definitions for the column vectors. When this is a constant or a
				/// value cast from another type, emit code to split the value into column
				/// vectors.
				MatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,
				IRBuilder<> Builder) {
				auto *VType = dyn_cast<VectorType>(MatrixVal->getType());
				assert(VType && "MatrixVal must be a matrix type");
				if (auto *C = dyn_cast<Constant>(MatrixVal)) {
				return splitToColumnVectors(MatrixVal, VType->getElementType(), SI,
				Builder);
				}

				return splitVector(MatrixVal, SI.NumRows, Builder);
				}
				anemetUnsubmitted Done Reply Inline Actions splitVector and splitToColumnVectors names are not very descriptive especially WRT their difference. anemet: splitVector and splitToColumnVectors names are not very descriptive especially WRT their…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Now there's just getMatrix. fhahn: Now there's just getMatrix.

				/// Split a matrix into its column vectors.
				///
				/// This works by casting it to a vector spanning the entire matrix and then
				/// splitting this into column-sized vectors. This is supposed to be used
				/// with values that have no lowered definitions for the column vectors.
				MatrixTy splitToColumnVectors(Value MatrixVal, Type EltTy,
				const ShapeInfo &SI, IRBuilder<> Builder) {

				auto *Vec = Builder.CreateBitCast(
				MatrixVal, VectorType::get(EltTy, SI.NumRows * SI.NumColumns),
				MatrixVal->getName());
				return splitVector(Vec, SI.NumRows, Builder);
				}

				// Split up \p Vec into vectors of \p SplitSize.
				MatrixTy splitVector(Value *Vec, unsigned SplitSize, IRBuilder<> &Builder) {
				SmallVector<Value *, DefaultVectorSize> SplitVecs;
				Value *Undef = UndefValue::get(Vec->getType());
				unsigned UnsplitSize = cast<VectorType>(Vec->getType())->getNumElements();

				for (unsigned MaskStart = 0; MaskStart < UnsplitSize;
				MaskStart += SplitSize) {
				Constant *Mask = createSequentialMask(Builder, MaskStart, SplitSize, 0);
				Value *V = Builder.CreateShuffleVector(Vec, Undef, Mask, "split");
				SplitVecs.push_back(V);
				}

				return {SplitVecs};
				}

				// Replace intrinsic calls
				bool VisitCallInst(CallInst *Inst) {
				if (!Inst->getCalledFunction() \|\| !Inst->getCalledFunction()->isIntrinsic())
				return false;

				switch (Inst->getCalledFunction()->getIntrinsicID()) {
				case Intrinsic::matrix_multiply:
				LowerMatrixMultiply(Inst);
				break;
				case Intrinsic::matrix_transpose:
				LowerMatrixTranspose(Inst);
				break;
				case Intrinsic::matrix_columnwise_load:
				LowerMatrixLoad(Inst);
				break;
				case Intrinsic::matrix_columnwise_store:
				LowerMatrixStore(Inst);
				break;
				default:
				return false;
				}
				ToRemove.push_back(Inst);
				return true;
				}

				bool Visit() {
				ReversePostOrderTraversal<Function *> RPOT(&Func);
				bool Changed = false;
				for (auto *BB : RPOT) {
				for (Instruction &Inst : *BB) {
				bool Touched = false;

				if (CallInst *CInst = dyn_cast<CallInst>(&Inst))
				Touched = VisitCallInst(CInst);
				Changed \|= Touched;
				}
				}

				std::reverse(ToRemove.begin(), ToRemove.end());
				GerolfUnsubmitted Not Done Reply Inline Actions You could early exit at this point (if (!Changed)....) Gerolf: You could early exit at this point (if (!Changed)....)
				fhahnAuthorUnsubmitted Done Reply Inline Actions We do not really need to keep a list of instructions to remove currently. We can just remove them directly. I've dropped the code for now. fhahn: We do not really need to keep a list of instructions to remove currently. We can just remove…
				for (auto *Inst : ToRemove)
				Inst->eraseFromParent();
				return Changed;
				}

				/// Return the address of a column vector (\p EltType x \p Rows) at index (\p
				/// Row, \p Col) of \p Base with original column size of \p Stride elements.
				Value computeColumnAddr(Value Base, unsigned Row, unsigned Col,
				Value Stride, Type EltType, unsigned Rows,
				IRBuilder<> &Builder) {
				Value *EltPtr =
				computeEltAddr(Base, Builder.getInt32(Row), Builder.getInt32(Col),
				EltType, Stride, Builder);
				Type *ColumnType = VectorType::get(EltType, Rows);
				Type *ColumnPtrType = PointerType::get(
				ColumnType, cast<PointerType>(Base->getType())->getAddressSpace());
				return Builder.CreatePointerCast(EltPtr, ColumnPtrType);
				}
				dmgreenUnsubmitted Not Done Reply Inline Actions What does "offset from the initial element" refer to? dmgreen: What does "offset from the initial element" refer to?
				fhahnAuthorUnsubmitted Done Reply Inline Actions That was a left-over from an earlier version, where the intrinsic took an extra offset parameter. I've updated the comment (same for LowerMatrixStore). fhahn: That was a left-over from an earlier version, where the intrinsic took an extra offset…

				Value computeColumnAddr(Value Base, unsigned Col, Value *Skip,
				VectorType *VType, ShapeInfo Shape,
				IRBuilder<> &Builder) {
				dmgreenUnsubmitted Not Done Reply Inline Actions How big should we expect these to be? Will stamping out all the loads be OK for the larger cases or would it in be better to start creating loops? (Probably not a question that needs fixing in this commit. Just a question in general). dmgreen: How big should we expect these to be? Will stamping out all the loads be OK for the larger…
				fhahnAuthorUnsubmitted Done Reply Inline Actions There is no upper bound to the dimensions, other than limits imposed by the frontend (and the number of vector elements for LLVM's vector type), so for larger matrixes this won't generate great code. For our initial use cases, we focused on smaller matrixes (<= 16x16) and we tried to keep things simple for the initial patch. As follow ups, we are planning on adding patches that generate loops for operations on larger matrixes, e.g. generating a fused load/multiply/add/store loops. fhahn: There is no upper bound to the dimensions, other than limits imposed by the frontend (and the…
				Value *Rows = Builder.getIntN(
				cast<IntegerType>(Skip->getType())->getBitWidth(), Shape.NumRows);
				Value *Stride = Builder.CreateAdd(Rows, Skip);
				return computeColumnAddr(Base, 0, Col, Stride, VType->getElementType(),
				Shape.NumRows, Builder);
				}

				Value computeColumnAddr(Value Base, unsigned Col, VectorType *VType,
				ShapeInfo Shape, IRBuilder<> &Builder) {
				return computeColumnAddr(Base, Col, Builder.getInt32(0), VType, Shape,
				Builder);
				GerolfUnsubmitted Not Done Reply Inline Actions Misplaced comment? Gerolf: Misplaced comment?
				}

				anemetUnsubmitted Done Reply Inline Actions Commenting the differences between the overloads would be good, also can these be stand-alone static functions? anemet: Commenting the differences between the overloads would be good, also can these be stand-alone…
				fhahnAuthorUnsubmitted Done Reply Inline Actions Moved the code out and the overload is gone now fhahn: Moved the code out and the overload is gone now
				LoadInst createColumnLoad(Value ColumnPtr, Type *EltType,
				IRBuilder<> Builder) {
				unsigned Align = DL.getABITypeAlignment(EltType);
				return Builder.CreateAlignedLoad(ColumnPtr, Align);
				}

				StoreInst createColumnStore(Value ColumnValue, Value *ColumnPtr,
				Type *EltType, IRBuilder<> Builder) {
				unsigned Align = DL.getABITypeAlignment(EltType);
				return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);
				}

				/// Handles lowering the non-contiguous matrix load
				///
				/// The intrinsic loads a matrix from an array with an offset from the
				/// initial element and a stride between columns
				///
				/// See VisitLoadInst for handling actual load instructions
				void LowerMatrixLoad(CallInst *Inst) {
				IRBuilder<> Builder(Inst);
				Value *Ptr = Inst->getArgOperand(0);
				Value *Stride = Inst->getArgOperand(1);
				auto VType = cast<VectorType>(Inst->getType());

				ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(2)),
				cast<ConstantInt>(Inst->getArgOperand(3)));

				MatrixTy Result;

				// Distance between start of one column and the start of the next
				for (unsigned C = 0, E = Shape.NumColumns; C < E; ++C) {
				Value *GEP = computeColumnAddr(Ptr, C, Stride, VType, Shape, Builder);
				GerolfUnsubmitted Done Reply Inline Actions \p Cols -> \p LM Gerolf: \p Cols -> \p LM
				Value *Column = createColumnLoad(GEP, VType->getElementType(), Builder);
				Result.addColumn(Column);
				}

				Inst->replaceAllUsesWith(Result.embeddInVector(Builder));
				}

				/// Handles lowering the non-contiguous matrix store
				///
				/// The intrinsic store a matrix back to an array with an offset from
				/// the initial element pointer and a stride between columns.
				///
				/// See VisitStoreInst for handling actual store instructions
				void LowerMatrixStore(CallInst *Inst) {
				IRBuilder<> Builder(Inst);
				Value *Matrix = Inst->getArgOperand(0);
				Value *Ptr = Inst->getArgOperand(1);
				Value *Stride = Inst->getArgOperand(2);
				ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(3)),
				GerolfUnsubmitted Done Reply Inline Actions assert(NumElts >= BlockNumElts)? Gerolf: assert(NumElts >= BlockNumElts)?
				cast<ConstantInt>(Inst->getArgOperand(4)));

				auto VType = cast<VectorType>(Matrix->getType());

				auto LM = getMatrix(Matrix, Shape, Builder);

				for (auto C : enumerate(LM.columns())) {
				Value *GEP =
				computeColumnAddr(Ptr, C.index(), Stride, VType, Shape, Builder);
				createColumnStore(C.value(), GEP, VType->getElementType(), Builder);
				}
				}

				/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from
				/// the matrix \p Cols represented as a vector of column vectors.
				Value *extractVector(const MatrixTy &LM, unsigned I, unsigned J,
				unsigned NumElts, IRBuilder<> Builder) {
				Value *Col = LM.getColumn(J);
				Value *Undef = UndefValue::get(Col->getType());
				Constant *Mask = createSequentialMask(Builder, I, NumElts, 0);
				return Builder.CreateShuffleVector(Col, Undef, Mask, "block");
				}

				// Set elements I..I+NumElts-1 to Block
				Value insertVector(Value Col, unsigned I, Value *Block,
				IRBuilder<> Builder) {

				// First, bring Block to the same size as Col
				unsigned BlockNumElts =
				cast<VectorType>(Block->getType())->getNumElements();
				unsigned NumElts = cast<VectorType>(Col->getType())->getNumElements();

				Value *ExtendMask =
				createSequentialMask(Builder, 0, BlockNumElts, NumElts - BlockNumElts);
				Value *Undef = UndefValue::get(Block->getType());
				Block = Builder.CreateShuffleVector(Block, Undef, ExtendMask);

				// If Col is 7 long and I is 2 and BlockNumElts is 2 the mask is: 0, 1, 7,
				// 8, 4, 5, 6
				SmallVector<Constant *, 16> Mask;
				unsigned i;
				for (i = 0; i < I; i++)
				Mask.push_back(Builder.getInt32(i));

				unsigned VecNumElts = cast<VectorType>(Col->getType())->getNumElements();
				for (; i < I + BlockNumElts; i++)
				Mask.push_back(Builder.getInt32(i - I + VecNumElts));

				for (; i < VecNumElts; i++)
				Mask.push_back(Builder.getInt32(i));

				Value *MaskVal = ConstantVector::get(Mask);

				return Builder.CreateShuffleVector(Col, Block, MaskVal);
				}

				Value createMulAdd(Value Sum, Value A, Value B, bool UseFPOp,
				IRBuilder<> &Builder) {
				Value *Mul = UseFPOp ? Builder.CreateFMul(A, B) : Builder.CreateMul(A, B);
				if (!Sum)
				return Mul;

				return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);
				}

				void LowerMatrixMultiply(CallInst *MatMul) {
				IRBuilder<> Builder(MatMul);
				auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
				ShapeInfo LShape(cast<ConstantInt>(MatMul->getArgOperand(2)),
				cast<ConstantInt>(MatMul->getArgOperand(3)));
				ShapeInfo RShape(cast<ConstantInt>(MatMul->getArgOperand(3)),
				cast<ConstantInt>(MatMul->getArgOperand(4)));

				const MatrixTy &Lhs = getMatrix(MatMul->getArgOperand(0), LShape, Builder);
				const MatrixTy &Rhs = getMatrix(MatMul->getArgOperand(1), RShape, Builder);

				const unsigned R = LShape.NumRows;
				const unsigned M = LShape.NumColumns;
				const unsigned C = RShape.NumColumns;
				assert(M == RShape.NumRows);

				// Initialize the output
				MatrixTy Result;
				for (unsigned J = 0; J < C; ++J)
				Result.addColumn(UndefValue::get(VectorType::get(EltType, R)));

				const unsigned VF = std::max(TTI.getRegisterBitWidth(true) /
				EltType->getPrimitiveSizeInBits(),
				1ULL);

				// Multiply columns from the first operand with scalars from the second
				// operand. Then move along the K axes and accumulate the columns. With
				// this the adds can be vectorized without reassociation.
				for (unsigned J = 0; J < C; ++J) {
				unsigned BlockSize = VF;
				for (unsigned I = 0; I < R; I += BlockSize) {
				// Gradually lower the vectorization factor to cover the remainder.
				while (I + BlockSize > R)
				BlockSize /= 2;

				Value *Sum = nullptr;
				for (unsigned K = 0; K < M; ++K) {
				Value *L = extractVector(Lhs, I, K, BlockSize, Builder);
				Value *RH = Builder.CreateExtractElement(Rhs.getColumn(J), K);
				Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");
				Sum = createMulAdd(Sum, L, Splat, EltType->isFloatingPointTy(),
				Builder);
				}
				Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));
				}
				}

				MatMul->replaceAllUsesWith(Result.embeddInVector(Builder));
				}

				void LowerMatrixTranspose(CallInst *Inst) {
				MatrixTy Result;
				IRBuilder<> Builder(Inst);
				Value *InputVal = Inst->getArgOperand(0);
				VectorType *VectorTy = cast<VectorType>(InputVal->getType());
				ShapeInfo ArgShape(cast<ConstantInt>(Inst->getArgOperand(1)),
				cast<ConstantInt>(Inst->getArgOperand(2)));
				MatrixTy InputMatrix = getMatrix(InputVal, ArgShape, Builder);

				for (unsigned Row = 0; Row < ArgShape.NumRows; ++Row) {
				// Build a single column vector for this row. First initialize it.
				Value *ResultColumn = UndefValue::get(
				VectorType::get(VectorTy->getElementType(), ArgShape.NumColumns));

				// Go through the elements of this row and insert it into the resulting
				// column vector.
				for (auto C : enumerate(InputMatrix.columns())) {
				Value *Elt = Builder.CreateExtractElement(C.value(), Row);
				// We insert at index Column since that is the row index after the
				// transpose.
				ResultColumn =
				Builder.CreateInsertElement(ResultColumn, Elt, C.index());
				}
				Result.addColumn(ResultColumn);
				}

				Inst->replaceAllUsesWith(Result.embeddInVector(Builder));
				}
				};
				} // namespace

				PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,
				FunctionAnalysisManager &AM) {
				auto &TTI = AM.getResult<TargetIRAnalysis>(F);
				LowerMatrixIntrinsics LMT(F, TTI);
				if (LMT.Visit()) {
				PreservedAnalyses PA;
				PA.preserveSet<CFGAnalyses>();
				return PA;
				}
				return PreservedAnalyses::all();
				}

				namespace {

				class LowerMatrixIntrinsicsLegacyPass : public FunctionPass {
				public:
				static char ID;

				LowerMatrixIntrinsicsLegacyPass() : FunctionPass(ID) {
				initializeLowerMatrixIntrinsicsLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override {
				auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
				LowerMatrixIntrinsics LMT(F, *TTI);
				bool C = LMT.Visit();
				return C;
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<TargetTransformInfoWrapperPass>();
				AU.setPreservesCFG();
				}
				};
				} // namespace

				static const char pass_name[] = "Lower operations on the matrix type";
				char LowerMatrixIntrinsicsLegacyPass::ID = 0;
				INITIALIZE_PASS_BEGIN(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,
				false, false)
				INITIALIZE_PASS_END(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,
				false, false)

				Pass *llvm::createLowerMatrixIntrinsicsPass() {
				return new LowerMatrixIntrinsicsLegacyPass();
				}

llvm/lib/Transforms/Scalar/Scalar.cpp

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	void llvm::initializeScalarOpts(PassRegistry &Registry) {
initializeLoopUnswitchPass(Registry);		initializeLoopUnswitchPass(Registry);
initializeWarnMissedTransformationsLegacyPass(Registry);		initializeWarnMissedTransformationsLegacyPass(Registry);
initializeLoopVersioningLICMPass(Registry);		initializeLoopVersioningLICMPass(Registry);
initializeLoopIdiomRecognizeLegacyPassPass(Registry);		initializeLoopIdiomRecognizeLegacyPassPass(Registry);
initializeLowerAtomicLegacyPassPass(Registry);		initializeLowerAtomicLegacyPassPass(Registry);
initializeLowerConstantIntrinsicsPass(Registry);		initializeLowerConstantIntrinsicsPass(Registry);
initializeLowerExpectIntrinsicPass(Registry);		initializeLowerExpectIntrinsicPass(Registry);
initializeLowerGuardIntrinsicLegacyPassPass(Registry);		initializeLowerGuardIntrinsicLegacyPassPass(Registry);
		initializeLowerMatrixIntrinsicsLegacyPassPass(Registry);
initializeLowerWidenableConditionLegacyPassPass(Registry);		initializeLowerWidenableConditionLegacyPassPass(Registry);
initializeMemCpyOptLegacyPassPass(Registry);		initializeMemCpyOptLegacyPassPass(Registry);
initializeMergeICmpsLegacyPassPass(Registry);		initializeMergeICmpsLegacyPassPass(Registry);
initializeMergedLoadStoreMotionLegacyPassPass(Registry);		initializeMergedLoadStoreMotionLegacyPassPass(Registry);
initializeNaryReassociateLegacyPassPass(Registry);		initializeNaryReassociateLegacyPassPass(Registry);
initializePartiallyInlineLibCallsLegacyPassPass(Registry);		initializePartiallyInlineLibCallsLegacyPassPass(Registry);
initializeReassociateLegacyPassPass(Registry);		initializeReassociateLegacyPassPass(Registry);
initializeRegToMemPass(Registry);		initializeRegToMemPass(Registry);
▲ Show 20 Lines • Show All 206 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/multiply.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s


				define void @multiply_2x2(<4 x double> * %Ptr.A, <4 x double> * %Ptr.B, <4 x double>* %Ptr.C) {
				; CHECK-LABEL: @multiply_2x2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <4 x double>, <4 x double> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[B:%.]] = load <4 x double>, <4 x double> [[PTR_B:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <4 x double> [[A]], <4 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <4 x double> [[A]], <4 x double> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <4 x double> [[B]], <4 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <4 x double> [[B]], <4 x double> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP0]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP1:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
				; CHECK-NEXT: [[BLOCK4:%.*]] = shufflevector <2 x double> [[SPLIT1]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT5:%.*]] = insertelement <1 x double> undef, double [[TMP2]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT6:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT5]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP3:%.*]] = fmul <1 x double> [[BLOCK4]], [[SPLAT_SPLAT6]]
				; CHECK-NEXT: [[TMP4:%.*]] = fadd <1 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <1 x double> [[TMP4]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x double> undef, <2 x double> [[TMP5]], <2 x i32> <i32 2, i32 1>
				; CHECK-NEXT: [[BLOCK7:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT8:%.*]] = insertelement <1 x double> undef, double [[TMP7]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT9:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT8]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP8:%.*]] = fmul <1 x double> [[BLOCK7]], [[SPLAT_SPLAT9]]
				; CHECK-NEXT: [[BLOCK10:%.*]] = shufflevector <2 x double> [[SPLIT1]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT11:%.*]] = insertelement <1 x double> undef, double [[TMP9]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT12:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT11]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = fmul <1 x double> [[BLOCK10]], [[SPLAT_SPLAT12]]
				; CHECK-NEXT: [[TMP11:%.*]] = fadd <1 x double> [[TMP8]], [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <1 x double> [[TMP11]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP6]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 2>
				; CHECK-NEXT: [[BLOCK13:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT14:%.*]] = insertelement <1 x double> undef, double [[TMP14]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT15:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT14]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP15:%.*]] = fmul <1 x double> [[BLOCK13]], [[SPLAT_SPLAT15]]
				; CHECK-NEXT: [[BLOCK16:%.*]] = shufflevector <2 x double> [[SPLIT1]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT17:%.*]] = insertelement <1 x double> undef, double [[TMP16]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT18:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT17]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP17:%.*]] = fmul <1 x double> [[BLOCK16]], [[SPLAT_SPLAT18]]
				; CHECK-NEXT: [[TMP18:%.*]] = fadd <1 x double> [[TMP15]], [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <1 x double> [[TMP18]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <2 x double> undef, <2 x double> [[TMP19]], <2 x i32> <i32 2, i32 1>
				; CHECK-NEXT: [[BLOCK19:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT20:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT21:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT20]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK19]], [[SPLAT_SPLAT21]]
				; CHECK-NEXT: [[BLOCK22:%.*]] = shufflevector <2 x double> [[SPLIT1]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT23:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT24:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT23]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK22]], [[SPLAT_SPLAT24]]
				; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
				; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <2 x double> [[TMP20]], <2 x double> [[TMP26]], <2 x i32> <i32 0, i32 2>
				; CHECK-NEXT: [[TMP28:%.*]] = shufflevector <2 x double> [[TMP13]], <2 x double> [[TMP27]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: store <4 x double> [[TMP28]], <4 x double>* [[PTR_C:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <4 x double>, <4 x double>* %Ptr.A, align 16
				%b = load <4 x double>, <4 x double>* %Ptr.B, align 16
				%c = call <4 x double> @llvm.matrix.multiply.v4f64.v4f64.v4f64(<4 x double> %a, <4 x double> %b, i32 2, i32 2, i32 2)
				store <4 x double> %c, <4 x double>* %Ptr.C, align 16
				ret void
				}

				declare <4 x double> @llvm.matrix.multiply.v4f64.v4f64.v4f64(<4 x double>, <4 x double>, i32, i32, i32)

				define void @multiply_1x2(<2 x double> * %Ptr.A, <2 x double> * %Ptr.B, <4 x double>* %Ptr.C) {
				; CHECK-LABEL: @multiply_1x2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <2 x double>, <2 x double> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[B:%.]] = load <2 x double>, <2 x double> [[PTR_B:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <2 x double> [[A]], <2 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <2 x double> [[B]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <2 x double> [[B]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <1 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP0]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP1:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
				; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <1 x double> [[TMP1]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> undef, <2 x double> [[TMP2]], <2 x i32> <i32 2, i32 1>
				; CHECK-NEXT: [[BLOCK3:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <1 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT4:%.*]] = insertelement <1 x double> undef, double [[TMP4]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT5:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT4]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = fmul <1 x double> [[BLOCK3]], [[SPLAT_SPLAT5]]
				; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <1 x double> [[TMP5]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> [[TMP6]], <2 x i32> <i32 0, i32 2>
				; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <1 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP8]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP9:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
				; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <1 x double> [[TMP9]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x double> undef, <2 x double> [[TMP10]], <2 x i32> <i32 2, i32 1>
				; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <2 x double> [[SPLIT]], <2 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <1 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP12]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP13:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
				; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <1 x double> [[TMP13]], <1 x double> undef, <2 x i32> <i32 0, i32 undef>
				; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP14]], <2 x i32> <i32 0, i32 2>
				; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <2 x double> [[TMP7]], <2 x double> [[TMP15]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: store <4 x double> [[TMP16]], <4 x double>* [[PTR_C:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <2 x double>, <2 x double>* %Ptr.A, align 16
				%b = load <2 x double>, <2 x double>* %Ptr.B, align 16
				%c = call <4 x double> @llvm.matrix.multiply.v4f64.v2f64.v2f64(<2 x double> %a, <2 x double> %b, i32 2, i32 1, i32 2)
				store <4 x double> %c, <4 x double>* %Ptr.C, align 16
				ret void
				}

				declare <4 x double> @llvm.matrix.multiply.v4f64.v2f64.v2f64(<2 x double>, <2 x double>, i32, i32, i32)

				define void @multiply_i32_2x3(<6 x i32> * %Ptr.A, <6 x i32> * %Ptr.B, <9 x i32>* %Ptr.C) {
				; CHECK-LABEL: @multiply_i32_2x3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <6 x i32>, <6 x i32> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[B:%.]] = load <6 x i32>, <6 x i32> [[PTR_B:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <6 x i32> [[A]], <6 x i32> undef, <3 x i32> <i32 0, i32 1, i32 2>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x i32> [[A]], <6 x i32> undef, <3 x i32> <i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <6 x i32> [[B]], <6 x i32> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <6 x i32> [[B]], <6 x i32> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <6 x i32> [[B]], <6 x i32> undef, <2 x i32> <i32 4, i32 5>
				; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x i32> undef, i32 [[TMP0]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP1:%.*]] = mul <1 x i32> [[BLOCK]], [[SPLAT_SPLAT]]
				; CHECK-NEXT: [[BLOCK5:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT6:%.*]] = insertelement <1 x i32> undef, i32 [[TMP2]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT7:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT6]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP3:%.*]] = mul <1 x i32> [[BLOCK5]], [[SPLAT_SPLAT7]]
				; CHECK-NEXT: [[TMP4:%.*]] = add <1 x i32> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <1 x i32> [[TMP4]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <3 x i32> undef, <3 x i32> [[TMP5]], <3 x i32> <i32 3, i32 1, i32 2>
				; CHECK-NEXT: [[BLOCK8:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT9:%.*]] = insertelement <1 x i32> undef, i32 [[TMP7]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT10:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT9]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP8:%.*]] = mul <1 x i32> [[BLOCK8]], [[SPLAT_SPLAT10]]
				; CHECK-NEXT: [[BLOCK11:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT12:%.*]] = insertelement <1 x i32> undef, i32 [[TMP9]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT13:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT12]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP10:%.*]] = mul <1 x i32> [[BLOCK11]], [[SPLAT_SPLAT13]]
				; CHECK-NEXT: [[TMP11:%.*]] = add <1 x i32> [[TMP8]], [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <1 x i32> [[TMP11]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <3 x i32> [[TMP6]], <3 x i32> [[TMP12]], <3 x i32> <i32 0, i32 3, i32 2>
				; CHECK-NEXT: [[BLOCK14:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT15:%.*]] = insertelement <1 x i32> undef, i32 [[TMP14]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT16:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT15]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP15:%.*]] = mul <1 x i32> [[BLOCK14]], [[SPLAT_SPLAT16]]
				; CHECK-NEXT: [[BLOCK17:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT18:%.*]] = insertelement <1 x i32> undef, i32 [[TMP16]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT19:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT18]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP17:%.*]] = mul <1 x i32> [[BLOCK17]], [[SPLAT_SPLAT19]]
				; CHECK-NEXT: [[TMP18:%.*]] = add <1 x i32> [[TMP15]], [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <1 x i32> [[TMP18]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <3 x i32> [[TMP13]], <3 x i32> [[TMP19]], <3 x i32> <i32 0, i32 1, i32 3>
				; CHECK-NEXT: [[BLOCK20:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT21:%.*]] = insertelement <1 x i32> undef, i32 [[TMP21]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT22:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT21]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP22:%.*]] = mul <1 x i32> [[BLOCK20]], [[SPLAT_SPLAT22]]
				; CHECK-NEXT: [[BLOCK23:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT24:%.*]] = insertelement <1 x i32> undef, i32 [[TMP23]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT25:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT24]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP24:%.*]] = mul <1 x i32> [[BLOCK23]], [[SPLAT_SPLAT25]]
				; CHECK-NEXT: [[TMP25:%.*]] = add <1 x i32> [[TMP22]], [[TMP24]]
				; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x i32> [[TMP25]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x i32> undef, <3 x i32> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
				; CHECK-NEXT: [[BLOCK26:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP28:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT27:%.*]] = insertelement <1 x i32> undef, i32 [[TMP28]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT28:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT27]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP29:%.*]] = mul <1 x i32> [[BLOCK26]], [[SPLAT_SPLAT28]]
				; CHECK-NEXT: [[BLOCK29:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT30:%.*]] = insertelement <1 x i32> undef, i32 [[TMP30]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT31:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT30]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP31:%.*]] = mul <1 x i32> [[BLOCK29]], [[SPLAT_SPLAT31]]
				; CHECK-NEXT: [[TMP32:%.*]] = add <1 x i32> [[TMP29]], [[TMP31]]
				; CHECK-NEXT: [[TMP33:%.*]] = shufflevector <1 x i32> [[TMP32]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP34:%.*]] = shufflevector <3 x i32> [[TMP27]], <3 x i32> [[TMP33]], <3 x i32> <i32 0, i32 3, i32 2>
				; CHECK-NEXT: [[BLOCK32:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP35:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT33:%.*]] = insertelement <1 x i32> undef, i32 [[TMP35]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT34:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT33]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP36:%.*]] = mul <1 x i32> [[BLOCK32]], [[SPLAT_SPLAT34]]
				; CHECK-NEXT: [[BLOCK35:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP37:%.*]] = extractelement <2 x i32> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT36:%.*]] = insertelement <1 x i32> undef, i32 [[TMP37]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT37:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT36]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP38:%.*]] = mul <1 x i32> [[BLOCK35]], [[SPLAT_SPLAT37]]
				; CHECK-NEXT: [[TMP39:%.*]] = add <1 x i32> [[TMP36]], [[TMP38]]
				; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <1 x i32> [[TMP39]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP41:%.*]] = shufflevector <3 x i32> [[TMP34]], <3 x i32> [[TMP40]], <3 x i32> <i32 0, i32 1, i32 3>
				; CHECK-NEXT: [[BLOCK38:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP42:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT39:%.*]] = insertelement <1 x i32> undef, i32 [[TMP42]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT40:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT39]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP43:%.*]] = mul <1 x i32> [[BLOCK38]], [[SPLAT_SPLAT40]]
				; CHECK-NEXT: [[BLOCK41:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP44:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT42:%.*]] = insertelement <1 x i32> undef, i32 [[TMP44]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT43:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT42]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP45:%.*]] = mul <1 x i32> [[BLOCK41]], [[SPLAT_SPLAT43]]
				; CHECK-NEXT: [[TMP46:%.*]] = add <1 x i32> [[TMP43]], [[TMP45]]
				; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <1 x i32> [[TMP46]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP48:%.*]] = shufflevector <3 x i32> undef, <3 x i32> [[TMP47]], <3 x i32> <i32 3, i32 1, i32 2>
				; CHECK-NEXT: [[BLOCK44:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP49:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT45:%.*]] = insertelement <1 x i32> undef, i32 [[TMP49]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT46:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT45]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP50:%.*]] = mul <1 x i32> [[BLOCK44]], [[SPLAT_SPLAT46]]
				; CHECK-NEXT: [[BLOCK47:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP51:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT48:%.*]] = insertelement <1 x i32> undef, i32 [[TMP51]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT49:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT48]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP52:%.*]] = mul <1 x i32> [[BLOCK47]], [[SPLAT_SPLAT49]]
				; CHECK-NEXT: [[TMP53:%.*]] = add <1 x i32> [[TMP50]], [[TMP52]]
				; CHECK-NEXT: [[TMP54:%.*]] = shufflevector <1 x i32> [[TMP53]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP55:%.*]] = shufflevector <3 x i32> [[TMP48]], <3 x i32> [[TMP54]], <3 x i32> <i32 0, i32 3, i32 2>
				; CHECK-NEXT: [[BLOCK50:%.*]] = shufflevector <3 x i32> [[SPLIT]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP56:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT51:%.*]] = insertelement <1 x i32> undef, i32 [[TMP56]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT52:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT51]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP57:%.*]] = mul <1 x i32> [[BLOCK50]], [[SPLAT_SPLAT52]]
				; CHECK-NEXT: [[BLOCK53:%.*]] = shufflevector <3 x i32> [[SPLIT1]], <3 x i32> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP58:%.*]] = extractelement <2 x i32> [[SPLIT4]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT54:%.*]] = insertelement <1 x i32> undef, i32 [[TMP58]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT55:%.*]] = shufflevector <1 x i32> [[SPLAT_SPLATINSERT54]], <1 x i32> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP59:%.*]] = mul <1 x i32> [[BLOCK53]], [[SPLAT_SPLAT55]]
				; CHECK-NEXT: [[TMP60:%.*]] = add <1 x i32> [[TMP57]], [[TMP59]]
				; CHECK-NEXT: [[TMP61:%.*]] = shufflevector <1 x i32> [[TMP60]], <1 x i32> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP62:%.*]] = shufflevector <3 x i32> [[TMP55]], <3 x i32> [[TMP61]], <3 x i32> <i32 0, i32 1, i32 3>
				; CHECK-NEXT: [[TMP63:%.*]] = shufflevector <3 x i32> [[TMP20]], <3 x i32> [[TMP41]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[TMP64:%.*]] = shufflevector <3 x i32> [[TMP62]], <3 x i32> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP65:%.*]] = shufflevector <6 x i32> [[TMP63]], <6 x i32> [[TMP64]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
				; CHECK-NEXT: store <9 x i32> [[TMP65]], <9 x i32>* [[PTR_C:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <6 x i32>, <6 x i32>* %Ptr.A, align 16
				%b = load <6 x i32>, <6 x i32>* %Ptr.B, align 16
				%c = call <9 x i32> @llvm.matrix.multiply.v6i32.v6i32.v6i32(<6 x i32> %a, <6 x i32> %b, i32 3, i32 2, i32 3)
				store <9 x i32> %c, <9 x i32>* %Ptr.C, align 16
				ret void
				}

				declare <9 x i32> @llvm.matrix.multiply.v6i32.v6i32.v6i32(<6 x i32>, <6 x i32>, i32, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

				define <9 x double> @strided_load_3x3(<9 x double>* %in, i32 %stride) {
				; CHECK-LABEL: @strided_load_3x3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = add i32 3, [[STRIDE:%.]]
				; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[IN:%.]] to double
				; CHECK-NEXT: [[TMP2:%.*]] = mul i32 0, [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = add i32 0, [[TMP2]]
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr double, double [[TMP1]], i32 [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TMP4]] to <3 x double>*
				; CHECK-NEXT: [[TMP6:%.]] = load <3 x double>, <3 x double> [[TMP5]], align 8
				; CHECK-NEXT: [[TMP7:%.*]] = add i32 3, [[STRIDE]]
				; CHECK-NEXT: [[TMP8:%.]] = bitcast <9 x double> [[IN]] to double*
				; CHECK-NEXT: [[TMP9:%.*]] = mul i32 1, [[TMP7]]
				; CHECK-NEXT: [[TMP10:%.*]] = add i32 0, [[TMP9]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr double, double [[TMP8]], i32 [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.]] = bitcast double [[TMP11]] to <3 x double>*
				; CHECK-NEXT: [[TMP13:%.]] = load <3 x double>, <3 x double> [[TMP12]], align 8
				; CHECK-NEXT: [[TMP14:%.*]] = add i32 3, [[STRIDE]]
				; CHECK-NEXT: [[TMP15:%.]] = bitcast <9 x double> [[IN]] to double*
				; CHECK-NEXT: [[TMP16:%.*]] = mul i32 2, [[TMP14]]
				; CHECK-NEXT: [[TMP17:%.*]] = add i32 0, [[TMP16]]
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr double, double [[TMP15]], i32 [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.]] = bitcast double [[TMP18]] to <3 x double>*
				; CHECK-NEXT: [[TMP20:%.]] = load <3 x double>, <3 x double> [[TMP19]], align 8
				; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <3 x double> [[TMP6]], <3 x double> [[TMP13]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <3 x double> [[TMP20]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <6 x double> [[TMP21]], <6 x double> [[TMP22]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
				; CHECK-NEXT: ret <9 x double> [[TMP23]]
				;
				entry:
				%load = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %in, i32 %stride, i32 3, i32 3)
				ret <9 x double> %load
				}

				declare <9 x double> @llvm.matrix.columnwise.load(<9 x double>*, i32, i32, i32)

				define <9 x double> @strided_load_9x1(<9 x double>* %in, i32 %stride) {
				; CHECK-LABEL: @strided_load_9x1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = add i32 9, [[STRIDE:%.]]
				; CHECK-NEXT: [[TMP1:%.]] = bitcast <9 x double> [[IN:%.]] to double
				; CHECK-NEXT: [[TMP2:%.*]] = mul i32 0, [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = add i32 0, [[TMP2]]
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr double, double [[TMP1]], i32 [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[TMP4]] to <9 x double>*
				; CHECK-NEXT: [[TMP6:%.]] = load <9 x double>, <9 x double> [[TMP5]], align 8
				; CHECK-NEXT: ret <9 x double> [[TMP6]]
				;
				entry:
				%load = call <9 x double> @llvm.matrix.columnwise.load(<9 x double>* %in, i32 %stride, i32 9, i32 1)
				ret <9 x double> %load
				}

				declare <8 x i64> @llvm.matrix.columnwise.load.v8i64(<8 x i64>*, i32, i32, i32)

				define <8 x i64> @strided_load_i64_4x2(<8 x i64>* %in, i32 %stride) {
				; CHECK-LABEL: @strided_load_i64_4x2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = add i32 4, [[STRIDE:%.]]
				; CHECK-NEXT: [[TMP1:%.]] = bitcast <8 x i64> [[IN:%.]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = mul i32 0, [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = add i32 0, [[TMP2]]
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr i64, i64 [[TMP1]], i32 [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[TMP4]] to <4 x i64>*
				; CHECK-NEXT: [[TMP6:%.]] = load <4 x i64>, <4 x i64> [[TMP5]], align 4
				; CHECK-NEXT: [[TMP7:%.*]] = add i32 4, [[STRIDE]]
				; CHECK-NEXT: [[TMP8:%.]] = bitcast <8 x i64> [[IN]] to i64*
				; CHECK-NEXT: [[TMP9:%.*]] = mul i32 1, [[TMP7]]
				; CHECK-NEXT: [[TMP10:%.*]] = add i32 0, [[TMP9]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i64, i64 [[TMP8]], i32 [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i64 [[TMP11]] to <4 x i64>*
				; CHECK-NEXT: [[TMP13:%.]] = load <4 x i64>, <4 x i64> [[TMP12]], align 4
				; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x i64> [[TMP6]], <4 x i64> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: ret <8 x i64> [[TMP14]]
				;
				entry:
				%load = call <8 x i64> @llvm.matrix.columnwise.load.v8i64(<8 x i64>* %in, i32 %stride, i32 4, i32 2)
				ret <8 x i64> %load
				}

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

				define void @strided_store_double_3x2(<6 x double>* %in.addr, double* %out) {
				; CHECK-LABEL: @strided_store_double_3x2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[IN:%.]] = load <6 x double>, <6 x double> [[IN_ADDR:%.*]], align 8
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[OUT:%.]] to <3 x double>
				; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[TMP0]], align 8
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr double, double [[OUT]], i32 5
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[TMP1]] to <3 x double>*
				; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[TMP2]], align 8
				; CHECK-NEXT: ret void
				;
				entry:

				%in = load <6 x double>, <6 x double>* %in.addr, align 8
				call void @llvm.matrix.columnwise.store(<6 x double> %in, double* %out, i32 2, i32 3, i32 2)
				ret void
				}

				define void @strided_store_double_3x2_nonconst_stride(<6 x double>* %in.addr, i32 %stride, double* %out) {
				; CHECK-LABEL: @strided_store_double_3x2_nonconst_stride(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[IN:%.]] = load <6 x double>, <6 x double> [[IN_ADDR:%.*]], align 8
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <6 x double> [[IN]], <6 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[TMP0:%.]] = add i32 3, [[STRIDE:%.]]
				; CHECK-NEXT: [[TMP1:%.*]] = mul i32 0, [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.*]] = add i32 0, [[TMP1]]
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr double, double [[OUT:%.*]], i32 [[TMP2]]
				; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[TMP3]] to <3 x double>*
				; CHECK-NEXT: store <3 x double> [[SPLIT]], <3 x double>* [[TMP4]], align 8
				; CHECK-NEXT: [[TMP5:%.*]] = add i32 3, [[STRIDE]]
				; CHECK-NEXT: [[TMP6:%.*]] = mul i32 1, [[TMP5]]
				; CHECK-NEXT: [[TMP7:%.*]] = add i32 0, [[TMP6]]
				; CHECK-NEXT: [[TMP8:%.]] = getelementptr double, double [[OUT]], i32 [[TMP7]]
				; CHECK-NEXT: [[TMP9:%.]] = bitcast double [[TMP8]] to <3 x double>*
				; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[TMP9]], align 8
				; CHECK-NEXT: ret void
				;
				entry:

				%in = load <6 x double>, <6 x double>* %in.addr, align 8
				call void @llvm.matrix.columnwise.store(<6 x double> %in, double* %out, i32 %stride, i32 3, i32 2)
				ret void
				}


				declare void @llvm.matrix.columnwise.store(<6 x double>, double*, i32, i32, i32)

				define void @strided_store_i8_2x3(<10 x i8>* %in.addr, double* %out) {
				; CHECK-LABEL: @strided_store_i8_2x3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[IN:%.]] = load <10 x i8>, <10 x i8> [[IN_ADDR:%.*]], align 8
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <10 x i8> [[IN]], <10 x i8> undef, <3 x i32> <i32 0, i32 1, i32 2>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <10 x i8> [[IN]], <10 x i8> undef, <3 x i32> <i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <10 x i8> [[IN]], <10 x i8> undef, <3 x i32> <i32 6, i32 7, i32 8>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <10 x i8> [[IN]], <10 x i8> undef, <3 x i32> <i32 9, i32 10, i32 11>
				; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[OUT:%.]] to i8
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <3 x i8>*
				; CHECK-NEXT: store <3 x i8> [[SPLIT]], <3 x i8>* [[TMP1]], align 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[OUT]] to i8*
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr i8, i8 [[TMP2]], i32 5
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <3 x i8>*
				; CHECK-NEXT: store <3 x i8> [[SPLIT1]], <3 x i8>* [[TMP4]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[OUT]] to i8*
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr i8, i8 [[TMP5]], i32 10
				; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <3 x i8>*
				; CHECK-NEXT: store <3 x i8> [[SPLIT2]], <3 x i8>* [[TMP7]], align 1
				; CHECK-NEXT: [[TMP8:%.]] = bitcast double [[OUT]] to i8*
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr i8, i8 [[TMP8]], i32 15
				; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[TMP9]] to <3 x i8>*
				; CHECK-NEXT: store <3 x i8> [[SPLIT3]], <3 x i8>* [[TMP10]], align 1
				; CHECK-NEXT: ret void
				;
				entry:

				%in = load <10 x i8>, <10 x i8>* %in.addr, align 8
				call void @llvm.matrix.columnwise.store.v10i8(<10 x i8> %in, double* %out, i32 2, i32 3, i32 2)
				ret void
				}

				declare void @llvm.matrix.columnwise.store.v10i8(<10 x i8>, double*, i32, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/transpose.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s


				define void @transpose(<8 x double>* %Ptr.A, <8 x double>* %Ptr.B) {
				; CHECK-LABEL: @transpose(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <8 x double>, <8 x double> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> undef, double [[TMP0]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP1]], double [[TMP2]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 2
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 3
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> undef, double [[TMP8]], i64 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
				; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <4 x double> [[TMP7]], <4 x double> [[TMP15]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: store <8 x double> [[TMP16]], <8 x double>* [[PTR_B:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <8 x double>, <8 x double> *%Ptr.A, align 16
				%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)

				dmgreenUnsubmitted Not Done Reply Inline Actions llvm.matrix.transpose.v8f64 dmgreen: llvm.matrix.transpose.v8f64
				store <8 x double> %c, <8 x double> *%Ptr.B, align 16
				ret void
				}

				declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

				define void @transpose_single_column(<8 x double>* %Ptr.A, <8 x double>* %Ptr.B) {
				; CHECK-LABEL: @transpose_single_column(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <8 x double>, <8 x double> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <8 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <1 x double> undef, double [[TMP0]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <1 x double> undef, double [[TMP2]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <8 x double> [[SPLIT]], i64 2
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <1 x double> undef, double [[TMP4]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <8 x double> [[SPLIT]], i64 3
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <1 x double> undef, double [[TMP6]], i64 0
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <8 x double> [[SPLIT]], i64 4
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <1 x double> undef, double [[TMP8]], i64 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <8 x double> [[SPLIT]], i64 5
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <1 x double> undef, double [[TMP10]], i64 0
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <8 x double> [[SPLIT]], i64 6
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <1 x double> undef, double [[TMP12]], i64 0
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <8 x double> [[SPLIT]], i64 7
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <1 x double> undef, double [[TMP14]], i64 0
				; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <1 x double> [[TMP1]], <1 x double> [[TMP3]], <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <1 x double> [[TMP5]], <1 x double> [[TMP7]], <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <1 x double> [[TMP9]], <1 x double> [[TMP11]], <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <1 x double> [[TMP13]], <1 x double> [[TMP15]], <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <2 x double> [[TMP16]], <2 x double> [[TMP17]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <2 x double> [[TMP18]], <2 x double> [[TMP19]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <4 x double> [[TMP20]], <4 x double> [[TMP21]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: store <8 x double> [[TMP22]], <8 x double>* [[PTR_B:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <8 x double>, <8 x double> *%Ptr.A, align 16
				%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 8, i32 1)

				store <8 x double> %c, <8 x double> *%Ptr.B, align 16
				ret void
				}

				declare <12 x i16> @llvm.matrix.transpose.v12i16(<12 x i16>, i32, i32)

				define void @transpose_i16_3x4(<12 x i16>* %Ptr.A, <12 x i16>* %Ptr.B) {
				; CHECK-LABEL: @transpose_i16_3x4(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A:%.]] = load <12 x i16>, <12 x i16> [[PTR_A:%.*]], align 16
				; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <12 x i16> [[A]], <12 x i16> undef, <3 x i32> <i32 0, i32 1, i32 2>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <12 x i16> [[A]], <12 x i16> undef, <3 x i32> <i32 3, i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <12 x i16> [[A]], <12 x i16> undef, <3 x i32> <i32 6, i32 7, i32 8>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <12 x i16> [[A]], <12 x i16> undef, <3 x i32> <i32 9, i32 10, i32 11>
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x i16> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i16> undef, i16 [[TMP0]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x i16> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i16> [[TMP1]], i16 [[TMP2]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <3 x i16> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i16> [[TMP3]], i16 [[TMP4]], i64 2
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <3 x i16> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i16> [[TMP5]], i16 [[TMP6]], i64 3
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <3 x i16> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i16> undef, i16 [[TMP8]], i64 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x i16> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i16> [[TMP9]], i16 [[TMP10]], i64 1
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x i16> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i16> [[TMP11]], i16 [[TMP12]], i64 2
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x i16> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x i16> [[TMP13]], i16 [[TMP14]], i64 3
				; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x i16> [[SPLIT]], i64 2
				; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i16> undef, i16 [[TMP16]], i64 0
				; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x i16> [[SPLIT1]], i64 2
				; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x i16> [[TMP17]], i16 [[TMP18]], i64 1
				; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x i16> [[SPLIT2]], i64 2
				; CHECK-NEXT: [[TMP21:%.*]] = insertelement <4 x i16> [[TMP19]], i16 [[TMP20]], i64 2
				; CHECK-NEXT: [[TMP22:%.*]] = extractelement <3 x i16> [[SPLIT3]], i64 2
				; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x i16> [[TMP21]], i16 [[TMP22]], i64 3
				; CHECK-NEXT: [[TMP24:%.*]] = shufflevector <4 x i16> [[TMP7]], <4 x i16> [[TMP15]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP25:%.*]] = shufflevector <4 x i16> [[TMP23]], <4 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <8 x i16> [[TMP24]], <8 x i16> [[TMP25]], <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
				; CHECK-NEXT: store <12 x i16> [[TMP26]], <12 x i16>* [[PTR_B:%.*]], align 16
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <12 x i16>, <12 x i16> *%Ptr.A, align 16
				%c = call <12 x i16> @llvm.matrix.transpose.v12i16(<12 x i16> %a, i32 3, i32 4)

				store <12 x i16> %c, <12 x i16> *%Ptr.B, align 16
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Add first set of matrix intrinsics and initial lowering pass.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 230121

llvm/docs/LangRef.rst

llvm/include/llvm/IR/Intrinsics.td

llvm/include/llvm/InitializePasses.h

llvm/include/llvm/Transforms/Scalar.h

llvm/include/llvm/Transforms/Scalar/LowerMatrixIntrinsics.h

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Passes/PassRegistry.def

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/lib/Transforms/Scalar/CMakeLists.txt

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

llvm/lib/Transforms/Scalar/Scalar.cpp

llvm/test/Transforms/LowerMatrixIntrinsics/multiply.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-load.ll

llvm/test/Transforms/LowerMatrixIntrinsics/strided-store.ll

llvm/test/Transforms/LowerMatrixIntrinsics/transpose.ll

[Matrix] Add first set of matrix intrinsics and initial lowering pass.
ClosedPublic