Page MenuHomePhabricator

[Matrix] Add first set of matrix intrinsics and initial lowering pass.
ClosedPublic

Authored by fhahn on Nov 19 2019, 12:19 PM.

Details

Summary

This is the first patch adding an initial set of matrix intrinsics and a
corresponding lowering pass. This has been discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2019-October/136240.html

The first patch introduces four new intrinsics (transpose, multiply,
columnwise load and store) and a LowerMatrixIntrinsics pass, that
lowers those intrinsics to vector operations.

Matrixes are embedded in a 'flat' vector (e.g. a 4 x 4 float matrix
embedded in a <16 x float> vector) and the intrinsics take the dimension
information as parameters. Those parameters need to be ConstantInt.
For the memory layout, we initially assume column-major, but in the RFC
we also described how to extend the intrinsics to support row-major as
well.

For the initial lowering, we split the input of the intrinsics into a
set of column vectors, transform those column vectors and concatenate
the result columns to a flat result vector.

This allows us to lower the intrinsics without any shape propagation, as
mentioned in the RFC. In follow-up patches, we plan to submit the
following improvements:

  • Shape propagation to eliminate the embedding/splitting for each intrinsic.
  • Fused & tiled lowering of multiply and other operations.
  • Optimization remarks highlighting matrix expressions and costs.
  • Generate loops for operations on large matrixes.
  • More general block processing for operation on large vectors, exploiting shape information.

We would like to add dedicated transpose, columnwise load and store
intrinsics, even though they are not strictly necessary. For example, we
could instead emit a large shufflevector instruction instead of the
transpose. But we expect that to

(1) become unwieldy for larger matrixes (even for 16x16 matrixes,
    the resulting shufflevector masks would be huge),
(2) risk instcombine making small changes, causing us to fail to
    detect the transpose, preventing better lowerings

For the load/store, we are additionally planning on exploiting the
intrinsics for better alias analysis.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fhahn updated this revision to Diff 230540.Nov 21 2019, 2:41 PM
fhahn marked 5 inline comments as done.

Clarify strides and vector sizes in LangRef, update code comments

fhahn marked an inline comment as done.Nov 21 2019, 2:43 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14428

Good catch! I think being bigger would mean we only process the first rows * columns, but I think we should restrict the vector size to match rows * columns. Not sure if we can do that on the intrinsic level, but I'll add asserts.

14478

No, the packed matrix in the flat vector will only contain the elements in the columns/rows, as described by the dimensions. The 'stride' elements are not loaded. I've tried to clarify the meaning of the stride argument. Please let me know if that makes more sense now.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
255

That was a left-over from an earlier version, where the intrinsic took an extra offset parameter. I've updated the comment (same for LowerMatrixStore).

259

There is no upper bound to the dimensions, other than limits imposed by the frontend (and the number of vector elements for LLVM's vector type), so for larger matrixes this won't generate great code. For our initial use cases, we focused on smaller matrixes (<= 16x16) and we tried to keep things simple for the initial patch.

As follow ups, we are planning on adding patches that generate loops for operations on larger matrixes, e.g. generating a fused load/multiply/add/store loops.

fhahn updated this revision to Diff 230542.Nov 21 2019, 2:46 PM

Fix dimensions in test case, surfaced by new assertion.

Harbormaster completed remote builds in B41342: Diff 230542.

Some food for thought & perhaps you could add a test (or more) for FP types other than double?

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
15

Reduce or eliminate?

45

Would this be clearer: \p Offset is usually the number of elements in a column (equivalent to the number of rows of the matrix). When MatrixPtr represents a submatrix, Offset is the number of rows of the parent matrix, which is the matrix that contains the submatrix? At least "other than accessing a sub matrix." needs consideration.

63

Always zero extend even when XBitWidth matches WidestBitWidth?

72

.. column = Col * Offset

73

Then you could remove this line?!

74

ColumnOffset -> Distance? Offset seems a bit loaded.

237

You could early exit at this point (if (!Changed)....)

270

Misplaced comment?

304

\p Cols -> \p LM

323

assert(NumElts >= BlockNumElts)?

fhahn updated this revision to Diff 230621.Nov 22 2019, 3:10 AM
fhahn marked 8 inline comments as done.

Address Gerolf's comments, thanks!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
45

I've updated the wording, hopefully it is clearer now.

63

The intrinsics now take i32 only, so we do not need this any longer.

73

I've compacted the comments a bit.

237

We do not really need to keep a list of instructions to remove currently. We can just remove them directly. I've dropped the code for now.

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

fhahn added a comment.Nov 27 2019, 1:52 PM

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.

typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.

typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}

Do you plan to support dynamic matrix type like array?

typedef float mmxk_t __attribute__((matrix_type(m, k)));
typedef float mkxn_t __attribute__((matrix_type(k, n)));
fhahn added a comment.Nov 28 2019, 8:39 AM

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

Assuming the dimensions are known at the call sites, you could implement it as a templated function, with the dimension being template parameters.

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}

Do you plan to support dynamic matrix type like array?

In our initial proposal, we focus exclusively on matrixes with dimensions known at compile-time (for example, uses cases similar to using Eigen with known dimensions). This is where most of the optimization potential comes from and to start with, we are focusing on generating the best code for that use case. We also think this is were LLVM can provide the most value initially.

However, we plan to implement fusion of matrix operations, which I think is the key building block to also generate efficient code with dynamic dimensions. We would love to collaborate with people interested in supporting dynamic dimensions!

I have one suggestion and one more question.

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
43

Could you add a picture/graphic showing a matrix, submatrix, and then their (lowered) memory layout and stride? This would make it easier to understand the computeAddress functions.

50

Do you need this function? Why are Row,Col Value here vs unsigned in the caller (computeColumnAddr())?

fhahn updated this revision to Diff 232111.Dec 4 2019, 6:20 AM

Remove computeEltAddr, drop unnecessary Row argument and clarify computeColumnAddr with a bit of ASCII art.

fhahn marked an inline comment as done.Dec 4 2019, 6:23 AM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
50

Not really. I folded the relevant bits into computeColumnAddr and removed the obsolete Row argument. Also turned the Col argument into a Value *.

fhahn updated this revision to Diff 232157.Dec 4 2019, 8:56 AM

Update address computation to take an element pointer directly, instead of creating them. This reduces the number of generated bitcasts.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

LGTM.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
46

Nit: \p Stride elements. You can delete the rest since you repeat the Stride definition below.

anemet accepted this revision.Dec 5 2019, 12:46 PM

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

This revision is now accepted and ready to land.Dec 5 2019, 12:46 PM
fhahn added a comment.Dec 5 2019, 1:12 PM

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

Sure! thanks for all the comments so far. I plan to land this early next week, unless there are any major comments/concerns!

LuoYuanke added inline comments.Dec 7 2019, 6:53 PM
llvm/docs/LangRef.rst
14485

Is it more straight forward that the start address of column B is computed as A + %stride? Given a 3D array "tensor[9][8][7]", to load some rows of data from the array the %stride can be 8*7*2 instead of 8*7*2-7.

LuoYuanke added inline comments.Dec 7 2019, 7:04 PM
llvm/docs/LangRef.rst
14485

What I mean is the the start address of column B is computed as A + <Rows> * %Stride.

LuoYuanke added inline comments.Dec 7 2019, 8:49 PM
llvm/docs/LangRef.rst
14485

I mean the the start address of column B is computed as A + %Stride.

fhahn marked an inline comment as done.Dec 8 2019, 3:57 AM
fhahn added inline comments.
llvm/docs/LangRef.rst
14485

I mean the the start address of column B is computed as A + %Stride.

That would simplify things a bit and be more in-line with what people expect a stride to be? The only slight advantage of having stride separately might be that it is slightly easier to ensure stride >= Rows (assuming only positive strides). But we can check for that separately and require Stride >= Rows. What do you think?

fhahn updated this revision to Diff 232918.Dec 9 2019, 1:02 PM

Update columnwise load/store to accept stride between the start addresses of 2 consecutive columns.

I plan to land this change tomorrow, unless there are any remaining concerns.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

anemet added inline comments.Dec 9 2019, 1:27 PM
llvm/docs/LangRef.rst
14491–14492

We should also be explicit that the unit for %Stride is also the element type. "Distance" above is a bit ambiguous.

The rest of the new changes look good.

fhahn marked an inline comment as done.Dec 10 2019, 12:46 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14485

I went ahead and put up a patch that uses the fact that Stride >= NumRows to emit !alias.scope and !noalias metadata to make sure we can deduce no-alias for columwise accesses sharing the same stride and base pointer (except accesses to the same column of course): D71295

This revision was automatically updated to reflect the committed changes.
djtodoro added inline comments.
llvm/docs/LangRef.rst
14507

It looks like this breaks the llvm-sphinx-docs build.

fhahn marked an inline comment as done.Dec 23 2019, 12:55 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14507

thanks, I fixed the build errors for LLVM's docs in 5762648

djtodoro added inline comments.Dec 24 2019, 1:20 AM
llvm/docs/LangRef.rst
14507

Thanks!

LuoYuanke added inline comments.Mar 31 2020, 7:04 AM
llvm/docs/LangRef.rst
14451

I have a question for the shape propagation. What if there is some conflict of the shape. Take below code for example. The shape of matrix A is defined both by load and multiply. What is shape of matrix C and D? 4x4 or 5x5?

A = llvm.matrix.columnwise.load(ptr, stride, 5, 5);
B = llvm.matrix.columnwise.load(ptr, stride, 5, 5);
C = A + B
llvm.matrix.multiply(A, B, 4, 4, 4);
llvm.matrix.multiply(A, B, 5, 5 ,5);
D = A - B
fhahn marked an inline comment as done.Mar 31 2020, 8:17 AM
fhahn added inline comments.
llvm/docs/LangRef.rst
14451

I think the example above is not entirely valid as the loaded values are of type <25 x double> and the first multiply would take arguments with <16 x double>. I've put up D77129 to update the verifier to reject such IR.

However a conflict between shapes can occur, if the number of elements matches, as in the example below. In cases where the shape information inferred for an operand does not match the shape required at a use, we embed the operand into a flat vector and extract vectors of the required shape for that use (it is done in getMatrix https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L331)

A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3
llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires  2 x 3 - use A directly
                                     // second operand requires  3 x 2 - cannot use A directly; embedded A in flat vector and extract columns assuming 3 x 2.

The specification of the intrinsics does not mention shape propagation and the shape imposed by the intrinsics only applies to the uses at the intrinsic calls. Shape propagation is only used for optimization purposes to better split up instructions that do not require shape info.

Does that make sense? It might be desirable to add an option to check for conflicts.

LuoYuanke added inline comments.Apr 1 2020, 1:24 AM
llvm/docs/LangRef.rst
14451

I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix.

C = llvm.matrix.reshape(A, row, col);
A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3
llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires  2 x 3 - use A directly
                                                     // second operand requires  3 x 2 - illegal.
fhahn marked an inline comment as done.Apr 2 2020, 1:07 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14451

I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix.

Unfortunately I think without a matrix type on the IR level, such a stricter syntax won't really be feasible. The intrinsics operate on arbitrary IR vectors and the operands/results can also be used by other instructions. At the moment, the matrix intrinsics and regular IR instructions interact nicely, I think exactly because they operate exclusively on flat vectors, with the intrinsics having some additional information. But the intrinsics themselves produce a flat vector again. This keeps the spec simple and also has the advantage of giving the optimizer freedom to do shape related optimizations.

A restriction as you mentioned would have to be enforced on the frontend side I think. For example, there should be no shape mismatches generated by Clang with our proposal/patches.

It might be good to add an option to the lowering pass to turn shape mismatches into error to verify a frontend does indeed not introduce conflicts.

What you you think?

LuoYuanke added inline comments.Apr 2 2020, 6:20 PM
llvm/docs/LangRef.rst
14451

It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed.
About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump.
BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix?

Hi Florian, cool work, thanks!

I'm wondering how the vectoriser could profit from this.

Currently, your patch is passing the intrinsic lowering pass before the vectoriser, so we'd see the long sequence of insert/extract element that we'd normally see anyway. Ie, it should have no functional difference if you lowered like that from the front-end.

However, it would perhaps be much easier to do loop tiling directly on the intrinsic, if we knew how to handle them. We could also directly take the vector dimension from the intrinsics to define how many native vector operations are needed (with padding, etc).

Do you plan to add support in the vectoriser, and if so, would that reduce / invalidate the need for the current lowering pass?

cheers,
--renato

fhahn marked an inline comment as done.Apr 3 2020, 5:40 AM

Hi Renato,

Hi Florian, cool work, thanks!

I'm wondering how the vectoriser could profit from this.

Currently, your patch is passing the intrinsic lowering pass before the vectoriser, so we'd see the long sequence of insert/extract element that we'd normally see anyway. Ie, it should have no functional difference if you lowered like that from the front-end.

Yes, the lowering as done in this patch could also have been done exclusively by the frontend without functional difference.

However, we since improved the lowering to propagate shape information from matrix intrinsics to connected operations. This shape information is then used to split up connected instructions (like regular IR loads/store/binary operators) to operate on column vectors, eliminating most insertelement/extractelement/shuffle instructions previously used to pack/unpack columns from the flat vector. For example, this can be seen in https://github.com/llvm/llvm-project/blob/master/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll.

However, it would perhaps be much easier to do loop tiling directly on the intrinsic, if we knew how to handle them. We could also directly take the vector dimension from the intrinsics to define how many native vector operations are needed (with padding, etc).

The aim/goal is to do exactly that and improve the lowering step-by-step. The lowering of matrix.multiply currently already tries to break down the operations according to the vector width of the target (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L881), but there is lot of potential for further improvements.

Do you plan to add support in the vectoriser, and if so, would that reduce / invalidate the need for the current lowering pass?

Currently I am working on adding initial tiling support for multiplies directly to the lowering pass: D75566.

One advantage of doing it in the lowering pass is that we have all information necessary available there and it is very specific to the intrinsic. Without lowering the intrinsic, there is no loop at all. (Even with the proposed tiling, there won't be any loops as the lowering effectively unrolls the tiled loops, but that will be improved in the future, as this approach is not practical for larger matrixes).

I think currently the general direction of the work is towards making the lowering pass better, rather than teaching other passes about the matrix intrinsics. I've also been thinking about using the infrastructure in the lowering pass to optimize large vector operations, even if no matrix intrinsics are involved. At the moment I am not sure how supporting matrix intrinsics would fit into passes like the loop vectorizer, but the lowering pass might be a good candidate to use VPlan for code generation/cost-modeling, once the infrastructure is there. Another direction to explore would be to detect loops that perform a matrix multiply and replacing them with a call to the intrinsic, which then gets further optimized.

Sorry for the somewhat lengthy response, but does the overall direction make sense to you?

Cheers,
Florian

llvm/docs/LangRef.rst
14451

It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed.

I have put up the latest version of the proposed spec for the C/C++ extensions on Phabricator: D76612. The actual clang patches to implement the proposal are also available, starting at D72281 (and the linked patches).
It introduces a matrix type on the C/C++ level and the matrix types must agree for all operands of operations. There is no syntax/builtin provided for converting between different shapes, other than going through memory or moving elements to a matrix value with the desired shape.

About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump.

No I don't think so. It would be solely helpful to verify the frontend does not introduce conflicts during testing.

BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix?

Interesting. I think that should ideally be driven be concrete use cases and I would like to hear more! Email or some other medium might be slightly better suited to discuss the topic though.

Yes, the lowering as done in this patch could also have been done exclusively by the frontend without functional difference.

Right, that makes sense. Do you expect front-ends to detect code patterns (like nested loops over i, j) or just to lower from existing "matmul" operations?

LLVM already does that for libc calls (ex. llvm.memcpy) and if languages have matmul intrinsics, then this would be a trivial lowering. But detecting patterns, especially in C/C++ code, can end up horribly wrong or slow. :)

Currently I am working on adding initial tiling support for multiplies directly to the lowering pass: D75566.

Sure, and I expect that this loop would already be "vectorised", sith safety guaranteed by construction and widths extracted from TTI, so "pragma clang vectorise" disabled and the vectoriser won't even look at it.

I'm not sure how VPlan handles partially vectorised nested loops, but it would be interesting if we could re-vectorise after loop fusion or outer-loop vectorisation.

One advantage of doing it in the lowering pass is that we have all information necessary available there and it is very specific to the intrinsic. Without lowering the intrinsic, there is no loop at all. (Even with the proposed tiling, there won't be any loops as the lowering effectively unrolls the tiled loops, but that will be improved in the future, as this approach is not practical for larger matrixes).

That was my point in letting the LV "know" about the intrinsic. To recognise it as a loop and work on it.

I think currently the general direction of the work is towards making the lowering pass better, rather than teaching other passes about the matrix intrinsics.

That sounds very sensible. :)

I've also been thinking about using the infrastructure in the lowering pass to optimize large vector operations, even if no matrix intrinsics are involved. At the moment I am not sure how supporting matrix intrinsics would fit into passes like the loop vectorizer, but the lowering pass might be a good candidate to use VPlan for code generation/cost-modeling, once the infrastructure is there.

Indeed, what I thought would be a way into the LV. I don't mind if we teach the LV about matmul or if we export the VPlan machinery and let other passes use it, as long as we don't duplicate the work.

Another direction to explore would be to detect loops that perform a matrix multiply and replacing them with a call to the intrinsic, which then gets further optimized.

That's curious. Do you mean tracing a path from (weird loop) to (llvm.matmul) to (matmul loop), in a way to canonicalise loops?

Sorry for the somewhat lengthy response, but does the overall direction make sense to you?

No problems at all. Also, bear in mind I don't want to delay the approval/merge of this patch. Glad to continue discussing it after it's committed.

cheers,
--renato

Oh, already merged, ignore me. :)

I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.

fhahn added a comment.Apr 3 2020, 10:20 AM

I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.

For example, assume we have an imaginary float @llvm.dot(<2 x float>, <2 x float> ) that computes the dot product of the 2 arguments and we would like to lower @llvm.matrix.multiply(<4 x float> %a, <4 x float> %b, 2, 2, 2) using @llvm.dot. Currently, the LowerMatrixIntrinsics pass is where this needs to happen, similar to the tiling patch (D75566). You could add a separate emitMultiplyUsingLLVMDot() which would generate something like

%a.row.0 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 0, i32 2>
%a.row.1 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 1, i32 3>
%b.col.0 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 0, i32 1>
%b.col.1 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 2, i32 3>

%r.0.0 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.0)
%res.1 = insertelement <4 x float> undef, float %r.0.0, i32 0
%r.1.0 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.0)
%res.2 = insertelement <4 x float> %res.1, float %r.1.0, i32 1
%r.0.1 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.1)
%res.3 = insertelement <4 x float> %res.2, float %r.0.1, i32 2
%r.1.1 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.1)
%res.4 = insertelement <4 x float> %res.3, float %r.1.1, i32 3

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

Yes. I'm interested in how to lower multiplies on large matrixes. The matrix type in front-end can support any large shape of matrix. right? Take below code as example, I'd like to lower it to some small VNNI operation. I'd like read your infrastructure code or example code to achieve it. Thanks.

%a = call <1048576 x i32> @llvm.matrix.columnwise.load(<1048576 x i32>* %ina, i32 1024, i32 1024, i32 1024)
%b = call <1048576 x i32> @llvm.matrix.columnwise.load(<1048576 x i32>* %inb, i32 1024, i32 1024, i32 1024)
%c = call <1048576 x i32> @llvm.matrix.multiply(<1048576 x i32> %a, <1048576 x i32> %b, i32 1024, i32 1024, i32 1024)
fhahn added a comment.Apr 6 2020, 7:35 AM

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

Yes. I'm interested in how to lower multiplies on large matrixes. The matrix type in front-end can support any large shape of matrix. right? Take below code as example, I'd like to lower it to some small VNNI operation. I'd like read your infrastructure code or example code to achieve it. Thanks.

I've created D77549 which uses AArch64's udot instruction to compute the result of multiplies on 4x4 tiles. To do so, first a tiled loop nest is created that iterates over the columns, rows and the inner dimension. In the inner loop, 4x4 tiles are loaded, multiplied (using the dot product) and accumulated. After the inner loop, the final result of the 4x4 tile is stored. The main reason I went for AArch64's udot is that I can easily run it, but IIUC the VNNI instructions are very similar, they just allow processing of larger tiles.

Please note that the patch is a bit rough around the edges and we currently it not clear how to specify 'multiply 8 bit operands, accumulate in 32 bit result' nature of those instructions; we will have to extend the llvm.matrix.multiply definition for that I think. But it should be enough for you to be able to get started with getting something working for VNNI. Please let me know if you have any questions or encounter any problems, either in the discussion for D77549 or email.

Cheers,
Florian

I've created D77549 which uses AArch64's udot instruction to compute the result of multiplies on 4x4 tiles. To do so, first a tiled loop nest is created that iterates over the columns, rows and the inner dimension. In the inner loop, 4x4 tiles are loaded, multiplied (using the dot product) and accumulated. After the inner loop, the final result of the 4x4 tile is stored. The main reason I went for AArch64's udot is that I can easily run it, but IIUC the VNNI instructions are very similar, they just allow processing of larger tiles.

Please note that the patch is a bit rough around the edges and we currently it not clear how to specify 'multiply 8 bit operands, accumulate in 32 bit result' nature of those instructions; we will have to extend the llvm.matrix.multiply definition for that I think. But it should be enough for you to be able to get started with getting something working for VNNI. Please let me know if you have any questions or encounter any problems, either in the discussion for D77549 or email.

Cheers,
Florian

Thanks Florian.