Page MenuHomePhabricator

[Matrix] Add first set of matrix intrinsics and initial lowering pass.
AcceptedPublic

Authored by fhahn on Tue, Nov 19, 12:19 PM.

Details

Summary

This is the first patch adding an initial set of matrix intrinsics and a
corresponding lowering pass. This has been discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2019-October/136240.html

The first patch introduces four new intrinsics (transpose, multiply,
columnwise load and store) and a LowerMatrixIntrinsics pass, that
lowers those intrinsics to vector operations.

Matrixes are embedded in a 'flat' vector (e.g. a 4 x 4 float matrix
embedded in a <16 x float> vector) and the intrinsics take the dimension
information as parameters. Those parameters need to be ConstantInt.
For the memory layout, we initially assume column-major, but in the RFC
we also described how to extend the intrinsics to support row-major as
well.

For the initial lowering, we split the input of the intrinsics into a
set of column vectors, transform those column vectors and concatenate
the result columns to a flat result vector.

This allows us to lower the intrinsics without any shape propagation, as
mentioned in the RFC. In follow-up patches, we plan to submit the
following improvements:

  • Shape propagation to eliminate the embedding/splitting for each intrinsic.
  • Fused & tiled lowering of multiply and other operations.
  • Optimization remarks highlighting matrix expressions and costs.
  • Generate loops for operations on large matrixes.
  • More general block processing for operation on large vectors, exploiting shape information.

We would like to add dedicated transpose, columnwise load and store
intrinsics, even though they are not strictly necessary. For example, we
could instead emit a large shufflevector instruction instead of the
transpose. But we expect that to

(1) become unwieldy for larger matrixes (even for 16x16 matrixes,
    the resulting shufflevector masks would be huge),
(2) risk instcombine making small changes, causing us to fail to
    detect the transpose, preventing better lowerings

For the load/store, we are additionally planning on exploiting the
intrinsics for better alias analysis.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lebedev.ri added inline comments.
llvm/docs/LangRef.rst
14410

Column-major-ness seems unusual to me.
Perhaps motivation can be stated either in the langref, or at least in the review?

llvm/include/llvm/IR/Intrinsics.td
1235

You may want to use ImmArg to actually enforce the constant-ness of dimensions.

fhahn marked 2 inline comments as done.Tue, Nov 19, 1:42 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14410

The main motivation was that column-major layout seems to be the default for at least a few popular matrix related libraries (proprietary and open source, like Eigen for example: https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html) we are working on supporting.

llvm/include/llvm/IR/Intrinsics.td
1235

Nice, I was not aware of that, thanks!

fhahn updated this revision to Diff 230141.Tue, Nov 19, 1:43 PM

Use ImmArg for intrinsics.

anemet added inline comments.Wed, Nov 20, 7:52 AM
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
158–167

splitVector and splitToColumnVectors names are not very descriptive especially WRT their difference.

244–272

Commenting the differences between the overloads would be good, also can these be stand-alone static functions?

fhahn updated this revision to Diff 230337.Wed, Nov 20, 2:26 PM
fhahn marked 2 inline comments as done.

Squashed splitToColumnVector & splitVector into getMatrix, removed unnecessary overload and moved address computation code out of class, add some additional comments, renamed MatrixTy -> ColumnMatrixTy.

fhahn updated this revision to Diff 230338.Wed, Nov 20, 2:30 PM
fhahn marked 2 inline comments as done.

Update comment for getMatrix.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
158–167

Now there's just getMatrix.

244–272

Moved the code out and the overload is gone now

dmgreen added inline comments.
llvm/docs/LangRef.rst
14422

Does the size of the vector need to be rows * cols? Can it be larger? I presume it would be a problem if it was smaller.

14450

Minor wording:
"The '`llvm.matrix.multiply.*`' intrinsic treats %A as a matrix with <M> rows and <K> columns, %B as a matrix with <K> rows and <N> columns and multiplies them."

14471

Can Stride be less that Cols?

14472

The returned matrix is expected to be packed, the memory layout is padded with Stride, correct?

llvm/include/llvm/Transforms/Scalar.h
362

I guess this should have an extra comment, to be consistent.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
255

What does "offset from the initial element" refer to?

259

How big should we expect these to be? Will stamping out all the loads be OK for the larger cases or would it in be better to start creating loops? (Probably not a question that needs fixing in this commit. Just a question in general).

llvm/test/Transforms/LowerMatrixIntrinsics/transpose.ll
36 ↗(On Diff #230338)

llvm.matrix.transpose.v8f64

fhahn updated this revision to Diff 230540.Thu, Nov 21, 2:41 PM
fhahn marked 5 inline comments as done.

Clarify strides and vector sizes in LangRef, update code comments

fhahn marked an inline comment as done.Thu, Nov 21, 2:43 PM
fhahn added inline comments.
llvm/docs/LangRef.rst
14422

Good catch! I think being bigger would mean we only process the first rows * columns, but I think we should restrict the vector size to match rows * columns. Not sure if we can do that on the intrinsic level, but I'll add asserts.

14472

No, the packed matrix in the flat vector will only contain the elements in the columns/rows, as described by the dimensions. The 'stride' elements are not loaded. I've tried to clarify the meaning of the stride argument. Please let me know if that makes more sense now.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
255

That was a left-over from an earlier version, where the intrinsic took an extra offset parameter. I've updated the comment (same for LowerMatrixStore).

259

There is no upper bound to the dimensions, other than limits imposed by the frontend (and the number of vector elements for LLVM's vector type), so for larger matrixes this won't generate great code. For our initial use cases, we focused on smaller matrixes (<= 16x16) and we tried to keep things simple for the initial patch.

As follow ups, we are planning on adding patches that generate loops for operations on larger matrixes, e.g. generating a fused load/multiply/add/store loops.

fhahn updated this revision to Diff 230542.Thu, Nov 21, 2:46 PM

Fix dimensions in test case, surfaced by new assertion.

Harbormaster completed remote builds in B41342: Diff 230542.

Some food for thought & perhaps you could add a test (or more) for FP types other than double?

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
15

Reduce or eliminate?

45

Would this be clearer: \p Offset is usually the number of elements in a column (equivalent to the number of rows of the matrix). When MatrixPtr represents a submatrix, Offset is the number of rows of the parent matrix, which is the matrix that contains the submatrix? At least "other than accessing a sub matrix." needs consideration.

63

Always zero extend even when XBitWidth matches WidestBitWidth?

72

.. column = Col * Offset

73

Then you could remove this line?!

74

ColumnOffset -> Distance? Offset seems a bit loaded.

237

You could early exit at this point (if (!Changed)....)

270

Misplaced comment?

304

\p Cols -> \p LM

323

assert(NumElts >= BlockNumElts)?

fhahn updated this revision to Diff 230621.Fri, Nov 22, 3:10 AM
fhahn marked 8 inline comments as done.

Address Gerolf's comments, thanks!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
45

I've updated the wording, hopefully it is clearer now.

63

The intrinsics now take i32 only, so we do not need this any longer.

73

I've compacted the comments a bit.

237

We do not really need to keep a list of instructions to remove currently. We can just remove them directly. I've dropped the code for now.

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

fhahn added a comment.Wed, Nov 27, 1:52 PM

Interesting. Do you have any patch for the C/C++ frontend? What does the C/C++ code look like?

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.

typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

I'm currently preparing the patches on the clang side in addition to an update to cfe-dev. Please stay tuned and we would really appreciate any feedback there.

In the original RFC, we sketched the C/C++ support we envisioned using builtins. A simple example that loads two 4x4 matrixes, multiplies them, adds a third matrix to the result and stores the it can be found in the code below. Our initial proposal is quite stripped down and intended to be exposed to end users via something like a C++ matrix wrapper class.
typedef float m4x4_t __attribute__((matrix_type(4, 4)));
 
 
void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) {
  *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c);
}

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}

Do you plan to support dynamic matrix type like array?

typedef float mmxk_t __attribute__((matrix_type(m, k)));
typedef float mkxn_t __attribute__((matrix_type(k, n)));
fhahn added a comment.Thu, Nov 28, 8:39 AM

Thanks for the example. I think most of time the dimension of the matrix is unknown in compile time. How do we write the below code with static dimension matrix type?

Assuming the dimensions are known at the call sites, you could implement it as a templated function, with the dimension being template parameters.

// Let's assume  0 < m, k, n <= 4. 
void matrix_multipy(float *a, float *b, int m, int k, int n) {
// ???
}

Do you plan to support dynamic matrix type like array?

In our initial proposal, we focus exclusively on matrixes with dimensions known at compile-time (for example, uses cases similar to using Eigen with known dimensions). This is where most of the optimization potential comes from and to start with, we are focusing on generating the best code for that use case. We also think this is were LLVM can provide the most value initially.

However, we plan to implement fusion of matrix operations, which I think is the key building block to also generate efficient code with dynamic dimensions. We would love to collaborate with people interested in supporting dynamic dimensions!

I have one suggestion and one more question.

-Gerolf

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
43

Could you add a picture/graphic showing a matrix, submatrix, and then their (lowered) memory layout and stride? This would make it easier to understand the computeAddress functions.

50

Do you need this function? Why are Row,Col Value here vs unsigned in the caller (computeColumnAddr())?

fhahn updated this revision to Diff 232111.Wed, Dec 4, 6:20 AM

Remove computeEltAddr, drop unnecessary Row argument and clarify computeColumnAddr with a bit of ASCII art.

fhahn marked an inline comment as done.Wed, Dec 4, 6:23 AM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
50

Not really. I folded the relevant bits into computeColumnAddr and removed the obsolete Row argument. Also turned the Col argument into a Value *.

fhahn updated this revision to Diff 232157.Wed, Dec 4, 8:56 AM

Update address computation to take an element pointer directly, instead of creating them. This reduces the number of generated bitcasts.

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

LGTM.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
45

Nit: \p Stride elements. You can delete the rest since you repeat the Stride definition below.

anemet accepted this revision.Thu, Dec 5, 12:46 PM

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

This revision is now accepted and ready to land.Thu, Dec 5, 12:46 PM
fhahn added a comment.Thu, Dec 5, 1:12 PM

LGTM too. You may want to a wait a few days to give other people a chance to comment further.

Sure! thanks for all the comments so far. I plan to land this early next week, unless there are any major comments/concerns!

LuoYuanke added inline comments.Sat, Dec 7, 6:53 PM
llvm/docs/LangRef.rst
14479

Is it more straight forward that the start address of column B is computed as A + %stride? Given a 3D array "tensor[9][8][7]", to load some rows of data from the array the %stride can be 8*7*2 instead of 8*7*2-7.

LuoYuanke added inline comments.Sat, Dec 7, 7:04 PM
llvm/docs/LangRef.rst
14479

What I mean is the the start address of column B is computed as A + <Rows> * %Stride.

LuoYuanke added inline comments.Sat, Dec 7, 8:49 PM
llvm/docs/LangRef.rst
14479

I mean the the start address of column B is computed as A + %Stride.

fhahn marked an inline comment as done.Sun, Dec 8, 3:57 AM
fhahn added inline comments.
llvm/docs/LangRef.rst
14479

I mean the the start address of column B is computed as A + %Stride.

That would simplify things a bit and be more in-line with what people expect a stride to be? The only slight advantage of having stride separately might be that it is slightly easier to ensure stride >= Rows (assuming only positive strides). But we can check for that separately and require Stride >= Rows. What do you think?