This patch creates a new builtin to support matrix multiply add. Currently when you do C = A*B + C you have the overhead of additional fadds. With this
builtin the accumulatores are loaded with the C matrix during the multiplication considerably reducing the ammount of operations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
@jdoerfert Which tests do you have in mind? I added one for SEMA and one for CodeGen.
Thanks for putting up the patch!
Do you think it would be possible to get the desired behavior without a new builtin? We should be able to combine the add with the initial multiply for each vector, as long as we have the right fast-math flags? IIUC reassociate should be enough. So perhaps it would be possible to perform this optimization in LowerMatrixIntrinsics directly. The user should then be able to use to enable the right fast-math flags locally using pragma clang fp, like below. Clang first needs to be updated to handle those pragmas properly for the matrix types.
#pragma clang fp reassociate(on) C = A*B + C;
@fhahn That was my first idea however its not as simple as it looks. I tried moving the adds but splats make it considerably harder to find a pattern that catches this and fuses the multiplies specially with bigger matrices. My real wish was to actually add a new IR instruction to handle matrices because the MADD is but a simple example of other more interesting optimizations that can be done, like using matrix associative properties to reduce the number of calculations. I found that path too complicated however and I opted for a compromise at the moment. I wish to start writing some GEMM micro-kernels with this extension and this builtin was the shortest path.
Tests for everything you placed in llvm/. Your tests are all in clang/.
llvm/include/llvm/IR/MatrixBuilder.h | ||
---|---|---|
152 | This code is not tested, as far as I can tell. Or is it? |
Could you elaborate on the splats that make this tricky? Before the matrix lowering, there should be no splats: https://godbolt.org/z/r941xsc6b. I was thinking of detecting the multiply/add before we do the actual lowering, e.g. like it is already done for {load, load} ->multiply->store chains in LowerMatrixMultiplyFused https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L1346
Before the matrix lowering, there should be no splats: https://godbolt.org/z/r941xsc6b
It might still be convenient to have a separate multiply-add intrinsic for matrixes, because then we could just replace fadd( @matrix.multiply() , X) before lowering. But I am not sure how scalable this will be (I don't think we want too many intrinsics), so perhaps we could keep track of bundles of instructions to lower together in general. But I don't think we need this for the initial optimization to start with.
FYI I filed https://bugs.llvm.org/show_bug.cgi?id=49738 and https://bugs.llvm.org/show_bug.cgi?id=49739 for improving fast-math handling in the lowering pass and support for pargma fp, in case you are interested.
@everton.constantino out of curiosity, what architecture are you focused on and what matrix sizes? I have a few performance improvements lined up for code-gen. It might be good to sync up to make sure there's no duplicated work.
@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:
%vec.cast = bitcast [4 x float]* %A to <2 x float>* %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4 %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2 %vec.cast2 = bitcast float* %vec.gep to <2 x float>* %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4 %vec.cast4 = bitcast [4 x float]* %B to <2 x float>* %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4 %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2 %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>* %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4 %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer %0 = fmul <2 x float> %col.load, %splat.splat %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1> %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0) %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer %2 = fmul <2 x float> %col.load, %splat.splat14 %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1> %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2) %vec.cast18 = bitcast [4 x float]* %C to <2 x float>* %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4 %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2 %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>* %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4 %4 = fadd <2 x float> %1, %col.load19 %5 = fadd <2 x float> %3, %col.load22 store <2 x float> %4, <2 x float>* %vec.cast18, align 4 store <2 x float> %5, <2 x float>* %vec.cast21, align 4
I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.
Sorry for not being clearer. I meant the IR *before* LowerMatrixIntrinisics is run (which should be on the righthand side of the Godbolt view). I'm also posting it below. Unless I am missing something, we should be able to easily match fadd (llvm.matrix.multiply(A, B), C) before the actual lowering of llvm.matrix.multiply. I think we do something similar already for combing load->multiply->store chains: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L703 . Basically try to fuse all multiplies before the 'normal' lowering. Would it be possible to deal with fadd (llvm.matrix.multiply(A, B), C) similarly?
lang-13: warning: argument unused during compilation: '--gcc-toolchain=/opt/compiler-explorer/gcc-snapshot' [-Wunused-command-line-argument] *** IR Dump Before Lower the matrix intrinsics (lower-matrix-intrinsics) *** ; Function Attrs: nofree nounwind uwtable willreturn mustprogress define dso_local void @_Z3fooRu11matrix_typeILm2ELm2EfES0_S0_([4 x float]* nocapture nonnull readonly align 4 dereferenceable(16) %0, [4 x float]* nocapture nonnull align 4 dereferenceable(16) %1, [4 x float]* nocapture nonnull readonly align 4 dereferenceable(16) %2) local_unnamed_addr #0 { %4 = bitcast [4 x float]* %0 to <4 x float>* %5 = load <4 x float>, <4 x float>* %4, align 4, !tbaa !6 %6 = bitcast [4 x float]* %2 to <4 x float>* %7 = load <4 x float>, <4 x float>* %6, align 4, !tbaa !6 %8 = tail call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %5, <4 x float> %7, i32 2, i32 2, i32 2) %9 = bitcast [4 x float]* %1 to <4 x float>* %10 = load <4 x float>, <4 x float>* %9, align 4, !tbaa !6 %11 = fadd <4 x float> %8, %10 store <4 x float> %11, <4 x float>* %9, align 4, !tbaa !6 ret void }
@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .
Just to see if I understand correctly I can add a matrix_add intrinsic, do a travesal looking for matrix_multiply and fuse both changing LowerMatrixMultiplyFused to support pre-loading the accumulator. Is that correct?
Yes that sounds like a good path forward! I think at the moment, adding a matrix_mul_add intrinsic may be a bit premature, as we can just match & lower directly in place, as we already do in LowerMatrixMultiplyFused. Once we add more and more such transforms, it may really help to have additional intrinsics (or we could just create our own dummy declarations which are just used during the matrix lowering, to avoid adding too many intrinsics). But for now I think can move along faster without adding a new intrinsic.
Sounds good to me, thanks! Marking as changes requested until then, to remove it from the review queue.
Just FYI, #pragma clang fp support for matrix operations has been added in be2277fbf233 by @effective-light in the meantime.
clang-tidy: warning: 'auto *MTypeA' can be declared as 'const auto *MTypeA' [llvm-qualified-auto]
not useful