# User Details

- User Since
- Feb 22 2021, 8:18 AM (11 w, 4 d)

# Mar 31 2021

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:

%vec.cast = bitcast [4 x float]* %A to <2 x float>* %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4 %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2 %vec.cast2 = bitcast float* %vec.gep to <2 x float>* %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4 %vec.cast4 = bitcast [4 x float]* %B to <2 x float>* %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4 %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2 %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>* %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4 %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer %0 = fmul <2 x float> %col.load, %splat.splat %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1> %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0) %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer %2 = fmul <2 x float> %col.load, %splat.splat14 %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1> %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2) %vec.cast18 = bitcast [4 x float]* %C to <2 x float>* %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4 %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2 %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>* %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4 %4 = fadd <2 x float> %1, %col.load19 %5 = fadd <2 x float> %3, %col.load22 store <2 x float> %4, <2 x float>* %vec.cast18, align 4 store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.

# Mar 26 2021

@fhahn That was my first idea however its not as simple as it looks. I tried moving the adds but splats make it considerably harder to find a pattern that catches this and fuses the multiplies specially with bigger matrices. My real wish was to actually add a new IR instruction to handle matrices because the MADD is but a simple example of other more interesting optimizations that can be done, like using matrix associative properties to reduce the number of calculations. I found that path too complicated however and I opted for a compromise at the moment. I wish to start writing some GEMM micro-kernels with this extension and this builtin was the shortest path.

@jdoerfert Which tests do you have in mind? I added one for SEMA and one for CodeGen.