Page MenuHomePhabricator

everton.constantino (Everton Constantino)
User

Projects

User does not belong to any projects.

User Details

User Since
Feb 22 2021, 8:18 AM (11 w, 4 d)

Recent Activity

Mar 31 2021

everton.constantino added a comment to D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

Just to see if I understand correctly I can add a matrix_add intrinsic, do a travesal looking for matrix_multiply and fuse both changing LowerMatrixMultiplyFused to support pre-loading the accumulator. Is that correct?

Yes that sounds like a good path forward! I think at the moment, adding a matrix_mul_add intrinsic may be a bit premature, as we can just match & lower directly in place, as we already do in LowerMatrixMultiplyFused. Once we add more and more such transforms, it may really help to have additional intrinsics (or we could just create our own dummy declarations which are just used during the matrix lowering, to avoid adding too many intrinsics). But for now I think can move along faster without adding a new intrinsic.

Mar 31 2021, 1:09 PM · Restricted Project, Restricted Project
everton.constantino added a comment to D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

Mar 31 2021, 10:25 AM · Restricted Project, Restricted Project
everton.constantino added a comment to D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..

@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:

%vec.cast = bitcast [4 x float]* %A to <2 x float>*
%col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
%vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
%vec.cast2 = bitcast float* %vec.gep to <2 x float>*
%col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
%vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
%col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
%vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
%vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
%col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
%splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer
%0 = fmul <2 x float> %col.load, %splat.splat
%splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0)
%splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer
%2 = fmul <2 x float> %col.load, %splat.splat14
%splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2)
%vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
%col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
%vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
%vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
%col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
%4 = fadd <2 x float> %1, %col.load19
%5 = fadd <2 x float> %3, %col.load22
store <2 x float> %4, <2 x float>* %vec.cast18, align 4
store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.

Mar 31 2021, 6:58 AM · Restricted Project, Restricted Project

Mar 26 2021

everton.constantino added a comment to D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..

@fhahn That was my first idea however its not as simple as it looks. I tried moving the adds but splats make it considerably harder to find a pattern that catches this and fuses the multiplies specially with bigger matrices. My real wish was to actually add a new IR instruction to handle matrices because the MADD is but a simple example of other more interesting optimizations that can be done, like using matrix associative properties to reduce the number of calculations. I found that path too complicated however and I opted for a compromise at the moment. I wish to start writing some GEMM micro-kernels with this extension and this builtin was the shortest path.

Mar 26 2021, 12:46 PM · Restricted Project, Restricted Project
everton.constantino added a comment to D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..

@jdoerfert Which tests do you have in mind? I added one for SEMA and one for CodeGen.

Mar 26 2021, 12:08 PM · Restricted Project, Restricted Project
everton.constantino requested review of D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension..
Mar 26 2021, 12:00 PM · Restricted Project, Restricted Project

Mar 10 2021

everton.constantino added inline comments to D97857: [Matrix] Add support for matrix-by-scalar division..
Mar 10 2021, 11:28 AM · Restricted Project, Restricted Project

Mar 3 2021

everton.constantino added inline comments to D97857: [Matrix] Add support for matrix-by-scalar division..
Mar 3 2021, 8:58 AM · Restricted Project, Restricted Project