Add a fused multiply-add ISel pattern with one of the addends as constant splat.
Specifically, patch aims at transformation similar to below:
add ( mul, splat_vector(C)) -> MOV C MAD
Differential D142656
[SVE][codegen] Add pattern for SVE multiply-add accumulate sushgokh on Jan 26 2023, 11:28 AM. Authored by
Details Add a fused multiply-add ISel pattern with one of the addends as constant splat. Specifically, patch aims at transformation similar to below: add ( mul, splat_vector(C)) -> MOV C MAD
Diff Detail
Event TimelineComment Actions I will look at this soon. In the meantime, can you upload your patch with context? See this documentation how to do that (or what it means): Adding to Eli's comment, see this doc for more details in section "LGTM - How a Patch Is Accepted": Comment Actions First, I think the description needs some clarifications. add ( mul, splat_vector(C)) -> MOV C; MAD where C is a constant/immediate, and the result is some MOV instruction to materialise constant C followed by the MAD. The FIXME indicates to me that current approach is not entirely the right one, but it's a bit difficult to follow all details there. Comment Actions @SjoerdMeijer Yes, I will add the patch with full context. Considering the motive of the patch. In cases where one of the addend is splat, semantics of 'mla' instruction will mandate extra register moves. This drawback is overcome by giving higher priority to 'mad' instruction than 'mla'. However, this doesnt guarantee complete freedom from extra moves as can be noted in one of the test cases(i.e. with mad, it generated one extra mov). But, overall, mad generated less count of instructions. I added TOFIX in test cases just to keep track that things can be improved there, either over current patch or current LLVM trunk results. I will remove them as they are misleading Comment Actions @SjoerdMeijer Regarding doing the same thing at MachineCombiner stage. Even with current trunk, MLA/MLS for SVE is generated at ISel level. The line which you pointed was for scalar MADD. Also, its relatively easy to implement pattern matching at ISel rather than machine combiner for the reason that
Comment Actions Ok, I misunderstood a few things, but see comment inlined.
Comment Actions and one more request, which I forgot to add earlier:
We should be doing one thing at a time. This patch is about 'mad' instruction selection, so please remove anything related to 3. in this patch, if possible, and prepare a follow up patch for that.
Comment Actions Agreed. Will go with constant splats as addend first. Will update the description of the patch here Comment Actions Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e. let AddedComplexity = 1 in { class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1, ValueType vt2, ValueType vt3, Instruction inst> : Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))), (inst $Op1, $Op2, $Op3)>; class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1, ValueType vt2, Operand vt3, Instruction inst> : Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))), (inst $Op1, $Op2, vt3:$Op3)>; } What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc. Comment Actions I am not agains it, just trying to understand it. :) Comment Actions @david-arm We dont want mad to be generated in all cases. As far as I remember, without using predicate 'AArch64mul_p_firstOpndWithSingleUse', muladd_i16_test3 generates mad with extra register moves and that becomes inefficient.
Comment Actions And for completeness, summarising previous comments, these are my other 2 concerns. We need an alternative for: for(SDNode* use: Op1->uses()) This is not doing what we want. So my question here is why we need it, why a hasOneUse check won't suffice? What is the motivation test case? If we know that, perhaps we can have a think about this. Second, this could be related to the above, but I remember from some tests that we are missing a few opportunities. Why is that? What would be needed to recognises these cases? Reason I am asking is to see if this is the right place to do this thing. Comment Actions In order to understand your first concern, can you try to check the outcome with/without 'AArch64mul_p_firstOpndWithSingleUse' predicate for, as far as I remember, muladd_i16_test3 ? It helps a bit not generating mad here
Comment Actions And in the meantime, can you upload a new revision with full context please, and rebase it on top of your change that precommit tests? Then we can see and discuss the changes this is making, and check if we see any improvements. Comment Actions I experimented with replacing AArch64mul_p_firstOpndWithSingleUse -> AArch64mul_p_oneuse : diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td index 96126b35c6a1..1cda0d41ac78 100644 --- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td +++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td @@ -408,7 +408,7 @@ def AArch64mla_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3), (add node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>; def AArch64mad_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3), [(int_aarch64_sve_mad node:$pred, node:$op1, node:$op2, node:$op3), - (add node:$op3, (AArch64mul_p_firstOpndWithSingleUse node:$pred, node:$op1, node:$op2))]>; + (add node:$op3, (AArch64mul_p_oneuse node:$pred, node:$op1, node:$op2))]>; def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3), [(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3), (sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)), diff --git a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll index b7ee8bfb25c5..51b8f1f129a4 100644 --- a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll +++ b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll @@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) ; CHECK-LABEL: muladd_i16_test2: ; CHECK: // %bb.0: +; CHECK-NEXT: mov w8, #200 +; CHECK-NEXT: mov z2.d, z0.d ; CHECK-NEXT: ptrue p0.h -; CHECK-NEXT: movprfx z2, z0 -; CHECK-NEXT: mul z2.h, p0/m, z2.h, z1.h -; CHECK-NEXT: add z2.h, z2.h, #200 // =0xc8 +; CHECK-NEXT: mov z3.h, w8 +; CHECK-NEXT: mad z2.h, p0/m, z1.h, z3.h ; CHECK-NEXT: mul z0.h, p0/m, z0.h, z2.h ; CHECK-NEXT: sub z0.h, z0.h, z1.h ; CHECK-NEXT: ret @@ -64,10 +65,12 @@ define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x define <vscale x 8 x i16> @muladd_i16_test3(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) ; CHECK-LABEL: muladd_i16_test3: ; CHECK: // %bb.0: +; CHECK-NEXT: mov w8, #200 +; CHECK-NEXT: mov z2.d, z0.d ; CHECK-NEXT: ptrue p0.h -; CHECK-NEXT: mul z1.h, p0/m, z1.h, z0.h -; CHECK-NEXT: add z1.h, z1.h, #200 // =0xc8 -; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h +; CHECK-NEXT: mov z3.h, w8 +; CHECK-NEXT: mad z2.h, p0/m, z1.h, z3.h +; CHECK-NEXT: mul z0.h, p0/m, z0.h, z2.h ; CHECK-NEXT: ret { %1 = mul <vscale x 8 x i16> %a, %b So it looks like we are generating more mads, which is interesting but I haven't looked if that is correct. From a quick look, this looks sensible? Comment Actions
Comment Actions There is no precedent for:
And the complexity check and interaction isn't a good sign. I don't think we are yet converging to a better implementation, and thus I believe this is not the right place for this. I still think this is a job for the machine combiner, this is where all the precedent is for doing these kind of things. Previously I linked to some scalar combines, but here all are the vector opcodes: For test-case muladd_i8_negativeAddend, this is the relevant MIR snippet: %3:zpr = MUL_ZPZZ_UNDEF_B killed %2:ppr_3b, %0:zpr, %1:zpr %4:zpr = ADD_ZI_B %3:zpr(tied-def 0), 241, 0 And this looks like a fairly straight forward combine to me, and again, all the precedent is there in the machine combiner to do this. However, the benefit isn't clear to me for this example, because the assembly for this example before this patch: mul z0.b, p0/m, z0.b, z1.b add z0.b, z0.b, #241 // =0xf1 And when we want to generate the mad, we need to splat the immediate and we end up with something like this: mov mad So there's no real gain here if I am not mistaken? That makes me wonder for which cases we do expect and actually see a gain? Do we have performance data for some benchmarks for this? @@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) ; CHECK-LABEL: muladd_i16_test2: ; CHECK: // %bb.0: +; CHECK-NEXT: mov w8, #200 +; CHECK-NEXT: mov z2.d, z0.d ; CHECK-NEXT: ptrue p0.h -; CHECK-NEXT: movprfx z2, z0 -; CHECK-NEXT: mul z2.h, p0/m, z2.h, z1.h -; CHECK-NEXT: add z2.h, z2.h, #200 // =0xc8 +; CHECK-NEXT: mov z3.h, w8 +; CHECK-NEXT: mad z2.h, p0/m, z1.h, z3.h ; CHECK-NEXT: mul z0.h, p0/m, z0.h, z2.h ; CHECK-NEXT: sub z0.h, z0.h, z1.h ; CHECK-NEXT: ret But this is a bit of an aside, at this point it would be great if I could get a second opinion on the direction, maybe @dmgreen , @david-arm ? Comment Actions Hi All, I could have saved you some time here but am only just catching up on code reviews. Firstly, whilst the MachineCombine argument is not totally irrelevant I'm going to dismiss it for now because as @sushgokh pointed out, this is not how we currently do such things for SVE. For the most part doing this during ISEL gets us mostly what we need. At least part of the motivation for this patch looks to be to allow a better choice as to when to use MLA or MAD. We handle this by not making the choice during ISEL but instead selecting a pseudo instruction that gets expanded after register allocation when we have access to more information. You'll see an example of this for the floating point side of things whereby FMLA_ZPZZ_UNDEF_D is emitted that is later expanded into one of FMLA_ZPmZZ_D, FMAD_ZPmZZ_D or a MOVPRFX'd FMLA_ZPmZZ_D depending on which registers, if any, the register allocation has chosen to reuse for the result. @sushgokh: I don't know how familiar you are with this logic so if you prefer me to upload a base implementation for you to build on then please just shout. Comment Actions
Sounds like a good reason why the machine combiner might be an option. ;-) Comment Actions @paulwalker-arm Thanks for the explanation. Let me check the example you have given and come up with implementation. If I cannot, will ask for your help Comment Actions Changed the logic to generate Psuedo instruction at ISel(which would be changed to actual instruction much later during ExpandPseudo pass) upon @paulwalker-arm suggestion Comment Actions @paulwalker-arm Thanks for reference. WIll check D143764 and try to make changes accordingly. Let me know if any more changes are required once you have gone through this
Comment Actions Currently, depending upon whether the add/sub instruction can synthesize immediate directly, its decided whether to generate add/sub+immediate or mov+mla/mad/msb/mls ops. If the add/sub can synthesize immediate directly, then fused ops wont get generated. This patch tries to address this by having makeshift higher priority for the fused ops.
|
I think this is probably because isel chooses the patterns with the highest complexity that it can find, believing that this is a bigger win, although I could be wrong? So if there are other patterns that also match the sequence of DAG nodes they will be chosen first. This suggests there are multiple matching patterns for the cases you are testing.