This is an archive of the discontinued LLVM Phabricator instance.

[SVE][codegen] Add pattern for SVE multiply-add accumulate
ClosedPublic

Authored by sushgokh on Jan 26 2023, 11:28 AM.

Details

Summary

Add a fused multiply-add ISel pattern with one of the addends as constant splat.

Specifically, patch aims at transformation similar to below:

add ( mul, splat_vector(C)) 
->  
      MOV C 
      MAD

Diff Detail

Event Timeline

sushgokh created this revision.Jan 26 2023, 11:28 AM
Herald added a project: Restricted Project. · View Herald Transcript
sushgokh requested review of this revision.Jan 26 2023, 11:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2023, 11:28 AM

Hi All,

Any further comments/suggestions on the patch ?

Please give reviewers at least a week before you ask for an update.

I will look at this soon.

In the meantime, can you upload your patch with context? See this documentation how to do that (or what it means):
https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

Adding to Eli's comment, see this doc for more details in section "LGTM - How a Patch Is Accepted":
https://llvm.org/docs/CodeReview.html

First, I think the description needs some clarifications.
As I understand it, the problem description is that we would like to match/rewrite this pattern:

add ( mul, splat_vector(C)) ->  MOV C; MAD

where C is a constant/immediate, and the result is some MOV instruction to materialise constant C followed by the MAD.

The FIXME indicates to me that current approach is not entirely the right one, but it's a bit difficult to follow all details there.
But for NEON, the MLA and MLS instructions are generated in MachineCombine, so my first question is why we can't do something similar, e.g.:

https://github.com/llvm/llvm-project/blob/5bc4e1cd3be8844c1e7947647020710016925e9e/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp#L5925

@SjoerdMeijer Yes, I will add the patch with full context.

Considering the motive of the patch. In cases where one of the addend is splat, semantics of 'mla' instruction will mandate extra register moves. This drawback is overcome by giving higher priority to 'mad' instruction than 'mla'. However, this doesnt guarantee complete freedom from extra moves as can be noted in one of the test cases(i.e. with mad, it generated one extra mov). But, overall, mad generated less count of instructions.

I added TOFIX in test cases just to keep track that things can be improved there, either over current patch or current LLVM trunk results. I will remove them as they are misleading

sushgokh added a comment.EditedJan 31 2023, 7:03 AM

@SjoerdMeijer Regarding doing the same thing at MachineCombiner stage. Even with current trunk, MLA/MLS for SVE is generated at ISel level. The line which you pointed was for scalar MADD. Also, its relatively easy to implement pattern matching at ISel rather than machine combiner for the reason that

  1. I am reusing pattern for 'mla'
  2. I dont need to get deep into intricate details needed at machine combiner level

Ok, I misunderstood a few things, but see comment inlined.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
350 ↗(On Diff #492524)

This doesn't do what you think it does, I think.
This returns the uses of this node in the graph, it doesn't return the def-uses.

So the question is for which test-case you need this and why?

david-arm added inline comments.Jan 31 2023, 8:32 AM
llvm/lib/Target/AArch64/SVEInstrFormats.td
483

I think this is probably because isel chooses the patterns with the highest complexity that it can find, believing that this is a bigger win, although I could be wrong? So if there are other patterns that also match the sequence of DAG nodes they will be chosen first. This suggests there are multiple matching patterns for the cases you are testing.

llvm/test/CodeGen/AArch64/sve-gep.ll
230 ↗(On Diff #492524)

This particular case looks like a regression, which is a shame.

llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
2 ↗(On Diff #492524)

It would be really nice if you could pre-commit these tests without any code changes so that we can see the diff in this patch?

80 ↗(On Diff #492524)

Without seeing the code for these tests before your patch it's hard to know what's changed and whether this is an improvement or not. I copied these tests and ran them manually without your patch and to be honest the code for this function isn't obviously better with the mad:

muladd_i16_test4:                       // @muladd_i16_test4
      ptrue   p0.h
      mul     z0.h, p0/m, z0.h, z1.h
      add     z0.h, z0.h, #200                // =0xc8
      mul     z0.h, p0/m, z0.h, z1.h
      ret

I think it's because the add is able to combine the constant into the instruction, whereas mad cannot. Having said that, I can believe that for non-constant cases this might be a win. It might be a good idea to include some tests for non-constants too?

and one more request, which I forgot to add earlier:

The patch aims at:

  1. Try to generate multiply-accumulate instructions post-isel at more places
  2. Improving the current functionality of 'mad' instruction(which only kicks in if intrinsic is used)
  3. with 'mla' instruction, it has been observed at multiple places that it generates extra register moves especially if the added is constant. This patch aims to improve this.

We should be doing one thing at a time. This patch is about 'mad' instruction selection, so please remove anything related to 3. in this patch, if possible, and prepare a follow up patch for that.

Matt added a subscriber: Matt.Jan 31 2023, 9:27 AM
sushgokh added inline comments.Jan 31 2023, 12:16 PM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
350 ↗(On Diff #492524)

->uses() indeed returns users of the node

llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
2 ↗(On Diff #492524)

Added a patch D142998.

Once thats commited, will update this patch

80 ↗(On Diff #492524)

Have added a patch for tests. Refer D142998

and one more request, which I forgot to add earlier:

The patch aims at:

  1. Try to generate multiply-accumulate instructions post-isel at more places
  2. Improving the current functionality of 'mad' instruction(which only kicks in if intrinsic is used)
  3. with 'mla' instruction, it has been observed at multiple places that it generates extra register moves especially if the added is constant. This patch aims to improve this.

We should be doing one thing at a time. This patch is about 'mad' instruction selection, so please remove anything related to 3. in this patch, if possible, and prepare a follow up patch for that.

Agreed. Will go with constant splats as addend first. Will update the description of the patch here

sushgokh edited the summary of this revision. (Show Details)Jan 31 2023, 12:20 PM
SjoerdMeijer added inline comments.Feb 1 2023, 2:40 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
491 ↗(On Diff #492524)

Is this not the only change we need?

llvm/lib/Target/AArch64/SVEInstrFormats.td
3115

So why do we need these changes?

Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.

let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}

What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.

let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}

What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

I am not agains it, just trying to understand it. :)
If this is only about the complexity, can we do this differently?

Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.

let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}

What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

@david-arm We dont want mad to be generated in all cases. As far as I remember, without using predicate 'AArch64mul_p_firstOpndWithSingleUse', muladd_i16_test3 generates mad with extra register moves and that becomes inefficient.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
491 ↗(On Diff #492524)

yes

llvm/lib/Target/AArch64/SVEInstrFormats.td
3115

SVE_4_Mad_Op_Pat is exactly identical to SVE_4_Op_Pat and in fact derived from SVE_4_Op_Pat.

The reason for creating seperate pattern just for MAD is assigning priority only to MAD and not to all instructions using SVE_4_Op_Pat

And for completeness, summarising previous comments, these are my other 2 concerns.

We need an alternative for:

for(SDNode* use: Op1->uses())

This is not doing what we want. So my question here is why we need it, why a hasOneUse check won't suffice? What is the motivation test case? If we know that, perhaps we can have a think about this.

Second, this could be related to the above, but I remember from some tests that we are missing a few opportunities. Why is that? What would be needed to recognises these cases? Reason I am asking is to see if this is the right place to do this thing.

And for completeness, summarising previous comments, these are my other 2 concerns.

We need an alternative for:

for(SDNode* use: Op1->uses())

This is not doing what we want. So my question here is why we need it, why a hasOneUse check won't suffice? What is the motivation test case? If we know that, perhaps we can have a think about this.

Second, this could be related to the above, but I remember from some tests that we are missing a few opportunities. Why is that? What would be needed to recognises these cases? Reason I am asking is to see if this is the right place to do this thing.

In order to understand your first concern, can you try to check the outcome with/without 'AArch64mul_p_firstOpndWithSingleUse' predicate for, as far as I remember, muladd_i16_test3 ? It helps a bit not generating mad here

SjoerdMeijer added inline comments.Feb 3 2023, 3:04 AM
llvm/lib/Target/AArch64/SVEInstrFormats.td
3115

Ok, thanks. I have applied the patch locally to see what's going on as I can't see the context here. I now see that it makes sense, so I am on board with this change here.
I am now looking into your suggestion for uses.

And in the meantime, can you upload a new revision with full context please, and rebase it on top of your change that precommit tests? Then we can see and discuss the changes this is making, and check if we see any improvements.

I experimented with replacing AArch64mul_p_firstOpndWithSingleUse -> AArch64mul_p_oneuse :

diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 96126b35c6a1..1cda0d41ac78 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -408,7 +408,7 @@ def AArch64mla_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                               (add node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;
 def AArch64mad_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                              [(int_aarch64_sve_mad node:$pred, node:$op1, node:$op2, node:$op3),
-                              (add node:$op3, (AArch64mul_p_firstOpndWithSingleUse node:$pred, node:$op1, node:$op2))]>;
+                              (add node:$op3, (AArch64mul_p_oneuse node:$pred, node:$op1, node:$op2))]>;
 def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                              [(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),
                               (sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),
diff --git a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
index b7ee8bfb25c5..51b8f1f129a4 100644
--- a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
+++ b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
@@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test2:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    movprfx z2, z0
-; CHECK-NEXT:    mul z2.h, p0/m, z2.h, z1.h
-; CHECK-NEXT:    add z2.h, z2.h, #200 // =0xc8
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
 ; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    sub z0.h, z0.h, z1.h
 ; CHECK-NEXT:    ret
@@ -64,10 +65,12 @@ define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test3(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test3:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    mul z1.h, p0/m, z1.h, z0.h
-; CHECK-NEXT:    add z1.h, z1.h, #200 // =0xc8
-; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z1.h
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
+; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    ret
 {
   %1 = mul <vscale x 8 x i16> %a, %b

So it looks like we are generating more mads, which is interesting but I haven't looked if that is correct. From a quick look, this looks sensible?
The next question is whether this is better....

sushgokh updated this revision to Diff 494624.Feb 3 2023, 7:11 AM
  1. Uploading the full context
  2. Added a check that one of the addends is constant splat
  3. Made some cleanup
  4. Updated test for precommit test in D142998

There is no precedent for:

  • matching opcodes like that, the latest revision adds the splat_vector opcode check,
  • checking the uses of sdnodes like that.

And the complexity check and interaction isn't a good sign. I don't think we are yet converging to a better implementation, and thus I believe this is not the right place for this. I still think this is a job for the machine combiner, this is where all the precedent is for doing these kind of things. Previously I linked to some scalar combines, but here all are the vector opcodes:

https://github.com/llvm/llvm-project/blob/5bc4e1cd3be8844c1e7947647020710016925e9e/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp#L5032

For test-case muladd_i8_negativeAddend, this is the relevant MIR snippet:

%3:zpr = MUL_ZPZZ_UNDEF_B killed %2:ppr_3b, %0:zpr, %1:zpr
%4:zpr = ADD_ZI_B %3:zpr(tied-def 0), 241, 0

And this looks like a fairly straight forward combine to me, and again, all the precedent is there in the machine combiner to do this. However, the benefit isn't clear to me for this example, because the assembly for this example before this patch:

mul     z0.b, p0/m, z0.b, z1.b
add     z0.b, z0.b, #241                // =0xf1

And when we want to generate the mad, we need to splat the immediate and we end up with something like this:

mov 
mad

So there's no real gain here if I am not mistaken? That makes me wonder for which cases we do expect and actually see a gain? Do we have performance data for some benchmarks for this?
And for most other cases sve-multiply-add-accumulate.ll, mla was generated which now turns into mad, but similarly the gain is unclear. By the way, would be great if you locally rebase this patch on the "precommit" test cases in D142998,. Then this would only see the changes in the tests, similarly like I showed in my previous comment, e.g.:

  @@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test2:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    movprfx z2, z0
-; CHECK-NEXT:    mul z2.h, p0/m, z2.h, z1.h
-; CHECK-NEXT:    add z2.h, z2.h, #200 // =0xc8
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
 ; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    sub z0.h, z0.h, z1.h
 ; CHECK-NEXT:    ret

But this is a bit of an aside, at this point it would be great if I could get a second opinion on the direction, maybe @dmgreen , @david-arm ?

Hi All, I could have saved you some time here but am only just catching up on code reviews. Firstly, whilst the MachineCombine argument is not totally irrelevant I'm going to dismiss it for now because as @sushgokh pointed out, this is not how we currently do such things for SVE. For the most part doing this during ISEL gets us mostly what we need.

At least part of the motivation for this patch looks to be to allow a better choice as to when to use MLA or MAD. We handle this by not making the choice during ISEL but instead selecting a pseudo instruction that gets expanded after register allocation when we have access to more information. You'll see an example of this for the floating point side of things whereby FMLA_ZPZZ_UNDEF_D is emitted that is later expanded into one of FMLA_ZPmZZ_D, FMAD_ZPmZZ_D or a MOVPRFX'd FMLA_ZPmZZ_D depending on which registers, if any, the register allocation has chosen to reuse for the result.

@sushgokh: I don't know how familiar you are with this logic so if you prefer me to upload a base implementation for you to build on then please just shout.

We handle this by not making the choice during ISEL but instead selecting a pseudo instruction that gets expanded after register allocation when we have access to more information.

Sounds like a good reason why the machine combiner might be an option. ;-)
But didn't know it worked like this for SVE, thanks for explaining.

@paulwalker-arm Thanks for the explanation. Let me check the example you have given and come up with implementation. If I cannot, will ask for your help

sushgokh updated this revision to Diff 497383.Feb 14 2023, 10:34 AM

Changed the logic to generate Psuedo instruction at ISel(which would be changed to actual instruction much later during ExpandPseudo pass) upon @paulwalker-arm suggestion

paulwalker-arm added a comment.EditedFeb 14 2023, 10:42 AM

The patch looks much better now, thanks @sushgokh. I'll take a proper look tomorrow but can I draw your attention to D143764 which I plan to land tomorrow. It shows a style that equally applies to your patch as well to reduce the number of patterns within AArch64SVEInstrInfo.td.

The patch looks much better now, thanks @sushgokh. I'll take a proper look tomorrow but can I draw your attention to D143764 which I plan to land tomorrow. It shows a style that equally applies to your patch as well to reduce the number of patterns within AArch64SVEInstrInfo.td.

@paulwalker-arm Thanks for reference. WIll check D143764 and try to make changes accordingly. Let me know if any more changes are required once you have gone through this

sushgokh updated this revision to Diff 498736.Feb 19 2023, 11:48 PM
sushgokh updated this revision to Diff 498781.Feb 20 2023, 3:07 AM

@paulwalker-arm can you check now if this looks good ?

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
405 ↗(On Diff #498781)

We have a naming convention here that follows the instruction's name with a letter suffix. In this instance this should be AArch64mla_p.

493 ↗(On Diff #498781)

Given you've added all the plumbing for MLS/MSB pseudo nodes please can you add support for AArch64mls_p to this patch as well.

llvm/lib/Target/AArch64/SVEInstrFormats.td
503

We don't normally use AddedComplexity like this unless the containing class has something about it that makes the complexity relevant. This is how it's used above bit for your usage it's better to just reuse SVE_4_Op_Pat and set AddedComplexity specifically for the important pattern.

With all that said for this first patch I'd rather not set AddedComplexity at all. This goes hand in hand with my comment relating to the tests. I'd prefer to get the base MLA/MLS pseudo support added thats solves the main MLA/MLS vs MAD/MSB problem and then have follow on work to solve the more nuanced issue of whether to emit "mul followed by add_imm" or "mov_imm followed by mla".

llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
4–6 ↗(On Diff #498781)

For this patch I'd rather see a set of much simpler patterns that just exercise the new isel, whereby the mla/mad choice is controlled by the function's operand order (i.e. %a is the multiplicand -> mad, %a is the addend -> mla). You'll see example of this for the equivalent fmla/fmad testing. This should mean all test ( 16 = 4 instructions * 4 element types) where each only emits a single instruction (ignoring the ret).

The more nuanced cases should form part of a different patch along with the code changes to improve them.

sushgokh added inline comments.Feb 20 2023, 9:43 PM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
405 ↗(On Diff #498781)

_p is slightly confusing and can confused with using 'predicated version'. To be specific, I used 'pseudo'. But to be consistent at this point, I will change it as you say

493 ↗(On Diff #498781)

I was reluctant to do since that's unrelated for this patch. I will take this in next patch if you agree

llvm/lib/Target/AArch64/SVEInstrFormats.td
503

Two points:

  1. As said above, will make changes for MLS/MSB in the next patch if you agree.
  2. The main motive of this patch was getting 'mad/mla' generated for
void
imul(uint8_t * restrict dst, 
     uint32_t * restrict src1, 
     uint32_t * restrict src2, 
     int n)
{
  for (int i = 0; i < n; i++) {
    dst[i] = ( src1[i] * src2[i] + 1 ) ;
  }
}

now, coming to complexity issue, I have created custom pattern with increased complexity so that

a) mad/mla is generated for above test case
a) it has no side-effects for instructions using 'SVE_4_Op_Pat'

Having said all this, if you say, I am ready to go by following sequence:

  1. Convert MLA/MLS/MAD/MSB to pseudo
  2. Looking at which sequence to generate: mad/mla or mul+add/sub

Let me know if you still suggest the above sequence.

llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
4–6 ↗(On Diff #498781)

will wait for your replies on previous comments

paulwalker-arm added inline comments.Feb 21 2023, 7:57 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
405 ↗(On Diff #498781)

I guess the comment at the top of the file can be improved but my request to use AArch64mla_p is because this is the 'predicated version'. By which I mean this PatFrags represents all the patterns that can be used to represent a predicated mla where the result of the inactive lanes does not matter. The only different between this and AArch64fmla_p is the latter has more representations including a dedicated ISD node, but that'll change, for example I expect int_aarch64_sve_mla_u will exist soon.

493 ↗(On Diff #498781)

I don't really see it as unrelated, especially given this patch is updating the pseudo mappings for MLS/MSB, but if that's your preference then fair enough.

llvm/lib/Target/AArch64/SVEInstrFormats.td
503

Yes I prefer to split the task in two given part (1) is always wanted, but (2) feels more specialised and warrants a different testing strategy.

For example, I'm assuming the advantage of using the mla is the mov_imm will be hoisted out of the loop and thus I think that patch will want a test showing this. Whereas this patch only requires simple one line tests to protect the new functionality.

Re SVE_4_Op_Psuedo_Pat: I just don't want to see multiple redefinitions of SVE_4_Op_Pat each with a different complexity. The need is specific to MLA and thus I prefer see it closer to its definition. That said, I suppose you can extend SVE_4_Op_Pat to take an extra parameter that defaults to unset or 0 and then set AddedComplexity within SVE_4_Op_Pat? This all belongs in the follow on patch though given it's not necessary for the base MLA support.

sushgokh added inline comments.Feb 21 2023, 10:54 AM
llvm/lib/Target/AArch64/SVEInstrFormats.td
503

ok thanks. As we agree, will upload a

  1. first patch - convert MLA/MAD/MLS/MSB to pseudo
  2. second patch - get into more complex mul+add patterns that can be converted to mla/mad/etc.
Allen added a subscriber: Allen.Mar 8 2023, 5:47 PM
sushgokh updated this revision to Diff 508900.Mar 27 2023, 11:26 PM

Currently, depending upon whether the add/sub instruction can synthesize immediate directly, its decided whether to generate add/sub+immediate or mov+mla/mad/msb/mls ops.

If the add/sub can synthesize immediate directly, then fused ops wont get generated. This patch tries to address this by having makeshift higher priority for the fused ops.

paulwalker-arm added inline comments.Apr 3 2023, 10:16 AM
llvm/lib/Target/AArch64/SVEInstrFormats.td
3201–3207

Rather than create a new class please just wrap these line in let AddedComplexity = 9 in {} instead.

sushgokh updated this revision to Diff 510703.Apr 3 2023, 11:29 PM
sushgokh marked an inline comment as done.
paulwalker-arm accepted this revision.Apr 4 2023, 3:17 AM
This revision is now accepted and ready to land.Apr 4 2023, 3:17 AM
This revision was landed with ongoing or failed builds.Apr 4 2023, 10:47 PM
This revision was automatically updated to reflect the committed changes.