This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4/10
AArch64SVEInstrInfo.td
4/9
SVEInstrFormats.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-gather-scatter-addr-opts.ll
-
sve-int-arith.ll
3/6
sve-multiply-add-accumulate.ll

Differential D142656

[SVE][codegen] Add pattern for SVE multiply-add accumulate
ClosedPublic

Authored by sushgokh on Jan 26 2023, 11:28 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen
david-arm
madhur13490
efriedma
paulwalker-arm
c-rhodes
peterwaller-arm

Commits

rG99fc6ec34cc1: [AArch64][SVE][CodeGen] Generate fused mul+add/sub ops with one of add/sub…

Summary

Add a fused multiply-add ISel pattern with one of the addends as constant splat.

Specifically, patch aims at transformation similar to below:

add ( mul, splat_vector(C)) 
->  
      MOV C 
      MAD

Diff Detail

Unit TestsFailed

	Time	Test
	3,390 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test

Event Timeline

sushgokh created this revision.Jan 26 2023, 11:28 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJan 26 2023, 11:28 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: ctetreau, psnobl, hiraditya, tschuett. · View Herald Transcript

sushgokh requested review of this revision.Jan 26 2023, 11:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2023, 11:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

efriedma added reviewers: paulwalker-arm, c-rhodes, peterwaller-arm.Jan 26 2023, 12:28 PM

Harbormaster completed remote builds in B210193: Diff 492524.Jan 26 2023, 1:31 PM

Hi All,

Any further comments/suggestions on the patch ?

Please give reviewers at least a week before you ask for an update.

I will look at this soon.

In the meantime, can you upload your patch with context? See this documentation how to do that (or what it means):
https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

Adding to Eli's comment, see this doc for more details in section "LGTM - How a Patch Is Accepted":
https://llvm.org/docs/CodeReview.html

First, I think the description needs some clarifications.
As I understand it, the problem description is that we would like to match/rewrite this pattern:

add ( mul, splat_vector(C)) ->  MOV C; MAD

where C is a constant/immediate, and the result is some MOV instruction to materialise constant C followed by the MAD.

The FIXME indicates to me that current approach is not entirely the right one, but it's a bit difficult to follow all details there.
But for NEON, the MLA and MLS instructions are generated in MachineCombine, so my first question is why we can't do something similar, e.g.:

https://github.com/llvm/llvm-project/blob/5bc4e1cd3be8844c1e7947647020710016925e9e/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp#L5925

@SjoerdMeijer Yes, I will add the patch with full context.

Considering the motive of the patch. In cases where one of the addend is splat, semantics of 'mla' instruction will mandate extra register moves. This drawback is overcome by giving higher priority to 'mad' instruction than 'mla'. However, this doesnt guarantee complete freedom from extra moves as can be noted in one of the test cases(i.e. with mad, it generated one extra mov). But, overall, mad generated less count of instructions.

I added TOFIX in test cases just to keep track that things can be improved there, either over current patch or current LLVM trunk results. I will remove them as they are misleading

@SjoerdMeijer Regarding doing the same thing at MachineCombiner stage. Even with current trunk, MLA/MLS for SVE is generated at ISel level. The line which you pointed was for scalar MADD. Also, its relatively easy to implement pattern matching at ISel rather than machine combiner for the reason that

I am reusing pattern for 'mla'
I dont need to get deep into intricate details needed at machine combiner level

Ok, I misunderstood a few things, but see comment inlined.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
350	This doesn't do what you think it does, I think. This returns the uses of this node in the graph, it doesn't return the def-uses. So the question is for which test-case you need this and why?

david-arm added inline comments.Jan 31 2023, 8:32 AM

llvm/lib/Target/AArch64/SVEInstrFormats.td
483	I think this is probably because isel chooses the patterns with the highest complexity that it can find, believing that this is a bigger win, although I could be wrong? So if there are other patterns that also match the sequence of DAG nodes they will be chosen first. This suggests there are multiple matching patterns for the cases you are testing.
llvm/test/CodeGen/AArch64/sve-gep.ll
230 ↗	(On Diff #492524)	This particular case looks like a regression, which is a shame.
llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
3	It would be really nice if you could pre-commit these tests without any code changes so that we can see the diff in this patch?
81	Without seeing the code for these tests before your patch it's hard to know what's changed and whether this is an improvement or not. I copied these tests and ran them manually without your patch and to be honest the code for this function isn't obviously better with the mad: muladd_i16_test4: // @muladd_i16_test4 ptrue p0.h mul z0.h, p0/m, z0.h, z1.h add z0.h, z0.h, #200 // =0xc8 mul z0.h, p0/m, z0.h, z1.h ret I think it's because the add is able to combine the constant into the instruction, whereas mad cannot. Having said that, I can believe that for non-constant cases this might be a win. It might be a good idea to include some tests for non-constants too?

and one more request, which I forgot to add earlier:

The patch aims at:

Try to generate multiply-accumulate instructions post-isel at more places

Improving the current functionality of 'mad' instruction(which only kicks in if intrinsic is used)

with 'mla' instruction, it has been observed at multiple places that it generates extra register moves especially if the added is constant. This patch aims to improve this.

We should be doing one thing at a time. This patch is about 'mad' instruction selection, so please remove anything related to 3. in this patch, if possible, and prepare a follow up patch for that.

Matt added a subscriber: Matt.Jan 31 2023, 9:27 AM

sushgokh added inline comments.Jan 31 2023, 12:16 PM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
350	->uses() indeed returns users of the node
llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
3	Added a patch D142998. Once thats commited, will update this patch
81	Have added a patch for tests. Refer D142998

In D142656#4093946, @SjoerdMeijer wrote:

and one more request, which I forgot to add earlier:

The patch aims at:

Try to generate multiply-accumulate instructions post-isel at more places

Improving the current functionality of 'mad' instruction(which only kicks in if intrinsic is used)

with 'mla' instruction, it has been observed at multiple places that it generates extra register moves especially if the added is constant. This patch aims to improve this.

We should be doing one thing at a time. This patch is about 'mad' instruction selection, so please remove anything related to 3. in this patch, if possible, and prepare a follow up patch for that.

Agreed. Will go with constant splats as addend first. Will update the description of the patch here

sushgokh edited the summary of this revision. (Show Details)Jan 31 2023, 12:20 PM

sushgokh mentioned this in D142998: [SVE][codegen] Add few more tests for MUL followed by ADD/SUB (NFC).Jan 31 2023, 12:28 PM

SjoerdMeijer added inline comments.Feb 1 2023, 2:40 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
491	Is this not the only change we need?
llvm/lib/Target/AArch64/SVEInstrFormats.td
3115	So why do we need these changes?

Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.

let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}

What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

In D142656#4096082, @david-arm wrote:
Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.
let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}
What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

I am not agains it, just trying to understand it. :)
If this is only about the complexity, can we do this differently?

In D142656#4096082, @david-arm wrote:
Hi @SjoerdMeijer, in @sushgokh's defence there is precedent for some of the changes in this patch - by changing from SVE_4_Op_Pat to SVE_4_Mad_Op_Pat we are able to set the AddedComplexity field to the pattern, which is not dissimilar to SVE_3_Op_Pat_SelZero or SVE_3_Op_Pat_Shift_Imm_SelZero, i.e.
let AddedComplexity = 1 in {
class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
                                     ValueType vt1, ValueType vt2,
                                     Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
      (inst $Op1, $Op2, vt3:$Op3)>;
}
What I don't fully understand is why the complexity has to be so high, since it suggests there are multiple competing patterns and it might be useful to understand what they are. I admit that AArch64mul_p_firstOpndWithSingleUse looks a bit unusual and I'm not sure that we should be checking for explicit opcodes such as TokenFactor, etc.

@david-arm We dont want mad to be generated in all cases. As far as I remember, without using predicate 'AArch64mul_p_firstOpndWithSingleUse', muladd_i16_test3 generates mad with extra register moves and that becomes inefficient.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
491	yes
llvm/lib/Target/AArch64/SVEInstrFormats.td
3115	SVE_4_Mad_Op_Pat is exactly identical to SVE_4_Op_Pat and in fact derived from SVE_4_Op_Pat. The reason for creating seperate pattern just for MAD is assigning priority only to MAD and not to all instructions using SVE_4_Op_Pat

And for completeness, summarising previous comments, these are my other 2 concerns.

We need an alternative for:

for(SDNode* use: Op1->uses())

This is not doing what we want. So my question here is why we need it, why a hasOneUse check won't suffice? What is the motivation test case? If we know that, perhaps we can have a think about this.

Second, this could be related to the above, but I remember from some tests that we are missing a few opportunities. Why is that? What would be needed to recognises these cases? Reason I am asking is to see if this is the right place to do this thing.

In D142656#4096107, @SjoerdMeijer wrote:
And for completeness, summarising previous comments, these are my other 2 concerns.

We need an alternative for:
for(SDNode* use: Op1->uses())
This is not doing what we want. So my question here is why we need it, why a hasOneUse check won't suffice? What is the motivation test case? If we know that, perhaps we can have a think about this.

Second, this could be related to the above, but I remember from some tests that we are missing a few opportunities. Why is that? What would be needed to recognises these cases? Reason I am asking is to see if this is the right place to do this thing.

In order to understand your first concern, can you try to check the outcome with/without 'AArch64mul_p_firstOpndWithSingleUse' predicate for, as far as I remember, muladd_i16_test3 ? It helps a bit not generating mad here

SjoerdMeijer added inline comments.Feb 3 2023, 3:04 AM

llvm/lib/Target/AArch64/SVEInstrFormats.td
3115	Ok, thanks. I have applied the patch locally to see what's going on as I can't see the context here. I now see that it makes sense, so I am on board with this change here. I am now looking into your suggestion for `uses`.

And in the meantime, can you upload a new revision with full context please, and rebase it on top of your change that precommit tests? Then we can see and discuss the changes this is making, and check if we see any improvements.

I experimented with replacing AArch64mul_p_firstOpndWithSingleUse -> AArch64mul_p_oneuse :

diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 96126b35c6a1..1cda0d41ac78 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -408,7 +408,7 @@ def AArch64mla_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                               (add node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;
 def AArch64mad_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                              [(int_aarch64_sve_mad node:$pred, node:$op1, node:$op2, node:$op3),
-                              (add node:$op3, (AArch64mul_p_firstOpndWithSingleUse node:$pred, node:$op1, node:$op2))]>;
+                              (add node:$op3, (AArch64mul_p_oneuse node:$pred, node:$op1, node:$op2))]>;
 def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
                              [(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),
                               (sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),
diff --git a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
index b7ee8bfb25c5..51b8f1f129a4 100644
--- a/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
+++ b/llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
@@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test2:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    movprfx z2, z0
-; CHECK-NEXT:    mul z2.h, p0/m, z2.h, z1.h
-; CHECK-NEXT:    add z2.h, z2.h, #200 // =0xc8
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
 ; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    sub z0.h, z0.h, z1.h
 ; CHECK-NEXT:    ret
@@ -64,10 +65,12 @@ define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test3(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test3:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    mul z1.h, p0/m, z1.h, z0.h
-; CHECK-NEXT:    add z1.h, z1.h, #200 // =0xc8
-; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z1.h
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
+; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    ret
 {
   %1 = mul <vscale x 8 x i16> %a, %b

So it looks like we are generating more mads, which is interesting but I haven't looked if that is correct. From a quick look, this looks sensible?
The next question is whether this is better....

Uploading the full context
Added a check that one of the addends is constant splat
Made some cleanup
Updated test for precommit test in D142998

Harbormaster completed remote builds in B211715: Diff 494624.Feb 3 2023, 8:02 AM

There is no precedent for:

matching opcodes like that, the latest revision adds the splat_vector opcode check,
checking the uses of sdnodes like that.

And the complexity check and interaction isn't a good sign. I don't think we are yet converging to a better implementation, and thus I believe this is not the right place for this. I still think this is a job for the machine combiner, this is where all the precedent is for doing these kind of things. Previously I linked to some scalar combines, but here all are the vector opcodes:

https://github.com/llvm/llvm-project/blob/5bc4e1cd3be8844c1e7947647020710016925e9e/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp#L5032

For test-case muladd_i8_negativeAddend, this is the relevant MIR snippet:

%3:zpr = MUL_ZPZZ_UNDEF_B killed %2:ppr_3b, %0:zpr, %1:zpr
%4:zpr = ADD_ZI_B %3:zpr(tied-def 0), 241, 0

And this looks like a fairly straight forward combine to me, and again, all the precedent is there in the machine combiner to do this. However, the benefit isn't clear to me for this example, because the assembly for this example before this patch:

mul     z0.b, p0/m, z0.b, z1.b
add     z0.b, z0.b, #241                // =0xf1

And when we want to generate the mad, we need to splat the immediate and we end up with something like this:

mov 
mad

So there's no real gain here if I am not mistaken? That makes me wonder for which cases we do expect and actually see a gain? Do we have performance data for some benchmarks for this?
And for most other cases sve-multiply-add-accumulate.ll, mla was generated which now turns into mad, but similarly the gain is unclear. By the way, would be great if you locally rebase this patch on the "precommit" test cases in D142998,. Then this would only see the changes in the tests, similarly like I showed in my previous comment, e.g.:

  @@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x
 define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
 ; CHECK-LABEL: muladd_i16_test2:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #200
+; CHECK-NEXT:    mov z2.d, z0.d
 ; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    movprfx z2, z0
-; CHECK-NEXT:    mul z2.h, p0/m, z2.h, z1.h
-; CHECK-NEXT:    add z2.h, z2.h, #200 // =0xc8
+; CHECK-NEXT:    mov z3.h, w8
+; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
 ; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
 ; CHECK-NEXT:    sub z0.h, z0.h, z1.h
 ; CHECK-NEXT:    ret

But this is a bit of an aside, at this point it would be great if I could get a second opinion on the direction, maybe @dmgreen , @david-arm ?

Hi All, I could have saved you some time here but am only just catching up on code reviews. Firstly, whilst the MachineCombine argument is not totally irrelevant I'm going to dismiss it for now because as @sushgokh pointed out, this is not how we currently do such things for SVE. For the most part doing this during ISEL gets us mostly what we need.

At least part of the motivation for this patch looks to be to allow a better choice as to when to use MLA or MAD. We handle this by not making the choice during ISEL but instead selecting a pseudo instruction that gets expanded after register allocation when we have access to more information. You'll see an example of this for the floating point side of things whereby FMLA_ZPZZ_UNDEF_D is emitted that is later expanded into one of FMLA_ZPmZZ_D, FMAD_ZPmZZ_D or a MOVPRFX'd FMLA_ZPmZZ_D depending on which registers, if any, the register allocation has chosen to reuse for the result.

@sushgokh: I don't know how familiar you are with this logic so if you prefer me to upload a base implementation for you to build on then please just shout.

We handle this by not making the choice during ISEL but instead selecting a pseudo instruction that gets expanded after register allocation when we have access to more information.

Sounds like a good reason why the machine combiner might be an option. ;-)
But didn't know it worked like this for SVE, thanks for explaining.

@paulwalker-arm Thanks for the explanation. Let me check the example you have given and come up with implementation. If I cannot, will ask for your help

paulwalker-arm mentioned this in D142132: [AArch64] Map DestructiveTernaryCommWithRev intrinsics to pesudo instructions.Feb 10 2023, 10:38 AM

Changed the logic to generate Psuedo instruction at ISel(which would be changed to actual instruction much later during ExpandPseudo pass) upon @paulwalker-arm suggestion

The patch looks much better now, thanks @sushgokh. I'll take a proper look tomorrow but can I draw your attention to D143764 which I plan to land tomorrow. It shows a style that equally applies to your patch as well to reduce the number of patterns within AArch64SVEInstrInfo.td.

Harbormaster completed remote builds in B213702: Diff 497383.Feb 14 2023, 12:30 PM

In D142656#4126789, @paulwalker-arm wrote:

The patch looks much better now, thanks @sushgokh. I'll take a proper look tomorrow but can I draw your attention to D143764 which I plan to land tomorrow. It shows a style that equally applies to your patch as well to reduce the number of patterns within AArch64SVEInstrInfo.td.

@paulwalker-arm Thanks for reference. WIll check D143764 and try to make changes accordingly. Let me know if any more changes are required once you have gone through this

sushgokh updated this revision to Diff 498736.Feb 19 2023, 11:48 PM

Harbormaster completed remote builds in B214686: Diff 498736.Feb 20 2023, 1:08 AM

sushgokh updated this revision to Diff 498781.Feb 20 2023, 3:07 AM

Harbormaster completed remote builds in B214716: Diff 498781.Feb 20 2023, 4:25 AM

@paulwalker-arm can you check now if this looks good ?

paulwalker-arm mentioned this in D144413: [InstCombine] Extend SVEVectorFuseMulAddSub to support newly added "undef" intrinsics..Feb 20 2023, 10:23 AM

paulwalker-arm added inline comments.Feb 20 2023, 11:05 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
388	We have a naming convention here that follows the instruction's name with a letter suffix. In this instance this should be `AArch64mla_p`.
472	Given you've added all the plumbing for MLS/MSB pseudo nodes please can you add support for `AArch64mls_p` to this patch as well.
llvm/lib/Target/AArch64/SVEInstrFormats.td
499	We don't normally use `AddedComplexity` like this unless the containing class has something about it that makes the complexity relevant. This is how it's used above bit for your usage it's better to just reuse `SVE_4_Op_Pat` and set `AddedComplexity` specifically for the important pattern. With all that said for this first patch I'd rather not set `AddedComplexity` at all. This goes hand in hand with my comment relating to the tests. I'd prefer to get the base MLA/MLS pseudo support added thats solves the main MLA/MLS vs MAD/MSB problem and then have follow on work to solve the more nuanced issue of whether to emit "mul followed by add_imm" or "mov_imm followed by mla".
llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
5–7	For this patch I'd rather see a set of much simpler patterns that just exercise the new isel, whereby the mla/mad choice is controlled by the function's operand order (i.e. %a is the multiplicand -> mad, %a is the addend -> mla). You'll see example of this for the equivalent fmla/fmad testing. This should mean all test ( 16 = 4 instructions * 4 element types) where each only emits a single instruction (ignoring the ret). The more nuanced cases should form part of a different patch along with the code changes to improve them.

sushgokh added inline comments.Feb 20 2023, 9:43 PM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
388	_p is slightly confusing and can confused with using 'predicated version'. To be specific, I used 'pseudo'. But to be consistent at this point, I will change it as you say
472	I was reluctant to do since that's unrelated for this patch. I will take this in next patch if you agree
llvm/lib/Target/AArch64/SVEInstrFormats.td
499	Two points: As said above, will make changes for MLS/MSB in the next patch if you agree. The main motive of this patch was getting 'mad/mla' generated for void imul(uint8_t * restrict dst, uint32_t * restrict src1, uint32_t * restrict src2, int n) { for (int i = 0; i < n; i++) { dst[i] = ( src1[i] * src2[i] + 1 ) ; } } now, coming to complexity issue, I have created custom pattern with increased complexity so that a) mad/mla is generated for above test case a) it has no side-effects for instructions using 'SVE_4_Op_Pat' Having said all this, if you say, I am ready to go by following sequence: Convert MLA/MLS/MAD/MSB to pseudo Looking at which sequence to generate: mad/mla or mul+add/sub Let me know if you still suggest the above sequence.
llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll
5–7	will wait for your replies on previous comments

paulwalker-arm added inline comments.Feb 21 2023, 7:57 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
388	I guess the comment at the top of the file can be improved but my request to use `AArch64mla_p` is because this is the 'predicated version'. By which I mean this PatFrags represents all the patterns that can be used to represent a predicated mla where the result of the inactive lanes does not matter. The only different between this and `AArch64fmla_p` is the latter has more representations including a dedicated ISD node, but that'll change, for example I expect int_aarch64_sve_mla_u will exist soon.
472	I don't really see it as unrelated, especially given this patch is updating the pseudo mappings for MLS/MSB, but if that's your preference then fair enough.
llvm/lib/Target/AArch64/SVEInstrFormats.td
499	Yes I prefer to split the task in two given part (1) is always wanted, but (2) feels more specialised and warrants a different testing strategy. For example, I'm assuming the advantage of using the mla is the mov_imm will be hoisted out of the loop and thus I think that patch will want a test showing this. Whereas this patch only requires simple one line tests to protect the new functionality. Re SVE_4_Op_Psuedo_Pat: I just don't want to see multiple redefinitions of `SVE_4_Op_Pat` each with a different complexity. The need is specific to MLA and thus I prefer see it closer to its definition. That said, I suppose you can extend `SVE_4_Op_Pat` to take an extra parameter that defaults to unset or 0 and then set AddedComplexity within SVE_4_Op_Pat? This all belongs in the follow on patch though given it's not necessary for the base MLA support.

sushgokh added inline comments.Feb 21 2023, 10:54 AM

llvm/lib/Target/AArch64/SVEInstrFormats.td
499	ok thanks. As we agree, will upload a first patch - convert MLA/MAD/MLS/MSB to pseudo second patch - get into more complex mul+add patterns that can be converted to mla/mad/etc.

Allen added a subscriber: Allen.Mar 8 2023, 5:47 PM

sushgokh mentioned this in D146282: [CodeGen][AArch64] Add few more tests for fused operations (NFC).Mar 17 2023, 3:56 AM

Currently, depending upon whether the add/sub instruction can synthesize immediate directly, its decided whether to generate add/sub+immediate or mov+mla/mad/msb/mls ops.

If the add/sub can synthesize immediate directly, then fused ops wont get generated. This patch tries to address this by having makeshift higher priority for the fused ops.

Harbormaster completed remote builds in B222173: Diff 508900.Mar 28 2023, 12:14 AM

paulwalker-arm added inline comments.Apr 3 2023, 10:16 AM

llvm/lib/Target/AArch64/SVEInstrFormats.td
3191–3199	Rather than create a new class please just wrap these line in `let AddedComplexity = 9 in {}` instead.

sushgokh updated this revision to Diff 510703.Apr 3 2023, 11:29 PM

sushgokh marked an inline comment as done.

Harbormaster completed remote builds in B223509: Diff 510703.Apr 4 2023, 12:36 AM

paulwalker-arm accepted this revision.Apr 4 2023, 3:17 AM

This revision is now accepted and ready to land.Apr 4 2023, 3:17 AM

This revision was landed with ongoing or failed builds.Apr 4 2023, 10:47 PM

Closed by commit rG99fc6ec34cc1: [AArch64][SVE][CodeGen] Generate fused mul+add/sub ops with one of add/sub… (authored by sushgokh). · Explain Why

This revision was automatically updated to reflect the committed changes.

sushgokh added a commit: rG99fc6ec34cc1: [AArch64][SVE][CodeGen] Generate fused mul+add/sub ops with one of add/sub….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64SVEInstrInfo.td

33 lines

SVEInstrFormats.td

39 lines

test/

CodeGen/

AArch64/

sve-gather-scatter-addr-opts.ll

15 lines

sve-int-arith.ll

7 lines

sve-multiply-add-accumulate.ll

234 lines

Diff 497383

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

Show First 20 Lines • Show All 341 Lines • ▼ Show 20 Lines

def AArch64fmul_p_oneuse : PatFrag<(ops node:$pred, node:$src1, node:$src2),		def AArch64fmul_p_oneuse : PatFrag<(ops node:$pred, node:$src1, node:$src2),
(AArch64fmul_p node:$pred, node:$src1, node:$src2), [{		(AArch64fmul_p node:$pred, node:$src1, node:$src2), [{
return N->hasOneUse();		return N->hasOneUse();
}]>;		}]>;


def AArch64fabd_p : PatFrag<(ops node:$pg, node:$op1, node:$op2),		def AArch64fabd_p : PatFrag<(ops node:$pg, node:$op1, node:$op2),
(AArch64fabs_mt node:$pg, (AArch64fsub_p node:$pg, node:$op1, node:$op2), undef)>;		(AArch64fabs_mt node:$pg, (AArch64fsub_p node:$pg, node:$op1, node:$op2), undef)>;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions This doesn't do what you think it does, I think. This returns the uses of this node in the graph, it doesn't return the def-uses. So the question is for which test-case you need this and why? SjoerdMeijer: This doesn't do what you think it does, I think. This returns the uses of this node in the…
		sushgokhAuthorUnsubmitted Done Reply Inline Actions ->uses() indeed returns users of the node sushgokh: ->uses() indeed returns users of the node

// FMAs with a negated multiplication operand can be commuted.		// FMAs with a negated multiplication operand can be commuted.
def AArch64fmls_p : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),		def AArch64fmls_p : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
[(AArch64fma_p node:$pred, (AArch64fneg_mt node:$pred, node:$op1, (undef)), node:$op2, node:$op3),		[(AArch64fma_p node:$pred, (AArch64fneg_mt node:$pred, node:$op1, (undef)), node:$op2, node:$op3),
(AArch64fma_p node:$pred, node:$op2, (AArch64fneg_mt node:$pred, node:$op1, (undef)), node:$op3)]>;		(AArch64fma_p node:$pred, node:$op2, (AArch64fneg_mt node:$pred, node:$op1, (undef)), node:$op3)]>;

def AArch64fsubr_p : PatFrag<(ops node:$pg, node:$op1, node:$op2),		def AArch64fsubr_p : PatFrag<(ops node:$pg, node:$op1, node:$op2),
(AArch64fsub_p node:$pg, node:$op2, node:$op1)>;		(AArch64fsub_p node:$pg, node:$op2, node:$op1)>;
Show All 20 Lines	def AArch64subr : PatFrag<(ops node:$op1, node:$op2),
(sub node:$op2, node:$op1)>;		(sub node:$op2, node:$op1)>;
def AArch64add_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2),		def AArch64add_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2),
[(int_aarch64_sve_add node:$pred, node:$op1, node:$op2),		[(int_aarch64_sve_add node:$pred, node:$op1, node:$op2),
(add node:$op1, (vselect node:$pred, node:$op2, (SVEDup0)))]>;		(add node:$op1, (vselect node:$pred, node:$op2, (SVEDup0)))]>;
def AArch64sub_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2),		def AArch64sub_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2),
[(int_aarch64_sve_sub node:$pred, node:$op1, node:$op2),		[(int_aarch64_sve_sub node:$pred, node:$op1, node:$op2),
(sub node:$op1, (vselect node:$pred, node:$op2, (SVEDup0)))]>;		(sub node:$op1, (vselect node:$pred, node:$op2, (SVEDup0)))]>;
def AArch64mla_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),		def AArch64mla_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
[(int_aarch64_sve_mla node:$pred, node:$op1, node:$op2, node:$op3),		[(int_aarch64_sve_mla node:$pred, node:$op1, node:$op2, node:$op3)]>;
(add node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),
// add(a, select(mask, mul(b, c), splat(0))) -> mla(a, mask, b, c)
(add node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;
def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),		def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
		paulwalker-armUnsubmitted Not Done Reply Inline Actions We have a naming convention here that follows the instruction's name with a letter suffix. In this instance this should be `AArch64mla_p`. paulwalker-arm: We have a naming convention here that follows the instruction's name with a letter suffix. In…
		sushgokhAuthorUnsubmitted Done Reply Inline Actions _p is slightly confusing and can confused with using 'predicated version'. To be specific, I used 'pseudo'. But to be consistent at this point, I will change it as you say sushgokh: _p is slightly confusing and can confused with using 'predicated version'. To be specific, I…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I guess the comment at the top of the file can be improved but my request to use `AArch64mla_p` is because this is the 'predicated version'. By which I mean this PatFrags represents all the patterns that can be used to represent a predicated mla where the result of the inactive lanes does not matter. The only different between this and `AArch64fmla_p` is the latter has more representations including a dedicated ISD node, but that'll change, for example I expect int_aarch64_sve_mla_u will exist soon. paulwalker-arm: I guess the comment at the top of the file can be improved but my request to use `AArch64mla_p`…
[(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),		[(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),
(sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),		(sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),
// sub(a, select(mask, mul(b, c), splat(0))) -> mls(a, mask, b, c)		// sub(a, select(mask, mul(b, c), splat(0))) -> mls(a, mask, b, c)
(sub node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;		(sub node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;
def AArch64eor3 : PatFrags<(ops node:$op1, node:$op2, node:$op3),		def AArch64eor3 : PatFrags<(ops node:$op1, node:$op2, node:$op3),
[(int_aarch64_sve_eor3 node:$op1, node:$op2, node:$op3),		[(int_aarch64_sve_eor3 node:$op1, node:$op2, node:$op3),
(xor node:$op1, (xor node:$op2, node:$op3))]>;		(xor node:$op1, (xor node:$op2, node:$op3))]>;

▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	let Predicates = [HasSVEorSME] in {
defm ADD_ZI : sve_int_arith_imm0<0b000, "add", add>;		defm ADD_ZI : sve_int_arith_imm0<0b000, "add", add>;
defm SUB_ZI : sve_int_arith_imm0<0b001, "sub", sub>;		defm SUB_ZI : sve_int_arith_imm0<0b001, "sub", sub>;
defm SUBR_ZI : sve_int_arith_imm0<0b011, "subr", AArch64subr>;		defm SUBR_ZI : sve_int_arith_imm0<0b011, "subr", AArch64subr>;
defm SQADD_ZI : sve_int_arith_imm0<0b100, "sqadd", saddsat>;		defm SQADD_ZI : sve_int_arith_imm0<0b100, "sqadd", saddsat>;
defm UQADD_ZI : sve_int_arith_imm0<0b101, "uqadd", uaddsat>;		defm UQADD_ZI : sve_int_arith_imm0<0b101, "uqadd", uaddsat>;
defm SQSUB_ZI : sve_int_arith_imm0<0b110, "sqsub", ssubsat>;		defm SQSUB_ZI : sve_int_arith_imm0<0b110, "sqsub", ssubsat>;
defm UQSUB_ZI : sve_int_arith_imm0<0b111, "uqsub", usubsat>;		defm UQSUB_ZI : sve_int_arith_imm0<0b111, "uqsub", usubsat>;

defm MAD_ZPmZZ : sve_int_mladdsub_vvv_pred<0b0, "mad", int_aarch64_sve_mad>;		defm MAD_ZPmZZ : sve_int_mladdsub_vvv_pred<0b0, "mad", int_aarch64_sve_mad, "MLA_ZPmZZ", /isReverseInstr/ 1>;
defm MSB_ZPmZZ : sve_int_mladdsub_vvv_pred<0b1, "msb", int_aarch64_sve_msb>;		defm MSB_ZPmZZ : sve_int_mladdsub_vvv_pred<0b1, "msb", int_aarch64_sve_msb, "MSB_ZPmZZ", /isReverseInstr/ 1>;
defm MLA_ZPmZZ : sve_int_mlas_vvv_pred<0b0, "mla", AArch64mla_m1>;		defm MLA_ZPmZZ : sve_int_mlas_vvv_pred<0b0, "mla", AArch64mla_m1, "MLA_ZPZZZ", "MAD_ZPmZZ">;
defm MLS_ZPmZZ : sve_int_mlas_vvv_pred<0b1, "mls", AArch64mls_m1>;		defm MLS_ZPmZZ : sve_int_mlas_vvv_pred<0b1, "mls", AArch64mls_m1, "MSB_ZPZZZ", "MSB_ZPmZZ">;

		defm MLA_ZPZZZ : sve_int_mladdsub_pseudo ;
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Given you've added all the plumbing for MLS/MSB pseudo nodes please can you add support for `AArch64mls_p` to this patch as well. paulwalker-arm: Given you've added all the plumbing for MLS/MSB pseudo nodes please can you add support for…
		sushgokhAuthorUnsubmitted Done Reply Inline Actions I was reluctant to do since that's unrelated for this patch. I will take this in next patch if you agree sushgokh: I was reluctant to do since that's unrelated for this patch. I will take this in next patch if…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I don't really see it as unrelated, especially given this patch is updating the pseudo mappings for MLS/MSB, but if that's your preference then fair enough. paulwalker-arm: I don't really see it as unrelated, especially given this patch is updating the pseudo mappings…

		multiclass sve_int_mladdsub<ValueType Ty, ValueType PredTy, string Suffix> {

		let AddedComplexity = 9 in {
		// Zd = Za + Zn * Zm
		def : Pat<(add (AArch64mul_p_oneuse PredTy:$P, Ty:$Zn, Ty:$Zm), Ty:$Za),
		(!cast<Instruction>("MLA_ZPZZZ_UNDEF_"#Suffix) $P, ZPR:$Za, ZPR:$Zn, ZPR:$Zm)>;

		// add(a, select(mask, mul(b, c), splat(0))) -> mla(a, mask, b, c)
		def : Pat<(add (vselect PredTy:$P, (AArch64mul_p_oneuse (SVEAllActive), Ty:$Zn, Ty:$Zm), (SVEDup0)), Ty:$Za),
		(!cast<Instruction>("MLA_ZPZZZ_UNDEF_"#Suffix) $P, ZPR:$Za, ZPR:$Zn, ZPR:$Zm)>;
		}

		}
		defm: sve_int_mladdsub<nxv16i8, nxv16i1, "B">;
		defm: sve_int_mladdsub<nxv8i16, nxv8i1, "H">;
		defm: sve_int_mladdsub<nxv4i32, nxv4i1, "S">;
		defm: sve_int_mladdsub<nxv2i64, nxv2i1, "D">;

		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Is this not the only change we need? SjoerdMeijer: Is this not the only change we need?
		sushgokhAuthorUnsubmitted Done Reply Inline Actions yes sushgokh: yes
// SVE predicated integer reductions.		// SVE predicated integer reductions.
defm SADDV_VPZ : sve_int_reduce_0_saddv<0b000, "saddv", AArch64saddv_p>;		defm SADDV_VPZ : sve_int_reduce_0_saddv<0b000, "saddv", AArch64saddv_p>;
defm UADDV_VPZ : sve_int_reduce_0_uaddv<0b001, "uaddv", AArch64uaddv_p>;		defm UADDV_VPZ : sve_int_reduce_0_uaddv<0b001, "uaddv", AArch64uaddv_p>;
defm SMAXV_VPZ : sve_int_reduce_1<0b000, "smaxv", AArch64smaxv_p>;		defm SMAXV_VPZ : sve_int_reduce_1<0b000, "smaxv", AArch64smaxv_p>;
defm UMAXV_VPZ : sve_int_reduce_1<0b001, "umaxv", AArch64umaxv_p>;		defm UMAXV_VPZ : sve_int_reduce_1<0b001, "umaxv", AArch64umaxv_p>;
defm SMINV_VPZ : sve_int_reduce_1<0b010, "sminv", AArch64sminv_p>;		defm SMINV_VPZ : sve_int_reduce_1<0b010, "sminv", AArch64sminv_p>;
defm UMINV_VPZ : sve_int_reduce_1<0b011, "uminv", AArch64uminv_p>;		defm UMINV_VPZ : sve_int_reduce_1<0b011, "uminv", AArch64uminv_p>;
defm ORV_VPZ : sve_int_reduce_2<0b000, "orv", AArch64orv_p>;		defm ORV_VPZ : sve_int_reduce_2<0b000, "orv", AArch64orv_p>;
▲ Show 20 Lines • Show All 3,391 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/SVEInstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 474 Lines • ▼ Show 20 Lines	class SVE_2_Op_Pred_Any_Predicate<ValueType vtd, SDPatternOperator op,
ValueType pt, ValueType vt1, ValueType vt2,		ValueType pt, ValueType vt1, ValueType vt2,
Instruction inst>		Instruction inst>
: Pat<(vtd (op (pt (SVEAnyPredicate)), vt1:$Op1, vt2:$Op2)),		: Pat<(vtd (op (pt (SVEAnyPredicate)), vt1:$Op1, vt2:$Op2)),
(inst $Op1, $Op2)>;		(inst $Op1, $Op2)>;

class SVE_3_Op_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_3_Op_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, Instruction inst>		ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3)),		: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3)),
(inst $Op1, $Op2, $Op3)>;		(inst $Op1, $Op2, $Op3)>;
		david-armUnsubmitted Not Done Reply Inline Actions I think this is probably because isel chooses the patterns with the highest complexity that it can find, believing that this is a bigger win, although I could be wrong? So if there are other patterns that also match the sequence of DAG nodes they will be chosen first. This suggests there are multiple matching patterns for the cases you are testing. david-arm: I think this is probably because isel chooses the patterns with the highest complexity that it…

multiclass SVE_3_Op_Undef_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		multiclass SVE_3_Op_Undef_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, Instruction inst> {		ValueType vt2, ValueType vt3, Instruction inst> {
def : Pat<(vtd (op (vt1 undef), vt2:$Op1, vt3:$Op2)),		def : Pat<(vtd (op (vt1 undef), vt2:$Op1, vt3:$Op2)),
(inst (IMPLICIT_DEF), $Op1, $Op2)>;		(inst (IMPLICIT_DEF), $Op1, $Op2)>;
def : Pat<(vtd (op vt1:$Op1, (vt2 (SVEAllActive:$Op2)), vt3:$Op3)),		def : Pat<(vtd (op vt1:$Op1, (vt2 (SVEAllActive:$Op2)), vt3:$Op3)),
(inst $Op1, $Op2, $Op3)>;		(inst $Op1, $Op2, $Op3)>;
}		}

class SVE_4_Op_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_4_Op_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, ValueType vt4,		ValueType vt2, ValueType vt3, ValueType vt4,
Instruction inst>		Instruction inst>
: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3, vt4:$Op4)),		: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3, vt4:$Op4)),
(inst $Op1, $Op2, $Op3, $Op4)>;		(inst $Op1, $Op2, $Op3, $Op4)>;

class SVE_2_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_2_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
		paulwalker-armUnsubmitted Not Done Reply Inline Actions We don't normally use `AddedComplexity` like this unless the containing class has something about it that makes the complexity relevant. This is how it's used above bit for your usage it's better to just reuse `SVE_4_Op_Pat` and set `AddedComplexity` specifically for the important pattern. With all that said for this first patch I'd rather not set `AddedComplexity` at all. This goes hand in hand with my comment relating to the tests. I'd prefer to get the base MLA/MLS pseudo support added thats solves the main MLA/MLS vs MAD/MSB problem and then have follow on work to solve the more nuanced issue of whether to emit "mul followed by add_imm" or "mov_imm followed by mla". paulwalker-arm: We don't normally use `AddedComplexity` like this unless the containing class has something…
		sushgokhAuthorUnsubmitted Done Reply Inline Actions Two points: As said above, will make changes for MLS/MSB in the next patch if you agree. The main motive of this patch was getting 'mad/mla' generated for void imul(uint8_t * restrict dst, uint32_t * restrict src1, uint32_t * restrict src2, int n) { for (int i = 0; i < n; i++) { dst[i] = ( src1[i] * src2[i] + 1 ) ; } } now, coming to complexity issue, I have created custom pattern with increased complexity so that a) mad/mla is generated for above test case a) it has no side-effects for instructions using 'SVE_4_Op_Pat' Having said all this, if you say, I am ready to go by following sequence: Convert MLA/MLS/MAD/MSB to pseudo Looking at which sequence to generate: mad/mla or mul+add/sub Let me know if you still suggest the above sequence. sushgokh: Two points: 1. As said above, will make changes for MLS/MSB in the next patch if you agree. 2.
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Yes I prefer to split the task in two given part (1) is always wanted, but (2) feels more specialised and warrants a different testing strategy. For example, I'm assuming the advantage of using the mla is the mov_imm will be hoisted out of the loop and thus I think that patch will want a test showing this. Whereas this patch only requires simple one line tests to protect the new functionality. Re SVE_4_Op_Psuedo_Pat: I just don't want to see multiple redefinitions of `SVE_4_Op_Pat` each with a different complexity. The need is specific to MLA and thus I prefer see it closer to its definition. That said, I suppose you can extend `SVE_4_Op_Pat` to take an extra parameter that defaults to unset or 0 and then set AddedComplexity within SVE_4_Op_Pat? This all belongs in the follow on patch though given it's not necessary for the base MLA support. paulwalker-arm: Yes I prefer to split the task in two given part (1) is always wanted, but (2) feels more…
		sushgokhAuthorUnsubmitted Done Reply Inline Actions ok thanks. As we agree, will upload a first patch - convert MLA/MAD/MLS/MSB to pseudo second patch - get into more complex mul+add patterns that can be converted to mla/mad/etc. sushgokh: ok thanks. As we agree, will upload a 1. first patch - convert MLA/MAD/MLS/MSB to pseudo 2.
ValueType vt2, Operand ImmTy, Instruction inst>		ValueType vt2, Operand ImmTy, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vt2 ImmTy:$Op2))),		: Pat<(vtd (op vt1:$Op1, (vt2 ImmTy:$Op2))),
(inst $Op1, ImmTy:$Op2)>;		(inst $Op1, ImmTy:$Op2)>;

class SVE_3_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_3_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, Operand ImmTy,		ValueType vt2, ValueType vt3, Operand ImmTy,
Instruction inst>		Instruction inst>
: Pat<(vtd (op vt1:$Op1, vt2:$Op2, (vt3 ImmTy:$Op3))),		: Pat<(vtd (op vt1:$Op1, vt2:$Op2, (vt3 ImmTy:$Op3))),
▲ Show 20 Lines • Show All 2,599 Lines • ▼ Show 20 Lines	class sve_int_mladdsub_vvv_pred<bits<2> sz8_64, bits<1> opc, string asm,
ZPRRegOp zprty>		ZPRRegOp zprty>
: I<(outs zprty:$Zdn), (ins PPR3bAny:$Pg, zprty:$_Zdn, zprty:$Zm, zprty:$Za),		: I<(outs zprty:$Zdn), (ins PPR3bAny:$Pg, zprty:$_Zdn, zprty:$Zm, zprty:$Za),
asm, "\t$Zdn, $Pg/m, $Zm, $Za",		asm, "\t$Zdn, $Pg/m, $Zm, $Za",
"",		"",
[]>, Sched<[]> {		[]>, Sched<[]> {
bits<3> Pg;		bits<3> Pg;
bits<5> Zdn;		bits<5> Zdn;
bits<5> Za;		bits<5> Za;
bits<5> Zm;		bits<5> Zm;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions So why do we need these changes? SjoerdMeijer: So why do we need these changes?
		sushgokhAuthorUnsubmitted Done Reply Inline Actions SVE_4_Mad_Op_Pat is exactly identical to SVE_4_Op_Pat and in fact derived from SVE_4_Op_Pat. The reason for creating seperate pattern just for MAD is assigning priority only to MAD and not to all instructions using SVE_4_Op_Pat sushgokh: SVE_4_Mad_Op_Pat is exactly identical to SVE_4_Op_Pat and in fact derived from SVE_4_Op_Pat.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Ok, thanks. I have applied the patch locally to see what's going on as I can't see the context here. I now see that it makes sense, so I am on board with this change here. I am now looking into your suggestion for `uses`. SjoerdMeijer: Ok, thanks. I have applied the patch locally to see what's going on as I can't see the context…
let Inst{31-24} = 0b00000100;		let Inst{31-24} = 0b00000100;
let Inst{23-22} = sz8_64;		let Inst{23-22} = sz8_64;
let Inst{21} = 0b0;		let Inst{21} = 0b0;
let Inst{20-16} = Zm;		let Inst{20-16} = Zm;
let Inst{15-14} = 0b11;		let Inst{15-14} = 0b11;
let Inst{13} = opc;		let Inst{13} = opc;
let Inst{12-10} = Pg;		let Inst{12-10} = Pg;
let Inst{9-5} = Za;		let Inst{9-5} = Za;
let Inst{4-0} = Zdn;		let Inst{4-0} = Zdn;

let Constraints = "$Zdn = $_Zdn";		let Constraints = "$Zdn = $_Zdn";
let DestructiveInstType = DestructiveOther;		let DestructiveInstType = DestructiveOther;
let ElementSize = zprty.ElementSize;		let ElementSize = zprty.ElementSize;
let hasSideEffects = 0;		let hasSideEffects = 0;
}		}

multiclass sve_int_mladdsub_vvv_pred<bits<1> opc, string asm, SDPatternOperator op> {		multiclass sve_int_mladdsub_vvv_pred<bits<1> opc, string asm, SDPatternOperator op,
def _B : sve_int_mladdsub_vvv_pred<0b00, opc, asm, ZPR8>;		string revname, bit isReverseInstr=0> {
def _H : sve_int_mladdsub_vvv_pred<0b01, opc, asm, ZPR16>;		def _B : sve_int_mladdsub_vvv_pred<0b00, opc, asm, ZPR8>,
def _S : sve_int_mladdsub_vvv_pred<0b10, opc, asm, ZPR32>;		SVEInstr2Rev<NAME # _B, revname # _B, isReverseInstr>;
def _D : sve_int_mladdsub_vvv_pred<0b11, opc, asm, ZPR64>;		def _H : sve_int_mladdsub_vvv_pred<0b01, opc, asm, ZPR16>,
		SVEInstr2Rev<NAME # _H, revname # _H, isReverseInstr>;
		def _S : sve_int_mladdsub_vvv_pred<0b10, opc, asm, ZPR32>,
		SVEInstr2Rev<NAME # _S, revname # _S, isReverseInstr>;
		def _D : sve_int_mladdsub_vvv_pred<0b11, opc, asm, ZPR64>,
		SVEInstr2Rev<NAME # _D, revname # _D, isReverseInstr>;

def : SVE_4_Op_Pat<nxv16i8, op, nxv16i1, nxv16i8, nxv16i8, nxv16i8, !cast<Instruction>(NAME # _B)>;		def : SVE_4_Op_Pat<nxv16i8, op, nxv16i1, nxv16i8, nxv16i8, nxv16i8, !cast<Instruction>(NAME # _B)>;
def : SVE_4_Op_Pat<nxv8i16, op, nxv8i1, nxv8i16, nxv8i16, nxv8i16, !cast<Instruction>(NAME # _H)>;		def : SVE_4_Op_Pat<nxv8i16, op, nxv8i1, nxv8i16, nxv8i16, nxv8i16, !cast<Instruction>(NAME # _H)>;
def : SVE_4_Op_Pat<nxv4i32, op, nxv4i1, nxv4i32, nxv4i32, nxv4i32, !cast<Instruction>(NAME # _S)>;		def : SVE_4_Op_Pat<nxv4i32, op, nxv4i1, nxv4i32, nxv4i32, nxv4i32, !cast<Instruction>(NAME # _S)>;
def : SVE_4_Op_Pat<nxv2i64, op, nxv2i1, nxv2i64, nxv2i64, nxv2i64, !cast<Instruction>(NAME # _D)>;		def : SVE_4_Op_Pat<nxv2i64, op, nxv2i1, nxv2i64, nxv2i64, nxv2i64, !cast<Instruction>(NAME # _D)>;
}		}

class sve_int_mlas_vvv_pred<bits<2> sz8_64, bits<1> opc, string asm,		class sve_int_mlas_vvv_pred<bits<2> sz8_64, bits<1> opc, string asm,
Show All 12 Lines	: I<(outs zprty:$Zda), (ins PPR3bAny:$Pg, zprty:$_Zda, zprty:$Zn, zprty:$Zm),
let Inst{20-16} = Zm;		let Inst{20-16} = Zm;
let Inst{15-14} = 0b01;		let Inst{15-14} = 0b01;
let Inst{13} = opc;		let Inst{13} = opc;
let Inst{12-10} = Pg;		let Inst{12-10} = Pg;
let Inst{9-5} = Zn;		let Inst{9-5} = Zn;
let Inst{4-0} = Zda;		let Inst{4-0} = Zda;

let Constraints = "$Zda = $_Zda";		let Constraints = "$Zda = $_Zda";
let DestructiveInstType = DestructiveOther;		let DestructiveInstType = DestructiveTernaryCommWithRev;
let ElementSize = zprty.ElementSize;		let ElementSize = zprty.ElementSize;
let hasSideEffects = 0;		let hasSideEffects = 0;
}		}

multiclass sve_int_mlas_vvv_pred<bits<1> opc, string asm, SDPatternOperator op> {		multiclass sve_int_mlas_vvv_pred<bits<1> opc, string asm, SDPatternOperator op,
def _B : sve_int_mlas_vvv_pred<0b00, opc, asm, ZPR8>;		string Ps, string revname, bit isReverseInstr=0> {
def _H : sve_int_mlas_vvv_pred<0b01, opc, asm, ZPR16>;		def _B : sve_int_mlas_vvv_pred<0b00, opc, asm, ZPR8>,
def _S : sve_int_mlas_vvv_pred<0b10, opc, asm, ZPR32>;		SVEPseudo2Instr<Ps # _B, 1>, SVEInstr2Rev<NAME # _B, revname # _B, isReverseInstr>;
def _D : sve_int_mlas_vvv_pred<0b11, opc, asm, ZPR64>;		def _H : sve_int_mlas_vvv_pred<0b01, opc, asm, ZPR16>,
		SVEPseudo2Instr<Ps # _H, 1>, SVEInstr2Rev<NAME # _H, revname # _H, isReverseInstr>;
		def _S : sve_int_mlas_vvv_pred<0b10, opc, asm, ZPR32>,
		SVEPseudo2Instr<Ps # _S, 1>, SVEInstr2Rev<NAME # _S, revname # _S, isReverseInstr>;
		def _D : sve_int_mlas_vvv_pred<0b11, opc, asm, ZPR64>,
		SVEPseudo2Instr<Ps # _D, 1>, SVEInstr2Rev<NAME # _D, revname # _D, isReverseInstr>;

def : SVE_4_Op_Pat<nxv16i8, op, nxv16i1, nxv16i8, nxv16i8, nxv16i8, !cast<Instruction>(NAME # _B)>;		def : SVE_4_Op_Pat<nxv16i8, op, nxv16i1, nxv16i8, nxv16i8, nxv16i8, !cast<Instruction>(NAME # _B)>;
def : SVE_4_Op_Pat<nxv8i16, op, nxv8i1, nxv8i16, nxv8i16, nxv8i16, !cast<Instruction>(NAME # _H)>;		def : SVE_4_Op_Pat<nxv8i16, op, nxv8i1, nxv8i16, nxv8i16, nxv8i16, !cast<Instruction>(NAME # _H)>;
def : SVE_4_Op_Pat<nxv4i32, op, nxv4i1, nxv4i32, nxv4i32, nxv4i32, !cast<Instruction>(NAME # _S)>;		def : SVE_4_Op_Pat<nxv4i32, op, nxv4i1, nxv4i32, nxv4i32, nxv4i32, !cast<Instruction>(NAME # _S)>;
def : SVE_4_Op_Pat<nxv2i64, op, nxv2i1, nxv2i64, nxv2i64, nxv2i64, !cast<Instruction>(NAME # _D)>;		def : SVE_4_Op_Pat<nxv2i64, op, nxv2i1, nxv2i64, nxv2i64, nxv2i64, !cast<Instruction>(NAME # _D)>;
}		}

		multiclass sve_int_mladdsub_pseudo {
		def _UNDEF_B : PredThreeOpPseudo<NAME # _B, ZPR8, FalseLanesUndef>;
		def _UNDEF_H : PredThreeOpPseudo<NAME # _H, ZPR16, FalseLanesUndef>;
		def _UNDEF_S : PredThreeOpPseudo<NAME # _S, ZPR32, FalseLanesUndef>;
		def _UNDEF_D : PredThreeOpPseudo<NAME # _D, ZPR64, FalseLanesUndef>;
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		paulwalker-armUnsubmitted Done Reply Inline Actions Rather than create a new class please just wrap these line in `let AddedComplexity = 9 in {}` instead. paulwalker-arm: Rather than create a new class please just wrap these line in `let AddedComplexity = 9 in {}`…
// SVE2 Integer Multiply-Add - Unpredicated Group		// SVE2 Integer Multiply-Add - Unpredicated Group
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class sve2_int_mla<bits<2> sz, bits<5> opc, string asm,		class sve2_int_mla<bits<2> sz, bits<5> opc, string asm,
ZPRRegOp zprty1, ZPRRegOp zprty2>		ZPRRegOp zprty1, ZPRRegOp zprty2>
: I<(outs zprty1:$Zda), (ins zprty1:$_Zda, zprty2:$Zn, zprty2:$Zm),		: I<(outs zprty1:$Zda), (ins zprty1:$_Zda, zprty2:$Zn, zprty2:$Zm),
asm, "\t$Zda, $Zn, $Zm", "", []>, Sched<[]> {		asm, "\t$Zda, $Zn, $Zm", "", []>, Sched<[]> {
bits<5> Zda;		bits<5> Zda;
▲ Show 20 Lines • Show All 6,723 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll

	Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
	;; Negative tests			;; Negative tests

	; Ensure we don't use a "vscale x 4" scatter. Cannot prove that variable stride			; Ensure we don't use a "vscale x 4" scatter. Cannot prove that variable stride
	; will not wrap when shrunk to be i32 based.			; will not wrap when shrunk to be i32 based.
	define void @scatter_f16_index_offset_var(ptr %base, i64 %offset, i64 %scale, <vscale x 4 x i1> %pg, <vscale x 4 x half> %data) #0 {			define void @scatter_f16_index_offset_var(ptr %base, i64 %offset, i64 %scale, <vscale x 4 x i1> %pg, <vscale x 4 x half> %data) #0 {
	; CHECK-LABEL: scatter_f16_index_offset_var:			; CHECK-LABEL: scatter_f16_index_offset_var:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: index z1.d, #0, #1			; CHECK-NEXT: index z1.d, #0, #1
	; CHECK-NEXT: mov z3.d, x1
	; CHECK-NEXT: mov z2.d, z1.d
	; CHECK-NEXT: mov z4.d, z3.d
	; CHECK-NEXT: ptrue p1.d			; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z2.d, z1.d
				; CHECK-NEXT: mov z3.d, x1
	; CHECK-NEXT: incd z2.d			; CHECK-NEXT: incd z2.d
	; CHECK-NEXT: mla z3.d, p1/m, z1.d, z3.d			; CHECK-NEXT: mad z1.d, p1/m, z3.d, z3.d
	; CHECK-NEXT: mla z4.d, p1/m, z2.d, z4.d			; CHECK-NEXT: mad z2.d, p1/m, z3.d, z3.d
	; CHECK-NEXT: punpklo p1.h, p0.b			; CHECK-NEXT: punpklo p1.h, p0.b
	; CHECK-NEXT: uunpklo z1.d, z0.s			; CHECK-NEXT: uunpklo z3.d, z0.s
	; CHECK-NEXT: punpkhi p0.h, p0.b			; CHECK-NEXT: punpkhi p0.h, p0.b
	; CHECK-NEXT: uunpkhi z0.d, z0.s			; CHECK-NEXT: uunpkhi z0.d, z0.s
	; CHECK-NEXT: st1h { z1.d }, p1, [x0, z3.d, lsl #1]			; CHECK-NEXT: st1h { z3.d }, p1, [x0, z1.d, lsl #1]
	; CHECK-NEXT: st1h { z0.d }, p0, [x0, z4.d, lsl #1]			; CHECK-NEXT: st1h { z0.d }, p0, [x0, z2.d, lsl #1]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0			%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
	%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer			%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
	%t2 = insertelement <vscale x 4 x i64> undef, i64 %scale, i32 0			%t2 = insertelement <vscale x 4 x i64> undef, i64 %scale, i32 0
	%t3 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer			%t3 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
	%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()			%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
	%t4 = mul <vscale x 4 x i64> %t3, %step			%t4 = mul <vscale x 4 x i64> %t3, %step
	%t5 = add <vscale x 4 x i64> %t1, %t4			%t5 = add <vscale x 4 x i64> %t1, %t4
	▲ Show 20 Lines • Show All 397 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-int-arith.ll

	Show First 20 Lines • Show All 331 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: uqsub_i8:			; CHECK-LABEL: uqsub_i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: uqsub z0.b, z0.b, z1.b			; CHECK-NEXT: uqsub z0.b, z0.b, z1.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%res = call <vscale x 16 x i8> @llvm.usub.sat.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b)			%res = call <vscale x 16 x i8> @llvm.usub.sat.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b)
	ret <vscale x 16 x i8> %res			ret <vscale x 16 x i8> %res
	}			}

	define <vscale x 16 x i8> @mla_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, <vscale x 16 x i8> %c) {			define <vscale x 16 x i8> @mad_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, <vscale x 16 x i8> %c) {
	; CHECK-LABEL: mla_i8:			; CHECK-LABEL: mad_i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.b			; CHECK-NEXT: ptrue p0.b
	; CHECK-NEXT: mla z2.b, p0/m, z0.b, z1.b			; CHECK-NEXT: mad z0.b, p0/m, z1.b, z2.b
	; CHECK-NEXT: mov z0.d, z2.d
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%prod = mul <vscale x 16 x i8> %a, %b			%prod = mul <vscale x 16 x i8> %a, %b
	%res = add <vscale x 16 x i8> %c, %prod			%res = add <vscale x 16 x i8> %c, %prod
	ret <vscale x 16 x i8> %res			ret <vscale x 16 x i8> %res
	}			}

	define <vscale x 16 x i8> @mla_i8_multiuse(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, <vscale x 16 x i8> %c, <vscale x 16 x i8>* %p) {			define <vscale x 16 x i8> @mla_i8_multiuse(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, <vscale x 16 x i8> %c, <vscale x 16 x i8>* %p) {
	; CHECK-LABEL: mla_i8_multiuse:			; CHECK-LABEL: mla_i8_multiuse:
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-multiply-add-accumulate.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-unknown-linux-gnu -mattr=+sve < %s \| FileCheck %s

				david-armUnsubmitted Not Done Reply Inline Actions It would be really nice if you could pre-commit these tests without any code changes so that we can see the diff in this patch? david-arm: It would be really nice if you could pre-commit these tests without any code changes so that we…
				sushgokhAuthorUnsubmitted Done Reply Inline Actions Added a patch D142998. Once thats commited, will update this patch sushgokh: Added a patch [[ https://reviews.llvm.org/D142998 \| D142998 ]]. Once thats commited, will…
				define <vscale x 2 x i64> @muladd_i64_positiveAddend(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b)
				; CHECK-LABEL: muladd_i64_positiveAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				paulwalker-armUnsubmitted Not Done Reply Inline Actions For this patch I'd rather see a set of much simpler patterns that just exercise the new isel, whereby the mla/mad choice is controlled by the function's operand order (i.e. %a is the multiplicand -> mad, %a is the addend -> mla). You'll see example of this for the equivalent fmla/fmad testing. This should mean all test ( 16 = 4 instructions * 4 element types) where each only emits a single instruction (ignoring the ret). The more nuanced cases should form part of a different patch along with the code changes to improve them. paulwalker-arm: For this patch I'd rather see a set of much simpler patterns that just exercise the new isel…
				sushgokhAuthorUnsubmitted Done Reply Inline Actions will wait for your replies on previous comments sushgokh: will wait for your replies on previous comments
				; CHECK-NEXT: mov z2.d, #0xffffffff
				; CHECK-NEXT: mad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 2 x i64> %a, %b
				%2 = add <vscale x 2 x i64> %1, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 4294967295, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				ret <vscale x 2 x i64> %2
				}

				define <vscale x 2 x i64> @muladd_i64_negativeAddend(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b)
				; CHECK-LABEL: muladd_i64_negativeAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov z2.d, #0xffffffff00000001
				; CHECK-NEXT: mad z0.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 2 x i64> %a, %b
				%2 = add <vscale x 2 x i64> %1, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 -4294967295, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				ret <vscale x 2 x i64> %2
				}

				define <vscale x 4 x i32> @muladd_i32_positiveAddend(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b)
				; CHECK-LABEL: muladd_i32_positiveAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z2.s, #0x10000
				; CHECK-NEXT: mad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 4 x i32> %a, %b
				%2 = add <vscale x 4 x i32> %1, shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 65536, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				ret <vscale x 4 x i32> %2
				}

				define <vscale x 4 x i32> @muladd_i32_negativeAddend(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b)
				; CHECK-LABEL: muladd_i32_negativeAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z2.s, #0xffff0000
				; CHECK-NEXT: mad z0.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 4 x i32> %a, %b
				%2 = add <vscale x 4 x i32> %1, shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 -65536, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				ret <vscale x 4 x i32> %2
				}

				define <vscale x 8 x i16> @muladd_i16_positiveAddend(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_i16_positiveAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, #255 // =0xff
				; CHECK-NEXT: mad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 255, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				ret <vscale x 8 x i16> %2
				}

				define <vscale x 8 x i16> @muladd_i16_negativeAddend(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_i16_negativeAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, #-255 // =0xffffffffffffff01
				; CHECK-NEXT: mad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 -255, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				ret <vscale x 8 x i16> %2
				}

				david-armUnsubmitted Not Done Reply Inline Actions Without seeing the code for these tests before your patch it's hard to know what's changed and whether this is an improvement or not. I copied these tests and ran them manually without your patch and to be honest the code for this function isn't obviously better with the mad: muladd_i16_test4: // @muladd_i16_test4 ptrue p0.h mul z0.h, p0/m, z0.h, z1.h add z0.h, z0.h, #200 // =0xc8 mul z0.h, p0/m, z0.h, z1.h ret I think it's because the add is able to combine the constant into the instruction, whereas mad cannot. Having said that, I can believe that for non-constant cases this might be a win. It might be a good idea to include some tests for non-constants too? david-arm: Without seeing the code for these tests before your patch it's hard to know what's changed and…
				sushgokhAuthorUnsubmitted Done Reply Inline Actions Have added a patch for tests. Refer D142998 sushgokh: Have added a patch for tests. Refer D142998
				define <vscale x 16 x i8> @muladd_i8_positiveAddend(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b)
				; CHECK-LABEL: muladd_i8_positiveAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov z2.b, #15 // =0xf
				; CHECK-NEXT: mad z0.b, p0/m, z1.b, z2.b
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 16 x i8> %a, %b
				%2 = add <vscale x 16 x i8> %1, shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 15, i8 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				ret <vscale x 16 x i8> %2
				}

				define <vscale x 16 x i8> @muladd_i8_negativeAddend(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b)
				; CHECK-LABEL: muladd_i8_negativeAddend:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov z2.b, #-15 // =0xfffffffffffffff1
				; CHECK-NEXT: mad z0.b, p0/m, z1.b, z2.b
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 16 x i8> %a, %b
				%2 = add <vscale x 16 x i8> %1, shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 -15, i8 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				ret <vscale x 16 x i8> %2
				}

				; both mul operands have a use
				define <vscale x 8 x i16> @muladd_generic_test1(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_generic_test1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #200
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: mla z2.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: sub z0.h, z0.h, z1.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 200, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				%3 = mul <vscale x 8 x i16> %2, %a
				%4 = sub <vscale x 8 x i16> %3, %b
				ret <vscale x 8 x i16> %4
				}

				; only the first mul operand has a use
				define <vscale x 8 x i16> @muladd_generic_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_generic_test2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #200
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: mad z1.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 200, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				%3 = mul <vscale x 8 x i16> %2, %a
				ret <vscale x 8 x i16> %3
				}

				; only the second mul operand has a use
				define <vscale x 8 x i16> @muladd_generic_test3(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_generic_test3:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #200
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: mad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 200, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				%3 = mul <vscale x 8 x i16> %2, %b
				ret <vscale x 8 x i16> %3
				}

				; negative integer splat as one of the addend
				define <vscale x 8 x i16> @muladd_generic_test4(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
				; CHECK-LABEL: muladd_generic_test4:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #-200
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: mad z0.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: ret
				{
				%1 = mul <vscale x 8 x i16> %a, %b
				%2 = add <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 -200, i16 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				ret <vscale x 8 x i16> %2
				}

				define void @fused_mul_add_in_loop(ptr noalias %a, ptr noalias %b, ptr noalias %c, i32 %n)
				; CHECK-LABEL: fused_mul_add_in_loop:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: cmp w3, #1
				; CHECK-NEXT: b.lt .LBB12_3
				; CHECK-NEXT: // %bb.1: // %for.body.preheader
				; CHECK-NEXT: mov w9, w3
				; CHECK-NEXT: mov x8, xzr
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: mov z0.s, #1 // =0x1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: whilelo p1.s, xzr, x9
				; CHECK-NEXT: .LBB12_2: // %vector.body
				; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ld1w { z1.s }, p1/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p1/z, [x1, x8, lsl #2]
				; CHECK-NEXT: mad z1.s, p0/m, z2.s, z0.s
				; CHECK-NEXT: st1w { z1.s }, p1, [x2, x8, lsl #2]
				; CHECK-NEXT: add x8, x8, x10
				; CHECK-NEXT: whilelo p1.s, x8, x9
				; CHECK-NEXT: b.mi .LBB12_2
				; CHECK-NEXT: .LBB12_3: // %for.cond.cleanup
				; CHECK-NEXT: ret
				{
				entry:
				%cmp9 = icmp sgt i32 %n, 0
				br i1 %cmp9, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				%active.lane.mask.entry = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 %wide.trip.count)
				%0 = tail call i64 @llvm.vscale.i64()
				%1 = shl nuw nsw i64 %0, 2
				br label %vector.body

				vector.body: ; preds = %vector.body, %for.body.preheader
				%index = phi i64 [ 0, %for.body.preheader ], [ %index.next, %vector.body ]
				%active.lane.mask = phi <vscale x 4 x i1> [ %active.lane.mask.entry, %for.body.preheader ], [ %active.lane.mask.next, %vector.body ]
				%2 = getelementptr inbounds i32, ptr %a, i64 %index
				%wide.masked.load = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr %2, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x i32> poison)
				%3 = getelementptr inbounds i32, ptr %b, i64 %index
				%wide.masked.load12 = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr %3, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x i32> poison)
				%4 = mul nsw <vscale x 4 x i32> %wide.masked.load12, %wide.masked.load
				%5 = add nsw <vscale x 4 x i32> %4, shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 1, i64 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				%6 = getelementptr inbounds i32, ptr %c, i64 %index
				tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> %5, ptr %6, i32 4, <vscale x 4 x i1> %active.lane.mask)
				%index.next = add i64 %index, %1
				%active.lane.mask.next = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %index.next, i64 %wide.trip.count)
				%7 = extractelement <vscale x 4 x i1> %active.lane.mask.next, i64 0
				br i1 %7, label %vector.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %vector.body, %entry
				ret void
				}

				declare i64 @llvm.vscale.i64()
				declare <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64, i64)
				declare <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr nocapture, i32 immarg, <vscale x 4 x i1>, <vscale x 4 x i32>)
				declare void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32>, ptr nocapture, i32 immarg, <vscale x 4 x i1>)