This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
718	I'd expect this all to look simpler, please have a go at simplifying. I think you can drop the matching logic above and instead add a condition that checks the AddOp1's predicate matches the Add's predicate. One issue I have is that both the swap and m_SVEFAdd as it currently is serve the purpose of testing both operand orders. I think it's important for clarity to only do this once.
724	It seems to me that a check is needed if the fast math flags contain 'contract', and if not, bailout.
757	Additionally to the diff-suggestion above, I think you could just call this FMLA. (I'd prefer to see FMLA capitalized in this context since it is an abbreviation so it matches, e.g. "SVE" in style.
759	I think I would prefer to see this in the big-switch-of-intrinsics or alternatively in a function called instCombineSVEVectorFAdd. It feels wrong for it to appear inside VectorBinOp since it doesn't apply to any BinOp other than FAdd, so the "hierarchy" is wrong in my view.

peterwaller-arm added inline comments.Oct 13 2021, 9:09 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
700	Please add a comment around your combine of the form `// fold (fadd a (fmul b c)) -> (fma a b c)`
718	If you want to use the match syntax above I think you can also bind the fmul with `m_And(m_Value(FMul), m_SVEFMul(m_Deferred(p), m_Value(a), m_Value(b))` and then you could grab `FMul` and `c` simultaneously with the existing logic.

peterwaller-arm added inline comments.Oct 13 2021, 9:16 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
724	Please also check that the flags are equal instead of taking their intersection. The fadd might plausibly have a flag which allows more aggressive optimization and contracting it in this way may prevent those optimizations from taking place.

Matt added a subscriber: Matt.Oct 13 2021, 2:49 PM

MattDevereau marked 5 inline comments as done.Oct 20 2021, 5:29 AM

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
718	I've opted to check AddOp1's predicate matches the Add's predicate
724	I'm not sure what you mean here. Since flags is its own fresh variable isnt the intersection the same as comparing if the flags are equal? The original flags on the fadd intrinsic will be preserved

MattDevereau added inline comments.Oct 20 2021, 5:51 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
718	Was this what you meant? if (!match(&II, m_SVEFAdd(m_Value(p), m_And(m_Value(FMul), m_SVEFMul(m_Deferred(p), m_Value(a), m_Value(b))), m_Value(c)))){ return None; } FMul isn't matching in this expression

peterwaller-arm added inline comments.Oct 20 2021, 7:29 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
718	Sorry, I think I meant match(X, m_CombineAnd(A,B)) which means to match both X on A and X on B. I erroneously wrote m_And(A,B) which means to match X on the expression `A && B`. See these two: https://github.com/llvm/llvm-project/blob/3efd2a0bec0298d804f274fcc10ea14431b61de1/llvm/include/llvm/IR/PatternMatch.h#L1122-L1126 https://github.com/llvm/llvm-project/blob/3efd2a0bec0298d804f274fcc10ea14431b61de1/llvm/include/llvm/IR/PatternMatch.h#L194-L218
724	When I wrote 'contracting' I think I was thinking 'intersecting'. My suggestion is to do `if (flags1 == flags2) return None;`, which cannot be the same as doing an intersection -- in my proposed case the optimization would not take place. Intersection implies constructing a new set of flags which is different from those flags sets on the input. Also not sure what you mean by 'the original flags on the fadd intrinsic will be preserved' -- we're expecting that this optimization will replace the fadd with the newly constructed fmul, so the fadd is going to be erased.

Restructured logic path by adding instCombineSVEVectorFAdd
Added check for no contract flag when comparing fast flags
Extended matching logic in instCombineSVEVectorFmla to capture value* FMul

peterwaller-arm added inline comments.Oct 20 2021, 8:49 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
724	Sorry, I meant if (flags1 != flags2) return None;

MattDevereau added inline comments.Oct 20 2021, 8:59 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
724	There's no != operator on llvm::FastMathFlags which is why I used the intersection before. From reading https://llvm.org/docs/LangRef.html#fast-math-flags it seems the contract flag needs some special attention?

Harbormaster completed remote builds in B129743: Diff 380971.Oct 20 2021, 9:26 AM

Added != operator to FastMathFlags
Compared FastMathFlags for equality instead of taking their intersection for FMLA combines

Herald added a subscriber: dexonsmith. · View Herald TranscriptOct 21 2021, 3:38 AM

MattDevereau marked 5 inline comments as done.Oct 21 2021, 3:39 AM

Wrong argument order needs fixing, and some nits. Please also run clang-format.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
700	The fold as written here looks correct to me, but the match in m_SVEFAdd isn't a 1:1 with this (it is `(fadd (fmul a b) c)`, which is different). The effect is that the resulting fma from the combine has the wrong argument order.
714	Please put a line break after this.
715	Is the dyn_cast needed? You should only need it if you need some methods which are on IntrinsicInst but not Value. Also, nit, dyn_cast has logic which checks the type of the thing being cast -- this is unnecessary because the type is already constrained by the match() logic above, so if you did need a cast you can write `cast<Ty>(Val)` instead of `dyn_cast`.
716	It seems to me that p != II.getOperand(0) should already be being checked by m_Deferred?
720	Nit. I think this condition would read a bit more clearly if it were split into two, and negate the contraction allow. if (!FAddFlags.allowContract() \|\| !FMulFlags.allowContract()) return false; And the second condition is a bit hidden on the right compared to just having it as a separate `if (FAddFlags != FMulFlags) return None`.
766	Nit. `FMLA` and `Fmla` (in the function name). I think it should be consistent on `FMLA`.

Harbormaster completed remote builds in B129904: Diff 381199.Oct 21 2021, 5:01 AM

Updated some formatting and used more appropriate casting

MattDevereau marked 6 inline comments as done.Oct 21 2021, 5:33 AM

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
715	replaced the dyn_cast with cast
716	removed it

Harbormaster completed remote builds in B129920: Diff 381227.Oct 21 2021, 6:31 AM

MattDevereau marked 2 inline comments as done.Oct 21 2021, 8:15 AM

bsmith added inline comments.Oct 22 2021, 6:38 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
719–723	None of this seems to take into account the global fast-math options, i.e. the "unsafe-fp-math"="true" attribute, hence I don't think this optimization can ever be triggered from C, only directly written IR with the fast flags.
720	Do we not care about Reassociation here also?
722–723	I feel like it might just be sufficient check to check both operands for reassociation and contraction, rather than a full flag check.

MattDevereau added inline comments.Oct 22 2021, 7:05 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

719–723

Compiling foo.c

svfloat16_t fmla_example(svbool_t p, svfloat16_t a, svfloat16_t b, svfloat16_t c) {
  return svadd_f16_m(p, a, svmul_f16_m(p, b, c));
}

with

clang foo.c -S -march=armv8-a+sve -emit-llvm -o - -Ofast

emits

; Function Attrs: mustprogress nofree nosync nounwind readnone uwtable willreturn vscale_range(0,16)
define dso_local <vscale x 8 x half> @fmla_example(<vscale x 16 x i1> %p, <vscale x 8 x half> %a, <vscale x 8 x half> %b, <vscale x 8 x half> %c) local_unnamed_addr #0 {
entry:
  %0 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %p)
  %1 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmla.nxv8f16(<vscale x 8 x i1> %0, <vscale x 8 x half> %a, <vscale x 8 x half> %b, <vscale x 8 x half> %c)
  ret <vscale x 8 x half> %1
}

for me after implementing this patch

bsmith added inline comments.Oct 22 2021, 7:11 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
719–723	You're right, I screwed up my testcase, I wasn't expecting the intrinsics to gain the fast flag but they do. It still won't trigger when using the global flag and not the instruction level flag however, but we perhaps don't care much about that given the intruction level flags do get used.

Thanks for your review @bsmith, a couple of thoughts on your queries.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
720	My read of the LangRef suggests that contract allows an FMA contraction (But not two of them). So to my eyes this looks sufficient. Please can you clarify your concern?
722–723	I feel like it might just be sufficient check to check both operands for reassociation and contraction, rather than a full flag check. The full check was chosen because: We're taking two instructions and turning them into one. Therefore we have two sets of flags. We need to put some flags on the resulting op. We could intersect them, but that would result in losing flags. It's conceivable that the lost flags could allow for the whole op to be removed by dead code elimination. (I don't have a concrete example of this in practice, but maybe via nnan/ninf for example?) Therefore, the conservative thing to do is only to apply the optimization if the flags are equal (and to preserve the flags). This hasn't been a data-driven analysis but it seems a conservative one to work with for the moment. If there is evidence to consider an alternative approach, we'd consider it. The DAG level FMA contractions don't bother with this, but they are running late in the pipeline.

peterwaller-arm added inline comments.Oct 25 2021, 2:41 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
720	Oh, on another reading I get it; 'contract' is like a weaker form of 'reassociation', and I also see what you're talking about with respect to the global flags.

Added Global Fusion check before instCombineSVEVectorFMLA
Added check for Reassociation flag on FADD and FMUL for instCombineSVEVectorFMLA

MattDevereau marked 4 inline comments as done.Oct 25 2021, 3:51 AM

peterwaller-arm added inline comments.Oct 25 2021, 4:13 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
769	This logic needs tweaking; AllowFusionGlobally is independent of the fast math checks, so if either are present the optimization should take place. Also would be good to see some tests covering these cases.
771	Looks as though clang format is needed.

Harbormaster completed remote builds in B130415: Diff 381919.Oct 25 2021, 4:39 AM

Moved Global fast-math checks down a level

Remove global flag condition checking for this pass

Harbormaster completed remote builds in B130482: Diff 382013.Oct 25 2021, 9:47 AM

peterwaller-arm added inline comments.Oct 26 2021, 2:49 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
719	It looks like you've dropped FAddFlags != FMulFlags, and it looks like we need a test for that case.
765	stray 's'

Readded nequal flags condition
Reformatted tests to catch the nequal flags condition

MattDevereau marked 2 inline comments as done.Oct 26 2021, 3:22 AM

Harbormaster completed remote builds in B130653: Diff 382249.Oct 26 2021, 3:56 AM

peterwaller-arm accepted this revision.Oct 26 2021, 6:17 AM

peterwaller-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
700	nit. I just realised we are missing 'p' here, which would serve as a small hint that this exists (and is different to other combines) because of the predicate.

This revision is now accepted and ready to land.Oct 26 2021, 6:17 AM

bsmith accepted this revision.Oct 26 2021, 8:04 AM

Closed by commit rGfc28a2f8ced4: [AArch64][SVE] Combine predicated FMUL/FADD into FMA (authored by MattDevereau). · Explain WhyOct 27 2021, 4:43 AM

This revision was automatically updated to reflect the committed changes.

MattDevereau added a commit: rGfc28a2f8ced4: [AArch64][SVE] Combine predicated FMUL/FADD into FMA.

Sorry @MattDevereau for the too late review but this is the first time I've had chance to look at the patch and I think there's an issue that needs fixing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
702–713	Is this correct? It looks like you're allowing the `fmul` to be either operand of the `fadd`, effectively saying `fadd(p, a, b) == fadd(p, b, a)`. Although true for the active lanes, the inactive lanes take the value of the first data operand and so these two forms are not identical.
724	Given the above I don't think we need `AllowReassoc` given we shouldn't be changing the order of operations.

MattDevereau added inline comments.Nov 2 2021, 6:39 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
702–713	That sounds right but it seems like something that would probably be caught by a test somewhere. I've had a look at FMLA here and it says "Inactive elements in the destination vector register remain unmodified.". FADD on the otherhand places the value of the first data operand in the destination vector, so FMLA and FADD seem to differ here. I'm probably wrong but from what I can see, the addition part of the FMLA can assume `fadd(p, a, b) == fadd(p, b, a)` but a normal FADD instruction cannot? Either way it seems like a more extensive set of rules should be considered for this
724	This seems linked to the other comment, which i think needs a bit more consideration

peterwaller-arm added inline comments.Nov 2 2021, 6:54 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
702–713	That sounds right but it seems like something that would probably be caught by a test somewhere. Not sure what you mean there, can you be more specific? The <it> is ambiguous: what would be caught? And by which testing? I've had a look at FMLA [...] FADD on the otherhand places the value of the first data operand in the destination vector, so FMLA and FADD seem to differ here. I'm probably wrong but from what I can see, the addition part of the FMLA can assume `fadd(p, a, b) == fadd(p, b, a)` but a normal FADD instruction cannot? Either way it seems like a more extensive set of rules should be considered for this I agree with what Paul's saying, consider: `X = A + (B * C)` and `Y = (B * C) + A`. In the case of inactive lanes before the combine, `X` ends up with values from `A` and `Y` ends up with values from `(B * C)`. But the combine as is rewrites both cases FMLA(A, B, C). `FMLA(A, B, C)` takes lanes from A in case of an inactive predicate. Therefore the output does not have lanes from `(B * C)` for the expression `Y`, as it should. So this needs updating, and we can ignore the Reassoc flag.

MattDevereau added a reverting change: rG895145aacbfa: Revert "[AArch64][SVE] Combine predicated FMUL/FADD into FMA".Nov 2 2021, 7:56 AM

MattDevereau mentioned this in D113095: [AArch64][SVE] Combine FADD and FMUL aarch64 intrinsics to FMLA.Nov 3 2021, 4:39 AM

MattDevereau mentioned this in rG4a59694ba148: [AArch64][SVE] Combine FADD and FMUL aarch64 intrinsics to FMLA.Nov 8 2021, 4:23 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

Operator.h

3 lines

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

55 lines

test/

Transforms/

InstCombine/

AArch64/

sve-intrinsic-fmla.ll

109 lines

Diff 381919

llvm/include/llvm/IR/Operator.h

Show First 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	public:
void setFast(bool B = true) { B ? set() : clear(); }		void setFast(bool B = true) { B ? set() : clear(); }

void operator&=(const FastMathFlags &OtherFlags) {		void operator&=(const FastMathFlags &OtherFlags) {
Flags &= OtherFlags.Flags;		Flags &= OtherFlags.Flags;
}		}
void operator\|=(const FastMathFlags &OtherFlags) {		void operator\|=(const FastMathFlags &OtherFlags) {
Flags \|= OtherFlags.Flags;		Flags \|= OtherFlags.Flags;
}		}
		bool operator!=(const FastMathFlags &OtherFlags) const {
		return Flags != OtherFlags.Flags;
		}
};		};

/// Utility class for floating point operations which can have		/// Utility class for floating point operations which can have
/// information about relaxed accuracy requirements attached to them.		/// information about relaxed accuracy requirements attached to them.
class FPMathOperator : public Operator {		class FPMathOperator : public Operator {
private:		private:
friend class Instruction;		friend class Instruction;

▲ Show 20 Lines • Show All 407 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 689 Lines • ▼ Show 20 Lines if (Op1 && Op2 &&

PTest->takeName(&II); PTest->takeName(&II);

return IC.replaceInstUsesWith(II, PTest); return IC.replaceInstUsesWith(II, PTest);

} }

return None; return None;

} }

static Optional<Instruction *> instCombineSVEVectorFMLA(InstCombiner &IC,

IntrinsicInst &II) {

// fold (fadd a (fmul b c)) -> (fma a b c)

peterwaller-armUnsubmitted

Done

Please add a comment around your combine of the form // fold (fadd a (fmul b c)) -> (fma a b c)

peterwaller-arm: Please add a comment around your combine of the form `// fold (fadd a (fmul b c)) -> (fma a b…

peterwaller-armUnsubmitted

Done

The fold as written here looks correct to me, but the match in m_SVEFAdd isn't a 1:1 with this (it is (fadd (fmul a b) c), which is different). The effect is that the resulting fma from the combine has the wrong argument order.

peterwaller-arm: The fold as written here looks correct to me, but the match in m_SVEFAdd isn't a 1:1 with this…

peterwaller-armUnsubmitted

Not Done

IntrinsicInst &II) {

- // fold (fadd a (fmul b c)) -> (fma a b c)

+ // fold (fadd p a (fmul p b c)) -> (fma p a b c)

Value *p, *FMul, *a, *b, *c;

nit. I just realised we are missing 'p' here, which would serve as a small hint that this exists (and is different to other combines) because of the predicate.

peterwaller-arm: nit. I just realised we are missing 'p' here, which would serve as a small hint that this…

Value *p, *FMul, *a, *b, *c;

auto m_SVEFAdd = [](auto p, auto w, auto x) {

return m_CombineOr(m_Intrinsic<Intrinsic::aarch64_sve_fadd>(p, w, x),

m_Intrinsic<Intrinsic::aarch64_sve_fadd>(p, x, w));

};

auto m_SVEFMul = [](auto p, auto y, auto z) {

return m_Intrinsic<Intrinsic::aarch64_sve_fmul>(p, y, z);

};

if (!match(&II, m_SVEFAdd(m_Value(p), m_Value(a),

m_CombineAnd(m_Value(FMul),

m_SVEFMul(m_Deferred(p), m_Value(b),

m_Value(c))))))

return None;

paulwalker-armUnsubmitted

Not Done

Is this correct? It looks like you're allowing the fmul to be either operand of the fadd, effectively saying fadd(p, a, b) == fadd(p, b, a). Although true for the active lanes, the inactive lanes take the value of the first data operand and so these two forms are not identical.

paulwalker-arm: Is this correct? It looks like you're allowing the `fmul` to be either operand of the `fadd`…

MattDevereauAuthorUnsubmitted

Done

That sounds right but it seems like something that would probably be caught by a test somewhere. I've had a look at FMLA here and it says "Inactive elements in the destination vector register remain unmodified.". FADD on the otherhand places the value of the first data operand in the destination vector, so FMLA and FADD seem to differ here. I'm probably wrong but from what I can see, the addition part of the FMLA can assume fadd(p, a, b) == fadd(p, b, a) but a normal FADD instruction cannot? Either way it seems like a more extensive set of rules should be considered for this

MattDevereau: That sounds right but it seems like something that would probably be caught by a test somewhere.

peterwaller-armUnsubmitted

Not Done

That sounds right but it seems like something that would probably be caught by a test somewhere.

Not sure what you mean there, can you be more specific? The <it> is ambiguous: what would be caught? And by which testing?

I've had a look at FMLA [...]

FADD on the otherhand places the value of the first data operand in the destination vector, so FMLA and FADD seem to differ here. I'm probably wrong but from what I can see, the addition part of the FMLA can assume fadd(p, a, b) == fadd(p, b, a) but a normal FADD instruction cannot? Either way it seems like a more extensive set of rules should be considered for this

I agree with what Paul's saying, consider: X = A + (B * C) and Y = (B * C) + A.

In the case of inactive lanes before the combine, X ends up with values from A and Y ends up with values from (B * C).

But the combine as is rewrites both cases FMLA(A, B, C). FMLA(A, B, C) takes lanes from A in case of an inactive predicate.

Therefore the output does not have lanes from (B * C) for the expression Y, as it should.

So this needs updating, and we can ignore the Reassoc flag.

peterwaller-arm: > That sounds right but it seems like something that would probably be caught by a test…

peterwaller-armUnsubmitted

Done

Please put a line break after this.

peterwaller-arm: Please put a line break after this.

if (!FMul->hasOneUse())

peterwaller-armUnsubmitted

Done

Is the dyn_cast needed? You should only need it if you need some methods which are on IntrinsicInst but not Value.

Also, nit, dyn_cast has logic which checks the type of the thing being cast -- this is unnecessary because the type is already constrained by the match() logic above, so if you did need a cast you can write cast<Ty>(Val) instead of dyn_cast.

peterwaller-arm: Is the dyn_cast needed? You should only need it if you need some methods which are on…

MattDevereauAuthorUnsubmitted

Done

replaced the dyn_cast with cast

MattDevereau: replaced the dyn_cast with cast

return None;

peterwaller-armUnsubmitted

Done

It seems to me that p != II.getOperand(0) should already be being checked by m_Deferred?

peterwaller-arm: It seems to me that p != II.getOperand(0) should already be being checked by m_Deferred?

MattDevereauAuthorUnsubmitted

Done

removed it

MattDevereau: removed it

llvm::FastMathFlags FAddFlags = II.getFastMathFlags();

peterwaller-armUnsubmitted

Done

I'd expect this all to look simpler, please have a go at simplifying. I think you can drop the matching logic above and instead add a condition that checks the AddOp1's predicate matches the Add's predicate.

One issue I have is that both the swap and m_SVEFAdd as it currently is serve the purpose of testing both operand orders. I think it's important for clarity to only do this once.

peterwaller-arm: I'd expect this all to look simpler, please have a go at simplifying. I think you can drop the…

peterwaller-armUnsubmitted

Done

If you want to use the match syntax above I think you can also bind the fmul with m_And(m_Value(FMul), m_SVEFMul(m_Deferred(p), m_Value(a), m_Value(b)) and then you could grab FMul and c simultaneously with the existing logic.

peterwaller-arm: If you want to use the match syntax above I think you can also bind the fmul with `m_And…

MattDevereauAuthorUnsubmitted

Done

I've opted to check AddOp1's predicate matches the Add's predicate

MattDevereau: I've opted to check AddOp1's predicate matches the Add's predicate

MattDevereauAuthorUnsubmitted

Done

Was this what you meant?

if (!match(&II, m_SVEFAdd(m_Value(p),
                          m_And(m_Value(FMul), m_SVEFMul(m_Deferred(p), m_Value(a), m_Value(b))),
                          m_Value(c)))){
  return None;
}

FMul isn't matching in this expression

MattDevereau: Was this what you meant? ``` if (!match(&II, m_SVEFAdd(m_Value(p)…

peterwaller-armUnsubmitted

Done

Sorry, I think I meant match(X, m_CombineAnd(A,B)) which means to match both X on A and X on B. I erroneously wrote m_And(A,B) which means to match X on the expression A && B.

See these two:

https://github.com/llvm/llvm-project/blob/3efd2a0bec0298d804f274fcc10ea14431b61de1/llvm/include/llvm/IR/PatternMatch.h#L1122-L1126

https://github.com/llvm/llvm-project/blob/3efd2a0bec0298d804f274fcc10ea14431b61de1/llvm/include/llvm/IR/PatternMatch.h#L194-L218

peterwaller-arm: Sorry, I think I meant match(X, m_CombineAnd(A,B)) which means to match both X on A and X on B.

llvm::FastMathFlags FMulFlags = cast<IntrinsicInst>(FMul)->getFastMathFlags();

peterwaller-armUnsubmitted

Done

It looks like you've dropped FAddFlags != FMulFlags, and it looks like we need a test for that case.

peterwaller-arm: It looks like you've dropped FAddFlags != FMulFlags, and it looks like we need a test for that…

if (!FAddFlags.allowReassoc() || !FMulFlags.allowReassoc())

peterwaller-armUnsubmitted

Done

Nit. I think this condition would read a bit more clearly if it were split into two, and negate the contraction allow.

if (!FAddFlags.allowContract() || !FMulFlags.allowContract())
  return false;

And the second condition is a bit hidden on the right compared to just having it as a separate if (FAddFlags != FMulFlags) return None.

peterwaller-arm: Nit. I think this condition would read a bit more clearly if it were split into two, and negate…

bsmithUnsubmitted

Done

Do we not care about Reassociation here also?

bsmith: Do we not care about Reassociation here also?

peterwaller-armUnsubmitted

Done

My read of the LangRef suggests that contract allows an FMA contraction (But not two of them). So to my eyes this looks sufficient. Please can you clarify your concern?

peterwaller-arm: My read of the LangRef suggests that contract allows an FMA contraction (But not two of them).

peterwaller-armUnsubmitted

Done

Oh, on another reading I get it; 'contract' is like a weaker form of 'reassociation', and I also see what you're talking about with respect to the global flags.

peterwaller-arm: Oh, on another reading I get it; 'contract' is like a weaker form of 'reassociation', and I…

return None;

if (!FAddFlags.allowContract() || !FMulFlags.allowContract())

return None;

bsmithUnsubmitted

Not Done

I feel like it might just be sufficient check to check both operands for reassociation and contraction, rather than a full flag check.

bsmith: I feel like it might just be sufficient check to check both operands for reassociation and…

bsmithUnsubmitted

Done

None of this seems to take into account the global fast-math options, i.e. the "unsafe-fp-math"="true" attribute, hence I don't think this optimization can ever be triggered from C, only directly written IR with the fast flags.

bsmith: None of this seems to take into account the global fast-math options, i.e. the "unsafe-fp…

peterwaller-armUnsubmitted

Done

I feel like it might just be sufficient check to check both operands for reassociation and contraction, rather than a full flag check.

The full check was chosen because:

We're taking two instructions and turning them into one.
Therefore we have two sets of flags.
We need to put some flags on the resulting op.
We could intersect them, but that would result in losing flags.
It's conceivable that the lost flags could allow for the whole op to be removed by dead code elimination.
- (I don't have a concrete example of this in practice, but maybe via nnan/ninf for example?)
Therefore, the conservative thing to do is only to apply the optimization if the flags are equal (and to preserve the flags).
This hasn't been a data-driven analysis but it seems a conservative one to work with for the moment. If there is evidence to consider an alternative approach, we'd consider it.

The DAG level FMA contractions don't bother with this, but they are running late in the pipeline.

peterwaller-arm: > I feel like it might just be sufficient check to check both operands for reassociation and…

MattDevereauAuthorUnsubmitted

Done

Compiling foo.c

svfloat16_t fmla_example(svbool_t p, svfloat16_t a, svfloat16_t b, svfloat16_t c) {
  return svadd_f16_m(p, a, svmul_f16_m(p, b, c));
}

with

clang foo.c -S -march=armv8-a+sve -emit-llvm -o - -Ofast

emits

; Function Attrs: mustprogress nofree nosync nounwind readnone uwtable willreturn vscale_range(0,16)
define dso_local <vscale x 8 x half> @fmla_example(<vscale x 16 x i1> %p, <vscale x 8 x half> %a, <vscale x 8 x half> %b, <vscale x 8 x half> %c) local_unnamed_addr #0 {
entry:
  %0 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %p)
  %1 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmla.nxv8f16(<vscale x 8 x i1> %0, <vscale x 8 x half> %a, <vscale x 8 x half> %b, <vscale x 8 x half> %c)
  ret <vscale x 8 x half> %1
}

for me after implementing this patch

MattDevereau: Compiling foo.c ``` svfloat16_t fmla_example(svbool_t p, svfloat16_t a, svfloat16_t b…

bsmithUnsubmitted

Done

You're right, I screwed up my testcase, I wasn't expecting the intrinsics to gain the fast flag but they do. It still won't trigger when using the global flag and not the instruction level flag however, but we perhaps don't care much about that given the intruction level flags do get used.

bsmith: You're right, I screwed up my testcase, I wasn't expecting the intrinsics to gain the fast flag…

if (FAddFlags != FMulFlags)

peterwaller-armUnsubmitted

Done

It seems to me that a check is needed if the fast math flags contain 'contract', and if not, bailout.

peterwaller-arm: It seems to me that a check is needed if the fast math flags contain 'contract', and if not…

paulwalker-armUnsubmitted

Not Done

Given the above I don't think we need AllowReassoc given we shouldn't be changing the order of operations.

paulwalker-arm: Given the above I don't think we need `AllowReassoc` given we shouldn't be changing the order…

peterwaller-armUnsubmitted

Done

Please also check that the flags are equal instead of taking their intersection. The fadd might plausibly have a flag which allows more aggressive optimization and contracting it in this way may prevent those optimizations from taking place.

peterwaller-arm: Please also check that the flags are equal instead of taking their intersection. The fadd might…

MattDevereauAuthorUnsubmitted

Done

I'm not sure what you mean here. Since flags is its own fresh variable isnt the intersection the same as comparing if the flags are equal? The original flags on the fadd intrinsic will be preserved

MattDevereau: I'm not sure what you mean here. Since flags is its own fresh variable isnt the intersection…

peterwaller-armUnsubmitted

Done

When I wrote 'contracting' I think I was thinking 'intersecting'.

My suggestion is to do if (flags1 == flags2) return None;, which cannot be the same as doing an intersection -- in my proposed case the optimization would not take place. Intersection implies constructing a new set of flags which is different from those flags sets on the input. Also not sure what you mean by 'the original flags on the fadd intrinsic will be preserved' -- we're expecting that this optimization will replace the fadd with the newly constructed fmul, so the fadd is going to be erased.

peterwaller-arm: When I wrote 'contracting' I think I was thinking 'intersecting'. My suggestion is to do `if…

peterwaller-armUnsubmitted

Done

Sorry, I meant if (flags1 != flags2) return None;

peterwaller-arm: Sorry, I meant if (flags1 != flags2) return None;

MattDevereauAuthorUnsubmitted

Done

There's no != operator on llvm::FastMathFlags which is why I used the intersection before. From reading https://llvm.org/docs/LangRef.html#fast-math-flags it seems the contract flag needs some special attention?

MattDevereau: There's no != operator on llvm::FastMathFlags which is why I used the intersection before. From…

MattDevereauAuthorUnsubmitted

Done

This seems linked to the other comment, which i think needs a bit more consideration

MattDevereau: This seems linked to the other comment, which i think needs a bit more consideration

return None;

IRBuilder<> Builder(II.getContext());

Builder.SetInsertPoint(&II);

auto FMLA = Builder.CreateIntrinsic(Intrinsic::aarch64_sve_fmla,

{II.getType()}, {p, a, b, c}, &II);

FMLA->setFastMathFlags(FAddFlags);

return IC.replaceInstUsesWith(II, FMLA);

}

static Instruction::BinaryOps intrinsicIDToBinOpCode(unsigned Intrinsic) { static Instruction::BinaryOps intrinsicIDToBinOpCode(unsigned Intrinsic) {

switch (Intrinsic) { switch (Intrinsic) {

case Intrinsic::aarch64_sve_fmul: case Intrinsic::aarch64_sve_fmul:

return Instruction::BinaryOps::FMul; return Instruction::BinaryOps::FMul;

case Intrinsic::aarch64_sve_fadd: case Intrinsic::aarch64_sve_fadd:

return Instruction::BinaryOps::FAdd; return Instruction::BinaryOps::FAdd;

case Intrinsic::aarch64_sve_fsub: case Intrinsic::aarch64_sve_fsub:

return Instruction::BinaryOps::FSub; return Instruction::BinaryOps::FSub;

default: default:

return Instruction::BinaryOpsEnd; return Instruction::BinaryOpsEnd;

} }

static Optional<Instruction *> instCombineSVEVectorBinOp(InstCombiner &IC, static Optional<Instruction *> instCombineSVEVectorBinOp(InstCombiner &IC,

IntrinsicInst &II) { IntrinsicInst &II) {

auto *OpPredicate = II.getOperand(0); auto *OpPredicate = II.getOperand(0);

auto BinOpCode = intrinsicIDToBinOpCode(II.getIntrinsicID()); auto BinOpCode = intrinsicIDToBinOpCode(II.getIntrinsicID());

if (BinOpCode == Instruction::BinaryOpsEnd || if (BinOpCode == Instruction::BinaryOpsEnd ||

!match(OpPredicate, m_Intrinsic<Intrinsic::aarch64_sve_ptrue>( !match(OpPredicate, m_Intrinsic<Intrinsic::aarch64_sve_ptrue>(

m_ConstantInt<AArch64SVEPredPattern::all>()))) m_ConstantInt<AArch64SVEPredPattern::all>())))

return None; return None;

IRBuilder<> Builder(II.getContext()); IRBuilder<> Builder(II.getContext());

Builder.SetInsertPoint(&II); Builder.SetInsertPoint(&II);

peterwaller-armUnsubmitted

Done

auto FmlaResult = instCombineSVEVectorFmla(IC, II);

- if (FmlaResult.hasValue())

+ if (FmlaResult)

return FmlaResult;

Additionally to the diff-suggestion above, I think you could just call this FMLA. (I'd prefer to see FMLA capitalized in this context since it is an abbreviation so it matches, e.g. "SVE" in style.

peterwaller-arm: Additionally to the diff-suggestion above, I think you could just call this FMLA. (I'd prefer…

Builder.setFastMathFlags(II.getFastMathFlags()); Builder.setFastMathFlags(II.getFastMathFlags());

auto BinOp = auto BinOp =

peterwaller-armUnsubmitted

Done

I think I would prefer to see this in the big-switch-of-intrinsics or alternatively in a function called instCombineSVEVectorFAdd. It feels wrong for it to appear inside VectorBinOp since it doesn't apply to any BinOp other than FAdd, so the "hierarchy" is wrong in my view.

peterwaller-arm: I think I would prefer to see this in the big-switch-of-intrinsics or alternatively in a…

Builder.CreateBinOp(BinOpCode, II.getOperand(1), II.getOperand(2)); Builder.CreateBinOp(BinOpCode, II.getOperand(1), II.getOperand(2));

return IC.replaceInstUsesWith(II, BinOp); return IC.replaceInstUsesWith(II, BinOp);

} }

static Optional<Instruction *> instCombineSVEVectorFAdd(InstCombiner &IC,

Lint: Pre-merge checks

clang-format: please reformat the code

-static Optional<Instruction *> instCombineSVEVectorFAdd(InstCombiner &IC,
-                                                        IntrinsicInst &II,
-                                                        const TargetOptions &Options) {
-  bool AllowFusionGlobally = (Options.AllowFPOpFusion == FPOpFusion::Fast ||
-                            Options.UnsafeFPMath);
-  if(AllowFusionGlobally){
+static Optional<Instruction *>
+instCombineSVEVectorFAdd(InstCombiner &IC, IntrinsicInst &II,
+                         const TargetOptions &Options) {
+  bool AllowFusionGlobally =

2 diff lines are omitted. See full path.

Lint: Pre-merge checks: clang-format: please reformat the code ``` -static Optional<Instruction *>…

IntrinsicInst &II,

peterwaller-armUnsubmitted

Done

stray 's'

peterwaller-arm: stray 's'

const TargetOptions &Options) {

peterwaller-armUnsubmitted

Done

Nit. FMLA and Fmla (in the function name). I think it should be consistent on FMLA.

peterwaller-arm: Nit. `FMLA` and `Fmla` (in the function name). I think it should be consistent on `FMLA`.

bool AllowFusionGlobally = (Options.AllowFPOpFusion == FPOpFusion::Fast ||

Options.UnsafeFPMath);

if(AllowFusionGlobally){

peterwaller-armUnsubmitted

Not Done

This logic needs tweaking; AllowFusionGlobally is independent of the fast math checks, so if either are present the optimization should take place.

Also would be good to see some tests covering these cases.

peterwaller-arm: This logic needs tweaking; AllowFusionGlobally is independent of the fast math checks, so if…

auto FMLA = instCombineSVEVectorFMLA(IC, II);

if(FMLA)

Lint: Pre-merge checks

clang-format: please reformat the code

-    if(FMLA)
+    if (FMLA)

Lint: Pre-merge checks: clang-format: please reformat the code ``` - if(FMLA) + if (FMLA) ```

peterwaller-armUnsubmitted

Not Done

Looks as though clang format is needed.

peterwaller-arm: Looks as though clang format is needed.

return FMLA;

}

return instCombineSVEVectorBinOp(IC, II);

}

static Optional<Instruction *> instCombineSVEVectorMul(InstCombiner &IC, static Optional<Instruction *> instCombineSVEVectorMul(InstCombiner &IC,

IntrinsicInst &II) { IntrinsicInst &II) {

auto *OpPredicate = II.getOperand(0); auto *OpPredicate = II.getOperand(0);

auto *OpMultiplicand = II.getOperand(1); auto *OpMultiplicand = II.getOperand(1);

auto *OpMultiplier = II.getOperand(2); auto *OpMultiplier = II.getOperand(2);

IRBuilder<> Builder(II.getContext()); IRBuilder<> Builder(II.getContext());

Builder.SetInsertPoint(&II); Builder.SetInsertPoint(&II);

▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines case Intrinsic::aarch64_sve_cntb:

return instCombineSVECntElts(IC, II, 16); return instCombineSVECntElts(IC, II, 16);

case Intrinsic::aarch64_sve_ptest_any: case Intrinsic::aarch64_sve_ptest_any:

case Intrinsic::aarch64_sve_ptest_first: case Intrinsic::aarch64_sve_ptest_first:

case Intrinsic::aarch64_sve_ptest_last: case Intrinsic::aarch64_sve_ptest_last:

return instCombineSVEPTest(IC, II); return instCombineSVEPTest(IC, II);

case Intrinsic::aarch64_sve_mul: case Intrinsic::aarch64_sve_mul:

case Intrinsic::aarch64_sve_fmul: case Intrinsic::aarch64_sve_fmul:

return instCombineSVEVectorMul(IC, II); return instCombineSVEVectorMul(IC, II);

case Intrinsic::aarch64_sve_fadd: case Intrinsic::aarch64_sve_fadd: {

const TargetOptions &Options = ST->getTargetLowering()->getTargetMachine().Options;

Lint: Pre-merge checks

clang-format: please reformat the code

-    const TargetOptions &Options = ST->getTargetLowering()->getTargetMachine().Options;
+    const TargetOptions &Options =
+        ST->getTargetLowering()->getTargetMachine().Options;

Lint: Pre-merge checks: clang-format: please reformat the code ``` - const TargetOptions &Options = ST…

return instCombineSVEVectorFAdd(IC, II, Options);

}

case Intrinsic::aarch64_sve_fsub: case Intrinsic::aarch64_sve_fsub:

return instCombineSVEVectorBinOp(IC, II); return instCombineSVEVectorBinOp(IC, II);

case Intrinsic::aarch64_sve_tbl: case Intrinsic::aarch64_sve_tbl:

return instCombineSVETBL(IC, II); return instCombineSVETBL(IC, II);

case Intrinsic::aarch64_sve_uunpkhi: case Intrinsic::aarch64_sve_uunpkhi:

case Intrinsic::aarch64_sve_uunpklo: case Intrinsic::aarch64_sve_uunpklo:

case Intrinsic::aarch64_sve_sunpkhi: case Intrinsic::aarch64_sve_sunpkhi:

case Intrinsic::aarch64_sve_sunpklo: case Intrinsic::aarch64_sve_sunpklo:

▲ Show 20 Lines • Show All 1,387 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-fmla.ll

This file was added.

				; RUN: opt -S -instcombine < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define dso_local <vscale x 8 x half> @combine_fmla(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @combine_fmla
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = call fast <vscale x 8 x half> @llvm.aarch64.sve.fmla.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: ret <vscale x 8 x half> %6
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				define dso_local <vscale x 8 x half> @combine_fmla_contract_flag_only(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @combine_fmla
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = call contract <vscale x 8 x half> @llvm.aarch64.sve.fmla.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: ret <vscale x 8 x half> %6
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call contract <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call contract <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_no_fast_flag(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_no_fast_flag
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %7 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				; CHECK-NEXT: ret <vscale x 8 x half> %7
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_no_fmul(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_no_fmul
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				; CHECK-NEXT: ret <vscale x 8 x half> %7
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_neq_pred(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_neq_pred
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 5)
				; CHECK-NEXT: %7 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %6)
				; CHECK-NEXT: %8 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %9 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %7, <vscale x 8 x half> %1, <vscale x 8 x half> %8)
				; ret <vscale x 8 x half> %9
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 5)
				%7 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %6)
				%8 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%9 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %7, <vscale x 8 x half> %1, <vscale x 8 x half> %8)
				ret <vscale x 8 x half> %9
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_two_fmul_uses(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_two_fmul_uses
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				; CHECK-NEXT: %8 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %7, <vscale x 8 x half> %6)
				; ret <vscale x 8 x half> %8
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				%8 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %7, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %8
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_and_flags_1(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_and_flags_1
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %7 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				; ret <vscale x 8 x half> %7
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				define dso_local <vscale x 8 x half> @neg_combine_fmla_and_flags_2(<vscale x 16 x i1> %0, <vscale x 8 x half> %1, <vscale x 8 x half> %2, <vscale x 8 x half> %3) local_unnamed_addr #0 {
				; CHECK-LABEL: @neg_combine_fmla_and_flags_2
				; CHECK-NEXT: %5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				; CHECK-NEXT: %6 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				; CHECK-NEXT: %7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				; CHECK-NEXT: ret <vscale x 8 x half> %7
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %0)
				%6 = tail call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %2, <vscale x 8 x half> %3)
				%7 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1> %5, <vscale x 8 x half> %1, <vscale x 8 x half> %6)
				ret <vscale x 8 x half> %7
				}

				declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1>)
				declare <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>, <vscale x 8 x half>)
				declare <vscale x 8 x half> @llvm.aarch64.sve.fadd.nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>, <vscale x 8 x half>)
				declare <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32)
				attributes #0 = { "target-features"="+sve" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Combine predicated FMUL/FADD into FMAClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 381919

llvm/include/llvm/IR/Operator.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-fmla.ll

[AArch64][SVE] Combine predicated FMUL/FADD into FMA
ClosedPublic