Page MenuHomePhabricator

[AArch64] Add feature has-fast-fma
AbandonedPublic

Authored by evandro on Aug 16 2016, 3:09 PM.

Details

Reviewers
t.p.northover
Summary

Use FMA combining aggressively if, for a given target, FMA is much faster than FMUL and FADD combined.

Diff Detail

Repository
rL LLVM

Event Timeline

evandro updated this revision to Diff 68265.Aug 16 2016, 3:09 PM
evandro retitled this revision from to [AArch64] Add feature has-fast-fma.
evandro updated this object.
evandro added a reviewer: t.p.northover.
evandro set the repository for this revision to rL LLVM.
evandro added a subscriber: llvm-commits.
MatzeB added a subscriber: MatzeB.Aug 16 2016, 3:15 PM

Can you explain in more detail what this is about? I would expect FMA to be as fast or faster than FMUL+FADD pretty much everywhere (oterhwise why would we bother with an extra instruction).

arsenm added a subscriber: arsenm.Aug 16 2016, 3:48 PM

Can you explain in more detail what this is about? I would expect FMA to be as fast or faster than FMUL+FADD pretty much everywhere (oterhwise why would we bother with an extra instruction).

We have this problem in AMDGPU, the higher precision is slower for f32, I don't know if any AArch64 variants have this issue also

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7378

Why is this check not part of isFMAFasterThanFMulAndFAdd?

evandro added inline comments.Aug 16 2016, 4:35 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7378

I think that they are separate things that each target should decide how to handle them. The combiner does call isFMAFasterThanFMulAndFAdd() to find out if combining multiplies and sums is allowed, but it only folds some extra cases if enableAggressiveFMAFusion() agrees.

So judging from the comments in r218120 that introduced the enableAggressiveFMAFusion() callback, this is for cases where FMAD has similar performance/resource usage as a single FADD (without even considering the FMUL). Maybe we should put that into the description of the target feature?

t.p.northover edited edge metadata.Aug 16 2016, 4:46 PM

The documentation/name of the callback too. It's not ideal to have to look back through history to work out what something really means.

IMHO, the comments for both functions in TargetLowering.h is sufficiently explanatory.

The latency for FMA in most targets seems to fall between the combination of the latencies for FMUL and FADD. In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick FMUL instead.

Hi,

I also have concerns here. The TargetLowering hook states:

/// Return true if target always beneficiates from combining into FMA for a
/// given value type. This must typically return false on targets where FMA
/// takes more cycles to execute than *FADD*.

Whereas you say:

In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick *FMUL* instead.

Which is correct? Or are you using this hook in a way the hook users don't intend? The wording used is vague and I really think we need to have more detail about what property of Exynos-M1 makes this good for Exynos but not for any other microarchitecture.

Cheers,

James

Hi,

I also have concerns here. The TargetLowering hook states:

/// Return true if target always beneficiates from combining into FMA for a
/// given value type. This must typically return false on targets where FMA
/// takes more cycles to execute than *FADD*.

Whereas you say:

In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick *FMUL* instead.

Which is correct? Or are you using this hook in a way the hook users don't intend? The wording used is vague and I really think we need to have more detail about what property of Exynos-M1 makes this good for Exynos but not for any other microarchitecture.

First off, all I can say in this patch is that this feature is beneficial to Exynos M1. I do not know whether it's beneficial or not in other targets.

However, if you look at the foldings in DAGCompiner.cpp performed only if enableAggressiveFMAFusion() is true you will find that FMUL is the pivot of the folding, not FADD, in spite of what the comments states. For instance, whether it should fold into FMA twice, back to back.

As a matter of fact, it seems to me that most "big" targets would benefit from folding into FMA more aggressively.

The latency for FMA in most targets seems to fall between the combination of the latencies for FMUL and FADD. In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick FMUL instead.

This looks like a decision to scheduling / cost model. :)

However, if you look at the foldings in DAGCompiner.cpp performed only if enableAggressiveFMAFusion() is true you will find that FMUL is the pivot of the folding, not FADD, in spite of what the comments states.

This makes sense. FMA being faster-than/as-fast-as FADD doesn't make sense.

For instance, whether it should fold into FMA twice, back to back.

This may be remnants of an ARMv7 issue around OOO cores having a hazard for FMA+FMA...

As a matter of fact, it seems to me that most "big" targets would benefit from folding into FMA more aggressively.

I tend to agree, but I don't have evidence. The best evidence would be some benchmark results or scheduling descriptions.

I'm not against this change in itself, but I really think this is a matter for the scheduler, not a target feature.

For instance, whether it should fold into FMA twice, back to back.

This may be remnants of an ARMv7 issue around OOO cores having a hazard for FMA+FMA...

I doubt it, since the folding is done in the middle end.

As a matter of fact, it seems to me that most "big" targets would benefit from folding into FMA more aggressively.

I tend to agree, but I don't have evidence. The best evidence would be some benchmark results or scheduling descriptions.

I'm not against this change in itself, but I really think this is a matter for the scheduler, not a target feature.

I disagree. isFMAFasterThanFMulAndFAdd() is static and there's no good reason why enableAggressiveFMAFusion() cannot be static either. In this sense, this feature is no different than the many of the existing features in AArch64.

Furthermore, I'm afraid that generalizing it to query the scheduling model would expand the scope of this feature beyond the hardware that I have available to test it on.

I doubt it, since the folding is done in the middle end.

Right, ignore me.

I disagree. isFMAFasterThanFMulAndFAdd() is static and there's no good reason why enableAggressiveFMAFusion() cannot be static either. In this sense, this feature is no different than the many of the existing features in AArch64.

I assumed the FMA fusion optimisation could just query the cost model in some way. Overriding the cost model for specific instructions in specific sub-architectures is the most logical solution to me.

Furthermore, I'm afraid that generalizing it to query the scheduling model would expand the scope of this feature beyond the hardware that I have available to test it on.

I understand the constraints, but I think we should design the back-end for all cores and not keep adding special cases to every different decision on every different core.

After all, moving from isCoreX() to hasCoreXFeatureY() seems pointless.

MatzeB added a comment.EditedAug 17 2016, 3:10 PM

I doubt it, since the folding is done in the middle end.

Right, ignore me.

I disagree. isFMAFasterThanFMulAndFAdd() is static and there's no good reason why enableAggressiveFMAFusion() cannot be static either. In this sense, this feature is no different than the many of the existing features in AArch64.

I assumed the FMA fusion optimisation could just query the cost model in some way. Overriding the cost model for specific instructions in specific sub-architectures is the most logical solution to me.

From a design point of view I agree. TargetInstrInfo callbacks appear to mix ISA specifics with cost model/tuning options and it would be cleaner to have the tuning/cost model stuff part of the scheduling Model (or create a new interface called TargetTuning or TargetCostModel).
But we have to discuss this outside of this patch I believe.

Furthermore, I'm afraid that generalizing it to query the scheduling model would expand the scope of this feature beyond the hardware that I have available to test it on.

I understand the constraints, but I think we should design the back-end for all cores and not keep adding special cases to every different decision on every different core.

After all, moving from isCoreX() to hasCoreXFeatureY() seems pointless.

From the looks of it this specific tuning should be applicable to many more cores. Having it available as a feature at least makes it easy for others to test and enable/disable on the cores they feel responsible for. However judging whether the feature would be suitable for a particular core is hard when the description reads "FastFMA/Use FMA aggressively".

From a design point of view I agree. TargetInstrInfo callbacks appear to mix ISA specifics with cost model/tuning options and it would be cleaner to have the tuning/cost model stuff part of the scheduling Model (or create a new interface called TargetTuning or TargetCostModel).
But we have to discuss this outside of this patch I believe.

Should be ok to have the target feature with a FIXME to move it to cost model, and drive the discussion on the list in the meantime.

cheers,
--renato

Hi,

I also have concerns here. The TargetLowering hook states:

/// Return true if target always beneficiates from combining into FMA for a
/// given value type. This must typically return false on targets where FMA
/// takes more cycles to execute than *FADD*.

Whereas you say:

In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick *FMUL* instead.

Which is correct? Or are you using this hook in a way the hook users don't intend? The wording used is vague and I really think we need to have more detail about what property of Exynos-M1 makes this good for Exynos but not for any other microarchitecture.

I'm not an expert on the codebase. As far as I can see, returning true here would enable two classes of additional transformations in DAGCombiner.cpp : DAGCombiner::visitFADDForFMACombine

 (fadd (fmul x, y), z) -> (fma x, y, z) 
 (fadd x, (fmul y, z)) -> (fma y, z, x)  
 (fsub (fmul x, y), z) -> (fma x, y, (fneg z)) 
 (fsub x, (fmul y, z)) -> (fma (fneg y), z, x) 

Where the FMUL has multiple uses (single-use is already permitted).

Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].

Is this really a good optimisation for Exynos-M1? I think this is of an exceptionally unclear benefit. I'd personally be surprised if it was uniformly good as it would seem to increase competition for resources in the processor. I suppose if you were a CPU on which forwarding a result from FMUL to FADD incurred a penalty that FMA didn't face, then this might give you a faster result from the FMA, and if that was on your critical path then you might see a win. But surely other scenarios like the FMUL being on the critical path (and now competing with FMA for multiply resources), or the second operand to the FADD coming from a long latency instruction (so executing the FMUL while you waited would have hidden the latency) are both possible and likely.

I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

(fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))
(fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

There is a third class of optimisation, but these relate to folding through extend operations. For AArch64 LookThroughFPExt will return false, so these won't help you.

(fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))

I'd guess that most of your benefit would come from the second class of folds, and that these are likely to be good across microarchitectures. The first class I think are a strange set of optimisations, and it isn't clear to me why that should uniformly be a good fold, even on microarchitectures where you can contrive a scenario where there is a benefit.

From a design point of view I agree. TargetInstrInfo callbacks appear to mix ISA specifics with cost model/tuning options and it would be cleaner to have the tuning/cost model stuff part of the scheduling Model (or create a new interface called TargetTuning or TargetCostModel).
But we have to discuss this outside of this patch I believe.

Should be ok to have the target feature with a FIXME to move it to cost model, and drive the discussion on the list in the meantime.

I agree. At the moment, I am not sure all the traits of all the targets which should be considered in devising the heuristics for such generalization. Yet, it should be a very interesting discussion which I'd love to take part in.

Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].

There are other cases, such as FADD + FMUL + FMA -> FMA + FMA. Probably a better way to describe the use of enableAggressiveFMAFusion is the relative cost of FMA to FADD and FMUL.

Is this really a good optimisation for Exynos-M1? I think this is of an exceptionally unclear benefit. I'd personally be surprised if it was uniformly good as it would seem to increase competition for resources in the processor. I suppose if you were a CPU on which forwarding a result from FMUL to FADD incurred a penalty that FMA didn't face, then this might give you a faster result from the FMA, and if that was on your critical path then you might see a win. But surely other scenarios like the FMUL being on the critical path (and now competing with FMA for multiply resources), or the second operand to the FADD coming from a long latency instruction (so executing the FMUL while you waited would have hidden the latency) are both possible and likely.

In Exynos M1, FMA uses the same resources as FMUL and saves using the other resources required by FADD, so it tends to be beneficial on it. It's not always a win though, such as when having finer grained FADD and FMUL allows more parallelism, but, in my experiments, they were few such workloads.

I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

(fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))
(fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

Indeed.

There is a third class of optimisation, but these relate to folding through extend operations. For AArch64 LookThroughFPExt will return false, so these won't help you.

(fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))

True that.

I'd guess that most of your benefit would come from the second class of folds, and that these are likely to be good across microarchitectures. The first class I think are a strange set of optimisations, and it isn't clear to me why that should uniformly be a good fold, even on microarchitectures where you can contrive a scenario where there is a benefit.

As I said above, FMA may free up the resources used by either FADD or FMUL to other computations. But that will depend on the target.

Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].

There are other cases, such as FADD + FMUL + FMA -> FMA + FMA. Probably a better way to describe the use of enableAggressiveFMAFusion is the relative cost of FMA to FADD and FMUL.

Yes, as you'll have seen from the rest of my Review. But this transform in particular looks like trouble. Again for clarity:

(fadd (fmul x, y), z) -> (fma x, y, z)

Where the multiply is multiple use (if it is single use, you catch it regardless of enableAgressiveFMAFusion).

So the transformation here always results in both an FMUL and an FMA in the instruction stream imagine:

fmul s0, s0, s0
fadd s1, s1, s0
str s0, [x0]
str s1, [x1]

The transformation above would generate:

fmul s2, s0, s0
fmadd s1, s0, s0, s1
str s2, [x0]
str s1, [x1]

This is why, for these transforms, the hook talks about the relative cost of an FADD to an FMA.

I would imagine for many cores, including Exynos-M1, this would not be a good transform to enable.

Is this really a good optimisation for Exynos-M1?

In Exynos M1, FMA uses the same resources as FMUL and saves using the other resources required by FADD, so it tends to be beneficial on it. It's not always a win though, such as when having finer grained FADD and FMUL allows more parallelism, but, in my experiments, they were few such workloads.

Right. So given that the transformation will increase the traffic to your multiply unit, are you sure it is something you want to enable? As you describe, the FMA is not going to be faster here, and the code generated is objectively worse.

I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

(fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))
(fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

Indeed.

We can agree that these are good optimisations. No arguments from me there. In fact, these are good enough that they should be enabled for many AArch64 cores. If these were the only transforms that were enabled by enableAgressiveFMAFusion I'd suggest that you'd actually want your target feature to have the opposite sense to what it has right now - most cores should enjoy these transforms and those that don't could call it out in a cost-model/feature flag.

It seems to me that decoupling the bad transforms from the good, enabling the good transforms for most AArch64 CPUs, and leaving the bad transform off would be the more beneficial approach.

I disagree. isFMAFasterThanFMulAndFAdd() is static and there's no good reason why enableAggressiveFMAFusion() cannot be static either. In this sense, this feature is no different than the many of the existing features in AArch64.

I assumed the FMA fusion optimisation could just query the cost model in some way. Overriding the cost model for specific instructions in specific sub-architectures is the most logical solution to me.

I was going to suggest adding to MachineCombiner, but it's already done for AArch64 with:
https://reviews.llvm.org/rL267328

There was a great list of examples detailing why FMA codegen decisions are difficult in the initial review for that commit:
https://reviews.llvm.org/D18751#402906

So the transformation here always results in both an FMUL and an FMA in the instruction stream imagine:

fmul s0, s0, s0
fadd s1, s1, s0
str s0, [x0]
str s1, [x1]

The transformation above would generate:

fmul s2, s0, s0
fmadd s1, s0, s0, s1
str s2, [x0]
str s1, [x1]

This is why, for these transforms, the hook talks about the relative cost of an FADD to an FMA.

I would imagine for many cores, including Exynos-M1, this would not be a good transform to enable.

Yes it would for in the former case there's a dependency between FMUL and FADD and, though both instructions use different units, there's a stall. In the latter case, though both insns use the same unit, they pipeline back to back, since there's no dependency stall.

Therefore, yes, I am confident enabling it for Exynos M1.

It seems to me that decoupling the bad transforms from the good, enabling the good transforms for most AArch64 CPUs, and leaving the bad transform off would be the more beneficial approach.

Perhaps the transforms could be refined with different cost functions, but it'd be better suited to discuss them when discussing a generalized cost function.

Belated ping^1.

evandro abandoned this revision.Oct 4 2016, 2:00 PM

After more extensive testing, this change proved to be a net tie.