Page MenuHomePhabricator

[MachineCombiner] Support for floating-point FMA on ARM64

Authored by Gerolf on Apr 3 2016, 11:46 PM.



Patch to evaluate fmul+fadd -> fmadd combines and similar code sequences in the
machine combiner. It adds support for float and double similar to the existing
integer implmentation. The key features of this patch are:

  • DAGCombiner checks whether it should combine greedily or let the machine combiner do the evaluation. This is (initially) only supported for ARM64.
  • It gives preference to throughput over latency: the heuristic used is to combine always in loops, but this can be chosen by the target. For in-order cores latency over throughput might be the better choice
  • Support for fmadd, f(n)msub, fmla, fmls
  • On by default at O3 fast-math
  • Performance: (mostly) single digits gains on kernels and SPEC2006 fp

Diff Detail

Event Timeline

Gerolf updated this revision to Diff 52522.Apr 3 2016, 11:46 PM
Gerolf retitled this revision from to [MachineCombiner] Support for floating-point FMA on ARM64.
Gerolf updated this object.
Gerolf added reviewers: qcolombet, t.p.northover, spatel.
Gerolf added a subscriber: llvm-commits.
jmolloy added a subscriber: jmolloy.Apr 4 2016, 4:58 AM

Hi Gerolf,

At a high level, could you please explain in what situations you expect *not* combining FMUL+FADD->FMA is a benefit? They use the same resource types on every chip I know of, and FMA is shorter in latency in every chip I know of than FMUL+FADD.




For the future: the pattern list is starting to grow quite large. I wonder if in the future we should consider moving the MachineCombinerPatterns to be table-generated?

Gerolf added a subscriber: Gerolf.Apr 4 2016, 1:26 PM

Hi James,

sure, sorry I missed that. I looked at this too long, I guess :-). It is principally the same ‘better ILP' story as for integers. The prototypical idea is this: imagine two fmul operands feeding the fadd. When the two fmul can execute in parallel it can be faster to issue fmul, fmul, fadd rather than fmul, fmadd.


sure, sorry I missed that. I looked at this too long, I guess :-). It is principally the same ‘better ILP' story as for integers. The prototypical idea is this: imagine two fmul operands feeding the fadd. When the two fmul can execute in parallel it can be faster to issue fmul, fmul, fadd rather than fmul, fmadd.

I think this opt's effect is depend on uarchitecture implementation. If some OoO uarchitectures can divide fmadd to small uops like fmul and fadd, this optimization is not worth for that kind of uarchitecture. (It's also not good for code-size. This means there is more overhead with instruction fetch.)

How about making flag for controling this optimization which is controled by uarch or core?

flyingforyou added inline comments.Apr 5 2016, 6:34 PM

Please, remove trailing whitespace.


I did just a very quick scan through the change-set,
and added few minor comments related to styling.

Not ready to comment anything about the general idea of the changes.

Thank you,


There are several misprints here. 'effiicent'->'efficient', 'generated'->'generate',
'instead on'->instead of'.

Regarding the name of the function. Shouldn't it start with a lower case letter?
Also, the name of the function would be a bit more informative if it was

The word 'Evaluate' initially made me think that the code did the replacement of FMAs with MUL/ADD. (Yes, in many cases such un-fusing may be very efficient).


I think it is a good practice to add a descriptive comment telling what functions do,
including the description of the function parameters and returns. Well, at least for the newly added functions.


One statement following an if-statement does not need braces { }.


In my opinion, the function name is a bit confusing.
The comment says about MADD operations (i.e. MUL and ADD), but the function name has only the word 'Multiply'.
The current name makes me think that function is doing some sort of fusion of few MULs into a bigger operation, e.g. a*b*c*d--> some_operation(a,b,c,d).


Yes, I agree. And more generally we need to decide which pattern should be handled by DAGCombiner/Global Isel and what should be moved into the MachineCombiner.


Agree & done, except for the lower case/upper case issue. I picked the spelling consistent with other names in the corresponding header files. I will change all names following the current naming convention in a follow up commit.






Changed to getFMAPatterns()



Gerolf updated this revision to Diff 53839.Apr 14 2016, 9:27 PM

Addresses minor issues like spelling, function names, comments. Is there any
major issue that needs attention?

v_klochkov added a comment.EditedApr 15 2016, 1:50 PM

`I worked on X86-FMA optimization in some other compiler and was switched to LLVM project just recently.

It is unlikely that anything from written below can be implemented in this change-set.
It is good that the new interface asks the target for permissions to generate or not generate
FMAs in DAGCombiner, but it does not help to solve the problems described below.

This should not stop you from adding these changes to LLVM.
Also, do not consider me as a reviewer, The primary reviewers should give their opinion.

Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
to let other developers have them in view/attention when they add latency-vs-throughput fixes.

My biggest concern regarding making Latency-vs-Throughput decisions is that
such decisions are often made using just one pattern or DAG, it is not based on the whole loop
analysis (perhaps I am missing something in LLVM).

I provided 4 examples having quite similar code in them.

Example1 - shows that FMAs can be very harmful for performance on Haswell.
Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
           It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
           exists, and tuning may be different for different DAGs in one loop.

Ok, let's start...

Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.

!NOTE: Please assume that the C code below only represents the structure of the final ASM code
(i.e. the loop is not unrolled, etc.)

  // LOOP1
  for (unsigned i = 0; i < N; i++) {
    accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
  with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
  without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
              MUL operation adds nothing as it is computed out-of-order,
			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.

Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
comparing to the code without FMAs.

On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
work the same time because the latency of ADD is equal to latency of FMA there.

The same problem still can be easily reproduced on SkyLake even though the
latencies of MUL/ADD/FMA are all equal there:

// LOOP2
for (unsigned i = 0; i < N; i++) {
  accu = a[i] * b + c[i] * d + accu;

There may be at least 3 different sequences for the LOOP2:
S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); LATENCY = 2xLAT(FMA) = 2*4
S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU)
LATENCY = 2xLAT(ADD) = 2*4
S3: 1xFMA: ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4

In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
before they are ready to be consumed by ADD.
So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.

It shows that the heuristics that could be implemented for Example1 and Example2
may be wrong if applied without the whole loop analysis.

// LOOP3
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1;
  accu2 = a2[i] * b + c2[i] * d + accu2;
  accu3 = a3[i] * b + c3[i] * d + accu3;
  accu4 = a4[i] * b + c4[i] * d + accu4;
  accu5 = a5[i] * b + c5[i] * d + accu5;
  accu6 = a6[i] * b + c6[i] * d + accu6;
  accu7 = a7[i] * b + c7[i] * d + accu7;
  accu8 = a8[i] * b + c8[i] * d + accu8;

This loop must be tuned for throughput because there are many independent DAGs
putting high pressure on the CPU execution units.
The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
"ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
It works faster because the loop is bounded by throughput.

On SkyLake:

T = approximate throughput of the loop counted in clock-ticks = 
  N * 16 operations / 2 execution units = N*8
L = latency of the loop = 
  N * 2*Lat(FMA) = N*2*4 = N*8

The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).

The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
the time spent in the loop is MAX(L,T) will only be bigger.

There may be mixed tuning, i.e. for both throughput and latency in one loop:

// LOOP4
for (unsigned i = 0; i < N; i++) {
  accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
  accu2 = a2[i] * b + accu2; // tune for throughput
  accu3 = a3[i] * b + accu3; // tune for throughput
  accu4 = a4[i] * b + accu4; // tune for throughput
  accu5 = a5[i] * b + accu5; // tune for throughput
  accu6 = a6[i] * b + accu6; // tune for throughput

On Haswell:
If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then

Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
Throughput T = N * 7 / 2
MAX (L,T) = N*10

Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to

L = N*Latency(FMA) = N*5,
and will only slightly increase T from N*3.5 to 
T = N * 8 operations / 2 execution units = N*4

As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.

Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).

L = N*Latency(ADD) = N*3
T = N * 13 operations / 2 = N*6.5
MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.

So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.

qcolombet edited edge metadata.Apr 15 2016, 3:26 PM

Hi Gerolf,

I haven’t looked into neither the machine combiner or the aarch64 specific changes, but here are some thoughts on DAG related parts.

Let me know if you want me to look into the other parts as well.



Shouldn’t the comment say something like: return true if we want to delegate the generation of fma to the machine combiner?
With the current comment I fell like this is a shift between what the comment said and what the hook is used for:

  • From the name of the method, I would expect the decision to generate fma is delegated to the machine combiner.
  • From the comment, it seems I need to return true if my target supports FMA and the fact that this combining is done by the machine combiner is almost anecdotic here.

Take the optimization level as input of the hook to give more control to the target.


A remark for the future.

Would it make sense to have this hook take an enum or something of the combines we want to delegate to the machine combiner?
We don’t have to decide now, but I believe the approach could be generalized to other hooks and we do not want one hook per instruction/combine.


Could we have a property on the pattern instead?
What are the other kinds of patterns?

The machine combiner optimize either for throughput or latency, something else?


Capital letter for the first letter per the coding standard.


No curly braces per the coding standard.


Since this is already accessible through the DAG and we use that only twice, I am not sure it is worth adding a field for it.
I don’t feel strongly about it though.


Query the opt level from the STI hook.
That way the target has a better control on when it wants to move the combine to the machine combiners.

Gerolf updated this revision to Diff 54452.Apr 20 2016, 7:53 PM
Gerolf edited edge metadata.

Comments and local handling of optlevel as Quentin suggested.

qcolombet accepted this revision.Apr 21 2016, 11:05 AM
qcolombet edited edge metadata.

Hi Gerolf,

LGTM for the non-target, non-machine combiner changes.

Make sure to follow LLVM coding style (the few things I’ve highlighted previously) and please run opt -instnamer on the test cases.



For the sake of history summarizing the discussion Gerolf and I had offline.
Long term, we should be smart enough to compute this information by ourselves using the scheduling model. To move forward in the meantime, we have this hook to have the targets emulate that with their own heuristics.

This revision is now accepted and ready to land.Apr 21 2016, 11:05 AM
Gerolf marked 3 inline comments as done.Apr 21 2016, 7:20 PM

Thanks, Quentin! I should have addressed all your concerns.




Yes, thanks for that note! I filed rdar://25867823 for the throughput API.


No preference either. I just made it look like TLI and expect more accesses to come.

Gerolf marked an inline comment as done.Apr 23 2016, 10:26 PM

Thanks, Quentin! I should have addressed all your concerns.


That should have been Agree. TBD.


For the record I removed the field and access it through the DAG. It turned out that the field for some target could result in a null reference.

spatel closed this revision.Apr 26 2016, 11:45 AM
spatel edited edge metadata.

Closing. The patch was committed after a bug fix at:

evandro added a subscriber: evandro.Oct 3 2016, 8:15 AM