This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner][RISCV] Add fmadd/fmsub/fnmsub instructions patterns
ClosedPublic

Authored by asi-sc on Oct 26 2022, 6:14 AM.

Details

Summary

This patch adds tranformation of fmul+fadd/fsub chains to fused multiply
instructions:

  • fmul+fadd->fmadd
  • fmul+fsub->fmsub/fnmsub

We also will try to combine these instructions if the fmul has more than one use
and cannot be deleted. However, removing the dependence between fmul and fadd can
still be profitable, and we rely on machine combiner approximations of scheduling.

Diff Detail

Event Timeline

asi-sc created this revision.Oct 26 2022, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2022, 6:14 AM
asi-sc requested review of this revision.Oct 26 2022, 6:14 AM

Performance impact on Whetstone (double-precision) for sifive-u74 -march=rv64imafdc -O3 -funroll-loops -finline-functions -ffast-math -DDP -mtune=sifive-u74:
N1 +67%
N2 +45%
MWIPS +18%

Baseline

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12398255667391900       285.015              0.700
N2 floating point     -1.12187079889295083       224.672              6.220
N3 if then else        1.00000000000000000                5725.464    0.188
N4 fixed point        12.00000000000000000               327505510.400    0.000
N5 sin,cos etc.        0.49902937281518078                  20.516   42.163
N6 floating point      0.99999987890802811       169.612             33.064
N7 assignments         3.00000000000000000                7123.768    0.270
N8 exp,sqrt etc.       0.75100163018453681                  21.097   18.333

MWIPS                                           1030.036            100.938

This patch

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12398255667393077       476.923              0.498
N2 floating point     -1.12187079889296992       325.987              5.098
N3 if then else        1.00000000000000000                7110.547    0.180
N4 fixed point        12.00000000000000000               299613459.692    0.000
N5 sin,cos etc.        0.49902937281518367                  19.847   51.836
N6 floating point      0.99999987890802855       307.010             21.725
N7 assignments         3.00000000000000000               28396.673    0.080
N8 exp,sqrt etc.       0.75100163018453681                  21.007   21.897

MWIPS                                           1220.474            101.313
asi-sc updated this revision to Diff 471500.Oct 28 2022, 5:12 AM

Merge debug locations of the original instructions when creating new fused instruction.

A gentle ping

Performance impact on Whetstone (double-precision) for sifive-u74 -march=rv64imafdc -O3 -funroll-loops -finline-functions -ffast-math -DDP -mtune=sifive-u74:
N1 +67%
N2 +45%
MWIPS +18%

Baseline

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12398255667391900       285.015              0.700
N2 floating point     -1.12187079889295083       224.672              6.220
N3 if then else        1.00000000000000000                5725.464    0.188
N4 fixed point        12.00000000000000000               327505510.400    0.000
N5 sin,cos etc.        0.49902937281518078                  20.516   42.163
N6 floating point      0.99999987890802811       169.612             33.064
N7 assignments         3.00000000000000000                7123.768    0.270
N8 exp,sqrt etc.       0.75100163018453681                  21.097   18.333

MWIPS                                           1030.036            100.938

This patch

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12398255667393077       476.923              0.498
N2 floating point     -1.12187079889296992       325.987              5.098
N3 if then else        1.00000000000000000                7110.547    0.180
N4 fixed point        12.00000000000000000               299613459.692    0.000
N5 sin,cos etc.        0.49902937281518367                  19.847   51.836
N6 floating point      0.99999987890802855       307.010             21.725
N7 assignments         3.00000000000000000               28396.673    0.080
N8 exp,sqrt etc.       0.75100163018453681                  21.007   21.897

MWIPS                                           1220.474            101.313

Do you know why sin, cos and exp, sqrt seems to take longer?

llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
1279

Cache the the result is FADD when you called it the first time?

Do you know why sin, cos and exp, sqrt seems to take longer?

It's a good question, thanks. I should say that the math functions we call are unchanged, so it's definitely not the reason. The caller site for exp/sqrt (N8) also matches between time measurements, so it regressed not because of any change in instructions. One idea I have is that the unrolled loop that executes exp/sqrt is better aligned in the baseline version (0x1d64 vs 0x1d86). What do you think, can this be the reason?

As for sin/cos (N5), machine combiner generates fmadd (leaving fmul as it has one more use) that results in one additional fmv for each loop iteration. The reason for this is the extended liverange of the atan return value.
Baseline

call    atan@plt
fmul.d  fs1, fa0, fs6
...
fmv.d   fa0, fs0
call    sincos@plt
...
fadd.d  fa0, fs1, fs0
...

This patch

call    atan@plt
fmv.d   fs1, fa0           <--- we combined fmadd that uses fa0 but it is placed after sincos call. Move is required to extend fa0 liverange
fmul.d  fs2, fa0, fs8
...
fmv.d   fa0, fs0
call    sincos@plt
...
fmadd.d fa0, fs1, fs8, fs0
...

I do have a draft work that will improve machine combiner logic to deal with this problem. In my opinion, combining instructions that are separated by a call is a doubtful from performance point of view thing and we must do additional check in these situations. However, I don't see how we can fix it exactly in this patch. Do we agree that sometimes additional moves are inserted, their origin is clear, and the problem is likely to be addressed in further patches? I don't expect major performance issues caused by exactly this behavior and didn't observe any. Or maybe there are any suggestions on how to address it in this patch?

asi-sc updated this revision to Diff 475777.Nov 16 2022, 4:40 AM

Address review comments

I do have a draft work that will improve machine combiner logic to deal with this problem. In my opinion, combining instructions that are separated by a call is a doubtful from performance point of view thing and we must do additional check in these situations. However, I don't see how we can fix it exactly in this patch. Do we agree that sometimes additional moves are inserted, their origin is clear, and the problem is likely to be addressed in further patches? I don't expect major performance issues caused by exactly this behavior and didn't observe any. Or maybe there are any suggestions on how to address it in this patch?

I updated my draft patch to handle the situation described. Now locally there is no difference in asm instructions for both Whets-N5 (sin/cos) and Whets-N8(exp/sqrt) comparing to the baseline version. However, there is still the same execution time difference. So, it seems to me as architecture-related thing and not instructions combining issue.

Also, according to llvm-mca report (llvm-mca -march=riscv64 -mcpu=sifive-u74) for the unrolled loop body of N5 (sin/cos), there should not be so dramatic difference (diff in cycles is less than 1%):
baseline

Iterations:        100      
Instructions:      15300    
Total Cycles:      380004   
Total uOps:        18500    
                            
Dispatch Width:    2        
uOps Per Cycle:    0.05     
IPC:               0.04     
Block RThroughput: 560.0

this patch

Iterations:        100
Instructions:      16000  
Total Cycles:      383504 
Total uOps:        19200  

Dispatch Width:    2
uOps Per Cycle:    0.05
IPC:               0.04
Block RThroughput: 567.0
This revision is now accepted and ready to land.Nov 16 2022, 8:51 PM
asi-sc updated this revision to Diff 476043.Nov 17 2022, 1:15 AM

Rebase to fetch precommitted tests

This revision was landed with ongoing or failed builds.Nov 17 2022, 2:28 AM
This revision was automatically updated to reflect the committed changes.