This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Improve MachineCombiner's cost model
Needs ReviewPublic

Authored by Carrot on May 13 2022, 3:40 PM.

Details

Summary

This patch contains following improvements to MachineCombiner's cost model.

1 Ignore coalescable COPY instructions in computing latency because it will be deleted after RA.
2 When computing CycleCount, use (FirstDepth + RootLatency) instead of (RootDepth + RootLatency) to avoid double counting of instructions when there are multiple instructions in either the old or new instruction sequence.

Diff Detail

Event Timeline

Carrot created this revision.May 13 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2022, 3:40 PM
Carrot requested review of this revision.May 13 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2022, 3:40 PM

Thanks for splitting this out. I've been trying to run some tests.

Do you have any benchmark results? I worry that even though the old model was less correct, it still gave better performance results.

I tested SPEC2006 int on my skylake desktop.

base
400.perlbench    9770        241       40.6 *                                  
401.bzip2        9650        380       25.4 *                                  
403.gcc          8050        198       40.7 *                                  
429.mcf          9120        204       44.8 *                                  
445.gobmk       10490        343       30.6 *                                  
456.hmmer        9330        253       36.9 *                                  
458.sjeng       12100        355       34.1 *                                  
462.libquantum  20720        308       67.3 *                                  
464.h264ref     22130        325       68.1 *                                  
471.omnetpp      6250        239       26.1 *                                  
473.astar        7020        295       23.8 *                                  
483.xalancbmk    6900        144       47.8 *                                  
 Est. SPECint(R)_base2006              38.3

exp
400.perlbench    9770        241       40.6 *                                  
401.bzip2        9650        381       25.3 *                                  
403.gcc          8050        198       40.7 *                                  
429.mcf          9120        202       45.3 *                                  
445.gobmk       10490        342       30.7 *                                  
456.hmmer        9330        253       36.9 *                                  
458.sjeng       12100        354       34.1 *                                  
462.libquantum  20720        318       65.2 *                                  
464.h264ref     22130        324       68.3 *                                  
471.omnetpp      6250        237       26.3 *                                  
473.astar        7020        295       23.8 *                                  
483.xalancbmk    6900        145       47.5 *                                  
 Est. SPECint(R)_base2006              38.2

Only mcf and libquantum have a little larger delta, they are caused by larger run to run variation.

base
429.mcf          9120        207       44.1 S                                  
429.mcf          9120        204       44.8 *                                  
429.mcf          9120        198       46.0 S         
462.libquantum  20720        302       68.7 S                                  
462.libquantum  20720        334       62.1 S                                  
462.libquantum  20720        308       67.3 *                                  

exp
429.mcf          9120        202       45.3 *                                  
429.mcf          9120        207       44.0 S                                  
429.mcf          9120        197       46.2 S                                  
462.libquantum  20720        325       63.8 S                                  
462.libquantum  20720        318       65.2 *                                  
462.libquantum  20720        296       70.1 S

So basically no difference.

Thanks - do you have AArch64 numbers? The MachineCombiner does seem to be used on X86, but to a much smaller degree than on AArch64 where there are many more patterns. The reassociation might be useful, but it is the converting madd to mul+add and the fmla changes that are more worrying.

I tried SPEC2006 on a AArch64 machine. But for some unknown problem I couldn't run 464.h264ref correctly, even gcc compiled code generates same output.

base
400.perlbench    9770        306       31.9 *
401.bzip2        9650        418       23.1 *
403.gcc          8050        206       39.1 *
429.mcf          9120        332       27.5 *
445.gobmk       10490        337       31.1 *
456.hmmer        9330        275       34.0 *
458.sjeng       12100        391       30.9 *
462.libquantum  20720        214       96.8 *
464.h264ref                                 NR
471.omnetpp      6250        229       27.3 *
473.astar        7020        313       22.4 *
483.xalancbmk    6900        174       39.6 *
 Est. SPECint(R)_base2006              33.6


exp
400.perlbench    9770        306       31.9 *
401.bzip2        9650        418       23.1 *
403.gcc          8050        207       38.9 *
429.mcf          9120        330       27.6 *
445.gobmk       10490        337       31.1 *
456.hmmer        9330        280       33.4 *
458.sjeng       12100        391       31.0 *
462.libquantum  20720        213       97.5 *
464.h264ref                                 NR
471.omnetpp      6250        230       27.2 *
473.astar        7020        313       22.4 *
483.xalancbmk    6900        177       39.1 *
 Est. SPECint(R)_base2006              33.5

The biggest difference is from 456.hmmer. I reran it and got same result
456.hmmer        9330        273       34.1 S
456.hmmer        9330        274       34.0 *
456.hmmer        9330        274       34.0 S

So there is no real performance impact of SPEC2006 on AArch64.

Thanks. Is that with -mcpu=native or -mcpu=generic, and on which type of machine was it run? It sounds like the noise level is pretty high, which is a perpetual problem.

There are more accurate results in the .csv file that it generates - the results are not rounded to whole numbers so can be a little more accurate.

The test was run on a neoverse n1 machine. I didn't specify any -mcpu, so I guess it is -mcpu=generic.

The .csv results are:

base
400.perlbench,9770,305.860896,31.942625,1,S,,,,,NR,"SelectedIteration (base #3)"
401.bzip2,9650,417.556045,23.11067,1,S,,,,,NR,"SelectedIteration (base #1)"
403.gcc,8050,205.940987,39.088868,1,S,,,,,NR,"SelectedIteration (base #2)"
429.mcf,9120,331.917529,27.476705,1,S,,,,,NR,"SelectedIteration (base #3)"
445.gobmk,10490,337.410193,31.089754,1,S,,,,,NR,"SelectedIteration (base #1)"
456.hmmer,9330,274.508242,33.988051,1,S,,,,,NR,"SelectedIteration (base #1)"
458.sjeng,12100,391.184073,30.93173,1,S,,,,,NR,"SelectedIteration (base #1)"
462.libquantum,20720,213.955433,96.842598,1,S,,,,,NR,"SelectedIteration (base #2)"
464.h264ref,,,,,NR,,,,,NR
471.omnetpp,6250,228.927823,27.301181,1,S,,,,,NR,"SelectedIteration (base #2)"
473.astar,7020,313.225501,22.411968,1,S,,,,,NR,"SelectedIteration (base #3)"
483.xalancbmk,6900,174.455772,39.551572,1,S,,,,,NR,"SelectedIteration (base #2)"
SPECint_base2006,33.555777,,33.555777


exp
400.perlbench,9770,306.425665,31.883752,1,S,,,,,NR,"SelectedIteration (base #2)"
401.bzip2,9650,417.524178,23.112434,1,S,,,,,NR,"SelectedIteration (base #1)"
403.gcc,8050,206.76557,38.932981,1,S,,,,,NR,"SelectedIteration (base #3)"
429.mcf,9120,330.285014,27.612515,1,S,,,,,NR,"SelectedIteration (base #3)"
445.gobmk,10490,336.858972,31.140628,1,S,,,,,NR,"SelectedIteration (base #1)"
456.hmmer,9330,279.698504,33.357347,1,S,,,,,NR,"SelectedIteration (base #2)"
458.sjeng,12100,390.854833,30.957785,1,S,,,,,NR,"SelectedIteration (base #3)"
462.libquantum,20720,212.563873,97.476583,1,S,,,,,NR,"SelectedIteration (base #2)"
464.h264ref,,,,,NR,,,,,NR
471.omnetpp,6250,230.126178,27.159014,1,S,,,,,NR,"SelectedIteration (base #3)"
473.astar,7020,313.361278,22.402257,1,S,,,,,NR,"SelectedIteration (base #3)"
483.xalancbmk,6900,176.504437,39.092502,1,S,,,,,NR,"SelectedIteration (base #3)"
SPECint_base2006,33.4708,,33.4708
Matt added a subscriber: Matt.Jun 13 2022, 4:04 PM

Hello - I have been trying to take a look at the problems here, I feel without a lot of success.

The codegen changes that change madd to mul+add look worse on paper, and the performance results you've quoted seem to be a noisy decrease in performance. That matches the results I have which don't seem fantastic.

I think in general there shouldn't be any reason for the machine combiner to choose mul+add over madd. (There might be certain times on inorder cores where the mul+add is faster due to the exact scheduling, but the machine combiner isn't considering those characteristics, and we need consider general codegen even if -mcpu=generic is using an inorder schedule). A MUL is really a MADD with a WZR addend register, and with late forwarding the MADD should be preferred. We may be fighting the schedule a bit - it doesn't always report the schedules correct. I think I can look into trying to improve that my simplifying it a little, but that might take some time to get right. And I worry it might not exactly fix the issues if this isn't considering that instructions can have different latencies from each operand.

It sounds to me like the MADD is decoded into two uops, mul and add. Then the mul can be immediately executable once the two multipliers are available. Is this correct?

I agree in this case MADD is always preferred. But the problem is current MachineCombiner and latency computation doesn't model this behavior. It's an independent issue. I think it should be solved in another patch.

It sounds to me like the MADD is decoded into two uops, mul and add. Then the mul can be immediately executable once the two multipliers are available. Is this correct?

Not necessarily two microops exactly, but late-farwarding into the add operand of the MADD. The two mul operands are needed on cycle 0, but the add operand is needed on cycle 2. The schedule has this information, but it can easily get tripped up by the way it tries to specify it. Whatever happens here I think it might be worth trying to simplify that so that it is more correct more of the time, but that involves updating all codegen ever, so needs to be carefully handled.

I agree in this case MADD is always preferred. But the problem is current MachineCombiner and latency computation doesn't model this behavior. It's an independent issue. I think it should be solved in another patch.

I think - the old costmodel was not great, but the new code is still incorrect for some cases. For the MADD case we don't handle the differing latencies of each of the input operand. (The latency for the mul operand might be 3, but for the add operand is 1. The cost just always uses "depth" + 3). There may also be differences when inserting multiple instructions too. The old model was lastinstrdepth + sum(latencies). The new one is firstinstrdepth + sum(latencies), which I feel may not be more accurate if the instructions in InsInstrs have dependencies that are not in InsInstrs.

As a concrete way forward - can you take the isCoalescableCopy and split it into a separate patch. It can also change the if (!MI->isCopy()) return false; to if (!MI->isCopy()) return MI->isTransient(); and maybe rename the function to isTransient. That should be fine to go separately, I believe, and the isCoalescableCopy makes a nice change. For the rest we may need something that more accurately calculates the final depth for instructions with forwarding like the MADD case.

Carrot added a comment.Jul 1 2022, 3:05 PM

I think - the old costmodel was not great, but the new code is still incorrect for some cases. For the MADD case we don't handle the differing latencies of each of the input operand. (The latency for the mul operand might be 3, but for the add operand is 1. The cost just always uses "depth" + 3). There may also be differences when inserting multiple instructions too. The old model was lastinstrdepth + sum(latencies). The new one is firstinstrdepth + sum(latencies), which I feel may not be more accurate if the instructions in InsInstrs have dependencies that are not in InsInstrs.

You are right about the comparison. It is also partly mentioned in the comments of function getLatenciesForInstrSequences.

/// Estimate the latency of the new and original instruction sequence by summing
/// up the latencies of the inserted and deleted instructions. This assumes
/// that the inserted and deleted instructions are dependent instruction chains,
/// which might not hold in all cases.

So the new model firstinstrdepth + sum(latencies) is not correct if InsInstrs is not a single dependence chain. But it is still correct in many cases.
The old model lastinstrdepth + sum(latencies) is always wrong for a multiple instruction sequence InsInstrs, because instructions in the range [0..n-2] are counted in both firstinstrdepth and sum(latencies).

As a concrete way forward - can you take the isCoalescableCopy and split it into a separate patch. It can also change the if (!MI->isCopy()) return false; to if (!MI->isCopy()) return MI->isTransient(); and maybe rename the function to isTransient. That should be fine to go separately, I believe, and the isCoalescableCopy makes a nice change.

I will do.

For the rest we may need something that more accurately calculates the final depth for instructions with forwarding like the MADD case.

Agree.

thanks.