This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Support local strategy for traces
ClosedPublic

Authored by asi-sc on Dec 22 2022, 2:57 AM.

Details

Summary

For in-order cores MachineCombiner makes better decisions when the critical path
is calculated only for the current basic block and does not take into account
other blocks from the trace.

This patch adds a virtual method to TargetInstrInfo to allow each target decide
which strategy to use.

Depends on D140541

Diff Detail

Event Timeline

asi-sc created this revision.Dec 22 2022, 2:57 AM
asi-sc requested review of this revision.Dec 22 2022, 2:57 AM
asi-sc updated this revision to Diff 484790.Dec 22 2022, 3:33 AM

Sync with changes in parent patches

asi-sc updated this revision to Diff 484799.Dec 22 2022, 4:49 AM

Update RISCV/machine-combiner.ll

fhahn added a subscriber: dmgreen.Jan 10 2023, 12:19 AM

@dmgreen do you know if this would be beneficial for in-order AArch64 cores as well? Or is there the chance that the RISCV models are missing some information?

For in-order cores MachineCombiner makes better decisions when the critical path

@asi-sc did you do any measurements to collect empirical data?

For in-order cores MachineCombiner makes better decisions when the critical path

@asi-sc did you do any measurements to collect empirical data?

I have performance impact only for microbenchmarks as execution time fluctuations on SPEC is higher than the performance change. However, there are some statistics just in case

Program                                       machine-combiner.NumInstCombined                                                                                                                                                                                                            
                                              results                          min-instr diff                                                                                                                                                                                             
83.xalancbmk/483.xalancbmk                     243.00                           328.00    35.0%                                                                                                                                                                                           
64.h264ref/464.h264ref                         878.00                           909.00     3.5%                                                                                                                                                                                           
00.perlbench/400.perlbench                     272.00                           279.00     2.6%                                                                                                                                                                                           
03.gcc/403.gcc                                 946.00                           946.00     0.0%                                                                                                                                                                                           
29.mcf/429.mcf                                   2.00                             2.00     0.0%                                                                                                                                                                                           
73.astar/473.astar                               5.00                             5.00     0.0%                                                                                                                                                                                           
45.gobmk/445.gobmk                            1025.00                          1020.00    -0.5%                                                                                                                                                                                           
62.libquantum/462.libquantum                    27.00                            26.00    -3.7%                                                                                                                                                                                           
58.sjeng/458.sjeng                              80.00                            76.00    -5.0%                                                                                                                                                                                           
01.bzip2/401.bzip2                             135.00                           127.00    -5.9%                                                                                                                                                                                           
71.omnetpp/471.omnetpp                          12.00                            11.00    -8.3%                                                                                                                                                                                           
56.hmmer/456.hmmer                             429.00                           360.00   -16.1%

I cannot share details but my testing shows that FLOPS module 7 is 1.5% faster for in-order RISCV core when local strategy is used. The test I attached to this patch is a minimization of a performance problem in a real application that with different strategies shows ~3% performance change (~1.5% for sifive-u74). MultiSource/Benchmarks/Ptrdist/bc/bc from llvm-test-suite also speedups by 3%. Other tests from llvm-test-suite show no measurable performance difference.

Apart from execution time, local strategy reduces compilation time as traces become smaller. I randomly ran time-report and there are no regressions and often improvements up to 20% of the pass time (total impact is hardly noticeable).

Gentle ping

I don't see anything wrong with this patch, but I haven't looked at MachineCombiner in a long time, so I'm likely not the best reviewer.

I do want to note a potential alternative - have you looked at using enableAggressiveFMAFusion() in DAGCombiner?
If you flip that switch for RISCV, the sub3 calculation in the last block would become:

	fmadd.d	ft1, ft0, fa1, fa0
	fnmsub.d	fa0, ft0, fa1, ft1

If the fma instruction timing is the same as plain fmul on your target, then is that even better than the 1 fmadd produced on the test here?

I don't see anything wrong with this patch, but I haven't looked at MachineCombiner in a long time, so I'm likely not the best reviewer.

I do want to note a potential alternative - have you looked at using enableAggressiveFMAFusion() in DAGCombiner?
If you flip that switch for RISCV, the sub3 calculation in the last block would become:

	fmadd.d	ft1, ft0, fa1, fa0
	fnmsub.d	fa0, ft0, fa1, ft1

If the fma instruction timing is the same as plain fmul on your target, then is that even better than the 1 fmadd produced on the test here?

@spatel , yeah, there are almost no recent contributions to machine combiner, so I had to find people who contributed to it years ago.

Thanks for the suggestion, I agree it can fix the provided test, but it won't solve the problem in general: when machine combiner estimates profitability, it bases the decision on the trace gathered according to some strategy. The current implementation supports only one strategy that choses multiple blocks from a function. So, when we ask machine combiner to check whether the suggested pattern is good or bad, an instruction that was really far away (let's say 4 BBs and 400 instructions above) may dramatically affect its decision (when it is on the critical path). It works fine for CPUs with big OOO buffer, but not for a tiny CPU that executes instructions in-order. My experiments show that for such CPUs we should calculate critical path separately for each BB as instructions that were in other BBs usually has almost no effect. Does it sound reasonable?

spatel added a comment.EditedJan 30 2023, 6:23 AM

Thanks for the suggestion, I agree it can fix the provided test, but it won't solve the problem in general: when machine combiner estimates profitability, it bases the decision on the trace gathered according to some strategy. The current implementation supports only one strategy that choses multiple blocks from a function. So, when we ask machine combiner to check whether the suggested pattern is good or bad, an instruction that was really far away (let's say 4 BBs and 400 instructions above) may dramatically affect its decision (when it is on the critical path). It works fine for CPUs with big OOO buffer, but not for a tiny CPU that executes instructions in-order. My experiments show that for such CPUs we should calculate critical path separately for each BB as instructions that were in other BBs usually has almost no effect. Does it sound reasonable?

Yes, that seems like a valid approach. However, I have no experience with any recent in-order cores, so it would be interesting to see if the experimental results that you got with sifive-u74 can be replicated for other in-order cores.

A quick scan of in-tree targets with MicroOpBufferSize = 0 says that we could try benchmarking on SiFive7 and a variety of Arm CPUs (M4, M55, Cortex-M7, Cortex-A55, etc).

IIUC, only RISCV is overriding the default with this patch, so no other arch will be affected, and so I don't think we need to hold this patch up waiting for more experimental data.
But does that mean the test difference that you are showing will also occur for SiFive7, Syntacore SCR1, and/or Rocket with this patch? If so, can you add a RUN line to the test file like that?

asi-sc updated this revision to Diff 493914.Feb 1 2023, 5:24 AM

Improve test coverage

asi-sc added a comment.Feb 1 2023, 5:40 AM

But does that mean the test difference that you are showing will also occur for SiFive7, Syntacore SCR1, and/or Rocket with this patch? If so, can you add a RUN line to the test file like that?

I added one more test for Syntacore-SCR1 and SiFive-u74. It shows the desired reassociation when the local strategy is used. Reassociated in this way instructions demonstrate better ILP.
One thing we should understand is that RISC-V reassociation patterns for machine combiner are not really interesting for Syntacore SCR1 as it is a single-issue CPU and doesn't support FP extensions. However, when locally strategy is used, resulted asm is not worse for Syntacore-SCR1 and slightly better for SiFive-u74 (which is dual-issue).

spatel added a comment.Feb 1 2023, 8:05 AM

But does that mean the test difference that you are showing will also occur for SiFive7, Syntacore SCR1, and/or Rocket with this patch? If so, can you add a RUN line to the test file like that?

I added one more test for Syntacore-SCR1 and SiFive-u74. It shows the desired reassociation when the local strategy is used. Reassociated in this way instructions demonstrate better ILP.
One thing we should understand is that RISC-V reassociation patterns for machine combiner are not really interesting for Syntacore SCR1 as it is a single-issue CPU and doesn't support FP extensions. However, when locally strategy is used, resulted asm is not worse for Syntacore-SCR1 and slightly better for SiFive-u74 (which is dual-issue).

Thanks - I don't know anything about RISC-V chips, so I'm deferring to others on that.

The asm diffs in existing test files show that we are affecting the default behavior of RISC-V compiles, right? If those are considered neutral or improvements, then I think this is good to go.

Gerolf added a comment.Feb 1 2023, 5:44 PM

Given that this is RISC-V and under a flag, this LGTM. I would like to see stats on the FMA's plus the changes to the cycle counts on the critical path, and see how the data correlate to your measured run-time performance numbers. And ditto for the current heuristic. This might also help understand the wide variety of results in your SPEC data. Your numbers look all over the place. Also, you can probably push your idea more by allowing a parameterized schedule window (eg 10 or 15 instructions) rather than a basic block. This would allow you to catch cases across blocks and should work better for large blocks. Finally, I would not be surprised - just learning from your insights here and guessing - that various in-order processors show best performance for different window sizes. All this is just food for thought for additional/future work though. Cheers!

asi-sc updated this revision to Diff 495085.Feb 6 2023, 4:41 AM

Use MinInstrCount strategy when scheduling model is not specified.

asi-sc added a comment.Feb 6 2023, 4:57 AM

The asm diffs in existing test files show that we are affecting the default behavior of RISC-V compiles, right? If those are considered neutral or improvements, then I think this is good to go.

Thanks for the question. Although changes were neutral, I decided we shouldn't change the default behavior. At least not in this change. So, I slightly updated the patch.

Given that this is RISC-V and under a flag, this LGTM. I would like to see stats on the FMA's plus the changes to the cycle counts on the critical path, and see how the data correlate to your measured run-time performance numbers. And ditto for the current heuristic. This might also help understand the wide variety of results in your SPEC data. Your numbers look all over the place. Also, you can probably push your idea more by allowing a parameterized schedule window (eg 10 or 15 instructions) rather than a basic block. This would allow you to catch cases across blocks and should work better for large blocks. Finally, I would not be surprised - just learning from your insights here and guessing - that various in-order processors show best performance for different window sizes. All this is just food for thought for additional/future work though. Cheers!

Using a schedule window is an interesting idea, thank you. It'll definitely work better at basic block boundaries, however I can imagine basic blocks for which reasonably small schedule window won't work well because of instructions placement before scheduling. It might be a rare corner case though. I'll experiment with this idea at my free time.

spatel accepted this revision.Feb 6 2023, 5:15 AM

LGTM for the limited scope of the patch.
I don't know anything about those particular targets, so be prepared to deal with regressions/requests from developers on those platforms. :)

This revision is now accepted and ready to land.Feb 6 2023, 5:15 AM
This revision was landed with ongoing or failed builds.Feb 17 2023, 2:18 AM
This revision was automatically updated to reflect the committed changes.