This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE] Add an option to enable global CSE
AbandonedPublic

Authored by pcwang-thead on Nov 22 2021, 4:15 AM.

Details

Summary

For some scenarios like optimizing for size, we may want to do
aggressive MachineCSE on the whole function, so we add an option
--enable-aggressive-machine-cse and a target hook to enable this.

Diff Detail

Unit TestsFailed

Event Timeline

pcwang-thead requested review of this revision.Nov 22 2021, 4:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2021, 4:15 AM
lebedev.ri added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
914–916

You can not modify global variables like that

Amend commit message.

pcwang-thead edited the summary of this revision. (Show Details)Nov 22 2021, 4:21 AM

Do not modify global variable EnableGlobalCSE.

lkail added a comment.EditedNov 22 2021, 5:03 AM

Global is not a good name here. MachineCSE is working on the whole function, which is global in compiler's terminology, by walking through the DominatorTree.

llvm/lib/CodeGen/MachineCSE.cpp
463

Why only apply to this heuristics? Since your intention is reducing size, why not always consider profitable if hasOptSize?

jrtc27 added inline comments.Nov 22 2021, 5:08 AM
llvm/lib/CodeGen/MachineCSE.cpp
55

Putting this behaviour behind a cl::opt is a great way to ensure it's never used...

llvm/test/CodeGen/RISCV/enable-global-cse.ll
1 ↗(On Diff #388880)

This should be a MIR test that runs just MachineCSE

pcwang-thead marked an inline comment as done.Nov 22 2021, 5:51 AM

Global is not a good name here. MachineCSE is working on the whole function, which is global in compiler's terminology, by walking through the DominatorTree.

Can we name it greedy?

llvm/lib/CodeGen/MachineCSE.cpp
55

LOL.

So is there a better way? Maybe we can add a target hook to enable this?

463

You are right, my thought were limited.

llvm/test/CodeGen/RISCV/enable-global-cse.ll
1 ↗(On Diff #388880)

Thanks, I will update it later.

  • Change global to aggressive.
  • Return true immediately if aggressive MachineCSE is enabled.
  • Change test case to MIR test.
pcwang-thead edited the summary of this revision. (Show Details)Nov 23 2021, 12:16 AM

Added dmgreen since embeded-arm should be sensitive to size optimizations.

Thanks. What targets have you tested with this? And what kind of codesize differences have you observed?

pcwang-thead marked 2 inline comments as done.EditedNov 23 2021, 2:35 AM

Thanks. What targets have you tested with this? And what kind of codesize differences have you observed?

RISCV.
(And it seems that I need to modify other targets' tests)

The differences(in scope of functions):

  • Some loads of immediates are redundant.
  • Some loads of global symbols are redundant.
  • etc.

These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.

shchenz added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
441

If the register pressure is increased, doing more CSEs may introduce register spill/reload and thus it will generate worse code even for optimization for size?

465

Can we estimate the register pressure here to do a more aggressive CSE? If so, we should not limit this only for "optimization for size".

RISCV.
(And it seems that I need to modify other targets' tests)

The differences(in scope of functions):

  • Some loads of immediates are redundant.
  • Some loads of global symbols are redundant.
  • etc.

These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.

OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".

My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.

You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.

craig.topper added inline comments.
llvm/test/CodeGen/RISCV/enable-agressive-machine-cse.mir
1 ↗(On Diff #389104)

Please use update_mir_test_checks.py

  • Address comments.
  • Only apply aggressive CSE to Heuristics 1.
  • Make enableAggressiveMachineCSE return false by default.
  • Remove RISCV MIR test.
lebedev.ri resigned from this revision.Nov 26 2021, 2:01 AM
pcwang-thead added a comment.EditedNov 26 2021, 3:44 AM

OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".

My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.

You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.

Thank you for your nice advice.

I have tested RISCV on SPECINT 2006 under Oz, here is the result:

                code size
400.perlbench    +0.438%
401.bzip2        0%
403.gcc          -1.128%
429.mcf          0%
445.gobmk        -0.221%
456.hmmer        -1.682%
458.sjeng        0%
462.libquantum   0%
464.h264ref      -0.858%
471.omnetpp      -0.616%
473.astar        0%

perlbench got increased code size.

The result may not be convincing with outdated benchmarks, so I tested it on OpenCV codebase.

Most of executable files and libraries had no code size change, while some large files got smaller, like:

opencv_perf_imgproc  -0.069%
opencv_perf_video    -0.288%
opencv_test_calib3d  -0.407%
opencv_test_core     -0.249%
opencv_test_dnn      -0.182%
opencv_test_imgproc  -0.246%
libopencv_imgproc.so -0.247%
……

Besides, third-party libraries used by OpenCV(like libquirc, libwebp, libjpeg-turbo, libtiff, etc.) got smaller code size.
Some small examples of OpenCV increased a few bytes, as a result of increment of register pressure.

I have made aggressive MachineCSE disabled by default, targets may override it if it's profitable.

In fact, I think this work-around can be more elegant via live intervals analysis as @shchenz said. At least, we should do CSE on Extended Basic Blocks instead of local or adjacent blocks.

llvm/lib/CodeGen/MachineCSE.cpp
441

Yes, you are right.

AggressiveMachineCSE should be placed after MayIncreasePressure.

465

Absolutely!

IMO, the key point is that we should do some live range analysis here?

pcwang-thead marked 2 inline comments as done.Dec 15 2021, 6:33 PM

ping

craig.topper added inline comments.Dec 15 2021, 9:35 PM
llvm/lib/CodeGen/MachineCSE.cpp
463

Can this info be cached from runOnMachineFunction? No need to make a virtual call for something that wont' change per instruction.

  • Rebase.
  • Address comment.
pcwang-thead marked an inline comment as done.Dec 15 2021, 10:31 PM
pcwang-thead abandoned this revision.Jun 12 2023, 1:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 1:13 AM
Herald added a subscriber: StephenFan. · View Herald Transcript