Page MenuHomePhabricator

[MachineCSE] Add an option to enable global CSE
Needs ReviewPublic

Authored by pcwang-thead on Mon, Nov 22, 4:15 AM.

Details

Summary

For some scenarios like optimizing for size, we may want to do
aggressive MachineCSE on the whole function, so we add an option
--enable-aggressive-machine-cse and a target hook to enable this.

Diff Detail

Event Timeline

pcwang-thead created this revision.Mon, Nov 22, 4:15 AM
pcwang-thead requested review of this revision.Mon, Nov 22, 4:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Nov 22, 4:15 AM
lebedev.ri added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
914–916

You can not modify global variables like that

Amend commit message.

pcwang-thead edited the summary of this revision. (Show Details)Mon, Nov 22, 4:21 AM

Do not modify global variable EnableGlobalCSE.

lkail added a comment.EditedMon, Nov 22, 5:03 AM

Global is not a good name here. MachineCSE is working on the whole function, which is global in compiler's terminology, by walking through the DominatorTree.

llvm/lib/CodeGen/MachineCSE.cpp
462

Why only apply to this heuristics? Since your intention is reducing size, why not always consider profitable if hasOptSize?

jrtc27 added inline comments.Mon, Nov 22, 5:08 AM
llvm/lib/CodeGen/MachineCSE.cpp
55

Putting this behaviour behind a cl::opt is a great way to ensure it's never used...

llvm/test/CodeGen/RISCV/enable-global-cse.ll
1 ↗(On Diff #388880)

This should be a MIR test that runs just MachineCSE

pcwang-thead marked an inline comment as done.Mon, Nov 22, 5:51 AM

Global is not a good name here. MachineCSE is working on the whole function, which is global in compiler's terminology, by walking through the DominatorTree.

Can we name it greedy?

llvm/lib/CodeGen/MachineCSE.cpp
55

LOL.

So is there a better way? Maybe we can add a target hook to enable this?

462

You are right, my thought were limited.

llvm/test/CodeGen/RISCV/enable-global-cse.ll
1 ↗(On Diff #388880)

Thanks, I will update it later.

  • Change global to aggressive.
  • Return true immediately if aggressive MachineCSE is enabled.
  • Change test case to MIR test.
pcwang-thead edited the summary of this revision. (Show Details)Tue, Nov 23, 12:16 AM

Added dmgreen since embeded-arm should be sensitive to size optimizations.

Thanks. What targets have you tested with this? And what kind of codesize differences have you observed?

pcwang-thead marked 2 inline comments as done.EditedTue, Nov 23, 2:35 AM

Thanks. What targets have you tested with this? And what kind of codesize differences have you observed?

RISCV.
(And it seems that I need to modify other targets' tests)

The differences(in scope of functions):

  • Some loads of immediates are redundant.
  • Some loads of global symbols are redundant.
  • etc.

These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.

shchenz added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
440

If the register pressure is increased, doing more CSEs may introduce register spill/reload and thus it will generate worse code even for optimization for size?

465

Can we estimate the register pressure here to do a more aggressive CSE? If so, we should not limit this only for "optimization for size".

RISCV.
(And it seems that I need to modify other targets' tests)

The differences(in scope of functions):

  • Some loads of immediates are redundant.
  • Some loads of global symbols are redundant.
  • etc.

These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.

OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".

My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.

You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.

craig.topper added inline comments.
llvm/test/CodeGen/RISCV/enable-agressive-machine-cse.mir
1 ↗(On Diff #389104)

Please use update_mir_test_checks.py

  • Address comments.
  • Only apply aggressive CSE to Heuristics 1.
  • Make enableAggressiveMachineCSE return false by default.
  • Remove RISCV MIR test.
lebedev.ri resigned from this revision.Fri, Nov 26, 2:01 AM
pcwang-thead added a comment.EditedFri, Nov 26, 3:44 AM

OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".

My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.

You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.

Thank you for your nice advice.

I have tested RISCV on SPECINT 2006 under Oz, here is the result:

                code size
400.perlbench    +0.438%
401.bzip2        0%
403.gcc          -1.128%
429.mcf          0%
445.gobmk        -0.221%
456.hmmer        -1.682%
458.sjeng        0%
462.libquantum   0%
464.h264ref      -0.858%
471.omnetpp      -0.616%
473.astar        0%

perlbench got increased code size.

The result may not be convincing with outdated benchmarks, so I tested it on OpenCV codebase.

Most of executable files and libraries had no code size change, while some large files got smaller, like:

opencv_perf_imgproc  -0.069%
opencv_perf_video    -0.288%
opencv_test_calib3d  -0.407%
opencv_test_core     -0.249%
opencv_test_dnn      -0.182%
opencv_test_imgproc  -0.246%
libopencv_imgproc.so -0.247%
……

Besides, third-party libraries used by OpenCV(like libquirc, libwebp, libjpeg-turbo, libtiff, etc.) got smaller code size.
Some small examples of OpenCV increased a few bytes, as a result of increment of register pressure.

I have made aggressive MachineCSE disabled by default, targets may override it if it's profitable.

In fact, I think this work-around can be more elegant via live intervals analysis as @shchenz said. At least, we should do CSE on Extended Basic Blocks instead of local or adjacent blocks.

llvm/lib/CodeGen/MachineCSE.cpp
440

Yes, you are right.

AggressiveMachineCSE should be placed after MayIncreasePressure.

465

Absolutely!

IMO, the key point is that we should do some live range analysis here?