For some scenarios like optimizing for size, we may want to do
aggressive MachineCSE on the whole function, so we add an option
--enable-aggressive-machine-cse and a target hook to enable this.
Details
- Reviewers
lkail anton-afanasyev dmgreen lebedev.ri
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
80 ms | x64 debian > LLVM.Bindings/Go::go.test |
Event Timeline
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
914–916 | You can not modify global variables like that |
Global is not a good name here. MachineCSE is working on the whole function, which is global in compiler's terminology, by walking through the DominatorTree.
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
463 | Why only apply to this heuristics? Since your intention is reducing size, why not always consider profitable if hasOptSize? |
- Change global to aggressive.
- Return true immediately if aggressive MachineCSE is enabled.
- Change test case to MIR test.
Thanks. What targets have you tested with this? And what kind of codesize differences have you observed?
RISCV.
(And it seems that I need to modify other targets' tests)
The differences(in scope of functions):
- Some loads of immediates are redundant.
- Some loads of global symbols are redundant.
- etc.
These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
441 | If the register pressure is increased, doing more CSEs may introduce register spill/reload and thus it will generate worse code even for optimization for size? | |
465 | Can we estimate the register pressure here to do a more aggressive CSE? If so, we should not limit this only for "optimization for size". |
RISCV.
(And it seems that I need to modify other targets' tests)The differences(in scope of functions):
- Some loads of immediates are redundant.
- Some loads of global symbols are redundant.
- etc.
These redundancies are in nonadjacent(non-local?) blocks , so they can't be eliminated according to Heuristics #1 in MachineCSE::isProfitableToCSE.
OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".
My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.
You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.
llvm/test/CodeGen/RISCV/enable-agressive-machine-cse.mir | ||
---|---|---|
1 ↗ | (On Diff #389104) | Please use update_mir_test_checks.py |
- Address comments.
- Only apply aggressive CSE to Heuristics 1.
- Make enableAggressiveMachineCSE return false by default.
- Remove RISCV MIR test.
OK, that's a good start. I was expected something among the lines of "I have tested RISCV on the llvm test suite or some other large codebase under Oz and it reduced the total codesize by 0.16%".
My experiments on ARM and AArch64 are not as great. This seems to increase codesize more than it reduces it, especially on ARM. The AArch64 numbers were dominated by one large increase, with some of the smaller cases being smaller. I would be interested in what the tests in-tree showed too.
You might want to check X86 as it's easy to run. If I was making target independent changed like this I would expect to test at least a couple of architecture combos (say, X86 with Arm and AArch64 for 32bit and 64bit variants), and potentially add target overrides where needed. In this case the default should maybe be kept as before, unless we have some evidence this is beneficial across most architectures.
Thank you for your nice advice.
I have tested RISCV on SPECINT 2006 under Oz, here is the result:
code size 400.perlbench +0.438% 401.bzip2 0% 403.gcc -1.128% 429.mcf 0% 445.gobmk -0.221% 456.hmmer -1.682% 458.sjeng 0% 462.libquantum 0% 464.h264ref -0.858% 471.omnetpp -0.616% 473.astar 0%
perlbench got increased code size.
The result may not be convincing with outdated benchmarks, so I tested it on OpenCV codebase.
Most of executable files and libraries had no code size change, while some large files got smaller, like:
opencv_perf_imgproc -0.069% opencv_perf_video -0.288% opencv_test_calib3d -0.407% opencv_test_core -0.249% opencv_test_dnn -0.182% opencv_test_imgproc -0.246% libopencv_imgproc.so -0.247% ……
Besides, third-party libraries used by OpenCV(like libquirc, libwebp, libjpeg-turbo, libtiff, etc.) got smaller code size.
Some small examples of OpenCV increased a few bytes, as a result of increment of register pressure.
I have made aggressive MachineCSE disabled by default, targets may override it if it's profitable.
In fact, I think this work-around can be more elegant via live intervals analysis as @shchenz said. At least, we should do CSE on Extended Basic Blocks instead of local or adjacent blocks.
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
441 | Yes, you are right. AggressiveMachineCSE should be placed after MayIncreasePressure. | |
465 | Absolutely! IMO, the key point is that we should do some live range analysis here? |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
463 | Can this info be cached from runOnMachineFunction? No need to make a virtual call for something that wont' change per instruction. |
Putting this behaviour behind a cl::opt is a great way to ensure it's never used...