Page MenuHomePhabricator

[clang][PassManager] Add -falways-mem2reg CC1 flag to run mem2reg at -O0
AcceptedPublic

Authored by jrtc27 on Jul 6 2021, 3:34 PM.

Details

Summary

Standard -O0 IR and assembly can be hard to follow as, without mem2reg,
there are loads and stores to the stack everywhere that clutter things
up and make it hard to see where the actual interesting instructions are
(such as when trying to debug a crash in unoptimised code from the
disassembly). It is therefore useful to be able to force mem2reg to be
run even at -O0 to clean up a lot of those stack loads and stores. There
are also Clang CodeGen tests in the tree that explicitly run mem2reg on
the output in order to make the CHECK lines more readable, which
requires manually passing -disable-O0-optnone and piping to opt; having
a flag that supports this also makes those less clunky.

Whilst optimisation for speed's sake is not the primary purpose of this
patch, it does provide an easy significant improvement in code size as
you might expect, giving a ~12% decrease in code size on macOS/arm64
when compiling Clang itself with the option enabled, likely also having
a significant improvement on the running time of the test suite over a
plain Debug build. On GNU/Linux/amd64 the decrease is less pronounced,
at about 4%, likely due to the fact that many instructions can take one
memory operand and so do not have to pay the additional cost of a load
or store like on load-store architectures.

Diff Detail

Unit TestsFailed

TimeTest
220 msx64 windows > Flang.Driver::debug-parsing-log.f90
Script: -- : 'RUN: at line 1'; c:\ws\w32-1\llvm-project\premerge-checks\build\bin\flang-new.exe -fc1 -fdebug-dump-parsing-log C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Driver\debug-parsing-log.f90 2>&1 | c:\ws\w32-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Driver\debug-parsing-log.f90
150 msx64 windows > Flang.Driver::debug-provenance.f90
Script: -- : 'RUN: at line 6'; c:\ws\w32-1\llvm-project\premerge-checks\build\bin\flang-new.exe -fc1 -fdebug-dump-provenance C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Driver\debug-provenance.f90 2>&1 | c:\ws\w32-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Driver\debug-provenance.f90
20 msx64 windows > Flang.Evaluate::folding01.f90
Script: -- : 'RUN: at line 1'; C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate/test_folding.sh C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate\folding01.f90 C:\ws\w32-1\llvm-project\premerge-checks\build\tools\flang\test\Evaluate\Output\folding01.f90.tmp c:\ws\w32-1\llvm-project\premerge-checks\build\bin\flang-new.exe -fc1
0 msx64 windows > Flang.Evaluate::folding02.f90
Script: -- : 'RUN: at line 1'; C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate/test_folding.sh C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate\folding02.f90 C:\ws\w32-1\llvm-project\premerge-checks\build\tools\flang\test\Evaluate\Output\folding02.f90.tmp c:\ws\w32-1\llvm-project\premerge-checks\build\bin\flang-new.exe -fc1
0 msx64 windows > Flang.Evaluate::folding03.f90
Script: -- : 'RUN: at line 1'; C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate/test_folding.sh C:\ws\w32-1\llvm-project\premerge-checks\flang\test\Evaluate\folding03.f90 C:\ws\w32-1\llvm-project\premerge-checks\build\tools\flang\test\Evaluate\Output\folding03.f90.tmp c:\ws\w32-1\llvm-project\premerge-checks\build\bin\flang-new.exe -fc1
View Full Test Results (481 Failed)

Event Timeline

jrtc27 created this revision.Jul 6 2021, 3:34 PM
jrtc27 requested review of this revision.Jul 6 2021, 3:34 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 6 2021, 3:34 PM

I think it would be better to focus on making -O1 more usable for this sort of purpose, rather than introduce -O0.5. I mean, there's a lot of wiggle-room between -O0 and -O2, but I don't think it makes sense to add a driver option that promises to run exactly one optimization.

jrtc27 added a comment.EditedJul 19 2021, 9:20 PM

This is not meant to be an -O0.5, this is meant to be an -Oepsilon. I don't want optimised code, I just want code that I can actually disassemble and understand without having to trawl through a mess of stack spills and loads. This is for debugging really basic bugs (either compiler or bad C/C++ input) that turn up even at -O0 and that you don't want optimisations for because it's just going to complicate things, or even make the bugs disappear.

This is also so that the myriad of %clang_cc1 -disable-O0-optnone | opt -S -mem2reg seen in clang/tests can become %clang_cc1 -falways-mem2reg as the current way to write those tests is really clunky.

The part I'm most uncomfortable with is sticking "mem2reg" in a public, documented driver option. I don't want to promise that the mem2reg pass will exist forever. We should be focused on making sure the options we add are stable, and compose effectively, not just being convenient for some specific use.

I'd be less concerned if it were just a -cc1 option; if it's for our internal use, and we can throw it away if we come up with a better solution, this seems okay.

The part I'm most uncomfortable with is sticking "mem2reg" in a public, documented driver option. I don't want to promise that the mem2reg pass will exist forever. We should be focused on making sure the options we add are stable, and compose effectively, not just being convenient for some specific use.

I'd be less concerned if it were just a -cc1 option; if it's for our internal use, and we can throw it away if we come up with a better solution, this seems okay.

I'd be ok with having it just be a -cc1 option (I didn't even actually add a driver test for the non-cc1 form...). I also thought about doing something like -falways-regalloc to not tie it to the pass name, but names like that are misleading since machine register allocation does still happen, just not on things that it doesn't know could be promoted from memory to registers.

Matt added a subscriber: Matt.Jul 20 2021, 6:43 AM

I agree with Eli: we should decide what the goals are here and then use those goals to decide if we can identify a desirable permanent feature and, if so, what the appropriate name for that feature is.

It sounds like your goal is to get readable assembly that still corresponds fairly literally to your original code, in the sense that the readability of -O0 assembly is often undermined by the sheer amount of code and all the extra, unnecessary work it seems to do. However, I've found that a lot of the extra -O0 code is not actually from loads and stores to local variables, it's from the -O0 instruction selection and register allocation, which often combine to do very silly things. Have you looked into whether you get more readable code by just running normal -O0 IR through the non-O0 codegen pipeline? Because the problem with doing just mem2reg is that that's already a fairly major non-literal change to the code, and at some point it's tricky to say what exactly should be part of this new pipeline; whereas still emitting exactly what the abstract machine says to do, just with less nonsense from fast-isel, is a lot easier to define.

lkail added a subscriber: lkail.Jul 23 2021, 3:37 AM
jrtc27 updated this revision to Diff 362214.Jul 27 2021, 4:06 PM

Now only a CC1 option

efriedma accepted this revision.Jul 27 2021, 4:09 PM

Seems fine.

This revision is now accepted and ready to land.Jul 27 2021, 4:09 PM

I like this since the flag significantly improves readability for update_cc_test_checks.py-generated Clang test without having to use the -disable-O0-optnone | opt trick. Not sure what the best flag name is, but as long as it's a CC1 flag it shouldn't really matter.

jrtc27 added a comment.EditedJul 27 2021, 4:10 PM

I agree with Eli: we should decide what the goals are here and then use those goals to decide if we can identify a desirable permanent feature and, if so, what the appropriate name for that feature is.

It sounds like your goal is to get readable assembly that still corresponds fairly literally to your original code, in the sense that the readability of -O0 assembly is often undermined by the sheer amount of code and all the extra, unnecessary work it seems to do. However, I've found that a lot of the extra -O0 code is not actually from loads and stores to local variables, it's from the -O0 instruction selection and register allocation, which often combine to do very silly things. Have you looked into whether you get more readable code by just running normal -O0 IR through the non-O0 codegen pipeline? Because the problem with doing just mem2reg is that that's already a fairly major non-literal change to the code, and at some point it's tricky to say what exactly should be part of this new pipeline; whereas still emitting exactly what the abstract machine says to do, just with less nonsense from fast-isel, is a lot easier to define.

Well, I'm dealing with RISC-V which doesn't have FastISel. And no, running -O0 IR through a different pipeline is never going to work well without mem2reg (and you can't mem2reg -O0 IR without -disable-O0-optnone), so you need at least this to work well. Whether or not there's a useful load of additional stuff you could do, maybe, but I do think this patch in and of itself makes a huge difference, and the limited scope of it can be beneficial.

jrtc27 retitled this revision from [clang][PassManager] Add -falways-mem2reg to run mem2reg at -O0 to [clang][PassManager] Add -falways-mem2reg CC1 flag to run mem2reg at -O0.Jul 27 2021, 4:11 PM
jrtc27 edited the summary of this revision. (Show Details)