Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[CodeGen] Late cleanup of redundant address/immediate definitions.
ClosedPublic

Authored by jonpa on Apr 8 2022, 8:21 AM.

Details

Summary

On SystemZ, eliminateFrameIndex() may use a scratch register to load part of an offset that is out of range for the using instruction (this is likely also done on other targets - I believe at least AArch64 also does this). Since this is done per instruction (FI operand), several identical addressing "anchor points" may result in the MBB where actually only the first one is needed (provided the GPR is the same). I was expecting this to be cleaned up somehow afterwards, but to my surprise it is not.

I have therefore now made an initial experiment consisting of making PrologEpilogInserter try to clean up redundant definitions after all frame indices have been eliminated, and it would be great to get feedback from other targets to see if this is generally useful. It seems beneficial on SystemZ at least (see below). The patch isn't quite ready yet but I hope it can be the start of a discussion at least. It actually now seems that even though there are "load address" instructions removed, there are in addition many more redundant immediate loads removed (see below). This makes me think that maybe this would better belong in a separate pass run after PEI (or perhaps even later)?

I should also say that I looked into MachineCSE and DeadMachineInstructionElim, but they are at least not readily available: MachineCSE requires SSA, and DeadMachineInstructionElim works bottom-up which does not quite seem to work (the redundant definitions are not "dead").

The patch currently looks for simple "candidate" instructions which may be removed if an identical preceding one exists without any intervening clobbering of the value. The CFG is traversed breadth-first so that live-in definitions can be reused.

Given that the the global analysis adds complexity, I tried also doing this just locally (by passing -pei-localonly). I did find however that the reuse from predecessors is what gives the bulk of improvement, so I would say it is worth that extra effort. On SystemZ/SPEC17:

main <> "local only":
lghi           :               445967               443571    -2396  // Immediate load
lay            :                55090                54783     -307  // Load address
vgbm           :                10974                10848     -126  // Immediate load
...
OPCDIFFS: -2969
main <> patch:
lghi           :               445967               432678   -13289  // Immediate load
lhi            :               219663               216594    -3069  // Immediate load
la             :               533531               531952    -1579  // Load address
lay            :                55090                54774     -316  // Load address
...
OPCDIFFS: -19862

I would say that this is a number of instructions removed (~20k) that makes this look interesting to me... (*) Some examples:

LAYs removed in ./f507.cactuBSSN_r/build/ML_BSSN_Advect.s:

        lay     %r1, 4096(%r15)
        std     %f4, 1024(%r1)
-       lay     %r1, 4096(%r15)
        std     %f1, 1000(%r1)
-       lay     %r1, 4096(%r15)
        std     %f5, 1016(%r1)
-       lay     %r1, 4096(%r15)

LGHI removed in ./f507.cactuBSSN_r/build/RadiationBoundary.s:

        lghi    %r3, 0
        clgijl  %r5, 12, .LBB2_23     // Conditional branch
        risbgn  %r4, %r4, 1, 189, 0
        lghi    %r5, 0
-       lghi    %r3, 0

VGBMs (vector immediate loads) removed in ./f510.parest_r/build/derivative_approximation.s

...
        vst     %v0, 344(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 360(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 312(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 328(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 280(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 296(%r15), 3

(*) I also see a few hundred more lmg/br insructions in total, which I am guessing come from some missed CFG optimization, but not sure if this is important. For example:

-       j       .LBB50_33
+       lochie  %r13, 1
+       llgfr   %r2, %r13
+       lmg     %r10, %r15, 240(%r15)
+       br      %r14
 .LBB50_32:
        lghi    %r0, -1
        sllg    %r0, %r0, 0(%r12)
        xihf    %r0, 4294967295
        xilf    %r0, 4294967295
        cg      %r0, 24(%r11)
-.LBB50_33:
-       lhi     %r13, 0
        lochie  %r13, 1
        llgfr   %r2, %r13
        lmg     %r10, %r15, 240(%r15)
        br      %r14

Current status is that benchmarks build with -verify-machineinstrs, but there are many regression tests failing (hopefully due to removed redundant defs :-). I am hoping that people working with those targets can help me understand those...

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

LGTM to enable this on Arm/AArch64.

jonpa requested review of this revision.Nov 17 2022, 8:48 AM

Ping!

Now that many targets are willing to try this, it would be nice to have it properly reviewed as well before I commit it.

From what I understand nearly all targets with updated tests except BPF, Mips and XCore have said they would like to have it enabled. Not sure if I should disable those targets...

@sdardis Any thoughts on the MIPS changes?

craig.topper added inline comments.Nov 17 2022, 10:06 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
112

Can we use RPOT(reverse post order traversal)? That will guarantee this property except for loops.

165

I think you can use getMF() instead of getParent()->getParent()

rnk resigned from this revision.Nov 17 2022, 11:35 AM
jonpa updated this revision to Diff 476249.Nov 17 2022, 3:35 PM
jonpa marked an inline comment as done.

Updated per review. Using RPOT was a nice improvement - see inline comment. A few more improvements in tests also resulted from this.

llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
112

Yes - awesome!

55 lines of iteration logic removed, about the same amount of success (even very slightly better on SPEC), and improved compile time (ave 0.57% -> 0.50%).

jonpa added a comment.Nov 17 2022, 3:41 PM

Updated tests were:

llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll |  2 --
llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll                                  |  2 +-
llvm/test/CodeGen/RISCV/rvv/bswap-sdnode.ll                                       |  1 -
llvm/test/CodeGen/RISCV/rvv/bswap-vp.ll                                           |  6 ------
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll                             |  4 ----
llvm/test/CodeGen/SystemZ/frame-28.mir                                            |  5 +++--
jonpa added a comment.Nov 18 2022, 2:20 PM

That's done after regalloc in eliminateFrameIndex - interesting. My thoughts on this have been that it is more efficient to scan the entire function once rather than for each single frame index elimination. That should be possible if the used register is the same for consecutive FI operands, which I think it should typically be. It would be interesting to see what would happen if you removed that code and instead enabled this pass.

The other thing is that the bulk of the instructions being cleaned up seems to be loads of immediates which result from rematerialization rather than FIs.

jonpa updated this revision to Diff 476992.Nov 21 2022, 1:49 PM

Rebase.

This time the test CodeGen/AMDGPU/flat-scratch.ll failed with the verifier as a kill flag of a super register was not removed. To remedy this, I changed clearKillsForDef() to check for any overlapping reg instead of just the identical one.

Updated tests:

AMDGPU/GlobalISel/call-outgoing-stack-args.ll
AMDGPU/GlobalISel/flat-scratch.ll
AMDGPU/chain-hi-to-lo.ll
AMDGPU/flat-scratch.ll
AMDGPU/si-annotate-cf.ll
AMDGPU/spill-offset-calculation.ll
AMDGPU/spill-scavenge-offset.ll

foad added inline comments.Nov 21 2022, 11:16 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
183

Could use structured bindings: for (auto [Reg, DefMI] : ...)

186

Could use all_of(drop_begin(MBB->predecessors()), ...)?

194–195

Use << printMBBReference(*MBB) << instead of rolling your own MBB#number syntax, here and elsewhere.

204–205

This looks like for (auto &MI : make_early_inc_range(MBB)))

207

Comment should explain why rather than what.

jonpa updated this revision to Diff 477189.Nov 22 2022, 7:40 AM
jonpa marked 5 inline comments as done.

Thanks (again) for nice review suggestions. Patch updated (NFC, no new test updates).

jonpa updated this revision to Diff 478579.Nov 29 2022, 7:01 AM

Ping!

Patch rebased, with one test updated: AMDGPU/GlobalISel/call-outgoing-stack-args.ll.

Still waiting for a final approval...

LGTM - @foad @craig.topper any additional feedback?

foad added a comment.Nov 29 2022, 8:37 AM

LGTM - @foad @craig.topper any additional feedback?

Still seems fine to me - but I still think we should also try to clean up some of these redundant instructions earlier, or avoid generating them in the first place, if possible.

RKSimon accepted this revision.Nov 30 2022, 4:34 AM

LGTM - cheers!

This revision is now accepted and ready to land.Nov 30 2022, 4:34 AM
This revision was landed with ongoing or failed builds.Dec 1 2022, 10:22 AM
This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Dec 1 2022, 10:36 AM

This looks very interesting. I've been thinking about ways to improve stack addressing sequences on RISCV, and this nicely tackles many of the cases I was just thinking about.

Unfortunately, I think the patch as landed is wrong. See the inline comment on an example where we can't forward due to a clobber which doesn't appear to be respected.

llvm/test/CodeGen/RISCV/rvv/rv64-spill-zvlsseg.ll
44 ↗(On Diff #479348)

This line looks to be wrong. a0 is clobbered on line 40 which is after the previous definition being forwarded from line 37.

jonpa updated this revision to Diff 479392.EditedDec 1 2022, 1:04 PM

Patch reverted because of build failure, relating to capturing structured bindings, which builds fine with gcc but not yet with clang.

I added init captures in processBlock(), which should remedy the build problems and also allow easy removal once it is fully accepted by clang.

for (auto [Reg, DefMI] : RegDefs[FirstPred->getNumber()])
  if (llvm::all_of(
       drop_begin(MBB->predecessors()),
       **[&, &Reg = Reg, &DefMI **= DefMI](const MachineBasicBlock *Pred) {
       ...

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:

$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
PseudoVSPILL2_M1 killed renamable $v8_v9, killed $x10, killed $x11 :: (store unknown-size into %stack.0, align 8)
INLINEASM &"" [sideeffect] [attdialect], $0:[clobber] ...
$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
renamable $v7_v8 = PseudoVRELOAD2_M1 killed $x10, killed $x11 :: (load unknown-size from %stack.0, align 8)

After Machine Late Instructions Cleanup Pass (machine-latecleanup)

$x10 = ADDI $x2, 16
 $x11 = PseudoReadVLENB
 PseudoVSPILL2_M1 killed renamable $v8_v9, $x10, $x11 :: (store unknown-size into %stack.0, align 8)
 INLINEASM &"" [sideeffect] [attdialect], $0:[clobber], ...
 renamable $v7_v8 = PseudoVRELOAD2_M1 killed $x10, killed $x11 :: (load unknown-size from %stack.0, align 8)

After RISCV pseudo instruction expansion pass (riscv-expand-pseudo)

$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
VS1R_V $v8, $x10, implicit $v8_v9 :: (store unknown-size into %stack.0, align 8)
$x10 = ADD $x10, $x11
VS1R_V $v9, $x10, implicit $v8_v9 :: (store unknown-size into %stack.0, align 8)
INLINEASM &"" [sideeffect] [attdialect], $0:[clobber], ...
$v7 = VL1RE8_V $x10 :: (load unknown-size from %stack.0, align 8)
$x10 = ADD $x10, $x11
$v8 = VL1RE8_V $x10 :: (load unknown-size from %stack.0, align 8)

The problem here is that $x10 is only marked as a use operand in the pseudo instruction (PseudoVSPILL2_M1) when the cleanup pass sees it, but it is in fact later getting expanded to a sequence that actually clobbers the register. I think the fix should be to make sure any pseudo instruction is properly modeled to reflect the final expansion, in this case with a def operand of the base register.

Could I disable the pass for now for RISCV and recommit?

jonpa reopened this revision.Dec 1 2022, 1:06 PM
This revision is now accepted and ready to land.Dec 1 2022, 1:06 PM
jonpa requested review of this revision.Dec 1 2022, 1:07 PM
jonpa added a reviewer: reames.
reames added a comment.Dec 1 2022, 2:10 PM

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:
...
Could I disable the pass for now for RISCV and recommit?

If you're sure this is a RISCV specific problem, sure. Your description does sound like one to me, so it basically comes down to how sure of your analysis you are.

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:
...
Could I disable the pass for now for RISCV and recommit?

If you're sure this is a RISCV specific problem, sure. Your description does sound like one to me, so it basically comes down to how sure of your analysis you are.

Looks like I tried to fix the RISCV issue once. https://reviews.llvm.org/D109405 but it caused other problems and got reverted. Then I guess I lost track of it.

jonpa added a comment.Dec 1 2022, 2:49 PM

OK - I will recommit this then as it does seem like a problem in the RISCV backend. I'll wait a day or so to let other targets check that they don't have this potential problem...

jonpa updated this revision to Diff 479862.Dec 3 2022, 12:47 PM

Recommitted as 17db0de, with RISCV disabling it for now.

Posting the RISCV part of the patch here for future use when you want to enable it.

This revision is now accepted and ready to land.Dec 3 2022, 12:47 PM
jonpa added a comment.Dec 4 2022, 3:59 PM

Reverted again with 122efef.

MaskRay added a comment.EditedDec 4 2022, 4:40 PM

Hi, it's useful to include Differential Revision: https://reviews.llvm.org/D123394 for all reland commits so that others can easily tell the state by visiting this page.
For relands, please include the original description (don't let readers trace through the original commit to get the description).

MaskRay added inline comments.Dec 4 2022, 4:47 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
87

This can be moved closer to Changed |= processBlock(MBB);

113
177

Don't add unneeded blank lines between variable definitions or a variable definition. If the next statement has a comment, it is fine to keep a blank line before the comment.

Applies to elsewhere in the patch.

203

This condition is somewhat strange.

mgorny added a comment.Dec 5 2022, 7:49 AM

@MaskRay, I've reverted your M68k test fix since this change was reverted too, to get M68k tests to pass again.

@jonpa, can you please include the change from f55880e830e150d98e5340cdc3c4c41867a5514d in your patch?

This revision was landed with ongoing or failed builds.Dec 5 2022, 10:55 AM
This revision was automatically updated to reflect the committed changes.
jonpa marked 4 inline comments as done.
jonpa updated this revision to Diff 480166.Dec 5 2022, 11:03 AM

Latest patch as committed (did not update correctly automatically).

uabelho added a subscriber: uabelho.Dec 6 2022, 1:59 AM
uabelho added inline comments.Dec 13 2022, 2:02 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
106–110

Hi @jonpa

It looks like clearKillsForDef doesn't really handle bundled input. E.g. for our OOT target I've seen cases where "kill" flags were cleared in the BUNDLE instruction, but not in the individual bundled instructions which then lead to verifier complaints.

Do you know if there are other parts of the pass that have problems with bundled input?

uabelho added inline comments.Dec 14 2022, 12:13 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
114–123

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

At least for my target I've found cases like

DEF superreg
USE killed subreg1
USE killed subreg2
DEF superreg
USE killed subreg1

and with the current implementation we get

DEF super reg
USE killed subreg1
USE subreg2
USE killed subreg1

and then the verifier complains on the second use of subreg1.

jonpa added a comment.Dec 14 2022, 2:23 PM

Hi @uabelho , I'm glad you may find some use for this pass as well :)

It looks like clearKillsForDef doesn't really handle bundled input. E.g. for our OOT target I've seen cases where "kill" flags were cleared in the BUNDLE instruction, but not in the individual bundled instructions which then lead to verifier complaints.

Do you know if there are other parts of the pass that have problems with bundled input?

TBH, it was a while since I thought about BUNDLEs, but IIRC it should be possible to use an instr_iterator with NFC in the *absence* of BUNDLEs, so perhaps that would be a somewhat trivial change..?
Maybe @kparzysz would be interested in this as well, as I believe Hexagon is also emitting BUNDLEs..?

Not quite sure how BUNDLEs need to be handled, for instance when a bundled instruction is deleted, I guess the BUNDLE operands must also be updated.

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

Are you returning true from enableSubRegLiveness()? I suspect that could cause a difference that I haven't yet run into myself.

As I have mentioned before, I still think the most valuable thing here would be to consider dropping kill flags entirely since this is run so late. I don't think e.g. the buildSchedGraph() uses them. Perhaps they could be regenerated quickly when they are actually useful. The benefits here would be to not have to worry about this tedious updating, which also costs potentially in compile time. Alternatively, it seems that this has to be fixed here for the case of enabled subreg liveness. I have at least so far found it sufficient with the current search for kill flags (without subreg liveness enabled).

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

Are you returning true from enableSubRegLiveness()? I suspect that could cause a difference that I haven't yet run into myself.

As I have mentioned before, I still think the most valuable thing here would be to consider dropping kill flags entirely since this is run so late. I don't think e.g. the buildSchedGraph() uses them. Perhaps they could be regenerated quickly when they are actually useful. The benefits here would be to not have to worry about this tedious updating, which also costs potentially in compile time. Alternatively, it seems that this has to be fixed here for the case of enabled subreg liveness. I have at least so far found it sufficient with the current search for kill flags (without subreg liveness enabled).

We haven't enabled subreg liveness. What we do have though is a downstream hack in VirtRegMap.cpp that throws away implicit uses of superregisters since we haven't understood what they are good for and they just limit scheduling later.
But ok, that explains why those implicit killed uses are missing for us :) I'm still not sure if they are always guaranteed to be there though. Do you know if this is documented somewhere?

Right now, downstream I just made clearKillsForDef continue searhing until it finds a def, and not do the early abort you can do due to the implicit superreg kill. (no idea what compiletime impact this has.)

We haven't enabled subreg liveness. What we do have though is a downstream hack in VirtRegMap.cpp that throws away implicit uses of superregisters since we haven't understood what they are good for and they just limit scheduling later.

Ah - yeah have some vague notion of that :-)

But ok, that explains why those implicit killed uses are missing for us :) I'm still not sure if they are always guaranteed to be there though. Do you know if this is documented somewhere?

Not really - the idea was basically to stop the search backwards also when a use of the reg is found as there is no point in continuing. I then encountered a problem in a case of the extra super-reg operand, and added a handling for that. I guess I was then still assuming that there can't be an earlier kill of the register. Are you seeing a problem here also without the removal of the superregisters-hack you are using? Something like a def of a superregister, and then a kill of a subreg, in which case the search should continue for the other subreg kill?

Right now, downstream I just made clearKillsForDef continue searhing until it finds a def, and not do the early abort you can do due to the implicit superreg kill. (no idea what compiletime impact this has.)

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Hello Jonas, just FYI I got case:

Total Execution Time: 670.4078 seconds (670.7149 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  289.2998 ( 43.7%)   0.1439 (  1.8%)  289.4437 ( 43.2%)  289.5775 ( 43.2%)  Structurize control flow
  124.9852 ( 18.9%)   0.0000 (  0.0%)  124.9852 ( 18.6%)  125.0289 ( 18.6%)  Machine Late Instructions Cleanup Pass
  45.2408 (  6.8%)   0.0000 (  0.0%)  45.2408 (  6.7%)  45.2740 (  6.8%)  SI Form memory clauses
  38.0358 (  5.7%)   0.0240 (  0.3%)  38.0598 (  5.7%)  38.0767 (  5.7%)  Simple Register Coalescing

Most of the time is spent in clearKillsForDef.

jonpa added a comment.Mar 11 2023, 3:13 AM

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Hello Jonas, just FYI I got case:

Total Execution Time: 670.4078 seconds (670.7149 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  289.2998 ( 43.7%)   0.1439 (  1.8%)  289.4437 ( 43.2%)  289.5775 ( 43.2%)  Structurize control flow
  124.9852 ( 18.9%)   0.0000 (  0.0%)  124.9852 ( 18.6%)  125.0289 ( 18.6%)  Machine Late Instructions Cleanup Pass
  45.2408 (  6.8%)   0.0000 (  0.0%)  45.2408 (  6.7%)  45.2740 (  6.8%)  SI Form memory clauses
  38.0358 (  5.7%)   0.0240 (  0.3%)  38.0598 (  5.7%)  38.0767 (  5.7%)  Simple Register Coalescing

Most of the time is spent in clearKillsForDef.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Sorry for delay, done https://github.com/llvm/llvm-project/issues/61397. Maybe if you have the version with kill flag maps we can check if it works better on the testcase.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Sorry for delay, done https://github.com/llvm/llvm-project/issues/61397. Maybe if you have the version with kill flag maps we can check if it works better on the testcase.

Indeed, the kill flags map does seem to remedy this huge test case. My results and patch posted on github issues.

Anyone interested in compile time with this pass with huge functions, please feel free to contribute your opinion here.