This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Late cleanup of redundant address/immediate definitions.
ClosedPublic

Authored by jonpa on Apr 8 2022, 8:21 AM.

Details

Summary

On SystemZ, eliminateFrameIndex() may use a scratch register to load part of an offset that is out of range for the using instruction (this is likely also done on other targets - I believe at least AArch64 also does this). Since this is done per instruction (FI operand), several identical addressing "anchor points" may result in the MBB where actually only the first one is needed (provided the GPR is the same). I was expecting this to be cleaned up somehow afterwards, but to my surprise it is not.

I have therefore now made an initial experiment consisting of making PrologEpilogInserter try to clean up redundant definitions after all frame indices have been eliminated, and it would be great to get feedback from other targets to see if this is generally useful. It seems beneficial on SystemZ at least (see below). The patch isn't quite ready yet but I hope it can be the start of a discussion at least. It actually now seems that even though there are "load address" instructions removed, there are in addition many more redundant immediate loads removed (see below). This makes me think that maybe this would better belong in a separate pass run after PEI (or perhaps even later)?

I should also say that I looked into MachineCSE and DeadMachineInstructionElim, but they are at least not readily available: MachineCSE requires SSA, and DeadMachineInstructionElim works bottom-up which does not quite seem to work (the redundant definitions are not "dead").

The patch currently looks for simple "candidate" instructions which may be removed if an identical preceding one exists without any intervening clobbering of the value. The CFG is traversed breadth-first so that live-in definitions can be reused.

Given that the the global analysis adds complexity, I tried also doing this just locally (by passing -pei-localonly). I did find however that the reuse from predecessors is what gives the bulk of improvement, so I would say it is worth that extra effort. On SystemZ/SPEC17:

main <> "local only":
lghi           :               445967               443571    -2396  // Immediate load
lay            :                55090                54783     -307  // Load address
vgbm           :                10974                10848     -126  // Immediate load
...
OPCDIFFS: -2969
main <> patch:
lghi           :               445967               432678   -13289  // Immediate load
lhi            :               219663               216594    -3069  // Immediate load
la             :               533531               531952    -1579  // Load address
lay            :                55090                54774     -316  // Load address
...
OPCDIFFS: -19862

I would say that this is a number of instructions removed (~20k) that makes this look interesting to me... (*) Some examples:

LAYs removed in ./f507.cactuBSSN_r/build/ML_BSSN_Advect.s:

        lay     %r1, 4096(%r15)
        std     %f4, 1024(%r1)
-       lay     %r1, 4096(%r15)
        std     %f1, 1000(%r1)
-       lay     %r1, 4096(%r15)
        std     %f5, 1016(%r1)
-       lay     %r1, 4096(%r15)

LGHI removed in ./f507.cactuBSSN_r/build/RadiationBoundary.s:

        lghi    %r3, 0
        clgijl  %r5, 12, .LBB2_23     // Conditional branch
        risbgn  %r4, %r4, 1, 189, 0
        lghi    %r5, 0
-       lghi    %r3, 0

VGBMs (vector immediate loads) removed in ./f510.parest_r/build/derivative_approximation.s

...
        vst     %v0, 344(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 360(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 312(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 328(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 280(%r15), 3
-       vgbm    %v0, 0
        vst     %v0, 296(%r15), 3

(*) I also see a few hundred more lmg/br insructions in total, which I am guessing come from some missed CFG optimization, but not sure if this is important. For example:

-       j       .LBB50_33
+       lochie  %r13, 1
+       llgfr   %r2, %r13
+       lmg     %r10, %r15, 240(%r15)
+       br      %r14
 .LBB50_32:
        lghi    %r0, -1
        sllg    %r0, %r0, 0(%r12)
        xihf    %r0, 4294967295
        xilf    %r0, 4294967295
        cg      %r0, 24(%r11)
-.LBB50_33:
-       lhi     %r13, 0
        lochie  %r13, 1
        llgfr   %r2, %r13
        lmg     %r10, %r15, 240(%r15)
        br      %r14

Current status is that benchmarks build with -verify-machineinstrs, but there are many regression tests failing (hopefully due to removed redundant defs :-). I am hoping that people working with those targets can help me understand those...

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If we're not generating enough code for the giant integers, increasing the bitwidth of the giant integers is the first thing to try, I think.

Thanks! I increased it to i8500, and it now passes and seems to have the same block structure.

arsenm added inline comments.Apr 26 2022, 5:14 PM
llvm/lib/CodeGen/PrologEpilogInserter.cpp
135–136 ↗(On Diff #425193)

DenseMap?

138–139 ↗(On Diff #425193)

SmallSetVector?

1492–1495 ↗(On Diff #425256)

Can't you use one of the existing block visiting iterators?

1514 ↗(On Diff #425256)

If PEI's loop was reversed and scavenge register backwards was used, we wouldn't need kill flags at all

1629 ↗(On Diff #425256)

*MI works, don't need dump

1643 ↗(On Diff #425256)

Ditto

jonpa marked an inline comment as done.Apr 27 2022, 2:58 AM
jonpa added inline comments.
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1514 ↗(On Diff #425256)

Thanks for review. That is a good point, but I have discovered that not only address anchors from Frame Indices lowering that can be found redundant and cleaned, but also many (more) immediate loads. So changing PEI would not handle all cases, unfortunately.

I have asked before here if this should be part of PEI or become a separate pass, but so far we have not agreed on anything. Given the above, it seems to me now more reasonable to have this as a separate pass run before MachineCopyPropagation. Does this make sense to you?

arsenm added inline comments.Apr 27 2022, 8:45 AM
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1514 ↗(On Diff #425256)

I think PEI should be split into a number of separate passes as it is. It might make sense to keep it in PEI if you were to do something ahead of or as part of the frame finalization (which might be better than looking for references to getFrameRegister)

jonpa updated this revision to Diff 426232.Apr 30 2022, 8:19 AM
jonpa marked 5 inline comments as done.

Thank you for review.

I have made some progress but there is yet more to do - any suggestions are welcome.

I think PEI should be split into a number of separate passes as it is. It might make sense to keep it in PEI if you were to do something ahead of or as part of the frame finalization (which might be better than looking for references to getFrameRegister)

I have experimented with putting this into a new pass at different positions in TargetPassConfig::addMachineLateOptimization(). It is possible to maintain NFC by putting this first in that method (which I ended up doing), but I thought it would be interesting to see if it would make any difference putting it after Branchfolder / TailDup. I saw some minor changes only, and these are the "-stats" summarized over SPEC17/SystemZ:

Before Branchfolder <> After Branchfolder:

159682           160143    branch-folder - Number of block tails merged         461  0.3%
561638           560901    branch-folder - Number of branches optimized        -737  0.1%
139108           139050  tailduplication - Number of tail duplicated blocks     -58
107964           108330  tailduplication - Number of tails duplicated           366

After Branchfolder <> Before MachinCopyPropagation (After TailDuplication)

160143           160126    branch-folder - Number of block tails merged         -17
560901           561491    branch-folder - Number of branches optimized         590
139050           138961  tailduplication - Number of tail duplicated blocks     -89
108330           107959  tailduplication - Number of tails duplicated          -371

Before Branchfolder <> Before MachinCopyPropagation (After TailDuplication)

159682           160126    branch-folder - Number of block tails merged         444
561638           561491    branch-folder - Number of branches optimized        -147
139108           138961  tailduplication - Number of tail duplicated blocks    -147
107964           107959  tailduplication - Number of tails duplicated            -5

This doesn't look like making much difference, so I picked the original place for now (with a bias towards more tails dups/less tail merging) .

jonpa added inline comments.Apr 30 2022, 8:21 AM
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1492–1495 ↗(On Diff #425256)

Using breadth_first() seems to work well (very nearly NFC on SPEC17).

jonpa updated this revision to Diff 426771.May 3 2022, 10:49 AM

The handling of the kill flags have been improved, but they are still updated one at a time. Not sure if there is a better way, or if that would be needed.

Simplified the handling of values from predecessors, but however then got into trouble with DenseMap. It seems I must have been updating it somehow in cases while iterating which is not allowed, so I was scared back to std::map for now (this was not due to the operator[], at least).

I experimented with something like a "register substitution", where a previous def was reused even if the current register is a different one, after changing the register in the users. After sorting out all illegal cases this however seemed to give only a very marginal improvement, so it's not included in the patch. Also tried to not merge predecessor entries, but just handle the case of 1 predecessor, but that did have an impact, so I kept it as it was. So I think at the moment at least this looks to be close to a final version, as I have no more ideas to try.

All tests are updated. It would be nice to get review (it's less than 300 lines) and an agreement on on what could be the next step - perhaps at some point enable it for one or more specific targets?

jonpa updated this revision to Diff 427871.May 7 2022, 9:28 AM

Also handle cases of invariant loads (GOT, CP) and more constant operand types, such as Global, CPI, and some minor improvements.

Number of removed redundant instructions on SystemZ / SPEC17 is now 21200 (was 19500).

It would be nice to get some numbers for other targets as well!

This is fine from a PPC perspective. What would make more of an impact would be the ability to common partial immediate materializations such as happen in prologue/epilogue when a stack frame is very large. Perhaps this can be extended to do that in the future. It would eliminate things like:

lis 11, 16
ori 11, 11, 256 
stxvx 53, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 272 
stxvx 54, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 288 
stxvx 55, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 304 
stxvx 56, 31, 11                        # 16-byte Folded Spill

Of course, that would involve changing the subsequent ori instructions to addi and would likely be very PPC-specific. I just bring it up here in case SystemZ has the same issue.

jonpa added a comment.May 8 2022, 9:44 AM

This is fine from a PPC perspective. What would make more of an impact would be the ability to common partial immediate materializations such as happen in prologue/epilogue when a stack frame is very large. Perhaps this can be extended to do that in the future. It would eliminate things like:

lis 11, 16
ori 11, 11, 256 
stxvx 53, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 272 
stxvx 54, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 288 
stxvx 55, 31, 11                        # 16-byte Folded Spill
lis 11, 16
ori 11, 11, 304 
stxvx 56, 31, 11                        # 16-byte Folded Spill

Of course, that would involve changing the subsequent ori instructions to addi and would likely be very PPC-specific. I just bring it up here in case SystemZ has the same issue.

That's a good point - my idea so far about that is to try to use a postRA pseudo for e.g. an immediate load requiring two target instructions so that they can be handled just the same way during this pass. Alternatively, one could as you say probably extend this current algorithm to have some kind of "look-ahead" or something so that it could recognize e.g. pairs of instructions as well. That is a bit more complex, though... I will try pseudo on SystemZ and see how it goes with 32-bit immediate loads.

So you did not get any significant results on SPEC with this patch yet?

I checked the compile time and realized that I have to work on that a bit as I see a bit of slowdowns on some files.

asb added a comment.May 12 2022, 1:25 AM

The RISC-V changes for this look good to me.

jonpa updated this revision to Diff 429800.May 16 2022, 12:17 PM

NFC updates.

  • Compile time improved: The regression I saw was remedied with a better iteration strategy over the MBBs. I tried to find further improvements, like tracking kill flags in a map instead of doing the search backwards over instructions, but that was slower. I also tried to use DenseMap for the mapping of Register -> MachineInstr*. Even the best init() alloc with a value of 8 was not better than std::map. IndexedMap did not seem like a good choice given that the typical number of mapped registers are not that many. It looks now like the Wall time for this pass on SPEC/SystemZ is on average 0.6%, much like Machine Copy Propagation Pass #2. There are no big compile time blowups that I am aware of.

I can't find much more to improve, at least not at the moment. This pass is "better than nothing" but it is not impossible that there could be more elegant/powerful solutions like emitting frame address anchors in an intelligent way, cleaning up rematerialized immedate-loads somehow, and maybe other things. Personally, I think this looks pretty good to have as long as there is no better way...

However, if we were to actually go ahead and use this, I think there should be some kind of broader agreement on this. There is no strong SystemZ benefit from this - it was just something that seemed missing (looking at those depressing multiple identical load adress instructions). The most benefit is probably on in-order targets. I think there should at least be two or three targets that "vote for" this. It's good to know that the test changes on e.g. RISC-V look correct, and now the next step would be for those targets to evaluate if it worth adding to the pass pipeline.

@nemanjai My previous idea to try post-RA pseudos for immediates on SystemZ is actually not very useful as it is only 64-bit immediates that require two instructions (very rare). But it still looks like this could be worth trying on e.g. PowerPC per your example above. Maybe you could give that a try (like in FrameLowering emitting a pseduo instruction instead of two actual instructions which is expanded in ExpandPostRAPseudos)? If you think that looks good, it would be great to know that...

jonpa added a comment.Jun 8 2022, 7:57 AM

ping!

I haven't heard about any other target testing this on benchmarks yet. On SystemZ we see one or two slight improvements on SPEC, but not enough to motivate having this pass in the SystemZ backend. We would however run it if there were other targets as well wanting to do it. So please, take the chance and see if this is anything beneficial for your target!

bjope added a comment.Jun 8 2022, 9:51 AM

Hi @jonpa ,

FWIW, I've made a quick test downstream for our OOT target and can see that this triggers occasionally.

Found one benchmark so far that was improved by almost 2%.
Except for that I've only seen some benchmarks where materialization of zero constants are eliminated. And for our VLIW target such loads of zero often only has a cost in code size as we can bundle them with other instructions, so I did not really see any improvements in cycles when disregarding potential I-cache effects.

Anyway, it doesn't seem to be totally useless for us. I'll see if I can run some more tests to later.

asb added a comment.Jun 20 2022, 2:50 AM

@jonpa: From your perspective, what's the threshold on whether this should be added to the pass pipeline or not? A pass like this that always results in better code (as opposed to sometimes producing better, sometimes worse code) is attractive even if it triggers rarely. But obviously being too eager to add such code will hit compile time.

jonpa added a comment.Jun 20 2022, 4:18 AM

@jonpa: From your perspective, what's the threshold on whether this should be added to the pass pipeline or not? A pass like this that always results in better code (as opposed to sometimes producing better, sometimes worse code) is attractive even if it triggers rarely. But obviously being too eager to add such code will hit compile time.

I think it should be enabled on targets where there is some kind of real performance improvement and compile time is not top-priority. If there are only static code-generation improvements that do not really make a difference in execution time, it seems better to leave it out. On SystemZ, there are just one or two slight performance improvements (~1%), enough to want to use it, I think.

Generally, loads of immediates and address anchors are easily done ahead of time on an OOO machine, so just because it is a nice cleanup doesn't mean it would be worth adding. I am hoping that the other targets would care to try it on benchmarks, and if there are then some more interest we could enable it for those targets.

jonpa updated this revision to Diff 459019.Sep 9 2022, 5:26 AM

Ping!

Patch rebased and waiting for evaluation on different targets.

As I have spent a certain amount of time developing this new target independent pass, I would be very happy to get some feedback.

Short summary of what this patch does:

  • Cleaning up of rematerialized immediates: After ISel, an immediate is typically copied to all it's users from the original def, while regalloc then rematerializes it which causes a lot of identical immediate loads which are currently never cleaned away even if they are close in the same MBB.
  • Cleaning up of big address offsets forced into registers: SystemZRegisterInfo::eliminateFrameIndex() loads the bigger part of an out of range offset into a register for each FI access, with the idea that these anchors should be reused between accesses. There is however no late "machine cleanup" of identical instructions currently, so there is no reuse at all. This surprised me and was the motivation for this patch, and I think this should be relevant to other targets as well.

It's ~300 LOC, with a small compile time approximately similar to MachineCopyPropagation.

On SystemZ I see ~22k instructions cleaned on SPEC and some slight benchmark improvements which is nice, but perhaps not enough to add a new pass to the backend. I expected this to be done in common code in the first place, and I think this is where it belongs as it is not really target specific.

The question for me now is whether this cleanup should be purposefully omitted (e.g. "doesn't matter enough on OOO CPUs"), or if there is a performance improvement available here. If there is, this patch could be used, or perhaps some other way of achieving this (i.e. earlier in the compilation - any ideas anyone?). If there *isn't*, then that would also be good to know (and perhaps make the comment somewhere in CodeGen)!

I hope you agree that it would be worth at least trying once if this cleanup matters or not. These targets have these number of test files improved: RISCV: 21, AMDGPU: 15, X86: 13, AArch64/ARM/Thumb: 12, PowerPC: 4, Mips: 2, BPF: 1. It would be very nice to learn how much of a cleanup targets achieve in terms of number of instructions eliminated and benchmark improvements. If there is no static improvement on benchmarks (no eliminated instructions) for a target, that would also be interesting to see why this was not even needed.

I am not really strongly arguing for this pass to be added, but I would very much like to come to some consensus on this one way or the other, thanks.

foad added a subscriber: foad.Sep 12 2022, 6:31 AM
foad added inline comments.Sep 12 2022, 9:11 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
4710 ↗(On Diff #459019)

The redundant s_mov_b32 instructions in saddsat.ll and ssubsat.ll should be fixed by D133702.

asb added a comment.Sep 13 2022, 7:58 AM

My natural bias would be towards accepting a pass that wasn't excessively complicated and demonstrably improved static code quality (especially like this one does, in a way where it's never a regression, even if the benefit may be small or not measurable).

There's a cost in terms of complicating the compiler and adding a new pass, but there's also cost in having generated code that's known to be substandard. Every single time someone invests time in analysing compiler output, it will show up when looking for redundant code and people are likely to waste time figuring out if it's measureable, or perhaps if it's a symptom of some wider codegen problem.

What do others think?

Do you have a sense of what (in practice) is causing the dead operations to get emitted? Can they practically be handled upstream? Such a thing would be a win for all the upstream cost models and would reduce compile time.

I wonder if a version of your pass working as an "assertion" could shake out some obvious missed optimizations or other bugs that cause extraneous instructions to get emitted.

If there are well known things that are impractical to handle (for one reason or another) then something like this would make sense to me.

lkail added a subscriber: lkail.Sep 14 2022, 6:36 PM
foad added a comment.Sep 15 2022, 7:11 AM

I can understand the temptation to commit this patch. I also agree with Chris that it's useful as a diagnostic tool to find deficiencies that should probably be fixed elsewhere. That's what I did with D133702.

Do you have a sense of what (in practice) is causing the dead operations to get emitted?

Speaking only for AMDGPU, I think there are quite a few different reasons. So far I've seen:

  • Missed optimizations in globalisel (D133702).
  • PrologEpilogInserter calls TargetRegisterInfo::eliminateFrameIndex which materializes a frame offset into a register, too late for the resulting register-immediate moves to be commoned up.
  • RegisterCoalescer(?) not doing a good job with register tuples? The problem is that a 64-bit constant like 0x1 is materialized in a register pair like s[0:1], and then 32-bit 0x0 is materialized in s1 -- but s1 was already 0 because it was the high part of the 64-bit constant.
  • Some problems related to CFG structurization: I suspect that the highly-AMDGPU-specific parts of this like the SIOptimizeExecMasking pass are not doing a good enough job of cleaning up redundant code.
jonpa added a comment.Sep 15 2022, 7:54 AM

There's a cost in terms of complicating the compiler and adding a new pass, but there's also cost in having generated code that's known to be substandard. Every single time someone invests time in analysing compiler output, it will show up when looking for redundant code and people are likely to waste time figuring out if it's measureable, or perhaps if it's a symptom of some wider codegen problem.

Thanks for mentioning this important aspect which I agree very much with, only to fail to put it down into words here.

jonpa added a comment.Sep 15 2022, 8:13 AM

Do you have a sense of what (in practice) is causing the dead operations to get emitted? Can they practically be handled upstream? Such a thing would be a win for all the upstream cost models and would reduce compile time.

I wonder if a version of your pass working as an "assertion" could shake out some obvious missed optimizations or other bugs that cause extraneous instructions to get emitted.

If there are well known things that are impractical to handle (for one reason or another) then something like this would make sense to me.

I did another round today of trying to understand the causes here while building SPEC on SystemZ. The patch applied as it is removes ~22k instructions. I could see that rematerialization was the cause in some cases so I rebuilt the comparison without it. In the next comparison I in addition also disabled the register coalescer. These disablings were done both on main and the patch when comparing in each step:

  • main <> patch: ~22k instructions less
  • main <> patch, while disabling all rematerialization in the backend (returning false in isTriviallyReMaterializable()): ~15k instructions less
  • main <> patch, while disabling all rematerialization in the backend, and also disabling register coalescing (-join-intervals=false): ~8k instructions less

The register coalescer seems to play a role here as it can take two unrelated but identical immediate loads and make them define the same virtual register. This causes them to end up in the same physreg, and in some cases then the second one can be removed. This looks kind of similar to the results of rematerialization (having two smaller live-ranges instead of one).

For the remaining 8k instruction it seems that many of these are instructions present after isel already. Prior to isel it is an operand of an instruction, e.g. a call operand, and then (in some contexts) target decides to load it into a register. Then there are also on SystemZ the already mentioned duplicated address offsets loads into registers, emitted post-RA, which is different.

It seems to me that these redundant instructions are the effect of the overall heuristic in the compiler of reducing register pressure at the cost of more immediate loads, in all of the above. It's the way it should be, but it wouldn't hurt to clean things up somewhere in the end as much as possible. I don't think that can be done until the final allocation of registers is done (how would one know if there is a clobbering of the reg in between two identical immediate loads?). That's what it looks like to me on SystemZ, at least.

jonpa updated this revision to Diff 460424.Sep 15 2022, 8:28 AM

rebase (tests)

The X86 tests LGTM (including masked_load.ll and oddshuffles.ll)

jonpa updated this revision to Diff 466712.Oct 11 2022, 12:22 AM

PING!

Patch updated after rebase (just two AMDGPU tests).

Speaking only for AMDGPU, I think there are quite a few different reasons. So far I've seen:
...

  • PrologEpilogInserter calls TargetRegisterInfo::eliminateFrameIndex which materializes a frame offset into a register, too late for the resulting register-immediate moves to be commoned up.

...

Yes, this is exactly the same issue as on SystemZ. So far I cannot see any other solution for this - can you?

jonpa updated this revision to Diff 471134.Oct 27 2022, 5:17 AM

PING! Further comments would be very nice.

Another dozen or so tests updated (RISCV, X86) after rebase (improved).

I know there have been at least some interest in this patch, but so far I am still waiting for a consensus from everybody. Am I the only person who would actually like to see this go in?

X86 LGTM (and I'd like to see this committed as well!)

llvm/include/llvm/InitializePasses.h
265

This looks like it should be sorted?

llvm/lib/CodeGen/CMakeLists.txt
196

sorting

jonpa added inline comments.Oct 27 2022, 6:55 AM
llvm/include/llvm/InitializePasses.h
265

yes... Should the pass maybe better be named something with Machine... to make it clear that it is a late CodeGen pass? Maybe MachineImmLoadsCleanupPass or something?

RKSimon added inline comments.Oct 27 2022, 7:03 AM
llvm/include/llvm/InitializePasses.h
265

SGTM - no strong preference tbh

jonpa updated this revision to Diff 471265.Oct 27 2022, 12:51 PM
jonpa marked 3 inline comments as done.

Patch updated per review (sorted order in two files), and also a name change into something hopefully better (it seems good to me that it begins with Machine...).

Thanks for taking a look - I guess then we could enable this for at least SystemZ and X86 to begin with if there are no objections. Other targets that seem to benefit like AMDGPU and RISCV could enable it at some later point if they would rather wait for some reason.

foad added a comment.Oct 27 2022, 1:30 PM

I'm happy to have this enabled for AMDGPU.

+1 to fast track AMDGPU/SystemZ/X86 - I don't how you want to handle helping other targets in the longer term (raise bugs? TODO comments?)

jonpa added a comment.Nov 1 2022, 7:26 AM

+1 to fast track AMDGPU/SystemZ/X86 - I don't how you want to handle helping other targets in the longer term (raise bugs? TODO comments?)

So far I have only thought that targets should decide for themselves if they want to run it or not. I hope they will give it a try eventually and either enable it or make a comment for the reasons not to. Perhaps a TODO comment next to the disablePass in the target file saying that this needs to be evaluated before enabling would be nice?

craig.topper added inline comments.Nov 1 2022, 9:45 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
48

Can these be DenseMaps?

53

Every MachineBasicBlock has a number assigned to it. This could possibly be a BitVector using the basic block numbering. See MachineBasicBlock::getNumber() and MachineFunction::getNumBlockIDs

You might also be able to use the numbering instead of a map for MBB2RegDefsMap. But probably depends on how spare that map is.

139

No underscores in variable names. Use VisitedPreds

156

!pred_empty

207

!MBB->pred_empty()?

223

Could this use llvm::all_of with a lambda?

282

Would std::queue be better here?

284

!Worklist.empty()?

asb accepted this revision.Nov 1 2022, 9:56 AM

LGTM to enable this on RISC-V too, thank you!

This revision is now accepted and ready to land.Nov 1 2022, 9:56 AM
craig.topper added inline comments.Nov 1 2022, 3:38 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
197

Use DefedReg.isValid(). Someday we should remove the implicited casts to unsigned integers from the Register class.

jonpa updated this revision to Diff 472617.Nov 2 2022, 8:06 AM
jonpa marked 8 inline comments as done.

Thanks for review! The pass compilation time percentage has with these improvements decreased on average by 0.1%. I did all the changes except DenseMap (see inline comments below).

Tests updated after rebase:

AMDGPU/si-annotate-cf.ll
RISCV/rvv/fixed-vectors-rint-vp.ll
RISCV/rvv/rint-vp.ll

jonpa added inline comments.Nov 2 2022, 8:09 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
48

I did make some experiments with DenseMap and other alternatives earlier (see May 16 above) and never got any improvements in compile time by doing so.

In addition, it currently seems unwise to do that since processBlock ("Clear any entries in map that MI clobbers") is looping over the entries while erasing some of them. I could not do this now even after rewriting the loop like:

  // Data structures to map regs to their definitions per MBB.
-  using Reg2DefMap = std::map<Register, MachineInstr*>;
-  using MBB2RegDefsMap = std::map<MachineBasicBlock *, Reg2DefMap>;
+  using Reg2DefMap = DenseMap<Register, MachineInstr*>;
+  using MBB2RegDefsMap = DenseMap<MachineBasicBlock *, Reg2DefMap>;
   MBB2RegDefsMap RegDefs;
 
   // Set of visited MBBs.
@@ -257,11 +258,10 @@ bool MachineLateInstrsCleanup::processBlock(MachineBasicBlock *MBB) {
 
     // Clear any entries in map that MI clobbers.
     for (auto DefI = MBBDefs.begin(); DefI != MBBDefs.end();) {
-      Register Reg = DefI->first;
+      Reg2DefMap::iterator CurrI = DefI++;
+      Register Reg = CurrI->first;
       if (MI->modifiesRegister(Reg, TRI))
-        DefI = MBBDefs.erase(DefI);
-      else
-        ++DefI;
+        MBBDefs.erase(CurrI);
     }

I then got "Assertion `isHandleInSync() && "invalid iterator access!"' failed.", so it seems to me that this isn't supported.

53

Using BitVector for Visited and Visited_preds seems to lower the average compile time (on the files it shows up) by about ~0.04%, so it looks like a very slight but noticeable improvement.

For the MBB2RegDefsMap I saw a similar slight further improvement on average, so with both of these changes an average improvement of 0.1% seems to result. It may be that with this change (MBB2RegDefsMap) some of the slower cases get slower, but the average is improved.

I have only tested this on SystemZ, but I guess a BitMap really should be faster than a map, so it makes sense to use it to me.

223

yeah

282

It's 0.01% faster on average, so why not :-)

danielkiss accepted this revision.Nov 7 2022, 2:30 AM

LGTM to enable this on Arm/AArch64.

jonpa requested review of this revision.Nov 17 2022, 8:48 AM

Ping!

Now that many targets are willing to try this, it would be nice to have it properly reviewed as well before I commit it.

From what I understand nearly all targets with updated tests except BPF, Mips and XCore have said they would like to have it enabled. Not sure if I should disable those targets...

@sdardis Any thoughts on the MIPS changes?

craig.topper added inline comments.Nov 17 2022, 10:06 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
112

Can we use RPOT(reverse post order traversal)? That will guarantee this property except for loops.

165

I think you can use getMF() instead of getParent()->getParent()

rnk resigned from this revision.Nov 17 2022, 11:35 AM
jonpa updated this revision to Diff 476249.Nov 17 2022, 3:35 PM
jonpa marked an inline comment as done.

Updated per review. Using RPOT was a nice improvement - see inline comment. A few more improvements in tests also resulted from this.

llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
112

Yes - awesome!

55 lines of iteration logic removed, about the same amount of success (even very slightly better on SPEC), and improved compile time (ave 0.57% -> 0.50%).

jonpa added a comment.Nov 17 2022, 3:41 PM

Updated tests were:

llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll |  2 --
llvm/test/CodeGen/RISCV/rvv/bitreverse-sdnode.ll                                  |  2 +-
llvm/test/CodeGen/RISCV/rvv/bswap-sdnode.ll                                       |  1 -
llvm/test/CodeGen/RISCV/rvv/bswap-vp.ll                                           |  6 ------
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll                             |  4 ----
llvm/test/CodeGen/SystemZ/frame-28.mir                                            |  5 +++--
jonpa added a comment.Nov 18 2022, 2:20 PM

That's done after regalloc in eliminateFrameIndex - interesting. My thoughts on this have been that it is more efficient to scan the entire function once rather than for each single frame index elimination. That should be possible if the used register is the same for consecutive FI operands, which I think it should typically be. It would be interesting to see what would happen if you removed that code and instead enabled this pass.

The other thing is that the bulk of the instructions being cleaned up seems to be loads of immediates which result from rematerialization rather than FIs.

jonpa updated this revision to Diff 476992.Nov 21 2022, 1:49 PM

Rebase.

This time the test CodeGen/AMDGPU/flat-scratch.ll failed with the verifier as a kill flag of a super register was not removed. To remedy this, I changed clearKillsForDef() to check for any overlapping reg instead of just the identical one.

Updated tests:

AMDGPU/GlobalISel/call-outgoing-stack-args.ll
AMDGPU/GlobalISel/flat-scratch.ll
AMDGPU/chain-hi-to-lo.ll
AMDGPU/flat-scratch.ll
AMDGPU/si-annotate-cf.ll
AMDGPU/spill-offset-calculation.ll
AMDGPU/spill-scavenge-offset.ll

foad added inline comments.Nov 21 2022, 11:16 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
183

Could use structured bindings: for (auto [Reg, DefMI] : ...)

186

Could use all_of(drop_begin(MBB->predecessors()), ...)?

194–195

Use << printMBBReference(*MBB) << instead of rolling your own MBB#number syntax, here and elsewhere.

204–205

This looks like for (auto &MI : make_early_inc_range(MBB)))

207

Comment should explain why rather than what.

jonpa updated this revision to Diff 477189.Nov 22 2022, 7:40 AM
jonpa marked 5 inline comments as done.

Thanks (again) for nice review suggestions. Patch updated (NFC, no new test updates).

jonpa updated this revision to Diff 478579.Nov 29 2022, 7:01 AM

Ping!

Patch rebased, with one test updated: AMDGPU/GlobalISel/call-outgoing-stack-args.ll.

Still waiting for a final approval...

LGTM - @foad @craig.topper any additional feedback?

foad added a comment.Nov 29 2022, 8:37 AM

LGTM - @foad @craig.topper any additional feedback?

Still seems fine to me - but I still think we should also try to clean up some of these redundant instructions earlier, or avoid generating them in the first place, if possible.

RKSimon accepted this revision.Nov 30 2022, 4:34 AM

LGTM - cheers!

This revision is now accepted and ready to land.Nov 30 2022, 4:34 AM
This revision was landed with ongoing or failed builds.Dec 1 2022, 10:22 AM
This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Dec 1 2022, 10:36 AM

This looks very interesting. I've been thinking about ways to improve stack addressing sequences on RISCV, and this nicely tackles many of the cases I was just thinking about.

Unfortunately, I think the patch as landed is wrong. See the inline comment on an example where we can't forward due to a clobber which doesn't appear to be respected.

llvm/test/CodeGen/RISCV/rvv/rv64-spill-zvlsseg.ll
44 ↗(On Diff #479348)

This line looks to be wrong. a0 is clobbered on line 40 which is after the previous definition being forwarded from line 37.

jonpa updated this revision to Diff 479392.EditedDec 1 2022, 1:04 PM

Patch reverted because of build failure, relating to capturing structured bindings, which builds fine with gcc but not yet with clang.

I added init captures in processBlock(), which should remedy the build problems and also allow easy removal once it is fully accepted by clang.

for (auto [Reg, DefMI] : RegDefs[FirstPred->getNumber()])
  if (llvm::all_of(
       drop_begin(MBB->predecessors()),
       **[&, &Reg = Reg, &DefMI **= DefMI](const MachineBasicBlock *Pred) {
       ...

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:

$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
PseudoVSPILL2_M1 killed renamable $v8_v9, killed $x10, killed $x11 :: (store unknown-size into %stack.0, align 8)
INLINEASM &"" [sideeffect] [attdialect], $0:[clobber] ...
$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
renamable $v7_v8 = PseudoVRELOAD2_M1 killed $x10, killed $x11 :: (load unknown-size from %stack.0, align 8)

After Machine Late Instructions Cleanup Pass (machine-latecleanup)

$x10 = ADDI $x2, 16
 $x11 = PseudoReadVLENB
 PseudoVSPILL2_M1 killed renamable $v8_v9, $x10, $x11 :: (store unknown-size into %stack.0, align 8)
 INLINEASM &"" [sideeffect] [attdialect], $0:[clobber], ...
 renamable $v7_v8 = PseudoVRELOAD2_M1 killed $x10, killed $x11 :: (load unknown-size from %stack.0, align 8)

After RISCV pseudo instruction expansion pass (riscv-expand-pseudo)

$x10 = ADDI $x2, 16
$x11 = PseudoReadVLENB
VS1R_V $v8, $x10, implicit $v8_v9 :: (store unknown-size into %stack.0, align 8)
$x10 = ADD $x10, $x11
VS1R_V $v9, $x10, implicit $v8_v9 :: (store unknown-size into %stack.0, align 8)
INLINEASM &"" [sideeffect] [attdialect], $0:[clobber], ...
$v7 = VL1RE8_V $x10 :: (load unknown-size from %stack.0, align 8)
$x10 = ADD $x10, $x11
$v8 = VL1RE8_V $x10 :: (load unknown-size from %stack.0, align 8)

The problem here is that $x10 is only marked as a use operand in the pseudo instruction (PseudoVSPILL2_M1) when the cleanup pass sees it, but it is in fact later getting expanded to a sequence that actually clobbers the register. I think the fix should be to make sure any pseudo instruction is properly modeled to reflect the final expansion, in this case with a def operand of the base register.

Could I disable the pass for now for RISCV and recommit?

jonpa reopened this revision.Dec 1 2022, 1:06 PM
This revision is now accepted and ready to land.Dec 1 2022, 1:06 PM
jonpa requested review of this revision.Dec 1 2022, 1:07 PM
jonpa added a reviewer: reames.
reames added a comment.Dec 1 2022, 2:10 PM

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:
...
Could I disable the pass for now for RISCV and recommit?

If you're sure this is a RISCV specific problem, sure. Your description does sound like one to me, so it basically comes down to how sure of your analysis you are.

@reames: Thanks for catching that problem that was previously missed. I looked into it and found:
...
Could I disable the pass for now for RISCV and recommit?

If you're sure this is a RISCV specific problem, sure. Your description does sound like one to me, so it basically comes down to how sure of your analysis you are.

Looks like I tried to fix the RISCV issue once. https://reviews.llvm.org/D109405 but it caused other problems and got reverted. Then I guess I lost track of it.

jonpa added a comment.Dec 1 2022, 2:49 PM

OK - I will recommit this then as it does seem like a problem in the RISCV backend. I'll wait a day or so to let other targets check that they don't have this potential problem...

jonpa updated this revision to Diff 479862.Dec 3 2022, 12:47 PM

Recommitted as 17db0de, with RISCV disabling it for now.

Posting the RISCV part of the patch here for future use when you want to enable it.

This revision is now accepted and ready to land.Dec 3 2022, 12:47 PM
jonpa added a comment.Dec 4 2022, 3:59 PM

Reverted again with 122efef.

MaskRay added a comment.EditedDec 4 2022, 4:40 PM

Hi, it's useful to include Differential Revision: https://reviews.llvm.org/D123394 for all reland commits so that others can easily tell the state by visiting this page.
For relands, please include the original description (don't let readers trace through the original commit to get the description).

MaskRay added inline comments.Dec 4 2022, 4:47 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
87

This can be moved closer to Changed |= processBlock(MBB);

113
177

Don't add unneeded blank lines between variable definitions or a variable definition. If the next statement has a comment, it is fine to keep a blank line before the comment.

Applies to elsewhere in the patch.

203

This condition is somewhat strange.

mgorny added a comment.Dec 5 2022, 7:49 AM

@MaskRay, I've reverted your M68k test fix since this change was reverted too, to get M68k tests to pass again.

@jonpa, can you please include the change from f55880e830e150d98e5340cdc3c4c41867a5514d in your patch?

This revision was landed with ongoing or failed builds.Dec 5 2022, 10:55 AM
This revision was automatically updated to reflect the committed changes.
jonpa marked 4 inline comments as done.
jonpa updated this revision to Diff 480166.Dec 5 2022, 11:03 AM

Latest patch as committed (did not update correctly automatically).

uabelho added a subscriber: uabelho.Dec 6 2022, 1:59 AM
uabelho added inline comments.Dec 13 2022, 2:02 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
106–110

Hi @jonpa

It looks like clearKillsForDef doesn't really handle bundled input. E.g. for our OOT target I've seen cases where "kill" flags were cleared in the BUNDLE instruction, but not in the individual bundled instructions which then lead to verifier complaints.

Do you know if there are other parts of the pass that have problems with bundled input?

uabelho added inline comments.Dec 14 2022, 12:13 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
114–123

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

At least for my target I've found cases like

DEF superreg
USE killed subreg1
USE killed subreg2
DEF superreg
USE killed subreg1

and with the current implementation we get

DEF super reg
USE killed subreg1
USE subreg2
USE killed subreg1

and then the verifier complains on the second use of subreg1.

jonpa added a comment.Dec 14 2022, 2:23 PM

Hi @uabelho , I'm glad you may find some use for this pass as well :)

It looks like clearKillsForDef doesn't really handle bundled input. E.g. for our OOT target I've seen cases where "kill" flags were cleared in the BUNDLE instruction, but not in the individual bundled instructions which then lead to verifier complaints.

Do you know if there are other parts of the pass that have problems with bundled input?

TBH, it was a while since I thought about BUNDLEs, but IIRC it should be possible to use an instr_iterator with NFC in the *absence* of BUNDLEs, so perhaps that would be a somewhat trivial change..?
Maybe @kparzysz would be interested in this as well, as I believe Hexagon is also emitting BUNDLEs..?

Not quite sure how BUNDLEs need to be handled, for instance when a bundled instruction is deleted, I guess the BUNDLE operands must also be updated.

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

Are you returning true from enableSubRegLiveness()? I suspect that could cause a difference that I haven't yet run into myself.

As I have mentioned before, I still think the most valuable thing here would be to consider dropping kill flags entirely since this is run so late. I don't think e.g. the buildSchedGraph() uses them. Perhaps they could be regenerated quickly when they are actually useful. The benefits here would be to not have to worry about this tedious updating, which also costs potentially in compile time. Alternatively, it seems that this has to be fixed here for the case of enabled subreg liveness. I have at least so far found it sufficient with the current search for kill flags (without subreg liveness enabled).

I wonder if this won't abort too early? Or are we really guaranteed to find an implicit kill of the super-reg within this instruction if we find a kill of a sub-reg?

Are you returning true from enableSubRegLiveness()? I suspect that could cause a difference that I haven't yet run into myself.

As I have mentioned before, I still think the most valuable thing here would be to consider dropping kill flags entirely since this is run so late. I don't think e.g. the buildSchedGraph() uses them. Perhaps they could be regenerated quickly when they are actually useful. The benefits here would be to not have to worry about this tedious updating, which also costs potentially in compile time. Alternatively, it seems that this has to be fixed here for the case of enabled subreg liveness. I have at least so far found it sufficient with the current search for kill flags (without subreg liveness enabled).

We haven't enabled subreg liveness. What we do have though is a downstream hack in VirtRegMap.cpp that throws away implicit uses of superregisters since we haven't understood what they are good for and they just limit scheduling later.
But ok, that explains why those implicit killed uses are missing for us :) I'm still not sure if they are always guaranteed to be there though. Do you know if this is documented somewhere?

Right now, downstream I just made clearKillsForDef continue searhing until it finds a def, and not do the early abort you can do due to the implicit superreg kill. (no idea what compiletime impact this has.)

We haven't enabled subreg liveness. What we do have though is a downstream hack in VirtRegMap.cpp that throws away implicit uses of superregisters since we haven't understood what they are good for and they just limit scheduling later.

Ah - yeah have some vague notion of that :-)

But ok, that explains why those implicit killed uses are missing for us :) I'm still not sure if they are always guaranteed to be there though. Do you know if this is documented somewhere?

Not really - the idea was basically to stop the search backwards also when a use of the reg is found as there is no point in continuing. I then encountered a problem in a case of the extra super-reg operand, and added a handling for that. I guess I was then still assuming that there can't be an earlier kill of the register. Are you seeing a problem here also without the removal of the superregisters-hack you are using? Something like a def of a superregister, and then a kill of a subreg, in which case the search should continue for the other subreg kill?

Right now, downstream I just made clearKillsForDef continue searhing until it finds a def, and not do the early abort you can do due to the implicit superreg kill. (no idea what compiletime impact this has.)

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Hello Jonas, just FYI I got case:

Total Execution Time: 670.4078 seconds (670.7149 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  289.2998 ( 43.7%)   0.1439 (  1.8%)  289.4437 ( 43.2%)  289.5775 ( 43.2%)  Structurize control flow
  124.9852 ( 18.9%)   0.0000 (  0.0%)  124.9852 ( 18.6%)  125.0289 ( 18.6%)  Machine Late Instructions Cleanup Pass
  45.2408 (  6.8%)   0.0000 (  0.0%)  45.2408 (  6.7%)  45.2740 (  6.8%)  SI Form memory clauses
  38.0358 (  5.7%)   0.0240 (  0.3%)  38.0598 (  5.7%)  38.0767 (  5.7%)  Simple Register Coalescing

Most of the time is spent in clearKillsForDef.

jonpa added a comment.Mar 11 2023, 3:13 AM

Makes sense to me - I tried also earlier to store the kills in a map but I found (at least on SystemZ) that this backwards search was actually quicker than that, so I don't think it should be an issue in the normal case...

Hello Jonas, just FYI I got case:

Total Execution Time: 670.4078 seconds (670.7149 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  289.2998 ( 43.7%)   0.1439 (  1.8%)  289.4437 ( 43.2%)  289.5775 ( 43.2%)  Structurize control flow
  124.9852 ( 18.9%)   0.0000 (  0.0%)  124.9852 ( 18.6%)  125.0289 ( 18.6%)  Machine Late Instructions Cleanup Pass
  45.2408 (  6.8%)   0.0000 (  0.0%)  45.2408 (  6.7%)  45.2740 (  6.8%)  SI Form memory clauses
  38.0358 (  5.7%)   0.0240 (  0.3%)  38.0598 (  5.7%)  38.0767 (  5.7%)  Simple Register Coalescing

Most of the time is spent in clearKillsForDef.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Sorry for delay, done https://github.com/llvm/llvm-project/issues/61397. Maybe if you have the version with kill flag maps we can check if it works better on the testcase.

Hi Valery, thanks for the report! It would be nice if you posted your test case on github issues and assigned me (and/or yourself :)... I would think this could be fixed.

Sorry for delay, done https://github.com/llvm/llvm-project/issues/61397. Maybe if you have the version with kill flag maps we can check if it works better on the testcase.

Indeed, the kill flags map does seem to remedy this huge test case. My results and patch posted on github issues.

Anyone interested in compile time with this pass with huge functions, please feel free to contribute your opinion here.