This is an archive of the discontinued LLVM Phabricator instance.

[MachineCopyPropagation] Eliminate spillage copies that might be caused by eviction chain
ClosedPublic

Authored by lkail on Mar 21 2022, 2:21 AM.

Details

Summary

Remove spill-reload like copy chains. For example

r0 = COPY r1
r1 = COPY r2
r2 = COPY r3
r3 = COPY r4
<def-use r4>
r4 = COPY r3
r3 = COPY r2
r2 = COPY r1
r1 = COPY r0

will be folded into

r0 = COPY r1
r1 = COPY r4
<def-use r4>
r4 = COPY r1
r1 = COPY r0

Diff Detail

Event Timeline

lkail created this revision.Mar 21 2022, 2:21 AM
lkail requested review of this revision.Mar 21 2022, 2:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 2:21 AM
lkail retitled this revision from [MachineCopyPropagaion][WIP] Eliminate spillage copies that might caused by eviction chain to [MachineCopyPropagation][WIP] Eliminate spillage copies that might caused by eviction chain.Mar 21 2022, 2:21 AM
lkail updated this revision to Diff 416891.Mar 21 2022, 4:23 AM
lkail updated this revision to Diff 416893.Mar 21 2022, 4:32 AM
lkail updated this revision to Diff 417201.Mar 22 2022, 12:35 AM
qcolombet added a subscriber: aditya_nandakumar.

At first glance it looks similar to what @aditya_nandakumar implemented internally to get rid of copies produced by eviction chains.

@aditya_nandakumar could you take a look?

Talked to Aditya offline (a while back actually) and he told me he doesn't expect to have time to look at this any time soon.
Adding myself as a reviewers.

Hi @lkail ,

Thanks for your patience.
This goes in the right direction.

I think we miss a few comments and some cleanups (clang-format, remove stall comments, use proper LLVM_DEBUG macros, etc.) and we're good to go!

Cheers,
-Quentin

llvm/lib/CodeGen/MachineCopyPropagation.cpp
1057

Add comments.
What this does: explain the algorithm at a high level.

1065

Not for this patch, but you may want to use isCopyInstr instead of isCopy to catch more cases.
That said, it probably won't make much of a difference, since in particular most copies you're trying to remove here comes from splitting in regalloc (i.e., we'll have plain COPY).

1065

To be on the safe side, you may want to check that the operation has no implicit operand.

1072

Instead of copying the chain, can we hold on to the reference until we're done with the processing?

1079

Maybe worth splitting the reload from the spill chain as this assert is strange at first glance.
Perhaps it wouldn't be as problematic when the method is properly documented.

Put differently, let's leave it like this for now, add a bunch of comments and we'll see if it still feels weird after that.

1082

We'll need to expand on that commet because naively, a 2 pairs chain would be beneficial to remove, but this is not something this code can do since we don't recolor outside of the spill chain.

I'd suggest putting something like: We need at least 3 pairs of copies for the transformation to apply, because the first outermost pair cannot be removed since we don't recolor outside of the chain and that we need at least one temporary spill slot to shorten the chain.
If we only have a chain of two pairs, we already have the shortest sequence this code can handle: the outermost pair for the temporary spill slot, and the pair that use that temporary spill slot for the other end of the chain.

1094

Here and other places where you have "debug" statements:
Put this in LLVM_DEBUG macros.
Or remove completely.

1099

Move that into its own helper function and call it from an assert.

static bool LLVM_ATTRIBUTE_UNUSED isValidChain(const SmallVectorImpl<MachineInstr *> &Chain) {
  // your checks here.
}
...
assert(isValidChain(Chain) && "Invalid chain to process");
1104

I feel that this is a bit late to check that.
We should not put a copy in the "candidate" chains if the copy is not foldable.

I would suggest to handle that in the main loop.

1111

By construction Chain[Len - 4]->getOperand(0) == Chain[Len - 3]->getOperand(1), so I would instead put that in a variable and use it in both places.

E.g., something like:

// Pull the last spill slot used only within the chain as the final spill slot.
MCRegister LastReusableRegSpillSlot = Chain[Len - 4]->getOperand(0).getReg()
// Update the chain to skip all the intermediate register spill slots:
// Spilling:
Chain[0]->getOperand(0).setReg(LastReusableRegSpillSlot);
// Reload:
Chain[1]->getOperand(1).setReg(LastReusableRegSpillSlot);
1113

Maybe be worth adding a comment here that although the variable is called MaybeDeadCopies, we really are going to remove the related instructions.

The fact that we use MaybeDeadCopies to do our code cleanup is slightly confusing because if we don't actually delete the intermediate copies (a.k.a. what remains of the chain at this point) the resulting code would be incorrect.

1178

At first it is strange to see that we look for a copy when Reg is a def, but I guess it makes sense because:

  1. We are not going to recolor Reg
  2. We need to consider this chain before it gets clobbered later in that same loop

Assuming I understood that correctly, it deserves its comment here.

1186

Use LeadRegs.find and avoid the double lookups (one in count and one in operator[]).

1196

I think this statement deserves its own comment.
IIUC here we unconditionally clobber all the registers (as opposed to only clobbering the definitions) because we only rewrite the chain itself (i.e., we don't attempt to rewrite uses after the chain).

BTW, you need to take into account regmasks too.

1196

Shouldn't we clear the SpillChains here for defs and not-preversed-by-regmasks regs at this point?

llvm/test/CodeGen/PowerPC/mcp-elim-eviction-chain.mir
135

Add a test with regmasks.

lkail updated this revision to Diff 471178.Oct 27 2022, 9:02 AM
lkail retitled this revision from [MachineCopyPropagation][WIP] Eliminate spillage copies that might caused by eviction chain to [MachineCopyPropagation] Eliminate spillage copies that might caused by eviction chain.
lkail edited the summary of this revision. (Show Details)
lkail added a reviewer: Restricted Project.

Hi @qcolombet , thanks for your detail comments. I have uploaded another patch which is very different from previous one. Not sure I have addressed all your comments.

lkail marked 7 inline comments as done.Oct 27 2022, 9:04 AM
lkail retitled this revision from [MachineCopyPropagation] Eliminate spillage copies that might caused by eviction chain to [MachineCopyPropagation] Eliminate spillage copies that might be caused by eviction chain.Oct 27 2022, 9:13 AM
lkail added a comment.Nov 15 2022, 9:44 PM

Gentle ping.

Hi @lkail,

I am halfway through.

I'm sharing my comments so far if you want to get started with some of the nitpicks.

Cheers,
-Quentin

llvm/lib/CodeGen/MachineCopyPropagation.cpp
108

Maybe rename in LastSeenUseInCopy.

Essentially, I would avoid LastUse alone as it carries a lot of expected semantic that I don't think apply here.

199

Use MI directly here instead of adding it line 200.

293

Could Current be const here?

312

If Def is clobbered between DefCopy and Current I would have expected that DefCopy would have been removed from Copies.

Put differently, I feel that if we have to check that here, the Copies map is holding hard-to-reason-about information.

Are we missing some call to clobberRegister?

1064

I think you can simplify this with a range based loop with rbegin/rend.

1065

That should be a simple range based loop:

for (const MachineInstr *MI: RC)
  MI->dump();
1084

Technically we could collapse this sequence to:

// r0 = COPY r4
// <def-use r4>
// r4 = COPY r0

I.e., I think it is worth explaining that the propagation doesn't check whether or not r0 can be altered outside of the chain and that's why we conservatively keep its value as it was before the rewrite. (Would be a nice follow-up fix BTW :)).

1087

Typo: until

1087

typo: chain uses

1100

typo: encountered

1105

That sounds weird.
Would you mind sharing the assertion?

1107

Could you add a comment on what the mappings hold (all three of them)?

I haven't read the code past this point yet, but for instance I would have expected that the key in these maps would be a Register not MachineInstr.

1108

typo: until

1110

Instead of tracking that, should we just invalidate the chain / stop it before that point?

1123

Maybe add a todo that if the outermost pair of copies modifies a register that is dead outside of that pair, we could eliminate one more pair.

lkail added inline comments.Nov 18 2022, 12:07 AM
llvm/lib/CodeGen/MachineCopyPropagation.cpp
199

Correct me if I'm woring, Copies.insert insert successfully only when the key doesn't exist before. Suppose we have

L0: R0 = COPY R1
L1: R2 = COPY R1

LastUseSeenInCopy should track MI in L1 rather than L0.

312

I can imagine there might be some RegMasks doesn't implicit def any registers, just clobber them. When we are traversing the MBB and are encountered with a RegMask without any other implicit-def, we don't know which register to clobber directly. I'm not sure we have a way to enumerate registers a RegMask clobbers. If there is such a way, I think we should clobber registers when we are traversing the MBB, not checking RegMask clobbers here.

1105

It hits DenseMapIterator's

  pointer operator->() const {
    assert(isHandleInSync() && "invalid iterator access!");
...
  }

I dived into it a bit, looks it's checking the validity of the iterator, i.e., if the container is updated, the iterator constructed before the update is invalid.
Code like

        auto Leader = ChainLeader.find(MaybePrevReload);
...
        ChainLeader.insert({MaybeSpill, Leader->second});
        ChainLeader.insert({&MI, Leader->second});

Should be avoided.

1107

I haven't read the code past this point yet, but for instance I would have expected that the key in these maps would be a Register not MachineInstr.

I separate the algorithm implementation in to two stages. stage1: Collect spill-reload chains. stage2: Fold the chains. If using Register, we are unable to track different spill-reload chains that share same registers.

1110

The implementation doesn't invalidate any chain in stage1. Compared to previous implementation, I think current one is easier to reason and easier to maintain. When we are iterating MI inside the MBB, we don't know which COPY might be one of the innermost spill-reload pair and we don't want to lose track of the innermost spill-reload pair. The Source of the innermost spill is allowed to be re-use and re-def between the innermost spill-reload pair.

lkail updated this revision to Diff 476364.Nov 18 2022, 12:08 AM

Address comments.

lkail marked 10 inline comments as done.Nov 18 2022, 12:09 AM
lkail updated this revision to Diff 476369.Nov 18 2022, 12:17 AM
lkail marked an inline comment as done.
lkail updated this revision to Diff 476371.Nov 18 2022, 12:30 AM
lkail added inline comments.Nov 18 2022, 12:57 AM
llvm/lib/CodeGen/MachineCopyPropagation.cpp
1084

Maybe we can check if r0 is killed to remove one more COPY.

qcolombet added inline comments.Nov 22 2022, 7:20 PM
llvm/lib/CodeGen/MachineCopyPropagation.cpp
199

Ah good point!

312

I think I see our misunderstanding.
Given the name of the function I would expect that this function only does some queries on the tracker, but you're actually using this function to do some bookkeeping as well.

So the conclusion is either rename this function to more accurately represents what it does (I don't have a good name for now) or move the bookkeeping in the main loop (i.e., I thought we were calling clobberRegister from the main loop already.)

Regarding your comment on RegMasks, I am not sure I follow:

  • RegMasks always list all the registers they preserve/clobber
  • Liveness sets at basic block boundaries are not represented with RegMasks, but anyway we don't care because the tracking is always purely local to a basic block in that pass. (Unless you've changed that and I missed it :)).
1064

You should be able to use an even more compact form:

for (auto I : make_range(SC.rbegin(), SC.rend())
1084

Yep, but for that to be accurate we would need to flip the direction of the analysis (from top-down, to bottom-up) to get proper liveness construction.
(Or use the kill flag, but generally speaking we try to avoid relying on this.)

1105

Yep, every time you insert something in the dense map, you may invalidate the iterators.

1107

I see, make sense.

lkail added inline comments.Nov 23 2022, 1:36 AM
llvm/lib/CodeGen/MachineCopyPropagation.cpp
312

RegMasks always list all the registers they preserve/clobber

Ah, I see. Is it a good idea to enumerate them via TRI->getNumRegs() and check RegMask to see if they are preserve/clobber?

For the origin question

If Def is clobbered between DefCopy and Current I would have expected that DefCopy would have been removed from Copies.

Does it imply we should not check RegMask and don't update bookkeeping by calling Tracker::clobberRegister here, checking if RegMask clobbers registers should already have been conducted in the main loop?

Correct me if I still fail to get your point.

qcolombet added inline comments.Nov 24 2022, 12:15 PM
llvm/lib/CodeGen/MachineCopyPropagation.cpp
312

Ah, I see. Is it a good idea to enumerate them via TRI->getNumRegs() and check RegMask to see if they are preserve/clobber?

That's the idea. Though we wouldn't need to enumerate all the registers, only the ones that you care about.
What you're doing here is already fine.

The thing that bothers me in the current code is the call to clobberRegister. This modifies the state of the tracker. Usually the find methods only query the RegMasks directly (with MachineOperand::clobbersPhysReg).

Correct me if I still fail to get your point.

You got the point.
Let's keep the code as is for now and let me do a full pass on the code so that I have a better model of how the whole things works.

I'll probably won't have time to do it before next week though.

Hi @lkail,

Thanks for your patience.

Looks mostly good to me.

The only thing that makes me uneasy is the potential impact on compile time of CheckCopyConstraint.
Do you know on average how many chains we see per function?

I know we have a bot tracking compile time. If that doesn't show anything significant, I guess we could enable it by default. I just don't know how extensive the tests are.

Alternatively, to avoid surprising everybody, we could add a way to enable/disable the new folding.
Either like what was done with UseCopyInstr or with a target hook (you can add a command line option too that would override what the target asked for for testing purposes).

Then only enabled it for your target and send an RFC asking people to try the new folding for their targets.

What do you think?

Cheers,
-Quentin

llvm/lib/CodeGen/MachineCopyPropagation.cpp
1118

Nit: const on MachineInstr*

1138

You should be able to use a range loop:

for (const MachineInstr *Spill : SC) {
    if (CopySourceInvalid.count(Spill))
            return;
}
1143

Nit: range loop

1149

That's going to be pretty expensive to walk all the register classes.
I'm guessing you're trying to check if the resulting copy is legal and unfortunately there's no good way to do that.

Did you see that showing up in compile time profile?

1179

Nit: range loop

1188

Nit: Here and other places where you use isCopyInstr: use the explicit type instead of auto. (The return type is hard to infer.)

lkail updated this revision to Diff 485466.Dec 28 2022, 12:18 AM
lkail updated this revision to Diff 485469.Dec 28 2022, 12:23 AM

Add target hook and command line option.

lkail added a comment.EditedDec 28 2022, 2:15 AM

Do you know on average how many chains we see per function?

I have run llvm-test-suite, here's the stats

functionsspill chain lengthavg spill chain length per functionmax spill chain length in one CUnumber of spill chainsavg number of spill chains per functionmax number of spill chains in one CU
powerpc64-ibm-aix604263080.00509714361367652270.0004468275245763
x86_64-linux-gnu1859386460.003474276371694130800.00043025094386314

Compile time on powerpc64-ibm-aix

Tests: 1039
Metric: compile_time

Program                                       compile_time           
                                              baseline     experiment
SingleSour...sts/2002-10-09-ArrayResolution     0.07         0.50    
MultiSourc...marks/Trimaran/enc-pc1/enc-pc1     0.14         0.86    
SingleSour...e/UnitTests/2002-10-13-BadLoad     0.06         0.38    
SingleSour...UnitTests/2002-08-02-CastTest2     0.06         0.37    
SingleSour...nitTests/2002-04-17-PrintfChar     0.06         0.36    
SingleSour.../UnitTests/conditional-gnu-ext     0.07         0.23    
MultiSource/Benchmarks/llubenchmark/llu         0.10         0.35    
SingleSour...tTests/2003-08-05-CastFPToUint     0.06         0.22    
SingleSour...nitTests/2003-05-31-LongShifts     0.07         0.22    
SingleSour...e/UnitTests/Vector/Altivec/lde     0.21         0.68    
SingleSource/UnitTests/blockstret               0.06         0.18    
SingleSour...e/UnitTests/2002-05-03-NotTest     0.07         0.21    
SingleSour...UnitTests/2002-05-02-CastTest2     0.07         0.20    
SingleSour...UnitTests/2003-08-11-VaListArg     0.15         0.43    
SingleSource/UnitTests/StructModifyTest         0.07         0.19    
      compile_time             
run       baseline   experiment
count  1039.000000  1039.000000
mean   1.108578     1.102223   
std    4.496574     4.384102   
min    0.000000     0.000000   
25%    0.000000     0.000000   
50%    0.000000     0.000000   
75%    0.276350     0.278450   
max    80.927800    77.030800

Compile time on x86_64-linux-gnu

Tests: 2991
Metric: compile_time

Program                                       compile_time           
                                              baseline     experiment
UnitTests/...9-04-16-BitfieldInitialization     0.00         0.02    
UnitTests/2002-04-17-PrintfChar                 0.00         0.01    
UnitTests/2002-10-13-BadLoad                    0.00         0.01    
UnitTests/block-copied-in-cxxobj-1              0.00         0.01    
UnitTests/testcase-ExprConstant-1               0.01         0.02    
UnitTests/2010-05-24-BitfieldTest               0.00         0.01    
UnitTests/block-byref-test                      0.00         0.01    
UnitTests/testcase-Expr-1                       0.01         0.02    
UnitTests/byval-alignment                       0.01         0.02    
Benchmarks/Misc/pi                              0.01         0.02    
UnitTests/block-byref-cxxobj-test               0.01         0.01    
UnitTests/2020-01-06-coverage-008               0.01         0.02    
UnitTests/2002-08-02-CastTest2                  0.01         0.01    
UnitTests/2002-05-03-NotTest                    0.02         0.02    
UnitTests/2005-05-13-SDivTwo                    0.01         0.02    
      compile_time             
l/r       baseline   experiment
count  2991.000000  2991.000000
mean   0.291675     0.291211   
std    2.642554     2.642948   
min    0.000000     0.000000   
25%    0.000000     0.000000   
50%    0.000000     0.000000   
75%    0.000000     0.000000   
max    99.370200    99.510200

I don't see significant compile time regressions.

I just don't know how extensive the tests are.

I have tried bootstrapping stage3 and running llvm-test-suite on powerpc64-ibm-aix and x86_64-linux-gnu, no regression found. I'll follow your advice to send an RFC on discourse, currently I only enable it on PowerPC by default. https://reviews.llvm.org/D122118?id=485466 shows changes in other targets.

lkail updated this revision to Diff 485479.Dec 28 2022, 2:41 AM
lkail updated this revision to Diff 485578.Dec 28 2022, 10:00 PM
bjope added a subscriber: bjope.Dec 30 2022, 4:39 AM
lkail updated this revision to Diff 485899.Jan 2 2023, 6:22 PM

Use range loop.

qcolombet accepted this revision.Jan 24 2023, 6:00 AM
This revision is now accepted and ready to land.Jan 24 2023, 6:00 AM
Matt added a subscriber: Matt.Jan 25 2023, 9:10 AM
This revision was landed with ongoing or failed builds.Feb 7 2023, 7:34 PM
This revision was automatically updated to reflect the committed changes.