This is an archive of the discontinued LLVM Phabricator instance.

[MachineCopyPropagation] Extend pass to do COPY source forwarding
ClosedPublic

Authored by gberry on Mar 8 2017, 11:15 AM.

Details

Summary

This change extends MachineCopyPropagation to do COPY source forwarding.

This change also extends the MachineCopyPropagation pass to be able to
be run during register allocation, after physical registers have been
assigned, but before the virtual registers have been re-written, which
allows it to remove virtual register COPY LiveIntervals that become dead
through the forwarding of all of their uses.

Event Timeline

gberry created this revision.Mar 8 2017, 11:15 AM

High-level question: Why does the register allocator not do this on its own?

test/CodeGen/PowerPC/fma-mutate.ll
17–18

Is this an improvement?

High-level question: Why does the register allocator not do this on its own?

I'm not sure I can fully answer that question, but doing this during register allocation could have the down-side of extending live ranges resulting in a worse allocation. Doing it in a pass just after register allocation allows it to be more conservative/opportunistic and not impact e.g. the amount of spilling.

If instead you're asking, why not just append this code at the end of register allocation, the only reason is based on previous feedback from Quentin that this should be a separate pass.

test/CodeGen/PowerPC/fma-mutate.ll
17–18

The fmr is not new, I just added it to get the second register number. Here are the full diffs before/after this change for this test:

 	fmr 3, 1
 	addi 3, 3, .LCPI0_0@toc@l
 	lfs 2, 0(3)
-	xsnmsubadp 3, 2, 3
+	xsnmsubadp 3, 2, 1
 	xsmuldp 4, 0, 0
 	xsmaddmdp 4, 3, 2
 	xsmuldp 0, 0, 4
arsenm added a subscriber: arsenm.Mar 8 2017, 12:13 PM
arsenm added inline comments.
lib/CodeGen/MachineCopyForwarding.cpp
110 ↗(On Diff #91056)

Use DEBUG_TYPE instead of repeating

arsenm added inline comments.Mar 8 2017, 12:13 PM
lib/CodeGen/MachineCopyForwarding.cpp
57 ↗(On Diff #91056)

I think having this more exactly match the pass name is better. machine-copy-forwarding

High-level question: Why does the register allocator not do this on its own?

I'm not sure I can fully answer that question, but doing this during register allocation could have the down-side of extending live ranges resulting in a worse allocation. Doing it in a pass just after register allocation allows it to be more conservative/opportunistic and not impact e.g. the amount of spilling.

Okay. If it only makes sense as a fallback strategy, not something that would otherwise affect how the allocation is performed, this seems like the logical way to do it.

If instead you're asking, why not just append this code at the end of register allocation, the only reason is based on previous feedback from Quentin that this should be a separate pass.

Makes sense.

test/CodeGen/PowerPC/fma-mutate.ll
17–18

Okay, thanks!

qcolombet edited edge metadata.Mar 9 2017, 11:44 AM

Hi Geoff,

Two questions:

  1. Why is this useful?
  2. If useful, why do we end up with this pattern?

For #1, I can think of giving more freedom to the post RA scheduler or eliminating the copy. If it is solely for post RA scheduling, I believe the scheduler should be able to do that. If it is to remove the copy, why do we have it in the first place?

Cheers,
-Quentin

javed.absar edited edge metadata.Mar 11 2017, 2:31 AM

Hi Geoff:

Thanks for this pass. If I understand correctly, its primary objective is to reduce reduce pressure?
I see only a few cases for ARM targets. Any particular reason why more cases don't benefit from your optimisation?

test/CodeGen/AArch64/arm64-zero-cycle-regmov.ll
7–10

Would it be better to rewrite these as MIR tests?

test/CodeGen/AArch64/neg-imm.ll
9–10

Would it be better adding new/separate test file instead of changing the purpose of this one ?

gberry marked 2 inline comments as done.Mar 13 2017, 11:30 AM

@arsenm I've addressed your comments in my working copy.
@qcolombet @javed.absar I'll address your comments/questions soon

@javed.absar The purpose of this pass is not to reduce register pressure (since it is run just after register allocation), but to allow more scheduling flexibility and to a lesser degree to remove some redundant COPYs. I'll elaborate on this in my response to Quentin.
As for your question about why more ARM tests aren't effected, I don't have a good answer, but my guess would be that there are just more X86 lit test cases both in general and in the number that are sensitive to changes in register allocation.

test/CodeGen/AArch64/arm64-zero-cycle-regmov.ll
7–10

I'm not sure how that would help. In this test, similar to the one Hal asked about before, the newly checked 'mov's aren't new, I just needed to add them to get the new register numbers. Here are the full diffs of the generated code for this test case:

 _t:                                     ; @t
 ; BB#0:                                 ; %entry
 	stp	x20, x19, [sp, #-32]!   ; 8-byte Folded Spill
 	stp	x29, x30, [sp, #16]     ; 8-byte Folded Spill
 	mov	 x19, x3
 	mov	 x20, x2
-	mov	 x0, x20
-	mov	 x1, x19
+	mov	 x0, x2
+	mov	 x1, x3
 	bl	_foo
 	mov	 x0, x20
 	mov	 x1, x19
 	bl	_foo
test/CodeGen/AArch64/neg-imm.ll
9–10

Again, I'm not trying to change the purpose of this test. My change just caused things to be scheduled slightly differently. The test is still checking that the condition is computed by a 'subs' feeding a 'csel'. Here are the full diffs:

test:                                   // @test
 	str	x20, [sp, #-32]!        // 8-byte Folded Spill
 	stp	x19, x30, [sp, #16]     // 8-byte Folded Spill
+	subs	w8, w0, #1              // =1
 	mov	 w19, w0
-	subs	w8, w19, #1             // =1
 	csel	w20, wzr, w8, lt
 .LBB0_1:                                // %for.body
                                         // =>This Inner Loop Header: Depth=1
 	cmp		w19, w20
 	b.eq	.LBB0_3
 // BB#2:                                // %if.then3
                                         //   in Loop: Header=BB0_1 Depth=1
 	mov	 w0, w20
 	bl	foo
 .LBB0_3:                                // %for.inc
                                         //   in Loop: Header=BB0_1 Depth=1
 	cmp		w20, w19
 	add	w20, w20, #1            // =1
 	b.le	.LBB0_1
 // BB#4:                                // %for.cond.cleanup
 	ldp	x19, x30, [sp, #16]     // 8-byte Folded Reload
 	ldr	x20, [sp], #32          // 8-byte Folded Reload
 	ret

@qcolombet This change is useful primarily to increase scheduling flexibility and reducing the critical path, though it does also make some COPYs unnecessary leading to their removal. Handling just the former in the scheduler is a possibility, but it would have the drawback of not provide a benefit to OoO cores that don't do post-RA scheduling.

As to your question of why we end up with this pattern, I looked at some cases where we end up removing COPYs and saw two main causes for this:

  1. virtual registers are not getting coalesced before/during RA because there is a mismatch in register-class (e.g. the source reg class is a subset of the dest reg class).
  2. only a partial segment of a complex live range has a COPY that becomes dead.

In terms of overall compiler complexity, I suspect that this pass could be extended a bit (to handle subreg COPYs for example) and make the current MachineCopyPropagation unnecessary. There may be problems doing this related to phase ordering or different optimization level pass pipelines though, I haven't investigated it thoroughly.

I will also note FWIW that changes doing something similar to this have come up at least twice before: D21455 and D20531

This change is useful primarily to increase scheduling flexibility and reducing the critical path, though it does also make some COPYs unnecessary leading to their removal. Handling just the former in the scheduler is a possibility, but it would have the drawback of not provide a benefit to OoO cores that don't do post-RA scheduling.

I may be wrong but I wouldn't expect OoO cores to be that affected by such change.

As to your question of why we end up with this pattern, I looked at some cases where we end up removing COPYs and saw two main causes for this:

  1. virtual registers are not getting coalesced before/during RA because there is a mismatch in register-class (e.g. the source reg class is a subset of the dest reg class).

That's strange as long as the register-class are a subset of one another, the coalescing should still be possible. If there is no intersection then the coalescing is not possible at all.
Could you share some test cases?

  1. only a partial segment of a complex live range has a COPY that becomes dead.

In terms of overall compiler complexity, I suspect that this pass could be extended a bit (to handle subreg COPYs for example) and make the current MachineCopyPropagation unnecessary.

If we could merge the MachineCopyPropagation logic in here, then that would be a non-brainer for the goodness of this approach.

Cheers,
-Quentin

gberry updated this revision to Diff 94733.EditedApr 10 2017, 1:38 PM

I've taken a new approach with this change: extending the existing
MachineCopyPropagation pass instead of making a new pass. This makes
the patch quite a bit simpler at the expense of making
MachineCopyPropagation a little more complicated (by having two
modes).

There are two AMDGPU lit test cases that I'm not sure about (marked
with XXXGCB) that I would appreciate someone more familiar with that
target to make sure they are reasonable.

To answer Quentin's original questions/comments:

  • I have at least one example of an OoO core that does benefit from this change (and specifically benefits even if no COPYs are removed, only forwarded).
  • I did some more investigating into why there are COPYs that can be forwarded/removed just after register allocation at all and the case that came up every time I looked deeper was COPYs that were inserted during RegAlloc Greedy (presumably as part of live range splitting?) that looked something like this (from aarch64 MultiSource/Benchmarks/MiBench/consumer-jpeg/jdphuff.c:decode_mcu_AC_refine)
  # After Greedy Register Allocator:
  9008B	BB#62: derived from LLVM BB %if.end169
  	    Predecessors according to CFG: BB#45 BB#93
  9056B		%vreg236:sub_32<def,read-undef> = SUBWrr %vreg236:sub_32, %vreg46; GPR64common:%vreg236 GPR32common:%vreg46
  9104B		%vreg215<def> = ASRVXr %vreg43, %vreg236; GPR64:%vreg215,%vreg43 GPR64common:%vreg236
  9128B		%vreg426<def> = COPY %vreg425; GPR32common:%vreg426,%vreg425
  9136B		%vreg217<def> = SUBWri %vreg426, 1, 0; GPR32common:%vreg217,%vreg426
  9152B		%vreg218<def> = ANDWrr %vreg217, %vreg215:sub_32; GPR32:%vreg218 GPR32common:%vreg217 GPR64:%vreg215
  9168B		%vreg426<def> = ADDWrr %vreg218, %vreg426; GPR32common:%vreg426 GPR32:%vreg218
  9200B		CBZW %vreg426, <BB#63>; GPR32common:%vreg426
	    Successors according to CFG: BB#63(0x30000000 / 0x80000000 = 37.50%) BB#94(0x50000000 / 0x80000000 = 62.50%)
Where the COPY added had a small live range and did not end up
getting allocated in such a way that the COPY was a NOP
(i.e. %vreg426 was assigned a different register than %vreg425).
gberry retitled this revision from [MachineCopyForwarding] Add new pass to do register COPY forwarding at end of register allocation. to [MachineCopyPropagation] Extend pass to do COPY source forwarding.Apr 10 2017, 1:40 PM
gberry edited the summary of this revision. (Show Details)
MatzeB edited edge metadata.Apr 27 2017, 4:34 PM

For the record: This looks very similar to https://reviews.llvm.org/D20531 and https://reviews.llvm.org/D21455. Though the earlier two attempts were somewhat sketchy in terms of correctness as they were renaming physregs and we have no good way in LLVM to detect which ones are legal to rename.

For the record: This looks very similar to https://reviews.llvm.org/D20531 and https://reviews.llvm.org/D21455. Though the earlier two attempts were somewhat sketchy in terms of correctness as they were renaming physregs and we have no good way in LLVM to detect which ones are legal to rename.

Yep, that's why I added you and @jonpa as reviewers and mentioned the previous changes it in my comment on 3/14 :)
I think this version avoids the issues of the previous two attempts by running when we still have virtual reg information.

Ping? @qcolombet have you had a chance to look at this latest version?

qcolombet added inline comments.May 26 2017, 5:45 PM
lib/CodeGen/MachineCopyPropagation.cpp
135

No else.

193

I am wondering if we want to obfuscate that this is really just the same pass with different parameters. I get that the dependencies are also different for the initialization process, but usually we just go with the most constraining one.

Is it worth doing differently here?

273

What about FullReg (or getSubReg if you want to invert) instead of ForClobber?

301

physregs don't have subreg.
Thus, you could make the code more readable with:
if (!TargetRegisterInfo::isVirtualRegister(Reg))

return Reg;

assert(PreRegRewrite);

// Then test over sub reg

350

I found the name of the function confusing.
It takes MIs but mention only reg classes.

447

Put /*ForClobber*/ (or the updated name) in from of false.

465

That's invalid per the MachineVerifier

I have implemented mostly same optimization in PowerPC backend (independently from this patch) and submitted a patch https://reviews.llvm.org/D34193 .
After I realized this patch based on a comment from @MatzeB, I abandoned my patch because this patch is platform neutral and so has wider coverage.
Since I see performance improvements in SPECCPU with my patch, I hope this patch gives similar performance gains.

lib/CodeGen/MachineCopyPropagation.cpp
619

Do we need to execute forwardUses after virtual register rewriting again?

692

Ditto. Do we need this after virtual register rewriting again?

FWIW, I'm hoping to get back to this change and process the feedback I've received in the next week or so.

gberry marked 3 inline comments as done.Jun 27 2017, 10:43 AM

I've fixed some of the comments and have questions about the others.
I also re-based and fixed an issue with instructions that have wider register implicit uses that are implicitly tied to other operands. The added code is the check in forwardUses() that references AMDGPU in the preceding comment.

lib/CodeGen/MachineCopyPropagation.cpp
193

I took a look at this, and the main problem that comes up when doing it as one pass is the fact that you wouldn't have separate pass IDs that can be used to disable the later run of this pass but not the earlier one (e.g. as is done by NVPTX and WebAssembly targets). I also think it might make it hard to run the pass in isolation via llc. If you think these issues are surmountable with the single pass approach, let me know and I'll take another look.

350

How about isForwardableRegClassCopy? I also renamed the second parameter to 'UseI'.

447

I refactored this to put the 'FullReg' parameter in the function name instead to be more clear.

465

Can you elaborate on this? I'm not sure what you are referring to as being invalid.

619

There are additional forwarding opportunities exposed later in the pipeline by e.g. tail-merging and tail-duplication that are caught by this second run.

gberry updated this revision to Diff 104212.Jun 27 2017, 10:44 AM
gberry edited the summary of this revision. (Show Details)

Rebased and addressed some review comments

qcolombet added inline comments.Jul 21 2017, 2:48 PM
lib/CodeGen/MachineCopyPropagation.cpp
193

Thanks for double checking. I'm fine with the added complexity given it has an explanation.
Please add comment explaining that in the code.

349

Could you add an example illustrating that?

465

Sorry. Physical registers don't have subreg indices. The machine verifier checks that.

gberry marked 2 inline comments as done.Jul 31 2017, 1:43 PM
gberry added inline comments.
lib/CodeGen/MachineCopyPropagation.cpp
465

In this case, the MOUse may not be a physical register. We are replacing a possibly virtual reg (MOUse) with a physical reg (NewUseReg), so we need to make sure to translate the subreg on MOUse. I've added a comment to hopefully clarify this a bit.

gberry updated this revision to Diff 108995.Jul 31 2017, 1:46 PM

Address Quentin's comments

qcolombet requested changes to this revision.Aug 3 2017, 2:47 PM
qcolombet added inline comments.
lib/CodeGen/MachineCopyPropagation.cpp
350

Works with better comment on top. See next comment.

362

No bracket

387

I think this example would do great on the comment of the function itself with a note that Copy is the first copy and UseI the second one. Of course, we need to document the non-copy UseI case as well :)

391

I'd add an assert that UseI.getOperand(1) == DstReg

395

Add a comment on what are the relation between CopyDstReg, CopySrcReg and UseMI and assert on that.
E.g., UseMI is a copy that uses CopyDstReg. CopyDstReg is going to be replaced by CopySrcReg.

396

Move that closer to its first use.

397

Unless I am missing something, I don't see the extend to use call for CopySrcReg.

424

Add a comment on what this is doing.

435

I would add:
if (!UseReg) continue;

444

I would avoid forwarding on physical register period.

465

Ah right, I missed the *New* in the name of the check.

516

We don't need this check for virtual reg, right?

563

It would help the readability if there were helper function for the different regclass/register fixing for the respective physreg and virtureg cases.

E.g., a helper with if NewUseReg is a physreg then block line 485 else block line 541.
Then the code would look like:
Check if replacing is possible between MOUse and NewUseReg
[...]
isForwardXXX
[...]
Adapt NewUseReg to whatever constraints are carried by MOUse
fix(NewUseReg, NewUseSubReg); // <--- This call your helper function

test/CodeGen/AArch64/flags-multiuse.ll
1

Why do we need to change the run line here?

This revision now requires changes to proceed.Aug 3 2017, 2:47 PM
gberry marked 12 inline comments as done.Aug 11 2017, 2:33 PM
gberry added inline comments.
lib/CodeGen/MachineCopyPropagation.cpp
397

That was being done before the call to this function. I've moved it into this function in my latest revision.

444

I have seen some benefit to forwarding physical registers, mostly in cases where block boundaries are changed between RA time and the second run of this pass. This mostly seems to happen when tail merge/tail duplication changes which uses COPYs are exposed to in the same block. This pass won't remove any physical register writes, so this seems relatively safe, other than the caveat in the comment above.

516

No, I don't think so. I added an early exit for this case to hasImplicitOverlap.

563

I've factored out quite a bit of code from this function, let me know what you think.

test/CodeGen/AArch64/flags-multiuse.ll
1

Turning off the post-RA scheduler kept the checked instructions in the same order. I've just re-arranged the checks now.

gberry updated this revision to Diff 110815.Aug 11 2017, 2:35 PM
gberry edited edge metadata.
gberry marked 4 inline comments as done.

Update to address Quentin's comments

qcolombet accepted this revision.Aug 14 2017, 9:28 AM

Thanks for your patience Geoff.

LGTM

This revision is now accepted and ready to land.Aug 14 2017, 9:28 AM
This revision was automatically updated to reflect the committed changes.