This is an archive of the discontinued LLVM Phabricator instance.

CodeGen peephole: fold redundant phys reg copies
ClosedPublic

Authored by jfb on Dec 2 2015, 9:20 AM.

Details

Summary

Code generation often exposes redundant physical register copies through
virtual registers such as:

%vreg = COPY %PHYSREG
...
%PHYSREG = COPY %vreg

There are cases where no intervening clobber of %PHYSREG occurs, and the
later copy could therefore be removed. In some cases this further allows
us to remove the initial copy.

This patch contains a motivating example which comes from the x86 build
of Chrome, specifically cc::ResourceProvider::UnlockForRead uses
libstdc++'s implementation of hash_map. That example has two tests live
at the same time, and after machine sinking LLVM has confused itself
enough and things spilling EFLAGS is a great idea even though it's
never restored and the comparison results are both live.

Before this patch we have:

DEC32m %RIP, 1, %noreg, <ga:@L>, %noreg, %EFLAGS<imp-def>
%vreg1<def> = COPY %EFLAGS; GR64:%vreg1
%EFLAGS<def> = COPY %vreg1; GR64:%vreg1
JNE_1 <BB#1>, %EFLAGS<imp-use>

Both copies are useless. This patch tries to eliminate the later copy in
a generic manner.

dec is especially confusing to LLVM when compared with sub.

I wrote this patch to treat all physical registers generically, but only
remove redundant copies of non-allocatable physical registers because
the allocatable ones caused issues (e.g. when calling conventions weren't
properly modeled) and should be handled later by the register allocator
anyways.

This works fine when there are intervening calls between flag generation and
flag usage (these calls clobber flags):

test/CodeGen/X86/cmpxchg-clobber-flags.ll

The following tests used to failed when the patch also replaced allocatable
registers:

CodeGen/X86/StackColoring.ll
CodeGen/X86/avx512-calling-conv.ll
CodeGen/X86/copy-propagation.ll
CodeGen/X86/inline-asm-fpstack.ll
CodeGen/X86/musttail-varargs.ll
CodeGen/X86/pop-stack-cleanup.ll
CodeGen/X86/preserve_mostcc64.ll
CodeGen/X86/tailcallstack64.ll
CodeGen/X86/this-return-64.ll

Note that all other backends' tests pass.

Diff Detail

Repository
rL LLVM

Event Timeline

jfb updated this revision to Diff 41642.Dec 2 2015, 9:20 AM
jfb retitled this revision from to CodeGen peephole: fold redundant phys reg copies.
jfb updated this object.
jfb added a subscriber: llvm-commits.
jfb added a comment.Dec 2 2015, 9:27 AM

There's further complication around calling conventions, where it seems that the x86 backend doesn't fully model xmm registers as being caller-saved. This example:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@I32 = external global i32

declare i32* @foo(i32*, i64)

declare i32* @bar(i32, i32)

declare double @D()

declare i32* @P32()

declare i32* @baz(double)

define hidden i32* @ExpressionFromLiteral(i32 %token) {
entry:
  switch i32 %token, label %return [
    i32 83, label %bb0
    i32 82, label %bb1
  ]

bb0:
  %I32.loaded = load i32, i32* @I32
  %call.foo = tail call i32* @foo(i32* nonnull @I32, i64 48)
  %call.bar = tail call i32* @bar(i32 %I32.loaded, i32 %I32.loaded)
  %call.foo.gep.40 = getelementptr inbounds i32, i32* %call.foo, i64 40
  %ptr.foo = bitcast i32* %call.foo.gep.40 to i32**
  store i32* %call.bar, i32** %ptr.foo
  br label %return

bb1:
  %call.D = tail call double @D()
  %call.P32 = tail call i32* @P32()
  %call.baz = tail call i32* @baz(double %call.D)
  %call.P32.gep.40 = getelementptr inbounds i32, i32* %call.P32, i64 40
  %ptr.P32 = bitcast i32* %call.P32.gep.40 to i32**
  store i32* %call.baz, i32** %ptr.P32
  br label %return

return:
  %retval.0 = phi i32* [ %call.P32, %bb1 ], [ %call.foo, %bb0 ], [ null, %entry ]
  ret i32* %retval.0
}

Confuses LLVM around the @D call because the double is returned in xmm0 and is live across a call, and passed to the next one. The code starts off as:

BB#2: derived from LLVM BB %bb1
    Predecessors according to CFG: BB#0
	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	CALL64pcrel32 <ga:@D>, <regmask %BH %BL %BP %BPL %BX %EBP %EBX %RBP %RBX %R12 %R13 %R14 %R15 %R12B %R13B %R14B %R15B %R12D %R13D %R14D %R15D %R12W %R13W %R14W %R15W>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def>
	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%vreg8<def> = COPY %XMM0; FR64:%vreg8
	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	CALL64pcrel32 <ga:@P32>, <regmask %BH %BL %BP %BPL %BX %EBP %EBX %RBP %RBX %R12 %R13 %R14 %R15 %R12B %R13B %R14B %R15B %R12D %R13D %R14D %R15D %R12W %R13W %R14W %R15W>, %RSP<imp-use>, %RSP<imp-def>, %RAX<imp-def>
	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%vreg9<def> = COPY %RAX; GR64:%vreg9
	%vreg1<def> = COPY %vreg9; GR64:%vreg1,%vreg9
	ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%XMM0<def> = COPY %vreg8; FR64:%vreg8
	CALL64pcrel32 <ga:@baz>, <regmask %BH %BL %BP %BPL %BX %EBP %EBX %RBP %RBX %R12 %R13 %R14 %R15 %R12B %R13B %R14B %R15B %R12D %R13D %R14D %R15D %R12W %R13W %R14W %R15W>, %RSP<imp-use>, %XMM0<imp-use>, %RSP<imp-def>, %RAX<imp-def>
	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%vreg10<def> = COPY %RAX; GR64:%vreg10
	MOV64mr %vreg9, 1, %noreg, 160, %noreg, %vreg10; mem:ST8[%ptr.P32] GR64:%vreg9,%vreg10
    Successors according to CFG: BB#3(?%)

Note how xmm0 is implicitly defined, then (with available information) needlessly copied to vreg8 and back, then implicitly used. My patch deletes %XMM0<def> = COPY %vreg8, which then allows %vreg8<def> = COPY %XMM0 to be deleted.

This could be fixed by being either:

  • Being smarter about calling conventions in my code.
  • Marking all calls as having modeled side-effects (which would lose some potential optimizations).
  • Marking xmm registers as clobbered in calls such as @P32 above.

That problem doesn't really need to be tackled until I figure out how to fix the other one, though. It's just extra :-)

qcolombet requested changes to this revision.Dec 2 2015, 9:40 AM
qcolombet added a reviewer: qcolombet.
qcolombet added a subscriber: qcolombet.

Hi,

We purposely don’t remove such copies because it over constrains the coloring later on.
The idea is that if the coalescing is possible, then the register allocator should prefer it. However, I reckon that in your motivating example that’s not going to happen since this is a cross register bank copy.

Anyway, the peephole optimizer happens too early IMO for this to be a reasonable solution. I would instead encourage you to look at fixing this into the CopyPropagation pass.

Cheers,
-Quentin

This revision now requires changes to proceed.Dec 2 2015, 9:40 AM

Note: The transformation would make sense in the peephole optimizer if we do it only for not allocatable registers.

jfb added a comment.Dec 2 2015, 1:11 PM

Note: The transformation would make sense in the peephole optimizer if we do it only for not allocatable registers.

Interesting proposal, that does seem to make sense. Let me look into it, thanks!

jfb updated this revision to Diff 41664.Dec 2 2015, 1:28 PM
jfb edited edge metadata.

NFC: rename PhysCopy to ConstPhysCopy

First step in applying qcolombet's suggestion of only applying this optimization
to constant physical register (he suggested to affect non-allocatable only, but
that would include pinned ones too, whereas constant physical registers seem
more conservative).

jfb updated this revision to Diff 41672.Dec 2 2015, 1:50 PM
jfb edited edge metadata.

Restrict to non-allocatable physical registers.

I tried using MachineRegisterInfo::isConstantPhysReg instead but that's too
conservative and doesn't include the main motivator for this patch: EFLAGS.

This implements the change qcolombet suggested and passes all x86 tests,
including the new test I added incdec-and-branch.ll.

jfb added a comment.Dec 2 2015, 1:52 PM

This now passes all tests with ninja check on a build with all backends, and seems to do what I want! I'll clean up the code a bit, test it on Chrome some more (including the issue I posted above).

jfb updated this revision to Diff 41680.Dec 2 2015, 2:32 PM

Clean up code.

jfb added a comment.Dec 2 2015, 2:34 PM

I cleaned up the code, it passes all the tests and does what I wanted to achieve for the one Chrome usecase. @qcolombet would you mind taking another look?

jfb updated this object.Dec 2 2015, 2:37 PM

Hi,

Looks almost good.
Two main remarks:

  1. Check for regmask in the operands as well
  2. Try to add more test cases for coverage

Thanks,
-Quentin

lib/CodeGen/PeepholeOptimizer.cpp
1427 ↗(On Diff #41680)

Put a message in the assert, even an obvious one :).

1537 ↗(On Diff #41680)

I’ll make a function out of it and you have to check the reg mask operand as well.
For instance, if we have a pure (side effect free) function call, you may clobber the register and we will miss it.

test/CodeGen/X86/incdec-and-branch.ll
16 ↗(On Diff #41680)

Could we have some “positive” testing as well, i.e. CHECK lines not just CHECK-NOT?

Also, I would rename the file into peephole-na-copy-folding or something and add more test cases where you check the inline asm case, the definition in the middle, etc.
Anyhow, you get it, something that improves the coverage.
I understand this can be difficult to come with additional test cases, just give it a try and if you can’t find some, then you can’t :).

jfb updated this revision to Diff 41812.Dec 3 2015, 3:27 PM
jfb marked an inline comment as done.

Address qcolombet's comments:

  • Rename test file.
  • Add more tests.
  • Make the tests positive.
  • Document asserts.
  • Also test for reg mask clobbering.
jfb marked an inline comment as done.Dec 3 2015, 3:28 PM

Comments addressed.

lib/CodeGen/PeepholeOptimizer.cpp
1537 ↗(On Diff #41680)

I added regmask.

I'm not sure I understand what you mean with "pure". Do you mean readonly? That still has to observe the calling convention and declare its clobbers, no?

qcolombet accepted this revision.Dec 3 2015, 3:34 PM
qcolombet edited edge metadata.

LGTM.

Thanks,
Q.

lib/CodeGen/PeepholeOptimizer.cpp
1537 ↗(On Diff #41812)

Yes, I mean readonly :).
Yes, it has to declare its clobbers.

What I meant was that if we do not check for the regmask operand, the check "MI->hasUnmodeledSideEffects()” was not enough to ensure we won’t clobber the registers.

test/CodeGen/X86/peephole-na-phys-copy-folding.ll
6 ↗(On Diff #41812)

s/whne/when

This revision is now accepted and ready to land.Dec 3 2015, 3:34 PM
jfb updated this revision to Diff 41815.Dec 3 2015, 3:42 PM
jfb edited edge metadata.
  • Typo.
This revision was automatically updated to reflect the committed changes.