This is an archive of the discontinued LLVM Phabricator instance.

Remove redundant copy in recurrences
ClosedPublic

Authored by twoh on Apr 7 2017, 11:19 AM.

Details

Summary

If there is a chain of instructions formulating a recurrence, commuting operands can help removing a redundant copy. In the following example code,

BB#1: ; Loop Header
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6: ; Loop Latch
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def,tied1> = ADD32rr %vreg1<kill,tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1,%vreg0
  %vreg3<def,tied1> = ADD32rr %vreg2<kill,tied0>, %vreg10<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2,%vreg10
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>

Existing two-address generation pass generates following code:

BB#1:
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6:
    Predecessors according to CFG: BB#5 BB#4
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def> = COPY %vreg1<kill>; GR32:%vreg10,%vreg1
  %vreg10<def,tied1> = ADD32rr %vreg10<tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0
  %vreg3<def> = COPY %vreg10<kill>; GR32:%vreg3,%vreg10
  %vreg3<def,tied1> = ADD32rr %vreg3<tied0>, %vreg2<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>
  JMP_1 <BB#7>

This is suboptimal because the assembly code generated has a redundant copy at the end of #BB6 to feed %vreg13 to BB#1:

.LBB0_6:
  addl  %esi, %edi
  addl  %ebx, %edi
  cmpl  $10, %edi
  movl  %edi, %esi
  jl  .LBB0_1

This redundant copy can be elimiated by making instructions in the recurrence chain to compute the value "into" the register that actually holds the feedback value. In this example, this can be achieved by commuting %vreg0 and %vreg1 to compute %vreg10. With that change, code after two-address generation becomes

BB#1:
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6: derived from LLVM BB %bb7
    Predecessors according to CFG: BB#5 BB#4
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def> = COPY %vreg0<kill>; GR32:%vreg10,%vreg0
  %vreg10<def,tied1> = ADD32rr %vreg10<tied0>, %vreg1<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1
  %vreg3<def> = COPY %vreg10<kill>; GR32:%vreg3,%vreg10
  %vreg3<def,tied1> = ADD32rr %vreg3<tied0>, %vreg2<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>
  JMP_1 <BB#7>

and the final assembly does not have redundant copy:

.LBB0_6:
  addl  %edi, %eax
  addl  %ebx, %eax
  cmpl  $10, %eax
  jl  .LBB0_1

Event Timeline

twoh created this revision.Apr 7 2017, 11:19 AM
twoh edited the summary of this revision. (Show Details)Apr 7 2017, 11:19 AM
twoh updated this revision to Diff 94549.EditedApr 7 2017, 11:22 AM

small changes to the test.

echristo resigned from this revision.Apr 11 2017, 1:48 PM

Probably best to have Quentin or Matthias do it.

Friendly ping. Thanks!

twoh added a comment.Apr 21 2017, 11:16 AM

Ping. Thanks!

MatzeB edited edge metadata.Apr 21 2017, 11:30 AM

This looks pretty complicated for the task at hand. Wouldn't this be a simpler transformation to perform at a place where we are still in Machine SSA and have use-def information for free and also have phis around to indicate loops so we can do without a MachineLoopInfo instance.

twoh added a comment.Apr 21 2017, 11:39 AM

@MatzeB I was considering other passes as well for the sake of simplicity, but I concluded two-address instruction is the right place to do. The problem doesn't appear while we're in SSA, but becomes an issue when we make an decision about which operand register should be the destination register. I think this is why isProfitableToCommute function is in two-address instruction pass.

twoh added a comment.Apr 28 2017, 9:40 AM

Ping. If you're too busy to review this patch, could you please recommend someone else? Thanks!

twoh added a comment.May 12 2017, 2:36 PM

Ping. Can someone please let me know what would be the best way to have this patch reviewed? Thanks!

MatzeB added a reviewer: wmi.May 15 2017, 10:59 AM

I'd like to hear @qcolombet's opinion about this, also +wmi who wrote the similar isRevCopyChain().

This patch

  • Nitpicks below
  • Do you have an idea of the compiletime impact of this patch?
  • I am slightly worried about the use of MachineLoopInfo: Is it really necessary here? Maybe the same can be done with some simpler dominance check? Can you tell when the MachineLoopInfo is/isn't available (as that will impact whether this rule is applied or not).

General Observations

I am not happy with the general approach of throwing more and more patterns at TwoAddressInstructions:

  • Yes, LLVMs handling of TwoAddressInstruction as a pre-RA pass is a bad idea IMO (and I don't know why it was done this way): We are unnecessarily constraining the allocation problem and don't necessarily do a good job upfront without knowing how the allocation will work out.
  • This pass adds another pattern at TwoAddressInstruction: It is similar to isRevCopyChain() but slightly more complicated so we end up with 160 extra lines of analysis for yet another pattern. If this trend continues we will have an even harder time maintaining this pass.
  • On the other hand this improves code quality today without rearchitecting the code.
lib/CodeGen/TwoAddressInstructionPass.cpp
187

This makes no sense! You are starting to use the machine loop info, all you do here is mark it preserved a 2nd time; There was already addPreservedID(MachineLoopInfoID) which had the same effect.

577

Use references for things that cannot be nullptr. Similar for a few other places here.

579–580

Don't use auto when the type isn't immediately obvious just by looking at the line. (auto is unfriendly towards the readers of your code). Similar in a few more places.

585

we don't tend to have a space before the ';' here. Did you try clang-format on your patch?

621

This only gives you the declared/minimum number of operands, there may be more.

714–715

Isn't this check superfluous? I would expect this to be true anyway if the instruction is inside MBB which is checked later.

test/CodeGen/X86/twoaddr-recurrence.ll
1–2

Did you try to create a .mir test so the pass can be tested in isolation? If yes and it didn't work, could you tell us why so we can improve the .mir testing.

twoh updated this revision to Diff 99207.May 16 2017, 2:42 PM

Addressing comments from @MatzeB.

twoh added a comment.May 16 2017, 2:42 PM

@MatzeB Thank you for your comments. I'm collection compile time numbers now with spec2006. For our internal benchmark, which takes about ~40min to compile, the compile time difference was negligible.

I don't think this patch can be implemented without loop info, because this is formulated around the loop recurrence pattern. It might be possible but will eventually require more code to produce the information provided by MachineLoopInfo. I'm curious what is your biggest concern about using MachineLoopInfo in this pass, as it is used in other passes as well.

I agree on you that maintainability vs supporting more patterns is a hard trade-off. IMHO, if we want to generate better performing code, adding more code to support more patterns would be unavoidable, unless we re-architecturing register allocation related passes. I think this patch adds more lines of code because this handles the pattern based on the loop structure for the first time.

I tried .mir test, but the output is still not two-address format. I tried

./bin/llc -mtriple=x86_64-- -run-pass=twoaddressinstruction -o - ./input.mir

(input.mir is from -stop-before=twoaddressinstruction), and the output still has instructions such as

%vreg10<def,tied1> = ADD32rr %vreg1<kill,tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1,%vreg0
.

lib/CodeGen/TwoAddressInstructionPass.cpp
714–715

This check is here to filter the definitions that are outside of Loop. The check below is to see if the definition inside the Loop belongs to MBB.

qcolombet edited edge metadata.May 17 2017, 8:42 AM

The problem doesn't appear while we're in SSA [...]

Why is that?
The marking of tie operand is already there, following phis shouldn't be an issue and would eliminate the whole data flow processing.

What am I missing?

Side question; if several recurrence chains share some operand, how do you pick the most profitable one?

twoh added a comment.May 17 2017, 1:25 PM

@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.

I'm afraid that I couldn't quite understand your side question. This patch commutes operands of an instruction if the instruction is inside the recurrence that meets certain conditions. In line 846-850, it checks if the recurrence pattern is formulated around regC, and returns true to commute operands. If not, it checks same for regB, and returns false to not to commute if it observes the pattern. If the pattern is observed for both regB and regC, it is possible that not to commute the operands generates better result, but as the logic checks regC first, it'll commute the operands.

@MatzeB Below is a compile time difference in percentage for spec2006

400.perlbench 1.80
401.bzip2 0.66
403.gcc 0.77
429.mcf 2.61
433.milc -1.70
444.namd 0.31
445.gobmk -0.05
447.dealII -0.22
450.soplex 0.48
453.povray 1.34
456.hmmer 0.49
458.sjeng 1.30
462.libquantum 1.90
464.h264ref 0.52
470.lbm -3.01
471.omnetpp 2.50
473.astar 0.27
482.sphinx3 1.01
483.xalncbmk -0.47

wmi edited edge metadata.May 19 2017, 12:04 PM

Hi Taewook,

Thank you for working on it. The testcase you gave is interesting. My previous work in isRevCopyChain is quite limited and only handle recurrence within a BB, so I like to see a more general fix about the redundent copy problem caused by recurrence.

A problem of the patch I can see is that it uses dataflow to track the recurrence, which I think may not be enough. Think about the case below:

r3 = r4;
...
r1 = r2 + r3;
...
r5 = load(r1)
r4 = r5;

r1-->r5-->r4-->r3-->r1, it is a recurrence loop according to the algorithm, however, we don't have problem to coalesce r3 with r4 and coalesce r4 and r5, because r5 and r1 don't have to be allocated to the same register.

So the def-src constructing an recurrence chain interesting for us should only contains def and src which are in a tied operand group, or which are in the same copy. In other word, the recurrence chain can only cause redundent copy problem when all the operands on the chain have to be allocated to the same register. That kind of recurrence loop is what we are interested in.

In addition, I don't think the recurrence chain has to be a strict cycle. Just like your motivational testcase:
vreg0 = vreg13
...
vreg2 = vreg15
vreg10 = add vreg1, vreg0
vreg3 = add vreg2, vreg10
vreg13 = vreg3

Here the recurrence chain starting from vreg0 is: vreg0-->vreg13-->vreg3-->vreg2. This is not a strict cycle, however, vreg2 and vreg0 already have interference with each other. It means it is impossible to allocate all the vregs in the recurrence chain to the same physical register, and some copy has to be kept.

I think those are the two issues that have to be addressed for a more general solution to the recurrence problem.

Thanks,
Wei.

twoh added a comment.May 19 2017, 1:14 PM

@wmi Thank you for your reply. I agree on you that we should consider tied operand group, and as @qcolombet mentioned in the previous comment, if tie operand information is already available before this pass, I'd like to discuss where would be the best place to implement this.

I'm afraid I couldn't understand your second point. In the example there is a recurrence cycle of vreg0-->vreg13-->vreg3-->vreg10-->vreg0 as well, and what this patch does is commuting (vreg0, vreg1) and (vreg2, vreg10) to remove the copy. Did I miss something?

Hi Taewook,

@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.

I would investigate this as part of the PeepholeOptimizer. It already massages copies.

Regarding my second question, answered it, with that part:

If the pattern is observed for both regB and regC, it is possible that not to commute the operands generates better result, but as the logic checks regC first, it'll commute the operands.

Cheers,
-Quentin

twoh updated this revision to Diff 100886.May 31 2017, 10:27 AM

Reimplement the feature in peephole optimizer. This patch makes the values in the recurrence cycle to be tied to each other if possible.

twoh updated this revision to Diff 100890.May 31 2017, 11:09 AM

Add missing reference.

twoh added a comment.May 31 2017, 11:11 AM

FYI, below is the compile difference in percentage for spec2006:

400.perlbench 0.23
401.bzip2 -0.01
403.gcc -0.01
429.mcf 3.35
433.milc 1.11
444.namd 0.27
445.gobmk 0.20
447.dealII 0.49
450.soplex 1.02
453.povray 0.55
456.hmmer 0.19
458.sjeng 1.37
462.libquantum 1.76
464.h264ref 1.01
470.lbm -0.98
471.omnetpp 1.61
473.astar 0.95
482.sphinx3 0.50
483.xalncbmk -0.89

twoh added a comment.Jun 7 2017, 2:29 PM

ping. thanks!

wmi added a comment.Jun 10 2017, 6:31 PM

Sorry for the delay. The rewrite based on SSA looks much cleaner now. About the algorithm, IIUC it tries to find loop based on define-use of tied operand or operand commutable with tied operand. However, I still have concern that the method can increase redundent copy sometimes.

Here is a testcase:

A = phi(B, O)
M = load addr1.
C = M + A C and M are tied operands.
store A to addr2.
N = load from addr3
B = N + C
B and N are tied operands.

Without the patch, we can allocate the above testcase without copy -- allocate A, B and N to physreg1 and allocate M, C to physreg2.

With the patch, after it changes C = M + A to C = A + M, a copy needs to be generated because C and A are tied operands but C's live range and A's live range are overlapped. It is impossible to allocate C and A to the same physreg without extra copy.

Actually, I think we don't have to explore candidate cycle based on operand commutable with tied operand. An algorithm in my mind is to find cycle which are only composed of tied operands, copies, and phi in loop header, then try to look at if there is any live range overlap between any two operands inside of the cycle. Only when there is live range overlap, we will consider to commute some operands.

twoh added a comment.Jun 10 2017, 9:46 PM

@wmi Not at all! Thanks for your comments.

You're right. I should've considered the case where the operands involved in the recurrence cycle live beyond the use in the recurrence cycle. You're suggestion sounds great, but my concern is that computing live range might be expensive because live interval analysis is performed after peephole optimization, What do you think of limit it to the case of operands involved in the recurrence cycle have no use outside of the cycle?

twoh updated this revision to Diff 103261.Jun 20 2017, 1:44 PM

Addressing @wmi's concern by limiting the targets to the recurrence cycles that only the last instruction of the recurrence (that feeds the PHI instruction) can have uses outside of the recurrence. This is not an ideal solution yet, and more fundamental solution (such as having recurrence optimization as a separate pass and/or using live range analysis for it) should follow. But still I think it is worth to have it here.

Test cases are updated as well to use MIR.

twoh added a comment.Jun 28 2017, 11:15 AM

Ping. Thanks!

wmi added a comment.Jun 28 2017, 3:50 PM

For the example below, findTargetRecurrence starts from r2 and r3 to search a def reg equals to r1. There are a lot of possibilities to explore. That is where the complexity of findTargetRecurrence comes from.

r1 = phi(r2, r3)
r4 = r5 + r1;
r6 = r7 + r4;
r3 = r6 + r8;

After adding the constraint to the recurrence cycle, since all the instructions other than the last one in the recurrence loop should have only one use, it will be easy to start from r1 and search forward. I guess findTargetRecurrence can be simplified a lot if the backward searching is replaced by forward searching, right?

twoh updated this revision to Diff 104696.Jun 29 2017, 11:03 AM

Addressing comments from @wmi. Thank you for the suggestion!

twoh updated this revision to Diff 104697.EditedJun 29 2017, 11:04 AM

minor fix.

wmi added inline comments.Jun 29 2017, 11:42 AM
lib/CodeGen/PeepholeOptimizer.cpp
1549–1552 ↗(On Diff #104697)

For findTargetRecurrence, RCs will at most contain one RC, right? It may be better to remove RCs, change the return value of findTargetRecurrence to bool type, and use it to indicate whether a RC is found.

twoh updated this revision to Diff 104729.Jun 29 2017, 1:31 PM

@wmi Good call! I fixed the code per your suggestion. Thanks!

wmi accepted this revision.Jun 29 2017, 2:06 PM

LGTM.

This revision is now accepted and ready to land.Jun 29 2017, 2:06 PM
This revision was automatically updated to reflect the committed changes.