This is an archive of the discontinued LLVM Phabricator instance.

MachineCopyPropagation: remove some more copys when they are not needed
Needs ReviewPublic

Authored by jonpa on May 23 2016, 11:11 AM.

Details

Reviewers
MatzeB
Summary

I was experimenting with the Loop Unroller, and found a benchmark that regressed for no obvious reason as relating to the unrolling of loops. However, I found in a loop with four outer loops, an extra unneeded copy of a register.

It was clear that the COPY was not needed, however not so clear how it should be handled. Comparing the two compilations, the regcoalescer results were identical. It was the greedy regalloc that behaved a bit differently (there were more live intervals in the unrolled loop), for one reason or other.

I looked at MCP, and found that it was missing this very simple case. I am not sure if MCP should really be a simple pass and instead the regalloc should be fixed, but I guessed that MCP has some purpose of cleaning up after regalloc, and experimented with extending it to handle my regression. Here is my patch:

  1. If a register use is defined by a COPY, and the COPY source reg could be used directly, use it instead if it seems that the COPY may be removable then.
  2. Whenever an instruction defines (clobbers) any register, remove any COPY that is tracked in MaybeDeadCopies, just like it was done for RegMask operands.

This was needed to handle:
%R4D<def> = LGR %R12D
%R7D<def> = LGR %R9D<kill>
%R1D<def> = SLLG %R14D, %noreg, 2
%R0H<def> = LFH %R4D<kill>, 0, %R1D<kill>; mem:LD4[%arrayidx35.i.i](tbaa=!6)
%R1D<def> = LGR %R5D
%R4D<def> = LGR %R3D
The COPY of R12D to R4D is unnecessary. R4D could be replaced with R12D in the LFH instruction (first point above). The COPY itself can be removed when the last instruction redefines R4D (point two above).

Why is this pass so simple? It is not trusting live-in lists, and not recomputing liveness -- is this something "todo", or? My first thought was that liveness could be checked with a reaching phys defs algorithm (see http://reviews.llvm.org/D17257), and basically the pass could be doing a bit more work then. After working with this benchmark, it became clear that even a single COPY could make a noticable difference (SPEC/bzip2).

Greatful for any feedback on this!

Diff Detail

Event Timeline

jonpa updated this revision to Diff 58116.May 23 2016, 11:11 AM
jonpa retitled this revision from to MachineCopyPropagation: remove some more copys when they are not needed.
jonpa updated this object.
jonpa added a reviewer: MatzeB.
jonpa added a subscriber: llvm-commits.
junbuml added inline comments.
lib/CodeGen/MachineCopyPropagation.cpp
275

insted -> instead

277

You may intended this continue in outside of this for loop ?

282

I doubt if you can simply replace MO with PrevCopySrc whenever the register class is matched. What if either MO or PrevCopySrc is reserved ?

331–334

Cann't you use MI->definesRegister() here instead of tracking defs through Clobbers.

MatzeB edited edge metadata.May 23 2016, 5:15 PM

I haven't looked at the change yet, but this is how I see MCPs role:

Why is this pass so simple? It is not trusting live-in lists, and not recomputing liveness -- is this something "todo", or? My first thought was that liveness could be checked with a reaching phys defs algorithm (see http://reviews.llvm.org/D17257), and basically the pass could be doing a bit more work then. After working with this benchmark, it became clear that even a single COPY could make a noticable difference (SPEC/bzip2).

The MachineCopyPropagation started as an ad-hoc fix for some cases that the register allocator did not handle. I assume we simply never tried to use something more sophisticated. While we probably would have no problem with a more sophisticated MCP pass the question we really should answer first is why all the other mechanisms failed to eliminate the copy:
There is the aggressive and extensive RegisterCoalescer pass, the PeepholeOptimizer removes cross register bank copies and the allocator has hinting mechanism for the few cases we cannot remove in the RegisterCoalescer. So before extending the MCP pass we should try to understand why you ended up with that copy. My first guess would be that the RegisterCoalescer should remove the COPY from your example (though it's hard to judge from the limited context).

jonpa added a comment.May 24 2016, 7:52 AM

Thanks for feedback and review Matthias and Jun!

I have done a little further investigation on this, and it seems that

  • If MCP would trust the live-in lists, about 40% more COPYs would be removed, or ~2 COPYs per file.
  • This patch is nowhere near that. It fixed my particular problem, but little more (only one change per every 15 files or so).

Matthias: at least in this case, the code seemed identical after regcoalescer, but it was some merging of live intervals by the GreedyRegAlloc that somehow kept the copy propagation from happening.

I see an effective Copy propagation pass as at least a first step towards a final solution, since

  • Regalloc passes are complex and difficult to improve. RAGreedy gets rid of nearly all the COPYs, but not all. This last part might be tricky to get in place.
  • By turning off such a CopyProp pass, it would be easy to get many test cases for improving regalloc copy propagation. Eventually, perhaps regalloc could do it all.
  • Copy propagation is simple and standard on its own. Even one single copy in a tight loop could make a noticable difference. This makes it important for stability during development of other optimisations, so as to not get a regression because of this.

So instead of continuing with this current patch, I wonder if it wouldn't be even better to try to use the ReachingPhysDefs utility and try a more aggressive algorithm with global liveness analysis.

What are your thoughts on this?

jonpa updated this revision to Diff 58596.May 26 2016, 6:16 AM
jonpa edited edge metadata.

Patch updated according to review, and also added handling of a COPY's second implit-def with ClobberRegister().

Hi Jonas,

Couple of comment:

If a register use is defined by a COPY, and the COPY source reg could be used directly, use it instead if it seems that the COPY may be removable then.

That is not generally safe. In particular, if the definition is required by the ABI, you cannot eliminate that copy.

What are your thoughts on this?

Improving MachineCopyPropagation to handle more cases seem reasonable to me.
For more complex cases, even if I believe the ReachingPhysDefs would help (BTW, I think it needs you to make some changes for faster convergence), it sounds more like we should improve our regalloc heuristic to not have to deal with those cases in the first place.

Cheers,
-Quentin

jonpa added a comment.Jun 7 2016, 8:11 AM

If a register use is defined by a COPY, and the COPY source reg could be used directly, use it instead if it seems that the COPY may be removable then.

That is not generally safe. In particular, if the definition is required by the ABI, you cannot eliminate that copy.

What are the exact checks that need to be done? The only one I know of is the involvement of a reserved register.

Hi Jonas,

What are the exact checks that need to be done?

That’s the thing, you can’t really know at this stage.
An approximation would be to do this transformation for explicit uses only. But I would rather stay out of that game to be safe.

Right now, this pass is about removing no-op copies. It does not try to do recoloring, which is what your patch aim for.

If we want to go for recoloring, we need to handle that before the virtual registers are rewritten, because at this point, we can assume that all present physical registers are ABI constraints.

Cheers,
-Quentin

Hi Jonas,

I just realized that I was working on pretty much the same thing (d'oh). I just didn't realize at first as I had a different motivation: When I change:

%x = COPY %y
  use %x

to

%x = COPY %y
  use %y

then both instructions can be scheduled independently by the post-ra scheduler.

As Quentin already mentioned it is indeed sketchy to rename physregs as it is not clear whether there are additional architectural constraints forcing the use of a specific register. While ABI constraints should not be touched (because you MCInstrDesc doesn't specify a register class for those implicit def and use operands), this still doesn't universally work.

In my experiments the renaming was fine on aarch64, but AMDGPU for example breaks after renaming because apparently some physregs cannot be used multiple times on the same instruction.
A fact that is not visible to the MachineCopyPropagation pass. I "solved" this by requiring the target to explicitely opt-in to the behaviour for now.

I'll upload my version shortly for reference after incorporating some of the things you did here.

Comparing with http://reviews.llvm.org/D21455 I realized two details:

lib/CodeGen/MachineCopyPropagation.cpp
234–236

This looks strange. In theory we can see any number of additional implicit operands on a COPY. I assume we were just to not hit any issues because in todays practice we only add implicit defs and uses of the superregister when a subregister was used (see code in VirtRegRewriter).

Do you actually need this bugfix as part of this patch? As far as I can see COPY instructions are not renamed here so we should not run into any new problems here.

(It would still be good to upstream this fix in a separate commit through and use a loop so we also catch the case with 4 operands when we had a subregister def and use at the COPY).

284–291

I found that there are examples where you can only rename some of the operands (the reg may be used multiple times, but on of the inputs is tied to an output and cannot be renamed). It is semantically questionable to rename only parts of the operands (and you will not get any scheduling benefits or dead copy removal opportunities anyway in this case).

jonpa added inline comments.Jun 20 2016, 1:12 AM
lib/CodeGen/MachineCopyPropagation.cpp
234–236

I thought that it was kind of a bug to not ignore an implicit-def operand regardless of anything else.

As I recall, it was also necessary to handle during my experiments, either because the MachineVerifier complained that a register was used without a def (if that COPY was removed), or because I wanted to avoid bothering with subregisters all together (changing just one subregister is not a good idea). Not sure exactly which of these I ran into.