This is an archive of the discontinued LLVM Phabricator instance.

[RS4GC] Effective rematerialization at non-entry polls
AbandonedPublic

Authored by reames on Jan 21 2016, 5:10 PM.

Details

Summary

This is an attempt at addressing 26223. Specifically, try to avoid unfortunate register spilling by trying to place rematerializations introduced by rewrite-statepoints-for-gc in order to maximize folding and simplification opportunities rather than to minimize execution frequency.

If we have a bit of code like this:
%addr = gep %o, 8
loop {

if (poll) {
   safepoint();
}
load %addr

}

We currently end up rewriting this as:
%addr = gep %o, 8
loop {

%addr1 = phi (%addr, %addr2)
if (poll) {
   safepoint();
   %remat = gep %o.relocated, 8
}
%addr2 = phi (%addr1, %remat)
load %addr2

}
This ends up forcing us to rematerialize the address explicitly and likely will cause us to spill/fill the address if register constrained. This creates a bunch of dependent loads (fill from stack, load from result) which show up as hot in a couple of benchmarks.

A much better result would be:
%addr = gep %o, 8
loop {

if (poll) {
   safepoint();   
}
%remat = gep %o.relocated, 8
load %remat

}

This version allows the GEP to be folded directly into x86's native addressing modes.

(Note: For conciseness, I'm not writing the phis for relocating %o, assume they're all there.)

The particular heuristic chosen here is to push each given remat as late as possible. This has the effect of moving remats closer to uses and preventing the creation of unnecessary and confusing PHI nodes. Empirically, this does appear to help in some of the benchmarks when I encountered this, but I'm getting increasing uncomfortable with the coupling between RS4GC and CGP. In particular, a better version of this heuristic is already present in CGP.

I think we should probably take this incremental step, but before going much further, factoring the code to share parts of the implementation of CGP might be a good idea. The generally problem is that many CGP transforms are hard to perform after RS4GC has run. It may make sense to selectively run them before hand.

Diff Detail

Event Timeline

reames updated this revision to Diff 45619.Jan 21 2016, 5:10 PM
reames retitled this revision from to [RS4GC] Effective rematerialization at non-entry polls.
reames updated this object.
reames added a subscriber: llvm-commits.

For those curious, the analogous CGP logic is in: CodeGenPrepare::optimizeMemoryInst

mjacob edited edge metadata.Jan 22 2016, 5:05 PM
mjacob added a subscriber: mjacob.
This comment was removed by mjacob.

I think I've found a better fix for this issue. I'm working on a patch for CGP which lets it cleanup the phi cycles introduced by the simple rematerialization. Assuming that works out, I'll abandon this patch.

sanjoy resigned from this revision.Jan 28 2016, 5:19 PM
sanjoy removed a reviewer: sanjoy.

Resigning as reviewer to move this off my worklist.

@reames if you want to resume working on this patch and would like me to help review, please add me back.

reames abandoned this revision.Feb 12 2016, 12:05 PM

I found a better way to approach this. Once I get back to this, I'm going to just fix CGP rather than working around the problem here. Given that, abandoning revision.