This is an attempt at addressing 26223. Specifically, try to avoid unfortunate register spilling by trying to place rematerializations introduced by rewrite-statepoints-for-gc in order to maximize folding and simplification opportunities rather than to minimize execution frequency.
If we have a bit of code like this:
%addr = gep %o, 8
loop {
if (poll) { safepoint(); } load %addr
}
We currently end up rewriting this as:
%addr = gep %o, 8
loop {
%addr1 = phi (%addr, %addr2) if (poll) { safepoint(); %remat = gep %o.relocated, 8 } %addr2 = phi (%addr1, %remat) load %addr2
}
This ends up forcing us to rematerialize the address explicitly and likely will cause us to spill/fill the address if register constrained. This creates a bunch of dependent loads (fill from stack, load from result) which show up as hot in a couple of benchmarks.
A much better result would be:
%addr = gep %o, 8
loop {
if (poll) { safepoint(); } %remat = gep %o.relocated, 8 load %remat
}
This version allows the GEP to be folded directly into x86's native addressing modes.
(Note: For conciseness, I'm not writing the phis for relocating %o, assume they're all there.)
The particular heuristic chosen here is to push each given remat as late as possible. This has the effect of moving remats closer to uses and preventing the creation of unnecessary and confusing PHI nodes. Empirically, this does appear to help in some of the benchmarks when I encountered this, but I'm getting increasing uncomfortable with the coupling between RS4GC and CGP. In particular, a better version of this heuristic is already present in CGP.
I think we should probably take this incremental step, but before going much further, factoring the code to share parts of the implementation of CGP might be a good idea. The generally problem is that many CGP transforms are hard to perform after RS4GC has run. It may make sense to selectively run them before hand.