This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Exclude LR from register classes in low overhead loops
Needs ReviewPublic

Authored by dmgreen on May 12 2020, 12:11 AM.

Details

Summary

In a low overhead loop, LR should ideally be used exclusively for the loop count, and not spilled and reloaded in the loop. This attempts to enforce that more directly by adjusting the register classes of registers used or def'd in the loop to no include LR. This can help especially to prevent the live range or LR from being being split between t2LoopDec and t2LoopEnd, meaning we revert the loop less often (and don't end up with movs at the same time!)

It does mean that we have a register less, which can mean we end up spilling other register more. On average this should be an improvement though.

Diff Detail

Event Timeline

dmgreen created this revision.May 12 2020, 12:11 AM

The added register pressure is a worry, but I don't see there's much we can do it about and I think it's a risk worth taking. Is it possible to add some MIR tests for the individual register classes?

llvm/lib/Target/ARM/ARMRegisterTypePass.cpp
101

I think adding GPRwithZR and GPRwithZRnosp makes sense as well, especially since your test uses csinc.

This seems like an extremely big hammer to use just to ensure that the interval isn't split between the decrement and the end of the loop. In particular, trying to list out all the possible register classes seems tricky.

I'd recommend investigating other approaches. Maybe you could use an instruction bundle to force the instructions to stay adjacent?

I thought this one might get some pushback. I contemplated starting the summary with "Bear with me..."

But it's not really just stopping LR from spilling between the t2LoopDec and the t2LoopEnd, although that will cause even larger problems that we are seeing.
If a low overhead loop sets LR to anything else inside the loop (calls, spills, etc), then the branch info is cleared and the next time you hit the LE it's like the first time again. Architecturally the loop info is created when you execute an LE and cleared when you otherwise write to LR. So by using LR for anything else you end up loosing all the benefits of a low overhead loop, and there are times where apparently this can be even worse than not using a low overhead loop at all.
So it might deserve a big hammer if it needs one.

But having said that this doesn't entirely solve my problem, and there are still times when we can hit poor register allocation. I think I would also like to change t2DoLoopStart to def GPRlr and use a rGPR reg, more like t2DLS should. That might need quite a bit of changes in the backend pass, from what I've seen, but it seems more "glued together".
Even with that I think there might be times when we end up spilling LR at the wrong place, but it becomes a lot rarer.

In particular, trying to list out all the possible register classes seems tricky.

Are you worried about more than the 4 + 2 extra that are here? This, with an assert in the NDEBUG block, managed to pass all the benchmarks I threw it at plus all the llvm tests. Do you worry it will need a lot of others?

dmgreen updated this revision to Diff 263524.May 12 2020, 2:56 PM

Added GPRwithZR and GPRwithZRnosp

Architecturally the loop info is created when you execute an LE and cleared when you otherwise write to LR.

Oh, hmm. Maybe we do need something drastic, then. I'm a little worried about cases where it isn't profitable to emit a low-overhead loop due to the increased register pressure, but maybe you can control that with other heuristics.

Are you worried about more than the 4 + 2 extra that are here? This, with an assert in the NDEBUG block, managed to pass all the benchmarks I threw it at plus all the llvm tests. Do you worry it will need a lot of others?

It's not really the raw number I'm concerned about; more that it's hard to enumerate them, and anyone adding a new register class likely won't be aware of the need to update this code. Also, some register classes only show up in obscure circumstances. For example, hGPR only shows up with inline asm, and GPRwithAPSR_NZCVnosp only shows up with CDE.

On a side-note, you probably want to be careful about restricting register classes if inline asm is involved; you might force the allocator to run out of registers.

Architecturally the loop info is created when you execute an LE and cleared when you otherwise write to LR.

Erm, reference for that? Because I certainly don't remember it.

Erm, reference for that? Because I certainly don't remember it.

Oh. Maybe I was mis-remembering that. I might have mixed up where the branch info was cleared.

I thought some of the improvements I was seeing from this were related to it, they were larger than I would expect otherwise. Perhaps something else is going in there though..