This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Macro fuse t2LoopDec and t2LoopEnd
AbandonedPublic

Authored by dmgreen on May 11 2020, 11:53 PM.

Details

Summary

Fusing these two pseudo instructions together in the scheduler forces them to try and be closer together in the final assembly. This can help, but doesn't on it's own solve the problems of register allocation going awry and spilling lr in the loop.

Diff Detail

Event Timeline

dmgreen created this revision.May 11 2020, 11:53 PM

Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?

Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?

Not where it's helping LOB's, exactly. It can help keep the live range of cpsr shorter, allowing us to use more t1 instructions, which can help a little. I've not seen it improve the issue with spilling LR between the t2LoopDec and t2LoopEnd. I think the problem there might be that there is only really a single place for the value to spill at, between those two instructions. Depending on how we try to fix that, this might be useful in it's own right or not.

Seems like a good idea to me too. A question on the "fusion" inlined.

llvm/lib/Target/ARM/ARMSubtarget.h
708

Do we need to add LOB here, i.e. are we using that?
If so, is "fusing" the right thing? Is the use of LOB here the same as e.g. "fusing AES" instructions?

dmgreen marked an inline comment as done.Jul 21 2020, 3:08 AM
dmgreen added inline comments.
llvm/lib/Target/ARM/ARMSubtarget.h
708

I think the idea is that we only call createARMMacroFusionDAGMutation if we have instructions to fuse, and we would not have ARM::t2LoopEnd if we did not have low overhead branches.

If you mean "should we add a hasFuseLOB subtarget feature for this" - I'm not sure. It will depend on how we end up changing the instructions. If we do end up using this method to force t2LoopDec and t2LoopEnd close together in order to reduce the live range of a CPSR, then we should probably do that for all subtargets.

But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? That would obviously make this patch unnecessary.

SjoerdMeijer added inline comments.Jul 21 2020, 3:14 AM
llvm/lib/Target/ARM/ARMSubtarget.h
708

I was really starting to think the same:

But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing?

What we do here in this patch starts to feel like a convoluted approach, and indeed, since we are modelling a single instruction, shouldn't we go for that then? Or would that have disadvantages?

dmgreen marked an inline comment as done.Jul 21 2020, 3:33 AM
dmgreen added inline comments.
llvm/lib/Target/ARM/ARMSubtarget.h
708

My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it.

I may be wrong though, I'm not sure entirely why it was done in the way it was and am really just guessing. We had an older downstream implementation that worked that way I think, but always had bugs. Whether they were fixable I'm not sure. It would take some investigation.

samparker accepted this revision.Aug 3 2020, 3:48 AM

My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it.

This.
It also could also have the benefit of other instructions using the value in LR too. Maybe it makes reverting easier too, but I can't even remember now... Either way, I think trying to fuse the operations is quite a bit less invasive than trying to re-implement how we do these loops :) and if it can give us some little improvements, let's do it.

This revision is now accepted and ready to land.Aug 3 2020, 3:48 AM

The last time me and Sjoerd talked about it, we figured this wouldn't fix the issue exactly (it only fuses them during scheduling, you can still spill between the t2LoopDec and the End), and as we need a proper solution anyway this might not end up being useful on it's own.

But I will run some tests again and see if this is useful anyway. If it is it may be worth getting in.

dmgreen abandoned this revision.Dec 11 2020, 11:29 AM

We went a different way with this, creating a new combined instruction instead.