Fusing these two pseudo instructions together in the scheduler forces them to try and be closer together in the final assembly. This can help, but doesn't on it's own solve the problems of register allocation going awry and spilling lr in the loop.
Details
Diff Detail
Event Timeline
Seems like a good idea to me, I can't see this causing any harm. but do you have an example of where it enables LOBs?
Not where it's helping LOB's, exactly. It can help keep the live range of cpsr shorter, allowing us to use more t1 instructions, which can help a little. I've not seen it improve the issue with spilling LR between the t2LoopDec and t2LoopEnd. I think the problem there might be that there is only really a single place for the value to spill at, between those two instructions. Depending on how we try to fix that, this might be useful in it's own right or not.
Seems like a good idea to me too. A question on the "fusion" inlined.
llvm/lib/Target/ARM/ARMSubtarget.h | ||
---|---|---|
708 | Do we need to add LOB here, i.e. are we using that? |
llvm/lib/Target/ARM/ARMSubtarget.h | ||
---|---|---|
708 | I think the idea is that we only call createARMMacroFusionDAGMutation if we have instructions to fuse, and we would not have ARM::t2LoopEnd if we did not have low overhead branches. If you mean "should we add a hasFuseLOB subtarget feature for this" - I'm not sure. It will depend on how we end up changing the instructions. If we do end up using this method to force t2LoopDec and t2LoopEnd close together in order to reduce the live range of a CPSR, then we should probably do that for all subtargets. But a t2LoopDec and a t2LoopEnd are really modelling a single instruction, and maybe it makes sense to just make them into a single thing? That would obviously make this patch unnecessary. |
llvm/lib/Target/ARM/ARMSubtarget.h | ||
---|---|---|
708 | I was really starting to think the same:
What we do here in this patch starts to feel like a convoluted approach, and indeed, since we are modelling a single instruction, shouldn't we go for that then? Or would that have disadvantages? |
llvm/lib/Target/ARM/ARMSubtarget.h | ||
---|---|---|
708 | My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it. I may be wrong though, I'm not sure entirely why it was done in the way it was and am really just guessing. We had an older downstream implementation that worked that way I think, but always had bugs. Whether they were fixable I'm not sure. It would take some investigation. |
My understanding is that it is a terminator that produces a value, and that does not work very well. If you need to spill the value, you would have not place to spill it.
This.
It also could also have the benefit of other instructions using the value in LR too. Maybe it makes reverting easier too, but I can't even remember now... Either way, I think trying to fuse the operations is quite a bit less invasive than trying to re-implement how we do these loops :) and if it can give us some little improvements, let's do it.
The last time me and Sjoerd talked about it, we figured this wouldn't fix the issue exactly (it only fuses them during scheduling, you can still spill between the t2LoopDec and the End), and as we need a proper solution anyway this might not end up being useful on it's own.
But I will run some tests again and see if this is useful anyway. If it is it may be worth getting in.
Do we need to add LOB here, i.e. are we using that?
If so, is "fusing" the right thing? Is the use of LOB here the same as e.g. "fusing AES" instructions?