Although this was something that I was hoping we would not have to do, this patch makes t2DoLoopStartTP a terminator in order to keep it at the end of it's block, so not allowing extra MVE instruction between it and the end. With t2DoLoopStartTP's also starting tail predication regions, it also marks them as having side effects. The t2DoLoopStart is still not a terminator, giving it the extra scheduling freedom that can be helpful, but now that we have a TP version they can be treated differently.
Details
Diff Detail
Event Timeline
Although this was something that I was hoping we would not have to do,
Can you quickly remind me about the different options/trade-offs here? Just as a refresher for me how everything fits together.
Can you quickly remind me about the different options/trade-offs here? Just as a refresher for me how everything fits together.
It's just a little sub-optimal. It's better to let the scheduler place the instructions where they would be best, not artificially force them to the end of the block. The latency of setting LR into the next instruction that uses it can be felt at times, and on some CPU's you can get cases where the loop will be aligned with a NOP, meaning it's better to put a T1 instruction at the end of the block to allow more dual-issue.
They are mostly, hopefully fairly minor though. And the benefits of forcing the dlstp instruction to the end of the block to prevent reverted tail predicated loops is at least better in the short term, even if we change it back later.
Ah yes, thanks, got it. Just one more high-level follow up question then. Is this something we want to perhaps control with an option?
Ah yes, thanks, got it. Just one more high-level follow up question then. Is this something we want to perhaps control with an option?
We have an option for merging dec and end, and will currently only do this transform if that succeeded. I could add an option for controlling creating t2DoLoopEndDec too if it sounds useful, but we do already at least have an off switch.