The loops are run exactly once per lane, so VGPRs do not need to be
saved. Mark waterfall loops with SI_WATERFALL_LOOP and use the
SIOptimizeVGPRLiveRange pass to add phi nodes that take undef when
coming from the loop.
There are still a few shortcomings, that is
- Return values from a function call in the loop are copied because their live range conflicts with the live range of arguments, even if arguments are only IMPLICIT_DEF after the phi insertion.
- If a VGPR argument is used after the loop, the register is still copied unnecessarily inside the loop (see @test_indirect_call_vgpr_ptr_arg_and_reuse in indirect-call.ll).
why the source operand of v_readfirstlane cannot be optimized?