This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Optimize VGPR LiveRange in waterfall loops
ClosedPublic

Authored by sebastian-ne on Jun 30 2021, 7:05 AM.

Details

Summary

The loops are run exactly once per lane, so VGPRs do not need to be
saved. Mark waterfall loops with SI_WATERFALL_LOOP and use the
SIOptimizeVGPRLiveRange pass to add phi nodes that take undef when
coming from the loop.

There are still a few shortcomings, that is

  1. Return values from a function call in the loop are copied because their live range conflicts with the live range of arguments, even if arguments are only IMPLICIT_DEF after the phi insertion.
  2. If a VGPR argument is used after the loop, the register is still copied unnecessarily inside the loop (see @test_indirect_call_vgpr_ptr_arg_and_reuse in indirect-call.ll).

Diff Detail

Event Timeline

sebastian-ne created this revision.Jun 30 2021, 7:05 AM
sebastian-ne requested review of this revision.Jun 30 2021, 7:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2021, 7:06 AM
foad added inline comments.Jun 30 2021, 7:20 AM
llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
527

Can you use something like for (auto &I : make_early_inc_range(MRI->use_operands(Reg)))?

sebastian-ne marked an inline comment as done.

Can you use something like for (auto &I : make_early_inc_range(MRI->use_operands(Reg)))?

Thanks, done.

foad added inline comments.Jun 30 2021, 7:36 AM
llvm/test/CodeGen/AMDGPU/indirect-call.ll
596

Maybe precommit this new test (no review required)?

sebastian-ne added inline comments.Jun 30 2021, 7:44 AM
llvm/test/CodeGen/AMDGPU/indirect-call.ll
596

I just tried it, there is no difference for this test with or without this patch.
So, not sure if it’s worth adding, but it shows a (still) suboptimal case.

ruiling added inline comments.Jul 1 2021, 12:17 AM
llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
331

why the source operand of v_readfirstlane cannot be optimized?

537

As you just replace the uses of OldReg within the waterfall loop, the OldReg may still be possibly alive in the waterfall loop. Consider the cases:

  1. OldReg defined outside of a outer_loop that encloses waterfall loop.
def(OldReg)

outer_loop:
  use(OldReg)   // a use here require us to keep the OldReg alive through the whole range of outer_loop.
  br waterfall_loop

waterfall_loop:
  ...
  brcond waterfall_loop, outer_loop_end_bb

outer_loop_end_bb:
  ...
  brcond outer_loop
  1. the OldReg is still used after the waterfall loop:
  def(OldReg)
  ...

waterfall_loop:
  ...
  brcond waterfall_loop

after_waterfall_loop_bb:
  use(OldReg)

In the above two listed cases, the OldReg should still be alive through the waterfall loop.
I think we cannot reuse the physical register of OldReg in the waterfall loop. Or did I miss something?

Do not exclude readfirstlane registers from being optimized.

llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
331

You’re right, it can be optimized. I confused myself when looking at the test cases and at the readfirstlane, but all lanes that could have changed that physical register are not active anymore in the next iteration, so the readfirstlane will still return the original value.

537

Case 2 is what @test_indirect_call_vgpr_ptr_arg_and_reuse tests. The generated assembly doesn’t change with or without this patch, because the RegisterCoalescer merges the OldReg and NewReg.
(And I think it can only get better if they are not merged; then the COPY inside the loop should be omitted like for the cases where OldReg is unused after the waterfall loop.)

ruiling added inline comments.Jul 5 2021, 6:23 PM
llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
422 ↗(On Diff #355804)

You can move the SI_WATERFALL_LOOP changes into a separate patch.

llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
517

the RegState::Kill is unnecessary. we don't need to track Undef register.

532–534

should be:

LiveVariables::VarInfo &NewVarInfo = LV->getVarInfo(NewReg);
LiveVariables::VarInfo &OldVarInfo = LV->getVarInfo(Reg);

the getVarInfo(NewReg) may trigger memory reallocation of VirtRegInfo which invalidate OldVarInfo. You can extract the test_indirect_call_vgpr_ptr_arg_and_reuse into a separate file and run the llc command to reproduce the issue.
The UndefVarInfo is not needed.

537

Case 2 is what @test_indirect_call_vgpr_ptr_arg_and_reuse tests. The generated assembly doesn’t change with or without this patch, because the RegisterCoalescer merges the OldReg and NewReg.
(And I think it can only get better if they are not merged; then the COPY inside the loop should be omitted like for the cases where OldReg is unused after the waterfall loop.)

I think the remaining COPY inside the loop cannot be solved in this pass, we need to solve it using some other way at later steps. You can add a TODO in the test saying that the copy inside the waterfall loop could be removed.

538–552

I think NewVarInfo.AliveBlocks.set(Loop->getNumber()); is incorrect. the NewReg is not alive through the whole range of the waterfall loop, it is defined at the PHI instruction and Killed at the last use in the waterfall loop.
OldVarInfo.AliveBlocks.reset(Loop->getNumber()); only works for cases excluding Case 1 and Case 2. For these two cases I listed before, there should not be "liveness hole" after the transformation.
I would suggest you not to do this optimization if you find the register def-use match these two cases as currently it does not bring any benefit, you can check if the register isLivein the successor of waterfall-loop-block.
I am sorry I disabled the verifier after this pass for some other issue. you can enable it to help you catch wrong LiveVariable info.

542

Did you see such case happen? generally if a value is defined before entering a loop, the use inside the loop will not be treated as kill because later loop iterations still need to access the value. I guess the code is just useless?

sebastian-ne marked 4 inline comments as done.

Fix review comments.

llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
537

I think the remaining COPY inside the loop cannot be solved in this pass, we need to solve it using some other way at later steps. You can add a TODO in the test saying that the copy inside the waterfall loop could be removed.

I agree.
I misunderstood my own test, @test_indirect_call_vgpr_ptr_arg_and_reuse shouldn’t have an unnecessary COPY, I added a new one (@test_indirect_call_vgpr_ptr_arg_and_return) that has an unnecessary COPY.

ruiling added inline comments.Jul 8 2021, 7:22 AM
llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
532

I still think we need to update the LiveVariables::VarInfo for the NewReg to make sure LiveVariables analysis being updated. We just need to update the Kills to the last use in the waterfall loop. as the waterfall loop is pretty small, I think we can just reverse iterate to find the last use. I am not sure if there is any better way?

536–542

I would like such check happens in collectWaterfallCandidateRegisters() which means if it is used after waterfall loop, don't push into candidate list. Because for such kind of situations, the optimization transformation we do here does not help as it will be coalesced into one virtual register again in register coalescer. So IMO, this is just waste of compile time for no benefit.

ruiling added inline comments.Jul 8 2021, 7:46 AM
llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
528

nit: Can we do register replacement before inserting the phi instruction? then we don't need the check UseMI != PHI.getInstr().

532

You also need to call addRegisterKilled to mark the last use/uses of NewReg as killed.

sebastian-ne marked 4 inline comments as done.

Thanks for the review, it should be better now.
I also added a comment regarding waterfall loops at the beginning of the file.

ruiling accepted this revision.Jul 11 2021, 7:33 PM

Thanks for the work, the code logic looks good to me. Please wait one or two days in case others have more comments.

llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp
555

Several NDEBUGs in short piece of code is a little bit annoying, may be you can check the !NewVarInfo.Kills.empty().

llvm/test/CodeGen/AMDGPU/indirect-call.ll
649

thanks for the test, I see the issue, will take further look later.

This revision is now accepted and ready to land.Jul 11 2021, 7:33 PM
sebastian-ne marked an inline comment as done.

may be you can check the !NewVarInfo.Kills.empty()

Good idea!

llvm/test/CodeGen/AMDGPU/indirect-call.ll
649

I think the issue is that IMPLICIT_DEF does create a live range, but it shouldn’t. Even the language reference mentions that undefs don’t have a live range.
There are already a few workarounds in LLVM to get around this. E.g. the RegisterCoalescer keeps track of and ignores IMPLICIT_DEFs, also the undef MachineOperand flag was probably introduced because of these live ranges (commit 0dc101b897090e37e6e6eb239d96aec6d368b43f).

However, I’m not sure if removing live ranges of IMPLICIT_DEFs is the right thing to do. I guess it would mean that registers are live-in in one block, but not live-out in one of the predecessors—because they are IMPLICIT_DEF in the predecessor. And I don’t know about what implications that has.

This revision was landed with ongoing or failed builds.Jul 13 2021, 3:15 AM
This revision was automatically updated to reflect the committed changes.