This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable whole wave register copy
ClosedPublic

Authored by cdevadas on Feb 10 2023, 9:51 AM.

Details

Summary

So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.

Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.

This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.

Diff Detail

Event Timeline

cdevadas created this revision.Feb 10 2023, 9:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 9:51 AM
cdevadas requested review of this revision.Feb 10 2023, 9:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 9:51 AM
yassingh updated this revision to Diff 523320.EditedMay 18 2023, 3:48 AM
yassingh added a subscriber: yassingh.
  • Rebase
  • Incorporated the downstream code
yassingh updated this revision to Diff 523325.May 18 2023, 4:04 AM

fix file comment

yassingh updated this revision to Diff 528812.Jun 6 2023, 5:36 AM

Rebase and merge D150390 ([AMDGPU] Introduce and use the new PRED_COPY opcode) into this.

yassingh added inline comments.Jun 6 2023, 5:53 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1939–1941

I wanted to keep this in SILowerPredicatedCopy pass but it wasn't working there, it needed virtregRewriter to be run before it.

yassingh updated this revision to Diff 532856.Jun 20 2023, 3:32 AM
  • Revert back to PRED_COPY to COPY lowering mechanism
  • Rebase
yassingh updated this revision to Diff 533295.Jun 21 2023, 9:24 AM
  • Review comments (use getWaveMaskRegClass(), getExec() functions)
  • Review comments (use getWaveMaskRegClass(), getExec() functions)

Accidentally updated this one instead of D143759. Not reverting as the extra diff will go away once it is rebased.

arsenm added inline comments.Jun 21 2023, 9:34 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
8002–8005

Don't need to query TRI, it's already the RI member. You can also just do RI.isSGPRReg.

Also, you're not checking the flags? The point was to check the flags and switch to the WWM copy. You don't care about SGPR or VGPR, you care about WWM VGPR or not

yassingh updated this revision to Diff 533516.Jun 22 2023, 2:23 AM
  • Check vreg flag in getLiveRangeSplitOpcode
arsenm added inline comments.Jun 22 2023, 10:44 AM
llvm/lib/CodeGen/SplitKit.cpp
567

You're missing the new opcode here. buildSingleSubRegCopy should gain a new MCInstrDesc & parameter for the opcode to use

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

Regular copy shouldn't reach here?

8002

SrcReg.isVirtual should be implied

llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
140 ↗(On Diff #533516)

Can you early return false if the WWM reg set is empty?

143 ↗(On Diff #533516)

continue and reduce indentation

144 ↗(On Diff #533516)

Should add a TODO to try to reduce the saveexec/restore exec pairs for adjacent WWM ops

159 ↗(On Diff #533516)

Do these need to gain an implicit exec use?

arsenm requested changes to this revision.Jun 22 2023, 10:44 AM
This revision now requires changes to proceed.Jun 22 2023, 10:44 AM
yassingh updated this revision to Diff 533912.Jun 23 2023, 3:19 AM
  • review comments
yassingh added inline comments.Jun 23 2023, 3:22 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

Yes it won't, will remove.

8002

Added an assert.

llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
140 ↗(On Diff #533516)

Done, had to add a new helper function.

159 ↗(On Diff #533516)

Do you mean we should add the implicit exec? In that case, SIFixVGPRCopies will take care?

arsenm added inline comments.Jun 23 2023, 6:12 AM
llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
159 ↗(On Diff #533516)

This isn't something that can be taken care of later. SIFixVGPRCopies is a horribly broken hack, the less we depend on it the better

159 ↗(On Diff #533516)

Actually these split copies might have been the original problem which caused it to be added. Maybe we have a way to drop it now?

arsenm added inline comments.Jun 23 2023, 6:14 AM
llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
159 ↗(On Diff #533516)

That was the reasoning given in D28874 when it was added. How about as a next step we make sure all the VGPR splits end up with exec reads

yassingh added inline comments.Jun 23 2023, 7:52 AM
llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
159 ↗(On Diff #533516)

Will look into it, don't have sufficient expertise to answer this right now. Adding @AMDGPU.

159 ↗(On Diff #533516)

I can try and see what comes. I can see 2 ways to go about it, add the exec operand here itself or, in tablegen definition of PRED_COPY. What do you suggest?

yassingh added a subscriber: Restricted Project.Jun 23 2023, 7:53 AM
arsenm added inline comments.Jun 23 2023, 7:58 AM
llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
159 ↗(On Diff #533516)

Add to the instruction definition. Though we're still stuck with COPY=PRED_COPY 'sometimes'

arsenm added inline comments.Jun 26 2023, 11:05 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
8004

I guess the name is wrong now. It's technically an unpredicated copy. Rename to WWM_COPY or COPY_WWM?

llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
82 ↗(On Diff #534588)

SIInstrInfo

92 ↗(On Diff #534588)

Either this check is unnecessary or should have happened earlier to avoid crashing on the checkFlag

cdevadas added inline comments.Jun 26 2023, 11:16 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
8004

Make sense. I prefer WWM_COPY.

llvm/lib/Target/AMDGPU/SILowerPredicatedCopies.cpp
82 ↗(On Diff #534588)

This function is no longer needed. Now the PRED_COPY is inserted only for wwm-regs after we fine-tuned AMDGPU's getLiveRangeSplitOpcode implementation.

151 ↗(On Diff #534588)

I guess this whole check now can be entirely avoided. PRED_COPY is inserted only for wwm-regs. If we see any such copy, insert the EXEC mask manipulation unconditionally.

yassingh updated this revision to Diff 534849.Jun 26 2023, 11:24 PM
  • Rename PRED_COPY to WWM_COPY
  • Review comments delete redundant isWWMCopy() check

Remove the test llvm/test/CodeGen/AMDGPU/skip-subreg-copy-from-iswwmcopy-check.mir. It is no longer needed as you have dropped isWWMCopy function.

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1974

Did you rebase this patch after the prototype change in D150388?
This diff shouldn't be here.

llvm/lib/CodeGen/SplitKit.cpp
539

Ditto

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

Still see the regular COPY here.

llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
94

Remove the VRM check here. We no longer include this pass in the O0 pipeline.

126

I think this assertion is redundant. We are inserting WWM_COPY only during the live range split.

yassingh added inline comments.Jun 27 2023, 6:29 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
94

This pass is being invoked in O0 too currently(added while exploring an independent lowering mechanism for PRED_COPY). Will remove that too.

arsenm added inline comments.Jun 27 2023, 10:57 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
130

avoid DebugLoc copy, use const ref

arsenm added inline comments.Jun 27 2023, 10:59 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
145

In a follow up could release the reserved register

yassingh updated this revision to Diff 535253.Jun 28 2023, 12:02 AM

Review commnets and remove pass from O0.

Harbormaster completed remote builds in B241698: Diff 535255.
yassingh added inline comments.Jun 28 2023, 12:06 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
145

I don't understand, since sgpr allocation is already done do we have to free the reserved sgprs? Also, can you refer me to some other reserved registers to see how do I release it.

yassingh updated this revision to Diff 535430.Jun 28 2023, 8:42 AM

re-upload

arsenm accepted this revision.Jun 28 2023, 9:03 AM

LGTM with nit

llvm/lib/Target/AMDGPU/SIInstructions.td
3303

This should not be AMDGPUGenericInstruction, it's not a generic instruction. Should subclass from AMDGPUInst, and move to the other non-generic pseudos. AMDGPUGenericInstruction is for the G_AMDGPU_ globalisel pre-isel pseudos

This revision is now accepted and ready to land.Jun 28 2023, 9:03 AM
arsenm added inline comments.Jun 28 2023, 9:04 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
121

Typos Club adjancent

136

Missing newline

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
368

I think I lost track of what was going on with -O0, did this lose the -O0 run?

Also not sure how this has no test changes

arsenm added inline comments.Jun 29 2023, 4:17 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
3310

Should be marked convergent

yassingh updated this revision to Diff 537048.Jul 4 2023, 4:11 AM

Review comments

Also not sure how this has no test changes

This patch was originally submitted with 0 tests. My understating is we won't see any effects of these changes until WWM copies are inserted in code by the next patch(Spill SGPRs to virtual VGPRs).
cc @cdevadas

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
368

It was introduced when we moved from PRED_COPY "simplification" to proper lowering mechanism. Was removed again after the approach changed back to lowering WWM_COPY to COPY first. I have added the O0 invocation back but it's not doing anything.

Also not sure how this has no test changes

This patch was originally submitted with 0 tests. My understating is we won't see any effects of these changes until WWM copies are inserted in code by the next patch(Spill SGPRs to virtual VGPRs).
cc @cdevadas

Yes. This is only a pre-patch for D124196 where the actual test was included, llvm/test/CodeGen/AMDGPU/whole-wave-register-copy.ll

arsenm added inline comments.Jul 4 2023, 7:04 AM
llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
133

Back to back debug printing prints same instruction, just remove the second one?

yassingh updated this revision to Diff 537120.Jul 4 2023, 9:39 AM

Removed extra debug statement

cdevadas added inline comments.Jul 4 2023, 9:58 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1351

I'm still not convinced why this is needed in the -O0 flow?
By now, the VGPR allocation is done in the -O0 flow, and we no longer have any virtual registers. This pass act on virtual registers to see if wwm copies needed exec manipulation.

llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
91

If this pass is enabled for for -O0 as well, put back the !VRM check to ensure an early return from here for -O0 flow.
And may be a comment too?

yassingh updated this revision to Diff 537660.Jul 6 2023, 4:19 AM

Update instruction class of WWM_COPY

arsenm added inline comments.Jul 6 2023, 8:30 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1351

It's conceptually needed and it's an implementation detail of current regalloc fast that these aren't introduced. Plus I think in general we should have other WWM copies for general WWM support in the future

cdevadas added inline comments.Jul 6 2023, 8:46 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1351

ok

llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp
112

Also, do an early return if MRI.getNumVirtRegs() is zero. This would avoid iterating the whole function when there are no virtual registers at all (would take care of the -O0 path when physRegs are already assigned).

yassingh updated this revision to Diff 538190.Jul 7 2023, 10:27 AM

Rebase before merge

This revision was landed with ongoing or failed builds.Jul 7 2023, 10:29 AM
This revision was automatically updated to reflect the committed changes.
vitalybuka reopened this revision.Jul 26 2023, 10:15 PM
This revision is now accepted and ready to land.Jul 26 2023, 10:15 PM