This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Callee must always spill writelane VGPRs
ClosedPublic

Authored by cdevadas on Apr 21 2022, 12:15 PM.

Details

Summary

Since the writelane instruction used for SGPR spills can
modify inactive lanes, the callee must preserve the VGPR
this instruction modifies even if it was marked Caller-saved.

Diff Detail

Event Timeline

cdevadas created this revision.Apr 21 2022, 12:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 12:15 PM
cdevadas requested review of this revision.Apr 21 2022, 12:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 12:15 PM
hsmhsm added inline comments.Apr 21 2022, 9:50 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
491

Why do we need MapVector here instead of DenseMap? Do we need to access the values in deterministic order?

501

Are we still using the data structure SGPRSpillVGPR? If not, should not we delete it?

Nice change(s).
I’m not sure why the writelane registers are added as live-in to every block. Is the same happening for WWM registers and VGPRs used for SGPR spills?

sebastian-ne added inline comments.Apr 22 2022, 3:28 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1270–1272

Why do we need these registers as a live-in to every block? (do we also need them as live-out if they should be alive everywhere?)

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
491

I guess if it’s not deterministic, the spills and restores will be in an undefined order and we get different results when compiling multiple times

cdevadas added inline comments.Apr 25 2022, 4:14 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1270–1272

These registers shouldn't be reused at the intermediate points and hence they must live-in all paths from the def point (prolog spills) to the use (the restores in the epilog). There will be liveness verifier errors otherwise.
I don't think the live-out is necessary.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
501

Yes, will clean it up.

arsenm added inline comments.Apr 25 2022, 6:42 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1270–1272

This is true for the globally reserved registers, like used for WWM. The VGPRs which are now allocated normally for SGPR spills do not need to be live in everywhere

cdevadas updated this revision to Diff 425168.Apr 26 2022, 3:42 AM

Removed unwanted struct declarations.

arsenm added inline comments.Apr 29 2022, 10:47 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
782–783

No auto

1266

I don't think this is the right place to determine these registers. It's adding an extra loop over the function, and adding statefulness to determineCalleeSaves. Theoretically we should be able to call it multiple times

cdevadas added inline comments.Apr 29 2022, 11:03 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Any recommendation?
This should be done before identifying the CSR registers and they should be skipped from the default CSR spill insertion routines.

arsenm added inline comments.Apr 29 2022, 3:08 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Logically I think this belongs in PEI::calculateCallFrameInfo, but this isn't really a reasonable thing to add to the generic code.

cdevadas added inline comments.May 1 2022, 11:38 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Yes, we can't accommodate the code into PEI::calculateCallFrameInfo. The processFunctionBeforeFrameFinalized is also not a viable option. At this moment, determineCalleeSaves seemed a better place. The need to iterate over the MBB, in this case, can't be avoided.

arsenm added inline comments.May 17 2022, 2:40 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

I don't like it, but I guess we can go with this for now. Is there a way to assert that this was only computed once? I know at one point I had a patch that tried to speculatively call this to see what spills will be later inserted

arsenm requested changes to this revision.May 17 2022, 3:06 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1057

No auto

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
492

Needs a comment for what lane VGPRs are

This revision now requires changes to proceed.May 17 2022, 3:06 PM
nhaehnle requested changes to this revision.May 24 2022, 8:36 AM

I don't think the premise of this patch is correct. A source-level writelane intrinsic is not allowed to overwrite inactive lanes. If it does, then that's UB. If you want wave-wide writelane in source, the way to achieve that would be to use WWM.

Also, the name "lane VGPRs" is misleading. Isn't more like "wave-wide VGPRs", actually?

I don't think the premise of this patch is correct. A source-level writelane intrinsic is not allowed to overwrite inactive lanes. If it does, then that's UB. If you want wave-wide writelane in source, the way to achieve that would be to use WWM.

Are you saying the usage of writelane intrinsic as specified in test kernel_writelane_intrinsic added with llvm/test/CodeGen/AMDGPU/spill-writelane-vgprs.ll doesn't really cause any inactive lane overwrite?
I remember having such usages being mentioned for a potential inactive lane overwrite issue.

Sorry about the late response (was behind a priority work).

I don't think the premise of this patch is correct. A source-level writelane intrinsic is not allowed to overwrite inactive lanes. If it does, then that's UB. If you want wave-wide writelane in source, the way to achieve that would be to use WWM.

Are you saying the usage of writelane intrinsic as specified in test kernel_writelane_intrinsic added with llvm/test/CodeGen/AMDGPU/spill-writelane-vgprs.ll doesn't really cause any inactive lane overwrite?
I remember having such usages being mentioned for a potential inactive lane overwrite issue.

Sorry about the late response (was behind a priority work).

Ping @nhaehnle

cdevadas added inline comments.Jun 21 2022, 8:20 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1265

@nhaehnle the general use of writelane (writelane intrinsic) currently being considered for spilling is due to the fact that we identify all writelane instructions at this point.
This is a pre-patch for spilling SGPRs to virtual VGPRs during SILowerSGPRSpills pass (D124196). Once we spill SGPRs to virtVGPRs, we don't know the actual VGPRs they get until RA. To identify the actual VGPRs used for SGPR spills, we have no choice other than walking through all BBs at a later stage. At this point, we don't have a better way to distinguish the writelane instructions for spills from the general use.

cdevadas updated this revision to Diff 440266.Jun 27 2022, 9:33 AM
cdevadas edited the summary of this revision. (Show Details)

Removed auto, changed LaneVGPRs into WWMVGPRs, and added relevant comments/descriptions.

arsenm added inline comments.Jun 27 2022, 5:23 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
102

Why is anything being added to WWM registers here?

1276

Were WWM registers not reserved? I thought they were

cdevadas added inline comments.Jun 27 2022, 8:20 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
102

This function picks a new VGPR for FP/BP spilling and the actual spilling happens at emitPrologue/emitEpilogue.
We should add them to WWMVGPRs (renamed from 'LaneVGPRs' after Nicolai's suggestion) that track all whole wave VGPRs and their FIs.

1276

Yes, they are. See SIRegisterInfo::getReservedRegs, I added some comments below.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
705

Reserves WWM Reserved registers accumulated during SIPreAllocateWWMRegs.

715

Reserves the VGPRs used for SGPR spills, accumulated during SILowerSGPRSpills.
We won't have this reserving after D124196 where RA allocates/chooses these VGPRs.

cdevadas updated this revision to Diff 440470.Jun 27 2022, 10:59 PM

Optimized addToWWMVGPRs to early return if it is an entry function.
Also, added more details in the description for WWMVGPRs.

arsenm added inline comments.Jun 28 2022, 6:17 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
102

But this case uses an ordinary mov to copy to VGPR and spills directly to memory, not with writelane. We currently don't try to spill into a lane (although we could/should, at least merging the FP and BP into one slot)

cdevadas added inline comments.Jun 28 2022, 6:29 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
102

Nope. This case exactly a spill to VGPR lane using WRITELANE instruction. What you are talking about is case 4 which is handled in the else part below.

arsenm added inline comments.Jun 28 2022, 6:40 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
272

I don't understand why you would encounter the same register twice. This should assert the insert is a new insert?

cdevadas added inline comments.Jun 28 2022, 6:49 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
272

The main place where we call this function is from determineCalleeSaves where we walk through each BB and identify every WRITELANE instruction. They will appear multiple times for different SGPR spills to separate lanes of the same register.

arsenm added inline comments.Jun 28 2022, 10:57 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
491

Why is the frame index optional? The only case where you wouldn't have this is the entry function case, where you skip the insertion. It might be clearer to make this non-optional and rename to WWMSpills

cdevadas updated this revision to Diff 440722.Jun 28 2022, 12:35 PM

Review comments addressed.

arsenm accepted this revision.Jun 28 2022, 1:06 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1276

I think we ought to either use the live in strategy or mark as reserved, but not both. Should clean this up in a later patch

@nhaehnle should also have to accept this patch. I won't be able to push it otherwise.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1276

Sure.

nhaehnle added inline comments.Jun 29 2022, 2:43 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

I don't understand why this is necessary in the first place. Can't you call allocateWWMSpill when the V_WRITELANE_B32 for spilling is inserted? Actually, isn't that already happening?

cdevadas added inline comments.Jun 29 2022, 2:59 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Yes, that's the behavior with the upstream compiler today. This patch is a prerequisite for D124196 which changes the SGPR spilling strategy by spilling them into virtual VGPRs instead of using the physical registers. These virtual registers get physVGPRs during the second iteration of allocation passes that allocates only VGPR classes.
At the time of WRITELANE insertion at SILowerSGPRSpills pass, we only have the virtual registers with D124196. That's the reason we need to iterate over all BBs looking for WRITELANE instructions later during PrologEpilogInserter.

nhaehnle added inline comments.Jun 29 2022, 6:23 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Do you have a plan to fix this? And what about the correctness of D124196, see the question about the register allocator introducing COPYs when allocating VGPRs.

I have a feeling that the compiler really does need to track at some level the nature of VGPRs, i.e. whether they're used at lane scope or at wave scope, in order to fix this potential miscompilation due to reg alloc. Once you have that, you don't need the scan for V_WRITELANE either.

cdevadas added inline comments.Jun 29 2022, 8:53 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1266

Do you have a plan to fix this?

Yes, I am planning to fix it. Will see the approach you're talking about to detect lane scope and wave scope operations.

Ping.
@nhaehnle can you unblock this?

nhaehnle added inline comments.Oct 27 2022, 1:19 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1270–1273

For V_WRITELANE_B32: Why is this still needed, given that the code calls allocateWWMSpill when allocating an SGPR-to-VGPR spill lane?

For V_READLANE_B32: This seems like an unnecessary and unacceptable pessimization. V_READLANE_B32 doesn't modify the register, so there's no need to spill anything of the source register based on it.

nhaehnle accepted this revision.Oct 27 2022, 2:06 AM

We had a detailed offline discussion. The scan for v_writelane_b32/v_readlane_b32 is a (bad) heuristic for which physical VGPRs are used in a WWM fashion, so that a corresponding prolog spill / epilog reload is necessary. What we really should be doing is to remember this information when virtual registers are mapped to physical ones, which will involve calling allocateWWMSpill directly from the register allocator. I think that marking this as a TODO here, and then following up with a separate change that does the direct call of allocateWWMSpill from register allocation is a decent plan.

This revision is now accepted and ready to land.Oct 27 2022, 2:06 AM
foad added a comment.Oct 27 2022, 2:36 AM

I think that marking this as a TODO here, and then following up with a separate change that does the direct call of allocateWWMSpill from register allocation is a decent plan.

Is that something you could do in a separate pass post-RA but pre-VirtRegRewriter, by looking at the VirtRegMap analysis?

I think that marking this as a TODO here, and then following up with a separate change that does the direct call of allocateWWMSpill from register allocation is a decent plan.

Is that something you could do in a separate pass post-RA but pre-VirtRegRewriter, by looking at the VirtRegMap analysis?

A separate pass would be possible with Greedy allocator. But with RegAllocFast, VRM won't be available. Planning to use delegate method (see D134950) in RegAllocFast to notify targets when a physical register is allocated.

cdevadas updated this revision to Diff 471099.Oct 27 2022, 3:08 AM

Added a TODO to call allocateWWMSpill elsewhere in a better way to identify the VGPRs used for SGPR spilling.

I am confused why the parent/child relationship among the changes are removed, I think there are still some dependencies among this and other changes?

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1273

Why not define a separate pseudo instruction like V_SPILL_SGPR_TO_VGPR, which will be translated to V_WRITELANE at expandPostRAPseudo()? This should help us differentiate a writelane/readlane that have no wwm behavior from the sgpr spill/reload to vgpr.

I am confused why the parent/child relationship among the changes are removed, I think there are still some dependencies among this and other changes?

Sorry about that. I have set them initially. But later broke the chain while arc patching them into a downstream compiler copy. The arc patch was causing trouble as it patches all reviews in the relationship chain when there are conflicts in the intermediate patches.
I put them back now. This is the parent review and there are other 9 patches in the chain ending at D124196.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1273

Yes, that appeared to be a better approach than looking for WRITELANE/READLANE directly. Still, the looping is an overhead and we should move it entirely from here. Capturing them during the regalloc would be the efficient way. I'm planning a post-patch to optimize it.

arsenm accepted this revision.Nov 1 2022, 1:29 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1267–1272

Can you open an issue for this?

arsenm accepted this revision.Dec 14 2022, 11:33 AM
This revision was landed with ongoing or failed builds.Dec 16 2022, 10:17 PM
This revision was automatically updated to reflect the committed changes.