Since the writelane instruction used for SGPR spills can
modify inactive lanes, the callee must preserve the VGPR
this instruction modifies even if it was marked Caller-saved.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Nice change(s).
I’m not sure why the writelane registers are added as live-in to every block. Is the same happening for WWM registers and VGPRs used for SGPR spills?
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1267–1269 | Why do we need these registers as a live-in to every block? (do we also need them as live-out if they should be alive everywhere?) | |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
485 | I guess if it’s not deterministic, the spills and restores will be in an undefined order and we get different results when compiling multiple times |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1267–1269 | These registers shouldn't be reused at the intermediate points and hence they must live-in all paths from the def point (prolog spills) to the use (the restores in the epilog). There will be liveness verifier errors otherwise. | |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
495 | Yes, will clean it up. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1267–1269 | This is true for the globally reserved registers, like used for WWM. The VGPRs which are now allocated normally for SGPR spills do not need to be live in everywhere |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | Any recommendation? |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | Logically I think this belongs in PEI::calculateCallFrameInfo, but this isn't really a reasonable thing to add to the generic code. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | Yes, we can't accommodate the code into PEI::calculateCallFrameInfo. The processFunctionBeforeFrameFinalized is also not a viable option. At this moment, determineCalleeSaves seemed a better place. The need to iterate over the MBB, in this case, can't be avoided. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | I don't like it, but I guess we can go with this for now. Is there a way to assert that this was only computed once? I know at one point I had a patch that tried to speculatively call this to see what spills will be later inserted |
I don't think the premise of this patch is correct. A source-level writelane intrinsic is not allowed to overwrite inactive lanes. If it does, then that's UB. If you want wave-wide writelane in source, the way to achieve that would be to use WWM.
Also, the name "lane VGPRs" is misleading. Isn't more like "wave-wide VGPRs", actually?
Are you saying the usage of writelane intrinsic as specified in test kernel_writelane_intrinsic added with llvm/test/CodeGen/AMDGPU/spill-writelane-vgprs.ll doesn't really cause any inactive lane overwrite?
I remember having such usages being mentioned for a potential inactive lane overwrite issue.
Sorry about the late response (was behind a priority work).
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1262 | @nhaehnle the general use of writelane (writelane intrinsic) currently being considered for spilling is due to the fact that we identify all writelane instructions at this point. |
Removed auto, changed LaneVGPRs into WWMVGPRs, and added relevant comments/descriptions.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
102 | This function picks a new VGPR for FP/BP spilling and the actual spilling happens at emitPrologue/emitEpilogue. | |
1273 | Yes, they are. See SIRegisterInfo::getReservedRegs, I added some comments below. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
705 | Reserves WWM Reserved registers accumulated during SIPreAllocateWWMRegs. | |
715 | Reserves the VGPRs used for SGPR spills, accumulated during SILowerSGPRSpills. |
Optimized addToWWMVGPRs to early return if it is an entry function.
Also, added more details in the description for WWMVGPRs.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
102 | But this case uses an ordinary mov to copy to VGPR and spills directly to memory, not with writelane. We currently don't try to spill into a lane (although we could/should, at least merging the FP and BP into one slot) |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
102 | Nope. This case exactly a spill to VGPR lane using WRITELANE instruction. What you are talking about is case 4 which is handled in the else part below. |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
267 | I don't understand why you would encounter the same register twice. This should assert the insert is a new insert? |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
267 | The main place where we call this function is from determineCalleeSaves where we walk through each BB and identify every WRITELANE instruction. They will appear multiple times for different SGPR spills to separate lanes of the same register. |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
---|---|---|
485 | Why is the frame index optional? The only case where you wouldn't have this is the entry function case, where you skip the insertion. It might be clearer to make this non-optional and rename to WWMSpills |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1273 | I think we ought to either use the live in strategy or mark as reserved, but not both. Should clean this up in a later patch |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | I don't understand why this is necessary in the first place. Can't you call allocateWWMSpill when the V_WRITELANE_B32 for spilling is inserted? Actually, isn't that already happening? |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | Yes, that's the behavior with the upstream compiler today. This patch is a prerequisite for D124196 which changes the SGPR spilling strategy by spilling them into virtual VGPRs instead of using the physical registers. These virtual registers get physVGPRs during the second iteration of allocation passes that allocates only VGPR classes. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 | Do you have a plan to fix this? And what about the correctness of D124196, see the question about the register allocator introducing COPYs when allocating VGPRs. I have a feeling that the compiler really does need to track at some level the nature of VGPRs, i.e. whether they're used at lane scope or at wave scope, in order to fix this potential miscompilation due to reg alloc. Once you have that, you don't need the scan for V_WRITELANE either. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1263 |
Yes, I am planning to fix it. Will see the approach you're talking about to detect lane scope and wave scope operations. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1267–1270 | For V_WRITELANE_B32: Why is this still needed, given that the code calls allocateWWMSpill when allocating an SGPR-to-VGPR spill lane? For V_READLANE_B32: This seems like an unnecessary and unacceptable pessimization. V_READLANE_B32 doesn't modify the register, so there's no need to spill anything of the source register based on it. |
We had a detailed offline discussion. The scan for v_writelane_b32/v_readlane_b32 is a (bad) heuristic for which physical VGPRs are used in a WWM fashion, so that a corresponding prolog spill / epilog reload is necessary. What we really should be doing is to remember this information when virtual registers are mapped to physical ones, which will involve calling allocateWWMSpill directly from the register allocator. I think that marking this as a TODO here, and then following up with a separate change that does the direct call of allocateWWMSpill from register allocation is a decent plan.
Is that something you could do in a separate pass post-RA but pre-VirtRegRewriter, by looking at the VirtRegMap analysis?
A separate pass would be possible with Greedy allocator. But with RegAllocFast, VRM won't be available. Planning to use delegate method (see D134950) in RegAllocFast to notify targets when a physical register is allocated.
Added a TODO to call allocateWWMSpill elsewhere in a better way to identify the VGPRs used for SGPR spilling.
I am confused why the parent/child relationship among the changes are removed, I think there are still some dependencies among this and other changes?
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1270 | Why not define a separate pseudo instruction like V_SPILL_SGPR_TO_VGPR, which will be translated to V_WRITELANE at expandPostRAPseudo()? This should help us differentiate a writelane/readlane that have no wwm behavior from the sgpr spill/reload to vgpr. |
Sorry about that. I have set them initially. But later broke the chain while arc patching them into a downstream compiler copy. The arc patch was causing trouble as it patches all reviews in the relationship chain when there are conflicts in the intermediate patches.
I put them back now. This is the parent review and there are other 9 patches in the chain ending at D124196.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1270 | Yes, that appeared to be a better approach than looking for WRITELANE/READLANE directly. Still, the looping is an overhead and we should move it entirely from here. Capturing them during the regalloc would be the efficient way. I'm planning a post-patch to optimize it. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
1264–1269 | Can you open an issue for this? |
Why is anything being added to WWM registers here?