This commit fixes the DPP sequence in the atomic optimizer (which was previously missing the row_shr:3 step).
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Updated to bring in an additional fix to remove read_register exec and replace it with a ballot.
(removed a comment here that was wrong)
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
316–317 ↗ | (On Diff #185510) | So I hadn't noticed this before, but I think the wwm intrinsic shouldn't be applied *after* the readlane below. With wwm before readlane, there's a theoretical possibility that register allocation splits the live range of the value and inserts a V_MOV in between which ends up executed with bit 63 disabled, leading to an incorrect results from the readlane. |
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
316–317 ↗ | (On Diff #185510) | Oh yeah - so I was assuming because readlane ignores the exec mask it should be fine, but I can see why that might be an issue. I'll change the code. |
Not really an area I'm 100% sure about - but looks ok to me. One of the other reviewers will have to sign off too.
Minor niggle on the comment (if my understanding is correct).
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
303 ↗ | (On Diff #185929) | Is this comment still true? |
I don't understand this fix. Surely a reduction is done with just power of two shifts. Why do we need the shift by 3 as well? What is the extra wf_sr1 dpp at the start for?
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
258 ↗ | (On Diff #185929) | Do you need the setConvergent? It's already marked setConvergent in the .td file. This might also apply to the setConvergent calls lower down, but I haven't checked. |
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
258 ↗ | (On Diff #185929) | Probably not, but it shouldn't harm this by adding it (I use SetConvergent on every other convergent thing in the file, so best to be consistent I think!). |
But I still don't understand it:
- Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.
- Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).
- Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)
No we need both - you need an exclusive scan to know the individual lane's position in the single atomic operation, but you also need the reduction (sum) to know how much we are adding in the atomic operation in the first place. So I do an exclusive scan to get our position, then add on our own lane's value to get the inclusive scan, and readlane 63 to get the total reduction across the wavefront.
- Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).
No you need to do the shift 1,2,3 - can see the reasoning why in here https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
- Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)
There was two bugs:
- The exclusive scan component returned the wrong result in some cases because the shift-by-3 wasn't there.
- We need to do a ballot instead of read_register because read_register was sometimes being misidentified as requiring WWM and thus giving us garbage results for the atomc load.
Ah, right, I see about the need for the exclusive and inclusive scan results.
I checked https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and I didn't see any reasoning about why you need the shift by 3. In the diagram there, I reckon you could change instruction 2 to have the result of instruction 1 as both operands, and omit instruction 3 (the shift by 3), and get the same result, saving one instruction. However that might actually take one wait state longer, assuming you can't schedule other stuff into the middle of it, which you probably can't.
So I'm happy now.
llvm/trunk/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
256 | This is leaving the declaration for |Context| unused in line 214. I'm getting build errors with -Wunused-variable. |
llvm/trunk/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
256 | There's a later change that removes it. See https://llvm.org/svn/llvm-project/llvm/trunk@353704 by Benny Kramer. |
This is leaving the declaration for |Context| unused in line 214. I'm getting build errors with -Wunused-variable.