Page MenuHomePhabricator

[AMDGPU] Fix DPP sequence in atomic optimizer.
ClosedPublic

Authored by sheredom on Tue, Feb 5, 1:42 AM.

Details

Summary

This commit fixes the DPP sequence in the atomic optimizer (which was previously missing the row_shr:3 step).

Diff Detail

Event Timeline

sheredom created this revision.Tue, Feb 5, 1:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Feb 5, 1:42 AM
sheredom updated this revision to Diff 185510.Wed, Feb 6, 1:46 AM

Updated to bring in an additional fix to remove read_register exec and replace it with a ballot.

nhaehnle added a comment.EditedThu, Feb 7, 3:54 AM

(removed a comment here that was wrong)

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
319–320

So I hadn't noticed this before, but I think the wwm intrinsic shouldn't be applied *after* the readlane below.

With wwm before readlane, there's a theoretical possibility that register allocation splits the live range of the value and inserts a V_MOV in between which ends up executed with bit 63 disabled, leading to an incorrect results from the readlane.

sheredom marked an inline comment as done.Thu, Feb 7, 5:00 AM
sheredom added inline comments.
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
319–320

Oh yeah - so I was assuming because readlane ignores the exec mask it should be fine, but I can see why that might be an issue. I'll change the code.

sheredom updated this revision to Diff 185929.Fri, Feb 8, 2:05 AM

Make the final readlane be in the WWM section too as per the review comments.

sheredom marked an inline comment as done.Fri, Feb 8, 2:06 AM
dstuttard accepted this revision.Mon, Feb 11, 2:36 AM

Not really an area I'm 100% sure about - but looks ok to me. One of the other reviewers will have to sign off too.
Minor niggle on the comment (if my understanding is correct).

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
306

Is this comment still true?

This revision is now accepted and ready to land.Mon, Feb 11, 2:36 AM
sheredom updated this revision to Diff 186208.Mon, Feb 11, 2:50 AM

Fix comment that should now say exclusive scan instead of inclusive scan.

sheredom marked an inline comment as done.Mon, Feb 11, 2:51 AM
tpr added a comment.Mon, Feb 11, 3:15 AM

I don't understand this fix. Surely a reduction is done with just power of two shifts. Why do we need the shift by 3 as well? What is the extra wf_sr1 dpp at the start for?

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
265

Do you need the setConvergent? It's already marked setConvergent in the .td file. This might also apply to the setConvergent calls lower down, but I haven't checked.

tpr added a comment.Mon, Feb 11, 3:19 AM

Oh, it's because you've switched to an exclusive scan.

sheredom marked 2 inline comments as done.Mon, Feb 11, 3:25 AM
sheredom added inline comments.
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
265

Probably not, but it shouldn't harm this by adding it (I use SetConvergent on every other convergent thing in the file, so best to be consistent I think!).

tpr added a comment.Mon, Feb 11, 3:42 AM

But I still don't understand it:

  1. Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.
  2. Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).
  3. Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)
sheredom marked an inline comment as done.Mon, Feb 11, 3:49 AM
In D57737#1392667, @tpr wrote:

But I still don't understand it:

  1. Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.

No we need both - you need an exclusive scan to know the individual lane's position in the single atomic operation, but you also need the reduction (sum) to know how much we are adding in the atomic operation in the first place. So I do an exclusive scan to get our position, then add on our own lane's value to get the inclusive scan, and readlane 63 to get the total reduction across the wavefront.

  1. Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).

No you need to do the shift 1,2,3 - can see the reasoning why in here https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

  1. Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)

There was two bugs:

  1. The exclusive scan component returned the wrong result in some cases because the shift-by-3 wasn't there.
  2. We need to do a ballot instead of read_register because read_register was sometimes being misidentified as requiring WWM and thus giving us garbage results for the atomc load.
tpr accepted this revision.Mon, Feb 11, 5:04 AM

Ah, right, I see about the need for the exclusive and inclusive scan results.

I checked https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and I didn't see any reasoning about why you need the shift by 3. In the diagram there, I reckon you could change instruction 2 to have the result of instruction 1 as both operands, and omit instruction 3 (the shift by 3), and get the same result, saving one instruction. However that might actually take one wait state longer, assuming you can't schedule other stuff into the middle of it, which you probably can't.

So I'm happy now.

This revision was automatically updated to reflect the committed changes.