Page MenuHomePhabricator

[AMDGPU] Use reductions instead of scans in the atomic optimizer
ClosedPublic

Authored by foad on Mar 19 2021, 7:52 AM.

Details

Summary

If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.

Diff Detail

Event Timeline

foad created this revision.Mar 19 2021, 7:52 AM
foad requested review of this revision.Mar 19 2021, 7:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2021, 7:52 AM
b-sumner added inline comments.Mar 19 2021, 11:25 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299

This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size?

foad added inline comments.Mar 19 2021, 11:36 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299

The reduction or scan runs in whole wave mode. All lanes are active.

299

... and lanes that weren't active to start with are set to an appropriate identity value for the operation.

b-sumner added inline comments.Mar 19 2021, 12:25 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299

But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm not aware that WWM can actually activate the rest of them.

foad added inline comments.Mar 19 2021, 2:02 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299

That's exactly what WWM does: unconditionally activates all lanes. You can see that in the tests in this patch (both before and after my changes): s_or_saveexec_b64 s[0:1], -1 sets all bits in the exec mask.

critson accepted this revision.Mar 23 2021, 1:14 AM

LGTM - this seems like a good use of GFX10 row_xmask.

Please see minor comments.

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
358

Should this V_PERMLANEX16 be guarded as well?
Or at least have an assert?

llvm/lib/Target/AMDGPU/SIDefines.h
675

Is it worth silencing linting of this enum?

711

ROW_SHARE0 is defined, but not used?
Not that I am against it existing.

This revision is now accepted and ready to land.Mar 23 2021, 1:14 AM
foad updated this revision to Diff 332583.Mar 23 2021, 2:44 AM

Address review comments, and use an early-out.

foad marked 3 inline comments as done.Mar 23 2021, 2:47 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIDefines.h
711

Yeah, I just added it for consistency before I had worked out which ones I would actually need.

This revision was automatically updated to reflect the committed changes.
foad marked an inline comment as done.