If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
299 | This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
299 | But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm not aware that WWM can actually activate the rest of them. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
299 | That's exactly what WWM does: unconditionally activates all lanes. You can see that in the tests in this patch (both before and after my changes): s_or_saveexec_b64 s[0:1], -1 sets all bits in the exec mask. |
LGTM - this seems like a good use of GFX10 row_xmask.
Please see minor comments.
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
358 | Should this V_PERMLANEX16 be guarded as well? | |
llvm/lib/Target/AMDGPU/SIDefines.h | ||
674 | Is it worth silencing linting of this enum? | |
710 | ROW_SHARE0 is defined, but not used? |
llvm/lib/Target/AMDGPU/SIDefines.h | ||
---|---|---|
710 | Yeah, I just added it for consistency before I had worked out which ones I would actually need. |
This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size?