If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
|---|---|---|
| 299 | This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size? | |
| llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
|---|---|---|
| 299 | But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm not aware that WWM can actually activate the rest of them. | |
| llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
|---|---|---|
| 299 | That's exactly what WWM does: unconditionally activates all lanes. You can see that in the tests in this patch (both before and after my changes): s_or_saveexec_b64 s[0:1], -1 sets all bits in the exec mask. | |
LGTM - this seems like a good use of GFX10 row_xmask.
Please see minor comments.
| llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
|---|---|---|
| 358 | Should this V_PERMLANEX16 be guarded as well? | |
| llvm/lib/Target/AMDGPU/SIDefines.h | ||
| 675 | Is it worth silencing linting of this enum? | |
| 711 | ROW_SHARE0 is defined, but not used? | |
| llvm/lib/Target/AMDGPU/SIDefines.h | ||
|---|---|---|
| 711 | Yeah, I just added it for consistency before I had worked out which ones I would actually need. | |
This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size?