This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add support for Whole Wavefront Mode
ClosedPublic

Authored by cwabbott on Jul 17 2017, 5:50 PM.

Details

Summary

Whole Wavefront Wode (WWM) is similar to WQM, except that all of the
lanes are always enabled, regardless of control flow. This is required
for implementing wavefront reductions in non-uniform control flow, where
we need to use the inactive lanes to propagate intermediate results, so
they need to be enabled. We need to propagate WWM to uses (unless
they're explicitly marked as exact) so that they also propagate
intermediate results correctly. We do the analysis and exec mask munging
during the WQM pass, since there are interactions with WQM for things
that require both WQM and WWM. For simplicity, WWM is entirely
block-local -- blocks are never WWM on entry or exit of a block, and WWM
is not propagated to the block level. This means that computations
involving WWM cannot involve control flow, but we only ever plan to use
WWM for a few limited purposes (none of which involve control flow)
anyways.

Shaders can ask for WWM using the @llvm.amdgcn.wwm intrinsic. There
isn't yet a way to turn WWM off -- that will be added in a future
change.

Finally, it turns out that turning on inactive lanes causes a number of
problems with register allocation. While the best long-term solution
seems like teaching LLVM's register allocator about predication, for now
we need to add some hacks to prevent ourselves from getting into trouble
due to constraints that aren't currently expressed in LLVM. For the gory
details, see the comments at the top of SIFixWWMLiveness.cpp.

Diff Detail

Repository
rL LLVM

Event Timeline

cwabbott created this revision.Jul 17 2017, 5:50 PM
tpr edited edge metadata.Jul 19 2017, 9:25 AM

Looks good to me, but will leave to Nicolai to review.

nhaehnle edited edge metadata.Jul 21 2017, 7:28 AM

Looks mostly good, but I do have a couple of comments.

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
26 ↗(On Diff #106993)

*don't

100–101 ↗(On Diff #106993)

Duplicate function call.

118 ↗(On Diff #106993)

Upper case variable names.

168 ↗(On Diff #106993)

Variable names are upper case in LLVM style.

183 ↗(On Diff #106993)

Variable names are upper case in LLVM style.

lib/Target/AMDGPU/SIWholeQuadMode.cpp
139 ↗(On Diff #106993)

LLVM style is upper-case variable names.

619–625 ↗(On Diff #106993)

This could be an S_OR_SAVEEXEC_B64.

784–785 ↗(On Diff #106993)

This should be simpler with MI->setDesc on the previous patch :)

test/CodeGen/AMDGPU/wqm.ll
127–185 ↗(On Diff #106993)

Maybe I have jetlag blindness, but those two tests seem to be the same.

Also, there should be a test where the wwm computation does use some value from the predecessor block (e.g. use %hi as the offset in the load).

cwabbott added inline comments.Jul 26 2017, 5:08 PM
test/CodeGen/AMDGPU/wqm.ll
127–185 ↗(On Diff #106993)

They're not, since the second one uses lacks the iadd after the llvm.amdgcn.wwm intrinsic. Instead, the intrinsic result is passed directly to the phi. Without the early clobber on the WWM intrinsic, we would coalesce the resulting copy with the other phi sources, and the WWM write would overwrite the inactive channels that should be 0. There's a similar thing going on with the above, but this is easier to verify in a less fragile way -- we just need to make sure that a v_mov_b32 gets emitted after the WWM computation. I'm currently jetlagged too, but I think that's why I added this test.

Good point about the additional test though.

cwabbott added inline comments.Jul 27 2017, 5:54 PM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
619–625 ↗(On Diff #106993)

This is a little tricky, since S_OR_SAVEEXEC_B64 clobbers SCC while the S_MOV_B64 doesn't, so we need to be more careful about where we put it. But I didn't like how I was handling prepareInsertion() in the face of WWM anyways, so I've refactored it to make using S_OR_SAVEEXEC_B64 possible.

cwabbott updated this revision to Diff 108735.Jul 28 2017, 3:50 PM
  • Fix style issues.
  • Rebased on using setDesc() in lowerCopyInstrs()
  • Reworked logic for determining safe points to transition where SCC isn't live, to make using s_or_saveexec_b64 for entering WWM possible.
  • Added tests for WWM across multiple basic blocks and correct SCC handling. For the latter, I couldn't figure out how to make an IR test that exercised that path, so I made a MIR test for it.
nhaehnle accepted this revision.Aug 2 2017, 2:23 AM

One remaining style nitpick. LGTM with that fixed.

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
183 ↗(On Diff #106993)

Gentle reminder :)

test/CodeGen/AMDGPU/wqm.ll
127–185 ↗(On Diff #106993)

Wow, that's subtle stuff. Thanks for the explanation!

This revision is now accepted and ready to land.Aug 2 2017, 2:23 AM
arsenm added inline comments.Aug 2 2017, 10:41 AM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
633 ↗(On Diff #108735)

Formatting

619–625 ↗(On Diff #106993)

I think this pass should probably not be producing the save exec instructions directly. We re-form these later and this may be a problem with -O0

test/CodeGen/AMDGPU/wqm.ll
104 ↗(On Diff #108735)

An -O0 run line would be nice

arsenm added inline comments.Aug 2 2017, 10:45 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3300–3301 ↗(On Diff #108735)

Shouldn't this have a chain?

test/CodeGen/AMDGPU/wqm.ll
123 ↗(On Diff #108735)

Need some additional tests with wwm using different types

arsenm added inline comments.Aug 2 2017, 10:48 AM
lib/Target/AMDGPU/SIInstructions.td
133 ↗(On Diff #108735)

I think you should keep the out name to match others, i.e. $sdst

lib/Target/AMDGPU/SIWholeQuadMode.cpp
713 ↗(On Diff #108735)

Why remove this? It should definitely exit early for compute

test/CodeGen/AMDGPU/wqm.mir
35–54 ↗(On Diff #108735)

You can delete all of this

cwabbott added inline comments.Aug 2 2017, 11:28 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3300–3301 ↗(On Diff #108735)

No, because it's really just a copy instruction that happens to also enable WWM for its sources -- it doesn't have any side-effects.

lib/Target/AMDGPU/SIWholeQuadMode.cpp
713 ↗(On Diff #108735)

Because OpenGL compute shaders (and geometry shaders, and tesselation shaders etc.) can all use WWM for doing non-uniform wavefront reductions, so we at least need to scan the program for WWM instructions. ROCm will probably want it at some point too.

619–625 ↗(On Diff #106993)

The existing pass already uses S_AND_SAVEEXEC_B64, so I don't see a reason not to use it here; I was just trying to be consistent. If we want to let that pass to handle it, it should probably be a separate change. Also, because of that, any problem with -O0 is probably an already-existing bug (although I doubt this pass got much testing with -O0 before), so I don't want to block anything on that. I can try adding an -O0 line to the test though.

cwabbott added inline comments.Aug 2 2017, 6:41 PM
test/CodeGen/AMDGPU/wqm.ll
104 ↗(On Diff #108735)

I tried adding that, but one of the pre-existing functions in the test assert-fails. Something about it wanting to use scratch with -O0, but it isn't set up correctly. I'd be inclined to punt on this, since it's not related to WWM.

cwabbott updated this revision to Diff 109476.Aug 2 2017, 7:06 PM
  • fix style issues
  • add test for WWM with integers
  • remove unneccesary stuff from MIR test
cwabbott marked 2 inline comments as done.Aug 2 2017, 7:08 PM
cwabbott updated this revision to Diff 109477.Aug 2 2017, 7:09 PM
  • rename dest of EXIT_WWM to be consistent
cwabbott marked an inline comment as done.Aug 2 2017, 7:10 PM
cwabbott updated this revision to Diff 109619.Aug 3 2017, 1:06 PM

Fix one last style issue.

cwabbott updated this revision to Diff 109626.Aug 3 2017, 1:50 PM

Remove unneccesary stuff from fix-wwm-liveness.mir, based on Matt's comment for wqm.mir.

This revision was automatically updated to reflect the committed changes.