This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add support for Whole Wavefront Mode
ClosedPublic

Authored by cwabbott on Jul 17 2017, 5:50 PM.

Details

Summary

Whole Wavefront Wode (WWM) is similar to WQM, except that all of the
lanes are always enabled, regardless of control flow. This is required
for implementing wavefront reductions in non-uniform control flow, where
we need to use the inactive lanes to propagate intermediate results, so
they need to be enabled. We need to propagate WWM to uses (unless
they're explicitly marked as exact) so that they also propagate
intermediate results correctly. We do the analysis and exec mask munging
during the WQM pass, since there are interactions with WQM for things
that require both WQM and WWM. For simplicity, WWM is entirely
block-local -- blocks are never WWM on entry or exit of a block, and WWM
is not propagated to the block level. This means that computations
involving WWM cannot involve control flow, but we only ever plan to use
WWM for a few limited purposes (none of which involve control flow)
anyways.

Shaders can ask for WWM using the @llvm.amdgcn.wwm intrinsic. There
isn't yet a way to turn WWM off -- that will be added in a future
change.

Finally, it turns out that turning on inactive lanes causes a number of
problems with register allocation. While the best long-term solution
seems like teaching LLVM's register allocator about predication, for now
we need to add some hacks to prevent ourselves from getting into trouble
due to constraints that aren't currently expressed in LLVM. For the gory
details, see the comments at the top of SIFixWWMLiveness.cpp.

Event Timeline

cwabbott created this revision.Jul 17 2017, 5:50 PM
tpr edited edge metadata.Jul 19 2017, 9:25 AM

Looks good to me, but will leave to Nicolai to review.

nhaehnle edited edge metadata.Jul 21 2017, 7:28 AM

Looks mostly good, but I do have a couple of comments.

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
27

*don't

101–102

Duplicate function call.

119

Upper case variable names.

169

Variable names are upper case in LLVM style.

184

Variable names are upper case in LLVM style.

lib/Target/AMDGPU/SIWholeQuadMode.cpp
146

LLVM style is upper-case variable names.

623–629

This could be an S_OR_SAVEEXEC_B64.

802–803

This should be simpler with MI->setDesc on the previous patch :)

test/CodeGen/AMDGPU/wqm.ll
127–185

Maybe I have jetlag blindness, but those two tests seem to be the same.

Also, there should be a test where the wwm computation does use some value from the predecessor block (e.g. use %hi as the offset in the load).

cwabbott added inline comments.Jul 26 2017, 5:08 PM
test/CodeGen/AMDGPU/wqm.ll
127–185

They're not, since the second one uses lacks the iadd after the llvm.amdgcn.wwm intrinsic. Instead, the intrinsic result is passed directly to the phi. Without the early clobber on the WWM intrinsic, we would coalesce the resulting copy with the other phi sources, and the WWM write would overwrite the inactive channels that should be 0. There's a similar thing going on with the above, but this is easier to verify in a less fragile way -- we just need to make sure that a v_mov_b32 gets emitted after the WWM computation. I'm currently jetlagged too, but I think that's why I added this test.

Good point about the additional test though.

cwabbott added inline comments.Jul 27 2017, 5:54 PM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
623–629

This is a little tricky, since S_OR_SAVEEXEC_B64 clobbers SCC while the S_MOV_B64 doesn't, so we need to be more careful about where we put it. But I didn't like how I was handling prepareInsertion() in the face of WWM anyways, so I've refactored it to make using S_OR_SAVEEXEC_B64 possible.

cwabbott updated this revision to Diff 108735.Jul 28 2017, 3:50 PM
  • Fix style issues.
  • Rebased on using setDesc() in lowerCopyInstrs()
  • Reworked logic for determining safe points to transition where SCC isn't live, to make using s_or_saveexec_b64 for entering WWM possible.
  • Added tests for WWM across multiple basic blocks and correct SCC handling. For the latter, I couldn't figure out how to make an IR test that exercised that path, so I made a MIR test for it.
nhaehnle accepted this revision.Aug 2 2017, 2:23 AM

One remaining style nitpick. LGTM with that fixed.

lib/Target/AMDGPU/SIFixWWMLiveness.cpp
184

Gentle reminder :)

test/CodeGen/AMDGPU/wqm.ll
127–185

Wow, that's subtle stuff. Thanks for the explanation!

This revision is now accepted and ready to land.Aug 2 2017, 2:23 AM
arsenm added inline comments.Aug 2 2017, 10:41 AM
lib/Target/AMDGPU/SIWholeQuadMode.cpp
623–629

I think this pass should probably not be producing the save exec instructions directly. We re-form these later and this may be a problem with -O0

633

Formatting

test/CodeGen/AMDGPU/wqm.ll
104

An -O0 run line would be nice

arsenm added inline comments.Aug 2 2017, 10:45 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3300–3301

Shouldn't this have a chain?

test/CodeGen/AMDGPU/wqm.ll
123

Need some additional tests with wwm using different types

arsenm added inline comments.Aug 2 2017, 10:48 AM
lib/Target/AMDGPU/SIInstructions.td
133

I think you should keep the out name to match others, i.e. $sdst

lib/Target/AMDGPU/SIWholeQuadMode.cpp
713

Why remove this? It should definitely exit early for compute

test/CodeGen/AMDGPU/wqm.mir
36–55

You can delete all of this

cwabbott added inline comments.Aug 2 2017, 11:28 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3300–3301

No, because it's really just a copy instruction that happens to also enable WWM for its sources -- it doesn't have any side-effects.

lib/Target/AMDGPU/SIWholeQuadMode.cpp
623–629

The existing pass already uses S_AND_SAVEEXEC_B64, so I don't see a reason not to use it here; I was just trying to be consistent. If we want to let that pass to handle it, it should probably be a separate change. Also, because of that, any problem with -O0 is probably an already-existing bug (although I doubt this pass got much testing with -O0 before), so I don't want to block anything on that. I can try adding an -O0 line to the test though.

713

Because OpenGL compute shaders (and geometry shaders, and tesselation shaders etc.) can all use WWM for doing non-uniform wavefront reductions, so we at least need to scan the program for WWM instructions. ROCm will probably want it at some point too.

cwabbott added inline comments.Aug 2 2017, 6:41 PM
test/CodeGen/AMDGPU/wqm.ll
104

I tried adding that, but one of the pre-existing functions in the test assert-fails. Something about it wanting to use scratch with -O0, but it isn't set up correctly. I'd be inclined to punt on this, since it's not related to WWM.

cwabbott updated this revision to Diff 109476.Aug 2 2017, 7:06 PM
  • fix style issues
  • add test for WWM with integers
  • remove unneccesary stuff from MIR test
cwabbott marked 2 inline comments as done.Aug 2 2017, 7:08 PM
cwabbott updated this revision to Diff 109477.Aug 2 2017, 7:09 PM
  • rename dest of EXIT_WWM to be consistent
cwabbott marked an inline comment as done.Aug 2 2017, 7:10 PM
cwabbott updated this revision to Diff 109619.Aug 3 2017, 1:06 PM

Fix one last style issue.

cwabbott updated this revision to Diff 109626.Aug 3 2017, 1:50 PM

Remove unneccesary stuff from fix-wwm-liveness.mir, based on Matt's comment for wqm.mir.

This revision was automatically updated to reflect the committed changes.