Page MenuHomePhabricator

[AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions
ClosedPublic

Authored by cwabbott on Jun 27 2017, 3:47 PM.

Details

Summary

All instructions with the DPP modifier may not write to certain lanes of
the output if bound_ctrl=1 is set or any bits in bank_mask or row_mask
aren't set, so the destination register may be both defined and modified.
The right way to handle this is to add a constraint that the destination
register is the same as one of the inputs. We could tie the destination
to the first source, but that would be too restrictive for some use-cases
where we want the destination to be some other value before the
instruction executes. Instead, add a fake "old" source and tie it to the
destination. Effectively, the "old" source defines what value unwritten
lanes will get. We'll expose this functionality to users with a new
intrinsic later.

Also, we want to use DPP instructions for computing derivatives, which
means we need to set WQM for them. We also need to enable the entire
wavefront when using DPP intrinsics to implement nonuniform subgroup
reductions, since otherwise we'll get incorrect results in some cases.
To accomodate this, add a new operand to all DPP instructions which will
be interpreted by the SI WQM pass. This will be exposed with a new
intrinsic later. We'll also add support for Whole Wavefront Mode later.

I also fixed llvm.amdgcn.mov.dpp to overwrite the source and fixed up
the test. However, I could also keep the old behavior (where lanes that
aren't written are undefined) if people want it.

This change is more of an RFC, since some assembler tests are failing
and I have no idea why. Also, this seemed quite hairy, and I'm not sure
if this the best way to hook everything up. Should I be creating
separate pseudo-instructions with these extra sources? Any guidance on
that would be appreciated.

Diff Detail

Repository
rL LLVM

Event Timeline

cwabbott created this revision.Jun 27 2017, 3:47 PM
tpr added a comment.Jun 28 2017, 5:14 AM

Hi Connor. We have also been thinking about issues around dpp, wqm and wwm inside AMD. Something we may want to see is a way in machine instructions to express that a dpp operand can be combined into an alu op, and then the write gating (bound_ctrl=1, row and bank masks) affects the result of the alu op, not the result of the dpp move. However I guess that is a more extensive change to the definition of a whole class of instructions. I'd be interested to hear your thoughts.

Being able to combine dpp move, alu op and write gating also affects how to express it at the IR intrinsic level. It seems to me that the write gating needs to be a separate intrinsic, with instruction selection spotting that it can all be combined into a single instruction.

cwabbott added a comment.EditedJun 28 2017, 10:50 AM
In D34716#793614, @tpr wrote:

Hi Connor. We have also been thinking about issues around dpp, wqm and wwm inside AMD. Something we may want to see is a way in machine instructions to express that a dpp operand can be combined into an alu op, and then the write gating (bound_ctrl=1, row and bank masks) affects the result of the alu op, not the result of the dpp move. However I guess that is a more extensive change to the definition of a whole class of instructions. I'd be interested to hear your thoughts.

Being able to combine dpp move, alu op and write gating also affects how to express it at the IR intrinsic level. It seems to me that the write gating needs to be a separate intrinsic, with instruction selection spotting that it can all be combined into a single instruction.

Yeah, I've already added such an intrinsic in D34718. You can also see my implementation of the inclusive scan kernel in Mesa here. Each round generates IR like:

%1 = call i32 @llvm.amdgcn.update.dpp(i32 0, i32 %0, <dpp_ctrl, etc.>)
%3 = i32 iadd %1, %2

The key part here is putting the identity of the operation as the "old" source, which works regardless of operation -- this lets it be folded to something like:

V_ADD_I32_dpp  %3, old:%1, src0:%0, src1:%2 <dpp_ctrl, etc.>

Which becomes a single instruction with %0 and %1 tied to the same register. This should just be a matter of writing a few ISel patterns, although I haven't done that yet, since I've been concentrating on getting it working first.

So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?

Tim had an interesting proposal for that, which involved a pair of intrinsics:

llvm.amdgcn.helpervalue(src, helpervalue) --> returns src for active lanes and helpervalue for other lanes

llvm.amdgcn.wwm(src) --> returns src for active lanes and undefined/poison (my choice of words, not TIm's) for other lanes, but guarantees that the computations leading to src are executed "as-if" in WWM.

llvm.amdgcn.wqm(src) --> analogous

I'm writing "as-if", because not all computations leading up to src actually need to be in WWM: llvm.amdgcn.helpervalue can act as a "barrier" to the propagation of WWM. So if you think of the graph of WWM computations, .helpervalue acts as a source, and .wwm acts as a sink.

I think this proposal goes a long way towards clarifying which operations actually need WQM/WWM. One issue that occurred to me today is that the semantics are unclear when control flow is involved. Two basic examples to think about:

v = some computation
if (cond) {
   t1 = f(v)
   r1 = wwm(t1)
} else {
   t2 = f(v)
   r2 = wwm(t2)
}

I believe the desirable semantics here are clear, though they may require some compiler work. Basically, you want the entire vector of v be equal at the start of both blocks. This requires ensuring that no part of it gets overwritten during the first block we go through.

The much more problematic case is:

if (cond) {
  v1 = ...
} else {
  v2 = ...
}
v = wwm(phi(v1, v2))

What does v look like? Specifically, what's in the inactive lanes? Perhaps the best thing we can do is say that the active lanes come from the predecessor block they went through, and all the other lanes come from one of the two blocks, though it is undefined which one.

So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?

Tim had an interesting proposal for that, which involved a pair of intrinsics:

llvm.amdgcn.helpervalue(src, helpervalue) --> returns src for active lanes and helpervalue for other lanes

In D34719, I added llvm.amdgcn.set.inactive, which does exactly what you describe. I left that out of the example in my comment, but you can see it in the Mesa implementation I posted (in particular, look at ac_build_reduce(), ac_build_inclusive_scan(), and ac_build_exclusive_scan()).

llvm.amdgcn.wwm(src) --> returns src for active lanes and undefined/poison (my choice of words, not TIm's) for other lanes, but guarantees that the computations leading to src are executed "as-if" in WWM.

llvm.amdgcn.wqm(src) --> analogous

I'm writing "as-if", because not all computations leading up to src actually need to be in WWM: llvm.amdgcn.helpervalue can act as a "barrier" to the propagation of WWM. So if you think of the graph of WWM computations, .helpervalue acts as a source, and .wwm acts as a sink.

Hmm, this might be an interesting approach. I think that setting wqm_ctrl to WQM on a DPP instruction is essentially equivalent to caling llvm.amdgcn.wqm on the result and then replacing all uses with the result of llvm.amdgcn.wqm (and similarly for llvm.amdgcn.wwm). I can see how having a separate pseudo-instruction might be a little cleaner though. And it would be nice for us to stop pretending that we can figure out what needs WWM/WQM based on the instruction itself, since it does very much depend on what you're using the instruction and what the API demands.

One thing that strikes me is that while your definition is sufficient for WWM, it isn't for WQM -- for derivatives, GL says that we actually do have to care about the values of things in helper invocations. The program has to behave as if it's always in WQM, except for loads and stores, so just assuming that helper lanes are undefined/poison isn't valid as long as loads from memory aren't involved. I think we can just strengthen the definition of llvm.amdgcn.wqm a little, to say that helper lanes must have the correct value as if everything was computed in WQM.

Also, with these two intrinsics, we still wouldn't be able to express that some computation must happen in exact mode. This matters for DPP instructions and store instructions with side effects. I'm not sure if we'll ever want to use DPP instructions in exact mode, but we definitely need to care about store instructions. I guess we can just keep the current logic for making sure that stores are executed in exact mode, although it certainly seems kinda hack-ish, especially if the goal is to get rid of special assumptions about instructions needing WQM/WWM/Exact.

I think this proposal goes a long way towards clarifying which operations actually need WQM/WWM. One issue that occurred to me today is that the semantics are unclear when control flow is involved. Two basic examples to think about:

v = some computation
if (cond) {
   t1 = f(v)
   r1 = wwm(t1)
} else {
   t2 = f(v)
   r2 = wwm(t2)
}

I believe the desirable semantics here are clear, though they may require some compiler work. Basically, you want the entire vector of v be equal at the start of both blocks. This requires ensuring that no part of it gets overwritten during the first block we go through.

I think the extra edge will guarantee that that's the case already. And we certainly already do have similar problems with WQM, where you have to consider v live during the first block in case some WQM operation clobbers it.

The much more problematic case is:

if (cond) {
  v1 = ...
} else {
  v2 = ...
}
v = wwm(phi(v1, v2))

What does v look like? Specifically, what's in the inactive lanes? Perhaps the best thing we can do is say that the active lanes come from the predecessor block they went through, and all the other lanes come from one of the two blocks, though it is undefined which one.

If you take the "as-if" semantics at heart, then the inactive lanes should have the value they would have if the whole program were executed in WWM -- that is, the block they come from should depend on what the value of cond would be if we executed the entire thing in WWM. In fact, if you replace "WWM" with "WQM" everywhere, then GL already mandates this behavior, and we implement it in the existing WQM pass. I chose not to implement it in WWM, since we're only ever generating WWM things ourselves with a matching llvm.amdgcn.set.inactive that tightly contains the "WWM-ness", and I doubt we'll ever need to care about these types of examples.

One thing that strikes me is that while your definition is sufficient for WWM, it isn't for WQM -- for derivatives, GL says that we actually do have to care about the values of things in helper invocations. The program has to behave as if it's always in WQM, except for loads and stores, so just assuming that helper lanes are undefined/poison isn't valid as long as loads from memory aren't involved. I think we can just strengthen the definition of llvm.amdgcn.wqm a little, to say that helper lanes must have the correct value as if everything was computed in WQM.

Actually, scratch that -- after thinking about it some more, the "as-if" semantics should be enough for GL. You can't really observe the effects of helper invocations except for when you take a derivative, so as long as you stick a llvm.amdgcn.wqm intrinsic after every derivative, you should be fine.

cwabbott updated this revision to Diff 106995.Jul 17 2017, 5:54 PM

Remove wqm_mode in favor of llvm.amdgcn.wqm and llvm.amdgcn.wwm intrinsics.

nhaehnle edited edge metadata.Jul 20 2017, 2:58 PM

One question, apart from that looks good.

lib/Target/AMDGPU/VOP2Instructions.td
278 ↗(On Diff #106995)

Is the wqm_ctrl still correct?

cwabbott added inline comments.Jul 26 2017, 1:26 PM
lib/Target/AMDGPU/VOP2Instructions.td
278 ↗(On Diff #106995)

No, good catch!

cwabbott updated this revision to Diff 108738.Jul 28 2017, 3:54 PM

Remove leftover $wqm_ctrl.

cwabbott marked 2 inline comments as done.Jul 28 2017, 3:55 PM
nhaehnle accepted this revision.Aug 2 2017, 2:26 AM

LGTM

This revision is now accepted and ready to land.Aug 2 2017, 2:26 AM
cwabbott updated this revision to Diff 109666.Aug 3 2017, 5:48 PM

Fix assembling DPP instructions. Also, adopt a more conservative version of
D34715. In particular, we ignore Constraints/DisableEncoding from the original
instruction for the DPP version. The only instruction with any special
constraints is MAC, because of its fake third source, and there it doesn't make
sense to keep the fake third source since it has to be the same as the normal
"old" source anyways. We can revisit this if something else comes up, but I
think this is a good plan for now.

Some comments on what I noticed. Probably best if somebody else has a look as well.

lib/Target/AMDGPU/VOP1Instructions.td
84–85 ↗(On Diff #109666)

Duplicate setting of Constraints/DisableEncoding.

lib/Target/AMDGPU/VOP2Instructions.td
105–106 ↗(On Diff #109666)

Duplicate setting of Constraints/DisableEncoding.

lib/Target/AMDGPU/VOPInstructions.td
451–452 ↗(On Diff #109666)

Both here and in VOP_SDWA_Real, it looks like there are duplicated pairs of let Constraints/DisableEncoding lines. One of those pairs should be removed.

cwabbott added inline comments.Aug 4 2017, 8:48 AM
lib/Target/AMDGPU/VOPInstructions.td
451–452 ↗(On Diff #109666)

These duplicates already exist in master now, and removing them would be unrelated to the current change, so I thought I'd keep them for now -- I can make a separate change that removes them.

cwabbott updated this revision to Diff 109779.Aug 4 2017, 10:54 AM

Remove spurious change to AMDGPUAsmParser.cpp

It looks like Sam has worked a lot on the assembler, including adding support for DPP instructions, so I'm adding him for the assembler bits. I'd like to get this in before I leave next week, though.

SamWot accepted this revision.Aug 5 2017, 12:57 AM
This revision was automatically updated to reflect the committed changes.