This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
ClosedPublic

Authored by mareko on Oct 5 2018, 1:34 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

mareko created this revision.Oct 5 2018, 1:34 PM
nhaehnle added inline comments.Oct 8 2018, 7:55 AM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

I don't like having this as a pointer. It really isn't, so we're really kidding ourselves here.

This could just be an i32. Alternatively, and honestly preferably, this should be just the pointer to the honest-to-goodness address, and the waveID a separate operand. It should be possible to add the DAG nodes to build the M0 value in LowerINTRINSIC_W_CHAIN. Unless the DAG combine is unable to optimize that? (Or possibly even then, and we should fix the DAG combines...)

401–403 ↗(On Diff #168522)

The ordering / scope / isVolatile is a bit weird, but it's weird in the same way as the existing intrinsics, so fine by me.

lib/Target/AMDGPU/SIISelLowering.cpp
5175–5176 ↗(On Diff #168522)

Use report_fatal_error (to support the non-Mesa compiler case).

5193 ↗(On Diff #168522)

Same here.

mareko added inline comments.Oct 8 2018, 8:14 AM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

That was what the previous version did and it produced worse code. It selected correct instructions, but we want CSE of the m0 expression across basic blocks and simplifications based on the numbers of known zero bits in the inputs. Doing it in LowerINTRINSIC_W_CHAIN is too late for those.

nhaehnle added inline comments.Oct 8 2018, 9:26 AM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

I see, that sucks. We really need better CSE at the MachineInstr level.

Well, the problem is that pretending that this thing is a pointer is pretty far from correct. Since it makes for a cleaner "API", I'd personally prefer to have the real pointer here, and the waveID as a separate argument, even if it means a few additional instructions.

mareko added inline comments.Oct 8 2018, 2:56 PM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

I wouldn't like more instructions, because I may be instruction-bound. The pointer is made by inttoptr anyway, currently even inttoptr((NULL << 16) | waveID) = inttoptr(waveID). inttoptr is most likely to be used in all cases, because GDS ordered count offsets are constant and have only 14 bits that can change. i32 for m0 would be better.

mareko added inline comments.Oct 8 2018, 2:59 PM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

One optimization for drivers is to pass a non-zero GDS offset in a user SGPR in the high 16bits, so that the shader doesn't have to shift. Then it's just OR'd with WaveID. If the offset is 0, there is no offset to OR.

arsenm added inline comments.Oct 9 2018, 7:18 PM
include/llvm/IR/IntrinsicsAMDGPU.td
397–399 ↗(On Diff #168522)

I've made a few attempts to improve CSE involving M0 so I think it's possible to fix this.

This shouldn't be a fake pointer

arsenm added a comment.Oct 9 2018, 7:19 PM

Does this need to be marked as isSourceOfDivergence?

Does this need to be marked as isSourceOfDivergence?

Now that you mention it, yes, even though it's for stupid reasons: I believe the ds_ordered_count instruction executes only in a single lane, so it's intuitively a uniform operation; however, it returns its result only in lane 0, so it's formally non-uniform.

Does this need to be marked as isSourceOfDivergence?

Now that you mention it, yes, even though it's for stupid reasons: I believe the ds_ordered_count instruction executes only in a single lane, so it's intuitively a uniform operation; however, it returns its result only in lane 0, so it's formally non-uniform.

ds_ordered_count hangs if more than 1 lane is active.

Would everybody tolerate "i32 m0" in the intrinsic?

I might change the intrinsic to add the option to insert "s_and_saveexec s[N:M], 1" and "s_mov_b64 exec, s[N:M]" around the intrinsic to get an optimal single-lane block.

If I make m0 integer, DS_ORDERED_COUNT won't be a mem node.

Any comments?

mareko updated this revision to Diff 175389.Nov 26 2018, 8:39 PM

Use report_fatal_error, add SourceOfDivergence lines

FYI, if there is no feedback on January 15, I'll commit this to get it into LLVM 8.0.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 16 2019, 7:47 AM
This revision was automatically updated to reflect the committed changes.