This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic
ClosedPublic

Authored by nhaehnle on Apr 17 2016, 3:20 PM.

Details

Summary

So it appears that to guarantee some of the ordering requirements of a GLSL
memoryBarrier() executed in the shader, we need to emit an s_waitcnt.

(We can't use an s_barrier, because memoryBarrier() may appear anywhere in
the shader, in particular it may appear in non-uniform control flow.)

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle updated this revision to Diff 54014.Apr 17 2016, 3:20 PM
nhaehnle retitled this revision from to AMDGPU/SI: Add llvm.amdgcn.s.waitcnt.all intrinsic.
nhaehnle updated this object.
nhaehnle added a subscriber: llvm-commits.
arsenm edited edge metadata.Apr 17 2016, 3:28 PM

What are the memory barrier requirements for GLSL? Do you need consistency between multiple workgroups?

include/llvm/IR/IntrinsicsAMDGPU.td
71 ↗(On Diff #54014)

I think the intrinsic should be int_amdgpu_s_waitcnt, and we should expose the configuration bits with an input argument.

arsenm added inline comments.Apr 18 2016, 8:10 AM
include/llvm/IR/IntrinsicsAMDGPU.td
71 ↗(On Diff #54014)

I'm not sure waitcnt should be directly exposed, and it should probably be a memfence intrinsic. However, I'm not clear if what is really wanted here is the cache flush intrinsics like is necessary for OpenCL 2.0

This also doesn't need to be convergent

Yes, we need consistency between all shader invocations, which can span all the CUs and SEs on the chip. There isn't really a notion of workgroups for GLSL graphics shaders. Basically, the instruction needs to make sure that all past memory writes by the shader (actually, only 'coherent' and 'volatile' ones) are visible to all other shaders. I'm not sure about what OpenCL needs.

With this patch, the idea is to implement this by setting glc=1 on the coherent/volatile writes and using a wait. I believe (but have not tried) that an alternative would be to always use glc=0 and wait + explicitly request an L1 cache flush at the memory barrier.

Tom, do you want the numeric counts as input, or just bits that indicate whether to wait for vm/exp/lgkm?

Tom, do you want the numeric counts as input, or just bits that indicate whether to wait for vm/exp/lgkm?

Just the bits. So, the input to the intrinsic and the instruction are the same.

nhaehnle updated this revision to Diff 54311.Apr 19 2016, 6:42 PM
nhaehnle edited edge metadata.

Changed the intrinsic to take a single argument.

Also changed the SIInsertWaits logic so that the intrinsic will be delayed
up to the next counter-incrementing instruction or the next "natural" wait
(and then merged with it).

tstellarAMD accepted this revision.Apr 25 2016, 2:03 PM
tstellarAMD edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Apr 25 2016, 2:03 PM
This revision was automatically updated to reflect the committed changes.