So it appears that to guarantee some of the ordering requirements of a GLSL
memoryBarrier() executed in the shader, we need to emit an s_waitcnt.
(We can't use an s_barrier, because memoryBarrier() may appear anywhere in
the shader, in particular it may appear in non-uniform control flow.)
I think the intrinsic should be int_amdgpu_s_waitcnt, and we should expose the configuration bits with an input argument.