It turns out that the `__builtin_amdgcn_s_barrier()` alone does not emit

a fence. We somehow got away with this and assumed it would work as it

(hopefully) is correct on the NVIDIA path where we just emit a

`__syncthreads`. After talking to @arsenm we now (mostly) align with the

OpenCL barrier implementation [1] and emit explicit fences for AMDGPUs.

It seems this was the underlying cause for #59759, but I am not 100%

certain. There is a chance this simply hides the problem.