This function mimics the std::atomic_thread_fence function from
<atomic>. This has no uses in source currently, but this will be used by
the proposed RPC client for the GPU mode support. There is varying
support for direct memory ordering for the GPU atomics on shared memory
resources. So the implementation will use relaxed atomics and explicit
memory fences.
Some additional work may need to be done to map this to NVPTX system
level fences.
Do we need a guard to check if the builtin __atomic_thread_fence is available?