According to PTX ISA .volatile has the same memory synchronization
semantics as .relaxed.sys, so it can be used to implement monotonic
atomic loads and stores. This is important for OpenMP's atomic
construct where
- 'read's and 'write's are lowered to atomic loads and stores, and
- an update of float or double types are lowered into a cmpxchg loop.
(Note that PTX could do better because it has atom.add.f{32,64} but
LLVM's atomicrmw instruction only allows integer types.)
Higher levels of atomicity (like acquire and release) need additional
synchronization properties which were added with PTX ISA 6.0 / sm_70.
So using these instructions still results in an error.
I'd be more explicit -- in order to lower atomic loads with stronger guarantees we would need to have to use .release/.acquire which are only available in ....
Same for the tryStore() below.
Maybe add TODO to check if we *are* compiling for sm_70 and use ld/st with .release/.acquire qualifiers then.