The parallel level counter is accessed simultaneously by many threads. In combination with atomic operations used in full runtime SPMD mode (fired by dynamic scheduling, for example) it may lead to incorrect results caused by compiler optimizations. According to the CUDA Toolkit documentation (https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#volatile-qualifier):
The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions and memory visibility semantics of synchronization functions.
These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.
This is especially important in case of thread divergence mixed with atomic operations. In our case, this may lead to undefined behavior or thread deadlock, especially on CUDA 9 and later. This change is required especially for SPMD mode with full runtime or SPMD mode without full runtime mode but with atomic operations.