Patched by Gregory Rodgers.
It uses 64bit atomicCAS to achieve the atomicMax on sm_30. The 64 bit atomicMax is not available on sm_30.
Differential D46185
[OpenMP] Allow nvptx sm_30 to be used as an offloading device grokos on Apr 27 2018, 6:02 AM. Authored by
Details
Patched by Gregory Rodgers. It uses 64bit atomicCAS to achieve the atomicMax on sm_30. The 64 bit atomicMax is not available on sm_30.
Diff Detail
Event TimelineComment Actions I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it. A correct implementation would look somewhat like this: uint64_t old_value = *Buffer; uint64_t expected_old_value; do { expected_old_value = old_value; old_value = atomicCAS((unsigned long long *) Buffer, (unsigned long long) expected_old_value, (unsigned long long) max(expected_old_value, elem)); } while (expected_old_value != old_value); What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem). Comment Actions Fully agree with George's analysis here. The proposed implementation may be tweaked to early exit if elem < *Buffer. Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!? Comment Actions I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :) Comment Actions Okay, then the question is: Will the implementation get better if it supports 6 years old hardware with yet another code path that only few people care about and that will get less testing? From intuition I doubt that because extra code for old hardware won't get much support and attention by definition. Comment Actions I see your point. It will always be the more people testing the better, especially for people playing with devices. For the moment, my feeling is that this is a small regression we should try to fix before we give up the device. Comment Actions I agree that George's RMW proposed code is correct. This was my first attempt at an RMW code. Maybe we should implement atomicMax as a device function in architecture-specific (e.g sm_30) device library. This way the code in loop.cu can remain just a call to atomicMax. Such an implementation would need an overloaded atomicMax. Comment Actions As there has been no update from the original author for 1.5 years and what is posted here doesn't work anyway, I am abandoning this patch. If for any reason we need to enable sm_30 devices to be used for offloading, we can do it later on as part of redesigning the runtime by factoring out common logic. |