Page MenuHomePhabricator

[OpenMP] Allow nvptx sm_30 to be used as an offloading device
AbandonedPublic

Authored by grokos on Apr 27 2018, 6:02 AM.

Details

Summary

Patched by Gregory Rodgers.

It uses 64bit atomicCAS to achieve the atomicMax on sm_30. The 64 bit atomicMax is not available on sm_30.

Diff Detail

Repository
rOMP OpenMP

Event Timeline

guansong created this revision.Apr 27 2018, 6:02 AM

I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:

uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);

What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).

I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:

uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);

What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).

Fully agree with George's analysis here. The proposed implementation may be tweaked to early exit if elem < *Buffer.

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:

uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);

What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).

Fully agree with George's analysis here. The proposed implementation may be tweaked to early exit if elem < *Buffer.

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

Okay, then the question is: Will the implementation get better if it supports 6 years old hardware with yet another code path that only few people care about and that will get less testing? From intuition I doubt that because extra code for old hardware won't get much support and attention by definition.

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

Okay, then the question is: Will the implementation get better if it supports 6 years old hardware with yet another code path that only few people care about and that will get less testing? From intuition I doubt that because extra code for old hardware won't get much support and attention by definition.

I see your point. It will always be the more people testing the better, especially for people playing with devices. For the moment, my feeling is that this is a small regression we should try to fix before we give up the device.

I agree that George's RMW proposed code is correct. This was my first attempt at an RMW code. Maybe we should implement atomicMax as a device function in architecture-specific (e.g sm_30) device library. This way the code in loop.cu can remain just a call to atomicMax. Such an implementation would need an overloaded atomicMax.

Are we still interested in this patch or should we abandon it?

grokos commandeered this revision.Aug 23 2019, 11:44 AM
grokos edited reviewers, added: guansong; removed: grokos.
grokos abandoned this revision.Aug 23 2019, 11:47 AM

As there has been no update from the original author for 1.5 years and what is posted here doesn't work anyway, I am abandoning this patch. If for any reason we need to enable sm_30 devices to be used for offloading, we can do it later on as part of redesigning the runtime by factoring out common logic.