This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libomptarget/deviceRTLs/nvptx/src/
-
deviceRTLs/
-
nvptx/
-
src/
-
loop.cu

Differential D46185

[OpenMP] Allow nvptx sm_30 to be used as an offloading device
AbandonedPublic

Authored by grokos on Apr 27 2018, 6:02 AM.

Download Raw Diff

Details

Reviewers

gregrodgers
guansong
jdoerfert

Summary

Patched by Gregory Rodgers.

It uses 64bit atomicCAS to achieve the atomicMax on sm_30. The 64 bit atomicMax is not available on sm_30.

Diff Detail

Repository: rOMP OpenMP

Event Timeline

guansong created this revision.Apr 27 2018, 6:02 AM

guansong added a reviewer: gregrodgers.Apr 27 2018, 6:07 AM

I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:

uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);

What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).

In D46185#1081111, @grokos wrote:
I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:
uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);
What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).

Fully agree with George's analysis here. The proposed implementation may be tweaked to early exit if elem < *Buffer.

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

In D46185#1081124, @Hahnfeld wrote:
In D46185#1081111, @grokos wrote:
I don't think this code does atomic max. Actually, I don't think it does anything at all. First off, old_value and elem are variables which do not change inside the while-loop. So the first loop condition is either always true or always false. Secondly, atomicCAS returns the value that existed at address Buffer (regardless of whether or not the CAS operation succeeded) and not a boolean result. Also, old_value must be updated before each CAS attempt because some other thread may have changed it.

A correct implementation would look somewhat like this:
uint64_t old_value = *Buffer;
uint64_t expected_old_value;
do {
  expected_old_value = old_value;
  old_value = atomicCAS((unsigned long long *) Buffer,
                        (unsigned long long) expected_old_value,
                        (unsigned long long) max(expected_old_value, elem));
} while (expected_old_value != old_value);
What this code does is: check that the value at Buffer is still what I expect it to be, i.e. no other thread has changed it, and replace it with max(old_value, elem).
Fully agree with George's analysis here. The proposed implementation may be tweaked to early exit if elem < *Buffer.

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

In D46185#1081362, @guansong wrote:

In D46185#1081124, @Hahnfeld wrote:

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

Okay, then the question is: Will the implementation get better if it supports 6 years old hardware with yet another code path that only few people care about and that will get less testing? From intuition I doubt that because extra code for old hardware won't get much support and attention by definition.

In D46185#1081368, @Hahnfeld wrote:

In D46185#1081362, @guansong wrote:

In D46185#1081124, @Hahnfeld wrote:

Anyway, why do you want to support sm_30? Is there a particular hardware that you want to support? Why does AMD care?!?

I will double check with Greg on the way atomicCAS used in the code. For the last question here. We have sm_30 machine for testing. Besides, from my perspective, I care the overall implementation of openmp, particularly the offloading part, not necessarily amdgcn code only :)

Okay, then the question is: Will the implementation get better if it supports 6 years old hardware with yet another code path that only few people care about and that will get less testing? From intuition I doubt that because extra code for old hardware won't get much support and attention by definition.

I see your point. It will always be the more people testing the better, especially for people playing with devices. For the moment, my feeling is that this is a small regression we should try to fix before we give up the device.

I agree that George's RMW proposed code is correct. This was my first attempt at an RMW code. Maybe we should implement atomicMax as a device function in architecture-specific (e.g sm_30) device library. This way the code in loop.cu can remain just a call to atomicMax. Such an implementation would need an overloaded atomicMax.

Are we still interested in this patch or should we abandon it?

Herald added a subscriber: jfb. · View Herald TranscriptOct 22 2018, 6:27 AM

grokos commandeered this revision.Aug 23 2019, 11:44 AM

grokos edited reviewers, added: guansong; removed: grokos.

Herald added a reviewer: jdoerfert. · View Herald TranscriptAug 23 2019, 11:44 AM

As there has been no update from the original author for 1.5 years and what is posted here doesn't work anyway, I am abandoning this patch. If for any reason we need to enable sm_30 devices to be used for offloading, we can do it later on as part of redesigning the runtime by factoring out common logic.

Revision Contents

Path

Size

libomptarget/

deviceRTLs/

nvptx/

src/

loop.cu

8 lines

Diff 144320

libomptarget/deviceRTLs/nvptx/src/loop.cu

Show First 20 Lines • Show All 751 Lines • ▼ Show 20 Lines	if (tid == 0)
*Buffer = 0; // Reset to minimum loop iteration value.		*Buffer = 0; // Reset to minimum loop iteration value.

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);

// Atomic max of iterations.		// Atomic max of iterations.
uint64_t varArray = (uint64_t )array;		uint64_t varArray = (uint64_t )array;
uint64_t elem = varArray[i];		uint64_t elem = varArray[i];
		#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 350
(void)atomicMax((unsigned long long int *)Buffer,		(void)atomicMax((unsigned long long int *)Buffer,
(unsigned long long int)elem);		(unsigned long long int)elem);
		#else
		uint64_t old_value = *Buffer;
		while (old_value < elem && !atomicCAS((unsigned long long *)Buffer,
		(unsigned long long)old_value,
		(unsigned long long)elem)) {
		};
		#endif

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);

// Read max value and update thread private array.		// Read max value and update thread private array.
varArray[i] = *Buffer;		varArray[i] = *Buffer;

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);
}		}
}		}