This is an archive of the discontinued LLVM Phabricator instance.

So before it was a data race, but one that the compiler was unlikely to miscompile. Now it's a non-volatile data race. That is indeed more likely to be transformed.

I don't think we can reasonably assume this change is safe. Making it a relaxed atomic instead would be better.

JonChesterfield reopened this revision.Jul 20 2021, 4:07 AM

This revision is now accepted and ready to land.Jul 20 2021, 4:07 AM

In D106310#2889970, @JonChesterfield wrote:

This seems hazardous.

The value is written from one warp and then read from another. It should be atomic qualified, but isn't, because cuda doesn't do that.

Previously it was volatile, which happens to get treated in broadly similar fashion to atomic.

So before it was a data race, but one that the compiler was unlikely to miscompile. Now it's a non-volatile data race. That is indeed more likely to be transformed.

I don't think we can reasonably assume this change is safe. Making it a relaxed atomic instead would be better.

What prevents us from just doing an atomic write to it in parallel.cu:79? Might slow it down though.

If the volatile property is a problem when calling the function, how about keeping the variable volatile, but assigning the value into a local value (explicit read the volatile variable) and then calling the local function pointer?

Iirc this is used for a server warp to tell workers what to do. So the worker needs to do each load to learn that it needs to do something else or stop execution.

Caching the loaded value (because it is no longer volatile) would be my guess at how this change would break things, so caching it manually would seem to have the same problem.

This is not needed. If this was needed none of the other variables here would work, I mean, take execution_param in line 58. Same thing happens, one thread writes all read it.
We use proper synchronization through barriers which ensures we see shared memory updates. If you think this does not work, please provide an example or at least some reference to
a documentation that would indicate so.

I think your reasoning is equivalent to:

void barrier();

static int g;

void set(int x) { g = x; }

int get()
{   
    int t = g;
    barrier();
    return g + t;
}

If the barrier() call is commented out, the loads in get() fold. If the variable is volatile, both loads are emitted.

If the external barrier() call is present, both loads are emitted.

If the barrier not external, and does not change that global, e.g.

int other;
extern "C" void barrier(){other++;}

then the generated IR only reads the variable once again.

define dso_local i32 @get() local_unnamed_addr #0 {
  %1 = load i32, i32* @_ZL1g, align 4, !tbaa !3
  %2 = load i32, i32* @other, align 4, !tbaa !3
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* @other, align 4, !tbaa !3
  %4 = shl nsw i32 %1, 1
  ret i32 %4
}

So I think, at present, the 'barrier' call works as a thread fence as far as the compiler is concerned, in that it doesn't move loads across it.

Given better cross function analysis, would the property of 'any call => fence' hold? I'm not confident it holds today within LLVM though I think it must do in clang.

In particular, as far as I can tell C and C++ require synchronisation via atomic variables, not merely fences on non-atomics, and LLVM tends to reflect C++ semantics. So I think one thread writing the variable while others read is a data race and we are avoiding being burned by that by conservative optimisations around function calls.

Changing the loads/stores on this variable to relaxed atomic firmly kills the UB data race, and will interact properly with llvm.fence (and hopefully) the nvptx barrier.

So I think we should use relaxed load/stores where we currently use volatile variables, instead of relying on clang not miscompiling our data race.

In D106310#2890709, @JonChesterfield wrote:

So I think, at present, the 'barrier' call works as a thread fence as far as the compiler is concerned, in that it doesn't move loads across it.

Yes, it does. Potentially impacted accesses are not moved across. Loads and stores of shared memory are impacted by such a fence.

Given better cross function analysis, would the property of 'any call => fence' hold? I'm not confident it holds today within LLVM though I think it must do in clang.

It would. There is no (non-buggy) analysis that would remove the barriers impact on a shared memory accesses.

In particular, as far as I can tell C and C++ require synchronisation via atomic variables, not merely fences on non-atomics, and LLVM tends to reflect C++ semantics.

No. As long as there is no race there is no need to do atomics. The barrier synchronizes sufficiently and also acts as a memory fence. There is no problem in C/C++/IR here.

So I think one thread writing the variable while others read is a data race and we are avoiding being burned by that by conservative optimisations around function calls.

Assuming no synchronization between the reads and write, you are correct. However, no volatile or atomic would actually make this work as it is required to if there was
a race without it. Let's talk about atomics as volatile is way worse. If you do the reads and write atomically there is no race and therefore no UB, great. However, it would
also mean the workers might read the value before it was written, not so great. The only way this scheme works is that we write -> synchronize -> read, and that is what we do.
As such, the reads and write do not need to be atomic (or volatile).

Changing the loads/stores on this variable to relaxed atomic firmly kills the UB data race, and will interact properly with llvm.fence (and hopefully) the nvptx barrier.

There is no data race nor is there UB. There is also no need for it to "interact properly" with the barrier.

So I think we should use relaxed load/stores where we currently use volatile variables, instead of relying on clang not miscompiling our data race.

See above.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

common/

omptarget.h

2 lines

src/

omp_data.cu

2 lines

Diff 359964

openmp/libomptarget/deviceRTLs/common/omptarget.h

	Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines
	extern void *EXTERN_SHARED(ReductionScratchpadPtr);			extern void *EXTERN_SHARED(ReductionScratchpadPtr);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// work function (outlined parallel/simd functions) and arguments.			// work function (outlined parallel/simd functions) and arguments.
	// needed for L1 parallelism only.			// needed for L1 parallelism only.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	typedef void *omptarget_nvptx_WorkFn;			typedef void *omptarget_nvptx_WorkFn;
	extern volatile omptarget_nvptx_WorkFn EXTERN_SHARED(omptarget_nvptx_workFn);			extern omptarget_nvptx_WorkFn EXTERN_SHARED(omptarget_nvptx_workFn);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// get private data structures			// get private data structures
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	INLINE omptarget_nvptx_TeamDescr &getMyTeamDescriptor();			INLINE omptarget_nvptx_TeamDescr &getMyTeamDescriptor();
	INLINE omptarget_nvptx_WorkDescr &getMyWorkDescriptor();			INLINE omptarget_nvptx_WorkDescr &getMyWorkDescriptor();
	INLINE omptarget_nvptx_TaskDescr *			INLINE omptarget_nvptx_TaskDescr *
	Show All 15 Lines

openmp/libomptarget/deviceRTLs/common/src/omp_data.cu

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	omptarget_nvptx_ThreadPrivateContext *			omptarget_nvptx_ThreadPrivateContext *
	SHARED(omptarget_nvptx_threadPrivateContext);			SHARED(omptarget_nvptx_threadPrivateContext);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// The team master sets the outlined parallel function in this variable to			// The team master sets the outlined parallel function in this variable to
	// communicate with the workers. Since it is in shared memory, there is one			// communicate with the workers. Since it is in shared memory, there is one
	// copy of these variables for each kernel, instance, and team.			// copy of these variables for each kernel, instance, and team.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	volatile omptarget_nvptx_WorkFn SHARED(omptarget_nvptx_workFn);			omptarget_nvptx_WorkFn SHARED(omptarget_nvptx_workFn);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// OpenMP kernel execution parameters			// OpenMP kernel execution parameters
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	uint32_t SHARED(execution_param);			uint32_t SHARED(execution_param);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Scratchpad for teams reduction.			// Scratchpad for teams reduction.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	void *SHARED(ReductionScratchpadPtr);			void *SHARED(ReductionScratchpadPtr);

	#pragma omp end declare target			#pragma omp end declare target

This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Remove volatile from NVPTX work functionAcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 359964

openmp/libomptarget/deviceRTLs/common/omptarget.h

openmp/libomptarget/deviceRTLs/common/src/omp_data.cu

[Libomptarget] Remove volatile from NVPTX work function
AcceptedPublic