This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Remove volatile from NVPTX work function
AcceptedPublic

Authored by jhuber6 on Jul 19 2021, 2:01 PM.

Details

Summary

Currently the NPVTX work function is marked volatile. This prevents some
optimizations from using this value.

Diff Detail

Event Timeline

jhuber6 requested review of this revision.Jul 19 2021, 2:01 PM
jhuber6 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2021, 2:01 PM
This revision is now accepted and ready to land.Jul 19 2021, 2:13 PM
This revision was landed with ongoing or failed builds.Jul 19 2021, 5:03 PM
This revision was automatically updated to reflect the committed changes.

This seems hazardous.

The value is written from one warp and then read from another. It should be atomic qualified, but isn't, because cuda doesn't do that.

Previously it was volatile, which happens to get treated in broadly similar fashion to atomic.

So before it was a data race, but one that the compiler was unlikely to miscompile. Now it's a non-volatile data race. That is indeed more likely to be transformed.

I don't think we can reasonably assume this change is safe. Making it a relaxed atomic instead would be better.

JonChesterfield reopened this revision.Jul 20 2021, 4:07 AM
This revision is now accepted and ready to land.Jul 20 2021, 4:07 AM

This seems hazardous.

The value is written from one warp and then read from another. It should be atomic qualified, but isn't, because cuda doesn't do that.

Previously it was volatile, which happens to get treated in broadly similar fashion to atomic.

So before it was a data race, but one that the compiler was unlikely to miscompile. Now it's a non-volatile data race. That is indeed more likely to be transformed.

I don't think we can reasonably assume this change is safe. Making it a relaxed atomic instead would be better.

What prevents us from just doing an atomic write to it in parallel.cu:79? Might slow it down though.

If the volatile property is a problem when calling the function, how about keeping the variable volatile, but assigning the value into a local value (explicit read the volatile variable) and then calling the local function pointer?

Iirc this is used for a server warp to tell workers what to do. So the worker needs to do each load to learn that it needs to do something else or stop execution.

Caching the loaded value (because it is no longer volatile) would be my guess at how this change would break things, so caching it manually would seem to have the same problem.

This is not needed. If this was needed none of the other variables here would work, I mean, take execution_param in line 58. Same thing happens, one thread writes all read it.
We use proper synchronization through barriers which ensures we see shared memory updates. If you think this does not work, please provide an example or at least some reference to
a documentation that would indicate so.

I think your reasoning is equivalent to:

void barrier();

static int g;

void set(int x) { g = x; }

int get()
{   
    int t = g;
    barrier();
    return g + t;
}

If the barrier() call is commented out, the loads in get() fold. If the variable is volatile, both loads are emitted.

If the external barrier() call is present, both loads are emitted.

If the barrier not external, and does not change that global, e.g.

int other;
extern "C" void barrier(){other++;}

then the generated IR only reads the variable once again.

define dso_local i32 @get() local_unnamed_addr #0 {
  %1 = load i32, i32* @_ZL1g, align 4, !tbaa !3
  %2 = load i32, i32* @other, align 4, !tbaa !3
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* @other, align 4, !tbaa !3
  %4 = shl nsw i32 %1, 1
  ret i32 %4
}

So I think, at present, the 'barrier' call works as a thread fence as far as the compiler is concerned, in that it doesn't move loads across it.

Given better cross function analysis, would the property of 'any call => fence' hold? I'm not confident it holds today within LLVM though I think it must do in clang.

In particular, as far as I can tell C and C++ require synchronisation via atomic variables, not merely fences on non-atomics, and LLVM tends to reflect C++ semantics. So I think one thread writing the variable while others read is a data race and we are avoiding being burned by that by conservative optimisations around function calls.

Changing the loads/stores on this variable to relaxed atomic firmly kills the UB data race, and will interact properly with llvm.fence (and hopefully) the nvptx barrier.

So I think we should use relaxed load/stores where we currently use volatile variables, instead of relying on clang not miscompiling our data race.

So I think, at present, the 'barrier' call works as a thread fence as far as the compiler is concerned, in that it doesn't move loads across it.

Yes, it does. Potentially impacted accesses are not moved across. Loads and stores of shared memory are impacted by such a fence.

Given better cross function analysis, would the property of 'any call => fence' hold? I'm not confident it holds today within LLVM though I think it must do in clang.

It would. There is no (non-buggy) analysis that would remove the barriers impact on a shared memory accesses.

In particular, as far as I can tell C and C++ require synchronisation via atomic variables, not merely fences on non-atomics, and LLVM tends to reflect C++ semantics.

No. As long as there is no race there is no need to do atomics. The barrier synchronizes sufficiently and also acts as a memory fence. There is no problem in C/C++/IR here.

So I think one thread writing the variable while others read is a data race and we are avoiding being burned by that by conservative optimisations around function calls.

Assuming no synchronization between the reads and write, you are correct. However, no volatile or atomic would actually make this work as it is required to if there was
a race without it. Let's talk about atomics as volatile is way worse. If you do the reads and write atomically there is no race and therefore no UB, great. However, it would
also mean the workers might read the value before it was written, not so great. The only way this scheme works is that we write -> synchronize -> read, and that is what we do.
As such, the reads and write do not need to be atomic (or volatile).

Changing the loads/stores on this variable to relaxed atomic firmly kills the UB data race, and will interact properly with llvm.fence (and hopefully) the nvptx barrier.

There is no data race nor is there UB. There is also no need for it to "interact properly" with the barrier.

So I think we should use relaxed load/stores where we currently use volatile variables, instead of relying on clang not miscompiling our data race.

See above.