Page MenuHomePhabricator

[OpenMP] Unified entry point for SPMD & generic kernels in the device RTL
AcceptedPublic

Authored by jdoerfert on May 6 2021, 12:00 AM.

Details

Summary

In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.

The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.

[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11

This was in parts extracted from D59319.

Diff Detail

Event Timeline

jdoerfert created this revision.May 6 2021, 12:00 AM
jdoerfert requested review of this revision.May 6 2021, 12:00 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptMay 6 2021, 12:00 AM
NOTE: not all tests have been updated, only *codegen.cpp ones.
jdoerfert updated this revision to Diff 343301.May 6 2021, 12:02 AM

Remove itantium mangle change

ABataev added inline comments.May 6 2021, 4:37 AM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68

Why not __syncthreads? It is safer to use __syncthreads as it is convergent. Would be good to mark this code somehow as convergent too to avoid incorrect optimizations

jdoerfert added inline comments.May 6 2021, 7:51 AM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68

The problem is that syncthreads is basically a bar.sync which is a barrier.sync.aligned, if I understood everything properly. This worked so far because the "main thread" (lane 0, last warp) was alone in it's warp and all other threads have been terminated. Now, we simplify the control flow (and later get rid of the last warp) such that the threads of the last warp and the main thread will hit different barriers. The former hit the one in the state machine while the latter will be in parallel_51. The .aligned version doesn't allow that. Does that make sense?

I'm not concerned about convergent though, we solved that wholesale: We mark all functions that clang compiles for the GPU via openmp-target as convergent (IIRC). The entire device runtime is certainly convergent.

ABataev accepted this revision.May 6 2021, 8:26 AM

LG

openmp/libomptarget/deviceRTLs/interface.h
421

Formatting

This revision is now accepted and ready to land.May 6 2021, 8:26 AM
JonChesterfield requested changes to this revision.May 6 2021, 8:34 AM

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

This revision now requires changes to proceed.May 6 2021, 8:34 AM

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

One hazard is the amdgpu devicertl only has one barrier. D102016 makes it simpler to add a second. I'd guess we want named_sync to call one barrier and syncthreads to call a different one, so we should probably rename those functions. The LDS barrier implementation needs to know how many threads to wait for, we may be OK passing 'all the threads' down from the __syncthreads entry point.

The other is the single instruction pointer per wavefront, like pre-volta nvidia cards (which I believe we also expect to work). I'm not sure whether totally independent barriers will work, or whether we'll need to arrange for thread 0 and thread 1-31 to call the two different barriers at the same point in control flow.

What are the required semantics of the barrier operations? Amdgcn builds them on shared memory, so probably needs a change to the corresponding target_impl to match

I have *not* tested AMDGCN but I was not expecting a problem. The semantics I need here is:
warp N, thread 0 hits a barrier instruction I0
warp N, threads 1-31 hit a barrier instruction I1
the entire warp synchronizes and moves on.

One hazard is the amdgpu devicertl only has one barrier. D102016 makes it simpler to add a second. I'd guess we want named_sync to call one barrier and syncthreads to call a different one, so we should probably rename those functions. The LDS barrier implementation needs to know how many threads to wait for, we may be OK passing 'all the threads' down from the __syncthreads entry point.

The other is the single instruction pointer per wavefront, like pre-volta nvidia cards (which I believe we also expect to work). I'm not sure whether totally independent barriers will work, or whether we'll need to arrange for thread 0 and thread 1-31 to call the two different barriers at the same point in control flow.

So what do you wnat me to change for this patch now?

So what do you wnat me to change for this patch now?

Equivalent change to amdgpu target_impl to the nvptx target_impl, which looks like syncthreads should call a new barrier.

Iiuc this has run successfully even without that change so hopefully that's sufficient that we won't regress on amdgpu. I'd like to get miniqmc running locally to verify as well, but we may not be able to wait for that.

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195

why are these weak?

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
65–68

amdgcn presumably needs the same change. Add a barrier and call it from _kmpc_impl_syncthreads.

I think barrier.sync defaults to all threads when the second argument is omitted, so we can use the corresponding kmpc call to get the num_threads argument for it.

So what do you wnat me to change for this patch now?

Equivalent change to amdgpu target_impl to the nvptx target_impl, which looks like syncthreads should call a new barrier.

Iiuc this has run successfully even without that change so hopefully that's sufficient that we won't regress on amdgpu. I'd like to get miniqmc running locally to verify as well, but we may not be able to wait for that.

So we don't need changes? I'm not sure what the problem here is.

openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195

So we do not perform IPO but will inline them. If we perform IPO we specialize the arguments even though we still want to do potentially change the mode from non-SPMD to SPMD.

JonChesterfield accepted this revision.May 12 2021, 12:16 PM

I'm not certain what this 'aligned' limitation for nvptx syncthreads is, but can't think of a corresponding one for amdgcn. So we may not need the LDS barrier construction, and it'll be much faster if we don't.

This was reported working on amdgpu by a third party against an earlier trunk build, but sadly the current trunk seems to have regressed (debugging offline). So I have no reason to believe this doesn't work, and some reason to believe it will do. Objection withdrawn.

The code itself always looked fine, was only nervous about the changes to concurrency primitives in nvptx.

This revision is now accepted and ready to land.May 12 2021, 12:16 PM
openmp/libomptarget/deviceRTLs/common/src/omptarget.cu
195

as discussed offline, weak_odr or drop the weak

Update tests

jdoerfert updated this revision to Diff 346621.May 19 2021, 7:59 PM

Drop the weak attribute, will solve the problem differently