This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Introduce stream pool to make sure the correctness of device synchronization
ClosedPublic

Authored by tianshilei1992 on Apr 3 2020, 9:17 AM.

Details

Summary

In previous patch, in order to optimize performance, we only synchronize once
for each target region. The syncrhonization is via stream synchronization.
However, in the extreme situation, the performce might be bad. Consider the
following case: There is a task that requires transferring huge amount of data
(call many times of data transferring function). It is scheduled to the first
stream. And then we have 255 very light tasks scheduled to the remaining 255
streams (by default we have 256 streams). They can be finished before we do
synchronization at the end of the first task. Next, we get another very huge
task. It will be scheduled again to the first stream. Now the first task
finishes its kernel launch and call stream synchronization. Right now, the
stream already contains two kernels, and the synchronization will wait until the
two kernels finish instead of just the first one for the first task.

In this patch, we introduce stream pool. After each synchronization, the stream
will be returned back to the pool to make sure that for each synchronization,
only expected operations are waited.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Apr 3 2020, 9:17 AM

I like this a lot. I have one high-level question and a request for documentation:
Shouldn't we hide the stream stuff behind the async_info object instead?
I mean, we could avoid accidental misuse by restricting the getStream/returnStream to the constructor/destructor of async_info for example.
We should also unify the naming.

Maybe, put the entire "stream" (=queue) management in a class. All the initialization happens there, all the tear down as well, and when you make it friends with async_info we can restrict access too.

WDYT?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
366

This is clever. So clever in fact that we need a comment above StreamPool (or similar) describing how the different values interact and how the StreamPool evolves. Basically how it works.

I like this a lot. I have one high-level question and a request for documentation:
Shouldn't we hide the stream stuff behind the async_info object instead?
I mean, we could avoid accidental misuse by restricting the getStream/returnStream to the constructor/destructor of async_info for example.
We should also unify the naming.

Maybe, put the entire "stream" (=queue) management in a class. All the initialization happens there, all the tear down as well, and when you make it friends with async_info we can restrict access too.

WDYT?

Agree. Will create a new class StreamManager.

Wrap all stream-related operations into StreamManager.

tianshilei1992 marked an inline comment as done.Apr 3 2020, 6:21 PM

Rebase my patch

jdoerfert added inline comments.Apr 9 2020, 8:34 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
251

No need to put your name here, git blame would know anyway.

We should move all the CUDA stuff into the stream manager and the stream manager out of the RTLDeviceInfoTy but both can be done later. For now just replace the friends with a getter.

Removed friend functions. Later I'll prepare a new patch to improve encapsulation.

tianshilei1992 marked an inline comment as done.Apr 10 2020, 9:22 AM
jdoerfert added inline comments.Apr 10 2020, 9:58 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
103

Later we can think abut making this a vector of structs instead as we tend to access them per device anyway. We can then align the structs such that different processes interacting with different devices don't really interfere with each other.

118

Outline this into a helper. Also the error checking should be a helper.

So something like:

/// ....
bool checkResult(CUResult err, const char *ErrMsg) {
  if (err == CUDA_SUCCESS)
    return true;
   DP(ErrMsg);
   CUDA_ERR_STRING(err);
   return false;
}

/// ...
bool switchToDevice(DeviceId) {
  CUresult err = cuCtxSetCurrent((*ContextsPtr)[DeviceId]);
  return checkResult(err, "Error when creating CUDA stream to resize stream pool\n");
}

and then use it eveywhere. Should cut down the size and duplication.

137

If there is no reason ever to provide a nullptr as CtxPtr make it a reference everywhere instead.

150

Don't you need to create the sreams, e.g., call resizeStreamPool instead?

538–539

I was expecting this to be part of the StreamManager. If it can only happen at this point, maybe we want a StreamManager::initializeDevice method or similar. We can then also avoid exposing the stream pool to the outside.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
103

That would be ideal.

137

My bad. At the very beginning I thought by some chances it will be nullptr but it turns out that will never happen. :-)

150

Here I follow the initial design which is later initialization. In the constructor, it only allocate memory but does not initialize.

538–539

Yes, we could do that. But what I'm thinking is, do we assume cuCtxSetCurrent is called before we call StreamManager::initializeDevice, or we will set the context again in StreamManager::initializeDevice?

tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
118

We can do the first one. As for the second function, it must be the member function. Let's do that in next patch which moves every CUDA related things into class.

jdoerfert added inline comments.Apr 10 2020, 4:04 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
118

ok

538–539

Once could move both to the StreamManager::initializeDevice, set ctx and create streams. One could then either reset the context or just document that the context is set by initializeDevice. This is a one time cost so setting the context twice is not my main concern.

tianshilei1992 added inline comments.Apr 10 2020, 5:39 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
538–539

Yes, in the next patch, all these issues are no long existing. Here let's assume that when call StreamManager::initializeDevice, the right context has been set.

This revision is now accepted and ready to land.Apr 10 2020, 7:10 PM
This revision was automatically updated to reflect the committed changes.