This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Use primary context in CUDA plugin
ClosedPublic

Authored by ye-luo on Jun 28 2020, 8:12 AM.

Details

Summary

Retaining per device primary context is preferred to creating a context owned by the plugin.

From CUDA documentation

  1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
  2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
  3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX

Two issues are addressed by this patch:

  1. Not using the primary context caused interoperability issue with libraries like cublas, cusolver. CUBLAS_STATUS_EXECUTION_FAILED and cudaErrorInvalidResourceHandle
  2. On OLCF summit, "Error returned from cuCtxCreate" and "CUDA error is: invalid device ordinal"

Regarding the flags of the primary context. If it is inactive, we set CU_CTX_SCHED_BLOCKING_SYNC. If it is already active, we respect the current flags.

Diff Detail

Event Timeline

ye-luo created this revision.Jun 28 2020, 8:12 AM
tianshilei1992 added a comment.EditedJun 28 2020, 8:54 PM

Is there any side affect for this method? If no, I think it might be better to directly use it such that we don't need an extra environment variable to control it.

I'm not aware of any side effect but others may have different opinions. Since it is low effort to keep both, I made the choice to keep the old behavior accessible.

I'm not aware of any side effect but others may have different opinions. Since it is low effort to keep both, I made the choice to keep the old behavior accessible.

More options mean more configurations we should test. Not that we do it right now but we should. Let's see if anyone else has an opinion about this.

(In general, all patches must be sent to the respective -commits list. This also makes feedback more likely.)

Retaining per device primary context is preferred to creating a context owned by the plugin.
CUDA driver API documentation recommends this.

Do you have a link for this? From a users / admin perspective, my only concern is that libomptarget should only "block" devices that are actually used. This is important for interactive machines that are configured in exclusive mode. It looks like cuDevicePrimaryCtxRetain does this, but maybe you can test that it's indeed working this way?

(In general, all patches must be sent to the respective -commits list. This also makes feedback more likely.)

Retaining per device primary context is preferred to creating a context owned by the plugin.
CUDA driver API documentation recommends this.

Do you have a link for this? From a users / admin perspective, my only concern is that libomptarget should only "block" devices that are actually used. This is important for interactive machines that are configured in exclusive mode. It looks like cuDevicePrimaryCtxRetain does this, but maybe you can test that it's indeed working this way?

  1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
  2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
  3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX

libomptarget only engages a device if user requests it. I have no problems running 6 MPI on 6 GPUs within a single node. Each MPI owns one GPU exclusively.
I don't fully understand what you mean "blocking". I'm able to run my offload application + linux GUI on my desktop with only 1 GPU. So it is not blocking the GPU anyway, the time slicing of sharing a GPU still works.

(In general, all patches must be sent to the respective -commits list. This also makes feedback more likely.)

Sorry, I was not aware of -commits list. What is it for exclusively? If there is policy or instruction , please point me.
I added OpenMP project tag. I you watch OpenMP project on differential. I expect you get notifications. Is it not the case?

I'm not aware of any side effect but others may have different opinions. Since it is low effort to keep both, I made the choice to keep the old behavior accessible.

More options mean more configurations we should test. Not that we do it right now but we should. Let's see if anyone else has an opinion about this.

If no one objects it, I'm happy to delete the old behavior.

(In general, all patches must be sent to the respective -commits list. This also makes feedback more likely.)

Sorry, I was not aware of -commits list. What is it for exclusively? If there is policy or instruction , please point me.
I added OpenMP project tag. I you watch OpenMP project on differential. I expect you get notifications. Is it not the case?

It's in the very first part of https://www.llvm.org/docs/Phabricator.html, the primary usage documentation for Phabricator within LLVM. IIRC it's automatically added if you add the LLVM repository and touch a file under openmp/. The project tags are a more recent invention and I haven't seen a thread making it mandatory to get updates on submitted patches.

Do you have a link for this? From a users / admin perspective, my only concern is that libomptarget should only "block" devices that are actually used. This is important for interactive machines that are configured in exclusive mode. It looks like cuDevicePrimaryCtxRetain does this, but maybe you can test that it's indeed working this way?

  1. Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used." from https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
  2. Right under cuCtxCreate. In most cases it is recommended to use cuDevicePrimaryCtxRetain. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
  3. The primary context is unique per device and shared with the CUDA runtime API. These functions allow integration with other libraries using CUDA. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PRIMARY__CTX.html#group__CUDA__PRIMARY__CTX

Based on the documentation, using the primary context makes sense and I don't feel strongly that there's a need to keep the old behavior.

libomptarget only engages a device if user requests it. I have no problems running 6 MPI on 6 GPUs within a single node. Each MPI owns one GPU exclusively.
I don't fully understand what you mean "blocking". I'm able to run my offload application + linux GUI on my desktop with only 1 GPU. So it is not blocking the GPU anyway, the time slicing of sharing a GPU still works.

Ok, I'll take your word that it's equivalent to the current behavior. With "blocking" I was referring to the driver configuration that ensures each device is only accessible by a single process (the second one to try will get an error). This broke often enough with MPI + PGI's OpenACC runtime in the past...

It seems no one objects to removing the old behavior. I will update the patch.

ye-luo updated this revision to Diff 274318.Jun 29 2020, 8:12 PM
ye-luo edited the summary of this revision. (Show Details)

I have updated the patch removing the old behavior. @Hahnfeld would you approve?

This patch drops the CU_CTX_SCHED_BLOCKING_SYNC property currently selected for the context. Is this intended? Should we add another function call to request this behavior for the primary context?

This patch drops the CU_CTX_SCHED_BLOCKING_SYNC property currently selected for the context. Is this intended? Should we add another function call to request this behavior for the primary context?

That is good point. We depend on the synchronous behavior in some cases in the RTL.

This patch drops the CU_CTX_SCHED_BLOCKING_SYNC property currently selected for the context. Is this intended? Should we add another function call to request this behavior for the primary context?

cuDevicePrimaryCtxSetFlags does allow me to set the flag. However, I don't have enough knowledge to say which flag CU_CTX_SCHED_BLOCKING_SYNC or not setting it, namely depends on the current PrimaryCtx flag.
I also don't feel right having libomptarget making a choice for all others.

My primary goal is to get the current issue fixed and get applications up and running.
If you prefer setting CU_CTX_SCHED_BLOCKING_SYNC to accept the patch, I can add that. It is not my primary concern and it is a separate topic and difficult.
There is no guarantee it will not be overwritten by another library or user code overwrites the flag for the primary context.

This patch drops the CU_CTX_SCHED_BLOCKING_SYNC property currently selected for the context. Is this intended? Should we add another function call to request this behavior for the primary context?

That is good point. We depend on the synchronous behavior in some cases in the RTL.

Are you sure this is the right flag you need?

ye-luo updated this revision to Diff 274662.Jun 30 2020, 6:29 PM
ye-luo edited the summary of this revision. (Show Details)

Updated description regarding setting CU_CTX_SCHED_BLOCKING_SYNC.

tianshilei1992 added a comment.EditedJun 30 2020, 7:08 PM

This patch drops the CU_CTX_SCHED_BLOCKING_SYNC property currently selected for the context. Is this intended? Should we add another function call to request this behavior for the primary context?

That is good point. We depend on the synchronous behavior in some cases in the RTL.

Are you sure this is the right flag you need?

I did some investigation and finally think BLOCKING_SYNC might be a good option here, but I also would like to hear from others.
Basically we have three options here: SPIN, YIELD and the BLOCKING_SYNC.

  • SPIN: In most cases this is not a good option but it might be better for a really tiny kernel.
  • YIELD: Like mentioned in CUDA documentation, it can increase latency, but can increase the performance of CPU threads performing work in parallel with the GPU. But for OpenMP, this level of yield may not help too much because it is not very common to have thread oversubscription in OpenMP.
  • BLOCKING_SYNC: I guess chances are that it works with interruption. This is kind of a balanced option which will not increase too much latency, and also will not waste the resource, especially considering that we might be going to use unshackled threads for all target tasks.
  • AUTO: In fact this option will be used if there is no flag specified. According to CUDA documentation, the behavior depends on the number of CUDA contexts in current process versus the number of logical processors in the system. Either SPIN or YIELD will be used. BLOCKING_SYNC will only be chosen on Tegra devices.
This comment was removed by ye-luo.

I did some investigation and finally think BLOCKING_SYNC might be a good option here, but I also would like to hear from others.
Basically we have three options here: SPIN, YIELD and the BLOCKING_SYNC.

  • SPIN: In most cases this is not a good option but it might be better for a really tiny kernel.
  • YIELD: Like mentioned in CUDA documentation, it can increase latency, but can increase the performance of CPU threads performing work in parallel with the GPU. But for OpenMP, this level of yield may not help too much because it is not very common to have thread oversubscription in OpenMP.
  • BLOCKING_SYNC: I guess chances are that it works with interruption. This is kind of a balanced option which will not increase too much latency, and also will not waste the resource, especially considering that we might be going to use unshackled threads for all target tasks.
  • AUTO: In fact this option will be used if there is no flag specified. According to CUDA documentation, the behavior depends on the number of CUDA contexts in current process versus the number of logical processors in the system. Either SPIN or YIELD will be used. BLOCKING_SYNC will only be chosen on Tegra devices.

From you analysis BLOCKING_SYNC looks best from an OpenMP perspective. We can make that the default and maybe introduce an env var LIBOMPTARGET_CUDA_CTX_SCHED_POLICY so the user can override it (this is better to be done in another patch).

ye-luo added a comment.Jul 2 2020, 5:58 AM

@protze.joachim Is the current way of handling CU_CTX_SCHED_BLOCKING_SYNC OK?
I mean still set the flag as CU_CTX_SCHED_BLOCKING_SYNC if the primary context is not activated prior to libomptarget.
Allowing users to choose a scheme will be a separate patch.
I'd like to get this patch in as soon as possible because the 11 release branch will be created soon and the current issue is a stopper for applications using libomptarget.

My comment was just an observation based on the code modification. I have not much understanding of the implications of this flag other than the few lines I read in the documentation and the comparison @tianshilei1992 posted.

The intention of your latest change makes sense to me, I just think that there could be more flags active (see inline comment)

I leave it to others to judge whether this is the desired behavior. As an alternative, you could create a new context in case the primary context uses a different scheduling policy.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
428

I think, it should be

if (! (FormerPrimaryCtxFlags & CU_CTX_SCHED_BLOCKING_SYNC))

or

if ((FormerPrimaryCtxFlags & CU_CTX_SCHED_MASK) != CU_CTX_SCHED_BLOCKING_SYNC)

They might be equivalent, if only one scheduling policy can be set at a time.

ye-luo updated this revision to Diff 275156.Jul 2 2020, 10:13 AM
ye-luo marked an inline comment as done.Jul 2 2020, 10:16 AM
ye-luo added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
428

I adopted CU_CTX_SCHED_MASK. So far only one flag is allowed but we never know the future.

jdoerfert accepted this revision.Jul 6 2020, 8:52 PM

Not overwriting the user/system choice is the right thing to do.

I'm fine with the warning in case there is a different option set in the existing context, but from what I understand there is no correctness but only a performance implication anyway.

This seems to have addressed all concerns and I doubt it will change the behavior on most systems anyway. On the ones it does, it is important to do so.

Thanks for the fix!

One nit, otherwise LGTM.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
420

We should initialize these, I never trust runtime calls to do that for some reason.

This revision is now accepted and ready to land.Jul 6 2020, 8:52 PM
ye-luo marked an inline comment as done.Jul 6 2020, 9:01 PM
ye-luo added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
420

Apart from trusting the runtime calls, I don't see any fallback/default value makes sense. There are no "unknown" flags mentioned in the documentation. We do check returned status of these runtime calls. It should be safe.

jdoerfert added inline comments.Jul 7 2020, 5:10 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
420

Anything that is not an undefined value, which easily results in non-deterministic UB, is fine with me. Pick 0 for both.

ye-luo updated this revision to Diff 276032.Jul 7 2020, 6:12 AM
ye-luo marked an inline comment as done.
ye-luo added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
420

Initialized them with 0

ye-luo marked 3 inline comments as done.Jul 7 2020, 6:14 AM
This revision was automatically updated to reflect the committed changes.