This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Don't report lack of CUDA devices
ClosedPublic

Authored by jdenny on Jul 22 2022, 10:13 AM.

Details

Summary

Sometimes libomptarget's CUDA plugin produces unhelpful diagnostics
about a lack of CUDA devices before an application runs:

$ clang -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa hello-world.c
$ ./a.out
CUDA error: Error returned from cuInit
CUDA error: no CUDA-capable device is detected
Hello World: 4

This can happen when the CUDA plugin was built but all CUDA devices
are currently disabled in some manner, perhaps because
CUDA_VISIBLE_DEVICES is set to the empty string. As shown in the
above example, it can even happen when we haven't compiled the
application for offloading to CUDA.

The following code from openmp/libomptarget/plugins/cuda/src/rtl.cpp
appears to be intended to handle this case, and it chooses not to
write a diagnostic to stderr unless debugging is enabled:

if (NumberOfDevices == 0) {
  DP("There are no devices supporting CUDA.\n");
  return;
}

The problem is that the above code is never reached because the
earlier cuInit returns CUDA_ERROR_NO_DEVICE. This patch handles
that cuInit case in the same manner as the above code handles the
NumberOfDevices == 0 case.

Diff Detail

Event Timeline

jdenny created this revision.Jul 22 2022, 10:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 10:13 AM
jdenny requested review of this revision.Jul 22 2022, 10:13 AM
tianshilei1992 accepted this revision.Jul 22 2022, 10:34 AM

LGTM Thanks for the improvement!

This revision is now accepted and ready to land.Jul 22 2022, 10:34 AM
This revision was landed with ongoing or failed builds.Jul 22 2022, 11:50 AM
This revision was automatically updated to reflect the committed changes.

Thanks for the quick review.

ye-luo added a subscriber: ye-luo.Aug 4 2022, 1:16 PM

I guess you machine has the nvidia driver installed but there is no GPU.
When there is no nvidia driver,

Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': libcuda.so.1: cannot open shared object file: No such file or directory!
jdenny added a comment.Aug 8 2022, 7:25 AM

I guess you machine has the nvidia driver installed but there is no GPU.

On my laptop, I saw the problem when I just disabled the (discrete) nvidia gpu in favor of integrated graphics... or when I set CUDA_VISIBLE_DEVICES to the empty string.

The machine that originally motivated this change has cuda installed but not the nvidia driver. This patch helped that case too. However, that machine also experienced other strange behavior I don't have any more time right now to pursue, and I ultimately recommended -DLIBOMPTARGET_BUILD_CUDA_PLUGIN=False to get around it. (I would have reported the behavior upstream, but it might be specific to Clacc.) Anyway, my point is that I'm not sure yet that things always work right with a disabled nvidia driver.

ye-luo added a comment.Aug 8 2022, 7:38 AM

I guess you machine has the nvidia driver installed but there is no GPU.

On my laptop, I saw the problem when I just disabled the (discrete) nvidia gpu in favor of integrated graphics... or when I set CUDA_VISIBLE_DEVICES to the empty string.

The machine that originally motivated this change has cuda installed but not the nvidia driver. This patch helped that case too. However, that machine also experienced other strange behavior I don't have any more time right now to pursue, and I ultimately recommended -DLIBOMPTARGET_BUILD_CUDA_PLUGIN=False to get around it. (I would have reported the behavior upstream, but it might be specific to Clacc.) Anyway, my point is that I'm not sure yet that things always work right with a disabled nvidia driver.

Thanks for the info. With your patch
clang++ -fopenmp --offload-arch=sm_80,gfx906 main.cpp
CUDA_VISIBLE_DEVICES="" ./a.out # runs fine on the AMD GPU.
so it is good.