This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][cuda] Call v2 functions explicitly
ClosedPublic

Authored by JonChesterfield on Jan 22 2021, 4:29 PM.

Details

Summary

[libomptarget][cuda] Call v2 functions explicitly

rtl.cpp calls functions like cuMemFree that are replaced by a macro
in cuda.h with cuMemFree_v2. This patch changes the source to use
the v2 names consistently.

See also D95104, D95155 for the idea. Alternatives are to use a mixture,
e.g. call the macro names and explictly dlopen the _v2 names, or to keep
the current status where the symbols are replaced by macros in both files

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jan 22 2021, 4:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2021, 4:29 PM
This revision is now accepted and ready to land.Jan 22 2021, 4:34 PM
This revision was landed with ongoing or failed builds.Jan 23 2021, 12:33 PM
This revision was automatically updated to reflect the committed changes.
kkwli0 added a subscriber: kkwli0.Jan 25 2021, 7:05 AM

I am seeing these errors.

/llvm-project/openmp/runtime/src/kmp_runtime.cpp:36:10: fatal error: 'llvm/Support/TimeProfiler.h' file not found
#include "llvm/Support/TimeProfiler.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[5]: *** [openmp/runtime/src/CMakeFiles/omp.dir/kmp_runtime.cpp.o] Error 1
make[5]: *** Waiting for unfinished jobs....
/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:469:21: error: use of undeclared identifier 'cuDevicePrimaryCtxRelease_v2'; did you mean 'cuDevicePrimaryCtxRelease'?
        checkResult(cuDevicePrimaryCtxRelease_v2(Device),
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
                    cuDevicePrimaryCtxRelease
/usr/local/cuda/include/cuda.h:3437:18: note: 'cuDevicePrimaryCtxRelease' declared here
CUresult CUDAAPI cuDevicePrimaryCtxRelease(CUdevice dev);
                 ^
/llvm-project/openmp/libomptarget/plugins/cuda/src/rtl.cpp:509:13: error: use of undeclared identifier 'cuDevicePrimaryCtxSetFlags_v2'; did you mean 'cuDevicePrimaryCtxSetFlags'?
      Err = cuDevicePrimaryCtxSetFlags_v2(Device, CU_CTX_SCHED_BLOCKING_SYNC);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            cuDevicePrimaryCtxSetFlags
/usr/local/cuda/include/cuda.h:3502:18: note: 'cuDevicePrimaryCtxSetFlags' declared here
CUresult CUDAAPI cuDevicePrimaryCtxSetFlags(CUdevice dev, unsigned int flags);
                 ^
make[4]: *** [openmp/runtime/src/CMakeFiles/omp.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....

I think missing TimeProfiler.h is a known problem, independent from this. It can be worked around locally by disabling the profiling.

The error: use of undeclared identifier 'cuDevicePrimaryCtxRelease_v2'; did you mean 'cuDevicePrimaryCtxRelease'? is a hazard though. This means that at least one person (you!) is using a cuda version which is old enough to not use the _v2 appended functions.

We can unblock you by going back to the non-suffixed versions in rtl.cpp and adding the macros back to dynamic_cuda/cuda.h. That will mean the following combinations work:

  • directly linking against an old cuda
  • directly linking against a new cuda
  • dlopen of a new cuda

What version of cuda are you using? It might be useful to know when things changed.

I think we can dlopen cuda, and try to load cuMemcpyDtoHAsync_v2, and then try to load cuMemcpyDtoHAsync on failure. That would get us some degree of robustness to different symbol versions.

We can unblock you by going back to the non-suffixed versions in rtl.cpp and adding the macros back to dynamic_cuda/cuda.h. That will mean the following combinations work:

Let's do that, we revert this for now.

Patch at D95367. Can't do a simple revert as we also need to back out e5e448aafa7699c17f78aaffb001b665b607e5ae

What version of cuda are you using? It might be useful to know when things changed.

I think we can dlopen cuda, and try to load cuMemcpyDtoHAsync_v2, and then try to load cuMemcpyDtoHAsync on failure. That would get us some degree of robustness to different symbol versions.

$ cat /usr/local/cuda/version.txt 
CUDA Version 10.2.89

What version of cuda are you using? It might be useful to know when things changed.

I think we can dlopen cuda, and try to load cuMemcpyDtoHAsync_v2, and then try to load cuMemcpyDtoHAsync on failure. That would get us some degree of robustness to different symbol versions.

$ cat /usr/local/cuda/version.txt 
CUDA Version 10.2.89

This means dyn_cuda won't work for this version (and others I assume). We might need to look for both symbols in dyn_cuda, wdyt Jon?

This means dyn_cuda won't work for this version (and others I assume). We might need to look for both symbols in dyn_cuda, wdyt Jon?

Based on the macro renaming used to patch application code to work with either, it seems the calling conventions are unchanged.

If so, logic that looks up foo_v2 first, and then tries foo second, and only fails if both are missing seems valid.

The clean way to do that is probably:

  • rtl.cpp calls cuMemcpyDtoHAsync
  • dynamic_cuda/cuda.h does not have a macro that replaces that with cuMemcpyDtoHAsync_v2
  • dynamic_cuda/cuda.cpp implements cuMemcpyDtoHAsync as a call to a function pointer which is initialised with cuMemcpyDtoHAsync_v2 or cuMemcpyDtoHAsync