This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Move part of nvptx devicertl under clang
AbandonedPublic

Authored by JonChesterfield on Jan 24 2021, 10:36 AM.

Details

Summary

[WIP] Move part of nvptx devicertl under clang

Example of moving the devicertl functions that depend on cuda
version under clang, so they can be injected at application
build time.

The original idea was to use the intrinsic definitions from
__clang_cuda_intrinsics, but that header needs a lot of cuda
specific setup to compile and includes part of the cuda sdk.
It's therefore difficult to compile as openmp.

This implements the code in headers and will work for c++ with
openmp, but not necessarily for C as the inline functions may not
be instantiated. It will also be a problem for fortran openmp.

I'm inclined to do something broadly equivalent to this, but in
the library. It means clang would need to link against devicertl.bc
and against a small cuda version specific devicertl_tbd.bc.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jan 24 2021, 10:36 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJan 24 2021, 10:36 AM
clang/lib/Driver/ToolChains/Clang.cpp
1204

Logic very like this could pick out a second, small devicertl bitcode library

clang/lib/Headers/openmp_wrappers/__clang_openmp_devicertl_cuda_ge90.h
18

linkonce and linkonce_odr can both be discarded, but these symbols need to survive until devicertl is linked

39

calling into intrinsics would remove this messy expression, if we can work out how to reliably compile parts of the cuda sdk as openmp

these functions probably can't be instantiated on the host, so when changing the devicertl build over to openmp we will also need to guard these with variant / macro

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip
49

Changing from c++ to c name mangling seems fine here. We don't consistently use one or the other in the devicertl. Using c mangling is simpler if the implementation is in a header.

In general we're moving to the direction that target specific implementation will be compiled along with user code, which is fantastic. In this way, we only need to provide one bitcode library for one target. The change in FE lacks of some efficiency. If user code has multiple files, target specific header will be included multiple times, thus compiled multiple times. A more efficient way is to change the workflow of the driver, probably in the following way:

  1. Compile target implementation t.bc
  2. Link t.bc and libomptarget-[arch].bc to libomptarget.bc
  3. Compile user code, which is also multiple steps. libomptarget.bc is fed into FE in this step.
  4. Remaining steps...
clang/lib/Driver/ToolChains/Clang.cpp
1204

can we just use one header with different macros, like what we're using now?

JonChesterfield abandoned this revision.Jan 26 2021, 5:06 PM

Abandoned in favour of multiple instantiations of the devicertl, which works across all languages and doesn't require hacks to clang