This is an archive of the discontinued LLVM Phabricator instance.

[RFC][flang] Experimental device build of Flang runtime.
ClosedPublic

Authored by vzakhari on May 22 2023, 5:22 PM.

Details

Summary

These are initial changes to experiment with building the Fortran runtime
as a CUDA or OpenMP target offload library.

The initial patch defines a set of macros that have to be used consistently
in Flang runtime source code so that it can be built for different
offload devices using different programming models (CUDA, HIP, OpenMP target
offload). Currently supported modes are:

  • CUDA: Flang runtime may be built as a fatlib for the host and a set of CUDA architectures specified during the build. The packaging of the device code is done by the CUDA toolchain and may differ from toolchan to toolchain.
  • OpenMP offload:
    • host_device mode: Flang runtime may be built as a fatlib for the host and a set of OpenMP offload architectures. The packaging of the device code is done by the OpenMP offload compiler and may differ from compiler to compiler.

OpenMP offload 'nohost' mode is a TODO to match the build setup
of libomptarget/DeviceRTL. Flang runtime will be built as LLVM Bitcode
library using Clang/LLVM toolchain. The host part of the library
will be "empty", so there will be two distributable object: the host
Flang runtime and dummy host library with device Flang runtime pieces
packaged using clang-offload-packager and clang.

In all supported modes, enabling parts of Flang runtime for the device
compilation can be done iteratively to make the patches observable.
Note that at any point in time the resulting library may have unresolved
references to not yet enabled parts of Flang runtime.

Example cmake/make commands for building with Clang for NVPTX target:
cmake \
-DFLANG_EXPERIMENTAL_CUDA_RUNTIME=ON \
-DCMAKE_CUDA_ARCHITECTURES=80 \
-DCMAKE_C_COMPILER=/clang_nvptx/bin/clang \
-DCMAKE_CXX_COMPILER=/clang_nvptx/bin/clang++ \
-DCMAKE_CUDA_COMPILER=/clang_nvptx/bin/clang \
/llvm-project/flang/runtime/
make -j FortranRuntime

Example cmake/make commands for building with Clang OpenMP offload:
cmake \
-DFLANG_EXPERIMENTAL_OMP_OFFLOAD_BUILD="host_device" \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DFLANG_OMP_DEVICE_ARCHITECTURES="sm_80" \
../flang/runtime/
make -j FortranRuntime

Diff Detail

Event Timeline

vzakhari created this revision.May 22 2023, 5:22 PM
Herald added a project: Restricted Project. · View Herald Transcript
vzakhari requested review of this revision.May 22 2023, 5:22 PM
vzakhari edited the summary of this revision. (Show Details)
vzakhari edited the summary of this revision. (Show Details)May 22 2023, 5:25 PM
domada added a subscriber: domada.May 23 2023, 3:19 AM

Tagging a few more potential reviewers. Thanks for posting this.

Is the current plan for amdgpu to compile this as hip?

Long term are there plans to move.off cuda as the source language, e.g. to freestanding C++ or the device-only variant of openmp which essentially serves as a C++ dialect with GPU extensions for conjuring specific IR?

jhuber6 added a comment.EditedMay 23 2023, 6:02 AM

As mentioned in the RFC, I think the easiest way to accomplish this in would be to use the "new driver" that's currently opt-in for CUDA. I would also recommend using OpenMP instead if we want this library to be generic, but we may also be able to have a separate build using HIP. There's some documentation on the compilation pipeline for offloading at https://clang.llvm.org/docs/OffloadingDesign.html and a talk at https://www.youtube.com/watch?v=4NnzymmQe7k. Right now to support multiple GPUs we simply provide redundant builds. It's inefficient space-wise but it's the simplest solution to cross-architecture compatibility without relying on mandatory LLVM passes or config libraries. Creating a library that can be linked with the proposed Fortran pipeline would work as follows:

clang++ -x cu cuda.cpp --offload-arch=sm_70,sm_80 -foffload-lto -fvisibility=hidden --offload-new-driver -fgpu-rdc -c
llvm-ar rcs libcuda.a cuda.o

This will create a fatbinary containing code for sm_70 and sm_80 similar to CUDA's support. We use -foffload-lto to improve performance, -fvisibility=hidden allows LTO to optimize out unused symbols. We should then be able to link this libcuda.a with the Fortran application the user is trying to compile. The only change needed here is a special option to make the kind metadata in the fat binary to be empty so we don't try to build code for the CUDA runtime when we link it.

klausler added inline comments.May 23 2023, 8:31 AM
flang/include/flang/Runtime/entry-names.h
30

As I mentioned, "decl" seems weird to me when applied to both declarations and definitions.

I suggest "RTENTRY" or "RTAPI", but please feel free to retain RTDECL if you think it is best.

flang/runtime/device-tools.h
15

file*

25

This could be called "fill_n" in the Fortran::runtime namespace, and defined via "using std::fill_n;" by default.

vzakhari marked 2 inline comments as done.Jun 2 2023, 12:42 PM

Thank you all for the reviews and the comments on discourse. I will upload the updated changes shortly. Please let me know if the new approach could work for all of us.

flang/include/flang/Runtime/entry-names.h
30

Thank you for the review, Peter!

I decided to go with two macros: RTDECL and RTDEF. Having the two does not make much sense in the context of this patch, but it may help with other applications. For example, I wanted to experiment with Clang cpu dispatch feature for the Flang runtime exported entry points. With the two macros I could expand RTDECL into __attribute__((cpu_dispatch(cpu1,cpu2,generic))) and RTDEF into __attribute__((cpu_specific(cpu1,cpu2,generic))).

vzakhari updated this revision to Diff 527938.Jun 2 2023, 12:43 PM
vzakhari retitled this revision from [RFC][flang] Experimental CUDA build of Flang runtime. to [RFC][flang] Experimental device build of Flang runtime..
vzakhari edited the summary of this revision. (Show Details)
vzakhari edited the summary of this revision. (Show Details)Jun 2 2023, 12:44 PM

There will be instructions on the Flang webpage?

There will be instructions on the Flang webpage?

Yes, sure, I can add the build instructions for Flang runtime to https://github.com/llvm/llvm-project/blob/main/flang/docs/GettingStarted.md

Do we really need to support "old" CUDA style offloading/linking? If so, we also need HIP, SYCL, ...
I would suggest only to support the "new" offload linker. That's where Clang (driver) is headed anyway, I think.

If we would do that, we could also drop the CUDA stuff, and benefit from fun things like "auto declare target" for all the internal functions we use. All we would really annotate as "__device__/declare target" is the interface. And we could properly internalize the rest as we do with the OpenMP DeviceRTL.

flang/include/flang/ISO_Fortran_binding.h
131

The fact that we now have some but not all members available on the device is not ideal, IMHO. I would have hoped the opposing approach would at least be tried; compile the entire thing for the device, opt-out where necessary.

flang/runtime/terminator.h
39

FWIW, we should just support them (by defining our own "ABI", one for all GPU archs). Basically, the "printf" special handling generalized.

Do we really need to support "old" CUDA style offloading/linking?

Yes, I would like to be able to build Flang runtime for NVIDIA GPUs so that the resulting format is compatible with nvfortran driver, i.e. the Flang runtime library can be linked using nvlink.

If so, we also need HIP, SYCL, ...

I added the new macros such that they can be used for HIP and SYCL as well. E.g. for HIP setting RT_API_ATTRS to __host__ __device__ should work as well as for CUDA; for SYCL, setting SYCL_EXTERNAL in RTDEF should probably be enough. The CMake changes will probably differ depending on the used model, but at least the source code will be somewhat uniform.

I would suggest only to support the "new" offload linker. That's where Clang (driver) is headed anyway, I think.

If we would do that, we could also drop the CUDA stuff, and benefit from fun things like "auto declare target" for all the internal functions we use. All we would really annotate as "__device__/declare target" is the interface.

Actually, the "auto declare target" is already used in this patch. "declare target" is only in effect for the definitions (see RT_EXT_API_GROUP_BEGIN/END usage in transformational.cpp).

And we could properly internalize the rest as we do with the OpenMP DeviceRTL.

I am not sure what you mean by this, could you please clarify what you by "internalize"?

vzakhari added inline comments.Jun 2 2023, 4:26 PM
flang/include/flang/ISO_Fortran_binding.h
131

I was trying to minimize the changes for the initial patch. Going forward, I agree that enabling all class methods or even the whole single file makes more sense.

tblah added a subscriber: tblah.Jun 5 2023, 2:00 AM
klausler accepted this revision.Jun 8 2023, 9:08 AM
This revision is now accepted and ready to land.Jun 8 2023, 9:08 AM
vzakhari updated this revision to Diff 529768.Jun 8 2023, 4:17 PM

Updated build doc and rebased.

vzakhari updated this revision to Diff 530116.Jun 9 2023, 4:53 PM

Reverted accidentally removed __SIZEOF_INT128__ guard. This should fix the Windows build.

Matt added a subscriber: Matt.Jun 21 2023, 11:40 AM
vzakhari updated this revision to Diff 534717.Jun 26 2023, 12:54 PM

Rebase before merging.

This revision was automatically updated to reflect the committed changes.