There are various runtime calls in the device runtime with unused, or
always fixed, arguments. This is bad for all sorts of reasons. Clean up
two before as we match them in OpenMPOpt now.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This is definitely not NFC and breaks API compatibility (but apparently nobody cares anymore?).
This is the device RTL. I am not aware we (want to) keep the API stable. If we are, I'm not sure why:
- Dynamic linking (among other things) is not really an option so people that link against the device runtime (should) do so statically.
- Linking against an old device runtime with a new clang seems unreasonable to me. If you replace clang you must replace the static runtime as the new clang might use new functions.
Same response as above. What is the use case here which we want to continue to support?
clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp | ||
---|---|---|
42 | I think, instead the optimizer can try to detect if the runtime library is used by the kernel and switch this flag to 0 if no runtime calls are used in the kernel. For non-SPMD mode in most cases, the runtime is required, but in some cases, it can be disabled. |
I'm not sure we have a consensus on api stability. Usually llvm allows mixing libraries and compilers from different sources, e.g. the various libunwind or compiler-rt vs libgcc. Libomptarget in general appears to be considered fixed and has external users (intel, maybe gcc).
The device runtime would be ill served by this default. This is the only openmp device runtime library which works with llvm. It's statically linked, usually as bitcode when performance is important. The code used to handle target offloading for nvptx is spread across the compiler and the runtime, probably not optimally.
I'm not familiar with the gcc-nvptx-openmp implementation. Reading through https://gcc.gnu.org/wiki/Offloading strongly suggests a totally independent implementation to this one. I don't think gcc can be using this runtime library for nvptx. It definitely doesn't for amdgcn. Proprietary compilers might be using this code, but we can have no duty of care to toolchains that use this code without telling us they're doing so.
Therefore the only backwards/forwards compatibility we can strive for is between different versions of clang and the device runtime. That seems potentially useful - one could use a release clang binary while working on the deviceRTL or vice versa. It's an expensive developer convenience though.
We would pay with things like the API rot fixed above. Introducing a faster lowering for an openmp construct would mean a redundant path through clang and some version checking to guess which runtime library we're targeting, which is not presently versioned. Likewise moving code between compiler and runtime becomes much more expensive to implement. Getting it wrong is inevitable given our test coverage.
I think the project is much better served by assuming that the runtime library used by clang is the one from the same hash in the monorepo. That leaves us free to fix debt and improve performance, at the price of needing to build clang from (near to) trunk while developing the rtl.
Perhaps we can embrace API stability later on, once we have things like versioning and a solid optimisation pipeline in place, especially if gcc wants to use the deviceRTL for nvptx. Now is too early.
clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp | ||
---|---|---|
42 | If we can detect that no runtime calls are used, we should be able to do better than passing a different argument. E.g. delete some setup calls. Failing that, if we want to pass an argument which says 'actually don't do any work', it shouldn't be the same argument used to check whether the runtime has been initialised. |
No, I'm not. Long before that, we relied on the API stability and already have some runtime calls marked as deprecated. Especially taking into account, that libomp can be built separately.
clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp | ||
---|---|---|
42 | No, I don't think you can do this in all cases |
I don't think gcc can be using this runtime library for nvptx.
Yes, and: We are (going to) use clang specific intrinsics to avoid CUDA (soon).
Use of the new library with the previous version of the compiler.
Except that you cannot generally expect this to work. In our supported use case the library is kept as bitcode (LLVM-IR). Bitcode is not backward compatible. An old toolchain (clang, llvm-link, ...) cannot be fed new IR and be expected to work. So, we are already not able to give a stability guarantee here, why pretend we do. The bitcode runtime has to be kept in-sync with the toolchain.
clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp | ||
---|---|---|
42 | We can detect all we want but switching it *does not have any effect*. |
Yes, the existing v# naming and deprecated comments should also go.
What can libomp be built by separately? Nvcc and gcc don't use this runtime. That leaves us with downstream proprietary compilers derived from clang that are already stuck carrying extensive compatibility patches and usually ship as one large toolchain blob which only needs to be internally self consistent.
Especially taking into account, that libomp can be built separately.
This is *not* affecting libomp in any way.
libomptarget and device runtimes are part of libomp. If you're going to remove some params, you'll need to modify the runtime functions too, no?
Answered already: the previous version of the compiler with the new version of the runtime.
Still cannot be expected to work: https://reviews.llvm.org/D83268#2135951
Are there other use cases?
There is still compatibility between clang10 and clang11. Or they are incompatible in LLVM IR level? Also, there was a mode (I don't remember if it was removed or not) where the runtime library could be linked as .a library, without LLVM IR inlining.
That is the point. They might or might not be, right? There is no guarantee they are.
Also, there was a mode (I don't remember if it was removed or not) where the runtime library could be linked as .a library, without LLVM IR inlining.
That mode is deprecated.
Better to ask the users. Maybe, send an RFC to openmp-devs?
Also, there was a mode (I don't remember if it was removed or not) where the runtime library could be linked as .a library, without LLVM IR inlining.
That mode is deprecated.
As mentioned before, no it is not.
Despite the similarity in name, libomp and libomptarget are distinct libraries, this was a conscious design choice.
FWIW, this patch does *not* modify libomptarget either. This modifies *the device runtime*, aka. libomptarget-nvptx-sm_XXX.bc.
I'm neither, and I've long argued that being able to build the OpenMP runtime(s) without Clang trunk is an important use case. These arguments have gone largely unheard, so I'll not join this discussion once more.
__kmpc_spmd_kernel_init is always called with RequiresDataSharing == 0
Specifically, it's only called from clang, and emitSPMDEntryHeader unconditionally passes zero to it
I.e. I think there's more stuff that can be cleaned up in the theme of the above, suggest in later patches
This patch breaks compilation of previously working code.
I added the following to openmp/libomptarget/test/offloading/offloading_success.c:
+// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
which results in
# command stderr: ptxas offloading_success-openmp-nvptx64-nvidia-cuda.s, line 173; error : Call has wrong number of parameters ptxas fatal : Ptx assembly aborted due to errors clang-12: error: ptxas command failed with exit code 255 (use -v to see invocation)
The file offloading_success-openmp-nvptx64-nvidia-cuda.s contains:
.func (.param .b32 func_retval0) __kmpc_kernel_parallel ( .param .b64 __kmpc_kernel_parallel_param_0, .param .b32 __kmpc_kernel_parallel_param_1 ) ;
The same file then calls the function with one argument:
call.uni (retval0), __kmpc_kernel_parallel, ( param0 );
For the clang 11 release, we should either fix the codegen or revert this patch.
I carefully made sure, that the freshly built clang was used to execute the test. I opened https://bugs.llvm.org/show_bug.cgi?id=46836 to track the issue and made it release blocker.
I think, instead the optimizer can try to detect if the runtime library is used by the kernel and switch this flag to 0 if no runtime calls are used in the kernel. For non-SPMD mode in most cases, the runtime is required, but in some cases, it can be disabled.