This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Move supporti.h to support.cu
ClosedPublic

Authored by JonChesterfield on Nov 12 2019, 9:43 AM.

Details

Summary

[libomptarget] Move supporti.h to support.cu
Reimplementation of D69652, without the unity build and refactors.
Will need a clean build of libomptarget as the cmakelists changed.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

What about the performance with this version of the runtime?

What about the performance with this version of the runtime?

The nvcc build will be slower as it doesn't support LTO. Fixable by changing to a unity build in a subsequent patch.

The bitcode library itself will be slightly less optimised as we combine the bitcode with llvm-link but don't run any subsequent optimisation passes on it. After the bitcode library is combined with the application, we'll have the same code as before. I don't think losing the always_inline attribute (which was implied by forceinline) makes any difference here as the functions are small.

We don't have any bitcode library level optimisations at present. I'd like to internalise the non-api symbols and run opt across it during the build. That's essentially a build time optimisation for the clients of the library and orthogonal to this change.

jdoerfert accepted this revision.Nov 12 2019, 5:46 PM

Assuming this doesn't break the build as the old version did, it still looks good to me.

What about the performance with this version of the runtime?

The nvcc build will be slower as it doesn't support LTO. Fixable by changing to a unity build in a subsequent patch.

That is fine. Cleaning up the code structure and preparing for bigger changes is way more important than small variations in our untracked and non-release version performance.

The bitcode library itself will be slightly less optimised as we combine the bitcode with llvm-link but don't run any subsequent optimisation passes on it. After the bitcode library is combined with the application, we'll have the same code as before. I don't think losing the always_inline attribute (which was implied by forceinline) makes any difference here as the functions are small.

Arguably, we need to look into issues like "not inlined" in device compilation separately and more generically. Marking all runtime functions always_inline but not the user code is less helpful than adjusting the inline threshold based on the target.

We don't have any bitcode library level optimisations at present. I'd like to internalise the non-api symbols and run opt across it during the build. That's essentially a build time optimisation for the clients of the library and orthogonal to this change.

I'm unsure what you mean in the first part. Internalizing symbols and aggressively removing unneeded part of the device runtime is for sure what we want to do, maybe even for api symbols.

This revision is now accepted and ready to land.Nov 12 2019, 5:46 PM

Assuming this doesn't break the build as the old version did, it still looks good to me.

It does not break ghe build, I checked it.

What about the performance with this version of the runtime?

The nvcc build will be slower as it doesn't support LTO. Fixable by changing to a unity build in a subsequent patch.

That is fine. Cleaning up the code structure and preparing for bigger changes is way more important than small variations in our untracked and non-release version performance.

The bitcode library itself will be slightly less optimised as we combine the bitcode with llvm-link but don't run any subsequent optimisation passes on it. After the bitcode library is combined with the application, we'll have the same code as before. I don't think losing the always_inline attribute (which was implied by forceinline) makes any difference here as the functions are small.

Arguably, we need to look into issues like "not inlined" in device compilation separately and more generically. Marking all runtime functions always_inline but not the user code is less helpful than adjusting the inline threshold based on the target.

We don't have any bitcode library level optimisations at present. I'd like to internalise the non-api symbols and run opt across it during the build. That's essentially a build time optimisation for the clients of the library and orthogonal to this change.

I'm unsure what you mean in the first part. Internalizing symbols and aggressively removing unneeded part of the device runtime is for sure what we want to do, maybe even for api symbols.

Arguably, we need to look into issues like "not inlined" in device compilation separately and more generically. Marking all runtime functions always_inline but not the user code is less helpful than adjusting the inline threshold based on the target.

I'd go further. "inlined" can also be an issue, either where it produces N copies of a slow function (e.g. printf) where there should be one or where it thrashes the instruction cache. Beyond the scope of this patch, but agreed that it's a task.

We don't have any bitcode library level optimisations at present. I'd like to internalise the non-api symbols and run opt across it during the build. That's essentially a build time optimisation for the clients of the library and orthogonal to this change.

I'm unsure what you mean in the first part. Internalizing symbols and aggressively removing unneeded part of the device runtime is for sure what we want to do, maybe even for api symbols.

The nvptx deviceRTL is N individually optimised bitcode files combined with llvm-link. Anything opt -O2 or similar would do to optimise across the previously separate files will be repeated for every application at application compile time, whereas we could reasonably do it once before installing the bitcode library by invoking opt. This extra opt pass would probably change the code in the final executable in minor ways, but the motivation is link time reduction for the application code.

Symbol internalizing is great. Non-inlined functions can have bespoke calling conventions, symbols in application code can no longer collide with those in the library, dead stuff can be discarded. We have a bitcode support library that can probably be merged into deviceRTL and then itself entirely internalised. Loads of good options from postponing machine code generation.

It does not break ghe build, I checked it.

That's a huge relief, thank you.

I've got the bitcode libraries building on my nvptx machine again which should decrease the frequency of such problems. For the curious, the cmake looks for llvm-link next to clang and gives up if it isn't there, which it wasn't on my machine. It might be worth making that cmake more verbose, at least printing out why it isn't building the bitcode library.

The performance justification here is a bit hand wavy. Inferring performance changes from bitcode change isn't ideal. I suspect that - in addition to targeted correctness tests for deviceRTL - we should probably have microbenchmarks doing things like invoking the reduction runtime call on various sizes of data.

This revision was automatically updated to reflect the committed changes.