This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Replace function pointer uses in GPU state machine
ClosedPublic

Authored by jdoerfert on Jul 6 2020, 6:07 PM.

Details

Summary

In non-SPMD mode we create a state machine like code to identify the
parallel region the GPU worker threads should execute next. The
identification uses the parallel region function pointer as that allows
it to work even if the kernel (=target region) and the parallel region
are in separate TUs. However, taking the address of a function comes
with various downsides. With this patch we will identify the most common
situation and replace the function pointer use with a dummy global
symbol (for identification purposes only). That means, if the parallel
region is only called from a single target region (or kernel), we do not
use the function pointer of the parallel region to identify it but a new
global symbol.

Fixes PR46450.

Diff Detail

Event Timeline

jdoerfert created this revision.Jul 6 2020, 6:07 PM
arsenm added a subscriber: arsenm.Jul 6 2020, 6:29 PM
arsenm added inline comments.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
970–971

if (CachedKernel) return *CachedKernel

978

*CachedValue

llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
41

These tests seem really big

279–281

Mostly unneeded metadata?

That's interesting. Amdgpu does not handle function pointers well and I suspect nvptx has considerable performance overhead for them too. If a parallel region is only called from a single target region, it is always passed the same function pointer. Thus specialise the state machine. I think this machinery is equivalent to specialising the parallel region call.

The general case involves calling one parallel region runtime function with various different function pointers. Devirtualising that is fairly difficult. For another time.

For this simpler case, I think this transform is equivalent to specialising the various kmpc*parallel calls on a given function pointer. The callees are available when using a bitcode deviceRTL.

Iirc function specialisation / partial evaluation is one of the classic compiler optimisations that LLVM doesn't really do. It's difficult to define a good cost model and C exposes function pointer comparison. What we could implement for this is an attribute driven one, where we mark the function pointer arguments in the deviceRTL with such and use LTO. Avoid specialising a function whose address escapes.

I like this patch. It's a clear example of an effective openmp specific optimisation. It just happens to run very close to specialisation, which may not be that much harder to implement if we cheat on the cost model.

tianshilei1992 added inline comments.Jul 6 2020, 7:44 PM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1087

Probably we need to set Changed to true here?

jdoerfert updated this revision to Diff 275927.Jul 7 2020, 12:12 AM
jdoerfert marked 6 inline comments as done.

Addressed comments

That's interesting. Amdgpu does not handle function pointers well and I suspect nvptx has considerable performance overhead for them too. If a parallel region is only called from a single target region, it is always passed the same function pointer. Thus specialise the state machine. I think this machinery is equivalent to specialising the parallel region call.

The problem here was the spurious call edge from an unrelated kernel to the outlined parallel function. ptxas then needed more registers for a trivial kernel as it was "thought" to call the outlined function.

The general case involves calling one parallel region runtime function with various different function pointers. Devirtualising that is fairly difficult. For another time.

For this simpler case, I think this transform is equivalent to specialising the various kmpc*parallel calls on a given function pointer. The callees are available when using a bitcode deviceRTL.

Iirc function specialisation / partial evaluation is one of the classic compiler optimisations that LLVM doesn't really do. It's difficult to define a good cost model and C exposes function pointer comparison. What we could implement for this is an attribute driven one, where we mark the function pointer arguments in the deviceRTL with such and use LTO. Avoid specialising a function whose address escapes.

I like this patch. It's a clear example of an effective openmp specific optimisation. It just happens to run very close to specialisation, which may not be that much harder to implement if we cheat on the cost model.

Specialization is (soonish) coming to the Attributor ;)


llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
279–281

Interestingly, that is what our device runtime looks like. For reasons I haven't understood yet it has all these "null is aligned" annotations. CUDA is weird.

Anyway, I can strip this down too.

jdoerfert edited the summary of this revision. (Show Details)Jul 7 2020, 5:08 AM
JonChesterfield accepted this revision.Jul 7 2020, 4:57 PM

I haven't been able to apply this to the aomp tree (for reasons unrelated to this patch), but by inspection I think it's sound. I like the conservative pattern matching approach.

The function pointer specialisation alternative is more complicated than I suggested above - because the pointer gets stored in local state and loaded, it isn't readily available for specialisation on by each call.

llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
41

Agreed. I wonder if it's worth restructuring the openmp codegen to favour emitting functions instead of blocks with interesting control flow, such that tests like these look more like a linear sequence of named function calls. Said functions would then be inlined downstream of the codegen to produce the same IR we see here.

This revision is now accepted and ready to land.Jul 7 2020, 4:57 PM
This revision was automatically updated to reflect the committed changes.