This is an archive of the discontinued LLVM Phabricator instance.

[HIP][LLVM][Opt][AMDGPU][RFC] Add LLVM support for C++ Parallel Algorithm Offload
ClosedPublic

Authored by AlexVlx on Jul 20 2023, 9:24 AM.

Details

Summary

This patch adds the LLVM changes needed by the standard algorithm offload feature being proposed here: https://discourse.llvm.org/t/rfc-adding-c-parallel-algorithm-offload-support-to-clang-llvm/72159/1. The verbose documentation is included in the head of the patch series, with all other patches targetting Clang. What we do here is add two passes, one mandatory and one optional:

  1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
    • Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
    • Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
    • After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
    • All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
  2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
    • Iterate all functions in a Module;
    • If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
      • If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
    • We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.

This concludes the patch set around adding support for HIP C++ Parallel Algorithm Offload.

Diff Detail

Event Timeline

AlexVlx created this revision.Jul 20 2023, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 9:24 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
AlexVlx requested review of this revision.Jul 20 2023, 9:24 AM
AlexVlx updated this revision to Diff 544969.Jul 27 2023, 3:45 PM
AlexVlx removed a reviewer: Anastasia.

Add support for dealing with builtins that are unavailable on the target.

AlexVlx updated this revision to Diff 547039.Aug 3 2023, 4:07 PM

Update diff, handling of globals was wrong in that it'd mess up extern __shared__ cases.

AlexVlx updated this revision to Diff 549166.Aug 10 2023, 2:06 PM

Add support for handling incompatible ASM.

AlexVlx updated this revision to Diff 551299.Aug 17 2023, 3:29 PM

Add handling of thread_local variables that are used by accelerator reachable functions: since this is currently unsupported, we error out with as tidy a diagnostic as possible.

AlexVlx updated this revision to Diff 552687.Aug 23 2023, 6:15 AM
AlexVlx retitled this revision from [LLVM][Opt][RFC] Add LLVM support for C++ Parallel Algorithm Offload to [HIP][LLVM][Opt][RFC] Add LLVM support for C++ Parallel Algorithm Offload.
AlexVlx edited the summary of this revision. (Show Details)

Updating this to reflect the outcome of the RFC, which is that we're adding it as a HIP extension exclusively. Added support for handling thread_local, which is unsupported.

Are these passes called when compiling C++ to bitcode? If so, then device functions not called by kernels in the same module will be removed, right? Then we cannot support calling device functions in a different TU. Should these passes be moved to the llvm codegen pipelines so that they will only be called post-linking?

llvm/lib/Transforms/HipStdPar/HipStdPar.cpp
63
74

Is this condition correct?

AMDGPU backend supports non-constant TLS?

136

Any reason to change the linkage to weak?

Are these passes called when compiling C++ to bitcode? If so, then device functions not called by kernels in the same module will be removed, right? Then we cannot support calling device functions in a different TU. Should these passes be moved to the llvm codegen pipelines so that they will only be called post-linking?

The selection pass is run post-linking, please see https://reviews.llvm.org/D155775, more specifically the changes done to how we form the lld invocation. The allocator interposition stuff is run / driven by Clang, however it is not predicated on having the full module available / running after linking, and wouldn't benefit from it.

llvm/lib/Transforms/HipStdPar/HipStdPar.cpp
63

Will fix, thanks.

74

Typo, will fix, good catch!

136

Yes, this is a precursor to adding actual support for globals, which will entail binding them. Since at the moment they are not handled (will require a RT addition), we're giving them weak linkage toe eschew the need to define (bind) them at code object load time, otherwise for declarations this'd lead to a load time error. The other option would be to turn all declarations into definitions, with the initialiser being poison (but then this'd need to be undone when we add support).

AlexVlx added inline comments.Sep 12 2023, 4:24 PM
llvm/lib/Transforms/HipStdPar/HipStdPar.cpp
74

Actually, scratch that, I misremembered -> the predicate should read as Is this Constant Used? and not Is only ConstantUsed i.e. it returns true iff there are non-constant uses, so we only early out iff there are ONLY constexpr uses. So yes, the condition is correct, but the predicate is confusing, a bit:)

AlexVlx updated this revision to Diff 556686.Sep 13 2023, 9:41 AM

Use correct capitalisation in helper.

AlexVlx marked an inline comment as done.Sep 13 2023, 9:41 AM
yaxunl accepted this revision.Sep 13 2023, 12:05 PM

LGTM. Thanks.

This revision is now accepted and ready to land.Sep 13 2023, 12:05 PM
JonChesterfield accepted this revision.Sep 15 2023, 6:52 AM

LG here too, thanks for working on this

This revision was landed with ongoing or failed builds.Oct 10 2023, 4:02 AM
This revision was automatically updated to reflect the committed changes.
AlexVlx updated this revision to Diff 557671.Oct 10 2023, 5:51 AM
AlexVlx retitled this revision from [HIP][LLVM][Opt][RFC] Add LLVM support for C++ Parallel Algorithm Offload to [HIP][LLVM][Opt][AMDGPU][RFC] Add LLVM support for C++ Parallel Algorithm Offload.
AlexVlx removed reviewers: tra, jlebar, pekka.jaaskelainen.

Fix SHARED_LIB build.

AlexVlx updated this revision to Diff 557687.Oct 12 2023, 2:39 AM

Fix OCaml build.