This is an archive of the discontinued LLVM Phabricator instance.

[HIP][Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload
ClosedPublic

Authored by AlexVlx on Jul 20 2023, 8:25 AM.

Details

Summary

This patch adds the CodeGen changes needed by the standard algorithm offload feature being proposed here: https://discourse.llvm.org/t/rfc-adding-c-parallel-algorithm-offload-support-to-clang-llvm/72159/1, which will only be available for the HIP language on AMD targets. The verbose documentation is included in the head of the patch series. This change concludes the set of additions needed in Clang, and essentially relaxes restrictions on what gets emitted on the device path, when compiling in hipstdpar mode (after the previous patch relaxed restrictions on what is semantically correct):

  1. Unless a function is explicitly marked __host__, it will get emitted, whereas before only __device__ and __global__ functions would be emitted;
  2. Unsupported builtins are ignored as opposed to being marked as an error, as the decision on their validity is deferred to the hipstdpar specific code selection pass we are adding, which will be the topic of the final patch in this series;
  3. We add the stdpar specific passes to the opt pipeline, independent of optimisation level:
    • When compiling for the accelerator / offload device, we add a code selection pass;
    • When compiling for the host, iff the user requested it via the --hipstdpar-interpose-alloc flag, we add a pass which replaces canonical allocation / deallocation functions with accelerator aware equivalents.

A test to validate that unannotated functions get correctly emitted is added as well. Please note that __device__, __global__ and __host__ are used to match existing nomenclature, they would not be present in user code.

Diff Detail

Event Timeline

AlexVlx created this revision.Jul 20 2023, 8:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 8:25 AM
Herald added a subscriber: ormris. · View Herald Transcript
AlexVlx requested review of this revision.Jul 20 2023, 8:25 AM
efriedma requested changes to this revision.Jul 20 2023, 11:44 AM
efriedma added a subscriber: efriedma.
efriedma added inline comments.
clang/lib/CodeGen/CGBuiltin.cpp
5785

This doesn't make sense; we can't just ignore bits of the source code. I guess this is related to "the decision on their validity is deferred", but I don't see how you expect this to work.

This revision now requires changes to proceed.Jul 20 2023, 11:44 AM
AlexVlx added inline comments.Jul 20 2023, 2:08 PM
clang/lib/CodeGen/CGBuiltin.cpp
5785

This is one of the weirder parts, so let's consider the following example:

cpp
void foo() { __builtin_ia32_pause(); }
void bar() { __builtin_trap(); }

void baz(const vector<int>& v) {
    return for_each(par_unseq, cbegin(v), cend(v), [](auto&& x) { if (x == 42) bar(); });
}

In the case above, what we'd offload to the accelerator, and ask the target BE to lower, is the implementation of for_each, and bar, because it is reachable from the latter. foo is not reachable by any execution path on the accelerator side, however it includes a builtin that is unsupported by the accelerator (unless said accelerator is x86, which is not impossible, but not something we're dealing with at the moment). If we were to actually error out early, in the FE, in these cases, there's almost no appeal to what is being proposed, because standard headers, as well as other libraries, are littered with various target specific builtins that are not going to be supported. This all builds on the core invariant of this model / extension / thingamabob, which is that the algorithms, and only the algorithms, are targets for offload. It thus follows that as long as code that is reachable from an algorithm's implementation is safe, all is fine, but we cannot know this in the FE / on an AST level, because we need the actual CFG. This part is handled in LLVM in the SelectAcceleratorCodePass that's in the last patch in this series.

Now, you will quite correctly observe that there's nothing preventing an user from calling foo in the callable they pass to an algorithm; they might read the docs / appreciate that this won't work, but even there they are not safe, because there via some opaque call chain they might end up touching some unsupported builtin. My intuition here, which is reflected above in letting builtins just flow through, is that such cases are better served with a compile time error, which is what will obtain once the target BE chokes trying to lower an unsupported builtin. It's not going to be a beautiful error, and we could probably prettify it somewhat if we were to check after we've done the accelerator code selection pass, but it will happen at compile time. Another solution would be to emit these as traps (poison + trap for value returning ones), but I am concerned that it would lead to really fascinating debug journeys.

Having said this, if there's a better way to deal with these scenarios, it would be rather nice. Similarly, if the above doesn't make sense, please let me know.

efriedma added inline comments.Jul 20 2023, 3:27 PM
clang/lib/CodeGen/CGBuiltin.cpp
5785

Oh, I see; you "optimistically" compile everything assuming it might run on the accelerator, then run LLVM IR optimizations, then determine late which bits of code will actually run on the accelerator, which then prunes the code which shouldn't run.

I'm not sure I really like this... would it be possible to infer which functions need to be run on the accelerator based on the AST? I mean, if your API takes a lambda expression that runs on the accelerator, you can mark the lambda's body as "must be emitted for GPU", then recursively mark all the functions referred to by the lambda.

Emiting errors lazily from the backend means you get different diagnostics depending on the optimization level.

If you do go with this codegen-based approach, it's not clear to me how you detect that a forbidden builtin was called; if you skip the error handling, you just get a literal "undef".

AlexVlx added inline comments.Jul 21 2023, 5:25 AM
clang/lib/CodeGen/CGBuiltin.cpp
5785

I'm not sure I really like this... - actually, I am not a big fan either, however I think it's about the best one can do, given the constraints (consume standard C++, no annotations on the user side etc.). Having tried a few times in the past (and at least once in a different compiler), I don't quite think this can be done on an AST level. It would add some fairly awkward checking during template instantiation (no way to know earlier that a CallableFoo was passed to an offloadable algorithm), and it's a bit unwieldy to basically compute the CFG on the AST and mark reachable Callees at that point. Ignoring those, the main reason for which we cannot do this is that the interface is not constrained to only take lambdas, but callables in general, and that includes pointers to function as well. We don't deal with those today, but plan to, and there's a natural solution when operating on IR, assuming closed / internalised Modules (which is the case at least for AMDGPU at the moment). The final challenge pertains to the AST being per TU, with no cross-TU visibility, whereas with IR you can either pre-link the BC (implicitly or LTO) and then operate on the entire compilation. This is a problem with cases where foo defined in TU0 is reachable from algorithm_bar_offloaded_impl in TU1. So TL;DR, I think it would be more complex to do this on the AST and would end up more brittle / less future proof.

In what regards how to do deferred diagnostics, it think it can be done like this (I crossed streams in my prior reply when discussing this part, so it's actually nonsense): instead of emitting undef here, we can emit a builtin with the same signature, but with the name suffixed with e.g. (__stdpar_unsupported) or something similar. Then, when doing the reachability computation later, if we stumble upon a node in the CFG that contains a builtin suffixed with __stdpar_unsupported we error out, and can provide nice diagnostics since we'd have the call-chain handy. Thoughts?

Since we need to support -O0, we need to be prepared that we may not be able to remove all the calls of unsupported functions even though they may never be called at run time.

We could simply replace them with traps in the middle end. This should work if such functions are not called at run time. The only issue is that if they are called at run time, how do we tell users that they used unsupported functions and where. A trap does not help since it only crashes the program without useful information.

We could emit calls of any unsupported functions as calls of __clang_unsupported(file_name, line_number, function_name).

In the middle-end pass where we eliminate functions not referenced by kernels, we could emit reports about calls of __clang_unsupported under a certain -R option. We could turn on that option for -stdpar in clang driver.

We can emit printf of file_name, line_number and function_name for the first active lane then emit trap for a call of __clang_unsupported(file_name, line_number, function_name) under an option in the middle-end pass to facilitate users debugging their code.

@yaxunl interesting point - are you worried about cases where due to missing inlining / const prop an indirect call site that can be replaced with a direct one would remain indirect? I think the problem in that case would actually be different, in that possibly reachable functions would not be identified as such and would be erroneously removed. I'm not sure there's any case where we'd fail to remove a meant to be unreachable function. We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead. A way to ensure we don't "miss a spot" might be to check after removal for any remaining unsupported builtins, instead of doing it during reachability computation (this is coupled with the special naming from the prior post).

In what regards how to do deferred diagnostics, it think it can be done like this (I crossed streams in my prior reply when discussing this part, so it's actually nonsense): instead of emitting undef here, we can emit a builtin with the same signature, but with the name suffixed with e.g. (__stdpar_unsupported) or something similar. Then, when doing the reachability computation later, if we stumble upon a node in the CFG that contains a builtin suffixed with __stdpar_unsupported we error out, and can provide nice diagnostics since we'd have the call-chain handy. Thoughts?

Sure, something like that. If you stick a SourceLocation on it, you can even recover the original clang source location.

We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead.

The overhead should be pretty minimal if the code doesn't actually run.

So TL;DR, I think it would be more complex to do this on the AST and would end up more brittle / less future proof.

Since we need to support -O0

The biggest downside of working in the backend is that it becomes very hard for users to predict what will compile, and will not compile. Particularly if you want to support -O0. (I was sort of assuming you just wouldn't support -O0.) If you work on the AST, fewer constructs will be accepted, but you can actually define rules about which constructs will/will not be accepted.

yaxunl added a comment.EditedJul 21 2023, 10:35 AM

@yaxunl interesting point - are you worried about cases where due to missing inlining / const prop an indirect call site that can be replaced with a direct one would remain indirect? I think the problem in that case would actually be different, in that possibly reachable functions would not be identified as such and would be erroneously removed. I'm not sure there's any case where we'd fail to remove a meant to be unreachable function. We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead. A way to ensure we don't "miss a spot" might be to check after removal for any remaining unsupported builtins, instead of doing it during reachability computation (this is coupled with the special naming from the prior post).

For programs having multiple TUs we cannot decide whether an unsupported function is used by a kernel during the compilation of a single TU. We can only decide that when we have the IR for the whole program. Currently, the HIP toolchain uses LTO of lld for multiple TUs, I am not sure whether we can emit clang diagnostics from lld. If not, then we need to use remarks. If we are confident to remove most unreachable unsupported functions at -O0, we may not need to use printf at run time. Remarks at LTO should be sufficient.

AlexVlx updated this revision to Diff 544954.Jul 27 2023, 3:30 PM
AlexVlx removed a reviewer: eli.friedman.

This adds more ecumenical handling of unsupported builtins, as per the review discussion (a suffixed equivalent stub is emitted instead); it's paired with an associated change in accelerator code selection pass, where the actual check for these stubs occurs. I've also adjusted where the latter pass gets added to the opt pipeline, for the AMDGCN target; for the latter it's better, for the moment, to run it later because we essentially do LTCG, and therefore can unambiguously determine reachability by operating on the full module.

efriedma added inline comments.Aug 2 2023, 4:20 PM
clang/lib/CodeGen/CGBuiltin.cpp
5786

Else-after-return.

clang/lib/CodeGen/CodeGenModule.cpp
5339

You can't just pretend a thread-local variable isn't thread-local. If the intent here is that thread-local variables are illegal in device code, you need to figure out some way to produce a diagnostic. (Maybe by generating a call to __stdpar_unsupported_threadlocal or something like that if code tries to refer to such a variable.)

AlexVlx added inline comments.Aug 2 2023, 6:44 PM
clang/lib/CodeGen/CodeGenModule.cpp
5339

Oh, this is actually an error that slipped through, I botched the diff it appears, I'll correct it, apologies.

AlexVlx updated this revision to Diff 547024.Aug 3 2023, 3:32 PM

Remove noise, correct style.

AlexVlx marked an inline comment as done.Aug 3 2023, 3:33 PM
AlexVlx updated this revision to Diff 548022.Aug 7 2023, 6:40 PM

Extend handling of unsupported builtins to include dealing with the target attribute.

efriedma accepted this revision.Aug 8 2023, 11:44 AM

LGTM (but please don't merge until we reach consensus on the overall feature)

This revision is now accepted and ready to land.Aug 8 2023, 11:44 AM

LGTM (but please don't merge until we reach consensus on the overall feature)

Of course, and thank you for the review. Please, do stick around if you don't mind, because this'll still get at least one update.

yaxunl accepted this revision.Aug 8 2023, 1:30 PM

LGTM from HIP side. Thanks.

arsenm added inline comments.Aug 8 2023, 1:32 PM
clang/lib/CodeGen/BackendUtil.cpp
1105–1106

Formatting

keryell added a subscriber: keryell.Aug 8 2023, 4:44 PM
keryell added inline comments.
clang/lib/CodeGen/CGBuiltin.cpp
5785

There is a lot of interesting design information in this discussion thread which will be lost forever after this is merged.
Is there a way to keep a summary as a comment somewhere to help the future readers/maintainers/historians?

AlexVlx updated this revision to Diff 549097.Aug 10 2023, 10:44 AM

Add support for handling certain cases of unambiguously accelerator unsupported ASM i.e. cases where constraints are clearly mismatched. When that happens, we instead emit an ASM__stdpar_unsupported stub which takes as its single argument the constexpr string value of the ASM block. Later, in the AcceleratorCodeSelection pass, if such a stub is reachable from an accelerator callable, we error out and print the offending ASM alongside the location.

AlexVlx marked 2 inline comments as done.Aug 10 2023, 10:45 AM
AlexVlx updated this revision to Diff 549101.Aug 10 2023, 10:47 AM

Fix typo.

yaxunl added inline comments.Aug 10 2023, 10:54 AM
clang/lib/CodeGen/CGStmt.cpp
2424

maybe prefix with __ to avoid potential name collision with users' code

AlexVlx updated this revision to Diff 549159.Aug 10 2023, 1:52 PM

Switch to __ASM prefix.

AlexVlx marked an inline comment as done.Aug 10 2023, 1:52 PM
AlexVlx updated this revision to Diff 552575.Aug 22 2023, 7:19 PM
AlexVlx edited the summary of this revision. (Show Details)

Updating to reflect the outcome of the RFC, which is that this will be added as a HIP extension exclusively.

AlexVlx retitled this revision from [Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload to [HIP][Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload.Aug 22 2023, 7:20 PM
AlexVlx updated this revision to Diff 557672.Oct 10 2023, 5:59 AM
AlexVlx removed reviewers: tra, jlebar.

Rebase.

AlexVlx updated this revision to Diff 557673.Oct 10 2023, 8:10 AM

Use unmangled names in test.

AlexVlx updated this revision to Diff 557712.Oct 15 2023, 5:18 PM
AlexVlx edited the summary of this revision. (Show Details)

Rebase.

This revision was automatically updated to reflect the committed changes.

clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp is failing: https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-incremental/38041/testReport/junit/Clang/CodeGenHipStdPar/unannotated_functions_get_emitted_cpp/

project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp:15:22: error: NO-HIPSTDPAR-DEV: expected string not found in input
// NO-HIPSTDPAR-DEV: define {{.*}} void @bar({{.*}})
                     ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/Users/buildslave/jenkins/workspace/clang-stage1-cmake-RA-incremental/llvm-project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp'
^
<stdin>:7:1: note: possible intended match here
define void @bar(ptr noundef %a, float noundef %b) #0 {
^

It looks like it may be due to the matcher having whitespace on both sides of {{.*}}, while the output only has a single space between define and void, but I'm not too well versed in FileCheck edge cases to know for sure.

clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp is failing: https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-incremental/38041/testReport/junit/Clang/CodeGenHipStdPar/unannotated_functions_get_emitted_cpp/

project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp:15:22: error: NO-HIPSTDPAR-DEV: expected string not found in input
// NO-HIPSTDPAR-DEV: define {{.*}} void @bar({{.*}})
                     ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/Users/buildslave/jenkins/workspace/clang-stage1-cmake-RA-incremental/llvm-project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp'
^
<stdin>:7:1: note: possible intended match here
define void @bar(ptr noundef %a, float noundef %b) #0 {
^

It looks like it may be due to the matcher having whitespace on both sides of {{.*}}, while the output only has a single space between define and void, but I'm not too well versed in FileCheck edge cases to know for sure.

Thank you for the ping... this is pretty confusing since it's not tripping any of the buildbots, or flaring locally, let me look into it.

AlexVlx updated this revision to Diff 557733.Oct 17 2023, 7:32 AM

Simplify test.

This revision was landed with ongoing or failed builds.Oct 17 2023, 7:42 AM