This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
1/1
BackendUtil.cpp
3/6
CGBuiltin.cpp
1/1
CGStmt.cpp
-
CMakeLists.txt
-
CodeGenFunction.cpp
2/2
CodeGenModule.cpp
-
test/CodeGenHipStdPar/
-
CodeGenHipStdPar/
-
unannotated-functions-get-emitted.cpp
-
unsupported-ASM.cpp
-
unsupported-builtins.cpp

Differential D155850

[HIP][Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload
ClosedPublic

Authored by AlexVlx on Jul 20 2023, 8:25 AM.

Download Raw Diff

Details

Reviewers

yaxunl
rjmccall
arsenm
efriedma
jhuber6

Commits

rG791b890c468e: [HIP][Clang][CodeGen] Simplify test for `hipstdpar`
rGdd5d65adb641: [HIP][Clang][CodeGen] Add CodeGen support for `hipstdpar`

Summary

This patch adds the CodeGen changes needed by the standard algorithm offload feature being proposed here: https://discourse.llvm.org/t/rfc-adding-c-parallel-algorithm-offload-support-to-clang-llvm/72159/1, which will only be available for the HIP language on AMD targets. The verbose documentation is included in the head of the patch series. This change concludes the set of additions needed in Clang, and essentially relaxes restrictions on what gets emitted on the device path, when compiling in hipstdpar mode (after the previous patch relaxed restrictions on what is semantically correct):

Unless a function is explicitly marked __host__, it will get emitted, whereas before only __device__ and __global__ functions would be emitted;
Unsupported builtins are ignored as opposed to being marked as an error, as the decision on their validity is deferred to the hipstdpar specific code selection pass we are adding, which will be the topic of the final patch in this series;
We add the stdpar specific passes to the opt pipeline, independent of optimisation level:
- When compiling for the accelerator / offload device, we add a code selection pass;
- When compiling for the host, iff the user requested it via the --hipstdpar-interpose-alloc flag, we add a pass which replaces canonical allocation / deallocation functions with accelerator aware equivalents.

A test to validate that unannotated functions get correctly emitted is added as well. Please note that __device__, __global__ and __host__ are used to match existing nomenclature, they would not be present in user code.

Diff Detail

Event Timeline

AlexVlx created this revision.Jul 20 2023, 8:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 8:25 AM

Herald added a subscriber: ormris. · View Herald Transcript

AlexVlx requested review of this revision.Jul 20 2023, 8:25 AM

Herald added subscribers: cfe-commits, wdng. · View Herald TranscriptJul 20 2023, 8:25 AM

AlexVlx added a parent revision: D155833: [HIP][Clang][Sema][RFC] Add Sema support for C++ Parallel Algorithm Offload.Jul 20 2023, 8:25 AM

AlexVlx added a child revision: D155856: [HIP][LLVM][Opt][AMDGPU][RFC] Add LLVM support for C++ Parallel Algorithm Offload.Jul 20 2023, 9:24 AM

efriedma requested changes to this revision.Jul 20 2023, 11:44 AM

efriedma added a subscriber: efriedma.

efriedma added inline comments.

clang/lib/CodeGen/CGBuiltin.cpp
5542	This doesn't make sense; we can't just ignore bits of the source code. I guess this is related to "the decision on their validity is deferred", but I don't see how you expect this to work.

This revision now requires changes to proceed.Jul 20 2023, 11:44 AM

Harbormaster completed remote builds in B246932: Diff 542488.Jul 20 2023, 11:45 AM

AlexVlx added inline comments.Jul 20 2023, 2:08 PM

clang/lib/CodeGen/CGBuiltin.cpp
5542	This is one of the weirder parts, so let's consider the following example: cpp void foo() { __builtin_ia32_pause(); } void bar() { __builtin_trap(); } void baz(const vector<int>& v) { return for_each(par_unseq, cbegin(v), cend(v), [](auto&& x) { if (x == 42) bar(); }); } In the case above, what we'd offload to the accelerator, and ask the target BE to lower, is the implementation of `for_each`, and `bar`, because it is reachable from the latter. `foo` is not reachable by any execution path on the accelerator side, however it includes a builtin that is unsupported by the accelerator (unless said accelerator is x86, which is not impossible, but not something we're dealing with at the moment). If we were to actually error out early, in the FE, in these cases, there's almost no appeal to what is being proposed, because standard headers, as well as other libraries, are littered with various target specific builtins that are not going to be supported. This all builds on the core invariant of this model / extension / thingamabob, which is that the algorithms, and only the algorithms, are targets for offload. It thus follows that as long as code that is reachable from an algorithm's implementation is safe, all is fine, but we cannot know this in the FE / on an AST level, because we need the actual CFG. This part is handled in LLVM in the `SelectAcceleratorCodePass` that's in the last patch in this series. Now, you will quite correctly observe that there's nothing preventing an user from calling `foo` in the callable they pass to an algorithm; they might read the docs / appreciate that this won't work, but even there they are not safe, because there via some opaque call chain they might end up touching some unsupported builtin. My intuition here, which is reflected above in letting builtins just flow through, is that such cases are better served with a compile time error, which is what will obtain once the target BE chokes trying to lower an unsupported builtin. It's not going to be a beautiful error, and we could probably prettify it somewhat if we were to check after we've done the accelerator code selection pass, but it will happen at compile time. Another solution would be to emit these as traps (poison + trap for value returning ones), but I am concerned that it would lead to really fascinating debug journeys. Having said this, if there's a better way to deal with these scenarios, it would be rather nice. Similarly, if the above doesn't make sense, please let me know.

efriedma added inline comments.Jul 20 2023, 3:27 PM

clang/lib/CodeGen/CGBuiltin.cpp
5542	Oh, I see; you "optimistically" compile everything assuming it might run on the accelerator, then run LLVM IR optimizations, then determine late which bits of code will actually run on the accelerator, which then prunes the code which shouldn't run. I'm not sure I really like this... would it be possible to infer which functions need to be run on the accelerator based on the AST? I mean, if your API takes a lambda expression that runs on the accelerator, you can mark the lambda's body as "must be emitted for GPU", then recursively mark all the functions referred to by the lambda. Emiting errors lazily from the backend means you get different diagnostics depending on the optimization level. If you do go with this codegen-based approach, it's not clear to me how you detect that a forbidden builtin was called; if you skip the error handling, you just get a literal "undef".

AlexVlx added inline comments.Jul 21 2023, 5:25 AM

clang/lib/CodeGen/CGBuiltin.cpp
5542	`I'm not sure I really like this...` - actually, I am not a big fan either, however I think it's about the best one can do, given the constraints (consume standard C++, no annotations on the user side etc.). Having tried a few times in the past (and at least once in a different compiler), I don't quite think this can be done on an AST level. It would add some fairly awkward checking during template instantiation (no way to know earlier that a `CallableFoo` was passed to an offloadable algorithm), and it's a bit unwieldy to basically compute the CFG on the AST and mark reachable Callees at that point. Ignoring those, the main reason for which we cannot do this is that the interface is not constrained to only take lambdas, but callables in general, and that includes pointers to function as well. We don't deal with those today, but plan to, and there's a natural solution when operating on IR, assuming closed / internalised Modules (which is the case at least for AMDGPU at the moment). The final challenge pertains to the AST being per TU, with no cross-TU visibility, whereas with IR you can either pre-link the BC (implicitly or LTO) and then operate on the entire compilation. This is a problem with cases where `foo` defined in TU0 is reachable from `algorithm_bar_offloaded_impl` in TU1. So TL;DR, I think it would be more complex to do this on the AST and would end up more brittle / less future proof. In what regards how to do deferred diagnostics, it think it can be done like this (I crossed streams in my prior reply when discussing this part, so it's actually nonsense): instead of emitting undef here, we can emit a builtin with the same signature, but with the name suffixed with e.g. (`__stdpar_unsupported`) or something similar. Then, when doing the reachability computation later, if we stumble upon a node in the CFG that contains a builtin suffixed with `__stdpar_unsupported` we error out, and can provide nice diagnostics since we'd have the call-chain handy. Thoughts?

Since we need to support -O0, we need to be prepared that we may not be able to remove all the calls of unsupported functions even though they may never be called at run time.

We could simply replace them with traps in the middle end. This should work if such functions are not called at run time. The only issue is that if they are called at run time, how do we tell users that they used unsupported functions and where. A trap does not help since it only crashes the program without useful information.

We could emit calls of any unsupported functions as calls of __clang_unsupported(file_name, line_number, function_name).

In the middle-end pass where we eliminate functions not referenced by kernels, we could emit reports about calls of __clang_unsupported under a certain -R option. We could turn on that option for -stdpar in clang driver.

We can emit printf of file_name, line_number and function_name for the first active lane then emit trap for a call of __clang_unsupported(file_name, line_number, function_name) under an option in the middle-end pass to facilitate users debugging their code.

@yaxunl interesting point - are you worried about cases where due to missing inlining / const prop an indirect call site that can be replaced with a direct one would remain indirect? I think the problem in that case would actually be different, in that possibly reachable functions would not be identified as such and would be erroneously removed. I'm not sure there's any case where we'd fail to remove a meant to be unreachable function. We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead. A way to ensure we don't "miss a spot" might be to check after removal for any remaining unsupported builtins, instead of doing it during reachability computation (this is coupled with the special naming from the prior post).

In what regards how to do deferred diagnostics, it think it can be done like this (I crossed streams in my prior reply when discussing this part, so it's actually nonsense): instead of emitting undef here, we can emit a builtin with the same signature, but with the name suffixed with e.g. (__stdpar_unsupported) or something similar. Then, when doing the reachability computation later, if we stumble upon a node in the CFG that contains a builtin suffixed with __stdpar_unsupported we error out, and can provide nice diagnostics since we'd have the call-chain handy. Thoughts?

Sure, something like that. If you stick a SourceLocation on it, you can even recover the original clang source location.

We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead.

The overhead should be pretty minimal if the code doesn't actually run.

So TL;DR, I think it would be more complex to do this on the AST and would end up more brittle / less future proof.

Since we need to support -O0

The biggest downside of working in the backend is that it becomes very hard for users to predict what will compile, and will not compile. Particularly if you want to support -O0. (I was sort of assuming you just wouldn't support -O0.) If you work on the AST, fewer constructs will be accepted, but you can actually define rules about which constructs will/will not be accepted.

In D155850#4523051, @AlexVlx wrote:

@yaxunl interesting point - are you worried about cases where due to missing inlining / const prop an indirect call site that can be replaced with a direct one would remain indirect? I think the problem in that case would actually be different, in that possibly reachable functions would not be identified as such and would be erroneously removed. I'm not sure there's any case where we'd fail to remove a meant to be unreachable function. We can definitely go with the __clang_unsupported approach, but I think I'd prefer these to be compile time errors rather than remarks + runtime printf, not in the least because printf adds some overhead. A way to ensure we don't "miss a spot" might be to check after removal for any remaining unsupported builtins, instead of doing it during reachability computation (this is coupled with the special naming from the prior post).

For programs having multiple TUs we cannot decide whether an unsupported function is used by a kernel during the compilation of a single TU. We can only decide that when we have the IR for the whole program. Currently, the HIP toolchain uses LTO of lld for multiple TUs, I am not sure whether we can emit clang diagnostics from lld. If not, then we need to use remarks. If we are confident to remove most unreachable unsupported functions at -O0, we may not need to use printf at run time. Remarks at LTO should be sufficient.

This adds more ecumenical handling of unsupported builtins, as per the review discussion (a suffixed equivalent stub is emitted instead); it's paired with an associated change in accelerator code selection pass, where the actual check for these stubs occurs. I've also adjusted where the latter pass gets added to the opt pipeline, for the AMDGCN target; for the latter it's better, for the moment, to run it later because we essentially do LTCG, and therefore can unambiguously determine reachability by operating on the full module.

Harbormaster completed remote builds in B248711: Diff 544954.Jul 27 2023, 3:30 PM

efriedma added inline comments.Aug 2 2023, 4:20 PM

clang/lib/CodeGen/CGBuiltin.cpp
5559	Else-after-return.
clang/lib/CodeGen/CodeGenModule.cpp
5315	You can't just pretend a thread-local variable isn't thread-local. If the intent here is that thread-local variables are illegal in device code, you need to figure out some way to produce a diagnostic. (Maybe by generating a call to __stdpar_unsupported_threadlocal or something like that if code tries to refer to such a variable.)

AlexVlx added inline comments.Aug 2 2023, 6:44 PM

clang/lib/CodeGen/CodeGenModule.cpp
5315	Oh, this is actually an error that slipped through, I botched the diff it appears, I'll correct it, apologies.

Remove noise, correct style.

AlexVlx marked an inline comment as done.Aug 3 2023, 3:33 PM

Harbormaster completed remote builds in B250189: Diff 547024.Aug 3 2023, 3:33 PM

Extend handling of unsupported builtins to include dealing with the target attribute.

Harbormaster completed remote builds in B250964: Diff 548022.Aug 7 2023, 9:04 PM

LGTM (but please don't merge until we reach consensus on the overall feature)

This revision is now accepted and ready to land.Aug 8 2023, 11:44 AM

In D155850#4570336, @efriedma wrote:

LGTM (but please don't merge until we reach consensus on the overall feature)

Of course, and thank you for the review. Please, do stick around if you don't mind, because this'll still get at least one update.

LGTM from HIP side. Thanks.

arsenm added inline comments.Aug 8 2023, 1:32 PM

clang/lib/CodeGen/BackendUtil.cpp
1101–1102	Formatting

keryell added a subscriber: keryell.Aug 8 2023, 4:44 PM

keryell added inline comments.

clang/lib/CodeGen/CGBuiltin.cpp
5542	There is a lot of interesting design information in this discussion thread which will be lost forever after this is merged. Is there a way to keep a summary as a comment somewhere to help the future readers/maintainers/historians?

Add support for handling certain cases of unambiguously accelerator unsupported ASM i.e. cases where constraints are clearly mismatched. When that happens, we instead emit an ASM__stdpar_unsupported stub which takes as its single argument the constexpr string value of the ASM block. Later, in the AcceleratorCodeSelection pass, if such a stub is reachable from an accelerator callable, we error out and print the offending ASM alongside the location.

AlexVlx marked 2 inline comments as done.Aug 10 2023, 10:45 AM

Fix typo.

yaxunl added inline comments.Aug 10 2023, 10:54 AM

clang/lib/CodeGen/CGStmt.cpp
2422	maybe prefix with `__` to avoid potential name collision with users' code

Switch to __ASM prefix.

AlexVlx marked an inline comment as done.Aug 10 2023, 1:52 PM

Harbormaster completed remote builds in B251777: Diff 549159.Aug 10 2023, 7:14 PM

Updating to reflect the outcome of the RFC, which is that this will be added as a HIP extension exclusively.

AlexVlx retitled this revision from [Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload to [HIP][Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload.Aug 22 2023, 7:20 PM

Harbormaster completed remote builds in B254240: Diff 552575.Aug 22 2023, 7:47 PM

AlexVlx added a reviewer: jhuber6.Aug 27 2023, 3:43 PM

Rebase.

Harbormaster completed remote builds in B257807: Diff 557672.Oct 10 2023, 7:21 AM

Use unmangled names in test.

Harbormaster completed remote builds in B257808: Diff 557673.Oct 10 2023, 8:46 AM

Rebase.

Harbormaster completed remote builds in B257824: Diff 557712.Oct 15 2023, 6:26 PM

Closed by commit rGdd5d65adb641: [HIP][Clang][CodeGen] Add CodeGen support for `hipstdpar` (authored by AlexVlx). · Explain WhyOct 17 2023, 3:41 AM

This revision was automatically updated to reflect the committed changes.

AlexVlx added a commit: rGdd5d65adb641: [HIP][Clang][CodeGen] Add CodeGen support for `hipstdpar`.

clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp is failing: https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-incremental/38041/testReport/junit/Clang/CodeGenHipStdPar/unannotated_functions_get_emitted_cpp/

project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp:15:22: error: NO-HIPSTDPAR-DEV: expected string not found in input
// NO-HIPSTDPAR-DEV: define {{.*}} void @bar({{.*}})
                     ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/Users/buildslave/jenkins/workspace/clang-stage1-cmake-RA-incremental/llvm-project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp'
^
<stdin>:7:1: note: possible intended match here
define void @bar(ptr noundef %a, float noundef %b) #0 {
^

It looks like it may be due to the matcher having whitespace on both sides of {{.*}}, while the output only has a single space between define and void, but I'm not too well versed in FileCheck edge cases to know for sure.

In D155850#4654146, @hnrklssn wrote:
clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp is failing: https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-incremental/38041/testReport/junit/Clang/CodeGenHipStdPar/unannotated_functions_get_emitted_cpp/
project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp:15:22: error: NO-HIPSTDPAR-DEV: expected string not found in input
// NO-HIPSTDPAR-DEV: define {{.*}} void @bar({{.*}})
                     ^
<stdin>:1:1: note: scanning from here
; ModuleID = '/Users/buildslave/jenkins/workspace/clang-stage1-cmake-RA-incremental/llvm-project/clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp'
^
<stdin>:7:1: note: possible intended match here
define void @bar(ptr noundef %a, float noundef %b) #0 {
^
It looks like it may be due to the matcher having whitespace on both sides of {{.*}}, while the output only has a single space between define and void, but I'm not too well versed in FileCheck edge cases to know for sure.

Thank you for the ping... this is pretty confusing since it's not tripping any of the buildbots, or flaring locally, let me look into it.

Simplify test.

Harbormaster completed remote builds in B257838: Diff 557733.Oct 17 2023, 7:33 AM

This revision was landed with ongoing or failed builds.Oct 17 2023, 7:42 AM

AlexVlx added a commit: rG791b890c468e: [HIP][Clang][CodeGen] Simplify test for `hipstdpar`.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

5 lines

26 lines

37 lines

1 line

12 lines

7 lines

test/

CodeGenHipStdPar/

unannotated-functions-get-emitted.cpp

19 lines

unsupported-ASM.cpp

10 lines

unsupported-builtins.cpp

8 lines

Diff 557733

clang/lib/CodeGen/BackendUtil.cpp

Context not available.
	#include "llvm/Transforms/Scalar/EarlyCSE.h"	#include "llvm/Transforms/Scalar/EarlyCSE.h"
	#include "llvm/Transforms/Scalar/GVN.h"	#include "llvm/Transforms/Scalar/GVN.h"
	#include "llvm/Transforms/Scalar/JumpThreading.h"	#include "llvm/Transforms/Scalar/JumpThreading.h"
		#include "llvm/Transforms/HipStdPar/HipStdPar.h"
	#include "llvm/Transforms/Utils/Debugify.h"	#include "llvm/Transforms/Utils/Debugify.h"
	#include "llvm/Transforms/Utils/EntryExitInstrumenter.h"	#include "llvm/Transforms/Utils/EntryExitInstrumenter.h"
	#include "llvm/Transforms/Utils/ModuleUtils.h"	#include "llvm/Transforms/Utils/ModuleUtils.h"
		arsenmUnsubmitted Done Reply Inline Actions Formatting arsenm: Formatting
Context not available.
	return;	return;
	}	}

		if (LangOpts.HIPStdPar && !LangOpts.CUDAIsDevice &&
		LangOpts.HIPStdParInterposeAlloc)
		MPM.addPass(HipStdParAllocationInterpositionPass());

	// Now that we have all of the passes ready, run them.	// Now that we have all of the passes ready, run them.
	{	{
	PrettyStackTraceString CrashInfo("Optimizer");	PrettyStackTraceString CrashInfo("Optimizer");
Context not available.

clang/lib/CodeGen/CGBuiltin.cpp

Context not available.
	return nullptr;	return nullptr;
	}	}

		static RValue EmitHipStdParUnsupportedBuiltin(CodeGenFunction *CGF,
		const FunctionDecl *FD) {
		auto Name = FD->getNameAsString() + "__hipstdpar_unsupported";
		auto FnTy = CGF->CGM.getTypes().GetFunctionType(FD);
		auto UBF = CGF->CGM.getModule().getOrInsertFunction(Name, FnTy);

		SmallVector<Value *, 16> Args;
		for (auto &&FormalTy : FnTy->params())
		Args.push_back(llvm::PoisonValue::get(FormalTy));

		return RValue::get(CGF->Builder.CreateCall(UBF, Args));
		}

	RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,	RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
	const CallExpr *E,	const CallExpr *E,
	ReturnValueSlot ReturnValue) {	ReturnValueSlot ReturnValue) {
		efriedmaUnsubmitted Not Done Reply Inline Actions This doesn't make sense; we can't just ignore bits of the source code. I guess this is related to "the decision on their validity is deferred", but I don't see how you expect this to work. efriedma: This doesn't make sense; we can't just ignore bits of the source code. I guess this is related…
		AlexVlxAuthorUnsubmitted Done Reply Inline Actions This is one of the weirder parts, so let's consider the following example: cpp void foo() { __builtin_ia32_pause(); } void bar() { __builtin_trap(); } void baz(const vector<int>& v) { return for_each(par_unseq, cbegin(v), cend(v), [](auto&& x) { if (x == 42) bar(); }); } In the case above, what we'd offload to the accelerator, and ask the target BE to lower, is the implementation of `for_each`, and `bar`, because it is reachable from the latter. `foo` is not reachable by any execution path on the accelerator side, however it includes a builtin that is unsupported by the accelerator (unless said accelerator is x86, which is not impossible, but not something we're dealing with at the moment). If we were to actually error out early, in the FE, in these cases, there's almost no appeal to what is being proposed, because standard headers, as well as other libraries, are littered with various target specific builtins that are not going to be supported. This all builds on the core invariant of this model / extension / thingamabob, which is that the algorithms, and only the algorithms, are targets for offload. It thus follows that as long as code that is reachable from an algorithm's implementation is safe, all is fine, but we cannot know this in the FE / on an AST level, because we need the actual CFG. This part is handled in LLVM in the `SelectAcceleratorCodePass` that's in the last patch in this series. Now, you will quite correctly observe that there's nothing preventing an user from calling `foo` in the callable they pass to an algorithm; they might read the docs / appreciate that this won't work, but even there they are not safe, because there via some opaque call chain they might end up touching some unsupported builtin. My intuition here, which is reflected above in letting builtins just flow through, is that such cases are better served with a compile time error, which is what will obtain once the target BE chokes trying to lower an unsupported builtin. It's not going to be a beautiful error, and we could probably prettify it somewhat if we were to check after we've done the accelerator code selection pass, but it will happen at compile time. Another solution would be to emit these as traps (poison + trap for value returning ones), but I am concerned that it would lead to really fascinating debug journeys. Having said this, if there's a better way to deal with these scenarios, it would be rather nice. Similarly, if the above doesn't make sense, please let me know. AlexVlx: This is one of the weirder parts, so let's consider the following example: ```cpp void foo() {…
		efriedmaUnsubmitted Not Done Reply Inline Actions Oh, I see; you "optimistically" compile everything assuming it might run on the accelerator, then run LLVM IR optimizations, then determine late which bits of code will actually run on the accelerator, which then prunes the code which shouldn't run. I'm not sure I really like this... would it be possible to infer which functions need to be run on the accelerator based on the AST? I mean, if your API takes a lambda expression that runs on the accelerator, you can mark the lambda's body as "must be emitted for GPU", then recursively mark all the functions referred to by the lambda. Emiting errors lazily from the backend means you get different diagnostics depending on the optimization level. If you do go with this codegen-based approach, it's not clear to me how you detect that a forbidden builtin was called; if you skip the error handling, you just get a literal "undef". efriedma: Oh, I see; you "optimistically" compile everything assuming it might run on the accelerator…
		AlexVlxAuthorUnsubmitted Done Reply Inline Actions `I'm not sure I really like this...` - actually, I am not a big fan either, however I think it's about the best one can do, given the constraints (consume standard C++, no annotations on the user side etc.). Having tried a few times in the past (and at least once in a different compiler), I don't quite think this can be done on an AST level. It would add some fairly awkward checking during template instantiation (no way to know earlier that a `CallableFoo` was passed to an offloadable algorithm), and it's a bit unwieldy to basically compute the CFG on the AST and mark reachable Callees at that point. Ignoring those, the main reason for which we cannot do this is that the interface is not constrained to only take lambdas, but callables in general, and that includes pointers to function as well. We don't deal with those today, but plan to, and there's a natural solution when operating on IR, assuming closed / internalised Modules (which is the case at least for AMDGPU at the moment). The final challenge pertains to the AST being per TU, with no cross-TU visibility, whereas with IR you can either pre-link the BC (implicitly or LTO) and then operate on the entire compilation. This is a problem with cases where `foo` defined in TU0 is reachable from `algorithm_bar_offloaded_impl` in TU1. So TL;DR, I think it would be more complex to do this on the AST and would end up more brittle / less future proof. In what regards how to do deferred diagnostics, it think it can be done like this (I crossed streams in my prior reply when discussing this part, so it's actually nonsense): instead of emitting undef here, we can emit a builtin with the same signature, but with the name suffixed with e.g. (`__stdpar_unsupported`) or something similar. Then, when doing the reachability computation later, if we stumble upon a node in the CFG that contains a builtin suffixed with `__stdpar_unsupported` we error out, and can provide nice diagnostics since we'd have the call-chain handy. Thoughts? AlexVlx: `I'm not sure I really like this...` - actually, I am not a big fan either, however I think…
		keryellUnsubmitted Not Done Reply Inline Actions There is a lot of interesting design information in this discussion thread which will be lost forever after this is merged. Is there a way to keep a summary as a comment somewhere to help the future readers/maintainers/historians? keryell: There is a lot of interesting design information in this discussion thread which will be lost…
		efriedmaUnsubmitted Done Reply Inline Actions Else-after-return. efriedma: Else-after-return.
Context not available.
	llvm_unreachable("Bad evaluation kind in EmitBuiltinExpr");	llvm_unreachable("Bad evaluation kind in EmitBuiltinExpr");
	}	}

		if (getLangOpts().HIPStdPar && getLangOpts().CUDAIsDevice)
		return EmitHipStdParUnsupportedBuiltin(this, FD);

	ErrorUnsupported(E, "builtin function");	ErrorUnsupported(E, "builtin function");

	// Unknown builtin, for now just dump it out and return undef.	// Unknown builtin, for now just dump it out and return undef.
Context not available.
	unsigned BuiltinID, const CallExpr *E,	unsigned BuiltinID, const CallExpr *E,
	ReturnValueSlot ReturnValue,	ReturnValueSlot ReturnValue,
	llvm::Triple::ArchType Arch) {	llvm::Triple::ArchType Arch) {
		// When compiling in HipStdPar mode we have to be conservative in rejecting
		// target specific features in the FE, and defer the possible error to the
		// AcceleratorCodeSelection pass, wherein iff an unsupported target builtin is
		// referenced by an accelerator executable function, we emit an error.
		// Returning nullptr here leads to the builtin being handled in
		// EmitStdParUnsupportedBuiltin.
		if (CGF->getLangOpts().HIPStdPar && CGF->getLangOpts().CUDAIsDevice &&
		Arch != CGF->getTarget().getTriple().getArch())
		return nullptr;

	switch (Arch) {	switch (Arch) {
	case llvm::Triple::arm:	case llvm::Triple::arm:
	case llvm::Triple::armeb:	case llvm::Triple::armeb:
Context not available.

clang/lib/CodeGen/CGStmt.cpp

Context not available.
	}	}
	}	}

		yaxunlUnsubmitted Done Reply Inline Actions maybe prefix with `__` to avoid potential name collision with users' code yaxunl: maybe prefix with `__` to avoid potential name collision with users' code
		static void EmitHipStdParUnsupportedAsm(CodeGenFunction *CGF,
		const AsmStmt &S) {
		constexpr auto Name = "__ASM__hipstdpar_unsupported";

		StringRef Asm;
		if (auto GCCAsm = dyn_cast<GCCAsmStmt>(&S))
		Asm = GCCAsm->getAsmString()->getString();

		auto &Ctx = CGF->CGM.getLLVMContext();

		auto StrTy = llvm::ConstantDataArray::getString(Ctx, Asm);
		auto FnTy = llvm::FunctionType::get(llvm::Type::getVoidTy(Ctx),
		{StrTy->getType()}, false);
		auto UBF = CGF->CGM.getModule().getOrInsertFunction(Name, FnTy);

		CGF->Builder.CreateCall(UBF, {StrTy});
		}

	void CodeGenFunction::EmitAsmStmt(const AsmStmt &S) {	void CodeGenFunction::EmitAsmStmt(const AsmStmt &S) {
	// Pop all cleanup blocks at the end of the asm statement.	// Pop all cleanup blocks at the end of the asm statement.
	CodeGenFunction::RunCleanupsScope Cleanups(*this);	CodeGenFunction::RunCleanupsScope Cleanups(*this);
Context not available.
	SmallVector<TargetInfo::ConstraintInfo, 4> OutputConstraintInfos;	SmallVector<TargetInfo::ConstraintInfo, 4> OutputConstraintInfos;
	SmallVector<TargetInfo::ConstraintInfo, 4> InputConstraintInfos;	SmallVector<TargetInfo::ConstraintInfo, 4> InputConstraintInfos;

	for (unsigned i = 0, e = S.getNumOutputs(); i != e; i++) {	bool IsHipStdPar = getLangOpts().HIPStdPar && getLangOpts().CUDAIsDevice;
		bool IsValidTargetAsm = true;
		for (unsigned i = 0, e = S.getNumOutputs(); i != e && IsValidTargetAsm; i++) {
	StringRef Name;	StringRef Name;
	if (const GCCAsmStmt *GAS = dyn_cast<GCCAsmStmt>(&S))	if (const GCCAsmStmt *GAS = dyn_cast<GCCAsmStmt>(&S))
	Name = GAS->getOutputName(i);	Name = GAS->getOutputName(i);
	TargetInfo::ConstraintInfo Info(S.getOutputConstraint(i), Name);	TargetInfo::ConstraintInfo Info(S.getOutputConstraint(i), Name);
	bool IsValid = getTarget().validateOutputConstraint(Info); (void)IsValid;	bool IsValid = getTarget().validateOutputConstraint(Info); (void)IsValid;
	assert(IsValid && "Failed to parse output constraint");	if (IsHipStdPar && !IsValid)
		IsValidTargetAsm = false;
		else
		assert(IsValid && "Failed to parse output constraint");
	OutputConstraintInfos.push_back(Info);	OutputConstraintInfos.push_back(Info);
	}	}

	for (unsigned i = 0, e = S.getNumInputs(); i != e; i++) {	for (unsigned i = 0, e = S.getNumInputs(); i != e && IsValidTargetAsm; i++) {
	StringRef Name;	StringRef Name;
	if (const GCCAsmStmt *GAS = dyn_cast<GCCAsmStmt>(&S))	if (const GCCAsmStmt *GAS = dyn_cast<GCCAsmStmt>(&S))
	Name = GAS->getInputName(i);	Name = GAS->getInputName(i);
	TargetInfo::ConstraintInfo Info(S.getInputConstraint(i), Name);	TargetInfo::ConstraintInfo Info(S.getInputConstraint(i), Name);
	bool IsValid =	bool IsValid =
	getTarget().validateInputConstraint(OutputConstraintInfos, Info);	getTarget().validateInputConstraint(OutputConstraintInfos, Info);
	assert(IsValid && "Failed to parse input constraint"); (void)IsValid;	if (IsHipStdPar && !IsValid)
		IsValidTargetAsm = false;
		else
		assert(IsValid && "Failed to parse input constraint");
	InputConstraintInfos.push_back(Info);	InputConstraintInfos.push_back(Info);
	}	}

		if (!IsValidTargetAsm)
		return EmitHipStdParUnsupportedAsm(this, S);

	std::string Constraints;	std::string Constraints;

	std::vector<LValue> ResultRegDests;	std::vector<LValue> ResultRegDests;
Context not available.

clang/lib/CodeGen/CMakeLists.txt

Context not available.
	Extensions	Extensions
	FrontendHLSL	FrontendHLSL
	FrontendOpenMP	FrontendOpenMP
		HIPStdPar
	IPO	IPO
	IRPrinter	IRPrinter
	IRReader	IRReader
Context not available.

clang/lib/CodeGen/CodeGenFunction.cpp

Context not available.
	std::string MissingFeature;	std::string MissingFeature;
	llvm::StringMap<bool> CallerFeatureMap;	llvm::StringMap<bool> CallerFeatureMap;
	CGM.getContext().getFunctionFeatureMap(CallerFeatureMap, FD);	CGM.getContext().getFunctionFeatureMap(CallerFeatureMap, FD);
		// When compiling in HipStdPar mode we have to be conservative in rejecting
		// target specific features in the FE, and defer the possible error to the
		// AcceleratorCodeSelection pass, wherein iff an unsupported target builtin is
		// referenced by an accelerator executable function, we emit an error.
		bool IsHipStdPar = getLangOpts().HIPStdPar && getLangOpts().CUDAIsDevice;
	if (BuiltinID) {	if (BuiltinID) {
	StringRef FeatureList(CGM.getContext().BuiltinInfo.getRequiredFeatures(BuiltinID));	StringRef FeatureList(CGM.getContext().BuiltinInfo.getRequiredFeatures(BuiltinID));
	if (!Builtin::evaluateRequiredTargetFeatures(	if (!Builtin::evaluateRequiredTargetFeatures(
	FeatureList, CallerFeatureMap)) {	FeatureList, CallerFeatureMap) && !IsHipStdPar) {
	CGM.getDiags().Report(Loc, diag::err_builtin_needs_feature)	CGM.getDiags().Report(Loc, diag::err_builtin_needs_feature)
	<< TargetDecl->getDeclName()	<< TargetDecl->getDeclName()
	<< FeatureList;	<< FeatureList;
Context not available.
	return false;	return false;
	}	}
	return true;	return true;
	}))	}) && !IsHipStdPar)
	CGM.getDiags().Report(Loc, diag::err_function_needs_feature)	CGM.getDiags().Report(Loc, diag::err_function_needs_feature)
	<< FD->getDeclName() << TargetDecl->getDeclName() << MissingFeature;	<< FD->getDeclName() << TargetDecl->getDeclName() << MissingFeature;
	} else if (!FD->isMultiVersion() && FD->hasAttr<TargetAttr>()) {	} else if (!FD->isMultiVersion() && FD->hasAttr<TargetAttr>()) {
Context not available.

	for (const auto &F : CalleeFeatureMap) {	for (const auto &F : CalleeFeatureMap) {
	if (F.getValue() && (!CallerFeatureMap.lookup(F.getKey()) \|\|	if (F.getValue() && (!CallerFeatureMap.lookup(F.getKey()) \|\|
	!CallerFeatureMap.find(F.getKey())->getValue()))	!CallerFeatureMap.find(F.getKey())->getValue()) &&
		!IsHipStdPar)
	CGM.getDiags().Report(Loc, diag::err_function_needs_feature)	CGM.getDiags().Report(Loc, diag::err_function_needs_feature)
	<< FD->getDeclName() << TargetDecl->getDeclName() << F.getKey();	<< FD->getDeclName() << TargetDecl->getDeclName() << F.getKey();
	}	}
Context not available.

clang/lib/CodeGen/CodeGenModule.cpp

Context not available.
	GV->setComdat(TheModule.getOrInsertComdat(GV->getName()));	GV->setComdat(TheModule.getOrInsertComdat(GV->getName()));
	Emitter.finalize(GV);	Emitter.finalize(GV);

	return ConstantAddress(GV, GV->getValueType(), Alignment);	return ConstantAddress(GV, GV->getValueType(), Alignment);
	}	}

	ConstantAddress CodeGenModule::GetWeakRefReference(const ValueDecl *VD) {	ConstantAddress CodeGenModule::GetWeakRefReference(const ValueDecl *VD) {
Context not available.
	!Global->hasAttr<CUDAConstantAttr>() &&	!Global->hasAttr<CUDAConstantAttr>() &&
	!Global->hasAttr<CUDASharedAttr>() &&	!Global->hasAttr<CUDASharedAttr>() &&
	!Global->getType()->isCUDADeviceBuiltinSurfaceType() &&	!Global->getType()->isCUDADeviceBuiltinSurfaceType() &&
	!Global->getType()->isCUDADeviceBuiltinTextureType())	!Global->getType()->isCUDADeviceBuiltinTextureType() &&
		!(LangOpts.HIPStdPar &&
		isa<FunctionDecl>(Global) &&
		!Global->hasAttr<CUDAHostAttr>()))
	return;	return;
	} else {	} else {
	// We need to emit host-side 'shadows' for all global	// We need to emit host-side 'shadows' for all global
Context not available.
		efriedmaUnsubmitted Done Reply Inline Actions You can't just pretend a thread-local variable isn't thread-local. If the intent here is that thread-local variables are illegal in device code, you need to figure out some way to produce a diagnostic. (Maybe by generating a call to __stdpar_unsupported_threadlocal or something like that if code tries to refer to such a variable.) efriedma: You can't just pretend a thread-local variable isn't thread-local. If the intent here is that…
		AlexVlxAuthorUnsubmitted Done Reply Inline Actions Oh, this is actually an error that slipped through, I botched the diff it appears, I'll correct it, apologies. AlexVlx: Oh, this is actually an error that slipped through, I botched the diff it appears, I'll correct…

clang/test/CodeGenHipStdPar/unannotated-functions-get-emitted.cpp

This file was added.

				// RUN: %clang_cc1 -x hip -emit-llvm -fcuda-is-device \
				// RUN: -o - %s \| FileCheck --check-prefix=NO-HIPSTDPAR-DEV %s

				// RUN: %clang_cc1 --hipstdpar -emit-llvm -fcuda-is-device \
				// RUN: -o - %s \| FileCheck --check-prefix=HIPSTDPAR-DEV %s

				#define __device__ __attribute__((device))

				// NO-HIPSTDPAR-DEV-NOT: {{.}}void @foo({{.}})
				// HIPSTDPAR-DEV: {{.}}void @foo({{.}})
				extern "C" void foo(float *a, float b) {
				*a = b;
				}

				// NO-HIPSTDPAR-DEV: {{.}}void @bar({{.}})
				// HIPSTDPAR-DEV: {{.}}void @bar({{.}})
				extern "C" __device__ void bar(float *a, float b) {
				*a = b;
				}

clang/test/CodeGenHipStdPar/unsupported-ASM.cpp

This file was added.

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-gnu \
				// RUN: --hipstdpar -x hip -emit-llvm -fcuda-is-device -o - %s \| FileCheck %s

				#define __global__ __attribute__((global))

				__global__ void foo(int i) {
				asm ("addl %2, %1; seto %b0" : "=q" (i), "+g" (i) : "r" (i));
				}

				// CHECK: declare void @__ASM__hipstdpar_unsupported([{{.*}}])

clang/test/CodeGenHipStdPar/unsupported-builtins.cpp

This file was added.

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-gnu \
				// RUN: --hipstdpar -x hip -emit-llvm -fcuda-is-device -o - %s \| FileCheck %s

				#define __global__ __attribute__((global))

				__global__ void foo() { return __builtin_ia32_pause(); }

				// CHECK: declare void @__builtin_ia32_pause__hipstdpar_unsupported()