This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
3/3
OpenMPOpt.cpp
-
test/Transforms/OpenMP/
-
Transforms/
-
OpenMP/
2/4
gpu_state_machine_function_ptr_replacement.ll

Differential D83271

[OpenMP] Replace function pointer uses in GPU state machine
ClosedPublic

Authored by jdoerfert on Jul 6 2020, 6:07 PM.

Download Raw Diff

Details

Reviewers

jhuber6
fghanim
JonChesterfield
grokos
AndreyChurbanov
ye-luo
tianshilei1992
ggeorgakoudis

Commits

rG5b0581aedc22: [OpenMP] Replace function pointer uses in GPU state machine

Summary

In non-SPMD mode we create a state machine like code to identify the
parallel region the GPU worker threads should execute next. The
identification uses the parallel region function pointer as that allows
it to work even if the kernel (=target region) and the parallel region
are in separate TUs. However, taking the address of a function comes
with various downsides. With this patch we will identify the most common
situation and replace the function pointer use with a dummy global
symbol (for identification purposes only). That means, if the parallel
region is only called from a single target region (or kernel), we do not
use the function pointer of the parallel region to identify it but a new
global symbol.

Fixes PR46450.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Jul 6 2020, 6:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2020, 6:07 PM

Herald added subscribers: llvm-commits, aaron.ballman, sstefan1 and 4 others. · View Herald Transcript

Harbormaster failed remote builds in B63109: Diff 275875!Jul 6 2020, 6:08 PM

arsenm added a subscriber: arsenm.Jul 6 2020, 6:29 PM

arsenm added inline comments.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
970–971	if (CachedKernel) return *CachedKernel
978	*CachedValue
llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
41	These tests seem really big
279–281	Mostly unneeded metadata?

That's interesting. Amdgpu does not handle function pointers well and I suspect nvptx has considerable performance overhead for them too. If a parallel region is only called from a single target region, it is always passed the same function pointer. Thus specialise the state machine. I think this machinery is equivalent to specialising the parallel region call.

The general case involves calling one parallel region runtime function with various different function pointers. Devirtualising that is fairly difficult. For another time.

For this simpler case, I think this transform is equivalent to specialising the various kmpc*parallel calls on a given function pointer. The callees are available when using a bitcode deviceRTL.

Iirc function specialisation / partial evaluation is one of the classic compiler optimisations that LLVM doesn't really do. It's difficult to define a good cost model and C exposes function pointer comparison. What we could implement for this is an attribute driven one, where we mark the function pointer arguments in the deviceRTL with such and use LTO. Avoid specialising a function whose address escapes.

I like this patch. It's a clear example of an effective openmp specific optimisation. It just happens to run very close to specialisation, which may not be that much harder to implement if we cheat on the cost model.

tianshilei1992 added inline comments.Jul 6 2020, 7:44 PM

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1087	Probably we need to set `Changed` to `true` here?

Addressed comments

In D83271#2134746, @JonChesterfield wrote:

That's interesting. Amdgpu does not handle function pointers well and I suspect nvptx has considerable performance overhead for them too. If a parallel region is only called from a single target region, it is always passed the same function pointer. Thus specialise the state machine. I think this machinery is equivalent to specialising the parallel region call.

The problem here was the spurious call edge from an unrelated kernel to the outlined parallel function. ptxas then needed more registers for a trivial kernel as it was "thought" to call the outlined function.

The general case involves calling one parallel region runtime function with various different function pointers. Devirtualising that is fairly difficult. For another time.

For this simpler case, I think this transform is equivalent to specialising the various kmpc*parallel calls on a given function pointer. The callees are available when using a bitcode deviceRTL.

Iirc function specialisation / partial evaluation is one of the classic compiler optimisations that LLVM doesn't really do. It's difficult to define a good cost model and C exposes function pointer comparison. What we could implement for this is an attribute driven one, where we mark the function pointer arguments in the deviceRTL with such and use LTO. Avoid specialising a function whose address escapes.

I like this patch. It's a clear example of an effective openmp specific optimisation. It just happens to run very close to specialisation, which may not be that much harder to implement if we cheat on the cost model.

Specialization is (soonish) coming to the Attributor ;)

llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
279–281	Interestingly, that is what our device runtime looks like. For reasons I haven't understood yet it has all these "null is aligned" annotations. CUDA is weird. Anyway, I can strip this down too.

Harbormaster failed remote builds in B63134: Diff 275927!Jul 7 2020, 12:13 AM

jdoerfert edited the summary of this revision. (Show Details)Jul 7 2020, 5:08 AM

saiislam added a subscriber: saiislam.Jul 7 2020, 8:00 AM

I haven't been able to apply this to the aomp tree (for reasons unrelated to this patch), but by inspection I think it's sound. I like the conservative pattern matching approach.

The function pointer specialisation alternative is more complicated than I suggested above - because the pointer gets stored in local state and loaded, it isn't readily available for specialisation on by each call.

llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll
41	Agreed. I wonder if it's worth restructuring the openmp codegen to favour emitting functions instead of blocks with interesting control flow, such that tests like these look more like a linear sequence of named function calls. Said functions would then be inlined downstream of the codegen to produce the same IR we see here.

This revision is now accepted and ready to land.Jul 7 2020, 4:57 PM

jdoerfert added a parent revision: D83269: [OpenMP] Identify GPU kernels (aka. OpenMP target regions).Jul 10 2020, 8:13 AM

Closed by commit rG5b0581aedc22: [OpenMP] Replace function pointer uses in GPU state machine (authored by jdoerfert). · Explain WhyJul 10 2020, 11:50 PM

This revision was automatically updated to reflect the committed changes.

jdoerfert mentioned this in D83707: [OpenMP][NFC] Emit remarks during GPU state machine optimization.Jul 13 2020, 11:48 AM

jdoerfert mentioned this in D83832: [OpenMP] Provide a flag to disable safety checks for GPU optimizations.Jul 14 2020, 4:47 PM

jdoerfert mentioned this in rGfec1f2109f33: [OpenMP] Emit remarks during GPU state machine optimization.Jul 14 2020, 8:36 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

OpenMPOpt.cpp

164 lines

test/

Transforms/

OpenMP/

gpu_state_machine_function_ptr_replacement.ll

153 lines

Diff 277222

llvm/lib/Transforms/IPO/OpenMPOpt.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
STATISTIC(NumOpenMPParallelRegionsDeleted,		STATISTIC(NumOpenMPParallelRegionsDeleted,
"Number of OpenMP parallel regions deleted");		"Number of OpenMP parallel regions deleted");
STATISTIC(NumOpenMPRuntimeFunctionsIdentified,		STATISTIC(NumOpenMPRuntimeFunctionsIdentified,
"Number of OpenMP runtime functions identified");		"Number of OpenMP runtime functions identified");
STATISTIC(NumOpenMPRuntimeFunctionUsesIdentified,		STATISTIC(NumOpenMPRuntimeFunctionUsesIdentified,
"Number of OpenMP runtime function uses identified");		"Number of OpenMP runtime function uses identified");
STATISTIC(NumOpenMPTargetRegionKernels,		STATISTIC(NumOpenMPTargetRegionKernels,
"Number of OpenMP target region entry points (=kernels) identified");		"Number of OpenMP target region entry points (=kernels) identified");
		STATISTIC(
		NumOpenMPParallelRegionsReplacedInGPUStateMachine,
		"Number of OpenMP parallel regions replaced with ID in GPU state machines");

#if !defined(NDEBUG)		#if !defined(NDEBUG)
static constexpr auto TAG = "[" DEBUG_TYPE "]";		static constexpr auto TAG = "[" DEBUG_TYPE "]";
#endif		#endif

/// Apply \p CB to all uses of \p F. If \p LookThroughConstantExprUses is		/// Apply \p CB to all uses of \p F. If \p LookThroughConstantExprUses is
/// true, constant expression users are not given to \p CB but their uses are		/// true, constant expression users are not given to \p CB but their uses are
/// traversed transitively.		/// traversed transitively.
▲ Show 20 Lines • Show All 428 Lines • ▼ Show 20 Lines	LLVM_DEBUG(dbgs() << TAG << "Run on SCC with " << SCC.size()
<< " functions in a slice with "		<< " functions in a slice with "
<< OMPInfoCache.ModuleSlice.size() << " functions\n");		<< OMPInfoCache.ModuleSlice.size() << " functions\n");

if (PrintICVValues)		if (PrintICVValues)
printICVs();		printICVs();
if (PrintOpenMPKernels)		if (PrintOpenMPKernels)
printKernels();		printKernels();

		Changed \|= rewriteDeviceCodeStateMachine();

Changed \|= runAttributor();		Changed \|= runAttributor();

// Recollect uses, in case Attributor deleted any.		// Recollect uses, in case Attributor deleted any.
OMPInfoCache.recollectUses();		OMPInfoCache.recollectUses();

Changed \|= deduplicateRuntimeCalls();		Changed \|= deduplicateRuntimeCalls();
Changed \|= deleteParallelRegions();		Changed \|= deleteParallelRegions();

▲ Show 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	void collectGlobalThreadIdArguments(SmallSetVector<Value *, 16> &GTIdArgs) {

// Transitively search for more arguments by looking at the users of the		// Transitively search for more arguments by looking at the users of the
// ones we know already. During the search the GTIdArgs vector is extended		// ones we know already. During the search the GTIdArgs vector is extended
// so we cannot cache the size nor can we use a range based for.		// so we cannot cache the size nor can we use a range based for.
for (unsigned u = 0; u < GTIdArgs.size(); ++u)		for (unsigned u = 0; u < GTIdArgs.size(); ++u)
AddUserArgs(*GTIdArgs[u]);		AddUserArgs(*GTIdArgs[u]);
}		}

		/// Kernel (=GPU) optimizations and utility functions
		///
		///{{

		/// Check if \p F is a kernel, hence entry point for target offloading.
		bool isKernel(Function &F) { return OMPInfoCache.Kernels.count(&F); }

		/// Cache to remember the unique kernel for a function.
		DenseMap<Function *, Optional<Kernel>> UniqueKernelMap;

		/// Find the unique kernel that will execute \p F, if any.
		Kernel getUniqueKernelFor(Function &F);

		/// Find the unique kernel that will execute \p I, if any.
		Kernel getUniqueKernelFor(Instruction &I) {
		return getUniqueKernelFor(*I.getFunction());
		}

		/// Rewrite the device (=GPU) code state machine create in non-SPMD mode in
		/// the cases we can avoid taking the address of a function.
		bool rewriteDeviceCodeStateMachine();

		///
		///}}

/// Emit a remark generically		/// Emit a remark generically
///		///
/// This template function can be used to generically emit a remark. The		/// This template function can be used to generically emit a remark. The
/// RemarkKind should be one of the following:		/// RemarkKind should be one of the following:
/// - OptimizationRemark to indicate a successful optimization attempt		/// - OptimizationRemark to indicate a successful optimization attempt
/// - OptimizationRemarkMissed to report a failed optimization attempt		/// - OptimizationRemarkMissed to report a failed optimization attempt
/// - OptimizationRemarkAnalysis to provide additional information about an		/// - OptimizationRemarkAnalysis to provide additional information about an
/// optimization attempt		/// optimization attempt
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	for (Function *F : SCC) {
if (F->isDeclaration())		if (F->isDeclaration())
continue;		continue;

A.getOrCreateAAFor<AAICVTracker>(IRPosition::function(*F));		A.getOrCreateAAFor<AAICVTracker>(IRPosition::function(*F));
}		}
}		}
};		};

		Kernel OpenMPOpt::getUniqueKernelFor(Function &F) {
		if (!OMPInfoCache.ModuleSlice.count(&F))
		return nullptr;

		// Use a scope to keep the lifetime of the CachedKernel short.
		{
		Optional<Kernel> &CachedKernel = UniqueKernelMap[&F];
		if (CachedKernel)
		return *CachedKernel;
		arsenmUnsubmitted Done Reply Inline Actions if (CachedKernel) return CachedKernel arsenm:* if (CachedKernel) return *CachedKernel

		// TODO: We should use an AA to create an (optimistic and callback
		// call-aware) call graph. For now we stick to simple patterns that
		// are less powerful, basically the worst fixpoint.
		if (isKernel(F)) {
		CachedKernel = Kernel(&F);
		return *CachedKernel;
		arsenmUnsubmitted Done Reply Inline Actions CachedValue arsenm:* *CachedValue
		}

		CachedKernel = nullptr;
		if (!F.hasLocalLinkage())
		return nullptr;
		}

		auto GetUniqueKernelForUse = [&](const Use &U) -> Kernel {
		if (auto *Cmp = dyn_cast<ICmpInst>(U.getUser())) {
		// Allow use in equality comparisons.
		if (Cmp->isEquality())
		return getUniqueKernelFor(*Cmp);
		return nullptr;
		}
		if (auto *CB = dyn_cast<CallBase>(U.getUser())) {
		// Allow direct calls.
		if (CB->isCallee(&U))
		return getUniqueKernelFor(*CB);
		// Allow the use in __kmpc_kernel_prepare_parallel calls.
		if (Function *Callee = CB->getCalledFunction())
		if (Callee->getName() == "__kmpc_kernel_prepare_parallel")
		return getUniqueKernelFor(*CB);
		return nullptr;
		}
		// Disallow every other use.
		return nullptr;
		};

		// TODO: In the future we want to track more than just a unique kernel.
		SmallPtrSet<Kernel, 2> PotentialKernels;
		foreachUse(F, [&](const Use &U) {
		PotentialKernels.insert(GetUniqueKernelForUse(U));
		});

		Kernel K = nullptr;
		if (PotentialKernels.size() == 1)
		K = *PotentialKernels.begin();

		// Cache the result.
		UniqueKernelMap[&F] = K;

		return K;
		}

		bool OpenMPOpt::rewriteDeviceCodeStateMachine() {
		constexpr unsigned KMPC_KERNEL_PARALLEL_WORK_FN_PTR_ARG_NO = 0;

		OMPInformationCache::RuntimeFunctionInfo &KernelPrepareParallelRFI =
		OMPInfoCache.RFIs[OMPRTL___kmpc_kernel_prepare_parallel];

		bool Changed = false;
		if (!KernelPrepareParallelRFI)
		return Changed;

		for (Function *F : SCC) {

		// Check if the function is uses in a __kmpc_kernel_prepare_parallel call at
		// all.
		bool UnknownUse = false;
		unsigned NumDirectCalls = 0;

		SmallVector<Use *, 2> ToBeReplacedStateMachineUses;
		foreachUse(*F, [&](Use &U) {
		if (auto *CB = dyn_cast<CallBase>(U.getUser()))
		if (CB->isCallee(&U)) {
		++NumDirectCalls;
		return;
		}

		if (auto *Cmp = dyn_cast<ICmpInst>(U.getUser())) {
		ToBeReplacedStateMachineUses.push_back(&U);
		return;
		}
		if (CallInst *CI = OpenMPOpt::getCallIfRegularCall(
		*U.getUser(), &KernelPrepareParallelRFI)) {
		ToBeReplacedStateMachineUses.push_back(&U);
		return;
		}
		UnknownUse = true;
		});

		// If this ever hits, we should investigate.
		if (UnknownUse \|\| NumDirectCalls != 1)
		continue;

		// TODO: This is not a necessary restriction and should be lifted.
		if (ToBeReplacedStateMachineUses.size() != 2)
		continue;

		// Even if we have __kmpc_kernel_prepare_parallel calls, we (for now) give
		// up if the function is not called from a unique kernel.
		Kernel K = getUniqueKernelFor(*F);
		if (!K)
		continue;

		// We now know F is a parallel body function called only from the kernel K.
		// We also identified the state machine uses in which we replace the
		// function pointer by a new global symbol for identification purposes. This
		// ensures only direct calls to the function are left.

		Module &M = *F->getParent();
		Type *Int8Ty = Type::getInt8Ty(M.getContext());

		auto *ID = new GlobalVariable(
		M, Int8Ty, /* isConstant */ true, GlobalValue::PrivateLinkage,
		UndefValue::get(Int8Ty), F->getName() + ".ID");

		for (Use *U : ToBeReplacedStateMachineUses)
		U->set(ConstantExpr::getBitCast(ID, U->get()->getType()));
		tianshilei1992Unsubmitted Done Reply Inline Actions Probably we need to set `Changed` to `true` here? tianshilei1992: Probably we need to set `Changed` to `true` here?

		++NumOpenMPParallelRegionsReplacedInGPUStateMachine;

		Changed = true;
		}

		return Changed;
		}

/// Abstract Attribute for tracking ICV values.		/// Abstract Attribute for tracking ICV values.
struct AAICVTracker : public StateWrapper<BooleanState, AbstractAttribute> {		struct AAICVTracker : public StateWrapper<BooleanState, AbstractAttribute> {
using Base = StateWrapper<BooleanState, AbstractAttribute>;		using Base = StateWrapper<BooleanState, AbstractAttribute>;
AAICVTracker(const IRPosition &IRP, Attributor &A) : Base(IRP) {}		AAICVTracker(const IRPosition &IRP, Attributor &A) : Base(IRP) {}

/// Returns true if value is assumed to be tracked.		/// Returns true if value is assumed to be tracked.
bool isAssumedTracked() const { return getAssumed(); }		bool isAssumedTracked() const { return getAssumed(); }

▲ Show 20 Lines • Show All 321 Lines • Show Last 20 Lines

llvm/test/Transforms/OpenMP/gpu_state_machine_function_ptr_replacement.ll

This file was added.

				; RUN: opt -S -passes=openmpopt -pass-remarks=openmp-opt -openmp-print-gpu-kernels < %s \| FileCheck %s
				; RUN: opt -S -openmpopt -pass-remarks=openmp-opt -openmp-print-gpu-kernels < %s \| FileCheck %s

				; C input used for this test:

				; void bar(void) {
				; #pragma omp parallel
				; { }
				; }
				; void foo(void) {
				; #pragma omp target teams
				; {
				; #pragma omp parallel
				; {}
				; bar();
				; #pragma omp parallel
				; {}
				; }
				; }

				; Verify we replace the function pointer uses for the first and last outlined
				; region (1 and 3) but not for the middle one (2) because it could be called from
				; another kernel.

				; CHECK-DAG: @__omp_outlined__1_wrapper.ID = private constant i8 undef
				; CHECK-DAG: @__omp_outlined__3_wrapper.ID = private constant i8 undef

				; CHECK-DAG: icmp eq i8* %5, @__omp_outlined__1_wrapper.ID
				; CHECK-DAG: icmp eq i8* %7, @__omp_outlined__3_wrapper.ID

				; CHECK-DAG: call void @__kmpc_kernel_prepare_parallel(i8* @__omp_outlined__1_wrapper.ID)
				; CHECK-DAG: call void @__kmpc_kernel_prepare_parallel(i8* bitcast (void ()* @__omp_outlined__2_wrapper to i8*))
				; CHECK-DAG: call void @__kmpc_kernel_prepare_parallel(i8* @__omp_outlined__3_wrapper.ID)


				%struct.ident_t = type { i32, i32, i32, i32, i8* }

				define internal void @__omp_offloading_35_a1e179_foo_l7_worker() {
				entry:
				%work_fn = alloca i8*, align 8
				%exec_status = alloca i8, align 1
				arsenmUnsubmitted Done Reply Inline Actions These tests seem really big arsenm: These tests seem really big
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Agreed. I wonder if it's worth restructuring the openmp codegen to favour emitting functions instead of blocks with interesting control flow, such that tests like these look more like a linear sequence of named function calls. Said functions would then be inlined downstream of the codegen to produce the same IR we see here. JonChesterfield: Agreed. I wonder if it's worth restructuring the openmp codegen to favour emitting functions…
				store i8* null, i8** %work_fn, align 8
				store i8 0, i8* %exec_status, align 1
				br label %.await.work

				.await.work: ; preds = %.barrier.parallel, %entry
				call void @__kmpc_barrier_simple_spmd(%struct.ident_t* null, i32 0)
				%0 = call i1 @__kmpc_kernel_parallel(i8** %work_fn)
				%1 = zext i1 %0 to i8
				store i8 %1, i8* %exec_status, align 1
				%2 = load i8, i8* %work_fn, align 8
				%should_terminate = icmp eq i8* %2, null
				br i1 %should_terminate, label %.exit, label %.select.workers

				.select.workers: ; preds = %.await.work
				%3 = load i8, i8* %exec_status, align 1
				%is_active = icmp ne i8 %3, 0
				br i1 %is_active, label %.execute.parallel, label %.barrier.parallel

				.execute.parallel: ; preds = %.select.workers
				%4 = call i32 @__kmpc_global_thread_num(%struct.ident_t* null)
				%5 = load i8, i8* %work_fn, align 8
				%work_match = icmp eq i8* %5, bitcast (void ()* @__omp_outlined__1_wrapper to i8*)
				br i1 %work_match, label %.execute.fn, label %.check.next

				.execute.fn: ; preds = %.execute.parallel
				call void @__omp_outlined__1_wrapper()
				br label %.terminate.parallel

				.check.next: ; preds = %.execute.parallel
				%6 = load i8, i8* %work_fn, align 8
				%work_match1 = icmp eq i8* %6, bitcast (void ()* @__omp_outlined__2_wrapper to i8*)
				br i1 %work_match1, label %.execute.fn2, label %.check.next3

				.execute.fn2: ; preds = %.check.next
				call void @__omp_outlined__2_wrapper()
				br label %.terminate.parallel

				.check.next3: ; preds = %.check.next
				%7 = load i8, i8* %work_fn, align 8
				%work_match4 = icmp eq i8* %7, bitcast (void ()* @__omp_outlined__3_wrapper to i8*)
				br i1 %work_match4, label %.execute.fn5, label %.check.next6

				.execute.fn5: ; preds = %.check.next3
				call void @__omp_outlined__3_wrapper()
				br label %.terminate.parallel

				.check.next6: ; preds = %.check.next3
				%8 = bitcast i8* %2 to void ()*
				call void %8()
				br label %.terminate.parallel

				.terminate.parallel: ; preds = %.check.next6, %.execute.fn5, %.execute.fn2, %.execute.fn
				call void @__kmpc_kernel_end_parallel()
				br label %.barrier.parallel

				.barrier.parallel: ; preds = %.terminate.parallel, %.select.workers
				call void @__kmpc_barrier_simple_spmd(%struct.ident_t* null, i32 0)
				br label %.await.work

				.exit: ; preds = %.await.work
				ret void
				}

				define weak void @__omp_offloading_35_a1e179_foo_l7() {
				call void @__omp_offloading_35_a1e179_foo_l7_worker()
				call void @__omp_outlined__()
				ret void
				}

				define internal void @__omp_outlined__() {
				call void @__kmpc_kernel_prepare_parallel(i8* bitcast (void ()* @__omp_outlined__1_wrapper to i8*))
				call void @bar()
				call void @__kmpc_kernel_prepare_parallel(i8* bitcast (void ()* @__omp_outlined__3_wrapper to i8*))
				ret void
				}

				define internal void @__omp_outlined__1() {
				ret void
				}

				define internal void @__omp_outlined__1_wrapper() {
				call void @__omp_outlined__1()
				ret void
				}

				define hidden void @bar() {
				call void @__kmpc_kernel_prepare_parallel(i8* bitcast (void ()* @__omp_outlined__2_wrapper to i8*))
				ret void
				}

				define internal void @__omp_outlined__2_wrapper() {
				ret void
				}

				define internal void @__omp_outlined__3_wrapper() {
				ret void
				}

				declare void @__kmpc_kernel_prepare_parallel(i8* %WorkFn)

				declare zeroext i1 @__kmpc_kernel_parallel(i8** nocapture %WorkFn)

				declare void @__kmpc_kernel_end_parallel()

				declare void @__kmpc_barrier_simple_spmd(%struct.ident_t* nocapture readnone %loc_ref, i32 %tid)

				declare i32 @__kmpc_global_thread_num(%struct.ident_t* nocapture readnone)


				!nvvm.annotations = !{!0}

				!0 = !{void ()* @__omp_offloading_35_a1e179_foo_l7, !"kernel", i32 1}
				arsenmUnsubmitted Not Done Reply Inline Actions Mostly unneeded metadata? arsenm: Mostly unneeded metadata?
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Interestingly, that is what our device runtime looks like. For reasons I haven't understood yet it has all these "null is aligned" annotations. CUDA is weird. Anyway, I can strip this down too. jdoerfert: Interestingly, that is what our device runtime looks like. For reasons I haven't understood yet…