This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
21/24
CGDecl.cpp
2/2
CGOpenMPRuntime.h
-
CGOpenMPRuntimeGPU.h
8/11
CGOpenMPRuntimeGPU.cpp
9/15
CodeGenFunction.h
2/2
CodeGenFunction.cpp
-
test/OpenMP/
-
OpenMP/
-
amdgcn_target_device_vla.cpp

Differential D153883

[Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs
ClosedPublic

Authored by doru1004 on Jun 27 2023, 8:11 AM.

Download Raw Diff

Details

Reviewers

ronlieb
gregrodgers
carlo.bertolli
arsenm
jdoerfert
dhruvachak
ABataev
jhuber6
JonChesterfield

Summary

This patch fixes an issue with the use of ___kmpc_alloc_shared to allocate dynamically sized VLAs on GPUs when the declaration escapes the context. For example:

#pragma omp target teams distribute
for (int i=0; i<M; i++) {
  int N = 10;
  double A[N];

  #pragma omp parallel for
  for(int j=0; j<N; j++) {
    A[j] = j;
  }
}

This will generate a pair of __kmpc_alloc_shared / __kmpc_free_shared to handle the allocation and deallocation of A inside the target region but this emission will be delayed until the VLA size is availble in user code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

doru1004 created this revision.Jun 27 2023, 8:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 8:11 AM

Herald added subscribers: sunshaoce, guansong, yaxunl, jvesely. · View Herald Transcript

doru1004 requested review of this revision.Jun 27 2023, 8:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 8:11 AM

Herald added subscribers: cfe-commits, jplehr, sstefan1, wdng. · View Herald Transcript

doru1004 added reviewers: jhuber6, JonChesterfield.Jun 27 2023, 8:17 AM

So this is implementing the stacksave using __kmpc_alloc_shared instead? It makes sense since the OpenMP standard expects sharing for the stack. I wonder how this interfaces with -fopenmp-cuda-mode.

clang/lib/CodeGen/CGDecl.cpp
1603	Does NVPTX handle this already? If not, is there a compelling reason to exclude NVPTX? Otherwise we should check if we are the OpenMP device.

Add the runtime test?

clang/lib/CodeGen/CGDecl.cpp
587	Better to pass it as const reference
589	Wrong param name, use Camel
1603	OpenMPIsDevice?
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	Why this code is removed?
1125	Use `std::make_pair(VoidPtr, Size)`.

doru1004 added inline comments.Jun 27 2023, 8:47 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	I could not understand why this code is here in the first place since it doesn't seem that it could ever work correctly (and it doesn't seem to be covered by any existing tests). Maybe I'm wrong but that was my understanding of it. So what seems to happen is that this code attempts to emit a kmpc_alloc_shared before the actual size calculation is emitted. So if the VLA size is something that the user defines such as `int N = 10;` then that code will not have been emitted at this point. When the expression computing the size of the VLA uses `N`, the code that is deleted here would just fail to find the VLA size in the attempt to emit the kmpc_alloc_shared. The emission of the VLA as kmpc_alloc_shared needs to happen after the expression of the size is emitted.

jhuber6 added inline comments.Jun 27 2023, 8:49 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	I'm pretty sure I was the one that wrote this code, and at the time I don't recall it really working. I remember there was something else that expected this to be here, but for what utility I do not recall. VLAs were never tested or used.

ABataev added inline comments.Jun 27 2023, 9:08 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	They are tested, check test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for example, where it captures VLA implicitly. I assume this should not be AMDGCN specific.

doru1004 added inline comments.Jun 27 2023, 9:28 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	Oh I see so this code path would cover the case when the VLA is defined outside the target region? I'm surprised I haven't seen any lit test fails for AMD GPUs, maybe this kind of test only exists for NVPTX. I'll add a test for AMD GPUs in that case.

doru1004 added inline comments.Jun 27 2023, 9:32 AM

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085	Edit: the VLA is defined outside the target region => the VLA size is defined outside the target region

Harbormaster completed remote builds in B241495: Diff 534992.Jun 27 2023, 9:45 AM

doru1004 updated this revision to Diff 535186.Jun 27 2023, 4:29 PM

doru1004 marked 3 inline comments as done.

doru1004 added inline comments.

clang/lib/CodeGen/CGDecl.cpp
1603	Does NVPTX support dynamic allocas?

ABataev added inline comments.Jun 27 2023, 4:43 PM

clang/lib/CodeGen/CGDecl.cpp
1603	It does not matter here, it depends on the runtime library implementations. The compiler just shall provide proper runtime calls emission, everything else is part of the runtime support.

Harbormaster completed remote builds in B241648: Diff 535186.Jun 27 2023, 7:06 PM

arsenm added inline comments.Jun 28 2023, 4:38 AM

clang/lib/CodeGen/CGDecl.cpp
1603	I think I heard recent ptx introdced new instructions for it. amdgpu codegen just happens to be broken because we don't properly restore the stack afterwards. When I added the support we had no way of testing (and still don't really, __builtin_alloca doesn't handle non-0 stack address space correctly)

doru1004 added inline comments.Jun 28 2023, 6:55 AM

clang/lib/CodeGen/CGDecl.cpp
1603	If NVPTX supports that then there is no reason to have NVPTX avoid emitting allocas (i.e. the condition stays as it is right now) but I am willing to reach a consensus so please let me know what you would all prefer.

arsenm added inline comments.Jun 28 2023, 7:23 AM

clang/lib/CodeGen/CGDecl.cpp
1603	frontends seem to have a tradition of working around missing features in codegen, I think you should just pass through the correct IR and leave the backend bugs for the backends

I think it's better to just limit it to AMDGPU for now.
BTW, it might be worth to check if heap-to-stack will push it back to stack.

In D153883#4456342, @tianshilei1992 wrote:

I think it's better to just limit it to AMDGPU for now.
BTW, it might be worth to check if heap-to-stack will push it back to stack.

If you're really going to go for backend workarounds, it should be special casing the known broken with a fixme for why, not a positive check for where it's enabled

In D153883#4456342, @tianshilei1992 wrote:

I think it's better to just limit it to AMDGPU for now.

I rather doubt this is a good decision. Better to support for all targets. NVPTX supports(ed) (IIRC) static allocation and internal management for the shared memory (not sure it is true for the new library). If no, then we need at least to diagnose that this feature is not supported.

BTW, it might be worth to check if heap-to-stack will push it back to stack.

doru1004 updated this revision to Diff 536059.Jun 29 2023, 5:19 PM

doru1004 retitled this revision from [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions to [Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs .

doru1004 edited the summary of this revision. (Show Details)

I have modified the patch to only do one thing rather than several things as the previous patch. Essentially this patch now only handles the delayed emission of the __kmpc_alloc_shared for the VLA which it could not emit in the Prolog of the function. This is now very precise in terms of which VLAs it will transform into __kmpc_alloc_shared i.e. only the ones that were previously attempted in the Prolog and could not be emitted because their size was missing (had not been emitted yet).

I have dropped the previous intention of emitting __kmpc_alloc_shared for thread local variables which have dynamic size. I am emitting dynamic allocas (as the test shows) which will fail in the backend as expected. This behavior needs to be resolved separately in the backend according to @arsenm and any workaround in the frontend would have to live in a standalone patch that can be reverted when a fix to the backend is performed.

Harbormaster completed remote builds in B242289: Diff 536059.Jun 29 2023, 9:11 PM

ABataev added inline comments.Jun 30 2023, 4:56 AM

clang/lib/CodeGen/CGDecl.cpp
589	Name of the variable hides the type, potential warning or even error
1605–1609	I think you can drop triple checks and rely completely on RT.isDelayedVariableLengthDecl(*this, &D) result here
clang/lib/CodeGen/CodeGenFunction.cpp
2164–2174	Fix var naming

doru1004 added inline comments.Jun 30 2023, 7:25 AM

clang/lib/CodeGen/CGDecl.cpp
1605–1609	I tried it but there is a lit test (which I cannot identify) that hangs when offloading to the host (I think) so it has to be an actual GPU. Any ideas?

ABataev added inline comments.Jun 30 2023, 8:25 AM

clang/lib/CodeGen/CGDecl.cpp
1605–1609	Make isDelayedVariableLengthDecl virtual in base OpenMPRuntime and make it return false by default, and true in base implementation for GPU. This should fix the problem, I hope

doru1004 updated this revision to Diff 536288.Jun 30 2023, 9:24 AM

doru1004 marked 3 inline comments as done.

doru1004 added inline comments.

clang/lib/CodeGen/CGDecl.cpp
1605–1609	It worked thank you for the suggestion!!

doru1004 marked an inline comment as done.Jun 30 2023, 10:11 AM

ABataev added inline comments.Jun 30 2023, 10:16 AM

clang/lib/CodeGen/CGDecl.cpp
590–591	auto &RT = static_cast<CGOpenMPRuntimeGPU &>(...);
1605–1606	No need to cast to CGOpenMPRuntimeGPU since isDelayedVariableLengthDecl is a member of CGOpenMPRuntime.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1116–1120	return llvm::is_contained(I->getSecond().DelayedVariableLengthDecls, VD);
1152	pass it here and in other places as const reference
clang/lib/CodeGen/CodeGenFunction.h
2806	Is it possible that VariableArrayType does not have VLA size? Fix param name

doru1004 added inline comments.Jun 30 2023, 10:23 AM

clang/lib/CodeGen/CodeGenFunction.h
2806	@ABataev How would point 1 happen?

ABataev added inline comments.Jun 30 2023, 10:37 AM

clang/lib/CodeGen/CodeGenFunction.h
2806	You're adding a function that checks if VLA type has VLA size. I'm asking, if it is possible for VLA type to not have VLA size at all? Why do you need this function?

doru1004 updated this revision to Diff 536321.Jun 30 2023, 10:50 AM

doru1004 marked 4 inline comments as done.

doru1004 added inline comments.

clang/lib/CodeGen/CGDecl.cpp
1605–1606	RT is also used further down to call getKmpcAllocShared().
clang/lib/CodeGen/CodeGenFunction.h
2806	This function checks if the expression of the size of the VLA has already been emitted and can be used.

doru1004 updated this revision to Diff 536322.Jun 30 2023, 10:52 AM

ABataev added inline comments.Jun 30 2023, 10:53 AM

clang/lib/CodeGen/CodeGenFunction.cpp
2168	Use VLASizeMap.find() instead
clang/lib/CodeGen/CodeGenFunction.h
2806	Why the emission of VLA size can be delayed?

doru1004 added inline comments.Jun 30 2023, 10:55 AM

clang/lib/CodeGen/CodeGenFunction.h
2806	Because the size of the VLA is emitted in the user code and the prolog of the function happens before that. The emission of the VLA needs to be delayed until its size has been emitted in the user code.

doru1004 updated this revision to Diff 536326.Jun 30 2023, 11:08 AM

doru1004 marked an inline comment as done.

ABataev added inline comments.Jun 30 2023, 11:32 AM

clang/lib/CodeGen/CodeGenFunction.h
2806	This is very fragile approach. Can you try instead try to improve markAsEscaped function and fix insertion of VD to EscapedVariableLengthDecls and if the declaration is internal for the target region, insert it to DelayedVariableLengthDecls?

doru1004 added inline comments.Jun 30 2023, 12:08 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	I am not sure what the condition would be, at that point, to choose between one list or the other. I'm not sure what you mean by the declaration being internal to the target region.

Harbormaster completed remote builds in B242478: Diff 536326.Jun 30 2023, 12:24 PM

doru1004 added inline comments.Jun 30 2023, 2:11 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	Any thoughts? As far as I can tell all VLAs that reach that point belong in `DelayedVariableLengthDecls`

doru1004 added inline comments.Jun 30 2023, 3:59 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	@ABataev I cannot think of a condition to use for the distinction in markedAsEscaped(). Could you please explain in more detail what you want me to check? I can make the rest of the changes happen no problem but I don't know what the condition is. Unless you tell me otherwise, I think the best condition is to check whether the VLA size has been emitted (i.e. that is is part of the VLASize list) in which case the code as is now is fine.

ABataev added inline comments.Jun 30 2023, 4:03 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	Can you check that the declaration is not captured in the target context? If it is not captured, it is declared in the target region and should be emitted as delayed.

doru1004 added inline comments.Jun 30 2023, 4:39 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	How do I check that? There doesn't seem to be a list of captured variables available at that point in the code.

doru1004 added inline comments.Jun 30 2023, 4:47 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	So the complication is that the same declaration is captured and not captured at the same time. It can be declared inside a teams distribute (not captured) but captured by an inner parallel for (captured). I think I can come up with something though.

ABataev added inline comments.Jun 30 2023, 4:55 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	Need to check the captures in the target regions only

doru1004 added inline comments.Jun 30 2023, 5:51 PM

clang/lib/CodeGen/CodeGenFunction.h
2806	I cannot get a handle on the target directive in markedAsEscaped function in order to look at its captures.

@ABataev This is as close as I could get it to what you wanted. I don't know how to get hold of the target directive so late in the emission process i.e. in markedAsEscaped function. The target directive doesn't get visited in the var checked for escaped vars so I cannot get the list of captures from it.

In any case the patch is good to go. It no longer relies on VLA size checks.

Harbormaster completed remote builds in B242602: Diff 536489.Jun 30 2023, 6:54 PM

ABataev added inline comments.Jul 3 2023, 5:39 AM

clang/lib/CodeGen/CGDecl.cpp
1606	use `static_cast<CGOpenMPRuntimeGPU &>(CGM.getOpenMPRuntime())` It will crash if your device is not GPU. Better to make `getKmpcAllocShared` and `getKmpcFreeShared` virtual (just like `isDelayedVariableLengthDecl`) in base CGOpenMPRuntime, since it may be required not only for GPU-based devices.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
262	Yep, this is what I meant. The only question: do you really need this new parameter? CGF.CapturedStmtInfo provides the list of captures and you can try to use it.
1088–1092	Do you still need this check?

doru1004 updated this revision to Diff 537478.Jul 5 2023, 1:22 PM

doru1004 marked an inline comment as done.

doru1004 marked 8 inline comments as done.

doru1004 updated this revision to Diff 537485.Jul 5 2023, 1:40 PM

ABataev added inline comments.Jul 5 2023, 1:58 PM

clang/lib/CodeGen/CGDecl.cpp
1606	Check the second item, please, better to make all new member function virtual and handle it for non-GPU devices too

doru1004 added inline comments.Jul 5 2023, 2:04 PM

clang/lib/CodeGen/CGDecl.cpp
1606	The support I am adding is only meant for GPUs. I am not sure why we need to consider non-GPUs. There already exists a VLA handling for non-GPUs and that one should be used.

ABataev added inline comments.Jul 5 2023, 2:08 PM

clang/lib/CodeGen/CGDecl.cpp
1606	It will crash the compiler if your device is not a GPU (say, CPU). I'm not asking to implement it for non-GPU, I'm asking to provide common interface. The general implementation should call just llvm_unreachable, nothing else.

doru1004 added inline comments.Jul 5 2023, 2:10 PM

clang/lib/CodeGen/CGOpenMPRuntime.h
699–710	@ABataev I have added the interface entries here.

ABataev added inline comments.Jul 5 2023, 2:14 PM

clang/lib/CodeGen/CGDecl.cpp
591	Same, just CGOpenMPRuntime &RT = CGM.getOpenMPRuntime();
1605	Here and in other places, jusy remove the cast to CGOpenMPRuntimeGPU, CGM.getOpenMPRuntime() already provides virtual functions, use them directly without cast: CGOpenMPRuntime &RT = CGM.getOpenMPRuntime();
clang/lib/CodeGen/CGOpenMPRuntime.h
699–710	Then you already good, just do not gast to CGOpenMPRuntimeGPU, use CGM.getOpenMPRuntime() directly since it already has these member functions.

Harbormaster completed remote builds in B243298: Diff 537485.Jul 5 2023, 2:14 PM

doru1004 updated this revision to Diff 537498.Jul 5 2023, 2:23 PM

doru1004 marked 2 inline comments as done.

Harbormaster completed remote builds in B243308: Diff 537498.Jul 5 2023, 3:18 PM

LG with a nit

clang/lib/CodeGen/CGDecl.cpp
19	You can remove this include

This revision is now accepted and ready to land.Jul 6 2023, 4:36 AM

@ABataev thank you for the review! I have now fixed the last nit and will commit the patch soon!

Harbormaster completed remote builds in B243461: Diff 537706.Jul 6 2023, 7:41 AM

Doru Bercea <doru.bercea@amd.com> mentioned this in rG13888870e568: Enable dynamic-sized VLAs for data sharing in OpenMP offloaded target regions..Jul 6 2023, 7:57 AM

Commit: 13888870e568dea84c4ea65fe5c01ef4f4ccc751

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGDecl.cpp

83 lines

CGOpenMPRuntime.h

6 lines

CGOpenMPRuntimeGPU.h

13 lines

CGOpenMPRuntimeGPU.cpp

96 lines

CodeGenFunction.h

10 lines

CodeGenFunction.cpp

12 lines

test/

OpenMP/

amdgcn_target_device_vla.cpp

1260 lines

Diff 536288

clang/lib/CodeGen/CGDecl.cpp

Show All 10 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "CGBlocks.h"		#include "CGBlocks.h"
#include "CGCXXABI.h"		#include "CGCXXABI.h"
#include "CGCleanup.h"		#include "CGCleanup.h"
#include "CGDebugInfo.h"		#include "CGDebugInfo.h"
#include "CGOpenCLRuntime.h"		#include "CGOpenCLRuntime.h"
#include "CGOpenMPRuntime.h"		#include "CGOpenMPRuntime.h"
		#include "CGOpenMPRuntimeGPU.h"
		ABataevUnsubmitted Done Reply Inline Actions You can remove this include ABataev: You can remove this include
#include "CodeGenFunction.h"		#include "CodeGenFunction.h"
#include "CodeGenModule.h"		#include "CodeGenModule.h"
#include "ConstantEmitter.h"		#include "ConstantEmitter.h"
#include "PatternInit.h"		#include "PatternInit.h"
#include "TargetInfo.h"		#include "TargetInfo.h"
#include "clang/AST/ASTContext.h"		#include "clang/AST/ASTContext.h"
#include "clang/AST/Attr.h"		#include "clang/AST/Attr.h"
#include "clang/AST/CharUnits.h"		#include "clang/AST/CharUnits.h"
▲ Show 20 Lines • Show All 549 Lines • ▼ Show 20 Lines	struct CallStackRestore final : EHScopeStack::Cleanup {
bool isRedundantBeforeReturn() override { return true; }		bool isRedundantBeforeReturn() override { return true; }
void Emit(CodeGenFunction &CGF, Flags flags) override {		void Emit(CodeGenFunction &CGF, Flags flags) override {
llvm::Value *V = CGF.Builder.CreateLoad(Stack);		llvm::Value *V = CGF.Builder.CreateLoad(Stack);
llvm::Function *F = CGF.CGM.getIntrinsic(llvm::Intrinsic::stackrestore);		llvm::Function *F = CGF.CGM.getIntrinsic(llvm::Intrinsic::stackrestore);
CGF.Builder.CreateCall(F, V);		CGF.Builder.CreateCall(F, V);
}		}
};		};

		struct KmpcAllocFree final : EHScopeStack::Cleanup {
		std::pair<llvm::Value , llvm::Value > AddrSizePair;
		KmpcAllocFree(const std::pair<llvm::Value , llvm::Value > &AddrSizePair)
		ABataevUnsubmitted Done Reply Inline Actions Better to pass it as const reference ABataev: Better to pass it as const reference
		: AddrSizePair(AddrSizePair) {}
		void Emit(CodeGenFunction &CGF, Flags EmissionFlags) override {
		ABataevUnsubmitted Done Reply Inline Actions Wrong param name, use Camel ABataev: Wrong param name, use Camel
		ABataevUnsubmitted Done Reply Inline Actions Name of the variable hides the type, potential warning or even error ABataev: Name of the variable hides the type, potential warning or even error
		CGOpenMPRuntimeGPU &RT =
		(static_cast<CGOpenMPRuntimeGPU >(&CGF.CGM.getOpenMPRuntime()));
		ABataevUnsubmitted Done Reply Inline Actions auto &RT = static_cast<CGOpenMPRuntimeGPU &>(...); ABataev: ``` auto &RT = static_cast<CGOpenMPRuntimeGPU &>(...); ```
		ABataevUnsubmitted Not Done Reply Inline Actions Same, just CGOpenMPRuntime &RT = CGM.getOpenMPRuntime(); ABataev: Same, just CGOpenMPRuntime &RT = CGM.getOpenMPRuntime();
		RT.getKmpcFreeShared(CGF, AddrSizePair);
		}
		};

struct ExtendGCLifetime final : EHScopeStack::Cleanup {		struct ExtendGCLifetime final : EHScopeStack::Cleanup {
const VarDecl &Var;		const VarDecl &Var;
ExtendGCLifetime(const VarDecl var) : Var(var) {}		ExtendGCLifetime(const VarDecl var) : Var(var) {}

void Emit(CodeGenFunction &CGF, Flags flags) override {		void Emit(CodeGenFunction &CGF, Flags flags) override {
// Compute the address of the local variable, in case it's a		// Compute the address of the local variable, in case it's a
// byref or something.		// byref or something.
DeclRefExpr DRE(CGF.getContext(), const_cast<VarDecl *>(&Var), false,		DeclRefExpr DRE(CGF.getContext(), const_cast<VarDecl *>(&Var), false,
▲ Show 20 Lines • Show All 986 Lines • ▼ Show 20 Lines	if (NRVO) {
}		}
} else {		} else {
assert(!emission.useLifetimeMarkers());		assert(!emission.useLifetimeMarkers());
}		}
}		}
} else {		} else {
EnsureInsertPoint();		EnsureInsertPoint();

		// Delayed globalization for variable length declarations. This ensures that
		// the expression representing the length has been emitted and can be used
		// by the definition of the VLA. Since this is an escaped declaration, in
		// OpenMP we have to use a call to __kmpc_alloc_shared(). The matching
		// deallocation call to __kmpc_free_shared() is emitted later.
		bool VarAllocated = false;
		ABataevUnsubmitted Done Reply Inline Actions OpenMPIsDevice? ABataev: OpenMPIsDevice?
		jhuber6Unsubmitted Done Reply Inline Actions Does NVPTX handle this already? If not, is there a compelling reason to exclude NVPTX? Otherwise we should check if we are the OpenMP device. jhuber6: Does NVPTX handle this already? If not, is there a compelling reason to exclude NVPTX?
		doru1004AuthorUnsubmitted Done Reply Inline Actions Does NVPTX support dynamic allocas? doru1004: Does NVPTX support dynamic allocas?
		ABataevUnsubmitted Done Reply Inline Actions It does not matter here, it depends on the runtime library implementations. The compiler just shall provide proper runtime calls emission, everything else is part of the runtime support. ABataev: It does not matter here, it depends on the runtime library implementations. The compiler just…
		arsenmUnsubmitted Done Reply Inline Actions I think I heard recent ptx introdced new instructions for it. amdgpu codegen just happens to be broken because we don't properly restore the stack afterwards. When I added the support we had no way of testing (and still don't really, __builtin_alloca doesn't handle non-0 stack address space correctly) arsenm: I think I heard recent ptx introdced new instructions for it. amdgpu codegen just happens to be…
		doru1004AuthorUnsubmitted Done Reply Inline Actions If NVPTX supports that then there is no reason to have NVPTX avoid emitting allocas (i.e. the condition stays as it is right now) but I am willing to reach a consensus so please let me know what you would all prefer. doru1004: If NVPTX supports that then there is no reason to have NVPTX avoid emitting allocas (i.e. the…
		arsenmUnsubmitted Done Reply Inline Actions frontends seem to have a tradition of working around missing features in codegen, I think you should just pass through the correct IR and leave the backend bugs for the backends arsenm: frontends seem to have a tradition of working around missing features in codegen, I think you…
		if (getLangOpts().OpenMPIsDevice) {
		CGOpenMPRuntimeGPU &RT =
		ABataevUnsubmitted Not Done Reply Inline Actions Here and in other places, jusy remove the cast to CGOpenMPRuntimeGPU, CGM.getOpenMPRuntime() already provides virtual functions, use them directly without cast: CGOpenMPRuntime &RT = CGM.getOpenMPRuntime(); ABataev: Here and in other places, jusy remove the cast to CGOpenMPRuntimeGPU, CGM.getOpenMPRuntime()…
		(static_cast<CGOpenMPRuntimeGPU >(&CGM.getOpenMPRuntime()));
		ABataevUnsubmitted Done Reply Inline Actions No need to cast to CGOpenMPRuntimeGPU since isDelayedVariableLengthDecl is a member of CGOpenMPRuntime. ABataev: No need to cast to CGOpenMPRuntimeGPU since isDelayedVariableLengthDecl is a member of…
		doru1004AuthorUnsubmitted Done Reply Inline Actions RT is also used further down to call getKmpcAllocShared(). doru1004: RT is also used further down to call getKmpcAllocShared().
		ABataevUnsubmitted Done Reply Inline Actions use `static_cast<CGOpenMPRuntimeGPU &>(CGM.getOpenMPRuntime())` It will crash if your device is not GPU. Better to make `getKmpcAllocShared` and `getKmpcFreeShared` virtual (just like `isDelayedVariableLengthDecl`) in base CGOpenMPRuntime, since it may be required not only for GPU-based devices. ABataev: 1. use `static_cast<CGOpenMPRuntimeGPU &>(CGM.getOpenMPRuntime())` 2. It will crash if your…
		ABataevUnsubmitted Not Done Reply Inline Actions Check the second item, please, better to make all new member function virtual and handle it for non-GPU devices too ABataev: Check the second item, please, better to make all new member function virtual and handle it for…
		doru1004AuthorUnsubmitted Done Reply Inline Actions The support I am adding is only meant for GPUs. I am not sure why we need to consider non-GPUs. There already exists a VLA handling for non-GPUs and that one should be used. doru1004: The support I am adding is only meant for GPUs. I am not sure why we need to consider non-GPUs.
		ABataevUnsubmitted Done Reply Inline Actions It will crash the compiler if your device is not a GPU (say, CPU). I'm not asking to implement it for non-GPU, I'm asking to provide common interface. The general implementation should call just llvm_unreachable, nothing else. ABataev: 1. It will crash the compiler if your device is not a GPU (say, CPU). 2. I'm not asking to…
		if (RT.isDelayedVariableLengthDecl(*this, &D)) {
		// Emit call to __kmpc_alloc_shared() instead of the alloca.
		std::pair<llvm::Value , llvm::Value > AddrSizePair =
		ABataevUnsubmitted Done Reply Inline Actions I think you can drop triple checks and rely completely on RT.isDelayedVariableLengthDecl(this, &D) result here ABataev:* I think you can drop triple checks and rely completely on RT.isDelayedVariableLengthDecl(*this…
		doru1004AuthorUnsubmitted Done Reply Inline Actions I tried it but there is a lit test (which I cannot identify) that hangs when offloading to the host (I think) so it has to be an actual GPU. Any ideas? doru1004: I tried it but there is a lit test (which I cannot identify) that hangs when offloading to the…
		ABataevUnsubmitted Done Reply Inline Actions Make isDelayedVariableLengthDecl virtual in base OpenMPRuntime and make it return false by default, and true in base implementation for GPU. This should fix the problem, I hope ABataev: Make isDelayedVariableLengthDecl virtual in base OpenMPRuntime and make it return false by…
		doru1004AuthorUnsubmitted Done Reply Inline Actions It worked thank you for the suggestion!! doru1004: It worked thank you for the suggestion!!
		RT.getKmpcAllocShared(*this, &D);

		// Save the address of the allocation:
		LValue Base = MakeAddrLValue(AddrSizePair.first, D.getType(),
		CGM.getContext().getDeclAlign(&D),
		AlignmentSource::Decl);
		address = Base.getAddress(*this);

		// Push a cleanup block to emit the call to __kmpc_free_shared in the
		// appropriate location at the end of the scope of the
		// __kmpc_alloc_shared functions:
		pushKmpcAllocFree(NormalCleanup, AddrSizePair);

		// Mark variable as allocated:
		VarAllocated = true;
		}
		}

		if (!VarAllocated) {
if (!DidCallStackSave) {		if (!DidCallStackSave) {
// Save the stack.		// Save the stack.
Address Stack =		Address Stack =
CreateTempAlloca(Int8PtrTy, getPointerAlign(), "saved_stack");		CreateTempAlloca(Int8PtrTy, getPointerAlign(), "saved_stack");

llvm::Function *F = CGM.getIntrinsic(llvm::Intrinsic::stacksave);		llvm::Function *F = CGM.getIntrinsic(llvm::Intrinsic::stacksave);
llvm::Value *V = Builder.CreateCall(F);		llvm::Value *V = Builder.CreateCall(F);
Builder.CreateStore(V, Stack);		Builder.CreateStore(V, Stack);

DidCallStackSave = true;		DidCallStackSave = true;

// Push a cleanup block and restore the stack there.		// Push a cleanup block and restore the stack there.
// FIXME: in general circumstances, this should be an EH cleanup.		// FIXME: in general circumstances, this should be an EH cleanup.
pushStackRestore(NormalCleanup, Stack);		pushStackRestore(NormalCleanup, Stack);
}		}

auto VlaSize = getVLASize(Ty);		auto VlaSize = getVLASize(Ty);
llvm::Type *llvmTy = ConvertTypeForMem(VlaSize.Type);		llvm::Type *llvmTy = ConvertTypeForMem(VlaSize.Type);

// Allocate memory for the array.		// Allocate memory for the array.
address = CreateTempAlloca(llvmTy, alignment, "vla", VlaSize.NumElts,		address = CreateTempAlloca(llvmTy, alignment, "vla", VlaSize.NumElts,
&AllocaAddr);		&AllocaAddr);
		}

// If we have debug info enabled, properly describe the VLA dimensions for		// If we have debug info enabled, properly describe the VLA dimensions for
// this type by registering the vla size expression for each of the		// this type by registering the vla size expression for each of the
// dimensions.		// dimensions.
EmitAndRegisterVariableArrayDimensions(DI, D, EmitDebugInfo);		EmitAndRegisterVariableArrayDimensions(DI, D, EmitDebugInfo);
}		}

setAddrOfLocalVar(&D, address);		setAddrOfLocalVar(&D, address);
▲ Show 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	void CodeGenFunction::pushDestroy(CleanupKind cleanupKind, Address addr,
pushFullExprCleanup<DestroyObject>(cleanupKind, addr, type,		pushFullExprCleanup<DestroyObject>(cleanupKind, addr, type,
destroyer, useEHCleanupForArray);		destroyer, useEHCleanupForArray);
}		}

void CodeGenFunction::pushStackRestore(CleanupKind Kind, Address SPMem) {		void CodeGenFunction::pushStackRestore(CleanupKind Kind, Address SPMem) {
EHStack.pushCleanup<CallStackRestore>(Kind, SPMem);		EHStack.pushCleanup<CallStackRestore>(Kind, SPMem);
}		}

		void CodeGenFunction::pushKmpcAllocFree(
		CleanupKind Kind, std::pair<llvm::Value , llvm::Value > AddrSizePair) {
		EHStack.pushCleanup<KmpcAllocFree>(Kind, AddrSizePair);
		}

void CodeGenFunction::pushLifetimeExtendedDestroy(CleanupKind cleanupKind,		void CodeGenFunction::pushLifetimeExtendedDestroy(CleanupKind cleanupKind,
Address addr, QualType type,		Address addr, QualType type,
Destroyer *destroyer,		Destroyer *destroyer,
bool useEHCleanupForArray) {		bool useEHCleanupForArray) {
// If we're not in a conditional branch, we don't need to bother generating a		// If we're not in a conditional branch, we don't need to bother generating a
// conditional cleanup.		// conditional cleanup.
if (!isInConditionalBranch()) {		if (!isInConditionalBranch()) {
// Push an EH-only cleanup for the object now.		// Push an EH-only cleanup for the object now.
▲ Show 20 Lines • Show All 585 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGOpenMPRuntime.h

	Show First 20 Lines • Show All 684 Lines • ▼ Show 20 Lines
	public:			public:
	explicit CGOpenMPRuntime(CodeGenModule &CGM);			explicit CGOpenMPRuntime(CodeGenModule &CGM);
	virtual ~CGOpenMPRuntime() {}			virtual ~CGOpenMPRuntime() {}
	virtual void clear();			virtual void clear();

	/// Returns true if the current target is a GPU.			/// Returns true if the current target is a GPU.
	virtual bool isTargetCodegen() const { return false; }			virtual bool isTargetCodegen() const { return false; }

				/// Check if the variable length declaration is delayed:
				virtual bool isDelayedVariableLengthDecl(CodeGenFunction &CGF,
				const VarDecl *VD) const {
				return false;
				};

	/// Emits code for OpenMP 'if' clause using specified \a CodeGen			/// Emits code for OpenMP 'if' clause using specified \a CodeGen
	/// function. Here is the logic:			/// function. Here is the logic:
	/// if (Cond) {			/// if (Cond) {
	/// ThenGen();			/// ThenGen();
	/// } else {			/// } else {
	/// ElseGen();			/// ElseGen();
	/// }			/// }
	void emitIfClause(CodeGenFunction &CGF, const Expr *Cond,			void emitIfClause(CodeGenFunction &CGF, const Expr *Cond,
	const RegionCodeGenTy &ThenGen,			const RegionCodeGenTy &ThenGen,
	const RegionCodeGenTy &ElseGen);			const RegionCodeGenTy &ElseGen);

	/// Checks if the \p Body is the \a CompoundStmt and returns its child			/// Checks if the \p Body is the \a CompoundStmt and returns its child
				doru1004AuthorUnsubmitted Done Reply Inline Actions @ABataev I have added the interface entries here. doru1004: @ABataev I have added the interface entries here.
				ABataevUnsubmitted Done Reply Inline Actions Then you already good, just do not gast to CGOpenMPRuntimeGPU, use CGM.getOpenMPRuntime() directly since it already has these member functions. ABataev: Then you already good, just do not gast to CGOpenMPRuntimeGPU, use CGM.getOpenMPRuntime()…
	/// statement iff there is only one that is not evaluatable at the compile			/// statement iff there is only one that is not evaluatable at the compile
	/// time.			/// time.
	static const Stmt getSingleCompoundChild(ASTContext &Ctx, const Stmt Body);			static const Stmt getSingleCompoundChild(ASTContext &Ctx, const Stmt Body);

	/// Get the platform-specific name separator.			/// Get the platform-specific name separator.
	std::string getName(ArrayRef<StringRef> Parts) const;			std::string getName(ArrayRef<StringRef> Parts) const;

	/// Emit code for the specified user defined reduction construct.			/// Emit code for the specified user defined reduction construct.
	▲ Show 20 Lines • Show All 1,555 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	public:
void clear() override;		void clear() override;

bool isTargetCodegen() const override { return true; };		bool isTargetCodegen() const override { return true; };

/// Declare generalized virtual functions which need to be defined		/// Declare generalized virtual functions which need to be defined
/// by all specializations of OpenMPGPURuntime Targets like AMDGCN		/// by all specializations of OpenMPGPURuntime Targets like AMDGCN
/// and NVPTX.		/// and NVPTX.

		/// Check if the variable length declaration is delayed:
		bool isDelayedVariableLengthDecl(CodeGenFunction &CGF,
		const VarDecl *VD) const override;

		/// Get call to __kmpc_alloc_shared
		std::pair<llvm::Value , llvm::Value >
		getKmpcAllocShared(CodeGenFunction &CGF, const VarDecl *VD);

		/// Get call to __kmpc_free_shared
		void getKmpcFreeShared(CodeGenFunction &CGF,
		std::pair<llvm::Value , llvm::Value > AddrSizePair);

/// Get the GPU warp size.		/// Get the GPU warp size.
llvm::Value *getGPUWarpSize(CodeGenFunction &CGF);		llvm::Value *getGPUWarpSize(CodeGenFunction &CGF);

/// Get the id of the current thread on the GPU.		/// Get the id of the current thread on the GPU.
llvm::Value *getGPUThreadID(CodeGenFunction &CGF);		llvm::Value *getGPUThreadID(CodeGenFunction &CGF);

/// Get the maximum number of threads in a block of the GPU.		/// Get the maximum number of threads in a block of the GPU.
llvm::Value *getGPUNumThreads(CodeGenFunction &CGF);		llvm::Value *getGPUNumThreads(CodeGenFunction &CGF);
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	private:
/// The map of local variables to their addresses in the global memory.		/// The map of local variables to their addresses in the global memory.
using DeclToAddrMapTy = llvm::MapVector<const Decl *, MappedVarData>;		using DeclToAddrMapTy = llvm::MapVector<const Decl *, MappedVarData>;
/// Set of the parameters passed by value escaping OpenMP context.		/// Set of the parameters passed by value escaping OpenMP context.
using EscapedParamsTy = llvm::SmallPtrSet<const Decl *, 4>;		using EscapedParamsTy = llvm::SmallPtrSet<const Decl *, 4>;
struct FunctionData {		struct FunctionData {
DeclToAddrMapTy LocalVarData;		DeclToAddrMapTy LocalVarData;
EscapedParamsTy EscapedParameters;		EscapedParamsTy EscapedParameters;
llvm::SmallVector<const ValueDecl*, 4> EscapedVariableLengthDecls;		llvm::SmallVector<const ValueDecl*, 4> EscapedVariableLengthDecls;
		llvm::SmallVector<const ValueDecl *, 4> DelayedVariableLengthDecls;
llvm::SmallVector<std::pair<llvm::Value , llvm::Value >, 4>		llvm::SmallVector<std::pair<llvm::Value , llvm::Value >, 4>
EscapedVariableLengthDeclsAddrs;		EscapedVariableLengthDeclsAddrs;
std::unique_ptr<CodeGenFunction::OMPMapVars> MappedParams;		std::unique_ptr<CodeGenFunction::OMPMapVars> MappedParams;
};		};
/// Maps the function to the list of the globalized variables with their		/// Maps the function to the list of the globalized variables with their
/// addresses.		/// addresses.
llvm::SmallDenseMap<llvm::Function *, FunctionData> FunctionGlobalizedDecls;		llvm::SmallDenseMap<llvm::Function *, FunctionData> FunctionGlobalizedDecls;
llvm::GlobalVariable *KernelTeamsReductionPtr = nullptr;		llvm::GlobalVariable *KernelTeamsReductionPtr = nullptr;
Show All 15 Lines

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

Show First 20 Lines • Show All 253 Lines • ▼ Show 20 Lines	void markAsEscaped(const ValueDecl *VD) {
if (VD->getType()->isVariablyModifiedType())		if (VD->getType()->isVariablyModifiedType())
EscapedVariableLengthDecls.insert(VD);		EscapedVariableLengthDecls.insert(VD);
else		else
EscapedDecls.insert(VD);		EscapedDecls.insert(VD);
}		}

void VisitValueDecl(const ValueDecl *VD) {		void VisitValueDecl(const ValueDecl *VD) {
if (VD->getType()->isLValueReferenceType())		if (VD->getType()->isLValueReferenceType())
markAsEscaped(VD);		markAsEscaped(VD);
		ABataevUnsubmitted Done Reply Inline Actions Yep, this is what I meant. The only question: do you really need this new parameter? CGF.CapturedStmtInfo provides the list of captures and you can try to use it. ABataev: Yep, this is what I meant. The only question: do you really need this new parameter? CGF.
if (const auto *VarD = dyn_cast<VarDecl>(VD)) {		if (const auto *VarD = dyn_cast<VarDecl>(VD)) {
if (!isa<ParmVarDecl>(VarD) && VarD->hasInit()) {		if (!isa<ParmVarDecl>(VarD) && VarD->hasInit()) {
const bool SavedAllEscaped = AllEscaped;		const bool SavedAllEscaped = AllEscaped;
AllEscaped = VD->getType()->isLValueReferenceType();		AllEscaped = VD->getType()->isLValueReferenceType();
Visit(VarD->getInit());		Visit(VarD->getInit());
AllEscaped = SavedAllEscaped;		AllEscaped = SavedAllEscaped;
}		}
}		}
▲ Show 20 Lines • Show All 806 Lines • ▼ Show 20 Lines	for (auto &Rec : I->getSecond().LocalVarData) {
// Assign the local allocation to the newly globalized location.		// Assign the local allocation to the newly globalized location.
if (EscapedParam) {		if (EscapedParam) {
CGF.EmitStoreOfScalar(ParValue, VarAddr);		CGF.EmitStoreOfScalar(ParValue, VarAddr);
I->getSecond().MappedParams->setVarAddr(CGF, VD, VarAddr.getAddress(CGF));		I->getSecond().MappedParams->setVarAddr(CGF, VD, VarAddr.getAddress(CGF));
}		}
if (auto *DI = CGF.getDebugInfo())		if (auto *DI = CGF.getDebugInfo())
VoidPtr->setDebugLoc(DI->SourceLocToDebugLoc(VD->getLocation()));		VoidPtr->setDebugLoc(DI->SourceLocToDebugLoc(VD->getLocation()));
}		}
for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
ABataevUnsubmitted Not Done Reply Inline Actions Why this code is removed? ABataev: Why this code is removed?
doru1004AuthorUnsubmitted Done Reply Inline Actions I could not understand why this code is here in the first place since it doesn't seem that it could ever work correctly (and it doesn't seem to be covered by any existing tests). Maybe I'm wrong but that was my understanding of it. So what seems to happen is that this code attempts to emit a kmpc_alloc_shared before the actual size calculation is emitted. So if the VLA size is something that the user defines such as `int N = 10;` then that code will not have been emitted at this point. When the expression computing the size of the VLA uses `N`, the code that is deleted here would just fail to find the VLA size in the attempt to emit the kmpc_alloc_shared. The emission of the VLA as kmpc_alloc_shared needs to happen after the expression of the size is emitted. doru1004: I could not understand why this code is here in the first place since it doesn't seem that it…
jhuber6Unsubmitted Not Done Reply Inline Actions I'm pretty sure I was the one that wrote this code, and at the time I don't recall it really working. I remember there was something else that expected this to be here, but for what utility I do not recall. VLAs were never tested or used. jhuber6: I'm pretty sure I was the one that wrote this code, and at the time I don't recall it really…
ABataevUnsubmitted Not Done Reply Inline Actions They are tested, check test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for example, where it captures VLA implicitly. I assume this should not be AMDGCN specific. ABataev: They are tested, check test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for…
doru1004AuthorUnsubmitted Done Reply Inline Actions Oh I see so this code path would cover the case when the VLA is defined outside the target region? I'm surprised I haven't seen any lit test fails for AMD GPUs, maybe this kind of test only exists for NVPTX. I'll add a test for AMD GPUs in that case. doru1004: Oh I see so this code path would cover the case when the VLA is defined outside the target…
doru1004AuthorUnsubmitted Done Reply Inline Actions Edit: the VLA is defined outside the target region => the VLA size is defined outside the target region doru1004: Edit: the VLA is defined outside the target region => the VLA //size// is defined outside the…
// Use actual memory size of the VLA object including the padding		for (const auto *ValueD : I->getSecond().EscapedVariableLengthDecls) {
// for alignment purposes.		const auto *VD = cast<VarDecl>(ValueD);
		// Check if the size of the VLA is available at this point i.e. check that
		// it has been emitted already. If not available then skip it and use
		// delayed emission of __kmpc_alloc_shared.
		const VariableArrayType *VAT =
		CGM.getContext().getAsVariableArrayType(VD->getType());
		ABataevUnsubmitted Done Reply Inline Actions Do you still need this check? ABataev: Do you still need this check?
		if (!CGF.hasVLASize(VAT)) {
		I->getSecond().DelayedVariableLengthDecls.emplace_back(VD);
		continue;
		}

		std::pair<llvm::Value , llvm::Value > AddrSizePair =
		getKmpcAllocShared(CGF, VD);
		I->getSecond().EscapedVariableLengthDeclsAddrs.emplace_back(AddrSizePair);
		LValue Base = CGF.MakeAddrLValue(AddrSizePair.first, VD->getType(),
		CGM.getContext().getDeclAlign(VD),
		AlignmentSource::Decl);
		I->getSecond().MappedParams->setVarAddr(CGF, VD, Base.getAddress(CGF));
		}
		I->getSecond().MappedParams->apply(CGF);
		}

		bool CGOpenMPRuntimeGPU::isDelayedVariableLengthDecl(CodeGenFunction &CGF,
		const VarDecl *VD) const {
		const auto I = FunctionGlobalizedDecls.find(CGF.CurFn);
		if (I == FunctionGlobalizedDecls.end())
		return false;

		// Check variable declaration is delayed:
		for (const auto *DelayedD : I->getSecond().DelayedVariableLengthDecls)
		if (DelayedD == VD)
		return true;

		return false;
		ABataevUnsubmitted Done Reply Inline Actions return llvm::is_contained(I->getSecond().DelayedVariableLengthDecls, VD); ABataev: ``` return llvm::is_contained(I->getSecond().DelayedVariableLengthDecls, VD); ```
		}

		std::pair<llvm::Value , llvm::Value >
		CGOpenMPRuntimeGPU::getKmpcAllocShared(CodeGenFunction &CGF,
		const VarDecl *VD) {
		ABataevUnsubmitted Done Reply Inline Actions Use `std::make_pair(VoidPtr, Size)`. ABataev: Use `std::make_pair(VoidPtr, Size)`.
		CGBuilderTy &Bld = CGF.Builder;

		// Compute size and alignment.
llvm::Value *Size = CGF.getTypeSize(VD->getType());		llvm::Value *Size = CGF.getTypeSize(VD->getType());
CharUnits Align = CGM.getContext().getDeclAlign(VD);		CharUnits Align = CGM.getContext().getDeclAlign(VD);
Size = Bld.CreateNUWAdd(		Size = Bld.CreateNUWAdd(
Size, llvm::ConstantInt::get(CGF.SizeTy, Align.getQuantity() - 1));		Size, llvm::ConstantInt::get(CGF.SizeTy, Align.getQuantity() - 1));
llvm::Value *AlignVal =		llvm::Value *AlignVal =
llvm::ConstantInt::get(CGF.SizeTy, Align.getQuantity());		llvm::ConstantInt::get(CGF.SizeTy, Align.getQuantity());

Size = Bld.CreateUDiv(Size, AlignVal);		Size = Bld.CreateUDiv(Size, AlignVal);
Size = Bld.CreateNUWMul(Size, AlignVal);		Size = Bld.CreateNUWMul(Size, AlignVal);

// Allocate space for this VLA object to be globalized.		// Allocate space for this VLA object to be globalized.
llvm::Value *AllocArgs[] = {CGF.getTypeSize(VD->getType())};		llvm::Value *AllocArgs[] = {Size};
llvm::CallBase *VoidPtr =		llvm::CallBase *VoidPtr =
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(		CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_alloc_shared),		CGM.getModule(), OMPRTL___kmpc_alloc_shared),
AllocArgs, VD->getName());		AllocArgs, VD->getName());
VoidPtr->addRetAttr(		VoidPtr->addRetAttr(llvm::Attribute::get(
llvm::Attribute::get(CGM.getLLVMContext(), llvm::Attribute::Alignment,		CGM.getLLVMContext(), llvm::Attribute::Alignment, Align.getQuantity()));
CGM.getContext().getTargetInfo().getNewAlign()));
		return std::make_pair(VoidPtr, Size);
I->getSecond().EscapedVariableLengthDeclsAddrs.emplace_back(
std::pair<llvm::Value , llvm::Value >(
{VoidPtr, CGF.getTypeSize(VD->getType())}));
LValue Base = CGF.MakeAddrLValue(VoidPtr, VD->getType(),
CGM.getContext().getDeclAlign(VD),
AlignmentSource::Decl);
I->getSecond().MappedParams->setVarAddr(CGF, cast<VarDecl>(VD),
Base.getAddress(CGF));
}		}
I->getSecond().MappedParams->apply(CGF);
		void CGOpenMPRuntimeGPU::getKmpcFreeShared(
		CodeGenFunction &CGF,
		std::pair<llvm::Value , llvm::Value > AddrSizePair) {
		ABataevUnsubmitted Done Reply Inline Actions pass it here and in other places as const reference ABataev: pass it here and in other places as const reference
		// Deallocate the memory for each globalized VLA object
		CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
		CGM.getModule(), OMPRTL___kmpc_free_shared),
		{AddrSizePair.first, AddrSizePair.second});
}		}

void CGOpenMPRuntimeGPU::emitGenericVarsEpilog(CodeGenFunction &CGF,		void CGOpenMPRuntimeGPU::emitGenericVarsEpilog(CodeGenFunction &CGF,
bool WithSPMDCheck) {		bool WithSPMDCheck) {
if (getDataSharingMode(CGM) != CGOpenMPRuntimeGPU::Generic &&		if (getDataSharingMode(CGM) != CGOpenMPRuntimeGPU::Generic &&
getExecutionMode() != CGOpenMPRuntimeGPU::EM_SPMD)		getExecutionMode() != CGOpenMPRuntimeGPU::EM_SPMD)
return;		return;

const auto I = FunctionGlobalizedDecls.find(CGF.CurFn);		const auto I = FunctionGlobalizedDecls.find(CGF.CurFn);
if (I != FunctionGlobalizedDecls.end()) {		if (I != FunctionGlobalizedDecls.end()) {
// Deallocate the memory for each globalized VLA object		// Deallocate the memory for each globalized VLA object that was
		// globalized in the prolog (i.e. emitGenericVarsProlog).
for (const auto &AddrSizePair :		for (const auto &AddrSizePair :
llvm::reverse(I->getSecond().EscapedVariableLengthDeclsAddrs)) {		llvm::reverse(I->getSecond().EscapedVariableLengthDeclsAddrs)) {
CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(		CGF.EmitRuntimeCall(OMPBuilder.getOrCreateRuntimeFunction(
CGM.getModule(), OMPRTL___kmpc_free_shared),		CGM.getModule(), OMPRTL___kmpc_free_shared),
{AddrSizePair.first, AddrSizePair.second});		{AddrSizePair.first, AddrSizePair.second});
}		}
// Deallocate the memory for each globalized value		// Deallocate the memory for each globalized value
for (auto &Rec : llvm::reverse(I->getSecond().LocalVarData)) {		for (auto &Rec : llvm::reverse(I->getSecond().LocalVarData)) {
▲ Show 20 Lines • Show All 2,535 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenFunction.h

Show First 20 Lines • Show All 1,041 Lines • ▼ Show 20 Lines	public:
/// Restores original addresses of the variables.		/// Restores original addresses of the variables.
void restore(CodeGenFunction &CGF) {		void restore(CodeGenFunction &CGF) {
if (!SavedLocals.empty()) {		if (!SavedLocals.empty()) {
copyInto(SavedLocals, CGF.LocalDeclMap);		copyInto(SavedLocals, CGF.LocalDeclMap);
SavedLocals.clear();		SavedLocals.clear();
}		}
}		}

		/// Check if a local has already been saved:
		bool isSavedLocal(const VarDecl *LocalVD) {
		return SavedLocals.count(LocalVD->getCanonicalDecl()) > 0;
		}

private:		private:
/// Copy all the entries in the source map over the corresponding		/// Copy all the entries in the source map over the corresponding
/// entries in the destination, which must exist.		/// entries in the destination, which must exist.
static void copyInto(const DeclMapTy &Src, DeclMapTy &Dest) {		static void copyInto(const DeclMapTy &Src, DeclMapTy &Dest) {
for (auto &Pair : Src) {		for (auto &Pair : Src) {
if (!Pair.second.isValid()) {		if (!Pair.second.isValid()) {
Dest.erase(Pair.first);		Dest.erase(Pair.first);
continue;		continue;
▲ Show 20 Lines • Show All 1,002 Lines • ▼ Show 20 Lines	void pushDestroy(CleanupKind kind, Address addr, QualType type,
Destroyer *destroyer, bool useEHCleanupForArray);		Destroyer *destroyer, bool useEHCleanupForArray);
void pushLifetimeExtendedDestroy(CleanupKind kind, Address addr,		void pushLifetimeExtendedDestroy(CleanupKind kind, Address addr,
QualType type, Destroyer *destroyer,		QualType type, Destroyer *destroyer,
bool useEHCleanupForArray);		bool useEHCleanupForArray);
void pushCallObjectDeleteCleanup(const FunctionDecl *OperatorDelete,		void pushCallObjectDeleteCleanup(const FunctionDecl *OperatorDelete,
llvm::Value *CompletePtr,		llvm::Value *CompletePtr,
QualType ElementType);		QualType ElementType);
void pushStackRestore(CleanupKind kind, Address SPMem);		void pushStackRestore(CleanupKind kind, Address SPMem);
		void pushKmpcAllocFree(CleanupKind Kind,
		std::pair<llvm::Value , llvm::Value > AddrSizePair);
void emitDestroy(Address addr, QualType type, Destroyer *destroyer,		void emitDestroy(Address addr, QualType type, Destroyer *destroyer,
bool useEHCleanupForArray);		bool useEHCleanupForArray);
llvm::Function *generateDestroyHelper(Address addr, QualType type,		llvm::Function *generateDestroyHelper(Address addr, QualType type,
Destroyer *destroyer,		Destroyer *destroyer,
bool useEHCleanupForArray,		bool useEHCleanupForArray,
const VarDecl *VD);		const VarDecl *VD);
void emitArrayDestroy(llvm::Value begin, llvm::Value end,		void emitArrayDestroy(llvm::Value begin, llvm::Value end,
QualType elementType, CharUnits elementAlign,		QualType elementType, CharUnits elementAlign,
▲ Show 20 Lines • Show All 714 Lines • ▼ Show 20 Lines	struct VlaSizePair {
VlaSizePair(llvm::Value *NE, QualType T) : NumElts(NE), Type(T) {}		VlaSizePair(llvm::Value *NE, QualType T) : NumElts(NE), Type(T) {}
};		};

/// Return the number of elements for a single dimension		/// Return the number of elements for a single dimension
/// for the given array type.		/// for the given array type.
VlaSizePair getVLAElements1D(const VariableArrayType *vla);		VlaSizePair getVLAElements1D(const VariableArrayType *vla);
VlaSizePair getVLAElements1D(QualType vla);		VlaSizePair getVLAElements1D(QualType vla);

		/// Return true if all the emissions for the VLA size have occured.
		bool hasVLASize(const VariableArrayType *type);
		ABataevUnsubmitted Not Done Reply Inline Actions Is it possible that VariableArrayType does not have VLA size? Fix param name ABataev: 1. Is it possible that VariableArrayType does not have VLA size? 2. Fix param name
		doru1004AuthorUnsubmitted Done Reply Inline Actions @ABataev How would point 1 happen? doru1004: @ABataev How would point 1 happen?
		ABataevUnsubmitted Not Done Reply Inline Actions You're adding a function that checks if VLA type has VLA size. I'm asking, if it is possible for VLA type to not have VLA size at all? Why do you need this function? ABataev: You're adding a function that checks if VLA type has VLA size. I'm asking, if it is possible…
		doru1004AuthorUnsubmitted Done Reply Inline Actions This function checks if the expression of the size of the VLA has already been emitted and can be used. doru1004: This function checks if the expression of the size of the VLA has already been emitted and can…
		ABataevUnsubmitted Not Done Reply Inline Actions Why the emission of VLA size can be delayed? ABataev: Why the emission of VLA size can be delayed?
		doru1004AuthorUnsubmitted Done Reply Inline Actions Because the size of the VLA is emitted in the user code and the prolog of the function happens before that. The emission of the VLA needs to be delayed until its size has been emitted in the user code. doru1004: Because the size of the VLA is emitted in the user code and the prolog of the function happens…
		ABataevUnsubmitted Not Done Reply Inline Actions This is very fragile approach. Can you try instead try to improve markAsEscaped function and fix insertion of VD to EscapedVariableLengthDecls and if the declaration is internal for the target region, insert it to DelayedVariableLengthDecls? ABataev: This is very fragile approach. Can you try instead try to improve markAsEscaped function and…
		doru1004AuthorUnsubmitted Done Reply Inline Actions I am not sure what the condition would be, at that point, to choose between one list or the other. I'm not sure what you mean by the declaration being internal to the target region. doru1004: I am not sure what the condition would be, at that point, to choose between one list or the…
		doru1004AuthorUnsubmitted Done Reply Inline Actions Any thoughts? As far as I can tell all VLAs that reach that point belong in `DelayedVariableLengthDecls` doru1004: Any thoughts? As far as I can tell all VLAs that reach that point belong in…
		doru1004AuthorUnsubmitted Done Reply Inline Actions @ABataev I cannot think of a condition to use for the distinction in markedAsEscaped(). Could you please explain in more detail what you want me to check? I can make the rest of the changes happen no problem but I don't know what the condition is. Unless you tell me otherwise, I think the best condition is to check whether the VLA size has been emitted (i.e. that is is part of the VLASize list) in which case the code as is now is fine. doru1004: @ABataev I cannot think of a condition to use for the distinction in markedAsEscaped(). Could…
		ABataevUnsubmitted Not Done Reply Inline Actions Can you check that the declaration is not captured in the target context? If it is not captured, it is declared in the target region and should be emitted as delayed. ABataev: Can you check that the declaration is not captured in the target context? If it is not captured…
		doru1004AuthorUnsubmitted Done Reply Inline Actions How do I check that? There doesn't seem to be a list of captured variables available at that point in the code. doru1004: How do I check that? There doesn't seem to be a list of captured variables available at that…
		doru1004AuthorUnsubmitted Done Reply Inline Actions So the complication is that the same declaration is captured and not captured at the same time. It can be declared inside a teams distribute (not captured) but captured by an inner parallel for (captured). I think I can come up with something though. doru1004: So the complication is that the same declaration is captured and not captured at the same time.
		ABataevUnsubmitted Not Done Reply Inline Actions Need to check the captures in the target regions only ABataev: Need to check the captures in the target regions only
		doru1004AuthorUnsubmitted Done Reply Inline Actions I cannot get a handle on the target directive in markedAsEscaped function in order to look at its captures. doru1004: I cannot get a handle on the target directive in markedAsEscaped function in order to look at…

/// Returns an LLVM value that corresponds to the size,		/// Returns an LLVM value that corresponds to the size,
/// in non-variably-sized elements, of a variable length array type,		/// in non-variably-sized elements, of a variable length array type,
/// plus that largest non-variably-sized element type. Assumes that		/// plus that largest non-variably-sized element type. Assumes that
/// the type has already been emitted with EmitVariablyModifiedType.		/// the type has already been emitted with EmitVariablyModifiedType.
VlaSizePair getVLASize(const VariableArrayType *vla);		VlaSizePair getVLASize(const VariableArrayType *vla);
VlaSizePair getVLASize(QualType vla);		VlaSizePair getVLASize(QualType vla);

/// LoadCXXThis - Load the value of 'this'. This function is only valid while		/// LoadCXXThis - Load the value of 'this'. This function is only valid while
▲ Show 20 Lines • Show All 2,095 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenFunction.cpp

Show First 20 Lines • Show All 2,155 Lines • ▼ Show 20 Lines	llvm::Value CodeGenFunction::emitArrayLength(const ArrayType origArrayType,

// If we had any VLA dimensions, factor them in.		// If we had any VLA dimensions, factor them in.
if (numVLAElements)		if (numVLAElements)
numElements = Builder.CreateNUWMul(numVLAElements, numElements);		numElements = Builder.CreateNUWMul(numVLAElements, numElements);

return numElements;		return numElements;
}		}

		bool CodeGenFunction::hasVLASize(const VariableArrayType *VLAType) {
		QualType ElementType;
		do {
		ElementType = VLAType->getElementType();
		llvm::Value *VLASize = VLASizeMap[VLAType->getSizeExpr()];
		ABataevUnsubmitted Done Reply Inline Actions Use VLASizeMap.find() instead ABataev: Use VLASizeMap.find() instead
		if (!VLASize)
		return false;
		} while ((VLAType = getContext().getAsVariableArrayType(ElementType)));

		return true;
		}
		ABataevUnsubmitted Done Reply Inline Actions Fix var naming ABataev: Fix var naming

CodeGenFunction::VlaSizePair CodeGenFunction::getVLASize(QualType type) {		CodeGenFunction::VlaSizePair CodeGenFunction::getVLASize(QualType type) {
const VariableArrayType *vla = getContext().getAsVariableArrayType(type);		const VariableArrayType *vla = getContext().getAsVariableArrayType(type);
assert(vla && "type was not a variable array type!");		assert(vla && "type was not a variable array type!");
return getVLASize(vla);		return getVLASize(vla);
}		}

CodeGenFunction::VlaSizePair		CodeGenFunction::VlaSizePair
CodeGenFunction::getVLASize(const VariableArrayType *type) {		CodeGenFunction::getVLASize(const VariableArrayType *type) {
▲ Show 20 Lines • Show All 729 Lines • Show Last 20 Lines

clang/test/OpenMP/amdgcn_target_device_vla.cpp

This file was added.

				// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --function-signature --include-generated-funcs --replace-value-regex "__omp_offloading_[0-9a-z]+_[0-9a-z]+" "reduction_size[.].+[.]" "pl_cond[.].+[.\|,]" --prefix-filecheck-ir-name _
				// REQUIRES: amdgpu-registered-target

				// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple x86_64-unknown-unknown -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm-bc %s -o %t-ppc-host.bc
				// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple amdgcn-amd-amdhsa -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm %s -fopenmp-is-device -fopenmp-host-ir-file-path %t-ppc-host.bc -o - \| FileCheck %s
				// expected-no-diagnostics
				#ifndef HEADER
				#define HEADER

				int foo1() {
				int sum = 0.0;
				#pragma omp target map(tofrom: sum)
				{
				int N = 10;
				int A[N];

				for (int i = 0; i < N; i++)
				A[i] = i;

				for (int i = 0; i < N; i++)
				sum += A[i];
				}
				return sum;
				}

				int foo2() {
				int sum = 0.0;
				int M = 12;
				int result[M];
				#pragma omp target teams distribute parallel for map(from: result[:M])
				for (int i = 0; i < M; i++) {
				int N = 10;
				int A[N];
				result[i] = i;

				for (int j = 0; j < N; j++)
				A[j] = j;

				for (int j = 0; j < N; j++)
				result[i] += A[j];
				}

				for (int i = 0; i < M; i++)
				sum += result[i];
				return sum;
				}

				int foo3() {
				int sum = 0.0;
				int M = 12;
				int result[M];
				#pragma omp target teams distribute map(from: result[:M])
				for (int i = 0; i < M; i++) {
				int N = 10;
				int A[N];
				result[i] = i;

				#pragma omp parallel for
				for (int j = 0; j < N; j++)
				A[j] = j;

				for (int j = 0; j < N; j++)
				result[i] += A[j];
				}

				for (int i = 0; i < M; i++)
				sum += result[i];
				return sum;
				}

				int foo4() {
				int sum = 0.0;
				int M = 12;
				int result[M];
				int N = 10;
				#pragma omp target teams distribute map(from: result[:M])
				for (int i = 0; i < M; i++) {
				int A[N];
				result[i] = i;

				#pragma omp parallel for
				for (int j = 0; j < N; j++)
				A[j] = j;

				for (int j = 0; j < N; j++)
				result[i] += A[j];
				}

				for (int i = 0; i < M; i++)
				sum += result[i];
				return sum;
				}

				int main() {
				return foo1() + foo2() + foo3() + foo4();
				}

				#endif
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
				// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[N:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
				// CHECK-NEXT: [[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
				// CHECK-NEXT: [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
				// CHECK-NEXT: [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
				// CHECK-NEXT: [[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
				// CHECK-NEXT: store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1:[0-9]+]] to ptr), i8 1, i1 true)
				// CHECK-NEXT: [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
				// CHECK-NEXT: br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.]], label [[WORKER_EXIT:%.]]
				// CHECK: user_code.entry:
				// CHECK-NEXT: store i32 10, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
				// CHECK-NEXT: [[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
				// CHECK-NEXT: [[TMP5:%.*]] = add nuw i64 [[TMP4]], 3
				// CHECK-NEXT: [[TMP6:%.*]] = udiv i64 [[TMP5]], 4
				// CHECK-NEXT: [[TMP7:%.*]] = mul nuw i64 [[TMP6]], 4
				// CHECK-NEXT: [[A:%.*]] = call align 4 ptr @__kmpc_alloc_shared(i64 [[TMP7]])
				// CHECK-NEXT: store i64 [[TMP3]], ptr [[__VLA_EXPR0_ASCAST]], align 8
				// CHECK-NEXT: store i32 0, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND:%.*]]
				// CHECK: for.cond:
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[TMP8]], [[TMP9]]
				// CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.]], label [[FOR_END:%.]]
				// CHECK: for.body:
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP11]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP10]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: br label [[FOR_INC:%.*]]
				// CHECK: for.inc:
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP12]], 1
				// CHECK-NEXT: store i32 [[INC]], ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP13:![0-9]+]]
				// CHECK: worker.exit:
				// CHECK-NEXT: ret void
				// CHECK: for.end:
				// CHECK-NEXT: store i32 0, ptr [[I1_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND2:%.*]]
				// CHECK: for.cond2:
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[I1_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[CMP3:%.*]] = icmp slt i32 [[TMP13]], [[TMP14]]
				// CHECK-NEXT: br i1 [[CMP3]], label [[FOR_BODY4:%.]], label [[FOR_END9:%.]]
				// CHECK: for.body4:
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[I1_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM5:%.*]] = sext i32 [[TMP15]] to i64
				// CHECK-NEXT: [[ARRAYIDX6:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM5]]
				// CHECK-NEXT: [[TMP16:%.*]] = load i32, ptr [[ARRAYIDX6]], align 4
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP0]], align 4
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP17]], [[TMP16]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[TMP0]], align 4
				// CHECK-NEXT: br label [[FOR_INC7:%.*]]
				// CHECK: for.inc7:
				// CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[I1_ASCAST]], align 4
				// CHECK-NEXT: [[INC8:%.*]] = add nsw i32 [[TMP18]], 1
				// CHECK-NEXT: store i32 [[INC8]], ptr [[I1_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND2]], !llvm.loop [[LOOP15:![0-9]+]]
				// CHECK: for.end9:
				// CHECK-NEXT: call void @__kmpc_free_shared(ptr [[A]], i64 [[TMP7]])
				// CHECK-NEXT: call void @__kmpc_target_deinit(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 1)
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30
				// CHECK-SAME: (i64 noundef [[M:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
				// CHECK-NEXT: [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
				// CHECK-NEXT: [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 2, i1 false)
				// CHECK-NEXT: [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
				// CHECK-NEXT: br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.]], label [[WORKER_EXIT:%.]]
				// CHECK: user_code.entry:
				// CHECK-NEXT: [[TMP3:%.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr))
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP4]], ptr [[M_CASTED_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i64, ptr [[M_CASTED_ASCAST]], align 8
				// CHECK-NEXT: store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTTHREADID_TEMP__ASCAST]], align 4
				// CHECK-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined(ptr [[DOTTHREADID_TEMP__ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]], i64 [[TMP5]], i64 [[TMP0]], ptr [[TMP1]]) #[[ATTR5:[0-9]+]]
				// CHECK-NEXT: call void @__kmpc_target_deinit(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 2)
				// CHECK-NEXT: ret void
				// CHECK: worker.exit:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], i64 noundef [[M:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR3:[0-9]+]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_COMB_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_COMB_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I3:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS:%.*]] = alloca [5 x ptr], align 8, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_1]] to ptr
				// CHECK-NEXT: [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
				// CHECK-NEXT: [[DOTOMP_COMB_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_COMB_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_COMB_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_COMB_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[I3_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I3]] to ptr
				// CHECK-NEXT: [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[CAPTURED_VARS_ADDRS]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP2]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP3]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB2:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB2]], ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP4]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP5]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[NVPTX_NUM_THREADS:%.*]] = call i32 @__kmpc_get_hardware_num_threads_in_block()
				// CHECK-NEXT: [[TMP6:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2:[0-9]+]] to ptr), i32 [[TMP7]], i32 91, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_COMB_LB_ASCAST]], ptr [[DOTOMP_COMB_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 [[NVPTX_NUM_THREADS]])
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[TMP8]], [[TMP9]]
				// CHECK-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.]], label [[COND_FALSE:%.]]
				// CHECK: cond.true:
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END:%.*]]
				// CHECK: cond.false:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END]]
				// CHECK: cond.end:
				// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP10]], [[COND_TRUE]] ], [ [[TMP11]], [[COND_FALSE]] ]
				// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP12]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], 1
				// CHECK-NEXT: [[CMP5:%.*]] = icmp slt i32 [[TMP13]], [[ADD]]
				// CHECK-NEXT: br i1 [[CMP5]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP16:%.*]] = zext i32 [[TMP15]] to i64
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP18:%.*]] = zext i32 [[TMP17]] to i64
				// CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP19]], ptr [[M_CASTED_ASCAST]], align 4
				// CHECK-NEXT: [[TMP20:%.*]] = load i64, ptr [[M_CASTED_ASCAST]], align 8
				// CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
				// CHECK-NEXT: [[TMP22:%.*]] = inttoptr i64 [[TMP16]] to ptr
				// CHECK-NEXT: store ptr [[TMP22]], ptr [[TMP21]], align 8
				// CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
				// CHECK-NEXT: [[TMP24:%.*]] = inttoptr i64 [[TMP18]] to ptr
				// CHECK-NEXT: store ptr [[TMP24]], ptr [[TMP23]], align 8
				// CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
				// CHECK-NEXT: [[TMP26:%.*]] = inttoptr i64 [[TMP20]] to ptr
				// CHECK-NEXT: store ptr [[TMP26]], ptr [[TMP25]], align 8
				// CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 3
				// CHECK-NEXT: [[TMP28:%.*]] = inttoptr i64 [[TMP0]] to ptr
				// CHECK-NEXT: store ptr [[TMP28]], ptr [[TMP27]], align 8
				// CHECK-NEXT: [[TMP29:%.*]] = getelementptr inbounds [5 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 4
				// CHECK-NEXT: store ptr [[TMP1]], ptr [[TMP29]], align 8
				// CHECK-NEXT: [[TMP30:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP31:%.*]] = load i32, ptr [[TMP30]], align 4
				// CHECK-NEXT: call void @__kmpc_parallel_51(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP31]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined_omp_outlined, ptr null, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 5)
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP32:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP33:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD6:%.*]] = add nsw i32 [[TMP32]], [[TMP33]]
				// CHECK-NEXT: store i32 [[ADD6]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP34:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP35:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD7:%.*]] = add nsw i32 [[TMP34]], [[TMP35]]
				// CHECK-NEXT: store i32 [[ADD7]], ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP36:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP37:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP36]], [[TMP37]]
				// CHECK-NEXT: store i32 [[ADD8]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP38:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP39:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[CMP9:%.*]] = icmp sgt i32 [[TMP38]], [[TMP39]]
				// CHECK-NEXT: br i1 [[CMP9]], label [[COND_TRUE10:%.]], label [[COND_FALSE11:%.]]
				// CHECK: cond.true10:
				// CHECK-NEXT: [[TMP40:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END12:%.*]]
				// CHECK: cond.false11:
				// CHECK-NEXT: [[TMP41:%.*]] = load i32, ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END12]]
				// CHECK: cond.end12:
				// CHECK-NEXT: [[COND13:%.*]] = phi i32 [ [[TMP40]], [[COND_TRUE10]] ], [ [[TMP41]], [[COND_FALSE11]] ]
				// CHECK-NEXT: store i32 [[COND13]], ptr [[DOTOMP_COMB_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP42:%.*]] = load i32, ptr [[DOTOMP_COMB_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP42]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
				// CHECK: omp.loop.exit:
				// CHECK-NEXT: [[TMP43:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP44:%.*]] = load i32, ptr [[TMP43]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP44]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo2v_l30_omp_outlined_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], i64 noundef [[DOTPREVIOUS_LB_:%.]], i64 noundef [[DOTPREVIOUS_UB_:%.]], i64 noundef [[M:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR3]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTPREVIOUS_LB__ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTPREVIOUS_UB__ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I4:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[N:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[SAVED_STACK:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[J:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[J11:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTPREVIOUS_LB__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTPREVIOUS_LB__ADDR]] to ptr
				// CHECK-NEXT: [[DOTPREVIOUS_UB__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTPREVIOUS_UB__ADDR]] to ptr
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_1]] to ptr
				// CHECK-NEXT: [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
				// CHECK-NEXT: [[DOTOMP_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[I4_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I4]] to ptr
				// CHECK-NEXT: [[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
				// CHECK-NEXT: [[SAVED_STACK_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SAVED_STACK]] to ptr
				// CHECK-NEXT: [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
				// CHECK-NEXT: [[J_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J]] to ptr
				// CHECK-NEXT: [[J11_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J11]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[DOTPREVIOUS_LB_]], ptr [[DOTPREVIOUS_LB__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[DOTPREVIOUS_UB_]], ptr [[DOTPREVIOUS_UB__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP2]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP3]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB2:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB2]], ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP4]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP5]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP6:%.*]] = load i64, ptr [[DOTPREVIOUS_LB__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[CONV:%.*]] = trunc i64 [[TMP6]] to i32
				// CHECK-NEXT: [[TMP7:%.*]] = load i64, ptr [[DOTPREVIOUS_UB__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[CONV3:%.*]] = trunc i64 [[TMP7]] to i32
				// CHECK-NEXT: store i32 [[CONV]], ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[CONV3]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[TMP8:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[TMP8]], align 4
				// CHECK-NEXT: call void @__kmpc_for_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB3:[0-9]+]] to ptr), i32 [[TMP9]], i32 33, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_LB_ASCAST]], ptr [[DOTOMP_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 1)
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP10]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[CONV5:%.*]] = sext i32 [[TMP11]] to i64
				// CHECK-NEXT: [[TMP12:%.*]] = load i64, ptr [[DOTPREVIOUS_UB__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[CMP6:%.*]] = icmp ule i64 [[CONV5]], [[TMP12]]
				// CHECK-NEXT: br i1 [[CMP6]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP13]], 1
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: store i32 10, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[TMP15:%.*]] = zext i32 [[TMP14]] to i64
				// CHECK-NEXT: [[TMP16:%.*]] = call ptr @llvm.stacksave()
				// CHECK-NEXT: store ptr [[TMP16]], ptr [[SAVED_STACK_ASCAST]], align 8
				// CHECK-NEXT: [[VLA7:%.*]] = alloca i32, i64 [[TMP15]], align 4, addrspace(5)
				// CHECK-NEXT: [[VLA7_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA7]] to ptr
				// CHECK-NEXT: store i64 [[TMP15]], ptr [[__VLA_EXPR0_ASCAST]], align 8
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP18]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP17]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: store i32 0, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND:%.*]]
				// CHECK: for.cond:
				// CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP20:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[CMP8:%.*]] = icmp slt i32 [[TMP19]], [[TMP20]]
				// CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_END:%.]]
				// CHECK: for.body:
				// CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM9:%.*]] = sext i32 [[TMP22]] to i64
				// CHECK-NEXT: [[ARRAYIDX10:%.*]] = getelementptr inbounds i32, ptr [[VLA7_ASCAST]], i64 [[IDXPROM9]]
				// CHECK-NEXT: store i32 [[TMP21]], ptr [[ARRAYIDX10]], align 4
				// CHECK-NEXT: br label [[FOR_INC:%.*]]
				// CHECK: for.inc:
				// CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP23]], 1
				// CHECK-NEXT: store i32 [[INC]], ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP16:![0-9]+]]
				// CHECK: for.end:
				// CHECK-NEXT: store i32 0, ptr [[J11_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND12:%.*]]
				// CHECK: for.cond12:
				// CHECK-NEXT: [[TMP24:%.*]] = load i32, ptr [[J11_ASCAST]], align 4
				// CHECK-NEXT: [[TMP25:%.*]] = load i32, ptr [[N_ASCAST]], align 4
				// CHECK-NEXT: [[CMP13:%.*]] = icmp slt i32 [[TMP24]], [[TMP25]]
				// CHECK-NEXT: br i1 [[CMP13]], label [[FOR_BODY14:%.]], label [[FOR_END22:%.]]
				// CHECK: for.body14:
				// CHECK-NEXT: [[TMP26:%.*]] = load i32, ptr [[J11_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM15:%.*]] = sext i32 [[TMP26]] to i64
				// CHECK-NEXT: [[ARRAYIDX16:%.*]] = getelementptr inbounds i32, ptr [[VLA7_ASCAST]], i64 [[IDXPROM15]]
				// CHECK-NEXT: [[TMP27:%.*]] = load i32, ptr [[ARRAYIDX16]], align 4
				// CHECK-NEXT: [[TMP28:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM17:%.*]] = sext i32 [[TMP28]] to i64
				// CHECK-NEXT: [[ARRAYIDX18:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM17]]
				// CHECK-NEXT: [[TMP29:%.*]] = load i32, ptr [[ARRAYIDX18]], align 4
				// CHECK-NEXT: [[ADD19:%.*]] = add nsw i32 [[TMP29]], [[TMP27]]
				// CHECK-NEXT: store i32 [[ADD19]], ptr [[ARRAYIDX18]], align 4
				// CHECK-NEXT: br label [[FOR_INC20:%.*]]
				// CHECK: for.inc20:
				// CHECK-NEXT: [[TMP30:%.*]] = load i32, ptr [[J11_ASCAST]], align 4
				// CHECK-NEXT: [[INC21:%.*]] = add nsw i32 [[TMP30]], 1
				// CHECK-NEXT: store i32 [[INC21]], ptr [[J11_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND12]], !llvm.loop [[LOOP17:![0-9]+]]
				// CHECK: for.end22:
				// CHECK-NEXT: [[TMP31:%.*]] = load ptr, ptr [[SAVED_STACK_ASCAST]], align 8
				// CHECK-NEXT: call void @llvm.stackrestore(ptr [[TMP31]])
				// CHECK-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
				// CHECK: omp.body.continue:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP32:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP33:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD23:%.*]] = add nsw i32 [[TMP32]], [[TMP33]]
				// CHECK-NEXT: store i32 [[ADD23]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
				// CHECK: omp.loop.exit:
				// CHECK-NEXT: [[TMP34:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP35:%.*]] = load i32, ptr [[TMP34]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP35]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52
				// CHECK-SAME: (i64 noundef [[M:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR0]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
				// CHECK-NEXT: [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
				// CHECK-NEXT: [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 1, i1 true)
				// CHECK-NEXT: [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
				// CHECK-NEXT: br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.]], label [[WORKER_EXIT:%.]]
				// CHECK: user_code.entry:
				// CHECK-NEXT: [[TMP3:%.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr))
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP4]], ptr [[M_CASTED_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i64, ptr [[M_CASTED_ASCAST]], align 8
				// CHECK-NEXT: store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTTHREADID_TEMP__ASCAST]], align 4
				// CHECK-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined(ptr [[DOTTHREADID_TEMP__ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]], i64 [[TMP5]], i64 [[TMP0]], ptr [[TMP1]]) #[[ATTR5]]
				// CHECK-NEXT: call void @__kmpc_target_deinit(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 1)
				// CHECK-NEXT: ret void
				// CHECK: worker.exit:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], i64 noundef [[M:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.*]]) #[[ATTR3]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I3:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS:%.*]] = alloca [3 x ptr], align 8, addrspace(5)
				// CHECK-NEXT: [[J:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_1]] to ptr
				// CHECK-NEXT: [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
				// CHECK-NEXT: [[DOTOMP_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[I3_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I3]] to ptr
				// CHECK-NEXT: [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[CAPTURED_VARS_ADDRS]] to ptr
				// CHECK-NEXT: [[J_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[N:%.*]] = call align 8 ptr @__kmpc_alloc_shared(i64 4)
				// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP2]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP3]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB2:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB2]], ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP4]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP5]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[TMP6:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP7]], i32 92, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_LB_ASCAST]], ptr [[DOTOMP_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 1)
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[TMP8]], [[TMP9]]
				// CHECK-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.]], label [[COND_FALSE:%.]]
				// CHECK: cond.true:
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END:%.*]]
				// CHECK: cond.false:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END]]
				// CHECK: cond.end:
				// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP10]], [[COND_TRUE]] ], [ [[TMP11]], [[COND_FALSE]] ]
				// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP12]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP5:%.*]] = icmp sle i32 [[TMP13]], [[TMP14]]
				// CHECK-NEXT: br i1 [[CMP5]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP15]], 1
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[I3_ASCAST]], align 4
				// CHECK-NEXT: store i32 10, ptr [[N]], align 4
				// CHECK-NEXT: [[TMP16:%.*]] = load i32, ptr [[N]], align 4
				// CHECK-NEXT: [[TMP17:%.*]] = zext i32 [[TMP16]] to i64
				// CHECK-NEXT: [[TMP18:%.*]] = mul nuw i64 [[TMP17]], 4
				// CHECK-NEXT: [[TMP19:%.*]] = add nuw i64 [[TMP18]], 3
				// CHECK-NEXT: [[TMP20:%.*]] = udiv i64 [[TMP19]], 4
				// CHECK-NEXT: [[TMP21:%.*]] = mul nuw i64 [[TMP20]], 4
				// CHECK-NEXT: [[A:%.*]] = call align 4 ptr @__kmpc_alloc_shared(i64 [[TMP21]])
				// CHECK-NEXT: store i64 [[TMP17]], ptr [[__VLA_EXPR0_ASCAST]], align 8
				// CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[I3_ASCAST]], align 4
				// CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[I3_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP23]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP22]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
				// CHECK-NEXT: store ptr [[N]], ptr [[TMP24]], align 8
				// CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
				// CHECK-NEXT: [[TMP26:%.*]] = inttoptr i64 [[TMP17]] to ptr
				// CHECK-NEXT: store ptr [[TMP26]], ptr [[TMP25]], align 8
				// CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
				// CHECK-NEXT: store ptr [[A]], ptr [[TMP27]], align 8
				// CHECK-NEXT: [[TMP28:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP29:%.*]] = load i32, ptr [[TMP28]], align 4
				// CHECK-NEXT: call void @__kmpc_parallel_51(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP29]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined_omp_outlined, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined_omp_outlined_wrapper, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 3)
				// CHECK-NEXT: store i32 0, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND:%.*]]
				// CHECK: for.cond:
				// CHECK-NEXT: [[TMP30:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP31:%.*]] = load i32, ptr [[N]], align 4
				// CHECK-NEXT: [[CMP6:%.*]] = icmp slt i32 [[TMP30]], [[TMP31]]
				// CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_END:%.]]
				// CHECK: for.body:
				// CHECK-NEXT: [[TMP32:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM7:%.*]] = sext i32 [[TMP32]] to i64
				// CHECK-NEXT: [[ARRAYIDX8:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM7]]
				// CHECK-NEXT: [[TMP33:%.*]] = load i32, ptr [[ARRAYIDX8]], align 4
				// CHECK-NEXT: [[TMP34:%.*]] = load i32, ptr [[I3_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM9:%.*]] = sext i32 [[TMP34]] to i64
				// CHECK-NEXT: [[ARRAYIDX10:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM9]]
				// CHECK-NEXT: [[TMP35:%.*]] = load i32, ptr [[ARRAYIDX10]], align 4
				// CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[TMP35]], [[TMP33]]
				// CHECK-NEXT: store i32 [[ADD11]], ptr [[ARRAYIDX10]], align 4
				// CHECK-NEXT: br label [[FOR_INC:%.*]]
				// CHECK: for.inc:
				// CHECK-NEXT: [[TMP36:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP36]], 1
				// CHECK-NEXT: store i32 [[INC]], ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP18:![0-9]+]]
				// CHECK: for.end:
				// CHECK-NEXT: call void @__kmpc_free_shared(ptr [[A]], i64 [[TMP21]])
				// CHECK-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
				// CHECK: omp.body.continue:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP37:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP37]], 1
				// CHECK-NEXT: store i32 [[ADD12]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
				// CHECK: omp.loop.exit:
				// CHECK-NEXT: [[TMP38:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP39:%.*]] = load i32, ptr [[TMP38]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP39]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: call void @__kmpc_free_shared(ptr [[N]], i64 4)
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[N:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[A:%.*]]) #[[ATTR3]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[N_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[A_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[J:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[J3:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[N_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[A_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[A_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_1]] to ptr
				// CHECK-NEXT: [[J_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J]] to ptr
				// CHECK-NEXT: [[DOTOMP_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[J3_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J3]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[N]], ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[A]], ptr [[A_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load ptr, ptr [[A_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[TMP0]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP4]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB2:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB2]], ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP5]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP6]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[TMP7:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[TMP7]], align 4
				// CHECK-NEXT: call void @__kmpc_for_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB3]] to ptr), i32 [[TMP8]], i32 33, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_LB_ASCAST]], ptr [[DOTOMP_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 1)
				// CHECK-NEXT: br label [[OMP_DISPATCH_COND:%.*]]
				// CHECK: omp.dispatch.cond:
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[TMP9]], [[TMP10]]
				// CHECK-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.]], label [[COND_FALSE:%.]]
				// CHECK: cond.true:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END:%.*]]
				// CHECK: cond.false:
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END]]
				// CHECK: cond.end:
				// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP11]], [[COND_TRUE]] ], [ [[TMP12]], [[COND_FALSE]] ]
				// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP13]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP5:%.*]] = icmp sle i32 [[TMP14]], [[TMP15]]
				// CHECK-NEXT: br i1 [[CMP5]], label [[OMP_DISPATCH_BODY:%.]], label [[OMP_DISPATCH_END:%.]]
				// CHECK: omp.dispatch.body:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP16:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP6:%.*]] = icmp sle i32 [[TMP16]], [[TMP17]]
				// CHECK-NEXT: br i1 [[CMP6]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP18]], 1
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[TMP20:%.*]] = load i32, ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP20]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP2]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP19]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
				// CHECK: omp.body.continue:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[ADD7:%.*]] = add nsw i32 [[TMP21]], 1
				// CHECK-NEXT: store i32 [[ADD7]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_DISPATCH_INC:%.*]]
				// CHECK: omp.dispatch.inc:
				// CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP22]], [[TMP23]]
				// CHECK-NEXT: store i32 [[ADD8]], ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP24:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP25:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP24]], [[TMP25]]
				// CHECK-NEXT: store i32 [[ADD9]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_DISPATCH_COND]]
				// CHECK: omp.dispatch.end:
				// CHECK-NEXT: [[TMP26:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP27:%.*]] = load i32, ptr [[TMP26]], align 4
				// CHECK-NEXT: call void @__kmpc_for_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB3]] to ptr), i32 [[TMP27]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined_omp_outlined_wrapper
				// CHECK-SAME: (i16 noundef zeroext [[TMP0:%.]], i32 noundef [[TMP1:%.]]) #[[ATTR7:[0-9]+]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTADDR:%.*]] = alloca i16, align 2, addrspace(5)
				// CHECK-NEXT: [[DOTADDR1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[GLOBAL_ARGS:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR]] to ptr
				// CHECK-NEXT: [[DOTADDR1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR1]] to ptr
				// CHECK-NEXT: [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
				// CHECK-NEXT: [[GLOBAL_ARGS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[GLOBAL_ARGS]] to ptr
				// CHECK-NEXT: store i16 [[TMP0]], ptr [[DOTADDR_ASCAST]], align 2
				// CHECK-NEXT: store i32 [[TMP1]], ptr [[DOTADDR1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
				// CHECK-NEXT: call void @__kmpc_get_shared_variables(ptr [[GLOBAL_ARGS_ASCAST]])
				// CHECK-NEXT: [[TMP2:%.*]] = load ptr, ptr [[GLOBAL_ARGS_ASCAST]], align 8
				// CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 0
				// CHECK-NEXT: [[TMP4:%.*]] = load ptr, ptr [[TMP3]], align 8
				// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 1
				// CHECK-NEXT: [[TMP6:%.*]] = load i64, ptr [[TMP5]], align 8
				// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 2
				// CHECK-NEXT: [[TMP8:%.*]] = load ptr, ptr [[TMP7]], align 8
				// CHECK-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo3v_l52_omp_outlined_omp_outlined(ptr [[DOTADDR1_ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]], ptr [[TMP4]], i64 [[TMP6]], ptr [[TMP8]]) #[[ATTR5]]
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76
				// CHECK-SAME: (i64 noundef [[M:%.]], i64 noundef [[N:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.]]) #[[ATTR0]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[N_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[N_CASTED:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTTHREADID_TEMP_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[N_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[M_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_CASTED]] to ptr
				// CHECK-NEXT: [[N_CASTED_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N_CASTED]] to ptr
				// CHECK-NEXT: [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
				// CHECK-NEXT: [[DOTTHREADID_TEMP__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTTHREADID_TEMP_]] to ptr
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[N]], ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 1, i1 true)
				// CHECK-NEXT: [[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP2]], -1
				// CHECK-NEXT: br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.]], label [[WORKER_EXIT:%.]]
				// CHECK: user_code.entry:
				// CHECK-NEXT: [[TMP3:%.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr))
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP4]], ptr [[M_CASTED_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i64, ptr [[M_CASTED_ASCAST]], align 8
				// CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[N_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP6]], ptr [[N_CASTED_ASCAST]], align 4
				// CHECK-NEXT: [[TMP7:%.*]] = load i64, ptr [[N_CASTED_ASCAST]], align 8
				// CHECK-NEXT: store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTTHREADID_TEMP__ASCAST]], align 4
				// CHECK-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined(ptr [[DOTTHREADID_TEMP__ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]], i64 [[TMP5]], i64 [[TMP7]], i64 [[TMP0]], ptr [[TMP1]]) #[[ATTR5]]
				// CHECK-NEXT: call void @__kmpc_target_deinit(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i8 1)
				// CHECK-NEXT: ret void
				// CHECK: worker.exit:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], i64 noundef [[M:%.]], i64 noundef [[N:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[RESULT:%.]]) #[[ATTR3]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[M_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[N_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[RESULT_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_2:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[I4:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS:%.*]] = alloca [3 x ptr], align 8, addrspace(5)
				// CHECK-NEXT: [[J:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[M_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[M_ADDR]] to ptr
				// CHECK-NEXT: [[N_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[RESULT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[RESULT_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_2_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_2]] to ptr
				// CHECK-NEXT: [[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
				// CHECK-NEXT: [[DOTOMP_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[I4_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I4]] to ptr
				// CHECK-NEXT: [[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
				// CHECK-NEXT: [[CAPTURED_VARS_ADDRS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[CAPTURED_VARS_ADDRS]] to ptr
				// CHECK-NEXT: [[J_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[M]], ptr [[M_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[N]], ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[RESULT]], ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load ptr, ptr [[RESULT_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[N_ADDR_ASCAST]], align 4
				// CHECK-NEXT: [[N1:%.*]] = call align 8 ptr @__kmpc_alloc_shared(i64 4)
				// CHECK-NEXT: store i32 [[TMP2]], ptr [[N1]], align 4
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[M_ADDR_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP4]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB3:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB3]], ptr [[DOTCAPTURE_EXPR_2_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[I_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP5]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_2_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP6]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[TMP7:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[TMP7]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP8]], i32 92, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_LB_ASCAST]], ptr [[DOTOMP_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 1)
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_2_ASCAST]], align 4
				// CHECK-NEXT: [[CMP5:%.*]] = icmp sgt i32 [[TMP9]], [[TMP10]]
				// CHECK-NEXT: br i1 [[CMP5]], label [[COND_TRUE:%.]], label [[COND_FALSE:%.]]
				// CHECK: cond.true:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_2_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END:%.*]]
				// CHECK: cond.false:
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END]]
				// CHECK: cond.end:
				// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP11]], [[COND_TRUE]] ], [ [[TMP12]], [[COND_FALSE]] ]
				// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP13]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP6:%.*]] = icmp sle i32 [[TMP14]], [[TMP15]]
				// CHECK-NEXT: br i1 [[CMP6]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP16:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP16]], 1
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[N1]], align 4
				// CHECK-NEXT: [[TMP18:%.*]] = zext i32 [[TMP17]] to i64
				// CHECK-NEXT: [[TMP19:%.*]] = mul nuw i64 [[TMP18]], 4
				// CHECK-NEXT: [[TMP20:%.*]] = add nuw i64 [[TMP19]], 3
				// CHECK-NEXT: [[TMP21:%.*]] = udiv i64 [[TMP20]], 4
				// CHECK-NEXT: [[TMP22:%.*]] = mul nuw i64 [[TMP21]], 4
				// CHECK-NEXT: [[A:%.*]] = call align 4 ptr @__kmpc_alloc_shared(i64 [[TMP22]])
				// CHECK-NEXT: store i64 [[TMP18]], ptr [[__VLA_EXPR0_ASCAST]], align 8
				// CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[TMP24:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP24]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP23]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 0
				// CHECK-NEXT: store ptr [[N1]], ptr [[TMP25]], align 8
				// CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 1
				// CHECK-NEXT: [[TMP27:%.*]] = inttoptr i64 [[TMP18]] to ptr
				// CHECK-NEXT: store ptr [[TMP27]], ptr [[TMP26]], align 8
				// CHECK-NEXT: [[TMP28:%.*]] = getelementptr inbounds [3 x ptr], ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 0, i64 2
				// CHECK-NEXT: store ptr [[A]], ptr [[TMP28]], align 8
				// CHECK-NEXT: [[TMP29:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP30:%.*]] = load i32, ptr [[TMP29]], align 4
				// CHECK-NEXT: call void @__kmpc_parallel_51(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 [[TMP30]], i32 1, i32 -1, i32 -1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined_omp_outlined, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined_omp_outlined_wrapper, ptr [[CAPTURED_VARS_ADDRS_ASCAST]], i64 3)
				// CHECK-NEXT: store i32 0, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND:%.*]]
				// CHECK: for.cond:
				// CHECK-NEXT: [[TMP31:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP32:%.*]] = load i32, ptr [[N1]], align 4
				// CHECK-NEXT: [[CMP7:%.*]] = icmp slt i32 [[TMP31]], [[TMP32]]
				// CHECK-NEXT: br i1 [[CMP7]], label [[FOR_BODY:%.]], label [[FOR_END:%.]]
				// CHECK: for.body:
				// CHECK-NEXT: [[TMP33:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM8:%.*]] = sext i32 [[TMP33]] to i64
				// CHECK-NEXT: [[ARRAYIDX9:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IDXPROM8]]
				// CHECK-NEXT: [[TMP34:%.*]] = load i32, ptr [[ARRAYIDX9]], align 4
				// CHECK-NEXT: [[TMP35:%.*]] = load i32, ptr [[I4_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM10:%.*]] = sext i32 [[TMP35]] to i64
				// CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM10]]
				// CHECK-NEXT: [[TMP36:%.*]] = load i32, ptr [[ARRAYIDX11]], align 4
				// CHECK-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP36]], [[TMP34]]
				// CHECK-NEXT: store i32 [[ADD12]], ptr [[ARRAYIDX11]], align 4
				// CHECK-NEXT: br label [[FOR_INC:%.*]]
				// CHECK: for.inc:
				// CHECK-NEXT: [[TMP37:%.*]] = load i32, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP37]], 1
				// CHECK-NEXT: store i32 [[INC]], ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP19:![0-9]+]]
				// CHECK: for.end:
				// CHECK-NEXT: call void @__kmpc_free_shared(ptr [[A]], i64 [[TMP22]])
				// CHECK-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
				// CHECK: omp.body.continue:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP38:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[ADD13:%.*]] = add nsw i32 [[TMP38]], 1
				// CHECK-NEXT: store i32 [[ADD13]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
				// CHECK: omp.loop.exit:
				// CHECK-NEXT: [[TMP39:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP40:%.*]] = load i32, ptr [[TMP39]], align 4
				// CHECK-NEXT: call void @__kmpc_distribute_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB2]] to ptr), i32 [[TMP40]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: call void @__kmpc_free_shared(ptr [[N1]], i64 4)
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined_omp_outlined
				// CHECK-SAME: (ptr noalias noundef [[DOTGLOBAL_TID_:%.]], ptr noalias noundef [[DOTBOUND_TID_:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[N:%.]], i64 noundef [[VLA:%.]], ptr noundef nonnull align 4 dereferenceable(4) [[A:%.*]]) #[[ATTR3]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[N_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[VLA_ADDR:%.*]] = alloca i64, align 8, addrspace(5)
				// CHECK-NEXT: [[A_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[TMP:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[J:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[J3:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTGLOBAL_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTGLOBAL_TID__ADDR]] to ptr
				// CHECK-NEXT: [[DOTBOUND_TID__ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTBOUND_TID__ADDR]] to ptr
				// CHECK-NEXT: [[N_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N_ADDR]] to ptr
				// CHECK-NEXT: [[VLA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VLA_ADDR]] to ptr
				// CHECK-NEXT: [[A_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[A_ADDR]] to ptr
				// CHECK-NEXT: [[DOTOMP_IV_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IV]] to ptr
				// CHECK-NEXT: [[TMP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[TMP]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR__ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_]] to ptr
				// CHECK-NEXT: [[DOTCAPTURE_EXPR_1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTCAPTURE_EXPR_1]] to ptr
				// CHECK-NEXT: [[J_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J]] to ptr
				// CHECK-NEXT: [[DOTOMP_LB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_LB]] to ptr
				// CHECK-NEXT: [[DOTOMP_UB_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_UB]] to ptr
				// CHECK-NEXT: [[DOTOMP_STRIDE_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_STRIDE]] to ptr
				// CHECK-NEXT: [[DOTOMP_IS_LAST_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTOMP_IS_LAST]] to ptr
				// CHECK-NEXT: [[J3_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[J3]] to ptr
				// CHECK-NEXT: store ptr [[DOTGLOBAL_TID_]], ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[DOTBOUND_TID_]], ptr [[DOTBOUND_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[N]], ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store i64 [[VLA]], ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: store ptr [[A]], ptr [[A_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP0:%.*]] = load ptr, ptr [[N_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP1:%.*]] = load i64, ptr [[VLA_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP2:%.*]] = load ptr, ptr [[A_ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[TMP0]], align 4
				// CHECK-NEXT: store i32 [[TMP3]], ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP4]], 0
				// CHECK-NEXT: [[DIV:%.*]] = sdiv i32 [[SUB]], 1
				// CHECK-NEXT: [[SUB2:%.*]] = sub nsw i32 [[DIV]], 1
				// CHECK-NEXT: store i32 [[SUB2]], ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[J_ASCAST]], align 4
				// CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR__ASCAST]], align 4
				// CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 0, [[TMP5]]
				// CHECK-NEXT: br i1 [[CMP]], label [[OMP_PRECOND_THEN:%.]], label [[OMP_PRECOND_END:%.]]
				// CHECK: omp.precond.then:
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP6]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: store i32 1, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTOMP_IS_LAST_ASCAST]], align 4
				// CHECK-NEXT: [[TMP7:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP8:%.*]] = load i32, ptr [[TMP7]], align 4
				// CHECK-NEXT: call void @__kmpc_for_static_init_4(ptr addrspacecast (ptr addrspace(1) @[[GLOB3]] to ptr), i32 [[TMP8]], i32 33, ptr [[DOTOMP_IS_LAST_ASCAST]], ptr [[DOTOMP_LB_ASCAST]], ptr [[DOTOMP_UB_ASCAST]], ptr [[DOTOMP_STRIDE_ASCAST]], i32 1, i32 1)
				// CHECK-NEXT: br label [[OMP_DISPATCH_COND:%.*]]
				// CHECK: omp.dispatch.cond:
				// CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP10:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: [[CMP4:%.*]] = icmp sgt i32 [[TMP9]], [[TMP10]]
				// CHECK-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.]], label [[COND_FALSE:%.]]
				// CHECK: cond.true:
				// CHECK-NEXT: [[TMP11:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_1_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END:%.*]]
				// CHECK: cond.false:
				// CHECK-NEXT: [[TMP12:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[COND_END]]
				// CHECK: cond.end:
				// CHECK-NEXT: [[COND:%.*]] = phi i32 [ [[TMP11]], [[COND_TRUE]] ], [ [[TMP12]], [[COND_FALSE]] ]
				// CHECK-NEXT: store i32 [[COND]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP13:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: store i32 [[TMP13]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP14:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP15:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP5:%.*]] = icmp sle i32 [[TMP14]], [[TMP15]]
				// CHECK-NEXT: br i1 [[CMP5]], label [[OMP_DISPATCH_BODY:%.]], label [[OMP_DISPATCH_END:%.]]
				// CHECK: omp.dispatch.body:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
				// CHECK: omp.inner.for.cond:
				// CHECK-NEXT: [[TMP16:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[CMP6:%.*]] = icmp sle i32 [[TMP16]], [[TMP17]]
				// CHECK-NEXT: br i1 [[CMP6]], label [[OMP_INNER_FOR_BODY:%.]], label [[OMP_INNER_FOR_END:%.]]
				// CHECK: omp.inner.for.body:
				// CHECK-NEXT: [[TMP18:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP18]], 1
				// CHECK-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
				// CHECK-NEXT: store i32 [[ADD]], ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[TMP19:%.*]] = load i32, ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[TMP20:%.*]] = load i32, ptr [[J3_ASCAST]], align 4
				// CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[TMP20]] to i64
				// CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP2]], i64 [[IDXPROM]]
				// CHECK-NEXT: store i32 [[TMP19]], ptr [[ARRAYIDX]], align 4
				// CHECK-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
				// CHECK: omp.body.continue:
				// CHECK-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
				// CHECK: omp.inner.for.inc:
				// CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: [[ADD7:%.*]] = add nsw i32 [[TMP21]], 1
				// CHECK-NEXT: store i32 [[ADD7]], ptr [[DOTOMP_IV_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_INNER_FOR_COND]]
				// CHECK: omp.inner.for.end:
				// CHECK-NEXT: br label [[OMP_DISPATCH_INC:%.*]]
				// CHECK: omp.dispatch.inc:
				// CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP22]], [[TMP23]]
				// CHECK-NEXT: store i32 [[ADD8]], ptr [[DOTOMP_LB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP24:%.*]] = load i32, ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: [[TMP25:%.*]] = load i32, ptr [[DOTOMP_STRIDE_ASCAST]], align 4
				// CHECK-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP24]], [[TMP25]]
				// CHECK-NEXT: store i32 [[ADD9]], ptr [[DOTOMP_UB_ASCAST]], align 4
				// CHECK-NEXT: br label [[OMP_DISPATCH_COND]]
				// CHECK: omp.dispatch.end:
				// CHECK-NEXT: [[TMP26:%.*]] = load ptr, ptr [[DOTGLOBAL_TID__ADDR_ASCAST]], align 8
				// CHECK-NEXT: [[TMP27:%.*]] = load i32, ptr [[TMP26]], align 4
				// CHECK-NEXT: call void @__kmpc_for_static_fini(ptr addrspacecast (ptr addrspace(1) @[[GLOB3]] to ptr), i32 [[TMP27]])
				// CHECK-NEXT: br label [[OMP_PRECOND_END]]
				// CHECK: omp.precond.end:
				// CHECK-NEXT: ret void
				//
				//
				// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined_omp_outlined_wrapper
				// CHECK-SAME: (i16 noundef zeroext [[TMP0:%.]], i32 noundef [[TMP1:%.]]) #[[ATTR7]] {
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTADDR:%.*]] = alloca i16, align 2, addrspace(5)
				// CHECK-NEXT: [[DOTADDR1:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[DOTZERO_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
				// CHECK-NEXT: [[GLOBAL_ARGS:%.*]] = alloca ptr, align 8, addrspace(5)
				// CHECK-NEXT: [[DOTADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR]] to ptr
				// CHECK-NEXT: [[DOTADDR1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTADDR1]] to ptr
				// CHECK-NEXT: [[DOTZERO_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DOTZERO_ADDR]] to ptr
				// CHECK-NEXT: [[GLOBAL_ARGS_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[GLOBAL_ARGS]] to ptr
				// CHECK-NEXT: store i16 [[TMP0]], ptr [[DOTADDR_ASCAST]], align 2
				// CHECK-NEXT: store i32 [[TMP1]], ptr [[DOTADDR1_ASCAST]], align 4
				// CHECK-NEXT: store i32 0, ptr [[DOTZERO_ADDR_ASCAST]], align 4
				// CHECK-NEXT: call void @__kmpc_get_shared_variables(ptr [[GLOBAL_ARGS_ASCAST]])
				// CHECK-NEXT: [[TMP2:%.*]] = load ptr, ptr [[GLOBAL_ARGS_ASCAST]], align 8
				// CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 0
				// CHECK-NEXT: [[TMP4:%.*]] = load ptr, ptr [[TMP3]], align 8
				// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 1
				// CHECK-NEXT: [[TMP6:%.*]] = load i64, ptr [[TMP5]], align 8
				// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds ptr, ptr [[TMP2]], i64 2
				// CHECK-NEXT: [[TMP8:%.*]] = load ptr, ptr [[TMP7]], align 8
				// CHECK-NEXT: call void @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo4v_l76_omp_outlined_omp_outlined(ptr [[DOTADDR1_ASCAST]], ptr [[DOTZERO_ADDR_ASCAST]], ptr [[TMP4]], i64 [[TMP6]], ptr [[TMP8]]) #[[ATTR5]]
				// CHECK-NEXT: ret void
				//

This is an archive of the discontinued LLVM Phabricator instance.

[Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536288

clang/lib/CodeGen/CGDecl.cpp

clang/lib/CodeGen/CGOpenMPRuntime.h

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp

clang/lib/CodeGen/CodeGenFunction.h

clang/lib/CodeGen/CodeGenFunction.cpp

clang/test/OpenMP/amdgcn_target_device_vla.cpp

[Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs
ClosedPublic