This is an archive of the discontinued LLVM Phabricator instance.

[Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs
ClosedPublic

Authored by doru1004 on Jun 27 2023, 8:11 AM.

Details

Summary

This patch fixes an issue with the use of ___kmpc_alloc_shared to allocate dynamically sized VLAs on GPUs when the declaration escapes the context. For example:

#pragma omp target teams distribute
for (int i=0; i<M; i++) {
  int N = 10;
  double A[N];

  #pragma omp parallel for
  for(int j=0; j<N; j++) {
    A[j] = j;
  }
}

This will generate a pair of __kmpc_alloc_shared / __kmpc_free_shared to handle the allocation and deallocation of A inside the target region but this emission will be delayed until the VLA size is availble in user code.

Diff Detail

Event Timeline

doru1004 created this revision.Jun 27 2023, 8:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 8:11 AM
doru1004 requested review of this revision.Jun 27 2023, 8:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 8:11 AM

So this is implementing the stacksave using __kmpc_alloc_shared instead? It makes sense since the OpenMP standard expects sharing for the stack. I wonder how this interfaces with -fopenmp-cuda-mode.

clang/lib/CodeGen/CGDecl.cpp
1603

Does NVPTX handle this already? If not, is there a compelling reason to exclude NVPTX? Otherwise we should check if we are the OpenMP device.

Add the runtime test?

clang/lib/CodeGen/CGDecl.cpp
587

Better to pass it as const reference

589

Wrong param name, use Camel

1603

OpenMPIsDevice?

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

Why this code is removed?

1125

Use std::make_pair(VoidPtr, Size).

doru1004 added inline comments.Jun 27 2023, 8:47 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

I could not understand why this code is here in the first place since it doesn't seem that it could ever work correctly (and it doesn't seem to be covered by any existing tests). Maybe I'm wrong but that was my understanding of it. So what seems to happen is that this code attempts to emit a kmpc_alloc_shared before the actual size calculation is emitted. So if the VLA size is something that the user defines such as int N = 10; then that code will not have been emitted at this point. When the expression computing the size of the VLA uses N, the code that is deleted here would just fail to find the VLA size in the attempt to emit the kmpc_alloc_shared. The emission of the VLA as kmpc_alloc_shared needs to happen after the expression of the size is emitted.

jhuber6 added inline comments.Jun 27 2023, 8:49 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

I'm pretty sure I was the one that wrote this code, and at the time I don't recall it really working. I remember there was something else that expected this to be here, but for what utility I do not recall. VLAs were never tested or used.

ABataev added inline comments.Jun 27 2023, 9:08 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

They are tested, check test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for example, where it captures VLA implicitly. I assume this should not be AMDGCN specific.

doru1004 added inline comments.Jun 27 2023, 9:28 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

Oh I see so this code path would cover the case when the VLA is defined outside the target region? I'm surprised I haven't seen any lit test fails for AMD GPUs, maybe this kind of test only exists for NVPTX. I'll add a test for AMD GPUs in that case.

doru1004 added inline comments.Jun 27 2023, 9:32 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1085

Edit: the VLA is defined outside the target region => the VLA size is defined outside the target region

doru1004 updated this revision to Diff 535186.Jun 27 2023, 4:29 PM
doru1004 marked 3 inline comments as done.
doru1004 added inline comments.
clang/lib/CodeGen/CGDecl.cpp
1603

Does NVPTX support dynamic allocas?

ABataev added inline comments.Jun 27 2023, 4:43 PM
clang/lib/CodeGen/CGDecl.cpp
1603

It does not matter here, it depends on the runtime library implementations. The compiler just shall provide proper runtime calls emission, everything else is part of the runtime support.

arsenm added inline comments.Jun 28 2023, 4:38 AM
clang/lib/CodeGen/CGDecl.cpp
1603

I think I heard recent ptx introdced new instructions for it. amdgpu codegen just happens to be broken because we don't properly restore the stack afterwards. When I added the support we had no way of testing (and still don't really, __builtin_alloca doesn't handle non-0 stack address space correctly)

doru1004 added inline comments.Jun 28 2023, 6:55 AM
clang/lib/CodeGen/CGDecl.cpp
1603

If NVPTX supports that then there is no reason to have NVPTX avoid emitting allocas (i.e. the condition stays as it is right now) but I am willing to reach a consensus so please let me know what you would all prefer.

arsenm added inline comments.Jun 28 2023, 7:23 AM
clang/lib/CodeGen/CGDecl.cpp
1603

frontends seem to have a tradition of working around missing features in codegen, I think you should just pass through the correct IR and leave the backend bugs for the backends

I think it's better to just limit it to AMDGPU for now.
BTW, it might be worth to check if heap-to-stack will push it back to stack.

I think it's better to just limit it to AMDGPU for now.
BTW, it might be worth to check if heap-to-stack will push it back to stack.

If you're really going to go for backend workarounds, it should be special casing the known broken with a fixme for why, not a positive check for where it's enabled

I think it's better to just limit it to AMDGPU for now.

I rather doubt this is a good decision. Better to support for all targets. NVPTX supports(ed) (IIRC) static allocation and internal management for the shared memory (not sure it is true for the new library). If no, then we need at least to diagnose that this feature is not supported.

BTW, it might be worth to check if heap-to-stack will push it back to stack.

doru1004 updated this revision to Diff 536059.Jun 29 2023, 5:19 PM
doru1004 retitled this revision from [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions to [Clang][OpenMP] Delay emission of __kmpc_alloc_shared for escaped VLAs .
doru1004 edited the summary of this revision. (Show Details)
doru1004 added a comment.EditedJun 29 2023, 5:28 PM

I have modified the patch to only do one thing rather than several things as the previous patch. Essentially this patch now only handles the delayed emission of the __kmpc_alloc_shared for the VLA which it could not emit in the Prolog of the function. This is now very precise in terms of which VLAs it will transform into __kmpc_alloc_shared i.e. only the ones that were previously attempted in the Prolog and could not be emitted because their size was missing (had not been emitted yet).

I have dropped the previous intention of emitting __kmpc_alloc_shared for thread local variables which have dynamic size. I am emitting dynamic allocas (as the test shows) which will fail in the backend as expected. This behavior needs to be resolved separately in the backend according to @arsenm and any workaround in the frontend would have to live in a standalone patch that can be reverted when a fix to the backend is performed.

ABataev added inline comments.Jun 30 2023, 4:56 AM
clang/lib/CodeGen/CGDecl.cpp
589

Name of the variable hides the type, potential warning or even error

1605–1609

I think you can drop triple checks and rely completely on RT.isDelayedVariableLengthDecl(*this, &D) result here

clang/lib/CodeGen/CodeGenFunction.cpp
2164–2174

Fix var naming

doru1004 added inline comments.Jun 30 2023, 7:25 AM
clang/lib/CodeGen/CGDecl.cpp
1605–1609

I tried it but there is a lit test (which I cannot identify) that hangs when offloading to the host (I think) so it has to be an actual GPU. Any ideas?

ABataev added inline comments.Jun 30 2023, 8:25 AM
clang/lib/CodeGen/CGDecl.cpp
1605–1609

Make isDelayedVariableLengthDecl virtual in base OpenMPRuntime and make it return false by default, and true in base implementation for GPU. This should fix the problem, I hope

doru1004 updated this revision to Diff 536288.Jun 30 2023, 9:24 AM
doru1004 marked 3 inline comments as done.
doru1004 added inline comments.
clang/lib/CodeGen/CGDecl.cpp
1605–1609

It worked thank you for the suggestion!!

doru1004 marked an inline comment as done.Jun 30 2023, 10:11 AM
ABataev added inline comments.Jun 30 2023, 10:16 AM
clang/lib/CodeGen/CGDecl.cpp
590–591
auto &RT = static_cast<CGOpenMPRuntimeGPU &>(...);
1605–1606

No need to cast to CGOpenMPRuntimeGPU since isDelayedVariableLengthDecl is a member of CGOpenMPRuntime.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1116–1120
return llvm::is_contained(I->getSecond().DelayedVariableLengthDecls, VD);
1152

pass it here and in other places as const reference

clang/lib/CodeGen/CodeGenFunction.h
2806
  1. Is it possible that VariableArrayType does not have VLA size?
  2. Fix param name
doru1004 added inline comments.Jun 30 2023, 10:23 AM
clang/lib/CodeGen/CodeGenFunction.h
2806

@ABataev How would point 1 happen?

ABataev added inline comments.Jun 30 2023, 10:37 AM
clang/lib/CodeGen/CodeGenFunction.h
2806

You're adding a function that checks if VLA type has VLA size. I'm asking, if it is possible for VLA type to not have VLA size at all? Why do you need this function?

doru1004 updated this revision to Diff 536321.Jun 30 2023, 10:50 AM
doru1004 marked 4 inline comments as done.
doru1004 added inline comments.
clang/lib/CodeGen/CGDecl.cpp
1605–1606

RT is also used further down to call getKmpcAllocShared().

clang/lib/CodeGen/CodeGenFunction.h
2806

This function checks if the expression of the size of the VLA has already been emitted and can be used.

doru1004 updated this revision to Diff 536322.Jun 30 2023, 10:52 AM
ABataev added inline comments.Jun 30 2023, 10:53 AM
clang/lib/CodeGen/CodeGenFunction.cpp
2168

Use VLASizeMap.find() instead

clang/lib/CodeGen/CodeGenFunction.h
2806

Why the emission of VLA size can be delayed?

doru1004 added inline comments.Jun 30 2023, 10:55 AM
clang/lib/CodeGen/CodeGenFunction.h
2806

Because the size of the VLA is emitted in the user code and the prolog of the function happens before that. The emission of the VLA needs to be delayed until its size has been emitted in the user code.

doru1004 updated this revision to Diff 536326.Jun 30 2023, 11:08 AM
doru1004 marked an inline comment as done.
ABataev added inline comments.Jun 30 2023, 11:32 AM
clang/lib/CodeGen/CodeGenFunction.h
2806

This is very fragile approach. Can you try instead try to improve markAsEscaped function and fix insertion of VD to EscapedVariableLengthDecls and if the declaration is internal for the target region, insert it to DelayedVariableLengthDecls?

doru1004 added inline comments.Jun 30 2023, 12:08 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

I am not sure what the condition would be, at that point, to choose between one list or the other. I'm not sure what you mean by the declaration being internal to the target region.

doru1004 added inline comments.Jun 30 2023, 2:11 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

Any thoughts? As far as I can tell all VLAs that reach that point belong in DelayedVariableLengthDecls

doru1004 added inline comments.Jun 30 2023, 3:59 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

@ABataev I cannot think of a condition to use for the distinction in markedAsEscaped(). Could you please explain in more detail what you want me to check? I can make the rest of the changes happen no problem but I don't know what the condition is. Unless you tell me otherwise, I think the best condition is to check whether the VLA size has been emitted (i.e. that is is part of the VLASize list) in which case the code as is now is fine.

ABataev added inline comments.Jun 30 2023, 4:03 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

Can you check that the declaration is not captured in the target context? If it is not captured, it is declared in the target region and should be emitted as delayed.

doru1004 added inline comments.Jun 30 2023, 4:39 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

How do I check that? There doesn't seem to be a list of captured variables available at that point in the code.

doru1004 added inline comments.Jun 30 2023, 4:47 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

So the complication is that the same declaration is captured and not captured at the same time. It can be declared inside a teams distribute (not captured) but captured by an inner parallel for (captured). I think I can come up with something though.

ABataev added inline comments.Jun 30 2023, 4:55 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

Need to check the captures in the target regions only

doru1004 added inline comments.Jun 30 2023, 5:51 PM
clang/lib/CodeGen/CodeGenFunction.h
2806

I cannot get a handle on the target directive in markedAsEscaped function in order to look at its captures.

doru1004 updated this revision to Diff 536489.Jun 30 2023, 5:57 PM

@ABataev This is as close as I could get it to what you wanted. I don't know how to get hold of the target directive so late in the emission process i.e. in markedAsEscaped function. The target directive doesn't get visited in the var checked for escaped vars so I cannot get the list of captures from it.

In any case the patch is good to go. It no longer relies on VLA size checks.

ABataev added inline comments.Jul 3 2023, 5:39 AM
clang/lib/CodeGen/CGDecl.cpp
1606
  1. use static_cast<CGOpenMPRuntimeGPU &>(CGM.getOpenMPRuntime())
  2. It will crash if your device is not GPU. Better to make getKmpcAllocShared and getKmpcFreeShared virtual (just like isDelayedVariableLengthDecl) in base CGOpenMPRuntime, since it may be required not only for GPU-based devices.
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
262

Yep, this is what I meant. The only question: do you really need this new parameter? CGF.CapturedStmtInfo provides the list of captures and you can try to use it.

1088–1092

Do you still need this check?

doru1004 updated this revision to Diff 537478.Jul 5 2023, 1:22 PM
doru1004 marked an inline comment as done.
doru1004 marked 8 inline comments as done.
doru1004 updated this revision to Diff 537485.Jul 5 2023, 1:40 PM
ABataev added inline comments.Jul 5 2023, 1:58 PM
clang/lib/CodeGen/CGDecl.cpp
1606

Check the second item, please, better to make all new member function virtual and handle it for non-GPU devices too

doru1004 added inline comments.Jul 5 2023, 2:04 PM
clang/lib/CodeGen/CGDecl.cpp
1606

The support I am adding is only meant for GPUs. I am not sure why we need to consider non-GPUs. There already exists a VLA handling for non-GPUs and that one should be used.

ABataev added inline comments.Jul 5 2023, 2:08 PM
clang/lib/CodeGen/CGDecl.cpp
1606
  1. It will crash the compiler if your device is not a GPU (say, CPU).
  2. I'm not asking to implement it for non-GPU, I'm asking to provide common interface. The general implementation should call just llvm_unreachable, nothing else.
doru1004 added inline comments.Jul 5 2023, 2:10 PM
clang/lib/CodeGen/CGOpenMPRuntime.h
699–710

@ABataev I have added the interface entries here.

ABataev added inline comments.Jul 5 2023, 2:14 PM
clang/lib/CodeGen/CGDecl.cpp
591

Same, just CGOpenMPRuntime &RT = CGM.getOpenMPRuntime();

1605

Here and in other places, jusy remove the cast to CGOpenMPRuntimeGPU, CGM.getOpenMPRuntime() already provides virtual functions, use them directly without cast:

CGOpenMPRuntime &RT = CGM.getOpenMPRuntime();
clang/lib/CodeGen/CGOpenMPRuntime.h
699–710

Then you already good, just do not gast to CGOpenMPRuntimeGPU, use CGM.getOpenMPRuntime() directly since it already has these member functions.

doru1004 updated this revision to Diff 537498.Jul 5 2023, 2:23 PM
doru1004 marked 2 inline comments as done.
ABataev accepted this revision.Jul 6 2023, 4:36 AM

LG with a nit

clang/lib/CodeGen/CGDecl.cpp
19

You can remove this include

This revision is now accepted and ready to land.Jul 6 2023, 4:36 AM
doru1004 updated this revision to Diff 537706.Jul 6 2023, 7:12 AM
doru1004 marked an inline comment as done.

@ABataev thank you for the review! I have now fixed the last nit and will commit the patch soon!