Page MenuHomePhabricator

[OpenMP] Codegen aggregate for outlined function captures
AcceptedPublic

Authored by ggeorgakoudis on May 8 2021, 8:16 AM.

Details

Summary

Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I got this after changing __kmpc_impl_malloc to return 0xdeadbeef. So, this confirms that missing malloc implementation is the root cause.

Memory access fault by GPU node-4 (Agent handle: 0x1bc5000) on address 0xdeadb000. Reason: Page not present or supervisor privilege.

Nice! In that case I think the way to go is to audit the (probably few) places where kmpc_impl_malloc are called and add a check for whether the return value is 0. With that in place we can reland this and get more graceful failure (at a guess we should fall back to the host when gpu memory is exhausted? or maybe just print a 'out of gpu heap memory' style message and abort, don't know).

I got this after changing __kmpc_impl_malloc to return 0xdeadbeef. So, this confirms that missing malloc implementation is the root cause.

Memory access fault by GPU node-4 (Agent handle: 0x1bc5000) on address 0xdeadb000. Reason: Page not present or supervisor privilege.

Nice! In that case I think the way to go is to audit the (probably few) places where kmpc_impl_malloc are called and add a check for whether the return value is 0. With that in place we can reland this and get more graceful failure (at a guess we should fall back to the host when gpu memory is exhausted? or maybe just print a 'out of gpu heap memory' style message and abort, don't know).

We should only fail to remove the kmpc_shared_alloc with O0. Since we need kmpc_shared_alloc for all non-trivial codes, they would always fail on AMDGPU. That said,
why is the shared memory stack not catching this. It's a 64 byte stack for the main thread and we are looking at at 24 byte allocation for declare_mapper_target.cpp.
Can you determine why first two conditionals in __kmpc_alloc_shared don't catch this and return proper memory?

ggeorgakoudis reopened this revision.Sep 28 2021, 10:01 AM
This revision is now accepted and ready to land.Sep 28 2021, 10:01 AM

Update memory allocation for aggregate argument.
Introduce runtime interface to allocate from local memory,
when in SPMD mode, or heap, when in generic.

Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2021, 10:25 AM

@pdhaliwal @JonChesterfield @ronlieb I updated the aggregate argument memory allocation to use an alloca instead of malloc'ing in SPMD mode, which should resolve your issue. Could someone please test the updated patch and give me feedback before landing?

[AMD Official Use Only]

Hi George,
I will do it now ...

i backed up to your reverted patch, and applied this one.
I see some new errors

libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data_enter_exit.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_update.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/parallel_offloading_map.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/taskloop_offload_nowait.cpp

Dont know if Jon is around, so i will ask Singh @pdhaliwal if he can dig into it a bit more.

i backed up to your reverted patch, and applied this one.
I see some new errors

libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data_enter_exit.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_update.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/parallel_offloading_map.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/taskloop_offload_nowait.cpp

Dont know if Jon is around, so i will ask Singh @pdhaliwal if he can dig into it a bit more.

Hmm, thanks @ronlieb, @pdhaliwal please let me know what fails.

Apologies for late reply. Most of the tests now do not try to call malloc, so no page fault errors. But all of them are producing wrong results. For e.g. declare_mapper_target.cpp produces Sum = 132608 with the patch applied. Similarly for other tests as well. So don't know what's happening yet.

Apologies for late reply. Most of the tests now do not try to call malloc, so no page fault errors. But all of them are producing wrong results. For e.g. declare_mapper_target.cpp produces Sum = 132608 with the patch applied. Similarly for other tests as well. So don't know what's happening yet.

Thanks @pdhaliwal. I get the right result on nvidia. Please let me know when you get to the bottom of it.

I modified the declare_mapper_target to print the contents of array after target region and found the following output:

2 3 4 5 6 7 8 9 10 11 Sum = 65

Program:

#include <cstdio>
#include <cstdlib>

#define NUM 10

int main() {
  int *c= new int[NUM];
  for (int i = 0; i < NUM; i++) {
    c[i] = 1;
  }
#pragma omp target teams distribute  parallel for map(tofrom: c[0:NUM])
  for (int i = 0; i < NUM; i++) {
    c[i]++;
  }
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += c[i];
    printf("%d ", c[i]);
  }
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Different variant of the same program is producing correct output,

#include <cstdio>
#include <cstdlib>

#define NUM 10

int main() {
  int *c= new int[NUM];
  for (int i = 0; i < NUM; i++) {
    c[i] = 1;
  }

  int *b = new int[NUM];
#pragma omp target teams distribute  parallel for map(tofrom: c[0:NUM], b[0:NUM])
  for (int i = 0; i < NUM; i++) {
    b[i] = c[i] + 1;
  }
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += b[i];
    printf("%d ", b[i]);
  }
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Output (this is the right answer):

2 2 2 2 2 2 2 2 2 2 Sum = 20

On internal amd-stg-open branch, this patch works fine, so issue is only with the trunk.
I compared the generated IR before and after applying this patch, I didn't see anything suspicious. (but can't be 100% sure).

I have created a patch (D111218) with fix for amdgcn. This is a temporary fix. I will still keep on looking into it until I find a real root cause.

Update the interface for allocating/sharing the struct aggregate
Simplify invoking tasks

ggeorgakoudis requested review of this revision.Nov 9 2021, 7:44 AM

TODO update tests

Update tests
Fix for attributes to kmpc_alloc_aggregate_arg
Do not emit allocations if there are no arguments in the aggregate

ABataev accepted this revision.Dec 1 2021, 9:26 AM

LG wit a nit

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1623

Enclose into braces too

This revision is now accepted and ready to land.Dec 1 2021, 9:26 AM

Rebase, address comment, update few tests

ggeorgakoudis marked 2 inline comments as done.Dec 1 2021, 11:36 AM
JonChesterfield added a comment.EditedDec 1 2021, 11:42 AM

This works approximately as well as trunk does for me, provided D114865 is also applied. My baseline is not totally solid but I think there's a credible chance this would pass the buildbot, provided D114865 went in first.

Ron reports two new failures with this applied,
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51781.c
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51982.c

My local sm_75 box with this patch applied (and otherwise a clean build) claims failures in

libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug51781.c
jdoerfert accepted this revision.Dec 22 2021, 2:45 PM

Can we land this? AMD issues seems resolved.

[AMD Official Use Only]

@Singh, Pushpinder is this resolved?
You were most recently working on it.

Thx

pdhaliwal added a comment.EditedDec 22 2021, 7:32 PM

I am seeing a lot of failures on nvptx machine (sm_70, cuda11.4) with this patch,

libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49779.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51982.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/close_enter_exit.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/close_modifier.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/shared_update.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/close_enter_exit.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/close_modifier.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/shared_update.c

On amdgcn, these are the tests failing,

libomptarget :: amdgcn-amd-amdhsa :: offloading/bug49021.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51781.c
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51982.c
libomptarget :: amdgcn-amd-amdhsa-newRTL :: offloading/bug49021.cpp
libomptarget :: amdgcn-amd-amdhsa-newRTL :: offloading/bug51781.c

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 12:05 PM

As discussed in https://github.com/llvm/llvm-project/issues/54654, this needs to be added for SPMDization with this patch. Not sure whether further handling is required.

diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index c4736521e475..23cfa6fe5e27 100644

  • a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp

+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -4260,6 +4260,7 @@ struct AAKernelInfoCallSite : AAKernelInfo {

case OMPRTL___kmpc_nvptx_parallel_reduce_nowait_v2:
case OMPRTL___kmpc_nvptx_teams_reduce_nowait_v2:
case OMPRTL___kmpc_nvptx_end_reduce_nowait:

+ case OMPRTL___kmpc_alloc_aggregate_arg:

  break;
case OMPRTL___kmpc_distribute_static_init_4:
case OMPRTL___kmpc_distribute_static_init_4u:

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Hey @dhruvachak. Unfortunately I can't find time lately to work on this patch. Would you like to take over?

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Hey @dhruvachak. Unfortunately I can't find time lately to work on this patch. Would you like to take over?

@ggeorgakoudis I rebased the sources on top of main and resolved conflicts in my local workspace. I haven't updated the clang/llvm tests. There are tests that fail on amdgpu that I am investigating. One example is https://github.com/llvm/llvm-project/issues/54654.

dhruvachak added inline comments.Apr 8 2022, 12:35 PM
llvm/include/llvm/Frontend/OpenMP/OMPKinds.def
930

NoCapture attributes for the parameters need to be removed. See https://github.com/llvm/llvm-project/issues/54654

reverse ping. Are there outstanding issues with this?

I rebased and resolved conflicts just now and got the compiler built. I did not update the tests, hence not updating this review. I see the following outstanding issues:

(1) make check-libomptarget produces a bunch of failures with the following compile-time assertion. So my rebased patch is not interacting correctly with opaque pointers. It is the same assertion for all the failures.
llvm-project/llvm/include/llvm/IR/Type.h:384: llvm::Type* llvm::Type::getNonOpaquePointerElementType() const: Assertion `NumContainedTys && "Attempting to get element type of opaque pointer"' failed.

(2) From earlier investigation a couple of months back, this patch uses device alloc and will fail if device allocation is not implemented (e.g. in main branch of amdgpu). Most of these failures are seen at -O0, OpenMPOpt is able to optimize them away at higher opt levels. Are we ok with these failures at -O0?

(3) There were a few issues found regarding SPDMization, NoCaptureAttrs, alignment that should be applied to this patch. I have those changes on a local branch.

Also, make sure to remove all deviceRTL files and probably reset the autogenerated tests to upstream (and re-generate) before you merge (or reupload).

I rebased and resolved conflicts just now and got the compiler built. I did not update the tests, hence not updating this review. I see the following outstanding issues:

(1) make check-libomptarget produces a bunch of failures with the following compile-time assertion. So my rebased patch is not interacting correctly with opaque pointers. It is the same assertion for all the failures.
llvm-project/llvm/include/llvm/IR/Type.h:384: llvm::Type* llvm::Type::getNonOpaquePointerElementType() const: Assertion `NumContainedTys && "Attempting to get element type of opaque pointer"' failed.

See my comment below. I think that's the issue.

(2) From earlier investigation a couple of months back, this patch uses device alloc and will fail if device allocation is not implemented (e.g. in main branch of amdgpu). Most of these failures are seen at -O0, OpenMPOpt is able to optimize them away at higher opt levels. Are we ok with these failures at -O0?

It used __kmpc_alloc_shared, which should in theory work with O0 (also for AMDGPU) but in practice might not, especially if it has to fallback to malloc. We are working on malloc support right now. This should not stop us. No reasonable code runs with O0 (on AMDGPU) right now.

(3) There were a few issues found regarding SPDMization, NoCaptureAttrs, alignment that should be applied to this patch. I have those changes on a local branch.

Apply them, I can look over everything again.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1553

This doesn't work anymore with opaque pointers, IIRC. We should remember the type and pass to this place.

Also, make sure to remove all deviceRTL files and probably reset the autogenerated tests to upstream (and re-generate) before you merge (or reupload).

I rebased and resolved conflicts just now and got the compiler built. I did not update the tests, hence not updating this review. I see the following outstanding issues:

(1) make check-libomptarget produces a bunch of failures with the following compile-time assertion. So my rebased patch is not interacting correctly with opaque pointers. It is the same assertion for all the failures.
llvm-project/llvm/include/llvm/IR/Type.h:384: llvm::Type* llvm::Type::getNonOpaquePointerElementType() const: Assertion `NumContainedTys && "Attempting to get element type of opaque pointer"' failed.

See my comment below. I think that's the issue.

(2) From earlier investigation a couple of months back, this patch uses device alloc and will fail if device allocation is not implemented (e.g. in main branch of amdgpu). Most of these failures are seen at -O0, OpenMPOpt is able to optimize them away at higher opt levels. Are we ok with these failures at -O0?

It used __kmpc_alloc_shared, which should in theory work with O0 (also for AMDGPU) but in practice might not, especially if it has to fallback to malloc. We are working on malloc support right now. This should not stop us. No reasonable code runs with O0 (on AMDGPU) right now.

(3) There were a few issues found regarding SPDMization, NoCaptureAttrs, alignment that should be applied to this patch. I have those changes on a local branch.

Apply them, I can look over everything again.

Yes, I will apply the changes, refresh the tests, and re-upload.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1553

Thanks. Changing this fixed the assertions.

dhruvachak added inline comments.Jul 8 2022, 11:31 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
3591

This should be VoidTy now that GlobalArgs type has changed.

Is there an llvm/utils script to update clang tests that have RUN lines at the top? An example is clang/test/OpenMP/debug_threadprivate_copyin.c.

Is there an llvm/utils script to update clang tests that have RUN lines at the top? An example is clang/test/OpenMP/debug_threadprivate_copyin.c.

You can create the run lines with the llvm/utils/update_cc_test_checks.py script but those tests have manual lines for now.
I usually run llvm/utils/update_cc_test_checks.py -u clang/test/OpenMP/*.{c,cpp} to update all autogenerated tests.

Is there an llvm/utils script to update clang tests that have RUN lines at the top? An example is clang/test/OpenMP/debug_threadprivate_copyin.c.

You can create the run lines with the llvm/utils/update_cc_test_checks.py script but those tests have manual lines for now.
I usually run llvm/utils/update_cc_test_checks.py -u clang/test/OpenMP/*.{c,cpp} to update all autogenerated tests.

Okay, I will convert those few manual OpenMP tests to autogen format.

How about the AST ones? Do they have to be manually updated? Example: clang/test/AST/ast-dump-openmp-distribute-parallel-for-simd.c

jdoerfert added a comment.EditedJul 8 2022, 1:19 PM

>>! In D102107#3639615, @dhruvachak wrote:

Is there an llvm/utils script to update clang tests that have RUN lines at the top? An example is clang/test/OpenMP/debug_threadprivate_copyin.c.

You can create the run lines with the llvm/utils/update_cc_test_checks.py script but those tests have manual lines for now.
I usually run llvm/utils/update_cc_test_checks.py -u clang/test/OpenMP/*.{c,cpp} to update all autogenerated tests.

Okay, I will convert those few manual OpenMP tests to autogen format.

How about the AST ones? Do they have to be manually updated? Example: clang/test/AST/ast-dump-openmp-distribute-parallel-for-simd.c

For these ones I have a script locally (attached) that need some manual doing but it helps:

1. run the ast dump and store the result (same as RUN line), e.g.,
  {F23722650} clang -cc1 -internal-isystem /data/build/llvm-project/lib/clang/13.0.0/include -nostdsysteminc -triple x86_64-unknown-unknown -fopenmp -verify -ast-dump /data/src/llvm-project/clang/test/AST/ast-dump-openmp-begin-declare-variant_template_4.cpp &> /tmp/ast
2. python3 ast_dump_2_check.py /tmp/ast CHECK
3. replace the check lines with the content of /tmp/ast.check

>>! In D102107#3639615, @dhruvachak wrote:

Is there an llvm/utils script to update clang tests that have RUN lines at the top? An example is clang/test/OpenMP/debug_threadprivate_copyin.c.

You can create the run lines with the llvm/utils/update_cc_test_checks.py script but those tests have manual lines for now.
I usually run llvm/utils/update_cc_test_checks.py -u clang/test/OpenMP/*.{c,cpp} to update all autogenerated tests.

Okay, I will convert those few manual OpenMP tests to autogen format.

How about the AST ones? Do they have to be manually updated? Example: clang/test/AST/ast-dump-openmp-distribute-parallel-for-simd.c

For these ones I have a script locally (attached) that need some manual doing but it helps:

1. run the ast dump and store the result (same as RUN line), e.g.,
  {F23722650} clang -cc1 -internal-isystem /data/build/llvm-project/lib/clang/13.0.0/include -nostdsysteminc -triple x86_64-unknown-unknown -fopenmp -verify -ast-dump /data/src/llvm-project/clang/test/AST/ast-dump-openmp-begin-declare-variant_template_4.cpp &> /tmp/ast
2. python3 ast_dump_2_check.py /tmp/ast CHECK
3. replace the check lines with the content of /tmp/ast.check

Thanks. I followed the above steps and regenerated a couple of the AST tests but they still fail. Perhaps I am missing some options?

I currently have a handful of clang test failures where regen did not work. I am going to update the patch, post the current test results, and we can figure out how to regen the rest before we land this patch.

Thanks. I followed the above steps and regenerated a couple of the AST tests but they still fail. Perhaps I am missing some options?

I currently have a handful of clang test failures where regen did not work. I am going to update the patch, post the current test results, and we can figure out how to regen the rest before we land this patch.

So, generate check lines for new tests in a separate patch first.
For the AST ones, you need to take the run line of the test, not what I posted there. If it doesn't work, one needs to check why. Hard to diagnose and I don't remember if there is something else. Maybe you need to only include part of it?

jhuber6 added a comment.EditedJul 8 2022, 5:48 PM

Thanks. I followed the above steps and regenerated a couple of the AST tests but they still fail. Perhaps I am missing some options?

I currently have a handful of clang test failures where regen did not work. I am going to update the patch, post the current test results, and we can figure out how to regen the rest before we land this patch.

Sometimes if update_cc_test_check.py -u ${test} doesn't work you either just need to run it twice so the line numbers get updated on the kernel functions, or you can try taking the command line directly from the top of the file and running it again with that instead of -u. A few options aren't handled properly via the update with -u and need to be run again completely.

dhruvachak updated this revision to Diff 443399.Jul 8 2022, 5:53 PM

Fixed opaque pointer miscompile.
Added alloc_aggregate_arg entry point to OpenMPOpt SPMD list.
Fixed nocapture attribute of kmpc_alloc_aggregate_arg,
Added align attribute for call to
kmpc_alloc_shared.
Updated (most) failing clang tests.

Thanks. I followed the above steps and regenerated a couple of the AST tests but they still fail. Perhaps I am missing some options?

I currently have a handful of clang test failures where regen did not work. I am going to update the patch, post the current test results, and we can figure out how to regen the rest before we land this patch.

So, generate check lines for new tests in a separate patch first.

Not sure what you mean by new tests. make check-clang has a few failures on existing tests. I think all of them are regen issues. I will post the results.

For the AST ones, you need to take the run line of the test, not what I posted there. If it doesn't work, one needs to check why. Hard to diagnose and I don't remember if there is something else. Maybe you need to only include part of it?

Yes, I took the run line of the test. The regen worked OK, I removed the old CHECK lines and added the new ones. But make check-clang still flags it as a failure. As you said, we need to understand why.

dhruvachak added inline comments.Jul 8 2022, 6:02 PM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
4266

@jdoerfert Is this enough to enable SPMDization or is further handling required?

dhruvachak added inline comments.Jul 8 2022, 6:04 PM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
4266

Just to be clear, this change does allow SPMDization now but want to make sure nothing else is missing.

Results from "make check-clang":

Failed Tests (14):

Clang :: AST/ast-dump-openmp-distribute-parallel-for-simd.c
Clang :: AST/ast-dump-openmp-distribute-parallel-for.c
Clang :: AST/ast-dump-openmp-target-teams-distribute-parallel-for-simd.c
Clang :: AST/ast-dump-openmp-target-teams-distribute-parallel-for.c
Clang :: AST/ast-dump-openmp-teams-distribute-parallel-for-simd.c
Clang :: AST/ast-dump-openmp-teams-distribute-parallel-for.c
Clang :: CodeGenCXX/observe-noexcept.cpp
Clang :: OpenMP/declare_variant_construct_codegen_1.c
Clang :: OpenMP/nvptx_lambda_pointer_capturing.cpp
Clang :: OpenMP/remarks_parallel_in_multiple_target_state_machines.c
Clang :: OpenMP/remarks_parallel_in_target_state_machine.c
Clang :: OpenMP/target_in_reduction_codegen.cpp
Clang :: SemaCXX/static-assert.cpp
Clang :: utils/update_cc_test_checks/generated-funcs.test

Testing Time: 29.33s

Skipped          :     4
Unsupported      :  1478
Passed           : 29406
Expectedly Failed:    27
Failed           :    14

Need to check the following again.

clang/test/AST/ast-dump-openmp-distribute-parallel-for.c was regenerated and part of the patch but the test still fails. The other regenerated AST tests are not part of this patch, they seem to fail even after regen.

Need to regen CodeGenCXX, SemaCXX, and utils tests (3 total).

I tried converting the OpenMP manual CHECK tests to the autogen format. Some of them still fail as above, don't know why.

Need to know how to regen the OpenMP remarks tests.

make check-openmp passes on amdgpu. Need to check on nvptx.

Testing Time: 39.95s

Unsupported      : 143
Passed           : 563
Expectedly Failed:  14

[100%] Built target check-openmp
[100%] Built target check-openmp

It seems the buildbot didn't actually test this patch but an old one, still:

The checks for this tests are not updated:
target_teams_distribute_parallel_for_order_codegen.cpp
target_in_reduction_codegen.cpp
nvptx_lambda_capturing.cpp
nvptx_lambda_pointer_capturing.cpp

Similar to clang/test/OpenMP/declare_variant_construct_codegen_1.c, we should manually update the few fork calls in clang/test/OpenMP/declare_variant_construct_codegen_1.c.

Can you share the output of the AST dump tests and the new check lines, so what run produces and the file we give to Filechec to verify it.

clang/test/OpenMP/declare_variant_construct_codegen_1.c
1052

Something went wrong here. Might be easier to manually change the kmpc_forc_call line (should not be much more)

Can you share the output of the AST dump tests and the new check lines, so what run produces and the file we give to Filechec to verify it.

I looked at the AST test output and the CHECK lines more carefully. Turns out the full path was embedded in some of the CHECK lines causing the failures. I corrected those manually and those AST tests now pass. I will move on to the other failures.

Regenerated clang tests, make check-clang passes

Rebased on top of a recent commit. Both check-clang and check-openmp (on amdgpu) pass.

Testing Time: 30.73s

Skipped          :     4
Unsupported      :  1480
Passed           : 29554
Expectedly Failed:    27

[100%] Built target check-clang

On amdgpu:

Testing Time: 42.65s

Unsupported      : 145
Passed           : 570
Expectedly Failed:  14

[100%] Built target check-openmp
[100%] Built target check-openmp

@jdoerfert With this patch, additional remarks are being generated. Please check whether the new OMP121 remarks in the following tests are OK.

Clang :: OpenMP/remarks_parallel_in_multiple_target_state_machines.c
Clang :: OpenMP/remarks_parallel_in_target_state_machine.c

All changes from my end are in. Please review.

@jdoerfert With this patch, additional remarks are being generated. Please check whether the new OMP121 remarks in the following tests are OK.

Clang :: OpenMP/remarks_parallel_in_multiple_target_state_machines.c
Clang :: OpenMP/remarks_parallel_in_target_state_machine.c

Can you send me the device IR generated for these (-save-temps). I need to check what's happening and building the patch myself will take a while.

@jdoerfert Attached are the device IR files, generated with -save-temps.

Pointing out the recent changes at the corresponding source locations.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1570

Added align attribute for call to __kmpc_alloc_shared.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
265

CapturedVarsElemTypes added to handle opaque pointers.

clang/lib/CodeGen/CodeGenFunction.h
3346

CapturedVarsElemTypes introduced to handle opaque pointers.

llvm/include/llvm/Frontend/OpenMP/OMPKinds.def
929

Fixed attribute of __kmpc_alloc_aggregate_arg,

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
4266

Added alloc_aggregate_arg entry point to OpenMPOpt SPMD list.