Page MenuHomePhabricator

[OpenMP] Codegen aggregate for outlined function captures
AcceptedPublic

Authored by ggeorgakoudis on May 8 2021, 8:16 AM.

Details

Summary

Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Diff Detail

Unit TestsFailed

TimeTest
200 msx64 debian > LLVM.Transforms/OpenMP::spmdization.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt --mtriple=amdgcn-amd-amdhsa --data-layout=A5 -S -passes=openmp-opt < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Transforms/OpenMP/spmdization.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Transforms/OpenMP/spmdization.ll --check-prefixes=AMDGPU

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ggeorgakoudis reopened this revision.Jul 27 2021, 11:04 AM
This revision is now accepted and ready to land.Jul 27 2021, 11:04 AM
jhuber6 added inline comments.Jul 27 2021, 11:05 AM
clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1597

This needs to include the size of the accompanying push after D106496.

Rebase, update tests

Add previous, upper bound to aggregate for combined distributed directives.
Update tests.

Remove unnecessary comments.

ggeorgakoudis requested review of this revision.Sep 16 2021, 4:57 PM
jhuber6 accepted this revision.Sep 20 2021, 5:49 AM

LGTM, this passed the usual tests and I didn't see any leftover allocations.

This revision is now accepted and ready to land.Sep 20 2021, 5:49 AM

Fix for clang-tidy

Rebase and update tests

This revision was landed with ongoing or failed builds.Sep 21 2021, 10:51 AM
This revision was automatically updated to reflect the committed changes.

seeing buildbot failures after this patch landed https://lab.llvm.org/staging/#/builders/183/builds/1598

ggeorgakoudis added a comment.EditedSep 21 2021, 11:46 AM

seeing buildbot failures after this patch landed https://lab.llvm.org/staging/#/builders/183/builds/1598

Looking at it @ronlieb, thanks for reporting

seeing buildbot failures after this patch landed https://lab.llvm.org/staging/#/builders/183/builds/1598

This looks like another AMDGPU issue. The code in question doesn't do anything AMDGPU specific.
@ronlieb @JonChesterfield How to debug this?

Please revert the patch so our buildbot can resume greeness, and we can look into it with urgency today (me or Jon)
as it should be reproducible

@ronlieb can you apply this to amd-stg-open? If it breaks there we have a chance of trying a debugger on it. @dpalermo might be available again now.

@jdoerfert I debug stuff like this by inspection, guesswork and a DIY printf implementation that is itself not totally robust. Very occasionally the thing can be isolated as a unit test. If we're lucky a debug llvm + debug rocr build will be more verbose about what is going wrong.

Please revert the patch so our buildbot can resume greeness, and we can look into it with urgency today (me or Jon)
as it should be reproducible

Sounds good. @ggeorgakoudis let's revert and wait for input.

@pdhaliwal
i will pass the problem over to Pushpinder Singh who should be waking up soon.

George, thank you for reverting it. i can reproduce the issue on a local system.
building latest (revert present) passes.
Revert the revert and fails

one test that fails :
in build directory:
export LOC=pwd; cd $LOC/runtimes/runtimes-bins/openmp && /usr/bin/python3.8 $LOC/./bin/llvm-l
it -vv --show-unsupported --show-xfail -j 32 $LOC/runtimes/runtimes-bins/openmp/libomptarget/test/amdgcn-amd-amdhsa/mapping/declare_mapper_targe
t_data.cpp

command stderr:

[GPU Memory Error] Addr: 0x0 Reason: Page not present or supervisor privilege.
Memory access fault by GPU node-2 (Agent handle: 0x18ae1d0) on address (nil). Reason: Page not present or supervisor privilege.

It looks like from IR diff that this patch is adding use of kmpc_alloc_shared method. These methods likely won't work on AMDGPU as device malloc is not available. Not sure what could be done apart from marking those tests as XFAIL on amdgcn. :(

It looks like from IR diff that this patch is adding use of kmpc_alloc_shared method. These methods likely won't work on AMDGPU as device malloc is not available. Not sure what could be done apart from marking those tests as XFAIL on amdgcn. :(

That's a good theory. Could confirm by patching the amdgpu malloc to return 0xdeadbeef or similar instead of 0 and seeing if that number shows up in the invalid memory access error. If so there's two problems:
1/ malloc on the gpu can fail, so it would mean we're missing a check on the return code of malloc in the devicertl
2/ increased importance for getting malloc running on amdgpu
The openmp in rocm/aomp does have a malloc, so it would also be interesting to see if they run OK with this patch applied

I got this after changing __kmpc_impl_malloc to return 0xdeadbeef. So, this confirms that missing malloc implementation is the root cause.

Memory access fault by GPU node-4 (Agent handle: 0x1bc5000) on address 0xdeadb000. Reason: Page not present or supervisor privilege.

I got this after changing __kmpc_impl_malloc to return 0xdeadbeef. So, this confirms that missing malloc implementation is the root cause.

Memory access fault by GPU node-4 (Agent handle: 0x1bc5000) on address 0xdeadb000. Reason: Page not present or supervisor privilege.

Nice! In that case I think the way to go is to audit the (probably few) places where kmpc_impl_malloc are called and add a check for whether the return value is 0. With that in place we can reland this and get more graceful failure (at a guess we should fall back to the host when gpu memory is exhausted? or maybe just print a 'out of gpu heap memory' style message and abort, don't know).

I got this after changing __kmpc_impl_malloc to return 0xdeadbeef. So, this confirms that missing malloc implementation is the root cause.

Memory access fault by GPU node-4 (Agent handle: 0x1bc5000) on address 0xdeadb000. Reason: Page not present or supervisor privilege.

Nice! In that case I think the way to go is to audit the (probably few) places where kmpc_impl_malloc are called and add a check for whether the return value is 0. With that in place we can reland this and get more graceful failure (at a guess we should fall back to the host when gpu memory is exhausted? or maybe just print a 'out of gpu heap memory' style message and abort, don't know).

We should only fail to remove the kmpc_shared_alloc with O0. Since we need kmpc_shared_alloc for all non-trivial codes, they would always fail on AMDGPU. That said,
why is the shared memory stack not catching this. It's a 64 byte stack for the main thread and we are looking at at 24 byte allocation for declare_mapper_target.cpp.
Can you determine why first two conditionals in __kmpc_alloc_shared don't catch this and return proper memory?

ggeorgakoudis reopened this revision.Sep 28 2021, 10:01 AM
This revision is now accepted and ready to land.Sep 28 2021, 10:01 AM

Update memory allocation for aggregate argument.
Introduce runtime interface to allocate from local memory,
when in SPMD mode, or heap, when in generic.

Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2021, 10:25 AM

@pdhaliwal @JonChesterfield @ronlieb I updated the aggregate argument memory allocation to use an alloca instead of malloc'ing in SPMD mode, which should resolve your issue. Could someone please test the updated patch and give me feedback before landing?

[AMD Official Use Only]

Hi George,
I will do it now ...

i backed up to your reverted patch, and applied this one.
I see some new errors

libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data_enter_exit.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_update.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/parallel_offloading_map.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/taskloop_offload_nowait.cpp

Dont know if Jon is around, so i will ask Singh @pdhaliwal if he can dig into it a bit more.

i backed up to your reverted patch, and applied this one.
I see some new errors

libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_data_enter_exit.cpp
libomptarget :: amdgcn-amd-amdhsa :: mapping/declare_mapper_target_update.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/parallel_offloading_map.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/taskloop_offload_nowait.cpp

Dont know if Jon is around, so i will ask Singh @pdhaliwal if he can dig into it a bit more.

Hmm, thanks @ronlieb, @pdhaliwal please let me know what fails.

Apologies for late reply. Most of the tests now do not try to call malloc, so no page fault errors. But all of them are producing wrong results. For e.g. declare_mapper_target.cpp produces Sum = 132608 with the patch applied. Similarly for other tests as well. So don't know what's happening yet.

Apologies for late reply. Most of the tests now do not try to call malloc, so no page fault errors. But all of them are producing wrong results. For e.g. declare_mapper_target.cpp produces Sum = 132608 with the patch applied. Similarly for other tests as well. So don't know what's happening yet.

Thanks @pdhaliwal. I get the right result on nvidia. Please let me know when you get to the bottom of it.

I modified the declare_mapper_target to print the contents of array after target region and found the following output:

2 3 4 5 6 7 8 9 10 11 Sum = 65

Program:

#include <cstdio>
#include <cstdlib>

#define NUM 10

int main() {
  int *c= new int[NUM];
  for (int i = 0; i < NUM; i++) {
    c[i] = 1;
  }
#pragma omp target teams distribute  parallel for map(tofrom: c[0:NUM])
  for (int i = 0; i < NUM; i++) {
    c[i]++;
  }
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += c[i];
    printf("%d ", c[i]);
  }
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Different variant of the same program is producing correct output,

#include <cstdio>
#include <cstdlib>

#define NUM 10

int main() {
  int *c= new int[NUM];
  for (int i = 0; i < NUM; i++) {
    c[i] = 1;
  }

  int *b = new int[NUM];
#pragma omp target teams distribute  parallel for map(tofrom: c[0:NUM], b[0:NUM])
  for (int i = 0; i < NUM; i++) {
    b[i] = c[i] + 1;
  }
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += b[i];
    printf("%d ", b[i]);
  }
  // CHECK: Sum = 2048
  printf("Sum = %d\n", sum);
  return 0;
}

Output (this is the right answer):

2 2 2 2 2 2 2 2 2 2 Sum = 20

On internal amd-stg-open branch, this patch works fine, so issue is only with the trunk.
I compared the generated IR before and after applying this patch, I didn't see anything suspicious. (but can't be 100% sure).

I have created a patch (D111218) with fix for amdgcn. This is a temporary fix. I will still keep on looking into it until I find a real root cause.

Update the interface for allocating/sharing the struct aggregate
Simplify invoking tasks

ggeorgakoudis requested review of this revision.Nov 9 2021, 7:44 AM

TODO update tests

Update tests
Fix for attributes to kmpc_alloc_aggregate_arg
Do not emit allocations if there are no arguments in the aggregate

ABataev accepted this revision.Dec 1 2021, 9:26 AM

LG wit a nit

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1604

Enclose into braces too

This revision is now accepted and ready to land.Dec 1 2021, 9:26 AM

Rebase, address comment, update few tests

ggeorgakoudis marked 2 inline comments as done.Dec 1 2021, 11:36 AM
JonChesterfield added a comment.EditedDec 1 2021, 11:42 AM

This works approximately as well as trunk does for me, provided D114865 is also applied. My baseline is not totally solid but I think there's a credible chance this would pass the buildbot, provided D114865 went in first.

Ron reports two new failures with this applied,
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51781.c
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51982.c

My local sm_75 box with this patch applied (and otherwise a clean build) claims failures in

libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug51781.c
jdoerfert accepted this revision.Dec 22 2021, 2:45 PM

Can we land this? AMD issues seems resolved.

[AMD Official Use Only]

@Singh, Pushpinder is this resolved?
You were most recently working on it.

Thx

pdhaliwal added a comment.EditedDec 22 2021, 7:32 PM

I am seeing a lot of failures on nvptx machine (sm_70, cuda11.4) with this patch,

libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49779.cpp
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda :: offloading/bug51982.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/close_enter_exit.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/close_modifier.c
libomptarget :: nvptx64-nvidia-cuda :: unified_shared_memory/shared_update.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49021.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug49334.cpp
libomptarget :: nvptx64-nvidia-cuda-newRTL :: offloading/bug51781.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/close_enter_exit.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/close_modifier.c
libomptarget :: nvptx64-nvidia-cuda-newRTL :: unified_shared_memory/shared_update.c

On amdgcn, these are the tests failing,

libomptarget :: amdgcn-amd-amdhsa :: offloading/bug49021.cpp
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51781.c
libomptarget :: amdgcn-amd-amdhsa :: offloading/bug51982.c
libomptarget :: amdgcn-amd-amdhsa-newRTL :: offloading/bug49021.cpp
libomptarget :: amdgcn-amd-amdhsa-newRTL :: offloading/bug51781.c

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 12:05 PM

As discussed in https://github.com/llvm/llvm-project/issues/54654, this needs to be added for SPMDization with this patch. Not sure whether further handling is required.

diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index c4736521e475..23cfa6fe5e27 100644

  • a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp

+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -4260,6 +4260,7 @@ struct AAKernelInfoCallSite : AAKernelInfo {

case OMPRTL___kmpc_nvptx_parallel_reduce_nowait_v2:
case OMPRTL___kmpc_nvptx_teams_reduce_nowait_v2:
case OMPRTL___kmpc_nvptx_end_reduce_nowait:

+ case OMPRTL___kmpc_alloc_aggregate_arg:

  break;
case OMPRTL___kmpc_distribute_static_init_4:
case OMPRTL___kmpc_distribute_static_init_4u:

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Hey @dhruvachak. Unfortunately I can't find time lately to work on this patch. Would you like to take over?

I added https://github.com/llvm/llvm-project/issues/54654 documenting what I found when testing this patch on amdgpu.

@ggeorgakoudis Can you please rebase this patch on top of main? Thanks.

Hey @dhruvachak. Unfortunately I can't find time lately to work on this patch. Would you like to take over?

@ggeorgakoudis I rebased the sources on top of main and resolved conflicts in my local workspace. I haven't updated the clang/llvm tests. There are tests that fail on amdgpu that I am investigating. One example is https://github.com/llvm/llvm-project/issues/54654.

dhruvachak added inline comments.Apr 8 2022, 12:35 PM
llvm/include/llvm/Frontend/OpenMP/OMPKinds.def
907

NoCapture attributes for the parameters need to be removed. See https://github.com/llvm/llvm-project/issues/54654