This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] WIP: Attempt to fix clang frontend codegen issue
Needs ReviewPublic

Authored by ivanrodriguez3753 on Aug 22 2023, 2:50 PM.

Details

Reviewers
jdoerfert
Group Reviewers
Restricted Project
Summary

It seems that the OpenMP CodeGen is incorrectly generating a pointer for a size calculation on the combined entry of a partially mapped struct. Here is the reduced test case:

scrubbed-user@scrubbed-server: cat reduced.cpp
#include <omp.h>
#include <cassert>
#include <iostream>

#define N 1000

struct T {
  int dep_1[N];
  int dep_2[N];
};

using namespace std;
int main() {
  #define SMALL 2
  T t;
  #pragma omp target map(tofrom: t.dep_1, t.dep_2[0:SMALL])
  {
    for (int i = 0; i < SMALL; i++) {
      t.dep_1[i] = 1;
      t.dep_2[i] = 1;
    }
  }

  for (int i = 0; i < SMALL; i++) {
    assert(t.dep_1[i] == 1);
    assert(t.dep_2[i] == 1);
  }
}

Originally, we were mapping t.dep_2[0:N], but I reduced to the smallest size that still breaks the runtime. We'll see why we need at least 2 in a second...
Here is some output from the runtime library crashing

scrubbed-user@scrubbed-server: /ptmp/scrubbed-user/llvm-project/build/bin/clang++ -I /ptmp/scrubbed-user/llvm-project/build/projects/openmp/runtime/src -L /ptmp/scrubbed-user/llvm-project/build/projects/openmp/libomptarget/ -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 reduced.cpp -g
scrubbed-user@scrubbed-server: LIBOMPTARGET_DEBUG=1 ./a.out # only including relevant output, run yourself for the full verbose debug messaging

PluginInterface --> Entry point 0x0000000000000000 maps to __omp_offloading_4e_6ccfb3ae_main_l16 (0x000055b886d524d8)
Libomptarget --> Entry  0: Base=0x00007ffd9cac727c, Begin=0x00007ffd9cac727c, Size=4004, Type=0x20, Name=unknown
Libomptarget --> Entry  1: Base=0x00007ffd9cac727c, Begin=0x00007ffd9cac727c, Size=4000, Type=0x1000000000003, Name=unknown
Libomptarget --> Entry  2: Base=0x00007ffd9cac727c, Begin=0x00007ffd9cac821c, Size=8, Type=0x1000000000003, Name=unknown

a.out:237581 terminated with signal 6 at PC=7f409bf30c6b SP=7ffd9cac6a00.  Backtrace:
/lib64/libc.so.6(gsignal+0x10d)[0x7f409bf30c6b]
/lib64/libc.so.6(abort+0x177)[0x7f409bf32305]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(+0x7452c1)[0x7f409ca652c1]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(+0x740fe6)[0x7f409ca60fe6]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(__tgt_target_kernel+0xe5)[0x7f409ca60335]
./a.out(+0x3385)[0x55b8850d5385]
/lib64/libc.so.6(__libc_start_main+0xef)[0x7f409bf1b24d]
./a.out(+0x312a)[0x55b8850d512a]

If my understanding is correct, the combined entry should have a size equal to the highest pointer minus the lowest pointer (in the most ideal scenario). I'm not sure if upstream clang uses a tight or loose bounding box for the combined entry, but in any case, it's wrong. It should be either 4008 or 8000, depending on whether we are being clever or not.

Running in GDB:

scrubbed-user@scrubbed-server: gdb a.out
(gdb) r
Starting program: /cray/css/users/scrubbed-user/sandbox/test_cleanup/target_depend_lvalue_01/reduced/upstream_build/a.out 
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.31-150300.46.1.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffceb2c700 (LWP 247765)]
[New Thread 0x7ffece1ff700 (LWP 247766)]
[Thread 0x7ffece1ff700 (LWP 247766) exited]
Libomptarget message: explicit extension not allowed: host address specified is 0x00007fffffff786c (8 bytes), but device allocation maps to host at 0x00007fffffff68cc (4004 bytes)
Libomptarget error: Call to getTargetPointer returned null pointer (device failure or illegal mapping).
Libomptarget error: Call to targetDataBegin failed, abort target.
Libomptarget error: Failed to process data before launching the kernel.
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
reduced.cpp:16:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Thread 1 "a.out" received signal SIGABRT, Aborted.
0x00007ffff62ccc6b in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install comgr5.5.0-debuginfo-2.5.0.50500-sles153.63.x86_64 hip-runtime-amd5.5.0-debuginfo-5.5.30201.50500-sles153.63.x86_64 hsa-rocr5.5.0-debuginfo-1.8.0.50500-sles153.63.x86_64 libatomic1-debuginfo-12.2.1+git416-150000.1.7.1.x86_64 libdrm2-debuginfo-2.4.107-150400.1.8.x86_64 libdrm_amdgpu1-debuginfo-2.4.107-150400.1.8.x86_64 libefa1-debuginfo-38.1-150400.4.6.x86_64 libelf1-debuginfo-0.185-150400.5.3.1.x86_64 libfabric1-debuginfo-1.13.2-150400.1.73.x86_64 libffi7-debuginfo-3.2.1.git259-10.8.x86_64 libgcc_s1-debuginfo-12.2.1+git416-150000.1.7.1.x86_64 libibverbs1-debuginfo-38.1-150400.4.6.x86_64 libinfinipath4-debuginfo-3.3-5.3.1.x86_64 libjansson4-debuginfo-2.9-1.24.x86_64 libncurses6-debuginfo-6.1-150000.5.12.1.x86_64 libnl3-200-debuginfo-3.3.0-1.29.x86_64 libnuma1-debuginfo-2.0.14.20.g4ee5e0c-150400.1.24.x86_64 libpsm_infinipath1-debuginfo-3.3-5.3.1.x86_64 librdmacm1-debuginfo-38.1-150400.4.6.x86_64 libstdc++6-debuginfo-12.2.1+git416-150000.1.7.1.x86_64 libuuid1-debuginfo-2.37.2-150400.8.14.1.x86_64 libz1-debuginfo-1.2.11-150000.3.39.1.x86_64
(gdb) info stack
#0  0x00007ffff62ccc6b in raise () from /lib64/libc.so.6
#1  0x00007ffff62ce305 in abort () from /lib64/libc.so.6
#2  0x00007ffff6e012c1 in handleTargetOutcome (Success=false, Loc=0x55555555bc18) at /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/omptarget.cpp:303
#3  0x00007ffff6dfcfe6 in targetKernel<AsyncInfoTy> (Loc=0x55555555bc18, DeviceId=0, NumTeams=1, ThreadLimit=0, HostPtr=0x555555559320 <.__omp_offloading_4e_6ccfb3ae_main_l16.region_id>, 
    KernelArgs=0x7fffffff67f8) at /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp:308
#4  0x00007ffff6dfc335 in __tgt_target_kernel (Loc=0x55555555bc18, DeviceId=-1, NumTeams=-1, ThreadLimit=0, HostPtr=0x555555559320 <.__omp_offloading_4e_6ccfb3ae_main_l16.region_id>, 
    KernelArgs=0x7fffffff67f8) at /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp:333
#5  0x000055555555a8e8 in main () at reduced.cpp:16
(gdb) s
Single stepping until exit from function raise,
which has no line number information.

a.out:245985 terminated with signal 6 at PC=7ffff62ccc6b SP=7fffffff6050.  Backtrace:
/lib64/libc.so.6(gsignal+0x10d)[0x7ffff62ccc6b]
/lib64/libc.so.6(abort+0x177)[0x7ffff62ce305]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(+0x7452c1)[0x7ffff6e012c1]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(+0x740fe6)[0x7ffff6dfcfe6]
/ptmp/scrubbed-user/llvm-project/build/bin/../lib/libomptarget.so.18git(__tgt_target_kernel+0xe5)[0x7ffff6dfc335]
/cray/css/users/scrubbed-user/sandbox/test_cleanup/target_depend_lvalue_01/reduced/upstream_build/a.out(+0x68e8)[0x55555555a8e8]
/lib64/libc.so.6(__libc_start_main+0xef)[0x7ffff62b724d]
/cray/css/users/scrubbed-user/sandbox/test_cleanup/target_depend_lvalue_01/reduced/upstream_build/a.out(+0x668a)[0x55555555a68a]
[Thread 0x7fffceb2c700 (LWP 247765) exited]
[Inferior 1 (process 245985) exited with code 01]

Running again except changing the combined entry size to 4008, note the process exits normally

(gdb) b /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp:

malformed linespec error: unexpected end of input
(gdb) b /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp:329
Breakpoint 1 at 0x7ffff6dfc2d7: file /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp, line 329.
(gdb) r
Starting program: /cray/css/users/scrubbed-user/sandbox/test_cleanup/target_depend_lvalue_01/reduced/upstream_build/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffceb2c700 (LWP 261095)]
[New Thread 0x7ffece1ff700 (LWP 261096)]
[Thread 0x7ffece1ff700 (LWP 261096) exited]

Thread 1 "a.out" hit Breakpoint 1, __tgt_target_kernel (Loc=0x55555555bc18, DeviceId=-1, NumTeams=-1, ThreadLimit=0, HostPtr=0x555555559320 <.__omp_offloading_4e_6ccfb3ae_main_l16.region_id>, KernelArgs=0x7fffffff67f8) at /ptmp/scrubbed-user/llvm-project/openmp/libomptarget/src/interface.cpp:329
329       if (KernelArgs->Flags.NoWait)
(gdb) p KernelArgs->ArgSizes[0]
$1 = 4004
(gdb) set KernelArgs->ArgSizes[0]=4008
(gdb) p KernelArgs->ArgSizes[0]
$2 = 4008
(gdb) c
Continuing.
[Thread 0x7fffceb2c700 (LWP 261095) exited]
[Inferior 1 (process 259669) exited normally]

So, it looks like the frontend is generating a size incorrectly, since it works when we hack via gdb to give it the size we think it should be.

As an additional data point, Cray's compiler (which I have access to because I work here) is failing with a different but more or less equivalent error message from our OpenMP offloading runtime (CRAY_ACC_DEBUG is a user facing debug flag similar to upstream llvm's LIBOMPTARGET_DEBUG):

scrubbed-user@scrubbed-server: cc -fopenmp ../reduced.cpp
scrubbed-user@scrubbed-server: CRAY_ACC_DEBUG=2 ./a.out
ACC: Version 5.0 of HIP already initialized, runtime version 50530201
ACC: Get Device 0
ACC: Set Thread Context
ACC: Start transfer 3 items from reduced.cpp:16
ACC:       allocate 'unknown' (4004 bytes)
ACC:       member, copy to acc 't.dep_1' (4000 bytes)
ACC: libcrayacc/acc_present.c:679 CRAY_ACC_ERROR - Host region (7ffe9957034c to 7ffe99570354) overlaps present region (7ffe9956f3ac to 7ffe99570350 index 0) but is not contained for 't.dep_2[0:2]' from reduced.cpp:16
scrubbed-user@scrubbed-server: CRAY_ACC_DEBUG=3 ./a.out
ACC: __tgt_register_requires: flags = NONE
ACC: __tgt_register_lib
ACC:   NumDeviceImages=1
ACC:   Device Images:
ACC:   Image location: 0x200c52 - 0x201fd2
ACC:   Processing valid image
ACC:   NumEntries=1
ACC:   Image entries:
ACC:   __omp_offloading_4e_6ccfb3ae_main_l16
ACC:     {
ACC:         addr=0x200ac0
ACC:         size=0
ACC:         flags=0
ACC:     }
ACC:   NumHostEntries=1
ACC:   Host entries:
ACC:   __omp_offloading_4e_6ccfb3ae_main_l16
ACC:     {
ACC:         addr=0x200ac0
ACC:         size=0
ACC:         flags=0
ACC:     }
ACC: __tgt_target_kernel(device_id=-1, host_ptr=0x200ac0, arg_num=3)
ACC: __internal_tgt_target_teams(device_id=-1, host_ptr=0x200ac0, arg_num=3, num_teams=1, thread_limit=0)
ACC: Version 5.0 of HIP already initialized, runtime version 50530201
ACC: Get Device 0
ACC: Compute level 9.0
ACC: Device Name: 
ACC: Number of cus 120
ACC: Device name 
ACC: AMD GCN arch name: gfx908:sramecc+:xnack-
ACC: Max shared memory 65536
ACC: Max thread blocks per cu 8
ACC: Max concurrent kernels 8
ACC: Async table size 8
ACC: Total GPU memory 34342961152
ACC: Available GPU memory 34309406720
ACC: Set Thread Context
ACC: Establish link bewteen libcrayacc and libcraymp
ACC:   libcrayacc interface v6
ACC:    libcraymp interface v6
ACC:    loading module data
ACC: __internal_tgt_target_teams(device_id=-1, host_ptr=0x200ac0, arg_num=3, num_teams=1, thread_limit=1)
ACC:   [0] 0x7ffd3634543c base 0x7ffd3634543c begin 0x7ffd3634543c : 4004 bytes type=0x20 (TARGET_PARAM) name (unknown)
ACC:   [1] 0x7ffd3634543c base 0x7ffd3634543c begin 0x7ffd3634543c : 4000 bytes type=0x1000000000003 (TO FROM MEMBER_OF) name (t.dep_1)
ACC:   [2] 0x7ffd363463dc base 0x7ffd3634543c begin 0x7ffd363463dc : 8 bytes type=0x1000000000003 (TO FROM MEMBER_OF) name (t.dep_2[0:2])
ACC: Start transfer 3 items from reduced.cpp:16
ACC:   flags: NEED_POST_PHASE
ACC: 
ACC:   Transfer Phase
ACC:   Trans 1
ACC:       Simple transfer of 'unknown' (4004 bytes)
ACC:            host ptr 7ffd3634543c
ACC:            acc  ptr 0
ACC:            flags: ALLOCATE ACQ_PRESENT REG_PRESENT
ACC:            memory not found in present table
ACC:            allocate (4004 bytes)
ACC:              get new reusable memory, added entry
ACC:            new allocated ptr (7fb81a200000)
ACC:            add to present table index 0: host 7ffd3634543c to 7ffd363463e0, acc 7fb81a200000
ACC:            new acc ptr 7fb81a200000
ACC: 
ACC:   Trans 2
ACC:   Trans 3
ACC:   Post Transfer Phase
ACC:   Trans 1
ACC:   Trans 2
ACC:       Simple transfer of 't.dep_1' (4000 bytes)
ACC:            host ptr 7ffd3634543c
ACC:            acc  ptr 0
ACC:            flags: COPY_HOST_TO_ACC REG_PRESENT DIR_MEMBER_UPDATE
ACC:            host region 7ffd3634543c to 7ffd363463dc found in present table index 0 (ref count 1)
ACC:            copy host to acc (7ffd3634543c to 7fb81a200000)
ACC:                internal copy host to acc (host 7ffd3634543c to acc 7fb81a200000) size = 4000
ACC: 
ACC:   Trans 3
ACC:       Simple transfer of 't.dep_2[0:2]' (8 bytes)
ACC:            host ptr 7ffd363463dc
ACC:            acc  ptr 0
ACC:            flags: COPY_HOST_TO_ACC REG_PRESENT DIR_MEMBER_UPDATE
ACC: libcrayacc/acc_present.c:679 CRAY_ACC_ERROR - Host region (7ffd363463dc to 7ffd363463e4) overlaps present region (7ffd3634543c to 7ffd363463e0 index 0) but is not contained for 't.dep_2[0:2]' from reduced.cpp:16
ACC: __tgt_unregister_lib
ACC: Start executing pending destructors

The same gdb trick works using the executable generated by Cray's compiler.

Let's change the reduced test case to map t.dep_2[0:N], and compare to a working test case. This working test case is identical except it maps all of t.dep_2, with no slice.
The following are snippets from -S -emit-llvm, from the broken and working cases respectively:

broken:

define dso_local noundef i32 @main() #4 !dbg !929 {
entry:
  %retval = alloca i32, align 4
  %t = alloca %struct.T, align 4
  %.offload_baseptrs = alloca [3 x ptr], align 8
  %.offload_ptrs = alloca [3 x ptr], align 8
  %.offload_mappers = alloca [3 x ptr], align 8
  %.offload_sizes = alloca [3 x i64], align 8
  %kernel_args = alloca %struct.__tgt_kernel_arguments, align 8
  %i = alloca i32, align 4
  store i32 0, ptr %retval, align 4
  call void @llvm.dbg.declare(metadata ptr %t, metadata !930, metadata !DIExpression()), !dbg !938
  %dep_1 = getelementptr inbounds %struct.T, ptr %t, i32 0, i32 0, !dbg !939
  %dep_2 = getelementptr inbounds %struct.T, ptr %t, i32 0, i32 1, !dbg !941
  %arrayidx = getelementptr inbounds [1000 x i32], ptr %dep_2, i64 0, i64 0, !dbg !942
  %0 = getelementptr i32, ptr %arrayidx, i32 1, !dbg !943
  %1 = ptrtoint ptr %0 to i64, !dbg !943
  %2 = ptrtoint ptr %dep_1 to i64, !dbg !943
  %3 = sub i64 %1, %2, !dbg !943
  %4 = sdiv exact i64 %3, ptrtoint (ptr getelementptr (i8, ptr null, i32 1) to i64), !dbg !943
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %.offload_sizes, ptr align 8 @.offload_sizes, i64 24, i1 false)

working:

define dso_local noundef i32 @main() #4 !dbg !929 {
entry:
  %retval = alloca i32, align 4
  %t = alloca %struct.T, align 4
  %.offload_baseptrs = alloca [3 x ptr], align 8
  %.offload_ptrs = alloca [3 x ptr], align 8
  %.offload_mappers = alloca [3 x ptr], align 8
  %.offload_sizes = alloca [3 x i64], align 8
  %kernel_args = alloca %struct.__tgt_kernel_arguments, align 8
  %i = alloca i32, align 4
  store i32 0, ptr %retval, align 4
  call void @llvm.dbg.declare(metadata ptr %t, metadata !930, metadata !DIExpression()), !dbg !938
  %dep_1 = getelementptr inbounds %struct.T, ptr %t, i32 0, i32 0, !dbg !939
  %dep_2 = getelementptr inbounds %struct.T, ptr %t, i32 0, i32 1, !dbg !941
  %0 = getelementptr [1000 x i32], ptr %dep_2, i32 1, !dbg !942
  %1 = ptrtoint ptr %0 to i64, !dbg !942
  %2 = ptrtoint ptr %dep_1 to i64, !dbg !942
  %3 = sub i64 %1, %2, !dbg !942
  %4 = sdiv exact i64 %3, ptrtoint (ptr getelementptr (i8, ptr null, i32 1) to i64), !dbg !942
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %.offload_sizes, ptr align 8 @.offload_sizes, i64 24, i1 false)

It's a little subtle but the key is in the high pointer used for the pointer difference. The type of getelementptr used for the broken test case is an i32, while in the working test case it is an [1000 x i32]. In the context of our test case, this explains the 4004 byte size (as opposed to 4008 or 8000, again depending on whether or not we're being clever with our bounding box).

In CGOpenMPRuntime.cpp, both test cases go through

} else {
  LowestElem = LB =
      CGF.EmitOMPSharedLValue(I->getAssociatedExpression())
          .getAddress(CGF);
}

in generateInfoForComponentList. EmitOMPSharedLValue seems like it'll handle an arbitrarily long list of components like a.b.c.ptr->whatever, but it will return the last component it generated. In our case, it is the array slice.

LowestElem is later copied over to HighestElem, and PartialStruct is updated. It really seems like StructRangeInfoTy is only meant to hold DIRECT struct members, because the high pointer is emitted with a hardcoded GEP instruction of offset 1, CreateConstGEP1_32, in emitCombinedEntry:

// Size is (addr of {highest+1} element) - (addr of lowest element)
llvm::Value *HB = HBAddr.getPointer();
llvm::Value *HAddr = CGF.Builder.CreateConstGEP1_32(
    HBAddr.getElementType(), HB, /*Idx0=*/1);
llvm::Value *CLAddr = CGF.Builder.CreatePointerCast(LB, CGF.VoidPtrTy);
llvm::Value *CHAddr = CGF.Builder.CreatePointerCast(HAddr, CGF.VoidPtrTy);
llvm::Value *Diff = CGF.Builder.CreatePtrDiff(CGF.Int8Ty, CHAddr, CLAddr);
llvm::Value *Size = CGF.Builder.CreateIntCast(Diff, CGF.Int64Ty,
                                              /*isSigned=*/false);
CombinedInfo.Sizes.push_back(Size);

This PR addresses that be going backwards in the component list until we get to the second to last component (as in a direct member of the struct in question). It fixes the broken test case but breaks quite a few tests. Here's check-clang-openmp before and after this PR:

Unsupported:   12
Passed     : 1354
Unsupported:   12
Passed     : 1334
Failed     :   20

I'm very unexperienced with clang's frontend codegen and was hoping for some pointers, as well as opinions about the broken test case. At the very least, this serves as a bug report. Any misunderstandings on my part, or missing context?

Some question:

  1. Can someone confirm or deny that PartialStruct is supposed to only hold direct members?
  2. Can the while loop rely on the GEP instruction dyn_cast?
  3. If this idea of a solution is appropriate, should it be implemented as I did, or when the pointer is created, as where I left the comment, or a change to PartialStruct to keep track of or differentiate between direct members and transitive members? Maybe we could instead use the HighestElem plus its offset, instead of the CreateConstGEP1_32?

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2023, 2:50 PM
ivanrodriguez3753 requested review of this revision.Aug 22 2023, 2:50 PM
Herald added a project: Restricted Project. · View Herald Transcript
cchen added a subscriber: cchen.Aug 22 2023, 2:55 PM
  1. Always provide full context in the patch.
  2. It looks like we're counting the pointer size (or the size of just the first element of the array), because we do not account array section here, either just the pointer or the first element only (most probably second option). Need to teach the compiler about array section here.
  1. Always provide full context in the patch.

Sure, would you mind mentioning what's missing? Or do you mean the stuff from debug output being shortened, as well as LLVM IR being only snippets?

Also, should the underlying issue and test case be filed as an issue on Github? I wasn't sure since this revision includes the bug and a description

  1. Always provide full context in the patch.

Sure, would you mind mentioning what's missing? Or do you mean the stuff from debug output being shortened, as well as LLVM IR being only snippets?

I mean, provide full diff context, read the docs how to upload patches for the review.
https://www.llvm.org/docs/Phabricator.html

Also, should the underlying issue and test case be filed as an issue on Github? I wasn't sure since this revision includes the bug and a description

If you're going to fix this yourself - probably it is not necessary to create a bug report. Otherwise, please go ahead and create a bug report.