This patch adds Fortran Lowering support for the OpenMP Target Data Op with associated region.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
We probably need to update the Conversion in MLIR to OpenMP to handle the region as well. If we add a loop inside the target data construct, I believe currently it will fail to convert to LLVM.
https://github.com/llvm/llvm-project/blob/15b90805bcb8a567525a0a3605904f3ea59490dd/mlir/lib/Conversion/OpenMPToLLVM/OpenMPToLLVM.cpp#L102
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
115 | Can we add a CHECK for the contents as well? |
LG.
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
114 | I have not looked at how the lowering to LLVM IR works. I have a question here. When the array is mapped to the address-space of the target, I assume there will be a separate allocation there and it should be the new allocation that should be used in the body of this region. The fir.coordinate below continues to use the current fir.alloca in the host. How will this work. Generally, we handle this by making it an argument of the region and using it instead. The argument will be materialised as an argument in the target. Hence the use in the body of the construct need not be changed. |
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
114 | I haven't explicitly handled this in the IRBuilder. I am just now familiarising myself with the issue. There is a patch up for review from Jan Sjodin in our team for alloca address space handling in FIR->LLVMIR here D144203. My guess for now is that we would need that patch to handle this scenario. Do you think this is something that needs addressing in this patch? I plan on putting another patch for review for the FIR->LLVMIR conversion handling of region for Target Data. |
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
114 | I would like to understand how this works fully before proceeding in order to ensure that we are doing this correctly. My concern is primarily because this seems similar to privatisation clauses and we handle privatisation a bit differently due to some issues. Doesn't the IRBuilder patch (https://reviews.llvm.org/D142914) lower target data map? |
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
114 | Yes, lowering of Target Data with region is handled. But as you mentioned in your earlier comments, I haven't made the argument local to the region. I am currently looking into what needs to be done for handling the map arguments inside of the region. |
Hi Kiran,
Here's what the final llvm IR looks like using clang vs flang-new.
From what I can see clang is still using the same alloca inside the region body, and furthermore if you change the map type to to, from, tofrom and alloc, the only difference in IR is the @.offload_maptypes global variable.
As I said previously, I am getting to grips with offloading as well. From reading the specification it seems the use_device_addr and use_device_ptr clauses are used to deal with the interaction between host and device data maps.
Please let me know if I am missing something here.
Fortran code:
subroutine openmp_target_data integer :: i !$omp target data map(tofrom: a) i = 99 !$omp end target data end subroutine openmp_target_data
flang generated llvm ir:
; ModuleID = 'LLVMDialectModule' source_filename = "LLVMDialectModule" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "aarch64-unknown-linux-gnu" %struct.ident_t = type { i32, i32, i32, i32, ptr } @0 = private unnamed_addr constant [26 x i8] c";test.mlir;unknown;4;10;;\00", align 1 @1 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1 @2 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @1 }, align 8 @.offload_maptypes = private unnamed_addr constant [1 x i64] [i64 3] @.offload_mapnames = private constant [1 x ptr] [ptr @0] declare ptr @malloc(i64) declare void @free(ptr) define void @_QPopenmp_target_data() { %1 = alloca [1 x ptr], align 8 %2 = alloca [1 x ptr], align 8 %3 = alloca [1 x i64], align 8 %4 = alloca float, i64 1, align 4 %5 = alloca i32, i64 1, align 4 br label %entry entry: ; preds = %0 %6 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 store ptr %4, ptr %6, align 8 %7 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 store ptr %4, ptr %7, align 8 %8 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 store i64 ptrtoint (ptr getelementptr (ptr, ptr null, i32 1) to i64), ptr %8, align 8 %9 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 %10 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 %11 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 call void @__tgt_target_data_begin_mapper(ptr @2, i64 -1, i32 1, ptr %9, ptr %10, ptr %11, ptr @.offload_maptypes, ptr @.offload_mapnames, ptr null) br label %omp.data.region omp.data.region: ; preds = %entry store i32 99, ptr %5, align 4 br label %omp.region.cont omp.region.cont: ; preds = %omp.data.region %12 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 %13 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 %14 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 call void @__tgt_target_data_end_mapper(ptr @2, i64 -1, i32 1, ptr %12, ptr %13, ptr %14, ptr @.offload_maptypes, ptr @.offload_mapnames, ptr null) ret void } ; Function Attrs: nounwind declare void @__tgt_target_data_begin_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #0 ; Function Attrs: nounwind declare void @__tgt_target_data_end_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #0 attributes #0 = { nounwind } !llvm.module.flags = !{!0} !0 = !{i32 2, !"Debug Info Version", i32 3}
C code:
#include <omp.h> int main() { int i; #pragma omp target data map(alloc : i) { i = 99; } return 0; }
clang generated llvm ir ( clang -fopenmp -fopenmp-extensions -fopenmp-offload-mandatory -fopenmp-targets=amdgcn-amd-amdhsa -S -emit-llvm test.c -o test.ll )
; __CLANG_OFFLOAD_BUNDLE____START__ openmp-amdgcn-amd-amdhsa-gfx90a ; ModuleID = 'test.c' source_filename = "test.c" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7" target triple = "amdgcn-amd-amdhsa" @__omp_rtl_debug_kind = weak_odr hidden addrspace(1) constant i32 0 @__omp_rtl_assume_teams_oversubscription = weak_odr hidden addrspace(1) constant i32 0 @__omp_rtl_assume_threads_oversubscription = weak_odr hidden addrspace(1) constant i32 0 @__omp_rtl_assume_no_thread_state = weak_odr hidden addrspace(1) constant i32 0 @__omp_rtl_assume_no_nested_parallelism = weak_odr hidden addrspace(1) constant i32 0 !llvm.module.flags = !{!0, !1, !2, !3, !4, !5} !llvm.ident = !{!6} !0 = !{i32 1, !"amdgpu_code_object_version", i32 400} !1 = !{i32 1, !"wchar_size", i32 4} !2 = !{i32 7, !"openmp", i32 50} !3 = !{i32 7, !"openmp-device", i32 50} !4 = !{i32 8, !"PIC Level", i32 2} !5 = !{i32 7, !"frame-pointer", i32 2} !6 = !{!"AOMP_STANDALONE_16.0-4 clang version 17.0.0 (ssh://dpalermo@gerrit-git.amd.com:29418/lightning/ec/llvm-project a5701609098bc48735f5feafe84084a93c00639f)"} ; __CLANG_OFFLOAD_BUNDLE____END__ openmp-amdgcn-amd-amdhsa-gfx90a ; __CLANG_OFFLOAD_BUNDLE____START__ host-x86_64-unknown-linux-gnu ; ModuleID = '/tmp/test-117796.bc' source_filename = "test.c" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" %struct.ident_t = type { i32, i32, i32, i32, ptr } @.offload_sizes = private unnamed_addr constant [1 x i64] [i64 4] @.offload_maptypes = private unnamed_addr constant [1 x i64] [i64 3] @0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1 @1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8 ; Function Attrs: noinline nounwind optnone uwtable define dso_local i32 @main() #0 { entry: %retval = alloca i32, align 4 %i = alloca i32, align 4 %.offload_baseptrs = alloca [1 x ptr], align 8 %.offload_ptrs = alloca [1 x ptr], align 8 %.offload_mappers = alloca [1 x ptr], align 8 store i32 0, ptr %retval, align 4 %0 = getelementptr inbounds [1 x ptr], ptr %.offload_baseptrs, i32 0, i32 0 store ptr %i, ptr %0, align 8 %1 = getelementptr inbounds [1 x ptr], ptr %.offload_ptrs, i32 0, i32 0 store ptr %i, ptr %1, align 8 %2 = getelementptr inbounds [1 x ptr], ptr %.offload_mappers, i64 0, i64 0 store ptr null, ptr %2, align 8 %3 = getelementptr inbounds [1 x ptr], ptr %.offload_baseptrs, i32 0, i32 0 %4 = getelementptr inbounds [1 x ptr], ptr %.offload_ptrs, i32 0, i32 0 call void @__tgt_target_data_begin_mapper(ptr @1, i64 -1, i32 1, ptr %3, ptr %4, ptr @.offload_sizes, ptr @.offload_maptypes, ptr null, ptr null) store i32 99, ptr %i, align 4 %5 = getelementptr inbounds [1 x ptr], ptr %.offload_baseptrs, i32 0, i32 0 %6 = getelementptr inbounds [1 x ptr], ptr %.offload_ptrs, i32 0, i32 0 call void @__tgt_target_data_end_mapper(ptr @1, i64 -1, i32 1, ptr %5, ptr %6, ptr @.offload_sizes, ptr @.offload_maptypes, ptr null, ptr null) ret i32 0 } ; Function Attrs: nounwind declare void @__tgt_target_data_begin_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #1 ; Function Attrs: nounwind declare void @__tgt_target_data_end_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #1 attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } attributes #1 = { nounwind } !llvm.module.flags = !{!0, !1, !2, !3, !4, !5} !llvm.ident = !{!6} !0 = !{i32 1, !"wchar_size", i32 4} !1 = !{i32 7, !"openmp", i32 50} !2 = !{i32 8, !"PIC Level", i32 2} !3 = !{i32 7, !"PIE Level", i32 2} !4 = !{i32 7, !"uwtable", i32 2} !5 = !{i32 7, !"frame-pointer", i32 2} !6 = !{!"AOMP_STANDALONE_16.0-4 clang version 17.0.0 (ssh://dpalermo@gerrit-git.amd.com:29418/lightning/ec/llvm-project a5701609098bc48735f5feafe84084a93c00639f)"} ; __CLANG_OFFLOAD_BUNDLE____END__ host-x86_64-unknown-linux-gnu
Thanks,
Akash
In my previous comment I made a mistake. The C code map_type should be tofrom. The IR is still correct and generated with tofrom as the map_type.
Sorry to be making these requests. But the following example is mapping a and not i. Would it be OK for you to give the IR for this?
subroutine openmp_target_data integer :: i !$omp target data map(tofrom: a) i = 99 !$omp end target data end subroutine openmp_target_data
Sorry to be making these requests. But the following example is mapping a and not i. Would it be OK for you to give the IR for this?
subroutine openmp_target_data integer :: i !$omp target data map(tofrom: a) i = 99 !$omp end target data end subroutine openmp_target_data
Sorry about that. Here's the correct IR.
Fortran Code:
subroutine openmp_target_data integer :: i !$omp target data map(tofrom: i) i = 99 !$omp end target data end subroutine openmp_target_data
FIR ( flang-new -fc1 -emit-fir -fopenmp test.f90 -o test.fir ):
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<i64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f80, dense<128> : vector<2xi32>>, #dlti.dl_entry<i1, dense<8> : vector<2xi32>>, #dlti.dl_entry<i8, dense<8> : vector<2xi32>>, #dlti.dl_entry<i16, dense<16> : vector<2xi32>>, #dlti.dl_entry<i32, dense<32> : vector<2xi32>>, #dlti.dl_entry<f16, dense<16> : vector<2xi32>>, #dlti.dl_entry<f64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f128, dense<128> : vector<2xi32>>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"} { func.func @_QPopenmp_target_data() { %0 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFopenmp_target_dataEi"} omp.target_data map((tofrom -> %0 : !fir.ref<i32>)) { %c99_i32 = arith.constant 99 : i32 fir.store %c99_i32 to %0 : !fir.ref<i32> omp.terminator } return } }
LLVMIR ( fir-opt --split-input-file --cfg-conversion --fir-to-llvm-ir="target=aarch64-unknown-linux-gnu" test.fir -o test.mlir ):
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<i64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f80, dense<128> : vector<2xi32>>, #dlti.dl_entry<i1, dense<8> : vector<2xi32>>, #dlti.dl_entry<i8, dense<8> : vector<2xi32>>, #dlti.dl_entry<i16, dense<16> : vector<2xi32>>, #dlti.dl_entry<i32, dense<32> : vector<2xi32>>, #dlti.dl_entry<f16, dense<16> : vector<2xi32>>, #dlti.dl_entry<f64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f128, dense<128> : vector<2xi32>>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "aarch64-unknown-linux-gnu"} { llvm.func @_QPopenmp_target_data() { %0 = llvm.mlir.constant(1 : i64) : i64 %1 = llvm.alloca %0 x i32 {bindc_name = "i", in_type = i32, operand_segment_sizes = array<i32: 0, 0>, uniq_name = "_QFopenmp_target_dataEi"} : (i64) -> !llvm.ptr<i32> omp.target_data map((tofrom -> %1 : !llvm.ptr<i32>)) { %2 = llvm.mlir.constant(99 : i32) : i32 llvm.store %2, %1 : !llvm.ptr<i32> omp.terminator } llvm.return } }
llvm IR ( mlir-translate -allow-unregistered-dialect --mlir-to-llvmir test.mlir -o test.ll ):
; ModuleID = 'LLVMDialectModule' source_filename = "LLVMDialectModule" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "aarch64-unknown-linux-gnu" %struct.ident_t = type { i32, i32, i32, i32, ptr } @0 = private unnamed_addr constant [26 x i8] c";test.mlir;unknown;4;10;;\00", align 1 @1 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1 @2 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @1 }, align 8 @.offload_maptypes = private unnamed_addr constant [1 x i64] [i64 3] @.offload_mapnames = private constant [1 x ptr] [ptr @0] declare ptr @malloc(i64) declare void @free(ptr) define void @_QPopenmp_target_data() { %1 = alloca [1 x ptr], align 8 %2 = alloca [1 x ptr], align 8 %3 = alloca [1 x i64], align 8 %4 = alloca i32, i64 1, align 4 br label %entry entry: ; preds = %0 %5 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 store ptr %4, ptr %5, align 8 %6 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 store ptr %4, ptr %6, align 8 %7 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 store i64 ptrtoint (ptr getelementptr (ptr, ptr null, i32 1) to i64), ptr %7, align 8 %8 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 %9 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 %10 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 call void @__tgt_target_data_begin_mapper(ptr @2, i64 -1, i32 1, ptr %8, ptr %9, ptr %10, ptr @.offload_maptypes, ptr @.offload_mapnames, ptr null) br label %omp.data.region omp.data.region: ; preds = %entry store i32 99, ptr %4, align 4 br label %omp.region.cont omp.region.cont: ; preds = %omp.data.region %11 = getelementptr inbounds [1 x ptr], ptr %1, i32 0, i32 0 %12 = getelementptr inbounds [1 x ptr], ptr %2, i32 0, i32 0 %13 = getelementptr inbounds [1 x i64], ptr %3, i32 0, i32 0 call void @__tgt_target_data_end_mapper(ptr @2, i64 -1, i32 1, ptr %11, ptr %12, ptr %13, ptr @.offload_maptypes, ptr @.offload_mapnames, ptr null) ret void } ; Function Attrs: nounwind declare void @__tgt_target_data_begin_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #0 ; Function Attrs: nounwind declare void @__tgt_target_data_end_mapper(ptr, i64, i32, ptr, ptr, ptr, ptr, ptr, ptr) #0 attributes #0 = { nounwind } !llvm.module.flags = !{!0} !0 = !{i32 2, !"Debug Info Version", i32 3}
Thanks,
Akash
Thanks, that looks good. The runtime is in charge of mapping and copying the data. I guess it will transparently map the address from the host pointer to the device address space since it stores the mapping.
That target data construct has device, if, map, use_device_addr, use_device_ptr clauses. Are all these supported? If not could you add TODO failures for the ones that are not supported. For the supported ones, could you add tests?
Yes, these things were already addressed in this patch : https://reviews.llvm.org/D142357. This only adds the region lowering support for Target Data.
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
120 | Is there a document that I can read to understand how the OpenMP MLIR dialect is going to work with generic MLIR optimizations? Sorry for hijacking this review, but I am curious about the following case: %0 = fir.alloca !fir.array<1024xi32> %1 = fir.coordinate_of %0, %cstN // use of %1 omp.target_data map((tofrom -> %0)) { %2 = fir.coordinate_of %0, %cstN // use of %2 } How is it guaranteed that the above will not be optimized by some MLIR pass into the following: %0 = fir.alloca !fir.array<1024xi32> %1 = fir.coordinate_of %0, %cstN // use of %1 omp.target_data map((tofrom -> %0)) { // %2 = fir.coordinate_of %0, %cstN // eliminated or hoisted // use of %1 // there is no any association between the references inside the region and the references in the map clauses } In other words, is there something in OMP dialect operations's definitions that prevents other MLIR passes to introduce new live-ins into the OMP regions? This may not be a problem for target data map, but it may be a problem for target data use_device_ptr and target map. |
flang/test/Lower/OpenMP/target_data.f90 | ||
---|---|---|
120 | Thanks @vzakhari for bringing up this issue. There is nothing at the moment in the OpenMP dialect that might prevent this. The reduction declare OpenMP operation is marked as isolated from above to prevent operations from being hoisted out or in. Would you know, If you prefer, we can discuss this in MLIR discourse to get opinions from others as well. |
All types that can be present in a map clause. Should be most of the following, please consult the restrictions for the clause.
char, char array
integer, real, complex and arrays of these with both fixed and variable length
pointer, allocatable, assumed shape arrays
derived type
derived type components, array-sections
polymorphic
This patch can go ahead.
I would recommend the following follow-up work:
- If all the above types are working then please add tests for them. I assume you are talking about the end-to-end flow once https://reviews.llvm.org/D142914 lands.
- Remove the use_device_addr and use_device_ptr Operands from the MLIR operation if the presence of those oeprands would make the semantics of the operation different.
- Change the type of map_operands to OpenMP_PointerLikeType.
I have not looked at how the lowering to LLVM IR works. I have a question here. When the array is mapped to the address-space of the target, I assume there will be a separate allocation there and it should be the new allocation that should be used in the body of this region. The fir.coordinate below continues to use the current fir.alloca in the host. How will this work. Generally, we handle this by making it an argument of the region and using it instead. The argument will be materialised as an argument in the target. Hence the use in the body of the construct need not be changed.