Add lowering support for the use_device_ptr and use_Device_addr clauses for the Target Data directive.
Depends on D152822
Differential D152824
[MLIR][OpenMP]Add Flang lowering support for device_ptr and device_addr clauses TIFitis on Jun 13 2023, 9:10 AM. Authored by
Details Add lowering support for the use_device_ptr and use_Device_addr clauses for the Target Data directive. Depends on D152822
Diff Detail
Event TimelineComment Actions I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp for (auto &devPtrOp : useDevPtrOperands) { llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp); const auto &arg = region.addArgument(devPtrOp.getType(), devPtrOp.getLoc()); moduleTranslation.mapValue(arg, info.DevicePtrInfoMap[mapOpValue].second); replaceAllUsesInRegionWith(devPtrOp, arg, region); } Comment Actions The modelling for the Operation (target data) should be changed to represent both use_device_addr and use_device_addr as a block argument. The device_addr example from your test will look like the following. func.func @_QPomp_target_device_addr() { %0 = fir.alloca !fir.box<!fir.ptr<i32>> {bindc_name = "a", uniq_name = "_QFomp_target_device_addrEa"} %1 = fir.zero_bits !fir.ptr<i32> %2 = fir.embox %1 : (!fir.ptr<i32>) -> !fir.box<!fir.ptr<i32>> fir.store %2 to %0 : !fir.ref<!fir.box<!fir.ptr<i32>>> omp.target_data map((tofrom -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>)) use_device_addr(%daddr -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>) { %c10_i32 = arith.constant 10 : i32 %3 = fir.load %daddr : !fir.ref<!fir.box<!fir.ptr<i32>>> %4 = fir.box_addr %3 : (!fir.box<!fir.ptr<i32>>) -> !fir.ptr<i32> fir.store %c10_i32 to %4 : !fir.ptr<i32> omp.terminator } return } While lowering to LLVM, you can hence do this in fewer steps. for (auto &devPtrOp : useDevPtrOperands) { llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp); const auto &arg = getTheArgumentAssociatedWith(devPtrOp) moduleTranslation.mapValue(arg, info.DevicePtrInfoMap[mapOpValue].second); } For handling entry block arguments, you can refer to both the wsloop op or the atomic update op. Comment Actions Hi, Also, if we do decide to do it. Would I have to use Variadic_of_Variadic for storing the map between the device_addr_value and orig_value or do you have anything else in mind? Comment Actions It will not increase the lowering complexity. For the MLIR to LLVMIR lowering it will simplify it.
It is incorrect to use the host pointer in the region and the modelling will not be in line with the standard.
This step is only modelling it accurately. All processing (generating runtime calls, getting the GPU pointer etc) will all be done by the OpenMP IRBuilder.
Variadic_of_Variadic might work. But the verifier should check that there are only two entries per inner Variadic. You can also implement it as two operand list/arrays. Comment Actions Thinking of this again, we might not need a change in the operands for use_device_ptr or use_device_addr since we are only adding block arguments. Comment Actions Are there any examples of this currently? I saw wsloop and atomic update but they aren't exactly what we're doing here I think. When lowering we still need to correspond the device addresses to their original host address counterpart to correctly set the mapping, type etc. I was thinking of achieving this by having 2 separate operands each for device_ptr and addr like Variadic<OpenMP_PointerLikeType>:$use_device_ptr_device and Variadic<OpenMP_PointerLikeType>:$use_device_ptr_host for the device_ptr clause. Comment Actions I will try to find out something for reference today.
Block arguments are not available outside the operation and thus cannot be operands. Comment Actions My current implementation generates the following FIR. Fortran: subroutine omp_target_data integer :: a a = 10 !$omp target data map(tofrom: a) use_device_ptr(a) a = 20 !$omp end target data a = 30 end subroutine omp_target_data FIR: func.func @_QPomp_target_data() { %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"} %c10_i32 = arith.constant 10 : i32 fir.store %c10_i32 to %0 : !fir.ref<i32> %1 = fir.alloca i32 {bindc_name = "a", pinned, uniq_name = "_QFomp_target_dataEa"} omp.target_data map((tofrom -> %0 : !fir.ref<i32>)) use_device_ptr((%1 -> %0 : !fir.ref<i32>)) { %c20_i32 = arith.constant 20 : i32 fir.store %c20_i32 to %1 : !fir.ref<i32> omp.terminator } %c30_i32 = arith.constant 30 : i32 fir.store %c30_i32 to %0 : !fir.ref<i32> return } Is the above code more or less what you are expecting? Comment Actions Also, my initial plan was to have OMPIRBuilder entirely handle codegen for both clang and mlir. As such I have already moved the device ptr and addr related codegen from clang to OMPIRBuilder. Comment Actions No. What I was thinking was something like the following. func.func @_QPomp_target_data() { %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"} %c10_i32 = arith.constant 10 : i32 fir.store %c10_i32 to %0 : !fir.ref<i32> omp.target_data map((tofrom -> %0 : !fir.ref<i32>)) use_device_ptr((%1 -> %0 : !fir.ref<i32>)) { ^bb0(%1: !fir.ref<i32>): %c20_i32 = arith.constant 20 : i32 fir.store %c20_i32 to %1 : !fir.ref<i32> omp.terminator } %c30_i32 = arith.constant 30 : i32 fir.store %c30_i32 to %0 : !fir.ref<i32> return }
MLIR will not be creating a private version. It will only be providing a block argument that you can use to lower easily to the private version created by the IRBuilder. (I believe, you previously reported that it was easy to use block arguments to lower to LLVM IR.) Comment Actions Yes having a block argument works. I am just finding it tricky to add it here in the frontend lowering stage. I will see if I am able to generate something like the code you shared above. Comment Actions It should be similar to the worksharing loop, just that this will be for the device_ptr clause and not the index variable. Comment Actions I've updated the patch. I am not sure if this is the correct way of adding the block arguments. It is generating the following code currently. I don't know why it's adding the new %0 variable and using it inside the region. func.func @_QPomp_target_data() { %0 = fir.alloca i32 {adapt.valuebyref} %1 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"} %c10_i32 = arith.constant 10 : i32 fir.store %c10_i32 to %1 : !fir.ref<i32> omp.target_data map((tofrom -> %1 : !fir.ref<i32>)) use_device_ptr((%1 -> %1 : !fir.ref<i32>)) { ^bb0(%arg0: i32): fir.store %arg0 to %0 : !fir.ref<i32> %c20_i32 = arith.constant 20 : i32 fir.store %c20_i32 to %0 : !fir.ref<i32> omp.terminator } %c30_i32 = arith.constant 30 : i32 fir.store %c30_i32 to %1 : !fir.ref<i32> return }
Comment Actions I was suggesting that you can do something similar as the Worksharing Loop's boyd creation, the same code will not work. Only a subset of what is implemented for the worksharing loop's arguments are required here. Basically, you have to create a block in the body of the operation with the appropriate block argument types and locations. The argument types will have the same type and location as the type of the device_ptr operands. SmallVector<Type> tiv; // fill this with the type of the device_ptr SmallVector<Location> locs; // fill this with the location of the device_ptr // Create a block in the body of the Operation with as many block arguments as there are elements in the device_ptr clause. firOpBuilder.createBlock(&op.getRegion(), {}, tiv, locs); // Now get these block arguments and bind them to the device_ptr symbols // so that it is the block arguments that get used for these symbols in the // body of the operation // Note: here the `args` are the device_ptr symbols. int argIndex = 0; for (const Fortran::semantics::Symbol *arg : args) { mlir::Value val = fir::getBase(op.getRegion().front().getArgument(argIndex)); converter.bindSymbol(*arg, val); argIndex++; } Note: The above is pseudocode only intended to convey the idea. Comment Actions @kiranchandramohan Thanks a lot for helping me with the pseudocode. I have updated the patch and it seems to be working as intended except when using pointers. The following is the code currently being generated: Fortran: subroutine omp_target_data integer :: a, b, c !$omp target data map(tofrom: a, b, c) use_device_ptr(a) use_device_addr(b) a = 20 b = 2 c = 10 !$omp end target data a = 30 b = 3 c = 10 end subroutine omp_target_data FIR: func.func @_QPomp_target_data() { %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"} %1 = fir.alloca i32 {bindc_name = "b", uniq_name = "_QFomp_target_dataEb"} %2 = fir.alloca i32 {bindc_name = "c", uniq_name = "_QFomp_target_dataEc"} omp.target_data map((tofrom -> %0 : !fir.ref<i32>), (tofrom -> %1 : !fir.ref<i32>), (tofrom -> %2 : !fir.ref<i32>)) use_device_ptr(%0 : !fir.ref<i32>) use_device_addr(%1 : !fir.ref<i32>) { ^bb0(%arg0: !fir.ref<i32>, %arg1: !fir.ref<i32>): %c20_i32 = arith.constant 20 : i32 fir.store %c20_i32 to %arg0 : !fir.ref<i32> %c2_i32 = arith.constant 2 : i32 fir.store %c2_i32 to %arg1 : !fir.ref<i32> %c10_i32_0 = arith.constant 10 : i32 fir.store %c10_i32_0 to %2 : !fir.ref<i32> omp.terminator } %c30_i32 = arith.constant 30 : i32 fir.store %c30_i32 to %0 : !fir.ref<i32> %c3_i32 = arith.constant 3 : i32 fir.store %c3_i32 to %1 : !fir.ref<i32> %c10_i32 = arith.constant 10 : i32 fir.store %c10_i32 to %2 : !fir.ref<i32> return } However, if I change it to integer, pointer :: a, b, c then I am getting the following assertion failure llvm-project/flang/lib/Lower/Bridge.cpp:3446: auto (anonymous namespace)::FirConverter::genAssignment(const Fortran::evaluate::Assignment &)::(anonymous class)::operator()(const Fortran::evaluate::Assignment::Intrinsic &) const: Assertion 'isFuncResultDesignator(assign.lhs) && "type mismatch"' failed. This seems to be generating from the store inside the region and binding its symbol. I wasn't able to come up with a quick fix for it. If you have any suggestions that would be great!!!
Comment Actions Please test with the use_device_ptr or use_device_addr tests in the gfortran testsuite. This is a reasonable start for modelling device_ptr and device_addr. More changes will probably be required to model the dataflow and integrating with the data-mapping clause. Using a block argument hopefully makes lowering to LLVM easier. LG. Please wait for one more acceptance.
Comment Actions Addressed reviewer comments.
|
Is fir::factory::getNonDeferredLenParams not available here?