This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][OpenMP]Add Flang lowering support for device_ptr and device_addr clauses
ClosedPublic

Authored by TIFitis on Jun 13 2023, 9:10 AM.

Details

Summary

Add lowering support for the use_device_ptr and use_Device_addr clauses for the Target Data directive.

Depends on D152822

Diff Detail

Event Timeline

TIFitis created this revision.Jun 13 2023, 9:10 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 13 2023, 9:10 AM
TIFitis requested review of this revision.Jun 13 2023, 9:10 AM

Weren't you planning to make the device_ptr a block argument operand?

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

The modelling for the Operation (target data) should be changed to represent both use_device_addr and use_device_addr as a block argument.
May be something like omp.target_data use_device_addr(%daddr -> %a), where %daddr is an entry block argument and %daddr will be used inside the region and not %a.
Note that this might require a custom printer and parser.
This will be a more accurate modelling since the standard explicitly says that the address will be a device address (and not the host address) and hence it is incorrect to use the host address in the body of the region.

The device_addr example from your test will look like the following.

func.func @_QPomp_target_device_addr() {
  %0 = fir.alloca !fir.box<!fir.ptr<i32>> {bindc_name = "a", uniq_name = "_QFomp_target_device_addrEa"}
  %1 = fir.zero_bits !fir.ptr<i32>
  %2 = fir.embox %1 : (!fir.ptr<i32>) -> !fir.box<!fir.ptr<i32>>
  fir.store %2 to %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>
  omp.target_data   map((tofrom -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>))   use_device_addr(%daddr -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>) {
    %c10_i32 = arith.constant 10 : i32
    %3 = fir.load %daddr : !fir.ref<!fir.box<!fir.ptr<i32>>>
    %4 = fir.box_addr %3 : (!fir.box<!fir.ptr<i32>>) -> !fir.ptr<i32>
    fir.store %c10_i32 to %4 : !fir.ptr<i32>
    omp.terminator
  }
  return
}

While lowering to LLVM, you can hence do this in fewer steps.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg = getTheArgumentAssociatedWith(devPtrOp)
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
}

For handling entry block arguments, you can refer to both the wsloop op or the atomic update op.

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

The modelling for the Operation (target data) should be changed to represent both use_device_addr and use_device_addr as a block argument.
May be something like omp.target_data use_device_addr(%daddr -> %a), where %daddr is an entry block argument and %daddr will be used inside the region and not %a.
Note that this might require a custom printer and parser.
This will be a more accurate modelling since the standard explicitly says that the address will be a device address (and not the host address) and hence it is incorrect to use the host address in the body of the region.

The device_addr example from your test will look like the following.

func.func @_QPomp_target_device_addr() {
  %0 = fir.alloca !fir.box<!fir.ptr<i32>> {bindc_name = "a", uniq_name = "_QFomp_target_device_addrEa"}
  %1 = fir.zero_bits !fir.ptr<i32>
  %2 = fir.embox %1 : (!fir.ptr<i32>) -> !fir.box<!fir.ptr<i32>>
  fir.store %2 to %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>
  omp.target_data   map((tofrom -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>))   use_device_addr(%daddr -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>) {
    %c10_i32 = arith.constant 10 : i32
    %3 = fir.load %daddr : !fir.ref<!fir.box<!fir.ptr<i32>>>
    %4 = fir.box_addr %3 : (!fir.box<!fir.ptr<i32>>) -> !fir.ptr<i32>
    fir.store %c10_i32 to %4 : !fir.ptr<i32>
    omp.terminator
  }
  return
}

While lowering to LLVM, you can hence do this in fewer steps.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg = getTheArgumentAssociatedWith(devPtrOp)
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
}

For handling entry block arguments, you can refer to both the wsloop op or the atomic update op.

Hi,
I see what you mean. At a glance it looks to me like it would make the lowering to both MLIR and later to LLVM IR a little bit more complex. But I don't see any obvious benefit from doing it this way.
Wouldn't we want all the processing for the omp ops to be done in one stage?

Also, if we do decide to do it. Would I have to use Variadic_of_Variadic for storing the map between the device_addr_value and orig_value or do you have anything else in mind?

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

Weren't you planning to make the device_ptr a block argument operand?

I made that change when lowering to llvm IR in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Here's a snippet of code showing how I did it. If there's a better way of doing this I'd be happy to know.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg =
      region.addArgument(devPtrOp.getType(), devPtrOp.getLoc());
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
  replaceAllUsesInRegionWith(devPtrOp, arg, region);
}

The modelling for the Operation (target data) should be changed to represent both use_device_addr and use_device_addr as a block argument.
May be something like omp.target_data use_device_addr(%daddr -> %a), where %daddr is an entry block argument and %daddr will be used inside the region and not %a.
Note that this might require a custom printer and parser.
This will be a more accurate modelling since the standard explicitly says that the address will be a device address (and not the host address) and hence it is incorrect to use the host address in the body of the region.

The device_addr example from your test will look like the following.

func.func @_QPomp_target_device_addr() {
  %0 = fir.alloca !fir.box<!fir.ptr<i32>> {bindc_name = "a", uniq_name = "_QFomp_target_device_addrEa"}
  %1 = fir.zero_bits !fir.ptr<i32>
  %2 = fir.embox %1 : (!fir.ptr<i32>) -> !fir.box<!fir.ptr<i32>>
  fir.store %2 to %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>
  omp.target_data   map((tofrom -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>))   use_device_addr(%daddr -> %0 : !fir.ref<!fir.box<!fir.ptr<i32>>>) {
    %c10_i32 = arith.constant 10 : i32
    %3 = fir.load %daddr : !fir.ref<!fir.box<!fir.ptr<i32>>>
    %4 = fir.box_addr %3 : (!fir.box<!fir.ptr<i32>>) -> !fir.ptr<i32>
    fir.store %c10_i32 to %4 : !fir.ptr<i32>
    omp.terminator
  }
  return
}

While lowering to LLVM, you can hence do this in fewer steps.

for (auto &devPtrOp : useDevPtrOperands) {
  llvm::Value *mapOpValue = moduleTranslation.lookupValue(devPtrOp);
  const auto &arg = getTheArgumentAssociatedWith(devPtrOp)
  moduleTranslation.mapValue(arg,
                             info.DevicePtrInfoMap[mapOpValue].second);
}

For handling entry block arguments, you can refer to both the wsloop op or the atomic update op.

Hi,
I see what you mean. At a glance it looks to me like it would make the lowering to both MLIR and later to LLVM IR a little bit more complex.

It will not increase the lowering complexity. For the MLIR to LLVMIR lowering it will simplify it.

But I don't see any obvious benefit from doing it this way.

It is incorrect to use the host pointer in the region and the modelling will not be in line with the standard.

Wouldn't we want all the processing for the omp ops to be done in one stage?

This step is only modelling it accurately. All processing (generating runtime calls, getting the GPU pointer etc) will all be done by the OpenMP IRBuilder.

Also, if we do decide to do it. Would I have to use Variadic_of_Variadic for storing the map between the device_addr_value and orig_value or do you have anything else in mind?

Variadic_of_Variadic might work. But the verifier should check that there are only two entries per inner Variadic. You can also implement it as two operand list/arrays.

Thinking of this again, we might not need a change in the operands for use_device_ptr or use_device_addr since we are only adding block arguments.

Thinking of this again, we might not need a change in the operands for use_device_ptr or use_device_addr since we are only adding block arguments.

Are there any examples of this currently? I saw wsloop and atomic update but they aren't exactly what we're doing here I think.

When lowering we still need to correspond the device addresses to their original host address counterpart to correctly set the mapping, type etc. I was thinking of achieving this by having 2 separate operands each for device_ptr and addr like Variadic<OpenMP_PointerLikeType>:$use_device_ptr_device and Variadic<OpenMP_PointerLikeType>:$use_device_ptr_host for the device_ptr clause.

Thinking of this again, we might not need a change in the operands for use_device_ptr or use_device_addr since we are only adding block arguments.

Are there any examples of this currently? I saw wsloop and atomic update but they aren't exactly what we're doing here I think.

I will try to find out something for reference today.

When lowering we still need to correspond the device addresses to their original host address counterpart to correctly set the mapping, type etc. I was thinking of achieving this by having 2 separate operands each for device_ptr and addr like Variadic<OpenMP_PointerLikeType>:$use_device_ptr_device and Variadic<OpenMP_PointerLikeType>:$use_device_ptr_host for the device_ptr clause.

Block arguments are not available outside the operation and thus cannot be operands.

TIFitis added a comment.EditedJun 19 2023, 7:13 AM

My current implementation generates the following FIR.

Fortran:

subroutine omp_target_data
    integer :: a
    a = 10
    !$omp target data map(tofrom: a) use_device_ptr(a)
        a = 20
    !$omp end target data
    a = 30
end subroutine omp_target_data

FIR:

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  %1 = fir.alloca i32 {bindc_name = "a", pinned, uniq_name = "_QFomp_target_dataEa"}
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

Is the above code more or less what you are expecting?

Also, my initial plan was to have OMPIRBuilder entirely handle codegen for both clang and mlir.

As such I have already moved the device ptr and addr related codegen from clang to OMPIRBuilder.
The IRBuilder creates private versions of the operands for us in llvm. However, since MLIR would also be creating private versions for us now, we would have to add in code in the IRBuilder to treat clang and mlir separately which I'd like to avoid as much as possible.

My current implementation generates the following FIR.

Fortran:

subroutine omp_target_data
    integer :: a
    a = 10
    !$omp target data map(tofrom: a) use_device_ptr(a)
        a = 20
    !$omp end target data
    a = 30
end subroutine omp_target_data

FIR:

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  %1 = fir.alloca i32 {bindc_name = "a", pinned, uniq_name = "_QFomp_target_dataEa"}
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

Is the above code more or less what you are expecting?

No. What I was thinking was something like the following.

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
  ^bb0(%1: !fir.ref<i32>):
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

However, since MLIR would also be creating private versions for us now, we would have to add in code in the IRBuilder to treat clang and mlir separately which I'd like to avoid as much as possible.

MLIR will not be creating a private version. It will only be providing a block argument that you can use to lower easily to the private version created by the IRBuilder. (I believe, you previously reported that it was easy to use block arguments to lower to LLVM IR.)

My current implementation generates the following FIR.

Fortran:

subroutine omp_target_data
    integer :: a
    a = 10
    !$omp target data map(tofrom: a) use_device_ptr(a)
        a = 20
    !$omp end target data
    a = 30
end subroutine omp_target_data

FIR:

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  %1 = fir.alloca i32 {bindc_name = "a", pinned, uniq_name = "_QFomp_target_dataEa"}
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

Is the above code more or less what you are expecting?

No. What I was thinking was something like the following.

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
  ^bb0(%1: !fir.ref<i32>):
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

However, since MLIR would also be creating private versions for us now, we would have to add in code in the IRBuilder to treat clang and mlir separately which I'd like to avoid as much as possible.

MLIR will not be creating a private version. It will only be providing a block argument that you can use to lower easily to the private version created by the IRBuilder. (I believe, you previously reported that it was easy to use block arguments to lower to LLVM IR.)

Yes having a block argument works. I am just finding it tricky to add it here in the frontend lowering stage. I will see if I am able to generate something like the code you shared above.

My current implementation generates the following FIR.

Fortran:

subroutine omp_target_data
    integer :: a
    a = 10
    !$omp target data map(tofrom: a) use_device_ptr(a)
        a = 20
    !$omp end target data
    a = 30
end subroutine omp_target_data

FIR:

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  %1 = fir.alloca i32 {bindc_name = "a", pinned, uniq_name = "_QFomp_target_dataEa"}
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

Is the above code more or less what you are expecting?

No. What I was thinking was something like the following.

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %0 : !fir.ref<i32>
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>))   use_device_ptr((%1 -> %0 : !fir.ref<i32>)) {
  ^bb0(%1: !fir.ref<i32>):
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %1 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  return
}

However, since MLIR would also be creating private versions for us now, we would have to add in code in the IRBuilder to treat clang and mlir separately which I'd like to avoid as much as possible.

MLIR will not be creating a private version. It will only be providing a block argument that you can use to lower easily to the private version created by the IRBuilder. (I believe, you previously reported that it was easy to use block arguments to lower to LLVM IR.)

Yes having a block argument works. I am just finding it tricky to add it here in the frontend lowering stage. I will see if I am able to generate something like the code you shared above.

It should be similar to the worksharing loop, just that this will be for the device_ptr clause and not the index variable.

TIFitis updated this revision to Diff 532706.Jun 19 2023, 10:33 AM

[WIP] Add block arguments.

I've updated the patch. I am not sure if this is the correct way of adding the block arguments.

It is generating the following code currently. I don't know why it's adding the new %0 variable and using it inside the region.

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {adapt.valuebyref}
  %1 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %1 : !fir.ref<i32>
  omp.target_data   map((tofrom -> %1 : !fir.ref<i32>))   use_device_ptr((%1 -> %1 : !fir.ref<i32>)) {
  ^bb0(%arg0: i32):
    fir.store %arg0 to %0 : !fir.ref<i32>
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %0 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %1 : !fir.ref<i32>
  return
}
flang/lib/Lower/OpenMP.cpp
931

The arg has type i32 here instead of fir.ref<i32> and as a result getting an error when adding it as an operand to dev_ptr clause.

I've updated the patch. I am not sure if this is the correct way of adding the block arguments.

It is generating the following code currently. I don't know why it's adding the new %0 variable and using it inside the region.

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {adapt.valuebyref}
  %1 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %1 : !fir.ref<i32>
  omp.target_data   map((tofrom -> %1 : !fir.ref<i32>))   use_device_ptr((%1 -> %1 : !fir.ref<i32>)) {
  ^bb0(%arg0: i32):
    fir.store %arg0 to %0 : !fir.ref<i32>
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %0 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %1 : !fir.ref<i32>
  return
}

I was suggesting that you can do something similar as the Worksharing Loop's boyd creation, the same code will not work. Only a subset of what is implemented for the worksharing loop's arguments are required here.

Basically, you have to create a block in the body of the operation with the appropriate block argument types and locations. The argument types will have the same type and location as the type of the device_ptr operands.

SmallVector<Type> tiv; // fill this with the type of the device_ptr      
SmallVector<Location> locs; // fill this with the location of the device_ptr
// Create a block in the body of the Operation with as many block arguments as there are elements in the device_ptr clause.
firOpBuilder.createBlock(&op.getRegion(), {}, tiv, locs);

 // Now get these block arguments and bind them to the device_ptr symbols
 // so that it is the block arguments that get used for these symbols in the
 // body of the operation
 // Note: here the `args` are the device_ptr symbols.
 int argIndex = 0;
  for (const Fortran::semantics::Symbol *arg : args) {
    mlir::Value val =
        fir::getBase(op.getRegion().front().getArgument(argIndex));
    converter.bindSymbol(*arg, val);
    argIndex++;
  }

Note: The above is pseudocode only intended to convey the idea.

TIFitis updated this revision to Diff 532919.Jun 20 2023, 7:30 AM

Added block arguments.

@kiranchandramohan Thanks a lot for helping me with the pseudocode.

I have updated the patch and it seems to be working as intended except when using pointers.

The following is the code currently being generated:

Fortran:

subroutine omp_target_data
    integer :: a, b, c
    !$omp target data map(tofrom: a, b, c) use_device_ptr(a) use_device_addr(b)
        a = 20
        b = 2
        c = 10
    !$omp end target data
    a = 30
    b = 3
    c = 10
end subroutine omp_target_data

FIR:

func.func @_QPomp_target_data() {
  %0 = fir.alloca i32 {bindc_name = "a", uniq_name = "_QFomp_target_dataEa"}
  %1 = fir.alloca i32 {bindc_name = "b", uniq_name = "_QFomp_target_dataEb"}
  %2 = fir.alloca i32 {bindc_name = "c", uniq_name = "_QFomp_target_dataEc"}
  omp.target_data   map((tofrom -> %0 : !fir.ref<i32>), (tofrom -> %1 : !fir.ref<i32>), (tofrom -> %2 : !fir.ref<i32>))   use_device_ptr(%0 : !fir.ref<i32>) use_device_addr(%1 : !fir.ref<i32>) {
  ^bb0(%arg0: !fir.ref<i32>, %arg1: !fir.ref<i32>):
    %c20_i32 = arith.constant 20 : i32
    fir.store %c20_i32 to %arg0 : !fir.ref<i32>
    %c2_i32 = arith.constant 2 : i32
    fir.store %c2_i32 to %arg1 : !fir.ref<i32>
    %c10_i32_0 = arith.constant 10 : i32
    fir.store %c10_i32_0 to %2 : !fir.ref<i32>
    omp.terminator
  }
  %c30_i32 = arith.constant 30 : i32
  fir.store %c30_i32 to %0 : !fir.ref<i32>
  %c3_i32 = arith.constant 3 : i32
  fir.store %c3_i32 to %1 : !fir.ref<i32>
  %c10_i32 = arith.constant 10 : i32
  fir.store %c10_i32 to %2 : !fir.ref<i32>
  return
}

However, if I change it to integer, pointer :: a, b, c then I am getting the following assertion failure

llvm-project/flang/lib/Lower/Bridge.cpp:3446: auto (anonymous namespace)::FirConverter::genAssignment(const Fortran::evaluate::Assignment &)::(anonymous class)::operator()(const Fortran::evaluate::Assignment::Intrinsic &) const: Assertion 'isFuncResultDesignator(assign.lhs) && "type mismatch"' failed.

This seems to be generating from the store inside the region and binding its symbol. I wasn't able to come up with a quick fix for it. If you have any suggestions that would be great!!!

flang/lib/Lower/OpenMP.cpp
937

I think the issue here is that we are binding the symbol to an MLIR value. The original symbol is bound to an ExtendedValue (fir::MutableBoxValue). When some checks are done to see whether the value is a box of type MutableBoxValue, it fails. And hence it does not insert the load and box_addr instructions.

I think the solution is to create an ExtendedValue of the matching type.

if sym is bound to MutableBoxValue box
  converter.bindSymbol(*sym, fir::MutableBoxValue(val, box.lenparams, {}));
else if sym is bound to CharBoxValue
  converter.bindSymbol(*sym, fir::CharBoxValue(
else
  converter.bindSymbol(*sym, val);

Note: This code is just to convey the idea.

Can you try something like this and see whether it works for the different types and boxes. If there are issues, we might have to chat with the lowering team.

TIFitis updated this revision to Diff 533287.Jun 21 2023, 9:00 AM

Fixed support for pointer type.

TIFitis marked an inline comment as done.Jun 21 2023, 9:01 AM
TIFitis added inline comments.
flang/lib/Lower/OpenMP.cpp
937

Thanks for the pointer, it worked.

I have only added support for the PointerType here as UseDevice clause operands must always be pointers as per standard. I checked with integer and character pointers and both are working fine.

TIFitis marked an inline comment as done.Jun 21 2023, 9:06 AM
flang/lib/Lower/OpenMP.cpp
443

Is fir::factory::getNonDeferredLenParams not available here?

927–941

Can you move this into a createBodyOfTargetOp function?

flang/test/Lower/OpenMP/target.f90
172

Are these tests valid for device_ptr? Should they be of type c_ptr as per the standard?

TIFitis updated this revision to Diff 533379.Jun 21 2023, 1:38 PM

Addressed reviewer comments

TIFitis marked 3 inline comments as done.Jun 21 2023, 1:40 PM
TIFitis added inline comments.
flang/lib/Lower/OpenMP.cpp
443

Sorry about that, I've fixed it.

753–758

Are these error messages alright?

flang/test/Lower/OpenMP/target.f90
172

I've updated the test. Let me know if its okay now.

Please test with the use_device_ptr or use_device_addr tests in the gfortran testsuite.

This is a reasonable start for modelling device_ptr and device_addr. More changes will probably be required to model the dataflow and integrating with the data-mapping clause. Using a block argument hopefully makes lowering to LLVM easier.

LG. Please wait for one more acceptance.

flang/lib/Lower/OpenMP.cpp
733

Nit: remove auto here and elsewhere where the type is not in the RHS.

748

Since all boxes are not MutableBoxes, this is probably not correct.

753–758

If you are pressed for time, I would recommend passing through only the types that are required and marking everything else as a TODO. Please consult the standard, gfortran testsuite, and your team for the types that are allowed for device_ptr, device_addr.

This revision is now accepted and ready to land.Jun 22 2023, 2:42 AM
TIFitis updated this revision to Diff 533559.Jun 22 2023, 4:43 AM
TIFitis marked 5 inline comments as done.

Addressed reviewer comments.

flang/lib/Lower/OpenMP.cpp
748

Thanks, I've updated this.

753–758

I have added the same checks for the use_device operands that we had added earlier for the map clause operands. Hope that's satisfactory for now.

AFAIK use_device clauses are more restrictive as they only allow pointers. I'll update this in the future as we add support for more types.