This is an archive of the discontinued LLVM Phabricator instance.

[flang][OpenMP] Lowering support for atomic capture
ClosedPublic

Authored by NimishMishra on Jun 8 2022, 12:56 AM.

Details

Summary

This patch adds lowering support for atomic capture operation. First is created a region (without any operand) for the atomic capture operation. Then based on one of the following configurations...

  1. [update-stmt, capture-stmt]
  2. [capture-stmt, update-stmt]
  3. [capture-stmt, write-stmt]

... the lowering proceeds by creating these individual operations inside the atomic capture's region.

Diff Detail

Event Timeline

NimishMishra created this revision.Jun 8 2022, 12:56 AM
NimishMishra requested review of this revision.Jun 8 2022, 12:56 AM

This follows after D125668.

TODO: handling pointers in atomic capture construct

Thanks for working on this. I have a few comments on generated IR and changes in OpenMPDialect.cpp. Please excuse the formatting issues, if any as I'm commenting this from a mobile :')

flang/lib/Lower/OpenMP.cpp
1071

Shouldn't this be done in semantics?

flang/test/Lower/OpenMP/atomic-capture.f90
35

Shouldn't all of this (34-41) be wrapped in an omp.atomic.read and omp.atomic.update operation? Why can't we generate that here? Relaxing the restrictions on omp.capture is not a solution for this when it's possible to express this same thing in current syntax.

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
936

Is this move required?

941

Can you please add a testcase for this in mlir/test/Dialect/OpenMP/ops.mlir? I'm having trouble understanding why this is required.

NimishMishra added inline comments.Jun 9 2022, 9:33 PM
flang/lib/Lower/OpenMP.cpp
1071

This is the best approach I could think of in order to understand which capture construct combination to lower to. So I put basic "structural" checks for v=x and x=x op expr statements. There are more semantic checks attached with these. I assume we rely on the semantics phase to take care of them.

These helper functions here are only doing structural checks.

flang/test/Lower/OpenMP/atomic-capture.f90
35

I do not understand. What do you mean by "wrapped" in a read and write operation?

Generally, the read operation is not problematic. In the write operation however, the FIR for expression evaluation is involved. I was attempting to control insertion points to make this expression evaluation "outside" the omp.atomic.capture, but I couldn't do it.

However, I am not too sure if it is wrong to put the FIR related to expression evaluation within the capture block (alongside the omp.atomic.write). Please correct me if I am missing something here.

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
936

No. It's unintentional. I will revert it.

shraiysh added inline comments.Jun 9 2022, 10:20 PM
flang/test/Lower/OpenMP/atomic-capture.f90
34–43

If you can generate this, then that's the most accurate imo.

This does not violate the semantics of atomic construct because it clearly says that -

Only the read and write of the location designated by x are performed mutually atomically. Neither the evaluation of expr or expr_list, nor the write to the location designated by v, need be atomic with respect to the read or write of the location designated by x.

36–42

I do not understand. What do you mean by "wrapped" in a read and write operation?

This is what I meant.

However, I am not too sure if it is wrong to put the FIR related to expression evaluation within the capture block (alongside the omp.atomic.write).

Ideally, the evaluation of the expression should not be inside the atomic region at all. However, if that's somehow not possible, it should be pushed inside an omp.atomic.update expression because omp.atomic.update operation supports multiple operations in it's region. If that cannot be done too, we should justify adding this relaxation to omp.atomic.capture independent of Flang by answering "what kind of capture executions cannot be expressed in MLIR with the current syntax?". As long as they can be expressed, it should be the job of flang to lower appropriately.

NimishMishra added inline comments.Jun 9 2022, 10:38 PM
flang/test/Lower/OpenMP/atomic-capture.f90
36–42

Okay. Then I will work on moving the evaluation of the expression outside the atomic region. I think I have a way to do that.

shraiysh added inline comments.Jun 9 2022, 11:03 PM
flang/lib/Lower/OpenMP.cpp
1113–1114

This should not always be true. If the function call on rhs does not use the variable on the lhs, then this is an atomic write statement.

Modeling it as an update sort of works, in the sense that there is no visible change in behavior of the program, but the generated IR will not be entirely accurate. If it is very hard to deduce write here, maybe mention it as a todo?

1115–1117

I meant can something like this work? Relying on semantics for a valid binary operator.

NimishMishra added inline comments.Jun 9 2022, 11:09 PM
flang/lib/Lower/OpenMP.cpp
1113–1114

You are right. I missed it. I will revisit these functions and see what can be done.

Improved design of the solution

NimishMishra added inline comments.Aug 1 2022, 4:55 AM
flang/test/Lower/OpenMP/atomic-capture.f90
82

I had a discussion to make at this point. The verifier for atomic capture checks if opsInRegion.size() == 3 i.e. if the number of operations in a region were 3.

This issue came up before during lowering of omp.atomic.write inside atomic capture, since a write statement has expression evaluation which also takes up space. @shraiysh suggested to keep this expression evaluation outside the omp.atomic.capture, since is what I have done currently.

However, with pointers, the issue has resurfaced. This particular b = a lowers as

%0 = allocate c
%1 = allocate d
%2 = allocate a
%3 = allocate b

omp.atomic.capture{
   omp.atomic.update{......}

 %4 = load a
  % 5 = load b
  omp.atomic.read %5 = %4

 omp.terminator
}

The discussion I wish to have is whether the verification method here should be changed, or should we evaluate all LHS and RHS of an atomic assignment statement beforehand. To change the verification, I was thinking like the following:

  1. Ensure in a list of operations, there are exactly two omp.atomic operations and exactly one omp.terminator operations
  1. Last operation in the region is omp.terminator
  1. Second last operation in the region is necessarily a omp.atomic operation

Ping for review!

Apologies for the delay in review. Please let me know what you think. I will join the next OpenMP call and we can maybe discuss this.

flang/test/Lower/OpenMP/atomic-capture.f90
82

The generated code should look like the following -

// %a_ptr : !fir.ref<!fir.ref<i32>> %b_ptr = !fir.ref<!fir.ref<i32>>
%a_addr = load %a_ptr : !fir.ref<!fir.ref<i32>>
%b_addr = load %b_ptr : !fir.ref<!fir.ref<i32>>
// %a_addr : !fir.ref<i32>, %b_addr : !fir.ref<i32>
omp.atomic.capture{
  omp.atomic.update %a_addr : !fir.ref<i32> {
  ^bb0(%a_val: i32):
    %b_val = load %b_addr : !fir.ref<i32>
    %temp = arith.addi %a_val, %b_val : i32
    omp.yield %temp
  }
  omp.atomic.read %b_addr = %a_addr
  omp.terminator
}

I get the sense that this is convoluted on flang's end to generate it. It is however not a good idea to relax the constraints on omp.capture because of this. I will reiterate that if we come up with an OpenMP atomic evaluation that cannot be expressed by hand with the omp.atomic.capture operation, then we should definitely change it. Just because flang isn't able to generate it - this isn't a good enough reason to alter the operation. This, and the fact that if we change the number of operations inside atomic capture, we have to worry about lowering it to LLVM IR - which will be harder as the operation relaxes. If you strongly need to relax the constraints on omp.atomic.capture, we should first make sure that the relaxed version translates properly to LLVM IR for execution (probably as a separate patch).

I wanted to put an idea out - maybe to ease the difficulty of generation of omp.atomic.capture, we can define an fir.omp.atomic.capture operation that accepts multiple operations under it. Then during canonicalization (or some other pass) in FIR we can push the unnecessary operations (load a, load b in your example) outside the fir.omp.atomic.capture operation to generate omp.atomic.capture operation. Does that sound like it would make the implementation more straightforward?

NimishMishra added inline comments.Aug 30 2022, 6:52 PM
flang/test/Lower/OpenMP/atomic-capture.f90
82

Ok. I will try to keep loading of the two addresses outside the capture region entirely.

Changed design of the patch to evaluate LHS and RHS of the two assignment statements before generating the omp.atomic.capture operation.

flang/test/Lower/OpenMP/atomic-capture.f90
82

@shraiysh Does this IR look ok?

shraiysh added inline comments.Aug 31 2022, 9:19 PM
flang/test/Lower/OpenMP/atomic-capture.f90
82

Yes, the current testcases look perfect to me and I cannot spot any errors in them. Thanks for the patience and to get it to work 👏 .

I have not reviewed the code itself, but functionality wise, it looks okay to me. Please feel free to go ahead without my approval for the code as it might be sometime before I get time to review the code. If because of some review comments you happen to change the testcases before landing this, then let me know and I will review the updated testcases as soon as I can.

The following test case fails.

program main
  implicit none
  integer, parameter :: n1 = 10
  integer, parameter :: n2 = 100
  integer, parameter :: n = 30
  integer :: idx(n2)
  integer :: i
  integer(1) :: xi1(n1), yi1(n1), zi1(n1), oi1(n1), pi1(n1), qi1(n1), expecti1(n1)
  integer(8) :: xi8(n1), yi8(n1), zi8(n1), oi8(n1), pi8(n1), qi8(n1), expecti8(n1)
  logical :: rst(n) = .false.

  do i = 1, n2
    idx(i) = mod(i, n1) + 1
  end do
  ! add integer(1)
  xi1 = 0
  yi1 = 0
  zi1 = 0
  expecti1 = [38, -52, -42, -32, -22, -12, -2, 8, 18, 28]
  call atomic_capture_addi1(xi1, yi1, zi1, oi1, pi1, qi1, idx, n2)
  rst(1:10) = xi1 .eq. expecti1
  rst(11:20) = yi1 .eq. expecti1
  rst(21:30) = zi1 .eq. expecti1
  if (any(rst .neqv. .true.)) STOP 1
  print *, "PASS"

contains
  integer(1) function fi1(i)
    integer :: i
    fi1 = i
  end function
  integer(8) function fi8(i)
    integer :: i
    fi8 = i
  end function
  subroutine atomic_capture_addi1(x, y, z, o, p, q, idx, n)
    integer(1) :: x(*), y(*), z(*), o(*), p(*), q(*)
    integer :: idx(*)
    integer :: i, n
    !$omp parallel do shared(x, y, z, o, p, q, idx, n)
    do i = 1, n
      !$omp atomic capture
      x(idx(i)) = x(idx(i)) + fi1(i)
      o(idx(i)) = x(idx(i))
      !$omp end atomic
      !$omp atomic capture
      y(idx(i)) = fi1(i) + y(idx(i))
      p(idx(i)) = y(idx(i))
      !$omp end atomic
      !$omp atomic capture
      z(idx(i)) = z(idx(i)) + fi8(i)
      q(idx(i)) = z(idx(i))
      !$omp end atomic
    end do
  end subroutine
end program main
$ gfortran -fopenmp test.f90 && ./a.out
 PASS
$ flang-new -flang-experimental-exec -fopenmp test.f90 
flang-new: /home/qpx/compilers/llvm-community/omp-dev/llvm-project/flang/lib/Lower/OpenMP.cpp:1591: void genOmpAtomicUpdateStatement(Fortran::lower::AbstractConverter&, Fortran::lower::pft::Evaluation&, mlir::Value, mlir::Type, const Fortran::parser::Variable&, const Fortran::parser::Expr&, const Fortran::parser::OmpAtomicClauseList*, const Fortran::parser::OmpAtomicClauseList*): Assertion `name && name->symbol && "No symbol attached to atomic update variable"' failed.
peixin requested changes to this revision.Sep 29 2022, 6:31 PM
This revision now requires changes to proceed.Sep 29 2022, 6:31 PM

Added a fix to handle Array Refs in atomic update constructs

kiranchandramohan requested changes to this revision.Feb 19 2023, 11:15 AM

See comment inline about the latest change.

Since we do not have a mechanism for handling array element, I think you can add a hard TODO for the array element case. We can move ahead with the rest of the patch. File a github issue for the array element case.

flang/test/Lower/OpenMP/atomic-capture.f90
144–154

I believe the semantics of atomic.update is that it will load the address that is provided to it and that will be available as the basic block argument ARG. The body of the update will use this loaded value ARG to perform the update and yield the updated value which will be stored again at the address.

I think this update operation will not work as expected since it is not using the automatically loaded value in ARG and it is loading the value at the address passed to the atomic.update op and adds the constant to it. This will end up computing which is not what we want.
numbers(1) = numbers(1) + numbers(1) + 10.

Let me know if I missed a point.

This revision now requires changes to proceed.Feb 19 2023, 11:15 AM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

I think I got that wrong. It will not double add since the new code is not touching the ARG.

I tried fetching the patch but it shows some issues. Could you rebase? I would like to have a look at the IR that is generated.

Particularly, we have to check,

  1. whether the following load is an atomic load in the LLVM IR.
!CHECK: %[[array_element_inner:.*]] = fir.load %[[array_element_ref]]
  1. whether the loop containing cmpxchg is well-formed.
NimishMishra added inline comments.Mar 9 2023, 6:16 AM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

Hi Kiran.

I was trying to understand the requirement here. I do not completely understand why cmpxchg could be a problem here? Wouldn't the compare and the exchange happen inside the atomic region.

I mean I am just trying to understand how the generated IR should look like, to make sure I am doing it correctly.

Rebased with main. Fixed generation of omp.atomic.read

NimishMishra added inline comments.Mar 22 2023, 8:35 PM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

Hi Kiran.

The LLVM IR is as follows. Does it look ok? The load is within the Atomic Update block, so it should be fine right?

llvm.func @_QParray_refs() {                                                                                                                 
  %0 = llvm.mlir.constant(1.000000e+01 : f32) : f32                                                                                         
  %1 = llvm.mlir.constant(1 : i64) : i64                                                                                                     
  %2 = llvm.alloca %1 x !llvm.array<5 x f32> {bindc_name = "numbers", in_type = !fir.array<5xf32>, operand_segment_sizes = array<i32: 0, 0>,
uniq_name = "_QFarray_refsEnumbers"} : (i64) -> !llvm.ptr<array<5 x f32>>                                                                   
  %3 = llvm.alloca %1 x f32 {bindc_name = "x", in_type = f32, operand_segment_sizes = array<i32: 0, 0>, uniq_name = "_QFarray_refsEx"} : (i64
) -> !llvm.ptr<f32>                                                                                                                         
  %4 = llvm.getelementptr %2[0, 0] : (!llvm.ptr<array<5 x f32>>) -> !llvm.ptr<f32>                                                           
  omp.atomic.capture   {                                                                                                                     
    omp.atomic.update   %4 : !llvm.ptr<f32> {                                                                                               
    ^bb0(%arg0: f32):                                                                                                                       
      %5 = llvm.load %4 : !llvm.ptr<f32>                                                                                                     
      %6 = llvm.fadd %5, %0  {fastmathFlags = #llvm.fastmath<contract>} : f32
      omp.yield(%6 : f32)                                             
    }                                                                 
    omp.atomic.read %3 = %4   : !llvm.ptr<f32>, f32
  }         
  llvm.return
NimishMishra added inline comments.Mar 22 2023, 11:23 PM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

Apologies. That is the MLIR dialect. Please find the LLVM IR below. It also has a cmpxchg instruction

The test case is simple:

!$omp atomic capture
      x = y
      y = x + y
!$omp end capture
; ModuleID = 'FIRModule'
source_filename = "FIRModule"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

declare ptr @malloc(i64)

declare void @free(ptr)

define void @_QPtest() {
  %x.new.val = alloca float, align 4
  %1 = alloca float, i64 1, align 4
  %2 = alloca float, i64 1, align 4
  store float 2.000000e+01, ptr %1, align 4
  store float 1.000000e+01, ptr %2, align 4
  br label %entry

entry:                                            ; preds = %0
  %.atomic.load = load atomic i32, ptr %2 monotonic, align 4
  br label %.atomic.cont

.atomic.cont:                                     ; preds = %.atomic.cont, %entry
  %3 = phi i32 [ %.atomic.load, %entry ], [ %8, %.atomic.cont ]
  %.atomic.fltCast = bitcast i32 %3 to float
  %4 = load float, ptr %1, align 4
  %5 = fadd contract float %4, %.atomic.fltCast
  store float %5, ptr %x.new.val, align 4
  %6 = load i32, ptr %x.new.val, align 4
  %7 = cmpxchg ptr %2, i32 %3, i32 %6 monotonic monotonic, align 4
  %8 = extractvalue { i32, i1 } %7, 0
  %9 = extractvalue { i32, i1 } %7, 1
  br i1 %9, label %.atomic.exit, label %.atomic.cont

.atomic.exit:                                     ; preds = %.atomic.cont
  store float %.atomic.fltCast, ptr %1, align 4
  ret void
}

!llvm.module.flags = !{!0}

!0 = !{i32 2, !"Debug Info Version", i32 3}
NimishMishra added inline comments.Mar 22 2023, 11:28 PM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

And for the array reference test case

!$omp atomic capture
      x(1) = y(1)
      y(1) = x(1) + y(1)
!$omp end capture

The following IR is generated:

; ModuleID = 'FIRModule'
source_filename = "FIRModule"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

declare ptr @malloc(i64)

declare void @free(ptr)

define void @_QPtest() {
  %x.new.val = alloca float, align 4
  %1 = alloca float, i64 1, align 4
  %2 = alloca [5 x float], i64 1, align 4
  %3 = alloca [5 x float], i64 1, align 4
  %4 = getelementptr [5 x float], ptr %2, i32 0, i32 0
  store float 2.000000e+01, ptr %4, align 4
  %5 = getelementptr [5 x float], ptr %3, i32 0, i32 0
  store float 1.000000e+01, ptr %5, align 4
  %6 = load float, ptr %4, align 4
  %7 = load float, ptr %5, align 4
  %8 = fadd contract float %6, %7
  store float %8, ptr %1, align 4
  br label %entry

entry:                                            ; preds = %0
  %.atomic.load = load atomic i32, ptr %4 monotonic, align 4
  br label %.atomic.cont

.atomic.cont:                                     ; preds = %.atomic.cont, %entry
  %9 = phi i32 [ %.atomic.load, %entry ], [ %13, %.atomic.cont ]
  %.atomic.fltCast = bitcast i32 %9 to float
  %10 = load float, ptr %5, align 4
  store float %10, ptr %x.new.val, align 4
  %11 = load i32, ptr %x.new.val, align 4
  %12 = cmpxchg ptr %4, i32 %9, i32 %11 monotonic monotonic, align 4
  %13 = extractvalue { i32, i1 } %12, 0
  %14 = extractvalue { i32, i1 } %12, 1
  br i1 %14, label %.atomic.exit, label %.atomic.cont

.atomic.exit:                                     ; preds = %.atomic.cont
  store float %10, ptr %5, align 4
  ret void
}

!llvm.module.flags = !{!0}

!0 = !{i32 2, !"Debug Info Version", i32 3}
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

The concern here is that the atomically loaded value is not used in the update operation.

AFAIU, the cmpxchg instruction only updates the location if the value that is currently at that location equals the value that is used for the update. So, if we are doing y = y + x. An initial atomic load is made of y (=y_old) and it is added with the value at x to obtain the value y_old + x. Before storing this value at y, it is checked that the current resident value at y is equal to y_old. The problem here (for the array-element case) is that the value to be used for updating is not obtained using the atomically loaded value y_old, but it is using a different value and that does not seem correct. Also, the update operation (addition here) has to be inside the loop since it addition should be performed on the atomically loaded value.

NimishMishra added inline comments.Mar 29 2023, 7:32 PM
flang/test/Lower/OpenMP/atomic-capture.f90
144–154

I understand now. Thank you.

The patch can not go ahead in its current form then. Do you any suggestions on how to go forward with fixing it then? Johannes did mention an alternative strategy some time back, but I am not sure how to start on that direction. Can you give some initial direction?

Please remove the handling of the array element case and remove the test for it. We can handle array elements in a separate patch.

I believe the rest of the code looks good. Thanks for the patience and the changes.

flang/lib/Lower/OpenMP.cpp
1017–1018

For the array element case, please add a Not Yet Implemented TODO. We can handle this separately.

This revision was not accepted when it landed; it landed in state Needs Review.May 3 2023, 9:48 PM
This revision was automatically updated to reflect the committed changes.