This is an archive of the discontinued LLVM Phabricator instance.

[flang][OpenMP] Support privatization for single construct
ClosedPublic

Authored by peixin on Jun 25 2022, 11:18 PM.

Details

Summary

This supports the lowering of private and firstprivate clauses in single
construct. The alloca ops are emitted in the entry block according to
https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas, and
the load/store ops are emitted in the single region. The data race
problem is handled in OMPIRBuilder. That is, the barrier is emitted in
OMPIRBuilder.

Co-authored-by: Nimish Mishra <neelam.nimish@gmail.com>

Diff Detail

Event Timeline

peixin created this revision.Jun 25 2022, 11:18 PM
peixin requested review of this revision.Jun 25 2022, 11:18 PM

I am preparing one patch to fix the same data-race problem in classic-flang. I noticed it seems the barrier is not necessary for task and single constructs. BTW, our test team found that the data race problem is more obvious when both firstprivate and lastprivate exist, even not only for pointer variables. Once we agree on a final decision on the data-race problem, I will submit one similar patch to classic-flang, too.

peixin planned changes to this revision.Jul 7 2022, 6:08 PM
domada added a subscriber: domada.Sep 13 2022, 4:02 AM

@peixin What is status of this patch? Are you planning to modify this patch or is it ready for the review?

@domada I will update this patch later this week.

peixin updated this revision to Diff 460962.Sep 16 2022, 8:41 PM
peixin edited the summary of this revision. (Show Details)

Rebase and address the data race problem for single construct.

domada added inline comments.Sep 19 2022, 1:37 AM
flang/test/Lower/OpenMP/single.f90
2

Why do you use bbc instead of flang-new?

peixin added inline comments.Sep 19 2022, 1:49 AM
flang/test/Lower/OpenMP/single.f90
2

I use both bbc and flang-new for lowering test. I also remove the fir-opt check since it should not be here.

@kiranchandramohan planned to change the lowering tests (removing the lowering to LLVMIR tests under flang/test/Lower/OpenMP) after upstreaming is done. @kiranchandramohan When do you plan to change them?

For me the patch is ok, but I think it would be better, if somebody else will also review it.

kiranchandramohan requested changes to this revision.Sep 19 2022, 2:52 PM

Thanks @peixin for this patch.

I have a couple of questions/concerns. See comments inline.

flang/lib/Lower/OpenMP.cpp
370

The barrier is created by the OpenMPIRBuilder. copyprivate's barrier can be handled separately I guess.
https://github.com/llvm/llvm-project/blob/29b37f319ac310e9c023d7c707ecfe1f709807ae/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp#L3352

flang/test/Lower/OpenMP/single.f90
2

Sorry about the delay here. It is on my list. I will get to it soon.

mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
249

The single is inlined usually, is the OpenMPIRBuilder behaviour different? We might want to change the name of the Interface if we really want the alloca to be inside the single region. Have you checked the Clang behaviour?

This revision now requires changes to proceed.Sep 19 2022, 2:52 PM
peixin updated this revision to Diff 462876.Sep 26 2022, 5:49 AM
peixin edited the summary of this revision. (Show Details)

Thanks @kiranchandramohan for pointing out the problem.

You are right. The barrier is emitted in OMPIRBuilder, and the copyprivate should be handled in OMPIRBuilder, too. Clang emits the private/firstprivate in CodeGen (https://github.com/llvm/llvm-project/blob/2188cf9fa4d012b3ce484f9e97b66581569c1157/clang/lib/CodeGen/CGStmtOpenMP.cpp#L4196-L4197) before and after single runtime calls. Also, check the following case:

void sub(float a, double b);
float x = 1.0;
double y = 2.0;
void test() {
  #pragma omp single private(x) firstprivate(y)
  {
    sub(x, y);
  }
}

The generated IR is as follows:

@x = dso_local global float 1.000000e+00, align 4
@y = dso_local global double 2.000000e+00, align 8
@0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1
@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8
@2 = private unnamed_addr constant %struct.ident_t { i32 0, i32 322, i32 0, i32 22, ptr @0 }, align 8

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @test() #0 {
entry:
  %y = alloca double, align 8
  %x = alloca float, align 4
  %0 = call i32 @__kmpc_global_thread_num(ptr @1)
  %1 = call i32 @__kmpc_single(ptr @1, i32 %0)
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %omp_if.then, label %omp_if.end

omp_if.then:                                      ; preds = %entry
  %3 = load double, ptr @y, align 8
  store double %3, ptr %y, align 8
  %4 = load float, ptr %x, align 4
  %5 = load double, ptr %y, align 8
  call void @sub(float noundef %4, double noundef %5)
  call void @__kmpc_end_single(ptr @1, i32 %0)
  br label %omp_if.end

omp_if.end:                                       ; preds = %omp_if.then, %entry
  call void @__kmpc_barrier(ptr @2, i32 %0)
  ret void
}

The private/firstprivate are both in single region. So, there is no need to do any additional work in flang lowering for it.

BTW, filed one ticket for one missed sema check for firstprivate clause. https://github.com/llvm/llvm-project/issues/57983

Clang emits the private/firstprivate in CodeGen (https://github.com/llvm/llvm-project/blob/2188cf9fa4d012b3ce484f9e97b66581569c1157/clang/lib/CodeGen/CGStmtOpenMP.cpp#L4196-L4197) before and after single runtime calls.

Is it OK to not add the Interface to the Single operation in this case? And keep the interface for operations that are outlined only and reinstating its original name OutlineableOpenMPOpInterface.

Clang emits the private/firstprivate in CodeGen (https://github.com/llvm/llvm-project/blob/2188cf9fa4d012b3ce484f9e97b66581569c1157/clang/lib/CodeGen/CGStmtOpenMP.cpp#L4196-L4197) before and after single runtime calls.

Is it OK to not add the Interface to the Single operation in this case? And keep the interface for operations that are outlined only and reinstating its original name OutlineableOpenMPOpInterface.

Of course. The interface is mainly used to get the alloca block currently. The following would work for the single construct.

diff --git a/flang/lib/Lower/OpenMP.cpp b/flang/lib/Lower/OpenMP.cpp
index e1cba68e1686..4f2824f2c591 100644
--- a/flang/lib/Lower/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP.cpp
@@ -203,6 +203,8 @@ static bool privatizeVars(Op &op, Fortran::lower::AbstractConverter &converter,
   bool needBarrier = false;
   if (mlir::isa<mlir::omp::SectionOp>(op))
     firOpBuilder.setInsertionPointToStart(&op.getRegion().back());
+  else if (mlir::isa<mlir::omp::SingleOp>(op))
+    firOpBuilder.setInsertionPointToStart(&op.getRegion().front());
   else
     firOpBuilder.setInsertionPointToStart(firOpBuilder.getAllocaBlock());
   for (auto sym : privatizedSymbols) {

Do you prefer this solution?

Clang emits the private/firstprivate in CodeGen (https://github.com/llvm/llvm-project/blob/2188cf9fa4d012b3ce484f9e97b66581569c1157/clang/lib/CodeGen/CGStmtOpenMP.cpp#L4196-L4197) before and after single runtime calls.

Is it OK to not add the Interface to the Single operation in this case? And keep the interface for operations that are outlined only and reinstating its original name OutlineableOpenMPOpInterface.

Of course. The interface is mainly used to get the alloca block currently. The following would work for the single construct.

diff --git a/flang/lib/Lower/OpenMP.cpp b/flang/lib/Lower/OpenMP.cpp
index e1cba68e1686..4f2824f2c591 100644
--- a/flang/lib/Lower/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP.cpp
@@ -203,6 +203,8 @@ static bool privatizeVars(Op &op, Fortran::lower::AbstractConverter &converter,
   bool needBarrier = false;
   if (mlir::isa<mlir::omp::SectionOp>(op))
     firOpBuilder.setInsertionPointToStart(&op.getRegion().back());
+  else if (mlir::isa<mlir::omp::SingleOp>(op))
+    firOpBuilder.setInsertionPointToStart(&op.getRegion().front());
   else
     firOpBuilder.setInsertionPointToStart(firOpBuilder.getAllocaBlock());
   for (auto sym : privatizedSymbols) {

Do you prefer this solution?

Why wouldn't getAllocaBlock work by itself? From our understanding of Clang it is not required for the private allocas to be inside the runtime calls, so getAllocaBlock will hoist it to the nearest region that will be outlined. Isn't that sufficient?

Why wouldn't getAllocaBlock work by itself? From our understanding of Clang it is not required for the private allocas to be inside the runtime calls, so getAllocaBlock will hoist it to the nearest region that will be outlined. Isn't that sufficient?

firOpBuilder.getAllocaBlock() is to call getEntryBlock() if it is not OutlineableOpenMPOpInterface, which is to call getFunction().front(). For the following case, it will hoist the private/firstprivate operations outside the parallel region, which is wrong.

subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp parallel
  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
  !$omp end parallel
end subroutine single_private

Why wouldn't getAllocaBlock work by itself? From our understanding of Clang it is not required for the private allocas to be inside the runtime calls, so getAllocaBlock will hoist it to the nearest region that will be outlined. Isn't that sufficient?

firOpBuilder.getAllocaBlock() is to call getEntryBlock() if it is not OutlineableOpenMPOpInterface, which is to call getFunction().front(). For the following case, it will hoist the private/firstprivate operations outside the parallel region, which is wrong.

subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp parallel
  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
  !$omp end parallel
end subroutine single_private

The code (as given below) tries to get a parent of type OutlineableOpenMPOpInterface which the parallel operation has. If this is not happening then there is a bug.

/// Get the block for adding Allocas.
mlir::Block *fir::FirOpBuilder::getAllocaBlock() {
  auto iface =
      getRegion().getParentOfType<mlir::omp::OutlineableOpenMPOpInterface>();
  return iface ? iface.getAllocaBlock() : getEntryBlock();
}

Why wouldn't getAllocaBlock work by itself? From our understanding of Clang it is not required for the private allocas to be inside the runtime calls, so getAllocaBlock will hoist it to the nearest region that will be outlined. Isn't that sufficient?

firOpBuilder.getAllocaBlock() is to call getEntryBlock() if it is not OutlineableOpenMPOpInterface, which is to call getFunction().front(). For the following case, it will hoist the private/firstprivate operations outside the parallel region, which is wrong.

subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp parallel
  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
  !$omp end parallel
end subroutine single_private

The code (as given below) tries to get a parent of type OutlineableOpenMPOpInterface which the parallel operation has. If this is not happening then there is a bug.

/// Get the block for adding Allocas.
mlir::Block *fir::FirOpBuilder::getAllocaBlock() {
  auto iface =
      getRegion().getParentOfType<mlir::omp::OutlineableOpenMPOpInterface>();
  return iface ? iface.getAllocaBlock() : getEntryBlock();
}

Right, that case works OK. For the following case, the private clause works ok using firOpBuilder.getAllocaBlock(). However, the implementation of firstprivate clause is not reasonable. The alloca ops can be hoisted outside the single region, but the store op for firstprivate clause should be in single region. Clang put all load and store ops in the single region. There is no execution error if put all alloca, load, and store ops outside single region. However, it has worse performance. Actually, moving alloca inside single region will result in less memory usage since it will only allocate one copy of variable for the single thread instead of making the copy for each thread.

 subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
end subroutine single_private

Also for the wsloop, the generated IR seems to be not OK, either.

subroutine firstprivate_complex2(arg1, arg2)
        complex(4) :: arg1
        complex(8) :: arg2

!$OMP DO FIRSTPRIVATE(arg1, arg2)
do i = 1,10
        call foo(arg1, arg2)
enddo
!$OMP end DO

end subroutine
func.func @_QPfirstprivate_complex2(%arg0: !fir.ref<!fir.complex<4>> {fir.bindc_name = "arg1"}, %arg1: !fir.ref<!fir.complex<8>> {fir.bindc_name = "arg2"}) {
  %0 = fir.alloca !fir.complex<4> {bindc_name = "arg1", pinned, uniq_name = "_QFfirstprivate_complex2Earg1"}
  %1 = fir.load %arg0 : !fir.ref<!fir.complex<4>>
  fir.store %1 to %0 : !fir.ref<!fir.complex<4>>
  %2 = fir.alloca !fir.complex<8> {bindc_name = "arg2", pinned, uniq_name = "_QFfirstprivate_complex2Earg2"}
  %3 = fir.load %arg1 : !fir.ref<!fir.complex<8>>
  fir.store %3 to %2 : !fir.ref<!fir.complex<8>>
  %4 = fir.alloca i32 {adapt.valuebyref}
  %5 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFfirstprivate_complex2Ei"}
  %c1_i32 = arith.constant 1 : i32
  %c10_i32 = arith.constant 10 : i32
  %c1_i32_0 = arith.constant 1 : i32
  omp.wsloop   for  (%arg2) : i32 = (%c1_i32) to (%c10_i32) inclusive step (%c1_i32_0) {
    fir.store %arg2 to %4 : !fir.ref<i32>
    fir.call @_QPfoo(%0, %2) : (!fir.ref<!fir.complex<4>>, !fir.ref<!fir.complex<8>>) -> ()
    omp.yield
  }
  return
}

For some loop indexes, there is no need to allocate the memory and perform the firstprivate operations. For example, running do i = 1, 10 using 8 threads in the second round will only need 2 threads. Why do we do it (private/firstprivate) for each thread? The privatization should be after __kmpc_for_static_init and before __kmpc_for_static_fini if in static schedule.

Why wouldn't getAllocaBlock work by itself? From our understanding of Clang it is not required for the private allocas to be inside the runtime calls, so getAllocaBlock will hoist it to the nearest region that will be outlined. Isn't that sufficient?

firOpBuilder.getAllocaBlock() is to call getEntryBlock() if it is not OutlineableOpenMPOpInterface, which is to call getFunction().front(). For the following case, it will hoist the private/firstprivate operations outside the parallel region, which is wrong.

subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp parallel
  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
  !$omp end parallel
end subroutine single_private

The code (as given below) tries to get a parent of type OutlineableOpenMPOpInterface which the parallel operation has. If this is not happening then there is a bug.

/// Get the block for adding Allocas.
mlir::Block *fir::FirOpBuilder::getAllocaBlock() {
  auto iface =
      getRegion().getParentOfType<mlir::omp::OutlineableOpenMPOpInterface>();
  return iface ? iface.getAllocaBlock() : getEntryBlock();
}

Right, that case works OK. For the following case, the private clause works ok using firOpBuilder.getAllocaBlock(). However, the implementation of firstprivate clause is not reasonable. The alloca ops can be hoisted outside the single region, but the store op for firstprivate clause should be in single region. Clang put all load and store ops in the single region. There is no execution error if put all alloca, load, and store ops outside single region. However, it has worse performance. Actually, moving alloca inside single region will result in less memory usage since it will only allocate one copy of variable for the single thread instead of making the copy for each thread.

 subroutine single_private(x, y)
  real :: x
  real(8) :: y

  !$omp single private(x) firstprivate(y)
  call bar(x, y)
  !$omp end single
end subroutine single_private

Also for the wsloop, the generated IR seems to be not OK, either.

subroutine firstprivate_complex2(arg1, arg2)
        complex(4) :: arg1
        complex(8) :: arg2

!$OMP DO FIRSTPRIVATE(arg1, arg2)
do i = 1,10
        call foo(arg1, arg2)
enddo
!$OMP end DO

end subroutine
func.func @_QPfirstprivate_complex2(%arg0: !fir.ref<!fir.complex<4>> {fir.bindc_name = "arg1"}, %arg1: !fir.ref<!fir.complex<8>> {fir.bindc_name = "arg2"}) {
  %0 = fir.alloca !fir.complex<4> {bindc_name = "arg1", pinned, uniq_name = "_QFfirstprivate_complex2Earg1"}
  %1 = fir.load %arg0 : !fir.ref<!fir.complex<4>>
  fir.store %1 to %0 : !fir.ref<!fir.complex<4>>
  %2 = fir.alloca !fir.complex<8> {bindc_name = "arg2", pinned, uniq_name = "_QFfirstprivate_complex2Earg2"}
  %3 = fir.load %arg1 : !fir.ref<!fir.complex<8>>
  fir.store %3 to %2 : !fir.ref<!fir.complex<8>>
  %4 = fir.alloca i32 {adapt.valuebyref}
  %5 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFfirstprivate_complex2Ei"}
  %c1_i32 = arith.constant 1 : i32
  %c10_i32 = arith.constant 10 : i32
  %c1_i32_0 = arith.constant 1 : i32
  omp.wsloop   for  (%arg2) : i32 = (%c1_i32) to (%c10_i32) inclusive step (%c1_i32_0) {
    fir.store %arg2 to %4 : !fir.ref<i32>
    fir.call @_QPfoo(%0, %2) : (!fir.ref<!fir.complex<4>>, !fir.ref<!fir.complex<8>>) -> ()
    omp.yield
  }
  return
}

For some loop indexes, there is no need to allocate the memory and perform the firstprivate operations. For example, running do i = 1, 10 using 8 threads in the second round will only need 2 threads. Why do we do it (private/firstprivate) for each thread? The privatization should be after __kmpc_for_static_init and before __kmpc_for_static_fini if in static schedule.

I think, in general, LLVM recommends for performance reasons to put allocas in the entry block. See https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas.

Yes, there will be some over-allocation. But also note that if we did not hoist and there was a loop outside the single construct then this will lead to run-away allocations without a stacksave and stackrestore.

The case of a loop is a little more complicated, since we need to place the allocas for loops in the header of the loop which is not available in the loop form, it is only available when we break the loop into control-flow form. So it is not possible to do it in the high-level MLIR.

These issues will probably be resolved when we move privatisation clauses to the OpenMP dialect and handle it in the OpenMPIRBuilder.

The alloca ops can be hoisted outside the single region, but the store op for firstprivate clause should be in single region.

Yes, this is important. Otherwise, any updates to the value will not be available. I don't remember whether we fixed the issue or whether it is still present. It was captured in https://github.com/flang-compiler/f18-llvm-project/issues/1171#issuecomment-1119997442

peixin updated this revision to Diff 463567.Sep 28 2022, 8:01 AM
peixin edited the summary of this revision. (Show Details)

I think, in general, LLVM recommends for performance reasons to put allocas in the entry block. See https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas.

Yes, there will be some over-allocation. But also note that if we did not hoist and there was a loop outside the single construct then this will lead to run-away allocations without a stacksave and stackrestore.

The case of a loop is a little more complicated, since we need to place the allocas for loops in the header of the loop which is not available in the loop form, it is only available when we break the loop into control-flow form. So it is not possible to do it in the high-level MLIR.

These issues will probably be resolved when we move privatisation clauses to the OpenMP dialect and handle it in the OpenMPIRBuilder.

The alloca ops can be hoisted outside the single region, but the store op for firstprivate clause should be in single region.

Yes, this is important. Otherwise, any updates to the value will not be available. I don't remember whether we fixed the issue or whether it is still present. It was captured in https://github.com/flang-compiler/f18-llvm-project/issues/1171#issuecomment-1119997442

I see. Thanks for the explanations. Borrowed some code from D133686 (add @NimishMishra as co-author) for changing the insertion point. Emit the alloca ops at the beginning of the entry block and load/store ops in the single region.

The issue in wsloop hasn't been fixed. It may need to be fixed until the privatization is done in MLIR.

This revision is now accepted and ready to land.Oct 5 2022, 3:31 AM
This revision was automatically updated to reflect the committed changes.