This is an archive of the discontinued LLVM Phabricator instance.

[OMPIRBuilder][MLIR] Support ordered clause specified with parameter
AbandonedPublic

Authored by peixin on Dec 26 2021, 6:26 PM.

Details

Summary

With ordered clause specified with parameter n, the n outer loops form a
doacross loop nest. Add applyDoacrossLoop to implement the doacross loop
"init" and "fini" runtime call in OpenMP IRBuilder. Add one virtual
clause in WsLoop MLIR Op to store the doacross loop bounds info.

In addition, move the barrier runtime call in the front of "after" basic
block, and set the insertion point at the end of "after" basic block.
With this change, lowering to LLVM IR is supported when dynamic schedule
is specified and collapse value is greater than 1. Also add the test
case.

Diff Detail

Unit TestsFailed

Event Timeline

peixin created this revision.Dec 26 2021, 6:26 PM
peixin requested review of this revision.Dec 26 2021, 6:26 PM
peixin updated this revision to Diff 396260.Dec 26 2021, 8:09 PM

Fix clang failed test cases.

To make the review work easier, I would like to give a brief explanation of the design of OMPIRBuilder of ordered clause with parameter. First of all, the ordered clause cannot work alone to make the code region execute in order. The ordered clause and ordered construct must cooperate to make the code region executing in order. For ordered clause specified with a paramter, the outer n (the parameter) loops form the doacross loop nest and OpenMP runtime function kmpc_doacross_init is generated to initialize the loop bounds info of the doacross loop nest. For ordered construct with depend clause, it posts/waits the corresponding thread id according to the index specified in ordered depend directive.

Clang transforms the doacross loop nest into a new one with lower bound of 0 and step of 1. However, this is really not necessary. OpenMP runtime can handle the doacross loop nest regardless of positive or negative step https://github.com/llvm/llvm-project/blob/7c3cf4c2c0689be1a08b8a1326703ec5770de471/openmp/runtime/src/kmp_csupport.cpp#L4050-L4058. The doacross loop nest is independent of worksharing-loop.

When lowering parse-tree to MLIR (https://github.com/flang-compiler/f18-llvm-project/pull/1370/commits/75a8db9c0f7f8c21c2720a794a46afc950ccd0ff), the loop bounds info of lower bounds, upper bounds and steps of the doacross loop nest is collected. The loop bounds info can be taken as the fourth argument of kmpc_doacross_init call directly, and using the expression value of ordered depend directive as the argument of kmpc_doacross_wait/post will make it work (https://github.com/flang-compiler/f18-llvm-project/pull/1368).

Meinersbur added a comment.EditedJan 28 2022, 2:16 PM

Thank you for the summary, it was helpful.

With this change, lowering to LLVM IR is supported when dynamic schedule

is specified and collapse value is greater than 1. Also add the test
case.

Could you explain what goes bad when you do not do this?

Clang transforms the doacross loop nest into a new one with lower bound of 0 and step of 1. However, this is really not necessary. OpenMP runtime can handle the doacross loop nest regardless of positive or negative step https://github.com/llvm/llvm-project/blob/7c3cf4c2c0689be1a08b8a1326703ec5770de471/openmp/runtime/src/kmp_csupport.cpp#L4050-L4058. The doacross loop nest is independent of worksharing-loop.

While it does contain code for it, it is also wrong in edge cases:

  1. If lo is larger than up (I assume there must be check for this somewhere, but I don't who is responsible for checking it; the compiler-emitted code?)
  2. If lo - up overflows, in particular if the loop counter variable itself is int64_t. The trip count itself doesn't even need to be large if the increment value is large as well.
  3. Potentially if the loop counter variable is uint64_t and lo/up larger than 2^63.

An integer loop variable must be introduced anyway for loops over iterators, we might just as well normalize everything to a simplified logical iteration space and not have to bother with overflows later.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
424

Did you consider making doacross part of an existing call like applyDynamicWorkshareLoop? What are the reason against it? If is a potential collapseLoop that loses information of the dimensionality of the original loop, did you consider adding that information to CanonicalLoopInfo such that it can be preserved?

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1718–1720

If the body of the loop is just an assert, enclose the entire loop into an #ifndef NDEBUG

1729–1732
1744
llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp
2025–2115

I don't think this kind of checking is useful. It does not make clear what properties are actually relevant and very difficult to update even if e.g. just the allocas are ordered differently.

I suggest to only have some sanity checks, such as the existence of a call to __kmpc_doacross_fini.

mlir/test/Dialect/OpenMP/invalid.mlir
123

Why this change?

Thanks @Meinersbur for the review and good comments.

With this change, lowering to LLVM IR is supported when dynamic schedule
is specified and collapse value is greater than 1. Also add the test
case.

Could you explain what goes bad when you do not do this?

For dynamic schedule, it overrides the afterIP. But when collapse value is greater than 1, it should use the afterIP stored before transforming the collapsed loops. You can check the changes of this patch in mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp.

Clang transforms the doacross loop nest into a new one with lower bound of 0 and step of 1. However, this is really not necessary. OpenMP runtime can handle the doacross loop nest regardless of positive or negative step https://github.com/llvm/llvm-project/blob/7c3cf4c2c0689be1a08b8a1326703ec5770de471/openmp/runtime/src/kmp_csupport.cpp#L4050-L4058. The doacross loop nest is independent of worksharing-loop.

While it does contain code for it, it is also wrong in edge case:

  1. If lo is larger than up (I assume there must be check for this somewhere, but I don't who is responsible for checking it; the compiler-emitted code?)

Do you mean lo is greater than up and step is positive? Normalization don't check this case, either. I think that is the problem of the user code.

  1. If lo - up overflows, in particular if the loop counter variable itself is int64_t. The trip count itself doesn't even need to be large if the increment value is large as well.

If lo - up overflows, computing the trip count also overflows. The worksharing-loop seems not to support it.

  1. Potentially if the loop counter variable is uint64_t and lo/up larger than 2^63.

For flang, the type of loop counter variable must be int32_t or int64_t, and there is no signed integer 64-bit in fortran. For clang, the uint64_t is converted into int64_t and there is no uint64_t passed to __kmpc_doacross_init. For example, the lower bound is -1 if it is declared as unsigned long long lb = ULLONG_MAX;. In this case, the computation here is trace_count = (uint64_t) (-1 - 1) / 1 + 1 = ULLONG_MAX for the statement for (unsigned long long i = ULLONG_MAX; i >= 1; i--) and it is correct.

An integer loop variable must be introduced anyway for loops over iterators, we might just as well normalize everything to a simplified logical iteration space and not have to bother with overflows later.

For clang, the variable can be normalized according the operators in the for loop such as <, >, <=, >=, !=. But for flang, it is not easy to know if lb is greater than ub or not. Let's look at the following example,

!$omp do ordered(1)
do i = lb, up, step
  !$omp ordered depend(i-1)
  func(i-1)
  ...
enddo

For Ordered Depend directive, how to transform i-1 for the argument of __kmpc_doacrosss_wait is hard to know. Actually, normalization in clang is not correct in all cases. I found one bug as follows:

#include <iostream>
using namespace std;

int main() {
  int i, i_lb = 1, i_ub = 10, i_step = 1;
  int a[10];

  for (i = 0; i < 10; i++)
    a[i] = 1;

  #pragma omp parallel num_threads(9)
  #pragma omp for ordered(1)
  for (i = i_lb; i != i_ub; i = i + i_step) {
    #pragma omp ordered depend(sink: i-1)
    a[i] = a[i-1] + 1;
    #pragma omp ordered depend(source)
  }

  for (i = 0; i < 10; i++)
    cout << a[i] << " ";
  cout << endl;
  return 0;
}
$ clang++ case.cpp && ./a.out
1 2 3 4 5 6 7 8 9 10 
$ clang++ case.cpp -fopenmp && ./a.out
1 1 1 1 1 1 1 1 1 1 

The problem is clang thinks that i_lb is greater than i_ub in this case and the normalization of depend(sink: i-1) is wrong.

For other comments, will fix them later.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
424

The doacrossloop only inserts the init and fini calls into current worksharing-loop. If making it like applyDynamicWorkshareLoop, three are needed, i.e., applyDoacrossDynamicLoop, applyDoacrossStaticLoop, and applyDoacrossStaticChunkLoop, which is too redundant.

Worksharing loop is commonly used, but ordered(n) is not commonly used. In some workloads, there is even no ordered(n) clause. Adding doacorss loop info into CanonicalLoopInfo will have some cost, which is not necessary. What do you think?

mlir/test/Dialect/OpenMP/invalid.mlir
123

The doacross loop is not implemented before. If the ordered value is greater than 1, there is one virtual doacross clause attached with this patch. This check only checks if there is ordered clause.

In last OpenMP Flang technical call, got the information from OpenMP community by @Meinersbur that implementation of ordered directive and clause is under discussion. Currently in LLVM openmp library and clang frontend, the doacross loop is independent from the worksharing loop. The OpenMP community is discussing about if fixing the canonical loop instead of forming one new doacross loop considering the performance issue and edge cases such as overflow. We plan to delay the progress of lowering the ordered directive and clause. So close this PR for now and may reopen this in future.

peixin abandoned this revision.Feb 22 2022, 10:33 PM