This is an archive of the discontinued LLVM Phabricator instance.

[OpenMPIRBuilder] Implement static-chunked workshare-loop schedules.
ClosedPublic

Authored by Meinersbur on Nov 22 2021, 8:50 PM.

Details

Summary

Add applyStaticChunkedWorkshareLoop method implementing static schedule when chunk-size is specified. Unlike a static schedule without chunk-size (where chunk-size is chosen by the runtime such that each thread receives one chunk), we need two nested loops: one for looping over the iterations of a chunk, and a second for looping over all chunks assigned to the threads.

This patch includes the following related changes:

  • Adapt applyWorkshareLoop to triage between the schedule types, now possible since all schedules have been implemented. The default schedule is assumed to be non-chunked static, as without OpenMPIRBuilder.
  • Remove the chunk parameter from applyStaticWorkshareLoop, it is ignored by the runtime. Change the value for the value passed to the init function to 0, as without OpenMPIRBuilder.
  • Refactor CanonicalLoopInfo::setTripCount and CanonicalLoopInfo::mapIndVar as used by both, applyStaticWorkshareLoop and applyStaticChunkedWorkshareLoop.
  • Enable Clang to use the OpenMPIRBuilder in the presence of the schedule clause.

Diff Detail

Event Timeline

Meinersbur created this revision.Nov 22 2021, 8:50 PM
Meinersbur requested review of this revision.Nov 22 2021, 8:50 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptNov 22 2021, 8:50 PM
  • Simicolon after LLVM_FALLTHROUGH
  • Changes requested by @peixin
  • Remove unused code
  • Fix mlir test

@Meinersbur Please rebase on main. The function "getPreheader()" was moved into OMPIRBuilder.h.

  • Rebase
  • clang-format

@Meinersbur Please rebase on main. The function "getPreheader()" was moved into OMPIRBuilder.h.

I rebased, but I am not sure what you are referring to. getPreheader() always was in OMPIRBuilder.h

@Meinersbur Please rebase on main. The function "getPreheader()" was moved into OMPIRBuilder.h.

I rebased, but I am not sure what you are referring to. getPreheader() always was in OMPIRBuilder.h

getPreheader() was in OMPIRBuilder.cpp before you rebase in your last update here. That's why I let you rebase since I failed to git apply your last patch in main branch. It's not important now. Please forget about that.

getPreheader() was in OMPIRBuilder.cpp before you rebase in your last update here. That's why I let you rebase since I failed to git apply your last patch in main branch. It's not important now. Please forget about that.

D114368 (which this patch depends on) moves getPreheder() to the .cpp files (because it has become more than a simple getter)

getPreheader() was in OMPIRBuilder.cpp before you rebase in your last update here. That's why I let you rebase since I failed to git apply your last patch in main branch. It's not important now. Please forget about that.

D114368 (which this patch depends on) moves getPreheder() to the .cpp files (because it has become more than a simple getter)

Thanks a lot. Now I get it.

peixin added a comment.Dec 9 2021, 6:59 AM

Can you check the following example by applying this patch on fir-dev?

program main
  integer :: i, N = 10
  real :: x = 0

  !$omp do schedule(static, 2)
  do i = 3, N
    x = x + i
  end do
  !$omp end do

  print *, x
end

Test running result:

$ gfortran test.f90 -fopenmp && ./a.out
   52.0000000    
$ bbc -fopenmp -emit-fir test.f90
$ tco test.mlir -o test.ll
$ clang++ -lFortran_main -lFortranRuntime -lFortranDecimal -lomp -o a.out test.ll
$ ./a.out
 7.

When you change "schedule(static, 2)" into "schedule(static, 1)", the running result is 3.0.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1702

Can you move "Value *ThreadNum = getOrCreateThreadID(SrcLoc);" after "Builder.CreateStore(One, PStride);" in order that the "kmpc_global_thread_num" call is right before the "kmpc_static_init" call to keep consistence with others?

1799

Please remove the space between "ScheduleKind" and "OMP_SCHEDULE_Default"? Also for the following switch cases.

@Meinersbur Here is the c++ code test. Without the chunk size specified, the running result using OMPIRBuilder is correct.

#include<iostream>

int main() {
  int i, N = 10;
  float x = 0.0;

  #pragma omp for schedule(static, 2)
  for(i = 3; i <= N; i++) {
    x = x + i;
  }

  std::cout << "x = " << x << std::endl;

  return 0;
}
$ clang++ chunk-1.cpp -fopenmp -fopenmp-enable-irbuilder && ./a.out
x = 7
$ clang++ chunk-1.cpp -fopenmp && ./a.out
x = 52

BTW, please rebase on main. There is one conflict about function getOrCreateSrcLocStr.

  • Rebase

Still to do: fix bug report

libomp implemented a special case that I was not considering: When not in a parallel environment, it would ignore the chunksize and execute the entire iteration space as a single chunk. That is, we cannot assume that the chunk loop iteration count is chunksize must compute it from the lower and upper bound returned from the runtime call.

Thanks for the reproducer!

Can you rebase this? I cannot apply this patch on current main branch.

When I investigated the edge cases you mentioned in D116292. Found one unsupported case as follows

#include <climits>
#include <iostream>
using namespace std;

void func(unsigned long long lb, unsigned long long ub, unsigned long long step) {
  unsigned long long i;
  #pragma omp for schedule(static, 1)
  for (i = lb; i > ub; i -= step) {
    cout << i << endl;
  }
}

int main() {
  unsigned long long lb, ub, step;
  lb = ULLONG_MAX;
  ub = ULLONG_MAX / 10;
  step = ULLONG_MAX / 10;
  cout << "lb: " << lb << endl;
  cout << "ub: " << ub << endl;
  cout << "step: " << step << endl;

  func(lb, ub, step);

  cout << endl;
  return 0;
}
$ clang++ temp.cpp -fopenmp && ./a.out
lb: 18446744073709551615
ub: 1844674407370955161
step: 1844674407370955161
18446744073709551615
16602069666338596454
14757395258967641293
12912720851596686132
11068046444225730971
9223372036854775810
7378697629483820649
5534023222112865488
3689348814741910327
1844674407370955166
$ clang++ temp.cpp -fopenmp -fopenmp-enable-irbuilder
clang-14: /home/qpx/compilers/llvm-community/static-chunk-codegen/llvm-project/llvm/lib/IR/Instructions.cpp:506: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:

This is also for schedule(static).

  • Fix chunksize/tripcount type mismatch
  • Test with different iv sizes
Meinersbur added a comment.EditedJan 29 2022, 1:31 PM

@peixin Thanks for testing edge cases. You hit multiple issues:

  1. Chunksize was i32, combining it with a i64 induction variable caused an error. Fixed the latest update of this patch.
  2. OpenMPIRBuilder currently doesn't work really with exceptions. See D115216 for a start of a discussion with #pragma omp parallel. Support for irregular exits (exceptions, cancellation, destructors) out of CanonicalLoopInfo is what I was working on recently. Use -fno-exceptions to work around.
  3. There is an off-by-one error that I already fixed in my development branch. Upstream patch here: D118542

Result with these fixes for me is:

lb: 18446744073709551615
ub: 1844674407370955161
step: 1844674407370955161
18446744073709551615
16602069666338596454
14757395258967641293
12912720851596686132
11068046444225730971
9223372036854775810
7378697629483820649
5534023222112865488
3689348814741910327
1844674407370955166

Note that this does not involve __kmpc_doacross_init code in libomp you pointed-to in D116292. This uses __kmpc_for_static_init calls of which there are 4 variants for (signed/unsigned x 32/64 bits). To do __kmpc_doacross_init correctly, it would also need at least have variants for signed/unsigned (or one working internally with signed 128 bits).

Thanks for the fix. The fix of off-by-one issue looks ok to me. Will continue reviewing other parts in one week due to the Spring Festival in China.

Except for three nits. LGTM.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
1451

Nit: integer -> 64-bit integer?

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1702

This comment is not addressed.

1799

The extra space is not removed.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 28 2022, 4:18 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Meinersbur marked 4 inline comments as done.
Meinersbur added inline comments.Mar 1 2022, 5:51 AM
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
1451

not necessarily, we do not require a specific integer size. For instance, __kmpc_for_static_init_4u takes a 32-bit integer. It is up to the applyXYZ function to zext/trunc it when necessary.

peixin added inline comments.Mar 1 2022, 7:39 PM
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
1451

Got it. Thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 7:39 PM