This is an archive of the discontinued LLVM Phabricator instance.

[mlir][OpenMP] Added ReductionClauseInterface
ClosedPublic

Authored by shraiysh on Mar 24 2022, 7:41 AM.

Details

Summary

This patch adds the ReductionClauseInterface and also adds reduction
support for omp.parallel operation.

Diff Detail

Event Timeline

shraiysh created this revision.Mar 24 2022, 7:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2022, 7:41 AM
shraiysh requested review of this revision.Mar 24 2022, 7:41 AM
shraiysh updated this revision to Diff 417958.Mar 24 2022, 9:41 AM

Address pre-merge check failure.

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2022, 9:41 AM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

Will this work correctly if there are nested parallel regions but the reduction is on the outer region?

shraiysh added inline comments.Mar 24 2022, 10:20 AM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

That is an interesting question. I just translated the functionality from wsloop to interface. I could not find clear documentation of this being an error.

I tried running a nested parallel region in C without reduction clause on the inner construct and it results in a race condition -

#pragma omp parallel reduction(+:i)
  {
    #pragma omp parallel
    {
      for(int j = 0; j < 10000; j++)
        i += 1;
    }
  }

This means that the instruction i+=1 is treated as a normal operation and not an "omp.reduction" operation. So, I think to have similar behavior in Fortran, PFT to MLIR should generate normal operation for i+=1 and not an "omp.reduction" operation. Based on that interpretation, omp.reduction must always be enclosed with an operation with ReductionClauseInterface. Let me know if something seems incorrect.

shraiysh added inline comments.Mar 24 2022, 10:29 AM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

omp.reduction must always be enclosed with an operation with ReductionClauseInterface.

more precise: omp.reduction must always be enclosed with an operation with ReductionClauseInterface where the accumulator is a part of the innermost such operation's reduction clause.

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

OK. I am not sure whether the check is stronger than what the standard suggests. Compilers seem to be OK with it. I haven't looked into the standard in detail but did not find anything on a quick glance.

The following test seemed to work fine.

int i=0  
#pragma omp parallel reduction(+:i)
#pragma omp for 
for(int j = 0; j < 10; j++)
  i += 1;
shraiysh added inline comments.Mar 24 2022, 11:07 AM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

Yes, I am not suggesting that this is an error. I am saying that in the code above, i+=1 will not be executed atomically (atleast that's what is happening with clang and gcc for C). Thus, it is not handling it as a "reduction". This means that while lowering this code, i+=1 was lowered like normal code (without any reduction specific atomic handling). So, the FIR for this following clang semantics for nested constructs will be something like the following -

omp.parallel reduction( ... ){
  omp.wsloop for (...) {
    // omp.reduction here will be wrong - because that means atomic reduction.
    %1 = fir.load %i
    %2 = arith.addi %i, i32 1
    fir.store %2, %i
  }
}

It is not an error for the frontend - the IR for reduction clause on the internal construct is diff. It is the job of frontend to lower only reductions in the immediate scope as omp.reduction and the nested ones as normal fir operations. Let me know if that is not clear.

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

Yes, I am not suggesting that this is an error. I am saying that in the code above, i+=1 will not be executed atomically (atleast that's what is happening with clang and gcc for C). Thus, it is not handling it as a "reduction".

The IR i saw seemed to be similar to the case where the reduction is in the #pragma omp for. Intuitively the reduction on parallel would say that there are private copies for each thread in the parallel region, In each of these threads the addition will have happen sequentially, and across threads finally at the end of the region the results will be accumulated atomically. So it seems to be a legitimate case.
Are you suggesting that while this is not an error, we do not have a way to represent this reduction in the current state of reduction handling in the dialect?

shraiysh added inline comments.Mar 24 2022, 7:20 PM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

Are you suggesting that while this is not an error, we do not have a way to represent this reduction in the current state of reduction handling in the dialect?

No, we do have a way to represent it. I will try to elaborate.

There are three cases:

  • Reduction on both constructs
omp_set_max_active_levels(2);
#pragma omp parallel reduction(+:i)
{
  #pragma omp parallel for reduction(+:i)
  for(int j = 0; j < 100000; j++)
      i += 1;
}
printf("i = %d\n", i);

On a 32 core machine, this makes the value of i=3200000 as expected.

  • Reduction on outer loop
omp_set_max_active_levels(2);
#pragma omp parallel reduction(+:i)
{
  #pragma omp parallel for
  for(int j = 0; j < 100000; j++)
      i += 1;
}
printf("i = %d\n", i);

This will always give different answers. We have 32x32 threads executing the i+=1 statement. The answers are different because while the outer construct makes sure that all 32 threads because of the outer construct mutate i atomically, the threads spawned by the inner construct make no such guarantee and hence the concurrent edits.

  • Reduction on inner loop
omp_set_max_active_levels(2);
#pragma omp parallel
{
  #pragma omp parallel for reduction(+:i)
  for(int j = 0; j < 100000; j++)
      i += 1;
}
printf("i = %d\n", i);

Again, this gives different answers on each run. This is very similar to the second case. Again we have 32x32 threads here. Here, while all the threads generated by inner construct are atomic wrt each other (all sets of 32 child threads are atomic internally) they are not globally atomic. For ease of understanding, lets assume T(m, n) is the instance of the statement i+=1 executed by the n-th inner construct thread which is the child of m-th outer construct thread. With this notation, T(1,1), T(1,2), ... T(1,32) are atomic but they are not atomic with T(2,1) and hence the concurrent edits.

The IR i saw seemed to be similar to the case where the reduction is in the #pragma omp for. Intuitively the reduction on parallel would say that there are private copies for each thread in the parallel region, In each of these threads the addition will have happen sequentially, and across threads finally at the end of the region the results will be accumulated atomically. So it seems to be a legitimate case.

Yes, it is a legitimate case, and the IRs are similar but as I pointed out above, they are both not seeing this as a "reduction". Only in the first case (reduction clause on both) treats this as a proper reduction. With the current implementation of omp.reduction, we can have the same semantics as gcc/clang. (I am assuming gcc and clang behavior to be accurate about nesting as the standard doesn't mention it). We can connect over slack to figure this out if you'd like to. Apologies for the lengthy explanation. Please let me know if anything seems incorrect.

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

@shraiysh I agree with the nested parallel case because there are two sets of threads. But I was pointing to the case where there is an outer parallel and an inner work-sharing loop.

omp.parallel {
  omp.wsloop {
  }
}

For this case, it seems the reduction can appear on either of the operations and it should not be rejected by the verifier.

shraiysh added inline comments.Mar 25 2022, 3:35 AM
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
713–714

Hmm okay, now I understand. I had not tried with just worksharing construct. In that case, I will relax this requirement to cover all levels of parents. Thanks for the patience and the discussion!

shraiysh updated this revision to Diff 418182.Mar 25 2022, 4:22 AM

Relaxed the check for accumulator.

shraiysh updated this revision to Diff 418183.Mar 25 2022, 4:24 AM

Added tests for the new checks. This patch is ready for review now.

shraiysh marked 4 inline comments as done.Mar 25 2022, 11:31 PM
This revision is now accepted and ready to land.Mar 27 2022, 4:02 PM
This revision was automatically updated to reflect the committed changes.