This patch adds the ReductionClauseInterface and also adds reduction
support for omp.parallel operation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | Will this work correctly if there are nested parallel regions but the reduction is on the outer region? |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | That is an interesting question. I just translated the functionality from wsloop to interface. I could not find clear documentation of this being an error. I tried running a nested parallel region in C without reduction clause on the inner construct and it results in a race condition - #pragma omp parallel reduction(+:i) { #pragma omp parallel { for(int j = 0; j < 10000; j++) i += 1; } } This means that the instruction i+=1 is treated as a normal operation and not an "omp.reduction" operation. So, I think to have similar behavior in Fortran, PFT to MLIR should generate normal operation for i+=1 and not an "omp.reduction" operation. Based on that interpretation, omp.reduction must always be enclosed with an operation with ReductionClauseInterface. Let me know if something seems incorrect. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 |
more precise: omp.reduction must always be enclosed with an operation with ReductionClauseInterface where the accumulator is a part of the innermost such operation's reduction clause. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | OK. I am not sure whether the check is stronger than what the standard suggests. Compilers seem to be OK with it. I haven't looked into the standard in detail but did not find anything on a quick glance. The following test seemed to work fine. int i=0 #pragma omp parallel reduction(+:i) #pragma omp for for(int j = 0; j < 10; j++) i += 1; |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | Yes, I am not suggesting that this is an error. I am saying that in the code above, i+=1 will not be executed atomically (atleast that's what is happening with clang and gcc for C). Thus, it is not handling it as a "reduction". This means that while lowering this code, i+=1 was lowered like normal code (without any reduction specific atomic handling). So, the FIR for this following clang semantics for nested constructs will be something like the following - omp.parallel reduction( ... ){ omp.wsloop for (...) { // omp.reduction here will be wrong - because that means atomic reduction. %1 = fir.load %i %2 = arith.addi %i, i32 1 fir.store %2, %i } } It is not an error for the frontend - the IR for reduction clause on the internal construct is diff. It is the job of frontend to lower only reductions in the immediate scope as omp.reduction and the nested ones as normal fir operations. Let me know if that is not clear. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 |
The IR i saw seemed to be similar to the case where the reduction is in the #pragma omp for. Intuitively the reduction on parallel would say that there are private copies for each thread in the parallel region, In each of these threads the addition will have happen sequentially, and across threads finally at the end of the region the results will be accumulated atomically. So it seems to be a legitimate case. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 |
No, we do have a way to represent it. I will try to elaborate. There are three cases:
omp_set_max_active_levels(2); #pragma omp parallel reduction(+:i) { #pragma omp parallel for reduction(+:i) for(int j = 0; j < 100000; j++) i += 1; } printf("i = %d\n", i); On a 32 core machine, this makes the value of i=3200000 as expected.
omp_set_max_active_levels(2); #pragma omp parallel reduction(+:i) { #pragma omp parallel for for(int j = 0; j < 100000; j++) i += 1; } printf("i = %d\n", i); This will always give different answers. We have 32x32 threads executing the i+=1 statement. The answers are different because while the outer construct makes sure that all 32 threads because of the outer construct mutate i atomically, the threads spawned by the inner construct make no such guarantee and hence the concurrent edits.
omp_set_max_active_levels(2); #pragma omp parallel { #pragma omp parallel for reduction(+:i) for(int j = 0; j < 100000; j++) i += 1; } printf("i = %d\n", i); Again, this gives different answers on each run. This is very similar to the second case. Again we have 32x32 threads here. Here, while all the threads generated by inner construct are atomic wrt each other (all sets of 32 child threads are atomic internally) they are not globally atomic. For ease of understanding, lets assume T(m, n) is the instance of the statement i+=1 executed by the n-th inner construct thread which is the child of m-th outer construct thread. With this notation, T(1,1), T(1,2), ... T(1,32) are atomic but they are not atomic with T(2,1) and hence the concurrent edits.
Yes, it is a legitimate case, and the IRs are similar but as I pointed out above, they are both not seeing this as a "reduction". Only in the first case (reduction clause on both) treats this as a proper reduction. With the current implementation of omp.reduction, we can have the same semantics as gcc/clang. (I am assuming gcc and clang behavior to be accurate about nesting as the standard doesn't mention it). We can connect over slack to figure this out if you'd like to. Apologies for the lengthy explanation. Please let me know if anything seems incorrect. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | @shraiysh I agree with the nested parallel case because there are two sets of threads. But I was pointing to the case where there is an outer parallel and an inner work-sharing loop. omp.parallel { omp.wsloop { } } For this case, it seems the reduction can appear on either of the operations and it should not be rejected by the verifier. |
mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp | ||
---|---|---|
713–714 | Hmm okay, now I understand. I had not tried with just worksharing construct. In that case, I will relax this requirement to cover all levels of parents. Thanks for the patience and the discussion! |
Will this work correctly if there are nested parallel regions but the reduction is on the outer region?