Page MenuHomePhabricator

[MLIR][OpenMP] Add omp.do operation
Needs RevisionPublic

Authored by DavidTruby on Aug 17 2020, 6:25 AM.

Details

Summary

This adds a simple definition of a "workshare loop" operation for
the OpenMP MLIR dialect, excluding the "reduction" and "allocate"
clauses and without a custom parser and pretty printer.

The schedule clause also does not yet accept the modifiers that are
permitted in OpenMP 5.0.

Diff Detail

Event Timeline

DavidTruby created this revision.Aug 17 2020, 6:25 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 17 2020, 6:25 AM
DavidTruby requested review of this revision.Aug 17 2020, 6:25 AM
llvm/include/llvm/Frontend/OpenMP/OMP.td
120

For the default clause in parallel we have used a prefix "def" to fix this issue. I think we need to standardize this. Would converting the first letter to caps be a reasonable workaround since reserved keywords do not have a letter with caps as first letter?

clementval added inline comments.Aug 18 2020, 11:34 AM
llvm/include/llvm/Frontend/OpenMP/OMP.td
120

That's probably a good idea to have a standard way to do that. +1 for the first letter capitalized if it works.

llvm/include/llvm/Frontend/OpenMP/OMP.td
120

I have filed an issue https://bugs.llvm.org/show_bug.cgi?id=47225 regarding StrEnumAttr not accepting reserved C++ keywords.

A few general comments about the design of the workshare loop operation.

  1. Should the iteration interval (start, end) and step be part of the loop operation or should we have one more version of omp do?

Openmp for/do loops with static scheduling are implemented by the two following runtime call. How will these values be provided to the OpenMP IRBuilder if these are not carried in the loop operation?
void kmpc for static init 4 ( ident_t ∗ loc, kmp int32 gtid, kmp int32 schedtype, kmp int32 ∗ plastiter, kmp int32 ∗ plower, kmp int32 ∗ pupper, kmp int32 ∗ pstride, kmp int32 incr, kmp int32 chunk )

  1. Should the loop operation be Anyregion or just one block? (Note: @clementval felt it was OK with both anyregion or single block region for OpenACC loop.)

Unlike parallel, the loop operation will always have a loop associated with it. All higher-level dialects have loop operations. But since we lower to OpenMP + LLVM dialect (and llvm has no loop operation) we need to have any region and not just one block. An alternative would be to have two omp.do operations one which sits with the do loop (of one block) and one which sits without (any region).

  1. Another version of omp.do suggested was a single block region with iteration interval and step size. This is similar to loops in other dialects like affine. This has the advantage of transforming to other dialects like affine or recreating the affine transformations as omp.do, as well as keeping it free of optimizations of other dialects. But this version alone will not be able to capture the various loops permitted by OpenMP and Fortran/C++ because there can be branches in Fortran code. Another issue is that we have to retain the OpenMP information somehow (to generate runtime calls) so fully transforming to other dialect loops is not possible.
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
141

Is this a single value or a list of values?

142–143

Is there a schedule modifier also?

144

This should be an attribute.
"The parameter of the collapse clause must be a constant positive integer expression."

146

This should also be an attribute.
"The parameter of the ordered clause must be a constant positive integer expression if specified."

ftynse added inline comments.Mon, Aug 31, 12:36 AM
llvm/include/llvm/Frontend/OpenMP/OMP.td
123

Nit: could we have a space after comma?

DavidTruby added inline comments.Mon, Sep 7, 6:09 AM
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
141

it's one step value per element in linear_vars. So a list of values.
The two lists should always be the same length, but I don't think there's a way to enforce it here (it should be enforced later)

142–143

I'm leaving the schedule modifier for a later patch, as it requires more changes to the OMP.td file. I've added a clarification on this to the commit message.

DavidTruby updated this revision to Diff 290276.Mon, Sep 7, 6:29 AM

Address review comments.

kiranchandramohan accepted this revision.Mon, Sep 7, 1:50 PM

@DavidTruby, In general, I was asking whether all clause enumeration values can begin with a Capital letter? Just the reserved keywords would look odd right?

Also, can you mark fixed comments as done?

I guess for the general workshare loop design issues we can have an RFC in discourse. But this patch can go ahead.

mlir/test/Dialect/OpenMP/ops.mlir
142

Nit: extra line.

This revision is now accepted and ready to land.Mon, Sep 7, 1:50 PM

@DavidTruby, In general, I was asking whether all clause enumeration values can begin with a Capital letter? Just the reserved keywords would look odd right?

Oh I see. Yes, we can do that. I'll make the same change for parallel in a separate patch.

DavidTruby marked 3 inline comments as done.Tue, Sep 8, 4:52 AM
DavidTruby updated this revision to Diff 290474.Tue, Sep 8, 6:47 AM

Capitalise all schedule clause values.

DavidTruby marked 2 inline comments as done.Tue, Sep 8, 6:49 AM
clementval accepted this revision.Thu, Sep 10, 8:35 AM
ftynse requested changes to this revision.Thu, Sep 10, 8:54 AM

I guess for the general workshare loop design issues we can have an RFC in discourse. But this patch can go ahead.

I haven't seen answers to the questions about lowering to LLVM IR + OpenMP runtime, and it sounds suboptimal to push the patch before discussing and agreeing on the actual design.

In particular, it is not clear to me how this construct will connect to loops and what the lowering flow is. Does it expect an scf.for/scf.parallel as the only nested op? Is there a plan for a separate omp.for? How long do loops persist when we go to LLVM, given that OpenMPIRBuilder does not handle loop constructs and we really want to avoid converting loops to CFG during MLIR->LLVM IR translation.

This revision now requires changes to proceed.Thu, Sep 10, 8:54 AM

In particular, it is not clear to me how this construct will connect to loops and what the lowering flow is. Does it expect an scf.for/scf.parallel as the only nested op? Is there a plan for a separate omp.for? How long do loops persist when we go to LLVM, given that OpenMPIRBuilder does not handle loop constructs and we really want to avoid converting loops to CFG during MLIR->LLVM IR translation.

The last part is unclear to me TBH. What exactly do you expect to do with OpenMP worksharing loops on this level which is problematic with CFGs?

What exactly do you expect to do with OpenMP worksharing loops on this level which is problematic with CFGs?

It's the inverse that is problematic: having loop ops where CFG is expected. I would like to avoid seeing something like

omp.do <...> {
  omp.for %i = <...> {
    llvm.store <...>
    // other LLVM operations here
  }
}
// LLVM operations as CFG here

go into mlir-translate, which will have to outline and lower the loop during _translation_ in this case.

What exactly do you expect to do with OpenMP worksharing loops on this level which is problematic with CFGs?

It's the inverse that is problematic: having loop ops where CFG is expected. I would like to avoid seeing something like

omp.do <...> {
  omp.for %i = <...> {
    llvm.store <...>
    // other LLVM operations here
  }
}
// LLVM operations as CFG here

go into mlir-translate, which will have to outline and lower the loop during _translation_ in this case.

I think I might simply misunderstand because I'm oblivious to the "MLIR way" but the loop belongs to the OpenMP directive.
It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it?

Yes.

So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

All OpenMP related LLVM-IR code generation is (eventually) going to be moved into OpenMPIRBuilder so we do not duplicate the rather nontrivial parts in two places which would become a maintenance nightmare. That is (among other things) the point ;)

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it?

Yes.

Is there a written version somewhere we can see?

So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

All OpenMP related LLVM-IR code generation is (eventually) going to be moved into OpenMPIRBuilder so we do not duplicate the rather nontrivial parts in two places which would become a maintenance nightmare. That is (among other things) the point ;)

This sounds great! What I am missing is how this connects to MLIR. Two particular issues: how the OpenMP dialect interacts with other dialects, and can the OpenMPIRBuilder be designed so as to eventually take an mlir::Builder instead of llvm::IRBuilder<> and produce LLVM dialect instead of LLVM; there may be others. So far, the implementation of omp.do proposed here does not align, if not contradicts, the original RFC. Bottomline, this deserves a proper discussion in a forum with better visibility than code review. Part of the discussion seemingly happens in flang channels and should be somehow summarized for mlir folks as OpenMP dialect being independent of any Fortran construct was a condition for accepting it in core.

I guess for the general workshare loop design issues we can have an RFC in discourse. But this patch can go ahead.

I haven't seen answers to the questions about lowering to LLVM IR + OpenMP runtime, and it sounds suboptimal to push the patch before discussing and agreeing on the actual design.

In particular, it is not clear to me how this construct will connect to loops and what the lowering flow is. Does it expect an scf.for/scf.parallel as the only nested op? Is there a plan for a separate omp.for? How long do loops persist when we go to LLVM, given that OpenMPIRBuilder does not handle loop constructs and we really want to avoid converting loops to CFG during MLIR->LLVM IR translation.

Thanks @ftynse for the feedback. Yes, the whole flow for the OpenMP worksharing loop requires a detailed discussion in discourse. I was only suggesting that since the operation in this patch only models what is an OpenMP do directive it can go ahead. The RFC for the workshare operation is next on @DavidTruby's list. It is fine to have the RFC discussion before submitting this patch.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it?

Yes.

Is there a written version somewhere we can see?

A written version of the plan? Yes. A written version of the code, not yet.

So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

All OpenMP related LLVM-IR code generation is (eventually) going to be moved into OpenMPIRBuilder so we do not duplicate the rather nontrivial parts in two places which would become a maintenance nightmare. That is (among other things) the point ;)

This sounds great!
[...]
and can the OpenMPIRBuilder be designed so as to eventually take an mlir::Builder instead of llvm::IRBuilder<> and produce LLVM dialect instead of LLVM;
[...]

So far that was not on my TODO list and it seems like a lot of work assuming you do not port various other things into MLIR land. Could you help me understand what we would gain by generating LLVM dialect?

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it?

Yes.

Is there a written version somewhere we can see?

A written version of the plan? Yes. A written version of the code, not yet.

Could you post a link?

So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

All OpenMP related LLVM-IR code generation is (eventually) going to be moved into OpenMPIRBuilder so we do not duplicate the rather nontrivial parts in two places which would become a maintenance nightmare. That is (among other things) the point ;)

This sounds great!
[...]
and can the OpenMPIRBuilder be designed so as to eventually take an mlir::Builder instead of llvm::IRBuilder<> and produce LLVM dialect instead of LLVM;
[...]

So far that was not on my TODO list and it seems like a lot of work assuming you do not port various other things into MLIR land. Could you help me understand what we would gain by generating LLVM dialect?

Would you mind moving this to discourse / mailing list?

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it? So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

It has to be the OpenMPIRBuilder that lowers it into a CFG (eventually) because it is not "a fortran/mlir/affine/... loop" but an OpenMP worksharing loop with all what that entails.

If there is an OpenMPIRBuilder::CreateForLoop or a plan to have it?

Yes.

Is there a written version somewhere we can see?

A written version of the plan? Yes. A written version of the code, not yet.

Could you post a link?

Sure: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

So far, it looks a there is a non-negligible amount of code in Clang that emits the IR for loops, and replicating that code in mlir-translate is a no go.

All OpenMP related LLVM-IR code generation is (eventually) going to be moved into OpenMPIRBuilder so we do not duplicate the rather nontrivial parts in two places which would become a maintenance nightmare. That is (among other things) the point ;)

This sounds great!
[...]
and can the OpenMPIRBuilder be designed so as to eventually take an mlir::Builder instead of llvm::IRBuilder<> and produce LLVM dialect instead of LLVM;
[...]

So far that was not on my TODO list and it seems like a lot of work assuming you do not port various other things into MLIR land. Could you help me understand what we would gain by generating LLVM dialect?

Would you mind moving this to discourse / mailing list?

No, feel free to reply to the thread above or start a new one.

Sure: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

@jdoerfert This is probably the old flang-dev mailing list.

Is it OK for @jdoerfert and @ftynse If I initiate the discussion in discourse?

I'm fine with anything that is not a code review, which most people would just ignore

Sure: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

@jdoerfert This is probably the old flang-dev mailing list.

Is it OK for @jdoerfert and @ftynse If I initiate the discussion in discourse?

I'm not following discourse.

Sure: http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html

@jdoerfert This is probably the old flang-dev mailing list.

Is it OK for @jdoerfert and @ftynse If I initiate the discussion in discourse?

I'm not following discourse.

That's the main discussion channel for MLIR, we expect RFCs to go there. We can go to llvm-dev@ for the OpenMPIRBuilder discussion, I think it is actually the best place given how it cuts across LLVM, Clang and Flang.

OK. I had almost typed in discourse. I primarily wanted to ask the following question before posing a general question in llvm-dev about the dialect or the OpenMP IRBuilder.

Consider that we want to parallelize an SCF loop with OpenMP. We can add the omp.do operation around the loop. This would look like.

func @some_op_inside_loop(%arg0: index, %arg1: index, %arg2: index) {
omp.do {
  scf.for %i = %arg0 to %arg1 step %arg2 {
    "some.op"(%i) : (index) -> ()
  }
}
  return
}

One way to pass this to the OpenMP IR Builder would be as follows. The loop exists as control flow. (Question: Can we have index, start, end and step variables here as operands or attributes?)

  llvm.func @some_op_inside_loop(%arg0: !llvm.i64, %arg1: !llvm.i64, %arg2: !llvm.i64) {
omp.do {
    llvm.br ^bb1(%arg0 : !llvm.i64)
  ^bb1(%0: !llvm.i64):  // 2 preds: ^bb0, ^bb2
    %1 = llvm.icmp "slt" %0, %arg1 : !llvm.i64
    llvm.cond_br %1, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    "some.op"(%0) : (!llvm.i64) -> ()
    %2 = llvm.add %0, %arg2 : !llvm.i64
    llvm.br ^bb1(%2 : !llvm.i64)
  ^bb3:  // pred: ^bb1
}
    llvm.return
  }

Another way would be as follows where the loop exists without the control flow of the loop.

llvm.func @some_op_inside_loop(%arg0: !llvm.i64, %arg1: !llvm.i64, %arg2: !llvm.i64) {
omp.do index(%i: !llvm.i64) start(%arg0: !llvm.i64) stop(%arg1: !llvm.i64) step(%arg2: !llvm.i64) {
    "some.op"(%i) : (!llvm.i64) -> ()
}
  return
}

The question I have for @jdoerfert is which one of these will be preferable for the OpenMP IRBuilder? And the question I have for @ftynse is what is the issue that is there in these schemes?

bondhugula requested changes to this revision.Sun, Sep 13, 11:17 PM
bondhugula added a subscriber: bondhugula.

Great to see this. Some minor comments.

llvm/include/llvm/Frontend/OpenMP/OMP.td
120

Please terminate with a period.

mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
134

This description is missing a customary example in the end. (Please use triple backticks to include an example.)

144

Please Confined and an IntMinValue argument to force this to be positive. It'll be automatically enforced and verified. As an example, please see AllocLikeOp in the standard dialect.

let arguments = (ins Variadic<Index>:$value,
                  Confined<OptionalAttr<I64Attr>, [IntMinValue<0>]>:$alignment);
bondhugula added a comment.EditedSun, Sep 13, 11:30 PM

I guess for the general workshare loop design issues we can have an RFC in discourse. But this patch can go ahead.

I haven't seen answers to the questions about lowering to LLVM IR + OpenMP runtime, and it sounds suboptimal to push the patch before discussing and agreeing on the actual design.

In particular, it is not clear to me how this construct will connect to loops and what the lowering flow is. Does it expect an scf.for/scf.parallel as the only nested op? Is there a plan for a separate omp.for? How long do loops persist when we go to LLVM, given that OpenMPIRBuilder does not handle loop constructs and we really want to avoid converting loops to CFG during MLIR->LLVM IR translation.

+1 on having clarity on the lowering part to LLVM dialect. I think the op itself doesn't place any restrictions on what's inside. There could be scf.for ops nested inside or surrounding an omp.do given that they both use the same type subsystem (standard types). One would expect an scf.parallel to be converted to an omp.do? The remaining scf.for's would be lowered to LLVM dialect the usual way. It looks like we need some sort of a pre-pass on the LLVM dialect to lower/handle the omp.do and some of the implementation therein would have to duplicate/reuse the logic of scf.for lowering. @DavidTruby @kiranchandramohan - do you have an example sketch of how the LLVM dialect would like for a simple / minimal omp.do loop right before it's translated out to LLVM IR proper? Would it require duplicating LLVM's OpenMPIRBuilder functionality here on the LLVM dialect?

Thanks @bondhugula for the review and comments. Very helpful. We will post out an RFC in a week or so and take it forward from there. Will try to answer your questions in the RFC. We will likely need some help to complete the flow.

jdoerfert added a comment.EditedMon, Sep 14, 9:18 PM

I forgot to submit the first part of this response a few days ago, apologies.

OK. I had almost typed in discourse. I primarily wanted to ask the following question before posing a general question in llvm-dev about the dialect or the OpenMP IRBuilder.

Consider that we want to parallelize an SCF loop with OpenMP. We can add the omp.do operation around the loop. This would look like.

func @some_op_inside_loop(%arg0: index, %arg1: index, %arg2: index) {
omp.do {
  scf.for %i = %arg0 to %arg1 step %arg2 {
    "some.op"(%i) : (index) -> ()
  }
}
  return
}

This is not parallelization but worksharing, but I think I get what you're saying.

One way to pass this to the OpenMP IR Builder would be as follows. The loop exists as control flow. (Question: Can we have index, start, end and step variables here as operands or attributes?)

  llvm.func @some_op_inside_loop(%arg0: !llvm.i64, %arg1: !llvm.i64, %arg2: !llvm.i64) {
omp.do {
    llvm.br ^bb1(%arg0 : !llvm.i64)
  ^bb1(%0: !llvm.i64):  // 2 preds: ^bb0, ^bb2
    %1 = llvm.icmp "slt" %0, %arg1 : !llvm.i64
    llvm.cond_br %1, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    "some.op"(%0) : (!llvm.i64) -> ()
    %2 = llvm.add %0, %arg2 : !llvm.i64
    llvm.br ^bb1(%2 : !llvm.i64)
  ^bb3:  // pred: ^bb1
}
    llvm.return
  }

Another way would be as follows where the loop exists without the control flow of the loop.

llvm.func @some_op_inside_loop(%arg0: !llvm.i64, %arg1: !llvm.i64, %arg2: !llvm.i64) {
omp.do index(%i: !llvm.i64) start(%arg0: !llvm.i64) stop(%arg1: !llvm.i64) step(%arg2: !llvm.i64) {
    "some.op"(%i) : (!llvm.i64) -> ()
}
  return
}

The question I have for @jdoerfert is which one of these will be preferable for the OpenMP IRBuilder? And the question I have for @ftynse is what is the issue that is there in these schemes?

First, the loop belongs to the omp.do. If you want to lower an omp.do (w/ or w/o the OpenMPIRBuilder) you need the loop information (=bounds + step). The loop body is irrelevant at this point.
The interface will somewhat look like this:

InsertPos CreateWorksharingLoop(..., LowerBound, UpperBound, Step, ..., BodyCodeGenCallback)

I guess for the general workshare loop design issues we can have an RFC in discourse. But this patch can go ahead.

I haven't seen answers to the questions about lowering to LLVM IR + OpenMP runtime, and it sounds suboptimal to push the patch before discussing and agreeing on the actual design.

In particular, it is not clear to me how this construct will connect to loops and what the lowering flow is. Does it expect an scf.for/scf.parallel as the only nested op? Is there a plan for a separate omp.for? How long do loops persist when we go to LLVM, given that OpenMPIRBuilder does not handle loop constructs and we really want to avoid converting loops to CFG during MLIR->LLVM IR translation.

+1 on having clarity on the lowering part to LLVM dialect. I think the op itself doesn't place any restrictions on what's inside. There could be scf.for ops nested inside or surrounding an omp.do given that they both use the same type subsystem (standard types).
One would expect an scf.parallel to be converted to an omp.do?

You can lower scf.parallel into omp parallel for/do *if* you prove the body does not contain (certain) OpenMP runtime calls and directives. So without analysis you cannot.

The remaining scf.for's would be lowered to LLVM dialect the usual way. It looks like we need some sort of a pre-pass on the LLVM dialect to lower/handle the omp.do and some of the implementation therein would have to duplicate/reuse the logic of scf.for lowering.

I don't see why this would be the case. As mentioned above, the loop, whatever "op" it might be, belongs to the omp do. There is no omp do without loop, there is no "loop" once omp do has been lowered (to runtime calls). The omp do lowering is also not duplicating scf.for code if you use the OpenMPIRBuilder.

@DavidTruby @kiranchandramohan - do you have an example sketch of how the LLVM dialect would like for a simple / minimal omp.do loop right before it's translated out to LLVM IR proper? Would it require duplicating LLVM's OpenMPIRBuilder functionality here on the LLVM dialect?

Please, do not duplicate OpenMPIRBuilder functionality.

The remaining scf.for's would be lowered to LLVM dialect the usual way. It looks like we need some sort of a pre-pass on the LLVM dialect to lower/handle the omp.do and some of the implementation therein would have to duplicate/reuse the logic of scf.for lowering.

I don't see why this would be the case. As mentioned above, the loop, whatever "op" it might be, belongs to the omp do. There is no omp do without loop, there is no "loop" once omp do has been lowered (to runtime calls). The omp do lowering is also not duplicating scf.for code if you use the OpenMPIRBuilder.

I see. So you are suggesting just preserving the omp.do all the way into the LLVM dialect and then use the OpenMPIRBuilder in the MLIR LLVM dialect to LLVM IR *translator*?

You can lower scf.parallel into omp parallel for/do *if* you prove the body does not contain (certain) OpenMP runtime calls and directives. So without analysis you cannot.

It's generally hard to prove that something is not contained in a region in MLIR. You can always have "unknown.op"() in there, and you'd have to treat it conservatively by assuming it may be an equivalent of the forbidden call or it lowers to such call.

The presence of additional constraints on what is allowed inside the loop body suggests that OpenMP loops should a different operation than scf.for or scf.parallel, potentially sharing interfaces/traits with the latter. That being said, introducing yet another "loop-like" operation also sounds suboptimal.

The question I have for @jdoerfert is which one of these will be preferable for the OpenMP IRBuilder? And the question I have for @ftynse is what is the issue that is there in these schemes?

If scf.for persists until the point where we perform the MLIR-to-LLVM-IR translation, the translation will have to do (the equivalent of) scf-to-std and std-to-llvm passes. which contradicts the "progressive lowering" principle of MLIR. (Note that we originally had a translation from the standard dialect directly to LLVM IR, but we introduced the LLVM dialect specifically to avoid that translation growing in complexity). If there is no explicit loop, the translator will have to analyze the CFG and essentially raise back to the loop form to be able to call the OpenMP IRBuilder, which would expect loop bounds.

To actually make the loops persist, we would need some fine-grained control over the SCF-to-std lowering that would ignore loops contained in omp.do somehow, but only the outermost (?). The entire layering of scf-to-std lowering depending on the OpenMP dialect is not clear to me. There are also hard edges on dialect mixture due to type system differences: SCF does not work on LLVM types and LLVM operations don't work on standard types. We'd need cast operations, type conversions and canonicalizations that remove redundant back-and-forth cast chains, all in translation, which sounds messy and poorly maintainable.

I don't see why this would be the case. As mentioned above, the loop, whatever "op" it might be, belongs to the omp do. There is no omp do without loop, there is no "loop" once omp do has been lowered (to runtime calls). The omp do lowering is also not duplicating scf.for code if you use the OpenMPIRBuilder.

It sounds like a satisfactory solution if we have an explicit loop-like construct in the OpenMP dialect, compatible with the LLVM dialect type system. At translation time, it is expected to contain LLVM+OpenMP dialect inside and the translation itself is straightforward thanks to a dedicated OpenMPIRBuilder method. It's still suboptimal to test this within MLIR, but we could trust OpenMPIRBuilder to be tested properly in LLVM.

Please, do not duplicate OpenMPIRBuilder functionality.

+1.

I see. So you are suggesting just preserving the omp.do all the way into the LLVM dialect and then use the OpenMPIRBuilder in the MLIR LLVM dialect to LLVM IR *translator*?

This was one of my initial concerns, the translator should be kept simple and OpenMPIRBuilder did not have a "create loop" function when I looked at it. So it was unclear to me if the folks implementing OpenMP intend to replicate it within MLIR, extend it in LLVM or do something else, hence my request for an RFC+discussion.

Two other options that are worth considering are:

  • using an attribute to annotate SCF loops as OpenMP loops; this adds a verification hook without requiring to duplicate an operation but the layering is still not very clear to me;
  • making OpenMPIRBuilder somehow extensible so that it can build the LLVM dialect instead of LLVM IR, at which point we can have an OpenMP-to-LLVM dialect conversion that uses it directly and is testable within MLIR, and an actually trivial translation.

I would be interested to hear from @mehdi_amini and @nicolasvasilache among others... @jdoerfert seems to have an account on Discourse, so he would normally receive an email if mentioned explicitly even if he doesn't follow the forum in general.

The question I have for @jdoerfert is which one of these will be preferable for the OpenMP IRBuilder? And the question I have for @ftynse is what is the issue that is there in these schemes?
First, the loop belongs to the omp.do. If you want to lower an omp.do (w/ or w/o the OpenMPIRBuilder) you need the loop information (=bounds + step). The loop body is irrelevant at this point. The interface will somewhat look like this: InsertPos CreateWorksharingLoop(..., LowerBound, UpperBound, Step, ..., BodyCodeGenCallback)

Thanks @jdoerfert. I am thinking that there is some additional requirement since (for.eg. for a static schedule) the kmpc_static_init call has to be inserted in the loop header, the kmpc_static_fini has to be inserted in the loop footer.

The question I have for @jdoerfert is which one of these will be preferable for the OpenMP IRBuilder? And the question I have for @ftynse is what is the issue that is there in these schemes?

If scf.for persists until the point where we perform the MLIR-to-LLVM-IR translation, the translation will have to do (the equivalent of) scf-to-std and std-to-llvm passes. which contradicts the "progressive lowering" principle of MLIR. (Note that we originally had a translation from the standard dialect directly to LLVM IR, but we introduced the LLVM dialect specifically to avoid that translation growing in complexity). If there is no explicit loop, the translator will have to analyze the CFG and essentially raise back to the loop form to be able to call the OpenMP IRBuilder, which would expect loop bounds.

To actually make the loops persist, we would need some fine-grained control over the SCF-to-std lowering that would ignore loops contained in omp.do somehow, but only the outermost (?). The entire layering of scf-to-std lowering depending on the OpenMP dialect is not clear to me. There are also hard edges on dialect mixture due to type system differences: SCF does not work on LLVM types and LLVM operations don't work on standard types. We'd need cast operations, type conversions and canonicalizations that remove redundant back-and-forth cast chains, all in translation, which sounds messy and poorly maintainable.

The broad plan is to always to get to OpenMP operation with LLVM dialect. So the plan is not for scf to persist until MLIR to LLVM IR translation. The plan is to have a conversion pattern (or does this fall under transformations?) which converts an scf.for nested inside an omp.do to a loop like operation in the OpenMP dialect. A user can invoke this with the -convert-openmp-to-llvm conversion option with mlir-opt. The question I had is that most of the loops in MLIR seems to be SizedRegion<1>. Is there any issue with having a loop like operation (with bounds, step) and is Anyregion with multiple blocks? The spv.loop seems to be AnyRegion but is a collection of blocks with control flow but has a well-defined structure and no bounds.

I don't see why this would be the case. As mentioned above, the loop, whatever "op" it might be, belongs to the omp do. There is no omp do without loop, there is no "loop" once omp do has been lowered (to runtime calls). The omp do lowering is also not duplicating scf.for code if you use the OpenMPIRBuilder.

It sounds like a satisfactory solution if we have an explicit loop-like construct in the OpenMP dialect, compatible with the LLVM dialect type system. At translation time, it is expected to contain LLVM+OpenMP dialect inside and the translation itself is straightforward thanks to a dedicated OpenMPIRBuilder method. It's still suboptimal to test this within MLIR, but we could trust OpenMPIRBuilder to be tested properly in LLVM.

OK. I was worried you had an objection here.

Please, do not duplicate OpenMPIRBuilder functionality.

+1.

I see. So you are suggesting just preserving the omp.do all the way into the LLVM dialect and then use the OpenMPIRBuilder in the MLIR LLVM dialect to LLVM IR *translator*?

This was one of my initial concerns, the translator should be kept simple and OpenMPIRBuilder did not have a "create loop" function when I looked at it. So it was unclear to me if the folks implementing OpenMP intend to replicate it within MLIR, extend it in LLVM or do something else, hence my request for an RFC+discussion.

There were more things to discuss, like

  1. should we have both the directive like omp.do operation and the loop like omp.do operation (name can be omp.wsloop) or should there be only the loop like omp.do operation. @DavidTruby was writing an RFC on this.
  2. How flang fits into the picture. The FIR developers informed that scf.for might not be in the path and it is possible that fir.do loop operation might get converted directly to LLVM dialect. In this case there has to be additional conversions or passes in Flang to convert and fir.do inside an omp.do. Can the loop like omp.do operation be the target of fir.do that is concurrent?
  3. Should we handle affine.for nested inside an omp.do? Is affine.for guaranteed to be converted to scf.for at some time and will the handling of scf.for inside omp.do automatically kick in?
  4. Should we use the information that the omp.do loop is parallel to do additional optimisation or even convert to something like affine loops if it is possible?

Two other options that are worth considering are:

  • using an attribute to annotate SCF loops as OpenMP loops; this adds a verification hook without requiring to duplicate an operation but the layering is still not very clear to me;

I was thinking for operations that are already parallel there will be a conversion operation to convert to the OpenMP dialect. -convert-scf-parallel-to-openmp. This can convert the parallel scf loop to the loop like operation in the OpenMP dialect.

  • making OpenMPIRBuilder somehow extensible so that it can build the LLVM dialect instead of LLVM IR, at which point we can have an OpenMP-to-LLVM dialect conversion that uses it directly and is testable within MLIR, and an actually trivial translation.

I would be interested to hear from @mehdi_amini and @nicolasvasilache among others... @jdoerfert seems to have an account on Discourse, so he would normally receive an email if mentioned explicitly even if he doesn't follow the forum in general.

We are OK with submitting the RFC in discourse.

DavidTruby added a comment.EditedFri, Sep 18, 7:15 AM

I have submitted an RFC on the openmp do loop design here: https://llvm.discourse.group/t/openmp-worksharing-loop-rfc/1815