This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP Dialect] Add omp.canonical_loop operation.
Needs ReviewPublic

Authored by shraiysh on Jul 19 2023, 5:03 PM.

Details

Summary

This patch continues the work of D147658. It adds the omp.canonical_loop operation as the basic block for everything loop-related in OpenMP, such as worksharing-loop, distribute, loop transformation, etc.

In contrast to the current omp.wsloop approach

  • Loop-related semantics need to be implemented only once
  • Is composable with OpenMP loop transformations such as unrolling, tiling.
  • Is supposed to eventually support non-rectangular loops
  • Supports expressing non-perfectly nested loops

This patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder.

The pretty syntax currently is

omp.canonical_loop $iv in [0, %tripcount) { ... }

where [0, %tripcount) represents the half-open integer range of an OpenMP logical iteration space. Unbalanced parentheses/brackets and 0 keyword might not be universally liked. I could think of alternatives such as

omp.canonical_loop $iv = range(%tripcount) { ... }

Diff Detail

Event Timeline

Meinersbur created this revision.Jul 19 2023, 5:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2023, 5:03 PM
Meinersbur requested review of this revision.Jul 19 2023, 5:03 PM

Thanks @Meinersbur for this patch. It is also exciting to see your patches in MLIR. :)

This patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder.

I will give this a quick try to see how this fits in with the lowering from Flang. Previously, we were against the [0,tripcount) representation, now with a better understanding we can hopefully make it work without much hassle.

mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
440

SingleBlockImplicitTerminator will enforce the requirement for a single block in the loop and hence would disallow branches inside the loop.

443
443

While this representation definitely conforms to one of the representations mentioned in the OpenMP canonical loop definition, it does not directly cover all the forms. We might have to word it slightly differently.

462
mlir/test/Dialect/OpenMP/cli.mlir
81–85

Nit: empty lines

shraiysh added a subscriber: shraiysh.EditedAug 25 2023, 7:15 AM

Hi @Meinersbur. Thank you for this patch. This looks like it would make the IR cleaner. It isn't clear to me how the loop info result of this operation is going to be used in the IR. Can you please elaborate on that? Also, an example of how it would look under omp.parallel would be helpful to understand.

jsjodin added inline comments.
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
523

I'm trying to understand how the yield works. What determines if an inner canonical loop should/must be yielded and be part or the result of an outer canonical loop? Can there be multiple non-nested canonical loops inside the body of a canonical loop?

Meinersbur added inline comments.Aug 25 2023, 12:23 PM
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
523

The inner loop should always be yielded if to be considered a canonical loop nest in the OpenMP sense. Multiple loops are returned for deeper loops nests -- see cli.mlir for examples.

There can be other non-canonical loops nested in the loop body. They are basically ignored.

OpenMP does not allow two canonical loops at the same level (e.g. sequentially executed), only nested within the other.

jsjodin added inline comments.Aug 25 2023, 1:53 PM
mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
523

The inner loop should always be yielded if to be considered a canonical loop nest in the OpenMP sense. Multiple loops are returned for deeper loops nests -- see cli.mlir for examples.

There can be other non-canonical loops nested in the loop body. They are basically ignored.

OpenMP does not allow two canonical loops at the same level (e.g. sequentially executed), only nested within the other.

Thanks for the clarification, adding this op looks reasonable to me.

domada added a subscriber: domada.Aug 28 2023, 7:26 AM

Hi Michael,
thank you very much for your work.

Could you provide a high level (Fortran/C) example with OpenMP pragmas and show the expected MLIR code? Do I understand correctly that the code:

!$omp parallel do
do index_ = 1, 10
    do_body(index)
end do
!$omp end parallel do

will be lowered to the MLIR code:

omp.parallel {
   omp. do {
      omp.canonical_loop %index : i32 in [0, 10) {
          do_body(%index)
      }
   }
}
shraiysh commandeered this revision.EditedAug 28 2023, 5:29 PM
shraiysh added a reviewer: Meinersbur.

After discussing with Michael, I will be trying to implement this operation. @domada I have added an example of intended use in the operation with teams distribute construct. Hope it helps.

mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
440

The idea behind allowing only one block is to be able to statically infer the exact CanonicalLoopInfo objects to apply transformations to. We discussed about branching, and while it has not been added right now, we were considering adding an omp.region (name suggestions welcome) later, which would allow us to have branching within canonical loop, while having concrete information about CLIs.

%outer, %inner = omp.canonical_loop {
  %innertmp = omp.canonical_loop {
    omp.region {
      // ... branch within canonical loop
    }
    omp.yield
  }
  omp.yield %innertmp
}

What do you think about this, @kiranchandramohan? (@Meinersbur please correct me if I am wrong).

443

Is the current wording okay?

shraiysh updated this revision to Diff 554116.Aug 28 2023, 5:29 PM

Address comments and nits

shraiysh marked 5 inline comments as done.Aug 28 2023, 5:33 PM
shraiysh added reviewers: raghavendhra, dpalermo.
shraiysh updated this revision to Diff 554190.Aug 28 2023, 11:01 PM

Fix build by adding enums and attributes in CMakeLists.txt

After discussing with Michael, I will be trying to implement this operation. @domada I have added an example of intended use in the operation with teams distribute construct. Hope it helps.

I am thinking we might not have to name the canonical loops and use yield, since the structure is simply a list of nested loops they can be identified by nesting level if they need to be referred to at all.
So could we just nest the operations instead like below?

omp.collapse {
  omp.canonical_loop {
    omp.canonical _loop {
      omp.region { .... }
    }
  }
}
shraiysh added a comment.EditedAug 29 2023, 11:27 AM

I like @jsjodin's idea about nesting these. It avoids generating problematic IR like this

%cli1, %cli2, %cli3 = omp.canonical_loop ... {
  %cli2, %cli3 = omp.canonical_loop ... {
    %cli3 = omp.canonical_loop ... {
    }
    omp.yield(%cli3)
  }
  omp.yield(%cli2, %cli3)
}
%mergedcli = omp.collapse(%cli1, %cli3) // Problem here - collapsing innermost and outermost loops

What do you think @Meinersbur?

I like @jsjodin's idea about nesting these. It avoids generating problematic IR like this

%cli1, %cli2, %cli3 = omp.canonical_loop ... {
  %cli2, %cli3 = omp.canonical_loop ... {
    %cli3 = omp.canonical_loop ... {
    }
    omp.yield(%cli3)
  }
  omp.yield(%cli2, %cli3)
}
%mergedcli = omp.collapse(%cli1, %cli3) // Problem here - collapsing innermost and outermost loops

What do you think @Meinersbur?

Nesting is sufficient to model upto OpenMP 5.2. But it will not be able to model clauses like apply that are coming in the next version of the standard. @Meinersbur was very particular about this and clarified in https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/4.

I like @jsjodin's idea about nesting these. It avoids generating problematic IR like this

%cli1, %cli2, %cli3 = omp.canonical_loop ... {
  %cli2, %cli3 = omp.canonical_loop ... {
    %cli3 = omp.canonical_loop ... {
    }
    omp.yield(%cli3)
  }
  omp.yield(%cli2, %cli3)
}
%mergedcli = omp.collapse(%cli1, %cli3) // Problem here - collapsing innermost and outermost loops

What do you think @Meinersbur?

Nesting is sufficient to model upto OpenMP 5.2. But it will not be able to model clauses like apply that are coming in the next version of the standard. @Meinersbur was very particular about this and clarified in https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/4.

Looking at the example on discourse I still don't see why it is not sufficient to nest operations, given that an index can be used to refer to any canonical loop in a nest. The example shows this:

%outer = omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  %inner = omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    ..
  }
}
%tiled:4 = omp.tile loops(%outer,%inner) { tile_sizes=[4,4] } ;
omp.ws loops(%tiled#0)

An equivalent nested version would be:

omp.ws {
  omp.tile_loops { tilze_sizes = [4,4], loops = [0,1] } {
    omp.canonical_loop for (%c0) to (%c10) step (%c1) {
      omp.canonical_loop for (%d0) to (%d10) step (%d1) {
        ..
      }
    }
  }
}

I like @jsjodin's idea about nesting these. It avoids generating problematic IR like this

%cli1, %cli2, %cli3 = omp.canonical_loop ... {
  %cli2, %cli3 = omp.canonical_loop ... {
    %cli3 = omp.canonical_loop ... {
    }
    omp.yield(%cli3)
  }
  omp.yield(%cli2, %cli3)
}
%mergedcli = omp.collapse(%cli1, %cli3) // Problem here - collapsing innermost and outermost loops

What do you think @Meinersbur?

Nesting is sufficient to model upto OpenMP 5.2. But it will not be able to model clauses like apply that are coming in the next version of the standard. @Meinersbur was very particular about this and clarified in https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/4.

Looking at the example on discourse I still don't see why it is not sufficient to nest operations, given that an index can be used to refer to any canonical loop in a nest. The example shows this:

%outer = omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  %inner = omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    ..
  }
}
%tiled:4 = omp.tile loops(%outer,%inner) { tile_sizes=[4,4] } ;
omp.ws loops(%tiled#0)

An equivalent nested version would be:

omp.ws {
  omp.tile_loops { tilze_sizes = [4,4], loops = [0,1] } {
    omp.canonical_loop for (%c0) to (%c10) step (%c1) {
      omp.canonical_loop for (%d0) to (%d10) step (%d1) {
        ..
      }
    }
  }
}

I meant @Meinersbur 's reply (https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/5) to my question, that shows an example that tiles the loop and then unrolls the inner-loop. Copying it here for reference.

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

which is equivalent to:

#pragma omp 
for (int i1 = 0; i1 < 64; i1+=4) 
  #pragma omp unroll
  for (int i = i1; i < i1+4; ++i) ;

I like @jsjodin's idea about nesting these. It avoids generating problematic IR like this

%cli1, %cli2, %cli3 = omp.canonical_loop ... {
  %cli2, %cli3 = omp.canonical_loop ... {
    %cli3 = omp.canonical_loop ... {
    }
    omp.yield(%cli3)
  }
  omp.yield(%cli2, %cli3)
}
%mergedcli = omp.collapse(%cli1, %cli3) // Problem here - collapsing innermost and outermost loops

What do you think @Meinersbur?

Nesting is sufficient to model upto OpenMP 5.2. But it will not be able to model clauses like apply that are coming in the next version of the standard. @Meinersbur was very particular about this and clarified in https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/4.

Looking at the example on discourse I still don't see why it is not sufficient to nest operations, given that an index can be used to refer to any canonical loop in a nest. The example shows this:

%outer = omp.canonical_loop for (%c0) to (%c10) step (%c1) {
  %inner = omp.canonical_loop for (%d0) to (%d10) step (%d1) {
    ..
  }
}
%tiled:4 = omp.tile loops(%outer,%inner) { tile_sizes=[4,4] } ;
omp.ws loops(%tiled#0)

An equivalent nested version would be:

omp.ws {
  omp.tile_loops { tilze_sizes = [4,4], loops = [0,1] } {
    omp.canonical_loop for (%c0) to (%c10) step (%c1) {
      omp.canonical_loop for (%d0) to (%d10) step (%d1) {
        ..
      }
    }
  }
}

I meant @Meinersbur 's reply (https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/5) to my question, that shows an example that tiles the loop and then unrolls the inner-loop. Copying it here for reference.

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

which is equivalent to:

#pragma omp 
for (int i1 = 0; i1 < 64; i1+=4) 
  #pragma omp unroll
  for (int i = i1; i < i1+4; ++i) ;

Yes. that is the example I was referring to. I just copied the MLIR, but didn't bother to put in the OpenMP code, so thanks for doing that.

I meant @Meinersbur 's reply (https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/5) to my question, that shows an example that tiles the loop and then unrolls the inner-loop. Copying it here for reference.

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

which is equivalent to:

#pragma omp 
for (int i1 = 0; i1 < 64; i1+=4) 
  #pragma omp unroll
  for (int i = i1; i < i1+4; ++i) ;

Yes. that is the example I was referring to. I just copied the MLIR, but didn't bother to put in the OpenMP code, so thanks for doing that.

I think, I am missing the point here. But just to confirm, the example above has only a single loop on which a tiling transformation is performed followed by an unroll on the inner loop. The approach proposed in this patch I believe would model it as:

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

%lp = omp.canonical_loop for (%c0) to (%c64) step (%c1) {
    ..
}
%tiled:2 = omp.tile loops(%lp) { tile_sizes=[4] } 
%unrolled = omp.unroll loops(%tiled#1)

Representing this with the nested approach would require specifying which loop the unroll should apply and since it is not the outermost loop, we will have to specify with an index. I am using loop(n) with an integer argument to specify the loop at the depth that we want (0 for the loop enclosed, 1 for the following loop and so on).

omp.unroll loop(1)
  omp.tile loop (0) { tile_sizes=[4] } {
    omp.canonical_loop for (%c0) to (%c64) step (%c1) {
      ..
    }
  }
}

I don't know whether coming up with an integer argument is always possible. We will have to effectively interpret the whole sequence of transformations to come up with this number. And with loop-fission (not in 6.0) there will be multiple loops at a nesting level and to index them accurately, each nesting level will need a multi-dimensional index.

mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
440

Sounds OK to me. omp.region looks like a nice Op. Might be useful for easing lowering in other cases too.

Keep in mind that this will be required quite soon since even an if statement gets converted to branches by the time we are in LLVM dialect.

443

Yes, the wording looks fine.

I meant @Meinersbur 's reply (https://discourse.llvm.org/t/mlir-summit-openmp-roundtable-discussion-summary/66574/5) to my question, that shows an example that tiles the loop and then unrolls the inner-loop. Copying it here for reference.

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

which is equivalent to:

#pragma omp 
for (int i1 = 0; i1 < 64; i1+=4) 
  #pragma omp unroll
  for (int i = i1; i < i1+4; ++i) ;

Yes. that is the example I was referring to. I just copied the MLIR, but didn't bother to put in the OpenMP code, so thanks for doing that.

I think, I am missing the point here. But just to confirm, the example above has only a single loop on which a tiling transformation is performed followed by an unroll on the inner loop. The approach proposed in this patch I believe would model it as:

#pragma omp tile sizes(4) apply(intratile:unroll)
for (int i = 0; i < 64; ++i) ;

%lp = omp.canonical_loop for (%c0) to (%c64) step (%c1) {
    ..
}
%tiled:2 = omp.tile loops(%lp) { tile_sizes=[4] } 
%unrolled = omp.unroll loops(%tiled#1)

Representing this with the nested approach would require specifying which loop the unroll should apply and since it is not the outermost loop, we will have to specify with an index. I am using loop(n) with an integer argument to specify the loop at the depth that we want (0 for the loop enclosed, 1 for the following loop and so on).

Yes, that is correct, and it looks like in the example code if I understand the notation correctly we are using indices to specify the loops. 'Does '''%tiled:2''' mean it produces two loops and does '''%tiled#1''' mean index 1 of the loops?

omp.unroll loop(1)
  omp.tile loop (0) { tile_sizes=[4] } {
    omp.canonical_loop for (%c0) to (%c64) step (%c1) {
      ..
    }
  }
}

I don't know whether coming up with an integer argument is always possible. We will have to effectively interpret the whole sequence of transformations to come up with this number. And with loop-fission (not in 6.0) there will be multiple loops at a nesting level and to index them accurately, each nesting level will need a multi-dimensional index.

Ah okay, I think that I wasn't clear about what the nesting operations do with respect to the loop information. Each operation would transform the loop information, so internally it would be similar to the current proposal. It is preferable to avoid having a representation that allows multiple ways to create malformed IR, which I think the nesting would avoid.

There are two possible approaches for the unrolled operation here -

%unrolled1 = omp.unroll loops(%tiled) at(1)
%unrolled2 = omp.unroll loops(%tiled#1)

The good thing with the first approach, with at(1) - is that because we have both the loops as input, we transform the inner loop and we have the new single loop as output which can directly be used in other constructs. Note that this is similar to Jan's suggestion about nesting and maintaining indices.
The problem with the second approach is that the input is only the inner loop and the output (%unrolled2) should represent a structured block (not a loop) which should be captured within the loop %tiled#0 somehow. (maybe an omp.merge_canonical_loop_with_body operation?).

There are two possible approaches for the unrolled operation here -

%unrolled1 = omp.unroll loops(%tiled) at(1)
%unrolled2 = omp.unroll loops(%tiled#1)

I realized that we cannot really do the first approach, because unroll could be present without nested loops. The nesting of openmp constructs could help with this. It might require complex indexing though. @kiranchandramohan do we have the details on loop-fission or the loop-modifiers for apply (like intratile)? I checked the technical report document here but it doesn't mention too much detail about these modifiers.

Moved to https://github.com/llvm/llvm-project/pull/65380 because phabricator has been unresponsive lately (I have to hit reply multiple times before it sends one).