This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add conversion from SCF parallel loops to OpenMP
ClosedPublic

Authored by ftynse on Nov 23 2020, 10:45 AM.

Details

Summary

Introduce a conversion pass from SCF parallel loops to OpenMP dialect
constructs - parallel region and workshare loop. Loops with reductions are not
supported because the OpenMP dialect cannot model them yet.

The conversion currently targets only one level of parallelism, i.e. only
one top-level omp.parallel operation is produced even if there are nested
scf.parallel operations that could be mapped to omp.wsloop. Nested
parallelism support is left for future work.

Diff Detail

Event Timeline

ftynse created this revision.Nov 23 2020, 10:45 AM
ftynse requested review of this revision.Nov 23 2020, 10:45 AM
ftynse updated this revision to Diff 307123.Nov 23 2020, 10:54 AM

Add cmake

ftynse updated this revision to Diff 307124.Nov 23 2020, 11:00 AM

Sprinkle around a bit more documentation

Nice. Looks like you missed adding test cases.

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp
25

Brief doc comment please.

58

Missing doc comment.

78

Comment here or somewhere in the block.

88

func.getContext() and make this method static?

ftynse updated this revision to Diff 307133.Nov 23 2020, 11:26 AM

Forgotten git add

ftynse updated this revision to Diff 307134.Nov 23 2020, 11:29 AM
ftynse marked 3 inline comments as done.

Address (most of) the review

ftynse updated this revision to Diff 307201.Nov 23 2020, 2:32 PM
ftynse marked an inline comment as done.

Address review

Thanks for this patch.

We should be able to do something similar for the Fortran "Do concurrent" loop. @schweitz

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp
24

Nit: OpenMP paralle loop or OpenMP parallel + workshare loop.

68

Are you not interested in nested parallelism?

SouraVX added a comment.EditedNov 24 2020, 7:46 AM

Please ignore if this is out of context.
So Flang lower all loops into fir.do_loop(which is essentially a superset of scf.for) then it may lower to scf.for(depending on feasibility). So the question I have is what's the intended lowering targeted here for openmp worksharing loop ?

FORTRAN -> fir.loop -> scf.for -> ws.loop
OR 
FORTRAN -> fir.loop -> ws.loop

OR something else ?

is it already discussed/finalized in RFC?
If this patch intend to implements some part from the RFC discussion ? Would you mind adding the link here.
Thanks!

Please ignore if this is out of context.
So Flang lower all loops into fir.loop(which is essentially a superset of scf.for) then it may lower to scf.for(depending on feasibility). So the question I have is what's the intended lowering targeted here for openmp worksharing loop ?

FORTRAN -> fir.loop -> scf.for -> ws.loop
OR 
FORTRAN -> fir.loop -> ws.loop

OR something else ?

is it already discussed/finalized in RFC?
If this patch intend to implements some part from the RFC discussion ? Would you mind adding the link here.
Thanks!

This is fully orthogonal to Flang. I have code that comes from higher-level abstractions in core MLIR (Linalg) to SCF, and I need this code to run in parallel on a platform that supports OpenMP. Flang can choose to target the SCF dialect or the OpenMP dialect regardless of is done here. If you think this needs an RFC within core MLIR context, I can send one, but to me this looks like relatively straightforward plumbing work.

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp
24

This pattern does not create the OpenMP parallel operation, only the workshare loop, so I maintain that the original comment is correct and the suggested replacement will be misleading.

68

I don't have a use case for it as of yet. We can add more features when we need them.

LGTM.

Do you need the lowering to LLVM IR also sometime soon?

Previously, did you mention something about a parallelisation strategy and SCF ops carrying an attribute to determine whether they should be lowered to OpenMP parallel loops.

@SouraVX we can discuss the lowering separately for Flang. Since we decided to represent the work-sharing loop as a loop like operation (omp.wsloop), lowering directly to the work-sharing loop from the parse tree will be the straight forward method. If there are difficulties with that we can think of lowering to fir.do_loop and then converting to omp.wsloop but I think this might need modification of fir.do_loop or addition of a directive like operation in FIR.

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp
68

OK. Can the commit message/description carry this info?

This revision is now accepted and ready to land.Nov 24 2020, 9:15 AM
ftynse edited the summary of this revision. (Show Details)Nov 24 2020, 9:24 AM
ftynse marked 2 inline comments as done.Nov 24 2020, 9:34 AM

LGTM.

Do you need the lowering to LLVM IR also sometime soon?

Previously, did you mention something about a parallelisation strategy and SCF ops carrying an attribute to determine whether they should be lowered to OpenMP parallel loops.

@SouraVX we can discuss the lowering separately for Flang. Since we decided to represent the work-sharing loop as a loop like operation (omp.wsloop), lowering directly to the work-sharing loop from the parse tree will be the straight forward method. If there are difficulties with that we can think of lowering to fir.do_loop and then converting to omp.wsloop but I think this might need modification of fir.do_loop or addition of a directive like operation in FIR.

Thanks for the quick review!

I am in the process of writing a translation to LLVM IR, happy to use anything that is already available.

We discussed parallelization strategies in several ODMs, but there was no firm decision. Most of the discussion was in the GPU context, I believe, where we need not only to say which loops remain parallel but also how they are mapped to the device. We can add annotations that this pass would consume. Alternatively, it is possible to first transform any SCF parallel that should not be converted to SCF for, and then just run a blanket conversion of all remaining parallel operations. We already have the former conversion implemented, can just as well add a way to control it at a finer grain.

We can expose any part of this transformation somewhere in Utils.h if you find them useful for Flang/

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp
68

Sure, will put this into the commit message and add a TODO here.

LGTM.

Do you need the lowering to LLVM IR also sometime soon?

Previously, did you mention something about a parallelisation strategy and SCF ops carrying an attribute to determine whether they should be lowered to OpenMP parallel loops.

@SouraVX we can discuss the lowering separately for Flang. Since we decided to represent the work-sharing loop as a loop like operation (omp.wsloop), lowering directly to the work-sharing loop from the parse tree will be the straight forward method. If there are difficulties with that we can think of lowering to fir.do_loop and then converting to omp.wsloop but I think this might need modification of fir.do_loop or addition of a directive like operation in FIR.

Thanks for the quick review!

I am in the process of writing a translation to LLVM IR, happy to use anything that is already available.

@Meinersbur added a canonical loop in the OpenMPIRBuilder (https://reviews.llvm.org/D90830). The idea was to create the worksharing loop using this canonical loop.
FYI, we have a call ( 4pm BST/11am ET) for the OpenMP for Flang work where the OpenMP dialect work is also discussed. Link to the call and minutes can be found below.
https://teams.microsoft.com/l/meetup-join/19%3ameeting_ZjhiMzZiOWMtNTU5Ni00YzljLWEwNDctYjZkNTZhODg2ZTIx%40thread.v2/0?context=%7b%22Tid%22%3a%223dd8961f-e488-4e60-8e11-a82d994e183d%22%2c%22Oid%22%3a%22ce0d0f4a-6fbf-4ba7-b0c7-20184e3bcb53%22%7d
https://docs.google.com/document/d/1yA-MeJf6RYY-ZXpdol0t7YoDoqtwAyBhFLr5thu5pFI/edit#

We discussed parallelization strategies in several ODMs, but there was no firm decision. Most of the discussion was in the GPU context, I believe, where we need not only to say which loops remain parallel but also how they are mapped to the device. We can add annotations that this pass would consume. Alternatively, it is possible to first transform any SCF parallel that should not be converted to SCF for, and then just run a blanket conversion of all remaining parallel operations. We already have the former conversion implemented, can just as well add a way to control it at a finer grain.

We can expose any part of this transformation somewhere in Utils.h if you find them useful for Flang/

OK, thanks for the explanation. We might need it (not sure though).

This revision was automatically updated to reflect the committed changes.
ftynse marked 2 inline comments as done.

LGTM.

Do you need the lowering to LLVM IR also sometime soon?

I have something that works here https://reviews.llvm.org/D92055 but it needs some discussions.

Previously, did you mention something about a parallelisation strategy and SCF ops carrying an attribute to determine whether they should be lowered to OpenMP parallel loops.

@SouraVX we can discuss the lowering separately for Flang. Since we decided to represent the work-sharing loop as a loop like operation (omp.wsloop), lowering directly to the work-sharing loop from the parse tree will be the straight forward method. If there are difficulties with that we can think of lowering to fir.do_loop and then converting to omp.wsloop but I think this might need modification of fir.do_loop or addition of a directive like operation in FIR.

Thanks for the quick review!

I am in the process of writing a translation to LLVM IR, happy to use anything that is already available.

@Meinersbur added a canonical loop in the OpenMPIRBuilder (https://reviews.llvm.org/D90830). The idea was to create the worksharing loop using this canonical loop.

Familiar names :) I indeed intended to use the canonical loop creation functionality.

FYI, we have a call ( 4pm BST/11am ET) for the OpenMP for Flang work where the OpenMP dialect work is also discussed. Link to the call and minutes can be found below.
https://teams.microsoft.com/l/meetup-join/19%3ameeting_ZjhiMzZiOWMtNTU5Ni00YzljLWEwNDctYjZkNTZhODg2ZTIx%40thread.v2/0?context=%7b%22Tid%22%3a%223dd8961f-e488-4e60-8e11-a82d994e183d%22%2c%22Oid%22%3a%22ce0d0f4a-6fbf-4ba7-b0c7-20184e3bcb53%22%7d
https://docs.google.com/document/d/1yA-MeJf6RYY-ZXpdol0t7YoDoqtwAyBhFLr5thu5pFI/edit#

Good to know, thanks! I may drop by some time.

We discussed parallelization strategies in several ODMs, but there was no firm decision. Most of the discussion was in the GPU context, I believe, where we need not only to say which loops remain parallel but also how they are mapped to the device. We can add annotations that this pass would consume. Alternatively, it is possible to first transform any SCF parallel that should not be converted to SCF for, and then just run a blanket conversion of all remaining parallel operations. We already have the former conversion implemented, can just as well add a way to control it at a finer grain.

We can expose any part of this transformation somewhere in Utils.h if you find them useful for Flang/

OK, thanks for the explanation. We might need it (not sure though).

Don't hesitate to ping me.