This is an archive of the discontinued LLVM Phabricator instance.

[OpenMPIRBuilder} Add capturing of parameters to pass to omp::parallel
Needs ReviewPublic

Authored by llitchev on Nov 16 2020, 11:17 AM.

Details

Reviewers
jdoerfert
ftynse
Summary

The omp::ParallelOp is translated to a callback that is called for each thread. It uses varargs, but the parameter passing is not working properly with SSE(UP) parameter types. Thus, the need to capture the parameters into an alloca-ed struct and pass that to the callback.

Diff Detail

Event Timeline

llitchev created this revision.Nov 16 2020, 11:17 AM
llitchev requested review of this revision.Nov 16 2020, 11:17 AM
rriddle added inline comments.Nov 16 2020, 11:20 AM
mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
934

I would not expect the translation to modify the input module.

llitchev added inline comments.Nov 16 2020, 10:26 PM
mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
934

Thanks! That makes sense - I just never thought that a translator is actually translating, input and not modifying it. Moving this code to OpenMPToLLVM converter.

ftynse requested changes to this revision.Nov 17 2020, 3:43 PM
ftynse added inline comments.
mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
934

Translation is supposed to be as simple as possible. We have a dedicated pass preparing an MLIR module in the LLVM dialect for translation - LegalizeForExport. Something similar can be introduced on the OpenMP dialect, and the translator just reject the inputs it cannot handle.

(On a side note, we'd better refactor the translator in such a way that it no longer needs to know about OpenMP)

This revision now requires changes to proceed.Nov 17 2020, 3:43 PM

Thanks @llitchev for this patch.

FYI @jdoerfert was suggesting to fix this issue in the OpenMPIRBuilder by setting /* AggregateArgs */ to true in the CodeExtractor in /llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp.
This hit couple of issues,

  1. When the ArtificialEntryBlock is removed in OpenMPIRBuilder::finalize() there are some uses that are left hanging.
  2. Once the above was fixed it hit an assertion in the following place.

OpenMPIRBuilder::createParallel

assert(OutlinedFn.arg_size() >= 2 &&
       "Expected at least tid and bounded tid as arguments");
llitchev updated this revision to Diff 305963.Nov 17 2020, 8:18 PM

Created a small pass to capture omp::ParallelOp parameters.

llitchev marked an inline comment as done.Nov 17 2020, 8:20 PM
llitchev added inline comments.
mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
934

I created a simple pass that goes after the ConvertOpenMPToLLVM one. It looks like a better place for this than the translator.

Thanks, this looks better! I have a couple of further comments.

mlir/include/mlir/Conversion/OpenMPToLLVM/ConvertOpenMPToLLVM.h
30 ↗(On Diff #305963)

This could be a function pass instead.

mlir/include/mlir/Conversion/Passes.td
227 ↗(On Diff #305963)

Please try to fit 80 cols.

228 ↗(On Diff #305963)

Triple backslashes? We could just use single quotes inside or the code-block syntax outside...

mlir/lib/Conversion/OpenMPToLLVM/OpenMPToLLVM.cpp
27 ↗(On Diff #305963)

Why is this necessary?

59 ↗(On Diff #305963)

Nit: elide trivial braces

64 ↗(On Diff #305963)

Nit: MLIR uses auto when it improves readability, e.g. if the type is already mentioned on the RHS (casts) or if it's too long to spell out. LLVMType looks just fine here.

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
20

Please drop these

mlir/test/Conversion/OpenMPToLLVM/openmp_float-parallel_param.mlir
3

Could we please make this test minimal and only exercise the functionality that the patch is adding? I don't think we need anything about main or _mlir_ciface or the entire initialization block here. We can use the Test dialect that supports unregistered ops as opaque producers or users of values.

42

The check pattern doesn't look like valid MLIR, I am surprised pre-merge checks haven't complained.

47

Prefer CHECK over CHECK-NEXT unless the semantics of the IR changes when two operations are not adjacent. The need to change another, unrelated test in these patch is the a good illustration why :)

rriddle added inline comments.Nov 18 2020, 4:13 AM
mlir/lib/Conversion/OpenMPToLLVM/OpenMPToLLVM.cpp
24 ↗(On Diff #305963)

Use the auto generated base class instead of PassWrapper.

llitchev updated this revision to Diff 306203.Nov 18 2020, 1:09 PM
llitchev marked 10 inline comments as done.

Addressed some CR feedback on this diff.

mlir/include/mlir/Conversion/OpenMPToLLVM/ConvertOpenMPToLLVM.h
30 ↗(On Diff #305963)

This pass operates on LLVM Dialect IR. I tried to have it as a function pass (from the very beginning), but the function pass doesn't take the LLVMFuncOp (maybe I am missing something, though). Thanks!

mlir/include/mlir/Conversion/Passes.td
227 ↗(On Diff #305963)

I ran the git clang-format origin/master. I thought it should have fix it. Thanks!

mlir/test/Conversion/OpenMPToLLVM/openmp_float-parallel_param.mlir
42

I have no idea why it didn't get caught.

jdoerfert requested changes to this revision.Nov 19 2020, 7:58 AM

It doesn't make sense to do this here. The OpenMPIRBuilder is used in other places that have the same problem, especially OpenMPOpt. The logic needs to be part of the OpenMPIRBuilder.

This revision now requires changes to proceed.Nov 19 2020, 7:58 AM

@jdoerfert That makes sense. I looked at the code and that is LLVMIR, so Ill start moving things over.

llitchev planned changes to this revision.Dec 11 2020, 11:41 AM

Just finished significant testing ofg this with our AI codegen. The fix in https://reviews.llvm.org/D92189 whows some issues with number of arguments passed as varargs. The number of varargs that starts showing problems is different for different HW (it seems different for SpirV, Vulcan, MultiCore). I will update this PR.

llitchev updated this revision to Diff 311301.Dec 11 2020, 12:53 PM

Implemented the closure approach in the OMPIRBuilder only.

The implementation is fully encapsulated in the OMPIRBuilder only (no changes outside of this file).
It also addresses the issue/limitation of the current implementation with having more than 15 upward defined Values, passed to the parallel region as varargs.

Herald added a project: Restricted Project. · View Herald TranscriptDec 11 2020, 12:53 PM
llitchev planned changes to this revision.Dec 11 2020, 1:00 PM

I suspect there will be conflicts at this time. Posting the source here, so some of the previous commenters could look at it.

I think the overall approach is good to solve the problem with max 16 arguments. Is this based on current master? I left remarks wrt style and other things below, I will need to go over the logic again after those are addressed.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
664

Nit: use SmallVectorImpl w/o the size. No const on the pointers (*const); doesn't mean much anyway.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
452

If it has to be an instruction, use cast. If it might no be, simply get the current instruction point from the builder.

698

This should make some of the code introduced by D92189 obsolete, right?

844

Use range loops above whenever possible. Use LLVM naming style for variables please, so first letter capitalized. Also no llvm::.
Prefer Insertion point guards over manually saving restoring (potentially adding a explicit scope { ... }).
I don't think we need to iterate over the entire set of instructions of the outer function, do we? Check how D92189 identifies communicated values.

mlir/lib/Conversion/OpenMPToLLVM/OpenMPToLLVM.cpp
17 ↗(On Diff #311301)

leftover.

llitchev updated this revision to Diff 311475.Dec 13 2020, 4:05 PM
llitchev marked 5 inline comments as done.

Addressed CR feedback.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
664

Using the Impl now.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
452

Using the current position from the Builder (moving one forward, so it is not the end() ).

698

Yes! Most of it. Now the data is just wrapped into a capture struct before calling the synthetic ..._fork function.

844

You are right ... That was one of the changes I wanted to make to the Diff. Now iterate over the parallel region blocks only.

llitchev planned changes to this revision.Dec 13 2020, 4:06 PM

Needed merges with the latest master.

llitchev retitled this revision from Add capturing of parameters to pass to omp::parallel to [OpenMPIRBuilder} Add capturing of parameters to pass to omp::parallel.Dec 14 2020, 5:21 AM
llitchev updated this revision to Diff 311557.Dec 14 2020, 5:25 AM

Merged in master.

Removed unnecessary (now) code from D92189.

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2020, 5:25 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
llitchev updated this revision to Diff 311648.Dec 14 2020, 10:32 AM

Fixed a failing Windows test. The issue is that that the order of the operands for add operation has changed. I cant see how these changes could cause the issue, but it is a failing test that blocks push of this Diff.

llitchev updated this revision to Diff 311698.Dec 14 2020, 1:51 PM

Added separate tests for Windows for specific tests.

On Windows the registers for the add operation are swapped. This test is completely unrelated to this change, but it fails.

Thanks for continuing to work on this.

Change the commit message as it is working for non integer types now, just not for "too many". Some more comments below.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
119

wasn't that a spelling error before? and one \param is enough ;)

126

I believe we need to keep the Inner parameter here. The reasons can be found in the discussion of D92476. Short story: The callback might have the original value in a map with information attached. If we pass in only the "reloaded" value the callback cannot determine what the original was.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
441

How do we know that -- is valid here? Couldn't Loc point to the begin of a function? If possible, let's just use Loc.IP.

Thanks! I have a couple of comments, but I will defer to @jdoerfert for approval in any case.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
676
678

Nit: it looks like this file uses IP rather than InsPoint for names related to insertion points

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
760

Nit: I think https://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop applies to .size() the same way it applies to .end()

795
796

Nit: trailing dot

799

Nit: please reserve before pushing back in a loop

807–808

Nit: Builder.restoreIP(CaptureAllocaInsPoint) looks shorter

810–815

I suppose you may want to have alloca inserted in a block (function entry) different from the one where you store into the memory. You need to store just before calling the fork function (or, at least, so that the store postdominates all stored values). Looking at the function API, I would have assumed CaptureAllocaInsPoint to be an insertion point at the function entry block specifically for allocas, where these insertvalues are invalid.

826–829

Can we rather take each captured value and enumerate its uses, replacing those within the parallel block set?

llvm/test/CodeGen/XCore/threads.ll
84–140 ↗(On Diff #311698)

These look irrelevant to the patch, but seem to fix a breakage upstream. Would you mind committing this separately?

mlir/test/Conversion/OpenMPToLLVM/openmp_float-parallel_param.mlir
2

Changes to MLIR are no longer necessary

llitchev updated this revision to Diff 313039.Dec 21 2020, 1:55 AM
llitchev marked 10 inline comments as done.

Addressed CR.

llitchev added inline comments.Dec 21 2020, 1:56 AM
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
119

Fixed.

678

No need to store this value anymore. Used the InsertBB->getTerminator(), thus guaranteeing the alloca and stores are just before the fork call (they were before that call too, since the ThreadID was called last), so even if more codegen is introduced in the future the logic deals with it.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
441

There is always an instruction before - the ThreadID was always generated, and that is what -- points to. Changed it to use InsertBB->getTerminator(). It is much more sturdy this way. Even if the codegen is changed, the alloca, insert and store will be generated always just before the forkCall.

760

Fixed.

807–808

Refactored. No need to store the InsertPoint.

810–815

Now it is guaranteed that. the codegen of the alloca, insert, and stores are done just before the forkCall. Even if the codegen changes in the future. It was the case before because the code was generated after the ThreadID getting call (which was just before the fork).

826–829

That was the first implementation I had. The issues was that the uses() was not returning all the uses (particularly the ones introduced by the loop unroller - spent bunch of time debugging it). Iterating to all the instruction parameters of the parallelRegions just works.

llvm/test/CodeGen/XCore/threads.ll
84–140 ↗(On Diff #311698)

OK

mlir/test/Conversion/OpenMPToLLVM/openmp_float-parallel_param.mlir
2

Yes. This just exposes the original issue I had. I thought it is useful to have a test that verifies the underlined functionality works for MLIR.

llitchev updated this revision to Diff 313041.Dec 21 2020, 2:04 AM

Removed the changes from threads.ll.

I'll pull this is a new Diff.

llitchev updated this revision to Diff 313048.Dec 21 2020, 2:11 AM

Some minor optimizations related to CR.

llitchev updated this revision to Diff 313077.Dec 21 2020, 3:57 AM

Fixed a casing issue with a local var.

jdoerfert added inline comments.Jan 6 2021, 8:12 PM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
790
  1. If we would need this, remove the Counter stuff everywhere, if you want to iterate a container: for (const T& : Container)
  2. BlockParents seems to be a set with the blocks, we already have that, it's called ParallelRegionBlockSet, simply pass it in.
  3. Why don't we use the Inputs and Outputs set computed by the findInputsOutputs call. Those are the live-in and live-out values of the parallel region.
801
812

The alloca needs to go in the OuterAllocaIP passed in by the caller of CreateParallel.

827

I'm not too happy with this insert/extract value scheme. Without further optimization (-O0) this might not be lowered properly. Why don't we create a GEP and load/store to the appropriate location instead?

837

Instead of doing this, unpack/load the location in the PrivHelper like we did before. Also, pass the loaded value as Inner to the PrivCB so that the callback has both the original value V and the reload Inner.

Ping @llitchev. Would you have time to take this forward?

Ping @llitchev. Would you have time to take this forward?

I think @ggeorgakoudis is working on an alternative API solution, we might need to pick up the MLIR parts though.

Ping @llitchev. Would you have time to take this forward?

I think @ggeorgakoudis is working on an alternative API solution, we might need to pick up the MLIR parts though.

Yes, I have a solution for OMPIRBuilder. It hinges on https://reviews.llvm.org/D96854 to use the CodeExtractor for building the aggregate.

ftynse resigned from this revision.Aug 27 2021, 12:17 AM