This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Make `Regions`s `cloneInto` multithread-readable
ClosedPublic

Authored by zero9178 on Apr 17 2022, 3:23 PM.

Details

Summary

Prior to this patch, cloneInto would do a simple walk over the blocks and contained operations and clone and map them as it encounters them. As finishing touch it then remaps any successor and operands it has remapped during that process.

This is generally fine, but sadly leads to a lot of uses of both operations and blocks from the source region, in the cloned operations in the target region. Those uses lead to writes in the use-def list of the operations, making cloneInto never thread safe.

This patch reimplements cloneInto in three steps to avoid ever creating any extra uses on elements in the source region:

  • It first creates the mapping of all blocks and block operands
  • It then clones all operations to create the mapping of all operation results, but does not yet clone any regions or set the operands
  • After all operation results have been mapped, it now sets the operations operands and clones their regions.

That way it is now possible to call cloneInto from multiple threads if the Region or Operation is isolated-from-above. This allows creating copies of functions or to use mlir::inlineCall with the same source region from multiple threads. In the general case, the method is thread-safe if through cloning, no new uses of Values from outside the cloned Operation/Region are created. This can be ensured by mapping any outside operands via the BlockAndValueMapping to Values owned by the caller thread.

While I was at it, I also reworked the clone method of Operation a little bit and added a proper options class to avoid having a cloneWithoutRegionsAndOperands method, and be more extensible in the future. cloneWithoutRegions is now also a simple wrapper that calls clone with the proper options set. That way all the operation cloning code is now contained solely within clone.

Regarding testing: I have no clue what an automated test for thread safety would look like, nor whether that is possible. I used TSAN on my own project to find uses of mlir::inlineCall making writes to the source callable, creating race conditions. After this patch, TSAN no longer reports any issues in my project.

Diff Detail

Event Timeline

zero9178 created this revision.Apr 17 2022, 3:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 17 2022, 3:23 PM
zero9178 requested review of this revision.Apr 17 2022, 3:23 PM
zero9178 updated this revision to Diff 423320.Apr 17 2022, 3:25 PM

Remove redundant includes

rriddle requested changes to this revision.Apr 17 2022, 6:00 PM
rriddle added inline comments.
mlir/lib/IR/Operation.cpp
526

The style of the code base has changed at this point, please drop the duplicated method documentation from here.

mlir/lib/IR/Region.cpp
117

We shouldn't need to track every new operation, we can just iterate over the inserted blocks alongside the original blocks.

124–125

Can you pull this out into a separate variable?

This revision now requires changes to proceed.Apr 17 2022, 6:00 PM
zero9178 updated this revision to Diff 423347.Apr 18 2022, 1:42 AM

Address review comments

(Can't wait for C++17 support btw)

Mogball added inline comments.Apr 18 2022, 6:10 PM
mlir/include/mlir/IR/Operation.h
75

Can you expand the documentation here?

78

What does it mean for all flags to be false?

79

Provide a parameter constructor? CloneOptions(bool cloneRegions, bool cloneOperands).

81

What does it mean for all flags to be true?

84–86
92–93

What is the use case of cloning operations with zero operands? E.g. what happens if I clone an op with its regions but without operands. All nested ops have zero operands?

92–93
125–137
zero9178 updated this revision to Diff 423639.Apr 19 2022, 8:42 AM
zero9178 marked an inline comment as not done.

Address review comments

zero9178 marked 9 inline comments as done.Apr 19 2022, 8:47 AM
zero9178 added inline comments.
mlir/include/mlir/IR/Operation.h
75

I tried to elaborate a bit further but I am not 100% sure what is missing.

92–93

Nested operations are currently unaffected by the cloning options. Aka they are not recursive and only affect the top level operation. Nested operations are cloned in their entirety.
A use case for cloning an operation without its operands but with its regions might be cloning the operation to outline it eg. (Although this could also be achieved via the mapper). Generally speaking, the point of not cloning operands is to avoid creating a use of those operands as that is not thread safe.

Mogball added inline comments.Apr 19 2022, 11:41 AM
mlir/include/mlir/IR/Operation.h
75

You should clearly state that the "parts of an operation" include whether the regions are recursively cloned and whether the operands are cloned and that these options are passed to clone methods.

92–93

Makes sense. But what if the region is not isolated from above? Could cloning the region recursively add a use to a value defined above and create a race condition?

zero9178 updated this revision to Diff 423697.Apr 19 2022, 12:00 PM
zero9178 marked an inline comment as done.

Address review comments regarding documentation

zero9178 added inline comments.Apr 19 2022, 12:05 PM
mlir/include/mlir/IR/Operation.h
92–93

If the region is not isolated from above then a use of a value not defined by any of the operations being cloned may still lead to a race condition yes. So only cloning an isolated from above region is generally read only.
One may still avoid a race condition in your case however if all the outside defined values are mapped by the BlockAndValueMapping to Values owned by the current thread.

Have you measured the performance of the new clone to ensure parity? Inlining in TF can get kind of heavy...

mlir/include/mlir/IR/Operation.h
91

Why has this option been removed?

92–93

Right. Could you update the patch description to reflect that bit of a wrinkle? It'd also be great if the threadsafe-ness of cloning were briefly documented somewhere in the code so that this doesn't become MLIR street knowledge.

mlir/lib/IR/Region.cpp
142

Don't create a temporary vector.

zero9178 marked 2 inline comments as done.Apr 20 2022, 5:01 AM
zero9178 added a subscriber: wsmoses.

Have you measured the performance of the new clone to ensure parity? Inlining in TF can get kind of heavy...

I tried to create a meaningful benchmark by taking an MLIR file produced by my compiler prior to my inliner pass, which consists of 2812 lines of code. I then modified the main function to contain lots of calls to a large function.
First run of ./pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading | wc -l lead to an output of 454056 lines of MLIR.
I then used hyperfine to try and measure a difference:

[markus@dell-xps13 bin]$ hyperfine --prepare "sleep 20"  './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'
Benchmark 1: ./pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.230 s ±  0.053 s    [User: 4.141 s, System: 0.069 s]
  Range (min … max):    4.158 s …  4.308 s    10 runs
 
Benchmark 2: ./pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.137 s ±  0.073 s    [User: 4.048 s, System: 0.068 s]
  Range (min … max):    4.009 s …  4.229 s    10 runs
 
Summary
  './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' ran
    1.02 ± 0.02 times faster than './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'

The benchmark was run on my Linux laptop running Fedora 35 and a i7 7500U. I used a prepare statement that sleeps 20 seconds before execution to try to circumvent thermal throttling of my laptop. For the same reason I also double checked the test results by running them in opposite order, which lead to the same results.

pylir-opt is my projects version of mlir-opt which linked against MLIR with my patch applied. pylir-opt-old is the same but links against a version without my patch applied. Both were compiled in release mode and used LTO when compiling my projects code.

To summarize, this patch seemingly regresses a tiny bit performance wise on average running on average 100ms slower in the benchmark, with a margin of error of around 50 ms. Of note however is that I had to really increase the problem size to get a good measurement. An initial measurement with max-recursion-depth=100 as parameter yielded a runtime of around 1 seconds eg. and an output file of roughly half the lines, where I failed to measure a consistent difference.
Another thing of note is that I did not find out through further profiling with Intel VTune any obvious causes. Most CPU time was spent in mlir::detail::walk and the self CPU time of Region::cloneInto was essentially nothing.

mlir/include/mlir/IR/Operation.h
91

This options was only very recently added in https://reviews.llvm.org/D122531
The bug essentially boiled down to the results of the operation being mapped before regions were being cloned.
This patch happens to also be a fix to that bug as through the single clone implementation, that ordering issue is also resolved.
Due to the recency of @wsmoses patch as well as the only code using that parameter being removed it essentially became dead code that I doubt any downstream users have yet to use as well.
(I initially had it as part of CloneOptions until I realised it was redundant)

mlir/lib/IR/Region.cpp
142

I am not sure how else it'd be possible to implement. mlir::ValueRange doesn't have a constructor that takes an iterator range of Values, only one taking an ArrayRef of Values, which this code calls. llvm::map_range is also just a range of iterators computing the elements and results lazily, that does not have any storage. Hence the llvm::to_vector to materialize it and construct the mlir::ValueRange with (which I think requires contiguous memory?).

zero9178 updated this revision to Diff 423873.EditedApr 20 2022, 5:16 AM
zero9178 marked 2 inline comments as done.

Document the conditions for when it is safe to call Operations clone and Regions cloneInto methods from multiple threads.

zero9178 edited the summary of this revision. (Show Details)Apr 20 2022, 5:19 AM
zero9178 updated this revision to Diff 423899.Apr 20 2022, 6:53 AM

Hoist vector creation out of loop, allowing reuse of capacity of the vector. Performance is now up to parity if not a little better (although in the margin of error):

[markus@dell-xps13 bin]$ hyperfine --prepare 'sleep 20'  './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'
Benchmark 1: ./pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.073 s ±  0.037 s    [User: 3.983 s, System: 0.068 s]
  Range (min … max):    4.014 s …  4.148 s    10 runs
 
Benchmark 2: ./pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.061 s ±  0.080 s    [User: 3.962 s, System: 0.078 s]
  Range (min … max):    3.963 s …  4.166 s    10 runs
 
Summary
  './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' ran
    1.00 ± 0.02 times faster than './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'
[markus@dell-xps13 bin]$ hyperfine --prepare 'sleep 20'  './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'
Benchmark 1: ./pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.025 s ±  0.077 s    [User: 3.927 s, System: 0.078 s]
  Range (min … max):    3.932 s …  4.156 s    10 runs
 
Benchmark 2: ./pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading
  Time (mean ± σ):      4.071 s ±  0.062 s    [User: 3.984 s, System: 0.065 s]
  Range (min … max):    3.979 s …  4.188 s    10 runs
 
Summary
  './pylir-opt benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading' ran
    1.01 ± 0.02 times faster than './pylir-opt-old benchmark.mlir --pylir-trial-inliner="min-callee-size-reduction=0 max-recursion-depth=200" --mlir-disable-threading'
Mogball accepted this revision.Apr 20 2022, 11:15 AM

Thanks for checking the performance!

Looks mostly good to me, but you'll want to wait for River

mlir/lib/IR/Region.cpp
142

You could change it so that when cloning operands, instead of zero, it sets them all to null, and then loop over them setting them here to avoid creating the vector.

rriddle accepted this revision.Apr 21 2022, 12:31 AM

LG

mlir/include/mlir/IR/Operation.h
79

Missing punctuation at the end of this sentence.

96
99
129
mlir/lib/IR/Region.cpp
85–88

Should likely clarify here that it isn't thread safe in some cases (when adding references to values in other regions).

112

Prefer using (), unless there isn't an actual constructor here.

126

nit: Spell out auto here.

127–128

The separate variable doesn't add much, can we just drop it? We could also drop the braces afterwards as well.

136–137

Can you just inlined these variables?

This revision is now accepted and ready to land.Apr 21 2022, 12:31 AM
zero9178 updated this revision to Diff 424151.Apr 21 2022, 4:42 AM
zero9178 marked 8 inline comments as done.

Address review comments

This revision was landed with ongoing or failed builds.Apr 21 2022, 4:43 AM
This revision was automatically updated to reflect the committed changes.

Thanks to the both of you for the very thorough review! Very happy with the result :)