This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Transform scf.parallel to scf.for + async.execute
ClosedPublic

Authored by ezhulenev on Oct 22 2020, 7:25 AM.

Details

Summary

Depends On D89958

  1. Adds async.group/async.awaitall to group together multiple async tokens/values
  2. Rewrite scf.parallel operation into multiple concurrent async.execute operations over non overlapping subranges of the original loop.

Example:

scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
  "do_some_compute"(%i, %j): () -> ()
}

Converted to:

%c0 = constant 0 : index
%c1 = constant 1 : index

// Compute blocks sizes for each induction variable.
%num_blocks_i = ... : index
%num_blocks_j = ... : index
%block_size_i = ... : index
%block_size_j = ... : index

// Create an async group to track async execute ops.
%group = async.create_group

scf.for %bi = %c0 to %num_blocks_i step %c1 {
  %block_start_i = ... : index
  %block_end_i   = ... : index

  scf.for %bj = %c0 t0 %num_blocks_j step %c1 {
    %block_start_j = ... : index
    %block_end_j   = ... : index

    // Execute the body of original parallel operation for the current
    // block.
    %token = async.execute {
      scf.for %i = %block_start_i to %block_end_i step %si {
        scf.for %j = %block_start_j to %block_end_j step %sj {
          "do_some_compute"(%i, %j): () -> ()
        }
      }
    }

    // Add produced async token to the group.
    async.add_to_group %token, %group
  }
}

// Await completion of all async.execute operations.
async.await_all %group

In this example outer loop launches inner block level loops as separate async
execute operations which will be executed concurrently.

At the end it waits for the completiom of all async execute operations.

Diff Detail

Event Timeline

ezhulenev created this revision.Oct 22 2020, 7:25 AM
Herald added a reviewer: aartbik. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
ezhulenev requested review of this revision.Oct 22 2020, 7:25 AM
ezhulenev edited the summary of this revision. (Show Details)Oct 22 2020, 7:30 AM
ezhulenev added reviewers: herhut, mehdi_amini.
ezhulenev edited the summary of this revision. (Show Details)Oct 22 2020, 7:34 AM
ezhulenev updated this revision to Diff 300634.Oct 26 2020, 4:33 AM

Minor code cleanup

ezhulenev updated this revision to Diff 300706.Oct 26 2020, 9:49 AM

Fix style guide violations

herhut added inline comments.Oct 29 2020, 10:01 AM
mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63

This should discuss the semantics a bit more, In particular, that this acts like a mutable collection.

An alternative could be to thread through the token and have a join operation that takes n token and produces a new one.

I cannot judge the tradeoffs right now but it seems worth a design discussion, especially also around how this would lower to other runtimes like TFRT.

mehdi_amini added inline comments.Oct 29 2020, 10:01 PM
mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63–65
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
86

Isn't this the deprecated form of builders? (@ftynse ?)

156
157

Needs a clarification on the lifetime, it seems like it isn't explicitly destroyed right now (and because it is mutable, it isn't really a pure value), so it isn't clear to me how it works.

mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
841

I think this is a variadic template, for convenience.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
57

(typo)

mlir/lib/ExecutionEngine/AsyncRuntime.cpp
53

I think that's what I'm talking above: seems like leaking at the moment?

57

Do we also have a leaking issue with tokens?

ftynse added inline comments.Oct 30 2020, 2:43 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
86

It is indeed. Please use OpBuilderDAG instead - https://mlir.llvm.org/docs/OpDefinitions/#custom-builder-methods.

111

Nit: it should be possible to omit the [] for empty lists.

192

We tend to use Index for most thing pertaining to positions. Any reason why this is I64 specifically?

222

Syntax nit: placing attr-dict _before_ operands is very uncommon in MLIR, any reason for this?

mlir/lib/Dialect/Async/IR/Async.cpp
150

I am pretty sure something in block argument lists does not expect a null type (that's what the default constructor gives you) and will eventually crash. Do we have a test that covers the case of valueType not being a ValueType?

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
122

This wouldn't work correctly for negative numbers.

There is a pending diff adding ceildiv operation to std, I'd just target that when it lands.

ezhulenev updated this revision to Diff 301883.Oct 30 2020, 6:46 AM
ezhulenev marked 10 inline comments as done.

Fix typos + use OpBuilderDAG + address comments.

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63

The problem with join operation that the number of tokens to be awaited is not know statically, so there is a need of something like std::vector in MLIR that can accumulate values of any type. async.group is basically a std::vector<AsyncToken> with a very limited API implemented as runtime intrinsics.

(re TFRT integration, it's in cl 339040831 async_runtime.{h.cc})

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor. For the TFRT the idea is that for each kernel/op invocation we create a short lived AsyncRuntime instance, which tracks all the tokens/values, and destroy it when kernel completed (async_runtime.h in cl/339040831, sorry it's internal only thing for now).

The other option is to add reference counting intrinsics to all runtime types (token, value, group), but then it gets tricky how to reference counting correctly,

222

No. Moved all attr-dict to the end

mlir/lib/Dialect/Async/IR/Async.cpp
150

Changed null type to operand.getType(), so it can safely fail in verifier.

mlir/lib/ExecutionEngine/AsyncRuntime.cpp
53

Yes, right now runtime is not doing any lifetime management of created tokens/values. If we agree on "runtime owns them all" approach I'll port it from TFRT runtime implementation.

mehdi_amini added inline comments.Oct 30 2020, 8:33 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor.

When is the destructor invoked? I don't have a clear understanding of the lifetime right now, it seems like the runtime context not being explicitly modeled does not allow to reason about it here right now.

ezhulenev added inline comments.Oct 30 2020, 9:54 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

In case of mlir-cpu-runner + shared lib it will be destructed during shut down process (something like static std::unique_ptr<Runtime>). For users that do JIT + ExecutionEngine this will be their responsibility to bind async API symbols at runtime to an "short lived" instance of the AsyncRT and properly destroy it.

mehdi_amini added inline comments.Oct 30 2020, 10:44 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

In case of mlir-cpu-runner + shared lib it will be destructed during shut down process

That seems morally equivalent to not releasing to me at this point.
I rather find a more principled way to manage the lifetime here, do you see a way to reason about it by construction?

ezhulenev added inline comments.Oct 30 2020, 11:19 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Yeah, my current proposal is just to sweep memory management problems under the rug :) (at least temporary)

Option #1 (easy):
Add async.destroy operation, and require that it should be called explicitly to destroy all async tokens/values/groups that are no longer needed. So it becomes the responsibility of IR builder to place this op correctly. In case of async parallel for it is trivial to do.

Option #2 (hard(er)?):
During async to LLVM lowering infer the point where mlisrAsyncRumtimeDestroy(...) API calls must be inserted. It gets tricky very quickly because for example async.token can be captured by multiple async.execute operations, and there is no single point for async.destroy.

The only option I see is to do reference counting on async values (tokens, ...):

  1. AddRef when passed to async.execute
  2. DropRef on captured values/tokens when async.execute completes
  3. DropRef when async values does not leave the "scope" (region), e.g. created inside a region (function/for/...) and does not leave it. I have a vague idea how to do that, it's kind of similar to C++ destructors called automatically, but it's hard to define what is a "scope" in MLIR.
ezhulenev added inline comments.Oct 30 2020, 1:58 PM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Another question is how to do async values reference counting, add reference counting ops to async dialect async.ref.add/async.ref.drop, add them to some "intermediate" host-async dialect (reference counting might not make much sense for the host-vs-gpu asynchronicity and gpu codegen), or just insert calls to runtime intrinsics during async to LLVM lowering

Principled solutions I can think of is async-ref-counting pass that runs before lowering to LLVM.

func @fn0() -> !async.token {
  %t0 = async.execute { ... }
  %t1 = async.execute[%t0] {...}
  return %t1
}

func @fn1(%arg0 : !async.token) {
  ...
  return
}

func @fn2() {
  %t = call @fn0()
  ...
  async.await %t
  return
}

mlir-opt -async-ref-counting produces something like:

func @fn0() -> !async.token {
  %t0 = async.execute { ... }

  async.ref.add %t0
  %t1 = async.execute[%t0] {  
     ...
     async.ref.drop %t0
  }

  async.ref.drop %t0
  async.ref.drop %t1

  async.ref.add %t1
  return %t1 : !async.token
}

func @fn1(%arg0 : !async.token) {
  ...
  async.ref.drop %arg0
  return
} 

func @fn2() {
  %t = call @fn0()

  async.ref.add %t
  call @fn1(%t)
  ...
  async.await %t
  async.ref.drop %t
  return
}

And then async.ref.* ops lowered to async runtime intrinsics.

mehdi_amini added inline comments.Oct 30 2020, 3:08 PM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Few thoughts on the refcount and a proposal:

%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}

I think you can ref.drop at the beginning of the execute right? (assuming the token isn't used in the region of course).

async.ref.add %t0
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}

It seems like we could make it part of the IR semantics for async.execute and leave it up to the lowering/implementation: it'll always increment the ref-count and drop it like here, but implicitly.

That said the advantage of the explicit method is that:

%t0 = async.execute { ... }

async.ref.add %t0
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}
async.ref.drop %t0

can be optimized to:

%t0 = async.execute { ... }
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}
async.ref.drop %t1

async.ref.add %t1
return %t1 : !async.token

Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted?

What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually a refcount of 1 (the actual refcount may be higher, but the responsibility of this region is to decrement it by one ultimately). We increment the refcount for as many SSA users there are for this token minus one (if there are no users we actually decrement the counter). It is the responsibility of the token user to either forward it as-is (to a new SSA value) or decrement the refcount.

It seems like it should be composing well, including with return statement.

func @fn0() -> !async.token {
  %t0 = async.execute { ... }
  // %t0 has a refcount of 1 and only one user, do nothing with refcount adjustments
  
  %t1 = async.execute[%t0] {   // on execution, drop the refcount for %t0, 
     ...
  }
  // %t1 has a refcount of 1 and a single user, do nothing with refcount adjustments
  return %t1 : !async.token
}

func @fn1(%arg0 : !async.token) {
  // %arg0 has a refcount of 1, but 0 user, we decrement immediately.
  async.ref.drop %arg0
  ...
  return
} 

func @fn2() {
  %t = call @fn0()
  // %t starts with a refcount of 1, has two SSA uses, add one
  async.ref.add %t, 1
  call @fn1(%t) // function call like any other consumer will naturally drop it by one.
  ...
  async.await %t // await always drop the refcount internally.
  return
}

(I should write a design doc, otherwise I'll have nothing for my next perf ;))

ezhulenev added inline comments.Oct 30 2020, 3:53 PM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted?

Yup, my mistake. I was thinking ahead and assumed that drop/add sequence can be optimized away.

What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually ...

Yeah, seems like that will work, and it's consistent with Swift calling convention (https://github.com/apple/swift/blob/main/docs/SIL.rst#reference-counts), references passed as +1 and caller consumes it (drops ref if I understand correctly). And BefExecutor is doing exactly this with +number-of-uses in the beginning of BEF execution.

The only problem is that reference counting lowering should know how to update all the users of async SSA values (drop ref inside functions, etc...). For example if async.value passed to scf.for async lowering should know how to update then/else regions, but maybe op interfaces will help here. But initially it can be restricted to functions and async ops.

I'll send a separate CL with a proper reference counting async runtime next week,

mehdi_amini accepted this revision.Oct 30 2020, 7:26 PM
mehdi_amini added inline comments.
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

OK LG!

This revision is now accepted and ready to land.Oct 30 2020, 7:26 PM
ftynse accepted this revision.Nov 5 2020, 1:45 AM

LGTM after replacing divup function with rewriter.create<SignedCeilDivIOp> and adding the populateStdExpandDivsRewritePatterns into the lowering mix.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
122

https://reviews.llvm.org/D89726 has landed, we now have ceildiv that expands to a longer sequence of operations that supports negative values. I expect the canonicalizer to remove the unnecessary ones if it knows the sign of all operands.

ezhulenev updated this revision to Diff 303931.Nov 9 2020, 10:40 AM

Use SignedCeilDivIOp to compute parallel loop trip counts and other useful numbers

ezhulenev marked 2 inline comments as done.Nov 9 2020, 11:04 AM
herhut added inline comments.Nov 13 2020, 2:52 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Another way to model this would be to mark the operations as an allocation with a dedicated allocation resource (TokenStorageResource) and the have the normal bufferization placement of free operations handle this. It would insert copies (inc_rc) and frees (dec_rc) for you.

See also https://llvm.discourse.group/t/remove-tight-coupling-of-the-bufferdeallocation-pass-to-std-and-linalg-operations/2162/7 for the proposal to make this work.

The advantage would be that we end up with a single system that manages the lifetime of allocated resources.

ezhulenev updated this revision to Diff 305068.Nov 13 2020, 3:02 AM

Update tests to handle new function syntax

ezhulenev added inline comments.Nov 13 2020, 3:11 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

Also npcomp needs reference counting for some of the types.

For async ref counting I have a working solution based on NumberOfExecutions analysis (https://reviews.llvm.org/rGbb0d5f767dd7cf34a92ba2af2d6fdb206d883e8c) and async runtime support int https://reviews.llvm.org/D90716 (still needs to be updated to use the recently merged analysis).

This revision was landed with ongoing or failed builds.Nov 13 2020, 4:03 AM
This revision was automatically updated to reflect the committed changes.
ezhulenev added inline comments.Nov 16 2020, 3:52 AM
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
157

I've implemented another version of async ref counting based on liveness analysis (somewhat similar to buffer dealloc) in https://reviews.llvm.org/D90716, however it has subtle differences because of the need to handle async.execute operations.