This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Dialect/Async/
-
Async/
-
CMakeLists.txt
-
IR/
-
Async.h
2/3
AsyncBase.td
13/20
AsyncOps.td
-
Passes.h
-
Passes.td
-
ExecutionEngine/
-
AsyncRuntime.h
-
InitAllPasses.h
-
integration_test/Dialect/Async/CPU/
-
Dialect/
-
Async/
-
CPU/
-
lit.local.cfg
-
test-async-parallel-for-1d.mlir
-
test-async-parallel-for-2d.mlir
-
lib/
-
Conversion/AsyncToLLVM/
-
AsyncToLLVM/
1/1
AsyncToLLVM.cpp
-
Dialect/Async/
-
Async/
-
CMakeLists.txt
-
IR/
2/2
Async.cpp
-
Transforms/
3/3
AsyncParallelFor.cpp
-
CMakeLists.txt
-
PassDetail.h
-
ExecutionEngine/
1/3
AsyncRuntime.cpp
-
test/
-
Conversion/AsyncToLLVM/
-
AsyncToLLVM/
-
convert-to-llvm.mlir
-
Dialect/Async/
-
Async/
-
async-parallel-for.mlir
-
ops.mlir
-
mlir-cpu-runner/
-
async-group.mlir

Differential D89963

[mlir] Transform scf.parallel to scf.for + async.execute
ClosedPublic

Authored by ezhulenev on Oct 22 2020, 7:25 AM.

Download Raw Diff

Details

Reviewers

ftynse
aartbik
herhut
mehdi_amini

Commits

rGc30ab6c2a307: [mlir] Transform scf.parallel to scf.for + async.execute

Summary

Depends On D89958

Adds async.group/async.awaitall to group together multiple async tokens/values
Rewrite scf.parallel operation into multiple concurrent async.execute operations over non overlapping subranges of the original loop.

Example:

scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
  "do_some_compute"(%i, %j): () -> ()
}

Converted to:

%c0 = constant 0 : index
%c1 = constant 1 : index

// Compute blocks sizes for each induction variable.
%num_blocks_i = ... : index
%num_blocks_j = ... : index
%block_size_i = ... : index
%block_size_j = ... : index

// Create an async group to track async execute ops.
%group = async.create_group

scf.for %bi = %c0 to %num_blocks_i step %c1 {
  %block_start_i = ... : index
  %block_end_i   = ... : index

  scf.for %bj = %c0 t0 %num_blocks_j step %c1 {
    %block_start_j = ... : index
    %block_end_j   = ... : index

    // Execute the body of original parallel operation for the current
    // block.
    %token = async.execute {
      scf.for %i = %block_start_i to %block_end_i step %si {
        scf.for %j = %block_start_j to %block_end_j step %sj {
          "do_some_compute"(%i, %j): () -> ()
        }
      }
    }

    // Add produced async token to the group.
    async.add_to_group %token, %group
  }
}

// Await completion of all async.execute operations.
async.await_all %group

In this example outer loop launches inner block level loops as separate async
execute operations which will be executed concurrently.

At the end it waits for the completiom of all async execute operations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ezhulenev created this revision.Oct 22 2020, 7:25 AM

Herald added a reviewer: ftynse. · View Herald TranscriptOct 22 2020, 7:25 AM

Herald added a reviewer: aartbik. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: rdzhabarov, tatianashp, msifontes and 15 others. · View Herald Transcript

ezhulenev requested review of this revision.Oct 22 2020, 7:25 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald TranscriptOct 22 2020, 7:25 AM

ezhulenev edited the summary of this revision. (Show Details)Oct 22 2020, 7:30 AM

ezhulenev added reviewers: herhut, mehdi_amini.

Fix typo

ezhulenev edited the summary of this revision. (Show Details)Oct 22 2020, 7:34 AM

Typo

Harbormaster completed remote builds in B76040: Diff 299966.Oct 22 2020, 7:57 AM

Harbormaster completed remote builds in B76042: Diff 299969.Oct 22 2020, 8:03 AM

Harbormaster completed remote builds in B76041: Diff 299968.Oct 22 2020, 8:08 AM

Minor code cleanup

Harbormaster completed remote builds in B76379: Diff 300634.Oct 26 2020, 4:51 AM

Fix style guide violations

Harbormaster completed remote builds in B76429: Diff 300706.Oct 26 2020, 10:05 AM

Rebase

Harbormaster completed remote builds in B76887: Diff 301566.Oct 29 2020, 4:38 AM

herhut added inline comments.Oct 29 2020, 10:01 AM

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63	This should discuss the semantics a bit more, In particular, that this acts like a mutable collection. An alternative could be to thread through the token and have a `join` operation that takes n token and produces a new one. I cannot judge the tradeoffs right now but it seems worth a design discussion, especially also around how this would lower to other runtimes like TFRT.

mehdi_amini added inline comments.Oct 29 2020, 10:01 PM

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63–65
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
89	Isn't this the deprecated form of builders? (@ftynse ?)
161
162	Needs a clarification on the lifetime, it seems like it isn't explicitly destroyed right now (and because it is mutable, it isn't really a pure value), so it isn't clear to me how it works.
mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
841	I think this is a variadic template, for convenience.
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
57	(typo)
mlir/lib/ExecutionEngine/AsyncRuntime.cpp
53	I think that's what I'm talking above: seems like leaking at the moment?
55	Do we also have a leaking issue with tokens?

ftynse added inline comments.Oct 30 2020, 2:43 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
89	It is indeed. Please use OpBuilderDAG instead - https://mlir.llvm.org/docs/OpDefinitions/#custom-builder-methods.
115	Nit: it should be possible to omit the `[]` for empty lists.
197	We tend to use Index for most thing pertaining to positions. Any reason why this is I64 specifically?
227	Syntax nit: placing attr-dict _before_ operands is very uncommon in MLIR, any reason for this?
mlir/lib/Dialect/Async/IR/Async.cpp
150	I am pretty sure something in block argument lists does not expect a null type (that's what the default constructor gives you) and will eventually crash. Do we have a test that covers the case of `valueType` not being a `ValueType`?
mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
122	This wouldn't work correctly for negative numbers. There is a pending diff adding ceildiv operation to std, I'd just target that when it lands.

Fix typos + use OpBuilderDAG + address comments.

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td
63	The problem with `join` operation that the number of tokens to be awaited is not know statically, so there is a need of something like `std::vector` in MLIR that can accumulate values of any type. `async.group` is basically a `std::vector<AsyncToken>` with a very limited API implemented as runtime intrinsics. (re TFRT integration, it's in cl 339040831 async_runtime.{h.cc})
mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor. For the TFRT the idea is that for each kernel/op invocation we create a short lived AsyncRuntime instance, which tracks all the tokens/values, and destroy it when kernel completed (async_runtime.h in cl/339040831, sorry it's internal only thing for now). The other option is to add reference counting intrinsics to all runtime types (token, value, group), but then it gets tricky how to reference counting correctly,
227	No. Moved all attr-dict to the end
mlir/lib/Dialect/Async/IR/Async.cpp
150	Changed null type to `operand.getType()`, so it can safely fail in verifier.
mlir/lib/ExecutionEngine/AsyncRuntime.cpp
53	Yes, right now runtime is not doing any lifetime management of created tokens/values. If we agree on "runtime owns them all" approach I'll port it from TFRT runtime implementation.

Harbormaster completed remote builds in B77036: Diff 301883.Oct 30 2020, 7:09 AM

mehdi_amini added inline comments.Oct 30 2020, 8:33 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor. When is the destructor invoked? I don't have a clear understanding of the lifetime right now, it seems like the runtime context not being explicitly modeled does not allow to reason about it here right now.

ezhulenev added inline comments.Oct 30 2020, 9:54 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	In case of mlir-cpu-runner + shared lib it will be destructed during shut down process (something like `static std::unique_ptr<Runtime>`). For users that do JIT + ExecutionEngine this will be their responsibility to bind async API symbols at runtime to an "short lived" instance of the AsyncRT and properly destroy it.

mehdi_amini added inline comments.Oct 30 2020, 10:44 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	In case of mlir-cpu-runner + shared lib it will be destructed during shut down process That seems morally equivalent to not releasing to me at this point. I rather find a more principled way to manage the lifetime here, do you see a way to reason about it by construction?

ezhulenev added inline comments.Oct 30 2020, 11:19 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Yeah, my current proposal is just to sweep memory management problems under the rug :) (at least temporary) Option #1 (easy): Add `async.destroy` operation, and require that it should be called explicitly to destroy all async tokens/values/groups that are no longer needed. So it becomes the responsibility of IR builder to place this op correctly. In case of async parallel for it is trivial to do. Option #2 (hard(er)?): During async to LLVM lowering infer the point where `mlisrAsyncRumtimeDestroy(...)` API calls must be inserted. It gets tricky very quickly because for example `async.token` can be captured by multiple `async.execute` operations, and there is no single point for `async.destroy`. The only option I see is to do reference counting on async values (tokens, ...): AddRef when passed to `async.execute` DropRef on captured values/tokens when `async.execute` completes DropRef when async values does not leave the "scope" (region), e.g. created inside a region (function/for/...) and does not leave it. I have a vague idea how to do that, it's kind of similar to C++ destructors called automatically, but it's hard to define what is a "scope" in MLIR.

ezhulenev added inline comments.Oct 30 2020, 1:58 PM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Another question is how to do async values reference counting, add reference counting ops to async dialect `async.ref.add`/`async.ref.drop`, add them to some "intermediate" host-async dialect (reference counting might not make much sense for the host-vs-gpu asynchronicity and gpu codegen), or just insert calls to runtime intrinsics during async to LLVM lowering Principled solutions I can think of is `async-ref-counting` pass that runs before lowering to LLVM. func @fn0() -> !async.token { %t0 = async.execute { ... } %t1 = async.execute[%t0] {...} return %t1 } func @fn1(%arg0 : !async.token) { ... return } func @fn2() { %t = call @fn0() ... async.await %t return } `mlir-opt -async-ref-counting` produces something like: func @fn0() -> !async.token { %t0 = async.execute { ... } async.ref.add %t0 %t1 = async.execute[%t0] { ... async.ref.drop %t0 } async.ref.drop %t0 async.ref.drop %t1 async.ref.add %t1 return %t1 : !async.token } func @fn1(%arg0 : !async.token) { ... async.ref.drop %arg0 return } func @fn2() { %t = call @fn0() async.ref.add %t call @fn1(%t) ... async.await %t async.ref.drop %t return } And then `async.ref.*` ops lowered to async runtime intrinsics.

mehdi_amini added inline comments.Oct 30 2020, 3:08 PM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Few thoughts on the refcount and a proposal: %t1 = async.execute[%t0] { ... async.ref.drop %t0 } I think you can ref.drop at the beginning of the execute right? (assuming the token isn't used in the region of course). async.ref.add %t0 %t1 = async.execute[%t0] { ... async.ref.drop %t0 } It seems like we could make it part of the IR semantics for `async.execute` and leave it up to the lowering/implementation: it'll always increment the ref-count and drop it like here, but implicitly. That said the advantage of the explicit method is that: %t0 = async.execute { ... } async.ref.add %t0 %t1 = async.execute[%t0] { ... async.ref.drop %t0 } async.ref.drop %t0 can be optimized to: %t0 = async.execute { ... } %t1 = async.execute[%t0] { ... async.ref.drop %t0 } async.ref.drop %t1 async.ref.add %t1 return %t1 : !async.token Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted? What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually a refcount of 1 (the actual refcount may be higher, but the responsibility of this region is to decrement it by one ultimately). We increment the refcount for as many SSA users there are for this token minus one (if there are no users we actually decrement the counter). It is the responsibility of the token user to either forward it as-is (to a new SSA value) or decrement the refcount. It seems like it should be composing well, including with return statement. func @fn0() -> !async.token { %t0 = async.execute { ... } // %t0 has a refcount of 1 and only one user, do nothing with refcount adjustments %t1 = async.execute[%t0] { // on execution, drop the refcount for %t0, ... } // %t1 has a refcount of 1 and a single user, do nothing with refcount adjustments return %t1 : !async.token } func @fn1(%arg0 : !async.token) { // %arg0 has a refcount of 1, but 0 user, we decrement immediately. async.ref.drop %arg0 ... return } func @fn2() { %t = call @fn0() // %t starts with a refcount of 1, has two SSA uses, add one async.ref.add %t, 1 call @fn1(%t) // function call like any other consumer will naturally drop it by one. ... async.await %t // await always drop the refcount internally. return } (I should write a design doc, otherwise I'll have nothing for my next perf ;))

ezhulenev added inline comments.Oct 30 2020, 3:53 PM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted? Yup, my mistake. I was thinking ahead and assumed that drop/add sequence can be optimized away. What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually ... Yeah, seems like that will work, and it's consistent with Swift calling convention (https://github.com/apple/swift/blob/main/docs/SIL.rst#reference-counts), references passed as +1 and caller consumes it (drops ref if I understand correctly). And BefExecutor is doing exactly this with +number-of-uses in the beginning of BEF execution. The only problem is that reference counting lowering should know how to update all the users of async SSA values (drop ref inside functions, etc...). For example if `async.value` passed to `scf.for` async lowering should know how to update then/else regions, but maybe op interfaces will help here. But initially it can be restricted to functions and async ops. I'll send a separate CL with a proper reference counting async runtime next week,

mehdi_amini accepted this revision.Oct 30 2020, 7:26 PM

mehdi_amini added inline comments.

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	OK LG!

This revision is now accepted and ready to land.Oct 30 2020, 7:26 PM

ezhulenev added a child revision: D90716: [mlir] Automatic reference counting for Async values + runtime support for ref counted objects.Nov 3 2020, 1:57 PM

LGTM after replacing divup function with rewriter.create<SignedCeilDivIOp> and adding the populateStdExpandDivsRewritePatterns into the lowering mix.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
122	https://reviews.llvm.org/D89726 has landed, we now have `ceildiv` that expands to a longer sequence of operations that supports negative values. I expect the canonicalizer to remove the unnecessary ones if it knows the sign of all operands.

Use SignedCeilDivIOp to compute parallel loop trip counts and other useful numbers

ezhulenev marked 2 inline comments as done.Nov 9 2020, 11:04 AM

Harbormaster completed remote builds in B78157: Diff 303931.Nov 9 2020, 11:06 AM

herhut added inline comments.Nov 13 2020, 2:52 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Another way to model this would be to mark the operations as an allocation with a dedicated allocation resource (`TokenStorageResource`) and the have the normal bufferization placement of free operations handle this. It would insert copies (`inc_rc`) and frees (`dec_rc`) for you. See also https://llvm.discourse.group/t/remove-tight-coupling-of-the-bufferdeallocation-pass-to-std-and-linalg-operations/2162/7 for the proposal to make this work. The advantage would be that we end up with a single system that manages the lifetime of allocated resources.

Update tests to handle new function syntax

ezhulenev added inline comments.Nov 13 2020, 3:11 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	Also `npcomp` needs reference counting for some of the types. For async ref counting I have a working solution based on `NumberOfExecutions` analysis (https://reviews.llvm.org/rGbb0d5f767dd7cf34a92ba2af2d6fdb206d883e8c) and async runtime support int https://reviews.llvm.org/D90716 (still needs to be updated to use the recently merged analysis).

Harbormaster completed remote builds in B78735: Diff 305068.Nov 13 2020, 3:17 AM

Rebased

Harbormaster completed remote builds in B78737: Diff 305075.Nov 13 2020, 4:02 AM

This revision was landed with ongoing or failed builds.Nov 13 2020, 4:03 AM

Closed by commit rGc30ab6c2a307: [mlir] Transform scf.parallel to scf.for + async.execute (authored by ezhulenev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ezhulenev added a commit: rGc30ab6c2a307: [mlir] Transform scf.parallel to scf.for + async.execute.

ezhulenev added inline comments.Nov 16 2020, 3:52 AM

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td
162	I've implemented another version of async ref counting based on liveness analysis (somewhat similar to buffer dealloc) in https://reviews.llvm.org/D90716, however it has subtle differences because of the need to handle `async.execute` operations.

Herald added a subscriber: teijeong. · View Herald TranscriptNov 16 2020, 3:52 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Async/

CMakeLists.txt

6 lines

IR/

6 lines

10 lines

94 lines

32 lines

27 lines

ExecutionEngine/

AsyncRuntime.h

20 lines

InitAllPasses.h

2 lines

integration_test/

Dialect/

Async/

CPU/

lit.local.cfg

5 lines

test-async-parallel-for-1d.mlir

66 lines

test-async-parallel-for-2d.mlir

93 lines

lib/

Conversion/

AsyncToLLVM/

AsyncToLLVM.cpp

153 lines

Dialect/

Async/

CMakeLists.txt

1 line

IR/

Async.cpp

47 lines

Transforms/

AsyncParallelFor.cpp

278 lines

CMakeLists.txt

17 lines

PassDetail.h

30 lines

ExecutionEngine/

AsyncRuntime.cpp

65 lines

test/

Conversion/

AsyncToLLVM/

convert-to-llvm.mlir

39 lines

Dialect/

Async/

async-parallel-for.mlir

44 lines

ops.mlir

14 lines

mlir-cpu-runner/

async-group.mlir

40 lines

Diff 305093

mlir/include/mlir/Dialect/Async/CMakeLists.txt

	add_subdirectory(IR)			add_subdirectory(IR)

				set(LLVM_TARGET_DEFINITIONS Passes.td)
				mlir_tablegen(Passes.h.inc -gen-pass-decls -name Async)
				add_public_tablegen_target(MLIRAsyncPassIncGen)

				add_mlir_doc(Passes -gen-pass-doc AsyncPasses ./)

mlir/include/mlir/Dialect/Async/IR/Async.h

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	public:
using Base::Base;		using Base::Base;

/// Get or create an async ValueType with the provided value type.		/// Get or create an async ValueType with the provided value type.
static ValueType get(Type valueType);		static ValueType get(Type valueType);

Type getValueType();		Type getValueType();
};		};

		/// The group type to represent async tokens or values grouped together.
		class GroupType : public Type::TypeBase<GroupType, Type, TypeStorage> {
		public:
		using Base::Base;
		};

} // namespace async		} // namespace async
} // namespace mlir		} // namespace mlir

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/Async/IR/AsyncOps.h.inc"		#include "mlir/Dialect/Async/IR/AsyncOps.h.inc"

#include "mlir/Dialect/Async/IR/AsyncOpsDialect.h.inc"		#include "mlir/Dialect/Async/IR/AsyncOpsDialect.h.inc"

#endif // MLIR_DIALECT_ASYNC_IR_ASYNC_H		#endif // MLIR_DIALECT_ASYNC_IR_ASYNC_H

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines let typeDescription = [{

`async.value` represents a value returned by asynchronous operations, `async.value` represents a value returned by asynchronous operations,

which may or may not be available currently, but will be available at some which may or may not be available currently, but will be available at some

point in the future. point in the future.

}]; }];

Type valueType = type; Type valueType = type;

} }

def Async_GroupType : DialectType<AsyncDialect,

CPred<"$_self.isa<::mlir::async::GroupType>()">, "group type">,

BuildableType<"$_builder.getType<::mlir::async::GroupType>()"> {

let typeDescription = [{

`async.group` represent a set of async tokens or values and allows to

herhutUnsubmitted

Not Done

This should discuss the semantics a bit more, In particular, that this acts like a mutable collection.

An alternative could be to thread through the token and have a join operation that takes n token and produces a new one.

I cannot judge the tradeoffs right now but it seems worth a design discussion, especially also around how this would lower to other runtimes like TFRT.

herhut: This should discuss the semantics a bit more, In particular, that this acts like a mutable…

ezhulenevAuthorUnsubmitted

Done

The problem with join operation that the number of tokens to be awaited is not know statically, so there is a need of something like std::vector in MLIR that can accumulate values of any type. async.group is basically a std::vector<AsyncToken> with a very limited API implemented as runtime intrinsics.

(re TFRT integration, it's in cl 339040831 async_runtime.{h.cc})

ezhulenev: The problem with `join` operation that the number of tokens to be awaited is not know…

execute async operations on all of them together (e.g. wait for the

completion of all/any of them).

mehdi_aminiUnsubmitted

Done

let typeDescription = [{

- `async.group` is a type that allows to group together multiple async tokens

- or values and execute async operations on all of them together (e.g. wait

+ `async.group` represent a set of async tokens or values

+ and execute async operations on all of them together (e.g. wait

for the completion of all/any of them).

}];

}

mehdi_amini:

}];

}

def Async_AnyValueType : DialectType<AsyncDialect, def Async_AnyValueType : DialectType<AsyncDialect,

CPred<"$_self.isa<::mlir::async::ValueType>()">, CPred<"$_self.isa<::mlir::async::ValueType>()">,

"async value type">; "async value type">;

def Async_AnyValueOrTokenType : AnyTypeOf<[Async_AnyValueType, def Async_AnyValueOrTokenType : AnyTypeOf<[Async_AnyValueType,

Async_TokenType]>; Async_TokenType]>;

#endif // ASYNC_BASE_TD #endif // ASYNC_BASE_TD

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines def Async_ExecuteOp :

let results = (outs Async_TokenType:$token, let results = (outs Async_TokenType:$token,

Variadic<Async_AnyValueType>:$results); Variadic<Async_AnyValueType>:$results);

let regions = (region SizedRegion<1>:$body); let regions = (region SizedRegion<1>:$body);

let printer = [{ return ::print(p, *this); }]; let printer = [{ return ::print(p, *this); }];

let parser = [{ return ::parse$cppClass(parser, result); }]; let parser = [{ return ::parse$cppClass(parser, result); }];

let verifier = [{ return ::verify(*this); }]; let verifier = [{ return ::verify(*this); }];

let skipDefaultBuilders = 1;

let builders = [

OpBuilderDAG<(ins "TypeRange":$resultTypes, "ValueRange":$dependencies,

"ValueRange":$operands,

CArg<"function_ref<void(OpBuilder &, Location, ValueRange)>",

mehdi_aminiUnsubmitted

Done

Isn't this the deprecated form of builders? (@ftynse ?)

mehdi_amini: Isn't this the deprecated form of builders? (@ftynse ?)

ftynseUnsubmitted

Done

It is indeed. Please use OpBuilderDAG instead - https://mlir.llvm.org/docs/OpDefinitions/#custom-builder-methods.

ftynse: It is indeed. Please use OpBuilderDAG instead - https://mlir.llvm.

"nullptr">:$bodyBuilder)>,

];

let extraClassDeclaration = [{

using BodyBuilderFn =

function_ref<void(OpBuilder &, Location, ValueRange)>;

}];

} }

def Async_YieldOp : def Async_YieldOp :

Async_Op<"yield", [HasParent<"ExecuteOp">, NoSideEffect, Terminator]> { Async_Op<"yield", [HasParent<"ExecuteOp">, NoSideEffect, Terminator]> {

let summary = "terminator for Async execute operation"; let summary = "terminator for Async execute operation";

let description = [{ let description = [{

The `async.yield` is a special terminator operation for the block inside The `async.yield` is a special terminator operation for the block inside

`async.execute` operation. `async.execute` operation.

}]; }];

let arguments = (ins Variadic<AnyType>:$operands); let arguments = (ins Variadic<AnyType>:$operands);

let assemblyFormat = "attr-dict ($operands^ `:` type($operands))?"; let assemblyFormat = "($operands^ `:` type($operands))? attr-dict";

let verifier = [{ return ::verify(*this); }]; let verifier = [{ return ::verify(*this); }];

} }

def Async_AwaitOp : Async_Op<"await", [NoSideEffect]> { def Async_AwaitOp : Async_Op<"await"> {

ftynseUnsubmitted

Done

Nit: it should be possible to omit the [] for empty lists.

ftynse: Nit: it should be possible to omit the `[]` for empty lists.

let summary = "waits for the argument to become ready"; let summary = "waits for the argument to become ready";

let description = [{ let description = [{

The `async.await` operation waits until the argument becomes ready, and for The `async.await` operation waits until the argument becomes ready, and for

the `async.value` arguments it unwraps the underlying value the `async.value` arguments it unwraps the underlying value

Example: Example:

```mlir ```mlir

Show All 18 Lines def Async_AwaitOp : Async_Op<"await"> {

let extraClassDeclaration = [{ let extraClassDeclaration = [{

Optional<Type> getResultType() { Optional<Type> getResultType() {

if (getResultTypes().empty()) return None; if (getResultTypes().empty()) return None;

return getResultTypes()[0]; return getResultTypes()[0];

} }

}]; }];

let assemblyFormat = [{ let assemblyFormat = [{

attr-dict $operand `:` custom<AwaitResultType>( $operand `:` custom<AwaitResultType>(

type($operand), type($result) type($operand), type($result)

) ) attr-dict

}]; }];

let verifier = [{ return ::verify(*this); }]; let verifier = [{ return ::verify(*this); }];

} }

def Async_CreateGroupOp : Async_Op<"create_group", [NoSideEffect]> {

let summary = "creates an empty async group";

let description = [{

The `async.create_group` allocates an empty async group. Async tokens or

mehdi_aminiUnsubmitted

Done

let description = [{

- The `async.create_group` creates and empty async group. Async tokens or

+ The `async.create_group` allocate an empty async group. Async tokens or

values can be added to this group later.

mehdi_amini:

values can be added to this group later.

mehdi_aminiUnsubmitted

Not Done

Needs a clarification on the lifetime, it seems like it isn't explicitly destroyed right now (and because it is mutable, it isn't really a pure value), so it isn't clear to me how it works.

mehdi_amini: Needs a clarification on the lifetime, it seems like it isn't explicitly destroyed right now…

ezhulenevAuthorUnsubmitted

Done

Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor. For the TFRT the idea is that for each kernel/op invocation we create a short lived AsyncRuntime instance, which tracks all the tokens/values, and destroy it when kernel completed (async_runtime.h in cl/339040831, sorry it's internal only thing for now).

The other option is to add reference counting intrinsics to all runtime types (token, value, group), but then it gets tricky how to reference counting correctly,

ezhulenev: Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them…

mehdi_aminiUnsubmitted

Not Done

Currently the runtime owns all the allocated values (tokens, values, groups) and destroys them in destructor.

When is the destructor invoked? I don't have a clear understanding of the lifetime right now, it seems like the runtime context not being explicitly modeled does not allow to reason about it here right now.

mehdi_amini: > Currently the runtime owns all the allocated values (tokens, values, groups) and destroys…

ezhulenevAuthorUnsubmitted

Not Done

In case of mlir-cpu-runner + shared lib it will be destructed during shut down process (something like static std::unique_ptr<Runtime>). For users that do JIT + ExecutionEngine this will be their responsibility to bind async API symbols at runtime to an "short lived" instance of the AsyncRT and properly destroy it.

ezhulenev: In case of mlir-cpu-runner + shared lib it will be destructed during shut down process…

mehdi_aminiUnsubmitted

Not Done

In case of mlir-cpu-runner + shared lib it will be destructed during shut down process

That seems morally equivalent to not releasing to me at this point.
I rather find a more principled way to manage the lifetime here, do you see a way to reason about it by construction?

mehdi_amini: > In case of mlir-cpu-runner + shared lib it will be destructed during shut down process That…

ezhulenevAuthorUnsubmitted

Done

Yeah, my current proposal is just to sweep memory management problems under the rug :) (at least temporary)

Option #1 (easy):
Add async.destroy operation, and require that it should be called explicitly to destroy all async tokens/values/groups that are no longer needed. So it becomes the responsibility of IR builder to place this op correctly. In case of async parallel for it is trivial to do.

Option #2 (hard(er)?):
During async to LLVM lowering infer the point where mlisrAsyncRumtimeDestroy(...) API calls must be inserted. It gets tricky very quickly because for example async.token can be captured by multiple async.execute operations, and there is no single point for async.destroy.

The only option I see is to do reference counting on async values (tokens, ...):

AddRef when passed to async.execute
DropRef on captured values/tokens when async.execute completes
DropRef when async values does not leave the "scope" (region), e.g. created inside a region (function/for/...) and does not leave it. I have a vague idea how to do that, it's kind of similar to C++ destructors called automatically, but it's hard to define what is a "scope" in MLIR.

ezhulenev: Yeah, my current proposal is just to sweep memory management problems under the rug :) (at…

ezhulenevAuthorUnsubmitted

Done

Another question is how to do async values reference counting, add reference counting ops to async dialect async.ref.add/async.ref.drop, add them to some "intermediate" host-async dialect (reference counting might not make much sense for the host-vs-gpu asynchronicity and gpu codegen), or just insert calls to runtime intrinsics during async to LLVM lowering

Principled solutions I can think of is async-ref-counting pass that runs before lowering to LLVM.

func @fn0() -> !async.token {
  %t0 = async.execute { ... }
  %t1 = async.execute[%t0] {...}
  return %t1
}

func @fn1(%arg0 : !async.token) {
  ...
  return
}

func @fn2() {
  %t = call @fn0()
  ...
  async.await %t
  return
}

mlir-opt -async-ref-counting produces something like:

func @fn0() -> !async.token {
  %t0 = async.execute { ... }

  async.ref.add %t0
  %t1 = async.execute[%t0] {  
     ...
     async.ref.drop %t0
  }

  async.ref.drop %t0
  async.ref.drop %t1

  async.ref.add %t1
  return %t1 : !async.token
}

func @fn1(%arg0 : !async.token) {
  ...
  async.ref.drop %arg0
  return
} 

func @fn2() {
  %t = call @fn0()

  async.ref.add %t
  call @fn1(%t)
  ...
  async.await %t
  async.ref.drop %t
  return
}

And then async.ref.* ops lowered to async runtime intrinsics.

ezhulenev: Another question is how to do async values reference counting, add reference counting ops to…

mehdi_aminiUnsubmitted

Not Done

Few thoughts on the refcount and a proposal:

%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}

I think you can ref.drop at the beginning of the execute right? (assuming the token isn't used in the region of course).

async.ref.add %t0
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}

It seems like we could make it part of the IR semantics for async.execute and leave it up to the lowering/implementation: it'll always increment the ref-count and drop it like here, but implicitly.

That said the advantage of the explicit method is that:

%t0 = async.execute { ... }

async.ref.add %t0
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}
async.ref.drop %t0

can be optimized to:

%t0 = async.execute { ... }
%t1 = async.execute[%t0] {  
   ...
   async.ref.drop %t0
}

async.ref.drop %t1

async.ref.add %t1
return %t1 : !async.token

Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted?

What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually a refcount of 1 (the actual refcount may be higher, but the responsibility of this region is to decrement it by one ultimately). We increment the refcount for as many SSA users there are for this token minus one (if there are no users we actually decrement the counter). It is the responsibility of the token user to either forward it as-is (to a new SSA value) or decrement the refcount.

It seems like it should be composing well, including with return statement.

func @fn0() -> !async.token {
  %t0 = async.execute { ... }
  // %t0 has a refcount of 1 and only one user, do nothing with refcount adjustments
  
  %t1 = async.execute[%t0] {   // on execution, drop the refcount for %t0, 
     ...
  }
  // %t1 has a refcount of 1 and a single user, do nothing with refcount adjustments
  return %t1 : !async.token
}

func @fn1(%arg0 : !async.token) {
  // %arg0 has a refcount of 1, but 0 user, we decrement immediately.
  async.ref.drop %arg0
  ...
  return
} 

func @fn2() {
  %t = call @fn0()
  // %t starts with a refcount of 1, has two SSA uses, add one
  async.ref.add %t, 1
  call @fn1(%t) // function call like any other consumer will naturally drop it by one.
  ...
  async.await %t // await always drop the refcount internally.
  return
}

(I should write a design doc, otherwise I'll have nothing for my next perf ;))

mehdi_amini: Few thoughts on the refcount and a proposal: ``` %t1 = async.execute[%t0] { ...

ezhulenevAuthorUnsubmitted

Done

Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted?

Yup, my mistake. I was thinking ahead and assumed that drop/add sequence can be optimized away.

What about this proposal: when assigning a token to an SSA value inside a region, it always has conceptually ...

Yeah, seems like that will work, and it's consistent with Swift calling convention (https://github.com/apple/swift/blob/main/docs/SIL.rst#reference-counts), references passed as +1 and caller consumes it (drops ref if I understand correctly). And BefExecutor is doing exactly this with +number-of-uses in the beginning of BEF execution.

The only problem is that reference counting lowering should know how to update all the users of async SSA values (drop ref inside functions, etc...). For example if async.value passed to scf.for async lowering should know how to update then/else regions, but maybe op interfaces will help here. But initially it can be restricted to functions and async ops.

I'll send a separate CL with a proper reference counting async runtime next week,

ezhulenev: > Is this correct? Isn't %t1 refcount 0 after the drop and so gets deleted? > Yup, my mistake.

mehdi_aminiUnsubmitted

Not Done

OK LG!

mehdi_amini: OK LG!

herhutUnsubmitted

Not Done

Another way to model this would be to mark the operations as an allocation with a dedicated allocation resource (TokenStorageResource) and the have the normal bufferization placement of free operations handle this. It would insert copies (inc_rc) and frees (dec_rc) for you.

The advantage would be that we end up with a single system that manages the lifetime of allocated resources.

herhut: Another way to model this would be to mark the operations as an allocation with a dedicated…

ezhulenevAuthorUnsubmitted

Done

Also npcomp needs reference counting for some of the types.

For async ref counting I have a working solution based on NumberOfExecutions analysis (https://reviews.llvm.org/rGbb0d5f767dd7cf34a92ba2af2d6fdb206d883e8c) and async runtime support int https://reviews.llvm.org/D90716 (still needs to be updated to use the recently merged analysis).

ezhulenev: Also `npcomp` needs reference counting for some of the types. For async ref counting I have a…

ezhulenevAuthorUnsubmitted

Done

I've implemented another version of async ref counting based on liveness analysis (somewhat similar to buffer dealloc) in https://reviews.llvm.org/D90716, however it has subtle differences because of the need to handle async.execute operations.

ezhulenev: I've implemented another version of async ref counting based on liveness analysis (somewhat…

Example:

```mlir

%0 = async.create_group

...

async.await_all %0

```

}];

let arguments = (ins );

let results = (outs Async_GroupType:$result);

let assemblyFormat = "attr-dict";

}

def Async_AddToGroupOp : Async_Op<"add_to_group", []> {

let summary = "adds and async token or value to the group";

let description = [{

The `async.add_to_group` adds an async token or value to the async group.

Returns the rank of the added element in the group. This rank is fixed

for the group lifetime.

Example:

```mlir

%0 = async.create_group

%1 = ... : !async.token

%2 = async.add_to_group %1, %0 : !async.token

```

}];

let arguments = (ins Async_AnyValueOrTokenType:$operand,

Async_GroupType:$group);

let results = (outs Index:$rank);

ftynseUnsubmitted

Done

We tend to use Index for most thing pertaining to positions. Any reason why this is I64 specifically?

ftynse: We tend to use Index for most thing pertaining to positions. Any reason why this is I64…

let assemblyFormat = "$operand `,` $group `:` type($operand) attr-dict";

}

def Async_AwaitAllOp : Async_Op<"await_all", []> {

let summary = "waits for the all async tokens or values in the group to "

"become ready";

let description = [{

The `async.await_all` operation waits until all the tokens or values in the

group become ready.

Example:

```mlir

%0 = async.create_group

%1 = ... : !async.token

%2 = async.add_to_group %1, %0 : !async.token

%3 = ... : !async.token

%4 = async.add_to_group %2, %0 : !async.token

async.await_all %0

```

}];

let arguments = (ins Async_GroupType:$operand);

let results = (outs);

let assemblyFormat = "$operand attr-dict";

ftynseUnsubmitted

Done

Syntax nit: placing attr-dict _before_ operands is very uncommon in MLIR, any reason for this?

ftynse: Syntax nit: placing attr-dict _before_ operands is very uncommon in MLIR, any reason for this?

ezhulenevAuthorUnsubmitted

Done

No. Moved all attr-dict to the end

ezhulenev: No. Moved all attr-dict to the end

}

#endif // ASYNC_OPS #endif // ASYNC_OPS

mlir/include/mlir/Dialect/Async/Passes.h

This file was added.

				//===- Passes.h - Async pass entry points ------------------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This header file defines prototypes that expose pass constructors.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DIALECT_ASYNC_PASSES_H_
				#define MLIR_DIALECT_ASYNC_PASSES_H_

				#include "mlir/Pass/Pass.h"

				namespace mlir {

				std::unique_ptr<OperationPass<FuncOp>> createAsyncParallelForPass();

				//===----------------------------------------------------------------------===//
				// Registration
				//===----------------------------------------------------------------------===//

				/// Generate the code for registering passes.
				#define GEN_PASS_REGISTRATION
				#include "mlir/Dialect/Async/Passes.h.inc"

				} // namespace mlir

				#endif // MLIR_DIALECT_ASYNC_PASSES_H_

mlir/include/mlir/Dialect/Async/Passes.td

This file was added.

				//===-- Passes.td - Async pass definition file -------------- tablegen --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DIALECT_ASYNC_PASSES
				#define MLIR_DIALECT_ASYNC_PASSES

				include "mlir/Pass/PassBase.td"

				def AsyncParallelFor : FunctionPass<"async-parallel-for"> {
				let summary = "Convert scf.parallel operations to multiple async regions "
				"executed concurrently for non-overlapping iteration ranges";
				let constructor = "mlir::createAsyncParallelForPass()";
				let options = [
				Option<"numConcurrentAsyncExecute", "num-concurrent-async-execute",
				"int32_t", /default=/"4",
				"The number of async.execute operations that will be used for concurrent "
				"loop execution.">
				];
				let dependentDialects = ["async::AsyncDialect", "scf::SCFDialect"];
				}

				#endif // MLIR_DIALECT_ASYNC_PASSES

mlir/include/mlir/ExecutionEngine/AsyncRuntime.h

	//===- AsyncRuntime.h - Async runtime reference implementation ------------===//			//===- AsyncRuntime.h - Async runtime reference implementation ------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file declares basic Async runtime API for supporting Async dialect			// This file declares basic Async runtime API for supporting Async dialect
	// to LLVM dialect lowering.			// to LLVM dialect lowering.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_			#ifndef MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_
	#define MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_			#define MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_

				#include <stdint.h>

	#ifdef _WIN32			#ifdef _WIN32
	#ifndef MLIR_ASYNCRUNTIME_EXPORT			#ifndef MLIR_ASYNCRUNTIME_EXPORT
	#ifdef mlir_async_runtime_EXPORTS			#ifdef mlir_async_runtime_EXPORTS
	// We are building this library			// We are building this library
	#define MLIR_ASYNCRUNTIME_EXPORT __declspec(dllexport)			#define MLIR_ASYNCRUNTIME_EXPORT __declspec(dllexport)
	#define MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS			#define MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS
	#else			#else
	// We are using this library			// We are using this library
	#define MLIR_ASYNCRUNTIME_EXPORT __declspec(dllimport)			#define MLIR_ASYNCRUNTIME_EXPORT __declspec(dllimport)
	#endif // mlir_async_runtime_EXPORTS			#endif // mlir_async_runtime_EXPORTS
	#endif // MLIR_ASYNCRUNTIME_EXPORT			#endif // MLIR_ASYNCRUNTIME_EXPORT
	#else			#else
	#define MLIR_ASYNCRUNTIME_EXPORT			#define MLIR_ASYNCRUNTIME_EXPORT
	#define MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS			#define MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS
	#endif // _WIN32			#endif // _WIN32

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Async runtime API.			// Async runtime API.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// Runtime implementation of `async.token` data type.			// Runtime implementation of `async.token` data type.
	typedef struct AsyncToken MLIR_AsyncToken;			typedef struct AsyncToken MLIR_AsyncToken;

				// Runtime implementation of `async.group` data type.
				typedef struct AsyncGroup MLIR_AsyncGroup;

	// Async runtime uses LLVM coroutines to represent asynchronous tasks. Task			// Async runtime uses LLVM coroutines to represent asynchronous tasks. Task
	// function is a coroutine handle and a resume function that continue coroutine			// function is a coroutine handle and a resume function that continue coroutine
	// execution from a suspension point.			// execution from a suspension point.
	using CoroHandle = void *; // coroutine handle			using CoroHandle = void *; // coroutine handle
	using CoroResume = void ()(void ); // coroutine resume function			using CoroResume = void ()(void ); // coroutine resume function

	// Create a new `async.token` in not-ready state.			// Create a new `async.token` in not-ready state.
	extern "C" MLIR_ASYNCRUNTIME_EXPORT AsyncToken *mlirAsyncRuntimeCreateToken();			extern "C" MLIR_ASYNCRUNTIME_EXPORT AsyncToken *mlirAsyncRuntimeCreateToken();

				// Create a new `async.group` in empty state.
				extern "C" MLIR_ASYNCRUNTIME_EXPORT AsyncGroup *mlirAsyncRuntimeCreateGroup();

				extern "C" MLIR_ASYNCRUNTIME_EXPORT int64_t
				mlirAsyncRuntimeAddTokenToGroup(AsyncToken , AsyncGroup );

	// Switches `async.token` to ready state and runs all awaiters.			// Switches `async.token` to ready state and runs all awaiters.
	extern "C" MLIR_ASYNCRUNTIME_EXPORT void			extern "C" MLIR_ASYNCRUNTIME_EXPORT void
	mlirAsyncRuntimeEmplaceToken(AsyncToken *);			mlirAsyncRuntimeEmplaceToken(AsyncToken *);

	// Blocks the caller thread until the token becomes ready.			// Blocks the caller thread until the token becomes ready.
	extern "C" MLIR_ASYNCRUNTIME_EXPORT void			extern "C" MLIR_ASYNCRUNTIME_EXPORT void
	mlirAsyncRuntimeAwaitToken(AsyncToken *);			mlirAsyncRuntimeAwaitToken(AsyncToken *);

				// Blocks the caller thread until the elements in the group become ready.
				extern "C" MLIR_ASYNCRUNTIME_EXPORT void
				mlirAsyncRuntimeAwaitAllInGroup(AsyncGroup *);

	// Executes the task (coro handle + resume function) in one of the threads			// Executes the task (coro handle + resume function) in one of the threads
	// managed by the runtime.			// managed by the runtime.
	extern "C" MLIR_ASYNCRUNTIME_EXPORT void mlirAsyncRuntimeExecute(CoroHandle,			extern "C" MLIR_ASYNCRUNTIME_EXPORT void mlirAsyncRuntimeExecute(CoroHandle,
	CoroResume);			CoroResume);

	// Executes the task (coro handle + resume function) in one of the threads			// Executes the task (coro handle + resume function) in one of the threads
	// managed by the runtime after the token becomes ready.			// managed by the runtime after the token becomes ready.
	extern "C" MLIR_ASYNCRUNTIME_EXPORT void			extern "C" MLIR_ASYNCRUNTIME_EXPORT void
	mlirAsyncRuntimeAwaitTokenAndExecute(AsyncToken *, CoroHandle, CoroResume);			mlirAsyncRuntimeAwaitTokenAndExecute(AsyncToken *, CoroHandle, CoroResume);

				// Executes the task (coro handle + resume function) in one of the threads
				// managed by the runtime after the all members of the group become ready.
				extern "C" MLIR_ASYNCRUNTIME_EXPORT void
				mlirAsyncRuntimeAwaitAllInGroupAndExecute(AsyncGroup *, CoroHandle, CoroResume);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Small async runtime support library for testing.			// Small async runtime support library for testing.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	extern "C" MLIR_ASYNCRUNTIME_EXPORT void mlirAsyncRuntimePrintCurrentThreadId();			extern "C" MLIR_ASYNCRUNTIME_EXPORT void mlirAsyncRuntimePrintCurrentThreadId();

	#endif // MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_			#endif // MLIR_EXECUTIONENGINE_ASYNCRUNTIME_H_

mlir/include/mlir/InitAllPasses.h

Show All 10 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef MLIR_INITALLPASSES_H_		#ifndef MLIR_INITALLPASSES_H_
#define MLIR_INITALLPASSES_H_		#define MLIR_INITALLPASSES_H_

#include "mlir/Conversion/Passes.h"		#include "mlir/Conversion/Passes.h"
#include "mlir/Dialect/Affine/Passes.h"		#include "mlir/Dialect/Affine/Passes.h"
		#include "mlir/Dialect/Async/Passes.h"
#include "mlir/Dialect/GPU/Passes.h"		#include "mlir/Dialect/GPU/Passes.h"
#include "mlir/Dialect/LLVMIR/Transforms/Passes.h"		#include "mlir/Dialect/LLVMIR/Transforms/Passes.h"
#include "mlir/Dialect/Linalg/Passes.h"		#include "mlir/Dialect/Linalg/Passes.h"
#include "mlir/Dialect/Quant/Passes.h"		#include "mlir/Dialect/Quant/Passes.h"
#include "mlir/Dialect/SCF/Passes.h"		#include "mlir/Dialect/SCF/Passes.h"
#include "mlir/Dialect/SPIRV/Passes.h"		#include "mlir/Dialect/SPIRV/Passes.h"
#include "mlir/Dialect/Shape/Transforms/Passes.h"		#include "mlir/Dialect/Shape/Transforms/Passes.h"
#include "mlir/Dialect/StandardOps/Transforms/Passes.h"		#include "mlir/Dialect/StandardOps/Transforms/Passes.h"
Show All 15 Lines	inline void registerAllPasses() {
// General passes		// General passes
registerTransformsPasses();		registerTransformsPasses();

// Conversion passes		// Conversion passes
registerConversionPasses();		registerConversionPasses();

// Dialect passes		// Dialect passes
registerAffinePasses();		registerAffinePasses();
		registerAsyncPasses();
registerGPUPasses();		registerGPUPasses();
registerLinalgPasses();		registerLinalgPasses();
LLVM::registerLLVMPasses();		LLVM::registerLLVMPasses();
quant::registerQuantPasses();		quant::registerQuantPasses();
registerSCFPasses();		registerSCFPasses();
registerShapePasses();		registerShapePasses();
spirv::registerSPIRVPasses();		spirv::registerSPIRVPasses();
registerStandardPasses();		registerStandardPasses();
tosa::registerTosaOptPasses();		tosa::registerTosaOptPasses();
}		}

} // namespace mlir		} // namespace mlir

#endif // MLIR_INITALLPASSES_H_		#endif // MLIR_INITALLPASSES_H_

mlir/integration_test/Dialect/Async/CPU/lit.local.cfg

This file was added.

				import sys

				# No JIT on win32.
				if sys.platform == 'win32':
				config.unsupported = True

mlir/integration_test/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

This file was added.

				// RUN: mlir-opt %s -async-parallel-for \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				func @entry() {
				%c0 = constant 0.0 : f32
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index

				%lb = constant 0 : index
				%ub = constant 9 : index

				%A = alloc() : memref<9xf32>
				%U = memref_cast %A : memref<9xf32> to memref<*xf32>

				// 1. %i = (0) to (9) step (1)
				scf.parallel (%i) = (%lb) to (%ub) step (%c1) {
				%0 = index_cast %i : index to i32
				%1 = sitofp %0 : i32 to f32
				store %1, %A[%i] : memref<9xf32>
				}
				// CHECK: [0, 1, 2, 3, 4, 5, 6, 7, 8]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				scf.parallel (%i) = (%lb) to (%ub) step (%c1) {
				store %c0, %A[%i] : memref<9xf32>
				}

				// 2. %i = (0) to (9) step (2)
				scf.parallel (%i) = (%lb) to (%ub) step (%c2) {
				%0 = index_cast %i : index to i32
				%1 = sitofp %0 : i32 to f32
				store %1, %A[%i] : memref<9xf32>
				}
				// CHECK: [0, 0, 2, 0, 4, 0, 6, 0, 8]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				scf.parallel (%i) = (%lb) to (%ub) step (%c1) {
				store %c0, %A[%i] : memref<9xf32>
				}

				// 3. %i = (-20) to (-11) step (3)
				%lb0 = constant -20 : index
				%ub0 = constant -11 : index
				scf.parallel (%i) = (%lb0) to (%ub0) step (%c3) {
				%0 = index_cast %i : index to i32
				%1 = sitofp %0 : i32 to f32
				%2 = constant 20 : index
				%3 = addi %i, %2 : index
				store %1, %A[%3] : memref<9xf32>
				}
				// CHECK: [-20, 0, 0, -17, 0, 0, -14, 0, 0]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				dealloc %A : memref<9xf32>
				return
				}

				func @print_memref_f32(memref<*xf32>) attributes { llvm.emit_c_interface }

mlir/integration_test/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

This file was added.

				// RUN: mlir-opt %s -async-parallel-for \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				func @entry() {
				%c0 = constant 0.0 : f32
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c8 = constant 8 : index

				%lb = constant 0 : index
				%ub = constant 8 : index

				%A = alloc() : memref<8x8xf32>
				%U = memref_cast %A : memref<8x8xf32> to memref<*xf32>

				// 1. (%i, %i) = (0, 8) to (8, 8) step (1, 1)
				scf.parallel (%i, %j) = (%lb, %lb) to (%ub, %ub) step (%c1, %c1) {
				%0 = muli %i, %c8 : index
				%1 = addi %j, %0 : index
				%2 = index_cast %1 : index to i32
				%3 = sitofp %2 : i32 to f32
				store %3, %A[%i, %j] : memref<8x8xf32>
				}

				// CHECK: [0, 1, 2, 3, 4, 5, 6, 7]
				// CHECK-NEXT: [8, 9, 10, 11, 12, 13, 14, 15]
				// CHECK-NEXT: [16, 17, 18, 19, 20, 21, 22, 23]
				// CHECK-NEXT: [24, 25, 26, 27, 28, 29, 30, 31]
				// CHECK-NEXT: [32, 33, 34, 35, 36, 37, 38, 39]
				// CHECK-NEXT: [40, 41, 42, 43, 44, 45, 46, 47]
				// CHECK-NEXT: [48, 49, 50, 51, 52, 53, 54, 55]
				// CHECK-NEXT: [56, 57, 58, 59, 60, 61, 62, 63]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				scf.parallel (%i, %j) = (%lb, %lb) to (%ub, %ub) step (%c1, %c1) {
				store %c0, %A[%i, %j] : memref<8x8xf32>
				}

				// 2. (%i, %i) = (0, 8) to (8, 8) step (2, 1)
				scf.parallel (%i, %j) = (%lb, %lb) to (%ub, %ub) step (%c2, %c1) {
				%0 = muli %i, %c8 : index
				%1 = addi %j, %0 : index
				%2 = index_cast %1 : index to i32
				%3 = sitofp %2 : i32 to f32
				store %3, %A[%i, %j] : memref<8x8xf32>
				}

				// CHECK: [0, 1, 2, 3, 4, 5, 6, 7]
				// CHECK-NEXT: [0, 0, 0, 0, 0, 0, 0, 0]
				// CHECK-NEXT: [16, 17, 18, 19, 20, 21, 22, 23]
				// CHECK-NEXT: [0, 0, 0, 0, 0, 0, 0, 0]
				// CHECK-NEXT: [32, 33, 34, 35, 36, 37, 38, 39]
				// CHECK-NEXT: [0, 0, 0, 0, 0, 0, 0, 0]
				// CHECK-NEXT: [48, 49, 50, 51, 52, 53, 54, 55]
				// CHECK-NEXT: [0, 0, 0, 0, 0, 0, 0, 0]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				scf.parallel (%i, %j) = (%lb, %lb) to (%ub, %ub) step (%c1, %c1) {
				store %c0, %A[%i, %j] : memref<8x8xf32>
				}

				// 3. (%i, %i) = (0, 8) to (8, 8) step (1, 2)
				scf.parallel (%i, %j) = (%lb, %lb) to (%ub, %ub) step (%c1, %c2) {
				%0 = muli %i, %c8 : index
				%1 = addi %j, %0 : index
				%2 = index_cast %1 : index to i32
				%3 = sitofp %2 : i32 to f32
				store %3, %A[%i, %j] : memref<8x8xf32>
				}

				// CHECK: [0, 0, 2, 0, 4, 0, 6, 0]
				// CHECK-NEXT: [8, 0, 10, 0, 12, 0, 14, 0]
				// CHECK-NEXT: [16, 0, 18, 0, 20, 0, 22, 0]
				// CHECK-NEXT: [24, 0, 26, 0, 28, 0, 30, 0]
				// CHECK-NEXT: [32, 0, 34, 0, 36, 0, 38, 0]
				// CHECK-NEXT: [40, 0, 42, 0, 44, 0, 46, 0]
				// CHECK-NEXT: [48, 0, 50, 0, 52, 0, 54, 0]
				// CHECK-NEXT: [56, 0, 58, 0, 60, 0, 62, 0]
				call @print_memref_f32(%U): (memref<*xf32>) -> ()

				dealloc %A : memref<8x8xf32>

				return
				}

				func @print_memref_f32(memref<*xf32>) attributes { llvm.emit_c_interface }

mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp

Show All 28 Lines
// Prefix for functions outlined from `async.execute` op regions.		// Prefix for functions outlined from `async.execute` op regions.
static constexpr const char kAsyncFnPrefix[] = "async_execute_fn";		static constexpr const char kAsyncFnPrefix[] = "async_execute_fn";

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Async Runtime C API declaration.		// Async Runtime C API declaration.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

static constexpr const char *kCreateToken = "mlirAsyncRuntimeCreateToken";		static constexpr const char *kCreateToken = "mlirAsyncRuntimeCreateToken";
		static constexpr const char *kCreateGroup = "mlirAsyncRuntimeCreateGroup";
static constexpr const char *kEmplaceToken = "mlirAsyncRuntimeEmplaceToken";		static constexpr const char *kEmplaceToken = "mlirAsyncRuntimeEmplaceToken";
static constexpr const char *kAwaitToken = "mlirAsyncRuntimeAwaitToken";		static constexpr const char *kAwaitToken = "mlirAsyncRuntimeAwaitToken";
		static constexpr const char *kAwaitGroup = "mlirAsyncRuntimeAwaitAllInGroup";
static constexpr const char *kExecute = "mlirAsyncRuntimeExecute";		static constexpr const char *kExecute = "mlirAsyncRuntimeExecute";
		static constexpr const char *kAddTokenToGroup =
		"mlirAsyncRuntimeAddTokenToGroup";
static constexpr const char *kAwaitAndExecute =		static constexpr const char *kAwaitAndExecute =
"mlirAsyncRuntimeAwaitTokenAndExecute";		"mlirAsyncRuntimeAwaitTokenAndExecute";
		static constexpr const char *kAwaitAllAndExecute =
		"mlirAsyncRuntimeAwaitAllInGroupAndExecute";

namespace {		namespace {
// Async Runtime API function types.		// Async Runtime API function types.
struct AsyncAPI {		struct AsyncAPI {
static FunctionType createTokenFunctionType(MLIRContext *ctx) {		static FunctionType createTokenFunctionType(MLIRContext *ctx) {
return FunctionType::get({}, {TokenType::get(ctx)}, ctx);		return FunctionType::get({}, {TokenType::get(ctx)}, ctx);
}		}

		static FunctionType createGroupFunctionType(MLIRContext *ctx) {
		return FunctionType::get({}, {GroupType::get(ctx)}, ctx);
		}

static FunctionType emplaceTokenFunctionType(MLIRContext *ctx) {		static FunctionType emplaceTokenFunctionType(MLIRContext *ctx) {
return FunctionType::get({TokenType::get(ctx)}, {}, ctx);		return FunctionType::get({TokenType::get(ctx)}, {}, ctx);
}		}

static FunctionType awaitTokenFunctionType(MLIRContext *ctx) {		static FunctionType awaitTokenFunctionType(MLIRContext *ctx) {
return FunctionType::get({TokenType::get(ctx)}, {}, ctx);		return FunctionType::get({TokenType::get(ctx)}, {}, ctx);
}		}

		static FunctionType awaitGroupFunctionType(MLIRContext *ctx) {
		return FunctionType::get({GroupType::get(ctx)}, {}, ctx);
		}

static FunctionType executeFunctionType(MLIRContext *ctx) {		static FunctionType executeFunctionType(MLIRContext *ctx) {
auto hdl = LLVM::LLVMType::getInt8PtrTy(ctx);		auto hdl = LLVM::LLVMType::getInt8PtrTy(ctx);
auto resume = resumeFunctionType(ctx).getPointerTo();		auto resume = resumeFunctionType(ctx).getPointerTo();
return FunctionType::get({hdl, resume}, {}, ctx);		return FunctionType::get({hdl, resume}, {}, ctx);
}		}

		static FunctionType addTokenToGroupFunctionType(MLIRContext *ctx) {
		auto i64 = IntegerType::get(64, ctx);
		return FunctionType::get({TokenType::get(ctx), GroupType::get(ctx)}, {i64},
		ctx);
		}

static FunctionType awaitAndExecuteFunctionType(MLIRContext *ctx) {		static FunctionType awaitAndExecuteFunctionType(MLIRContext *ctx) {
auto hdl = LLVM::LLVMType::getInt8PtrTy(ctx);		auto hdl = LLVM::LLVMType::getInt8PtrTy(ctx);
auto resume = resumeFunctionType(ctx).getPointerTo();		auto resume = resumeFunctionType(ctx).getPointerTo();
return FunctionType::get({TokenType::get(ctx), hdl, resume}, {}, ctx);		return FunctionType::get({TokenType::get(ctx), hdl, resume}, {}, ctx);
}		}

		static FunctionType awaitAllAndExecuteFunctionType(MLIRContext *ctx) {
		auto hdl = LLVM::LLVMType::getInt8PtrTy(ctx);
		auto resume = resumeFunctionType(ctx).getPointerTo();
		return FunctionType::get({GroupType::get(ctx), hdl, resume}, {}, ctx);
		}

// Auxiliary coroutine resume intrinsic wrapper.		// Auxiliary coroutine resume intrinsic wrapper.
static LLVM::LLVMType resumeFunctionType(MLIRContext *ctx) {		static LLVM::LLVMType resumeFunctionType(MLIRContext *ctx) {
auto voidTy = LLVM::LLVMType::getVoidTy(ctx);		auto voidTy = LLVM::LLVMType::getVoidTy(ctx);
auto i8Ptr = LLVM::LLVMType::getInt8PtrTy(ctx);		auto i8Ptr = LLVM::LLVMType::getInt8PtrTy(ctx);
return LLVM::LLVMType::getFunctionTy(voidTy, {i8Ptr}, false);		return LLVM::LLVMType::getFunctionTy(voidTy, {i8Ptr}, false);
}		}
};		};
} // namespace		} // namespace

// Adds Async Runtime C API declarations to the module.		// Adds Async Runtime C API declarations to the module.
static void addAsyncRuntimeApiDeclarations(ModuleOp module) {		static void addAsyncRuntimeApiDeclarations(ModuleOp module) {
auto builder = OpBuilder::atBlockTerminator(module.getBody());		auto builder = OpBuilder::atBlockTerminator(module.getBody());

MLIRContext *ctx = module.getContext();		MLIRContext *ctx = module.getContext();
Location loc = module.getLoc();		Location loc = module.getLoc();

if (!module.lookupSymbol(kCreateToken))		if (!module.lookupSymbol(kCreateToken))
builder.create<FuncOp>(loc, kCreateToken,		builder.create<FuncOp>(loc, kCreateToken,
AsyncAPI::createTokenFunctionType(ctx));		AsyncAPI::createTokenFunctionType(ctx));

		if (!module.lookupSymbol(kCreateGroup))
		builder.create<FuncOp>(loc, kCreateGroup,
		AsyncAPI::createGroupFunctionType(ctx));

if (!module.lookupSymbol(kEmplaceToken))		if (!module.lookupSymbol(kEmplaceToken))
builder.create<FuncOp>(loc, kEmplaceToken,		builder.create<FuncOp>(loc, kEmplaceToken,
AsyncAPI::emplaceTokenFunctionType(ctx));		AsyncAPI::emplaceTokenFunctionType(ctx));

if (!module.lookupSymbol(kAwaitToken))		if (!module.lookupSymbol(kAwaitToken))
builder.create<FuncOp>(loc, kAwaitToken,		builder.create<FuncOp>(loc, kAwaitToken,
AsyncAPI::awaitTokenFunctionType(ctx));		AsyncAPI::awaitTokenFunctionType(ctx));

		if (!module.lookupSymbol(kAwaitGroup))
		builder.create<FuncOp>(loc, kAwaitGroup,
		AsyncAPI::awaitGroupFunctionType(ctx));

if (!module.lookupSymbol(kExecute))		if (!module.lookupSymbol(kExecute))
builder.create<FuncOp>(loc, kExecute, AsyncAPI::executeFunctionType(ctx));		builder.create<FuncOp>(loc, kExecute, AsyncAPI::executeFunctionType(ctx));

		if (!module.lookupSymbol(kAddTokenToGroup))
		builder.create<FuncOp>(loc, kAddTokenToGroup,
		AsyncAPI::addTokenToGroupFunctionType(ctx));

if (!module.lookupSymbol(kAwaitAndExecute))		if (!module.lookupSymbol(kAwaitAndExecute))
builder.create<FuncOp>(loc, kAwaitAndExecute,		builder.create<FuncOp>(loc, kAwaitAndExecute,
AsyncAPI::awaitAndExecuteFunctionType(ctx));		AsyncAPI::awaitAndExecuteFunctionType(ctx));

		if (!module.lookupSymbol(kAwaitAllAndExecute))
		builder.create<FuncOp>(loc, kAwaitAllAndExecute,
		AsyncAPI::awaitAllAndExecuteFunctionType(ctx));
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LLVM coroutines intrinsics declarations.		// LLVM coroutines intrinsics declarations.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

static constexpr const char *kCoroId = "llvm.coro.id";		static constexpr const char *kCoroId = "llvm.coro.id";
static constexpr const char *kCoroSizeI64 = "llvm.coro.size.i64";		static constexpr const char *kCoroSizeI64 = "llvm.coro.size.i64";
▲ Show 20 Lines • Show All 437 Lines • ▼ Show 20 Lines

namespace {		namespace {
class AsyncRuntimeTypeConverter : public TypeConverter {		class AsyncRuntimeTypeConverter : public TypeConverter {
public:		public:
AsyncRuntimeTypeConverter() { addConversion(convertType); }		AsyncRuntimeTypeConverter() { addConversion(convertType); }

static Type convertType(Type type) {		static Type convertType(Type type) {
MLIRContext *ctx = type.getContext();		MLIRContext *ctx = type.getContext();
// Convert async tokens to opaque pointers.		// Convert async tokens and groups to opaque pointers.
if (type.isa<TokenType>())		if (type.isa<TokenType, GroupType>())
return LLVM::LLVMType::getInt8PtrTy(ctx);		return LLVM::LLVMType::getInt8PtrTy(ctx);
return type;		return type;
}		}
};		};
} // namespace		} // namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Convert types for all call operations to lowered async types.		// Convert types for all call operations to lowered async types.
Show All 18 Lines	rewriter.replaceOpWithNewOp<CallOp>(op, resultTypes, call.callee(),
call.getOperands());		call.getOperands());

return success();		return success();
}		}
};		};
} // namespace		} // namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// async.await op lowering to mlirAsyncRuntimeAwaitToken function call.		// async.create_group op lowering to mlirAsyncRuntimeCreateGroup function call.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
class AwaitOpLowering : public ConversionPattern {		class CreateGroupOpLowering : public ConversionPattern {
public:		public:
explicit AwaitOpLowering(		explicit CreateGroupOpLowering(MLIRContext *ctx)
		: ConversionPattern(CreateGroupOp::getOperationName(), 1, ctx) {}

		LogicalResult
		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const override {
		auto retTy = GroupType::get(op->getContext());
		rewriter.replaceOpWithNewOp<CallOp>(op, kCreateGroup, retTy);
		return success();
		}
		};
		} // namespace

		//===----------------------------------------------------------------------===//
		// async.add_to_group op lowering to runtime function call.
		//===----------------------------------------------------------------------===//

		namespace {
		class AddToGroupOpLowering : public ConversionPattern {
		public:
		explicit AddToGroupOpLowering(MLIRContext *ctx)
		: ConversionPattern(AddToGroupOp::getOperationName(), 1, ctx) {}

		LogicalResult
		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const override {
		// Currently we can only add tokens to the group.
		auto addToGroup = cast<AddToGroupOp>(op);
		if (!addToGroup.operand().getType().isa<TokenType>())
		return failure();

		auto i64 = IntegerType::get(64, op->getContext());
		rewriter.replaceOpWithNewOp<CallOp>(op, kAddTokenToGroup, i64, operands);
		return success();
		}
		};
		} // namespace

		//===----------------------------------------------------------------------===//
		// async.await and async.await_all op lowerings to the corresponding async
		// runtime function calls.
		//===----------------------------------------------------------------------===//

		namespace {

		template <typename AwaitType, typename AwaitableType>
		class AwaitOpLoweringBase : public ConversionPattern {
		protected:
		explicit AwaitOpLoweringBase(
MLIRContext *ctx,		MLIRContext *ctx,
const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions)		const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions,
: ConversionPattern(AwaitOp::getOperationName(), 1, ctx),		StringRef blockingAwaitFuncName, StringRef coroAwaitFuncName)
outlinedFunctions(outlinedFunctions) {}		: ConversionPattern(AwaitType::getOperationName(), 1, ctx),
		outlinedFunctions(outlinedFunctions),
		blockingAwaitFuncName(blockingAwaitFuncName),
		coroAwaitFuncName(coroAwaitFuncName) {}

		public:
LogicalResult		LogicalResult
matchAndRewrite(Operation *op, ArrayRef<Value> operands,		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
// We can only await on the token operand. Async valus are not supported.		// We can only await on one the `AwaitableType` (for `await` it can be
auto await = cast<AwaitOp>(op);		// only a `token`, for `await_all` it is a `group`).
if (!await.operand().getType().isa<TokenType>())		auto await = cast<AwaitType>(op);
		if (!await.operand().getType().template isa<AwaitableType>())
return failure();		return failure();

// Check if `async.await` is inside the outlined coroutine function.		// Check if await operation is inside the outlined coroutine function.
auto func = await.getParentOfType<FuncOp>();		auto func = await.template getParentOfType<FuncOp>();
auto outlined = outlinedFunctions.find(func);		auto outlined = outlinedFunctions.find(func);
const bool isInCoroutine = outlined != outlinedFunctions.end();		const bool isInCoroutine = outlined != outlinedFunctions.end();

Location loc = op->getLoc();		Location loc = op->getLoc();

// Inside regular function we convert await operation to the blocking		// Inside regular function we convert await operation to the blocking
// async API await function call.		// async API await function call.
if (!isInCoroutine)		if (!isInCoroutine)
rewriter.create<CallOp>(loc, Type(), kAwaitToken,		rewriter.create<CallOp>(loc, Type(), blockingAwaitFuncName,
ValueRange(op->getOperand(0)));		ValueRange(op->getOperand(0)));

// Inside the coroutine we convert await operation into coroutine suspension		// Inside the coroutine we convert await operation into coroutine suspension
// point, and resume execution asynchronously.		// point, and resume execution asynchronously.
if (isInCoroutine) {		if (isInCoroutine) {
const CoroMachinery &coro = outlined->getSecond();		const CoroMachinery &coro = outlined->getSecond();

OpBuilder builder(op);		OpBuilder builder(op);
MLIRContext *ctx = op->getContext();		MLIRContext *ctx = op->getContext();

// A pointer to coroutine resume intrinsic wrapper.		// A pointer to coroutine resume intrinsic wrapper.
auto resumeFnTy = AsyncAPI::resumeFunctionType(ctx);		auto resumeFnTy = AsyncAPI::resumeFunctionType(ctx);
auto resumePtr = builder.create<LLVM::AddressOfOp>(		auto resumePtr = builder.create<LLVM::AddressOfOp>(
loc, resumeFnTy.getPointerTo(), kResume);		loc, resumeFnTy.getPointerTo(), kResume);

// Save the coroutine state: @llvm.coro.save		// Save the coroutine state: @llvm.coro.save
auto coroSave = builder.create<LLVM::CallOp>(		auto coroSave = builder.create<LLVM::CallOp>(
loc, LLVM::LLVMTokenType::get(ctx),		loc, LLVM::LLVMTokenType::get(ctx),
builder.getSymbolRefAttr(kCoroSave), ValueRange(coro.coroHandle));		builder.getSymbolRefAttr(kCoroSave), ValueRange(coro.coroHandle));

// Call async runtime API to resume a coroutine in the managed thread when		// Call async runtime API to resume a coroutine in the managed thread when
// the async await argument becomes ready.		// the async await argument becomes ready.
SmallVector<Value, 3> awaitAndExecuteArgs = {		SmallVector<Value, 3> awaitAndExecuteArgs = {
await.getOperand(), coro.coroHandle, resumePtr.res()};		await.getOperand(), coro.coroHandle, resumePtr.res()};
builder.create<CallOp>(loc, Type(), kAwaitAndExecute,		builder.create<CallOp>(loc, Type(), coroAwaitFuncName,
awaitAndExecuteArgs);		awaitAndExecuteArgs);

// Split the entry block before the await operation.		// Split the entry block before the await operation.
addSuspensionPoint(coro, coroSave.getResult(0), op);		addSuspensionPoint(coro, coroSave.getResult(0), op);
}		}

// Original operation was replaced by function call or suspension point.		// Original operation was replaced by function call or suspension point.
rewriter.eraseOp(op);		rewriter.eraseOp(op);

return success();		return success();
}		}

private:		private:
const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions;		const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions;
		StringRef blockingAwaitFuncName;
		StringRef coroAwaitFuncName;
		};

		// Lowering for `async.await` operation (only token operands are supported).
		class AwaitOpLowering : public AwaitOpLoweringBase<AwaitOp, TokenType> {
		using Base = AwaitOpLoweringBase<AwaitOp, TokenType>;

		public:
		explicit AwaitOpLowering(
		MLIRContext *ctx,
		const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions)
		: Base(ctx, outlinedFunctions, kAwaitToken, kAwaitAndExecute) {}
};		};

		// Lowering for `async.await_all` operation.
		class AwaitAllOpLowering : public AwaitOpLoweringBase<AwaitAllOp, GroupType> {
		using Base = AwaitOpLoweringBase<AwaitAllOp, GroupType>;

		public:
		explicit AwaitAllOpLowering(
		MLIRContext *ctx,
		const llvm::DenseMap<FuncOp, CoroMachinery> &outlinedFunctions)
		: Base(ctx, outlinedFunctions, kAwaitGroup, kAwaitAllAndExecute) {}
		};

} // namespace		} // namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
struct ConvertAsyncToLLVMPass		struct ConvertAsyncToLLVMPass
: public ConvertAsyncToLLVMBase<ConvertAsyncToLLVMPass> {		: public ConvertAsyncToLLVMBase<ConvertAsyncToLLVMPass> {
void runOnOperation() override;		void runOnOperation() override;
Show All 40 Lines	void ConvertAsyncToLLVMPass::runOnOperation() {
MLIRContext *ctx = &getContext();		MLIRContext *ctx = &getContext();

// Convert async dialect types and operations to LLVM dialect.		// Convert async dialect types and operations to LLVM dialect.
AsyncRuntimeTypeConverter converter;		AsyncRuntimeTypeConverter converter;
OwningRewritePatternList patterns;		OwningRewritePatternList patterns;

populateFuncOpTypeConversionPattern(patterns, ctx, converter);		populateFuncOpTypeConversionPattern(patterns, ctx, converter);
patterns.insert<CallOpOpConversion>(ctx);		patterns.insert<CallOpOpConversion>(ctx);
patterns.insert<AwaitOpLowering>(ctx, outlinedFunctions);		patterns.insert<CreateGroupOpLowering, AddToGroupOpLowering>(ctx);
		patterns.insert<AwaitOpLowering, AwaitAllOpLowering>(ctx, outlinedFunctions);
		mehdi_aminiUnsubmitted Done Reply Inline Actions I think this is a variadic template, for convenience. mehdi_amini: I think this is a variadic template, for convenience.

ConversionTarget target(*ctx);		ConversionTarget target(*ctx);
target.addLegalDialect<LLVM::LLVMDialect>();		target.addLegalDialect<LLVM::LLVMDialect>();
target.addIllegalDialect<AsyncDialect>();		target.addIllegalDialect<AsyncDialect>();
target.addDynamicallyLegalOp<FuncOp>(		target.addDynamicallyLegalOp<FuncOp>(
[&](FuncOp op) { return converter.isSignatureLegal(op.getType()); });		[&](FuncOp op) { return converter.isSignatureLegal(op.getType()); });
target.addDynamicallyLegalOp<CallOp>(		target.addDynamicallyLegalOp<CallOp>(
[&](CallOp op) { return converter.isLegal(op.getResultTypes()); });		[&](CallOp op) { return converter.isLegal(op.getResultTypes()); });
Show All 9 Lines

mlir/lib/Dialect/Async/CMakeLists.txt

	add_subdirectory(IR)			add_subdirectory(IR)
				add_subdirectory(Transforms)

mlir/lib/Dialect/Async/IR/Async.cpp

Show All 15 Lines

void AsyncDialect::initialize() {		void AsyncDialect::initialize() {
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/Async/IR/AsyncOps.cpp.inc"		#include "mlir/Dialect/Async/IR/AsyncOps.cpp.inc"
>();		>();
addTypes<TokenType>();		addTypes<TokenType>();
addTypes<ValueType>();		addTypes<ValueType>();
		addTypes<GroupType>();
}		}

/// Parse a type registered to this dialect.		/// Parse a type registered to this dialect.
Type AsyncDialect::parseType(DialectAsmParser &parser) const {		Type AsyncDialect::parseType(DialectAsmParser &parser) const {
StringRef keyword;		StringRef keyword;
if (parser.parseKeyword(&keyword))		if (parser.parseKeyword(&keyword))
return Type();		return Type();

Show All 17 Lines
void AsyncDialect::printType(Type type, DialectAsmPrinter &os) const {		void AsyncDialect::printType(Type type, DialectAsmPrinter &os) const {
TypeSwitch<Type>(type)		TypeSwitch<Type>(type)
.Case<TokenType>([&](TokenType) { os << "token"; })		.Case<TokenType>([&](TokenType) { os << "token"; })
.Case<ValueType>([&](ValueType valueTy) {		.Case<ValueType>([&](ValueType valueTy) {
os << "value<";		os << "value<";
os.printType(valueTy.getValueType());		os.printType(valueTy.getValueType());
os << '>';		os << '>';
})		})
		.Case<GroupType>([&](GroupType) { os << "group"; })
.Default([](Type) { llvm_unreachable("unexpected 'async' type kind"); });		.Default([](Type) { llvm_unreachable("unexpected 'async' type kind"); });
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// ValueType		/// ValueType
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace mlir {		namespace mlir {
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (index.hasValue()) {
regions.push_back(RegionSuccessor(getResults()));		regions.push_back(RegionSuccessor(getResults()));
return;		return;
}		}

// Otherwise the successor is the body region.		// Otherwise the successor is the body region.
regions.push_back(RegionSuccessor(&body()));		regions.push_back(RegionSuccessor(&body()));
}		}

		void ExecuteOp::build(OpBuilder &builder, OperationState &result,
		TypeRange resultTypes, ValueRange dependencies,
		ValueRange operands, BodyBuilderFn bodyBuilder) {

		result.addOperands(dependencies);
		result.addOperands(operands);

		ftynseUnsubmitted Done Reply Inline Actions I am pretty sure something in block argument lists does not expect a null type (that's what the default constructor gives you) and will eventually crash. Do we have a test that covers the case of `valueType` not being a `ValueType`? ftynse: I am pretty sure something in block argument lists does not expect a null type (that's what the…
		ezhulenevAuthorUnsubmitted Done Reply Inline Actions Changed null type to `operand.getType()`, so it can safely fail in verifier. ezhulenev: Changed null type to `operand.getType()`, so it can safely fail in verifier.
		// Add derived `operand_segment_sizes` attribute based on parsed operands.
		int32_t numDependencies = dependencies.size();
		int32_t numOperands = operands.size();
		auto operandSegmentSizes = DenseIntElementsAttr::get(
		VectorType::get({2}, IntegerType::get(32, result.getContext())),
		{numDependencies, numOperands});
		result.addAttribute(kOperandSegmentSizesAttr, operandSegmentSizes);

		// First result is always a token, and then `resultTypes` wrapped into
		// `async.value`.
		result.addTypes({TokenType::get(result.getContext())});
		for (Type type : resultTypes)
		result.addTypes(ValueType::get(type));

		// Add a body region with block arguments as unwrapped async value operands.
		Region *bodyRegion = result.addRegion();
		bodyRegion->push_back(new Block);
		Block &bodyBlock = bodyRegion->front();
		for (Value operand : operands) {
		auto valueType = operand.getType().dyn_cast<ValueType>();
		bodyBlock.addArgument(valueType ? valueType.getValueType()
		: operand.getType());
		}

		// Create the default terminator if the builder is not provided and if the
		// expected result is empty. Otherwise, leave this to the caller
		// because we don't know which values to return from the execute op.
		if (resultTypes.empty() && !bodyBuilder) {
		OpBuilder::InsertionGuard guard(builder);
		builder.setInsertionPointToStart(&bodyBlock);
		builder.create<async::YieldOp>(result.location, ValueRange());
		} else if (bodyBuilder) {
		OpBuilder::InsertionGuard guard(builder);
		builder.setInsertionPointToStart(&bodyBlock);
		bodyBuilder(builder, result.location, bodyBlock.getArguments());
		}
		}

static void print(OpAsmPrinter &p, ExecuteOp op) {		static void print(OpAsmPrinter &p, ExecuteOp op) {
p << op.getOperationName();		p << op.getOperationName();

// [%tokens,...]		// [%tokens,...]
if (!op.dependencies().empty())		if (!op.dependencies().empty())
p << " [" << op.dependencies() << "]";		p << " [" << op.dependencies() << "]";

// (%value as %unwrapped: !async.value<!arg.type>, ...)		// (%value as %unwrapped: !async.value<!arg.type>, ...)
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

This file was added.

				//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements scf.parallel to src.for + async.execute conversion pass.
				//
				//===----------------------------------------------------------------------===//

				#include "PassDetail.h"
				#include "mlir/Dialect/Async/IR/Async.h"
				#include "mlir/Dialect/Async/Passes.h"
				#include "mlir/Dialect/SCF/SCF.h"
				#include "mlir/Dialect/StandardOps/IR/Ops.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

				using namespace mlir;
				using namespace mlir::async;

				#define DEBUG_TYPE "async-parallel-for"

				namespace {

				// Rewrite scf.parallel operation into multiple concurrent async.execute
				// operations over non overlapping subranges of the original loop.
				//
				// Example:
				//
				// scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
				// "do_some_compute"(%i, %j): () -> ()
				// }
				//
				// Converted to:
				//
				// %c0 = constant 0 : index
				// %c1 = constant 1 : index
				//
				// // Compute blocks sizes for each induction variable.
				// %num_blocks_i = ... : index
				// %num_blocks_j = ... : index
				// %block_size_i = ... : index
				// %block_size_j = ... : index
				//
				// // Create an async group to track async execute ops.
				// %group = async.create_group
				//
				// scf.for %bi = %c0 to %num_blocks_i step %c1 {
				// %block_start_i = ... : index
				// %block_end_i = ... : index
				//
				// scf.for %bj = %c0 to %num_blocks_j step %c1 {
				// %block_start_j = ... : index
				mehdi_aminiUnsubmitted Done Reply Inline Actions (typo) mehdi_amini: (typo)
				// %block_end_j = ... : index
				//
				// // Execute the body of original parallel operation for the current
				// // block.
				// %token = async.execute {
				// scf.for %i = %block_start_i to %block_end_i step %si {
				// scf.for %j = %block_start_j to %block_end_j step %sj {
				// "do_some_compute"(%i, %j): () -> ()
				// }
				// }
				// }
				//
				// // Add produced async token to the group.
				// async.add_to_group %token, %group
				// }
				// }
				//
				// // Await completion of all async.execute operations.
				// async.await_all %group
				//
				// In this example outer loop launches inner block level loops as separate async
				// execute operations which will be executed concurrently.
				//
				// At the end it waits for the completiom of all async execute operations.
				//
				struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
				public:
				AsyncParallelForRewrite(MLIRContext *ctx, int numConcurrentAsyncExecute)
				: OpRewritePattern(ctx),
				numConcurrentAsyncExecute(numConcurrentAsyncExecute) {}

				LogicalResult matchAndRewrite(scf::ParallelOp op,
				PatternRewriter &rewriter) const override;

				private:
				int numConcurrentAsyncExecute;
				};

				struct AsyncParallelForPass
				: public AsyncParallelForBase<AsyncParallelForPass> {
				AsyncParallelForPass() = default;
				void runOnFunction() override;
				};

				} // namespace

				LogicalResult
				AsyncParallelForRewrite::matchAndRewrite(scf::ParallelOp op,
				PatternRewriter &rewriter) const {
				// We do not currently support rewrite for parallel op with reductions.
				if (op.getNumReductions() != 0)
				return failure();

				MLIRContext *ctx = op.getContext();
				Location loc = op.getLoc();

				// Index constants used below.
				auto indexTy = IndexType::get(ctx);
				auto zero = IntegerAttr::get(indexTy, 0);
				auto one = IntegerAttr::get(indexTy, 1);
				auto c0 = rewriter.create<ConstantOp>(loc, indexTy, zero);
				auto c1 = rewriter.create<ConstantOp>(loc, indexTy, one);

				// Shorthand for signed integer ceil division operation.
				auto divup = [&](Value x, Value y) -> Value {
				ftynseUnsubmitted Done Reply Inline Actions This wouldn't work correctly for negative numbers. There is a pending diff adding ceildiv operation to std, I'd just target that when it lands. ftynse: This wouldn't work correctly for negative numbers. There is a pending diff adding ceildiv…
				ftynseUnsubmitted Done Reply Inline Actions https://reviews.llvm.org/D89726 has landed, we now have `ceildiv` that expands to a longer sequence of operations that supports negative values. I expect the canonicalizer to remove the unnecessary ones if it knows the sign of all operands. ftynse: https://reviews.llvm.org/D89726 has landed, we now have `ceildiv` that expands to a longer…
				return rewriter.create<SignedCeilDivIOp>(loc, x, y);
				};

				// Compute trip count for each loop induction variable:
				// tripCount = divUp(upperBound - lowerBound, step);
				SmallVector<Value, 4> tripCounts(op.getNumLoops());
				for (size_t i = 0; i < op.getNumLoops(); ++i) {
				auto lb = op.lowerBound()[i];
				auto ub = op.upperBound()[i];
				auto step = op.step()[i];
				auto range = rewriter.create<SubIOp>(loc, ub, lb);
				tripCounts[i] = divup(range, step);
				}

				// The target number of concurrent async.execute ops.
				auto numExecuteOps = rewriter.create<ConstantOp>(
				loc, indexTy, IntegerAttr::get(indexTy, numConcurrentAsyncExecute));

				// Blocks sizes configuration for each induction variable.

				// We try to use maximum available concurrency in outer dimensions first
				// (assuming that parallel induction variables are corresponding to some
				// multidimensional access, e.g. in (%d0, %d1, ..., %dn) = (<from>) to (<to>)
				// we will try to parallelize iteration along the %d0. If %d0 is too small,
				// we'll parallelize iteration over %d1, and so on.
				SmallVector<Value, 4> targetNumBlocks(op.getNumLoops());
				SmallVector<Value, 4> blockSize(op.getNumLoops());
				SmallVector<Value, 4> numBlocks(op.getNumLoops());

				// Compute block size and number of blocks along the first induction variable.
				targetNumBlocks[0] = numExecuteOps;
				blockSize[0] = divup(tripCounts[0], targetNumBlocks[0]);
				numBlocks[0] = divup(tripCounts[0], blockSize[0]);

				// Assign remaining available concurrency to other induction variables.
				for (size_t i = 1; i < op.getNumLoops(); ++i) {
				targetNumBlocks[i] = divup(targetNumBlocks[i - 1], numBlocks[i - 1]);
				blockSize[i] = divup(tripCounts[i], targetNumBlocks[i]);
				numBlocks[i] = divup(tripCounts[i], blockSize[i]);
				}

				// Create an async.group to wait on all async tokens from async execute ops.
				auto group = rewriter.create<CreateGroupOp>(loc, GroupType::get(ctx));

				// Build a scf.for loop nest from the parallel operation.

				// Lower/upper bounds for nest block level computations.
				SmallVector<Value, 4> blockLowerBounds(op.getNumLoops());
				SmallVector<Value, 4> blockUpperBounds(op.getNumLoops());
				SmallVector<Value, 4> blockInductionVars(op.getNumLoops());

				using LoopBodyBuilder =
				std::function<void(OpBuilder &, Location, Value, ValueRange)>;
				using LoopBuilder = std::function<LoopBodyBuilder(size_t loopIdx)>;

				// Builds inner loop nest inside async.execute operation that does all the
				// work concurrently.
				LoopBuilder workLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {
				return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {
				blockInductionVars[loopIdx] = iv;

				// Continute building async loop nest.
				if (loopIdx < op.getNumLoops() - 1) {
				b.create<scf::ForOp>(
				loc, blockLowerBounds[loopIdx + 1], blockUpperBounds[loopIdx + 1],
				op.step()[loopIdx + 1], ValueRange(), workLoopBuilder(loopIdx + 1));
				b.create<scf::YieldOp>(loc);
				return;
				}

				// Copy the body of the parallel op with new loop bounds.
				BlockAndValueMapping mapping;
				mapping.map(op.getInductionVars(), blockInductionVars);

				for (auto &bodyOp : op.getLoopBody().getOps())
				b.clone(bodyOp, mapping);
				};
				};

				// Builds a loop nest that does async execute op dispatching.
				LoopBuilder asyncLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {
				return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {
				auto lb = op.lowerBound()[loopIdx];
				auto ub = op.upperBound()[loopIdx];
				auto step = op.step()[loopIdx];

				// Compute lower bound for the current block:
				// blockLowerBound = iv * blockSize * step + lowerBound
				auto s0 = b.create<MulIOp>(loc, iv, blockSize[loopIdx]);
				auto s1 = b.create<MulIOp>(loc, s0, step);
				auto s2 = b.create<AddIOp>(loc, s1, lb);
				blockLowerBounds[loopIdx] = s2;

				// Compute upper bound for the current block:
				// blockUpperBound = min(upperBound,
				// blockLowerBound + blockSize * step)
				auto e0 = b.create<MulIOp>(loc, blockSize[loopIdx], step);
				auto e1 = b.create<AddIOp>(loc, e0, s2);
				auto e2 = b.create<CmpIOp>(loc, CmpIPredicate::slt, e1, ub);
				auto e3 = b.create<SelectOp>(loc, e2, e1, ub);
				blockUpperBounds[loopIdx] = e3;

				// Continue building async dispatch loop nest.
				if (loopIdx < op.getNumLoops() - 1) {
				b.create<scf::ForOp>(loc, c0, numBlocks[loopIdx + 1], c1, ValueRange(),
				asyncLoopBuilder(loopIdx + 1));
				b.create<scf::YieldOp>(loc);
				return;
				}

				// Build the inner loop nest that will do the actual work inside the
				// `async.execute` body region.
				auto executeBodyBuilder = [&](OpBuilder &executeBuilder,
				Location executeLoc,
				ValueRange executeArgs) {
				executeBuilder.create<scf::ForOp>(executeLoc, blockLowerBounds[0],
				blockUpperBounds[0], op.step()[0],
				ValueRange(), workLoopBuilder(0));
				executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());
				};

				auto execute = b.create<ExecuteOp>(
				loc, /resultTypes=/TypeRange(), /dependencies=/ValueRange(),
				/operands=/ValueRange(), executeBodyBuilder);
				auto rankType = IndexType::get(ctx);
				b.create<AddToGroupOp>(loc, rankType, execute.token(), group.result());
				b.create<scf::YieldOp>(loc);
				};
				};

				// Start building a loop nest from the first induction variable.
				rewriter.create<scf::ForOp>(loc, c0, numBlocks[0], c1, ValueRange(),
				asyncLoopBuilder(0));

				// Wait for the completion of all subtasks.
				rewriter.create<AwaitAllOp>(loc, group.result());

				// Erase the original parallel operation.
				rewriter.eraseOp(op);

				return success();
				}

				void AsyncParallelForPass::runOnFunction() {
				MLIRContext *ctx = &getContext();

				OwningRewritePatternList patterns;
				patterns.insert<AsyncParallelForRewrite>(ctx, numConcurrentAsyncExecute);

				if (failed(applyPatternsAndFoldGreedily(getFunction(), std::move(patterns))))
				signalPassFailure();
				}

				std::unique_ptr<OperationPass<FuncOp>> mlir::createAsyncParallelForPass() {
				return std::make_unique<AsyncParallelForPass>();
				}

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

This file was added.

				add_mlir_dialect_library(MLIRAsyncTransforms
				AsyncParallelFor.cpp

				ADDITIONAL_HEADER_DIRS
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Async

				DEPENDS
				MLIRAsyncPassIncGen

				LINK_LIBS PUBLIC
				MLIRIR
				MLIRAsync
				MLIRSCF
				MLIRPass
				MLIRTransforms
				MLIRTransformUtils
				)

mlir/lib/Dialect/Async/Transforms/PassDetail.h

This file was added.

				//===- PassDetail.h - Async Pass class details ------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef DIALECT_ASYNC_TRANSFORMS_PASSDETAIL_H_
				#define DIALECT_ASYNC_TRANSFORMS_PASSDETAIL_H_

				#include "mlir/IR/Dialect.h"
				#include "mlir/Pass/Pass.h"

				namespace mlir {

				namespace async {
				class AsyncDialect;
				} // namespace async

				namespace scf {
				class SCFDialect;
				} // namespace scf

				#define GEN_PASS_CLASSES
				#include "mlir/Dialect/Async/Passes.h.inc"

				} // namespace mlir

				#endif // DIALECT_ASYNC_TRANSFORMS_PASSDETAIL_H_

mlir/lib/ExecutionEngine/AsyncRuntime.cpp

	Show All 9 Lines
	// to LLVM dialect lowering.			// to LLVM dialect lowering.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/ExecutionEngine/AsyncRuntime.h"			#include "mlir/ExecutionEngine/AsyncRuntime.h"

	#ifdef MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS			#ifdef MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS

				#include <atomic>
	#include <condition_variable>			#include <condition_variable>
	#include <functional>			#include <functional>
	#include <iostream>			#include <iostream>
	#include <mutex>			#include <mutex>
	#include <thread>			#include <thread>
	#include <vector>			#include <vector>

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Async runtime API.			// Async runtime API.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	struct AsyncToken {			struct AsyncToken {
	bool ready = false;			bool ready = false;
	std::mutex mu;			std::mutex mu;
	std::condition_variable cv;			std::condition_variable cv;
	std::vector<std::function<void()>> awaiters;			std::vector<std::function<void()>> awaiters;
	};			};

				struct AsyncGroup {
				std::atomic<int> pendingTokens{0};
				std::atomic<int> rank{0};
				std::mutex mu;
				std::condition_variable cv;
				std::vector<std::function<void()>> awaiters;
				};

	// Create a new `async.token` in not-ready state.			// Create a new `async.token` in not-ready state.
	extern "C" AsyncToken *mlirAsyncRuntimeCreateToken() {			extern "C" AsyncToken *mlirAsyncRuntimeCreateToken() {
	AsyncToken *token = new AsyncToken;			AsyncToken *token = new AsyncToken;
	return token;			return token;
	}			}

				// Create a new `async.group` in empty state.
				extern "C" MLIR_ASYNCRUNTIME_EXPORT AsyncGroup *mlirAsyncRuntimeCreateGroup() {
				AsyncGroup *group = new AsyncGroup;
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I think that's what I'm talking above: seems like leaking at the moment? mehdi_amini: I think that's what I'm talking above: seems like leaking at the moment?
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Yes, right now runtime is not doing any lifetime management of created tokens/values. If we agree on "runtime owns them all" approach I'll port it from TFRT runtime implementation. ezhulenev: Yes, right now runtime is not doing any lifetime management of created tokens/values. If we…
				return group;
				}

				extern "C" MLIR_ASYNCRUNTIME_EXPORT int64_t
				mlirAsyncRuntimeAddTokenToGroup(AsyncToken token, AsyncGroup group) {
				std::unique_lock<std::mutex> lockToken(token->mu);
				std::unique_lock<std::mutex> lockGroup(group->mu);

				group->pendingTokens.fetch_add(1);

				auto onTokenReady = [group]() {
				// Run all group awaiters if it was the last token in the group.
				if (group->pendingTokens.fetch_sub(1) == 1) {
				group->cv.notify_all();
				for (auto &awaiter : group->awaiters)
				awaiter();
				}
				};

				if (token->ready)
				onTokenReady();
				else
				token->awaiters.push_back([onTokenReady]() { onTokenReady(); });

				return group->rank.fetch_add(1);
				}

	// Switches `async.token` to ready state and runs all awaiters.			// Switches `async.token` to ready state and runs all awaiters.
	extern "C" void mlirAsyncRuntimeEmplaceToken(AsyncToken *token) {			extern "C" void mlirAsyncRuntimeEmplaceToken(AsyncToken *token) {
	std::unique_lock<std::mutex> lock(token->mu);			std::unique_lock<std::mutex> lock(token->mu);
	token->ready = true;			token->ready = true;
	token->cv.notify_all();			token->cv.notify_all();
	for (auto &awaiter : token->awaiters)			for (auto &awaiter : token->awaiters)
	awaiter();			awaiter();
	}			}

	extern "C" void mlirAsyncRuntimeAwaitToken(AsyncToken *token) {			extern "C" void mlirAsyncRuntimeAwaitToken(AsyncToken *token) {
	std::unique_lock<std::mutex> lock(token->mu);			std::unique_lock<std::mutex> lock(token->mu);
	if (!token->ready)			if (!token->ready)
	token->cv.wait(lock, [token] { return token->ready; });			token->cv.wait(lock, [token] { return token->ready; });
	delete token;			}
	mehdi_aminiUnsubmitted Not Done Reply Inline Actions Do we also have a leaking issue with tokens? mehdi_amini: Do we also have a leaking issue with tokens?

				extern "C" MLIR_ASYNCRUNTIME_EXPORT void
				mlirAsyncRuntimeAwaitAllInGroup(AsyncGroup *group) {
				std::unique_lock<std::mutex> lock(group->mu);
				if (group->pendingTokens != 0)
				group->cv.wait(lock, [group] { return group->pendingTokens == 0; });
	}			}

	extern "C" void mlirAsyncRuntimeExecute(CoroHandle handle, CoroResume resume) {			extern "C" void mlirAsyncRuntimeExecute(CoroHandle handle, CoroResume resume) {
	#if LLVM_ENABLE_THREADS			#if LLVM_ENABLE_THREADS
	std::thread thread([handle, resume]() { (*resume)(handle); });			std::thread thread([handle, resume]() { (*resume)(handle); });
	thread.detach();			thread.detach();
	#else			#else
	(*resume)(handle);			(*resume)(handle);
	#endif			#endif
	}			}

	extern "C" void mlirAsyncRuntimeAwaitTokenAndExecute(AsyncToken *token,			extern "C" void mlirAsyncRuntimeAwaitTokenAndExecute(AsyncToken *token,
	CoroHandle handle,			CoroHandle handle,
	CoroResume resume) {			CoroResume resume) {
	std::unique_lock<std::mutex> lock(token->mu);			std::unique_lock<std::mutex> lock(token->mu);

	auto execute = [token, handle, resume]() {			auto execute = [handle, resume]() {
	mlirAsyncRuntimeExecute(handle, resume);			mlirAsyncRuntimeExecute(handle, resume);
	delete token;
	};			};

	if (token->ready)			if (token->ready)
	execute();			execute();
	else			else
	token->awaiters.push_back([execute]() { execute(); });			token->awaiters.push_back([execute]() { execute(); });
	}			}

				extern "C" MLIR_ASYNCRUNTIME_EXPORT void
				mlirAsyncRuntimeAwaitAllInGroupAndExecute(AsyncGroup *group, CoroHandle handle,
				CoroResume resume) {
				std::unique_lock<std::mutex> lock(group->mu);

				auto execute = [handle, resume]() {
				mlirAsyncRuntimeExecute(handle, resume);
				};

				if (group->pendingTokens == 0)
				execute();
				else
				group->awaiters.push_back([execute]() { execute(); });
				}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Small async runtime support library for testing.			// Small async runtime support library for testing.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	extern "C" void mlirAsyncRuntimePrintCurrentThreadId() {			extern "C" void mlirAsyncRuntimePrintCurrentThreadId() {
	static thread_local std::thread::id thisId = std::this_thread::get_id();			static thread_local std::thread::id thisId = std::this_thread::get_id();
	std::cout << "Current thread id: " << thisId << "\n";			std::cout << "Current thread id: " << thisId << "\n";
	}			}

	#endif // MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS			#endif // MLIR_ASYNCRUNTIME_DEFINE_FUNCTIONS

mlir/test/Conversion/AsyncToLLVM/convert-to-llvm.mlir

	Show First 20 Lines • Show All 150 Lines • ▼ Show 20 Lines
	// CHECK: llvm.call @llvm.coro.save			// CHECK: llvm.call @llvm.coro.save
	// CHECK: call @mlirAsyncRuntimeAwaitTokenAndExecute(%arg0, %[[HDL_1]],			// CHECK: call @mlirAsyncRuntimeAwaitTokenAndExecute(%arg0, %[[HDL_1]],
	// CHECK: llvm.call @llvm.coro.suspend			// CHECK: llvm.call @llvm.coro.suspend

	// Emplace result token after second resumption.			// Emplace result token after second resumption.
	// CHECK: store %arg1, %arg2[%c0] : memref<1xf32>			// CHECK: store %arg1, %arg2[%c0] : memref<1xf32>
	// CHECK: call @mlirAsyncRuntimeEmplaceToken(%[[RET_1]])			// CHECK: call @mlirAsyncRuntimeEmplaceToken(%[[RET_1]])

				// -----

				// CHECK-LABEL: async_group_await_all
				func @async_group_await_all(%arg0: f32, %arg1: memref<1xf32>) {
				// CHECK: %0 = call @mlirAsyncRuntimeCreateGroup()
				%0 = async.create_group

				// CHECK: %[[TOKEN:.*]] = call @async_execute_fn
				%token = async.execute { async.yield }
				// CHECK: call @mlirAsyncRuntimeAddTokenToGroup(%[[TOKEN]], %0)
				async.add_to_group %token, %0 : !async.token

				// CHECK: call @async_execute_fn_0
				async.execute {
				async.await_all %0
				async.yield
				}

				// CHECK: call @mlirAsyncRuntimeAwaitAllInGroup(%0)
				async.await_all %0

				return
				}

				// Function outlined from the async.execute operation.
				// CHECK: func private @async_execute_fn_0(%arg0: !llvm.ptr<i8>)
				// CHECK: %[[RET_1:.*]] = call @mlirAsyncRuntimeCreateToken()
				// CHECK: %[[HDL_1:.*]] = llvm.call @llvm.coro.begin

				// Suspend coroutine in the beginning.
				// CHECK: call @mlirAsyncRuntimeExecute(%[[HDL_1]],
				// CHECK: llvm.call @llvm.coro.suspend

				// Suspend coroutine second time waiting for the group.
				// CHECK: llvm.call @llvm.coro.save
				// CHECK: call @mlirAsyncRuntimeAwaitAllInGroupAndExecute(%arg0, %[[HDL_1]],
				// CHECK: llvm.call @llvm.coro.suspend

				// Emplace result token.
				// CHECK: call @mlirAsyncRuntimeEmplaceToken(%[[RET_1]])

mlir/test/Dialect/Async/async-parallel-for.mlir

This file was added.

				// RUN: mlir-opt %s -async-parallel-for \| FileCheck %s

				// CHECK-LABEL: @loop_1d
				func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: scf.for
				// CHECK: %[[TOKEN:.*]] = async.execute {
				// CHECK: scf.for
				// CHECK: store
				// CHECK: async.yield
				// CHECK: }
				// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {
				%one = constant 1.0 : f32
				store %one, %arg3[%i] : memref<?xf32>
				}

				return
				}

				// CHECK-LABEL: @loop_2d
				func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step
				%arg3: index, %arg4: index, %arg5: index, // lb, ub, step
				%arg6: memref<?x?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: %[[TOKEN:.*]] = async.execute {
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: store
				// CHECK: async.yield
				// CHECK: }
				// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)
				step (%arg2, %arg5) {
				%one = constant 1.0 : f32
				store %one, %arg6[%i0, %i1] : memref<?x?xf32>
				}

				return
				}

mlir/test/Dialect/Async/ops.mlir

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
	}			}

	// CHECK-LABEL: @await_value			// CHECK-LABEL: @await_value
	func @await_value(%arg0: !async.value<f32>) -> f32 {			func @await_value(%arg0: !async.value<f32>) -> f32 {
	// CHECK: async.await %arg0			// CHECK: async.await %arg0
	%0 = async.await %arg0 : !async.value<f32>			%0 = async.await %arg0 : !async.value<f32>
	return %0 : f32			return %0 : f32
	}			}

				// CHECK-LABEL: @create_group_and_await_all
				func @create_group_and_await_all(%arg0: !async.token, %arg1: !async.value<f32>) -> index {
				%0 = async.create_group

				// CHECK: async.add_to_group %arg0
				// CHECK: async.add_to_group %arg1
				%1 = async.add_to_group %arg0, %0 : !async.token
				%2 = async.add_to_group %arg1, %0 : !async.value<f32>
				async.await_all %0

				%3 = addi %1, %2 : index
				return %3 : index
				}

mlir/test/mlir-cpu-runner/async-group.mlir

This file was added.

				// RUN: mlir-opt %s -convert-async-to-llvm \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e main -entry-point-result=void -O0 \
				// RUN: -shared-libs=%linalg_test_lib_dir/libmlir_c_runner_utils%shlibext \
				// RUN: -shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%linalg_test_lib_dir/libmlir_async_runtime%shlibext \
				// RUN: \| FileCheck %s

				func @main() {
				%group = async.create_group

				%token0 = async.execute { async.yield }
				%token1 = async.execute { async.yield }
				%token2 = async.execute { async.yield }
				%token3 = async.execute { async.yield }
				%token4 = async.execute { async.yield }

				%0 = async.add_to_group %token0, %group : !async.token
				%1 = async.add_to_group %token1, %group : !async.token
				%2 = async.add_to_group %token2, %group : !async.token
				%3 = async.add_to_group %token3, %group : !async.token
				%4 = async.add_to_group %token4, %group : !async.token

				%token5 = async.execute {
				async.await_all %group
				async.yield
				}

				%group0 = async.create_group
				%5 = async.add_to_group %token5, %group0 : !async.token
				async.await_all %group0

				// CHECK: Current thread id: [[THREAD:.*]]
				call @mlirAsyncRuntimePrintCurrentThreadId(): () -> ()

				return
				}

				func @mlirAsyncRuntimePrintCurrentThreadId() -> ()

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Transform scf.parallel to scf.for + async.executeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 305093

mlir/include/mlir/Dialect/Async/CMakeLists.txt

mlir/include/mlir/Dialect/Async/IR/Async.h

mlir/include/mlir/Dialect/Async/IR/AsyncBase.td

mlir/include/mlir/Dialect/Async/IR/AsyncOps.td

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Passes.td

mlir/include/mlir/ExecutionEngine/AsyncRuntime.h

mlir/include/mlir/InitAllPasses.h

mlir/integration_test/Dialect/Async/CPU/lit.local.cfg

mlir/integration_test/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

mlir/integration_test/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp

mlir/lib/Dialect/Async/CMakeLists.txt

mlir/lib/Dialect/Async/IR/Async.cpp

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

mlir/lib/Dialect/Async/Transforms/PassDetail.h

mlir/lib/ExecutionEngine/AsyncRuntime.cpp

mlir/test/Conversion/AsyncToLLVM/convert-to-llvm.mlir

mlir/test/Dialect/Async/async-parallel-for.mlir

mlir/test/Dialect/Async/ops.mlir

mlir/test/mlir-cpu-runner/async-group.mlir

[mlir] Transform scf.parallel to scf.for + async.execute
ClosedPublic