This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	%0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [%0] In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ?

bondhugula added inline comments.Apr 5 2022, 12:24 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	@arnab-oss Does %memref have any uses in this example? If it doesn't, this is all dead code. Is the rule eliminating these now? Can you add this test case?

bondhugula added inline comments.Apr 8 2022, 11:22 PM

mlir/test/Dialect/GPU/canonicalize.mlir
12	Nit: You can just instead use: CHECK-NEXT: return for a stronger and more direct check.

Herald added a reviewer: ThomasRaoux. · View Herald TranscriptApr 8 2022, 11:22 PM

bondhugula requested changes to this revision.Apr 9 2022, 8:09 PM

bondhugula added inline comments.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258	These conditions are all completely wrong: you can't erase a gpu.wait op just because its result token has no uses. Also, you can't simply replace its uses (which could be empty) by its aysnc dependencies and replace the op. Can you please read through the `gpu.wait` doc description (which is pretty straightforward and simple) before adding these patterns?

This revision now requires changes to proceed.Apr 9 2022, 8:09 PM

mehdi_amini added inline comments.Apr 9 2022, 8:13 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1230	Please remove.
1248–1258	I'm confused, the description seems to exactly call for this: If the op contains the `async` keyword, it returns a new async token which is synchronized with the op arguments. This new token is merely a shortcut to the argument list, and one could replace the uses of the result with the arguments for the same effect.

bondhugula added inline comments.Apr 9 2022, 8:18 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1233	In fact, 1) is just a trivial special case of 3), but anyway, both (1) and (3) are wrong. You can't erase the gpu.wait op if its token has no uses. The op synchronizes the device/host (internally, it destroys the stream) You can only replace it if its result token has at least one use that is another gpu.wait op.

bondhugula added inline comments.Apr 9 2022, 8:20 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258	If none of those result uses are another gpu.wait op, you've eliminated a device synchronization incorrectly.

bondhugula added inline comments.Apr 9 2022, 8:33 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258	Actually, the documentation is messed up here, it says: If the op does not contain the `async` keyword, it does not return a new async token but blocks until all ops producing the async dependency tokens finished execution. But the op can definitely return a new token even without the async keyword. That's how a fresh token is created in the first place!

mehdi_amini added inline comments.Apr 9 2022, 8:34 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258	How so? What is specific to `gpu.wait` in how it is using a token that does not apply to the other users? Also: what kind of synchronization does an asynchronous gpu.wait without users providing? (if you can also explain how you interpret what the doc says).

Dismissing change request.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258	Sorry, looks like I was mistaken on this part (whenever you return an async token and also have async dependencies).

This revision now requires review to proceed.Apr 9 2022, 8:55 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptApr 9 2022, 8:55 PM

LGTM - mostly doc/code comment fix suggestions.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1232	%t = gpu.wait ... ops where %t has no uses (regardless of async dependencies).
1232–1234	Nit: The order of these doesn't match the order in the implementation. Permute.
1233	gpu.wait ops that neither have any async dependencies nor return any token.
1242–1243	Erase gpu.wait ops that neither have any async dependencies nor return any async token.
1248–1249	Replace uses of %t1 = gpu.wait async [%t0] ops with %t0 and erase the op.

This revision is now accepted and ready to land.Apr 9 2022, 9:03 PM

mehdi_amini added inline comments.Apr 9 2022, 9:14 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253	Ideally this should be implemented as a folder I think.

bondhugula added inline comments.Apr 9 2022, 11:27 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253	Yes, I think this pattern can be the folding hook since we aren't erasing any other ops.

mehdi_amini added inline comments.Apr 10 2022, 10:20 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253	Unfortunately I think the folding hook can't implement conditional deletion, only replacements. Can it? Otherwise the `eraseOp()` can't be moved there.

csigg added inline comments.Apr 11 2022, 12:14 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	%0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [%0] In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ? I think we can. The second `gpu.wait` does not synchronize with anything. The order of the `gpu.alloc` and second `gpu.wait` can be swapped, unless `gpu.wait` also uses `%asyncToken` (which would still allow you to remove `%0`).

bondhugula added inline comments.Apr 12 2022, 3:57 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	But if you remove the `gpu.wait`, you've left a stream "undestroyed". Also, we shouldn't be converting an async alloc to a sync alloc -- the latter won't even lower through gpu-to-llvm - so I'm confused here.

csigg added inline comments.Apr 12 2022, 6:41 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	But if you remove the gpu.wait, you've left a stream "undestroyed". I assume there is a `gpu.wait [<dependency of %asyncToken>]` somewhere. Otherwise the input is not really valid. Also, we shouldn't be converting an async alloc to a sync alloc That's not what's happening. %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [%0] should be folded to %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [] should be folded to %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>

bondhugula added inline comments.Apr 12 2022, 6:55 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	I think I see where the difference in understanding is. I was going by how the lowering functions worked (to CUDA streams) as opposed to the op's semantics. In the snippet below, `asyncToken` doesn't need to have a use later: when the `wait` on `%0` happens, the stream is synchronized and destroyed: this means the memcpy will be completed (even though %asyncToken isn't specified as a dep on it) and anything else attached to that stream. And one doesn't need another use of `%asyncToken` -- this IR currently lowers and executes correctly as intended AFAIU. In fact, deleting the second `gpu.wait` would eliminate a synchronization and violate semantics. %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [%0] However, per the doc, the second gpu.wait would really "block" on the completion of the first op which is a trivial block and so can be eliminated.

bondhugula added inline comments.Apr 12 2022, 4:45 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	@csigg Furthermore, based on what you said, the following would be valid IR: %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.wait [%0] gpu.wait %asyncToken But won't this crash on lowering today since it would destroy the same stream twice? https://github.com/llvm/llvm-project/blob/32f3633171aa9d7352e9507c12d219efb48899a0/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L560

mehdi_amini added inline comments.Apr 12 2022, 9:39 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	The notion of stream being destroyed in a wait or being left undestroyed is a bit confusing to me, and I can't connect this to the concept of token and async dependencies. Seems like some lowering issues to me more than anything to fix at this level of abstraction.

mehdi_amini added inline comments.Apr 12 2022, 9:45 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1211	Why does multiple uses prevent optimizing it away?
1222	Seems like instead of handling by looking at the user, we should root the pattern on the user and look for the operands. That way it would handle the "multiple uses" above, and it could operate by eliminating operands from the operands list of the gpu.wait. An async gpu.wait without users could be just deleted.

bondhugula added inline comments.Apr 13 2022, 2:25 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228	I agree - this all makes sense to me and I think @csigg's change can safely be applied here.
1248–1258	A gpu.wait that returns a token doesn't provide any synchronization. We are talking about a gpu.wait that doesn't return an async token but has async deps. This as you know will block till those producers are complete. If one of those producers is a gpu.wait, we can erase it -- this is fine per the semantics of the op and this is why @csigg's suggestion above can be applied.

bondhugula added inline comments.Apr 13 2022, 2:27 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1222	An async gpu.wait without users could be just deleted. The pattern further below already does this: see line 1218.

Adressed comments by @csigg, @bondhugula

Harbormaster completed remote builds in B159414: Diff 422466.Apr 13 2022, 4:42 AM

bondhugula accepted this revision.Apr 13 2022, 5:38 AM

Closed by commit rG392d55c1e2d7: [MLIR][GPU] Add canonicalization patterns for folding simple gpu.wait ops. (authored by arnab-oss, committed by bondhugula). · Explain WhyApr 14 2022, 12:02 AM

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rG392d55c1e2d7: [MLIR][GPU] Add canonicalization patterns for folding simple gpu.wait ops..

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

2 lines

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

72 lines

test/

Dialect/

GPU/

canonicalize.mlir

28 lines

Diff 422753

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 882 Lines • ▼ Show 20 Lines	def GPU_WaitOp : GPU_Op<"wait", [GPU_AsyncOpInterface]> {
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies);		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies);
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) attr-dict		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) attr-dict
}];		}];

		let hasCanonicalizer = 1;
}		}

def GPU_AllocOp : GPU_Op<"alloc", [		def GPU_AllocOp : GPU_Op<"alloc", [
GPU_AsyncOpInterface,		GPU_AsyncOpInterface,
AttrSizedOperandSegments		AttrSizedOperandSegments
]> {		]> {

let summary = "GPU memory allocation operation.";		let summary = "GPU memory allocation operation.";
▲ Show 20 Lines • Show All 526 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 1,179 Lines • ▼ Show 20 Lines

} }

LogicalResult MemsetOp::fold(ArrayRef<Attribute> operands, LogicalResult MemsetOp::fold(ArrayRef<Attribute> operands,

SmallVectorImpl<::mlir::OpFoldResult> &results) { SmallVectorImpl<::mlir::OpFoldResult> &results) {

return foldMemRefCast(*this); return foldMemRefCast(*this);

} }

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// GPU_WaitOp

//===----------------------------------------------------------------------===//

namespace {

/// Remove gpu.wait op use of gpu.wait op def without async dependencies.

/// %t = gpu.wait async [] // No async dependencies.

/// ... gpu.wait ... [%t, ...] // %t can be removed.

struct EraseRedundantGpuWaitOpPairs : public OpRewritePattern<WaitOp> {

public:

using OpRewritePattern::OpRewritePattern;

LogicalResult matchAndRewrite(WaitOp op,

PatternRewriter &rewriter) const final {

auto predicate = [](Value value) {

auto wait_op = value.getDefiningOp<WaitOp>();

return wait_op && wait_op->getNumOperands() == 0;

};

if (llvm::none_of(op.asyncDependencies(), predicate))

return failure();

SmallVector<Value> validOperands;

for (Value operand : op->getOperands()) {

if (predicate(operand))

continue;

mehdi_aminiUnsubmitted

Done

Why does multiple uses prevent optimizing it away?

mehdi_amini: Why does multiple uses prevent optimizing it away?

validOperands.push_back(operand);

}

op->setOperands(validOperands);

return success();

}

};

/// Simplify trivial gpu.wait ops for the following patterns.

/// 1. %t = gpu.wait async ... ops, where %t has no uses (regardless of async

/// dependencies).

/// 2. %t1 = gpu.wait async [%t0], in this case, we can replace uses of %t1 with

mehdi_aminiUnsubmitted

Done

Seems like instead of handling by looking at the user, we should root the pattern on the user and look for the operands.
That way it would handle the "multiple uses" above, and it could operate by eliminating operands from the operands list of the gpu.wait.
An async gpu.wait without users could be just deleted.

mehdi_amini: Seems like instead of handling by looking at the user, we should root the pattern on the user…

bondhugulaUnsubmitted

Done

An async gpu.wait without users could be just deleted.

The pattern further below already does this: see line 1218.

bondhugula: > An async gpu.wait without users could be just deleted. The pattern further below already…

/// %t0.

/// 3. gpu.wait [] ops, i.e gpu.wait ops that neither have any async

/// dependencies nor return any token.

struct SimplifyGpuWaitOp : public OpRewritePattern<WaitOp> {

public:

using OpRewritePattern::OpRewritePattern;

csiggUnsubmitted

Not Done

namespace {

- /// Fold away redundant gpu.wait ops of the following pattern.

- /// %t = gpu.wait async

- /// gpu.wait [%t]

+ /// Remove gpu.wait op use of gpu.wait op def without async dependencies.

+ /// %t = gpu.wait async [] // No async dependencies.

+ /// ... gpu.wait ... [%t, ...] // %t can be removed.

struct EraseRedundantGpuWaitOpPairs : public OpRewritePattern<WaitOp> {

public:

using OpRewritePattern::OpRewritePattern;

LogicalResult matchAndRewrite(WaitOp op,

PatternRewriter &rewriter) const final {

- // We check whether `op` produce result `asyncToken`.

- Value token = op.asyncToken();

- if (!token)

- return failure();

- // We check whether `op` has any async dependencies or not.

- if (!op.asyncDependencies().empty())

- return failure();

- // If token do not have single use, we cannot fold away gpu.wait ops.

- if (!token.hasOneUse())

- return failure();

- // If the only op operating on `token` is not a gpu.wait op, we cannot fold

- // away gpu.wait ops.

- auto tokenUser = dyn_cast<mlir::gpu::WaitOp>(*token.user_begin());

- if (!tokenUser)

- return failure();

- // If `waitOp` produces any token, we cannot fold away the gpu.wait ops.

- if (tokenUser.asyncToken())

- return failure();

- // `waitOp` should have only single async dependency.

- if (!llvm::hasSingleElement(tokenUser.asyncDependencies()))

+ auto predicate = [](Value value) {

+ auto wait_op = value.getDefiningOp<WaitOp>();

+ return wait_op && wait_op->getNumOperands() == 0;

+ };

+ if (llvm::none_of(op.asyncDependencies(), predicate))

return failure();

- rewriter.eraseOp(tokenUser);

- rewriter.eraseOp(op);

+ op->setOperands(llvm::to_vector<4>(

+ llvm::make_filter_range(op.asyncDependencies(), predicate)));

return success();

}

};

// clang-format off

Untested, but would it work to simply remove the async dependency and let the other canonicalizer take care of erasing ops?

csigg: Untested, but would it work to simply remove the async dependency and let the other…

arnab-ossAuthorUnsubmitted

Not Done

%0 = gpu.wait async
 %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ?

arnab-oss: ``` %0 = gpu.wait async %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> gpu.

bondhugulaUnsubmitted

Not Done

@arnab-oss Does %memref have any uses in this example? If it doesn't, this is all dead code. Is the rule eliminating these now? Can you add this test case?

bondhugula: @arnab-oss Does %memref have any uses in this example? If it doesn't, this is all dead code. Is…

csiggUnsubmitted

Not Done

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]
In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ?

I think we can. The second gpu.wait does not synchronize with anything. The order of the gpu.alloc and second gpu.wait can be swapped, unless gpu.wait also uses %asyncToken (which would still allow you to remove %0).

csigg: > ``` > %0 = gpu.wait async > %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16> >…

bondhugulaUnsubmitted

Not Done

But if you remove the gpu.wait, you've left a stream "undestroyed". Also, we shouldn't be converting an async alloc to a sync alloc -- the latter won't even lower through gpu-to-llvm - so I'm confused here.

bondhugula: But if you remove the `gpu.wait`, you've left a stream "undestroyed". Also, we shouldn't be…

csiggUnsubmitted

Not Done

But if you remove the gpu.wait, you've left a stream "undestroyed".

I assume there is a gpu.wait [<dependency of %asyncToken>] somewhere. Otherwise the input is not really valid.

Also, we shouldn't be converting an async alloc to a sync alloc

That's not what's happening.

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

should be folded to

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait []

should be folded to

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>

csigg: > But if you remove the gpu.wait, you've left a stream "undestroyed". I assume there is a `gpu.

bondhugulaUnsubmitted

Not Done

I think I see where the difference in understanding is. I was going by how the lowering functions worked (to CUDA streams) as opposed to the op's semantics. In the snippet below, asyncToken doesn't need to have a use later: when the wait on %0 happens, the stream is synchronized and destroyed: this means the memcpy will be completed (even though %asyncToken isn't specified as a dep on it) and anything else attached to that stream. And one doesn't need another use of %asyncToken -- this IR currently lowers and executes correctly as intended AFAIU. In fact, deleting the second gpu.wait would eliminate a synchronization and violate semantics.

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

However, per the doc, the second gpu.wait would really "block" on the completion of the first op which is a trivial block and so can be eliminated.

bondhugula: I think I see where the difference in understanding is. I was going by how the lowering…

bondhugulaUnsubmitted

Not Done

@csigg Furthermore, based on what you said, the following would be valid IR:

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]
gpu.wait %asyncToken

But won't this crash on lowering today since it would destroy the same stream twice?
https://github.com/llvm/llvm-project/blob/32f3633171aa9d7352e9507c12d219efb48899a0/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L560

bondhugula: @csigg Furthermore, based on what you said, the following would be valid IR: ``` %0 = gpu.wait…

mehdi_aminiUnsubmitted

Not Done

The notion of stream being destroyed in a wait or being left undestroyed is a bit confusing to me, and I can't connect this to the concept of token and async dependencies. Seems like some lowering issues to me more than anything to fix at this level of abstraction.

mehdi_amini: The notion of stream being destroyed in a wait or being left undestroyed is a bit confusing to…

bondhugulaUnsubmitted

Not Done

I agree - this all makes sense to me and I think @csigg's change can safely be applied here.

bondhugula: I agree - this all makes sense to me and I think @csigg's change can safely be applied here.

LogicalResult matchAndRewrite(WaitOp op,

mehdi_aminiUnsubmitted

Done

Please remove.

mehdi_amini: Please remove.

PatternRewriter &rewriter) const final {

// Erase gpu.wait ops that neither have any async dependencies nor return

bondhugulaUnsubmitted

Done

%t = gpu.wait ... ops where %t has no uses (regardless of async dependencies).

bondhugula: 1. %t = gpu.wait ... ops where %t has no uses (regardless of async dependencies).

// any async token.

bondhugulaUnsubmitted

Not Done

In fact, 1) is just a trivial special case of 3), but anyway, both (1) and (3) are wrong. You can't erase the gpu.wait op if its token has no uses. The op synchronizes the device/host (internally, it destroys the stream) You can only replace it if its result token has at least one use that is another gpu.wait op.

bondhugula: In fact, 1) is just a trivial special case of 3), but anyway, both (1) and (3) are wrong. You…

bondhugulaUnsubmitted

Done

gpu.wait ops that neither have any async dependencies nor return any token.

bondhugula: gpu.wait ops that neither have any async dependencies nor return any token.

if (op.asyncDependencies().empty() && !op.asyncToken()) {

bondhugulaUnsubmitted

Done

Nit: The order of these doesn't match the order in the implementation. Permute.

bondhugula: Nit: The order of these doesn't match the order in the implementation. Permute.

rewriter.eraseOp(op);

return success();

}

// Replace uses of %t1 = gpu.wait async [%t0] ops with %t0 and erase the op.

if (llvm::hasSingleElement(op.asyncDependencies()) && op.asyncToken()) {

rewriter.replaceOp(op, op.asyncDependencies());

return success();

}

// Erase %t = gpu.wait async ... ops, where %t has no uses.

bondhugulaUnsubmitted

Done

Erase gpu.wait ops that neither have any async dependencies nor return any async token.

bondhugula: Erase gpu.wait ops that neither have any async dependencies nor return any async token.

if (op.asyncToken() && op.asyncToken().use_empty()) {

rewriter.eraseOp(op);

return success();

}

return failure();

}

bondhugulaUnsubmitted

Done

Replace uses of %t1 = gpu.wait async [%t0] ops with %t0 and erase the op.

bondhugula: Replace uses of %t1 = gpu.wait async [%t0] ops with %t0 and erase the op.

};

} // end anonymous namespace

mehdi_aminiUnsubmitted

Done

Ideally this should be implemented as a folder I think.

mehdi_amini: Ideally this should be implemented as a folder I think.

bondhugulaUnsubmitted

Done

Yes, I think this pattern can be the folding hook since we aren't erasing any other ops.

bondhugula: Yes, I think this pattern can be the folding hook since we aren't erasing any other ops.

mehdi_aminiUnsubmitted

Done

Unfortunately I think the folding hook can't implement conditional deletion, only replacements. Can it?
Otherwise the eraseOp() can't be moved there.

mehdi_amini: Unfortunately I think the folding hook can't implement conditional deletion, only replacements.

void WaitOp::getCanonicalizationPatterns(RewritePatternSet &results,

MLIRContext *context) {

results.add<EraseRedundantGpuWaitOpPairs, SimplifyGpuWaitOp>(context);

}

bondhugulaUnsubmitted

Done

These conditions are all completely wrong: you can't erase a gpu.wait op just because its result token has no uses. Also, you can't simply replace its uses (which could be empty) by its aysnc dependencies and replace the op.

Can you please read through the gpu.wait doc description (which is pretty straightforward and simple) before adding these patterns?

bondhugula: These conditions are all completely wrong: you can't erase a gpu.wait op just because its…

mehdi_aminiUnsubmitted

Done

I'm confused, the description seems to exactly call for this:

If the op contains the async keyword, it returns a new async token which
is synchronized with the op arguments. This new token is merely a shortcut
to the argument list, and one could replace the uses of the result with the
arguments for the same effect.

mehdi_amini: I'm confused, the description seems to exactly call for this: > If the op contains the…

bondhugulaUnsubmitted

Done

If none of those result uses are another gpu.wait op, you've eliminated a device synchronization incorrectly.

bondhugula: If none of those result uses are another gpu.wait op, you've eliminated a device…

mehdi_aminiUnsubmitted

Done

How so? What is specific to gpu.wait in how it is using a token that does not apply to the other users?
Also: what kind of synchronization does an asynchronous gpu.wait without users providing?

(if you can also explain how you interpret what the doc says).

mehdi_amini: How so? What is specific to `gpu.wait` in how it is using a token that does not apply to the…

bondhugulaUnsubmitted

Done

A gpu.wait that returns a token doesn't provide any synchronization. We are talking about a gpu.wait that doesn't return an async token but has async deps. This as you know will block till those producers are complete. If one of those producers is a gpu.wait, we can erase it -- this is fine per the semantics of the op and this is why @csigg's suggestion above can be applied.

bondhugula: A gpu.wait that returns a token doesn't provide any synchronization. We are talking about a gpu.

bondhugulaUnsubmitted

Done

Actually, the documentation is messed up here, it says:

If the op does not contain the `async` keyword, it does not return a new
    async token but blocks until all ops producing the async dependency tokens
    finished execution.

But the op can definitely return a new token even without the async keyword. That's how a fresh token is created in the first place!

bondhugula: Actually, the documentation is messed up here, it says: ```If the op does not contain the…

bondhugulaUnsubmitted

Done

Sorry, looks like I was mistaken on this part (whenever you return an async token and also have async dependencies).

bondhugula: Sorry, looks like I was mistaken on this part (whenever you return an async token and also have…

//===----------------------------------------------------------------------===//

// GPU_AllocOp // GPU_AllocOp

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

LogicalResult AllocOp::verify() { LogicalResult AllocOp::verify() {

auto memRefType = memref().getType().cast<MemRefType>(); auto memRefType = memref().getType().cast<MemRefType>();

if (static_cast<int64_t>(dynamicSizes().size()) != if (static_cast<int64_t>(dynamicSizes().size()) !=

memRefType.getNumDynamicDims()) memRefType.getNumDynamicDims())

▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/canonicalize.mlir

	// RUN: mlir-opt %s -canonicalize --split-input-file -allow-unregistered-dialect \| FileCheck %s			// RUN: mlir-opt %s -canonicalize --split-input-file -allow-unregistered-dialect \| FileCheck %s

				// Fold all the gpu.wait ops as they are redundant.
				// CHECK-LABEL: func @fold_wait_op_test1
				func @fold_wait_op_test1() {
				%1 = gpu.wait async
				gpu.wait []
				%3 = gpu.wait async
				gpu.wait [%3]
				return
				}
				// CHECK-NOT: gpu.wait
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: You can just instead use: CHECK-NEXT: return for a stronger and more direct check. bondhugula: Nit: You can just instead use: ``` CHECK-NEXT: return ``` for a stronger and more direct check.

				// Replace uses of gpu.wait op with its async dependency.
				// CHECK-LABEL: func @fold_wait_op_test2
				func @fold_wait_op_test2(%arg0: i1) -> (memref<5xf16>, memref<5xf16>) {
				%0 = gpu.wait async
				%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
				gpu.wait [%0]
				%1 = gpu.wait async [%0]
				%memref_0, %asyncToken_0 = gpu.alloc async [%1] () : memref<5xf16>
				gpu.wait [%1]
				return %memref, %memref_0 : memref<5xf16>, memref<5xf16>
				}
				// CHECK-NEXT: %[[TOKEN0:.*]] = gpu.wait async
				// CHECK-NEXT: gpu.alloc async [%[[TOKEN0]]] ()
				// CHECK-NEXT: %[[TOKEN1:.*]] = gpu.wait async
				// CHECK-NEXT: gpu.alloc async [%[[TOKEN1]]] ()
				// CHECK-NEXT: return

	// CHECK-LABEL: @memcpy_after_cast			// CHECK-LABEL: @memcpy_after_cast
	func @memcpy_after_cast(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {			func @memcpy_after_cast(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
	// CHECK-NOT: memref.cast			// CHECK-NOT: memref.cast
	// CHECK: gpu.memcpy			// CHECK: gpu.memcpy
	%0 = memref.cast %arg0 : memref<10xf32> to memref<?xf32>			%0 = memref.cast %arg0 : memref<10xf32> to memref<?xf32>
	%1 = memref.cast %arg1 : memref<10xf32> to memref<?xf32>			%1 = memref.cast %arg1 : memref<10xf32> to memref<?xf32>
	gpu.memcpy %0, %1 : memref<?xf32>, memref<?xf32>			gpu.memcpy %0, %1 : memref<?xf32>, memref<?xf32>
	return			return
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Add canonicalization patterns for folding simple gpu.wait ops.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 422753

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/test/Dialect/GPU/canonicalize.mlir

[MLIR][GPU] Add canonicalization patterns for folding simple gpu.wait ops.
ClosedPublic