This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077	Nit: canoot -> cannot
1084	Nit: othet -> other
1096	Shouldn't this rather be a separate canonicalizer that removes `%t = gpu.wait async` + `gpu.wait [%t]` pairs?
mlir/test/Dialect/GPU/canonicalize.mlir
14	Wouldn't this more natually be %4, and the canonicalizer should `rewriter.replaceOp(op, asyncDependencies[0])`?

Harbormaster completed remote builds in B153326: Diff 414057.Mar 9 2022, 4:22 AM

Addressed review comments by @csigg.

Harbormaster completed remote builds in B153340: Diff 414072.Mar 9 2022, 5:28 AM

mehdi_amini added inline comments.Mar 9 2022, 5:47 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082	Strictly speaking it should also check that it does not have a read effect I think (it is unlikely but an operation could be "print_and_free(%mem)". Also isn't there an API to check the effect per operand? What is an operation has two operands but frees one of them? (One easy way out would be to match the GpuDealloc op specifically intead, or check that this op has a single operand) Also, in general this transformation seems like "dead store elimination", isn't it something that can be implementation generically (dialect independent, purely effect based)?

bondhugula requested changes to this revision.Mar 9 2022, 7:21 AM

bondhugula added inline comments.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082	so, in general this transformation seems like "dead store elimination", isn't it something that can be implementation generically (dialect independent, purely effect based)? Reg. this last comment: this issue came up in the past too (one of my revisions where we had a discussion). While there is a general pattern here and semblance of a reuse, a dead store elimination pass can't by itself be a substitute for a folding hook here because: a separate `dse` pass won't lead to early simplification like via canonicalization and folding hooks, one will run into phase ordering issues b/w other folding/canonicalizations and such dse; so it's ideal to have such O(1) and O(use-def) chain length stuff in folding hook. Moreover, in this case, there is a connection to the `wait` op and aysnc token here; so I think we do need the folding hook.
1082	+1 to what Mehdi says on associating the effect to the operand. But I think there is API missing here and one would need to get the value associated with each `Free` effect to see if it is `dest`. This check is currently conservative.

This revision now requires changes to proceed.Mar 9 2022, 7:21 AM

bondhugula added inline comments.Mar 9 2022, 7:30 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082	Strictly speaking it should also check that it does not have a read effect I think (it is unlikely but an operation could be "print_and_free(%mem)". That's right. The way it is, this pattern's check is incorrect. It needs to check that `Free` is the only effect that this `op` can have on `dest`. However, the pattern is checking if the op has a `Free` effect on some `Value`.

Addressed comments by @bondhugula

Harbormaster completed remote builds in B153546: Diff 414353.Mar 10 2022, 6:07 AM

bondhugula added inline comments.Mar 10 2022, 8:39 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1085	You don't need `mlir::` - you already have a `using` for that.
1326	Nit: Fold -> Erase?
1332	Typo here.
1339	does not ... a single use
1339	`ops` or single op that you are handling here?
1354	`eraseOp(waitOp)`
mlir/test/Dialect/GPU/canonicalize.mlir
24	Missing check that the `gpu.wait` has been folded away.

Addressed comments by @bondhugula

Harbormaster completed remote builds in B153720: Diff 414588.Mar 10 2022, 11:19 PM

LGTM - thanks.

mlir/test/Dialect/GPU/canonicalize.mlir
29	Nit: Punctuate correctly.

This revision is now accepted and ready to land.Mar 11 2022, 3:41 AM

Addressed comments by @bondhugula

Harbormaster completed remote builds in B153897: Diff 414817.Mar 12 2022, 3:19 AM

bondhugula accepted this revision.Mar 12 2022, 6:56 AM

I think the gpu.wait canonicalizer could be cleaning up more cases:

%unused gpu.wait async ...: eraseOp(op)
gpu.wait []: eraseOp(op)
%t1 = gpu.wait async [%t0]: replaceOp(op, t0)
%t = gpu.wait async ... + gpu.wait {async} [%t, ...]: drop %t from async dependencies.

In D121279#3378709, @csigg wrote:

I think the gpu.wait canonicalizer could be cleaning up more cases:

%unused gpu.wait async ...: eraseOp(op)

gpu.wait []: eraseOp(op)

%t1 = gpu.wait async [%t0]: replaceOp(op, t0)

%t = gpu.wait async ... + gpu.wait {async} [%t, ...]: drop %t from async dependencies.

Would this fit logically in this revision or a separate revision for gpu.wait canonicalizer? This revision is meant to erase away trivial gpu.memcpy and ancillary stuff. A full-fledged gpu.wait folder/canonicalizer should ideally go into a separate commit.

In D121279#3381610, @bondhugula wrote:

Would this fit logically in this revision or a separate revision for gpu.wait canonicalizer? This revision is meant to erase away trivial gpu.memcpy and ancillary stuff. A full-fledged gpu.wait folder/canonicalizer should ideally go into a separate commit.

Sorry, I should have been more clear, none of my comments were meant to gate the revision.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077	Why can the memcpy not be removed if it's a block argument?
1080	Would 'OnVal' be better than 'OnDest'?
1101	Should this also handle the common case of a tokens threaded through all gpu ops? if (op.asyncDependencies().size() > 1 \|\| op.asyncDependencies().empty() == op.asyncToken()) return failure() rewriter.replaceOp(op, op.asyncDependencies()); This would also take care of not updating the op but still returning `success()`.

Addressed comments by @csigg.

Harbormaster completed remote builds in B154283: Diff 415369.Mar 15 2022, 4:10 AM

csigg added inline comments.Mar 16 2022, 11:49 PM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1100–1103	Really just an idea, would this be easier to read?

Addressed review comments and rebased on latest master.

Herald added a reviewer: ThomasRaoux. · View Herald TranscriptApr 14 2022, 5:10 AM

Harbormaster completed remote builds in B159663: Diff 422822.Apr 14 2022, 5:28 AM

Fixed build issues.

Harbormaster completed remote builds in B160023: Diff 423352.Apr 18 2022, 4:27 AM

Removed trailing white spaces.

Harbormaster completed remote builds in B160205: Diff 423583.Apr 19 2022, 4:25 AM

Closed by commit rG12f55cac69d8: [MLIR][GPU] Add canonicalizer for gpu.memcpy (authored by arnab-oss, committed by bondhugula). · Explain WhyApr 19 2022, 5:25 AM

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rG12f55cac69d8: [MLIR][GPU] Add canonicalizer for gpu.memcpy.

I am sorry but we have clear evidence this miscompiles some tensorflow gpu tests, so I am applying a revert. I will check with them to provide a reproduce.

MaskRay added a reverting change: rGae46b3e01faa: Revert D121279 "[MLIR][GPU] Add canonicalizer for gpu.memcpy".Apr 21 2022, 8:55 AM

Can you please provide the test case for which incorrect IR is generated?
I will provide the fix. Thanks in advance.

In D121279#3464917, @arnab-oss wrote:

Can you please provide the test case for which incorrect IR is generated?
I will provide the fix. Thanks in advance.

Folks'll try getting self-contained repro, but in interim test case that fails with this but passes without is bazel test --config=cuda --compilation_mode=opt --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu in TF repo. You'd need to probably follow the instructions for local llvm repo in tensorflow/compiler/mlir for simplicity in repro

You also need --//tensorflow/compiler/xla/service/gpu:enable_xlir=true
bazel test --config=cuda --compilation_mode=opt --//tensorflow/compiler/xla/service/gpu:enable_xlir=true --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu

Here is a repro:

func @copy(%arg0: memref<1xi8>, %arg1: memref<i1>) {
  %0 = arith.constant 0 : index
  %1 = memref.view %arg0[%0][] : memref<1xi8> to memref<i1>
  gpu.memcpy  %1, %arg1 : memref<i1>, memref<i1>
  func.return
}

mlir-opt --canonicalize removes the memcpy when it really shouldn't.

In D121279#3466770, @csigg wrote:
Here is a repro:
func @copy(%arg0: memref<1xi8>, %arg1: memref<i1>) {
  %0 = arith.constant 0 : index
  %1 = memref.view %arg0[%0][] : memref<1xi8> to memref<i1>
  gpu.memcpy  %1, %arg1 : memref<i1>, memref<i1>
  func.return
}
mlir-opt --canonicalize removes the memcpy when it really shouldn't.

This is a big miss. The pattern will have to be made very conservative or instead be implemented as part of a copy-removal pass where aliasing information can be used. However, strictly speaking, the aliasing check available with AliasAnalysis is often just O(use-def) chain -- so one should be able to use something like bool doMemRefsAlias(Value memRefA, Value memRefB) from a canonicalization pattern or a folding hook. There isn't really any caching needed or caching happening there.

bondhugula added inline comments.Apr 22 2022, 1:42 AM

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077	This is the condition that's completely arbitrary. You can change this check to iterate through all uses and check if alloc/dealloc and this memcpy are the only ops that use this op? A more powerful removal has to be done anyway in a separate pass.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

1 line

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

52 lines

test/

Dialect/

GPU/

canonicalize.mlir

23 lines

Diff 414057

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 964 Lines • ▼ Show 20 Lines	def GPU_MemcpyOp : GPU_Op<"memcpy", [GPU_AsyncOpInterface]> {
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$dst`,` $src `:` type($dst)`,` type($src) attr-dict		$dst`,` $src `:` type($dst)`,` type($src) attr-dict
}];		}];
let hasFolder = 1;		let hasFolder = 1;
let hasVerifier = 1;		let hasVerifier = 1;
		let hasCanonicalizer = 1;
}		}

def GPU_MemsetOp : GPU_Op<"memset",		def GPU_MemsetOp : GPU_Op<"memset",
[GPU_AsyncOpInterface, AllElementTypesMatch<["dst", "value"]>]> {		[GPU_AsyncOpInterface, AllElementTypesMatch<["dst", "value"]>]> {

let summary = "GPU memset operation";		let summary = "GPU memset operation";

let description = [{		let description = [{
▲ Show 20 Lines • Show All 363 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 1,058 Lines • ▼ Show 20 Lines if (asyncTokenType)

printer << "async "; printer << "async ";

if (asyncDependencies.empty()) if (asyncDependencies.empty())

return; return;

printer << "["; printer << "[";

llvm::interleaveComma(asyncDependencies, printer); llvm::interleaveComma(asyncDependencies, printer);

printer << "]"; printer << "]";

} }

namespace {

/// Erases a common case of copy ops where a destination value is used only by

/// the copy op, alloc and dealloc ops.

struct EraseTrivialCopyOp : public OpRewritePattern<MemcpyOp> {

using OpRewritePattern<MemcpyOp>::OpRewritePattern;

LogicalResult matchAndRewrite(MemcpyOp op,

PatternRewriter &rewriter) const override {

Value dest = op.dst();

// If `dest` is a block argument, we canoot remove `op`.

csiggUnsubmitted

Done

Nit: canoot -> cannot

csigg: Nit: canoot -> cannot

csiggUnsubmitted

Done

Why can the memcpy not be removed if it's a block argument?

csigg: Why can the memcpy not be removed if it's a block argument?

bondhugulaUnsubmitted

Not Done

This is the condition that's completely arbitrary. You can change this check to iterate through all uses and check if alloc/dealloc and this memcpy are the only ops that use this op? A more powerful removal has to be done anyway in a separate pass.

bondhugula: This is the condition that's completely arbitrary. You can change this check to iterate through…

if (dest.isa<BlockArgument>())

return failure();

auto isDeallocLikeOp = [](Operation *op) {

csiggUnsubmitted

Done

Would 'OnVal' be better than 'OnDest'?

csigg: Would 'OnVal' be better than 'OnDest'?

auto memOp = dyn_cast<MemoryEffectOpInterface>(op);

return memOp && memOp.hasEffect<MemoryEffects::Free>();

mehdi_aminiUnsubmitted

Done

Strictly speaking it should also check that it does not have a read effect I think (it is unlikely but an operation could be "print_and_free(%mem)".

Also isn't there an API to check the effect per operand? What is an operation has two operands but frees one of them?

(One easy way out would be to match the GpuDealloc op specifically intead, or check that this op has a single operand)

Also, in general this transformation seems like "dead store elimination", isn't it something that can be implementation generically (dialect independent, purely effect based)?

mehdi_amini: Strictly speaking it should also check that it does not have a read effect I think (it is…

bondhugulaUnsubmitted

Done

so, in general this transformation seems like "dead store elimination", isn't it something
that can be implementation generically (dialect independent, purely effect based)?

Reg. this last comment: this issue came up in the past too (one of my revisions where we had a discussion). While there is a general pattern here and semblance of a reuse, a dead store elimination pass can't by itself be a substitute for a folding hook here because:

a separate dse pass won't lead to early simplification like via canonicalization and folding hooks,
one will run into phase ordering issues b/w other folding/canonicalizations and such dse; so it's ideal to have such O(1) and O(use-def) chain length stuff in folding hook.

Moreover, in this case, there is a connection to the wait op and aysnc token here; so I think we do need the folding hook.

bondhugula: > so, in general this transformation seems like "dead store elimination", isn't it something >…

bondhugulaUnsubmitted

Done

Strictly speaking it should also check that it does not have a read effect I think (it is
unlikely but an operation could be "print_and_free(%mem)".

That's right. The way it is, this pattern's check is incorrect. It needs to check that Free is the only effect that this op can have on dest. However, the pattern is checking if the op has a Free effect on some Value.

bondhugula: >Strictly speaking it should also check that it does not have a read effect I think (it is…

bondhugulaUnsubmitted

Done

+1 to what Mehdi says on associating the effect to the operand. But I think there is API missing here and one would need to get the value associated with each Free effect to see if it is dest. This check is currently conservative.

bondhugula: +1 to what Mehdi says on associating the effect to the operand. But I think there is API…

};

// We can erase `op` iff `dest` has no othet use apart from its

csiggUnsubmitted

Done

Nit: othet -> other

csigg: Nit: othet -> other

// use by `op` and dealloc ops.

bondhugulaUnsubmitted

Done

You don't need mlir:: - you already have a using for that.

bondhugula: You don't need `mlir::` - you already have a `using ` for that.

if (llvm::any_of(dest.getUsers(), [isDeallocLikeOp, op](Operation *user) {

return user != op && !isDeallocLikeOp(user);

}))

return failure();

ValueRange asyncDependencies = op.asyncDependencies();

// Check that the async token we are going to erase has no other uses.

if (!op.asyncToken() || op.asyncToken().use_empty())

rewriter.eraseOp(op);

// Remove redundant gpu.wait op. If `op` has a single async dependency

csiggUnsubmitted

Done

Shouldn't this rather be a separate canonicalizer that removes %t = gpu.wait async + gpu.wait [%t] pairs?

csigg: Shouldn't this rather be a separate canonicalizer that removes `%t = gpu.wait async` + `gpu.

// token, and the token value has a single user (other than `op`, of type

// gpu.wait, we can erase the gpu.wait op, along with the op defining the

// async token.

if (asyncDependencies.size() == 1 && asyncDependencies[0].hasOneUse()) {

if (auto waitOp = dyn_cast<WaitOp>(*asyncDependencies[0].user_begin())) {

csiggUnsubmitted

Done

Should this also handle the common case of a tokens threaded through all gpu ops?

if (op.asyncDependencies().size() > 1 ||
    op.asyncDependencies().empty() == op.asyncToken())
  return failure()
rewriter.replaceOp(op, op.asyncDependencies());

This would also take care of not updating the op but still returning success().

csigg: Should this also handle the common case of a tokens threaded through all gpu ops? ``` if (op.

if (!waitOp.asyncToken() && waitOp.asyncDependencies().size() == 1) {

rewriter.eraseOp(waitOp);

csiggUnsubmitted

Done

return failure();

- if (op.asyncDependencies().size() > 1 ||

- ((op.asyncDependencies().empty() && op.asyncToken()) ||

- (!op.asyncDependencies().empty() && !op.asyncToken())))

+ if (op.asyncDependencies().size() > 1)

+ return failure();

+ if (op.asyncDependencies().empty() && op.asyncToken())

+ return failure();

+ if (!op.asyncDependencies().empty() && !op.asyncToken())

return failure();

rewriter.replaceOp(op, op.asyncDependencies());

Really just an idea, would this be easier to read?

csigg: Really just an idea, would this be easier to read?

rewriter.eraseOp(asyncDependencies[0].getDefiningOp());

}

return success();

}

};

} // end anonymous namespace

void MemcpyOp::getCanonicalizationPatterns(RewritePatternSet &results,

MLIRContext *context) {

results.add<EraseTrivialCopyOp>(context);

}

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// GPU_SubgroupMmaLoadMatrixOp // GPU_SubgroupMmaLoadMatrixOp

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

LogicalResult SubgroupMmaLoadMatrixOp::verify() { LogicalResult SubgroupMmaLoadMatrixOp::verify() {

auto srcType = srcMemref().getType(); auto srcType = srcMemref().getType();

auto resType = res().getType(); auto resType = res().getType();

auto resMatrixType = resType.cast<gpu::MMAMatrixType>(); auto resMatrixType = resType.cast<gpu::MMAMatrixType>();

▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines

#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc" #include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"

#include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc" #include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc"

#define GET_ATTRDEF_CLASSES #define GET_ATTRDEF_CLASSES

#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc" #include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"

#define GET_OP_CLASSES #define GET_OP_CLASSES

#include "mlir/Dialect/GPU/GPUOps.cpp.inc" #include "mlir/Dialect/GPU/GPUOps.cpp.inc"

bondhugulaUnsubmitted

Done

Nit: Fold -> Erase?

bondhugula: Nit: Fold -> Erase?

bondhugulaUnsubmitted

Done

Typo here.

bondhugula: Typo here.

bondhugulaUnsubmitted

Done

does not ... a single use

bondhugula: does not ... a single use

bondhugulaUnsubmitted

Done

ops or single op that you are handling here?

bondhugula: `ops` or single op that you are handling here?

bondhugulaUnsubmitted

Done

eraseOp(waitOp)

bondhugula: `eraseOp(waitOp)`

mlir/test/Dialect/GPU/canonicalize.mlir

	// RUN: mlir-opt %s -canonicalize --split-input-file -allow-unregistered-dialect \| FileCheck %s			// RUN: mlir-opt %s -canonicalize --split-input-file -allow-unregistered-dialect \| FileCheck %s

				// CHECK-LABEL: func @fold_memcpy_op
				func @fold_memcpy_op(%arg0: i1) {
				%cst = arith.constant 3.343820e-05 : f16
				%cst_0 = arith.constant 0.000000e+00 : f16
				%1 = memref.alloc() : memref<400x1024x1024x1xf16>
				%2 = gpu.wait async
				%memref, %asyncToken = gpu.alloc async [%2] () : memref<400x1024x1024x1xf16>
				gpu.wait [%2]
				affine.store %cst, %memref[0, 0, 0, 0] : memref<400x1024x1024x1xf16>
				%3 = gpu.wait async
				%4 = gpu.memcpy async [%3] %1, %memref : memref<400x1024x1024x1xf16>, memref<400x1024x1024x1xf16>
				gpu.wait [%3]
				csiggUnsubmitted Done Reply Inline Actions Wouldn't this more natually be %4, and the canonicalizer should `rewriter.replaceOp(op, asyncDependencies[0])`? csigg: Wouldn't this more natually be %4, and the canonicalizer should `rewriter.replaceOp(op…
				%5 = scf.if %arg0 -> (i1) {
				memref.dealloc %1 : memref<400x1024x1024x1xf16>
				scf.yield %arg0 : i1
				} else {
				memref.dealloc %1 : memref<400x1024x1024x1xf16>
				scf.yield %arg0 : i1
				}
				return
				}
				// CHECK-NOT: gpu.memcpy
				bondhugulaUnsubmitted Done Reply Inline Actions Missing check that the `gpu.wait` has been folded away. bondhugula: Missing check that the `gpu.wait` has been folded away.

	// CHECK-LABEL: @memcpy_after_cast			// CHECK-LABEL: @memcpy_after_cast
	func @memcpy_after_cast(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {			func @memcpy_after_cast(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
	// CHECK-NOT: memref.cast			// CHECK-NOT: memref.cast
	// CHECK: gpu.memcpy			// CHECK: gpu.memcpy
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: Punctuate correctly. bondhugula: Nit: Punctuate correctly.
	%0 = memref.cast %arg0 : memref<10xf32> to memref<?xf32>			%0 = memref.cast %arg0 : memref<10xf32> to memref<?xf32>
	%1 = memref.cast %arg1 : memref<10xf32> to memref<?xf32>			%1 = memref.cast %arg1 : memref<10xf32> to memref<?xf32>
	gpu.memcpy %0, %1 : memref<?xf32>, memref<?xf32>			gpu.memcpy %0, %1 : memref<?xf32>, memref<?xf32>
	return			return
	}			}

	// CHECK-LABEL: @memset_after_cast			// CHECK-LABEL: @memset_after_cast
	func @memset_after_cast(%arg0: memref<10xf32>, %arg1: f32) {			func @memset_after_cast(%arg0: memref<10xf32>, %arg1: f32) {
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Add canonicalizer for gpu.memcpyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 414057

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/test/Dialect/GPU/canonicalize.mlir

[MLIR][GPU] Add canonicalizer for gpu.memcpy
ClosedPublic