Page MenuHomePhabricator

[MLIR][GPU] Add canonicalizer for gpu.memcpy
ClosedPublic

Authored by arnab-oss on Mar 9 2022, 3:53 AM.

Details

Summary

Fold away gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.

Diff Detail

Event Timeline

arnab-oss created this revision.Mar 9 2022, 3:53 AM
Herald added a project: Restricted Project. · View Herald Transcript
arnab-oss requested review of this revision.Mar 9 2022, 3:53 AM
arnab-oss updated this revision to Diff 414057.Mar 9 2022, 3:59 AM

Set hasVerifier field in gpu.memcpy removed earlier by mistake.

csigg added inline comments.Mar 9 2022, 4:12 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077

Nit: canoot -> cannot

1084

Nit: othet -> other

1096

Shouldn't this rather be a separate canonicalizer that removes %t = gpu.wait async + gpu.wait [%t] pairs?

mlir/test/Dialect/GPU/canonicalize.mlir
14

Wouldn't this more natually be %4, and the canonicalizer should rewriter.replaceOp(op, asyncDependencies[0])?

arnab-oss updated this revision to Diff 414072.EditedMar 9 2022, 5:11 AM
arnab-oss marked 2 inline comments as done.

Addressed review comments by @csigg.

mehdi_amini added inline comments.Mar 9 2022, 5:47 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082

Strictly speaking it should also check that it does not have a read effect I think (it is unlikely but an operation could be "print_and_free(%mem)".

Also isn't there an API to check the effect per operand? What is an operation has two operands but frees one of them?

(One easy way out would be to match the GpuDealloc op specifically intead, or check that this op has a single operand)

Also, in general this transformation seems like "dead store elimination", isn't it something that can be implementation generically (dialect independent, purely effect based)?

bondhugula requested changes to this revision.Mar 9 2022, 7:21 AM
bondhugula added inline comments.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082

so, in general this transformation seems like "dead store elimination", isn't it something
that can be implementation generically (dialect independent, purely effect based)?

Reg. this last comment: this issue came up in the past too (one of my revisions where we had a discussion). While there is a general pattern here and semblance of a reuse, a dead store elimination pass can't by itself be a substitute for a folding hook here because:

  1. a separate dse pass won't lead to early simplification like via canonicalization and folding hooks,
  2. one will run into phase ordering issues b/w other folding/canonicalizations and such dse; so it's ideal to have such O(1) and O(use-def) chain length stuff in folding hook.

Moreover, in this case, there is a connection to the wait op and aysnc token here; so I think we do need the folding hook.

1082

+1 to what Mehdi says on associating the effect to the operand. But I think there is API missing here and one would need to get the value associated with each Free effect to see if it is dest. This check is currently conservative.

This revision now requires changes to proceed.Mar 9 2022, 7:21 AM
bondhugula added inline comments.Mar 9 2022, 7:30 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1082

Strictly speaking it should also check that it does not have a read effect I think (it is
unlikely but an operation could be "print_and_free(%mem)".

That's right. The way it is, this pattern's check is incorrect. It needs to check that Free is the only effect that this op can have on dest. However, the pattern is checking if the op has a Free effect on some Value.

arnab-oss updated this revision to Diff 414353.Mar 10 2022, 5:44 AM
arnab-oss marked an inline comment as done.

Addressed comments by @bondhugula

bondhugula added inline comments.Mar 10 2022, 8:39 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1085

You don't need mlir:: - you already have a using for that.

1401

Nit: Fold -> Erase?

1407

Typo here.

1414

does not ... a single use

1414

ops or single op that you are handling here?

1429

eraseOp(waitOp)

mlir/test/Dialect/GPU/canonicalize.mlir
24

Missing check that the gpu.wait has been folded away.

arnab-oss marked 6 inline comments as done.

Addressed comments by @bondhugula

bondhugula accepted this revision.Mar 11 2022, 3:41 AM

LGTM - thanks.

mlir/test/Dialect/GPU/canonicalize.mlir
29

Nit: Punctuate correctly.

This revision is now accepted and ready to land.Mar 11 2022, 3:41 AM
arnab-oss updated this revision to Diff 414817.Mar 12 2022, 3:01 AM

Addressed comments by @bondhugula

bondhugula accepted this revision.Mar 12 2022, 6:56 AM

I think the gpu.wait canonicalizer could be cleaning up more cases:

  • %unused gpu.wait async ...: eraseOp(op)
  • gpu.wait []: eraseOp(op)
  • %t1 = gpu.wait async [%t0]: replaceOp(op, t0)
  • %t = gpu.wait async ... + gpu.wait {async} [%t, ...]: drop %t from async dependencies.

I think the gpu.wait canonicalizer could be cleaning up more cases:

  • %unused gpu.wait async ...: eraseOp(op)
  • gpu.wait []: eraseOp(op)
  • %t1 = gpu.wait async [%t0]: replaceOp(op, t0)
  • %t = gpu.wait async ... + gpu.wait {async} [%t, ...]: drop %t from async dependencies.

Would this fit logically in this revision or a separate revision for gpu.wait canonicalizer? This revision is meant to erase away trivial gpu.memcpy and ancillary stuff. A full-fledged gpu.wait folder/canonicalizer should ideally go into a separate commit.

csigg added a comment.Mar 15 2022, 2:50 AM

Would this fit logically in this revision or a separate revision for gpu.wait canonicalizer? This revision is meant to erase away trivial gpu.memcpy and ancillary stuff. A full-fledged gpu.wait folder/canonicalizer should ideally go into a separate commit.

Sorry, I should have been more clear, none of my comments were meant to gate the revision.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077

Why can the memcpy not be removed if it's a block argument?

1080

Would 'OnVal' be better than 'OnDest'?

1101

Should this also handle the common case of a tokens threaded through all gpu ops?

if (op.asyncDependencies().size() > 1 ||
    op.asyncDependencies().empty() == op.asyncToken())
  return failure()
rewriter.replaceOp(op, op.asyncDependencies());

This would also take care of not updating the op but still returning success().

arnab-oss updated this revision to Diff 415369.Mar 15 2022, 3:20 AM
arnab-oss marked 9 inline comments as done.

Addressed comments by @csigg.

csigg added inline comments.Mar 16 2022, 11:49 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1100–1103

Really just an idea, would this be easier to read?

arnab-oss updated this revision to Diff 422822.Apr 14 2022, 5:10 AM
arnab-oss marked 2 inline comments as done.

Addressed review comments and rebased on latest master.

arnab-oss updated this revision to Diff 423352.Apr 18 2022, 4:05 AM

Fixed build issues.

arnab-oss updated this revision to Diff 423583.Apr 19 2022, 3:46 AM

Removed trailing white spaces.

This revision was automatically updated to reflect the committed changes.

I am sorry but we have clear evidence this miscompiles some tensorflow gpu tests, so I am applying a revert. I will check with them to provide a reproduce.

Can you please provide the test case for which incorrect IR is generated?
I will provide the fix. Thanks in advance.

Can you please provide the test case for which incorrect IR is generated?
I will provide the fix. Thanks in advance.

Folks'll try getting self-contained repro, but in interim test case that fails with this but passes without is bazel test --config=cuda --compilation_mode=opt --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu in TF repo. You'd need to probably follow the instructions for local llvm repo in tensorflow/compiler/mlir for simplicity in repro

You also need --//tensorflow/compiler/xla/service/gpu:enable_xlir=true
bazel test --config=cuda --compilation_mode=opt --//tensorflow/compiler/xla/service/gpu:enable_xlir=true --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu

csigg added a comment.Apr 22 2022, 1:03 AM

Here is a repro:

func @copy(%arg0: memref<1xi8>, %arg1: memref<i1>) {
  %0 = arith.constant 0 : index
  %1 = memref.view %arg0[%0][] : memref<1xi8> to memref<i1>
  gpu.memcpy  %1, %arg1 : memref<i1>, memref<i1>
  func.return
}

mlir-opt --canonicalize removes the memcpy when it really shouldn't.

Here is a repro:

func @copy(%arg0: memref<1xi8>, %arg1: memref<i1>) {
  %0 = arith.constant 0 : index
  %1 = memref.view %arg0[%0][] : memref<1xi8> to memref<i1>
  gpu.memcpy  %1, %arg1 : memref<i1>, memref<i1>
  func.return
}

mlir-opt --canonicalize removes the memcpy when it really shouldn't.

This is a big miss. The pattern will have to be made very conservative or instead be implemented as part of a copy-removal pass where aliasing information can be used. However, strictly speaking, the aliasing check available with AliasAnalysis is often just O(use-def) chain -- so one should be able to use something like bool doMemRefsAlias(Value memRefA, Value memRefB) from a canonicalization pattern or a folding hook. There isn't really any caching needed or caching happening there.

bondhugula added inline comments.Apr 22 2022, 1:42 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1077

This is the condition that's completely arbitrary. You can change this check to iterate through all uses and check if alloc/dealloc and this memcpy are the only ops that use this op? A more powerful removal has to be done anyway in a separate pass.