Fold away gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1077 | Nit: canoot -> cannot | |
1084 | Nit: othet -> other | |
1096 | Shouldn't this rather be a separate canonicalizer that removes %t = gpu.wait async + gpu.wait [%t] pairs? | |
mlir/test/Dialect/GPU/canonicalize.mlir | ||
14 | Wouldn't this more natually be %4, and the canonicalizer should rewriter.replaceOp(op, asyncDependencies[0])? |
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1082 | Strictly speaking it should also check that it does not have a read effect I think (it is unlikely but an operation could be "print_and_free(%mem)". Also isn't there an API to check the effect per operand? What is an operation has two operands but frees one of them? (One easy way out would be to match the GpuDealloc op specifically intead, or check that this op has a single operand) Also, in general this transformation seems like "dead store elimination", isn't it something that can be implementation generically (dialect independent, purely effect based)? |
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1082 |
Reg. this last comment: this issue came up in the past too (one of my revisions where we had a discussion). While there is a general pattern here and semblance of a reuse, a dead store elimination pass can't by itself be a substitute for a folding hook here because:
Moreover, in this case, there is a connection to the wait op and aysnc token here; so I think we do need the folding hook. | |
1082 | +1 to what Mehdi says on associating the effect to the operand. But I think there is API missing here and one would need to get the value associated with each Free effect to see if it is dest. This check is currently conservative. |
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1082 |
That's right. The way it is, this pattern's check is incorrect. It needs to check that Free is the only effect that this op can have on dest. However, the pattern is checking if the op has a Free effect on some Value. |
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1085 | You don't need mlir:: - you already have a using for that. | |
1312 | Nit: Fold -> Erase? | |
1318 | Typo here. | |
1325 | does not ... a single use | |
1325 | ops or single op that you are handling here? | |
1340 | eraseOp(waitOp) | |
mlir/test/Dialect/GPU/canonicalize.mlir | ||
24 | Missing check that the gpu.wait has been folded away. |
I think the gpu.wait canonicalizer could be cleaning up more cases:
- %unused gpu.wait async ...: eraseOp(op)
- gpu.wait []: eraseOp(op)
- %t1 = gpu.wait async [%t0]: replaceOp(op, t0)
- %t = gpu.wait async ... + gpu.wait {async} [%t, ...]: drop %t from async dependencies.
Would this fit logically in this revision or a separate revision for gpu.wait canonicalizer? This revision is meant to erase away trivial gpu.memcpy and ancillary stuff. A full-fledged gpu.wait folder/canonicalizer should ideally go into a separate commit.
Sorry, I should have been more clear, none of my comments were meant to gate the revision.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1077 | Why can the memcpy not be removed if it's a block argument? | |
1080 | Would 'OnVal' be better than 'OnDest'? | |
1101 | Should this also handle the common case of a tokens threaded through all gpu ops? if (op.asyncDependencies().size() > 1 || op.asyncDependencies().empty() == op.asyncToken()) return failure() rewriter.replaceOp(op, op.asyncDependencies()); This would also take care of not updating the op but still returning success(). |
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1100–1103 | Really just an idea, would this be easier to read? |
I am sorry but we have clear evidence this miscompiles some tensorflow gpu tests, so I am applying a revert. I will check with them to provide a reproduce.
Can you please provide the test case for which incorrect IR is generated?
I will provide the fix. Thanks in advance.
Folks'll try getting self-contained repro, but in interim test case that fails with this but passes without is bazel test --config=cuda --compilation_mode=opt --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu in TF repo. You'd need to probably follow the instructions for local llvm repo in tensorflow/compiler/mlir for simplicity in repro
You also need --//tensorflow/compiler/xla/service/gpu:enable_xlir=true
bazel test --config=cuda --compilation_mode=opt --//tensorflow/compiler/xla/service/gpu:enable_xlir=true --test_env=XLA_FLAGS=--xla_gpu_bef_executable --copt=-Wno-error //tensorflow/compiler/xla/tests:concat_test_gpu
Here is a repro:
func @copy(%arg0: memref<1xi8>, %arg1: memref<i1>) { %0 = arith.constant 0 : index %1 = memref.view %arg0[%0][] : memref<1xi8> to memref<i1> gpu.memcpy %1, %arg1 : memref<i1>, memref<i1> func.return }
mlir-opt --canonicalize removes the memcpy when it really shouldn't.
This is a big miss. The pattern will have to be made very conservative or instead be implemented as part of a copy-removal pass where aliasing information can be used. However, strictly speaking, the aliasing check available with AliasAnalysis is often just O(use-def) chain -- so one should be able to use something like bool doMemRefsAlias(Value memRefA, Value memRefB) from a canonicalization pattern or a folding hook. There isn't really any caching needed or caching happening there.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp | ||
---|---|---|
1077 | This is the condition that's completely arbitrary. You can change this check to iterate through all uses and check if alloc/dealloc and this memcpy are the only ops that use this op? A more powerful removal has to be done anyway in a separate pass. |
Nit: canoot -> cannot