This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] generate proper memcpy in/out host and device
ClosedPublic

Authored by aartbik on Apr 18 2023, 8:57 PM.

Details

Summary

The host registration is a convenient way to get CUDA kernels
running, but it may be slow and does not work for all buffer
(like global constants). This revision uses the proper alloc
copy dealloc chains for buffers, using asynchronous chains
to increase overlap. The host registration mechanism is
kept under a flag for the output, just for experimentation
purposes while this project ramps up.

Diff Detail

Event Timeline

aartbik created this revision.Apr 18 2023, 8:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 8:57 PM
aartbik requested review of this revision.Apr 18 2023, 8:57 PM
Peiming added inline comments.Apr 20 2023, 10:20 AM
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
132–133

This reads wired... maybe remove the first "that"?

183

When will this condition be true? It seems it is false all the time (at least for now)?

aartbik updated this revision to Diff 515468.Apr 20 2023, 1:49 PM
aartbik marked 2 inline comments as done.

addressed comments, added TODO

aartbik added inline comments.Apr 20 2023, 1:50 PM
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
132–133

fixed the comment

183

I agree this was a bit hidden "commented out" code. I added a TODO to investigate this and/or couple this with a compiler option

(I really want to keep the code around for fast experimentation, also with our intern starting soon ;-)

332

note that the TODO is here

Peiming accepted this revision.Apr 20 2023, 2:14 PM
This revision is now accepted and ready to land.Apr 20 2023, 2:14 PM