This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add gpu async integration test.
ClosedPublic

Authored by csigg on Jan 11 2021, 8:18 AM.

Diff Detail

Event Timeline

csigg created this revision.Jan 11 2021, 8:18 AM
csigg requested review of this revision.Jan 11 2021, 8:18 AM

Just curious, are these tests candidates for mlir/integration_test rather than mlir/test?

csigg updated this revision to Diff 315933.Jan 11 2021, 2:07 PM

Rebase.

csigg updated this revision to Diff 316156.Jan 12 2021, 10:18 AM

Trying to upload again with D94399 as base.

Just curious, are these tests candidates for mlir/integration_test rather than mlir/test?

The cuda-runner and rocm-runner tests have traditionally been in the test directory.
But yeah, they might fit better in the integration_test directory.
I would move them in a separate change though.
That said, is there any specific reason we have two test directories?

I would move them in a separate change though.
That said, is there any specific reason we have two test directories?

Of course, I was not suggesting to make that part of this change.

The test directory is for fast running, light weight unit tests. The integration test directory is for (potentially) slightly longer running, heavier weight tests. Invoking full runners (like we do for mlir-cpu-runner) always seem good candidates. The latter is run less often to save resources.

It is of course, up to the test "owners" ultimately to make the call

herhut added inline comments.Jan 15 2021, 7:20 AM
mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
300

Long term we have to find a better way to do these things. For now this is fine (just writing this here for documentation).

mlir/test/mlir-cuda-runner/async.mlir
15

Is this needed?

60

Why not explicitly copy back, too?

mlir/tools/mlir-cuda-runner/mlir-cuda-runner.cpp
115

Just wondering whether all of this could also become a textual pass specification. Then the cuda runner would essentially only contain of a special pass that wraps the creation of cubin blobs (the createConvertGPUKernelToBlobPass) and otherwise uses a specified pipeline. Then we would not need to bake in the async lowering here.

csigg added inline comments.Jan 20 2021, 4:42 AM
mlir/test/mlir-cuda-runner/async.mlir
15

It is not required for correctness if we copy the result as you suggested below.

But without page-locking, the memcpy becomes synchronous and the second and third memcpy would schedule (and execute) after the first has completed even without synchronization.

60

I had that initially, but I wanted to keep the number of GPU operations minimal.

mlir/tools/mlir-cuda-runner/mlir-cuda-runner.cpp
115

Good idea, I will try that.

csigg updated this revision to Diff 317844.Jan 20 2021, 4:59 AM

Rebase.

herhut accepted this revision.Jan 20 2021, 8:51 AM

Thanks for the explanations!

This revision is now accepted and ready to land.Jan 20 2021, 8:51 AM
csigg updated this revision to Diff 321169.Feb 3 2021, 11:11 AM

Rebase.

This revision was automatically updated to reflect the committed changes.