The cuda-runner and rocm-runner tests have traditionally been in the test directory.
But yeah, they might fit better in the integration_test directory.
I would move them in a separate change though.
That said, is there any specific reason we have two test directories?
Of course, I was not suggesting to make that part of this change.
The test directory is for fast running, light weight unit tests. The integration test directory is for (potentially) slightly longer running, heavier weight tests. Invoking full runners (like we do for mlir-cpu-runner) always seem good candidates. The latter is run less often to save resources.
It is of course, up to the test "owners" ultimately to make the call
Long term we have to find a better way to do these things. For now this is fine (just writing this here for documentation).
Is this needed?
Why not explicitly copy back, too?
Just wondering whether all of this could also become a textual pass specification. Then the cuda runner would essentially only contain of a special pass that wraps the creation of cubin blobs (the createConvertGPUKernelToBlobPass) and otherwise uses a specified pipeline. Then we would not need to bake in the async lowering here.
It is not required for correctness if we copy the result as you suggested below.
But without page-locking, the memcpy becomes synchronous and the second and third memcpy would schedule (and execute) after the first has completed even without synchronization.
I had that initially, but I wanted to keep the number of GPU operations minimal.
Good idea, I will try that.