This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu][mlir-cuda-runner] Refactor ConvertKernelFuncToCubin to be generic.
ClosedPublic

Authored by whchung on May 18 2020, 10:04 AM.

Details

Summary

Make ConvertKernelFuncToCubin pass to be generic:

  • Rename to ConvertKernelFuncToBlob.
  • Allow specifying triple, target chip, target features.
  • Initializing LLVM backend is supplied by a callback function.
  • Lowering process from MLIR module to LLVM module is via another callback.
  • Change mlir-cuda-runner to adopt the revised pass.
  • Add new tests for lowering to ROCm HSA code object (HSACO).
  • Tests for CUDA and ROCm are kept in separate directories.

Diff Detail

Event Timeline

whchung created this revision.May 18 2020, 10:04 AM
Herald added a project: Restricted Project. · View Herald Transcript
whchung updated this revision to Diff 264670.May 18 2020, 10:06 AM

Remove unused codes.

Cool, nice to see ROCm support coming!

mlir/lib/Conversion/GPUToROCm/ConvertKernelFuncToHsaco.cpp
50 ↗(On Diff #264670)

This is a lot of common code. Could we have a base class, e.g., `GpuKernelToLLVMGeneratedBlobPass', that factors out all the common logic but has extensions points for the LLVM target to use, etc.?

146 ↗(On Diff #264670)

ptx -> hsa?

whchung updated this revision to Diff 264930.May 19 2020, 8:43 AM
whchung marked an inline comment as done.

Address part of code review comments.

whchung marked an inline comment as done.May 19 2020, 8:53 AM
whchung added inline comments.
mlir/lib/Conversion/GPUToROCm/ConvertKernelFuncToHsaco.cpp
50 ↗(On Diff #264670)

@herhut I would argue factoring out common logic to a base class is a premature optimization at this point.

In D80167, logic for lowering gpu.launch to runtime API invocations are more or less the same between the two platforms so I think it's good to make the logic generic.

In this patch however, passes which get GPUModule to binary blobs essentially do following tasks:

  1. acquire GPUModule, and acquire a LLVM context lock.
  2. initialize LLVM backend.
  3. translate the GPUModule to llvm/nvvm/rocdl dialect.
  4. invoke LLVM backend, specify triple / datalayout / target features.
  5. invoke platform-specific binary creator (on NV platform is cubin generator, on AMD platform it's LLVM lld).
  6. convert the binary into a StringAttr.
  1. and 6) are platform-neutral, but 2) - 5) are highly platform-dependent and may change from time to time.

Therefore, I'd like to keep both paths separate for now. Once ROCm side of logic become more matured we can consider factoring out common logic in subsequent patches.

whchung updated this revision to Diff 264948.May 19 2020, 9:25 AM

Address warnings found by clang-tidy.

whchung updated this revision to Diff 264989.May 19 2020, 12:10 PM

Revise function naming to better depict formats used in the process.

whchung updated this revision to Diff 264993.May 19 2020, 12:22 PM

Fix build errors.

whchung updated this revision to Diff 265000.May 19 2020, 12:54 PM

Fix build errors again.

whchung updated this revision to Diff 265027.May 19 2020, 1:43 PM

clang-tidy.

herhut added inline comments.May 20 2020, 3:24 AM
mlir/lib/Conversion/GPUToROCm/ConvertKernelFuncToHsaco.cpp
50 ↗(On Diff #264670)

The code is pretty much identical, except for

  1. the target triple
  2. the need for initializing an LLVM Target because of the triple,
  3. the use of a different annotation name
  4. a different compiler callback from ISA to blob.
  1. The target could also be passed in as a parameter.
  1. I am not even sure we need a different name, but that could just be a parameter to the pass.
  1. is already passed in and goes from string -> vector. So there is nothing platform dependent there.

Remains 2, which we could also be a callback that is passed in, or the user has to make sure that the target is initialized. It does not need to be part of the pass itself.

The only future extension I can envision is a different configuration of the lowering pipeline during code generation. That could also be done via a callback that gets a LegacyPassManager as parameter. Similar to how the LLVM lowering can be configured.

Also, if we split this now, it will likely diverge, making later refactoring harder.

whchung marked an inline comment as done.May 20 2020, 10:28 AM
whchung added inline comments.
mlir/lib/Conversion/GPUToROCm/ConvertKernelFuncToHsaco.cpp
50 ↗(On Diff #264670)

@herhut Points taken. I'm working on finalizing the implementation of mlir-rocm-runner. Let me revise the patch after its done in a couple of days.

whchung updated this revision to Diff 265806.May 22 2020, 3:00 PM

Rewrite the patch to make the pass be generic between CUDA+NVPTX and ROCm+AMDGPU.

whchung retitled this revision from [mlir][gpu][rocdl] Introduce GPUToROCm conversion passes. to [mlir][gpu][mlir-cuda-runner] Refactor ConvertKernelFuncToCubin to be generic..May 22 2020, 3:01 PM
whchung edited the summary of this revision. (Show Details)

@herhut I've revised the patch so ConvertGpuKernelToCubin pass is now ConvertGpuKernelToBlob and works on both CUDA and ROCm platform. Could you help review it once again? Thanks.

In my downstream fork I also have the implementation of mlir-rocm-runner ready, pending on this patch.

herhut accepted this revision.May 25 2020, 7:07 AM

Thanks. I think this avoids a lot of code duplication. I would prefer to remove the initialize target callback and move that into the caller.

With that addressed I am happy to see this land.

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h
39

Do we need to actually do this in this pass or could we do this in the code that constructs the pipeline? If the calling context has to provide this callback, it could also just call the init itself.

40

Could this be a std::function<std::unique_ptr<llvm::Module>(Operation *)> instead, signalling failure by returning a nullptr?

55

Nit: to target object

mlir/lib/Conversion/GPUCommon/ConvertKernelFuncToBlob.cpp
93

lob -> blob

This revision is now accepted and ready to land.May 25 2020, 7:07 AM
whchung updated this revision to Diff 266631.May 27 2020, 12:42 PM

Revise the patch addressing code review comments.

whchung marked 4 inline comments as done.May 27 2020, 12:42 PM
whchung updated this revision to Diff 266649.May 27 2020, 1:34 PM

Remove obsolete comment.

rriddle added inline comments.May 27 2020, 1:50 PM
mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h
12

Is this header really necessary?

whchung marked 2 inline comments as done.May 27 2020, 2:09 PM
whchung added inline comments.
mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h
12

@rriddle it's necessary so llvm::Module is visible. Forward declaration is not possible because sizeof(llvm::Module) would be used in one of the unit test.

whchung updated this revision to Diff 266676.May 27 2020, 2:35 PM
whchung marked an inline comment as done.

Remove unused headers.

herhut accepted this revision.May 28 2020, 1:50 AM

Thanks!

This revision was automatically updated to reflect the committed changes.
mlir/lib/Conversion/GPUToCUDA/CMakeLists.txt