Page MenuHomePhabricator

[HIP] Fix rocm detection
ClosedPublic

Authored by yaxunl on Jun 30 2020, 9:19 PM.

Details

Summary

Do not detect device library by default in rocm detector. Only detect device library in HIP toolchain.

Detect rocm path by version file in host toolchains.

Also added detecting rocm version and printing rocm installation path and version with -v.

Fixed include path and device library detection for ROCm 3.5.

Diff Detail

Event Timeline

yaxunl created this revision.Jun 30 2020, 9:19 PM
arsenm added inline comments.Jul 1 2020, 6:48 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

I don't think the detection should fail if this is missing

yaxunl marked an inline comment as done.Jul 1 2020, 9:36 AM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

why? this file is unique to HIP installation. If it is missing, the installation is broken.

arsenm added inline comments.Jul 1 2020, 5:41 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

Because I should be able to use this without a complete hip installation. Without a specified version, it should assume the most modern layout. This will for example break pointing --rocm-path at the device library build directory for library tests

arsenm added inline comments.Jul 1 2020, 5:44 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

I also don't see what value checking the version really provides; it may be informative to print it, but I don't think it's useful to derive information from it

yaxunl marked an inline comment as done.Jul 2 2020, 4:42 AM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

what is the directory structure of your most modern layout?

arsenm added inline comments.Jul 6 2020, 1:49 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

/opt/rocm/amdgcn/bitcode/foo.bc

The plan is to remove this and rely on symlinks in the resource directory though

tra added inline comments.Jul 6 2020, 1:57 PM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

I also don't see what value checking the version really provides; it may be informative to print it, but I don't think it's useful to derive information from it

In CUDA it's used to detect which back-end features to enable (they depend on what's supported by ptxas supplied by that CUDA version). I don't think that would be relevant to AMDGPU as it does not need external dependencies to generate GPU code.

It may be useful for filesystem layout detection. E.g. where to find bitcode files and what names to expect. This part seems to be relevant for ROCm, but I'm not sure if the layout is closely tied to the version.

yaxunl marked 4 inline comments as done.Jul 6 2020, 2:00 PM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

We are required to support previous ROCm releases for certain time range. To do that we need to detect HIP version and enable/disable certain HIP features based on HIP version when necessary.

Therefore if we have a new directory structure for ROCm, that directory structure should contain a HIP version file so that we can detect HIP version.

yaxunl marked 4 inline comments as done.Jul 6 2020, 2:39 PM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

Currently clang includes some wrapper headers by default, which does not work with ROCm 3.5 since those device functions are defined in HIP headers of ROCm 3.5. To support ROCm 3.5, we have to disable including those wrapper headers. This is one example why we need detect HIP version.

196

I think we need to separate HIP runtime detection and device library detection and also separate hasValidHIPRuntime and hasValidDeviceLibrary. ROCm toolchain only need to make sure hasValidDeviceLibrary but HIP toolchain need both hasValidDeviceLibrary and hasValidHIPRuntime.

arsenm added inline comments.Jul 7 2020, 6:43 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

Regardless of whether there's a version file or if does anything, I think the absence of one implies do the most modern thing

tra added inline comments.Jul 7 2020, 10:18 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

"The most modern thing" is an ever moving target. If the failure modes keep changing it will create a source of new problems every time something changes. Probably not a big issue in this case, but with (eventually) wide enough user base there will be some.

I assume that AMD does not have much of legacy user base yet for compilations with clang, so defaulting to a recent version may be sensible (unlike CUDA where clang had to deal with a lot of existing CUDA users using old CUDA versions). That said, I'd recommend pinning it to a specific version, so the expectations remain stable. Then we can document the failure/fallback in a way users can deal with -- if no version file is found, clang assumes version x.y and expects to find files X, Y and Z in directory A, B, C. You can specify the locations manually with --something-something-X=A or specify the version with --hip-version=3.7.

Considering that clang is being used from various derived tools, you may not always have the luxury of having the whole ROCm installation around and need to be able to override the expectations via command line flags.

If we have a way to override the version, then it does not matter all that much which version we pick as a fallback. If the guess is wrong, we can always correct it with a flag.

arsenm added inline comments.Jul 7 2020, 11:33 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

Unlike CUDA we're not really chasing some 3rd party thing that periodically drops unpredictable changes. We know what the most modern thing is on tip of tree

tra added inline comments.Jul 7 2020, 11:58 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
196

Unlike CUDA we're not really chasing some 3rd party thing that periodically drops unpredictable changes. We know what the most modern thing is on tip of tree

We're in agreement on this part. If lack of the version file is non-fatal, I'm fine with whatever version AMD picks as the default.

I'm just pointing out that you'll likely need a way to keep compiler working even if the filesystem does not have everything you expect for the full end-to-end compilation, be it the version file or bitcode. Reasonable default is fine. Overrides can be implemented later, if necessary.

yaxunl updated this revision to Diff 277095.Jul 10 2020, 10:35 AM
yaxunl marked 2 inline comments as done.

Separate detecting ROCm path and device library path.

Assume 3.5 as default HIP version. If -rocm-path is specified, do not require version file to exist.

Added --hip-version to override detected HIP version.

Fixed default option for -std and -fhip-use-new-launch-api.

tra accepted this revision.Jul 10 2020, 11:43 AM

LGTM in principle. Few style comments.

clang/lib/Driver/ToolChains/AMDGPU.cpp
161

A comment about expected version format would be helpful here.

Perhaps it could be simplified by splitting the string on "." into an array in one call instead of doing it one element at a time.

244–257

It would be nice to collect the list of directories to check into an array and then loop over it -- the same way you already do for the InstallPath candidates. Makes it easier to add/change search logic.

E.g. in the end we would get something like this:

for (path: {InstallPath, make_path(InstallPath, "lib"), make_path(InstallPath, "lib", "bitcode"), ... }) {
   if (CheckDeviceLib(path)) {
      // record path as the correct location
     return;
   }
}
248–249

Creation of the path from base+suffixes seems to be a common pattern to be extracted into a lambda. I'm surprised it's not provided by LLVM itself.

248–250

CheckDeviceLib should probably take the path to check as an argument and so should scanLibDevicePath.

299

Nit: IsRocm35 is a bit misleading as it would be true for versions other than 3.5.
Rename it to something closer matching the intent? IsRocm35OrOlder ot, perhaps, hipUsesRuntimeWrapper.

333

You could use CC1Args.append({"foo", "bar"}) here.

clang/lib/Driver/ToolChains/HIP.cpp
283

Nit: Path was fine.

This revision is now accepted and ready to land.Jul 10 2020, 11:43 AM
yaxunl marked 11 inline comments as done.Jul 10 2020, 3:41 PM

thanks. will fix when commit

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 8:21 PM

This is causing a test case to fail on mac http://45.33.8.238/mac/17048/step_7.txt

This breaks check-clang on macOS: http://45.33.8.238/mac/17053/step_7.txt

The output contains

5: clang: warning: argument unused during compilation: '--rocm-path=/Users/thakis/src/llvm-project/clang/test/Driver/Inputs/rocm' [-Wunused-command-line-argument]

which if true would explain the failure.

That bot uses the GN build, but I verified that the test fails the same way in the CMake build.

Please take a look, and if it takes a while to fix, please revert while you investigate.

clang/test/Driver/Inputs/rocm/bin/.hipVersion
2

Is this true? I don't see any cmake code generating this. Did I miss it?

…oh, njames said that already 5h ago :) I guess I'll wait another hour or two, and then I'll revert.

This breaks check-clang on macOS: http://45.33.8.238/mac/17053/step_7.txt

The output contains

5: clang: warning: argument unused during compilation: '--rocm-path=/Users/thakis/src/llvm-project/clang/test/Driver/Inputs/rocm' [-Wunused-command-line-argument]

which if true would explain the failure.

That bot uses the GN build, but I verified that the test fails the same way in the CMake build.

Please take a look, and if it takes a while to fix, please revert while you investigate.

should be fixed by 5d2c3e031a6861b3e95673d0e238c09938dd9c0d

Great, thanks :)