Spack is a package management tool extensively used by HPC community.
As ROCm packages are built by Spack by HPC community, we need to teach
clang driver to detect ROCm installation built by Spack.
Details
- Reviewers
tra - Commits
- rG34d1a5c7b18f: [HIP] Support Spack packages
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
---|---|---|
25–31 | void function can not return true, so the comment needs updating. Also, Using Path as both an input and an output is odd. We should probably return the full path or an empty string and call the function findSPACKPackage. | |
27–28 | How are users expected to handle cases when they do have multiple rocm versions installed with spack (I assume that having multiple packages *is* possible. E.g. package-4.0, package-3.7 should be able to co-exist). They may have legitimate reasons to have more than one rocm version installed *and* they may need to be able to build with a particular one. Can users use --hip-path=specific/hip/version/in-spack ? | |
clang/lib/Driver/ToolChains/ROCm.h | ||
63 | Nit: using SmallString here and everywhere else in this patch does not buy us anything. std::string would be fine. |
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
---|---|---|
27–28 | So, --hip-path is expected to point to the HIP runtime install directory and could presumably be used to point anywhere, including an arbitrary spack package. OK, this part works. | |
32 | Based on how the function is used, perhaps we should make it | |
383–398 | So, normally --hip-path would point to the HIP package installation. I see several potential issues with that.
I'd be OK to derive spack-based paths if user didn't provide the paths explicitly, but I don't think we should add any magic to explicit overrides. |
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
---|---|---|
32 | Will use PackageName and RocmVersion as arguments. However, we use llvm::sys::path to manipulate paths and llvm::sys::path is using llvm::SmallString. If we return std::string or use StringRef for path argument, we have to convert them back and force. | |
383–398 | Will make SPACKReleaseStr a member of Candidate and allow mixed Spack candidates and non-Spack candidates. If clang path contains llvm-amdgpu but is not a Spack package, clang will still look for non-Spack ROCm candidates. Also added parent of real path of clang as ROCm candidate. If clang is invoked through a sym link which is not under the real ROCm directory, clang will still be able to detect it. Also added --print-rocm-search-dirs for testing this. When --hip-path/--rocm-path is set, the specified ROCm/HIP path is used. Spack detection is disabled, since the candidate specified by --rocm-path or --hip-path has empty SPACKReleaseStr. |
clang/include/clang/Driver/Options.td | ||
---|---|---|
3566–3567 | Perhaps the existing print-search-dirs option would do the job? | |
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
32 | It should still work. The use cases below look like this: auto SPACKPath = findSPACKPackage(D, Path, "rocm-device-libs", Candidate.SPACKReleaseStr); if (!SPACKPath.empty()) Path = SPACKPath; } for (auto SubDir : SubDirs) llvm::sys::path::append(Path, SubDir); It's the Pack that needs to be a SmallString. Assigning an std::string to it should work -- it has operator=(StringRef) and StringRef can be created from a std::string. I'll leave it up to you. If std::string is inconvenient to use, the SmallString is fine. | |
186–188 | We call getInstallationPathCandidates() more more than one place, so we may end up with the same path added multiple times. I think population of the candidate list should be separated from reading it back, and should be done only once. if (!isCandidateListReady) initCandidateList(); // populates ROCmSearchDirs, sets isCandidateListReady return ROCmSearchDirs; | |
213 | What would be a consequence of mistakingly detecting a non-SPACK installation as SPACK? I'm still not quite comfortable with using a path name, which can be arbitrarily named by a user, as a way to detect particular installation method. It's probably OK to rely on the path naming scheme, but there are still corner cases it can't handle. /pkg/llvm-amdgpu-hash -> /pkg/some-other-real-dir-name /usr/local/bin/clang -> /pkg/llvm-amdgpu-hash/bin/clang If I compile with /usr/local/bin/clang neither the realpath nor the unresolved symlink will have 'llvm-amdgpu'. Granted, it's a somewhat contrived setup. Using the name as the hint would work for the standard setups, so we can live with that, provided that we can always tell compiler where to find the files it needs, if the user has an unusual install. | |
383–398 |
Maybe we could just enable that with -v. We already are printing various locations for GCC installations, etc. | |
clang/lib/Driver/ToolChains/ROCm.h | ||
48 | I'd add a bool isSPACK() method to consolidate the SPACK/non-SPACK determination into a single location. |
clang/include/clang/Driver/Options.td | ||
---|---|---|
3566–3567 | The concern is similar to https://reviews.llvm.org/D79210: There may be tools parsing print-search-dirs output, changing it may break those tools. | |
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
186–188 | We check whether ROCmSearchDirs is populated on line 166 and only populate it if not. | |
213 | llvm-amdgpu Spack installation does not add extra files. It uses normal llvm cmake files to build and differs from a non-Spack llvm build only by name. However, its name follows a very unusual pattern llvm-amdgpu-<release>-<hash>, where <hash> is a string of 32 alphanumerical characters. I think we could detect it by checking the length of <hash>. | |
383–398 | The situation is somehow similar to https://reviews.llvm.org/D79210 Another reason not to print this with -v is that normally users may not need this information unless clang cannot find ROCm and users want to know why. | |
clang/lib/Driver/ToolChains/ROCm.h | ||
48 | done |
LGTM modulo couple of nits.
clang/include/clang/Driver/Options.td | ||
---|---|---|
3566–3567 | OK. New option it is then, | |
clang/lib/Driver/ToolChains/AMDGPU.cpp | ||
163–164 | I'm not quite sure which part of the code is related to the do not do strict check. Perhaps move the comment closer to the code it describes and add the details what we'd do otherwise, so one can understand what's missing. Also, adding a comment describing what the function does (returns candidate list, populating it once, if necessary) would be great. | |
186–188 | Agreed, the code works. That said, the fact that I didn't notice that is a point in case -- it's not obvious. | |
213 | I think we're getting beyond the point of diminishing returns here. If we don't have a way to identify an SPACK package directly, detecting it by the name will never be 100% reliable. The current code will be correct most of the time. For those few cases where it may get it wrong, youser should be able to specify the correct path directly. No need to complicate it further -- it's not going to buy us much. | |
364–370 | This could be simplified to auto MaybeSpackPath = Candidate.isSPACK() ? findSPACKPackage() : ""; auto Path = MaybeSpackPath.empty() ? CandidatePath : MaybeSpackPath; We could further simplify it if we move the isSpack() check into findSPACKPackage and make it return an empty string if it's not. | |
383–398 | OK. |
Hi @tra and @yaxunl, I'm commenting as a reviewer of the spack pull request for the rocm 4.2.0 ecosystem. First of all: thanks for caring about spack installations, that's highly appreciated.
However, this patch does not seem the right approach to me, for a couple reasons:
- LLVM should not be aware of spack -- it doesn't make sense. Spack doesn't even have a stable 1.0 release, expect it to break and to be inconsistent between versions.
- It's incorrect in multiple ways: (a) the directory structure name is configurable in spack, not everybody has this <name>-<version>-<hash> structure, so you shouldn't rely on it, (b) you are still not guaranteed to pick up the correct llvm-amdgpu because it may be installed in a chained repo on a shared HPC system, and it won't be in the same prefix folder at all (c) spack's main selling point is it supports multiple installs of one package (a hash also changes for the same llvm version if a downstream dep such as zlib is updated), this patch just bails when it detect multiple installations
Let's step back a bit. The problem you seem to be solving is the circular dependency between llvm and rocm-device-libs which are separate packages in spack; clang can't know where rocm-device-libs is at compile time because rocm-device-libs depends on llvm.
Clearly the solution is to build llvm together with rocm-device-libs, as explained in the readme of https://github.com/RadeonOpenCompute/ROCm-Device-Libs, we can fix that in spack by adding a resource in the llvm package to pull in more sources, and LLVM can happily live without knowing about spack.
Thanks,
Harmen
I've created a pull request to spack here: https://github.com/spack/spack/pull/23859, hopefully that's enough to revert this patch.
Edit: sorry, I may have accidentally changed the state of this issue.
Thanks for your feedback.
Before this patch, clang assumes llvm, HIP and device lib directory follow ROCm directory structure. Since the option -hip-path and environment variable HIP_PATH have not been introduced to clang, there was no way for clang to specify or deduce HIP path if it does not follow ROCm directory structure. At that time, I thought the only way to support Spack package was to let clang know its naming convention and relative directory structure among llvm, HIP and device lib. That's what motivated this patch.
The patch should not cause circular dependency on HIP or device library. There was a bug causing clang to fail if it did not find Spack HIP package or multiple Spack HIP packages. I apologize for the trouble caused by that. That bug was fixed by https://reviews.llvm.org/D102556. With that fix, clang will not fail if no Spack HIP package detected or multiple Spack HIP packages are detected.
With that fix, Spack llvm-amdgpu package does not need to depend on Spack HIP and device-lib packages. llvm-amdgpu package can be built and installed first. Then device-lib package can be built with clang as before, since device-lib is written in OpenCL and the OpenCL code is compiled with -nogpulib, therefore clang should not try to detect device-lib when building device lib. Also clang does not need HIP to compile device-lib.
Now that clang supports -hip-path option and HIP_PATH environment variable, it is possible to specify HIP location for Spack. The detection of Spack HIP and device-lib path become an optional feature for users' convenience. Without auto-detection of HIP and device-lib path, users have to use lengthy options -hip-device-lib-path and -hip-path to specify them, which is painful. That's why we want to detect them automatically, like how clang auto-detect gcc, msvc, CUDA SDK, and ROCm. Such auto-detection is optional. If clang cannot auto-detect HIP or device-lib, it just assumes they are not installed and will not fail. Users are always able to override it by using -hip-device-lib-path and -hip-path. Therefore for users it is not a loss to have auto-detection. Hopefully, by following proper heuristics based on naming and location conventions, we may be able to detect HIP and device-lib location automatically for common users, so that they can use clang out of box to compile HIP programs without specifying HIP and device-lib path explicitly. For cases where clang cannot auto-detect HIP and device-lib locations, users can always specify them with options or environment variables.
Hi Yaxunl,
The patch should not cause circular dependency on HIP or device library.
I'm not saying this patch introduces a circular dependency, I'm saying you are trying to solve an already existing circular dependency (clang needs device libs at runtime, but device libs need llvm to compile).
Let's talk about my PR here: https://github.com/spack/spack/pull/23859. It's building rocm-device-libs as part of llvm-amdgpu by configuring llvm with -DLLVM_EXTERNAL_PROJECTS=device-libs.
If you checkout that pr, run spack install hip@4.2.0, what you get is get is:
$ spack find -p llvm-amdgpu@4.2.0 llvm-amdgpu@4.2.0 /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi
And indeed the bitcode is there:
$ find /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi -iname '*.bc' /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/oclc_isa_version_1033.bc /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/ocml.bc /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/hip.bc ...
Now when I used this --print-rocm-search-dirs flag on clang without other flags, what I see is:
$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc ROCm installation search path (Spack 4.2.0): /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0 ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/lib/clang/12.0.0 ROCm installation search path: /opt/rocm ... clang-12: error: cannot find ROCm device library. Provide its path via --rocm-path or --rocm-device-lib-path, or pass -nogpulib to build without ROCm device library.
Now can you make it such that clang will search the llvm prefix itself?
In fact, if you replace all 3 search paths JUST with CMAKE_INSTALL_PREFIX path, it will find the correct bitcode files independent of whether you're using spack or whether you're bundling the rocm packages the traditional way in with the shared prefix in /opt/rocm... So, what I'd like to see is this:
$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/
and
$ /opt/rocm/llvm/bin/clang++ --print-rocm-search-dirs -x hip hi.cc ROCm installation search path: /opt/rocm/llvm
Doesn't that make a whole lot more sense than informing llvm about spack?
Edit: it seems like the amdgcn directory in the distributions is in /opt/rocm/amdgcn, but moving them inside /opt/rocm/llvm/amdgcn could hopefully be an option? I mean, the omp bitcode files are already in the llvm prefix...
Finally, it doesn't settle locating hipcc (nor .hipVersion), but that shouldn't be a real issue from the point of view of a spack user, since spack already sets the -I and -L flags for you when you make your package depend on hip. Note that spack already sets many hip-related environment variables anyways when you run spack load hipcc@4.2.0...
Last comment on this: on many HPC systems you find a system install of rocm in /opt/rocm, but the user may have reasons to install rocm themselves with spack (maybe a debug build). So, another thing that searching only CMAKE_INSTALL_PREFIX for bitcode files would solve (unless overriden by env variable/flag), is accidentally falling back to /opt/rocm when the spack detection does not work.
Yes I can make clang to search that path for device libs. I will open a review for that and remove the old detection of device libs for spack.
I still think auto detection of HIP path is a desirable feature, especially for interactive use of clang in a shell. This also allows clang to maintain certain level of backward compatibility with old HIP runtime by detecting HIP version.
Perhaps the existing print-search-dirs option would do the job?