Page MenuHomePhabricator

[HIP] Support Spack packages
AcceptedPublic

Authored by yaxunl on Feb 23 2021, 3:04 PM.

Details

Summary

Spack is a package management tool extensively used by HPC community.
As ROCm packages are built by Spack by HPC community, we need to teach
clang driver to detect ROCm installation built by Spack.

Diff Detail

Event Timeline

yaxunl created this revision.Feb 23 2021, 3:04 PM
yaxunl requested review of this revision.Feb 23 2021, 3:04 PM
tra added inline comments.Feb 24 2021, 10:10 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
25–31

void function can not return true, so the comment needs updating.

Also, Using Path as both an input and an output is odd. We should probably return the full path or an empty string and call the function findSPACKPackage.

27–28

How are users expected to handle cases when they do have multiple rocm versions installed with spack (I assume that having multiple packages *is* possible. E.g. package-4.0, package-3.7 should be able to co-exist). They may have legitimate reasons to have more than one rocm version installed *and* they may need to be able to build with a particular one.

Can users use --hip-path=specific/hip/version/in-spack ?

clang/lib/Driver/ToolChains/ROCm.h
63

Nit: using SmallString here and everywhere else in this patch does not buy us anything. std::string would be fine.
My own rule of thumb is to use standard types by default and optimized types when they are needed.

yaxunl marked 3 inline comments as done.Feb 24 2021, 5:31 PM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
25–31

done

27–28

Yes. --hip-path is introduced for that purpose.

clang/lib/Driver/ToolChains/ROCm.h
63

thanks. will use std::string

yaxunl updated this revision to Diff 326254.Feb 24 2021, 5:36 PM
yaxunl marked 3 inline comments as done.

revised by Artem's comments

tra added inline comments.Feb 25 2021, 11:20 AM
clang/lib/Driver/ToolChains/AMDGPU.cpp
27–28

So, --hip-path is expected to point to the HIP runtime install directory and could presumably be used to point anywhere, including an arbitrary spack package. OK, this part works.

32

Based on how the function is used, perhaps we should make it
static std::string findSPACKPackage(const Driver &D, StringRef Path, StringRef PackageName, StringRef RocmVersion) and fold the prefix construction (and possibly version non-emptiness check) into the function.

383–398

So, normally --hip-path would point to the HIP package installation.
However, is clang installation path contains llvm-amdgpu (and thus CanInfo.SPACKReleaseStr is set to whatever version may follow),
then --hip-path (and other candidate paths) directories will also be treated as an SPACK package directory and, if the hip package is found there, then *that* package directory will be used as the actual path to be used.

I see several potential issues with that.

  • if clang path contains llvm-amdgpu, but it is *not* installed as an SPACK package We have no control over user's choice of the name for the installation path.
  • what if clang is installed as an SPCAK, but is invoked via a symlink with the name that does not contain llvm-amdgpu
  • the fact that --hip-path/--rocm-path behavior may change based on clang's executable path is rather peculiar. User-supplied parameters should not change their meaning depending on what you call the executable.

I'd be OK to derive spack-based paths if user didn't provide the paths explicitly, but I don't think we should add any magic to explicit overrides.

yaxunl marked 3 inline comments as done.Mar 1 2021, 12:28 PM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
32

Will use PackageName and RocmVersion as arguments.

However, we use llvm::sys::path to manipulate paths and llvm::sys::path is using llvm::SmallString. If we return std::string or use StringRef for path argument, we have to convert them back and force.

383–398

Will make SPACKReleaseStr a member of Candidate and allow mixed Spack candidates and non-Spack candidates. If clang path contains llvm-amdgpu but is not a Spack package, clang will still look for non-Spack ROCm candidates.

Also added parent of real path of clang as ROCm candidate. If clang is invoked through a sym link which is not under the real ROCm directory, clang will still be able to detect it. Also added --print-rocm-search-dirs for testing this.

When --hip-path/--rocm-path is set, the specified ROCm/HIP path is used. Spack detection is disabled, since the candidate specified by --rocm-path or --hip-path has empty SPACKReleaseStr.

yaxunl updated this revision to Diff 327243.Mar 1 2021, 12:30 PM
yaxunl marked 2 inline comments as done.

revised by Artem's comments

tra added inline comments.Mar 1 2021, 1:53 PM
clang/include/clang/Driver/Options.td
3566–3567

Perhaps the existing print-search-dirs option would do the job?

clang/lib/Driver/ToolChains/AMDGPU.cpp
32

It should still work. The use cases below look like this:

  auto SPACKPath = findSPACKPackage(D, Path, "rocm-device-libs",
                                    Candidate.SPACKReleaseStr);
  if (!SPACKPath.empty())
    Path = SPACKPath;
}
for (auto SubDir : SubDirs)
  llvm::sys::path::append(Path, SubDir);

It's the Pack that needs to be a SmallString. Assigning an std::string to it should work -- it has operator=(StringRef) and StringRef can be created from a std::string.

I'll leave it up to you. If std::string is inconvenient to use, the SmallString is fine.

186–188

We call getInstallationPathCandidates() more more than one place, so we may end up with the same path added multiple times.

I think population of the candidate list should be separated from reading it back, and should be done only once.
E.g.

if (!isCandidateListReady) 
   initCandidateList(); // populates ROCmSearchDirs, sets isCandidateListReady
return ROCmSearchDirs;
213

What would be a consequence of mistakingly detecting a non-SPACK installation as SPACK?

I'm still not quite comfortable with using a path name, which can be arbitrarily named by a user, as a way to detect particular installation method.
Do SPACK packages have something *inside* the package directory that we could use to tell that it is indeed an SPACK package? Some sort of metadata or version file, perhaps?

It's probably OK to rely on the path naming scheme, but there are still corner cases it can't handle.
I could have a setup when the path-name based detection would still fail despite considering both the reaolved and unresolved symlink names.
E.g.

/pkg/llvm-amdgpu-hash -> /pkg/some-other-real-dir-name
/usr/local/bin/clang -> /pkg/llvm-amdgpu-hash/bin/clang

If I compile with /usr/local/bin/clang neither the realpath nor the unresolved symlink will have 'llvm-amdgpu'.

Granted, it's a somewhat contrived setup. Using the name as the hint would work for the standard setups, so we can live with that, provided that we can always tell compiler where to find the files it needs, if the user has an unusual install.

383–398

Also added --print-rocm-search-dirs for testing this.

Maybe we could just enable that with -v. We already are printing various locations for GCC installations, etc.

clang/lib/Driver/ToolChains/ROCm.h
48

I'd add a bool isSPACK() method to consolidate the SPACK/non-SPACK determination into a single location.

yaxunl marked 6 inline comments as done.Mar 2 2021, 2:45 PM
yaxunl added inline comments.
clang/include/clang/Driver/Options.td
3566–3567

The concern is similar to https://reviews.llvm.org/D79210:

There may be tools parsing print-search-dirs output, changing it may break those tools.

clang/lib/Driver/ToolChains/AMDGPU.cpp
186–188

We check whether ROCmSearchDirs is populated on line 166 and only populate it if not.

213

llvm-amdgpu Spack installation does not add extra files. It uses normal llvm cmake files to build and differs from a non-Spack llvm build only by name. However, its name follows a very unusual pattern llvm-amdgpu-<release>-<hash>, where <hash> is a string of 32 alphanumerical characters. I think we could detect it by checking the length of <hash>.

383–398

The situation is somehow similar to https://reviews.llvm.org/D79210

Another reason not to print this with -v is that normally users may not need this information unless clang cannot find ROCm and users want to know why.

clang/lib/Driver/ToolChains/ROCm.h
48

done

yaxunl updated this revision to Diff 327596.Mar 2 2021, 2:47 PM
yaxunl marked 5 inline comments as done.

revised by Artem's comments

tra accepted this revision.Mar 2 2021, 3:36 PM

LGTM modulo couple of nits.

clang/include/clang/Driver/Options.td
3566–3567

OK. New option it is then,

clang/lib/Driver/ToolChains/AMDGPU.cpp
163–164

I'm not quite sure which part of the code is related to the do not do strict check. Perhaps move the comment closer to the code it describes and add the details what we'd do otherwise, so one can understand what's missing.

Also, adding a comment describing what the function does (returns candidate list, populating it once, if necessary) would be great.

186–188

Agreed, the code works. That said, the fact that I didn't notice that is a point in case -- it's not obvious.
It's much easier to keep track of what's going on when one function does one things without changing its behavior from one call to another.
I'd at least add a comment around line 166 saying that just return the list if we've already populated it.

213

I think we're getting beyond the point of diminishing returns here. If we don't have a way to identify an SPACK package directly, detecting it by the name will never be 100% reliable. The current code will be correct most of the time. For those few cases where it may get it wrong, youser should be able to specify the correct path directly. No need to complicate it further -- it's not going to buy us much.

364–370

This could be simplified to

auto MaybeSpackPath = Candidate.isSPACK() ? findSPACKPackage() : "";
auto Path = MaybeSpackPath.empty() ? CandidatePath : MaybeSpackPath;

We could further simplify it if we move the isSpack() check into findSPACKPackage and make it return an empty string if it's not.

383–398

OK.

This revision is now accepted and ready to land.Mar 2 2021, 3:36 PM
yaxunl marked 4 inline comments as done.Mar 5 2021, 7:34 PM
yaxunl added inline comments.
clang/lib/Driver/ToolChains/AMDGPU.cpp
163–164

will do

186–188

will do

213

sure. will not check hash length.

364–370

will move isSPACK() inside findSPACKPackage

This revision was landed with ongoing or failed builds.Mar 6 2021, 5:42 AM
This revision was automatically updated to reflect the committed changes.
yaxunl marked 4 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptMar 6 2021, 5:42 AM
haampie added a subscriber: haampie.EditedMay 22 2021, 7:12 AM

Hi @tra and @yaxunl, I'm commenting as a reviewer of the spack pull request for the rocm 4.2.0 ecosystem. First of all: thanks for caring about spack installations, that's highly appreciated.

However, this patch does not seem the right approach to me, for a couple reasons:

  1. LLVM should not be aware of spack -- it doesn't make sense. Spack doesn't even have a stable 1.0 release, expect it to break and to be inconsistent between versions.
  2. It's incorrect in multiple ways: (a) the directory structure name is configurable in spack, not everybody has this <name>-<version>-<hash> structure, so you shouldn't rely on it, (b) you are still not guaranteed to pick up the correct llvm-amdgpu because it may be installed in a chained repo on a shared HPC system, and it won't be in the same prefix folder at all (c) spack's main selling point is it supports multiple installs of one package (a hash also changes for the same llvm version if a downstream dep such as zlib is updated), this patch just bails when it detect multiple installations

Let's step back a bit. The problem you seem to be solving is the circular dependency between llvm and rocm-device-libs which are separate packages in spack; clang can't know where rocm-device-libs is at compile time because rocm-device-libs depends on llvm.

Clearly the solution is to build llvm together with rocm-device-libs, as explained in the readme of https://github.com/RadeonOpenCompute/ROCm-Device-Libs, we can fix that in spack by adding a resource in the llvm package to pull in more sources, and LLVM can happily live without knowing about spack.

Thanks,

Harmen

haampie reopened this revision.EditedMay 22 2021, 2:23 PM

I've created a pull request to spack here: https://github.com/spack/spack/pull/23859, hopefully that's enough to revert this patch.

Edit: sorry, I may have accidentally changed the state of this issue.

This revision is now accepted and ready to land.May 22 2021, 2:23 PM

Hi @tra and @yaxunl, I'm commenting as a reviewer of the spack pull request for the rocm 4.2.0 ecosystem. First of all: thanks for caring about spack installations, that's highly appreciated.

However, this patch does not seem the right approach to me, for a couple reasons:

  1. LLVM should not be aware of spack -- it doesn't make sense. Spack doesn't even have a stable 1.0 release, expect it to break and to be inconsistent between versions.
  2. It's incorrect in multiple ways: (a) the directory structure name is configurable in spack, not everybody has this <name>-<version>-<hash> structure, so you shouldn't rely on it, (b) you are still not guaranteed to pick up the correct llvm-amdgpu because it may be installed in a chained repo on a shared HPC system, and it won't be in the same prefix folder at all (c) spack's main selling point is it supports multiple installs of one package (a hash also changes for the same llvm version if a downstream dep such as zlib is updated), this patch just bails when it detect multiple installations

Let's step back a bit. The problem you seem to be solving is the circular dependency between llvm and rocm-device-libs which are separate packages in spack; clang can't know where rocm-device-libs is at compile time because rocm-device-libs depends on llvm.

Clearly the solution is to build llvm together with rocm-device-libs, as explained in the readme of https://github.com/RadeonOpenCompute/ROCm-Device-Libs, we can fix that in spack by adding a resource in the llvm package to pull in more sources, and LLVM can happily live without knowing about spack.

Thanks,

Harmen

Thanks for your feedback.

Before this patch, clang assumes llvm, HIP and device lib directory follow ROCm directory structure. Since the option -hip-path and environment variable HIP_PATH have not been introduced to clang, there was no way for clang to specify or deduce HIP path if it does not follow ROCm directory structure. At that time, I thought the only way to support Spack package was to let clang know its naming convention and relative directory structure among llvm, HIP and device lib. That's what motivated this patch.

The patch should not cause circular dependency on HIP or device library. There was a bug causing clang to fail if it did not find Spack HIP package or multiple Spack HIP packages. I apologize for the trouble caused by that. That bug was fixed by https://reviews.llvm.org/D102556. With that fix, clang will not fail if no Spack HIP package detected or multiple Spack HIP packages are detected.

With that fix, Spack llvm-amdgpu package does not need to depend on Spack HIP and device-lib packages. llvm-amdgpu package can be built and installed first. Then device-lib package can be built with clang as before, since device-lib is written in OpenCL and the OpenCL code is compiled with -nogpulib, therefore clang should not try to detect device-lib when building device lib. Also clang does not need HIP to compile device-lib.

Now that clang supports -hip-path option and HIP_PATH environment variable, it is possible to specify HIP location for Spack. The detection of Spack HIP and device-lib path become an optional feature for users' convenience. Without auto-detection of HIP and device-lib path, users have to use lengthy options -hip-device-lib-path and -hip-path to specify them, which is painful. That's why we want to detect them automatically, like how clang auto-detect gcc, msvc, CUDA SDK, and ROCm. Such auto-detection is optional. If clang cannot auto-detect HIP or device-lib, it just assumes they are not installed and will not fail. Users are always able to override it by using -hip-device-lib-path and -hip-path. Therefore for users it is not a loss to have auto-detection. Hopefully, by following proper heuristics based on naming and location conventions, we may be able to detect HIP and device-lib location automatically for common users, so that they can use clang out of box to compile HIP programs without specifying HIP and device-lib path explicitly. For cases where clang cannot auto-detect HIP and device-lib locations, users can always specify them with options or environment variables.

haampie added a comment.EditedMon, May 24, 1:41 PM

Hi Yaxunl,

The patch should not cause circular dependency on HIP or device library.

I'm not saying this patch introduces a circular dependency, I'm saying you are trying to solve an already existing circular dependency (clang needs device libs at runtime, but device libs need llvm to compile).

Let's talk about my PR here: https://github.com/spack/spack/pull/23859. It's building rocm-device-libs as part of llvm-amdgpu by configuring llvm with -DLLVM_EXTERNAL_PROJECTS=device-libs.

If you checkout that pr, run spack install hip@4.2.0, what you get is get is:

$ spack find -p llvm-amdgpu@4.2.0
llvm-amdgpu@4.2.0  /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi

And indeed the bitcode is there:

$ find /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi -iname '*.bc'
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/oclc_isa_version_1033.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/ocml.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/hip.bc
...

Now when I used this --print-rocm-search-dirs flag on clang without other flags, what I see is:

$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc 
ROCm installation search path (Spack 4.2.0): /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0
ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/lib/clang/12.0.0
ROCm installation search path: /opt/rocm
...
clang-12: error: cannot find ROCm device library. Provide its path via --rocm-path or --rocm-device-lib-path, or pass -nogpulib to build without ROCm device library.

Now can you make it such that clang will search the llvm prefix itself?

In fact, if you replace all 3 search paths JUST with CMAKE_INSTALL_PREFIX path, it will find the correct bitcode files independent of whether you're using spack or whether you're bundling the rocm packages the traditional way in with the shared prefix in /opt/rocm... So, what I'd like to see is this:

$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc 
ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/

and

$ /opt/rocm/llvm/bin/clang++ --print-rocm-search-dirs -x hip hi.cc 
ROCm installation search path: /opt/rocm/llvm

Doesn't that make a whole lot more sense than informing llvm about spack?

Edit: it seems like the amdgcn directory in the distributions is in /opt/rocm/amdgcn, but moving them inside /opt/rocm/llvm/amdgcn could hopefully be an option? I mean, the omp bitcode files are already in the llvm prefix...

Finally, it doesn't settle locating hipcc (nor .hipVersion), but that shouldn't be a real issue from the point of view of a spack user, since spack already sets the -I and -L flags for you when you make your package depend on hip. Note that spack already sets many hip-related environment variables anyways when you run spack load hipcc@4.2.0...

Last comment on this: on many HPC systems you find a system install of rocm in /opt/rocm, but the user may have reasons to install rocm themselves with spack (maybe a debug build). So, another thing that searching only CMAKE_INSTALL_PREFIX for bitcode files would solve (unless overriden by env variable/flag), is accidentally falling back to /opt/rocm when the spack detection does not work.

foad removed a subscriber: foad.Tue, May 25, 2:01 AM

Hi Yaxunl,

The patch should not cause circular dependency on HIP or device library.

I'm not saying this patch introduces a circular dependency, I'm saying you are trying to solve an already existing circular dependency (clang needs device libs at runtime, but device libs need llvm to compile).

Let's talk about my PR here: https://github.com/spack/spack/pull/23859. It's building rocm-device-libs as part of llvm-amdgpu by configuring llvm with -DLLVM_EXTERNAL_PROJECTS=device-libs.

If you checkout that pr, run spack install hip@4.2.0, what you get is get is:

$ spack find -p llvm-amdgpu@4.2.0
llvm-amdgpu@4.2.0  /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi

And indeed the bitcode is there:

$ find /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi -iname '*.bc'
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/oclc_isa_version_1033.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/ocml.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/hip.bc
...

Now when I used this --print-rocm-search-dirs flag on clang without other flags, what I see is:

$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc 
ROCm installation search path (Spack 4.2.0): /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0
ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/lib/clang/12.0.0
ROCm installation search path: /opt/rocm
...
clang-12: error: cannot find ROCm device library. Provide its path via --rocm-path or --rocm-device-lib-path, or pass -nogpulib to build without ROCm device library.

Now can you make it such that clang will search the llvm prefix itself?

Yes I can make clang to search that path for device libs. I will open a review for that and remove the old detection of device libs for spack.

I still think auto detection of HIP path is a desirable feature, especially for interactive use of clang in a shell. This also allows clang to maintain certain level of backward compatibility with old HIP runtime by detecting HIP version.

opened https://reviews.llvm.org/D103281 to fix device lib detection for spack