This is an archive of the discontinued LLVM Phabricator instance.

[clang] Make amdgpu-arch tool work on Windows
ClosedPublic

Authored by yaxunl on Jun 25 2023, 10:22 AM.

Details

Summary

Currently amdgpu-arch tool detects AMD GPU by dynamically
loading HSA runtime shared library and using HSA API's,
which is not available on Windows.

This patch makes it work on Windows by dynamically loading
HIP runtime dll and using HIP API's.

Diff Detail

Event Timeline

yaxunl created this revision.Jun 25 2023, 10:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2023, 10:22 AM
yaxunl requested review of this revision.Jun 25 2023, 10:22 AM
arsenm added a subscriber: arsenm.

Unrelated but can we get this to start reporting xnack and ecc?

clang/tools/amdgpu-arch/AMDGPUArchByHIP.cpp
81

single quotes around '\n'

92

llvm::outs

clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp
114

llvm::outs()?

Unrelated but can we get this to start reporting xnack and ecc?

A lot of CMake relies on this just being an ordered list of architectures, so we'd probably need to make that an opt-in thing.

clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

Doesn't LLVM know if it's being built for Windows? Maybe we should key off of that instead and then conditionally add_sources for a single function that satisfies the same "print all the architectures" thing.

Unrelated but can we get this to start reporting xnack and ecc?

Using HIP runtime reports xnack and ecc. It reports the target ID including the accurate xnack and ecc feature which is ready to be used by clang.

yaxunl marked 3 inline comments as done.Jun 30 2023, 8:11 AM
yaxunl added inline comments.
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

When this code is compiled on Windows, the compiler predefines _WIN32, so it should work.

I tried to tweak cmake files of amdgpu-arch to selectively add source files for Windows and non-windows but it did not work. If you have a file in that directory that is not included in any target, cmake will report an error. Seems there is a mechanism in CMake files for clang tools not allowing any 'dangling' source files.

clang/tools/amdgpu-arch/AMDGPUArchByHIP.cpp
81

will do

92

will do

clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp
114

will do

yaxunl updated this revision to Diff 536257.Jun 30 2023, 8:15 AM
yaxunl marked 3 inline comments as done.
yaxunl added a reviewer: arsenm.

revised by comments

I tested the program on Windows and Linux and it works.

jhuber6 added inline comments.Jun 30 2023, 8:42 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

The proper way to do that is to add it to a new subdirectory and conditionally do add_subdirectory. Something like

HSA/GetAMDGPUArch.cpp
HIP/GetAMDGPUArch.cpp

It's not a big deal, but I just feel like including unused symbols in the binary on Linux isn't ideal. Up to you if you want to put in the effort.

yaxunl marked 2 inline comments as done.Jun 30 2023, 8:49 AM
yaxunl added inline comments.
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

The HIP version actually works on both Linux and Windows. I am not sure whether one day we want to use it on Linux too since it supports target ID features.

Also, I kind of think it is overkill to have separate directories for Windows and Linux for this simple program.

Also w.r.t. target-id, I'm wondering what a good solution would be. Right now the main usage of amdgpu-arch is both to detect the -mcpu / -march in CMake and to fill in the architecture via --offload-arch=native or -fopenmp-target=amdgcn-amd-amdhsa. We may want to make a flag to specify if we want to include target-id information in the reported architectures.

arsenm added inline comments.Jun 30 2023, 8:58 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

Why can't you get the target id features through the HSA path? I think there's value in going through the lowest level component to get the information

jhuber6 added inline comments.Jun 30 2023, 9:02 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

This should be what we do in the OpenMP runtime, should be able to add that feature.

uint32_t name_len;                                                                                                                                                                      
err = hsa_isa_get_info_alt(isa, HSA_ISA_INFO_NAME_LENGTH, &name_len);                                                                                                                        
if (err != HSA_STATUS_SUCCESS) {                                                                                                                                                                                                   
  DP("Error getting ISA info length\n");                                                                                                                                                                                           
  return err;                                                                                                                                                                                                                      
}                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                   
char TargetID[name_len];                                                                                                                                                                                                           
err = hsa_isa_get_info_alt(isa, HSA_ISA_INFO_NAME, TargetID);                                                                                                                                                                      
if (err != HSA_STATUS_SUCCESS) {                                                                                                                                                                                                   
  DP("Error getting ISA info name\n");                                                                                                                                                                                             
  return err;                                                                                                                                                                                                                      
}

It's unfortunate that HSA does not work on Windows so we need to split this stuff up.

jhuber6 added inline comments.Jun 30 2023, 9:07 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
52

Just realized this needs the isa, ignore that. I'll need to scrounge around the HSA info enums to see if this is queriable there for the agent.

arsenm added inline comments.Jul 7 2023, 5:41 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
55

The HIP path should work on linux too. I generally think we should build as much code as possible on all hosts, so how about

#ifndef _WIN32
  if (tryHSA())
    return 0;
#endif

tryHIP()
jhuber6 added inline comments.Jul 7 2023, 5:43 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
55

That'd be fine, I'm in favor of sticking to HSA since it's a smaller runtime that's more reasonable to build standalone without the whole ROCm stack.

yaxunl updated this revision to Diff 538170.Jul 7 2023, 9:19 AM
yaxunl marked an inline comment as done.

revised by comments

yaxunl marked 5 inline comments as done.Jul 7 2023, 9:20 AM
yaxunl added inline comments.
clang/tools/amdgpu-arch/AMDGPUArch.cpp
55

done

jhuber6 added inline comments.Jul 7 2023, 9:28 AM
clang/tools/amdgpu-arch/AMDGPUArch.cpp
48–52

Are we missing something here ? They look the same.

yaxunl marked an inline comment as done.Jul 7 2023, 10:54 AM
yaxunl added inline comments.
clang/tools/amdgpu-arch/AMDGPUArch.cpp
48–52

sorry my mistake. will fix.

jdoerfert added inline comments.Jul 7 2023, 10:56 AM
clang/tools/amdgpu-arch/AMDGPUArchByHIP.cpp
97

Where is the call to this?

yaxunl updated this revision to Diff 538207.Jul 7 2023, 11:02 AM
yaxunl marked an inline comment as done.

revised by comments

yaxunl marked an inline comment as done.Jul 7 2023, 11:04 AM
yaxunl added inline comments.
clang/tools/amdgpu-arch/AMDGPUArchByHIP.cpp
97

forgot to call this in main. fixed.

This revision is now accepted and ready to land.Jul 7 2023, 11:05 AM
This revision was landed with ongoing or failed builds.Jul 7 2023, 9:32 PM
This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 9:32 PM

The right thing to do on Linux for this is to query the driver directly. That is, the kernel should populate some string under /sys that we read. That isn't yet implemented. Does windows happen to have that functionality available?

(landed here while trying to work out why tests aren't running because we now print errors about failing to load libamdhip64.so when hsa fails)

clang/tools/amdgpu-arch/AMDGPUArch.cpp
55

I can't think of a case on linux where HIP would work and HSA would not, given that HIP calls into HSA to do the same query. So I think this fallback path only contributes to line noise when HSA doesn't load,

./bin/amdgpu-arch 
Failed to 'dlopen' libhsa-runtime64.so
Failed to load libamdhip64.so: libamdhip64.so: cannot open shared object file: No such file or directory

The right thing to do on Linux for this is to query the driver directly. That is, the kernel should populate some string under /sys that we read. That isn't yet implemented.

It should definitely not do that. That's what this redundant thing does . The kernel doesn't know the names of these devices. The kernel knows different names that map to PCI ids that are not the same as the gfx numbers. The compiler should not be responsible for maintaining yet another name mapping table and should go through a real API

Does windows happen to have that functionality available?

This sounds very un-windows like. I assume the equivalent is digging around in the registry

arsenm added inline comments.Jul 10 2023, 4:28 AM
clang/tools/amdgpu-arch/AMDGPUArchByHIP.cpp
81

Should print some kind of stringified error codes for all of these

JonChesterfield added a comment.EditedJul 10 2023, 4:32 AM

The right thing to do on Linux for this is to query the driver directly. That is, the kernel should populate some string under /sys that we read. That isn't yet implemented.

It should definitely not do that. That's what this redundant thing does . The kernel doesn't know the names of these devices. The kernel knows different names that map to PCI ids that are not the same as the gfx numbers. The compiler should not be responsible for maintaining yet another name mapping table and should go through a real API

There's a lot of pci id to gfx906 style tables lying around already. There used to be one in roct, last time I looked people wanted to move that to somewhere else. I don't really want to copy/paste it.

The problem with using the proper API via HSA or similar is twofold:

  • we use this tool to enable tests, which means HSA has to exist before building clang or the tests don't run and HSA now requires clang to build
  • if you open the driver too many times at once it fails to open, so running a parallel build that uses this tool doesn't work on fast machines
  • if you open the driver too many times at once it fails to open, so running a parallel build that uses this tool doesn't work on fast machines

Why would this happen? Seems like a bug to fix?

The problem with using the proper API via HSA or similar is twofold:

  • we use this tool to enable tests, which means HSA has to exist before building clang or the tests don't run and HSA now requires clang to build

I don't follow this. You don't need this to work to perform the build and test build. You may need it to execute the tests, but if HSA doesn't exist they won't be able to run anyway. If a build is invoking these tools at cmake time it's just broken

  • if you open the driver too many times at once it fails to open, so running a parallel build that uses this tool doesn't work on fast machines

Why would this happen? Seems like a bug to fix?

Jon is probably referring to a recurring problem we've noticed with the libc tests on HSA that they will sometimes fail when running with multiple threads, see https://lab.llvm.org/staging/#/builders/247/builds/2599/steps/10/logs/stdio. Haven't been able to track down whether or not that's a bug in the implementation or interface somewhere.

The problem with using the proper API via HSA or similar is twofold:

  • we use this tool to enable tests, which means HSA has to exist before building clang or the tests don't run and HSA now requires clang to build

I don't follow this. You don't need this to work to perform the build and test build. You may need it to execute the tests, but if HSA doesn't exist they won't be able to run anyway. If a build is invoking these tools at cmake time it's just broken

And the libomptarget build is in fact doing that, but it shouldn't have to. What it's doing actually seems really unreasonable. It's only building the locally found targets when it should be building all targetable devices. The inconvenience there is that's too many devices, so as a build time hack you should be able to opt-in to a restricted subset. Even better would be if we would only build a copy for a reasonable subset of targets (i.e. one per generation where there's actually some semblance of compatibility). Or could just capitulate and rely on the hacks device libs does

  • if you open the driver too many times at once it fails to open, so running a parallel build that uses this tool doesn't work on fast machines

Why would this happen? Seems like a bug to fix?

It definitely annoys me. The argument is you can't usefully run some large N number of programs at the same time anyway and the driver failing to open is a rate limit. The problem is there are things we could usefully do, like this query, without needing to run a kernel as well. The net effect is we don't run tests widely in parallel because they fail if we do, for this and possibly other reasons.

The test detection is an awkward compromise between people who want to run the GPU tests and people who don't, and reflects diverse hardware in use and variation on whether cuda / hsa are installed.

And the libomptarget build is in fact doing that, but it shouldn't have to. What it's doing actually seems really unreasonable. It's only building the locally found targets when it should be building all targetable devices. The inconvenience there is that's too many devices, so as a build time hack you should be able to opt-in to a restricted subset. Even better would be if we would only build a copy for a reasonable subset of targets (i.e. one per generation where there's actually some semblance of compatibility). Or could just capitulate and rely on the hacks device libs does

The libomptarget build uses it to determine if it should build the tests mostly, we don't want to configure tests for a system that cannot support them. The libc tests however requires it to set the architecture for its test configuration since we can't support multiple test architectures at the same time, it required too much work so I shelved that. We more or less just say "If you've got HSA / CUDA we expect to run tests".