Page MenuHomePhabricator

[OffloadArch] Library to query properties of current offload archicture
Needs ReviewPublic

Authored by saiislam on Jul 28 2021, 7:31 AM.



offload-arch: Print offload architecture(s) for current system, or

print offload runtime capabilities of current system,
or lookup information about offload architectures,
or print offload requirements for an application binary


offload-arch [ Options ] [ Optional lookup-value ]

With no options, offload-arch prints the value for the first visible
offload-arch in the system. This can be used by various clang
frontends. For example, to compile for openmp offloading on your current
system, invoke clang with the following command:
   clang -fopenmp -fopenmp-targets=`offload-arch` foo.c

If an optional lookup-value is specified, offload-arch will
check if the value is either a valid offload-arch or a codename
and display associated values with that offload-arch or codename.
For example, this provides all information for offload-arch gfx906:

offload-arch gfx906 -v


     -m  Print device code name (often found in pci.ids file)
     -n  Print numeric pci-id
     -t  Print clang offload triple to use for the offload arch.
     -c  Print offload capabilities of the current system.
	 This option is used by the language runtime to select an image
	 when multiple offload images are availble in the binary.
	 A capability must exist for each requirement of the selected image.
         each compiled offload image built into an application binary file.
     -a  Print values for all devices. Don't stop at first visible device.
     -v  Verbose = -a -m -n -t
         For all devices, print codename, numeric value and triple

     The options -a and -v will show the offload-arch for all pci-ids that could
     offload, even if they are not visible. Otherwise, the options -m, -n, -t,
     or no option will only show information for the first visible device.

Other Options:

-h  Print this help message
-f  <filename> Print offload requirements including offload-arch for
    each offload image compiled into an application binary file.

There are aliases (symbolic links) 'amdgpu-arch', 'nvidia-arch',
and 'intelhd-arch'to the offload-arch tool. These aliases return 1
if respectively, no AMD, no Nvidia, or no IntelHD GPUs are found.
These aliases are useful to determine if architecture-specific
offloading tests should be run, or to conditionally load
archecture-specific software.

Originally authored by Greg Rodgers (@gregrodgers).

Diff Detail

Event Timeline

saiislam created this revision.Jul 28 2021, 7:31 AM
saiislam requested review of this revision.Jul 28 2021, 7:31 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptJul 28 2021, 7:31 AM
saiislam updated this revision to Diff 362377.Jul 28 2021, 7:36 AM

Fixed typos

ye-luo added a subscriber: ye-luo.Aug 4 2021, 8:16 AM

I testing with aomp 13.0-5 on ubuntu 20.04.2 LTS (Focal Fossa)

yeluo@epyc-server:~$ offload-arch -a
ERROR: offload-arch not found for 10de:2486.
yeluo@epyc-server:~$ offload-arch -c
gfx906   sramecc+ xnack-
yeluo@epyc-server:~$ offload-arch -n
gfx906 1002:66AF

my second GPU is NVIDIA 3060Ti (sm_86)
I build my app daily with -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_80.

About sm_80 binary able ot run on sm_86

tra added a subscriber: tra.Aug 4 2021, 10:20 AM

my second GPU is NVIDIA 3060Ti (sm_86)
I build my app daily with -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_80.

About sm_80 binary able ot run on sm_86

Keep in mind that the binaries compiled for sm_80 will likely run a lot slower on sm_86. sm_86 has distinctly different hardware and the code generated for sm_80 will be sub-optimal for it.
I don't have the Ampere cards to compare, but sm_70 binaries running on sm_75 were reached only about 1/2 of the speed of the same code compiled for sm_75 when it was operating on fp16.

NVIDIA didn't provide performance tuning guide for Ampere, but here's what it had to say about Volta/Turing:

Any binary compiled for Volta will run on Turing, but Volta binaries using Tensor Cores will only be able to reach half of Turing's Tensor Core peak performance.
Recompiling the binary specifically for Turing would allow it to reach the peak performance.

yaxunl added a comment.Aug 4 2021, 2:23 PM

This only works on Linux. So either to make it work on both Linux and Windows, or restrict it to Linux in CMakeLists.txt, otherwise it breaks LLVM build on Windows.


better to use LLVM or standard C++ functions for directory operations since dirent.h is not available in MSVC. Even though this utility only works on Linux for now, it is better to be platform-neutral to be ported to Windows.


It would be much simpler to use HIP API to get device name and capabilities e.g. gfx906:xnack+:sramecc-

It will work on both Linux and Windows. On Linux the availability of HIP runtime is the same as HSA runtime. On Windows HIP runtime is shipped with display driver, whereas HSA runtime is not available.

JonChesterfield added a comment.EditedAug 11 2021, 6:40 AM
This comment has been deleted.

On Linux the availability of HIP runtime is the same as HSA runtime

This is probably not true. If ROCm is installed somewhere, both HIP and HSA runtimes are available. If building from source, HSA is much quicker and easier to build than the HIP runtimes.

jdoerfert added inline comments.Aug 18 2021, 8:09 AM

the _aot_ names are not great.


licence is wrong.


This is not AMD.

cchen added a subscriber: cchen.Aug 19 2021, 1:37 PM