Subarchitectures for multi-file compilation specified using -fopenmp-targets,
-Xopenmp-target, and -march were not getting added to the
<Triple, Set(Archs)> map DerivedArchs.
Details
- Reviewers
jdoerfert JonChesterfield jhuber6 yaxunl
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Sorry never noticed this revision. The purpose of this patch seems to be supporting something like this
clang input.c -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -Xopenmp-target=nvptx64 -march=sm_80
Right now the above works if you replace -march= with --offload-arch=. Currently the offloading tools use a "bound" architecture to tie a specific architecture with a job, which is what allows us to offload to multiple architectures. If there is no bound architecture gives, we instead use the -march= option, and if that is not present we derive it. It would be possible to set the bound architecture via -march if we wanted to. But I'm not sure if it's necessary given that it would just be an alternate syntax for --offload-arch=.
-Xopenmp-target -march used to be the only option to target a specific sub arch before --offload-arch. But, it doesn't support multiple archs.
This patch relies on infra used by --offload-arch to support this verbose method of specifying multiple archs.
Use case: people already familiar with -Xopenmp-target -march option are likely to use the same for multiple archs, until they learn about shorthand representation, --offload-arch.
Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.
Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.
-march is just not flexible enough. E.g. AMDGPU has GPUs that have the same -march, but are considered to be different offloading targets.
At some point we should start consolidating the ways we can specify an offload target and try to avoid adding new ones until then.
Agreed, that was my intention with making --offload-arch work for everything the same way in the new driver. The difference with CUDA / HIP and OpenMP right now is the default behavior if nothing was given. For CUDA / HIP we just default the bound architecture to something like sm_35 and gfx803 I believe. For OpenMP we keep the bound architecture empty which signals us to check the value of -march= and use that if present, or default to something more intelligent. Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang, and amdgpu-arch which is just a program that runs at compile time. I'm not sure if there would be a desire to make CUDA / HIP adhere to this as well with the new driver.
That does not make sense to me. Most of the time clang would be built on a machine without a GPU, so I don't understand how one would derive a sensible value for CLANG_OPENMP_NVPTX_DEFAULT_ARCH there.
The vast majority of users will use official release builds of clang and that has no conceivable way to give a sensible default for any specific user. Any guess would be OK sometimes, but it would be wrong most of the time.
I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.
It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.
I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.
Sometimes people get tricked into thinking it works by the JIT performed on the PTX output. There's an argument to be made that we shouldn't support any defaults at all, since architectures like AMDGPU provide no such mutual compatibility.
As it happens, recent CUDA releases ship with bin/__nvcc_device_query which prints out the list of SM capabilities of the GPUs it sees.
Even that may not be the right value. E.g. only some of the GPUs on the machine may be intended for compute. It's not that uncommon to have a puny card to drive the display and one or more compute cards we actually want to compile for. There's no point compiling for a GPU variant which will never do any compute.
Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.
We do have existing precedent of -march=native, so it may make sense to introduce --offload-arch=native, though it may be a bit too ambiguous, considering that there may be more than one likely chouce -- multiple GPUs, possibly a mix from different vendors. Perhaps we would need something more specific, like --offload-arch=native-nvidia.
That's an interesting idea, we could put that functionality in OpenMP using this method as well.
-Xopenmp-target=nvptx64 -march=native
For CUDA / HIP we could also potentially just make users specify which toolchain the native applies to somewhat like -Xopenmp-target. Might be a more general solution than attaching it to the architecture string.