Page MenuHomePhabricator

[Clang][OpenMP] Process multi-arch compilation options given via -march
Needs ReviewPublic

Authored by saiislam on Jun 17 2022, 12:11 PM.

Details

Summary

Subarchitectures for multi-file compilation specified using -fopenmp-targets,
-Xopenmp-target, and -march were not getting added to the
<Triple, Set(Archs)> map DerivedArchs.

Diff Detail

Event Timeline

saiislam created this revision.Jun 17 2022, 12:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 12:11 PM
Herald added a subscriber: guansong. · View Herald Transcript
saiislam requested review of this revision.Jun 17 2022, 12:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 12:11 PM
saiislam updated this revision to Diff 438004.Jun 17 2022, 12:15 PM

clang-formatted.

saiislam edited the summary of this revision. (Show Details)Jun 17 2022, 12:17 PM

Sorry never noticed this revision. The purpose of this patch seems to be supporting something like this

clang input.c -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -Xopenmp-target=nvptx64 -march=sm_80

Right now the above works if you replace -march= with --offload-arch=. Currently the offloading tools use a "bound" architecture to tie a specific architecture with a job, which is what allows us to offload to multiple architectures. If there is no bound architecture gives, we instead use the -march= option, and if that is not present we derive it. It would be possible to set the bound architecture via -march if we wanted to. But I'm not sure if it's necessary given that it would just be an alternate syntax for --offload-arch=.

Sorry never noticed this revision. The purpose of this patch seems to be supporting something like this

clang input.c -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -Xopenmp-target=nvptx64 -march=sm_80

Right now the above works if you replace -march= with --offload-arch=. Currently the offloading tools use a "bound" architecture to tie a specific architecture with a job, which is what allows us to offload to multiple architectures. If there is no bound architecture gives, we instead use the -march= option, and if that is not present we derive it. It would be possible to set the bound architecture via -march if we wanted to. But I'm not sure if it's necessary given that it would just be an alternate syntax for --offload-arch=.

-Xopenmp-target -march used to be the only option to target a specific sub arch before --offload-arch. But, it doesn't support multiple archs.
This patch relies on infra used by --offload-arch to support this verbose method of specifying multiple archs.

Use case: people already familiar with -Xopenmp-target -march option are likely to use the same for multiple archs, until they learn about shorthand representation, --offload-arch.

jhuber6 added a subscriber: tra.Wed, Jul 13, 10:33 AM

-Xopenmp-target -march used to be the only option to target a specific sub arch before --offload-arch. But, it doesn't support multiple archs.
This patch relies on infra used by --offload-arch to support this verbose method of specifying multiple archs.

Use case: people already familiar with -Xopenmp-target -march option are likely to use the same for multiple archs, until they learn about shorthand representation, --offload-arch.

Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.

tra added a comment.Wed, Jul 13, 10:55 AM

Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.

-march is just not flexible enough. E.g. AMDGPU has GPUs that have the same -march, but are considered to be different offloading targets.

At some point we should start consolidating the ways we can specify an offload target and try to avoid adding new ones until then.

At some point we should start consolidating the ways we can specify an offload target and try to avoid adding new ones until then.

Agreed, that was my intention with making --offload-arch work for everything the same way in the new driver. The difference with CUDA / HIP and OpenMP right now is the default behavior if nothing was given. For CUDA / HIP we just default the bound architecture to something like sm_35 and gfx803 I believe. For OpenMP we keep the bound architecture empty which signals us to check the value of -march= and use that if present, or default to something more intelligent. Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang, and amdgpu-arch which is just a program that runs at compile time. I'm not sure if there would be a desire to make CUDA / HIP adhere to this as well with the new driver.

tra added a comment.Wed, Jul 13, 11:25 AM

Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang

That does not make sense to me. Most of the time clang would be built on a machine without a GPU, so I don't understand how one would derive a sensible value for CLANG_OPENMP_NVPTX_DEFAULT_ARCH there.
The vast majority of users will use official release builds of clang and that has no conceivable way to give a sensible default for any specific user. Any guess would be OK sometimes, but it would be wrong most of the time.

I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.

Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang

That does not make sense to me. Most of the time clang would be built on a machine without a GPU, so I don't understand how one would derive a sensible value for CLANG_OPENMP_NVPTX_DEFAULT_ARCH there.
The vast majority of users will use official release builds of clang and that has no conceivable way to give a sensible default for any specific user. Any guess would be OK sometimes, but it would be wrong most of the time.

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.

Sometimes people get tricked into thinking it works by the JIT performed on the PTX output. There's an argument to be made that we shouldn't support any defaults at all, since architectures like AMDGPU provide no such mutual compatibility.

tra added a comment.Wed, Jul 13, 11:44 AM

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

As it happens, recent CUDA releases ship with bin/__nvcc_device_query which prints out the list of SM capabilities of the GPUs it sees.

Even that may not be the right value. E.g. only some of the GPUs on the machine may be intended for compute. It's not that uncommon to have a puny card to drive the display and one or more compute cards we actually want to compile for. There's no point compiling for a GPU variant which will never do any compute.

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

As it happens, recent CUDA releases ship with bin/__nvcc_device_query which prints out the list of SM capabilities of the GPUs it sees.

Even that may not be the right value. E.g. only some of the GPUs on the machine may be intended for compute. It's not that uncommon to have a puny card to drive the display and one or more compute cards we actually want to compile for. There's no point compiling for a GPU variant which will never do any compute.

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

tra added a comment.Wed, Jul 13, 1:03 PM

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

We do have existing precedent of -march=native, so it may make sense to introduce --offload-arch=native, though it may be a bit too ambiguous, considering that there may be more than one likely chouce -- multiple GPUs, possibly a mix from different vendors. Perhaps we would need something more specific, like --offload-arch=native-nvidia.

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

We do have existing precedent of -march=native, so it may make sense to introduce --offload-arch=native, though it may be a bit too ambiguous, considering that there may be more than one likely choice -- multiple GPUs, possibly a mix from different vendors. Perhaps we would need something more specific, like --offload-arch=native-nvidia.

That's an interesting idea, we could put that functionality in OpenMP using this method as well.

-Xopenmp-target=nvptx64 -march=native

For CUDA / HIP we could also potentially just make users specify which toolchain the native applies to somewhat like -Xopenmp-target. Might be a more general solution than attaching it to the architecture string.