This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Driver/
-
clang/
-
Driver/
-
Driver.h
-
lib/Driver/
-
Driver/
-
Driver.cpp

Differential D128090

[Clang][OpenMP] Process multi-arch compilation options given via -march
AbandonedPublic

Authored by saiislam on Jun 17 2022, 12:11 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
jhuber6
yaxunl

Summary

Subarchitectures for multi-file compilation specified using -fopenmp-targets,
-Xopenmp-target, and -march were not getting added to the
<Triple, Set(Archs)> map DerivedArchs.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

saiislam created this revision.Jun 17 2022, 12:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 12:11 PM

Herald added a subscriber: guansong. · View Herald Transcript

saiislam requested review of this revision.Jun 17 2022, 12:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 12:11 PM

Herald added subscribers: cfe-commits, sstefan1, MaskRay. · View Herald Transcript

clang-formatted.

saiislam edited the summary of this revision. (Show Details)Jun 17 2022, 12:17 PM

Harbormaster completed remote builds in B170584: Diff 438004.Jun 17 2022, 1:33 PM

Sorry never noticed this revision. The purpose of this patch seems to be supporting something like this

clang input.c -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -Xopenmp-target=nvptx64 -march=sm_80

Right now the above works if you replace -march= with --offload-arch=. Currently the offloading tools use a "bound" architecture to tie a specific architecture with a job, which is what allows us to offload to multiple architectures. If there is no bound architecture gives, we instead use the -march= option, and if that is not present we derive it. It would be possible to set the bound architecture via -march if we wanted to. But I'm not sure if it's necessary given that it would just be an alternate syntax for --offload-arch=.

In D128090#3648210, @jhuber6 wrote:
Sorry never noticed this revision. The purpose of this patch seems to be supporting something like this
clang input.c -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -Xopenmp-target=nvptx64 -march=sm_80
Right now the above works if you replace -march= with --offload-arch=. Currently the offloading tools use a "bound" architecture to tie a specific architecture with a job, which is what allows us to offload to multiple architectures. If there is no bound architecture gives, we instead use the -march= option, and if that is not present we derive it. It would be possible to set the bound architecture via -march if we wanted to. But I'm not sure if it's necessary given that it would just be an alternate syntax for --offload-arch=.

-Xopenmp-target -march used to be the only option to target a specific sub arch before --offload-arch. But, it doesn't support multiple archs.
This patch relies on infra used by --offload-arch to support this verbose method of specifying multiple archs.

Use case: people already familiar with -Xopenmp-target -march option are likely to use the same for multiple archs, until they learn about shorthand representation, --offload-arch.

In D128090#3648879, @saiislam wrote:

-Xopenmp-target -march used to be the only option to target a specific sub arch before --offload-arch. But, it doesn't support multiple archs.
This patch relies on infra used by --offload-arch to support this verbose method of specifying multiple archs.

Use case: people already familiar with -Xopenmp-target -march option are likely to use the same for multiple archs, until they learn about shorthand representation, --offload-arch.

Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.

Overloading the meaning of -march might not work here. Typically -march is just checked via Args.getLastArg(), so repeated uses just override the last one. I'm not exactly sure what the expected use is however, maybe @tra can help there. Although we could consider ones contained inside of -Xopenmp-target= to be different. That being said, if we wanted to support this I think the easiest way to do it would be to add handling for -march in the source here.

-march is just not flexible enough. E.g. AMDGPU has GPUs that have the same -march, but are considered to be different offloading targets.

At some point we should start consolidating the ways we can specify an offload target and try to avoid adding new ones until then.

In D128090#3648984, @tra wrote:

At some point we should start consolidating the ways we can specify an offload target and try to avoid adding new ones until then.

Agreed, that was my intention with making --offload-arch work for everything the same way in the new driver. The difference with CUDA / HIP and OpenMP right now is the default behavior if nothing was given. For CUDA / HIP we just default the bound architecture to something like sm_35 and gfx803 I believe. For OpenMP we keep the bound architecture empty which signals us to check the value of -march= and use that if present, or default to something more intelligent. Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang, and amdgpu-arch which is just a program that runs at compile time. I'm not sure if there would be a desire to make CUDA / HIP adhere to this as well with the new driver.

In D128090#3648999, @jhuber6 wrote:

Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang

That does not make sense to me. Most of the time clang would be built on a machine without a GPU, so I don't understand how one would derive a sensible value for CLANG_OPENMP_NVPTX_DEFAULT_ARCH there.
The vast majority of users will use official release builds of clang and that has no conceivable way to give a sensible default for any specific user. Any guess would be OK sometimes, but it would be wrong most of the time.

I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.

In D128090#3649059, @tra wrote:

In D128090#3648999, @jhuber6 wrote:

Right now there's CLANG_OPENMP_NVPTX_DEFAULT_ARCH, which is defined by CMake to be the architecture of the system used to build clang

That does not make sense to me. Most of the time clang would be built on a machine without a GPU, so I don't understand how one would derive a sensible value for CLANG_OPENMP_NVPTX_DEFAULT_ARCH there.
The vast majority of users will use official release builds of clang and that has no conceivable way to give a sensible default for any specific user. Any guess would be OK sometimes, but it would be wrong most of the time.

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

I'm all for providing a sensible default, but there's no such thing when it comes for GPUs. CUDA falls back on the oldest supported GPU architecture which has the only benefit of working for occasional manual tinkering and is being consistently wrong for about everyone and forcing them to specify the actual offload targets relevant to their use case.
So far it's the least bad and somewhat consistent approach I've seen.

Sometimes people get tricked into thinking it works by the JIT performed on the PTX output. There's an argument to be made that we shouldn't support any defaults at all, since architectures like AMDGPU provide no such mutual compatibility.

In D128090#3649125, @jhuber6 wrote:

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

As it happens, recent CUDA releases ship with bin/__nvcc_device_query which prints out the list of SM capabilities of the GPUs it sees.

Even that may not be the right value. E.g. only some of the GPUs on the machine may be intended for compute. It's not that uncommon to have a puny card to drive the display and one or more compute cards we actually want to compile for. There's no point compiling for a GPU variant which will never do any compute.

In D128090#3649202, @tra wrote:

In D128090#3649125, @jhuber6 wrote:

It just defaults to sm_35 if CUDA isn't present on the system IIRC. Alternatively we could ship a tool to derive it at compile time.

As it happens, recent CUDA releases ship with bin/__nvcc_device_query which prints out the list of SM capabilities of the GPUs it sees.

Even that may not be the right value. E.g. only some of the GPUs on the machine may be intended for compute. It's not that uncommon to have a puny card to drive the display and one or more compute cards we actually want to compile for. There's no point compiling for a GPU variant which will never do any compute.

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

In D128090#3649235, @jhuber6 wrote:

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

We do have existing precedent of -march=native, so it may make sense to introduce --offload-arch=native, though it may be a bit too ambiguous, considering that there may be more than one likely chouce -- multiple GPUs, possibly a mix from different vendors. Perhaps we would need something more specific, like --offload-arch=native-nvidia.

In D128090#3649579, @tra wrote:

In D128090#3649235, @jhuber6 wrote:

Interesting, may be worthwhile to query that if it exists, though AMD does this with amdgpu-arch which has led to problems for me in the past. But even if it could be wrong it will still spit out an architecture that would run on at least one local card rather than zero.

We do have existing precedent of -march=native, so it may make sense to introduce --offload-arch=native, though it may be a bit too ambiguous, considering that there may be more than one likely choice -- multiple GPUs, possibly a mix from different vendors. Perhaps we would need something more specific, like --offload-arch=native-nvidia.

That's an interesting idea, we could put that functionality in OpenMP using this method as well.

-Xopenmp-target=nvptx64 -march=native

For CUDA / HIP we could also potentially just make users specify which toolchain the native applies to somewhat like -Xopenmp-target. Might be a more general solution than attaching it to the architecture string.

saiislam abandoned this revision.Sep 7 2023, 11:41 PM

Herald added a subscriber: jplehr. · View Herald TranscriptSep 7 2023, 11:41 PM

Revision Contents

Path

Size

clang/

include/

clang/

Driver/

Driver.h

5 lines

lib/

Driver/

Driver.cpp

36 lines

Diff 438004

clang/include/clang/Driver/Driver.h

Show First 20 Lines • Show All 406 Lines • ▼ Show 20 Lines	public:
/// @name Primary Functionality		/// @name Primary Functionality
/// @{		/// @{

/// CreateOffloadingDeviceToolChains - create all the toolchains required to		/// CreateOffloadingDeviceToolChains - create all the toolchains required to
/// support offloading devices given the programming models specified in the		/// support offloading devices given the programming models specified in the
/// current compilation. Also, update the host tool chain kind accordingly.		/// current compilation. Also, update the host tool chain kind accordingly.
void CreateOffloadingDeviceToolChains(Compilation &C, InputList &Inputs);		void CreateOffloadingDeviceToolChains(Compilation &C, InputList &Inputs);

		/// GetTargetInfoFromMArch - extract sub-architecture from -march flag used
		/// with -fopenmp-targets and -Xopenmp-target options.
		bool GetTargetInfoFromMArch(
		Compilation &C, llvm::StringMap<llvm::DenseSet<StringRef>> &DerivedArchs);

/// BuildCompilation - Construct a compilation object for a command		/// BuildCompilation - Construct a compilation object for a command
/// line argument vector.		/// line argument vector.
///		///
/// \return A compilation, or 0 if none was built for the given		/// \return A compilation, or 0 if none was built for the given
/// argument vector. A null return value does not necessarily		/// argument vector. A null return value does not necessarily
/// indicate an error condition, the diagnostics should be queried		/// indicate an error condition, the diagnostics should be queried
/// to determine if an error occurred.		/// to determine if an error occurred.
Compilation BuildCompilation(ArrayRef<const char > Args);		Compilation BuildCompilation(ArrayRef<const char > Args);
▲ Show 20 Lines • Show All 343 Lines • Show Last 20 Lines

clang/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 716 Lines • ▼ Show 20 Lines	if (RT == OMPRT_Unknown) {
else		else
// FIXME: We could use a nicer diagnostic here.		// FIXME: We could use a nicer diagnostic here.
Diag(diag::err_drv_unsupported_opt) << "-fopenmp";		Diag(diag::err_drv_unsupported_opt) << "-fopenmp";
}		}

return RT;		return RT;
}		}

		bool Driver::GetTargetInfoFromMArch(
		Compilation &C, llvm::StringMap<llvm::DenseSet<StringRef>> &DerivedArchs) {
		StringRef OpenMPTargetArch;
		for (Arg *A : C.getInputArgs()) {
		if (A->getOption().matches(options::OPT_Xopenmp_target_EQ)) {
		StringRef OpenMPTargetTriple = StringRef(A->getValue(0));
		llvm::Triple TargetTriple(OpenMPTargetTriple);

		for (auto *V : A->getValues()) {
		StringRef VStr = StringRef(V);
		if (VStr.startswith("-march=") \|\| VStr.startswith("--march=")) {
		OpenMPTargetArch = VStr.split('=').second;
		CudaArch Arch = StringToCudaArch(StringRef(OpenMPTargetArch));
		if (Arch == CudaArch::UNKNOWN) {
		C.getDriver().Diag(clang::diag::err_drv_cuda_bad_gpu_arch)
		<< OpenMPTargetArch;
		C.setContainsError();
		return false;
		}

		if (!OpenMPTargetTriple.empty() && !OpenMPTargetArch.empty()) {
		DerivedArchs[OpenMPTargetTriple].insert(OpenMPTargetArch);
		}
		}
		A->claim();
		}
		}
		}

		return true;
		}

void Driver::CreateOffloadingDeviceToolChains(Compilation &C,		void Driver::CreateOffloadingDeviceToolChains(Compilation &C,
InputList &Inputs) {		InputList &Inputs) {

//		//
// CUDA/HIP		// CUDA/HIP
//		//
// We need to generate a CUDA/HIP toolchain if any of the inputs has a CUDA		// We need to generate a CUDA/HIP toolchain if any of the inputs has a CUDA
// or HIP type. However, mixed CUDA/HIP compilation is not supported.		// or HIP type. However, mixed CUDA/HIP compilation is not supported.
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	if (IsOpenMPOffloading) {
// attempt to derive the appropriate toolchains from the arguments.		// attempt to derive the appropriate toolchains from the arguments.
if (Arg *OpenMPTargets =		if (Arg *OpenMPTargets =
C.getInputArgs().getLastArg(options::OPT_fopenmp_targets_EQ)) {		C.getInputArgs().getLastArg(options::OPT_fopenmp_targets_EQ)) {
if (OpenMPTargets && !OpenMPTargets->getNumValues()) {		if (OpenMPTargets && !OpenMPTargets->getNumValues()) {
Diag(clang::diag::warn_drv_empty_joined_argument)		Diag(clang::diag::warn_drv_empty_joined_argument)
<< OpenMPTargets->getAsString(C.getInputArgs());		<< OpenMPTargets->getAsString(C.getInputArgs());
return;		return;
}		}
		// Process legacy option -fopenmp-targets -Xopenmp-target and -march
		auto status = GetTargetInfoFromMArch(C, DerivedArchs);
		if (!status)
		return;
llvm::copy(OpenMPTargets->getValues(), std::back_inserter(OpenMPTriples));		llvm::copy(OpenMPTargets->getValues(), std::back_inserter(OpenMPTriples));
} else if (C.getInputArgs().hasArg(options::OPT_offload_arch_EQ) &&		} else if (C.getInputArgs().hasArg(options::OPT_offload_arch_EQ) &&
!IsHIP && !IsCuda) {		!IsHIP && !IsCuda) {
const ToolChain *HostTC = C.getSingleOffloadToolChain<Action::OFK_Host>();		const ToolChain *HostTC = C.getSingleOffloadToolChain<Action::OFK_Host>();
auto AMDTriple = getHIPOffloadTargetTriple(*this, C.getInputArgs());		auto AMDTriple = getHIPOffloadTargetTriple(*this, C.getInputArgs());
auto NVPTXTriple = getNVIDIAOffloadTargetTriple(*this, C.getInputArgs(),		auto NVPTXTriple = getNVIDIAOffloadTargetTriple(*this, C.getInputArgs(),
HostTC->getTriple());		HostTC->getTriple());

▲ Show 20 Lines • Show All 5,472 Lines • Show Last 20 Lines