This is an archive of the discontinued LLVM Phabricator instance.

I'm concerned that if we make it a top-level option, it's likely to be cargo-culted and (mis)used as a sledgehammer in cases where it's not needed. I think the option should remain hidden.

While thresholds do need to be tweaked, in my experience it happens very rarely.
When it does, most of the time it's sufficient to apply __attribute__((always_inline)) to a few functions where it matters.
If AMDGPU bumps into the limit too often, perhaps it's the default threshold value that needs to be changed.

If we do add an option to control inlining threshold, then we should also consider doing the same for other thresholds.
For instance, loop unrolling thresholds in my experience need bumping up about as often as the inlining ones.
Similarly, most of the time the issue can be dealt with at the source code level with #pragma unroll.

Perhaps we should generalize this patch to deal with wider range of threshold.
E.g. we could have something like --gpu-threshold threshold-kind=x which would expand to appropriate cc1 options for GPU sub-compilations.
It would also be nice to handle it in a way that can be used by both CUDA and HIP w/o having to copy/paste code.

Also, this patch would not be necessary if we had the generalized way to specify options for particular offload targets. Alas, we don't have it yet.

In D99233#2656446, @tra wrote:

I'm concerned that if we make it a top-level option, it's likely to be cargo-culted and (mis)used as a sledgehammer in cases where it's not needed. I think the option should remain hidden.

While thresholds do need to be tweaked, in my experience it happens very rarely.
When it does, most of the time it's sufficient to apply __attribute__((always_inline)) to a few functions where it matters.
If AMDGPU bumps into the limit too often, perhaps it's the default threshold value that needs to be changed.

Currently ROCm builds all math libs and frameworks with an LLVM option which inline all functions for AMDGPU target. We cannot simply remove that option and use the default inline threshold since it will cause performance degradations. We cannot use -mllvm -inline-threshold=x directly either since it will affect both host and device compilation. We need an option to set the inline threshold for GPU only so that we could fine-tuning the inline threshold. I agree that this option should be hidden since it is intended for compiler development.

If we do add an option to control inlining threshold, then we should also consider doing the same for other thresholds.
For instance, loop unrolling thresholds in my experience need bumping up about as often as the inlining ones.
Similarly, most of the time the issue can be dealt with at the source code level with #pragma unroll.

Perhaps we should generalize this patch to deal with wider range of threshold.
E.g. we could have something like --gpu-threshold threshold-kind=x which would expand to appropriate cc1 options for GPU sub-compilations.

My concern with --gpu-threshold threshold-kind=x is that it needs custom handling, e.g. setting default values, letting the last option win if multiple options are set. Using separate options allows standard handling of these options.

It would also be nice to handle it in a way that can be used by both CUDA and HIP w/o having to copy/paste code.

I can add a function as ToolChain so that it can be used by different ToolChains.

Also, this patch would not be necessary if we had the generalized way to specify options for particular offload targets. Alas, we don't have it yet.

The planned new option for offloading will be a more generic solution, however, I expect it will take time to develop and be adopted.

The planned new option for offloading will be a more generic solution, however, I expect it will take time to develop and be adopted.

Agreed. OK, let's use a hidden option until we have a better way of dealing with this.

clang/include/clang/Driver/Options.td
960	This option is only handled at the top level, it does not need `CC1Option`. It does need `HelpHidden`. Also, the option should probably be `-fgpu-inline-threshold=...` as it's a parameter tweak, and not something more serious, like `--offload-arch`. Naming is hard. :-)

yaxunl marked an inline comment as done.Apr 21 2021, 11:34 AM

yaxunl added inline comments.

clang/include/clang/Driver/Options.td
960	will do

revised by Artem's comments.

yaxunl retitled this revision from [HIP] Add option --gpu-inline-threshold to [HIP] Add option -fgpu-inline-threshold.Apr 21 2021, 12:45 PM

tra accepted this revision.Apr 21 2021, 1:03 PM

This revision is now accepted and ready to land.Apr 21 2021, 1:03 PM

Harbormaster completed remote builds in B100075: Diff 339353.Apr 21 2021, 1:50 PM

This revision was landed with ongoing or failed builds.Apr 21 2021, 2:19 PM

Closed by commit rG5a2d78b16397: [HIP] Add option -fgpu-inline-threshold (authored by yaxunl). · Explain Why

This revision was automatically updated to reflect the committed changes.

yaxunl added a commit: rG5a2d78b16397: [HIP] Add option -fgpu-inline-threshold.

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2021, 2:19 PM

Revision Contents

Path

Size

clang/

include/

clang/

Driver/

Options.td

3 lines

lib/

Driver/

ToolChains/

Clang.cpp

10 lines

test/

Driver/

hip-options.hip

5 lines

Diff 339380

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 950 Lines • ▼ Show 20 Lines	defm gpu_exclude_wrong_side_overloads : BoolFOption<"gpu-exclude-wrong-side-overloads",
PosFlag<SetTrue, [CC1Option], "Always exclude wrong side overloads">,		PosFlag<SetTrue, [CC1Option], "Always exclude wrong side overloads">,
NegFlag<SetFalse, [], "Exclude wrong side overloads only if there are same side overloads">,		NegFlag<SetFalse, [], "Exclude wrong side overloads only if there are same side overloads">,
BothFlags<[HelpHidden], " in overloading resolution for CUDA/HIP">>;		BothFlags<[HelpHidden], " in overloading resolution for CUDA/HIP">>;
def gpu_max_threads_per_block_EQ : Joined<["--"], "gpu-max-threads-per-block=">,		def gpu_max_threads_per_block_EQ : Joined<["--"], "gpu-max-threads-per-block=">,
Flags<[CC1Option]>,		Flags<[CC1Option]>,
HelpText<"Default max threads per block for kernel launch bounds for HIP">,		HelpText<"Default max threads per block for kernel launch bounds for HIP">,
MarshallingInfoInt<LangOpts<"GPUMaxThreadsPerBlock">, "1024">,		MarshallingInfoInt<LangOpts<"GPUMaxThreadsPerBlock">, "1024">,
ShouldParseIf<hip.KeyPath>;		ShouldParseIf<hip.KeyPath>;
		def fgpu_inline_threshold_EQ : Joined<["-"], "fgpu-inline-threshold=">,
		Flags<[HelpHidden]>,
		traUnsubmitted Done Reply Inline Actions This option is only handled at the top level, it does not need `CC1Option`. It does need `HelpHidden`. Also, the option should probably be `-fgpu-inline-threshold=...` as it's a parameter tweak, and not something more serious, like `--offload-arch`. Naming is hard. :-) tra: This option is only handled at the top level, it does not need `CC1Option`. It does need…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions will do yaxunl: will do
		HelpText<"Inline threshold for device compilation for CUDA/HIP">;
def gpu_instrument_lib_EQ : Joined<["--"], "gpu-instrument-lib=">,		def gpu_instrument_lib_EQ : Joined<["--"], "gpu-instrument-lib=">,
HelpText<"Instrument device library for HIP, which is a LLVM bitcode containing "		HelpText<"Instrument device library for HIP, which is a LLVM bitcode containing "
"__cyg_profile_func_enter and __cyg_profile_func_exit">;		"__cyg_profile_func_enter and __cyg_profile_func_exit">;
defm gpu_sanitize : BoolFOption<"gpu-sanitize", EmptyKPM, DefaultFalse,		defm gpu_sanitize : BoolFOption<"gpu-sanitize", EmptyKPM, DefaultFalse,
PosFlag<SetTrue, [], "Enable sanitizer for AMDGPU target">,		PosFlag<SetTrue, [], "Enable sanitizer for AMDGPU target">,
NegFlag<SetFalse>>;		NegFlag<SetFalse>>;
def cuid_EQ : Joined<["-"], "cuid=">, Flags<[CC1Option]>,		def cuid_EQ : Joined<["-"], "cuid=">, Flags<[CC1Option]>,
HelpText<"An ID for compilation unit, which should be the same for the same "		HelpText<"An ID for compilation unit, which should be the same for the same "
▲ Show 20 Lines • Show All 5,179 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,467 Lines • ▼ Show 20 Lines	if (IsCuda \|\| IsHIP) {
auto CUID = cast<InputAction>(SourceAction)->getId();		auto CUID = cast<InputAction>(SourceAction)->getId();
if (!CUID.empty())		if (!CUID.empty())
CmdArgs.push_back(Args.MakeArgString(Twine("-cuid=") + Twine(CUID)));		CmdArgs.push_back(Args.MakeArgString(Twine("-cuid=") + Twine(CUID)));
}		}

if (IsHIP)		if (IsHIP)
CmdArgs.push_back("-fcuda-allow-variadic-functions");		CmdArgs.push_back("-fcuda-allow-variadic-functions");

		if (IsCudaDevice \|\| IsHIPDevice) {
		StringRef InlineThresh =
		Args.getLastArgValue(options::OPT_fgpu_inline_threshold_EQ);
		if (!InlineThresh.empty()) {
		std::string ArgStr =
		std::string("-inline-threshold=") + InlineThresh.str();
		CmdArgs.append({"-mllvm", Args.MakeArgStringRef(ArgStr)});
		}
		}

// OpenMP offloading device jobs take the argument -fopenmp-host-ir-file-path		// OpenMP offloading device jobs take the argument -fopenmp-host-ir-file-path
// to specify the result of the compile phase on the host, so the meaningful		// to specify the result of the compile phase on the host, so the meaningful
// device declarations can be identified. Also, -fopenmp-is-device is passed		// device declarations can be identified. Also, -fopenmp-is-device is passed
// along to tell the frontend that it is generating code for a device, so that		// along to tell the frontend that it is generating code for a device, so that
// only the relevant declarations are emitted.		// only the relevant declarations are emitted.
if (IsOpenMPDevice) {		if (IsOpenMPDevice) {
CmdArgs.push_back("-fopenmp-is-device");		CmdArgs.push_back("-fopenmp-is-device");
if (OpenMPDeviceInput) {		if (OpenMPDeviceInput) {
▲ Show 20 Lines • Show All 1,153 Lines • Show Last 20 Lines

clang/test/Driver/hip-options.hip

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	// FIX-OVERLOAD: clang{{.}} "-triple" "x86_64-unknown-linux-gnu" {{.}} "-fgpu-exclude-wrong-side-overloads" "-fgpu-defer-diag"			// FIX-OVERLOAD: clang{{.}} "-triple" "x86_64-unknown-linux-gnu" {{.}} "-fgpu-exclude-wrong-side-overloads" "-fgpu-defer-diag"

	// Check -mconstructor-aliases is not passed to device compilation.			// Check -mconstructor-aliases is not passed to device compilation.

	// RUN: %clang -### -target x86_64-unknown-linux-gnu -nogpuinc -nogpulib \			// RUN: %clang -### -target x86_64-unknown-linux-gnu -nogpuinc -nogpulib \
	// RUN: --cuda-gpu-arch=gfx906 %s 2>&1 \| FileCheck -check-prefix=CTA %s			// RUN: --cuda-gpu-arch=gfx906 %s 2>&1 \| FileCheck -check-prefix=CTA %s
	// CTA: clang{{.}} "-triple" "x86_64-unknown-linux-gnu" {{.}} "-mconstructor-aliases"			// CTA: clang{{.}} "-triple" "x86_64-unknown-linux-gnu" {{.}} "-mconstructor-aliases"
	// CTA-NOT: clang{{.}} "-triple" "amdgcn-amd-amdhsa" {{.}} "-mconstructor-aliases"			// CTA-NOT: clang{{.}} "-triple" "amdgcn-amd-amdhsa" {{.}} "-mconstructor-aliases"

				// RUN: %clang -### -target x86_64-unknown-linux-gnu -nogpuinc -nogpulib \
				// RUN: --offload-arch=gfx906 -fgpu-inline-threshold=1000 %s 2>&1 \| FileCheck -check-prefix=THRESH %s
				// THRESH: clang{{.}} "-triple" "amdgcn-amd-amdhsa" {{.}} "-mllvm" "-inline-threshold=1000"
				// THRESH-NOT: clang{{.}} "-triple" "x86_64-unknown-linux-gnu" {{.}} "-inline-threshold=1000"

This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Add option -fgpu-inline-thresholdClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 339380

clang/include/clang/Driver/Options.td

clang/lib/Driver/ToolChains/Clang.cpp

clang/test/Driver/hip-options.hip

[HIP] Add option -fgpu-inline-threshold
ClosedPublic