This is an archive of the discontinued LLVM Phabricator instance.

clang/lib/Driver/ToolChains/Cuda.cpp
433	I think this would be contrary to the expectation that lack of `-O` in clang means - `do not optimize` and it generally implies the whole compilation chain, including assembler. Matching whatever nvidia tools do is an insufficient reason for breaking this assumption, IMO. If you do want do run optimized ptxas on unoptimized PTX, you can use `-Xcuda-ptxas -O3`.

hdelan added inline comments.Jan 5 2022, 2:34 AM

clang/lib/Driver/ToolChains/Cuda.cpp
433	I think for the average user, consistency across the `ptxjitcompiler` and `ptxas` is far more important than assuming that no `-O` means no optimization. I think most users will assume that no `-O` will assume that whatever tools being used will take their default optimization level, which in the case of clang is `-O0` and in the case of `ptxas` is `-O3`. We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization levels, which were quite hard to pin down since offline `ptxas` and `ptxjitcompiler` were using different optimisation levels, making bugs appear in one and not the other. Of course we are aware of this now but this inconsistency can result in bugs that are difficult to diagnose. Having consistency between the `ptxjitcompiler` and `ptxas` is therefore of practical benefit. Whereas if we are to leave it as is, with `ptxas` defaulting to `-O0`, the benefit is purely semantic and not practical.

tra added inline comments.Jan 5 2022, 10:52 AM

clang/lib/Driver/ToolChains/Cuda.cpp
433	I think for the average user, consistency across the ptxjitcompiler and ptxas is far more important than assuming that no -O means no optimization. The default is intended to provide the least amount of surprises for the most users. There are more users of clang as a CUDA compiler than users of clang as a cuda compiler who care about consistency with ptxjitcompiler. My point is that the improvements for a subset of users should be balanced vs usability in the common case. In this case the benefit does not justify the downsides, IMO. Please add me as a reviewer when the patch is ready for public review and we'll discuss it in a wider LLVM community.

hdelan added inline comments.Jan 31 2022, 1:56 AM

clang/lib/Driver/ToolChains/Cuda.cpp
433	We have come to the same conclusion that it is best to leave this unchanged upstream. However this change has been made locally in `intel/llvm`.

Closing revision

Revision Contents

Path

Size

clang/

lib/

Driver/

ToolChains/

Cuda.cpp

9 lines

test/

Driver/

cuda-external-tools.cu

6 lines

Diff 397236

clang/lib/Driver/ToolChains/Cuda.cpp

Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	void NVPTX::Assembler::ConstructJob(Compilation &C, const JobAction &JA,
if (DIKind == EmitSameDebugInfoAsHost) {		if (DIKind == EmitSameDebugInfoAsHost) {
// ptxas does not accept -g option if optimization is enabled, so		// ptxas does not accept -g option if optimization is enabled, so
// we ignore the compiler's -O* options if we want debug info.		// we ignore the compiler's -O* options if we want debug info.
CmdArgs.push_back("-g");		CmdArgs.push_back("-g");
CmdArgs.push_back("--dont-merge-basicblocks");		CmdArgs.push_back("--dont-merge-basicblocks");
CmdArgs.push_back("--return-at-end");		CmdArgs.push_back("--return-at-end");
} else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) {		} else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) {
// Map the -O we received to -O{0,1,2,3}.		// Map the -O we received to -O{0,1,2,3}.
//
// TODO: Perhaps we should map host -O2 to ptxas -O3. -O3 is ptxas's
// default, so it may correspond more closely to the spirit of clang -O2.

// -O3 seems like the least-bad option when -Osomething is specified to		// -O3 seems like the least-bad option when -Osomething is specified to
// clang but it isn't handled below.		// clang but it isn't handled below.
StringRef OOpt = "3";		StringRef OOpt = "3";
if (A->getOption().matches(options::OPT_O4) \|\|		if (A->getOption().matches(options::OPT_O4) \|\|
A->getOption().matches(options::OPT_Ofast))		A->getOption().matches(options::OPT_Ofast))
OOpt = "3";		OOpt = "3";
else if (A->getOption().matches(options::OPT_O0))		else if (A->getOption().matches(options::OPT_O0))
OOpt = "0";		OOpt = "0";
else if (A->getOption().matches(options::OPT_O)) {		else if (A->getOption().matches(options::OPT_O)) {
// -Os, -Oz, and -O(anything else) map to -O2, for lack of better options.		// -Os, -Oz, and -O(anything else) map to -O2, for lack of better options.
OOpt = llvm::StringSwitch<const char *>(A->getValue())		OOpt = llvm::StringSwitch<const char *>(A->getValue())
.Case("1", "1")		.Case("1", "1")
.Case("2", "2")		.Case("2", "2")
.Case("3", "3")		.Case("3", "3")
.Case("s", "2")		.Case("s", "2")
.Case("z", "2")		.Case("z", "2")
.Default("2");		.Default("2");
}		}
CmdArgs.push_back(Args.MakeArgString(llvm::Twine("-O") + OOpt));		CmdArgs.push_back(Args.MakeArgString(llvm::Twine("-O") + OOpt));
} else {		} else {
// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond		// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's
		traUnsubmitted Not Done Reply Inline Actions I think this would be contrary to the expectation that lack of `-O` in clang means - `do not optimize` and it generally implies the whole compilation chain, including assembler. Matching whatever nvidia tools do is an insufficient reason for breaking this assumption, IMO. If you do want do run optimized ptxas on unoptimized PTX, you can use `-Xcuda-ptxas -O3`. tra: I think this would be contrary to the expectation that lack of `-O` in clang means - `do not…
		hdelanAuthorUnsubmitted Not Done Reply Inline Actions I think for the average user, consistency across the `ptxjitcompiler` and `ptxas` is far more important than assuming that no `-O` means no optimization. I think most users will assume that no `-O` will assume that whatever tools being used will take their default optimization level, which in the case of clang is `-O0` and in the case of `ptxas` is `-O3`. We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization levels, which were quite hard to pin down since offline `ptxas` and `ptxjitcompiler` were using different optimisation levels, making bugs appear in one and not the other. Of course we are aware of this now but this inconsistency can result in bugs that are difficult to diagnose. Having consistency between the `ptxjitcompiler` and `ptxas` is therefore of practical benefit. Whereas if we are to leave it as is, with `ptxas` defaulting to `-O0`, the benefit is purely semantic and not practical. hdelan: I think for the average user, consistency across the `ptxjitcompiler` and `ptxas` is far more…
		traUnsubmitted Not Done Reply Inline Actions I think for the average user, consistency across the ptxjitcompiler and ptxas is far more important than assuming that no -O means no optimization. The default is intended to provide the least amount of surprises for the most users. There are more users of clang as a CUDA compiler than users of clang as a cuda compiler who care about consistency with ptxjitcompiler. My point is that the improvements for a subset of users should be balanced vs usability in the common case. In this case the benefit does not justify the downsides, IMO. Please add me as a reviewer when the patch is ready for public review and we'll discuss it in a wider LLVM community. tra: > I think for the average user, consistency across the ptxjitcompiler and ptxas is far more…
		hdelanAuthorUnsubmitted Done Reply Inline Actions We have come to the same conclusion that it is best to leave this unchanged upstream. However this change has been made locally in `intel/llvm`. hdelan: We have come to the same conclusion that it is best to leave this unchanged upstream. However…
// to no optimizations, but ptxas's default is -O3.		// optimization level the same as the ptxjitcompiler.
CmdArgs.push_back("-O0");		CmdArgs.push_back("-O3");
}		}
if (DIKind == DebugDirectivesOnly)		if (DIKind == DebugDirectivesOnly)
CmdArgs.push_back("-lineinfo");		CmdArgs.push_back("-lineinfo");

// Pass -v to ptxas if it was passed to the driver.		// Pass -v to ptxas if it was passed to the driver.
if (Args.hasArg(options::OPT_v))		if (Args.hasArg(options::OPT_v))
CmdArgs.push_back("-v");		CmdArgs.push_back("-v");

▲ Show 20 Lines • Show All 460 Lines • Show Last 20 Lines

clang/test/Driver/cuda-external-tools.cu

	Show All 34 Lines
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,DBG %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,DBG %s

	// --no-cuda-noopt-device-debug overrides --cuda-noopt-device-debug.			// --no-cuda-noopt-device-debug overrides --cuda-noopt-device-debug.
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-noopt-device-debug \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-noopt-device-debug \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \			// RUN: --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s

	// Regular compile without -O. This should result in us passing -O0 to ptxas.			// Regular compile without -O. This should result in us passing -O3 to ptxas.
	// RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s

	// Regular compiles with -Os and -Oz. For lack of a better option, we map			// Regular compiles with -Os and -Oz. For lack of a better option, we map
	// these to ptxas -O3.			// these to ptxas -O3.
	// RUN: %clang -### -target x86_64-linux-gnu -Os -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -Os -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s
	// RUN: %clang -### -target x86_64-linux-gnu -Oz -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -Oz -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	Show All 15 Lines
	// 32-bit compile when generating relocatable device code.			// 32-bit compile when generating relocatable device code.
	// RUN: %clang -### -target i386-linux-gnu -fgpu-rdc -c %s 2>&1 \			// RUN: %clang -### -target i386-linux-gnu -fgpu-rdc -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH32,SM35,RDC %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH32,SM35,RDC %s

	// Compile with -fintegrated-as. This should still cause us to invoke ptxas.			// Compile with -fintegrated-as. This should still cause us to invoke ptxas.
	// RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s
	// Check that we still pass -c when generating relocatable device code.			// Check that we still pass -c when generating relocatable device code.
	// RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 2>&1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,RDC %s			// RUN: \| FileCheck -check-prefixes=CHECK,ARCH64,SM35,RDC %s

	// Check -Xcuda-ptxas and -Xcuda-fatbinary			// Check -Xcuda-ptxas and -Xcuda-fatbinary
	// RUN: %clang -### -target x86_64-linux-gnu -c -Xcuda-ptxas -foo1 \			// RUN: %clang -### -target x86_64-linux-gnu -c -Xcuda-ptxas -foo1 \
	// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \			// RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 397236

clang/lib/Driver/ToolChains/Cuda.cpp

clang/test/Driver/cuda-external-tools.cu

Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).
AbandonedPublic