This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] cmake option LIBOMPTARGET_NVPTX_MAX_SM for nvptx device RTL
ClosedPublic

Authored by ye-luo on Sep 23 2020, 3:14 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield

Commits

rGffd159d8e919: [OpenMP] cmake option LIBOMPTARGET_NVPTX_MAX_SM for nvptx device RTL

Summary

It allows customizing MAX_SM for non-flagship GPU and reduces graphic memory usage.

In addition, so far the size is hard-coded up to CUDA_ARCH 700 and is already a hassle for 800.
Introduce MAX_SM for 800 and protect future arch

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ye-luo created this revision.Sep 23 2020, 3:14 PM

Herald added subscribers: guansong, yaxunl, mgorny. · View Herald TranscriptSep 23 2020, 3:14 PM

ye-luo requested review of this revision.Sep 23 2020, 3:14 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptSep 23 2020, 3:14 PM

Herald added a subscriber: sstefan1. · View Herald Transcript

Change seems reasonable. Amdgcn could benefit from the same, e.g. for trying to get apu systems with about 8 CU to run openmp code. Suggest we do that in a different patch if someone asks for it.

I'd like to get rid of the structure this macro controls entirely but don't have a good time estimate for that. This looks like a good idea in the meantime.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	Can we distinguish between GA100 and GA102? This structure is large so oversizing wastes significant memory.

ye-luo added inline comments.Sep 23 2020, 5:11 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	GA100 is CUDA_ARCH 800. GA102 is 860. There are also 700, 720, 750 I don't really feel the necessity to add more resolution because LIBOMPTARGET_NVPTX_MAX_SM can be leveraged.

JonChesterfield added inline comments.Sep 23 2020, 6:02 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	It could matter to someone with a GA102 who hasn't read the cmake. Back of envelope math suggests there's a little under a gigabyte of allocated but unused memory between 84 and 108.

ye-luo added inline comments.Sep 23 2020, 6:08 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	On arch 600, My measurement between 56 and 6 indicates about 500MB difference. So I expect 200MB difference and should matter little to GA102 owners. RTX 3070 has 8GB.

JonChesterfield added inline comments.Sep 23 2020, 6:13 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	Measuring beats mental arithmetic against a different arch. Amdgpu was 2.1gb w/64, so about 30mb/SM. Sort of glad to hear nvptx is smaller per SM.

LGTM. Change looks safe as-is and we can add finer granularity later.

This revision is now accepted and ready to land.Sep 24 2020, 7:46 AM

Closed by commit rGffd159d8e919: [OpenMP] cmake option LIBOMPTARGET_NVPTX_MAX_SM for nvptx device RTL (authored by ye-luo, committed by tianshilei1992). · Explain WhySep 24 2020, 9:40 AM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rGffd159d8e919: [OpenMP] cmake option LIBOMPTARGET_NVPTX_MAX_SM for nvptx device RTL.

Herald added a project: Restricted Project. · View Herald TranscriptSep 24 2020, 9:40 AM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

CMakeLists.txt

9 lines

src/

target_impl.h

17 lines

Diff 294095

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	if(LIBOMPTARGET_DEP_CUDA_FOUND)
set(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES ${default_capabilities} CACHE STRING		set(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES ${default_capabilities} CACHE STRING
"List of CUDA Compute Capabilities to be used to compile the NVPTX device RTL.")		"List of CUDA Compute Capabilities to be used to compile the NVPTX device RTL.")
string(REPLACE "," ";" nvptx_sm_list ${LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES})		string(REPLACE "," ";" nvptx_sm_list ${LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES})

foreach(sm ${nvptx_sm_list})		foreach(sm ${nvptx_sm_list})
set(CUDA_ARCH ${CUDA_ARCH} -gencode arch=compute_${sm},code=sm_${sm})		set(CUDA_ARCH ${CUDA_ARCH} -gencode arch=compute_${sm},code=sm_${sm})
endforeach()		endforeach()

		# Override default MAX_SM in src/target_impl.h if requested
		if (DEFINED LIBOMPTARGET_NVPTX_MAX_SM)
		set(MAX_SM_DEFINITION "-DMAX_SM=${LIBOMPTARGET_NVPTX_MAX_SM}")
		endif()

# Activate RTL message dumps if requested by the user.		# Activate RTL message dumps if requested by the user.
set(LIBOMPTARGET_NVPTX_DEBUG FALSE CACHE BOOL		set(LIBOMPTARGET_NVPTX_DEBUG FALSE CACHE BOOL
"Activate NVPTX device RTL debug messages.")		"Activate NVPTX device RTL debug messages.")
if(${LIBOMPTARGET_NVPTX_DEBUG})		if(${LIBOMPTARGET_NVPTX_DEBUG})
set(CUDA_DEBUG -DOMPTARGET_NVPTX_DEBUG=-1 -g --ptxas-options=-v)		set(CUDA_DEBUG -DOMPTARGET_NVPTX_DEBUG=-1 -g --ptxas-options=-v)
endif()		endif()

# NVPTX runtime library has to be statically linked. Dynamic linking is not		# NVPTX runtime library has to be statically linked. Dynamic linking is not
# yet supported by the CUDA toolchain on the device.		# yet supported by the CUDA toolchain on the device.
set(BUILD_SHARED_LIBS OFF)		set(BUILD_SHARED_LIBS OFF)
set(CUDA_SEPARABLE_COMPILATION ON)		set(CUDA_SEPARABLE_COMPILATION ON)
list(APPEND CUDA_NVCC_FLAGS -I${devicertl_base_directory}		list(APPEND CUDA_NVCC_FLAGS -I${devicertl_base_directory}
-I${devicertl_nvptx_directory}/src)		-I${devicertl_nvptx_directory}/src)
cuda_add_library(omptarget-nvptx STATIC ${cuda_src_files} ${omp_data_objects}		cuda_add_library(omptarget-nvptx STATIC ${cuda_src_files} ${omp_data_objects}
OPTIONS ${CUDA_ARCH} ${CUDA_DEBUG})		OPTIONS ${CUDA_ARCH} ${CUDA_DEBUG} ${MAX_SM_DEFINITION})

# Install device RTL under the lib destination folder.		# Install device RTL under the lib destination folder.
install(TARGETS omptarget-nvptx ARCHIVE DESTINATION "${OPENMP_INSTALL_LIBDIR}")		install(TARGETS omptarget-nvptx ARCHIVE DESTINATION "${OPENMP_INSTALL_LIBDIR}")

target_link_libraries(omptarget-nvptx ${CUDA_LIBRARIES})		target_link_libraries(omptarget-nvptx ${CUDA_LIBRARIES})


# Check if we can create an LLVM bitcode implementation of the runtime library		# Check if we can create an LLVM bitcode implementation of the runtime library
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	foreach(sm ${nvptx_sm_list})

# Compile CUDA files to bitcode.		# Compile CUDA files to bitcode.
set(bc_files "")		set(bc_files "")
foreach(src ${cuda_src_files})		foreach(src ${cuda_src_files})
get_filename_component(infile ${src} ABSOLUTE)		get_filename_component(infile ${src} ABSOLUTE)
get_filename_component(outfile ${src} NAME)		get_filename_component(outfile ${src} NAME)

add_custom_command(OUTPUT ${outfile}-sm_${sm}.bc		add_custom_command(OUTPUT ${outfile}-sm_${sm}.bc
COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER} ${bc_flags} ${cuda_arch}		COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER} ${bc_flags} ${cuda_arch} ${MAX_SM_DEFINITION}
-c ${infile} -o ${outfile}-sm_${sm}.bc		-c ${infile} -o ${outfile}-sm_${sm}.bc
DEPENDS ${infile}		DEPENDS ${infile}
IMPLICIT_DEPENDS CXX ${infile}		IMPLICIT_DEPENDS CXX ${infile}
COMMENT "Building LLVM bitcode ${outfile}-sm_${sm}.bc"		COMMENT "Building LLVM bitcode ${outfile}-sm_${sm}.bc"
VERBATIM		VERBATIM
)		)
set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile}-sm_${sm}.bc)		set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile}-sm_${sm}.bc)

Show All 29 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	#define L1_BARRIER (1)			#define L1_BARRIER (1)

	// Maximum number of preallocated arguments to an outlined parallel/simd function.			// Maximum number of preallocated arguments to an outlined parallel/simd function.
	// Anything more requires dynamic memory allocation.			// Anything more requires dynamic memory allocation.
	#define MAX_SHARED_ARGS 20			#define MAX_SHARED_ARGS 20

	// Maximum number of omp state objects per SM allocated statically in global			// Maximum number of omp state objects per SM allocated statically in global
	// memory.			// memory.
	#if __CUDA_ARCH__ >= 700			#if __CUDA_ARCH__ >= 600
	#define OMP_STATE_COUNT 32			#define OMP_STATE_COUNT 32
				#else
				#define OMP_STATE_COUNT 16
				#endif

				#if !defined(MAX_SM)
				#if __CUDA_ARCH__ >= 900
				#error unsupported compute capability, define MAX_SM via LIBOMPTARGET_NVPTX_MAX_SM cmake option
				#elif __CUDA_ARCH__ >= 800
				// GA100 design has a maxinum of 128 SMs but A100 product only has 108 SMs
				// GA102 design has a maxinum of 84 SMs
				#define MAX_SM 108
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Can we distinguish between GA100 and GA102? This structure is large so oversizing wastes significant memory. JonChesterfield: Can we distinguish between GA100 and GA102? This structure is large so oversizing wastes…
				ye-luoAuthorUnsubmitted Done Reply Inline Actions GA100 is CUDA_ARCH 800. GA102 is 860. There are also 700, 720, 750 I don't really feel the necessity to add more resolution because LIBOMPTARGET_NVPTX_MAX_SM can be leveraged. ye-luo: GA100 is __CUDA_ARCH__ 800. GA102 is 860. There are also 700, 720, 750 I don't really feel the…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions It could matter to someone with a GA102 who hasn't read the cmake. Back of envelope math suggests there's a little under a gigabyte of allocated but unused memory between 84 and 108. JonChesterfield: It could matter to someone with a GA102 who hasn't read the cmake. Back of envelope math…
				ye-luoAuthorUnsubmitted Done Reply Inline Actions On arch 600, My measurement between 56 and 6 indicates about 500MB difference. So I expect 200MB difference and should matter little to GA102 owners. RTX 3070 has 8GB. ye-luo: On arch 600, My measurement between 56 and 6 indicates about 500MB difference. So I expect…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Measuring beats mental arithmetic against a different arch. Amdgpu was 2.1gb w/64, so about 30mb/SM. Sort of glad to hear nvptx is smaller per SM. JonChesterfield: Measuring beats mental arithmetic against a different arch. Amdgpu was 2.1gb w/64, so about…
				#elif __CUDA_ARCH__ >= 700
	#define MAX_SM 84			#define MAX_SM 84
	#elif __CUDA_ARCH__ >= 600			#elif __CUDA_ARCH__ >= 600
	#define OMP_STATE_COUNT 32
	#define MAX_SM 56			#define MAX_SM 56
	#else			#else
	#define OMP_STATE_COUNT 16
	#define MAX_SM 16			#define MAX_SM 16
	#endif			#endif
				#endif

	#define OMP_ACTIVE_PARALLEL_LEVEL 128			#define OMP_ACTIVE_PARALLEL_LEVEL 128

	// Data sharing related quantities, need to match what is used in the compiler.			// Data sharing related quantities, need to match what is used in the compiler.
	enum DATA_SHARING_SIZES {			enum DATA_SHARING_SIZES {
	// The maximum number of workers in a kernel.			// The maximum number of workers in a kernel.
	DS_Max_Worker_Threads = 992,			DS_Max_Worker_Threads = 992,
	// The size reserved for data in a shared memory slot.			// The size reserved for data in a shared memory slot.
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines