Download Raw Diff

Details

Reviewers

jdoerfert
tra
tianshilei1992
JonChesterfield
kevinsala
sivachandra
michaelrj
lntue

Commits

rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing

Summary

This patch adds a loader utility targeting the CUDA driver API to launch
NVPTX images called nvptx_loader. This takes a GPU image on the
command line and launches the _start kernel with the appropriate
arguments. The _start kernel is provided by the already implemented
nvptx/start.cpp. So, an application with a main function can be
compiled and run as follows.

clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image
./nvptx_loader image args to kernel

This implementation is not tested and does not yet support RPC. This
requires further development to work around NVIDIA specific limitations
in atomics and linking.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Mar 22 2023, 6:09 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 22 2023, 6:09 PM

Herald added subscribers: libc-commits, mikhail.ramalho, mattd and 4 others. · View Herald Transcript

jhuber6 requested review of this revision.Mar 22 2023, 6:09 PM

Harbormaster completed remote builds in B221189: Diff 507571.Mar 22 2023, 6:16 PM

The argv/env setup is the same as on amdgpu, with different names for the allocators. Maybe pull that into a header that takes functions for the alloc/memcpy and call it from both loaders?

There's a comment about fine grain memory in this that suggests a diff between the two would show few differences, better to avoid the copy paste where we can.

Moving device copying functions into a common utility.

Herald added subscribers: kosarev, kerbowa, jvesely. · View Herald TranscriptMar 23 2023, 6:12 AM

Harbormaster completed remote builds in B221297: Diff 507717.Mar 23 2023, 6:19 AM

Looks reasonable, one nit below. Let's give others a chance to comment too.

libc/utils/gpu/loader/Loader.h
43–54 ↗	(On Diff #507717)	probably with a different fn name, but isn't this the same?

jhuber6 added inline comments.Mar 23 2023, 10:02 AM

libc/utils/gpu/loader/Loader.h
43–54 ↗	(On Diff #507717)	Good point, same code after getting the size.

Addressing comments.

Harbormaster completed remote builds in B221348: Diff 507790.Mar 23 2023, 10:35 AM

Error handling could be better but otherwise looks ok here

libc/utils/gpu/loader/Loader.h
21 ↗	(On Diff #507790)	I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory feels more likely to run out than address space. Might be worth rewriting this to a single alloc - on HSA each of those allocations will be rounded up to a multiple of 4k internally. Doing both would mean we can return null on failure without have to pass a deallocator along for the failure path. Not blocking at this time though, it's probably difficult to hit that exhaustion from libc startup.
libc/utils/gpu/loader/amdgpu/Loader.cpp
284 ↗	(On Diff #507790)	This probably should check the return codes
libc/utils/gpu/loader/nvptx/Loader.cpp
84	That's inconsistent with ignoring errors on the other path

Forgot to check errors on the AMD implementation.

Return nullptr early if the allocation returns null.

jhuber6 added inline comments.Mar 23 2023, 11:26 AM

libc/utils/gpu/loader/Loader.h
21 ↗	(On Diff #507790)	That's a good point, we should be able to allocate a single big block which would be more efficient. But for testing purposes it's likely not to be an issue. We can adjust it later.

JonChesterfield added inline comments.Mar 23 2023, 11:28 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
284 ↗	(On Diff #507790)	Agents allow access can fail, let's treat that the same as oom as they look the same to the caller. Leaking on the failure path is fine by me.

JonChesterfield added inline comments.Mar 23 2023, 11:29 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
291 ↗	(On Diff #507831)	handle_error calls around these returned pointers?

jhuber6 added inline comments.Mar 23 2023, 11:30 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
284 ↗	(On Diff #507790)	So, the only way it can fail is if the runtime is uninitialized, the arguments are null, or the pointer isn't allowed to have access. I think we can statically assert that we won't meet those criteria in usage.

jhuber6 added inline comments.Mar 23 2023, 11:31 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
291 ↗	(On Diff #507831)	Probably better than being lazy and waiting for it to segfault on the GPU like I do now.

Checking allocation return values.

Harbormaster completed remote builds in B221385: Diff 507836.Mar 23 2023, 11:48 AM

jhuber6 added a child revision: D146846: [libc] Implement the RPC client / server for NVPTX.Mar 24 2023, 2:07 PM

LG, I think

libc/utils/gpu/loader/amdgpu/Loader.cpp
81 ↗	(On Diff #507836)	Nit: you have this one twice now.

This revision is now accepted and ready to land.Mar 24 2023, 3:49 PM

Closed by commit rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing (authored by jhuber6). · Explain WhyMar 24 2023, 6:05 PM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing.

Diff 507571

libc/utils/gpu/loader/CMakeLists.txt

	add_library(gpu_loader OBJECT Main.cpp)			add_library(gpu_loader OBJECT Main.cpp)
	target_include_directories(gpu_loader PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})			target_include_directories(gpu_loader PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})

	find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)			find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
	if(hsa-runtime64_FOUND)			if(hsa-runtime64_FOUND)
	add_subdirectory(amdgpu)			add_subdirectory(amdgpu)
	else()			else()
	message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")			message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")
	endif()			endif()

				find_package(CUDAToolkit QUIET)
				if(CUDAToolkit_FOUND)
				add_subdirectory(nvptx)
				else()
				message(STATUS "Skipping CUDA loader for gpu target, no CUDA was detected")
				endif()

	# Add a custom target to be used for testing.			# Add a custom target to be used for testing.
	if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)			if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
	add_custom_target(libc.utils.gpu.loader)			add_custom_target(libc.utils.gpu.loader)
	add_dependencies(libc.utils.gpu.loader amdhsa_loader)			add_dependencies(libc.utils.gpu.loader amdhsa_loader)
	set_target_properties(			set_target_properties(
	libc.utils.gpu.loader			libc.utils.gpu.loader
	PROPERTIES			PROPERTIES
	EXECUTABLE "$<TARGET_FILE:amdhsa_loader>"			EXECUTABLE "$<TARGET_FILE:amdhsa_loader>"
	)			)
	endif()			endif()

libc/utils/gpu/loader/nvptx/CMakeLists.txt

This file was added.

				add_executable(nvptx_loader Loader.cpp)
				add_dependencies(nvptx_loader libc.src.__support.RPC.rpc)

				target_include_directories(nvptx_loader PRIVATE ${LIBC_SOURCE_DIR})
				target_link_libraries(nvptx_loader
				PRIVATE
				gpu_loader
				CUDA::cuda_driver
				)

libc/utils/gpu/loader/nvptx/Loader.cpp

This file was added.

				//===-- Loader Implementation for NVPTX devices --------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file impelements a simple loader to run images supporting the NVPTX
				// architecture. The file launches the '_start' kernel which should be provided
				// by the device application start code and call ultimately call the 'main'
				// function.
				//
				//===----------------------------------------------------------------------===//

				#include "cuda.h"
				#include <cstddef>
				#include <cstdio>
				#include <cstring>

				/// The arguments to the '_start' kernel.
				struct kernel_args_t {
				int argc;
				void *argv;
				void *envp;
				void *ret;
				void *inbox;
				void *outbox;
				void *buffer;
				};

				static void handle_error(CUresult err) {
				if (err == CUDA_SUCCESS)
				return;

				const char *err_str = nullptr;
				CUresult result = cuGetErrorString(err, &err_str);
				if (result != CUDA_SUCCESS)
				fprintf(stderr, "Unknown Error\n");
				else
				fprintf(stderr, "%s\n", err_str);
				exit(1);
				}

				int load(int argc, char argv, char envp, void *image, size_t size) {
				if (CUresult err = cuInit(0))
				handle_error(err);

				// Obtain the first device found on the system.
				CUdevice device;
				if (CUresult err = cuDeviceGet(&device, 0))
				handle_error(err);

				// Initialize the CUDA context and claim it for this execution.
				CUcontext context;
				if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))
				handle_error(err);
				if (CUresult err = cuCtxSetCurrent(context))
				handle_error(err);

				// Initialize a non-blocking CUDA stream to execute the kernel.
				CUstream stream;
				if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))
				handle_error(err);

				// Load the image into a CUDA module.
				CUmodule binary;
				if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))
				handle_error(err);

				// look up the '_start' kernel in the loaded module.
				CUfunction function;
				if (CUresult err = cuModuleGetFunction(&function, binary, "_start"))
				handle_error(err);

				// Allocate pinned memory on the host to hold the pointer array for the
				// copied argv and allow the GPU device to access it.
				void *dev_argv;
				if (CUresult err = cuMemAllocHost(&dev_argv, sizeof(char ) argc))
				handle_error(err);

				// Copy each string in the argument vector to shared memory on the device.
				for (int i = 0; i < argc; ++i) {
				size_t size = strlen(argv[i]) + 1;
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions That's inconsistent with ignoring errors on the other path JonChesterfield: That's inconsistent with ignoring errors on the other path
				void *dev_str;
				if (CUresult err = cuMemAllocHost(&dev_str, size))
				handle_error(err);
				// Load the host memory buffer with the pointer values of the newly
				// allocated strings.
				std::memcpy(dev_str, argv[i], size);
				static_cast<void **>(dev_argv)[i] = dev_str;
				}

				// Allocate fine-grained memory on the host to hold the pointer array for the
				// copied environment array and allow the GPU agent to access it.
				int envc = 0;
				for (char *env = envp; env != 0; ++env)
				++envc;
				void *dev_envp;
				if (CUresult err = cuMemAllocHost(&dev_envp, sizeof(char ) envc))
				handle_error(err);

				for (int i = 0; i < envc; ++i) {
				size_t size = strlen(envp[i]) + 1;
				void *dev_str;
				if (CUresult err = cuMemAllocHost(&dev_str, size))
				handle_error(err);
				// Load the host memory buffer with the pointer values of the newly
				// allocated strings.
				std::memcpy(dev_str, envp[i], size);
				static_cast<void **>(dev_envp)[i] = dev_str;
				}

				// Allocate space for the return pointer and initialize it to zero.
				CUdeviceptr dev_ret;
				if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))
				handle_error(err);
				if (CUresult err = cuMemsetD32(dev_ret, 0, 1))
				handle_error(err);

				// Set up the arguments to the '_start' kernel on the GPU.
				// TODO: Setup RPC server implementation;
				uint64_t args_size = sizeof(kernel_args_t);
				kernel_args_t args;
				std::memset(&args, 0, args_size);
				args.argc = argc;
				args.argv = dev_argv;
				args.envp = dev_argv;
				args.ret = reinterpret_cast<void *>(dev_ret);
				void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
				CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
				CU_LAUNCH_PARAM_END};

				// Call the kernel with the given arguments.
				if (CUresult err =
				cuLaunchKernel(function, /gridDimX=/1, /gridDimY=/1,
				/gridDimZ=/1, /blockDimX=/1, /blockDimY=/1,
				/bloackDimZ=/1, 0, stream, nullptr, args_config))
				handle_error(err);

				// TODO: Query the RPC server periodically while the kernel is running.
				while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)
				;

				// Copy the return value back from the kernel and wait.
				int host_ret = 0;
				if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))
				handle_error(err);

				if (CUresult err = cuStreamSynchronize(stream))
				handle_error(err);

				// Destroy the context and the loaded binary.
				if (CUresult err = cuModuleUnload(binary))
				handle_error(err);
				if (CUresult err = cuDevicePrimaryCtxRelease(device))
				handle_error(err);
				return host_ret;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for NVPTX architectures for testing
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 507571

libc/utils/gpu/loader/CMakeLists.txt

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for NVPTX architectures for testingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 507571

libc/utils/gpu/loader/CMakeLists.txt

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

[libc] Add a loader utility for NVPTX architectures for testing
ClosedPublic