Download Raw Diff

Details

Reviewers

jdoerfert
tra
tianshilei1992
JonChesterfield
kevinsala
sivachandra
michaelrj
lntue

Commits

rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing

Summary

This patch adds a loader utility targeting the CUDA driver API to launch
NVPTX images called nvptx_loader. This takes a GPU image on the
command line and launches the _start kernel with the appropriate
arguments. The _start kernel is provided by the already implemented
nvptx/start.cpp. So, an application with a main function can be
compiled and run as follows.

clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image
./nvptx_loader image args to kernel

This implementation is not tested and does not yet support RPC. This
requires further development to work around NVIDIA specific limitations
in atomics and linking.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Mar 22 2023, 6:09 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 22 2023, 6:09 PM

Herald added subscribers: libc-commits, mikhail.ramalho, mattd and 4 others. · View Herald Transcript

jhuber6 requested review of this revision.Mar 22 2023, 6:09 PM

Harbormaster completed remote builds in B221189: Diff 507571.Mar 22 2023, 6:16 PM

The argv/env setup is the same as on amdgpu, with different names for the allocators. Maybe pull that into a header that takes functions for the alloc/memcpy and call it from both loaders?

There's a comment about fine grain memory in this that suggests a diff between the two would show few differences, better to avoid the copy paste where we can.

Moving device copying functions into a common utility.

Herald added subscribers: kosarev, kerbowa, jvesely. · View Herald TranscriptMar 23 2023, 6:12 AM

Harbormaster completed remote builds in B221297: Diff 507717.Mar 23 2023, 6:19 AM

Looks reasonable, one nit below. Let's give others a chance to comment too.

libc/utils/gpu/loader/Loader.h
43–54	probably with a different fn name, but isn't this the same?

jhuber6 added inline comments.Mar 23 2023, 10:02 AM

libc/utils/gpu/loader/Loader.h
43–54	Good point, same code after getting the size.

Addressing comments.

Harbormaster completed remote builds in B221348: Diff 507790.Mar 23 2023, 10:35 AM

Error handling could be better but otherwise looks ok here

libc/utils/gpu/loader/Loader.h
21	I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory feels more likely to run out than address space. Might be worth rewriting this to a single alloc - on HSA each of those allocations will be rounded up to a multiple of 4k internally. Doing both would mean we can return null on failure without have to pass a deallocator along for the failure path. Not blocking at this time though, it's probably difficult to hit that exhaustion from libc startup.
libc/utils/gpu/loader/amdgpu/Loader.cpp
294	This probably should check the return codes
libc/utils/gpu/loader/nvptx/Loader.cpp
84	That's inconsistent with ignoring errors on the other path

Forgot to check errors on the AMD implementation.

Return nullptr early if the allocation returns null.

jhuber6 added inline comments.Mar 23 2023, 11:26 AM

libc/utils/gpu/loader/Loader.h
21	That's a good point, we should be able to allocate a single big block which would be more efficient. But for testing purposes it's likely not to be an issue. We can adjust it later.

JonChesterfield added inline comments.Mar 23 2023, 11:28 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
294	Agents allow access can fail, let's treat that the same as oom as they look the same to the caller. Leaking on the failure path is fine by me.

JonChesterfield added inline comments.Mar 23 2023, 11:29 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
298	handle_error calls around these returned pointers?

jhuber6 added inline comments.Mar 23 2023, 11:30 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
294	So, the only way it can fail is if the runtime is uninitialized, the arguments are null, or the pointer isn't allowed to have access. I think we can statically assert that we won't meet those criteria in usage.

jhuber6 added inline comments.Mar 23 2023, 11:31 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
298	Probably better than being lazy and waiting for it to segfault on the GPU like I do now.

Checking allocation return values.

Harbormaster completed remote builds in B221385: Diff 507836.Mar 23 2023, 11:48 AM

jhuber6 added a child revision: D146846: [libc] Implement the RPC client / server for NVPTX.Mar 24 2023, 2:07 PM

LG, I think

libc/utils/gpu/loader/amdgpu/Loader.cpp
81	Nit: you have this one twice now.

This revision is now accepted and ready to land.Mar 24 2023, 3:49 PM

Closed by commit rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing (authored by jhuber6). · Explain WhyMar 24 2023, 6:05 PM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG2bef46d2ad87: [libc] Add a loader utility for NVPTX architectures for testing.

Diff 508257

libc/utils/gpu/loader/CMakeLists.txt

	add_library(gpu_loader OBJECT Main.cpp)			add_library(gpu_loader OBJECT Main.cpp)
	target_include_directories(gpu_loader PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})			target_include_directories(gpu_loader PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})

	find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)			find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
	if(hsa-runtime64_FOUND)			if(hsa-runtime64_FOUND)
	add_subdirectory(amdgpu)			add_subdirectory(amdgpu)
	else()			else()
	message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")			message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")
	endif()			endif()

				find_package(CUDAToolkit QUIET)
				if(CUDAToolkit_FOUND)
				add_subdirectory(nvptx)
				else()
				message(STATUS "Skipping CUDA loader for gpu target, no CUDA was detected")
				endif()

	# Add a custom target to be used for testing.			# Add a custom target to be used for testing.
	if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)			if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
	add_custom_target(libc.utils.gpu.loader)			add_custom_target(libc.utils.gpu.loader)
	add_dependencies(libc.utils.gpu.loader amdhsa_loader)			add_dependencies(libc.utils.gpu.loader amdhsa_loader)
	set_target_properties(			set_target_properties(
	libc.utils.gpu.loader			libc.utils.gpu.loader
	PROPERTIES			PROPERTIES
	EXECUTABLE "$<TARGET_FILE:amdhsa_loader>"			EXECUTABLE "$<TARGET_FILE:amdhsa_loader>"
	)			)
	endif()			endif()

libc/utils/gpu/loader/Loader.h

	//===-- Generic device loader interface -----------------------------------===//			//===-- Generic device loader interface -----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UTILS_GPU_LOADER_LOADER_H
				#define LLVM_LIBC_UTILS_GPU_LOADER_LOADER_H

				#include <cstring>
	#include <stddef.h>			#include <stddef.h>

	/// Generic interface to load the \p image and launch execution of the _start			/// Generic interface to load the \p image and launch execution of the _start
	/// kernel on the target device. Copies \p argc and \p argv to the device.			/// kernel on the target device. Copies \p argc and \p argv to the device.
	/// Returns the final value of the `main` function on the device.			/// Returns the final value of the `main` function on the device.
	int load(int argc, char argv, char evnp, void *image, size_t size);			int load(int argc, char argv, char evnp, void *image, size_t size);

				/// Copy the system's argument vector to GPU memory allocated using \p alloc.
				template <typename Allocator>
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory feels more likely to run out than address space. Might be worth rewriting this to a single alloc - on HSA each of those allocations will be rounded up to a multiple of 4k internally. Doing both would mean we can return null on failure without have to pass a deallocator along for the failure path. Not blocking at this time though, it's probably difficult to hit that exhaustion from libc startup. JonChesterfield: I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions That's a good point, we should be able to allocate a single big block which would be more efficient. But for testing purposes it's likely not to be an issue. We can adjust it later. jhuber6: That's a good point, we should be able to allocate a single big block which would be more…
				void copy_argument_vector(int argc, char *argv, Allocator alloc) {
				void dev_argv = alloc(argc sizeof(char *));
				if (dev_argv == nullptr)
				return nullptr;

				for (int i = 0; i < argc; ++i) {
				size_t size = strlen(argv[i]) + 1;
				void *dev_str = alloc(size);
				if (dev_str == nullptr)
				return nullptr;

				// Load the host memory buffer with the pointer values of the newly
				// allocated strings.
				std::memcpy(dev_str, argv[i], size);
				static_cast<void **>(dev_argv)[i] = dev_str;
				}
				return dev_argv;
				};

				/// Copy the system's environment to GPU memory allocated using \p alloc.
				template <typename Allocator>
				void copy_environment(char *envp, Allocator alloc) {
				int envc = 0;
				for (char *env = envp; env != 0; ++env)
				++envc;

				return copy_argument_vector(envc, envp, alloc);
				};

				#endif

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	static void handle_error(hsa_status_t code) {

const char *desc;		const char *desc;
if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)		if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
desc = "Unknown error";		desc = "Unknown error";
fprintf(stderr, "%s\n", desc);		fprintf(stderr, "%s\n", desc);
exit(EXIT_FAILURE);		exit(EXIT_FAILURE);
}		}

		static void handle_error(const char *msg) {
		fprintf(stderr, "%s\n", msg);
		exit(EXIT_FAILURE);
		}
		jdoerfertUnsubmitted Not Done Reply Inline Actions Nit: you have this one twice now. jdoerfert: Nit: you have this one twice now.

/// Generic interface for iterating using the HSA callbacks.		/// Generic interface for iterating using the HSA callbacks.
template <typename elem_ty, typename func_ty, typename callback_ty>		template <typename elem_ty, typename func_ty, typename callback_ty>
hsa_status_t iterate(func_ty func, callback_ty cb) {		hsa_status_t iterate(func_ty func, callback_ty cb) {
auto l = [](elem_ty elem, void *data) -> hsa_status_t {		auto l = [](elem_ty elem, void *data) -> hsa_status_t {
callback_ty unwrapped = static_cast<callback_ty >(data);		callback_ty unwrapped = static_cast<callback_ty >(data);
return (*unwrapped)(elem);		return (*unwrapped)(elem);
};		};
return func(l, static_cast<void *>(&cb));		return func(l, static_cast<void *>(&cb));
▲ Show 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	int load(int argc, char argv, char envp, void *image, size_t size) {
void *args;		void *args;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,		if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
/flags=/0, &args))		/flags=/0, &args))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);

// Allocate fine-grained memory on the host to hold the pointer array for the		// Allocate fine-grained memory on the host to hold the pointer array for the
// copied argv and allow the GPU agent to access it.		// copied argv and allow the GPU agent to access it.
void *dev_argv;		auto allocator = [&](uint64_t size) -> void * {
if (hsa_status_t err =		void *dev_ptr = nullptr;
hsa_amd_memory_pool_allocate(finegrained_pool, argc * sizeof(char *),
/flags=/0, &dev_argv))
handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_argv);

// Copy each string in the argument vector to global memory on the device.
for (int i = 0; i < argc; ++i) {
size_t size = strlen(argv[i]) + 1;
void *dev_str;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,		if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,
/flags=/0, &dev_str))		/flags=/0, &dev_ptr))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_str);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr);
// Load the host memory buffer with the pointer values of the newly		return dev_ptr;
// allocated strings.		};
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This probably should check the return codes JonChesterfield: This probably should check the return codes
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Agents allow access can fail, let's treat that the same as oom as they look the same to the caller. Leaking on the failure path is fine by me. JonChesterfield: Agents allow access can fail, let's treat that the same as oom as they look the same to the…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions So, the only way it can fail is if the runtime is uninitialized, the arguments are null, or the pointer isn't allowed to have access. I think we can statically assert that we won't meet those criteria in usage. jhuber6: So, the only way it can fail is if the runtime is uninitialized, the arguments are null, or the…
std::memcpy(dev_str, argv[i], size);		void *dev_argv = copy_argument_vector(argc, argv, allocator);
static_cast<void **>(dev_argv)[i] = dev_str;		if (!dev_argv)
}		handle_error("Failed to allocate device argv");

		JonChesterfieldUnsubmitted Not Done Reply Inline Actions handle_error calls around these returned pointers? JonChesterfield: handle_error calls around these returned pointers?
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Probably better than being lazy and waiting for it to segfault on the GPU like I do now. jhuber6: Probably better than being lazy and waiting for it to segfault on the GPU like I do now.
// Allocate fine-grained memory on the host to hold the pointer array for the		// Allocate fine-grained memory on the host to hold the pointer array for the
// copied environment array and allow the GPU agent to access it.		// copied environment array and allow the GPU agent to access it.
int envc = 0;		void *dev_envp = copy_environment(envp, allocator);
for (char *env = envp; env != 0; ++env)		if (!dev_envp)
++envc;		handle_error("Failed to allocate device environment");
void *dev_envp;
if (hsa_status_t err =
hsa_amd_memory_pool_allocate(finegrained_pool, envc * sizeof(char *),
/flags=/0, &dev_envp))
handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_envp);
for (int i = 0; i < envc; ++i) {
size_t size = strlen(envp[i]) + 1;
void *dev_str;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,
/flags=/0, &dev_str))
handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_str);
// Load the host memory buffer with the pointer values of the newly
// allocated strings.
std::memcpy(dev_str, envp[i], size);
static_cast<void **>(dev_envp)[i] = dev_str;
}

// Allocate space for the return pointer and initialize it to zero.		// Allocate space for the return pointer and initialize it to zero.
void *dev_ret;		void *dev_ret;
if (hsa_status_t err =		if (hsa_status_t err =
hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),		hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),
/flags=/0, &dev_ret))		/flags=/0, &dev_ret))
handle_error(err);		handle_error(err);
hsa_amd_memory_fill(dev_ret, 0, sizeof(int));		hsa_amd_memory_fill(dev_ret, 0, sizeof(int));
▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

libc/utils/gpu/loader/nvptx/CMakeLists.txt

This file was added.

				add_executable(nvptx_loader Loader.cpp)
				add_dependencies(nvptx_loader libc.src.__support.RPC.rpc)

				target_include_directories(nvptx_loader PRIVATE ${LIBC_SOURCE_DIR})
				target_link_libraries(nvptx_loader
				PRIVATE
				gpu_loader
				CUDA::cuda_driver
				)

libc/utils/gpu/loader/nvptx/Loader.cpp

This file was added.

				//===-- Loader Implementation for NVPTX devices --------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file impelements a simple loader to run images supporting the NVPTX
				// architecture. The file launches the '_start' kernel which should be provided
				// by the device application start code and call ultimately call the 'main'
				// function.
				//
				//===----------------------------------------------------------------------===//

				#include "Loader.h"

				#include "cuda.h"
				#include <cstddef>
				#include <cstdio>
				#include <cstdlib>
				#include <cstring>

				/// The arguments to the '_start' kernel.
				struct kernel_args_t {
				int argc;
				void *argv;
				void *envp;
				void *ret;
				void *inbox;
				void *outbox;
				void *buffer;
				};

				static void handle_error(CUresult err) {
				if (err == CUDA_SUCCESS)
				return;

				const char *err_str = nullptr;
				CUresult result = cuGetErrorString(err, &err_str);
				if (result != CUDA_SUCCESS)
				fprintf(stderr, "Unknown Error\n");
				else
				fprintf(stderr, "%s\n", err_str);
				exit(1);
				}

				static void handle_error(const char *msg) {
				fprintf(stderr, "%s\n", msg);
				exit(EXIT_FAILURE);
				}

				int load(int argc, char argv, char envp, void *image, size_t size) {
				if (CUresult err = cuInit(0))
				handle_error(err);

				// Obtain the first device found on the system.
				CUdevice device;
				if (CUresult err = cuDeviceGet(&device, 0))
				handle_error(err);

				// Initialize the CUDA context and claim it for this execution.
				CUcontext context;
				if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))
				handle_error(err);
				if (CUresult err = cuCtxSetCurrent(context))
				handle_error(err);

				// Initialize a non-blocking CUDA stream to execute the kernel.
				CUstream stream;
				if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))
				handle_error(err);

				// Load the image into a CUDA module.
				CUmodule binary;
				if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))
				handle_error(err);

				// look up the '_start' kernel in the loaded module.
				CUfunction function;
				if (CUresult err = cuModuleGetFunction(&function, binary, "_start"))
				handle_error(err);

				// Allocate pinned memory on the host to hold the pointer array for the
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions That's inconsistent with ignoring errors on the other path JonChesterfield: That's inconsistent with ignoring errors on the other path
				// copied argv and allow the GPU device to access it.
				auto allocator = [&](uint64_t size) -> void * {
				void *dev_ptr;
				if (CUresult err = cuMemAllocHost(&dev_ptr, size))
				handle_error(err);
				return dev_ptr;
				};
				void *dev_argv = copy_argument_vector(argc, argv, allocator);
				if (!dev_argv)
				handle_error("Failed to allocate device argv");

				// Allocate pinned memory on the host to hold the pointer array for the
				// copied environment array and allow the GPU device to access it.
				void *dev_envp = copy_environment(envp, allocator);
				if (!dev_envp)
				handle_error("Failed to allocate device environment");

				// Allocate space for the return pointer and initialize it to zero.
				CUdeviceptr dev_ret;
				if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))
				handle_error(err);
				if (CUresult err = cuMemsetD32(dev_ret, 0, 1))
				handle_error(err);

				// Set up the arguments to the '_start' kernel on the GPU.
				// TODO: Setup RPC server implementation;
				uint64_t args_size = sizeof(kernel_args_t);
				kernel_args_t args;
				std::memset(&args, 0, args_size);
				args.argc = argc;
				args.argv = dev_argv;
				args.envp = dev_envp;
				args.ret = reinterpret_cast<void *>(dev_ret);
				void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
				CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
				CU_LAUNCH_PARAM_END};

				// Call the kernel with the given arguments.
				if (CUresult err =
				cuLaunchKernel(function, /gridDimX=/1, /gridDimY=/1,
				/gridDimZ=/1, /blockDimX=/1, /blockDimY=/1,
				/bloackDimZ=/1, 0, stream, nullptr, args_config))
				handle_error(err);

				// TODO: Query the RPC server periodically while the kernel is running.
				while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)
				;

				// Copy the return value back from the kernel and wait.
				int host_ret = 0;
				if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))
				handle_error(err);

				if (CUresult err = cuStreamSynchronize(stream))
				handle_error(err);

				// Destroy the context and the loaded binary.
				if (CUresult err = cuModuleUnload(binary))
				handle_error(err);
				if (CUresult err = cuDevicePrimaryCtxRelease(device))
				handle_error(err);
				return host_ret;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for NVPTX architectures for testing
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508257

libc/utils/gpu/loader/CMakeLists.txt

libc/utils/gpu/loader/Loader.h

libc/utils/gpu/loader/amdgpu/Loader.cpp

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for NVPTX architectures for testingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508257

libc/utils/gpu/loader/CMakeLists.txt

libc/utils/gpu/loader/Loader.h

libc/utils/gpu/loader/amdgpu/Loader.cpp

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

[libc] Add a loader utility for NVPTX architectures for testing
ClosedPublic