This is an archive of the discontinued LLVM Phabricator instance.

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167 ↗	(On Diff #71468)	Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice.
189 ↗	(On Diff #71468)	This whole dance is going to destroy the beautiful efficiency gains you were after, right? It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would also destroy your beautiful efficiency gains. At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this?
streamexecutor/unittests/CoreTests/CUDATest.cpp
142 ↗	(On Diff #71468)	Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.)

This revision is now accepted and ready to land.Sep 15 2016, 8:54 AM

Comment on dyn-shared-memory arg efficiency

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167 ↗	(On Diff #71468)	Yes, clang-format actually does this. I'll create a simple reproducer and file a bug.
189 ↗	(On Diff #71468)	While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. So, the idea is that, other than the check for `ArgumentArray.getArgumentCount() == 0`, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case. For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case. Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future.
streamexecutor/unittests/CoreTests/CUDATest.cpp
142 ↗	(On Diff #71468)	Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment.

jlebar added inline comments.Sep 15 2016, 10:57 AM

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
189 ↗	(On Diff #71520)	our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. I feel like these things usually are unimportant until they're not. Which is to say, I'm totally onboard with doing a slow thing for a case we think is uncommon, so long as we're not painting ourselves into a corner. I like your plan of (if the case arises) telling people who want their code to be fast to put a single shared memory parameter at the front of their param pack and then optimizing for that case.

Closed by commit rL281635: [SE] Support CUDA dynamic shared memory (authored by jhen). · Explain WhySep 15 2016, 11:19 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

parallel-libs/

trunk/

streamexecutor/

lib/

platforms/

cuda/

CUDAPlatformDevice.cpp

41 lines

unittests/

CoreTests/

CMakeLists.txt

5 lines

CUDATest.cpp

215 lines

Diff 71527

parallel-libs/trunk/streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	Error CUDAPlatformDevice::launch(			Error CUDAPlatformDevice::launch(
	const void *PlatformStreamHandle, BlockDimensions BlockSize,			const void *PlatformStreamHandle, BlockDimensions BlockSize,
	GridDimensions GridSize, const void *PKernelHandle,			GridDimensions GridSize, const void *PKernelHandle,
	const PackedKernelArgumentArrayBase &ArgumentArray) {			const PackedKernelArgumentArrayBase &ArgumentArray) {
	CUfunction Function =			CUfunction Function =
	reinterpret_cast<CUfunction>(const_cast<void *>(PKernelHandle));			reinterpret_cast<CUfunction>(const_cast<void *>(PKernelHandle));
	CUstream Stream =			CUstream Stream =
	reinterpret_cast<CUstream>(const_cast<void *>(PlatformStreamHandle));			reinterpret_cast<CUstream>(const_cast<void *>(PlatformStreamHandle));
	// TODO(jhen): Deal with shared memory arguments.
	unsigned SharedMemoryBytes = 0;			auto Launch = [Function, Stream, BlockSize,
	void ArgumentAddresses = const_cast<void >(ArgumentArray.getAddresses());			GridSize](size_t SharedMemoryBytes, void **ArgumentAddresses) {
	return CUresultToError(cuLaunchKernel(Function, GridSize.X, GridSize.Y,			return CUresultToError(
	GridSize.Z, BlockSize.X, BlockSize.Y,			cuLaunchKernel(Function, //
	BlockSize.Z, SharedMemoryBytes, Stream,			GridSize.X, GridSize.Y, GridSize.Z, //
	ArgumentAddresses, nullptr),			BlockSize.X, BlockSize.Y, BlockSize.Z, //
				SharedMemoryBytes, Stream, ArgumentAddresses, nullptr),
	"cuLaunchKernel");			"cuLaunchKernel");
				};

				void ArgumentAddresses = const_cast<void >(ArgumentArray.getAddresses());
				size_t SharedArgumentCount = ArgumentArray.getSharedCount();
				if (SharedArgumentCount) {
				// The argument handling in this case is not very efficient. We may need to
				// come back and optimize it later.
				//
				// Perhaps introduce another branch for the case where there is exactly one
				// shared memory argument and it is the first one. This is the only case
				// that will be used for compiler-generated CUDA kernels, and OpenCL users
				// can choose to take advantage of it by combining their dynamic shared
				// memory arguments and putting them first in the kernel signature.
				unsigned SharedMemoryBytes = 0;
				size_t ArgumentCount = ArgumentArray.getArgumentCount();
				llvm::SmallVector<void *, 16> NonSharedArgumentAddresses(
				ArgumentCount - SharedArgumentCount);
				size_t NonSharedIndex = 0;
				for (size_t I = 0; I < ArgumentCount; ++I)
				if (ArgumentArray.getType(I) == KernelArgumentType::SHARED_DEVICE_MEMORY)
				SharedMemoryBytes += ArgumentArray.getSize(I);
				else
				NonSharedArgumentAddresses[NonSharedIndex++] = ArgumentAddresses[I];
				return Launch(SharedMemoryBytes, NonSharedArgumentAddresses.data());
				}
				return Launch(0, ArgumentAddresses);
	}			}

	Error CUDAPlatformDevice::copyD2H(const void *PlatformStreamHandle,			Error CUDAPlatformDevice::copyD2H(const void *PlatformStreamHandle,
	const void *DeviceSrcHandle,			const void *DeviceSrcHandle,
	size_t SrcByteOffset, void *HostDst,			size_t SrcByteOffset, void *HostDst,
	size_t DstByteOffset, size_t ByteCount) {			size_t DstByteOffset, size_t ByteCount) {
	return CUresultToError(			return CUresultToError(
	cuMemcpyDtoHAsync(			cuMemcpyDtoHAsync(
	▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

parallel-libs/trunk/streamexecutor/unittests/CoreTests/CMakeLists.txt

				if(STREAM_EXECUTOR_ENABLE_CUDA_PLATFORM)
				set(CUDA_TEST_SOURCES CUDATest.cpp)
				endif()

	add_se_unittest(			add_se_unittest(
	CoreTests			CoreTests
	DeviceTest.cpp			DeviceTest.cpp
	KernelSpecTest.cpp			KernelSpecTest.cpp
	PackedKernelArgumentArrayTest.cpp			PackedKernelArgumentArrayTest.cpp
	StreamTest.cpp			StreamTest.cpp
				${CUDA_TEST_SOURCES}
	)			)

parallel-libs/trunk/streamexecutor/unittests/CoreTests/CUDATest.cpp

				//===-- CUDATest.cpp - Tests for CUDA platform ----------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// This file contains the unit tests for CUDA platform code.
				///
				//===----------------------------------------------------------------------===//

				#include "streamexecutor/StreamExecutor.h"

				#include "gtest/gtest.h"

				namespace {

				namespace compilergen {
				using SaxpyKernel =
				streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,
				streamexecutor::GlobalDeviceMemory<float>>;

				const char *SaxpyPTX = R"(
				.version 4.3
				.target sm_20
				.address_size 64

				.visible .entry saxpy(.param .f32 A, .param .u64 X, .param .u64 Y) {
				.reg .f32 %AValue;
				.reg .f32 %XValue;
				.reg .f32 %YValue;
				.reg .f32 %Result;

				.reg .b64 %XBaseAddrGeneric;
				.reg .b64 %YBaseAddrGeneric;
				.reg .b64 %XBaseAddrGlobal;
				.reg .b64 %YBaseAddrGlobal;
				.reg .b64 %XAddr;
				.reg .b64 %YAddr;
				.reg .b64 %ThreadByteOffset;

				.reg .b32 %TID;

				ld.param.f32 %AValue, [A];
				ld.param.u64 %XBaseAddrGeneric, [X];
				ld.param.u64 %YBaseAddrGeneric, [Y];
				cvta.to.global.u64 %XBaseAddrGlobal, %XBaseAddrGeneric;
				cvta.to.global.u64 %YBaseAddrGlobal, %YBaseAddrGeneric;
				mov.u32 %TID, %tid.x;
				mul.wide.u32 %ThreadByteOffset, %TID, 4;
				add.s64 %XAddr, %ThreadByteOffset, %XBaseAddrGlobal;
				add.s64 %YAddr, %ThreadByteOffset, %YBaseAddrGlobal;
				ld.global.f32 %XValue, [%XAddr];
				ld.global.f32 %YValue, [%YAddr];
				fma.rn.f32 %Result, %AValue, %XValue, %YValue;
				st.global.f32 [%XAddr], %Result;
				ret;
				}
				)";

				static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {
				streamexecutor::MultiKernelLoaderSpec Spec;
				Spec.addCUDAPTXInMemory("saxpy", {{{2, 0}, SaxpyPTX}});
				return Spec;
				}();

				using SwapPairsKernel =
				streamexecutor::Kernel<streamexecutor::SharedDeviceMemory<int>,
				streamexecutor::GlobalDeviceMemory<int>, int>;

				const char *SwapPairsPTX = R"(
				.version 4.3
				.target sm_20
				.address_size 64

				.extern .shared .align 4 .b8 SwapSpace[];

				.visible .entry SwapPairs(.param .u64 InOut, .param .u32 InOutSize) {
				.reg .b64 %InOutGeneric;
				.reg .b32 %InOutSizeValue;

				.reg .b32 %LocalIndex;
				.reg .b32 %PartnerIndex;
				.reg .b32 %ThreadsPerBlock;
				.reg .b32 %BlockIndex;
				.reg .b32 %GlobalIndex;

				.reg .b32 %GlobalIndexBound;
				.reg .pred %GlobalIndexTooHigh;

				.reg .b64 %InOutGlobal;
				.reg .b64 %GlobalByteOffset;
				.reg .b64 %GlobalAddress;

				.reg .b32 %InitialValue;
				.reg .b32 %SwappedValue;

				.reg .b64 %SharedBaseAddr;
				.reg .b64 %LocalWriteByteOffset;
				.reg .b64 %LocalReadByteOffset;
				.reg .b64 %SharedWriteAddr;
				.reg .b64 %SharedReadAddr;

				ld.param.u64 %InOutGeneric, [InOut];
				ld.param.u32 %InOutSizeValue, [InOutSize];
				mov.u32 %LocalIndex, %tid.x;
				mov.u32 %ThreadsPerBlock, %ntid.x;
				mov.u32 %BlockIndex, %ctaid.x;
				mad.lo.s32 %GlobalIndex, %ThreadsPerBlock, %BlockIndex, %LocalIndex;
				and.b32 %GlobalIndexBound, %InOutSizeValue, -2;
				setp.ge.s32 %GlobalIndexTooHigh, %GlobalIndex, %GlobalIndexBound;
				@%GlobalIndexTooHigh bra END;

				cvta.to.global.u64 %InOutGlobal, %InOutGeneric;
				mul.wide.s32 %GlobalByteOffset, %GlobalIndex, 4;
				add.s64 %GlobalAddress, %InOutGlobal, %GlobalByteOffset;
				ld.global.u32 %InitialValue, [%GlobalAddress];
				mul.wide.s32 %LocalWriteByteOffset, %LocalIndex, 4;
				mov.u64 %SharedBaseAddr, SwapSpace;
				add.s64 %SharedWriteAddr, %SharedBaseAddr, %LocalWriteByteOffset;
				st.shared.u32 [%SharedWriteAddr], %InitialValue;
				bar.sync 0;
				xor.b32 %PartnerIndex, %LocalIndex, 1;
				mul.wide.s32 %LocalReadByteOffset, %PartnerIndex, 4;
				add.s64 %SharedReadAddr, %SharedBaseAddr, %LocalReadByteOffset;
				ld.shared.u32 %SwappedValue, [%SharedReadAddr];
				st.global.u32 [%GlobalAddress], %SwappedValue;

				END:
				ret;
				}
				)";

				static streamexecutor::MultiKernelLoaderSpec SwapPairsLoaderSpec = []() {
				streamexecutor::MultiKernelLoaderSpec Spec;
				Spec.addCUDAPTXInMemory("SwapPairs", {{{2, 0}, SwapPairsPTX}});
				return Spec;
				}();
				} // namespace compilergen

				namespace se = ::streamexecutor;
				namespace cg = ::compilergen;

				class CUDATest : public ::testing::Test {
				public:
				CUDATest()
				: Platform(getOrDie(se::PlatformManager::getPlatformByName("CUDA"))),
				Device(getOrDie(Platform->getDevice(0))),
				Stream(getOrDie(Device.createStream())) {}

				se::Platform *Platform;
				se::Device Device;
				se::Stream Stream;
				};

				TEST_F(CUDATest, Saxpy) {
				float A = 42.0f;
				std::vector<float> HostX = {0, 1, 2, 3};
				std::vector<float> HostY = {4, 5, 6, 7};
				size_t ArraySize = HostX.size();

				cg::SaxpyKernel Kernel =
				getOrDie(Device.createKernel<cg::SaxpyKernel>(cg::SaxpyLoaderSpec));

				se::RegisteredHostMemory<float> RegisteredX =
				getOrDie(Device.registerHostMemory<float>(HostX));
				se::RegisteredHostMemory<float> RegisteredY =
				getOrDie(Device.registerHostMemory<float>(HostY));

				se::GlobalDeviceMemory<float> X =
				getOrDie(Device.allocateDeviceMemory<float>(ArraySize));
				se::GlobalDeviceMemory<float> Y =
				getOrDie(Device.allocateDeviceMemory<float>(ArraySize));

				Stream.thenCopyH2D(RegisteredX, X)
				.thenCopyH2D(RegisteredY, Y)
				.thenLaunch(ArraySize, 1, Kernel, A, X, Y)
				.thenCopyD2H(X, RegisteredX);
				se::dieIfError(Stream.blockHostUntilDone());

				std::vector<float> ExpectedX = {4, 47, 90, 133};
				EXPECT_EQ(ExpectedX, HostX);
				}

				TEST_F(CUDATest, DynamicSharedMemory) {
				std::vector<int> HostPairs = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
				std::vector<int> HostResult(HostPairs.size(), 0);
				int ArraySize = HostPairs.size();

				cg::SwapPairsKernel Kernel = getOrDie(
				Device.createKernel<cg::SwapPairsKernel>(cg::SwapPairsLoaderSpec));

				se::RegisteredHostMemory<int> RegisteredPairs =
				getOrDie(Device.registerHostMemory<int>(HostPairs));
				se::RegisteredHostMemory<int> RegisteredResult =
				getOrDie(Device.registerHostMemory<int>(HostResult));

				se::GlobalDeviceMemory<int> Pairs =
				getOrDie(Device.allocateDeviceMemory<int>(ArraySize));
				auto SharedMemory =
				se::SharedDeviceMemory<int>::makeFromElementCount(ArraySize);

				Stream.thenCopyH2D(RegisteredPairs, Pairs)
				.thenLaunch(ArraySize, 1, Kernel, SharedMemory, Pairs, ArraySize)
				.thenCopyD2H(Pairs, RegisteredResult);
				se::dieIfError(Stream.blockHostUntilDone());

				std::vector<int> ExpectedPairs = {1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10};
				EXPECT_EQ(ExpectedPairs, HostResult);
				}

				} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

[SE] Support CUDA dynamic shared memoryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 71527

parallel-libs/trunk/streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp

parallel-libs/trunk/streamexecutor/unittests/CoreTests/CMakeLists.txt

parallel-libs/trunk/streamexecutor/unittests/CoreTests/CUDATest.cpp

[SE] Support CUDA dynamic shared memory
ClosedPublic