This is an archive of the discontinued LLVM Phabricator instance.

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167	Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice.
189	This whole dance is going to destroy the beautiful efficiency gains you were after, right? It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would also destroy your beautiful efficiency gains. At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this?
streamexecutor/unittests/CoreTests/CUDATest.cpp
142	Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.)

This revision is now accepted and ready to land.Sep 15 2016, 8:54 AM

Comment on dyn-shared-memory arg efficiency

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167	Yes, clang-format actually does this. I'll create a simple reproducer and file a bug.
189	While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. So, the idea is that, other than the check for `ArgumentArray.getArgumentCount() == 0`, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case. For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case. Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future.
streamexecutor/unittests/CoreTests/CUDATest.cpp
142	Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment.

jlebar added inline comments.Sep 15 2016, 10:57 AM

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
189	our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. I feel like these things usually are unimportant until they're not. Which is to say, I'm totally onboard with doing a slow thing for a case we think is uncommon, so long as we're not painting ourselves into a corner. I like your plan of (if the case arises) telling people who want their code to be fast to put a single shared memory parameter at the front of their param pack and then optimizing for that case.

Closed by commit rL281635: [SE] Support CUDA dynamic shared memory (authored by jhen). · Explain WhySep 15 2016, 11:19 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

streamexecutor/

lib/

platforms/

cuda/

CUDAPlatformDevice.cpp

33 lines

unittests/

CoreTests/

CMakeLists.txt

5 lines

CUDATest.cpp

215 lines

Diff 71468

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	Error CUDAPlatformDevice::launch(			Error CUDAPlatformDevice::launch(
	const void *PlatformStreamHandle, BlockDimensions BlockSize,			const void *PlatformStreamHandle, BlockDimensions BlockSize,
	GridDimensions GridSize, const void *PKernelHandle,			GridDimensions GridSize, const void *PKernelHandle,
	const PackedKernelArgumentArrayBase &ArgumentArray) {			const PackedKernelArgumentArrayBase &ArgumentArray) {
	CUfunction Function =			CUfunction Function =
	reinterpret_cast<CUfunction>(const_cast<void *>(PKernelHandle));			reinterpret_cast<CUfunction>(const_cast<void *>(PKernelHandle));
	CUstream Stream =			CUstream Stream =
	reinterpret_cast<CUstream>(const_cast<void *>(PlatformStreamHandle));			reinterpret_cast<CUstream>(const_cast<void *>(PlatformStreamHandle));
	// TODO(jhen): Deal with shared memory arguments.
	unsigned SharedMemoryBytes = 0;			auto Launch = [Function, Stream, BlockSize,
	void ArgumentAddresses = const_cast<void >(ArgumentArray.getAddresses());			GridSize](size_t SharedMemoryBytes, void **ArgumentAddresses) {
				jlebarUnsubmitted Done Reply Inline Actions Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice. jlebar: Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions Yes, clang-format actually does this. I'll create a simple reproducer and file a bug. jhen: Yes, clang-format actually does this. I'll create a simple reproducer and file a bug.
	return CUresultToError(cuLaunchKernel(Function, GridSize.X, GridSize.Y,			return CUresultToError(
	GridSize.Z, BlockSize.X, BlockSize.Y,			cuLaunchKernel(Function, //
	BlockSize.Z, SharedMemoryBytes, Stream,			GridSize.X, GridSize.Y, GridSize.Z, //
	ArgumentAddresses, nullptr),			BlockSize.X, BlockSize.Y, BlockSize.Z, //
				SharedMemoryBytes, Stream, ArgumentAddresses, nullptr),
	"cuLaunchKernel");			"cuLaunchKernel");
				};

				void ArgumentAddresses = const_cast<void >(ArgumentArray.getAddresses());
				size_t SharedArgumentCount = ArgumentArray.getSharedCount();
				if (SharedArgumentCount) {
				unsigned SharedMemoryBytes = 0;
				size_t ArgumentCount = ArgumentArray.getArgumentCount();
				std::vector<void *> NonSharedArgumentAddresses(ArgumentCount -
				SharedArgumentCount);
				size_t NonSharedIndex = 0;
				for (size_t I = 0; I < ArgumentCount; ++I)
				if (ArgumentArray.getType(I) == KernelArgumentType::SHARED_DEVICE_MEMORY)
				SharedMemoryBytes += ArgumentArray.getSize(I);
				else
				NonSharedArgumentAddresses[NonSharedIndex++] = ArgumentAddresses[I];
				return Launch(SharedMemoryBytes, NonSharedArgumentAddresses.data());
				jlebarUnsubmitted Not Done Reply Inline Actions This whole dance is going to destroy the beautiful efficiency gains you were after, right? It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would also destroy your beautiful efficiency gains. At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this? jlebar: This whole dance is going to destroy the beautiful efficiency gains you were after, right? It…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. So, the idea is that, other than the check for `ArgumentArray.getArgumentCount() == 0`, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case. For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case. Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future. jhen: While optimizing this internally, our approach was that the dynamic shared memory case was not…
				jlebarUnsubmitted Not Done Reply Inline Actions our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. I feel like these things usually are unimportant until they're not. Which is to say, I'm totally onboard with doing a slow thing for a case we think is uncommon, so long as we're not painting ourselves into a corner. I like your plan of (if the case arises) telling people who want their code to be fast to put a single shared memory parameter at the front of their param pack and then optimizing for that case. jlebar: > our approach was that the dynamic shared memory case was not common, and we would accept…
				}
				return Launch(0, ArgumentAddresses);
	}			}

	Error CUDAPlatformDevice::copyD2H(const void *PlatformStreamHandle,			Error CUDAPlatformDevice::copyD2H(const void *PlatformStreamHandle,
	const void *DeviceSrcHandle,			const void *DeviceSrcHandle,
	size_t SrcByteOffset, void *HostDst,			size_t SrcByteOffset, void *HostDst,
	size_t DstByteOffset, size_t ByteCount) {			size_t DstByteOffset, size_t ByteCount) {
	return CUresultToError(			return CUresultToError(
	cuMemcpyDtoHAsync(			cuMemcpyDtoHAsync(
	▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

streamexecutor/unittests/CoreTests/CMakeLists.txt

				if(STREAM_EXECUTOR_ENABLE_CUDA_PLATFORM)
				set(CUDA_TEST_SOURCES CUDATest.cpp)
				endif()

	add_se_unittest(			add_se_unittest(
	CoreTests			CoreTests
	DeviceTest.cpp			DeviceTest.cpp
	KernelSpecTest.cpp			KernelSpecTest.cpp
	PackedKernelArgumentArrayTest.cpp			PackedKernelArgumentArrayTest.cpp
	StreamTest.cpp			StreamTest.cpp
				${CUDA_TEST_SOURCES}
	)			)

streamexecutor/unittests/CoreTests/CUDATest.cpp

This file was added.

				//===-- CUDATest.cpp - Tests for CUDA platform ----------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// This file contains the unit tests for CUDA platform code.
				///
				//===----------------------------------------------------------------------===//

				#include "streamexecutor/StreamExecutor.h"

				#include "gtest/gtest.h"

				namespace {

				namespace compilergen {
				using SaxpyKernel =
				streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,
				streamexecutor::GlobalDeviceMemory<float>>;

				const char *SaxpyPTX = R"(
				.version 4.3
				.target sm_20
				.address_size 64

				.visible .entry saxpy(.param .f32 A, .param .u64 X, .param .u64 Y) {
				.reg .f32 %AValue;
				.reg .f32 %XValue;
				.reg .f32 %YValue;
				.reg .f32 %Result;

				.reg .b64 %XBaseAddrGeneric;
				.reg .b64 %YBaseAddrGeneric;
				.reg .b64 %XBaseAddrGlobal;
				.reg .b64 %YBaseAddrGlobal;
				.reg .b64 %XAddr;
				.reg .b64 %YAddr;
				.reg .b64 %ThreadByteOffset;

				.reg .b32 %TID;

				ld.param.f32 %AValue, [A];
				ld.param.u64 %XBaseAddrGeneric, [X];
				ld.param.u64 %YBaseAddrGeneric, [Y];
				cvta.to.global.u64 %XBaseAddrGlobal, %XBaseAddrGeneric;
				cvta.to.global.u64 %YBaseAddrGlobal, %YBaseAddrGeneric;
				mov.u32 %TID, %tid.x;
				mul.wide.u32 %ThreadByteOffset, %TID, 4;
				add.s64 %XAddr, %ThreadByteOffset, %XBaseAddrGlobal;
				add.s64 %YAddr, %ThreadByteOffset, %YBaseAddrGlobal;
				ld.global.f32 %XValue, [%XAddr];
				ld.global.f32 %YValue, [%YAddr];
				fma.rn.f32 %Result, %AValue, %XValue, %YValue;
				st.global.f32 [%XAddr], %Result;
				ret;
				}
				)";

				static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {
				streamexecutor::MultiKernelLoaderSpec Spec;
				Spec.addCUDAPTXInMemory("saxpy", {{{2, 0}, SaxpyPTX}});
				return Spec;
				}();

				using SwapPairsKernel =
				streamexecutor::Kernel<streamexecutor::SharedDeviceMemory<int>,
				streamexecutor::GlobalDeviceMemory<int>, int>;

				const char *SwapPairsPTX = R"(
				.version 4.3
				.target sm_20
				.address_size 64

				.extern .shared .align 4 .b8 SwapSpace[];

				.visible .entry SwapPairs(.param .u64 InOut, .param .u32 InOutSize) {
				.reg .b64 %InOutGeneric;
				.reg .b32 %InOutSizeValue;

				.reg .b32 %LocalIndex;
				.reg .b32 %PartnerIndex;
				.reg .b32 %ThreadsPerBlock;
				.reg .b32 %BlockIndex;
				.reg .b32 %GlobalIndex;

				.reg .b32 %GlobalIndexBound;
				.reg .pred %GlobalIndexTooHigh;

				.reg .b64 %InOutGlobal;
				.reg .b64 %GlobalByteOffset;
				.reg .b64 %GlobalAddress;

				.reg .b32 %InitialValue;
				.reg .b32 %SwappedValue;

				.reg .b64 %SharedBaseAddr;
				.reg .b64 %LocalWriteByteOffset;
				.reg .b64 %LocalReadByteOffset;
				.reg .b64 %SharedWriteAddr;
				.reg .b64 %SharedReadAddr;

				ld.param.u64 %InOutGeneric, [InOut];
				ld.param.u32 %InOutSizeValue, [InOutSize];
				mov.u32 %LocalIndex, %tid.x;
				mov.u32 %ThreadsPerBlock, %ntid.x;
				mov.u32 %BlockIndex, %ctaid.x;
				mad.lo.s32 %GlobalIndex, %ThreadsPerBlock, %BlockIndex, %LocalIndex;
				and.b32 %GlobalIndexBound, %InOutSizeValue, -2;
				setp.ge.s32 %GlobalIndexTooHigh, %GlobalIndex, %GlobalIndexBound;
				@%GlobalIndexTooHigh bra END;

				cvta.to.global.u64 %InOutGlobal, %InOutGeneric;
				mul.wide.s32 %GlobalByteOffset, %GlobalIndex, 4;
				add.s64 %GlobalAddress, %InOutGlobal, %GlobalByteOffset;
				ld.global.u32 %InitialValue, [%GlobalAddress];
				mul.wide.s32 %LocalWriteByteOffset, %LocalIndex, 4;
				mov.u64 %SharedBaseAddr, SwapSpace;
				add.s64 %SharedWriteAddr, %SharedBaseAddr, %LocalWriteByteOffset;
				st.shared.u32 [%SharedWriteAddr], %InitialValue;
				bar.sync 0;
				xor.b32 %PartnerIndex, %LocalIndex, 1;
				mul.wide.s32 %LocalReadByteOffset, %PartnerIndex, 4;
				add.s64 %SharedReadAddr, %SharedBaseAddr, %LocalReadByteOffset;
				ld.shared.u32 %SwappedValue, [%SharedReadAddr];
				st.global.u32 [%GlobalAddress], %SwappedValue;

				END:
				ret;
				}
				)";

				static streamexecutor::MultiKernelLoaderSpec SwapPairsLoaderSpec = []() {
				streamexecutor::MultiKernelLoaderSpec Spec;
				Spec.addCUDAPTXInMemory("SwapPairs", {{{2, 0}, SwapPairsPTX}});
				return Spec;
				}();
				} // namespace __compilergen
				jlebarUnsubmitted Done Reply Inline Actions Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.) jlebar: Doesn't match ns name above. (It's going to be technically UB for something other than the…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment. jhen: Thanks for catching that. I changed it from the example code to avoid UB, but I missed the…

				namespace se = ::streamexecutor;
				namespace cg = ::compilergen;

				class CUDATest : public ::testing::Test {
				public:
				CUDATest()
				: Platform(getOrDie(se::PlatformManager::getPlatformByName("CUDA"))),
				Device(getOrDie(Platform->getDevice(0))),
				Stream(getOrDie(Device.createStream())) {}

				se::Platform *Platform;
				se::Device Device;
				se::Stream Stream;
				};

				TEST_F(CUDATest, Saxpy) {
				float A = 42.0f;
				std::vector<float> HostX = {0, 1, 2, 3};
				std::vector<float> HostY = {4, 5, 6, 7};
				size_t ArraySize = HostX.size();

				cg::SaxpyKernel Kernel =
				getOrDie(Device.createKernel<cg::SaxpyKernel>(cg::SaxpyLoaderSpec));

				se::RegisteredHostMemory<float> RegisteredX =
				getOrDie(Device.registerHostMemory<float>(HostX));
				se::RegisteredHostMemory<float> RegisteredY =
				getOrDie(Device.registerHostMemory<float>(HostY));

				se::GlobalDeviceMemory<float> X =
				getOrDie(Device.allocateDeviceMemory<float>(ArraySize));
				se::GlobalDeviceMemory<float> Y =
				getOrDie(Device.allocateDeviceMemory<float>(ArraySize));

				Stream.thenCopyH2D(RegisteredX, X)
				.thenCopyH2D(RegisteredY, Y)
				.thenLaunch(ArraySize, 1, Kernel, A, X, Y)
				.thenCopyD2H(X, RegisteredX);
				se::dieIfError(Stream.blockHostUntilDone());

				std::vector<float> ExpectedX = {4, 47, 90, 133};
				EXPECT_EQ(ExpectedX, HostX);
				}

				TEST_F(CUDATest, DynamicSharedMemory) {
				std::vector<int> HostPairs = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
				std::vector<int> HostResult(HostPairs.size(), 0);
				int ArraySize = HostPairs.size();

				cg::SwapPairsKernel Kernel = getOrDie(
				Device.createKernel<cg::SwapPairsKernel>(cg::SwapPairsLoaderSpec));

				se::RegisteredHostMemory<int> RegisteredPairs =
				getOrDie(Device.registerHostMemory<int>(HostPairs));
				se::RegisteredHostMemory<int> RegisteredResult =
				getOrDie(Device.registerHostMemory<int>(HostResult));

				se::GlobalDeviceMemory<int> Pairs =
				getOrDie(Device.allocateDeviceMemory<int>(ArraySize));
				auto SharedMemory =
				se::SharedDeviceMemory<int>::makeFromElementCount(ArraySize);

				Stream.thenCopyH2D(RegisteredPairs, Pairs)
				.thenLaunch(ArraySize, 1, Kernel, SharedMemory, Pairs, ArraySize)
				.thenCopyD2H(Pairs, RegisteredResult);
				se::dieIfError(Stream.blockHostUntilDone());

				std::vector<int> ExpectedPairs = {1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10};
				EXPECT_EQ(ExpectedPairs, HostResult);
				}

				} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

[SE] Support CUDA dynamic shared memoryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 71468

streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp

streamexecutor/unittests/CoreTests/CMakeLists.txt

streamexecutor/unittests/CoreTests/CUDATest.cpp

[SE] Support CUDA dynamic shared memory
ClosedPublic