This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
parallel-libs/trunk/streamexecutor/
-
trunk/
-
streamexecutor/
-
examples/
-
HostSaxpy.cpp
-
include/streamexecutor/
-
streamexecutor/
-
DeviceMemory.h
-
PackedKernelArgumentArray.h
-
unittests/CoreTests/
-
CoreTests/
-
PackedKernelArgumentArrayTest.cpp

Differential D24528

[SE] Pack global dev handle addresses
ClosedPublic

Authored by jhen on Sep 13 2016, 4:16 PM.

Download Raw Diff

Details

Reviewers

jlebar

Commits

rGb38d8a3a3baa: [SE] Pack global dev handle addresses
rL281424: [SE] Pack global dev handle addresses

Summary

We were packing global device memory handles in
PackedKernelArgumentArray, but as I was implementing the CUDA
platform, I realized that CUDA wants the address of the handle, not the
handle itself. So this patch switches to packing the address of the
handle.

Diff Detail

Repository: rL LLVM

Event Timeline

jhen updated this revision to Diff 71256.Sep 13 2016, 4:16 PM

jhen retitled this revision from to [SE] Pack global dev handle addresses.

jhen updated this object.

jhen added a reviewer: jlebar.

jhen added subscribers: parallel_libs-commits, jprice.

Herald added a subscriber: jlebar. · View Herald TranscriptSep 13 2016, 4:16 PM

jlebar added inline comments.Sep 13 2016, 4:26 PM

streamexecutor/include/streamexecutor/DeviceMemory.h
170 ↗	(On Diff #71256)	Same question as last patch: What happens if this guy is moved? Specifically, would it be a problem if it were moved after calling thenLaunch but before the driver actually launches the kernel?

Warning about kernel launch args

streamexecutor/include/streamexecutor/DeviceMemory.h
170 ↗	(On Diff #71256)	Yes, it would be a problem if the argument was moved at the wrong time. This was a choice we made internally to reduce the kernel launch overhead. Apparently it could make up 5% of some applications' run-time. In response to your question, I added a warning message to the `Stream::thenLaunch` method. This message tells users not to touch kernel launch arguments from other threads.

jlebar added inline comments.Sep 13 2016, 4:54 PM

streamexecutor/include/streamexecutor/DeviceMemory.h
170 ↗	(On Diff #71268)	Clearly in the multithreaded case you can't modify a GlobalDeviceMemoryBase concurrently with a call to thenLaunch(). That's true of any parameters to any function call, so I am not sure that's even worth warning about. What I was worried about was a single-threaded case, something like this: GlobalDeviceMemory<int> other_mem; { GlobalDeviceMemory<int> mem; Stream->thenLaunch(foo_kernel, mem); other_mem = std::move(mem); } Stream->block(); Is this safe? That is, do we use &mem.Handle only within thenLaunch? (We don't actually have to have a separate scope and so on for us to hit this same problem.) (I have to admit, if any of this is 5% of some applications' runtime, it seems like we could do a lot better even than what we have here. I'm not sure how, but hot code is hot...)

Remove warning

streamexecutor/include/streamexecutor/DeviceMemory.h
170 ↗	(On Diff #71268)	Oh, that's right. I think we spoke about this before. There should be no problem with your example, the platform should be responsible for copying the arguments (using memcpy) before returning from a launch call. This will match the guarantee currently provided by the CUDA driver library. I removed the unnecessary warning.

I think we spoke about this before.

Oh, I remember this now. :) Okay, sgtm.

This revision is now accepted and ready to land.Sep 13 2016, 5:01 PM

Closed by commit rL281424: [SE] Pack global dev handle addresses (authored by jhen). · Explain WhySep 13 2016, 5:07 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

parallel-libs/

trunk/

streamexecutor/

examples/

HostSaxpy.cpp

4 lines

include/

streamexecutor/

DeviceMemory.h

3 lines

PackedKernelArgumentArray.h

27 lines

unittests/

CoreTests/

PackedKernelArgumentArrayTest.cpp

12 lines

Diff 71277

parallel-libs/trunk/streamexecutor/examples/HostSaxpy.cpp

	Show All 27 Lines
	namespace __compilergen {			namespace __compilergen {
	using SaxpyKernel =			using SaxpyKernel =
	streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,			streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,
	streamexecutor::GlobalDeviceMemory<float>, size_t>;			streamexecutor::GlobalDeviceMemory<float>, size_t>;

	// Wrapper function converts argument addresses to arguments.			// Wrapper function converts argument addresses to arguments.
	void SaxpyWrapper(const void const ArgumentAddresses) {			void SaxpyWrapper(const void const ArgumentAddresses) {
	Saxpy(static_cast<const float >(ArgumentAddresses[0]),			Saxpy(static_cast<const float >(ArgumentAddresses[0]),
	static_cast<float >(const_cast<void >(ArgumentAddresses[1])),			static_cast<float >(const_cast<void >(ArgumentAddresses[1])),
	static_cast<float >(const_cast<void >(ArgumentAddresses[2])),			static_cast<float >(const_cast<void >(ArgumentAddresses[2])),
	static_cast<const size_t >(ArgumentAddresses[3]));			static_cast<const size_t >(ArgumentAddresses[3]));
	}			}

	// The wrapper function is what gets registered.			// The wrapper function is what gets registered.
	static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {			static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {
	streamexecutor::MultiKernelLoaderSpec Spec;			streamexecutor::MultiKernelLoaderSpec Spec;
	Spec.addHostFunction("Saxpy", SaxpyWrapper);			Spec.addHostFunction("Saxpy", SaxpyWrapper);
	return Spec;			return Spec;
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

parallel-libs/trunk/streamexecutor/include/streamexecutor/DeviceMemory.h

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	///			///
	/// For example, in the OpenCL platform, the handle is a pointer to a _cl_mem			/// For example, in the OpenCL platform, the handle is a pointer to a _cl_mem
	/// handle object which really is completely opaque to the user.			/// handle object which really is completely opaque to the user.
	class GlobalDeviceMemoryBase {			class GlobalDeviceMemoryBase {
	public:			public:
	/// Returns an opaque handle to the underlying memory.			/// Returns an opaque handle to the underlying memory.
	const void *getHandle() const { return Handle; }			const void *getHandle() const { return Handle; }

				/// Returns the address of the opaque handle as stored by this object.
				const void const getHandleAddress() const { return &Handle; }

	// Cannot copy because the handle must be owned by a single object.			// Cannot copy because the handle must be owned by a single object.
	GlobalDeviceMemoryBase(const GlobalDeviceMemoryBase &) = delete;			GlobalDeviceMemoryBase(const GlobalDeviceMemoryBase &) = delete;
	GlobalDeviceMemoryBase &operator=(const GlobalDeviceMemoryBase &) = delete;			GlobalDeviceMemoryBase &operator=(const GlobalDeviceMemoryBase &) = delete;

	protected:			protected:
	/// Creates a GlobalDeviceMemoryBase from a handle and a byte count.			/// Creates a GlobalDeviceMemoryBase from a handle and a byte count.
	GlobalDeviceMemoryBase(Device D, const void Handle, size_t ByteCount)			GlobalDeviceMemoryBase(Device D, const void Handle, size_t ByteCount)
	: TheDevice(D), Handle(Handle), ByteCount(ByteCount) {}			: TheDevice(D), Handle(Handle), ByteCount(ByteCount) {}
	▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

parallel-libs/trunk/streamexecutor/include/streamexecutor/PackedKernelArgumentArray.h

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	private:

// Pack a normal, non-device-memory argument.		// Pack a normal, non-device-memory argument.
template <typename T> void PackOneArgument(size_t Index, const T &Argument) {		template <typename T> void PackOneArgument(size_t Index, const T &Argument) {
Addresses[Index] = &Argument;		Addresses[Index] = &Argument;
Sizes[Index] = sizeof(T);		Sizes[Index] = sizeof(T);
Types[Index] = KernelArgumentType::VALUE;		Types[Index] = KernelArgumentType::VALUE;
}		}

// Pack a GlobalDeviceMemoryBase argument.
void PackOneArgument(size_t Index, const GlobalDeviceMemoryBase &Argument) {
Addresses[Index] = Argument.getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a GlobalDeviceMemoryBase pointer argument.
void PackOneArgument(size_t Index, GlobalDeviceMemoryBase *Argument) {
Addresses[Index] = Argument->getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a const GlobalDeviceMemoryBase pointer argument.
void PackOneArgument(size_t Index, const GlobalDeviceMemoryBase *Argument) {
Addresses[Index] = Argument->getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a GlobalDeviceMemory<T> argument.		// Pack a GlobalDeviceMemory<T> argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> &Argument) {		void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> &Argument) {
Addresses[Index] = Argument.getHandle();		Addresses[Index] = Argument.getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a GlobalDeviceMemory<T> pointer argument.		// Pack a GlobalDeviceMemory<T> pointer argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, GlobalDeviceMemory<T> *Argument) {		void PackOneArgument(size_t Index, GlobalDeviceMemory<T> *Argument) {
Addresses[Index] = Argument->getHandle();		Addresses[Index] = Argument->getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a const GlobalDeviceMemory<T> pointer argument.		// Pack a const GlobalDeviceMemory<T> pointer argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> *Argument) {		void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> *Argument) {
Addresses[Index] = Argument->getHandle();		Addresses[Index] = Argument->getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a SharedDeviceMemory argument.		// Pack a SharedDeviceMemory argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const SharedDeviceMemory<T> &Argument) {		void PackOneArgument(size_t Index, const SharedDeviceMemory<T> &Argument) {
++SharedCount;		++SharedCount;
Show All 40 Lines

parallel-libs/trunk/streamexecutor/unittests/CoreTests/PackedKernelArgumentArrayTest.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	TEST_F(DeviceMemoryPackingTest, SingleValue) {
auto Array = se::make_kernel_argument_pack(Value);		auto Array = se::make_kernel_argument_pack(Value);
ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);		ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedGlobal) {		TEST_F(DeviceMemoryPackingTest, SingleTypedGlobal) {
auto Array = se::make_kernel_argument_pack(TypedGlobal);		auto Array = se::make_kernel_argument_pack(TypedGlobal);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedGlobalPointer) {		TEST_F(DeviceMemoryPackingTest, SingleTypedGlobalPointer) {
auto Array = se::make_kernel_argument_pack(&TypedGlobal);		auto Array = se::make_kernel_argument_pack(&TypedGlobal);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleConstTypedGlobalPointer) {		TEST_F(DeviceMemoryPackingTest, SingleConstTypedGlobalPointer) {
const se::GlobalDeviceMemory<int> *ArgumentPointer = &TypedGlobal;		const se::GlobalDeviceMemory<int> *ArgumentPointer = &TypedGlobal;
auto Array = se::make_kernel_argument_pack(ArgumentPointer);		auto Array = se::make_kernel_argument_pack(ArgumentPointer);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedShared) {		TEST_F(DeviceMemoryPackingTest, SingleTypedShared) {
auto Array = se::make_kernel_argument_pack(TypedShared);		auto Array = se::make_kernel_argument_pack(TypedShared);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Show All 21 Lines

TEST_F(DeviceMemoryPackingTest, PackSeveralArguments) {		TEST_F(DeviceMemoryPackingTest, PackSeveralArguments) {
const se::GlobalDeviceMemory<int> *TypedGlobalPointer = &TypedGlobal;		const se::GlobalDeviceMemory<int> *TypedGlobalPointer = &TypedGlobal;
const se::SharedDeviceMemory<int> *TypedSharedPointer = &TypedShared;		const se::SharedDeviceMemory<int> *TypedSharedPointer = &TypedShared;
auto Array = se::make_kernel_argument_pack(Value, TypedGlobal, &TypedGlobal,		auto Array = se::make_kernel_argument_pack(Value, TypedGlobal, &TypedGlobal,
TypedGlobalPointer, TypedShared,		TypedGlobalPointer, TypedShared,
&TypedShared, TypedSharedPointer);		&TypedShared, TypedSharedPointer);
ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);		ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 1);		Type::GLOBAL_DEVICE_MEMORY, Array, 1);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 2);		Type::GLOBAL_DEVICE_MEMORY, Array, 2);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 3);		Type::GLOBAL_DEVICE_MEMORY, Array, 3);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 4);		Array, 4);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 5);		Array, 5);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 6);		Array, 6);
EXPECT_EQ(7u, Array.getArgumentCount());		EXPECT_EQ(7u, Array.getArgumentCount());
EXPECT_EQ(3u, Array.getSharedCount());		EXPECT_EQ(3u, Array.getSharedCount());
}		}

} // namespace		} // namespace