This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
streamexecutor/
-
examples/
-
HostSaxpy.cpp
-
include/streamexecutor/
-
streamexecutor/
4
DeviceMemory.h
-
PackedKernelArgumentArray.h
-
Stream.h
-
unittests/CoreTests/
-
CoreTests/
-
PackedKernelArgumentArrayTest.cpp

Differential D24528

[SE] Pack global dev handle addresses
ClosedPublic

Authored by jhen on Sep 13 2016, 4:16 PM.

Download Raw Diff

Details

Reviewers

jlebar

Commits

rGb38d8a3a3baa: [SE] Pack global dev handle addresses
rL281424: [SE] Pack global dev handle addresses

Summary

We were packing global device memory handles in
PackedKernelArgumentArray, but as I was implementing the CUDA
platform, I realized that CUDA wants the address of the handle, not the
handle itself. So this patch switches to packing the address of the
handle.

Diff Detail

Event Timeline

jhen updated this revision to Diff 71256.Sep 13 2016, 4:16 PM

jhen retitled this revision from to [SE] Pack global dev handle addresses.

jhen updated this object.

jhen added a reviewer: jlebar.

jhen added subscribers: parallel_libs-commits, jprice.

Herald added a subscriber: jlebar. · View Herald TranscriptSep 13 2016, 4:16 PM

jlebar added inline comments.Sep 13 2016, 4:26 PM

streamexecutor/include/streamexecutor/DeviceMemory.h
170	Same question as last patch: What happens if this guy is moved? Specifically, would it be a problem if it were moved after calling thenLaunch but before the driver actually launches the kernel?

Warning about kernel launch args

streamexecutor/include/streamexecutor/DeviceMemory.h
170	Yes, it would be a problem if the argument was moved at the wrong time. This was a choice we made internally to reduce the kernel launch overhead. Apparently it could make up 5% of some applications' run-time. In response to your question, I added a warning message to the `Stream::thenLaunch` method. This message tells users not to touch kernel launch arguments from other threads.

jlebar added inline comments.Sep 13 2016, 4:54 PM

streamexecutor/include/streamexecutor/DeviceMemory.h
170	Clearly in the multithreaded case you can't modify a GlobalDeviceMemoryBase concurrently with a call to thenLaunch(). That's true of any parameters to any function call, so I am not sure that's even worth warning about. What I was worried about was a single-threaded case, something like this: GlobalDeviceMemory<int> other_mem; { GlobalDeviceMemory<int> mem; Stream->thenLaunch(foo_kernel, mem); other_mem = std::move(mem); } Stream->block(); Is this safe? That is, do we use &mem.Handle only within thenLaunch? (We don't actually have to have a separate scope and so on for us to hit this same problem.) (I have to admit, if any of this is 5% of some applications' runtime, it seems like we could do a lot better even than what we have here. I'm not sure how, but hot code is hot...)

Remove warning

streamexecutor/include/streamexecutor/DeviceMemory.h
170	Oh, that's right. I think we spoke about this before. There should be no problem with your example, the platform should be responsible for copying the arguments (using memcpy) before returning from a launch call. This will match the guarantee currently provided by the CUDA driver library. I removed the unnecessary warning.

I think we spoke about this before.

Oh, I remember this now. :) Okay, sgtm.

This revision is now accepted and ready to land.Sep 13 2016, 5:01 PM

Closed by commit rL281424: [SE] Pack global dev handle addresses (authored by jhen). · Explain WhySep 13 2016, 5:07 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

streamexecutor/

examples/

HostSaxpy.cpp

4 lines

include/

streamexecutor/

DeviceMemory.h

3 lines

PackedKernelArgumentArray.h

27 lines

Stream.h

7 lines

unittests/

CoreTests/

PackedKernelArgumentArrayTest.cpp

12 lines

Diff 71268

streamexecutor/examples/HostSaxpy.cpp

	Show All 27 Lines
	namespace __compilergen {			namespace __compilergen {
	using SaxpyKernel =			using SaxpyKernel =
	streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,			streamexecutor::Kernel<float, streamexecutor::GlobalDeviceMemory<float>,
	streamexecutor::GlobalDeviceMemory<float>, size_t>;			streamexecutor::GlobalDeviceMemory<float>, size_t>;

	// Wrapper function converts argument addresses to arguments.			// Wrapper function converts argument addresses to arguments.
	void SaxpyWrapper(const void const ArgumentAddresses) {			void SaxpyWrapper(const void const ArgumentAddresses) {
	Saxpy(static_cast<const float >(ArgumentAddresses[0]),			Saxpy(static_cast<const float >(ArgumentAddresses[0]),
	static_cast<float >(const_cast<void >(ArgumentAddresses[1])),			static_cast<float >(const_cast<void >(ArgumentAddresses[1])),
	static_cast<float >(const_cast<void >(ArgumentAddresses[2])),			static_cast<float >(const_cast<void >(ArgumentAddresses[2])),
	static_cast<const size_t >(ArgumentAddresses[3]));			static_cast<const size_t >(ArgumentAddresses[3]));
	}			}

	// The wrapper function is what gets registered.			// The wrapper function is what gets registered.
	static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {			static streamexecutor::MultiKernelLoaderSpec SaxpyLoaderSpec = []() {
	streamexecutor::MultiKernelLoaderSpec Spec;			streamexecutor::MultiKernelLoaderSpec Spec;
	Spec.addHostFunction("Saxpy", SaxpyWrapper);			Spec.addHostFunction("Saxpy", SaxpyWrapper);
	return Spec;			return Spec;
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

streamexecutor/include/streamexecutor/DeviceMemory.h

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
///		///
/// For example, in the OpenCL platform, the handle is a pointer to a _cl_mem		/// For example, in the OpenCL platform, the handle is a pointer to a _cl_mem
/// handle object which really is completely opaque to the user.		/// handle object which really is completely opaque to the user.
class GlobalDeviceMemoryBase {		class GlobalDeviceMemoryBase {
public:		public:
/// Returns an opaque handle to the underlying memory.		/// Returns an opaque handle to the underlying memory.
const void *getHandle() const { return Handle; }		const void *getHandle() const { return Handle; }

		/// Returns the address of the opaque handle as stored by this object.
		const void const getHandleAddress() const { return &Handle; }

// Cannot copy because the handle must be owned by a single object.		// Cannot copy because the handle must be owned by a single object.
GlobalDeviceMemoryBase(const GlobalDeviceMemoryBase &) = delete;		GlobalDeviceMemoryBase(const GlobalDeviceMemoryBase &) = delete;
GlobalDeviceMemoryBase &operator=(const GlobalDeviceMemoryBase &) = delete;		GlobalDeviceMemoryBase &operator=(const GlobalDeviceMemoryBase &) = delete;

protected:		protected:
/// Creates a GlobalDeviceMemoryBase from a handle and a byte count.		/// Creates a GlobalDeviceMemoryBase from a handle and a byte count.
GlobalDeviceMemoryBase(Device D, const void Handle, size_t ByteCount)		GlobalDeviceMemoryBase(Device D, const void Handle, size_t ByteCount)
: TheDevice(D), Handle(Handle), ByteCount(ByteCount) {}		: TheDevice(D), Handle(Handle), ByteCount(ByteCount) {}
Show All 15 Lines	GlobalDeviceMemoryBase &operator=(GlobalDeviceMemoryBase &&Other) noexcept {
Other.Handle = nullptr;		Other.Handle = nullptr;
Other.ByteCount = 0;		Other.ByteCount = 0;
return *this;		return *this;
}		}

~GlobalDeviceMemoryBase();		~GlobalDeviceMemoryBase();

Device *TheDevice; // Pointer to the device on which this memory lives.		Device *TheDevice; // Pointer to the device on which this memory lives.
const void *Handle; // Platform-dependent value representing allocated memory.		const void *Handle; // Platform-dependent value representing allocated memory.
		jlebarUnsubmitted Not Done Reply Inline Actions Same question as last patch: What happens if this guy is moved? Specifically, would it be a problem if it were moved after calling thenLaunch but before the driver actually launches the kernel? jlebar: Same question as last patch: What happens if this guy is moved? Specifically, would it be a…
		jhenAuthorUnsubmitted Not Done Reply Inline Actions Yes, it would be a problem if the argument was moved at the wrong time. This was a choice we made internally to reduce the kernel launch overhead. Apparently it could make up 5% of some applications' run-time. In response to your question, I added a warning message to the `Stream::thenLaunch` method. This message tells users not to touch kernel launch arguments from other threads. jhen: Yes, it would be a problem if the argument was moved at the wrong time. This was a choice we…
		jlebarUnsubmitted Not Done Reply Inline Actions Clearly in the multithreaded case you can't modify a GlobalDeviceMemoryBase concurrently with a call to thenLaunch(). That's true of any parameters to any function call, so I am not sure that's even worth warning about. What I was worried about was a single-threaded case, something like this: GlobalDeviceMemory<int> other_mem; { GlobalDeviceMemory<int> mem; Stream->thenLaunch(foo_kernel, mem); other_mem = std::move(mem); } Stream->block(); Is this safe? That is, do we use &mem.Handle only within thenLaunch? (We don't actually have to have a separate scope and so on for us to hit this same problem.) (I have to admit, if any of this is 5% of some applications' runtime, it seems like we could do a lot better even than what we have here. I'm not sure how, but hot code is hot...) jlebar: Clearly in the multithreaded case you can't modify a GlobalDeviceMemoryBase concurrently with a…
		jhenAuthorUnsubmitted Not Done Reply Inline Actions Oh, that's right. I think we spoke about this before. There should be no problem with your example, the platform should be responsible for copying the arguments (using memcpy) before returning from a launch call. This will match the guarantee currently provided by the CUDA driver library. I removed the unnecessary warning. jhen: Oh, that's right. I think we spoke about this before. There should be no problem with your…
size_t ByteCount; // Size in bytes of this allocation.		size_t ByteCount; // Size in bytes of this allocation.
};		};

/// Typed wrapper around the "void *"-like GlobalDeviceMemoryBase class.		/// Typed wrapper around the "void *"-like GlobalDeviceMemoryBase class.
///		///
/// For example, GlobalDeviceMemory<int> is a simple wrapper around		/// For example, GlobalDeviceMemory<int> is a simple wrapper around
/// GlobalDeviceMemoryBase that represents a buffer of integers stored in global		/// GlobalDeviceMemoryBase that represents a buffer of integers stored in global
/// device memory.		/// device memory.
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

streamexecutor/include/streamexecutor/PackedKernelArgumentArray.h

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	private:

// Pack a normal, non-device-memory argument.		// Pack a normal, non-device-memory argument.
template <typename T> void PackOneArgument(size_t Index, const T &Argument) {		template <typename T> void PackOneArgument(size_t Index, const T &Argument) {
Addresses[Index] = &Argument;		Addresses[Index] = &Argument;
Sizes[Index] = sizeof(T);		Sizes[Index] = sizeof(T);
Types[Index] = KernelArgumentType::VALUE;		Types[Index] = KernelArgumentType::VALUE;
}		}

// Pack a GlobalDeviceMemoryBase argument.
void PackOneArgument(size_t Index, const GlobalDeviceMemoryBase &Argument) {
Addresses[Index] = Argument.getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a GlobalDeviceMemoryBase pointer argument.
void PackOneArgument(size_t Index, GlobalDeviceMemoryBase *Argument) {
Addresses[Index] = Argument->getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a const GlobalDeviceMemoryBase pointer argument.
void PackOneArgument(size_t Index, const GlobalDeviceMemoryBase *Argument) {
Addresses[Index] = Argument->getHandle();
Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}

// Pack a GlobalDeviceMemory<T> argument.		// Pack a GlobalDeviceMemory<T> argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> &Argument) {		void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> &Argument) {
Addresses[Index] = Argument.getHandle();		Addresses[Index] = Argument.getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a GlobalDeviceMemory<T> pointer argument.		// Pack a GlobalDeviceMemory<T> pointer argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, GlobalDeviceMemory<T> *Argument) {		void PackOneArgument(size_t Index, GlobalDeviceMemory<T> *Argument) {
Addresses[Index] = Argument->getHandle();		Addresses[Index] = Argument->getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a const GlobalDeviceMemory<T> pointer argument.		// Pack a const GlobalDeviceMemory<T> pointer argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> *Argument) {		void PackOneArgument(size_t Index, const GlobalDeviceMemory<T> *Argument) {
Addresses[Index] = Argument->getHandle();		Addresses[Index] = Argument->getHandleAddress();
Sizes[Index] = sizeof(void *);		Sizes[Index] = sizeof(void *);
Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;		Types[Index] = KernelArgumentType::GLOBAL_DEVICE_MEMORY;
}		}

// Pack a SharedDeviceMemory argument.		// Pack a SharedDeviceMemory argument.
template <typename T>		template <typename T>
void PackOneArgument(size_t Index, const SharedDeviceMemory<T> &Argument) {		void PackOneArgument(size_t Index, const SharedDeviceMemory<T> &Argument) {
++SharedCount;		++SharedCount;
Show All 40 Lines

streamexecutor/include/streamexecutor/Stream.h

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	public:

/// Entrains onto the stream of operations a kernel launch with the given		/// Entrains onto the stream of operations a kernel launch with the given
/// arguments.		/// arguments.
///		///
/// These arguments can be device memory types like GlobalDeviceMemory<T> and		/// These arguments can be device memory types like GlobalDeviceMemory<T> and
/// SharedDeviceMemory<T>, or they can be primitive types such as int. The		/// SharedDeviceMemory<T>, or they can be primitive types such as int. The
/// allowable argument types are determined by the template parameters to the		/// allowable argument types are determined by the template parameters to the
/// Kernel argument.		/// Kernel argument.
		///
		/// \warning
		/// This function passes the addresses of its \p Arguments to the underlying
		/// platform launcher. If those addresses become invalidated because another
		/// thread touches an argument, this call will fail in strange-looking ways,
		/// so be sure that no other threads are touching the arguments to this
		/// function until it returns.
template <typename... ParameterTs>		template <typename... ParameterTs>
Stream &thenLaunch(BlockDimensions BlockSize, GridDimensions GridSize,		Stream &thenLaunch(BlockDimensions BlockSize, GridDimensions GridSize,
const Kernel<ParameterTs...> &K,		const Kernel<ParameterTs...> &K,
const ParameterTs &... Arguments) {		const ParameterTs &... Arguments) {
auto ArgumentArray =		auto ArgumentArray =
make_kernel_argument_pack<ParameterTs...>(Arguments...);		make_kernel_argument_pack<ParameterTs...>(Arguments...);
setError(PDevice->launch(PlatformStreamHandle, BlockSize, GridSize,		setError(PDevice->launch(PlatformStreamHandle, BlockSize, GridSize,
K.getPlatformHandle(), ArgumentArray));		K.getPlatformHandle(), ArgumentArray));
▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

streamexecutor/unittests/CoreTests/PackedKernelArgumentArrayTest.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	TEST_F(DeviceMemoryPackingTest, SingleValue) {
auto Array = se::make_kernel_argument_pack(Value);		auto Array = se::make_kernel_argument_pack(Value);
ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);		ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedGlobal) {		TEST_F(DeviceMemoryPackingTest, SingleTypedGlobal) {
auto Array = se::make_kernel_argument_pack(TypedGlobal);		auto Array = se::make_kernel_argument_pack(TypedGlobal);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedGlobalPointer) {		TEST_F(DeviceMemoryPackingTest, SingleTypedGlobalPointer) {
auto Array = se::make_kernel_argument_pack(&TypedGlobal);		auto Array = se::make_kernel_argument_pack(&TypedGlobal);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleConstTypedGlobalPointer) {		TEST_F(DeviceMemoryPackingTest, SingleConstTypedGlobalPointer) {
const se::GlobalDeviceMemory<int> *ArgumentPointer = &TypedGlobal;		const se::GlobalDeviceMemory<int> *ArgumentPointer = &TypedGlobal;
auto Array = se::make_kernel_argument_pack(ArgumentPointer);		auto Array = se::make_kernel_argument_pack(ArgumentPointer);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 0);		Type::GLOBAL_DEVICE_MEMORY, Array, 0);
EXPECT_EQ(1u, Array.getArgumentCount());		EXPECT_EQ(1u, Array.getArgumentCount());
EXPECT_EQ(0u, Array.getSharedCount());		EXPECT_EQ(0u, Array.getSharedCount());
}		}

TEST_F(DeviceMemoryPackingTest, SingleTypedShared) {		TEST_F(DeviceMemoryPackingTest, SingleTypedShared) {
auto Array = se::make_kernel_argument_pack(TypedShared);		auto Array = se::make_kernel_argument_pack(TypedShared);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Show All 21 Lines

TEST_F(DeviceMemoryPackingTest, PackSeveralArguments) {		TEST_F(DeviceMemoryPackingTest, PackSeveralArguments) {
const se::GlobalDeviceMemory<int> *TypedGlobalPointer = &TypedGlobal;		const se::GlobalDeviceMemory<int> *TypedGlobalPointer = &TypedGlobal;
const se::SharedDeviceMemory<int> *TypedSharedPointer = &TypedShared;		const se::SharedDeviceMemory<int> *TypedSharedPointer = &TypedShared;
auto Array = se::make_kernel_argument_pack(Value, TypedGlobal, &TypedGlobal,		auto Array = se::make_kernel_argument_pack(Value, TypedGlobal, &TypedGlobal,
TypedGlobalPointer, TypedShared,		TypedGlobalPointer, TypedShared,
&TypedShared, TypedSharedPointer);		&TypedShared, TypedSharedPointer);
ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);		ExpectEqual(&Value, sizeof(Value), Type::VALUE, Array, 0);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 1);		Type::GLOBAL_DEVICE_MEMORY, Array, 1);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 2);		Type::GLOBAL_DEVICE_MEMORY, Array, 2);
ExpectEqual(TypedGlobal.getHandle(), sizeof(void *),		ExpectEqual(TypedGlobal.getHandleAddress(), sizeof(void *),
Type::GLOBAL_DEVICE_MEMORY, Array, 3);		Type::GLOBAL_DEVICE_MEMORY, Array, 3);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 4);		Array, 4);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 5);		Array, 5);
ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,		ExpectEqual(nullptr, TypedShared.getByteCount(), Type::SHARED_DEVICE_MEMORY,
Array, 6);		Array, 6);
EXPECT_EQ(7u, Array.getArgumentCount());		EXPECT_EQ(7u, Array.getArgumentCount());
EXPECT_EQ(3u, Array.getSharedCount());		EXPECT_EQ(3u, Array.getSharedCount());
}		}

} // namespace		} // namespace