This is an archive of the discontinued LLVM Phabricator instance.

General question: if we have variants of these memcpy methods that take ElementCount parameters to allow for partial copies to/from the device allocations, should we also have variants with an Offset parameter as well to allow for partial copies that don't start at the origin?

streamexecutor/include/streamexecutor/Executor.h
71	If this is going to be backing onto something like `cuMemHostRegister`, don't we need the size of the allocation as well?
144	memory
streamexecutor/include/streamexecutor/PlatformInterfaces.h
146	registerHostMemory

jlebar added inline comments.Aug 16 2016, 2:09 PM

streamexecutor/include/streamexecutor/Executor.h
48	Should we take this by reference?
110	I wonder if we should assert that the two arrays' sizes are the same, in this case. Same for the H2D function.

Respond to jprice's comments

In D23577#517138, @jprice wrote:

General question: if we have variants of these memcpy methods that take ElementCount parameters to allow for partial copies to/from the device allocations, should we also have variants with an Offset parameter as well to allow for partial copies that don't start at the origin?

Yes, I think allowing an offset argument is a good idea. I made that change to the Executor and Stream memcpy methods

streamexecutor/include/streamexecutor/Executor.h
48	The internal SE code uses the convention that a `DeviceMemory<T>` is passed wherever a `T` would be passed for host data and a `const DeviceMemory<T>&` would be passed wherever a `const T` would be used for host data. Do you think it would be better to pass `DeviceMemory<T>&` for `T` and `const DeviceMemory<T>&` for `const T*`, or something like that?
71	You're totally right. Thanks for catching that.
110	To me that feels too restrictive. Without that check do you think it would be better to rename the method? Maybe synchronousMemcpyWholeSrc or something. I like it how it is now, but I don't want the interface to be confusing.

jlebar added inline comments.Aug 16 2016, 3:11 PM

streamexecutor/include/streamexecutor/Executor.h
48	Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>& for const T, or something like that? Yes, I think so, since a DeviceMemory<T> is essentially a T. Or put another way, when you allocate host memory, you get a T*, and when you allocate device memory, you get a DeviceMemory<T>.
109	Hm, maybe we should match the arg order of memcpy, given that we're using that name.
110	To me that feels too restrictive. I guess it's a question of what we're optimizing for. It seems to me that much of the time, the dst will be of the same size as the src. In such cases, you'd want to know if you accidentally mismatched the sizes, because that's going to leave part of your dst uninitialized. If you really want to copy into only part of dst, you can make that explicit by creating a new array slice: vector<int> dst(100); synchronousMemcpyD2H(src, makeMutableArrayRef(dst.data(), DeviceSrc.getElementCount())); (makeMutableArrayRef doesn't exist at the moment, but that's easily fixed...) Or alternatively, you could just call the other overload: vector<int> dst(100); synchronousMemcpyD2H(src, dst, DeviceSrc.getElementCount()); Both cases are explicit that we're doing something other than simply copying everything from dst to src.

jprice added inline comments.Aug 16 2016, 3:43 PM

streamexecutor/include/streamexecutor/Executor.h
102	Should be passing `SrcElementOffset * sizeof(T)` through to the platform executor as well now.
135	Missing offset argument as above.
166	Missing offset argument as above.
streamexecutor/include/streamexecutor/Stream.h
122–124	Missing offset argument as above.
163–165	Missing offset argument as above.
202–205	Missing offset argument as above.

Add missing offset args in raw memcpys

jhen marked 6 inline comments as done.Aug 16 2016, 3:50 PM

jhen added inline comments.

streamexecutor/include/streamexecutor/Executor.h
102	Good catch! I promise to write a bunch of unit tests to catch this kind of thing once we have the interface decided.

Reverse src/dst memcpy arg order

streamexecutor/include/streamexecutor/Executor.h
48	The thing I would really miss from the current interface is the way the user has to specify a `DeviceMemory` as a copy destination by taking its address before passing it as an argument to the copy function. I like how this makes it hard for the user to accidentally pass the source as the destination and vice versa. Is there a good way to keep this nice feature and match the model of `DeviceMemory` as a pointer more closely?
110	Okay, that sounds like a good reason to me.

jlebar added inline comments.Aug 16 2016, 5:41 PM

streamexecutor/include/streamexecutor/Executor.h
48	Is there a good way to keep this nice feature and match the model of DeviceMemory as a pointer more closely? const-ness will get us some of the way there. I could imaging designing an API that would let you do MutableArrayRef<int> dst(...); dst = CopyFromDevice(src); But, in addition to requiring some substantial magic, that wouldn't work with our `.then` chaining.

I had a thought about MemcpyD2H and friends last night. Not to make you change it again, but it's kind of weird that the function name says "D2H" but the args are (host_ptr, device_ptr). So long as it's called "memcpy", I think we probably should match memcpy's ordering, but what if we just called it "copy"?

That might somewhat speak to your concern about users getting the order wrong.

In D23577#518094, @jlebar wrote:

I had a thought about MemcpyD2H and friends last night. Not to make you change it again, but it's kind of weird that the function name says "D2H" but the args are (host_ptr, device_ptr). So long as it's called "memcpy", I think we probably should match memcpy's ordering, but what if we just called it "copy"?

That might somewhat speak to your concern about users getting the order wrong.

Yes, the D2H, etc names were the reason for the original (src, dst) argument ordering, so I think there is a potential for confusion with the current ordering.

For an alternative solution that really helps the user know what they are doing, how about introducing different wrapper classes for src and dst and requiring the user to wrap the argument before passing it to the copy method:

// Wrapper classes.
template <typename T> class CopySrc;
template <typename T> class CopyDst;

// Copy function decl.
template <typename T>
Error synchronousCopy(CopyDst<T> Src, CopySrc<T> Dst, size_t ElementCount);

// Helper functions "to" and "from" to construct wrapper instances.
template <typename T> CopySrc<T> from(const GlobalDeviceMemory<T> &DeviceSrc, size_t Offset = 0);
template <typename T> CopySrc<T> from(const T *HostSrc, size_t Offset = 0);
template <typename T> CopyDst<T> to(GlobalDeviceMemory<T> DeviceDst, size_t Offset = 0);
template <typename T> CopyDst<T> to(T *HostDst, size_t Offset = 0);

// User code that calls a copy.
Executor.synchronousCopy(to(DeviceMemory, DeviceElementOffset), from(HostPtr), ElementCount);

This seems to me like a pretty foolproof interface, and I like that it allows the user to specify the offset along with the pointer only if an offset is desired.

What do you think?

Device memcpy uses DeviceArrayRef arg

The to and from functions ended up not being so elegant because ADL doesn't happen for the host arguments (llvm::ArrayRef, llvm::MutableArrayRef). So, instead, I made two classes just to wrap the device arrays: DeviceArrayRef and DeviceMutableArrayRef. I think this makes a nice symmetry between host and device and makes the user be explicit about which argument is src and which is dst.

The one thing I don't like is that the DeviceArrayRef constructor takes a size_t offset as its second argument, and this might get confusing because the llvm::ArrayRef takes a size_t size as its second argument.

Once we get this synchronous memcpy interface the way we like it, I'll also fix the other functions like free and asynchronous copies.

I'm not crazy about how verbose simple operations become with the current
revision.

Just trying to play with this API:

se::GlobalDeviceMemory<int> DMem;
int *HPtr;
MutableArrayRef<int> HArrRef;

// These two operations seem like they're going to be the most common:

// Copy everything, assert sizes equal.
E.syncCopyD2H(DMem, HArrRef);

// Up to user to ensure that HPtr is big enough.  (We could leave off "length"
// and copy everything, but since we're passing a raw pointer here, seems like
// we should be explicit about the length.)
E.syncCopyD2H(DMem, HPtr, length);

// These two together suggest that we should have a third overload:

// Copy length elements, assert sizes are less than or equal to length.
E.syncCopyD2H(DMem, HArrRef, length);

These three are probably sufficient for everything other than copying starting
at a nonzero offset within DMem? I guess that's where makeMutableArrayRef
comes in. But since this is going to be an uncommon operation, I don't think
we want to uglify all of the other call sites to make this work.

What if we called our class GlobalDeviceMemorySlice<T>. Not calling it
"ArrayRef" would leave us free to modify its API somewhat:

We could have an implicit constructor from GlobalDeviceMemory<T>, so we don't have to explicitly call makeMutableArrayRef.

We wouldn't need mutable and non-mutable versions of this; we could just use const to differentiate.

If it makes sense, we could even make it so that the way to construct a GlobalDeviceMemorySlice from a GlobalDeviceMemory is by calling functions on GlobalDeviceMemory. This would resolve the ambiguity around offset vs. length. Here's a not well-thought-through API:

GlobalDeviceMemory<int> DMem; E.syncCopyD2H(DMem.drop_front(5).drop_back(42).take_first(11));

I suspect that simply copying slice(), drop_front(), and drop_back() from
ArrayRef would be pretty good.

Many overloads and DeviceArraySlice

jlebar's last design suggestion seems reasonable to me. I implemented it (just for the D2H direction for now). If people like it, I will do the other directions too (H2D, D2D) and write a bunch of tests.

jlebar added inline comments.Aug 18 2016, 1:44 PM

streamexecutor/include/streamexecutor/DeviceMemory.h
101 ↗	(On Diff #68471)	If this must wrap specifically a GlobalDeviceMemory object, we should probably have those words in the name. Maybe "GlobalDeviceMemorySlice"? Or is that too long? It also seems kind of weird to me to focus on "arrays" in the name, function names, and comments, since it's no more or less as much of an array as GlobalDeviceMemory, but we don't name/describe it in the same way.
106 ↗	(On Diff #68471)	We should assert that these are in range.
110 ↗	(On Diff #68471)	I think getOffset() and getCount() might be better names, but we should definitely be consistent with the names in GlobalDeviceMemory, as you've done here.
116 ↗	(On Diff #68471)	We should think about what exactly we want to be an error in these functions. I'm not sure that silently accepting out-of-range values makes sense.
131 ↗	(On Diff #68471)	Hm... Maybe ArrayRef's two-arg slice() function is more useful than take_first(). take_first(n) then just becomes slice(0, n), which is fairly clear I think? And I think the common operation will likely be "starting at offset k, take n elements", which it's not trivial to do using the functions here. I don't think we need both the one-arg slice() and drop_front, as they're equivalent. Sorry if I mislead you on this one. In that example, I meant just to illustrate how we'd chain calls together, not to suggest specifically the API. But I don't think I was so clear about that.
192 ↗	(On Diff #68471)	We could avoid having these functions entirely by having an asSlice() function, but I'm not sure that's better than this API. However, it would be nice not to reimplement these tricky functions. Could we just do e.g. return DeviceArraySlice<ElemT>(*this).drop_front(DropCount); ?
streamexecutor/include/streamexecutor/Executor.h
156	My thought is that we'd have an implicit one-arg constructor in DeviceArraySlice<T> so we wouldn't need this overload. DeviceArraySlice<T>::DeviceArraySlice<T>(const GlobalDeviceMemory<T>& Mem); (In the same way, llvm::ArrayRef has an implicit one-arg constructor that accepts an std::vector<T>.)

Respond to jlebar's comments

streamexecutor/include/streamexecutor/DeviceMemory.h
110 ↗	(On Diff #68471)	I renamed them to `getOffset()` and `getCount()`. I also renamed the `GlobalDeviceMemory<T>` method to `getCount()` for consistency.
116 ↗	(On Diff #68471)	ArrayRef uses assert, so I did that here too.
192 ↗	(On Diff #68471)	`asSlice()` sounds good to me. I put that in and removed all of these.
streamexecutor/include/streamexecutor/Executor.h
156	I tried this and unfortunately it doesn't seem to work perfectly. I think the template type inference is getting in the way of selecting the implicit constructor. For example, I have to call `Executor.syncCopyD2H<int>(DeviceA, llvm::MutableArray<int>(Host))` in my test code, where I explicitly specify the function template argument for the `syncCopyD2H` function. Otherwise, I get a compile error that says it ignored the candidate template because it could not match `GlobalDeviceMemorySlice` agains `GlobalDeviceMemory`. I removed the `syncCopyD2H` overloads for `GlobalDeviceMemory` for now, so that users have to explicitly supply the template argument or call `asSlice()`. Should I add back the overloads in order to get around this problem?

I think this is probably good, modulo a few relatively minor things.

streamexecutor/include/streamexecutor/DeviceMemory.h
117 ↗	(On Diff #68721)	Nit, maybe we can reword this so it doesn't just repeat the function's name.
120 ↗	(On Diff #68721)	This is kind of confusing to me, I think because "element offset from base memory" isn't so idiomatic. Maybe we can reword this?
148 ↗	(On Diff #68721)	Why do we do std::min here instead of asserting?
streamexecutor/include/streamexecutor/Executor.h
100	Ohhh. This is why they're called getElementOffset and getElementCount. I guess it's not so bad if we don't have any data structure that exposes the byte offset / byte count. But even if not I find this a kind of compelling reason for those names. What do you think?
122	Maybe we can just create an arrayref and call one of the other overloads?
139	typo
141	I might actually suggest making this function the base overload; it seems like the most strongly-typed and general?
205	Should I add back the overloads in order to get around this problem? Argh. That's the common case, so I guess, yes. I guess you only need it for the two-arg overloads and not the three-arg overload. :-/

Respond to jlebar's comments 2

streamexecutor/include/streamexecutor/DeviceMemory.h
148 ↗	(On Diff #68721)	I missed that one (and it was buggy as it was anyway). Thanks for catching that!
streamexecutor/include/streamexecutor/Executor.h
100	Yes, I definitely prefer the `getElementOffset` and `getElementCount` names. Sorry about the confusion. I've changed all the names back now to include "Element".
205	I put the overloads back. I couldn't even get the one that took `T*` as its second parameter to work without the overload, so I had to put them all back. I even tried to define a conversion operator for `GlobalDeviceMemory<T>` to `GlobalDeviceMemorySlice<T>`, but that didn't work either.

jlebar added inline comments.Aug 22 2016, 11:41 AM

streamexecutor/include/streamexecutor/Executor.h
177	Well, this is a lot of overloads, but I guess it could be worse. :) My only thought is that it would be nice not to duplicate so much of the comments. But other than that, I think you can go forth and do this for real.

Add H2D and D2D functions and tests

Memory "free" no longer takes wrapper pointer

Okay, I've got all the synchronousCopy functions in for all directions (D2H, H2D, D2D) and all the tests are in, so I think this CL is ready for final review.

streamexecutor/include/streamexecutor/Executor.h
177	I got rid of most of the comment duplication by referencing earlier functions in the comments for later functions.

jlebar added inline comments.Aug 22 2016, 10:03 PM

streamexecutor/include/streamexecutor/Executor.h
93	Nit, can we make these two inequalities have the same direction?
streamexecutor/include/streamexecutor/PlatformInterfaces.h
160	Should we specify the device synchronicity semantics for these calls (and also in Executor, I guess)? The non-"synchronous" functions clearly are synchronous within a stream. But for these I have no idea what they're supposed to do.
streamexecutor/include/streamexecutor/Stream.h
111	I guess I expected that we'd match the Executor's interface, with the overloads and such. But maybe you plan to do that in a separate patch?
177	comma placement

Update Stream copy interface

streamexecutor/include/streamexecutor/Stream.h
111	I was thinking of doing it in a separate patch, but changed my mind. Now these interfaces should be updated too.

Just some mild pedantry from me.

streamexecutor/include/streamexecutor/Executor.h
199	H2D
284	D2D
streamexecutor/include/streamexecutor/Stream.h
178	H2D
215	D2D

Fix method names in error strings

In D23577#523434, @jprice wrote:

Just some mild pedantry from me.

:) Thanks for your help

I think this is officially unwieldy. Which I accept the blame for. :)

It looks good to me, and I think if we want to change things, it will probably be a lot easier to do so once this is in.

streamexecutor/include/streamexecutor/PlatformInterfaces.h
161	Ah, but there can be no device work enqueued after the call starts but before it completes because this is host-synchronous! (Unless you can touch Streams on multiple host threads?) So maybe the thing to say is simply that this does not block any ongoing device calls?
streamexecutor/include/streamexecutor/Stream.h
111	may
284	This review is getting so long I'm tempted to say it will be easier to check this in as-is and edit things like this later. But I think it would be nice to consider whether there's a way to reduce the comment copypasta -- there's a lot of it, and I think it obscures the API. I think in an earlier patch you did some doxygen "function group" magic? Again, maybe better to save this for later.

This revision is now accepted and ready to land.Aug 23 2016, 10:44 PM

Clarify device synchronicity claims

In D23577#523872, @jlebar wrote:

I think this is officially unwieldy. Which I accept the blame for. :)

Yeah, it might be a bit of a maintenance problem, but I do think it will be much nicer for the user to use, so I think it's been an overall improvement.

streamexecutor/include/streamexecutor/Stream.h
284	I definitely think I can cut down the comments a lot and make the API more understandable in the process. As you suggested, I'll check this in now and then do that in the next CL.

Closed by commit rL279640: [StreamExecutor] Executor add synchronous methods (authored by jhen). · Explain WhyAug 24 2016, 10:06 AM

This revision was automatically updated to reflect the committed changes.

Not sure what the protocol is for finding issues in patches that have already been committed - should I be posting to parallel_libs-dev instead?

This issue only showed up when I tried to actually write some code that uses all this stuff.

parallel-libs/trunk/streamexecutor/include/streamexecutor/Executor.h
44 ↗	(On Diff #69133)	There's a mismatch between the return types of these two `allocateDeviceMemory` methods. The `PlatformExecutor` version returns a `GlobalDeviceMemoryBase`, whereas this one tries to return a `GlobalDeviceMemory<T>`. This causes compilation failure if this `Executor::allocateDeviceMemory<T>()` method is actually used. Because this method is templated, it needs to actually be called somewhere for the issue to show up, and it doesn't appear to be covered in the unit tests.

Revision Contents

Path

Size

streamexecutor/

include/

streamexecutor/

Executor.h

160 lines

PlatformInterfaces.h

96 lines

Stream.h

101 lines

Utils/

Error.h

4 lines

lib/

Utils/

Error.cpp

4 lines

unittests/

CMakeLists.txt

10 lines

ExecutorTest.cpp

140 lines

StreamTest.cpp

42 lines

Diff 68289

streamexecutor/include/streamexecutor/Executor.h

	Show All 10 Lines
	/// The Executor class which represents a single device of a specific platform.			/// The Executor class which represents a single device of a specific platform.
	///			///
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef STREAMEXECUTOR_EXECUTOR_H			#ifndef STREAMEXECUTOR_EXECUTOR_H
	#define STREAMEXECUTOR_EXECUTOR_H			#define STREAMEXECUTOR_EXECUTOR_H

	#include "streamexecutor/KernelSpec.h"			#include "streamexecutor/KernelSpec.h"
				#include "streamexecutor/PlatformInterfaces.h"
	#include "streamexecutor/Utils/Error.h"			#include "streamexecutor/Utils/Error.h"

	namespace streamexecutor {			namespace streamexecutor {

	class KernelInterface;			class KernelInterface;
	class PlatformExecutor;
	class Stream;			class Stream;

	class Executor {			class Executor {
	public:			public:
	explicit Executor(PlatformExecutor *PExecutor);			explicit Executor(PlatformExecutor *PExecutor);
	virtual ~Executor();			virtual ~Executor();

	/// Gets the kernel implementation for the underlying platform.			/// Gets the kernel implementation for the underlying platform.
	virtual Expected<std::unique_ptr<KernelInterface>>			virtual Expected<std::unique_ptr<KernelInterface>>
	getKernelImplementation(const MultiKernelLoaderSpec &Spec) {			getKernelImplementation(const MultiKernelLoaderSpec &Spec) {
	// TODO(jhen): Implement this.			// TODO(jhen): Implement this.
	return nullptr;			return nullptr;
	}			}

	Expected<std::unique_ptr<Stream>> createStream();			Expected<std::unique_ptr<Stream>> createStream();

				/// Allocates an array of ElementCount entries of type T in device memory.
				template <typename T>
				Expected<GlobalDeviceMemory<T>> allocateDeviceMemory(size_t ElementCount) {
				return PExecutor->allocateDeviceMemory(ElementCount * sizeof(T));
				}

				/// Frees memory previously allocated with allocateDeviceMemory.
				template <typename T> Error freeDeviceMemory(GlobalDeviceMemory<T> *Memory) {
				jlebarUnsubmitted Not Done Reply Inline Actions Should we take this by reference? jlebar: Should we take this by reference?
				jhenAuthorUnsubmitted Not Done Reply Inline Actions The internal SE code uses the convention that a `DeviceMemory<T>` is passed wherever a `T` would be passed for host data and a `const DeviceMemory<T>&` would be passed wherever a `const T` would be used for host data. Do you think it would be better to pass `DeviceMemory<T>&` for `T` and `const DeviceMemory<T>&` for `const T`, or something like that? jhen:* The internal SE code uses the convention that a `DeviceMemory<T>` is passed wherever a `T`…
				jlebarUnsubmitted Not Done Reply Inline Actions Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>& for const T, or something like that? Yes, I think so, since a DeviceMemory<T> is essentially a T. Or put another way, when you allocate host memory, you get a T, and when you allocate device memory, you get a DeviceMemory<T>. jlebar:* > Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>&…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions The thing I would really miss from the current interface is the way the user has to specify a `DeviceMemory` as a copy destination by taking its address before passing it as an argument to the copy function. I like how this makes it hard for the user to accidentally pass the source as the destination and vice versa. Is there a good way to keep this nice feature and match the model of `DeviceMemory` as a pointer more closely? jhen: The thing I would really miss from the current interface is the way the user has to specify a…
				jlebarUnsubmitted Not Done Reply Inline Actions Is there a good way to keep this nice feature and match the model of DeviceMemory as a pointer more closely? const-ness will get us some of the way there. I could imaging designing an API that would let you do MutableArrayRef<int> dst(...); dst = CopyFromDevice(src); But, in addition to requiring some substantial magic, that wouldn't work with our `.then` chaining. jlebar: > Is there a good way to keep this nice feature and match the model of DeviceMemory as a…
				return PExecutor->freeDeviceMemory(Memory);
				}

				/// Allocates an array of ElementCount entries of type T in host memory.
				///
				/// Host memory allocated by this function can be used for asynchronous memory
				/// copies on streams. See Stream::thenMemcpyD2H and Stream::thenMemcpyH2D.
				template <typename T> Expected<T *> allocateHostMemory(size_t ElementCount) {
				return PExecutor->allocateHostMemory(ElementCount * sizeof(T));
				}

				/// Frees memory previously allocated with allocateHostMemory.
				template <typename T> Error freeHostMemory(T *Memory) {
				return PExecutor->freeHostMemory(Memory);
				}

				/// Registers a previously allocated host array of type T for asynchronous
				/// memory operations.
				///
				/// Host memory registered by this function can be used for asynchronous
				/// memory copies on streams. See Stream::thenMemcpyD2H and
				/// Stream::thenMemcpyH2D.
				template <typename T>
				jpriceUnsubmitted Done Reply Inline Actions If this is going to be backing onto something like `cuMemHostRegister`, don't we need the size of the allocation as well? jprice: If this is going to be backing onto something like `cuMemHostRegister`, don't we need the size…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions You're totally right. Thanks for catching that. jhen: You're totally right. Thanks for catching that.
				Error registerHostMemory(T *Memory, size_t ElementCount) {
				return PExecutor->registerHostMemory(Memory, ElementCount * sizeof(T));
				}

				/// Unregisters host memory previously registered by registerHostMemory.
				template <typename T> Error unregisterHostMemory(T *Memory) {
				return PExecutor->unregisterHostMemory(Memory);
				}

				/// Host-synchronously copies a slice of an array of elements of type T from
				/// device memory to host memory.
				///
				/// The calling host thread is blocked until the copy completes. Can be used
				/// with any host memory, the host memory does not have to be allocated with
				/// allocateHostMemory or registered with registerHostMemory.
				template <typename T>
				Error synchronousMemcpyD2H(llvm::MutableArrayRef<T> HostDst,
				const GlobalDeviceMemory<T> &DeviceSrc,
				size_t ElementCount, size_t SrcElementOffset = 0) {
				if (SrcElementOffset + ElementCount > DeviceSrc.getElementCount())
				return make_error("copying too many elements, " +
				llvm::Twine(ElementCount) + ", at element offset " +
				jlebarUnsubmitted Done Reply Inline Actions Nit, can we make these two inequalities have the same direction? jlebar: Nit, can we make these two inequalities have the same direction?
				llvm::Twine(SrcElementOffset) +
				" from device memory array of element count " +
				llvm::Twine(DeviceSrc.getElementCount()));
				else if (ElementCount > HostDst.size())
				return make_error(
				"copying too many elements, " + llvm::Twine(ElementCount) +
				", to host array of element count " + llvm::Twine(HostDst.size()));
				jlebarUnsubmitted Done Reply Inline Actions Ohhh. This is why they're called getElementOffset and getElementCount. I guess it's not so bad if we don't have any data structure that exposes the byte offset / byte count. But even if not I find this a kind of compelling reason for those names. What do you think? jlebar: Ohhh. This is why they're called getElementOffset and getElementCount. I guess it's not so…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions Yes, I definitely prefer the `getElementOffset` and `getElementCount` names. Sorry about the confusion. I've changed all the names back now to include "Element". jhen: Yes, I definitely prefer the `getElementOffset` and `getElementCount` names. Sorry about the…
				return PExecutor->synchronousMemcpyD2H(HostDst.data(), DeviceSrc,
				ElementCount * sizeof(T),
				jpriceUnsubmitted Done Reply Inline Actions Should be passing `SrcElementOffset * sizeof(T)` through to the platform executor as well now. jprice: Should be passing `SrcElementOffset * sizeof(T)` through to the platform executor as well now.
				jhenAuthorUnsubmitted Not Done Reply Inline Actions Good catch! I promise to write a bunch of unit tests to catch this kind of thing once we have the interface decided. jhen: Good catch! I promise to write a bunch of unit tests to catch this kind of thing once we have…
				SrcElementOffset * sizeof(T));
				}

				/// Just like synchronousMemcpyD2H above, but copies the entire source array
				/// to the destination.
				template <typename T>
				Error synchronousMemcpyD2H(llvm::MutableArrayRef<T> HostDst,
				jlebarUnsubmitted Done Reply Inline Actions Hm, maybe we should match the arg order of memcpy, given that we're using that name. jlebar: Hm, maybe we should match the arg order of memcpy, given that we're using that name.
				const GlobalDeviceMemory<T> &DeviceSrc) {
				jlebarUnsubmitted Not Done Reply Inline Actions I wonder if we should assert that the two arrays' sizes are the same, in this case. Same for the H2D function. jlebar: I wonder if we should assert that the two arrays' sizes are the same, in this case. Same for…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions To me that feels too restrictive. Without that check do you think it would be better to rename the method? Maybe synchronousMemcpyWholeSrc or something. I like it how it is now, but I don't want the interface to be confusing. jhen: To me that feels too restrictive. Without that check do you think it would be better to rename…
				jlebarUnsubmitted Done Reply Inline Actions To me that feels too restrictive. I guess it's a question of what we're optimizing for. It seems to me that much of the time, the dst will be of the same size as the src. In such cases, you'd want to know if you accidentally mismatched the sizes, because that's going to leave part of your dst uninitialized. If you really want to copy into only part of dst, you can make that explicit by creating a new array slice: vector<int> dst(100); synchronousMemcpyD2H(src, makeMutableArrayRef(dst.data(), DeviceSrc.getElementCount())); (makeMutableArrayRef doesn't exist at the moment, but that's easily fixed...) Or alternatively, you could just call the other overload: vector<int> dst(100); synchronousMemcpyD2H(src, dst, DeviceSrc.getElementCount()); Both cases are explicit that we're doing something other than simply copying everything from dst to src. jlebar: > To me that feels too restrictive. I guess it's a question of what we're optimizing for. It…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions Okay, that sounds like a good reason to me. jhen: Okay, that sounds like a good reason to me.
				if (DeviceSrc.getElementCount() != HostDst.size()) {
				return make_error("synchronousMemcpyD2H device source element count, " +
				llvm::Twine(DeviceSrc.getElementCount()) +
				" ,does not match host destination element count, " +
				llvm::Twine(HostDst.size()));
				}
				return synchronousMemcpyD2H(HostDst, DeviceSrc,
				DeviceSrc.getElementCount());
				}

				/// Host-synchronously copies a slice of an array of elements of type T from
				/// host memory to device memory.
				jlebarUnsubmitted Done Reply Inline Actions Maybe we can just create an arrayref and call one of the other overloads? jlebar: Maybe we can just create an arrayref and call one of the other overloads?
				///
				/// The calling host thread is blocked until the copy completes. Can be used
				/// with any host memory, the host memory does not have to be allocated with
				/// allocateHostMemory or registered with registerHostMemory.
				template <typename T>
				Error synchronousMemcpyH2D(GlobalDeviceMemory<T> *DeviceDst,
				llvm::ArrayRef<T> HostSrc, size_t ElementCount,
				size_t DstElementOffset = 0) {
				if (ElementCount > HostSrc.size())
				return make_error(
				"copying too many elements, " + llvm::Twine(ElementCount) +
				", from host array of element count " + llvm::Twine(HostSrc.size()));
				else if (DstElementOffset + ElementCount > DeviceDst->getElementCount())
				jpriceUnsubmitted Done Reply Inline Actions Missing offset argument as above. jprice: Missing offset argument as above.
				return make_error("copying too many elements, " +
				llvm::Twine(ElementCount) +
				", to device memory array of element count " +
				llvm::Twine(DeviceDst->getElementCount()) +
				jlebarUnsubmitted Done Reply Inline Actions typo jlebar: typo
				" at element offset " + llvm::Twine(DstElementOffset));
				return PExecutor->synchronousMemcpyH2D(DeviceDst, HostSrc.data(),
				jlebarUnsubmitted Done Reply Inline Actions I might actually suggest making this function the base overload; it seems like the most strongly-typed and general? jlebar: I might actually suggest making this function the base overload; it seems like the most…
				ElementCount * sizeof(T),
				DstElementOffset * sizeof(T));
				}
				jpriceUnsubmitted Done Reply Inline Actions memory jprice: memory

				/// Just like synchronousMemcpyH2D above, but copies the entire source array
				/// to the destination.
				template <typename T>
				Error synchronousMemcpyH2D(GlobalDeviceMemory<T> *DeviceDst,
				llvm::ArrayRef<T> HostSrc) {
				if (HostSrc.size() != DeviceDst->getElementCount()) {
				return make_error("synchronousMemcpyH2D host source element count, " +
				llvm::Twine(HostSrc.size()) +
				" ,does not match device destination element count, " +
				llvm::Twine(DeviceDst->getElementCount()));
				}
				jlebarUnsubmitted Not Done Reply Inline Actions My thought is that we'd have an implicit one-arg constructor in DeviceArraySlice<T> so we wouldn't need this overload. DeviceArraySlice<T>::DeviceArraySlice<T>(const GlobalDeviceMemory<T>& Mem); (In the same way, llvm::ArrayRef has an implicit one-arg constructor that accepts an std::vector<T>.) jlebar: My thought is that we'd have an implicit one-arg constructor in DeviceArraySlice<T> so we…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions I tried this and unfortunately it doesn't seem to work perfectly. I think the template type inference is getting in the way of selecting the implicit constructor. For example, I have to call `Executor.syncCopyD2H<int>(DeviceA, llvm::MutableArray<int>(Host))` in my test code, where I explicitly specify the function template argument for the `syncCopyD2H` function. Otherwise, I get a compile error that says it ignored the candidate template because it could not match `GlobalDeviceMemorySlice` agains `GlobalDeviceMemory`. I removed the `syncCopyD2H` overloads for `GlobalDeviceMemory` for now, so that users have to explicitly supply the template argument or call `asSlice()`. Should I add back the overloads in order to get around this problem? jhen: I tried this and unfortunately it doesn't seem to work perfectly. I think the template type…
				return synchronousMemcpyH2D(DeviceDst, HostSrc, HostSrc.size());
				}

				/// Host-synchronously copies a slice of an array of elements of type T from
				/// one place in device memory to another.
				template <typename T>
				Error synchronousMemcpyD2D(GlobalDeviceMemory<T> *DeviceDst,
				const GlobalDeviceMemory<T> &DeviceSrc,
				size_t ElementCount, size_t SrcElementOffset = 0,
				size_t DstElementOffset = 0) {
				jpriceUnsubmitted Done Reply Inline Actions Missing offset argument as above. jprice: Missing offset argument as above.
				if (SrcElementOffset + ElementCount > DeviceSrc.getElementCount())
				return make_error("copying too many elements, " +
				llvm::Twine(ElementCount) + ", at element offset " +
				llvm::Twine(SrcElementOffset) +
				" from device memory array of element count " +
				llvm::Twine(DeviceSrc.getElementCount()));
				else if (DstElementOffset + ElementCount > DeviceDst->getElementCount())
				return make_error("copying too many elements, " +
				llvm::Twine(ElementCount) +
				", to device memory array of element count " +
				llvm::Twine(DeviceDst->getElementCount()) +
				jlebarUnsubmitted Not Done Reply Inline Actions Well, this is a lot of overloads, but I guess it could be worse. :) My only thought is that it would be nice not to duplicate so much of the comments. But other than that, I think you can go forth and do this for real. jlebar: Well, this is a lot of overloads, but I guess it could be worse. :) My only thought is that…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions I got rid of most of the comment duplication by referencing earlier functions in the comments for later functions. jhen: I got rid of most of the comment duplication by referencing earlier functions in the comments…
				" at element offset " + llvm::Twine(DstElementOffset));
				return PExecutor->synchronousMemcpyD2D(
				DeviceDst, DeviceSrc, ElementCount * sizeof(T),
				SrcElementOffset * sizeof(T), DstElementOffset * sizeof(T));
				}

				/// Just like synchronousMemcpyD2D above, but copies the entire source array
				/// to the destination.
				template <typename T>
				Error synchronousMemcpyD2D(GlobalDeviceMemory<T> *DeviceDst,
				const GlobalDeviceMemory<T> &DeviceSrc) {
				if (DeviceSrc.getElementCount() != DeviceDst->getElementCount()) {
				return make_error("synchronousMemcpyH2D device source element count, " +
				llvm::Twine(DeviceSrc.getElementCount()) +
				" ,does not match device destination element count, " +
				llvm::Twine(DeviceDst->getElementCount()));
				}
				return synchronousMemcpyD2D(DeviceDst, DeviceSrc,
				DeviceSrc.getElementCount());
				}

	private:			private:
				jpriceUnsubmitted Done Reply Inline Actions H2D jprice: H2D
	PlatformExecutor *PExecutor;			PlatformExecutor *PExecutor;
	};			};

	} // namespace streamexecutor			} // namespace streamexecutor

	#endif // STREAMEXECUTOR_EXECUTOR_H			#endif // STREAMEXECUTOR_EXECUTOR_H
				jlebarUnsubmitted Done Reply Inline Actions Should I add back the overloads in order to get around this problem? Argh. That's the common case, so I guess, yes. I guess you only need it for the two-arg overloads and not the three-arg overload. :-/ jlebar: > Should I add back the overloads in order to get around this problem? Argh. That's the…
				jhenAuthorUnsubmitted Not Done Reply Inline Actions I put the overloads back. I couldn't even get the one that took `T` as its second parameter to work without the overload, so I had to put them all back. I even tried to define a conversion operator for `GlobalDeviceMemory<T>` to `GlobalDeviceMemorySlice<T>`, but that didn't work either. jhen:* I put the overloads back. I couldn't even get the one that took `T*` as its second parameter to…
				jpriceUnsubmitted Done Reply Inline Actions D2D jprice: D2D

streamexecutor/include/streamexecutor/PlatformInterfaces.h

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	public:
/// Launches a kernel on the given stream.		/// Launches a kernel on the given stream.
virtual Error launch(PlatformStreamHandle *S, BlockDimensions BlockSize,		virtual Error launch(PlatformStreamHandle *S, BlockDimensions BlockSize,
GridDimensions GridSize, const KernelBase &Kernel,		GridDimensions GridSize, const KernelBase &Kernel,
const PackedKernelArgumentArrayBase &ArgumentArray) {		const PackedKernelArgumentArrayBase &ArgumentArray) {
return make_error("launch not implemented for platform " + getName());		return make_error("launch not implemented for platform " + getName());
}		}

/// Copies data from the device to the host.		/// Copies data from the device to the host.
virtual Error memcpyD2H(PlatformStreamHandle *S,		///
		/// HostDst should have been allocated by allocateHostMemory or registered
		/// with registerHostMemory.
		virtual Error memcpyD2H(PlatformStreamHandle S, void HostDst,
const GlobalDeviceMemoryBase &DeviceSrc,		const GlobalDeviceMemoryBase &DeviceSrc,
void *HostDst, size_t ByteCount) {		size_t ByteCount, size_t SrcByteOffset = 0) {
return make_error("memcpyD2H not implemented for platform " + getName());		return make_error("memcpyD2H not implemented for platform " + getName());
}		}

/// Copies data from the host to the device.		/// Copies data from the host to the device.
virtual Error memcpyH2D(PlatformStreamHandle S, const void HostSrc,		///
GlobalDeviceMemoryBase *DeviceDst, size_t ByteCount) {		/// HostSrc should have been allocated by allocateHostMemory or registered
		/// with registerHostMemory.
		virtual Error memcpyH2D(PlatformStreamHandle *S,
		GlobalDeviceMemoryBase *DeviceDst,
		const void *HostSrc, size_t ByteCount,
		size_t DstByteOffset = 0) {
return make_error("memcpyH2D not implemented for platform " + getName());		return make_error("memcpyH2D not implemented for platform " + getName());
}		}

/// Copies data from one device location to another.		/// Copies data from one device location to another.
virtual Error memcpyD2D(PlatformStreamHandle *S,		virtual Error memcpyD2D(PlatformStreamHandle *S,
		GlobalDeviceMemoryBase *DeviceDst,
const GlobalDeviceMemoryBase &DeviceSrc,		const GlobalDeviceMemoryBase &DeviceSrc,
GlobalDeviceMemoryBase *DeviceDst, size_t ByteCount) {		size_t ByteCount, size_t SrcByteOffset = 0,
		size_t DstByteOffset = 0) {
return make_error("memcpyD2D not implemented for platform " + getName());		return make_error("memcpyD2D not implemented for platform " + getName());
}		}

/// Blocks the host until the given stream completes all the work enqueued up		/// Blocks the host until the given stream completes all the work enqueued up
/// to the point this function is called.		/// to the point this function is called.
virtual Error blockHostUntilDone(PlatformStreamHandle *S) {		virtual Error blockHostUntilDone(PlatformStreamHandle *S) {
return make_error("blockHostUntilDone not implemented for platform " +		return make_error("blockHostUntilDone not implemented for platform " +
getName());		getName());
}		}

		/// Allocates untyped device memory of a given size in bytes.
		virtual Expected<GlobalDeviceMemoryBase>
		allocateDeviceMemory(size_t ByteCount) {
		return make_error("allocateDeviceMemory not implemented for platform " +
		getName());
		}

		/// Frees device memory previously allocated by allocateDeviceMemory.
		virtual Error freeDeviceMemory(GlobalDeviceMemoryBase *Memory) {
		return make_error("freeDeviceMemory not implemented for platform " +
		getName());
		}

		/// Allocates untyped host memory of a given size in bytes.
		///
		/// Host memory allocated via this method is suitable for use with memcpyH2D
		/// and memcpyD2H.
		virtual Expected<void *> allocateHostMemory(size_t ByteCount) {
		return make_error("allocateHostMemory not implemented for platform " +
		getName());
		}

		/// Frees host memory allocated by allocateHostMemory.
		virtual Error freeHostMemory(void *Memory) {
		return make_error("freeHostMemory not implemented for platform " +
		getName());
		}

		/// Registers previously allocated host memory so it can be used with
		/// memcpyH2D and memcpyD2H.
		virtual Error registerHostMemory(void *Memory, size_t ByteCount) {
		return make_error("registerHostMemory not implemented for platform " +
		jpriceUnsubmitted Done Reply Inline Actions registerHostMemory jprice: registerHostMemory
		getName());
		}

		/// Unregisters host memory previously registered with registerHostMemory.
		virtual Error unregisterHostMemory(void *Memory) {
		return make_error("unregisterHostMemory not implemented for platform " +
		getName());
		}

		/// Copies the given number of bytes from device memory to host memory.
		///
		/// Blocks the calling host thread until the copy is completed. Can operate on
		/// any host memory, not just registered host memory or host memory allocated
		/// by allocateHostMemory.
		jlebarUnsubmitted Done Reply Inline Actions Should we specify the device synchronicity semantics for these calls (and also in Executor, I guess)? The non-"synchronous" functions clearly are synchronous within a stream. But for these I have no idea what they're supposed to do. jlebar: Should we specify the device synchronicity semantics for these calls (and also in Executor, I…
		virtual Error synchronousMemcpyD2H(void *HostDst,
		jlebarUnsubmitted Done Reply Inline Actions Ah, but there can be no device work enqueued after the call starts but before it completes because this is host-synchronous! (Unless you can touch Streams on multiple host threads?) So maybe the thing to say is simply that this does not block any ongoing device calls? jlebar: Ah, but there can be no device work enqueued after the call starts but before it completes…
		const GlobalDeviceMemoryBase &DeviceSrc,
		size_t ByteCount,
		size_t SrcByteOffset = 0) {
		return make_error("synchronousMemcpyD2H not implemented for platform " +
		getName());
		}

		/// Copies the given number of bytes from host memory to device memory.
		///
		/// Blocks the calling host thread until the copy is completed. Can operate on
		/// any host memory, not just registered host memory or host memory allocated
		/// by allocateHostMemory.
		virtual Error synchronousMemcpyH2D(GlobalDeviceMemoryBase *DeviceDst,
		const void *HostSrc, size_t ByteCount,
		size_t DstByteOffset = 0) {
		return make_error("synchronousMemcpyH2D not implemented for platform " +
		getName());
		}

		/// Copies the given number of bytes from one location to another in device
		/// memory.
		virtual Error synchronousMemcpyD2D(GlobalDeviceMemoryBase *DeviceDst,
		const GlobalDeviceMemoryBase &DeviceSrc,
		size_t ByteCount, size_t SrcByteOffset = 0,
		size_t DstByteOffset = 0) {
		return make_error("synchronousMemcpyD2D not implemented for platform " +
		getName());
		}
};		};

} // namespace streamexecutor		} // namespace streamexecutor

#endif // STREAMEXECUTOR_PLATFORMINTERFACES_H		#endif // STREAMEXECUTOR_PLATFORMINTERFACES_H

streamexecutor/include/streamexecutor/Stream.h

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	public:

/// Entrain onto the stream a memcpy of a given number of elements from a		/// Entrain onto the stream a memcpy of a given number of elements from a
/// device source to a host destination.		/// device source to a host destination.
///		///
/// HostDst must be a pointer to host memory allocated by		/// HostDst must be a pointer to host memory allocated by
/// Executor::allocateHostMemory or otherwise allocated and then		/// Executor::allocateHostMemory or otherwise allocated and then
/// registered with Executor::registerHostMemory.		/// registered with Executor::registerHostMemory.
template <typename T>		template <typename T>
Stream &thenMemcpyD2H(const GlobalDeviceMemory<T> &DeviceSrc,		Stream &thenMemcpyD2H(llvm::MutableArrayRef<T> HostDst,
llvm::MutableArrayRef<T> HostDst, size_t ElementCount) {		const GlobalDeviceMemory<T> &DeviceSrc,
if (ElementCount > DeviceSrc.getElementCount())		size_t ElementCount, size_t SrcElementOffset = 0) {
		jlebarUnsubmitted Done Reply Inline Actions I guess I expected that we'd match the Executor's interface, with the overloads and such. But maybe you plan to do that in a separate patch? jlebar: I guess I expected that we'd match the Executor's interface, with the overloads and such. But…
		jhenAuthorUnsubmitted Not Done Reply Inline Actions I was thinking of doing it in a separate patch, but changed my mind. Now these interfaces should be updated too. jhen: I was thinking of doing it in a separate patch, but changed my mind. Now these interfaces…
		jlebarUnsubmitted Done Reply Inline Actions may jlebar: may
		if (ElementCount + SrcElementOffset > DeviceSrc.getElementCount())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", from device memory array of size " +		", at element offset " + llvm::Twine(SrcElementOffset) +
		" from device memory array of element count " +
llvm::Twine(DeviceSrc.getElementCount()));		llvm::Twine(DeviceSrc.getElementCount()));
else if (ElementCount > HostDst.size())		else if (ElementCount > HostDst.size())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", to host array of size " + llvm::Twine(HostDst.size()));		", to host array of element count " +
		llvm::Twine(HostDst.size()));
else		else
setError(PExecutor->memcpyD2H(ThePlatformStream.get(), DeviceSrc,		setError(PExecutor->memcpyD2H(ThePlatformStream.get(), HostDst.data(),
HostDst.data(), ElementCount * sizeof(T)));		DeviceSrc, ElementCount * sizeof(T),
		SrcElementOffset * sizeof(T)));
		jpriceUnsubmitted Done Reply Inline Actions Missing offset argument as above. jprice: Missing offset argument as above.
return *this;		return *this;
}		}

/// Same as thenMemcpyD2H above, but copies the entire source to the		/// Same as thenMemcpyD2H above, but copies the entire source to the
/// destination.		/// destination.
template <typename T>		template <typename T>
Stream &thenMemcpyD2H(const GlobalDeviceMemory<T> &DeviceSrc,		Stream &thenMemcpyD2H(llvm::MutableArrayRef<T> HostDst,
llvm::MutableArrayRef<T> HostDst) {		const GlobalDeviceMemory<T> &DeviceSrc) {
return thenMemcpyD2H(DeviceSrc, HostDst, DeviceSrc.getElementCount());		if (DeviceSrc.getElementCount() != HostDst.size()) {
		setError("thenMemcpyD2H device source element count, " +
		llvm::Twine(DeviceSrc.getElementCount()) +
		" ,does not match host destination element count, " +
		llvm::Twine(HostDst.size()));
		return *this;
		}
		return thenMemcpyD2H(HostDst, DeviceSrc, DeviceSrc.getElementCount());
}		}

/// Entrain onto the stream a memcpy of a given number of elements from a host		/// Entrain onto the stream a memcpy of a given number of elements from a host
/// source to a device destination.		/// source to a device destination.
///		///
/// HostSrc must be a pointer to host memory allocated by		/// HostSrc must be a pointer to host memory allocated by
/// Executor::allocateHostMemory or otherwise allocated and then		/// Executor::allocateHostMemory or otherwise allocated and then
/// registered with Executor::registerHostMemory.		/// registered with Executor::registerHostMemory.
template <typename T>		template <typename T>
Stream &thenMemcpyH2D(llvm::ArrayRef<T> HostSrc,		Stream &thenMemcpyH2D(GlobalDeviceMemory<T> *DeviceDst,
GlobalDeviceMemory<T> *DeviceDst, size_t ElementCount) {		llvm::ArrayRef<T> HostSrc, size_t ElementCount,
		size_t DstElementOffset = 0) {
if (ElementCount > HostSrc.size())		if (ElementCount > HostSrc.size())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", from host array of size " + llvm::Twine(HostSrc.size()));		", from host array of element count " +
else if (ElementCount > DeviceDst->getElementCount())		llvm::Twine(HostSrc.size()));
		else if (DstElementOffset + ElementCount > DeviceDst->getElementCount())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", to device memory array of size " +		", to device memory array of element count " +
llvm::Twine(DeviceDst->getElementCount()));		llvm::Twine(DeviceDst->getElementCount()) +
		" at element offset " + llvm::Twine(DstElementOffset));
else		else
setError(PExecutor->memcpyH2D(ThePlatformStream.get(), HostSrc.data(),		setError(PExecutor->memcpyH2D(ThePlatformStream.get(), DeviceDst,
DeviceDst, ElementCount * sizeof(T)));		HostSrc.data(), ElementCount * sizeof(T),
		DstElementOffset * sizeof(T)));
		jpriceUnsubmitted Done Reply Inline Actions Missing offset argument as above. jprice: Missing offset argument as above.
return *this;		return *this;
}		}

/// Same as thenMemcpyH2D above, but copies the entire source to the		/// Same as thenMemcpyH2D above, but copies the entire source to the
/// destination.		/// destination.
template <typename T>		template <typename T>
Stream &thenMemcpyH2D(llvm::ArrayRef<T> HostSrc,		Stream &thenMemcpyH2D(GlobalDeviceMemory<T> *DeviceDst,
GlobalDeviceMemory<T> *DeviceDst) {		llvm::ArrayRef<T> HostSrc) {
return thenMemcpyH2D(HostSrc, DeviceDst, HostSrc.size());		if (HostSrc.size() != DeviceDst->getElementCount()) {
		setError("thenMemcpyH2D host source element count, " +
		llvm::Twine(HostSrc.size()) +
		" ,does not match device destination element count, " +
		jlebarUnsubmitted Done Reply Inline Actions comma placement jlebar: comma placement
		llvm::Twine(DeviceDst->getElementCount()));
		jpriceUnsubmitted Done Reply Inline Actions H2D jprice: H2D
		return *this;
		}
		return thenMemcpyH2D(DeviceDst, HostSrc, HostSrc.size());
}		}

/// Entrain onto the stream a memcpy of a given number of elements from a		/// Entrain onto the stream a memcpy of a given number of elements from a
/// device source to a device destination.		/// device source to a device destination.
template <typename T>		template <typename T>
Stream &thenMemcpyD2D(const GlobalDeviceMemory<T> &DeviceSrc,		Stream &thenMemcpyD2D(GlobalDeviceMemory<T> *DeviceDst,
GlobalDeviceMemory<T> *DeviceDst, size_t ElementCount) {		const GlobalDeviceMemory<T> &DeviceSrc,
if (ElementCount > DeviceSrc.getElementCount())		size_t ElementCount, size_t SrcElementOffset = 0,
		size_t DstElementOffset = 0) {
		if (SrcElementOffset + ElementCount > DeviceSrc.getElementCount())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", from device memory array of size " +		", at element offset " + llvm::Twine(SrcElementOffset) +
		" from device memory array of element count " +
llvm::Twine(DeviceSrc.getElementCount()));		llvm::Twine(DeviceSrc.getElementCount()));
else if (ElementCount > DeviceDst->getElementCount())		else if (DstElementOffset + ElementCount > DeviceDst->getElementCount())
setError("copying too many elements, " + llvm::Twine(ElementCount) +		setError("copying too many elements, " + llvm::Twine(ElementCount) +
", to device memory array of size " +		", to device memory array of element count " +
llvm::Twine(DeviceDst->getElementCount()));		llvm::Twine(DeviceDst->getElementCount()) +
		" at element offset " + llvm::Twine(DstElementOffset));
else		else
setError(PExecutor->memcpyD2D(ThePlatformStream.get(), DeviceSrc,		setError(PExecutor->memcpyD2D(ThePlatformStream.get(), DeviceDst,
DeviceDst, ElementCount * sizeof(T)));		DeviceSrc, ElementCount * sizeof(T),
		SrcElementOffset * sizeof(T),
		DstElementOffset * sizeof(T)));
		jpriceUnsubmitted Done Reply Inline Actions Missing offset argument as above. jprice: Missing offset argument as above.
return *this;		return *this;
}		}

/// Same as thenMemcpyD2D above, but copies the entire source to the		/// Same as thenMemcpyD2D above, but copies the entire source to the
/// destination.		/// destination.
template <typename T>		template <typename T>
Stream &thenMemcpyD2D(const GlobalDeviceMemory<T> &DeviceSrc,		Stream &thenMemcpyD2D(GlobalDeviceMemory<T> *DeviceDst,
GlobalDeviceMemory<T> *DeviceDst) {		const GlobalDeviceMemory<T> &DeviceSrc) {
return thenMemcpyD2D(DeviceSrc, DeviceDst, DeviceSrc.getElementCount());		if (DeviceSrc.getElementCount() != DeviceDst->getElementCount()) {
		setError("thenMemcpyH2D device source element count, " +
		jpriceUnsubmitted Done Reply Inline Actions D2D jprice: D2D
		llvm::Twine(DeviceSrc.getElementCount()) +
		" ,does not match device destination element count, " +
		llvm::Twine(DeviceDst->getElementCount()));
		return *this;
		}
		return thenMemcpyD2D(DeviceDst, DeviceSrc, DeviceSrc.getElementCount());
}		}

/// Blocks the host code, waiting for the operations entrained on the stream		/// Blocks the host code, waiting for the operations entrained on the stream
/// (enqueued up to this point in program execution) to complete.		/// (enqueued up to this point in program execution) to complete.
///		///
/// Returns true if there are no errors on the stream.		/// Returns true if there are no errors on the stream.
bool blockHostUntilDone() {		bool blockHostUntilDone() {
Error E = PExecutor->blockHostUntilDone(ThePlatformStream.get());		Error E = PExecutor->blockHostUntilDone(ThePlatformStream.get());
Show All 39 Lines	private:
llvm::Optional<std::string> ErrorMessage;		llvm::Optional<std::string> ErrorMessage;

Stream(const Stream &) = delete;		Stream(const Stream &) = delete;
void operator=(const Stream &) = delete;		void operator=(const Stream &) = delete;
};		};

} // namespace streamexecutor		} // namespace streamexecutor

#endif // STREAMEXECUTOR_STREAM_H		#endif // STREAMEXECUTOR_STREAM_H
		jlebarUnsubmitted Not Done Reply Inline Actions This review is getting so long I'm tempted to say it will be easier to check this in as-is and edit things like this later. But I think it would be nice to consider whether there's a way to reduce the comment copypasta -- there's a lot of it, and I think it obscures the API. I think in an earlier patch you did some doxygen "function group" magic? Again, maybe better to save this for later. jlebar: This review is getting so long I'm tempted to say it will be easier to check this in as-is and…
		jhenAuthorUnsubmitted Not Done Reply Inline Actions I definitely think I can cut down the comments a lot and make the API more understandable in the process. As you suggested, I'll check this in now and then do that in the next CL. jhen: I definitely think I can cut down the comments a lot and make the API more understandable in…

streamexecutor/include/streamexecutor/Utils/Error.h

	Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines

	#include "llvm/Support/Error.h"			#include "llvm/Support/Error.h"

	namespace streamexecutor {			namespace streamexecutor {

	using llvm::consumeError;			using llvm::consumeError;
	using llvm::Error;			using llvm::Error;
	using llvm::Expected;			using llvm::Expected;
	using llvm::StringRef;			using llvm::Twine;

	// Makes an Error object from an error message.			// Makes an Error object from an error message.
	Error make_error(StringRef Message);			Error make_error(Twine Message);

	// Consumes the input error and returns its error message.			// Consumes the input error and returns its error message.
	//			//
	// Assumes the input was created by the make_error function above.			// Assumes the input was created by the make_error function above.
	std::string consumeAndGetMessage(Error &&E);			std::string consumeAndGetMessage(Error &&E);

	} // namespace streamexecutor			} // namespace streamexecutor

	#endif // STREAMEXECUTOR_UTILS_ERROR_H			#endif // STREAMEXECUTOR_UTILS_ERROR_H

streamexecutor/lib/Utils/Error.cpp

	Show All 38 Lines
	};			};

	char StreamExecutorError::ID = 0;			char StreamExecutorError::ID = 0;

	} // namespace			} // namespace

	namespace streamexecutor {			namespace streamexecutor {

	Error make_error(StringRef Message) {			Error make_error(Twine Message) {
	return llvm::make_error<StreamExecutorError>(Message);			return llvm::make_error<StreamExecutorError>(Message.str());
	}			}

	std::string consumeAndGetMessage(Error &&E) {			std::string consumeAndGetMessage(Error &&E) {
	if (!E) {			if (!E) {
	return "success";			return "success";
	}			}
	std::string Message;			std::string Message;
	llvm::handleAllErrors(std::move(E),			llvm::handleAllErrors(std::move(E),
	[&Message](const StreamExecutorError &SEE) {			[&Message](const StreamExecutorError &SEE) {
	Message = SEE.getErrorMessage();			Message = SEE.getErrorMessage();
	});			});
	return Message;			return Message;
	}			}

	} // namespace streamexecutor			} // namespace streamexecutor

streamexecutor/lib/unittests/CMakeLists.txt

	add_executable(			add_executable(
				executor_test
				ExecutorTest.cpp)
				target_link_libraries(
				executor_test
				streamexecutor
				${GTEST_BOTH_LIBRARIES}
				${CMAKE_THREAD_LIBS_INIT})
				add_test(ExecutorTest executor_test)

				add_executable(
	kernel_test			kernel_test
	KernelTest.cpp)			KernelTest.cpp)
	target_link_libraries(			target_link_libraries(
	kernel_test			kernel_test
	streamexecutor			streamexecutor
	${GTEST_BOTH_LIBRARIES}			${GTEST_BOTH_LIBRARIES}
	${CMAKE_THREAD_LIBS_INIT})			${CMAKE_THREAD_LIBS_INIT})
	add_test(KernelTest kernel_test)			add_test(KernelTest kernel_test)
	Show All 32 Lines

streamexecutor/lib/unittests/ExecutorTest.cpp

This file was added.

				//===-- ExecutorTest.cpp - Tests for Executor -----------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// This file contains the unit tests for Executor code.
				///
				//===----------------------------------------------------------------------===//

				#include <cstdlib>
				#include <cstring>

				#include "streamexecutor/Executor.h"
				#include "streamexecutor/PlatformInterfaces.h"

				#include "gtest/gtest.h"

				namespace {

				namespace se = ::streamexecutor;

				class MockPlatformExecutor : public se::PlatformExecutor {
				public:
				~MockPlatformExecutor() override {}

				std::string getName() const override { return "MockPlatformExecutor"; }

				se::Expected<std::unique_ptr<se::PlatformStreamHandle>>
				createStream() override {
				return se::make_error("not implemented");
				}

				se::Expected<se::GlobalDeviceMemoryBase>
				allocateDeviceMemory(size_t ByteCount) override {
				return se::GlobalDeviceMemoryBase(std::malloc(ByteCount));
				}

				se::Error freeDeviceMemory(se::GlobalDeviceMemoryBase *Memory) override {
				std::free(const_cast<void *>(Memory->getHandle()));
				return se::Error::success();
				}

				se::Expected<void *> allocateHostMemory(size_t ByteCount) override {
				return std::malloc(ByteCount);
				}

				se::Error freeHostMemory(void *Memory) override {
				std::free(Memory);
				return se::Error::success();
				}

				se::Error synchronousMemcpyD2H(void *HostDst,
				const se::GlobalDeviceMemoryBase &DeviceSrc,
				size_t ByteCount,
				size_t SrcByteOffset) override {
				std::memcpy(HostDst, static_cast<const char *>(DeviceSrc.getHandle()) +
				SrcByteOffset,
				ByteCount);
				return se::Error::success();
				}

				se::Error synchronousMemcpyH2D(se::GlobalDeviceMemoryBase *DeviceDst,
				const void *HostSrc, size_t ByteCount,
				size_t DstByteOffset) override {
				std::memcpy(
				static_cast<char >(const_cast<void >(DeviceDst->getHandle())) +
				DstByteOffset,
				HostSrc, ByteCount);
				return se::Error::success();
				}

				se::Error synchronousMemcpyD2D(se::GlobalDeviceMemoryBase *DeviceDst,
				const se::GlobalDeviceMemoryBase &DeviceSrc,
				size_t ByteCount, size_t SrcByteOffset,
				size_t DstByteOffset) override {
				std::memcpy(
				static_cast<char >(const_cast<void >(DeviceDst->getHandle())) +
				DstByteOffset,
				static_cast<const char *>(DeviceSrc.getHandle()) + SrcByteOffset,
				ByteCount);
				return se::Error::success();
				}
				};

				/// Test fixture to hold objects used by tests.
				class ExecutorTest : public ::testing::Test {
				public:
				ExecutorTest()
				: DeviceA(se::GlobalDeviceMemory<int>::makeFromElementCount(HostA, 10)),
				DeviceB(se::GlobalDeviceMemory<int>::makeFromElementCount(HostB, 10)),
				Executor(&PExecutor) {}

				// Device memory is backed by host arrays.
				int HostA[10];
				se::GlobalDeviceMemory<int> DeviceA;
				int HostB[10];
				se::GlobalDeviceMemory<int> DeviceB;

				// Host memory to be used as actual host memory.
				int Host[10];

				MockPlatformExecutor PExecutor;
				se::Executor Executor;
				};

				TEST_F(ExecutorTest, MemcpyCorrectSize) {
				EXPECT_FALSE(static_cast<bool>(
				Executor.synchronousMemcpyH2D(&DeviceA, llvm::ArrayRef<int>(Host))));
				EXPECT_FALSE(static_cast<bool>(Executor.synchronousMemcpyD2H(
				llvm::MutableArrayRef<int>(Host), DeviceA)));
				EXPECT_FALSE(
				static_cast<bool>(Executor.synchronousMemcpyD2D(&DeviceB, DeviceA)));
				}

				TEST_F(ExecutorTest, MemcpyH2DTooManyElements) {
				se::Error E =
				Executor.synchronousMemcpyH2D(&DeviceA, llvm::ArrayRef<int>(Host), 20);
				EXPECT_TRUE(static_cast<bool>(E));
				se::consumeError(std::move(E));
				}

				TEST_F(ExecutorTest, MemcpyD2HTooManyElements) {
				se::Error E = Executor.synchronousMemcpyD2H(llvm::MutableArrayRef<int>(Host),
				DeviceA, 20);
				EXPECT_TRUE(static_cast<bool>(E));
				se::consumeError(std::move(E));
				}

				TEST_F(ExecutorTest, MemcpyD2DTooManyElements) {
				se::Error E = Executor.synchronousMemcpyD2D(&DeviceB, DeviceA, 20);
				EXPECT_TRUE(static_cast<bool>(E));
				se::consumeError(std::move(E));
				}

				} // namespace

streamexecutor/lib/unittests/StreamTest.cpp

Show All 34 Lines	public:

std::string getName() const override { return "MockPlatformExecutor"; }		std::string getName() const override { return "MockPlatformExecutor"; }

se::Expected<std::unique_ptr<se::PlatformStreamHandle>>		se::Expected<std::unique_ptr<se::PlatformStreamHandle>>
createStream() override {		createStream() override {
return nullptr;		return nullptr;
}		}

se::Error memcpyD2H(se::PlatformStreamHandle *,		se::Error memcpyD2H(se::PlatformStreamHandle , void HostDst,
const se::GlobalDeviceMemoryBase &DeviceSrc,		const se::GlobalDeviceMemoryBase &DeviceSrc,
void *HostDst, size_t ByteCount) override {		size_t ByteCount, size_t SrcByteOffset) override {
std::memcpy(HostDst, DeviceSrc.getHandle(), ByteCount);		std::memcpy(HostDst, static_cast<const char *>(DeviceSrc.getHandle()) +
		SrcByteOffset,
		ByteCount);
return se::Error::success();		return se::Error::success();
}		}

se::Error memcpyH2D(se::PlatformStreamHandle , const void HostSrc,		se::Error memcpyH2D(se::PlatformStreamHandle *,
se::GlobalDeviceMemoryBase *DeviceDst,		se::GlobalDeviceMemoryBase *DeviceDst,
size_t ByteCount) override {		const void *HostSrc, size_t ByteCount,
std::memcpy(const_cast<void *>(DeviceDst->getHandle()), HostSrc, ByteCount);		size_t DstByteOffset) override {
		std::memcpy(
		static_cast<char >(const_cast<void >(DeviceDst->getHandle())) +
		DstByteOffset,
		HostSrc, ByteCount);
return se::Error::success();		return se::Error::success();
}		}

se::Error memcpyD2D(se::PlatformStreamHandle *,		se::Error memcpyD2D(se::PlatformStreamHandle *,
const se::GlobalDeviceMemoryBase &DeviceSrc,
se::GlobalDeviceMemoryBase *DeviceDst,		se::GlobalDeviceMemoryBase *DeviceDst,
size_t ByteCount) override {		const se::GlobalDeviceMemoryBase &DeviceSrc,
std::memcpy(const_cast<void *>(DeviceDst->getHandle()),		size_t ByteCount, size_t SrcByteOffset,
DeviceSrc.getHandle(), ByteCount);		size_t DstByteOffset) override {
		std::memcpy(
		static_cast<char >(const_cast<void >(DeviceDst->getHandle())) +
		SrcByteOffset,
		static_cast<const char *>(DeviceSrc.getHandle()) + DstByteOffset,
		ByteCount);
return se::Error::success();		return se::Error::success();
}		}
};		};

/// Test fixture to hold objects used by tests.		/// Test fixture to hold objects used by tests.
class StreamTest : public ::testing::Test {		class StreamTest : public ::testing::Test {
public:		public:
StreamTest()		StreamTest()
Show All 11 Lines	protected:
// Host memory to be used as actual host memory.		// Host memory to be used as actual host memory.
int Host[10];		int Host[10];

MockPlatformExecutor PExecutor;		MockPlatformExecutor PExecutor;
se::Stream Stream;		se::Stream Stream;
};		};

TEST_F(StreamTest, MemcpyCorrectSize) {		TEST_F(StreamTest, MemcpyCorrectSize) {
Stream.thenMemcpyH2D(llvm::ArrayRef<int>(Host), &DeviceA);		Stream.thenMemcpyH2D(&DeviceA, llvm::ArrayRef<int>(Host));
EXPECT_TRUE(Stream.isOK());		EXPECT_TRUE(Stream.isOK());

Stream.thenMemcpyD2H(DeviceA, llvm::MutableArrayRef<int>(Host));		Stream.thenMemcpyD2H(llvm::MutableArrayRef<int>(Host), DeviceA);
EXPECT_TRUE(Stream.isOK());		EXPECT_TRUE(Stream.isOK());

Stream.thenMemcpyD2D(DeviceA, &DeviceB);		Stream.thenMemcpyD2D(&DeviceB, DeviceA);
EXPECT_TRUE(Stream.isOK());		EXPECT_TRUE(Stream.isOK());
}		}

TEST_F(StreamTest, MemcpyH2DTooManyElements) {		TEST_F(StreamTest, MemcpyH2DTooManyElements) {
Stream.thenMemcpyH2D(llvm::ArrayRef<int>(Host), &DeviceA, 20);		Stream.thenMemcpyH2D(&DeviceA, llvm::ArrayRef<int>(Host), 20);
EXPECT_FALSE(Stream.isOK());		EXPECT_FALSE(Stream.isOK());
}		}

TEST_F(StreamTest, MemcpyD2HTooManyElements) {		TEST_F(StreamTest, MemcpyD2HTooManyElements) {
Stream.thenMemcpyD2H(DeviceA, llvm::MutableArrayRef<int>(Host), 20);		Stream.thenMemcpyD2H(llvm::MutableArrayRef<int>(Host), DeviceA, 20);
EXPECT_FALSE(Stream.isOK());		EXPECT_FALSE(Stream.isOK());
}		}

TEST_F(StreamTest, MemcpyD2DTooManyElements) {		TEST_F(StreamTest, MemcpyD2DTooManyElements) {
Stream.thenMemcpyD2D(DeviceA, &DeviceB, 20);		Stream.thenMemcpyD2D(&DeviceB, DeviceA, 20);
EXPECT_FALSE(Stream.isOK());		EXPECT_FALSE(Stream.isOK());
}		}

} // namespace		} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

[StreamExecutor] Executor add synchronous methodsClosedPublic

Details

Diff Detail

Event Timeline