This is an archive of the discontinued LLVM Phabricator instance.

[StreamExecutor] Executor add synchronous methods
ClosedPublic

Authored by jhen on Aug 16 2016, 12:38 PM.

Details

Summary

Add Executor methods that block the host until completion. Since these
methods are host-synchronous, they don't require Stream arguments.

Diff Detail

Event Timeline

jhen updated this revision to Diff 68243.Aug 16 2016, 12:38 PM
jhen retitled this revision from to [StreamExecutor] Executor add synchronous methods.
jhen updated this object.
jhen added a reviewer: jlebar.
jhen added subscribers: parallel_libs-commits, jprice.

General question: if we have variants of these memcpy methods that take ElementCount parameters to allow for partial copies to/from the device allocations, should we also have variants with an Offset parameter as well to allow for partial copies that don't start at the origin?

streamexecutor/include/streamexecutor/Executor.h
71

If this is going to be backing onto something like cuMemHostRegister, don't we need the size of the allocation as well?

144

memory

streamexecutor/include/streamexecutor/PlatformInterfaces.h
146

registerHostMemory

jlebar added inline comments.Aug 16 2016, 2:09 PM
streamexecutor/include/streamexecutor/Executor.h
48

Should we take this by reference?

110

I wonder if we should assert that the two arrays' sizes are the same, in this case. Same for the H2D function.

jhen updated this revision to Diff 68261.Aug 16 2016, 2:54 PM
jhen marked 3 inline comments as done.
  • Respond to jprice's comments
jhen added a comment.Aug 16 2016, 2:56 PM

General question: if we have variants of these memcpy methods that take ElementCount parameters to allow for partial copies to/from the device allocations, should we also have variants with an Offset parameter as well to allow for partial copies that don't start at the origin?

Yes, I think allowing an offset argument is a good idea. I made that change to the Executor and Stream memcpy methods

streamexecutor/include/streamexecutor/Executor.h
48

The internal SE code uses the convention that a DeviceMemory<T>* is passed wherever a T* would be passed for host data and a const DeviceMemory<T>& would be passed wherever a const T* would be used for host data.

Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>& for const T*, or something like that?

71

You're totally right. Thanks for catching that.

110

To me that feels too restrictive. Without that check do you think it would be better to rename the method? Maybe synchronousMemcpyWholeSrc or something. I like it how it is now, but I don't want the interface to be confusing.

jlebar added inline comments.Aug 16 2016, 3:11 PM
streamexecutor/include/streamexecutor/Executor.h
48

Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>& for const T*, or something like that?

Yes, I think so, since a DeviceMemory<T> is essentially a T*. Or put another way, when you allocate host memory, you get a T*, and when you allocate device memory, you get a DeviceMemory<T>.

109

Hm, maybe we should match the arg order of memcpy, given that we're using that name.

110

To me that feels too restrictive.

I guess it's a question of what we're optimizing for.

It seems to me that much of the time, the dst will be of the same size as the src. In such cases, you'd want to know if you accidentally mismatched the sizes, because that's going to leave part of your dst uninitialized.

If you really want to copy into only part of dst, you can make that explicit by creating a new array slice:

vector<int> dst(100);
synchronousMemcpyD2H(src, makeMutableArrayRef(dst.data(), DeviceSrc.getElementCount()));

(makeMutableArrayRef doesn't exist at the moment, but that's easily fixed...)

Or alternatively, you could just call the other overload:

vector<int> dst(100);
synchronousMemcpyD2H(src, dst, DeviceSrc.getElementCount());

Both cases are explicit that we're doing something other than simply copying everything from dst to src.

jprice added inline comments.Aug 16 2016, 3:43 PM
streamexecutor/include/streamexecutor/Executor.h
102

Should be passing SrcElementOffset * sizeof(T) through to the platform executor as well now.

135

Missing offset argument as above.

166

Missing offset argument as above.

streamexecutor/include/streamexecutor/Stream.h
122–124

Missing offset argument as above.

163–165

Missing offset argument as above.

202–205

Missing offset argument as above.

jhen updated this revision to Diff 68272.Aug 16 2016, 3:49 PM
  • Add missing offset args in raw memcpys
jhen marked 6 inline comments as done.Aug 16 2016, 3:50 PM
jhen added inline comments.
streamexecutor/include/streamexecutor/Executor.h
102

Good catch! I promise to write a bunch of unit tests to catch this kind of thing once we have the interface decided.

jhen updated this revision to Diff 68289.Aug 16 2016, 5:05 PM
jhen marked 3 inline comments as done.
  • Reverse src/dst memcpy arg order
streamexecutor/include/streamexecutor/Executor.h
48

The thing I would really miss from the current interface is the way the user has to specify a DeviceMemory as a copy destination by taking its address before passing it as an argument to the copy function. I like how this makes it hard for the user to accidentally pass the source as the destination and vice versa. Is there a good way to keep this nice feature and match the model of DeviceMemory as a pointer more closely?

110

Okay, that sounds like a good reason to me.

jlebar added inline comments.Aug 16 2016, 5:41 PM
streamexecutor/include/streamexecutor/Executor.h
48

Is there a good way to keep this nice feature and match the model of DeviceMemory as a pointer more closely?

const-ness will get us some of the way there.

I could imaging designing an API that would let you do

MutableArrayRef<int> dst(...);
dst = CopyFromDevice(src);

But, in addition to requiring some substantial magic, that wouldn't work with our .then chaining.

jlebar edited edge metadata.Aug 17 2016, 8:16 AM

I had a thought about MemcpyD2H and friends last night. Not to make you change it again, but it's kind of weird that the function name says "D2H" but the args are (host_ptr, device_ptr). So long as it's called "memcpy", I think we probably should match memcpy's ordering, but what if we just called it "copy"?

That might somewhat speak to your concern about users getting the order wrong.

jhen added a comment.Aug 17 2016, 9:55 AM

I had a thought about MemcpyD2H and friends last night. Not to make you change it again, but it's kind of weird that the function name says "D2H" but the args are (host_ptr, device_ptr). So long as it's called "memcpy", I think we probably should match memcpy's ordering, but what if we just called it "copy"?

That might somewhat speak to your concern about users getting the order wrong.

Yes, the D2H, etc names were the reason for the original (src, dst) argument ordering, so I think there is a potential for confusion with the current ordering.

For an alternative solution that really helps the user know what they are doing, how about introducing different wrapper classes for src and dst and requiring the user to wrap the argument before passing it to the copy method:

// Wrapper classes.
template <typename T> class CopySrc;
template <typename T> class CopyDst;

// Copy function decl.
template <typename T>
Error synchronousCopy(CopyDst<T> Src, CopySrc<T> Dst, size_t ElementCount);

// Helper functions "to" and "from" to construct wrapper instances.
template <typename T> CopySrc<T> from(const GlobalDeviceMemory<T> &DeviceSrc, size_t Offset = 0);
template <typename T> CopySrc<T> from(const T *HostSrc, size_t Offset = 0);
template <typename T> CopyDst<T> to(GlobalDeviceMemory<T> DeviceDst, size_t Offset = 0);
template <typename T> CopyDst<T> to(T *HostDst, size_t Offset = 0);

// User code that calls a copy.
Executor.synchronousCopy(to(DeviceMemory, DeviceElementOffset), from(HostPtr), ElementCount);

This seems to me like a pretty foolproof interface, and I like that it allows the user to specify the offset along with the pointer only if an offset is desired.

What do you think?

jhen updated this revision to Diff 68425.Aug 17 2016, 2:32 PM
jhen edited edge metadata.
  • Device memcpy uses DeviceArrayRef arg
jhen added a comment.Aug 17 2016, 2:37 PM

The to and from functions ended up not being so elegant because ADL doesn't happen for the host arguments (llvm::ArrayRef, llvm::MutableArrayRef). So, instead, I made two classes just to wrap the device arrays: DeviceArrayRef and DeviceMutableArrayRef. I think this makes a nice symmetry between host and device and makes the user be explicit about which argument is src and which is dst.

The one thing I don't like is that the DeviceArrayRef constructor takes a size_t offset as its second argument, and this might get confusing because the llvm::ArrayRef takes a size_t size as its second argument.

Once we get this synchronous memcpy interface the way we like it, I'll also fix the other functions like free and asynchronous copies.

jlebar added a comment.EditedAug 17 2016, 3:28 PM

I'm not crazy about how verbose simple operations become with the current
revision.

Just trying to play with this API:

se::GlobalDeviceMemory<int> DMem;
int *HPtr;
MutableArrayRef<int> HArrRef;

// These two operations seem like they're going to be the most common:

// Copy everything, assert sizes equal.
E.syncCopyD2H(DMem, HArrRef);

// Up to user to ensure that HPtr is big enough.  (We could leave off "length"
// and copy everything, but since we're passing a raw pointer here, seems like
// we should be explicit about the length.)
E.syncCopyD2H(DMem, HPtr, length);

// These two together suggest that we should have a third overload:

// Copy length elements, assert sizes are less than or equal to length.
E.syncCopyD2H(DMem, HArrRef, length);

These three are probably sufficient for everything other than copying starting
at a nonzero offset within DMem? I guess that's where makeMutableArrayRef
comes in. But since this is going to be an uncommon operation, I don't think
we want to uglify all of the other call sites to make this work.

What if we called our class GlobalDeviceMemorySlice<T>. Not calling it
"ArrayRef" would leave us free to modify its API somewhat:

  • We could have an implicit constructor from GlobalDeviceMemory<T>, so we don't have to explicitly call makeMutableArrayRef.
  • We wouldn't need mutable and non-mutable versions of this; we could just use const to differentiate.
  • If it makes sense, we could even make it so that the way to construct a GlobalDeviceMemorySlice from a GlobalDeviceMemory is by calling functions on GlobalDeviceMemory. This would resolve the ambiguity around offset vs. length. Here's a not well-thought-through API:

    GlobalDeviceMemory<int> DMem; E.syncCopyD2H(DMem.drop_front(5).drop_back(42).take_first(11));

I suspect that simply copying slice(), drop_front(), and drop_back() from
ArrayRef would be pretty good.

jhen updated this revision to Diff 68471.Aug 17 2016, 6:19 PM
  • Many overloads and DeviceArraySlice
jhen added a comment.Aug 17 2016, 6:20 PM

jlebar's last design suggestion seems reasonable to me. I implemented it (just for the D2H direction for now). If people like it, I will do the other directions too (H2D, D2D) and write a bunch of tests.

jlebar added inline comments.Aug 18 2016, 1:44 PM
streamexecutor/include/streamexecutor/DeviceMemory.h
102

If this must wrap specifically a GlobalDeviceMemory object, we should probably have those words in the name. Maybe "GlobalDeviceMemorySlice"? Or is that too long?

It also seems kind of weird to me to focus on "arrays" in the name, function names, and comments, since it's no more or less as much of an array as GlobalDeviceMemory, but we don't name/describe it in the same way.

107

We should assert that these are in range.

111

I think getOffset() and getCount() might be better names, but we should definitely be consistent with the names in GlobalDeviceMemory, as you've done here.

117

We should think about what exactly we want to be an error in these functions. I'm not sure that silently accepting out-of-range values makes sense.

132

Hm... Maybe ArrayRef's two-arg slice() function is more useful than take_first(). take_first(n) then just becomes slice(0, n), which is fairly clear I think? And I think the common operation will likely be "starting at offset k, take n elements", which it's not trivial to do using the functions here.

I don't think we need both the one-arg slice() and drop_front, as they're equivalent.

Sorry if I mislead you on this one. In that example, I meant just to illustrate how we'd chain calls together, not to suggest specifically the API. But I don't think I was so clear about that.

215

We could avoid having these functions entirely by having an asSlice() function, but I'm not sure that's better than this API. However, it would be nice not to reimplement these tricky functions. Could we just do e.g.

return DeviceArraySlice<ElemT>(*this).drop_front(DropCount);

?

streamexecutor/include/streamexecutor/Executor.h
156

My thought is that we'd have an implicit one-arg constructor in DeviceArraySlice<T> so we wouldn't need this overload.

DeviceArraySlice<T>::DeviceArraySlice<T>(const GlobalDeviceMemory<T>& Mem);

(In the same way, llvm::ArrayRef has an implicit one-arg constructor that accepts an std::vector<T>.)

jhen updated this revision to Diff 68721.Aug 19 2016, 12:20 PM
jhen marked 6 inline comments as done.
  • Respond to jlebar's comments
streamexecutor/include/streamexecutor/DeviceMemory.h
111

I renamed them to getOffset() and getCount(). I also renamed the GlobalDeviceMemory<T> method to getCount() for consistency.

117

ArrayRef uses assert, so I did that here too.

215

asSlice() sounds good to me. I put that in and removed all of these.

streamexecutor/include/streamexecutor/Executor.h
156

I tried this and unfortunately it doesn't seem to work perfectly. I think the template type inference is getting in the way of selecting the implicit constructor. For example, I have to call Executor.syncCopyD2H<int>(DeviceA, llvm::MutableArray<int>(Host)) in my test code, where I explicitly specify the function template argument for the syncCopyD2H function. Otherwise, I get a compile error that says it ignored the candidate template because it could not match GlobalDeviceMemorySlice agains GlobalDeviceMemory.

I removed the syncCopyD2H overloads for GlobalDeviceMemory for now, so that users have to explicitly supply the template argument or call asSlice(). Should I add back the overloads in order to get around this problem?

I think this is probably good, modulo a few relatively minor things.

streamexecutor/include/streamexecutor/DeviceMemory.h
117

Nit, maybe we can reword this so it doesn't just repeat the function's name.

120

This is kind of confusing to me, I think because "element offset from base memory" isn't so idiomatic. Maybe we can reword this?

148

Why do we do std::min here instead of asserting?

streamexecutor/include/streamexecutor/Executor.h
100

Ohhh. *This* is why they're called getElementOffset and getElementCount.

I guess it's not so bad if we don't have any data structure that exposes the byte offset / byte count. But even if not I find this a kind of compelling reason for those names.

What do you think?

122

Maybe we can just create an arrayref and call one of the other overloads?

139

typo

141

I might actually suggest making this function the base overload; it seems like the most strongly-typed and general?

150

Should I add back the overloads in order to get around this problem?

Argh. That's the common case, so I guess, yes. I guess you only need it for the two-arg overloads and not the three-arg overload. :-/

jhen updated this revision to Diff 68737.Aug 19 2016, 3:01 PM
jhen marked 8 inline comments as done.
  • Respond to jlebar's comments 2
streamexecutor/include/streamexecutor/DeviceMemory.h
148

I missed that one (and it was buggy as it was anyway). Thanks for catching that!

streamexecutor/include/streamexecutor/Executor.h
100

Yes, I definitely prefer the getElementOffset and getElementCount names. Sorry about the confusion. I've changed all the names back now to include "Element".

150

I put the overloads back. I couldn't even get the one that took T* as its second parameter to work without the overload, so I had to put them all back. I even tried to define a conversion operator for GlobalDeviceMemory<T> to GlobalDeviceMemorySlice<T>, but that didn't work either.

jlebar added inline comments.Aug 22 2016, 11:41 AM
streamexecutor/include/streamexecutor/Executor.h
177

Well, this is a lot of overloads, but I guess it could be worse. :)

My only thought is that it would be nice not to duplicate so much of the comments. But other than that, I think you can go forth and do this for real.

jhen updated this revision to Diff 68925.Aug 22 2016, 4:07 PM
  • Add H2D and D2D functions and tests
jhen updated this revision to Diff 68926.Aug 22 2016, 4:13 PM
  • Memory "free" no longer takes wrapper pointer
jhen added a comment.Aug 22 2016, 4:14 PM

Okay, I've got all the synchronousCopy functions in for all directions (D2H, H2D, D2D) and all the tests are in, so I think this CL is ready for final review.

streamexecutor/include/streamexecutor/Executor.h
177

I got rid of most of the comment duplication by referencing earlier functions in the comments for later functions.

jlebar added inline comments.Aug 22 2016, 10:03 PM
streamexecutor/include/streamexecutor/Executor.h
93

Nit, can we make these two inequalities have the same direction?

streamexecutor/include/streamexecutor/PlatformInterfaces.h
160

Should we specify the device synchronicity semantics for these calls (and also in Executor, I guess)? The non-"synchronous" functions clearly are synchronous within a stream. But for these I have no idea what they're supposed to do.

streamexecutor/include/streamexecutor/Stream.h
111

I guess I expected that we'd match the Executor's interface, with the overloads and such. But maybe you plan to do that in a separate patch?

177

comma placement

jhen updated this revision to Diff 69021.Aug 23 2016, 11:47 AM
jhen marked 4 inline comments as done.
  • Update Stream copy interface
streamexecutor/include/streamexecutor/Stream.h
111

I was thinking of doing it in a separate patch, but changed my mind. Now these interfaces should be updated too.

Just some mild pedantry from me.

streamexecutor/include/streamexecutor/Executor.h
199

H2D

284

D2D

streamexecutor/include/streamexecutor/Stream.h
178

H2D

215

D2D

jhen updated this revision to Diff 69030.Aug 23 2016, 1:35 PM
  • Fix method names in error strings
jhen marked 4 inline comments as done.Aug 23 2016, 1:36 PM

Just some mild pedantry from me.

:) Thanks for your help

jlebar accepted this revision.Aug 23 2016, 10:44 PM
jlebar edited edge metadata.

I think this is officially unwieldy. Which I accept the blame for. :)

It looks good to me, and I think if we want to change things, it will probably be a lot easier to do so once this is in.

streamexecutor/include/streamexecutor/PlatformInterfaces.h
161

Ah, but there can be no device work enqueued after the call starts but before it completes because this is host-synchronous! (Unless you can touch Streams on multiple host threads?)

So maybe the thing to say is simply that this does not block any ongoing device calls?

streamexecutor/include/streamexecutor/Stream.h
111

may

284

This review is getting so long I'm tempted to say it will be easier to check this in as-is and edit things like this later. But I think it would be nice to consider whether there's a way to reduce the comment copypasta -- there's a lot of it, and I think it obscures the API. I think in an earlier patch you did some doxygen "function group" magic?

Again, maybe better to save this for later.

This revision is now accepted and ready to land.Aug 23 2016, 10:44 PM
jhen updated this revision to Diff 69132.Aug 24 2016, 9:54 AM
jhen marked 2 inline comments as done.
jhen edited edge metadata.
  • Clarify device synchronicity claims
jhen added a comment.Aug 24 2016, 9:55 AM

I think this is officially unwieldy. Which I accept the blame for. :)

Yeah, it might be a bit of a maintenance problem, but I do think it will be much nicer for the user to use, so I think it's been an overall improvement.

streamexecutor/include/streamexecutor/Stream.h
284

I definitely think I can cut down the comments a lot and make the API more understandable in the process. As you suggested, I'll check this in now and then do that in the next CL.

This revision was automatically updated to reflect the committed changes.

Not sure what the protocol is for finding issues in patches that have already been committed - should I be posting to parallel_libs-dev instead?

This issue only showed up when I tried to actually write some code that uses all this stuff.

parallel-libs/trunk/streamexecutor/include/streamexecutor/Executor.h
44 ↗(On Diff #69133)

There's a mismatch between the return types of these two allocateDeviceMemory methods. The PlatformExecutor version returns a GlobalDeviceMemoryBase, whereas this one tries to return a GlobalDeviceMemory<T>.

This causes compilation failure if this Executor::allocateDeviceMemory<T>() method is actually used. Because this method is templated, it needs to actually be called somewhere for the issue to show up, and it doesn't appear to be covered in the unit tests.