Add Executor methods that block the host until completion. Since these
methods are host-synchronous, they don't require Stream arguments.
Details
Diff Detail
Event Timeline
General question: if we have variants of these memcpy methods that take ElementCount parameters to allow for partial copies to/from the device allocations, should we also have variants with an Offset parameter as well to allow for partial copies that don't start at the origin?
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
71 | If this is going to be backing onto something like cuMemHostRegister, don't we need the size of the allocation as well? | |
144 | memory | |
streamexecutor/include/streamexecutor/PlatformInterfaces.h | ||
145 | registerHostMemory |
Yes, I think allowing an offset argument is a good idea. I made that change to the Executor and Stream memcpy methods
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
48 | The internal SE code uses the convention that a DeviceMemory<T>* is passed wherever a T* would be passed for host data and a const DeviceMemory<T>& would be passed wherever a const T* would be used for host data. Do you think it would be better to pass DeviceMemory<T>& for T* and const DeviceMemory<T>& for const T*, or something like that? | |
71 | You're totally right. Thanks for catching that. | |
110 | To me that feels too restrictive. Without that check do you think it would be better to rename the method? Maybe synchronousMemcpyWholeSrc or something. I like it how it is now, but I don't want the interface to be confusing. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
48 |
Yes, I think so, since a DeviceMemory<T> is essentially a T*. Or put another way, when you allocate host memory, you get a T*, and when you allocate device memory, you get a DeviceMemory<T>. | |
109 | Hm, maybe we should match the arg order of memcpy, given that we're using that name. | |
110 |
I guess it's a question of what we're optimizing for. It seems to me that much of the time, the dst will be of the same size as the src. In such cases, you'd want to know if you accidentally mismatched the sizes, because that's going to leave part of your dst uninitialized. If you really want to copy into only part of dst, you can make that explicit by creating a new array slice: vector<int> dst(100); synchronousMemcpyD2H(src, makeMutableArrayRef(dst.data(), DeviceSrc.getElementCount())); (makeMutableArrayRef doesn't exist at the moment, but that's easily fixed...) Or alternatively, you could just call the other overload: vector<int> dst(100); synchronousMemcpyD2H(src, dst, DeviceSrc.getElementCount()); Both cases are explicit that we're doing something other than simply copying everything from dst to src. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
102 | Should be passing SrcElementOffset * sizeof(T) through to the platform executor as well now. | |
135 | Missing offset argument as above. | |
166 | Missing offset argument as above. | |
streamexecutor/include/streamexecutor/Stream.h | ||
123–124 | Missing offset argument as above. | |
157–158 | Missing offset argument as above. | |
189–191 | Missing offset argument as above. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
102 | Good catch! I promise to write a bunch of unit tests to catch this kind of thing once we have the interface decided. |
- Reverse src/dst memcpy arg order
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
48 | The thing I would really miss from the current interface is the way the user has to specify a DeviceMemory as a copy destination by taking its address before passing it as an argument to the copy function. I like how this makes it hard for the user to accidentally pass the source as the destination and vice versa. Is there a good way to keep this nice feature and match the model of DeviceMemory as a pointer more closely? | |
110 | Okay, that sounds like a good reason to me. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
48 |
const-ness will get us some of the way there. I could imaging designing an API that would let you do MutableArrayRef<int> dst(...); dst = CopyFromDevice(src); But, in addition to requiring some substantial magic, that wouldn't work with our .then chaining. |
I had a thought about MemcpyD2H and friends last night. Not to make you change it again, but it's kind of weird that the function name says "D2H" but the args are (host_ptr, device_ptr). So long as it's called "memcpy", I think we probably should match memcpy's ordering, but what if we just called it "copy"?
That might somewhat speak to your concern about users getting the order wrong.
Yes, the D2H, etc names were the reason for the original (src, dst) argument ordering, so I think there is a potential for confusion with the current ordering.
For an alternative solution that really helps the user know what they are doing, how about introducing different wrapper classes for src and dst and requiring the user to wrap the argument before passing it to the copy method:
// Wrapper classes. template <typename T> class CopySrc; template <typename T> class CopyDst; // Copy function decl. template <typename T> Error synchronousCopy(CopyDst<T> Src, CopySrc<T> Dst, size_t ElementCount); // Helper functions "to" and "from" to construct wrapper instances. template <typename T> CopySrc<T> from(const GlobalDeviceMemory<T> &DeviceSrc, size_t Offset = 0); template <typename T> CopySrc<T> from(const T *HostSrc, size_t Offset = 0); template <typename T> CopyDst<T> to(GlobalDeviceMemory<T> DeviceDst, size_t Offset = 0); template <typename T> CopyDst<T> to(T *HostDst, size_t Offset = 0); // User code that calls a copy. Executor.synchronousCopy(to(DeviceMemory, DeviceElementOffset), from(HostPtr), ElementCount);
This seems to me like a pretty foolproof interface, and I like that it allows the user to specify the offset along with the pointer only if an offset is desired.
What do you think?
The to and from functions ended up not being so elegant because ADL doesn't happen for the host arguments (llvm::ArrayRef, llvm::MutableArrayRef). So, instead, I made two classes just to wrap the device arrays: DeviceArrayRef and DeviceMutableArrayRef. I think this makes a nice symmetry between host and device and makes the user be explicit about which argument is src and which is dst.
The one thing I don't like is that the DeviceArrayRef constructor takes a size_t offset as its second argument, and this might get confusing because the llvm::ArrayRef takes a size_t size as its second argument.
Once we get this synchronous memcpy interface the way we like it, I'll also fix the other functions like free and asynchronous copies.
I'm not crazy about how verbose simple operations become with the current
revision.
Just trying to play with this API:
se::GlobalDeviceMemory<int> DMem; int *HPtr; MutableArrayRef<int> HArrRef; // These two operations seem like they're going to be the most common: // Copy everything, assert sizes equal. E.syncCopyD2H(DMem, HArrRef); // Up to user to ensure that HPtr is big enough. (We could leave off "length" // and copy everything, but since we're passing a raw pointer here, seems like // we should be explicit about the length.) E.syncCopyD2H(DMem, HPtr, length); // These two together suggest that we should have a third overload: // Copy length elements, assert sizes are less than or equal to length. E.syncCopyD2H(DMem, HArrRef, length);
These three are probably sufficient for everything other than copying starting
at a nonzero offset within DMem? I guess that's where makeMutableArrayRef
comes in. But since this is going to be an uncommon operation, I don't think
we want to uglify all of the other call sites to make this work.
What if we called our class GlobalDeviceMemorySlice<T>. Not calling it
"ArrayRef" would leave us free to modify its API somewhat:
- We could have an implicit constructor from GlobalDeviceMemory<T>, so we don't have to explicitly call makeMutableArrayRef.
- We wouldn't need mutable and non-mutable versions of this; we could just use const to differentiate.
- If it makes sense, we could even make it so that the way to construct a GlobalDeviceMemorySlice from a GlobalDeviceMemory is by calling functions on GlobalDeviceMemory. This would resolve the ambiguity around offset vs. length. Here's a not well-thought-through API:
GlobalDeviceMemory<int> DMem; E.syncCopyD2H(DMem.drop_front(5).drop_back(42).take_first(11));
I suspect that simply copying slice(), drop_front(), and drop_back() from
ArrayRef would be pretty good.
jlebar's last design suggestion seems reasonable to me. I implemented it (just for the D2H direction for now). If people like it, I will do the other directions too (H2D, D2D) and write a bunch of tests.
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
101 ↗ | (On Diff #68471) | If this must wrap specifically a GlobalDeviceMemory object, we should probably have those words in the name. Maybe "GlobalDeviceMemorySlice"? Or is that too long? It also seems kind of weird to me to focus on "arrays" in the name, function names, and comments, since it's no more or less as much of an array as GlobalDeviceMemory, but we don't name/describe it in the same way. |
106 ↗ | (On Diff #68471) | We should assert that these are in range. |
110 ↗ | (On Diff #68471) | I think getOffset() and getCount() might be better names, but we should definitely be consistent with the names in GlobalDeviceMemory, as you've done here. |
116 ↗ | (On Diff #68471) | We should think about what exactly we want to be an error in these functions. I'm not sure that silently accepting out-of-range values makes sense. |
131 ↗ | (On Diff #68471) | Hm... Maybe ArrayRef's two-arg slice() function is more useful than take_first(). take_first(n) then just becomes slice(0, n), which is fairly clear I think? And I think the common operation will likely be "starting at offset k, take n elements", which it's not trivial to do using the functions here. I don't think we need both the one-arg slice() and drop_front, as they're equivalent. Sorry if I mislead you on this one. In that example, I meant just to illustrate how we'd chain calls together, not to suggest specifically the API. But I don't think I was so clear about that. |
192 ↗ | (On Diff #68471) | We could avoid having these functions entirely by having an asSlice() function, but I'm not sure that's better than this API. However, it would be nice not to reimplement these tricky functions. Could we just do e.g. return DeviceArraySlice<ElemT>(*this).drop_front(DropCount); ? |
streamexecutor/include/streamexecutor/Executor.h | ||
156 | My thought is that we'd have an implicit one-arg constructor in DeviceArraySlice<T> so we wouldn't need this overload. DeviceArraySlice<T>::DeviceArraySlice<T>(const GlobalDeviceMemory<T>& Mem); (In the same way, llvm::ArrayRef has an implicit one-arg constructor that accepts an std::vector<T>.) |
- Respond to jlebar's comments
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
110 ↗ | (On Diff #68471) | I renamed them to getOffset() and getCount(). I also renamed the GlobalDeviceMemory<T> method to getCount() for consistency. |
116 ↗ | (On Diff #68471) | ArrayRef uses assert, so I did that here too. |
192 ↗ | (On Diff #68471) | asSlice() sounds good to me. I put that in and removed all of these. |
streamexecutor/include/streamexecutor/Executor.h | ||
156 | I tried this and unfortunately it doesn't seem to work perfectly. I think the template type inference is getting in the way of selecting the implicit constructor. For example, I have to call Executor.syncCopyD2H<int>(DeviceA, llvm::MutableArray<int>(Host)) in my test code, where I explicitly specify the function template argument for the syncCopyD2H function. Otherwise, I get a compile error that says it ignored the candidate template because it could not match GlobalDeviceMemorySlice agains GlobalDeviceMemory. I removed the syncCopyD2H overloads for GlobalDeviceMemory for now, so that users have to explicitly supply the template argument or call asSlice(). Should I add back the overloads in order to get around this problem? |
I think this is probably good, modulo a few relatively minor things.
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
117 ↗ | (On Diff #68721) | Nit, maybe we can reword this so it doesn't just repeat the function's name. |
120 ↗ | (On Diff #68721) | This is kind of confusing to me, I think because "element offset from base memory" isn't so idiomatic. Maybe we can reword this? |
148 ↗ | (On Diff #68721) | Why do we do std::min here instead of asserting? |
streamexecutor/include/streamexecutor/Executor.h | ||
100 | Ohhh. *This* is why they're called getElementOffset and getElementCount. I guess it's not so bad if we don't have any data structure that exposes the byte offset / byte count. But even if not I find this a kind of compelling reason for those names. What do you think? | |
122 | Maybe we can just create an arrayref and call one of the other overloads? | |
139 | typo | |
141 | I might actually suggest making this function the base overload; it seems like the most strongly-typed and general? | |
187 |
Argh. That's the common case, so I guess, yes. I guess you only need it for the two-arg overloads and not the three-arg overload. :-/ |
- Respond to jlebar's comments 2
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
148 ↗ | (On Diff #68721) | I missed that one (and it was buggy as it was anyway). Thanks for catching that! |
streamexecutor/include/streamexecutor/Executor.h | ||
100 | Yes, I definitely prefer the getElementOffset and getElementCount names. Sorry about the confusion. I've changed all the names back now to include "Element". | |
187 | I put the overloads back. I couldn't even get the one that took T* as its second parameter to work without the overload, so I had to put them all back. I even tried to define a conversion operator for GlobalDeviceMemory<T> to GlobalDeviceMemorySlice<T>, but that didn't work either. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
177 | Well, this is a lot of overloads, but I guess it could be worse. :) My only thought is that it would be nice not to duplicate so much of the comments. But other than that, I think you can go forth and do this for real. |
Okay, I've got all the synchronousCopy functions in for all directions (D2H, H2D, D2D) and all the tests are in, so I think this CL is ready for final review.
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
177 | I got rid of most of the comment duplication by referencing earlier functions in the comments for later functions. |
streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
93 | Nit, can we make these two inequalities have the same direction? | |
streamexecutor/include/streamexecutor/PlatformInterfaces.h | ||
159 | Should we specify the device synchronicity semantics for these calls (and also in Executor, I guess)? The non-"synchronous" functions clearly are synchronous within a stream. But for these I have no idea what they're supposed to do. | |
streamexecutor/include/streamexecutor/Stream.h | ||
112 | I guess I expected that we'd match the Executor's interface, with the overloads and such. But maybe you plan to do that in a separate patch? | |
172 | comma placement |
- Update Stream copy interface
streamexecutor/include/streamexecutor/Stream.h | ||
---|---|---|
112 | I was thinking of doing it in a separate patch, but changed my mind. Now these interfaces should be updated too. |
I think this is officially unwieldy. Which I accept the blame for. :)
It looks good to me, and I think if we want to change things, it will probably be a lot easier to do so once this is in.
streamexecutor/include/streamexecutor/PlatformInterfaces.h | ||
---|---|---|
160 | Ah, but there can be no device work enqueued after the call starts but before it completes because this is host-synchronous! (Unless you can touch Streams on multiple host threads?) So maybe the thing to say is simply that this does not block any ongoing device calls? | |
streamexecutor/include/streamexecutor/Stream.h | ||
111 | may | |
263 | This review is getting so long I'm tempted to say it will be easier to check this in as-is and edit things like this later. But I think it would be nice to consider whether there's a way to reduce the comment copypasta -- there's a lot of it, and I think it obscures the API. I think in an earlier patch you did some doxygen "function group" magic? Again, maybe better to save this for later. |
Yeah, it might be a bit of a maintenance problem, but I do think it will be much nicer for the user to use, so I think it's been an overall improvement.
streamexecutor/include/streamexecutor/Stream.h | ||
---|---|---|
263 | I definitely think I can cut down the comments a lot and make the API more understandable in the process. As you suggested, I'll check this in now and then do that in the next CL. |
Not sure what the protocol is for finding issues in patches that have already been committed - should I be posting to parallel_libs-dev instead?
This issue only showed up when I tried to actually write some code that uses all this stuff.
parallel-libs/trunk/streamexecutor/include/streamexecutor/Executor.h | ||
---|---|---|
44 ↗ | (On Diff #69133) | There's a mismatch between the return types of these two allocateDeviceMemory methods. The PlatformExecutor version returns a GlobalDeviceMemoryBase, whereas this one tries to return a GlobalDeviceMemory<T>. This causes compilation failure if this Executor::allocateDeviceMemory<T>() method is actually used. Because this method is templated, it needs to actually be called somewhere for the issue to show up, and it doesn't appear to be covered in the unit tests. |
Should we take this by reference?