Improve the error-prone interface that allows users to pass host
pointers that haven't been registered to asynchronous copy methods. In
CUDA, this is an extremely easy error to make, and instead of failing at
runtime, it succeeds and gives the right answers by turning the async
copy into a sync copy. So, you silently get a huge performance
degradation if you misuse the old interface. This new interface should
prevent that.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
After an offline chat with jlebar, we decided to get rid of the allocateHostMemory functions in the device interface. We thought it was confusing that the first patch in this review made a RegisteredHostMemory object that only sometimes owned its underlying pointer.
By removing allocateHostMemory the object never has to own its underlying pointer. Possible concerns with removing this function:
- The interface may be slightly more annoying for users because they have to allocate their own memory.
- Performance may be worse if users have to make two calls instead of just one to get registered host memory.
- Memory that users allocate themselves may not have optimal alignment for the registration process.
If any of these problems are later shown to be a bigger issue, we can re-introduce the host allocation functions with a different interface than the proposal from the first patch.
There are a couple of options that might work for the new interface:
- Have allocateHostMemory return a pair<RegisteredHostMemory, std::unique_ptr> so that the unique_ptr is responsible for freeing the memory and the RegisteredHostMemory is responsible for un-registering it. (Maybe a problem if the unique_ptr goes out of scope first.)
- Create a new type OwnedRegisteredHostMemory that always owns the memory and the registration of the memory. Have methods to create a slice out of this new type, and support this type as an argument to all the thenCopy Stream methods (std::enable_if should make that pretty easy).
For the record, in addition to using std::enable_if to get the Stream thenCopy functions to accept either mutable or immutable slice arguments, I also tried providing an implicit ctor to make immutable slices from mutable slices and designing the functions to take only immutable slices. Result: as usual, template type inference does not consider implicit type conversions, so the user would have to specify the ElemT type parameter in order to get that to work. That's the reason I stuck with std::enable_if.
I talked to Chandler about this slicing thing...
His suggestion was to give up on deducing T in e.g. thenCopyH2D. Instead, we'd write function signatures like the following:
template <typename HostTy, typename DeviceTy> thenCopyH2D(HostTy &&Src, DeviceTy &&Dst);
We have constraints on HostTy and DeviceTy that we still need to enforce. Maybe our constraint is that they both expose a typedef named ElementTy and they are convertible to RegisteredHostMemorySlice<ElementTy> and GlobalMemorySlice<ElementTy> respectively. In this case we'd write something like
static_assert(std::is_same<HostTy::ElementTy, DeviceTy::ElementTy>::value, "..."); RegisteredHostMemorySlice<HostTy::ElemTy> HostSlice(Src); GlobalMemorySlice<DeviceTy::ElemTy> DeviceSlice(Dst);
If this compiles, we're good to go. If it doesn't compile, that means you passed bad types to this function.
One nice thing about this is that it would let us unify the mutable / immutable slices into a single type (I think):
RegisteredHostMemorySlice<const HostTy::ElemTy> ImmutableSlice; RegisteredHostMemorySlice<HostTy::ElemTy> MutableSlice;
This would solve our combinatorial explosion problem in Stream.h, and wouldn't require complex enable_if shenanigans. (Although it involves its own shenanigans, I suppose.)
What do you think?
| streamexecutor/include/streamexecutor/HostMemory.h | ||
|---|---|---|
| 161 ↗ | (On Diff #70766) | It doesn't matter much, but this base class is kind of annoying, partially because we have the added overhead of this division (shift) every time we get the size, but more because it's another class kicking around. It seems like the main reason we need the base class is to make the out-of-line destructor work without including device.h here. But we can do something similar without all the fuss of a base class. We could define an out-of-line function in a "detail" namespace, or we could have a base class with just one function, defined out-of-line, that we call from our destructor. | 
| 32 ↗ | (On Diff #70720) | Can we define in what way the memory is "registered" here? | 
I really like the idea of using completely templated Src and Dst arguments. I've implemented that in this patch, and I think everyone can agree that it really cleaned up the code.
I'm not 100% sure that combining the mutable and immutable host slices in one class is the right choice, but I've also made that change so we can take a look and see what folks think.
| streamexecutor/include/streamexecutor/HostMemory.h | ||
|---|---|---|
| 33 ↗ | (On Diff #70766) | I added comments to each of these classes to say the memory is registered in the sense of Device::registerHostMemory. | 
| 161 ↗ | (On Diff #70766) | I made a function called internal::destroyRegisteredHostMemoryInternals and got rid of the base class. | 
| streamexecutor/include/streamexecutor/HostMemory.h | ||
|---|---|---|
| 45 ↗ | (On Diff #70889) | I'm not sure that a const_cast is the right thing to do. Instead, could we have two constructors, one which takes a const RHM<ElemT> and one which takes a non-const RHM<ElemT>? Also, how does this work when ElemT = "const int" and we get an RHM<ElemT>? I would expect you need some const-decay here. | 
| 45 ↗ | (On Diff #70889) | If we go with this approach, we should make sure we can convert from RHMSlice<const T> to RHMSlice<T>, which I don't think works atm. | 
| 48 ↗ | (On Diff #70889) | Hmmm... In a vacuum, I think we'd want just one overload, ElemT *getPointer() const; This matches what C++17's std::slice (nee array_view) does. And it sort of makes sense: We already have constness in the template type to determine whether or not we can modify the underlying ElemTs. (By the way, if we stick with this approach, we should explain in a comment about putting const T in the template param.) But I see that we have the two overloads on RHM, and there they make more sense. So...I am not sure what we should do here. One option is we could also make RHM unconditionally give out non-const pointers (and mutable slices). That would make it behave like unique_ptr -- if I have a const unique_ptr<int[]>, the get() function still gives me a non-const pointer back. I guess one argument against merging the slice classes is that it seems like we're having difficulty keeping them straight within Stream.h. So I think I'm OK splitting them back out for now. It shouldn't complicate the Stream interface any. | 
| 99 ↗ | (On Diff #70889) | Nit, clang-format is being conservative and not re-merging strings, but this should probably be reflowed. | 
| streamexecutor/include/streamexecutor/Stream.h | ||
| 172 ↗ | (On Diff #70889) | Is it worth merging this with the overload that explicitly takes slices, or do you think that would be too confusing? | 
| 178 ↗ | (On Diff #70889) | There should be const in some of these? | 
- Remove allocateHostMemory
- Fix broken comments found by jprice
- implement chanderc's suggestion
- Combine more thenCopy functions as templates
| streamexecutor/include/streamexecutor/HostMemory.h | ||
|---|---|---|
| 45 ↗ | (On Diff #70889) | Switching back to mutable array. | 
| 48 ↗ | (On Diff #70889) | Switching back to mutable array. | 
| 99 ↗ | (On Diff #70889) | Actually, clang-format really seems to like this. It won't seem to break after the && even if I manually combine the string literals into one long one. So I'll just let clang-format have its way here. | 
| streamexecutor/include/streamexecutor/Stream.h | ||
| 172 ↗ | (On Diff #70889) | Yes, I did it now. I don't know why I didn't before. | 
| 178 ↗ | (On Diff #70889) | Using mutable array now. | 
Ship it.
| streamexecutor/include/streamexecutor/HostMemory.h | ||
|---|---|---|
| 94 ↗ | (On Diff #70934) | While I'm thinking about it, we should add LLVM_ATTRIBUTE_UNUSED_RESULT to these (and other similar functions) in a separate patch. |