We were packing global device memory handles in
PackedKernelArgumentArray, but as I was implementing the CUDA
platform, I realized that CUDA wants the address of the handle, not the
handle itself. So this patch switches to packing the address of the
handle.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
170 ↗ | (On Diff #71256) | Same question as last patch: What happens if this guy is moved? Specifically, would it be a problem if it were moved after calling thenLaunch but before the driver actually launches the kernel? |
- Warning about kernel launch args
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
170 ↗ | (On Diff #71256) | Yes, it would be a problem if the argument was moved at the wrong time. This was a choice we made internally to reduce the kernel launch overhead. Apparently it could make up 5% of some applications' run-time. In response to your question, I added a warning message to the Stream::thenLaunch method. This message tells users not to touch kernel launch arguments from other threads. |
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
170 ↗ | (On Diff #71268) | Clearly in the multithreaded case you can't modify a GlobalDeviceMemoryBase concurrently with a call to thenLaunch(). That's true of any parameters to any function call, so I am not sure that's even worth warning about. What I was worried about was a single-threaded case, something like this: GlobalDeviceMemory<int> other_mem; { GlobalDeviceMemory<int> mem; Stream->thenLaunch(foo_kernel, mem); other_mem = std::move(mem); } Stream->block(); Is this safe? That is, do we use &mem.Handle only within thenLaunch? (We don't actually have to have a separate scope and so on for us to hit this same problem.) (I have to admit, if any of this is 5% of some applications' runtime, it seems like we could do a lot better even than what we have here. I'm not sure how, but hot code is hot...) |
- Remove warning
streamexecutor/include/streamexecutor/DeviceMemory.h | ||
---|---|---|
170 ↗ | (On Diff #71268) | Oh, that's right. I think we spoke about this before. There should be no problem with your example, the platform should be responsible for copying the arguments (using memcpy) before returning from a launch call. This will match the guarantee currently provided by the CUDA driver library. I removed the unnecessary warning. |