Add proper handling for shared memory arguments in the CUDA platform. Also add
in unit tests for CUDA.
Details
Diff Detail
Event Timeline
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp | ||
---|---|---|
167 | Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice. | |
189 | This whole dance is going to destroy the beautiful efficiency gains you were after, right? It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would *also* destroy your beautiful efficiency gains. At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this? | |
streamexecutor/unittests/CoreTests/CUDATest.cpp | ||
142 | Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.) |
- Comment on dyn-shared-memory arg efficiency
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp | ||
---|---|---|
167 | Yes, clang-format actually does this. I'll create a simple reproducer and file a bug. | |
189 | While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory. So, the idea is that, other than the check for ArgumentArray.getArgumentCount() == 0, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case. For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case. Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future. | |
streamexecutor/unittests/CoreTests/CUDATest.cpp | ||
142 | Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment. |
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp | ||
---|---|---|
189 |
I feel like these things usually are unimportant until they're not. Which is to say, I'm totally onboard with doing a slow thing for a case we think is uncommon, so long as we're not painting ourselves into a corner. I like your plan of (if the case arises) telling people who want their code to be fast to put a single shared memory parameter at the front of their param pack and then optimizing for that case. |
Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice.