Page MenuHomePhabricator

[SE] Support CUDA dynamic shared memory
ClosedPublic

Authored by jhen on Sep 14 2016, 5:28 PM.

Details

Summary

Add proper handling for shared memory arguments in the CUDA platform. Also add
in unit tests for CUDA.

Diff Detail

Repository
rL LLVM

Event Timeline

jhen updated this revision to Diff 71468.Sep 14 2016, 5:28 PM
jhen retitled this revision from to [SE] Support CUDA dynamic shared memory.
jhen updated this object.
jhen added a reviewer: jlebar.
jhen added subscribers: parallel_libs-commits, jprice.
jlebar accepted this revision.Sep 15 2016, 8:54 AM
jlebar edited edge metadata.
jlebar added inline comments.
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167 ↗(On Diff #71468)

Huh, does clang-format actually do this? If so maybe that's worth filing a bug -- that is a strange choice.

189 ↗(On Diff #71468)

This whole dance is going to destroy the beautiful efficiency gains you were after, right?

It sort of seems like the only way to make this work would be to have a different args-packing class for each platform. But I am not sure how to do that without introducing virtual function calls, which would *also* destroy your beautiful efficiency gains.

At least let's use llvm::SmallVector so we don't have to malloc anything. And maybe add a comment that we may need to come back and improve this?

streamexecutor/unittests/CoreTests/CUDATest.cpp
142 ↗(On Diff #71468)

Doesn't match ns name above. (It's going to be technically UB for something other than the compiler to put anything into __foo.)

This revision is now accepted and ready to land.Sep 15 2016, 8:54 AM
jhen updated this revision to Diff 71520.Sep 15 2016, 9:53 AM
jhen marked 2 inline comments as done.
jhen edited edge metadata.
  • Comment on dyn-shared-memory arg efficiency
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
167 ↗(On Diff #71468)

Yes, clang-format actually does this. I'll create a simple reproducer and file a bug.

189 ↗(On Diff #71468)

While optimizing this internally, our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory.

So, the idea is that, other than the check for ArgumentArray.getArgumentCount() == 0, the no-dynamic-shared-memory case should take advantage of the efficiency gains in the quirky packed argument array design. Definitely let me know if I've done something that has broken that case.

For the general dynamic shared memory case, I couldn't think of a good way to make it efficient for CUDA and OpenCL at the same time. As you mentioned, it might require virtual function calls, which would hurt both cases. But again, I think we're OK to be less efficient in this case.

Actually, now that I think of it, we could be much more efficient in the most important specific case of dynamic shared memory--the case where there is only one dynamic shared memory argument, and it is the first one. That would work for all CUDA cases, and OpenCL users could take advantage of it as well by choosing to write their kernels in that way. For now, I've switched to using llvm::SmallVector (thanks for that idea!) and wrote a comment describing how we might improve this in the future.

streamexecutor/unittests/CoreTests/CUDATest.cpp
142 ↗(On Diff #71468)

Thanks for catching that. I changed it from the example code to avoid UB, but I missed the comment.

jlebar added inline comments.Sep 15 2016, 10:57 AM
streamexecutor/lib/platforms/cuda/CUDAPlatformDevice.cpp
189 ↗(On Diff #71520)

our approach was that the dynamic shared memory case was not common, and we would accept being inefficient in that case, as long as we were efficient in the case of no dynamic shared memory.

I feel like these things usually are unimportant until they're not. Which is to say, I'm totally onboard with doing a slow thing for a case we think is uncommon, so long as we're not painting ourselves into a corner.

I like your plan of (if the case arises) telling people who want their code to be fast to put a single shared memory parameter at the front of their param pack and then optimizing for that case.

This revision was automatically updated to reflect the committed changes.