This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Added partial support for CUDA-9.1
ClosedPublic

Authored by tra on Jan 24 2018, 4:13 PM.

Details

Summary

Clang can use CUDA-9.1 now, though new builtins (__hmma_m32n8k16*) are not implemented yet.

The major change is that headers in CUDA-9.1 went through substantial
changes that started in CUDA-9.0 which required substantial changes
in the cuda compatibility headers provided by clang.

There are two major issues:

  • CUDA SDK no longer provides declarations for libdevice functions.
  • A lot of device-side functions have become nvcc's builtins and CUDA headers no longer contain their implementations.

This patch changes the way CUDA headers are handled if we compile
with CUDA 9.x. Both 9.0 and 9.1 are affected.

  • Clang provides its own declarations of libdevice functions.
  • For CUDA-9.x clang now provides implementation of device-side 'standard library' functions using libdevice.

This patch should not affect compilation with CUDA-8. There may be
some observable differences for CUDA-9.0, though they are not expected
to affect functionality.

Tested: CUDA test-suite tests for all supported combinations of:

CUDA: 7.0,7.5,8.0,9.0,9.1
GPU: sm_20, sm_35, sm_60, sm_70

Diff Detail

Repository
rC Clang

Event Timeline

tra created this revision.Jan 24 2018, 4:13 PM

Gosh I think the chances we get through this without a copy-paste error are way too high. I'm not sure what to do about it, though. I don't have the focus to check this as carefully as it needs.

Perhaps we need to test this in the test-suite (eventually)? We've seen that even when our own implementation is correct, sometimes nvptx breaks it, and so we need to test e2e *anyway*...

clang/lib/Driver/ToolChains/Cuda.cpp
125 ↗(On Diff #131372)

Perhaps update to CUDA 9+

clang/lib/Headers/__clang_cuda_device_functions.h
32 ↗(On Diff #131372)

I don't think we should need __forceinline__? I'm sure they had that in their headers, but clang doesn't rely on its inliner for correctness, unlike some *other* compilers. :)

I'm also not sure about static. I think we mark all CUDA code as internal, so it shouldn't be necessary? If they're not static, they should be marked as inline, though.

51 ↗(On Diff #131372)

Sanity check: Ignoring the __a argument here is correct?

clang/lib/Headers/__clang_cuda_runtime_wrapper.h
143 ↗(On Diff #131372)

a particular (or "our particular", if you like)

144 ↗(On Diff #131372)

Sentence doesn't read right if you skip the parens.

But also: Our header provides *everything* from cuda/device_functions_decls.h, not just the things that were once in that header and, in newer CUDA versions, are no longer there, right? I'm actually even more confused now that I read the big header, because here we say the header "defines or declares" these functions, but that suggests that for every function, it decides whether to define or to declare it, but that's not the case...

155 ↗(On Diff #131372)

Should we be pushing/popping the macro? That is, do we want this macro exposed to user code?

jlebar accepted this revision.Jan 25 2018, 3:45 PM
This revision is now accepted and ready to land.Jan 25 2018, 3:45 PM
tra marked 2 inline comments as done.Jan 25 2018, 5:16 PM
tra added inline comments.
clang/lib/Headers/__clang_cuda_device_functions.h
32 ↗(On Diff #131372)

The idea is to make those wrappers always inlined, so they are effectively replaced with a call to __nv_* libdevice function. Whether the callee itself get inlined is controlled by the attributes in the libdevice bitcode.

__attribute__((always_inline)) which __forceinline__ expands to is handled differently from regular inline. We have Transforms/IPO/AlwaysInliner.cpp which handles them with -O1. Higher opt levels will presumably inline these functions, too. Without always_inline these wrappers will remain uninlined at -O1. Replacing it with just inline will work OK (only affects -O1 in a minor way), but __forceinline__ is a bit closer to what I want from these wrappers.

As for the static, unused non-static functions will be emitted by clang -- as far as clang is concerned they are externally visible. I've checked and we do end up emitting them in PTX, even though they don't have .visible directive (I think that's what you referred to when you mentioned 'mark as internal'). I don't think we want that.

I'll keep this combination for now. We can adjust it later, if necessary.

51 ↗(On Diff #131372)

Yes. It's a leftover from the old versions and has been deprecated in CUDA-8:

__DEPRECATED__("Please use __brkpt() instead.") void brkpt(int c = 0)

CUDA-9.1 does not have it at all.

clang/lib/Headers/__clang_cuda_runtime_wrapper.h
144 ↗(On Diff #131372)

I should rephrase that then.

__clang_cuda_device_functions.h does two jobs. It always includes __clang_cuda_libdevice_declares.h which provides declarations for the libdevice functions. Those are gone in CUDA-9.1.

It also provides definitions for the standard library and __* device-side functions that call libdevice.

I can explicitly include _clang_cuda_libdevice_declares.h here and update the comment to make more sense.

155 ↗(On Diff #131372)

We do push __THROW at the beginning of this header and pop it closer to the end, so this redefinition is not visible outside.

jlebar added inline comments.Jan 25 2018, 5:22 PM
clang/lib/Headers/__clang_cuda_device_functions.h
32 ↗(On Diff #131372)

Can you help me understand what makes you want this particular inlining behavior (i.e., always inline, even at -O0) for these wrappers?

Does e.g. libc++ do the same thing for the functions it has that are thin wrappers around e.g. libc functions?

As for the static, unused non-static functions will be emitted by clang -- as far as clang is concerned they are externally visible. I've checked and we do end up emitting them in PTX, even though they don't have .visible directive (I think that's what you referred to when you mentioned 'mark as internal'). I don't think we want that.

Agree. This is kind of unfortunate, though -- it's not How You Write Headers. Could we add a comment?

clang/lib/Headers/__clang_cuda_runtime_wrapper.h
144 ↗(On Diff #131372)

I think that would help clarify things, for sure.

tra added inline comments.Jan 25 2018, 5:32 PM
clang/lib/Headers/__clang_cuda_device_functions.h
32 ↗(On Diff #131372)

libc++ does seem to use _LIBCPP_ALWAYS_INLINE for a lot of wrapper-like functions:

E.g. libcxx/include/__locale:

535:    _LIBCPP_ALWAYS_INLINE
536-    const char_type* tolower(char_type* __low, const char_type* __high) const
537-    {
538-        return do_tolower(__low, __high);
539-    }
jlebar added inline comments.Jan 25 2018, 5:34 PM
clang/lib/Headers/__clang_cuda_device_functions.h
32 ↗(On Diff #131372)

sgtm then!

tra updated this revision to Diff 131650.Jan 26 2018, 1:51 PM

Addressed Justin's comments.

tra marked 12 inline comments as done.Jan 26 2018, 1:52 PM
tra updated this revision to Diff 131894.Jan 29 2018, 4:01 PM

Rebased to HEAD.

This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.