This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Implement __ldg using intrinsics.
ClosedPublic

Authored by jlebar on May 5 2016, 1:02 PM.

Details

Summary

Previously it was implemented as inline asm in the CUDA headers.

This change allows us to use the [addr+imm] addressing mode when
executing ld.global.nc instructions. This translates into a 1.3x
speedup on some benchmarks that call this instruction from within an
unrolled loop.

Diff Detail

Repository
rL LLVM

Event Timeline

jlebar updated this revision to Diff 56331.May 5 2016, 1:02 PM
jlebar retitled this revision from to [CUDA] Implement __ldg using intrinsics..
jlebar updated this object.
jlebar added reviewers: tra, rsmith.
jlebar added subscribers: cfe-commits, jhen.
majnemer added inline comments.
include/clang/Basic/BuiltinsNVPTX.def
569–603 ↗(On Diff #56331)

Would it be crazy to instead provide a generic builtin? Would cut down on the number of variants...

__builtin_add_overflow is an example of such a builtin.

jlebar added inline comments.May 5 2016, 1:40 PM
include/clang/Basic/BuiltinsNVPTX.def
569–603 ↗(On Diff #56331)

Art is going to send you flowers. :) He and I just had an argument about this.

I think this isn't an unreasonable thing to want, but I think it's beneficial to be consistent with our existing API. So if we offer a generic thing for ldg, it would be nice to have one for atomics above, which are basically the same.

So I told Art I'd prefer to add it to our list.

jlebar added inline comments.May 5 2016, 1:43 PM
include/clang/Basic/BuiltinsNVPTX.def
569–603 ↗(On Diff #56331)

Oh, another thing is that, you really see the benefit of having a generic builtin when you start hitting the combinatorial explosion of all the different kinds of loads. Like, as-is it's not so bad, but if you want to support all forms of ld.global.nc, there are four different caching behaviors. Supporting all forms of ld is way worse.

Which is to say, if we're going to do the generic thing, it seems like we benefit the most by making it generic on more than the types. But we're not ready to do that; I don't think most of these loads even exist in llvm atm.

http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld

Art pointed me to the fact that CUDA 8 adds a bunch more load intrinsics, and I said ohmygosh maybe we *do* want to do the variadic intrinsic thing here.

But now looking at how __builtin_add_overflow is implemented, we'd need special sema checking to make it work. We would also need some sort of argument promotion logic to make the value and pointer into the same types. In both cases it seems like maybe it's better to leave this stuff to clang, rather than trying to write a buggy implementation ourselves?

Even with the many new load intrinsics, listing all the intrinsics is a relatively small part of the code required. The majority of the code necessary is in our CUDA header, but even with a variadic builtin, that would be hard to reduce without some serious template magic, and that would be doubly difficult to do without exposing crummy diagnostics to users.

What do you all think?

tra edited edge metadata.May 9 2016, 11:35 AM

OK. Let's stick with __ldg for now.

Art pointed out that static_assert is c++11-only.

I'll just remove them and make a note to move them into the CUDA test-suite stuff Art is working on.

jlebar updated this revision to Diff 56603.May 9 2016, 11:38 AM
jlebar edited edge metadata.

Remove static_asserts.

jlebar added a reviewer: rnk.May 12 2016, 2:29 PM

Friendly ping. This is a big help with some Tensorflow benchmarks.

rsmith added inline comments.May 17 2016, 12:16 PM
include/clang/Basic/BuiltinsNVPTX.def
569–603 ↗(On Diff #56603)

It sounds like the combinatorial explosion will be unmanageable if we don't switch to using a generic builtin for the full suite of 'ld' operations, so it seems worthwhile to do that now. This would also be consistent with how we handle the somewhat-similar builtin __builtin_nontemporal_load.

rsmith accepted this revision.May 19 2016, 3:36 PM
rsmith edited edge metadata.

After offline discussion: we don't know for sure whether we're going to hit the combinatorial explosion in future or not. Let's go ahead with this as-is for now, then, with the explicit acknowledgement that we reserve the right to replace these builtins with a single type-generic builtin in the future.

This revision is now accepted and ready to land.May 19 2016, 3:36 PM
This revision was automatically updated to reflect the committed changes.