[CUDA] Added __hmma_m16n16k16_* builtins to support mma instructions in sm_70

Authored by tra on Tue, Oct 10, 9:53 AM.

Diff Detail

tra created this revision.Tue, Oct 10, 9:53 AM
jlebar accepted this revision.Wed, Oct 11, 9:40 AM
jlebar added inline comments.
9726 ↗(On Diff #118408)

weird indentation?

9733 ↗(On Diff #118408)

Urg, this isn't a bool? Do we want it to be?

9761 ↗(On Diff #118408)

Accidentally left over?

9762 ↗(On Diff #118408)


Maybe it would be better to write it as "Return value has type [[f16 x 2] x NumResults]."?

9784 ↗(On Diff #118408)

Nit, at this point it's probably better to assign NumResults in each branch, since there are only two. clang should make sure that we don't accidentally use it uninitialized.

9786 ↗(On Diff #118408)

s/are using/use/

9800 ↗(On Diff #118408)

spacing. (Probably just worth clang-formatting this and the other patch.)

9802 ↗(On Diff #118408)

Nit, we know that there won't ever be more than 8 elements...

This revision is now accepted and ready to land.Wed, Oct 11, 9:40 AM
tra updated this revision to Diff 118636.Wed, Oct 11, 10:12 AM
tra marked 6 inline comments as done.

Addressed Justin's comments.

9726 ↗(On Diff #118408)

My emacs and clang-format keep fighting case indentation... Fixed.

9733 ↗(On Diff #118408)

There are no explicit declarations for these builtins in CUDA headers. Callers of these builtins pass 0/1 and corresponding intrinsic described in NVVM-IR spec shows the argument type as i32, so I've made the type integer in clang.

9762 ↗(On Diff #118408)

That was part of the leftover block. Particular types are irrelevant here. All we care is to store whatever intrinsic call returned ([4 or 8 elements of v2f16 or float] ) in the destination array (which is int[] ).

9802 ↗(On Diff #118408)

We have two extra arguments -- destination buffer and stride.

jlebar added inline comments.Wed, Oct 11, 10:47 AM
9733 ↗(On Diff #118408)


This revision was automatically updated to reflect the committed changes.