Previously it was implemented as inline asm in the CUDA headers.
This change allows us to use the [addr+imm] addressing mode when
executing ld.global.nc instructions. This translates into a 1.3x
speedup on some benchmarks that call this instruction from within an
unrolled loop.
Would it be crazy to instead provide a generic builtin? Would cut down on the number of variants...
__builtin_add_overflow is an example of such a builtin.