Normally math functions are forwarded to nv_* counterparts provided by CUDA's
libdevice bitcode. However, nv_rint*() functions there have a bug -- they use
round() which rounds *up* instead of rounding towards the nearest integer, so we
end up with rint(2.5f) producing 3.0 instead of expected 2.0. The broken bitcode
is not actually used by NVCC itself, which has both a work-around in CUDA
headers and, in recent versions, uses correct implementations in NVCC's built-ins.
This patch implements equivalent workaround and directs rint/rintf to
__builtin_rint/rintf that produce correct results.