Previously, we translate llvm.round to PTX cvt.rni, which rounds to the
even interger when the source is equidistant between two integers. This
is not correct as llvm.round should round away from zero. This change
replaces llvm.round with a round away from zero implementation through
target specific custom lowering.
Modify a few affected tests to not check for cvt.rni. Instead, we check
for the use of a few constants used in implementing round. We are also
adding CUDA runnable tests to check for the values produced by
llvm.round to test-suites/External/CUDA.
Do we have FP32-related constants defined somewhere in LLVM tree?
It would be easier to understand if we could use 1 << SIGN_BIT_SHIFT here or ((0 + EXP_OFFSET) << EXP_SHIFT) | mantissa below.