In current Clang, on the OpenMP NVPTX toolchain, math functions are resolved as math functions for the host. For example, a call to sqrt() in a target region will result in an LLVM-IR call which looks like this:
call double sqrt(double %1)
This patch allows for math functions in OpenMP NVPTX target regions to call the same math functions that CUDA code calls. For example, for sqrt we get:
call double @llvm.nvvm.sqrt.rn.d(double %1)
This is necessary for both correctness and performance.
Could you elaborate on why you don't want the builtins?
Builtins are enabled and are useful for CUDA. What makes their use different for OpenMP?
Are you doing it to guarantee that math functions remain unresolved in IR so you could link them in from external bitcode?