MSVC header files using vectorcall to differentiate overloaded functions, which
causes failure for AMDGPU target. This is because clang does not check function
calling convention based on function target.
This patch checks calling convention using the proper target info.
Please sink these declarations into the CUDA-specific block. Also, please add some comments to explain why different logic is needed for CUDA.