If removed builtins-amdgcn.cu fails. It is CUDA with LDS pointer passed as flat. I.e. it comes as cast from addrspace(3) to flat. Generic builtin handling below in this file does the same.
There was a TargetInfo hook getCUDABuiltinAddressSpace which was introduced by Matt. The default implementation maps any address space to default address space 0. For some reason, it was not implemented as target specific to map the address space specified by builtin def to real ones. As a result, all builtin functions have generic pointer parameter for CUDA. Therefore the cast is needed here when calling the intrinsic.
We could consider fix that. For this patch, I think we still need the cast.