It would be good to add a test.
Both NVCC and clang currently lower memcpy to an explicit loop. I'm not sure what effect (if any) allowing memcpy/memset libcall would have on performance. We may want to benchmark it before landing.
I doubt I have the proper setup to do such benchmarking. I care about malloc/free, this was just a follow up because the same CUDA documentation paragraph says they are available.
I'm fine with dropping this for now.
I'd incorporate the changes into your free/malloc patch, but leave them commented out with a TODO explaining that they are available, but disabled until we can prove that they are beneficial.