Instead of calling CUDA runtime to arrange function arguments, 
the new API constructs arguments in a local array and the kernels 
are launched with __cudaLaunchKernel().
The old API has been deprecated and is expected to go away
in the next CUDA release.