This is a somewhat different way to do it than D11666 which got rolled back.
Codegen postpones emitting instantiated kernel function template until it's used.
If kernel is used only from the host side (which is normally the case) we'll never emit
it because on device side we don't emit the host code that uses it.
The change allows CUDA kernels to be emitted on device side unconditionally.
It's overly conservative and may emit more functions than we really need, but it
guarantees that the kernels launched from the host side are do exist on device-side.
In case it ever causes issues, there are other ways to address the issue,
though they are more invasive and are currently not worth the trouble.