Make NVPTXISelDAGToDAG able to emit cached loads (LDG) for pointer induction variables.
Also fix latent bug where LDG was not restricted to kernel functions. I believe that this could not be triggered so far since we do not currently infer that a pointer is global outside a kernel function, and only loads of global pointers are considered for cached loads.
This brings a 30% performance gain on some eigen3-based Google-internal CUDA benchmarks, where LLVM introduces a pointer induction variable and then previously couldn't use LDG.
Can't you get the data layout from F now?