nvgpu.device_async_copy is lowered into cp.async PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that.
This work simplifies PTX generation for nvgpu.device_async_copy, and implements it by NVVMToLLVM Pass.
Depends on D154060
it is great that we can rationalize this more cleanly, thanks for this!