Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an scf.for loop with either
vector.transfer or nvgpu.device_async_copy operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.
This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.
This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.
nit: for a load into shared memory ?