This change removes the requirement that the row stride be statically known when
converting vector.transfer_read and vector.transfer_write to distributed
SIMT operations in the nvgpu lowering path. It also adds a check to verify
that the last dimension of the source memref is statically known to have stride
1 since this is assumed in the conversion logic. No other change should be
required since the generated vector.load operations are never created across
dimensions other than the last. The routines for checking preconditions on
vector.transfer_read/write are moved to under nvgpu utilities.
The change is NFC with respect to the GPU dialect lowering path.