This patch fixes and simplifies the ldmatrix affine map arithmetic by abstracting the affine expressions in terms of pitch-linear layout (strided and contiguous dimensions). Then it applies the maps for strided and contiguous dimensions in row-major and col-major.
LdMatrixOp collaboratively (32 threads in a warp) load tiles (8 row x 128b col) of data. It can load either x1, x2, x4 tiles. Additionally, it can transpose at 16-bit granularity when moving data from the Shared Memory to registers.
This patch fixes affine map (laneid -> coordinate index a thread points in a tile).
- Loading x4 tiles needs all 32 lanes T0-31 point to a contiguous chunk of 128b. The issue was exposed when running this case.
- Loading x2 tiles and x1 needs T0-15 threads and T0-7 threads points to contiguous chunk of 128b. The patch is NFC for these cases.