ldmatrix transpose can only be used with types that are 16bits wide.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.
In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.
I thought that with transpose loading 32bits element was wrong because the 32bits value would be read as 2xf16 and when transposed would end up on different rows. Is that not the case? There are some miscompile when using ldmatrix in this case but this could be due to a different reason. Do you expect the transpose version of the op to work for 32bits?
You're right, I got mixed up with the transpose vs non-transpose versions. Transpose definitely needs the fp16 constraint.
Thanks for finding this, LGTM!