This is an archive of the discontinued LLVM Phabricator instance.

[mlir][VectorToGPU] Fix bug generating incorrect ldmatrix ops
ClosedPublic

Authored by ThomasRaoux on Jun 1 2022, 7:46 PM.

Details

Summary

ldmatrix transpose can only be used with types that are 16bits wide.

Diff Detail

Event Timeline

ThomasRaoux created this revision.Jun 1 2022, 7:46 PM
Herald added a project: Restricted Project. · View Herald Transcript
ThomasRaoux requested review of this revision.Jun 1 2022, 7:46 PM
christopherbate added a comment.EditedJun 1 2022, 9:57 PM

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

I thought that with transpose loading 32bits element was wrong because the 32bits value would be read as 2xf16 and when transposed would end up on different rows. Is that not the case? There are some miscompile when using ldmatrix in this case but this could be due to a different reason. Do you expect the transpose version of the op to work for 32bits?

christopherbate accepted this revision.Jun 2 2022, 1:31 PM

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

I thought that with transpose loading 32bits element was wrong because the 32bits value would be read as 2xf16 and when transposed would end up on different rows. Is that not the case? There are some miscompile when using ldmatrix in this case but this could be due to a different reason. Do you expect the transpose version of the op to work for 32bits?

You're right, I got mixed up with the transpose vs non-transpose versions. Transpose definitely needs the fp16 constraint.

Thanks for finding this, LGTM!

This revision is now accepted and ready to land.Jun 2 2022, 1:31 PM