Page MenuHomePhabricator

[Matrix] Use aarch64.udot for 4x4 tiling for i8 matrixes (WIP).
Needs ReviewPublic

Authored by fhahn on Apr 6 2020, 7:14 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This processes matrix multiplies of i8 matrixes in 4x4 tiles and use
aarch64.udot to compute the result of the 4x4 multiplies.

This patch lowers store(matrix.multiply(transpose(load()), load())) as
described above. As the first operand is transposed we can access the
rows of the transposed operands by loading the columns of the original
load directly.

Note that @llvm.matrix.multiply does not make a distinction between
unsigned & signed multiplication for integers and this patch arbitrarily
use udot. We probably have to add integer multiply variants for signed &
unsigned in the future. Also, the way this is currently integrated needs
a bit of more work. It would probably be good to expose a hook where
targets can be queried which kernels can be implemented efficiently on
the target.

Finally, the shuffles generated for the current lowering seems to
generate awful code for now, but the main goal of the patch is to
illustrate how target specific instructions can be used when lowering
matrix intrsinics.

Diff Detail

Event Timeline

fhahn created this revision.Apr 6 2020, 7:14 AM
fhahn updated this revision to Diff 255328.Apr 6 2020, 7:25 AM

Add tests to illustrate the generated IR.

fhahn added a comment.Apr 7 2020, 6:49 AM

I shared the WIP patch to illustrate how matrix intrinsics could be lowered using target intrinsics. I won't have time to work on this in the near future, but if anyone would be interested in picking this up in the meantime, that would be great :)

Thanks Florian, we are happy to pick this up.
+ @samparker , @dmgreen

fhahn updated this revision to Diff 256037.Apr 8 2020, 8:49 AM

Small update to preserve loop info.

+1 on seeing similar efforts on *all* matrix intrinsics, like transpose

fhahn added a comment.Jun 5 2020, 2:26 PM

I've just put up D81308, which uses the same approach to generate loops for the regular tiled matrix multiplication. I'll work towards getting the initial infrastructure in place, then the target specific follow-ups should be more straightforward .

fhahn updated this revision to Diff 289931.Sep 4 2020, 6:13 AM

Rebase on current trunk.