This processes matrix multiplies of i8 matrixes in 4x4 tiles and use
aarch64.udot to compute the result of the 4x4 multiplies.
This patch lowers store(matrix.multiply(transpose(load()), load())) as
described above. As the first operand is transposed we can access the
rows of the transposed operands by loading the columns of the original
load directly.
Note that @llvm.matrix.multiply does not make a distinction between
unsigned & signed multiplication for integers and this patch arbitrarily
use udot. We probably have to add integer multiply variants for signed &
unsigned in the future. Also, the way this is currently integrated needs
a bit of more work. It would probably be good to expose a hook where
targets can be queried which kernels can be implemented efficiently on
the target.
Finally, the shuffles generated for the current lowering seems to
generate awful code for now, but the main goal of the patch is to
illustrate how target specific instructions can be used when lowering
matrix intrsinics.
clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]
not useful