This processes matrix multiplies of i8 matrixes in 4x4 tiles and use

aarch64.udot to compute the result of the 4x4 multiplies.

This patch lowers store(matrix.multiply(transpose(load()), load())) as

described above. As the first operand is transposed we can access the

rows of the transposed operands by loading the columns of the original

load directly.

Note that @llvm.matrix.multiply does not make a distinction between

unsigned & signed multiplication for integers and this patch arbitrarily

use udot. We probably have to add integer multiply variants for signed &

unsigned in the future. Also, the way this is currently integrated needs

a bit of more work. It would probably be good to expose a hook where

targets can be queried which kernels can be implemented efficiently on

the target.

Finally, the shuffles generated for the current lowering seems to

generate awful code for now, but the main goal of the patch is to

illustrate how target specific instructions can be used when lowering

matrix intrsinics.