This patch adds initial fusion for load/multiply/store chains of matrix
operations.
The patch contains roughly two parts:
1. Code generation for a fused load/multiply/store chain (LowerMatrixMultiplyFused).
First, we ensure that both loads of the multiply operands do not alias. If they do, we create new non-aliasing copies of the operands. Note that this may introduce new basic block. Then we split the block containing the multiply at the multiply, to simplify processing by returning the remainder of the original block to continue analysis (see 2.). Finally we process TileSize x TileSize blocks, that is, load tiles from the input operands, multiply and store them.
2. Identify fusion candidates.
To identify candidates for fusion, we look for @llvm.matrix.multiply with operands that are loads and a single use of the result in a store. To avoid generating unnecessary code for loads that later on get fused, we do a first pass over the function and only try fusing instructions, while keeping track of all other instructions with shape information in the function. We continue with the regular code generation for the remaining instructions with shape information after finishing fusion.