LLVM Inliner encourages to inline single block callee by giving it higher threshold. However, some large single block callees still cannot be inlined although they have many redundant instructions that can be removed if they are inlined.
The motivation example is a fully unrolled 3x3 matrix multiplication. It loads every data in matrix a and b three times because of the Stores between them. The SROA analysis can figure out that these Stores can be simplified and then these redundant loads should also be free. Thus, running sroa and gvn after inlining the callee can remove 54% of the instructions.
define void @outer(%struct.matrix* %a, %struct.matrix* %b) { %c = alloca %struct.matrix call void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) ret void } define void @matrix_multiply(%struct.matrix* %a, %struct.matrix* %b, %struct.matrix* %c) {
This simple patch tries to find repeated loads in the callee. It stops finding Loads if there are Stores that cannot be simplified or there are function calls in the callee. The above restriction can be relaxed to find more CSE opportunities with more expensive analysis.
I tested the patch with SPEC20xx and llvm-test-suite using O3+LTO/O3/O2/Os on x86 and AArch64. Only spec2006/milc gets impacted when using O3+LTO. On X86, the performance is improved by +3.6% and the code size is increased by 0.07%. On AArch64, the performance is improved by +4.8% and code size has no change.