This patch adds unscaled loads and sign-extend loads to the TII getMemOpBaseRegImmOfs API, which is used to control clustering in the MI scheduler. This is done to create more opportunities for load pairing. I've also added the scaled LDRSWui instruction, which was missing from the scaled instructions. Finally, I've added support in shouldClusterLoads for clustering adjacent sext and zext loads that too can be paired by the load/store optimizer.
Overall, this patch increases the number of unscaled pairs by about 3% for Spec2006. I saw similar results for Spec2000. Also, I didn't see any serious changes in the register allocator statistics (see below).
Below is a summary of the llvm stats when comparing without and with this patch. For example, the first stat indicates 148 (or ~3.18%) more ldps are generated from unscaled loads and and the total number of paired instructions increase by 317 (or 0.64%).
Summary:
148 (3.18) aarch64-ldst-opt - Number of load/store from unscaled generated 317 (0.64) aarch64-ldst-opt - Number of load/store pair instructions generated 10 (0.42) aarch64-ldst-opt - Number of post-index updates folded -272 (-0.01) asm-printer - Number of machine instrs printed -936 (-0.00) assembler - Number of emitted object file bytes 6 (0.00) assembler - Number of evaluated fixups 6 (0.00) mccodeemitter - Number of MC fixups created. -272 (-0.01) mccodeemitter - Number of MC instructions emitted. 11 (0.00) mcexpr - Number of MCExpr evaluations 96 (0.00) pei - Number of bytes used for stack in all functions 4 (0.00) regalloc - Number of copies inserted for splitting -1 (-0.00) regalloc - Number of identity moves eliminated after rewriting 3 (0.01) regalloc - Number of interferences evicted 2 (0.22) regalloc - Number of live ranges fractured by DCE 15 (0.01) regalloc - Number of new live ranges queued 8 (0.00) regalloc - Number of registers assigned 4 (0.01) regalloc - Number of registers unassigned 1 (0.00) regalloc - Number of rematerialized defs for spilling -1 (-0.03) regalloc - Number of rematerialized defs for splitting 2 (0.01) regalloc - Number of spill slots allocated 1 (0.00) regalloc - Number of spilled live ranges 5 (0.04) regalloc - Number of split global live ranges -2 (-0.10) regalloc - Number of split local live ranges 2 (0.01) regalloc - Number of splits finished 2 (0.01) regalloc - Number of splits that were simple 29 (0.07) slotindexes - Number of local renumberings 1 (0.03) stackslotcoloring - Number of stack slots eliminated due to coloring 1 (0.00) tailduplication - Additional instructions due to tail duplication 1 (0.04) tailduplication - Number of dead blocks removed
Passed all correctness for EEMBC, Spec200X, llvm test-suite. Performance results look to be mostly noise with minor improvements here and there.
Chad
Do we always merge zext with sext loads? I would think this would be controlled by the subtarget check we do for combining narrow loads since this combination has the same trade off with load vs arith and increased depency chain?