This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][GlobalISel] Add a frame-index CSE optimization during post-select.
AbandonedPublic

Authored by aemerson on Aug 27 2021, 5:08 PM.

Details

Summary

Frame index operands are replaced by physical registers during frame
lowering. Doing so however may require multiple instructions to be
generated in order to materialize the right stack pointer offset. In some
cases, like G_MEMCPY expansion, we might have many memory operations using
a frame index + offset addressing mode. If the object being referenced is
too far from the stack pointer, we could have multiple stack address
instructions generated for each memory operation.

This optimization tries to mitigate this problem by searching for
essentially CSE'ing away frame index operands within a block and replacing
the uses with the vreg of the frame index. It's gated on having a large
enough stack size to be likely to save materialization costs
(empirically landed on 2K bytes using CTMark -Os) and having enough uses
in a block to be worth a significant amount of savings if it does fire.

Diff Detail

Event Timeline

aemerson created this revision.Aug 27 2021, 5:08 PM
aemerson requested review of this revision.Aug 27 2021, 5:08 PM

The code size results at -Os on CTMark:

Program             before        after        diff 
 bullet              475904       475912       0.0%
 SPASS               412780       412780       0.0%
 consumer-typeset    419140       419140       0.0%
 kc                  432528       432528       0.0%
 lencod              430124       430124       0.0%
 sqlite3             287628       287628       0.0%
 tramp3d-v4          367780       367756      -0.0%
 7zip-benchmark      570124       569984      -0.0%
 clamscan            383940       383688      -0.1%
 pairlocalalign      249760       249180      -0.2%
 Geomean difference                           -0.0%

How is this different from LocalStackSlotAllocation?

aemerson abandoned this revision.Aug 30 2021, 4:15 PM

How is this different from LocalStackSlotAllocation?

I wasn't familiar with the workings of that pass, it does look like it's trying to do the same thing.