This is an archive of the discontinued LLVM Phabricator instance.

Add ArgumentCopyElision MI pass
AbandonedPublic

Authored by rnk on Jan 5 2017, 5:24 PM.

Details

Summary

LLVM frequently copies arguments from the stack when generating
unoptimized code or when an argument's address is taken. This pass
identifies stores of loads of argument stack slots in the entry block,
deletes the store, and replaces all uses of the frame index with the
original load's frame index.

It is currently disabled by default under the flag
-enable-arg-copy-elim.

In a 32-bit release build of Chrome, this saves 20K from both
chrome_child.dll and chrome.dll. Each DLL is about 50MB, so this is less
than 0.05% for each, so it's not what I'd hoped, but it's simple enough
to keep. In the future, this pass could be extended to handle aggregates
that have been split into multiple LLVM arguments in clang.

The potential wins in a debug build are much greater because all
arguments will live in memory, but I think we should do this whole thing
in fast isel instead. First, x86 fast isel needs to know how to lower
arguments in memory. When we do that, it should be easy to add the
IR-equivalent pattern matching to identify allocas that are immediately
initialized by arguments passed in memory and use the appropriate
argument frame index from the beginning of ISel.

Fixes PR26328

Event Timeline

rnk updated this revision to Diff 83329.Jan 5 2017, 5:24 PM
rnk retitled this revision from to Add ArgumentCopyElision MI pass.
rnk updated this object.
rnk added reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans.
rnk added a subscriber: llvm-commits.
inglorion edited edge metadata.Jan 6 2017, 10:11 AM

Besides the effect on binary size, do we know what other effects this has? For example, what is the effect on build time and run-time performance?

rnk added a comment.Jan 6 2017, 5:15 PM

I haven't measured either yet.

Build time effect should be below the noise because it is O(entry block instructions) if there are no address taken parameters, which seems to be the common case given that this doesn't have much impact on overall size. If there are address-taken parameters, it iterates all instructions once to replace frame indices. I don't think it's common practice to measure for compile time regressions when adding one more linear pass every function.

The runtime effects will definitely be positive, since we're eliding copies, but given how low the size impact was, I don't expect it will be measurable either :(. Maybe we should chat about different ways we could measure it.

Right. The reason I asked is that it seems to me that enabling the pass will be neutral or better for code size and run time, while not costing much or possibly even being beneficial at compile time (we do a bit of extra analysis on a few instructions, but then we lower the number of instructions that the rest of the build operates on). If that is indeed the case, I would be in favor of enabling the new pass by default.

rnk updated this revision to Diff 84058.Jan 11 2017, 5:18 PM
rnk edited edge metadata.
  • Remove leftover fastisel test
  • Add ARM test
  • Clarify comments
rnk added a comment.Jan 13 2017, 11:13 AM

Any thoughts on this? I think the basic algorithm is sound, but I'd appreciate it if someone took a quick glance.

MatzeB edited edge metadata.Jan 13 2017, 11:16 AM

Can you explain why this needs an own pass. My first intuition would have been that this should be simple to solve at the SelectionDAG level.

inglorion requested changes to this revision.Jan 13 2017, 11:42 AM
inglorion edited edge metadata.

The implementation looks good to me. I think the ARM test is missing a few cases.

test/CodeGen/ARM/argument-copy-elision.ll
5 ↗(On Diff #84058)

I expect this should have the retaddr_taken and size_mismatch tests, too. If not, I think you can remove this declaration.

This revision now requires changes to proceed.Jan 13 2017, 11:42 AM
rnk added a comment.Jan 13 2017, 1:07 PM

Can you explain why this needs an own pass. My first intuition would have been that this should be simple to solve at the SelectionDAG level.

Initially I wanted it to be independent of ISel because we have 3 (!) ISel implementations. In particular, I was hoping to improve our -O0 code so that it's less bulky. However, I realized that x86 fastisel doesn't even handle arguments in memory, so I gave up on that. So, I guess it would be better to do this as part of SDAG.

Ideally, we'd do this as part of FunctionLoweringInfo::set so that the StaticAlloca map is populated with the fixed argument frame indices from the beginning of ISel, but we need to get the target to lower formal arguments first, which happens later. I could try lowering formal arguments up front, but that seemed more invasive than doing a standalone pass.

rnk abandoned this revision.Feb 7 2017, 11:54 AM

I rewrote this as an IR analysis and uploaded that as D29668