This is an archive of the discontinued LLVM Phabricator instance.

[ArgumentPromotion] Allow the frontend to specify the maximum number of elements to promote on a per-function basis via metadata.
Needs ReviewPublic

Authored by pcwalton on Nov 5 2022, 1:25 PM.

Details

Reviewers
nikic
Summary

In Rust, argument promotion on destructor functions (core::ptr::drop_in_place)
is an important optimization, because those functions are marked cold in
exceptional paths and as such are rarely inlined, making them often the only
functions standing in the way of SROA. The default cap on the number of
elements to promote, 2, is too conservative for these specific functions in
Rust. At the same time, we don't want to raise the element cap across the
board, because that could make argument promotion unprofitable in some
circumstances. Additionally, argument promotion only runs on -O3 today, but for
these functions argument promotion is so important that the Rust frontend wants
to run the pass at any optimization level.

This patch addresses both of these problems by introducing a new piece of
per-function metadata, !argpromotion !{i64}. The i64 value represents the
maximum number of elements that argument promotion will promote into and
overrides the MaxElements setting that the pass itself defaults to. At -O3, the
Rust frontend can run argument promotion as usual with MaxElements to 2; at
-O2, it can run the pass with MaxElements set to 0. The frontend will tag
destructor functions with !argpromotion set to some high value, perhaps 8 or

  1. This should allow better optimization for typical Rust code, especially

code that uses iterators frequently.

Diff Detail

Event Timeline

pcwalton created this revision.Nov 5 2022, 1:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 1:25 PM
pcwalton requested review of this revision.Nov 5 2022, 1:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 1:25 PM
nikic added a comment.Nov 5 2022, 1:40 PM

Could you please share some of the motivating (unreduced) IR samples? Allowing to configure this via metadata sounds sensible in principle, but I'm not sure your overall plan ("The frontend will tag destructor functions with !argpromotion set to some high value, perhaps 8 or 16") is viable.

pcwalton added a comment.EditedNov 5 2022, 5:22 PM

Sure. Here's backtrace::symbolize::gimli::macho::Object::parse in backtrace-rs.

No argument promotion: https://gist.github.com/pcwalton/38ff053dd1973ca383b85183e13de0fc

With argument promotion with elements = 99 (converting hidden to internal so arg promotion works): https://gist.github.com/pcwalton/6195ccd7b1b49cde8920f3e65c048b84

Notice that:

  • Many of the structures have more elements than 2.
  • There are no memcpys in parse() with argument promotion. Without it there are 384 bytes in 13 stack to stack memcpys. Teaching memcpyopt to eliminate these would require a lot of expensive flow sensitive analysis if it's even feasible at all.

Here is the codegen without the argument promotion: https://gist.github.com/pcwalton/6c589ee875d5f6a639d2139990db66d2

And with argument promotion with max elements at 99: https://gist.github.com/pcwalton/f359acd836a61ceb8669c6ce823d141e

Notice that, with the argument promotion, we reduce the size of the stack in that function from 408 bytes to 328 bytes, a 24% savings. The main remaining memcpy-like codegen segment is basically just copying into the retptr which we may be able to eliminate in other ways.

Looking at this again, there are still quite a few spills that turn into effectively memcpys, so this isn't a panacea, but I'll take a ~25% stack size win. In any case turning the fields into SSA values will probably be a prerequisite for further optimizations.

pcwalton added a comment.EditedNov 5 2022, 10:39 PM

As an alternative, I wonder if we could teach SROA to do a form of argument promotion for nocapture noalias readonly dereferenceable aligned arguments. If the only thing blocking an alloca from SROA is being passed into a function by nocapture noalias readonly dereferenceable aligned pointer, insert a new alloca and memcpy to "reform" the structure right before the call, change the call to pass a pointer to the new alloca containing the "reformed" structure, and then the original alloca becomes SROAable. This could avoid all those ugly spills and reloads in the drop_in_place function bodies... assuming we could come up with some kind of heuristic to know when it's worth inserting copies to perform SROA in these instances.

There is a new argpromotion pass in the works:
https://reviews.llvm.org/D119013

Can't the heuristics be improved instead?
Offloading this decision to (each) front-end seems rather sub-optimal.

As an alternative, I wonder if we could teach SROA to do a form of argument promotion for nocapture noalias readonly dereferenceable aligned arguments. If the only thing blocking an alloca from SROA is being passed into a function by nocapture noalias readonly dereferenceable aligned pointer, insert a new alloca and memcpy to "reform" the structure right before the call, change the call to pass a pointer to the new alloca containing the "reformed" structure, and then the original alloca becomes SROAable. This could avoid all those ugly spills and reloads in the drop_in_place function bodies... assuming we could come up with some kind of heuristic to know when it's worth inserting copies to perform SROA in these instances.

FWIW, that is rather exactly what i'm doing in https://reviews.llvm.org/D113520. It seems to work, but is needs some legality check, because it can end up trying to promote backing alloca, over and over again.

FWIW, that is rather exactly what i'm doing in https://reviews.llvm.org/D113520. It seems to work, but is needs some legality check, because it can end up trying to promote backing alloca, over and over again.

Yeah, I fixed your patch, but I think it still miscompiles part of rustc. I'm working on a fix now. :)