This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
-
ArgumentPromotion.cpp
-
test/Transforms/ArgumentPromotion/
-
Transforms/
-
ArgumentPromotion/
-
arg-promotion-metadata.ll

Differential D137497

[ArgumentPromotion] Allow the frontend to specify the maximum number of elements to promote on a per-function basis via metadata.
Needs ReviewPublic

Authored by pcwalton on Nov 5 2022, 1:25 PM.

Download Raw Diff

Details

Reviewers

nikic

Summary

In Rust, argument promotion on destructor functions (core::ptr::drop_in_place)
is an important optimization, because those functions are marked cold in
exceptional paths and as such are rarely inlined, making them often the only
functions standing in the way of SROA. The default cap on the number of
elements to promote, 2, is too conservative for these specific functions in
Rust. At the same time, we don't want to raise the element cap across the
board, because that could make argument promotion unprofitable in some
circumstances. Additionally, argument promotion only runs on -O3 today, but for
these functions argument promotion is so important that the Rust frontend wants
to run the pass at any optimization level.

This patch addresses both of these problems by introducing a new piece of
per-function metadata, !argpromotion !{i64}. The i64 value represents the
maximum number of elements that argument promotion will promote into and
overrides the MaxElements setting that the pass itself defaults to. At -O3, the
Rust frontend can run argument promotion as usual with MaxElements to 2; at
-O2, it can run the pass with MaxElements set to 0. The frontend will tag
destructor functions with !argpromotion set to some high value, perhaps 8 or

This should allow better optimization for typical Rust code, especially

code that uses iterators frequently.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	30 ms	x64 debian > LLVM.Transforms/ArgumentPromotion::arg-promotion-metadata.ll

Event Timeline

pcwalton created this revision.Nov 5 2022, 1:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 1:25 PM

Herald added subscribers: ormris, JDevlieghere, hiraditya. · View Herald Transcript

pcwalton requested review of this revision.Nov 5 2022, 1:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2022, 1:25 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

pcwalton added a reviewer: nikic.Nov 5 2022, 1:28 PM

Could you please share some of the motivating (unreduced) IR samples? Allowing to configure this via metadata sounds sensible in principle, but I'm not sure your overall plan ("The frontend will tag destructor functions with !argpromotion set to some high value, perhaps 8 or 16") is viable.

Harbormaster completed remote builds in B196315: Diff 473455.Nov 5 2022, 2:17 PM

tschuett added a subscriber: tschuett.Nov 5 2022, 2:50 PM

Sure. Here's backtrace::symbolize::gimli::macho::Object::parse in backtrace-rs.

No argument promotion: https://gist.github.com/pcwalton/38ff053dd1973ca383b85183e13de0fc

With argument promotion with elements = 99 (converting hidden to internal so arg promotion works): https://gist.github.com/pcwalton/6195ccd7b1b49cde8920f3e65c048b84

Notice that:

Many of the structures have more elements than 2.

There are no memcpys in parse() with argument promotion. Without it there are 384 bytes in 13 stack to stack memcpys. Teaching memcpyopt to eliminate these would require a lot of expensive flow sensitive analysis if it's even feasible at all.

Here is the codegen without the argument promotion: https://gist.github.com/pcwalton/6c589ee875d5f6a639d2139990db66d2

And with argument promotion with max elements at 99: https://gist.github.com/pcwalton/f359acd836a61ceb8669c6ce823d141e

Notice that, with the argument promotion, we reduce the size of the stack in that function from 408 bytes to 328 bytes, a 24% savings. The main remaining memcpy-like codegen segment is basically just copying into the retptr which we may be able to eliminate in other ways.

Looking at this again, there are still quite a few spills that turn into effectively memcpys, so this isn't a panacea, but I'll take a ~25% stack size win. In any case turning the fields into SSA values will probably be a prerequisite for further optimizations.

As an alternative, I wonder if we could teach SROA to do a form of argument promotion for nocapture noalias readonly dereferenceable aligned arguments. If the only thing blocking an alloca from SROA is being passed into a function by nocapture noalias readonly dereferenceable aligned pointer, insert a new alloca and memcpy to "reform" the structure right before the call, change the call to pass a pointer to the new alloca containing the "reformed" structure, and then the original alloca becomes SROAable. This could avoid all those ugly spills and reloads in the drop_in_place function bodies... assuming we could come up with some kind of heuristic to know when it's worth inserting copies to perform SROA in these instances.

There is a new argpromotion pass in the works:
https://reviews.llvm.org/D119013

Can't the heuristics be improved instead?
Offloading this decision to (each) front-end seems rather sub-optimal.

In D137497#3910548, @pcwalton wrote:

As an alternative, I wonder if we could teach SROA to do a form of argument promotion for nocapture noalias readonly dereferenceable aligned arguments. If the only thing blocking an alloca from SROA is being passed into a function by nocapture noalias readonly dereferenceable aligned pointer, insert a new alloca and memcpy to "reform" the structure right before the call, change the call to pass a pointer to the new alloca containing the "reformed" structure, and then the original alloca becomes SROAable. This could avoid all those ugly spills and reloads in the drop_in_place function bodies... assuming we could come up with some kind of heuristic to know when it's worth inserting copies to perform SROA in these instances.

FWIW, that is rather exactly what i'm doing in https://reviews.llvm.org/D113520. It seems to work, but is needs some legality check, because it can end up trying to promote backing alloca, over and over again.

In D137497#3912021, @lebedev.ri wrote:

FWIW, that is rather exactly what i'm doing in https://reviews.llvm.org/D113520. It seems to work, but is needs some legality check, because it can end up trying to promote backing alloca, over and over again.

Yeah, I fixed your patch, but I think it still miscompiles part of rustc. I'm working on a fix now. :)

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

ArgumentPromotion.cpp

13 lines

test/

Transforms/

ArgumentPromotion/

arg-promotion-metadata.ll

95 lines

Diff 473455

llvm/lib/Transforms/IPO/ArgumentPromotion.cpp

Show First 20 Lines • Show All 811 Lines • ▼ Show 20 Lines	do {
LocalChange = false;		LocalChange = false;

FunctionAnalysisManager &FAM =		FunctionAnalysisManager &FAM =
AM.getResult<FunctionAnalysisManagerCGSCCProxy>(C, CG).getManager();		AM.getResult<FunctionAnalysisManagerCGSCCProxy>(C, CG).getManager();

bool IsRecursive = C.size() > 1;		bool IsRecursive = C.size() > 1;
for (LazyCallGraph::Node &N : C) {		for (LazyCallGraph::Node &N : C) {
Function &OldF = N.getFunction();		Function &OldF = N.getFunction();
Function *NewF = promoteArguments(&OldF, FAM, MaxElements, IsRecursive);
		// If the IR specifies a maximum element count via !argpromotion
		// metadata, take that into account.
		unsigned MaxElems = MaxElements;
		if (MDNode *MD = OldF.getMetadata("argpromotion"))
		MaxElems =
		mdconst::extract<ConstantInt>(MD->getOperand(0))->getZExtValue();

		if (MaxElems == 0)
		continue;

		Function *NewF = promoteArguments(&OldF, FAM, MaxElems, IsRecursive);
if (!NewF)		if (!NewF)
continue;		continue;
LocalChange = true;		LocalChange = true;

// Directly substitute the functions in the call graph. Note that this		// Directly substitute the functions in the call graph. Note that this
// requires the old function to be completely dead and completely		// requires the old function to be completely dead and completely
// replaced by the new function. It does no call graph updates, it merely		// replaced by the new function. It does no call graph updates, it merely
// swaps out the particular function mapped to a particular node in the		// swaps out the particular function mapped to a particular node in the
Show All 26 Lines

llvm/test/Transforms/ArgumentPromotion/arg-promotion-metadata.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; Tests that !argpromotion metadata allows the frontend to specify the maximum
				; number of elements we choose to split.
				; RUN: opt < %s -passes=argpromotion -S \| FileCheck %S

				%struct.Foo = type { i32, i32, i32, i32 }

				@constant = private constant %struct.Foo { i32 1, i32 2, i32 3, i32 4 }, align 4

				; Argument promotion defaults to only promoting 2 elements; with no metadata
				; we won't promote anything.
				define internal void @no_metadata(ptr noundef dereferenceable(4) align 4 %0) {
				; CHECK-LABEL: @no_metadata(
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [[STRUCT_FOO:%.]], ptr [[TMP0:%.*]], i32 0, i32 0
				; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 1
				; CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[TMP4]], align 4
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 2
				; CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 3
				; CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[TMP8]], align 4
				; CHECK-NEXT: ret void
				;
				%2 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 0
				%3 = load i32, ptr %2, align 4
				%4 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 1
				%5 = load i32, ptr %4, align 4
				%6 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 2
				%7 = load i32, ptr %6, align 4
				%8 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 3
				%9 = load i32, ptr %8, align 4
				ret void
				}

				; When we override the maximum number of elements with metadata, we promote
				; arguments.
				define internal void @metadata_4(ptr noundef dereferenceable(4) align 4 %0) !argpromotion !{i64 4} {
				; CHECK-LABEL: @metadata_4(
				; CHECK-NEXT: ret void
				;
				%2 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 0
				%3 = load i32, ptr %2, align 4
				%4 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 1
				%5 = load i32, ptr %4, align 4
				%6 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 2
				%7 = load i32, ptr %6, align 4
				%8 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 3
				%9 = load i32, ptr %8, align 4
				ret void
				}

				; The metadata requested a maximum number of 3 elements, but we have 4, so
				; don't promote anything.
				define internal void @metadata_3(ptr noundef dereferenceable(4) align 4 %0) !argpromotion !{i64 3} {
				; CHECK-LABEL: @metadata_3(
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [[STRUCT_FOO:%.]], ptr [[TMP0:%.*]], i32 0, i32 0
				; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 1
				; CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[TMP4]], align 4
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 2
				; CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_FOO]], ptr [[TMP0]], i32 0, i32 3
				; CHECK-NEXT: [[TMP9:%.*]] = load i32, ptr [[TMP8]], align 4
				; CHECK-NEXT: ret void
				;
				%2 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 0
				%3 = load i32, ptr %2, align 4
				%4 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 1
				%5 = load i32, ptr %4, align 4
				%6 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 2
				%7 = load i32, ptr %6, align 4
				%8 = getelementptr inbounds %struct.Foo, ptr %0, i32 0, i32 3
				%9 = load i32, ptr %8, align 4
				ret void
				}

				define i32 @main() {
				; CHECK-LABEL: @main(
				; CHECK-NEXT: call void @no_metadata(ptr noundef align 4 dereferenceable(4) @constant)
				; CHECK-NEXT: [[CONSTANT_VAL:%.*]] = load i32, ptr @constant, align 4
				; CHECK-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr @constant, i64 4
				; CHECK-NEXT: [[CONSTANT_VAL1:%.*]] = load i32, ptr [[TMP1]], align 4
				; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr @constant, i64 8
				; CHECK-NEXT: [[CONSTANT_VAL2:%.*]] = load i32, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = getelementptr i8, ptr @constant, i64 12
				; CHECK-NEXT: [[CONSTANT_VAL3:%.*]] = load i32, ptr [[TMP3]], align 4
				; CHECK-NEXT: call void @metadata_4(i32 [[CONSTANT_VAL]], i32 [[CONSTANT_VAL1]], i32 [[CONSTANT_VAL2]], i32 [[CONSTANT_VAL3]])
				; CHECK-NEXT: call void @metadata_3(ptr noundef align 4 dereferenceable(4) @constant)
				; CHECK-NEXT: ret i32 0
				;
				call void @no_metadata(ptr noundef dereferenceable(4) align 4 @constant)
				call void @metadata_4(ptr noundef dereferenceable(4) align 4 @constant)
				call void @metadata_3(ptr noundef dereferenceable(4) align 4 @constant)
				ret i32 0
				}