Download Raw Diff

Details

Reviewers

tejohnson
davidxl
silvas
vsk
mehdi_amini

Commits

rG96a19d35aea9: [PGO] IRPGO pre-cleanup pass changes
rL275588: [PGO] IRPGO pre-cleanup pass changes

Summary

This patch adds a selected set of cleanup passes including a pre-inline pass before LLVM IR PGO instrumentation. The inline is only intended to apply those obvious/trivial ones before instrumentation so that much less instrumentation is needed to get better profiling information. This will drastically improve the instrumented code performance for large C++ applications. Another benefit was the context sensitive counts that can potentially improve the PGO optimization.

This is previously discussed here:
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

Here I recollected some the data to show the impact using the followings:
(1) spec2006 C and C++ programs
(2) some Google internal benchmarks (which are of different size of C++ applications).
(3) instrumented LLVM compiler

The metrics collected are:
(1) instrumented code runtime performance.
https://docs.google.com/document/d/1DCP7KPjtitHMotX2KgTO8PADd9zjLzPJgSXZ8egd6Mk/edit?usp=sharing

(2) program text size (i.e. sum of sections (having address and name) sizes minus
writable section sizes.)
https://docs.google.com/document/d/1ItOetOEc1vTpU54EBjQK4O0LfulFIv3gs9XXnmzwqdo/edit?usp=sharing

(3) program binary size
https://docs.google.com/document/d/1nmiR6He0cxMFqMzWE-X7FdhL5jSzdS9h7WLxDrfmSxQ/edit?usp=sharing

Diff Detail

Event Timeline

xur updated this revision to Diff 60891.Jun 15 2016, 1:31 PM

xur retitled this revision from to [PGO] IRPGO pre-cleanup pass changes.

xur updated this object.

xur added reviewers: silvas, vsk, davidxl, mehdi_amini, tejohnson.

xur added a subscriber: llvm-commits.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptJun 15 2016, 1:31 PM

The results in pdf are also attached here.

Binary_size_comparison.pdf80 KBDownload

Runtime_comparison.pdf76 KBDownload

Text_size_comparison.pdf67 KBDownload

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

lib/Transforms/IPO/PassManagerBuilder.cpp
128	Why two options?
228	These seem pretty arbitrary numbers?
245	Where is this list coming from? How is it computed?

This looks great! Thanks for putting together the detailed performance reports.

I noticed that there is one small slowdown with 401.bzip. Do you know what aspect of the pipeline is responsible for the change? What were your criteria for picking the pre-inline pipeline?

lib/Transforms/IPO/PassManagerBuilder.cpp
122	Is zero or more needed?

comments are inlined.

lib/Transforms/IPO/PassManagerBuilder.cpp
122	probably not needed.
128	this is adopted from the earlier implementation where we have a separated inline passes. Since now we are using the standard inliner with different thresholds. I think we can get rid of preinline option and keep disable-preinline option.
228	No. They are from current Os and Oz thresholds
245	I came up this list. I'm open to add most passes.

silvas added inline comments.Jun 15 2016, 2:52 PM

lib/Transforms/IPO/PassManagerBuilder.cpp
245	In testing our games, we did not observe a significant reduction in instrumentation overhead from adding passes after pre-inlining. For example, we started with the set of passes initially proposed in http://reviews.llvm.org/D15828. Then we removed individual passes until there was a significant change in performance of the instrumented build. We were able to remove all of them except for the inlining pass. I.e. only the inlining pass contributed significantly to reducing overhead of the instrumented build. I would be interested to see the results of this experiment on a subset of your benchmarks. I believe the reason for this is that in the codebases we were testing (PS4 games), there are a very large number of calls to trivial single-BB functions (e.g. Vec4::operator+). On one codebase I analyzed, over 80% of all counts in the profile are from single-BB functions.

In D21405#459242, @mehdi_amini wrote:

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

This was one of my initial concerns too: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089058.html
Jake and I (actually mostly Jake) did some basic measurements related to this and did not find any significant difference.

Rong also did some measurements related to this in the initial RFC thread: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

I agree though: generally speaking, we should avoid running too many passes before instrumentation precisely because they cause divergence between the PGO and non-PGO pipelines. Thankfully Jake and I found that on our games only pre-inlining was needed (running other optimizations before instrumentation did not help significantly for our test cases beyond the benefit of just pre-inlining).

(note: due to PR27299, a run of simplifycfg after inlining is needed. But this run of simplifycfg doesn't really affect the performance of the instrumented build at all)

to vsk:

I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
BZ2_compressBlock() -> sendMTFValues() --> bsW()
BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
sendMTFValues() has 64 call sites to bsW().

In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().

Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().

I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.

davidxl added inline comments.Jun 15 2016, 4:03 PM

lib/Transforms/IPO/PassManagerBuilder.cpp
238	A more effective cleanup should probably include SROA + EarlyCSE
239	I doubt JumpThreading pass will make any difference here. Can you experiment removing it?
240	Experimenting removing this pass too?
242	CFG simplification and inst combine look ok
243	Remove GVN pass?

In D21405#459392, @xur wrote:

to vsk:

I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
BZ2_compressBlock() -> sendMTFValues() --> bsW()
BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
sendMTFValues() has 64 call sites to bsW().

In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().

Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().

I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.

Rong asked me to look into this. There is a bug in the deferral logic and if it is fixed the default inliner will also result in a code size increase (and possibly performance regression).
A brief desciption of the deferral logic: When we inline a B->C callsite, and B has local linkage, we look at all callers of B (say A_i). If the cost of B->C inlining exceeds the delta (threshold - cost) of A->B inlining, it checks if the overall cost of B->C inlining plus all A_i->B inlining is less than B->C inlining and if it is true, the inlining is deferred. The idea is that delaying B->C inlining will allow B to be inlined into all its callers and the out-of-line body of B can be removed and subsequently C will be inlined into A, resulting in overall cost (proxy for code size) reduction. To account for the fact that the body of B could be removed, a negative cost (-15000) is applied.

Now, in this case (A: BZ2_compressBlock(), B: sendMTFValues() C:bsW()) , there is only one caller of B ( A). When A->B inline cost is computed, the cost analysis also applies the -15000 cost. In other words, the deferral logic under-estimate the cost of A->B inlining by 15000 and defer (because the cost of A->B + B->C is less than the cost of B->C after this under-estimation) most B->C callsites are deferred. But when we consider A->B inlining, the cost becomes higher than the threshold (since we don't apply the -15000 cost twice) and once that fails, the C nodes do not get inlined.

The fix is simple - apply the negative cost correctly - but that will result in all B->C callsites being inlined (and no inlining of A->B) callsite resulting in code size regression.

ere is the updated patch. Changes are:
(1) Removed options preinline as suggested by mehdi.
(2) Updated the cleanup passes list.

I used google internal benchmarks to measure the performance. The cleanup pass have ~2% geo-mean on the profile-generate performance. The largest saving is 17%.

David's suggestion on using SROA and EarlyCSE is good. It has positive impacts on a number of programs (largest ~10%)

Jumpthreading actually helps a few programs also (~3%).

In this patch, I propose to use
SROA + EArlyCSE + CFGSimplification + InstructionCombining
and remove the rest.

davidxl added inline comments.Jul 8 2016, 10:43 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
213	Do you have text size comparison (for profile-use phase) with and without pre-cleanup? If pre-cleanup always reduces final size, there is no need to special case here. If not, we should disable pre-cleanup with Os or Oz is specified.
216	Needs to have an option to control the value.

There is small text size increment in profile-use compilation. Here is the data using google benchmarks:
"-" means size increase.
`benchmark nocleanup cleanup
C++_bencharmk01 223956862 -2.47%
C++_bencharmk02 83380721 -1.06%
C++_bencharmk03 163282214 -0.56%
C++_bencharmk04 623589887 -0.16%
C++_bencharmk05 510498950 -0.41%
C++_bencharmk06 4523393 -3.07%
C++_bencharmk07 4523028 -3.07%
C++_bencharmk08 4523953 -3.05%
C_bencharmk09 17632563 -0.70%
C_bencharmk10 17633287 -0.69%
C++_bencharmk11 4402197 -3.00%
C++_bencharmk12 147280528 -1.13%
C++_bencharmk14 148784695 -1.01%
C_bencharmk15 2829725 -0.69%
C++_bencharmk16 41095218 -1.01%
C++_bencharmk17 115356091 -0.58%
C++_bencharmk18 115327817 -0.63%
C++_bencharmk19 240183470 -1.39%
C++_bencharmk20 4453343 -2.96%
C_bencharmk21 1729644 -1.67%
C++_bencharmk22 4336754 -2.67%

total size 2479324340 -0.80%
mean of improvements -1.52%
`

I'll disable the cleanup pass in Os and Oz and use a parameterized threshold.

mehdi_amini added inline comments.Jul 8 2016, 11:07 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
228	Uh? Since we have Os/Oz as function attribute, it seems strange to me to have this hard-coded on a global level.

mehdi_amini added inline comments.Jul 8 2016, 11:09 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
228	Also is it duplicating `llvm::computeThresholdFromOptLevels` ?

mehdi_amini added inline comments.Jul 8 2016, 11:11 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
216	Apparently `-inline-threshold` will override.

davidxl added inline comments.Jul 8 2016, 11:28 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
216	I don't think that is the right option to use. The -inline-threshold option is used to control the regular inliner -- the tuning option of the pre-cleanup should not affect it.
228	It does not. The pre-inline's role is to do pre-profiling cleanup, but to do full scale inlining, thus it needs to be more conservative (no profile guidance is available at this point). It should not controlled in the same way as the regular inline phase.
228	There is no mechanizm to do threshold tuning on a function granularity. Even there is one, using it may not help the goal of reducing the final overall size.

mehdi_amini added inline comments.Jul 8 2016, 11:41 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
216	OK, make sense.
228	OK.

Integrated David's review comments:
(1) don't perform the cleanup passes for size optimizations.
(2) add an option to control the preinline threshold.

This cleanup passes also have performance impact on the profile-use compilation.

In google benchmarks, compare to without these cleanup passes, we are seeing ~14% improvement for 1 benchmark, ~2 to ~4% for 3 benchmarks, and ~1% for other 2 benchmarks.

If we run these cleanup passes on O2 (without fdo), we only see small improvement (~1%) on 2 benchmarks.
We are thinking a large part of the improvement in profile-use is getting from better context sensitive profiles.

Note that all the numbers are got from existing profile-use infrastructure which is less well tuned.

davidxl added inline comments.Jul 8 2016, 1:50 PM

test/Transforms/PGOProfile/preinline.ll
8	Also add a check that the call to bar is actually eliminated

check function bar() is eliminated in the test case.

The result is really good. LGTM, but wait to see other reviewers have further comment.

I don't have any other comment, don't wait for me.

test/Transforms/PGOProfile/preinline.ll
9	Maybe add a one-line comment here saying why you have this test?

lgtm

(with test comment requested by medhdi).

This revision is now accepted and ready to land.Jul 13 2016, 10:24 AM

Closed by commit rL275588: [PGO] IRPGO pre-cleanup pass changes (authored by xur). · Explain WhyJul 15 2016, 11:18 AM

This revision was automatically updated to reflect the committed changes.

In D21405#459364, @silvas wrote:

In D21405#459242, @mehdi_amini wrote:

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

This was one of my initial concerns too: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089058.html
Jake and I (actually mostly Jake) did some basic measurements related to this and did not find any significant difference.

Rong also did some measurements related to this in the initial RFC thread: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

I agree though: generally speaking, we should avoid running too many passes before instrumentation precisely because they cause divergence between the PGO and non-PGO pipelines. Thankfully Jake and I found that on our games only pre-inlining was needed (running other optimizations before instrumentation did not help significantly for our test cases beyond the benefit of just pre-inlining).

I'm glad this is just pre-inlining, I strongly agree that we should not diverge here.

I actually continue to disagree with pre-inlining. I have said this repeatedly in the past, and I'm not sure where we reached consensus that pre-inlining was the right design, and where the discussion of the tradeoffs took place. The closest I could find was the thread you linked to Sean, but I don't actually see a lot of discussion of alternatives there, nor do I see much real discussion of the inliner at all outside of the fact that yes, it does work.

I would be much more in favor of a design which instead of changing the pipeline to optimize differently (and thus having PGO vs. normal divergence) either:

Teaches the instrumentation pass to do the necessary (possibly interprocedural) analysis to instrument only in places it will be cheap, or
Teach the existing pipeline to actually optimize and transform the instrumentation code to remove the overhead as necessary.

This way we have a single optimization pipeline that we turn profile info on and off of ,rather than twe pipelines.

In D21405#501428, @chandlerc wrote:

I would be much more in favor of a design which instead of changing the pipeline to optimize differently (and thus having PGO vs. normal divergence) either:

Teaches the instrumentation pass to do the necessary (possibly interprocedural) analysis to instrument only in places it will be cheap, or

Teach the existing pipeline to actually optimize and transform the instrumentation code to remove the overhead as necessary.

I totally agree that this would be much better, however there is something that was mentioned that I don't know how to address: the counters can get actually more precise due to the split context after inlining.

Diff 63272

lib/Transforms/IPO/PassManagerBuilder.cpp

Context not available.
	"enable-loop-versioning-licm", cl::init(false), cl::Hidden,	"enable-loop-versioning-licm", cl::init(false), cl::Hidden,
	cl::desc("Enable the experimental Loop Versioning LICM pass"));	cl::desc("Enable the experimental Loop Versioning LICM pass"));

		static cl::opt<bool>
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Why two options? mehdi_amini: Why two options?
		xurAuthorUnsubmitted Not Done Reply Inline Actions this is adopted from the earlier implementation where we have a separated inline passes. Since now we are using the standard inliner with different thresholds. I think we can get rid of preinline option and keep disable-preinline option. xur: this is adopted from the earlier implementation where we have a separated inline passes. Since…
		DisablePreInliner("disable-preinline", cl::init(false), cl::Hidden,
		cl::desc("Disable pre-instrumentation inliner"));

		static cl::opt<int> PreInlineThreshold(
		"preinline-threshold", cl::Hidden, cl::init(75), cl::ZeroOrMore,
		cl::desc("Control the amount of inlining in pre-instrumentation inliner "
		"(default = 75)"));

	PassManagerBuilder::PassManagerBuilder() {	PassManagerBuilder::PassManagerBuilder() {
	OptLevel = 2;	OptLevel = 2;
	SizeLevel = 0;	SizeLevel = 0;
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions These seem pretty arbitrary numbers? mehdi_amini: These seem pretty arbitrary numbers?
		xurAuthorUnsubmitted Not Done Reply Inline Actions No. They are from current Os and Oz thresholds xur: No. They are from current Os and Oz thresholds
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Uh? Since we have Os/Oz as function attribute, it seems strange to me to have this hard-coded on a global level. mehdi_amini: Uh? Since we have Os/Oz as function attribute, it seems strange to me to have this hard-coded…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Also is it duplicating `llvm::computeThresholdFromOptLevels` ? mehdi_amini: Also is it duplicating `llvm::computeThresholdFromOptLevels` ?
		davidxlUnsubmitted Not Done Reply Inline Actions It does not. The pre-inline's role is to do pre-profiling cleanup, but to do full scale inlining, thus it needs to be more conservative (no profile guidance is available at this point). It should not controlled in the same way as the regular inline phase. davidxl: It does not. The pre-inline's role is to do pre-profiling cleanup, but to do full scale…
		davidxlUnsubmitted Not Done Reply Inline Actions There is no mechanizm to do threshold tuning on a function granularity. Even there is one, using it may not help the goal of reducing the final overall size. davidxl: There is no mechanizm to do threshold tuning on a function granularity. Even there is one…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions OK. mehdi_amini: OK.
		davidxlUnsubmitted Not Done Reply Inline Actions Do you have text size comparison (for profile-use phase) with and without pre-cleanup? If pre-cleanup always reduces final size, there is no need to special case here. If not, we should disable pre-cleanup with Os or Oz is specified. davidxl: Do you have text size comparison (for profile-use phase) with and without pre-cleanup? If pre…
		davidxlUnsubmitted Not Done Reply Inline Actions Needs to have an option to control the value. davidxl: Needs to have an option to control the value.
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Apparently `-inline-threshold` will override. mehdi_amini: Apparently `-inline-threshold` will override.
		davidxlUnsubmitted Not Done Reply Inline Actions I don't think that is the right option to use. The -inline-threshold option is used to control the regular inliner -- the tuning option of the pre-cleanup should not affect it. davidxl: I don't think that is the right option to use. The -inline-threshold option is used to control…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions OK, make sense. mehdi_amini: OK, make sense.
Context not available.

	// Do PGO instrumentation generation or use pass as the option specified.	// Do PGO instrumentation generation or use pass as the option specified.
		davidxlUnsubmitted Not Done Reply Inline Actions A more effective cleanup should probably include SROA + EarlyCSE davidxl: A more effective cleanup should probably include SROA + EarlyCSE
	void PassManagerBuilder::addPGOInstrPasses(legacy::PassManagerBase &MPM) {	void PassManagerBuilder::addPGOInstrPasses(legacy::PassManagerBase &MPM) {
		davidxlUnsubmitted Not Done Reply Inline Actions I doubt JumpThreading pass will make any difference here. Can you experiment removing it? davidxl: I doubt JumpThreading pass will make any difference here. Can you experiment removing it?
		if (PGOInstrGen.empty() && PGOInstrUse.empty())
		davidxlUnsubmitted Not Done Reply Inline Actions Experimenting removing this pass too? davidxl: Experimenting removing this pass too?
		return;
		// Perform the preinline and cleanup passes for O1 and above.
		davidxlUnsubmitted Not Done Reply Inline Actions CFG simplification and inst combine look ok davidxl: CFG simplification and inst combine look ok
		// And avoid doing them if optimizing for size.
		davidxlUnsubmitted Not Done Reply Inline Actions Remove GVN pass? davidxl: Remove GVN pass?
		if (OptLevel > 0 && SizeLevel == 0 && !DisablePreInliner) {
		// Create preinline pass.
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Where is this list coming from? How is it computed? mehdi_amini: Where is this list coming from? How is it computed?
		xurAuthorUnsubmitted Not Done Reply Inline Actions I came up this list. I'm open to add most passes. xur: I came up this list. I'm open to add most passes.
		silvasUnsubmitted Not Done Reply Inline Actions In testing our games, we did not observe a significant reduction in instrumentation overhead from adding passes after pre-inlining. For example, we started with the set of passes initially proposed in http://reviews.llvm.org/D15828. Then we removed individual passes until there was a significant change in performance of the instrumented build. We were able to remove all of them except for the inlining pass. I.e. only the inlining pass contributed significantly to reducing overhead of the instrumented build. I would be interested to see the results of this experiment on a subset of your benchmarks. I believe the reason for this is that in the codebases we were testing (PS4 games), there are a very large number of calls to trivial single-BB functions (e.g. Vec4::operator+). On one codebase I analyzed, over 80% of all counts in the profile are from single-BB functions. silvas: In testing our games, we did not observe a significant reduction in instrumentation overhead…
		MPM.add(createFunctionInliningPass(PreInlineThreshold));
		MPM.add(createSROAPass());
		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies
		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
		MPM.add(createInstructionCombiningPass()); // Combine silly seq's
		addExtensionsToPM(EP_Peephole, MPM);
		}
	if (!PGOInstrGen.empty()) {	if (!PGOInstrGen.empty()) {
	MPM.add(createPGOInstrumentationGenLegacyPass());	MPM.add(createPGOInstrumentationGenLegacyPass());
	// Add the profile lowering pass.	// Add the profile lowering pass.
Context not available.

test/Transforms/PGOProfile/preinline.ll

This file was added.

				; RUN: opt < %s -O2 -profile-generate=default.profraw -S \| FileCheck %s --check-prefix=GEN
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define i32 @foo(i32 %i) {
				entry:
				; GEN: %pgocount = load i64, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc_foo
				; GEN-NOT: %pgocount.i = load i64, i64* getelementptr inbounds ([1 x i64], [1 x i64]* @__profc__stdin__bar
				davidxlUnsubmitted Done Reply Inline Actions Also add a check that the call to bar is actually eliminated davidxl: Also add a check that the call to bar is actually eliminated
				%call = call i32 @bar()
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Maybe add a one-line comment here saying why you have this test? mehdi_amini: Maybe add a one-line comment here saying why you have this test?
				%add = add nsw i32 %i, %call
				ret i32 %add
				}

				define internal i32 @bar() {
				entry:
				%call = call i32 (...) @bar1()
				ret i32 %call
				}

				declare i32 @bar1(...)

This is an archive of the discontinued LLVM Phabricator instance.

[PGO] IRPGO pre-cleanup pass changes
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 63272

lib/Transforms/IPO/PassManagerBuilder.cpp

test/Transforms/PGOProfile/preinline.ll

This is an archive of the discontinued LLVM Phabricator instance.

[PGO] IRPGO pre-cleanup pass changesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 63272

lib/Transforms/IPO/PassManagerBuilder.cpp

test/Transforms/PGOProfile/preinline.ll

[PGO] IRPGO pre-cleanup pass changes
ClosedPublic