- When optnone is not present, add optsize for Os and minsize for Oz.
- Pass Os/Oz to (old/new PM) LTO pipeline setup.
- Ran SPEC2017 & test-suite MultiSource to confirm the size reduction are comparable with NonLTO.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
380 ms | lld.wasm/lto::Unknown Unit Message ("") |
Event Timeline
This is going in the wrong direction IMO. We already have a way of specifying the size optimization level, which is to specify -Os or -Oz at compile time. Why is that not sufficient?
Thanks for the quick feedback, @pcc. :-) I saw your comments somewhere on this.
I don't know why is this changing directions. It only gives the user choice to optimize for size at link stage.
They still have the freedom the do this at compile stage.
From user perspective, if O1, O2 etc. are valid options for linker stage,
I don't see a reason Os and Oz can not be used or not do the right thing.
With ThinLTO, many decisions are postponed til linker stage to have more flexibility. I guess this is one of them?
Unit tests: fail. 60542 tests passed, 1 failed and 726 were skipped.
failed: lld.wasm/lto/opt-level.ll
clang-tidy: fail. Please fix clang-tidy findings.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
Unit tests: fail. 60542 tests passed, 1 failed and 726 were skipped.
failed: lld.ELF/note.s
clang-tidy: fail. Please fix clang-tidy findings.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
- when optnone is not present, add optsize for Os and minsize for Oz
- add Os Oz function attribute test.
- add Os Oz pipeline test
Unit tests: fail. 60543 tests passed, 1 failed and 726 were skipped.
failed: Clang.CodeGen/thinlto-debug-pm.c
clang-tidy: unknown.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
Unit tests: fail. 60543 tests passed, 1 failed and 726 were skipped.
failed: Clang.CodeGen/thinlto-debug-pm.c
clang-tidy: unknown.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
Unit tests: fail. 61849 tests passed, 1 failed and 781 were skipped.
failed: Clang.CodeGen/thinlto-debug-pm.c
clang-tidy: unknown.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
Unit tests: fail. 61858 tests passed, 1 failed and 781 were skipped.
failed: Clang.CodeGen/thinlto-debug-pm.c
clang-tidy: unknown.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
I think that failure can be fixed with something like this:
diff --git a/clang/test/CodeGen/thinlto-debug-pm.c b/clang/test/CodeGen/thinlto-debug-pm.c index 5f449d493af..185a8c8fb8b 100644 --- a/clang/test/CodeGen/thinlto-debug-pm.c +++ b/clang/test/CodeGen/thinlto-debug-pm.c @@ -13,17 +13,17 @@ // O0123sz-NEWPM: Running analysis: PassInstrumentationAnalysis // O0123sz-NEWPM: Starting llvm::Module pass manager run. // O0123sz-NEWPM: Running pass: WholeProgramDevirtPass -// O0123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Function>, llvm::Module> +// O0123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> // O0123sz-NEWPM: Running pass: LowerTypeTestsPass // O0123sz-NEWPM: Invalidating all non-preserved analyses for: -// O0123sz-NEWPM: Invalidating analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Function>, llvm::Module> +// O0123sz-NEWPM: Invalidating analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> // O123sz-NEWPM: Running pass: ForceFunctionAttrsPass // O123sz-NEWPM: Running pass: PassManager<llvm::Module> // O123sz-NEWPM: Starting llvm::Module pass manager run. // O123sz-NEWPM: Running pass: PGOIndirectCallPromotion // O123sz-NEWPM: Running analysis: ProfileSummaryAnalysis // O123sz-NEWPM: Running pass: InferFunctionAttrsPass -// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Function>, llvm::Module> +// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> // O123sz-NEWPM: Running pass: ModuleToFunctionPassAdaptor<llvm::PassManager<llvm::Function> > // O123sz-NEWPM: Running analysis: PassInstrumentationAnalysis on foo // O123sz-NEWPM: Starting llvm::Function pass manager run. @@ -41,9 +41,9 @@ // O123sz-NEWPM: Running pass: CalledValuePropagationPass // O123sz-NEWPM: Running pass: GlobalOptPass // O123sz-NEWPM: Invalidating all non-preserved analyses for: -// O123sz-NEWPM: Invalidating analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Function>, llvm::Module> +// O123sz-NEWPM: Invalidating analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> // O123sz-NEWPM: Running pass: ModuleToFunctionPassAdaptor<llvm::PromotePass> -// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Function>, llvm::Module> +// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> // O123sz-NEWPM: Running analysis: DominatorTreeAnalysis on foo // O123sz-NEWPM: Running analysis: PassInstrumentationAnalysis on foo // O123sz-NEWPM: Running analysis: AssumptionAnalysis on foo @@ -57,20 +57,20 @@ // O123sz-NEWPM: Running analysis: BasicAA on foo // O123sz-NEWPM: Running analysis: ScopedNoAliasAA on foo // O123sz-NEWPM: Running analysis: TypeBasedAA on foo -// O123sz-NEWPM: Running analysis: OuterAnalysisManagerProxy<llvm::AnalysisManager<llvm::Module>, llvm::Function> on foo +// O123sz-NEWPM: Running analysis: OuterAnalysisManagerProxy<llvm::ModuleAnalysisManager, llvm::Function> on foo // O123sz-NEWPM: Running pass: SimplifyCFGPass on foo // O123sz-NEWPM: Running analysis: TargetIRAnalysis on foo // O123sz-NEWPM: Finished llvm::Function pass manager run. -// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::GlobalsAA, llvm::Module> +// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::GlobalsAA, llvm::Module, llvm::AnalysisManager<llvm::Module>> // O123sz-NEWPM: Running analysis: GlobalsAA // O123sz-NEWPM: Running analysis: CallGraphAnalysis -// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::ProfileSummaryAnalysis, llvm::Module> -// O123sz-NEWPM: Running pass: ModuleToPostOrderCGSCCPassAdaptor<llvm::DevirtSCCRepeatedPass<llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&> > > -// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::Module> +// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::ProfileSummaryAnalysis, llvm::Module, llvm::AnalysisManager<llvm::Module>> +// O123sz-NEWPM: Running pass: ModuleToPostOrderCGSCCPassAdaptor<llvm::DevirtSCCRepeatedPass<llvm::PassManager<LazyCallGraph::SCC, llvm::CGSCCAnalysisManager, llvm::LazyCallGraph &, llvm::CGSCCUpdateResult &> > > +// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::CGSCCAnalysisManager, llvm::Module> // O123sz-NEWPM: Running analysis: LazyCallGraphAnalysis // O123sz-NEWPM: Running analysis: FunctionAnalysisManagerCGSCCProxy on (foo) // O123sz-NEWPM: Running analysis: PassInstrumentationAnalysis on (foo) -// O123sz-NEWPM: Running analysis: OuterAnalysisManagerProxy<llvm::AnalysisManager<llvm::Module>, llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&> on (foo) +// O123sz-NEWPM: Running analysis: OuterAnalysisManagerProxy<llvm::ModuleAnalysisManager, LazyCallGraph::SCC, llvm::LazyCallGraph &> // O123sz-NEWPM: Starting CGSCC pass manager run. // O123sz-NEWPM: Running pass: InlinerPass on (foo) // O123sz-NEWPM: Running pass: PostOrderFunctionAttrsPass on (foo) @@ -91,8 +91,8 @@ // O23sz-NEWPM: Running pass: TailCallElimPass on foo // O123sz-NEWPM: Running pass: SimplifyCFGPass on foo // O123sz-NEWPM: Running pass: ReassociatePass on foo -// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::OptimizationRemarkEmitterAnalysis, llvm::Function> on foo -// O123sz-NEWPM: Running pass: FunctionToLoopPassAdaptor<llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&> > on foo +// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::OptimizationRemarkEmitterAnalysis, llvm::Function, llvm::AnalysisManager<llvm::Function>> +// O123sz-NEWPM: Running pass: FunctionToLoopPassAdaptor<llvm::PassManager<llvm::Loop, llvm::LoopAnalysisManager, llvm::LoopStandardAnalysisResults &, llvm::LPMUpdater &> > // O123sz-NEWPM: Starting llvm::Function pass manager run. // O123sz-NEWPM: Running pass: LoopSimplifyPass on foo // O123sz-NEWPM: Running analysis: LoopAnalysis on foo @@ -100,7 +100,7 @@ // O123sz-NEWPM: Finished llvm::Function pass manager run. // O123sz-NEWPM: Running pass: SimplifyCFGPass on foo // O123sz-NEWPM: Running pass: InstCombinePass on foo -// O123sz-NEWPM: Running pass: FunctionToLoopPassAdaptor<llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&> > on foo +// O123sz-NEWPM: Running pass: FunctionToLoopPassAdaptor<llvm::PassManager<llvm::Loop, llvm::LoopAnalysisManager, llvm::LoopStandardAnalysisResults &, llvm::LPMUpdater &> > // O123sz-NEWPM: Starting llvm::Function pass manager run. // O123sz-NEWPM: Running pass: LoopSimplifyPass on foo // O123sz-NEWPM: Running pass: LCSSAPass on foo @@ -137,7 +137,7 @@ // O123sz-NEWPM: Running pass: GlobalDCEPass // O123sz-NEWPM: Running pass: EliminateAvailableExternallyPass // O123sz-NEWPM: Running pass: ReversePostOrderFunctionAttrsPass -// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::GlobalsAA, llvm::Module> +// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::GlobalsAA, llvm::Module, llvm::AnalysisManager<llvm::Module>> // O123sz-NEWPM: Running pass: ModuleToFunctionPassAdaptor<llvm::PassManager<llvm::Function> > // O123sz-NEWPM: Starting llvm::Function pass manager run. // O123sz-NEWPM: Running pass: Float2IntPass on foo @@ -149,7 +149,7 @@ // O123sz-NEWPM: Finished llvm::Function pass manager run. // O123sz-NEWPM: Running pass: LoopDistributePass on foo // O123sz-NEWPM: Running analysis: ScalarEvolutionAnalysis on foo -// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::Function> on foo +// O123sz-NEWPM: Running analysis: InnerAnalysisManagerProxy<llvm::LoopAnalysisManager, llvm::Function> // O123sz-NEWPM: Running pass: LoopVectorizePass on foo // O123sz-NEWPM: Running analysis: BlockFrequencyAnalysis on foo // O123sz-NEWPM: Running analysis: BranchProbabilityAnalysis on foo @@ -160,7 +160,7 @@ // O123sz-NEWPM: Running pass: LoopUnrollPass on foo // O123sz-NEWPM: Running pass: WarnMissedTransformationsPass on foo // O123sz-NEWPM: Running pass: InstCombinePass on foo -// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::OptimizationRemarkEmitterAnalysis, llvm::Function> on foo +// O123sz-NEWPM: Running pass: RequireAnalysisPass<llvm::OptimizationRemarkEmitterAnalysis, llvm::Function, llvm::AnalysisManager<llvm::Function>> on foo // O123sz-NEWPM: Running pass: FunctionToLoopPassAdaptor<llvm::LICMPass> on foo // O123sz-NEWPM: Starting llvm::Function pass manager run. // O123sz-NEWPM: Running pass: LoopSimplifyPass on foo
EDIT: Had to make more changes than I thought. Above passes the test.
Thank you. It turns out that GCC&Clang disagree on __PRETTY_FUNCTION__ for type alias (using). will fix those.
Unit tests: pass. 61859 tests passed, 0 failed and 781 were skipped.
clang-tidy: unknown.
clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.
Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml
I would be very interested in this patch, because for me to use ThinLTO without size regressions I need to set the optimization size level in the linker (PassManagerBuilder.SizeLevel etc).
This patch seems mostly correct to me, except for the function attributes. These function attributes (optsize, minsize) should IMHO be set in the frontend, not in the linker.
Apart from that, would it be reasonable to implement this in the form of -plugin-opt=Os and -plugin-opt=Oz? That is perhaps a cleaner UI (doesn't need a new flag!) and more cleanly maps to PassBuilder::Oz and the like.
I just saw this. I know it is a good idea to have choice during link time for the pipeline configuration and from your benchmark it also has size impact on the output. I also feel like this is going in the wrong direction as if I have part of the object files built with -O3 and part built with -Oz, I need to make a choice of unoptimized part of the program.
Before I say yes or no to this patch, have you figured out what are the passes that causes the most size regression? Ideally, with function attributes on the function, it shouldn't be much size impact on the output.
Why can't you build the object files with -Os in the first place instead of changing the optimization level between the compilation and the linking phases?
Apparently there is also another patch that tries to do something very similar: D81223.
I'm not changing the optimization level. The bitcode files are built with an equivalent of -Oz and have the optsize and minsize attributes. It's the optimization passes in the linker (ThinLTO) that don't respect these attributes.
Unfortunately, there is an impact. I did a quick test with a small program (around 4.7kB compiled code) and it looks like the LoopRotate pass is the main culprit. If MaxHeaderSize is set to 0 instead of -1, the code size regression is avoided. The following hacky patch avoids the code size increase:
diff --git a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp index aa916345954d..99be1926cf34 100644 --- a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp +++ b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp @@ -450,7 +450,7 @@ void PassManagerBuilder::addFunctionSimplificationPasses( // TODO: Investigate promotion cap for O1. MPM.add(createLICMPass(LicmMssaOptCap, LicmMssaNoAccForPromotionCap)); // Rotate Loop - disable header duplication at -Oz - MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1, PrepareForLTO)); + MPM.add(createLoopRotatePass(0/*SizeLevel == 2 ? 0 : -1*/, PrepareForLTO)); // TODO: Investigate promotion cap for O1. MPM.add(createLICMPass(LicmMssaOptCap, LicmMssaNoAccForPromotionCap)); if (EnableSimpleLoopUnswitch) @@ -917,7 +917,7 @@ void PassManagerBuilder::populateModulePassManager( // Re-rotate loops in all our loop nests. These may have fallout out of // rotated form due to GVN or other transformations, and the vectorizer relies // on the rotated form. Disable header duplication at -Oz. - MPM.add(createLoopRotatePass(SizeLevel == 2 ? 0 : -1, PrepareForLTO)); + MPM.add(createLoopRotatePass(0 /*SizeLevel == 2 ? 0 : -1*/, PrepareForLTO)); // Distribute loops to allow partial vectorization. I.e. isolate dependences // into separate loop that would otherwise inhibit vectorization. This is
So, should all passes just look at the optsize and minsize attributes instead of the SizeLevel? In other words, should PassManagerBuilder.SizeLevel be removed and should passes only look at function attributes instead of SizeLevel? Because at the moment, it's a weird mix of both. IMHO size level should either all go via function attributes or via a flag, not something in between as it is now.
Also, if size level is done via function attributes, why not optimization level? There is already optnone. I'm not saying that's better, but right now I don't see the logic in this whole system.
After some more testing on a larger amount of code (many small programs, together over 1MB in binary size), LoopRotate indeed seems to be the culprit. I'm now looking into a patch to LoopRotate to respect the optsize function attribute.
EDIT: see D119342.
I agree, I don't know (other than history) why we couldn't move towards removing PassManagerBuilder.SizeLevel?
Also, if size level is done via function attributes, why not optimization level? There is already optnone. I'm not saying that's better, but right now I don't see the logic in this whole system.
Despite what gcc and clang exposes to their users, at the LLVM level we don't have a single dimension on which to put O1/O2/O3 compared to Os/Oz. These also may not make sense for every single compiler out-there: O1/O2/O3 for clang may not be the right pass pipeline for my proprietary shader compiler.
Another reason why O1/O2/O3 are not making much such to be in the IR is that the IR is intended to be stored and reloaded any time in the middle of pipeline. LTO is an example of this, so we don't really want to store the "list of pass to run" in the IR.
Finally, we don't want to (or we can't really...) teach passes about whether they should execute during O1/O2/O3, while optnone is just a "fuse" to disable them all. It is also convenient to have optnone as a function attribute because it allows to selectively disable the optimizer on a per function basis. On the other hand because of the nature of the pass pipeline, it can't be tweaked on a per function basis (what about Module passes?).
So O1/O2/O3 is somehow just like an alias for an arbitrary list of passes (not really arbitrary, it is a good "default" one for clang-style IR), on the other hand the optsize/minsize are driving heuristic and can be orthogonal to the pass pipeline / controlled on a per-function basis and made available to every pass: they convey an "optimization goal" that applies to every pass individually.
@mehdi_amini thanks for explaining! D119342 moves slightly closer to removing SizeLevel from the pass pipeline setup.
In other news, I found a workaround that can be used to avoid the size increase due to LoopRotate (until D119342 is merged). Basically, just pass the flags -mllvm --rotation-max-header-size=0 to ld.lld when compiling with -Oz.
I left a comment on D119342 - I think that is the right way to go. As mentioned there, there is still some legacy handling of options passed down from the driver, but over time we've been trying to move things to use function attributes, for the reasons @mehdi_amini mentioned. To expand on one of the reasons Mehdi mentioned: using function attributes naturally handles LTO linking in the case where one file is compiled -flto -Os and another is compiled -flto -O2 the same way as if you compiled the two files with those different flags all the way down to native code without LTO. Using the approach in this pass you would be forced to pick either -Os or -O2 for both files at LTO link time.
clang-tidy: warning: invalid case style for member 'ltoo' [readability-identifier-naming]