This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/docs/
-
docs/
-
Inline-Oz-ML.md

Differential D77752

[llvm] Machine Learned policy for inlining -Oz
Needs ReviewPublic

Authored by mtrofin on Apr 8 2020, 2:04 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
sstefan1

Commits

rGd6695e18763a: [llvm] Add interface to drive inlining decision using ML model
rG98ef93eabd76: [llvm] Add function feature extraction analysis
rZORGf745b8309563: [llvm] Add build bots for ml-driven optimization policies.
rGbdceefe95ba6: [llvm] Release-mode ML InlineAdvisor
rG83080a294ad7: [llvm] Native size estimator for training -Oz inliner
rG70f8d0ac8a34: [llvm] Development-mode InlineAdvisor

Summary

[llvm] Machine Learned policy for inlining -Oz

This change is a placeholder, used to reference the submitted changes related to this work

Authors: Mircea Trofin, Yundi Qian, Eugene Brevdo

Please refer to the RFC for more details.

The change introduces a machine learned policy for -Oz inlining as a build-time opt-in.

There are two opt-in modes: 'release' and 'development'.

'Release' is the 'day-to-day compiler use'. A pre-trained ML model is compiled ahead of time into a native library, which is then used to make inlining
decisions. Determinism is ensured because the model is unchanged as the compiler runs.

Just like with hand-written heuristics, there is no formal guarantee that the policy performs well, only empirical results on benchmarks, based on which we derive a belief of general applicability.

'Development' is the mode used to train a model. Training happens offline, through reinforcement learning, by providing a training algorithm with traces of decisions made by the compiler on a training corpus (IR modules).

This initial change introduces all the 'release' mode components, together with a reference, pre-trained model, as well as key training mode components. More training mode components will be provided in subsequent changes. The reference model was trained on a Google-internal corpus of ~25K IR modules, and appears to generalize well to clang, opt, SPEC2006 (which are code bases sharing little to nothing with the training corpus), as well as some internal binaries. We observe as much as 6% size reduction (opt), and generally ~3%, when compared to clang -Oz.

To enable either/both modes, reference the buildbot script, as well as the buildbot definition.

TL;DR;

Release mode.

Get the tensorflow pip package, find out where it is installed:

python3 -m pip install --upgrade pip
python3 -m pip install --user tf_nightly==2.3.0.dev20200528
export TF_PIP=$(sudo -u buildbot python3 -m pip show tf_nightly | grep Location | cut -d ' ' -f 2)

Now setup the build by passing cmake -DTENSORFLOW_AOT_PATH=${TF_PIP}/tensorflow

Development

Get the tensorflow C API library from https://www.tensorflow.org/install/lang_c. Install it somewhere, i.e. $TF_API. Then, pass -DTENSORFLOW_API_PATH=${TF_API} to cmake.

Note that TF_API points to the directory containing the include and lib directories provided by the install package.

The Release and Development mode may co-exist.

To opt-in to one of the modes (including 'default' - which is the manual heuristic), pass the (-mllvm) -enable-ml-advisor={default|release|development}. Note that, for development, you need to also pass in the path to the size estimator model, and/or path to the saved model currently under training - see lib/Analysis/DevelopmentModeInlineAdvisor.cpp.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	180 ms	linux > Polly.ScopInfo::Unknown Unit Message ("")

Event Timeline

mtrofin created this revision.Apr 8 2020, 2:04 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptApr 8 2020, 2:04 PM

Herald added a reviewer: jdoerfert. · View Herald Transcript

Herald added a reviewer: sstefan1. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, dexonsmith, steven_wu and 5 others. · View Herald Transcript

mtrofin edited the summary of this revision. (Show Details)Apr 8 2020, 2:07 PM

Harbormaster completed remote builds in B52399: Diff 256104.Apr 8 2020, 2:41 PM

uenoku added a subscriber: uenoku.Apr 8 2020, 3:48 PM

Just practical things (which aren't the point right now given the general design review), but might be handy further down the road.

llvm/lib/Analysis/InlineCost.cpp
2283 ↗	(On Diff #256104)	Generally prefer llvm::None rather than {} for the not-present Optional state.
llvm/lib/Analysis/InliningAdvisor.cpp
32–34 ↗	(On Diff #256104)	Doesn't look like these two definitions are required. But more generally - this region of code seems to me like a lot of surface area to #def in and out for the ML parameters. Does the non-trivial implementation of this pull in problematic dependencies (like LLVM's code ensures it works with or without zlib) or just a little extra insurance that this wouldn't adversely affect people when it's first committed? If it's the latter - I'd probably just have this be a runtime setting (so the functionality can be tested on every buildbot/developer, etc), not a compile time/#ifdef'd out situation. If it is a zlib-like situation, I'd still be in favor of a smaller surface area for the opt in/out - like having the basic interfaces in one library & then another library that implements the ML implementation - and a factory function that gives you either the trivial implementation or the ML one, etc. Ah, looking further on, I see this adds a TF dependency (aside: does this make LLVM & TF circularly dependent?) - yeah, then I'd expect the CMake variable would just be "has TF/doesn't have TF" and then maybe an LLVM library that requires TF/doesn't build/link without it (hmm, maybe that's more complicated than what you've got... I don't know for sure) - and then the usage would just be "x = #if TF new TFThing() #elif new NullThing #endif"\, etc.
llvm/lib/Analysis/ML/IRToNativeSizeLearning.cpp
31–32 ↗	(On Diff #256104)	LLVM style guide encourages using namespace/global-scope static for file-local functions, and only using anonymous namespaces for types: https://llvm.org/docs/CodingStandards.html#anonymous-namespaces
llvm/lib/Analysis/ML/InliningAdvisor.cpp
480 ↗	(On Diff #256104)	Generally if you're passing a lambda that'll be only run immediately (or even within the scope it's declared, but not immediately) - I'd suggest using [&] for capture, treating the lambda scope the same as other scopes (like for/while/if/etc) - that implicitly have access to all outer variables
llvm/lib/Analysis/ML/InliningModelFeatureMaps.h
34–54 ↗	(On Diff #256104)	static things in headers are problematic (since they define new/distinct variables in every translation unit that includes the header) - you can use inline accessor functions, or maybe constexpr variables... (I forget/don't know the specific linkage rules there) or declared in the header and defined in a cpp file

dblaikie added inline comments.Apr 8 2020, 3:58 PM

llvm/lib/Analysis/ML/InliningAdvisor.cpp
492–494 ↗	(On Diff #256104)	Generally LLVM code doesn't use {} on single line blocks.
llvm/lib/Analysis/ML/InliningModelFeatureMaps.h
16–32 ↗	(On Diff #256104)	Enumerators need a prefix: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (or use an enum class, that forces using the enum name as a prefix)
llvm/lib/Analysis/ML/InliningModelRunnerProduction.h
56–63 ↗	(On Diff #256104)	Ah, this is more of the "two implementations of the same entities" discussed earlier - yeah, I'd be in favor of avoiding this as it seems like it'll make the code quite a bit harder to work with (since it'll only compile in one mode at a time, so changes to these APIs will need different builds to validate that both implementatios are correctly updated, etc), concerns around testing, etc
116 ↗	(On Diff #256104)	Prefer `= default` where possible
llvm/lib/Analysis/ML/InliningModelRunnerTraining.h
24–45 ↗	(On Diff #256104)	More static things in headers which will need to move to an implementation file
172–173 ↗	(On Diff #256104)	Looks like an extra set of () here that could be removed

sstefan1 resigned from this revision.Apr 9 2020, 1:38 AM

sstefan1 added a subscriber: sstefan1.

lkail added a subscriber: lkail.Apr 9 2020, 2:23 AM

simoll added a subscriber: simoll.Apr 9 2020, 7:00 AM

Small changes

Herald added a subscriber: arphaman. · View Herald TranscriptApr 13 2020, 4:17 PM

mtrofin added inline comments.Apr 13 2020, 4:20 PM

llvm/lib/Analysis/InliningAdvisor.cpp
32–34 ↗	(On Diff #256104)	If I understand it correctly, the idea would be to: detect optional dependencies (or be hinted where they are). if any/all of them are available, compile dependent code use flags to pick the right implementation, if more than one is available. In extreme, if the compiler is compiled with both 'rel' and 'dev' modes, one would have 3 flag possibilities for the inliner: 'manual heuristic', 'ml-rel mode', and 'ml-dev mode' I can see how this can result in cleaner code - probably all those ifdefs get removed, for example. I like it!
llvm/lib/Analysis/ML/InliningModelFeatureMaps.h
34–54 ↗	(On Diff #256104)	Ack - this sorts itself out with your earlier suggestion of building depending on availability of dependencies.
llvm/lib/Analysis/ML/InliningModelRunnerProduction.h
56–63 ↗	(On Diff #256104)	Ack - addressing with the 'build everything if deps available'

Harbormaster failed remote builds in B53021: Diff 257139!Apr 13 2020, 4:25 PM

renamed FeatureList

Harbormaster failed remote builds in B53025: Diff 257146!Apr 13 2020, 4:59 PM

russell.gallop added a subscriber: russell.gallop.Apr 14 2020, 1:23 AM

Incorporated feedback to implement the 2 modes as buildable depending on presence of dependencies, meaning they may both be present.

Herald added a subscriber: jfb. · View Herald TranscriptApr 17 2020, 6:16 PM

Harbormaster failed remote builds in B53814: Diff 258465!Apr 17 2020, 6:35 PM

Update/integrate

Harbormaster failed remote builds in B54154: Diff 259105!Apr 21 2020, 5:21 PM

mtrofin mentioned this in D78512: [llvm] Factor out inlining pipeline as a module pipeline..Apr 23 2020, 8:22 AM

In D77752#1990087, @mtrofin wrote:

Incorporated feedback to implement the 2 modes as buildable depending on presence of dependencies, meaning they may both be present.

Hey - thanks for having a go at what I was describing, but this isn't /quite/ what I was thinking and still involves quite a few different bits of macro usage across different situations (some conditional functionality implemented in the build files, some of it in source files - the two different implementations (trivial/null/empty and non-trivial) #ifdef'd out, etc)

At a high level:

I think the preprocessor macros provided by the build system should be named after the libraries that have been detected, not the functionality those libraries are intended to provide (rather than LLVM_USE_ML_POLICY_DEV, it'd be something more like ZLIB (LLVM_ENABLE_TF, LLVM_ENABLE_TF_LITE or something?))
I think it'd be clearer if all the conditional was done in the source/header files, not mixed between that and the build files.
If possible, I'd like to avoid having two different implementations of multiple classes like that - what I'd picture was one factory function that says "get the thing" and that implementation would check the runtime parameter from the user and call unconditionally provided (but conditionally implemented) functions that retrieve implementations of an interface. There would be a default/null implementation, if needed (if it's easier to have that than to have the callers check for presence/absence) - a separate class, rather than a #ifdef implementation of another class.

llvm/lib/Analysis/ML/InliningAdvisor.cpp
47–48 ↗	(On Diff #258465)	LLVM doesn't usually use top-level const like this, FWIW - I'd suggest avoiding it since it does things like disable assignability, etc. (also you don't necessarily have to write that ctor - you can use braced init to construct such a thing memberwise without having a ctor written)
117 ↗	(On Diff #258465)	Drop the `()` around `F` here (instead `delete F;`). But perhaps more generally: could `DeletedFunctions` contain `unique_ptr`s, so there's less manual memory management here? (instead this function could be `DeletedFunctions.clear();`)
469 ↗	(On Diff #258465)	Generally I'd encourage using `[&]` for captures in any lambda that doesn't escape its scope, doubly so if it doesn't escape the statement it's defined in.
llvm/lib/Analysis/ML/Rel/InliningModelRunnerRel.cpp
24–32 ↗	(On Diff #258465)	Externally visible definitions should go in headers (unless this is a definition of a pimpl or the like) - otherwise this type should be defined in an anonymous namespace (so it can't collide with other implementation details in other files) https://llvm.org/docs/CodingStandards.html#anonymous-namespaces

feedback

mtrofin added inline comments.Apr 28 2020, 10:16 AM

llvm/lib/Analysis/InliningAdvisor.cpp
32–34 ↗	(On Diff #256104)	Updated along these lines.
llvm/lib/Analysis/ML/InliningAdvisor.cpp
47–48 ↗	(On Diff #258465)	Removed the const and changed to explicit readonly accessors. I do want to underscore this is an immutable object - easier to read / maintain. If it's OK style-wise, I prefer the explicit ctor for readability, too.
117 ↗	(On Diff #258465)	I need to do lookups in the DeletedFunctions set. That would require some gymnastics with std::unique_ptr around find, I think. Not sure it's worth the complexity in this case - is there an alternative?

Harbormaster completed remote builds in B54994: Diff 260689.Apr 28 2020, 11:17 AM

dblaikie added inline comments.May 2 2020, 4:51 PM

llvm/lib/Analysis/ML/InliningAdvisor.cpp
136 ↗	(On Diff #260689)	This'll be implicitly deleted, FWIW (similarly the assignment operators will be implicitly deleted or omitted as necessary)
173–178 ↗	(On Diff #260689)	again, top level const like this is a bit quirky/uncommon in LLVM, but not outright banned Even if the struct isn't mutable - your vector is, and the equivalent to mutation could be achieved by removing an element from the vector and adding a new one with the desired values, so make const members aren't really providing the invariants you'd like them to.
209–217 ↗	(On Diff #260689)	Similarly - not sure if making these members const carries its weight - essentially it's to ensure these members aren't modified during "recordInlining"? (I don't see any other code that could be accessing these? But then I guess I wonder why aren't they private if they're only accessed by that member? Perhaps I'm missing something)
272–274 ↗	(On Diff #260689)	LLVM usually skips braces on single line (some people go so far as to omit braces on single statement (even if that statement has to be wrapped over several lines)) blocks - and several single line blocks in this code skip braces - so this one (& maybe others in this patch? I haven't looked) seems inconsistent.
465 ↗	(On Diff #260689)	This is often written as just "if (x.count(y))" but up to you. (about 30:1 ratio in favor of that compared to > 0 or != 0)
47–48 ↗	(On Diff #258465)	But it's not really immutable (well, even less now, without const members) since it can be reassigned (OldCSI = CallSiteInfo(NewCB, NewH)) & what's the readability benefit you have in mind in terms of having a ctor? It looks like this is only constructed in one place, and everything else refers to it via const ref - so I'm not sure it'd present a significant loss of readability if it were a plain struct? (this isn't a "you must not do it this way", but a bit of a discussion to understand if this/things like this are worthwhile compared to many other use cases that have simple structs, etc)
117 ↗	(On Diff #258465)	Ah, right. Yeah - maybe leave a FIXME to fix this once we're using C++20, which supports heterogenous lookup. Or might be worth using SmallPtrSet (https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallptrset-h) here anyway (which doesn't & probably can't support unique_ptr elements anyway, just an unfortunate reality) for reduced memory usage/better cache locality.

mtrofin marked 2 inline comments as done.May 2 2020, 5:43 PM

mtrofin added inline comments.

llvm/lib/Analysis/ML/InliningAdvisor.cpp
173–178 ↗	(On Diff #260689)	I see value in const helping a reader understand where and when mutation may come. Here, it says "when constructed, then can't change". So it helps the reader see the whole object as one value, where parts don't shift.
47–48 ↗	(On Diff #258465)	Maybe I should just pass the params directly, come to think of it. What I mean re. readability is reading stuff like callingAFunction(param1, param2..., {value1, value2},...). If I read this code, I have no idea what that tuple means, I have to go to the definition and maybe implementation of callingAFunction. Instead, callingAFunction(..., CallSiteInfo(value1, value2)) adds a bit more info. It's not much extra, but it has no perf cost, so why not. The concern is also wrt what happens later, if the type is constructed in more places. Similarly, a naked struct would get me chasing more for "where is this field set". The getter-only design may help a reader understand this is meant to be a set-once value. Anyway, if I just pass the scalar params, all this becomes moot.

dblaikie added inline comments.May 2 2020, 7:29 PM

llvm/lib/Analysis/ML/InliningAdvisor.cpp
173–178 ↗	(On Diff #260689)	I think the inconsistency with the rest of the codebase would raise more questions than it'd answer - but it's sufficiently locally scoped it's not a huge deal. Mostly trying to discuss this not just in this case, but in the broader context of coding practices in LLVM in general.
47–48 ↗	(On Diff #258465)	Maybe I should just pass the params directly, come to think of it. If it's just two parameters, to one function/function call - yeah, doesn't seem like it's adding a lot of value. Sometimes if a function ends up with too many arguments What I mean re. readability is reading stuff like callingAFunction(param1, param2..., {value1, value2},...). If I read this code, I have no idea what that tuple means, I have to go to the definition and maybe implementation of callingAFunction. Instead, callingAFunction(..., CallSiteInfo(value1, value2)) adds a bit more info. It's not much extra, but it has no perf cost, so why not. The concern is also wrt what happens later, if the type is constructed in more places. Ah, thanks for explaining - callers would still be able to write CallSiteInfo{x, y} with a struct, if the 'CallSiteInfo' helps describe things. I'd tend to leave it up to the caller to decide what's most readable? (though I agree there are probably a few places in LLVM that get a bit over-enthusiastic about {}). Also, as it happens - even with the struct as-written, you could still write the less-legible version. You'd have to mark the ctor "explicit" to disallow bare {x, y} construction. (which would still allow CallSiteInfo{x, y}, FWIW). Similarly, a naked struct would get me chasing more for "where is this field set". The getter-only design may help a reader understand this is meant to be a set-once value. nod if it's only passed across the boundary into a function to group parameters, hopefully those scopes aren't too complicated/find where it's created/used/set/etc. (admittedly, shorter functions helps here - big long functions risk code mutating parameters in ways that might be surprising, etc - so the benefit of using "const" for function parameters ("void f1(const int i) { /* really long body where it's nice to know 'i' isn't mutated */ }") but even that wouldn't need the struct to be intrinsically immutable) There's some value, from my perspective, in having types act like basic (what some folks call "vocabulary") types as much as possible - C++ goes to quite some lengths to make it possible, and it reduces surprises/curiosities, at least I find, when types work that way. If I see a non-copyable type, for instance, I worry that someone's keeping pointers to it in some "interesting" way that means moving/copying it would be problematic. Anyway, if I just pass the scalar params, all this becomes moot. Sure enough, but hopefully useful design discussions.

jbhateja added a subscriber: jbhateja.May 3 2020, 10:04 AM

keeping it in sync

Harbormaster failed remote builds in B55990: Diff 262497!May 6 2020, 5:38 PM

raghesh added a subscriber: raghesh.May 19 2020, 3:01 AM

nickdesaulniers added a subscriber: nickdesaulniers.Jun 2 2020, 2:02 PM

updated

Harbormaster failed remote builds in B59683: Diff 269637!Jun 9 2020, 2:21 PM

ggeorgakoudis added a subscriber: ggeorgakoudis.Jun 9 2020, 7:46 PM

asl added a subscriber: asl.Jun 17 2020, 4:09 AM

asl added inline comments.

llvm/cmake/modules/TensorFlowCompile.cmake
15 ↗	(On Diff #269637)	I believe you need to provide --target_triple here as well. Otherwise you'd end with ELF object on MacOS

phosek added a subscriber: phosek.Jun 24 2020, 1:22 AM

phosek added inline comments.

llvm/lib/Analysis/ML/InlineModelFeatureMaps.h
14 ↗	(On Diff #269637)	This is missing `<string>` include which Clang complains about.

mtrofin marked 2 inline comments as done.Jun 24 2020, 8:03 AM

mtrofin added inline comments.

llvm/lib/Analysis/ML/InlineModelFeatureMaps.h
14 ↗	(On Diff #269637)	thanks I'll patch it correctly in D81515 and then here

phosek added inline comments.Jun 24 2020, 11:11 AM

llvm/lib/Analysis/ML/CMakeLists.txt
19 ↗	(On Diff #269637)	I think line 19 and line 25 should be swapped: TensorFlow C API should be used with `LLVM_HAVE_TF_API` and `LLVMMLPoliciesXLA` with `LLVM_HAS_TF_AOT`. I got link errors with the current version, but everything works after swapping those two lines.

mtrofin marked 2 inline comments as done.Jun 24 2020, 11:32 AM

mtrofin added inline comments.

llvm/lib/Analysis/ML/CMakeLists.txt
19 ↗	(On Diff #269637)	Correct, yes. I'll make sure these are correct once I rebase.

dexonsmith removed a subscriber: dexonsmith.Jun 24 2020, 3:20 PM

dokyungs added a subscriber: dokyungs.Jul 5 2020, 10:25 PM

usaxena95 added a subscriber: usaxena95.Jul 15 2020, 6:08 AM

All work in initial patch was upstreamed, clearing changes in this patch, so we may use it just for tracking.

(trying again - clearing this change)

mtrofin edited the summary of this revision. (Show Details)Jul 20 2020, 11:48 AM

Harbormaster failed remote builds in B64968: Diff 279316!Jul 20 2020, 9:18 PM

Harbormaster failed remote builds in B64967: Diff 279314!

mtrofin added commits: rG70f8d0ac8a34: [llvm] Development-mode InlineAdvisor, rG83080a294ad7: [llvm] Native size estimator for training -Oz inliner, rGbdceefe95ba6: [llvm] Release-mode ML InlineAdvisor, rZORGf745b8309563: [llvm] Add build bots for ml-driven optimization policies., rG98ef93eabd76: [llvm] Add function feature extraction analysis, rGd6695e18763a: [llvm] Add interface to drive inlining decision using ML model.Jul 21 2020, 7:54 AM

Added RFC doc as a md file, for reference (doesn't need to be landed)

Herald added a subscriber: aaron.ballman. · View Herald TranscriptJul 27 2020, 12:19 PM

Harbormaster failed remote builds in B65899: Diff 281016!Jul 27 2020, 2:23 PM

Hi, I want to use my TensorFlow model in LLVM with MLModelRunner, like ReleaseModeModelRunner.cpp but I have difficulty compiling the model, especially in the TensorFlow AOT complication of the model.
It would be really helpful if you point me out the code which defines the computation graph and the signature. Is such a script already landed or published somewhere?

In D77752#2215746, @uenoku wrote:

Hi, I want to use my TensorFlow model in LLVM with MLModelRunner, like ReleaseModeModelRunner.cpp but I have difficulty compiling the model, especially in the TensorFlow AOT complication of the model.
It would be really helpful if you point me out the code which defines the computation graph and the signature. Is such a script already landed or published somewhere?

We are about to publish the training algorithm, I'll update here when that's done - meanwhile, a possible way to unblock would be to inspect the expected inputs and outputs of the graph from the protobuf itself - it is in textual form - or using tensorboard, I believe it has a facility for this.

Note that the only interface between LLVM and the model is the input and output signatures, the internal structure of the model can differ.

In D77752#2215788, @mtrofin wrote:

In D77752#2215746, @uenoku wrote:

Hi, I want to use my TensorFlow model in LLVM with MLModelRunner, like ReleaseModeModelRunner.cpp but I have difficulty compiling the model, especially in the TensorFlow AOT complication of the model.
It would be really helpful if you point me out the code which defines the computation graph and the signature. Is such a script already landed or published somewhere?

We are about to publish the training algorithm, I'll update here when that's done - meanwhile, a possible way to unblock would be to inspect the expected inputs and outputs of the graph from the protobuf itself - it is in textual form - or using tensorboard, I believe it has a facility for this.

Note that the only interface between LLVM and the model is the input and output signatures, the internal structure of the model can differ.

Great, I'm looking forward to it:) Thanks for the advice. I'll try again.

dmgreen added a subscriber: dmgreen.Aug 18 2020, 9:48 AM

mtrofin edited the summary of this revision. (Show Details)Sep 16 2020, 10:46 AM

mtrofin edited the summary of this revision. (Show Details)

AmirJamez added a subscriber: AmirJamez.Oct 2 2020, 2:54 PM

gengrill added a subscriber: gengrill.Nov 18 2021, 1:39 PM

TWeaver added a subscriber: TWeaver.Jun 30 2022, 6:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2022, 6:36 AM

Revision Contents

Path

Size

llvm/

docs/

Inline-Oz-ML.md

183 lines

Diff 281016

llvm/docs/Inline-Oz-ML.md

This file was added.

				# A practical mechanism for applying Machine Learning for optimization policies in LLVM

				Authors: Mircea Trofin, Yundi Qian, Eugene Brevdo ([mtrofin@google.com](mailto:mtrofin@google.com), [yundi@google.com](mailto:yundi@google.com), [ebrevdo@google.com](mailto:ebrevdo@google.com))

				[https://reviews.llvm.org/D77752](https://reviews.llvm.org/D77752)

				TL;DR; We can improve compiler optimizations driven by heuristics by replacing those heuristics with machine-learned policies (ML models). Policies are trained offline and ship as part of the compiler. Determinism is maintained because models are fixed when the compiler is operating in production. Fine-tuning or regressions may be handled by incorporating the interesting cases in the ML training set, retraining the compiler, and redeploying it.

				For a first milestone, we chose inlining for size (-Oz) on X86-64. We were able to train an ML model to produce binaries 1.5-6% smaller than -Oz of tip-of-tree. The trained model appears to generalize well over a diverse set of binaries. Compile time is increased marginally (under 5%). The model also happens to produce slightly better - performing code under SPEC2006 - total score improvement by 1.75%. As we only wanted to verify there is no significant regression in SPEC, and given the milestone goals, we haven’t dug any deeper into the speed results.

				We see these results as promising, and as a reasonable point for contributing our current work as a build-time opt-in to LLVM to benefit the community, in the hope of fostering collaboration and learning from the community’s feedback, as we try to better understand the trade-offs such an approach entails, and as we work on expanding the depth and breadth of applying these techniques to compiler optimization problems.


				# Motivation

				Optimization problems, such as inlining or register allocation, are hard: we don’t know of algorithms that are guaranteed to produce the optimum solution in all cases in a timely fashion. Instead, we use heuristics: algorithms that we expect to produce some approximation of the optimum, with some expected degree of generality, in a practical amount of time.

				Heuristics have some common characteristics. Taking inlining as a case study, it traverses the problem space in some way (bottom-up traversal of the SCC graph), extracts some properties (let’s call them “features”) of the program being optimized, and combine them with some weights (“tune”), to produce a cost (InlineCost), which allows for trade-off analysis. We validate the effectiveness of a heuristic over some accepted set of benchmarks. Over time, we react to regressions or pathological cases observed in the field, by manually analyzing such cases, figuring out an enhancement to the heuristic, and then re-validating over that set of benchmarks (maybe augmented by adding the newly found cases).

				Because heuristics are code that needs to be maintained, there is pressure to reduce complexity: adding more features means we need to reason about the interactions between the old and new features, which scales combinatorially. Re-tuning because of the new features adds a similar kind of complexity. Potentially, we miss out on optimization improvements as a result.

				Because tuning is manual, there is pressure to keep the number of benchmarks that can be studied in depth to a humanly-manageable size, which potentially affects the generality of a heuristic or heuristic tuning.

				The main advantage of manual heuristics is arguably their relatively lower overhead: no additional dependencies and more transparent to human analysis and improvement.

				Machine learning, in particular reinforcement learning, can address the tensions found in manual heuristics: once features are extracted from the program, the way they are combined and tuned can easily be scaled up through automation, improving effectiveness and generality. A major drawback, at least at this point in time, of machine learning, is that we don’t yet have a fully developed systematic approach for improving policy effectiveness.


				# High level design

				We identify two scenarios for a compiler using ML policies: development and release.

				The release scenario is equivalent to the regular compilation we have today - the only difference is that it uses a pre-trained model (trained in the development scenario beforehand) to make decisions instead of the heuristics. Determinism is guaranteed since the model in the release scenario is fixed. We imagine teams wishing to fine tune the effectiveness of the optimization to their scenarios would train a different model.

				The decision previously evaluated using a human-crafted heuristic is optionally replaced by:



				* a compiler-specific component, extracting features from IR (i.e. a vector of values)
				* an evaluation of an ML model using those features, to obtain a result. In ML nomenclature, this is referred to using the model for inference (as opposed to training it)

				For example, when we replaced the decision of whether to inline a callsite, the ML model produces a boolean (inline/don’t inline) based on a features vector characterizing the call site and some broader module-wide context.

				Training/development is more complicated, and happens offline - akin to how, today, attempts to improve an optimizing pass also happen offline. A description of the high level design and the specifics we used for the current scope are given in Appendix.


				# Current Scope

				The goal of our first milestone was to evaluate end to end an integration of ML with LLVM, and get a first promising result. To that end, we chose inlining for size (-Oz) as a stepping stone, as we perceived it to be more likely to require a simpler evaluation setup than performance-oriented optimizations might. At this point, we only train whether a call site may be inlined or not, leaving the SCC traversal order as-is.

				We are proposing an initial change demonstrating the inference mechanism using a pre-trained model, as a build-time opt-in to llvm. The compiler components needed to perform training are also included in this first change. Subsequent changes would include more training-related components.

				At a high level, the changes we are proposing consist of:



				1. a new module analysis pass, InliningAdvisor. By default, its implementation does nothing.
				2. minimal hooks into Inliner.cpp.
				3. the implementation of InliningAdvisor, activated when we opt-in ML:
				1. Release/Development specific ML model handing
				2. a reference pre-trained model for inlining for size (Analysis/models/inlining)
				3. a pre-trained model for predicting native size from IR (available for unit tests, see unittests/Analysis/Inputs), used in Development mode only.

				We discuss ‘release’ mode here, and ‘development’ mode in the Appendix, as it is more involved.


				## Release mode details

				Both release and development modes are build-time opt-ins, as they have non-standard dependencies. LLVM may be built with support for none/either/both release and development modes. Here, we’ll detail the release mode setup and use.

				The release mode depends on the [SavedModel](https://www.tensorflow.org/guide/saved_model) [compiler](https://www.tensorflow.org/xla/tfcompile), while the development mode depends on the [Tensorflow C API library](https://www.tensorflow.org/install/lang_c).

				Please also refer to the [build bot setup script](https://github.com/google/ml-compiler-opt/blob/master/buildbot/buildbot_init.sh).

				To install the SavedModel compiler:



				1. install tensorflow pip package

				```
				pip3 install tf_nightly --user
				```


				2. Find out where the pip was installed:

				```
				TF_PIP=$(python3 -m pip show tf_nightly \| grep Location \| cut -d ' ' -f 2)
				```

				3. configure llvm build by adding to cmake:


				```
				cmake .. -DTENSORFLOW_AOT_PATH=${TF_PIP}/tensorflow

				```



				4. build llvm as usual.
				5. pass `-mllvm -enable-ml-inliner=release` to clang.


				### _Details_

				The ML model is captured as a TensorFlow ‘saved model’. When building llvm, we use TensorFlow’s XLA native compiler (`saved_model_cli`) to compile the saved model into a native static library and a header file. Insofar as LLVM is concerned, there are minimal additional runtime requirements, packaged as source within the pip package: C++ wrappers around the compiled model. These will also be statically linked in the LLVM target. The compiled code is otherwise just a complex arithmetical computation, with no special requirements - it is single threaded and runs natively on the targeted architecture. Together with the aforementioned runtime dependencies, it adds ~115KB to the clang binary (0.08% increase)

				Runtime-wise, we observed a ~10% increase in the time spent in the inliner, for a large (33MB) binary IR module; inlining typically consumes ~10-15% of total compilation time, so the overall compile time overhead of the approach is arguably negligible. This cost is almost in entirety attributable to feature extraction.

				Memory-wise, the precompiled model has a fixed size buffer for its inputs, and performs a fixed amount of computations, so the memory overhead inherent to our approach is independent from the program being optimized. Using a small example to avoid effects such as memory use differences due to different inlinings, we observed an 300KB increase in the maximum resident size.

				A table showing effect on -Oz compiled binaries’ size is given in Appendix.


				# Next directions

				Our next milestone has two main high level goals: detailing a systematic approach to driving policy effectiveness; and exploring in depth the type of features and training algorithms most appropriate for compiler problems, or at least problems like inlining. For the latter, we expect embedding more of the call graph structure to play an important role, as well as, potentially, delegating the problem space traversal to the ML model.

				We plan to include inlining for speed as part of our work on these goals.


				# Appendix


				## Training - High Level

				We use Reinforcement Learning (RL) to train the Inline-for-size model. At a high level, it is composed of 3 parts: training data collection, model training, and iterative data collection/model training. We use TensorFlow as our ML framework.

				Related, we also needed to learn a separate model to evaluate the native size of a function, given its IR, in order to calculate a more precise reward for the reinforcement learning algorithm (“IR2Native”). We evaluated ‘just counting IR’ and TargetTransformInfo, but they appeared to provide too noisy of a signal for the reward, insofar as the RL training algorithm for the inlining model was concerned. This model is only used during training.

				RL - Training data collection: the training data we need to feed into a reinforcement learning algorithm are sequences consisting of: state of the problem (i.e. features); action (inline/not inline), and reward (native size shrinkage after inline/not inline, using ir2native). To collect the sequences, we hook the logging infrastructure into LLVM Inliner that is able to produce logs after the inline optimization pass.

				RL - Model training: We use DQN (Deep Q-Network) to train our inlining-for-size ML policy. On a high level, the DQN algorithm trains a neural network to predict the value of different actions --- the DQN policy then chooses to take the action with the highest predicted value. In our scenario, we have two actions: 1) inline; 2) not inline, so the neural network predicts the size reduction of these two actions based on features, and then decides to conduct inlining if the neural network believes doing inlining leads to higher size reduction. After the training finishes, it produces a TensorFlow SavedModel that takes features as input and generates inline decisions (whether to inline or not) as output.

				The choice of the features and reward are essential in model training. The features are chosen with the consideration of being helpful in making the decision --- the input to the cost model is a good starting point. Ideally, the reward in the inline-for-size problem is the native size shrinkage after inline/not inline. It is difficult to obtain precisely, so we use the estimate as an alternative. This means that, for training, we also need a model (not necessarily machine learned) for estimating rewards.

				RL - Iterative data collection/model training: Reinforcement learning is ideally an iterative model/policy improvement process that: 1) the trained model is deployed to the field to collect new data; 2) newly collected data are used to update the model. Thus, we need to do iterative data collection/model training. To facilitate data collection (avoid complex build dependencies and time spent before/after the pass being trained), we isolate out IR modules captured right before the optimization we are interested in, and iterate on them with opt. We bootstrap the training from the current heuristic, and stop the process once we are happy with the outcome.

				IR2Native: We collect IR features (different from the ones used for inlining) for each function at the end of inlining, right before we perform function simplification passes, and right after. This means we have two IR ‘shapes’ of the same function, and we know no further inlinings will be performed, so whatever changes happen are based on that IR. We then extract the native size at the end of compilation. Together, this data forms two records per function that can be used in supervised learning - the features are those extracted from IR, and the label is the native size. Training IR2Native happens independently from the training of the inliner model.


				## Training support for the current scope

				The initial change includes the logging mechanism, an ir2native model trained for x86-64, and the means to rapidly iterate over development ML models. For the components that will be included in subsequent changes, the rest of this section describes the mechanisms we employed. These components are detailed further below.

				To build LLVM with the ML policy in ‘Dev’ mode, we need the [tensorflow C API library](https://www.tensorflow.org/install/lang_c). Assuming it is installed under $TF_C_API (i.e. its include and lib directories are $TF_C_API/include and $TF_C_API/lib, respectively), we then configure the build:


				```
				cmake .. -DTENSORFLOW_C_LIB_PATH=$TF_C_API
				{-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared library over, if it's not on LD_LIBRARY_PATH}
				```


				To extract IR right before inlining, we hacked on top of the ThinLTO infrastructure, interrupting its pre-link pipeline right before inlining. This means clang produces binary IR files captured at that stage. We then built a large target, obtaining a corpus of ~25K modules. We intend to provide a clean mechanism in a subsequent change.

				To obtain features/labels for training this “IR to Native Size” model, we had to make some changes to the AsmPrinter (to get real native sizes) and llvm-readobj, as well as add an analysis pass for extracting the IR features for this model. We plan on upstreaming these changes subsequently.

				Finally, the infrastructure driving the policy training is currently built on proprietary APIs, as it benefits from a distributed computing infrastructure. We are currently evaluating options for open sourcing it. In the meantime, we are presenting the high level implementation details.

				To train a new model, the infrastructure performs 2 steps: extracting the logs, and using them in a training algorithm.

				Log extraction is highly parallelizable: for each IR module in the training corpus, we need to run opt once (or a few times, when we explore improvements). Specifically, each run is this invocation:


				```
				opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to ir2native> -training-log=<path to training log output> -enable-ml-inliner=development -o <output> <module.o>
				```


				Then collect the logs, and pass them to the next step.


				# Experimental results

				Experimental results are available as follows:



				* [SPEC2006](https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true) binary sizes (-Oz) and ‘test run’ scores.
				* [Size report](https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true) from a mix of internal benchmarks and binaries, as well as opt and clang
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[llvm] Machine Learned policy for inlining -OzNeeds ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 281016

llvm/docs/Inline-Oz-ML.md

[llvm] Machine Learned policy for inlining -Oz
Needs ReviewPublic