This implementation uses a pre-trained model which is statically
compiled into a native function.
RFC: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140763.html
Differential D81515
[llvm] Release-mode ML InlineAdvisor mtrofin on Jun 9 2020, 4:09 PM. Authored by
Details
This implementation uses a pre-trained model which is statically RFC: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140763.html
Diff Detail
Event TimelineComment Actions Including the models in the LLVM tree is problematic. I'm not sure there's a formal policy on this, but generally part of being an open-source project is that the source is available in human-readable format. With the exception of a few regression tests for binary parsers, the entire LLVM tree is human-readable. A model clearly doesn't count as human-readable. If it isn't practical to train the model as part of the LLVM build (because it would take too long), it might make sense to commit binary files. There's some precedent for this in-tree: lowering for shuffles on some targets is based on a precomputed table, built using a utility that isn't run as part of the normal build process. But I would expect reproducible instructions for how to generate the files. Comment Actions Indeed, training part of the build would be impractical. But that still doesn't mean we need binary files. I believe there are 2 concerns:
Comment Actions If there's some standardized binary format for models, that's might be okay? By analogy, there are some PNG files in the documentation; we don't insist people use XPM or something like that. There are some technical reasons to prefer text, though: it would allow someone to identify or diff the contents of the files without specialized tools. I'm more concerned about adding an opaque matrix of coefficients nobody can reproduce into the codebase. I think before we commit a generated model, the training tool needs to be committed, and someone needs to verify they can independently reproduce the generated model using that tool. I think it's important we set the right precedent here. Comment Actions It's the tensorflow format for models - https://www.tensorflow.org/guide/saved_model
We are on the same page - we do plan to release the training tools for developers wishing to produce their own models. It may be natural to do that first step first, but in this case, we believe the staging described in the RFC may have some merit (we should have described our motivation in the RFC, come to think of it). The main motivation for starting with the LLVM components (both ‘release mode’ and ‘development mode’, which I plan to submit next), and then making the training tools available (in a separate repository), is that having the LLVM components available allows for quicker experimentation by our partner teams, thus allowing us to parallelize work on upstreaming the training components with more ML exploration with those teams. IIUC, being an experimental feature that is conditionally-compiled in LLVM, this staging wouldn't have any material downside to anyone, while helping us maintain velocity. Importantly, because this is an optionally-built component, there should be no impact on "business as usual" LLVM developers, and, in particular, the build bots testing this feature are pointing to the silent master. Comment Actions Can you also rebase the patch?
Comment Actions Feedback
Comment Actions Hi, your git commit contains extra Phabricator tags. You can drop Reviewers: Subscribers: Tags: and the text Summary: from the git commit with the following script: arcfilter () { arc amend git log -1 --pretty=%B | awk '/Reviewers:|Subscribers:/{p=1} /Reviewed By:|Differential Revision:/{p=0} !p && !/^Summary:$/ {sub(/^Summary: /,"");print}' | git commit --amend --date=now -F - } Reviewed By: is considered important by some people. Please keep the tag. (I have updated my script to use --date=now (setting author date to committer date)) https://reviews.llvm.org/D80978 contains a git pre-push hook to automate this.
Comment Actions Hi Mircea, Could you also provide the information on what specific tf-nightly, protobuf version did you guys use to save the two frozen models? Unfortunately, I don't seem to load the models using a number of tf-nighly versions and am receiving google.protobuf.message.DecodeError: Error parsing message After further investigations, I noticed this has been done using the new TF's SavedModel method and Keras : https://tensorflow.google.cn/tutorials/keras/save_and_load?hl=en#save_checkpoints_during_training Would you provide scripts to load the model and see the layers? Thanks,
Comment Actions Hello Amir, to answer the first question (but I think you figured that already), the authoritative versions are captured in the bot script, available at https://github.com/google/ml-compiler-opt/blob/master/buildbot/buildbot_init.sh Re. second question, visualization - this is a question for Yundi, Gaurav, or Eugene (they are the ML experts). I'll venture "tensorboard" as an answer, but I'll make sure they give the authoritative one in a moment. Comment Actions You should be able to use tensorboard but you need to first import the model into tensorboard with https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/import_pb_to_tensorboard.py. Something like python import_pb_to_tensorboard.py --model_dir=llvm/lib/Analysis/models/inliner/ --log_dir=/tmp/inliner should work. Then you'll be able to run tensorboard on the log_dir. Here's a hosted visualization from tensorboard for your convenience: https://tensorboard.dev/experiment/C45o0HjZTPGRSqpOrdkbeg/#graphs Comment Actions Thanks. (1) May I ask what was the reason behind using a tf-nighlty rather than a tensoflow release? tensorboard duplicate plugins for name projector which it turned out to be a common issue for tensorboard when there are multiple packages installed, as a result of trying tf.nightly with release. Removing duplicate tensorboard fixed the issue. (4) Will you also release training scripts for brewing ir2native model as well here: https://github.com/google/ml-compiler-opt Thanks,
Comment Actions Historic reason - at the time we started upstreaming the work, the necessary changes to the pip package were not in the release package yet.
Thanks for pointing it out - updated the script; one of the build bots was also having issues for this reason, must have been a recent change (or the bots weren't rebooted in a while)
To confirm, now that we're using the release 2.3.0 tensorflow pip package, this shouldn't be an issue anymore, correct?
IR2Native is used for RL training algorithms where we want partial rewards. That's what we initially did, but then we got better characteristics with training algorithms using just final reward (==the .text size in the native object). We abandoned for the short term the partial rewards training. We suspect it will start making sense again when we incorporate more global context than we currently do (currently, the global context is really thin - node/edge counts, uses, and a measure of the initial DAG position). So this is a long way of saying: we should probably yank out IR2Native right now, for code simplicity, but didn't get around to doing it.
Comment Actions Yes. I confirm using TF.2.3.0 and Tensorboard 2.3.0; pip3 install tensorflow==2.3 --user did the job.
I see. So there are two questions: (Q1) Could you provide a definition for an IR2native final/optimal partial rewards ? I'd assume it was the final iteration of model weights when the training was stopped, however, what was the stop condition here? (Q2) To make sense of it, let consider:
(2-2) Inference Phase: Thanks,
Comment Actions IR2Native was trained through supervised learning: we captured features after last inlining, then also captured final native size of that function (when asm printing), as label.
IR2Native was trained completely separately: at a point, we captured the feature|label tuples from a corpus. Then we did supervised learning on that dataset, and obtained the IR2Native model. After that, we only used the IR2Native model in inference mode any time we wanted to train the the inliner model. The IR used for the training sessions was different (same overall codebase, but unrelated points in time). We didn't retrain IR2Native before training the inliner either.
|
Add a comment here describing briefly how to download TF packages and set AOT_PATH?