This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profdata] Add support for weighted merge of profile data (2nd try)
ClosedPublic

Authored by slingn on Dec 7 2015, 2:13 PM.

Details

Summary

This change adds support for specifying a weight when merging profile data with the llvm-profdata tool.
Weights are specified by using the --weighted-input=<weight>,<filename> option. Input files not specified
with this option (normal positional list after options) are given a default weight of 1.

Adding support for arbitrary weighting of input profile data allows for relative importance to be placed on the
input data from multiple training runs.

Both sampled and instrumented profiles are supported.

Diff Detail

Repository
rL LLVM

Event Timeline

slingn updated this revision to Diff 42106.Dec 7 2015, 2:13 PM
slingn retitled this revision from to [llvm-profdata] Add support for weighted merge of profile data (2nd try).
slingn updated this object.
slingn added a reviewer: davidxl.
slingn added a subscriber: llvm-commits.
slingn added a comment.Dec 7 2015, 2:14 PM

Diff 1 is the baseline change from D14547. The fix for the Windows test failure isn't quite 'trivial' so I'm going to make that Diff 2.

slingn updated this revision to Diff 42107.Dec 7 2015, 2:15 PM

Fix for Windows test failure.

davidxl edited edge metadata.Dec 7 2015, 2:29 PM
davidxl added a subscriber: davidxl.

What kind of windows test failure was there?

David

slingn added a comment.Dec 7 2015, 2:51 PM

Failed builds:

http://lab.llvm.org:8011/builders/clang-x64-ninja-win7/builds/7858
http://lab.llvm.org:8011/builders/clang-x86-win2008-selfhost/builds/4929

Before we try to parse the weight from each input file we check if the file exists with sys::fs::exists(). This was done to allow file names to still include the ':' delimiter character - especially important on Windows since many file paths contain ':'.

Unfortunately, this check isn't foolproof as it appears to not handle the case of mixed path delimiters as we have when running the tests:

llvm-profdata.EXE merge D:\buildslave\clang-x64-ninja-win7\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.proftext D:\buildslave\clang-x64-ninja-win7\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.profdata -o D:\buildslave\clang-x64-ninja-win7\stage2\test\tools\llvm-profdata\Output\text-format-errors.test.tmp.out

slingn added a comment.Dec 7 2015, 2:51 PM

BTW - I was able to reproduce the failure and verify the fix on Windows 7 / Visual Studio 2013.

davidxl added inline comments.Dec 7 2015, 3:23 PM
tools/llvm-profdata/llvm-profdata.cpp
198 ↗(On Diff #42107)

Can you add a comment indicating this works around a bug in sys::fs::exists()?

slingn updated this revision to Diff 42120.Dec 7 2015, 3:57 PM
slingn edited edge metadata.

Updated for davidxl's comments.

slingn added inline comments.Dec 7 2015, 4:00 PM
tools/llvm-profdata/llvm-profdata.cpp
198 ↗(On Diff #42120)

It isn't exclusively a work-around for the issue with sys::fs::exists() - but I've tried to make the explanation clearer in the comments.

With this check the user will get a meaningful error message (file not found) on UNIX in the (obscure) situation of an invalid path containing ':'.

slingn added a comment.Dec 7 2015, 5:27 PM

OK, so it's not a bug in sys::fs::exists(). It's a bug in the test script text-format-errors.text:

1- Detect invalid count
RUN: not llvm-profdata show %p/Inputs/invalid-count-later.proftext 2>&1 | FileCheck %s --check-prefix=INVALID-COUNT-LATER
RUN: not llvm-profdata merge %p/Inputs/invalid-count-later.proftext %p/Inputs/invalid-count-later.profdata -o %t.out 2>&1 | FileCheck %s --check-prefix=INVALID-COUNT-LATER
INVALID-COUNT-LATER: error: {{.*}}invalid-count-later.proftext: Malformed instrumentation profile data

I stepped through with a debugger:

"C:/Users/Nathan/Sandbox/llvm/build/Debug/bin\llvm-profdata.EXE" "merge" "C:\Users\Nathan\Sandbox\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.proftext" "C:\Users\Nathan\Sandbox\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.profdata" "-o" "C:\Users\Nathan\Sandbox\llvm\build\test\tools\llvm-profdata\Output\text-format-errors.test.tmp.out"

Input 1: C:\Users\Nathan\Sandbox\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.proftext
Input 2: C:\Users\Nathan\Sandbox\llvm\test\tools\llvm-profdata/Inputs/invalid-count-later.profdata
                                                                                          ^^^^^^^^

And what do you know, the first mixed path exists and the check works just fine. The problem is that the second path really doesn't exist - so sys::fs::exists() rightly points this out:

bad-hash.proftext   compat.profdata.v2   foo3bar3-1.proftext           sample-profile.proftext      weight-sample-foo.proftext
bar3-1.proftext     empty.proftext       gcc-sample-profile.gcov       text-format-errors.text.bin
basic.proftext      extra-word.proftext  inline-samples.afdo           weight-instr-bar.profdata
c-general.profraw   foo3-1.proftext      invalid-count-later.proftext  weight-instr-foo.profdata
compat.profdata.v1  foo3-2.proftext      no-counts.proftext            weight-sample-bar.proftext
slingn added a comment.Dec 7 2015, 5:30 PM

I still think a check is a good idea for the edge case mentioned above (input contains ':' but is not specifying a weight and does not exist) so that the error message is correct. But thankfully I think since sys::fs::exists() works we can just check if the split path exists - if it doesn't fall back to default behavior. Hopefully that's a more appealing way to do this.

I think the right fix is:

StringRef WeightStr;
​std::tie(FileName, WeightStr) = Input.rsplit(':');
​ if (!sys::fs::exists(FileName) && !sys::fs::exists(Input)) {

  // report missing file error;
}
else if (sys::fs::exists(Input)) {
  // add input 
 }
else {
   assert (!WeightStr.empty());
   // check WeightStr format 
   ...
}

slingn updated this revision to Diff 42211.Dec 8 2015, 1:14 PM

Updated for davidxl and bogner's comments.

Improves handling of invalid paths vs. invalid weights. Adds tests.

silvas added a subscriber: silvas.Dec 8 2015, 2:38 PM

Given the subtleties we've encountered, I propose we switch over to using the syntax --weighted-file=<weight>,<filename> (or similar). This eliminates the special cases and is easier to document without subtly lying about the true behavior.

The subtlety is pretty well understood and the solution to it is clean

  • I don't see a strong reason to introduce a new option.

David

From the end user's perspective the <filename>:<weight> syntax seems the most intuitive and concise. And no matter what delimiter we choose (even with a separate option) there will still be edge cases for us to differentiate between file names containing ',' and the user trying but failing to specify a weight.

silvas requested changes to this revision.Dec 10 2015, 3:13 PM
silvas added a reviewer: silvas.

From the end user's perspective the <filename>:<weight> syntax seems the most intuitive and concise. And no matter what delimiter we choose (even with a separate option) there will still be edge cases for us to differentiate between file names containing ',' and the user trying but failing to specify a weight.

I disagree. With <filename>:<weight> the way to robustly invoke the tool programmatically is non-obvious; for example, the presence of random other files will change our interpretation and cause an error. A tool like this should only look at the files it is told to look at, and no more; with <filename>:<weight> we cannot even begin to approach that standard since it is ambiguous which files we should even look at.

With an explicit option, it is something like '--weighted-file=' + str(weight) + ',' + filename which always works as expected (when the weight and filename are valid, which is the case I'm talking about).

Also, note that nothing about this is "intuitive" (unless you know of analogous tools that use that particular syntax, but I'm not aware of any). Intuition comes from past experience. People will learn about this feature of llvm-profdata from our documentation; it isn't trying to be compatible with anything and user's don't have any preconceived notions about how it should work. Actually, I would go so far as to say that your statement is backwards: using an option *is* intuitive because users already are already aware of the caveats of having a file whose name might be ambiguous (e.g. starts with -) and are aware of the notion of passing options to a command-line program.

Or to put it another way: we should not invent a new command line syntax. Options are the standard way to pass more than a filename.

Furthermore, using an option, we can always go back and add <filename>:<weight> if there is resounding user demand for a "more concise" syntax.

This revision now requires changes to proceed.Dec 10 2015, 3:13 PM

Adding a few others for review since this has expanded beyond 'fix the Windows test issue' for the reverted D14547.

I misunderstood your proposal. Your proposal is a modified version of
the originally proposed option that requires weight to be specified
for every file (and with implicit ordering assumption). To specify
more than one file with weight, does user have to use multiple
--weight-file option?

David

I misunderstood your proposal. Your proposal is a modified version of
the originally proposed option that requires weight to be specified
for every file (and with implicit ordering assumption). To specify
more than one file with weight, does user have to use multiple
--weight-file option?

Yes. What I'm suggesting is precisely like the current patch, except that --weighted-file=<weight>,<filename> is used instead of <filename>:<weight>. (I do not care about --weighted-file syntax per se; it could be -weighted-file, -file-with-weight, etc. whatever is most convenient to implement. The fundamental thing is to reuse the user intuition about passing options in order to avoid exposing users to another special case (argument syntax is already a special case, but one that users are aware of).

When a file is specified via --weighted-file, does it need to be
specified again later in the command line or it is optional?

David

slingn added a comment.EditedDec 10 2015, 3:38 PM

There's one --weighted-file=<weight>,<filename> per file to merge, right?

Regarding intuition - I know I have trouble remembering the exact name of options I don't use very often. I might have a chance with a single ':'. :)

There's one --weighted-file=<weight>,<filename> per file to merge, right?

Regarding intuition - I know I have trouble remember the exact name of options I don't use very often. I might have a chance with a single ':'. :)

That is not intuition; that is memorableness. The reality is that it is unlikely that a user will type in the merge step by hand except maybe to try it out once or twice at first (in which case it's all new to them anyway).

It is more important that once a user's automation has been taught to invoke the merge step, that it robustly continues to work as intended. Also, it is important for us to be able to easily communicate to them how to robustly invoke the tool.

OK - that makes sense to me if automated merging is expected to be the most common workflow.

There's one --weighted-file=<weight>,<filename> per file to merge, right?

Sorry, missed this in my last message. Yes. Any <filename> can be replaced with --weighted-file=<weight>,<filename>. Just <filename> is equivalent to --weighted-file=1,<filename>.

A couple examples are the easiest way to explain it (especially to users) I think:

(something like this would be useful to include in the docs I think)

# Basic usage
llvm-profdata merge foo.profdata bar.profdata baz.profdata

# `foo.profdata` is especially important, multiply its counts by 10.
llvm-profdata merge --file-with-weight=10,foo.profdata bar.profdata baz.profdata
# Exactly equivalent to the previous invocation (explicit form; useful for programmatic invocation)
llvm-profdata merge --file-with-weight=10,foo.profdata --file-with-weight=1,bar.profdata --file-with-weight=1,baz.profdata

sounds good to me.

David

slingn updated this revision to Diff 42580.Dec 11 2015, 2:19 PM
slingn edited edge metadata.

Updated for silvas comments.

Some nits but otherwise this LGTM.

docs/CommandGuide/llvm-profdata.rst
39 ↗(On Diff #42580)

Can you add a description of the weight stuff here and show the 3 example invocations I mentioned in http://reviews.llvm.org/D15306#307727 ?

test/tools/llvm-profdata/weight-instr.test
6 ↗(On Diff #42580)

The documentation says -weighted-input do we accept double dash or single dash or both? Let's make sure our tests are using the same command line invocations as our documentation indicates.

9 ↗(On Diff #42580)

Just for safety's sake, can you use CHECK-NEXT on the subsequent lines?

20 ↗(On Diff #42580)

Is there a more useful name than WEIGHT1? Maybe WEIGHT_UNIT_SCALING or something?

33 ↗(On Diff #42580)

Can you use a more meaningful name than WEIGHT2 Maybe WEIGHT_3x_5x?

57 ↗(On Diff #42580)

Can you come up with some semantically meaningful naming for the CHECK lines? Here and elsewhere.

test/tools/llvm-profdata/weight-sample.test
56 ↗(On Diff #42580)

These types of tests (that depend on system error messages or file paths) are notoriously brittle. I suggest searching around and looking at the patterns that other tests are using to know which regexes to use.

For example:

tools/llvm-size/basic.test:ENOENT: {{.*}}llvm-size{{(\.EXE|\.exe)?}}: {{.*}}.blah: {{[Nn]}}o such file or directory
slingn updated this revision to Diff 42751.Dec 14 2015, 12:12 PM
slingn marked 7 inline comments as done.
slingn edited edge metadata.

Updated for silvas feedback.

slingn updated this object.Dec 14 2015, 12:12 PM
dnovillo edited edge metadata.Dec 14 2015, 12:52 PM

I like the new way of specifying weighted input better. Thanks! This LGTM with a couple of changes below.

include/llvm/ProfileData/SampleProf.h
144 ↗(On Diff #42751)

We should be indicating overflows somehow here. Could you for now assert that an overflow hasn't happened? Adding a FIXME for doing a more principled handling is fine for now.

189 ↗(On Diff #42751)

Since we're using satmul, might as well change the addition to a saturating addition.

slingn updated this revision to Diff 42779.Dec 14 2015, 2:40 PM
slingn marked 2 inline comments as done.
slingn edited edge metadata.

Updated for dnovillo feedback.

This revision was automatically updated to reflect the committed changes.