[clang-tidy] Store checks profiling info as JSON files
ClosedPublic

Authored by lebedev.ri on May 8 2018, 2:44 PM.

Details

Summary

Continuation of D46504.

Example output:

$ clang-tidy -enable-check-profile -store-check-profile=. -checks=-*,readability-function-size source.cpp
$ # Note that there won't be timings table printed to the console.
$ cat *.json
{
"file": "/path/to/source.cpp",
"timestamp": "2018-05-16 16:13:18.717446360",
"profile": {
  "time.clang-tidy.readability-function-size.wall": 1.0421266555786133e+00,
  "time.clang-tidy.readability-function-size.user": 9.2088400000005421e-01,
  "time.clang-tidy.readability-function-size.sys": 1.2418899999999974e-01
}
}

There are two arguments that control profile storage:

  • -store-check-profile=<prefix>

    By default reports are printed in tabulated format to stderr. When this option is passed, these per-TU profiles are instead stored as JSON. If the prefix is not an absolute path, it is considered to be relative to the directory from where you have run :program:clang-tidy. All . and .. patterns in the path are collapsed, and symlinks are resolved.

    Example: Let's suppose you have a source file named example.cpp, located in /source directory.
    • If you specify -store-check-profile=/tmp, then the profile will be saved to /tmp/<timestamp>-example.cpp.json
    • If you run :program:clang-tidy from within /foo directory, and specify -store-check-profile=., then the profile will still be saved to /foo/<timestamp>-example.cpp.json

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes

I think will be good idea to store data in JSON format too.

docs/ReleaseNotes.rst
60 ↗(On Diff #145790)

May be Profile information could be stored in SSV format. sounds better?

docs/clang-tidy/index.rst
762 ↗(On Diff #145790)

Please use ` for command line options and other non-C/C++ constructs.

I think will be good idea to store data in JSON format too.

Yeah, i have thought about it, and i'm not sure.
The output is so dumb so there isn't even much point in using anything more advanced than CSV.

rja added a subscriber: rja.May 8 2018, 3:46 PM

+1 for JSON

lebedev.ri planned changes to this revision.May 9 2018, 5:32 AM
lebedev.ri added inline comments.
docs/clang-tidy/index.rst
762 ↗(On Diff #145790)

`` is used for command line options elsewhere in this document.
Is there a reference somewhere?

lebedev.ri updated this revision to Diff 145901.May 9 2018, 5:41 AM
lebedev.ri retitled this revision from [clang-tidy] Store checks profiling info as CSV files to [clang-tidy] Store checks profiling info as YAML files.
lebedev.ri added reviewers: george.karpenkov, NoQ.
  • Deduplicate code by switching to already-existing YAML printer infrastructure from TimerGroup
  • Switch to YAML. It's kinda ugly, but maybe better than having to manually construct CSV.. :/
  • When the output prefix does not yet exist, still store the profile as YAML, don't fail at make_real
  • When storing profile to file, don't print it to screen. I have tried it, and it is just too noisy for no apparent benefit.
lebedev.ri updated this revision to Diff 145906.May 9 2018, 6:29 AM
lebedev.ri edited the summary of this revision. (Show Details)
  • Produce valid-er YAML.
lebedev.ri added inline comments.May 9 2018, 6:56 AM
docs/clang-tidy/index.rst
783 ↗(On Diff #145906)

Hmm, thinking about it a bit more, i think it should also contain the "file",

{
  "file": "/source/example.cpp",
  "profile": {
    "time.clang-tidy.readability-function-size.wall": 9.5367431640625000e-05,
    "time.clang-tidy.readability-function-size.user": 7.2000000000002617e-05,
    "time.clang-tidy.readability-function-size.sys": 1.8999999999999920e-05
  }
}
alexfh added a comment.May 9 2018, 7:22 AM
In D46602#1092111, @rja wrote:

+1 for JSON

Could you explain why would you use YAML or JSON for this? How are you going to use/process this data?

alexfh added a comment.May 9 2018, 7:34 AM

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv. Main things are: 1. include a timestamp, so there's no need to overwrite old results, 2. include just the name of the file without any parent directories, 3. put all outputs into the same directory. This way we wouldn't have to create a directory structure and think about stripping a certain prefix (btw, utilities like patch just specify the number of path components to remove from the start, not the actual substring). WDYT?

In D46602#1092111, @rja wrote:

+1 for JSON

Could you explain why would you use YAML or JSON for this? How are you going to use/process this data?

I personally don't have a preference here.
The "YAML" may be better because that is what already supported by TimerGroup, which allowed to drop the table printer, too.

The next step will be a python script:

  1. Takes either a file names (these generated .yaml), or a prefix, If it is a prefix, it GLOBS all the .yaml's in there.
  2. Load all the files from step 1.
  3. Print global report from data collected (reduce(+) after grouping by check name) from all the files (what D45931 did)
  4. Maybe print report about outliers - without "reduce(+) after grouping by check name", with percentages to the total time spent on TU. This is where the filename (and YAML) would be helpful.
  5. ???
  6. Run on codebases, find performance problems, contribute fixes
  7. Profit!

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

I don't like this, because when working on trying to improve the performance of the check,
one wouldn't touch the source file, only the clang-tidy sources. So either you would
have to touch sources (and if they are not writable?), or remove .csv beforehand,
or not output to file, but redirect output.

Neither of these possibilities sound great to me.

  1. include just the name of the file without any parent directories,

That won't work, there are duplicate filenames even in LLVM.

$ find -iname Path.cpp
./lib/Support/Path.cpp
./unittests/Support/Path.cpp
  1. put all outputs into the same directory. This way we wouldn't have to create a directory structure and think about stripping a certain prefix (btw, utilities like patch just specify the number of path components to remove from the start, not the actual substring). WDYT?

I'm not particularly looking forward to having being forced to have n thousands of reports in a single directory :)

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

I don't like this, because when working on trying to improve the performance of the check,
one wouldn't touch the source file, only the clang-tidy sources. So either you would
have to touch sources (and if they are not writable?), or remove .csv beforehand,
or not output to file, but redirect output.

... also, a new report with a new name will be created each time the filetime changes, so not
only will it be fun from tooling point of view, but it will also leave old reports in-place..

Neither of these possibilities sound great to me.

  1. include just the name of the file without any parent directories,

That won't work, there are duplicate filenames even in LLVM.

$ find -iname Path.cpp
./lib/Support/Path.cpp
./unittests/Support/Path.cpp
  1. put all outputs into the same directory. This way we wouldn't have to create a directory structure and think about stripping a certain prefix (btw, utilities like patch just specify the number of path components to remove from the start, not the actual substring). WDYT?

I'm not particularly looking forward to having being forced to have n thousands of reports in a single directory :)

lebedev.ri updated this revision to Diff 145995.May 9 2018, 1:11 PM
lebedev.ri edited the summary of this revision. (Show Details)
  • Make json less flat, store source filename in it.

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

No, current timestamp, when each profile gets dumped.

  1. include just the name of the file without any parent directories,

That won't work, there are duplicate filenames even in LLVM.

That's why I suggested to include a timestamp. Each result file will get a unique timestamp as a part of its name (unless the resolution of the timestamp is too coarse - compared to clang-tidy run time - to allow collisions).

$ find -iname Path.cpp
./lib/Support/Path.cpp
./unittests/Support/Path.cpp

> 3. put all outputs into the same directory. This way we wouldn't have to create a directory structure and think about stripping a certain prefix (btw, utilities like patch just specify the number of path components to remove from the start, not the actual substring). WDYT?
 
 I'm not particularly looking forward to having being forced to have n thousands of reports in a single directory :)

What kind of problems do you expect to have with this? Are you going to look at the result files manually or use a script to aggregate them?

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

No, current timestamp, when each profile gets dumped.

Ah.
Then i don't understand this at all. Why would we want to do that, exactly?
Just so that we can avoid creating directory structure? Why do we want to avoid that?

  1. include just the name of the file without any parent directories,

That won't work, there are duplicate filenames even in LLVM.

That's why I suggested to include a timestamp. Each result file will get a unique timestamp as a part of its name (unless the resolution of the timestamp is too coarse - compared to clang-tidy run time - to allow collisions).

$ find -iname Path.cpp
./lib/Support/Path.cpp
./unittests/Support/Path.cpp

> 3. put all outputs into the same directory. This way we wouldn't have to create a directory structure and think about stripping a certain prefix (btw, utilities like patch just specify the number of path components to remove from the start, not the actual substring). WDYT?
 
 I'm not particularly looking forward to having being forced to have n thousands of reports in a single directory :)

What kind of problems do you expect to have with this? Are you going to look at the result files manually or use a script to aggregate them?

alexfh requested changes to this revision.May 14 2018, 7:20 AM

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

No, current timestamp, when each profile gets dumped.

Ah.
Then i don't understand this at all. Why would we want to do that, exactly?
Just so that we can avoid creating directory structure? Why do we want to avoid that?

If clang-tidy is invoked manually, a simpler naming scheme with less configuration options would be easier to use, in particular:

  1. no the need to specify the -store-check-profile-elide-prefix= option;
  2. it's easier to see all results (no need to use find, just ls /output/directory or ls /output/directory/20180514* to see today's results, for example);
  3. no chance of filename collision, and thus no chance of losing older results by just running clang-tidy again.

If clang-tidy is started by a script (e.g. run-clang-tidy.py or a modification of it specialized on profiling), then a specific naming scheme doesn't matter much, since the script can create a separate directory for each run or for each shard. I don't yet see how repeating the directory structure of the input files would make profiling results easier or more convenient to use. A relation between the input file and the result file can still be maintained in some other form (for example, by specifying the whole invocation inside the results file - if we want to use YAML).

This revision now requires changes to proceed.May 14 2018, 7:20 AM

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

No, current timestamp, when each profile gets dumped.

Ah.
Then i don't understand this at all. Why would we want to do that, exactly?
Just so that we can avoid creating directory structure? Why do we want to avoid that?

If clang-tidy is invoked manually, a simpler naming scheme with less configuration options would be easier to use, in particular:

  1. no the need to specify the -store-check-profile-elide-prefix= option;
  2. it's easier to see all results (no need to use find, just ls /output/directory or ls /output/directory/20180514* to see today's results, for example);
  3. no chance of filename collision, and thus no chance of losing older results by just running clang-tidy again.

How do i reflect that in tests? The output name will basically be random.

If clang-tidy is started by a script (e.g. run-clang-tidy.py or a modification of it specialized on profiling), then a specific naming scheme doesn't matter much, since the script can create a separate directory for each run or for each shard. I don't yet see how repeating the directory structure of the input files would make profiling results easier or more convenient to use. A relation between the input file and the result file can still be maintained in some other form (for example, by specifying the whole invocation inside the results file - if we want to use YAML).

Roman, it looks to me that a simpler storage scheme would be sufficient. For example, YYYYMMDDhhmmss-InputFileName.cpp.csv.
Main things are:

  1. include a timestamp, so there's no need to overwrite old results,

Of the input source file?

No, current timestamp, when each profile gets dumped.

Ah.
Then i don't understand this at all. Why would we want to do that, exactly?
Just so that we can avoid creating directory structure? Why do we want to avoid that?

If clang-tidy is invoked manually, a simpler naming scheme with less configuration options would be easier to use, in particular:

  1. no the need to specify the -store-check-profile-elide-prefix= option;
  2. it's easier to see all results (no need to use find, just ls /output/directory or ls /output/directory/20180514* to see today's results, for example);
  3. no chance of filename collision, and thus no chance of losing older results by just running clang-tidy again.

How do i reflect that in tests? The output name will basically be random.

Why random? It will follow a certain rule that can be verified. If we decide to name the file YYYYMMDD_InputFileName.cpp.yaml, for example, then the test can verify the file matching ????????_InputFileName.cpp.yaml exists and contains whatever it should contain. If we want the test to be stricter, it could verify that the ???????? part matches a certain pattern (say, 20\d\d[0-1][0-9][0-3][0-9]). Additionally, it could verify that after a short sleep a file with a different ???????? part is generated.

lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added subscribers: aaron.ballman, JonasToth.
  • Get rid of 'prefix elision'
  • Don't use full source file name, only the filename
  • Prefix that filename with ISO8601-like timestamp of the profile creation.

I think this is worse than before, horrible and ugly, but hey, i'm not the code owner.

lebedev.ri added a project: Restricted Project.May 14 2018, 10:56 AM

How do i reflect that in tests? The output name will basically be random.

Why random?

I was obviously talking about the FileCheck stuff, how to specify such a dynamic filename there..

lebedev.ri edited the summary of this revision. (Show Details)
  • Rebased
  • Slightly refactor how the profile storage params are computed, move that into ClangTidyProfiling::StorageParams helper struct.
  • Store timestamp in the json too, since it's already in the filename. May be useful for garbage collection and alike.

A couple of comments from a cursory look. I'll try to look closer later this week.

clang-tidy/ClangTidyDiagnosticConsumer.h
173 ↗(On Diff #147071)

The name is misleading. I'd suggest using a name that unambiguously states that we're setting a prefix for profiling output files. Some ideas: setProfilePathPrefix, setStoredProfilePathPrefix, setProfileStorePathPrefix, setProfileResultsPathPrefix.

174 ↗(On Diff #147071)

getProfileStorageParams?

lebedev.ri marked 2 inline comments as done.

Rename getter/setter functions.

Thank you for taking a look!

Quuxplusone added inline comments.
clang-tidy/tool/ClangTidyMain.cpp
188 ↗(On Diff #147120)

Drive-by nit: Each other help string ends with a newline ()"), on a line by itself). This one presumably should too.

lebedev.ri added inline comments.May 16 2018, 11:13 AM
clang-tidy/tool/ClangTidyMain.cpp
188 ↗(On Diff #147120)

Hmm, yeah, at least for consistency.

george.karpenkov accepted this revision.May 30 2018, 4:26 PM

LGTM with a nit to me, but maybe clang-tidy code owners would need to sign that off as well.

clang-tidy/ClangTidy.h
232 ↗(On Diff #147120)

doxygen comment for new param?

LGTM with a nit to me

Thank you!

I suspect that at least temporarily we will end up with two different tooling sets
to further post-process these results, since i really love to write bicycles and
then hopefully throw them away :)

but maybe clang-tidy code owners would need to sign that off as well.

Yeah, that is time-consuming as usual.

aaron.ballman added inline comments.May 31 2018, 5:10 AM
clang-tidy/ClangTidyProfiling.cpp
72 ↗(On Diff #147120)

Missing a whitespace before the quote for the file name.

clang-tidy/tool/ClangTidyMain.cpp
334 ↗(On Diff #147120)

The identifiers here don't match the coding convention (should be ProcessPrefix and Input by convention).

342–347 ↗(On Diff #147120)

This creates a TOCTOU bug; can you call real_path() without checking for existence and instead check the error code to decide whether to spit out an error or return AbsolutePath?

lebedev.ri updated this revision to Diff 149610.Jun 2 2018, 5:31 AM
lebedev.ri marked 6 inline comments as done.

Rebased.
Addressed review notes.

clang-tidy/tool/ClangTidyMain.cpp
342–347 ↗(On Diff #147120)

Hmm, i think this can just go away.

aaron.ballman added inline comments.Jun 5 2018, 12:12 AM
clang-tidy/ClangTidy.h
230–232 ↗(On Diff #149610)

Minor wording nits to fix grammar, how about: "...the profile will not be output to stderr, but will instead be stored as a JSON file in the specified directory."

clang-tidy/ClangTidyProfiling.cpp
47 ↗(On Diff #149610)

I'm not keen that we call this printAsJSON() when the docs talk about writing out YAML. While all JSON is valid YAML, that feels like trivia we shouldn't encode in the function name. What do you think to renaming to printAsYAML() instead?

77 ↗(On Diff #149610)

I'm not keen that we call printAsJSON when the docs talk about writing out YAML. While all JSON is valid YAML, that feels like trivia we shouldn't encode in the function name. What do you think to renaming to printAsYAML() instead?

docs/clang-tidy/index.rst
257 ↗(On Diff #149610)

Spurious change?

757 ↗(On Diff #149610)

use -> use the

758 ↗(On Diff #149610)

will be output to stderr

802 ↗(On Diff #149610)

in -> in the

lebedev.ri updated this revision to Diff 149933.Jun 5 2018, 4:10 AM
lebedev.ri marked 5 inline comments as done.

Rebase ontop of rL333994, address review notes, canonicalize to JSON.

lebedev.ri retitled this revision from [clang-tidy] Store checks profiling info as YAML files to [clang-tidy] Store checks profiling info as JSON files.Jun 5 2018, 4:10 AM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.
clang-tidy/ClangTidyProfiling.cpp
47 ↗(On Diff #149610)

But the actual printer is called TG->printJSONValues()..
Let's go the other way around, and canonicalize to JSON.

aaron.ballman accepted this revision.Jun 5 2018, 4:44 AM

LGTM!

clang-tidy/ClangTidyProfiling.cpp
47 ↗(On Diff #149610)

Even better!

LGTM!

Thank you for the review.

Waiting on @alexfh ...

This revision is now accepted and ready to land.Jun 6 2018, 7:55 AM

LG

Thank you for the review!

This revision was automatically updated to reflect the committed changes.