This is an archive of the discontinued LLVM Phabricator instance.

-ftime-report switch support in Clang
AcceptedPublic

Authored by avt77 on Feb 21 2018, 8:01 AM.

Details

Summary

The current support of the feature produces only 2 lines in report:

  • Some general Code Generation Time
  • Total time of Backend Consumer actions

This patch extends Clang time report with the following lines:

-------------------------------------------------------------------------

===== Clang Parser =====

-------------------------------------------------------------------------

Total Execution Time: 0.0160 seconds (0.0189 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 0.0080 ( 66.7%)   0.0040 (100.0%)   0.0120 ( 75.0%)   0.0114 ( 60.3%)  Parse Template
 0.0040 ( 33.3%)   0.0000 (  0.0%)   0.0040 ( 25.0%)   0.0057 ( 30.2%)  Parse Function Body
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0011 (  5.8%)  PP Macro Expansion
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0005 (  2.4%)  Scope manipulation
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.5%)  Annotation operations
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.5%)  PP Macro Call Args
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.3%)  PP Find Handler
 0.0120 (100.0%)   0.0040 (100.0%)   0.0160 (100.0%)   0.0189 (100.0%)  Total

We could add more lines in the report but it'd be nice to understand what we'd like to see.

Diff Detail

Repository
rL LLVM

Event Timeline

avt77 created this revision.Feb 21 2018, 8:01 AM
  • Just a warning: I'm a bit skeptical timers will work reliable for things that happen in well under a millisecond (did you do some sanity checking? i.e. do all the timer roughly add up to the time spent in the frontend?)
  • In the same vain I wonder if you add overhead by adding timers to functions that are likely called hundreds of times for a typical source file. Have you measured a release build with and without your patch.
avt77 added a comment.Feb 23 2018, 2:33 AM
  • Just a warning: I'm a bit skeptical timers will work reliable for things that happen in well under a millisecond (did you do some sanity checking? i.e. do all the timer roughly add up to the time spent in the frontend?)
  • In the same vain I wonder if you add overhead by adding timers to functions that are likely called hundreds of times for a typical source file. Have you measured a release build with and without your patch.
  1. You can see in the report very small numbers - it seems eveything works properly. But maybe you have some concrect examples to check?
  2. I'm going to build bootsraps with and w/o timers - I'll be back with results when it's done.

But the main idea of this version of the patch was to get understanding about interesting timers: what exactly do we want to see in report? Parser only? Sema? Some special parts of Clang Code Gen?...

Hi, thanks for working on this! Having better timers in clang would be really great.

But the main idea of this version of the patch was to get understanding about interesting timers: what exactly do we want to see in report? Parser only? Sema? Some special parts of Clang Code Gen?...

I think it would be a good idea to include timers on specific parts of the compiler that can often lead to big compile times, off the top of my head: constant evaluation, template instantiation, analysis based warnings, macro expansion (looks like you've already done this), and name lookup would all be really nice to have. Having these general timers like "Parse Template" that track the time spent in a parsing function don't really seem that valuable to me because they don't really give any indication as to how that time was actually spent. If the parser calls into sema then sema could start doing any number of things that would be included in the parser's timer, and nobody would have any insight into what actually needs to be optimized. What do you think about that?

include/clang/Lex/Preprocessor.h
145–146 ↗(On Diff #135260)

Making these non-static members is a waste of memory. Please make them static constexpr or just inline them into where they're used.

lib/Parse/ParseStmt.cpp
102 ↗(On Diff #135260)

Please don't include these unrelated whitespace changes in the diff!

avt77 updated this revision to Diff 138021.Mar 12 2018, 8:55 AM

I changed the timers. Now we could get something like here:

-------------------------------------------------------------------------

===== Frontend =====

-------------------------------------------------------------------------

Total Execution Time: 0.3760 seconds (0.3659 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 0.2080 ( 59.1%)   0.0160 ( 66.7%)   0.2240 ( 59.6%)   0.2157 ( 58.9%)  Handle Top Level Decl
 0.1120 ( 31.8%)   0.0000 (  0.0%)   0.1120 ( 29.8%)   0.1152 ( 31.5%)  Handle Translation Unit
 0.0200 (  5.7%)   0.0080 ( 33.3%)   0.0280 (  7.4%)   0.0253 (  6.9%)  Handle Tag Decl Required Definition
 0.0080 (  2.3%)   0.0000 (  0.0%)   0.0080 (  2.1%)   0.0097 (  2.6%)  Handle Inline Function Definition
 0.0040 (  1.1%)   0.0000 (  0.0%)   0.0040 (  1.1%)   0.0000 (  0.0%)  Init Backend Consumer
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Link In Modules
 0.3520 (100.0%)   0.0240 (100.0%)   0.3760 (100.0%)   0.3659 (100.0%)  Total

-------------------------------------------------------------------------

===== Clang Parser =====

-------------------------------------------------------------------------

Total Execution Time: 6.7240 seconds (6.7310 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 4.5520 ( 76.9%)   0.5560 ( 68.8%)   5.1080 ( 76.0%)   5.1109 ( 75.9%)  Parse Top Level Decl
 0.6680 ( 11.3%)   0.1200 ( 14.9%)   0.7880 ( 11.7%)   0.7732 ( 11.5%)  Parse Template
 0.6440 ( 10.9%)   0.1240 ( 15.3%)   0.7680 ( 11.4%)   0.7459 ( 11.1%)  Parse Function Definition
 0.0360 (  0.6%)   0.0040 (  0.5%)   0.0400 (  0.6%)   0.0654 (  1.0%)  Scope manipulation
 0.0120 (  0.2%)   0.0040 (  0.5%)   0.0160 (  0.2%)   0.0267 (  0.4%)  PP Macro Call Args
 0.0040 (  0.1%)   0.0000 (  0.0%)   0.0040 (  0.1%)   0.0059 (  0.1%)  PP Append Macro
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  0.0%)  Handle Pragma Directive
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.0%)  PP Find Handler
 5.9160 (100.0%)   0.8080 (100.0%)   6.7240 (100.0%)   6.7310 (100.0%)  Total

-------------------------------------------------------------------------

===== Sema =====

-------------------------------------------------------------------------

Total Execution Time: 3.0480 seconds (3.0127 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 1.0760 ( 38.9%)   0.0960 ( 33.8%)   1.1720 ( 38.5%)   1.1687 ( 38.8%)  Act On End Of Translation Unit: Common case
 1.0760 ( 38.9%)   0.0960 ( 33.8%)   1.1720 ( 38.5%)   1.1681 ( 38.8%)  Act On End Of Translation Unit: TUKind != TU_Prefix
 0.2520 (  9.1%)   0.0440 ( 15.5%)   0.2960 (  9.7%)   0.2534 (  8.4%)  Handle Declarator
 0.1160 (  4.2%)   0.0120 (  4.2%)   0.1280 (  4.2%)   0.1453 (  4.8%)  Act On Id Expression
 0.0520 (  1.9%)   0.0120 (  4.2%)   0.0640 (  2.1%)   0.0827 (  2.7%)  Classify Name
 0.0520 (  1.9%)   0.0120 (  4.2%)   0.0640 (  2.1%)   0.0612 (  2.0%)  Check Implicit Conversion
 0.0480 (  1.7%)   0.0040 (  1.4%)   0.0520 (  1.7%)   0.0485 (  1.6%)  Check Call
 0.0560 (  2.0%)   0.0040 (  1.4%)   0.0600 (  2.0%)   0.0465 (  1.5%)  Lookup Template Name
 0.0280 (  1.0%)   0.0000 (  0.0%)   0.0280 (  0.9%)   0.0160 (  0.5%)  Act On String Literal
 0.0040 (  0.1%)   0.0040 (  1.4%)   0.0080 (  0.3%)   0.0137 (  0.5%)  Check Builtin Function Call
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0049 (  0.2%)  Act On Predefined Expr
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0022 (  0.1%)  Check X86 Builtin Function Call
 0.0040 (  0.1%)   0.0000 (  0.0%)   0.0040 (  0.1%)   0.0014 (  0.0%)  Act On Dependent Id Expression
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Try To Recover With Call
 0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Try Expr As Call
 2.7640 (100.0%)   0.2840 (100.0%)   3.0480 (100.0%)   3.0127 (100.0%)  Total

-------------------------------------------------------------------------

===== Include Files =====

-------------------------------------------------------------------------

Total Execution Time: 0.0600 seconds (0.0540 wall clock)

 ---User Time---   --User+System--   ---Wall Time---  --- Name ---
 0.0320 ( 53.3%)   0.0320 ( 53.3%)   0.0292 ( 54.0%)  Lookup File2
 0.0200 ( 33.3%)   0.0200 ( 33.3%)   0.0218 ( 40.3%)  Lookup File
 0.0040 (  6.7%)   0.0040 (  6.7%)   0.0016 (  3.0%)  Find Usable Module For Header
 0.0040 (  6.7%)   0.0040 (  6.7%)   0.0014 (  2.7%)  Should Enter Include File
 0.0600 (100.0%)   0.0600 (100.0%)   0.0540 (100.0%)  Total

The exact picture very depends on content of the compiling file but in general it's clear what we have.
About the performance question: I compared boostrap times produced by pure trunk compiler and my compiler with timers. I got the following results (obviously I used the patched compiler w/o option -ftime-report):

Trunk compiler (4 bootstraps were created):
real 43m19.845s
user 161m22.448s
sys 6m57.480s

real 42m54.171s
user 160m45.444s
sys 6m50.236s

real 43m43.098s
user 161m8.112s
sys 6m46.944s

real 42m59.542s
user 160m48.304s
sys 6m48.616s

The patched compiler (3 bootstraps):

real 44m13.653s
user 162m15.588s
sys 6m55.864s

real 44m7.108s
user 162m7.740s
sys 6m47.848s

real 43m18.520s
user 162m14.008s
sys 6m44.828s

As we see in the worst case we have degradation less than 1.2%. From my point of view it's absolutely acceptable.
Any comments, suggestions, objections?

RKSimon resigned from this revision.Mar 18 2018, 7:23 AM
RKSimon added a subscriber: RKSimon.

Hi,

I think it would be really instrumental to have a finer timing report, so thanks for your work! I tested the patch on CTMark with O0-g and didn't see detectable change either, so I think from compile-time impact this change is good. The patch needs some clean-up, but otherwise looks good to me.

Thanks,
Michael

lib/CodeGen/CodeGenModule.cpp
59 ↗(On Diff #138021)

No timers are added to this file, thus the header is redundant.

64–65 ↗(On Diff #138021)

These are unused.

lib/Frontend/FrontendAction.cpp
898–899 ↗(On Diff #138021)

Unused variable.

lib/Lex/Pragma.cpp
32 ↗(On Diff #138021)

Please don't reorder header files in the patch (you can commit such changes separately before the patch).

44 ↗(On Diff #138021)

Why do we need "Pass.h" here?

89 ↗(On Diff #138021)

What does the number 22 mean (and other similar numbers in the names)?

137 ↗(On Diff #138021)

Typo in "pppragma"?

lib/Parse/Parser.cpp
1468–1470 ↗(On Diff #138021)

Not-cleaned leftover of some sort?

1662 ↗(On Diff #138021)

Please don't include unnecessary whitespace changes (it applies to some other changes too).

avt77 added inline comments.Apr 9 2018, 2:07 AM
lib/Lex/Pragma.cpp
44 ↗(On Diff #138021)

Because we need llvm::TimePassesIsEnabled.

89 ↗(On Diff #138021)

I renamed all such names but in fact it's only a simple timer ID that's why we could use any name here.

137 ↗(On Diff #138021)

No, stays for PreProcessor Pragma

avt77 updated this revision to Diff 141621.Apr 9 2018, 5:13 AM

I fixed all issues raised by mzolotukhin and answered on his questions. I'm waiting for LGTM now.

mzolotukhin accepted this revision.Apr 9 2018, 8:27 AM

LGTM. Please don't forget to commit headers sorting (and other NFC if any) in a separate patch.

Thanks,
Michael

This revision is now accepted and ready to land.Apr 9 2018, 8:27 AM
This revision was automatically updated to reflect the committed changes.

@davezarzycki remarks in https://reviews.llvm.org/D45485 that this breaks the shared build. The proposed fix there is to make several of clang's modules depend on LLVM's IR library ("Core"). This seems weird to me for two reasons, one architectural, one form a build point of view:

  1. The modules growing the dep don't really depend on IR, they just need that one bool that happens to be defined there. That bool is called TimePassesIsEnabled which is a reasonable bool to live in IR, but this patch starts using that bool for meanings other than "should we time passes?". It instead uses the same bool to decide if clang should print a bunch of timing info. We probably should have a separate bool in clang and key this off that and make -ftime-report set both.
  1. From a build PoV, depending on Core means explicitly depending on TableGen processing the Attributes.td and Intrinsics.td files in include/llvm/IR, which needlessly (explicitly) serializes the build.

Also, please add cfe-commits to clang changes. Since this wasn't here and the patch wasn't seen by clang folks, since ftime-report-template-decl.cppwarnings is still failing on the bots (e.g. http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/16388) and since it breaks the shared bot, maybe we should revert for now and reland when the issues are addressed?

rsmith added a subscriber: rsmith.Apr 10 2018, 8:22 AM

Also, please add cfe-commits to clang changes. Since this wasn't here and the patch wasn't seen by clang folks, since ftime-report-template-decl.cppwarnings is still failing on the bots (e.g. http://lab.llvm.org:8011/builders/llvm-clang-lld-x86_64-scei-ps4-windows10pro-fast/builds/16388) and since it breaks the shared bot, maybe we should revert for now and reland when the issues are addressed?

Yes, please. This was not appropriately reviewed, breaks bots, and appears to be wrong/inappropriate in multiple ways.

Could you help me with revert of the committed patch?
I'll do all necessary changes but I don't know how I can revert the patch.

avt77 reopened this revision.Apr 10 2018, 10:50 AM

Obviously, this patch was not ready for commit that's why I reopen the revison.
But I don't know how to revert the patch from trunk: please, help me.

This revision is now accepted and ready to land.Apr 10 2018, 10:50 AM

A revert in practice just means undoing the changes. For example, while both git and svn have revert subcommands, you can also just take the original diff/patch, pipe it into patch -R, and if there are no conflicts, commit the result.

A revert in practice just means undoing the changes. For example, while both git and svn have revert subcommands, you can also just take the original diff/patch, pipe it into patch -R, and if there are no conflicts, commit the result.

Unfortunately, revert does not work with committed changes:

revert: Restore pristine working copy state (undo local changes).
usage: revert PATH...

Revert changes in the working copy at or within PATH, and remove
conflict markers as well, if any.
  • This subcommand does not revert already committed changes.**

But I'll try with patch -R. Tnx.

while both git and svn have revert subcommands

This is kind of misleading; yes, both git and svn have subcommands named "revert", but "svn revert" doesn't have the right meaning. You have to use "svn merge" to revert a committed change.

Reverted:

$ svn merge -c -329714 .

  • Reverse-merging r329714 into '.':

A test/Frontend/ftime-report-template-decl.cpp

  • Recording mergeinfo for reverse merge of r329714 into '.': U .

$ svn merge -c -329693 .

  • Reverse-merging r329693 into '.':

U test/Frontend/ftime-report-template-decl.cpp

  • Recording mergeinfo for reverse merge of r329693 into '.': U .

$ svn merge -c -329684 .

  • Reverse-merging r329684 into '.':

D test/Frontend/ftime-report-template-decl.cpp
U include/clang/Frontend/FrontendAction.h
U lib/Parse/ParseTemplate.cpp
U lib/Frontend/FrontendAction.cpp
U lib/Lex/Pragma.cpp
U lib/CodeGen/CodeGenAction.cpp
U include/clang/Parse/Parser.h
U include/clang/Sema/Sema.h
U include/clang/Lex/HeaderSearch.h
U lib/Parse/Parser.cpp
U lib/Frontend/ASTMerge.cpp
U lib/Frontend/FrontendActions.cpp
U lib/Frontend/CompilerInstance.cpp
U lib/Sema/SemaChecking.cpp
U lib/Sema/SemaExpr.cpp
U lib/Sema/Sema.cpp
U lib/Sema/SemaDecl.cpp
U lib/Sema/SemaTemplate.cpp
U lib/Lex/PPMacroExpansion.cpp
U lib/Lex/HeaderSearch.cpp

  • Recording mergeinfo for reverse merge of r329684 into '.': U .

$ svn commit -m 'Revert r329684 (and follow-ups 329693, 329714). See discussion on https://reviews.llvm.org/D43578.'
Sending include/clang/Frontend/FrontendAction.h
Sending include/clang/Lex/HeaderSearch.h
Sending include/clang/Parse/Parser.h
Sending include/clang/Sema/Sema.h
Sending lib/CodeGen/CodeGenAction.cpp
Sending lib/Frontend/ASTMerge.cpp
Sending lib/Frontend/CompilerInstance.cpp
Sending lib/Frontend/FrontendAction.cpp
Sending lib/Frontend/FrontendActions.cpp
Sending lib/Lex/HeaderSearch.cpp
Sending lib/Lex/PPMacroExpansion.cpp
Sending lib/Lex/Pragma.cpp
Sending lib/Parse/ParseTemplate.cpp
Sending lib/Parse/Parser.cpp
Sending lib/Sema/Sema.cpp
Sending lib/Sema/SemaChecking.cpp
Sending lib/Sema/SemaDecl.cpp
Sending lib/Sema/SemaExpr.cpp
Sending lib/Sema/SemaTemplate.cpp
Transmitting file data ...................done
Committing transaction...
Committed revision 329739.

Reverted:

Many thanks for your help!

Hi @efriedma – Sorry about that. I haven't directly used svn in years thanks to the git svn bridge.

avt77 added a comment.EditedApr 11 2018, 3:18 AM

@davezarzycki remarks in https://reviews.llvm.org/D45485 that this breaks the shared build. The proposed fix there is to make several of clang's modules depend on LLVM's IR library ("Core"). This seems weird to me for two reasons, one architectural, one form a build point of view:

  1. The modules growing the dep don't really depend on IR, they just need that one bool that happens to be defined there. That bool is called TimePassesIsEnabled which is a reasonable bool to live in IR, but this patch starts using that bool for meanings other than "should we time passes?". It instead uses the same bool to decide if clang should print a bunch of timing info. We probably should have a separate bool in clang and key this off that and make -ftime-report set both.
  1. From a build PoV, depending on Core means explicitly depending on TableGen processing the Attributes.td and Intrinsics.td files in include/llvm/IR, which needlessly (explicitly) serializes the build.

In fact the current trunk already depends on TimePassesIsEnabled (w/o this patch applied):

$ find clang -name \*.cpp | xargs grep TimePassesIsEnabled
clang/lib/CodeGen/CodeGenAction.cpp: llvm::TimePassesIsEnabled = TimePasses;
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled)
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled)
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled) {
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled) {
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled)
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled)
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled) {
clang/lib/CodeGen/CodeGenAction.cpp: if (llvm::TimePassesIsEnabled) {
clang/lib/CodeGen/BackendUtil.cpp: TimeRegion Region(llvm::TimePassesIsEnabled ? &CodeGenerationTime : nullptr);
clang/lib/CodeGen/BackendUtil.cpp: TimeRegion Region(llvm::TimePassesIsEnabled ? &CodeGenerationTime : nullptr);

But I agree that such dependence is not OK. I'll create a separate bool instead of TimePassesIsEnabled but the question is should I remove the above usage of TimePassesIsEnabled as well? Or maybe it should be a separate pre-patch? Or it could be left as it is?
And one more question: should we introduce a new command line switch say 'ffrontend-time-report' or we should use the current 'ftime-report'?

We also see an assertion failure prior to the revert. At r329738:

$ cat test.cpp
template <class> struct A {
  template <class _Other> using rebind_alloc = _Other;
};
template <class _Alloc> struct _Wrap_alloc {
  template <class _Other>
  using rebind_alloc = typename A<_Alloc>::template rebind_alloc<_Other>;
  template <class> using rebind = _Wrap_alloc;
};
_Wrap_alloc<int>::rebind<int> w;

$ clang++ -c test.cpp

$ clang++ -c test.cpp -ftime-report
clang-7: llvm/lib/Support/Timer.cpp:133: void llvm::Timer::startTimer(): Assertion `!Running && "Cannot start a running timer"' failed.
avt77 added a comment.Apr 11 2018, 7:23 AM

We also see an assertion failure prior to the revert. At r329738:

Yes, you're right - thaks again for this test case.

avt77 updated this revision to Diff 142198.Apr 12 2018, 9:15 AM
avt77 added a reviewer: davezarzycki.

I removed the dependence on TimePassesIsEnabled (as @davezarzycki sugested) and fixed the issue with failed test (tnx to @russell.gallop). As result the patch was redesigned.

It wasn't my suggestion. @thakis wrote: "We probably should have a separate bool in clang and key this off that and make -ftime-report set both."

It wasn't my suggestion. @thakis wrote: "We probably should have a separate bool in clang and key this off that and make -ftime-report set both."

Sorry, but it's done now. Any comments, suggestions, requirements?

Last time I looked at doing this, I found that LLVM's timer infrastructure was fundamentally unsuitable for adding timers like these to Clang. The big problems are that Clang switches between "layers" and pieces that should be independently accounted at a very large number of call sites (eg, there are hundreds of functions on Sema that the Parser calls), and that Clang makes heavy use of recursion and delegation between functions that we might want to time (and generally has a very complex call graph, matching the complexity of the grammars that it parses). Unless there have been major changes, none of that actually works in LLVM's timer infrastructure: the time allocated to recursive and mutually-recursive functions is misaccounted (double-counted or worse), and there's no real way to extract the amount of time spent parsing a particular kind of declaration without including all the time spent in nested subactions (such as instantiating unrelated templates, generating code, and so on) -- and even enumerating all the barriers between "time spent in parsing" and "time spent in semantic analysis" does not seem reasonable. Generating user-friendly but misleading and likely wrong timing data seems like it would be a big problem here.

So I don't think this patch is reasonable for that reason. I'm also not sure whether this, as is, is addressing a pressing use case -- for Clang developers, existing non-invasive profiling tools (such as linux-perftools) are likely to work a lot better for identifying where in the Clang source code we're spending time. And Clang users typically don't care which function we're in (unless they're filing a bug, where again a profiler is probably a better tool).

However, I do think we could make -ftime-report vastly more useful to Clang users. Specifically, I think the useful, actionable feedback we can give to users would be to tell them which parts of their source code are taking a long time to compile. Profiling information that can describe -- for instance -- the time spent instantiating the N slowest templates, or handling the N slowest functions, or evaluating the N slowest constant expressions, or parsing the N slowest #included files, seems like it would be incredibly valuable. To make that work, I think we'll need to teach LLVM's timer infrastructure to properly separate out "self" time from "children" time for timers in the same group, and may need other infrastructure improvements.

Last time I looked at doing this, I found that LLVM's timer infrastructure was fundamentally unsuitable for adding timers like these to Clang.

Thank you for this answer. As I understand we should close both this review and D45619 but you'd like to see something like MS has here, right?

kimgr added a comment.Apr 14 2018, 1:06 AM

Heh, my e-mail response was eaten along the way. Continued here:

On Sat, Apr 14, 2018 at 1:53 AM, Richard Smith - zygoloid via
Phabricator via cfe-commits <cfe-commits@lists.llvm.org> wrote:

rsmith added a comment.

So I don't think this patch is reasonable for that reason. I'm also not sure whether this, as is, is addressing a pressing use case -- for Clang developers, existing non-invasive profiling tools (such as linux-perftools) are likely to work a lot better for identifying where in the Clang source code we're spending time. And Clang users typically don't care which function we're in (unless they're filing a bug, where again a profiler is probably a better tool).

However, I do think we could make -ftime-report vastly more useful to Clang users. Specifically, I think the useful, actionable feedback we can give to users would be to tell them which parts of their source code are taking a long time to compile. Profiling information that can describe -- for instance -- the time spent instantiating the N slowest templates, or handling the N slowest functions, or evaluating the N slowest constant expressions, or parsing the N slowest #included files, seems like it would be incredibly valuable. To make that work, I think we'll need to teach LLVM's timer infrastructure to properly separate out "self" time from "children" time for timers in the same group, and may need other infrastructure improvements.

I disagreed up until the last paragraph :)

That's exactly the crux of what most users need to know -- which parts
of my source code are causing the biggest build slow-down? The summary
information from -ftime-report can give a hint, but a detailed
breakdown would of course be great!

I disagreed up until the last paragraph :)

Can you say which things you disagree with? There seem to be two important facts here:

  1. LLVM's timing infrastructure cannot correctly report timings when one function containing a timing region calls another. That will happen all the time with this patch applied.
  2. Clang's architecture means that if you time the amount of time spent in, say, the "ParseTemplate" function, you have not computed the amount of time spent parsing templates, because the parser calls into Sema, which might perform tasks that are only somewhat related to parsing the template (such as performing other instantiations), and likewise Sema calls into other layers (such as AST file deserialization and code generation).

The first might have been fixed in LLVM at this point, but I can't see any evidence of that in LLVM's Timer implementation. And the second seems even more fundamental. But if there's some way around that, which would allow us to produce (correct!) numbers for times spent parsing / whatever else, without, for instance, adding a correctness requirement that we annotate every Parser -> Sema entry point with a timer, then that's definitely something we should discuss.

Last time I looked at doing this, I found that LLVM's timer infrastructure was fundamentally unsuitable for adding timers like these to Clang.

Thank you for this answer. As I understand we should close both this review and D45619

I think D45619 is a good change, and I'd like to see that get committed.

If LLVM's timer infrastructure can be improved to correctly track time in recursive and mutually-recursive timing regions (and I believe it can be), we could continue with this patch; if you want to pursue that, my first questions would be:

Who is the audience for this information?
What information do they want from a time report?
How do we present that information in a way that's not misleading (given Clang's architecture)?
Can we deliver useful value compared to a modern, dedicated profiling tool?

but you'd like to see something like MS has here, right?

Yes, that's the kind of information I think we should be looking at adding here -- I think that's exactly what Clang users would be looking for when they reach out to a compilation time report to try to figure out why their compile is slow.

kimgr added a comment.Apr 14 2018, 9:27 AM

I disagreed up until the last paragraph :)

Can you say which things you disagree with? There seem to be two important facts here:

  1. LLVM's timing infrastructure cannot correctly report timings when one function containing a timing region calls another. That will happen all the time with this patch applied.
  2. Clang's architecture means that if you time the amount of time spent in, say, the "ParseTemplate" function, you have not computed the amount of time spent parsing templates, because the parser calls into Sema, which might perform tasks that are only somewhat related to parsing the template (such as performing other instantiations), and likewise Sema calls into other layers (such as AST file deserialization and code generation).

The first might have been fixed in LLVM at this point, but I can't see any evidence of that in LLVM's Timer implementation. And the second seems even more fundamental. But if there's some way around that, which would allow us to produce (correct!) numbers for times spent parsing / whatever else, without, for instance, adding a correctness requirement that we annotate every Parser -> Sema entry point with a timer, then that's definitely something we should discuss.

Ah, I see. I read your original message as dismissive as in users don't need to know where time is spent.

I tried to make the case that users do care where time is spent, because it can give a hint as to what part of their physical/logical design they can work on to get better compile times.

But I think you were really saying that LLVM and Clang as designed can't provide anything perfect, only slightly misleading information?

Sorry for the misunderstanding.

  • Kim

Hi,

I've been monitoring compile-time for quite a while, so let my put my 2 cents here too.

Who is the audience for this information?
What information do they want from a time report?
How do we present that information in a way that's not misleading (given Clang's architecture)?

I would find the timers extremely useful. Even if they overlap, they still would 1) be a good indicator of a newly introduced problem and 2) give a rough idea where frontend time is spent. I agree that it wouldn't be super-accurate, but the numbers we usually operate are quite high (e.g. <some part> started to take 1.5x time). When we decide to investigate it deeper, we can use more accurate tools, such as profilers. All that said, if we can make timers more accurate/not-overlapping, that surely would be helpful as well.

Can we deliver useful value compared to a modern, dedicated profiling tool?

Having timers, especially producing results in the same format, as existing LLVM timers, would be much more convenient than using profilers. With timers I can simply add a compile-time flag to my test-suite run and get all the numbers in the end of my usual run. With profilers it's a bit more complicated.

Speaking of timers overlapping and mutual calling: LLVM timers also have this problem, and e.g. if there is problem is in some analysis (say, ScalarEvolution), we see indications in several other passes using this analysis (say, IndVarSimplify and LoopStrengthReduction). While it does not immediately show the problematic spot, it gives you pretty strong hints to where to look at first. So, having even not-perfect timers is still useful.

Also, I apologize for LGTMing the patch earlier while it was not properly reviewed - I didn't notice it didn't have cfe-commits in subscribers, and it had been waiting for a review for quite some time (so I assumed that all interested parties had their chance to look at it).

Thanks,
Michael

Who is the audience for this information?
What information do they want from a time report?
How do we present that information in a way that's not misleading (given Clang's architecture)?

I would find the timers extremely useful. Even if they overlap, they still would 1) be a good indicator of a newly introduced problem and 2) give a rough idea where frontend time is spent. I agree that it wouldn't be super-accurate, but the numbers we usually operate are quite high (e.g. <some part> started to take 1.5x time). When we decide to investigate it deeper, we can use more accurate tools, such as profilers.

What kinds of <some part> would be useful to you? (How do you want the runtime of Clang broken down?) Are vertical slices (through all Clang's various layers and potentially parts of LLVM) -- as this patch will produce -- useful, or would you really want horizontal slices (as in, what part of Clang is actually spending the time)? Or just anything that's basically expected to be consistent from run to run, so you can identify that something has slowed down, even if you don't have enough information to really know what?

All that said, if we can make timers more accurate/not-overlapping, that surely would be helpful as well.

I think we need to fix the overlap issue as a prerequisite to adding timers with large amounts of overlap, especially self-overlap. Otherwise the numbers will likely do more harm than good, as they will significantly misattribute runtime. Fortunately, I think that should only require some relatively small changes to the LLVM timer infrastructure.

Can we deliver useful value compared to a modern, dedicated profiling tool?

Having timers, especially producing results in the same format, as existing LLVM timers, would be much more convenient than using profilers. With timers I can simply add a compile-time flag to my test-suite run and get all the numbers in the end of my usual run. With profilers it's a bit more complicated.

Well, that depends on the profiler. With some profilers, you can just attach a profiler to your entire compilation and then ask it what the hottest functions were. But sure, if you have existing infrastructure built around -ftime-report, I can see that it makes sense to reuse that infrastructure.

Speaking of timers overlapping and mutual calling: LLVM timers also have this problem, and e.g. if there is problem is in some analysis (say, ScalarEvolution), we see indications in several other passes using this analysis (say, IndVarSimplify and LoopStrengthReduction). While it does not immediately show the problematic spot, it gives you pretty strong hints to where to look at first. So, having even not-perfect timers is still useful.

While LLVM may have some overlap between regions, the vertical timing slices are still pretty well aligned with the horizontal functionality slices. I expect the problems in Clang to be a lot worse. Suppose you enter N levels of parsing templates, and then trigger a whole bunch of template instantiation, AST deserialization, and code generation. Let's say that takes 1s in total. With this patch, I think we'd end up attributing Ns of compile time to "parsing templates", which is clearly very far from the truth, but will likely be listed as (by far) the slowest thing in the compilation. This does not even rise to the level of "not-perfect", it's going to be actively misleading in a lot of cases, and won't even necessarily point anywhere near the problematic spot.

I think with this patch and no fix to the overlap problem, we will find ourselves frequently fielding bugs where people say "X is slow" when it actually isn't. Plus I think we have the opportunity to deliver something that's hugely useful and not much harder to achieve (providing timing information that relates back to the source code), and I'd prefer we spend our complexity budget for this feature there.

Also, I apologize for LGTMing the patch earlier while it was not properly reviewed - I didn't notice it didn't have cfe-commits in subscribers, and it had been waiting for a review for quite some time (so I assumed that all interested parties had their chance to look at it).

No problem, these things happen :) And thank you for your feedback, it's very useful.

What kinds of <some part> would be useful to you? (How do you want the runtime of Clang broken down?) Are vertical slices (through all Clang's various layers and potentially parts of LLVM) -- as this patch will produce -- useful, or would you really want horizontal slices (as in, what part of Clang is actually spending the time)? Or just anything that's basically expected to be consistent from run to run, so you can identify that something has slowed down, even if you don't have enough information to really know what?

For me "something has slowed down" would be enough. I.e. even if "parse templates" would be erroneously attributed to 90% time spent in front-end, I would be able to see a jump from 90% to 95%. While these numbers are not reflecting the actual ratio, they still will indicate changes.

I would find horizontal slicing more interesting, as we can get vertical slices from profilers. I.e. if in LLVM profile I see that time spent in APInt::add has grown a lot, I'd like to know which pass started to use it more extensively - it's rarely the case that APInt::add slowed down itself.

I think we need to fix the overlap issue as a prerequisite to adding timers with large amounts of overlap, especially self-overlap. Otherwise the numbers will likely do more harm than good, as they will significantly misattribute runtime. Fortunately, I think that should only require some relatively small changes to the LLVM timer infrastructure.

I definitely would support that :)

Well, that depends on the profiler. With some profilers, you can just attach a profiler to your entire compilation and then ask it what the hottest functions were. But sure, if you have existing infrastructure built around -ftime-report, I can see that it makes sense to reuse that infrastructure.

Yeah, that was exactly my point.

While LLVM may have some overlap between regions, the vertical timing slices are still pretty well aligned with the horizontal functionality slices. I expect the problems in Clang to be a lot worse. Suppose you enter N levels of parsing templates, and then trigger a whole bunch of template instantiation, AST deserialization, and code generation. Let's say that takes 1s in total. With this patch, I think we'd end up attributing Ns of compile time to "parsing templates", which is clearly very far from the truth, but will likely be listed as (by far) the slowest thing in the compilation. This does not even rise to the level of "not-perfect", it's going to be actively misleading in a lot of cases, and won't even necessarily point anywhere near the problematic spot.
I think with this patch and no fix to the overlap problem, we will find ourselves frequently fielding bugs where people say "X is slow" when it actually isn't. Plus I think we have the opportunity to deliver something that's hugely useful and not much harder to achieve (providing timing information that relates back to the source code), and I'd prefer we spend our complexity budget for this feature there.

I see your point, and I agree it would be really misleading if e.g the sum of timers won't match the total time due to self-overlaps. I think adding these timers still might be worthwhile as my guess is that no one usually uses them anyway unless they know what to expect from the reported timers, but I might be mistaken here. And anyway, if you think it's possible to fix the issue first, it's totally fine with me.

Thanks,
Michael

avt77 added a comment.Apr 16 2018, 1:42 AM

I think D45619 is a good change, and I'd like to see that get committed.

Could you give me LGTM in this case? I'm going to publish "recursive"(ovelaped) timers but I'd like to base it on D45619.

avt77 added a comment.Apr 16 2018, 1:51 AM

I think we need to fix the overlap issue as a prerequisite to adding timers with large amounts of overlap, especially self-overlap. Otherwise the numbers will likely do more harm than good, as they will significantly misattribute runtime. Fortunately, I think that should only require some relatively small changes to the LLVM timer infrastructure.

JFYI: the given patch introduces non-recursive (self-overlaps) timers only. I'm going to introduce self-overlaps timers when D45619 is committed.

jsji removed a subscriber: jsji.Oct 1 2018, 6:44 AM
davezarzycki resigned from this revision.May 22 2019, 6:53 AM