This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Fix compile time regression caused by D126061.
ClosedPublic

Authored by mboehme on Jun 17 2022, 2:14 PM.

Details

Summary

As noted by @nikic, D126061 causes a compile time regression of about
0.5% on -O0 builds:

http://llvm-compile-time-tracker.com/compare.php?from=7acc88be0312c721bc082ed9934e381d297f4707&to=8c7b64b5ae2a09027c38db969a04fc9ddd0cd6bb&stat=instructions

This happens because, in a number of places, D126061 creates an
additional local variable of type ParsedAttributes. In the large
majority of cases, no attributes are added to this ParsedAttributes,
but it turns out that creating an empty ParsedAttributes, then
destroying it is a relatively expensive operation.

The reason for this is because AttributePool uses a TinyPtrVector as
its underlying vector class, and the technique that TinyPtrVector
employs to achieve its extreme memory frugality makes the begin() and
end() member functions relatively slow. The ParsedAttributes
destructor iterates over the attributes in its AttributePool, and this
is a relatively expensive operation because TinyPtrVector's begin() and
end() are relatively slow.

The fix for this is to replace TinyPtrVector in ParsedAttributes and
AttributePool with SmallVector. ParsedAttributes and
AttributePool objects are only ever allocated on the stack (they're
not part of the AST), and only a small number of these objects are live
at any given time, so they don't need the extreme memory frugality of
TinyPtrVector.

I've confirmed with valgrind that this patch does not increase heap
memory usage, and it actually makes compiles slightly faster than they
were before D126061.

Here are instruction count measurements (obtained with callgrind)
running clang -c MultiSource/Applications/JM/lencod/parsetcommon.c
(a file from llvm-test-suite that exhibited a particularly large
compile-time regression):

7acc88be0312c721bc082ed9934e381d297f4707
(baseline one commit before D126061 landed)
102,280,068 instructions

8c7b64b5ae2a09027c38db969a04fc9ddd0cd6bb
(the patch that landed D126061)
103,289,454 instructions
(+0.99% relative to baseline)

This patch applied onto
8c7b64b5ae2a09027c38db969a04fc9ddd0cd6bb
101,117,584 instructions
(-1.14% relative to baseline)

Diff Detail

Event Timeline

mboehme created this revision.Jun 17 2022, 2:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 2:14 PM
mboehme requested review of this revision.Jun 17 2022, 2:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 2:14 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

CI breakage is the usual libomp breakage and appears to be unrelated.

aaron.ballman accepted this revision.Jun 21 2022, 9:46 AM

LGTM, great catch! That's not something I would have expected to be at the root of this performance issue. Thank you for the quick turnaround on the fix!

This revision is now accepted and ready to land.Jun 21 2022, 9:46 AM
This revision was landed with ongoing or failed builds.Jun 21 2022, 2:15 PM
This revision was automatically updated to reflect the committed changes.

Unfortunately, it looks as if this patch did not fix the compile-time regression entirely. It did so for the particular source file I tested above, but other source files are still slower to compile than before D126061.

For reference, here is the compile-time regression caused by D126061:

http://llvm-compile-time-tracker.com/compare.php?from=7acc88be0312c721bc082ed9934e381d297f4707&to=8c7b64b5ae2a09027c38db969a04fc9ddd0cd6bb&stat=instructions

This a 0.35% regression (geomean) on -O0 compiles.

Here is the improvement due to this patch (D128097):

http://llvm-compile-time-tracker.com/compare.php?from=1c2b756cd6f9f9408863fb0e91f55731f81b46d9&to=0d300da799b06931eb6b974198d683548a8c8392&stat=instructions

This is only a 0.18% improvement (geomean) on -O0 compiles.

Here’s the cumulative change from before D126061 landed until after this patch (D128097) landed:

http://llvm-compile-time-tracker.com/compare.php?from=7acc88be0312c721bc082ed9934e381d297f4707&to=0d300da799b06931eb6b974198d683548a8c8392&stat=instructions

This confirms that we’re still seeing an 0.22% regression (geomean) on -O0 compiles.

I’ll do some more investigation to see what else is causing the slowdown. I’ll also try to run the whole test suite locally, though I had some trouble with this the first time I tried it.

Disclaimer: I'm not front-end guy.

After running 7zip benchmark under perf and comparing origin (7acc88be031) with fix (0d300da799b0) following diff can be seen:

Instructions change | Symbol
+0.12% | clang::Preprocessor::getMacroDefinition
+0.07% | GetDeclSpecTypeForDeclarator
+0.04% | clang::Parser::ParseDirectDeclarator
+0.03% | clang::Sema::ActOnFunctionDeclarator
+0.03% | clang::Sema::ProcessDeclAttributes

In 8c7b64b5ae2a commit those 2 code changes catch my eyes as (maybe) relevant to above perf diff:

  1. Change in Parser::ParseDeclGroup. That function looks quite heavy and it's caller of ParseDirectDeclarator/ActOnFunctionDeclarator (and even getMacroDefinition?)
  2. Change in GetDeclSpecTypeForDeclarator. After 8c7b64b5ae2a distributeTypeAttrsFromDeclarator is called unconditionally which maybe matters.

Hope it helps a bit in finding proper solution for slowdown.

Disclaimer: I'm not front-end guy.

After running 7zip benchmark under perf and comparing origin (7acc88be031) with fix (0d300da799b0) following diff can be seen:

Thanks a lot for helping me out here!

Instructions change | Symbol
+0.12% | clang::Preprocessor::getMacroDefinition

Hm, this surprises me. AFAICT, all of the relevant places that this gets called from are in the lexer… and I don’t think my patch should have affected what the parser does with the lexer. I almost suspect that this regression is due to some other change that happened between 7acc88be031 and 0d300da799b0?

+0.07% | GetDeclSpecTypeForDeclarator

I think you may potentially have found the reason for this – see below.

+0.04% | clang::Parser::ParseDirectDeclarator
+0.03% | clang::Sema::ActOnFunctionDeclarator
+0.03% | clang::Sema::ProcessDeclAttributes

An increase here is expected as my patch did add new code to ProcessDeclAttributes() to handle the “sliding” logic; as such, this increase in time seems hard to avoid.

In 8c7b64b5ae2a commit those 2 code changes catch my eyes as (maybe) relevant to above perf diff:

  1. Change in Parser::ParseDeclGroup. That function looks quite heavy and it's caller of ParseDirectDeclarator/ActOnFunctionDeclarator (and even getMacroDefinition?)

Here’s what changed in that function:

-  ParsingDeclarator D(*this, DS, Context);
+  // Consume all of the attributes from `Attrs` by moving them to our own local
+  // list. This ensures that we will not attempt to interpret them as statement
+  // attributes higher up the callchain.
+  ParsedAttributes LocalAttrs(AttrFactory);
+  LocalAttrs.takeAllFrom(Attrs);
+  ParsingDeclarator D(*this, DS, LocalAttrs, Context);

(Attrs is a new parameter of ParseDeclGroup().)

I’m not sure that this change per se had a significant impact on compile time though, as ParseDeclGroup() itself doesn’t show up in the diff above. The change also shouldn’t really affect how much time we spend in ParseDirectDeclarator() or ActOnFunctionDeclarator().

I’m inclined to say that the extra time spent in ParseDirectDeclarator() and ActOnFunctionDeclarator() is due to some other change that happened in the meantime, even though I realize that this sounds like a cheap excuse…

  1. Change in GetDeclSpecTypeForDeclarator. After 8c7b64b5ae2a distributeTypeAttrsFromDeclarator is called unconditionally which maybe matters.

Thanks for pointing this out to me – this could be the explanation for the 0.07% slowdown that you saw in GetDeclSpecTypeForDeclarator(), as this is the only change in that function. Of course, the other possibility is that GetDeclSpecTypeForDeclarator() is now getting called more often, but I don’t immediately see why that should be the case.

I had removed the if statement because it’s superfluous from a behavioral point of view, but it may well be important for performance. I’ve prepared https://reviews.llvm.org/D128439 to re-add that statement.

I unfortunately haven’t yet been able to get the test suite running. Since you do have a running setup, do you think you could test the compile time impact of that patch on 7zip? Please do this only if it’s not a big deal – otherwise, I’ll just submit the patch for review, as it seems that it should have at least a small positive influence on performance.

Again, thanks for your input here!

To add to the points that @yurai007 pointed out, I’ve done some more investigation of my own.

Compile-time regression does not seem to reproduce locally

Unfortunately, I still haven’t been able to get the test suite running locally. However, I’ve looked at some individual lencod input files which I’m able to compile and which still show a slowdown on http://llvm-compile-time-tracker.com/ even after https://reviews.llvm.org/D128097.

What is confusing to me is that I am not able to reproduce the slowdown. For comparison, here are the instruction counts from http://llvm-compile-time-tracker.com/ and the counts I am measuring locally.

At 7acc88be0312c721bc082ed9934e381d297f4707 (the commit directly before https://reviews.llvm.org/D126061):

Input fileTrackerOwn measurement
lencod/mb_access.c139M162,006,571
lencod/md_low.c227M264,179,977

At 0d300da799b06931eb6b974198d683548a8c8392 (the commit that landed https://reviews.llvm.org/D128097):

Input fileTrackerOwn measurement
lencod/mb_access.c141M (+0.88%)161,322,974 (-0.42%)
lencod/md_low.c228M (+0.57%)263,455,713 (-0.27%)

Note that not only the direction of change is different, but also the absolute magnitude of instruction counts.

I built Clang with -DCMAKE_BUILD_TYPE=Release and ran Clang with -O0 -g -c -o /dev/null. Maybe http://llvm-compile-time-tracker.com/ is doing something slightly different here? However, I suspect the more important factor for the direction of the change may be my local system headers, which are likely different to what http://llvm-compile-time-tracker.com/ is using. At any rate, this is frustrating because I’m not able to reproduce the issue. If it comes to it, would it be possible to obtain a preprocessed version of the input files (i.e. run with -E)?

Overall, compile times have improved in terms of cycle count (though not instruction count)

All of the above may be a moot point because I’ve noticed that the picture looks significantly different when looking at instructions versus cycle counts. The numbers I quoted in an earlier comment are for instruction counts (which is also the metric that @nikic used when originally raising this issue on https://reviews.llvm.org/D126061). Here are the same comparisons I showed earlier, with numbers taken from http://llvm-compile-time-tracker.com, but looking at cycle counts in addition to instruction counts (geomean for -O0 compiles in each case, titles link to the compile time tracker):

Compile-time regression caused by D126061
Instructions: +0.35%
Cycles: +0.30%

Improvement due to this patch (D128097)
Instructions: -0.18%
Cycles: -1.54%

Cumulative change from before D126061 landed until after this patch (D128097) landed
Instructions: +0.22%
Cycles: -1.09%

So in terms of cycle count, which is arguably the more important metric, the overall effect of the two changes is over 1% improvement in compile times. It would of course be great if we could reduce the regression on instruction counts too, but given that we have a marked improvement in cycle count, I think that may already be an acceptable result?

@nikic, I would appreciate your input here.

nikic added a comment.Jun 23 2022, 8:17 AM

@mboehme Cycle counts are very noisy, and it's pretty much impossible to determine whether they changed by looking at a single commit, unless the differences are huge (like 10%). In this case, the commit got "lucky" and the next commit goes back to the previous level. This is why I only look at instruction counts.

And yes, there may be environment differences, either due to system headers, or due to the used host compiler (which is GCC 9.3 for llvm-compile-time-tracker).

As this patch did mitigate half the regression, and the regression wasn't particularly large in the first place, I personally wouldn't bother spending too much time to investigate this further.

@mboehme Cycle counts are very noisy, and it's pretty much impossible to determine whether they changed by looking at a single commit, unless the differences are huge (like 10%). In this case, the commit got "lucky" and the next commit goes back to the previous level. This is why I only look at instruction counts.

Thanks for the clarification!

And yes, there may be environment differences, either due to system headers, or due to the used host compiler (which is GCC 9.3 for llvm-compile-time-tracker).

As this patch did mitigate half the regression, and the regression wasn't particularly large in the first place, I personally wouldn't bother spending too much time to investigate this further.

In that case, I think I'll be content with the performance gains I've achieved here.