This is an archive of the discontinued LLVM Phabricator instance.

[Lexer] Don't merge macro args from different macro files
ClosedPublic

Authored by vsk on May 18 2016, 4:41 PM.

Details

Summary

The lexer sets the end location of macro arguments incorrectly *if*,
while merging consecutive args to fit into a single SLocEntry, it finds
args which come from different macro files.

Fix the issue by using separate SLocEntries in this situation.

This fixes a code coverage crasher (rdar://problem/26181005). Because
the lexer reported end locations for certain macro args incorrectly, we
would generate bogus coverage mappings with negative line offsets.

Diff Detail

Event Timeline

vsk updated this revision to Diff 57710.May 18 2016, 4:41 PM
vsk retitled this revision from to [Lexer] Don't merge macro args from different macro files.
vsk updated this object.
vsk added reviewers: akyrtzi, doug.gregor.
vsk added a subscriber: cfe-commits.
vsk updated this revision to Diff 57711.May 18 2016, 4:47 PM
  • Add some comments to the unit test.
vsk updated this revision to Diff 57810.May 19 2016, 9:17 AM
  • Fix explanation of the test case in test/CoverageMapping.
vsk added a comment.May 19 2016, 1:47 PM

I discussed this bug with Argyrios off-list, who lgtm'd on the condition that it doesn't introduce a performance regression. He suggested preprocessing Cocoa.h to stress the patch. After running a stabilization script, I used this command to stress RelNoAsserts builds of clang both with and without this patch.

for I in $(seq 1 100); do time $CC -F /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/System/Library/Frameworks -E Cocoa.h -o /dev/null; done

The results are basically in the noise (link to raw data: https://ghostbin.com/paste/r6cyh):

CompilerUnpatched TOTPatched TOT
Avg. wall time (s)0.217090.21608
Std. deviation0.021010.02219

I also made sure that the preprocessed sources emitted by the two compilers are the same.

This revision was automatically updated to reflect the committed changes.

I know this was sped up slightly in 3339c568c43e4644f02289e5edfc78c860f19c9f, but this change makes updateConsecutiveMacroArgTokens the hottest function in clang in a bottom up profile of an entire build of the Linux kernel. It thrashes the one entry LastFileIDLookup cache, and we end up looking up the same FileID again and again and again for each token when we expand nested function like macros.

Is there anything we can do to speed this up? Is there any way to record which FileID corresponds to a given Token so that we don't have to keep rematerializing that? Is it possible to find whether two SourceLocations correspond to the same FileID without having to figure out which FileID in particular each belongs to?

I discussed this bug with Argyrios off-list, who lgtm'd on the condition that it doesn't introduce a performance regression.

Well, I'd say it's a performance regression, though perhaps reported 5 years too late.

vsk added a comment.May 20 2021, 10:19 AM

I know this was sped up slightly in 3339c568c43e4644f02289e5edfc78c860f19c9f, but this change makes updateConsecutiveMacroArgTokens the hottest function in clang in a bottom up profile of an entire build of the Linux kernel. It thrashes the one entry LastFileIDLookup cache, and we end up looking up the same FileID again and again and again for each token when we expand nested function like macros.

Is there anything we can do to speed this up? Is there any way to record which FileID corresponds to a given Token so that we don't have to keep rematerializing that? Is it possible to find whether two SourceLocations correspond to the same FileID without having to figure out which FileID in particular each belongs to?

Perhaps you could try:

  • using SourceManager::isInFileID(NextLoc, getFileID(CurLoc), ...) (to halve the number of necessary getFileID lookups), or
  • using a 2-element cache in getFileID?

I discussed this bug with Argyrios off-list, who lgtm'd on the condition that it doesn't introduce a performance regression.

Well, I'd say it's a performance regression, though perhaps reported 5 years too late.

If the performance issue manifests on Linux kernel sources from 5 years ago, then sure, I'd agree :).

sammccall added subscribers: hokein, sammccall.EditedSep 29 2022, 4:21 AM

I know this was sped up slightly in 3339c568c43e4644f02289e5edfc78c860f19c9f, but this change makes updateConsecutiveMacroArgTokens the hottest function in clang in a bottom up profile of an entire build of the Linux kernel. It thrashes the one entry LastFileIDLookup cache, and we end up looking up the same FileID again and again and again for each token when we expand nested function like macros.

Is there anything we can do to speed this up?

@hokein and I spent some time looking at this (initially trying to understand behavior, now performance).

Short version is:

  • we can *simplify* the code a lot, we think it's now just partitioning based on FileID and this can be done more clearly. This may have some speedups at the margin.
  • in terms of performance: I suspect when clang is built by GCC it's doing roughly 2x as much work as when it's built by clang. @nickdesaulniers can you tell me which you're measuring/deploying? To give some idea if we're likely to actually help much...

Behavior: partitioning by file IDs

I think we're back to the original (<2011) behavior of just partitioning by file IDs

  • the original patch clearly intended this to merge tokens across file IDs, and the comments still claim this
  • then a bugfix banned merging file+macro or macro+file
  • then this patch banned merging macro+macro
  • meanwhile, there's no code disallowing file+file, but I don't think it's actually possible to achieve: you can't have an #include or an eof inside a macro arg, and I don't know how else to switch between files.

Performance (good case)

The current obfuscated version *is* faster than the pre-2011 version because we avoid getFileID() in when testing file+macro, macro+file, and *some* macro+macro cases (when the locations happen to be >50 apart).
When we see a run of N consecutive macro nearby macro tokens, we do 2*(N-1) getFileID()s.

We can reduce the number of getFileID() calls by caching FileID bounds (the expensive part is looking up the SLocEntry - given that we can hit-test a SourceLocation against it with simple arithmetic).

However, getFileID() has the one-element cache of SLocEntry, so this may only be a marginal improvement.

// Tokens A1 A2 A3 B1 B2

isWrittenInSameFile(A1, A2);
  getFileID(A1); // miss
  getFileID(A2);
isWrittenInSameFile(A2, A3);
  getFileID(A2);
  getFileID(A3);
isWrittenInSameFile(A3, B1);
  getFileID(A3);
  getFileID(B1); // miss
isWrittenInSameFile(B1, B2);
  getFileID(B1);
  getFileID(B2);

All the getFileID() calls we could avoid are the cached ones. It's probably still a win (the cache lookup logic is kinda messy), but probably not huge.

Performance (bad case)

However, the implementation of isWrittenInSameFile is getFileID(X) == getFileID(Y), and it's unspecified which gets evaluated first. GCC generally evaluates right-to-left (https://godbolt.org/z/M4bs74Tbj), producing substantially more misses:

isWrittenInSameFile(A1, A2);
  getFileID(A2); // miss
  getFileID(A1);
isWrittenInSameFile(A2, A3);
  getFileID(A3);
  getFileID(A2);
isWrittenInSameFile(A3, B1);
  getFileID(B1); // miss
  getFileID(A3); // miss
isWrittenInSameFile(B1, B2);
  getFileID(B2); // miss
  getFileID(B1);

So I'd expect we can improve GCC-built-clang's performance by maybe 2x by caching externally.
(Or by having isWrittenInSameFile try to cleverly evaluate whichever arg matches the cache first, but I have no idea whether that will work well across clang)


@hokein or I will try to find time to take a stab at this.

Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 4:21 AM

I know this was sped up slightly in 3339c568c43e4644f02289e5edfc78c860f19c9f, but this change makes updateConsecutiveMacroArgTokens the hottest function in clang in a bottom up profile of an entire build of the Linux kernel. It thrashes the one entry LastFileIDLookup cache, and we end up looking up the same FileID again and again and again for each token when we expand nested function like macros.

Is there anything we can do to speed this up?

@hokein and I spent some time looking at this (initially trying to understand behavior, now performance).

Wonderful! I'm glad to see you also identified this change in particular. Thanks as well for looking into this code.

Short version is:

  • we can *simplify* the code a lot, we think it's now just partitioning based on FileID and this can be done more clearly. This may have some speedups at the margin.
  • in terms of performance: I suspect when clang is built by GCC it's doing roughly 2x as much work as when it's built by clang. @nickdesaulniers can you tell me which you're measuring/deploying? To give some idea if we're likely to actually help much...

In all of my personal measurements, I've been bootstrapping clang with AOSP's clang, and updateConsecutiveMacroArgTokens is still the worst method in profiles.

But I wouldn't be surprised if other Linux distro's like RHEL bootstrap their clang distribution via GCC. @tstellar or @serge-sans-paille or @nikic might know. We did get a curious comment from a kernel developer recently claiming that clang was "twice as slow as GCC" which didn't make much sense; not sure if it was an exaggeration vs. precise measurement, but I wouldn't be surprised if evaluation order you identified plays into this, making the worst method even slower. I'll try to find a link to the thread...

My summer intern @justinstitt looked into this case again briefly. We found that minor improvements to SourceManager::isWrittenInSameFile to avoid a few more calls to getFileID to be irrelevant in terms of performance improvement. But we also didn't consider GCC-built-clang; we were using clang to bootstrap clang.


Meanwhile, I think besides evaluating the high level logic in in TokenLexer and how it might be improved, I think there's potentially an opportunity for a "AOS vs. SOA" speedup in SourceManager. SourceManager::LoadedSLocEntryTable is a llvm::SmallVector<SrcMgr::SLocEntry>. SourceManager::getFileIDLoaded only really cares about the SLocEntry's Offset. I suspect we could get major memory locality wins by packing those into a standalone vector so that we could search them faster.

@hokein or I will try to find time to take a stab at this.

Awesome, please keep me in the loop.

Thanks Nick for the info! No kernel experience here, so if you have any particular suggestions about how to measure the workload you care about it'd be much appreciated (e.g. are particular files that are slow enough to measure in isolation, or is it better to do a full build)

But I wouldn't be surprised if other Linux distro's like RHEL bootstrap their clang distribution via GCC. @tstellar or @serge-sans-paille or @nikic might know. We did get a curious comment from a kernel developer recently claiming that clang was "twice as slow as GCC" which didn't make much sense; not sure if it was an exaggeration vs. precise measurement, but I wouldn't be surprised if evaluation order you identified plays into this, making the worst method even slower. I'll try to find a link to the thread...

I'm sure this is common enough to be worth fixing if it's a real effect which I need to confirm. (MSVC is right-to-left too).

Meanwhile, I think besides evaluating the high level logic in in TokenLexer and how it might be improved, I think there's potentially an opportunity for a "AOS vs. SOA" speedup in SourceManager. SourceManager::LoadedSLocEntryTable is a llvm::SmallVector<SrcMgr::SLocEntry>. SourceManager::getFileIDLoaded only really cares about the SLocEntry's Offset. I suspect we could get major memory locality wins by packing those into a standalone vector so that we could search them faster.

Ah, great point. SLocEntry is 24 bytes while Offset is only 4. SLocEntry is an important public API, but Offset is ~only used in SourceManager, so that refactoring might be doable. I guess we can cheaply prototype by redundantly storing offset *both* in a separate array used for search and in the SLocEntry.

@hokein or I will try to find time to take a stab at this.

Awesome, please keep me in the loop.

Will do!

But I wouldn't be surprised if other Linux distro's like RHEL bootstrap their clang distribution via GCC. @tstellar or @serge-sans-paille or @nikic might know. We did get a curious comment from a kernel developer recently claiming that clang was "twice as slow as GCC" which didn't make much sense; not sure if it was an exaggeration vs. precise measurement, but I wouldn't be surprised if evaluation order you identified plays into this, making the worst method even slower. I'll try to find a link to the thread...

For Fedora/RHEL we used to build Clang with GCC -- for Clang 15, we switched to building with Clang. No idea what other distros do. Maybe worth mentioning that the builds on https://llvm-compile-time-tracker.com/ are done with GCC.

@hokein or I will try to find time to take a stab at this.

Awesome, please keep me in the loop.

Will do!

https://reviews.llvm.org/D134942 is my attempt.

I discussed this bug with Argyrios off-list, who lgtm'd on the condition that it doesn't introduce a performance regression.

Well, I'd say it's a performance regression, though perhaps reported 5 years too late.

This patch does increase the number of SLocEntries. At least for SemaExpr.cpp, prior this patch it is ~298K, after this patch it is ~315K (~5% increase).

Thanks Nick for the info! No kernel experience here, so if you have any particular suggestions about how to measure the workload you care about it'd be much appreciated (e.g. are particular files that are slow enough to measure in isolation, or is it better to do a full build)

@sammccall I wrote up instructions for how to profile a Linux kernel build with LLVM.
https://github.com/ClangBuiltLinux/profiling/tree/main/perf
A build on a 72+ threaded workstation should only take ~1 minute. Can you please give it a shot and let me know off-thread if you encounter any issues?
One thing I found about that workflow: just this week I upgraded to a zen2-based threadripper workstation. It appears that zen2 has issues using per-thread LBR.
https://www.spinics.net/lists/linux-perf-users/msg23103.html
(There's follow up responses that I don't see yet in the archive, but it looks like there's pending Linux kernel patches to get that working.)
https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com/
So you might want to ensure you're running those instructions on an intel box, for now. I'm also happy to hop on a virtual call with you and @hokein anytime.

hokein added a comment.Oct 4 2022, 5:16 AM

Meanwhile, I think besides evaluating the high level logic in in TokenLexer and how it might be improved, I think there's potentially an opportunity for a "AOS vs. SOA" speedup in SourceManager. SourceManager::LoadedSLocEntryTable is a llvm::SmallVector<SrcMgr::SLocEntry>. SourceManager::getFileIDLoaded only really cares about the SLocEntry's Offset. I suspect we could get major memory locality wins by packing those into a standalone vector so that we could search them faster.

Ah, great point. SLocEntry is 24 bytes while Offset is only 4.

And SLocEntry costs 4 bytes for padding only, which is bad :(

SLocEntry is an important public API, but Offset is ~only used in SourceManager, so that refactoring might be doable. I guess we can cheaply prototype by redundantly storing offset *both* in a separate array used for search and in the SLocEntry.

This is an interesting idea. I got a quick prototype of adding an in-parallel offset table in SourceManager:

  • clang::SourceManager::getFileIDLocal 2.45% -> 1.57% (reduce by 30%+)
  • SourceManager memory usage is increased by ~10%: SemaExpr.cpp 12.6MB -> 14.3MB

The improvement of getFileIDLocal seems promising, but the memory increasement is a thing (10% is not small, maybe it is ok compared the actual AST size).

An alternative is to restructure the SLocEntry and the underlying storage in SourceManager, it will give us both performance and memory improvement, but we need to make a significant change of the SourceManager.

A build on a 72+ threaded workstation should only take ~1 minute. Can you please give it a shot and let me know off-thread if you encounter any issues?

Thanks for writing down the instructions, it is useful. It works for me (on an intel-based workstation).

Meanwhile, I think besides evaluating the high level logic in in TokenLexer and how it might be improved, I think there's potentially an opportunity for a "AOS vs. SOA" speedup in SourceManager. SourceManager::LoadedSLocEntryTable is a llvm::SmallVector<SrcMgr::SLocEntry>. SourceManager::getFileIDLoaded only really cares about the SLocEntry's Offset. I suspect we could get major memory locality wins by packing those into a standalone vector so that we could search them faster.

Ah, great point. SLocEntry is 24 bytes while Offset is only 4.

And SLocEntry costs 4 bytes for padding only, which is bad :(

SLocEntry is an important public API, but Offset is ~only used in SourceManager, so that refactoring might be doable. I guess we can cheaply prototype by redundantly storing offset *both* in a separate array used for search and in the SLocEntry.

Do we need to store both the whole SLocEntry and a copy of the Offset, or can we just store the Offset (or perhaps individual arrays of the pieces of an SLocEntry)? Perhaps we can lazily materialize an SLocEntry only when needed, if ever?

This is an interesting idea. I got a quick prototype of adding an in-parallel offset table in SourceManager:

  • clang::SourceManager::getFileIDLocal 2.45% -> 1.57% (reduce by 30%+)
  • SourceManager memory usage is increased by ~10%: SemaExpr.cpp 12.6MB -> 14.3MB

How did you measure the memory usage of an individual class? (I think we should move this discussion to LLVM Discourse for more visibility of our discussion).

The improvement of getFileIDLocal seems promising, but the memory increasement is a thing (10% is not small, maybe it is ok compared the actual AST size).

At this point, I'll pay it. Unless it regresses peak RSS of the compiler, I don't care.

An alternative is to restructure the SLocEntry and the underlying storage in SourceManager, it will give us both performance and memory improvement, but we need to make a significant change of the SourceManager.

At this point, I think it's worth it.