This is an archive of the discontinued LLVM Phabricator instance.

[clang][modules] Account for non-affecting inputs in `ASTWriter`
ClosedPublic

Authored by jansvoboda11 on Oct 24 2022, 10:31 AM.

Details

Summary

In D106876, we stopped serializing module map files that didn't affect compilation of the current module.

However, since each SourceLocation is simply an offset into SourceManager's global buffer of concatenated input files in, these need to be adjusted during serialization. Otherwise, they can incorrectly point after the buffer or into subsequent input file.

This patch starts adjusting SourceLocations, FileIDs and other SourceManager offsets in ASTWriter.

Diff Detail

Event Timeline

jansvoboda11 created this revision.Oct 24 2022, 10:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 10:31 AM
jansvoboda11 requested review of this revision.Oct 24 2022, 10:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 10:31 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This looks reasonable. Have you measured the performance impact of this change?

This looks reasonable. Have you measured the performance impact of this change?

I have done comparison between this patch and https://github.com/apple/llvm-project/pull/5451 (that, instead of leaving non-affecting input files out, serializes one extra bit saying whether they are affecting or not). Both versions carried some additional patches that refine the computation of non-affecting module maps, that I plan to upstream shortly. I used Clang's -ftime-trace on a reasonably large project (1680 TUs, 638 implicitly-built modules). The results say that serialization got 7.09% slower, and deserialization 1.01% slower. The serialization slowdown I understand, but I expected deserialization to get faster, since we now have less of SourceManager to look through. Overall, PCM compilation got 4.35% slower.

I tried optimizing this patch a bit. Instead of creating compact data structure and using binary search to find the preceding non-affecting file, I now store the adjustment information for each FileID in a vector. During deserialization, FileID is simply used as an index into SLocEntryInfos. That didn't yield any measurable improvement in performance, though. I think the regression must be coming from the SourceLocation/Offset to FileID translation.

I don't see any obvious way to work around that. SourceManager::getFileIDLocal() already implements some optimizations that makes accessing nearby offsets fast. A separate SourceManager would avoid this bottleneck, but I'm not sure how much work that would entail (seems substantial).

@Bigcheese LMK if you're fine with the performance implications here.

New version with flat vector + FileID indices; replacing the previous compact representation & binary search
approach

I tried optimizing this patch a bit. Instead of creating compact data structure and using binary search to find the preceding non-affecting file, I now store the adjustment information for each FileID in a vector. During deserialization, FileID is simply used as an index into SLocEntryInfos. That didn't yield any measurable improvement in performance, though. I think the regression must be coming from the SourceLocation/Offset to FileID translation.

I don't see any obvious way to work around that. SourceManager::getFileIDLocal() already implements some optimizations that makes accessing nearby offsets fast. A separate SourceManager would avoid this bottleneck, but I'm not sure how much work that would entail (seems substantial).

@Bigcheese LMK if you're fine with the performance implications here.

I don't think you need to call getFileID() inside getAdjustment(). There is also an opportunity for a peephole, if getAdjustment() remains expensive.

I've left some comments on the previous version of the patch since it's not obvious to me how to avoid the getFileID() call in the new version.

The serialization slowdown I understand, but I expected deserialization to get faster, since we now have less of SourceManager to look through.

Seems worth digging into the deserialization regression. Does the PCM actually get smaller and the ranges more condensed?

One quick test would be to manufacture a situation where two output PCMs would previously have different non-affecting inputs, but now should be bit-for-bit identical. Are they, in fact, bit-for-bit identical? If not, maybe there's something funny to look into...

clang/include/clang/Serialization/ASTWriter.h
449–452

Can you collect a histogram for how big these vectors are? Can we avoid pointer chasing in the common case by making them SmallVector of some size during lookup?

clang/lib/Serialization/ASTWriter.cpp
2049–2050

Can we shift this getAdjustedOffset() computation to after deciding whether to skip the record?

5293–5294

How often does getAdjustment() return the same answer in consecutive calls? If at all common, this would likely benefit from a peephole:

Optional<SLocRange> ASTWriter::CachedAdjustmentRange;
Optional<UIntTy> ASTWriter::CachedAdjustment;

SourceLocation::UIntTy
ASTWriter::getAdjustment(SourceLocation::UIntTy Offset) const {
  // Check for 0.
  //
  // How fast is "isLoadedOffset()"? Can/should we add a peephole, or is it just bit
  // manipulation? (I seem to remember it checking the high bit or something, but if
  // it's doing some sort of look up, maybe it should be in the slow path so it can
  // get cached by LastAdjustment.)
  if (PP->getSourceManager().isLoadedOffset(Offset) ||
      NonAffectingInputs.empty())
    return 0;

  // Check CachedAdjustment.
  if (CachedAdjustment && CachedAdjustmentRange->includes(Offset))
    return *CachedAdjustment;

  // Call getAdjustmentSlow, which updates CachedAdjustment and CachedAdjustmentRange.
  // It's out-of-line so that `getAdjustment` can easily be inlined without inlining
  // the slow path.
  //
  // LastAdjustmentRange would be the size of the "gap" between this adjustment
  // level and the next one (end would be UINTMAX if it's after the last
  // non-affecting range).
  return getAdjustmentSlow(Offset);
}
5300–5301

Why do you need to call getFileID() here?

Instead, I would expect this to be a search through a range of offsets (e.g., see my suggestion at https://reviews.llvm.org/D106876#3869247 -- DroppedMMs contains SourceLocations, not FileIDs).

Two benefits:

  1. You don't need to call getFileID() to look up an offset.
  2. You can merge adjacent non-affecting files (shrinking the search/storage significantly).
clang/test/Modules/non-affecting-module-maps-source-locations.m
32 ↗(On Diff #470211)

This is exercising the code, but it could do one better and check if the output PCMs are bit-for-bit identical when we (now) expect them to be.

Maybe you could do this by having two run lines: one that includes -I %t/second and another that doesn't. Then check if the output PCMs are equal.

dexonsmith added inline comments.Oct 27 2022, 8:07 AM
clang/test/Modules/non-affecting-module-maps-source-locations.m
32 ↗(On Diff #470211)

(Or if the PCM isn't bit-for-bit identical yet, maybe at least the AST block should be...)

Avoid looking up FileID for an offset.

I tried implementing your suggestion (merging ranges of adjacent non-affecting files and avoiding FileID lookups), but the numbers from -ftime-trace are very noisy. I got more stable data by measuring clock cycles and instruction counts, but nothing conclusive yet.

Compilation of CompilerInvocation.cpp with implicit modules.

  • previous approach with vector + FileID lookup: +0.64% cycles and +1.68% instructions,
  • current approach with merged SourceRanges: +0.38% cycles and +1.11% instructions.

I'll post here as I experiment more and get more data.

clang/lib/Serialization/ASTWriter.cpp
5300–5301

My reasoning was that if we search through a range of offsets, we're doing conceptually the same thing as getFileID() (which already has some optimizations baked in). Maybe the non-affecting files are indeed adjacent and we'll be able to merge most of them. I'll give it a shot and report back.

I tried implementing your suggestion (merging ranges of adjacent non-affecting files and avoiding FileID lookups), but the numbers from -ftime-trace are very noisy. I got more stable data by measuring clock cycles and instruction counts, but nothing conclusive yet.

Compilation of CompilerInvocation.cpp with implicit modules.

  • previous approach with vector + FileID lookup: +0.64% cycles and +1.68% instructions,
  • current approach with merged SourceRanges: +0.38% cycles and +1.11% instructions.

I'll post here as I experiment more and get more data.

Nice; that seems like a bit of an improvement.

I'm curious; are system modules allowed to be non-affecting yet, or are they still assumed to be affecting? (It's the system modules that I think are most likely to be adjacent.)

My intuition is that there is likely some peephole that would be quite effective, that might not be useful for general getFileID() lookups.

  • I already suggested "same as last lookup?"... I'm curious if that'll help. Maybe that's already in getFileID(), but now that you've factored out that call, it could be useful to replicate.
  • You could also try: "past the the last non-affecting module?"
  • You could also try: "before the first non-affecting module?"

I suspect you could collect some data to guide this, such as, for loaded locations (you could ignore "local" locations since they already have a peephole):

  • Histogram of "loaded" vs. "between" vs. "after" non-affecting modules.
  • Histogram of "same as last" vs. "same as last-1" vs. "different from last 2".
  • [...]

Other things that might be useful to know:

  • What effect is the merging having (or would it have)? (i.e., what's the histogram of "adjacent" non-affecting files? (e.g.: 9 ranges of non-affecting files, with two blocks of 5 files and seven blocks of 1 (which aren't adjacent to any others)))
  • Is there a change in cycles/instructions when the module cache is hot? (presumably the common case)
  • Are the PCM artifacts smaller?
  • Are the PCMs bit-for-bit identical now when a non-affecting module is added to the input? (If not, why not?)
  • What's the data for implicitly-discovered, explicitly-built modules?

I'm curious; are system modules allowed to be non-affecting yet, or are they still assumed to be affecting? (It's the system modules that I think are most likely to be adjacent.)

Yes, all the measurements are done with system modules allowed to be non-affecting.

My intuition is that there is likely some peephole that would be quite effective, that might not be useful for general getFileID() lookups.

  • I already suggested "same as last lookup?"... I'm curious if that'll help. Maybe that's already in getFileID(), but now that you've factored out that call, it could be useful to replicate.

I tried that and we seemingly never hit that case. I think that's because these two optimizations take precedence:

  • You could also try: "past the the last non-affecting module?"
  • You could also try: "before the first non-affecting module?"

These avoid the binary search for the vast majority of calls to getAdjustment() with local offsets.

I suspect you could collect some data to guide this, such as, for loaded locations (you could ignore "local" locations since they already have a peephole):

  • Histogram of "loaded" vs. "between" vs. "after" non-affecting modules.
  • Histogram of "same as last" vs. "same as last-1" vs. "different from last 2".
  • [...]

Loaded locations make up 68% of all calls to getAdjustment(), so it's good it's the first thing we check for. Around 4% of all calls are for locations before the first non-affecting module. That's the second special case. Around 28% are pointing after the last non-affecting module, and the current revision has a special case for that as well. I think it would make sense to prioritize this check. Only the remaining 0.3% of calls do the actual binary search.

Other things that might be useful to know:

  • What effect is the merging having (or would it have)? (i.e., what's the histogram of "adjacent" non-affecting files? (e.g.: 9 ranges of non-affecting files, with two blocks of 5 files and seven blocks of 1 (which aren't adjacent to any others)))

Big one. We usually get between 70-100 non-affecting module maps, but they merge into 4-6 consecutive regions.

  • Is there a change in cycles/instructions when the module cache is hot? (presumably the common case)

I didn't notice this (but didn't look for it specifically). How could that affect performance for PCM writes?

  • Are the PCM artifacts smaller?

Yes, we leave out the 70-100 SLocEntries. Nothing much changes otherwise.

  • Are the PCMs bit-for-bit identical now when a non-affecting module is added to the input? (If not, why not?)

They are not in the current revision, but I'll create a patch that makes them so (we also need to adjust the NumCreatedFileIDs).

  • What's the data for implicitly-discovered, explicitly-built modules?

I didn't measure that. I don't expect much of a difference, though. The scanner has an empty AST, so the number of SourceLocations will be pretty low. (Other parts of the PCM file, like the metadata, don't write much SourceManager offsets.) Even if there was a small regression, I'd be fine with it, since we'll be moving off of implicit build in the scanner. For the explicit builds themselves, this shouldn't affect performance at all. Implicit module map search is disabled, all necessary/affecting module maps are deserialized from the PCM files or provided on the command-line.

jansvoboda11 marked 5 inline comments as done.Nov 1 2022, 1:17 PM
jansvoboda11 added inline comments.
clang/include/clang/Serialization/ASTWriter.h
449–452

Usually 4-6 elements. Making them a SmallVector<T, 8> didn't affect performance, though.

clang/lib/Serialization/ASTWriter.cpp
5293–5294

Not that often, see my top-level comment.

5300–5301

This ended up being faster due to merging of non-affecting files. Thanks for the suggestion!

clang/test/Modules/non-affecting-module-maps-source-locations.m
32 ↗(On Diff #470211)

Yes, I'll probably drop this test entirely and just check the PCM files are bit-for-bit identical when a non-affecting file is not loaded at all.

  • Is there a change in cycles/instructions when the module cache is hot? (presumably the common case)

I didn't notice this (but didn't look for it specifically). How could that affect performance for PCM writes?

It wouldn't check write perf, but it's part of the overall build perf impact. This being neutral (no regression) could help to justify landing the change even when there's a penalty for a cold modules cache. My interest in (implicitly-discovered) explicitly-built modules was similar (if that's neutral, then a regression here is less critical).

Partly, trying to dig into why read speeds got slower. But maybe that was noise that went away though when you switched to cycles/instructions?

Loaded locations make up 68% of all calls to getAdjustment(), so it's good it's the first thing we check for. Around 4% of all calls are for locations before the first non-affecting module. That's the second special case. Around 28% are pointing after the last non-affecting module, and the current revision has a special case for that as well. I think it would make sense to prioritize this check. Only the remaining 0.3% of calls do the actual binary search.

Great; looking forward to seeing new numbers.

(BTW, if you check before/after last non-affecting module, does one of those subsume the is-loaded check entirely? Looking at isLoadedOffset() makes me think it might. If so, maybe you can replace that check with an assertion.)

  • Are the PCMs bit-for-bit identical now when a non-affecting module is added to the input? (If not, why not?)

They are not in the current revision, but I'll create a patch that makes them so (we also need to adjust the NumCreatedFileIDs).

Great. Seems like another motivation to land this change (despite a regression), as it can impact meta-build perf if/when artifacts are being archived/shared in a larger context. Specifically, if the PCM artifacts are more stable, then:

  • They take up less aggregate space in CAS storage.
  • Later build steps get more cache hits.

Overall this patch seems like a good step to me; I don't think 1-2% for a single compilation on a cold cache is that bad, assuming a hot cache doesn't regress:

  • Across a single build, the cache gets hot pretty quickly as most compilations reuse the same modules.
  • Across incremental builds, most of the cache (at least system modules) will (usually) be hot.
  • For explicit builds, it sounds like there's no regression anyway.
clang/include/clang/Basic/SourceManager.h
1831–1833 ↗(On Diff #471641)

The logic for isLoadedOffset() suggests that it could maybe be subsumed with "location past the end"?

clang/test/Modules/non-affecting-module-maps-source-locations.m
32 ↗(On Diff #470211)

That sounds great.

jansvoboda11 marked 4 inline comments as done.Nov 1 2022, 4:35 PM

Partly, trying to dig into why read speeds got slower. But maybe that was noise that went away though when you switched to cycles/instructions?
Great; looking forward to seeing new numbers.

Ah, I forgot to mention this. Building the modules is now only 0.2% slower and importing them 1.2% faster (compared to PCMs with all input files serialized).

clang/include/clang/Basic/SourceManager.h
1831–1833 ↗(On Diff #471641)

I don't think so - we don't want to adjust loaded offsets. Their invariant is that they grow from 2^31 downwards.

We do want to adjust local offsets past the last non-affecting file though.

Ah, I forgot to mention this. Building the modules is now only 0.2% slower and importing them 1.2% faster (compared to PCMs with all input files serialized).

Awesome. All upside then :).

I added a few nitpick-y comments inline. I can take another look once you've made the PCMs bit-for-bit identical and updated the test (is that happening here, or in a separate review)?

clang/include/clang/Basic/SourceManager.h
1831–1833 ↗(On Diff #471641)

Oops, right!

clang/lib/Serialization/ASTWriter.cpp
4535

Not sure this comment adds much on top of the code on the next line. A sentence before the for loop describing the overall approach might be useful though.

4547–4551

You can reduce nesting by inverting this condition and continue.

4559–4560

I'd slightly prefer the comment *before* the if, due to how folding tends to work in editors (see the comment even when the code is folded). Probably relies on dropping the else (see below).

4565

I suggest continue before the else to avoid adding nesting for insertion.

jansvoboda11 marked an inline comment as done.

Rebase, decrease nesting, test using diff

jansvoboda11 marked 4 inline comments as done.Nov 1 2022, 5:29 PM
dexonsmith accepted this revision.Nov 1 2022, 5:37 PM

LGTM, with one suggestion for the test inline.

clang/test/Modules/add-remove-irrelevant-module-map.m
30

Maybe the CC1s should add -verify and test-simple.m should have // expected-no-diagnostics to help protect against bitrot?

This revision is now accepted and ready to land.Nov 1 2022, 5:37 PM
This revision was landed with ongoing or failed builds.Nov 1 2022, 7:33 PM
This revision was automatically updated to reflect the committed changes.