Page MenuHomePhabricator

[clang][modules] Track included files per submodule
Needs ReviewPublic

Authored by jansvoboda11 on Nov 1 2021, 2:23 AM.

Details

Summary

When building a module consisting of submodules, the preprocessor keeps a "global" state that for each header file tracks (amongst other things) the number of times it was included/imported. This information is serialized into the PCM file.

When importing anything from such module (either the top-level module or any of its submodules), this number is merged into the state of the importing preprocessor.

This can incorrectly prevent imports of textual headers (see attached tests). This patch fixes this bug by making the number of times a header was included more fine-grained and tracks it on per-submodule basis. This information is lazily deserialized when first importing each (sub)module.

(This patch is an alternative approach to the same issue addressed in D104344.)

Depends on D114095.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm interested in hearing some feedback whether the direction I'm taking here makes sense.

There are a couple of TODOs (mostly on optimizing away unnecessary maps) and a few modules-ts tests are failing.

@rsmith left a suggestion on D104344 to track this information in Preprocessor::SubmoduleState, which has similar semantics to what I'm doing with Preprocessor::IncludeMap (dividing the state during submodule compilation). The issue is that it (Preprocessor::Submodules) is really only enabled when Clang gets invoked with -fmodules-local-submodule-visibility. To keep its current semantics and make it usable for our use-case (which needs to work even without the flag), I think we'd need to always track the submodule state and conditionally merge the outer and local macro states as we now do for IncludeMap in PPLexerChange.cpp. I'm not really sure this is correct/feasible.

Any opinions on this?

jansvoboda11 edited the summary of this revision. (Show Details)Nov 1 2021, 2:54 AM
jansvoboda11 edited the summary of this revision. (Show Details)

I'm not going to cover the entire change, some parts I need to consider more carefully.

There can be other reasons to keep IncludeMap out of SubmoduleState but I'm not sure the local submodule visibility is the right reason. I might be reading the code incorrectly but looks like CurSubmoduleState is used when local submodule visibility is disabled. The difference is it's tracking the aggregate state instead of per-submodule state. Need to experiment more but so far tracking includes in SubmoduleState follows the spirit of local submodule visibility. Though it's not guaranteed it'll work perfectly from the technical perspective.

Also I think we'll need to increase granularity to track other HeaderFileInfo attributes, not just NumIncludes. Don't have a test case to illustrate that right now and no need to change that now but something to keep in mind.

clang/lib/Lex/PPLexerChange.cpp
706

How many includes are expected to be here? Are this only immediate includes or also transitive?

Asking to evaluate how expensive iterating through the includes can get.

Avoid copying data between submodules

There can be other reasons to keep IncludeMap out of SubmoduleState but I'm not sure the local submodule visibility is the right reason. I might be reading the code incorrectly but looks like CurSubmoduleState is used when local submodule visibility is disabled. The difference is it's tracking the aggregate state instead of per-submodule state. Need to experiment more but so far tracking includes in SubmoduleState follows the spirit of local submodule visibility. Though it's not guaranteed it'll work perfectly from the technical perspective.

Yes, CurSubmoduleState is being used unconditionally. However, without local submodule visibility enabled, it always points to NullSubmoduleState. Only with the feature enabled does it point to the current submodule state (stored in Submodules). The change happens in Preprocessor::{Enter,Leave}Submodule.

Also I think we'll need to increase granularity to track other HeaderFileInfo attributes, not just NumIncludes. Don't have a test case to illustrate that right now and no need to change that now but something to keep in mind.

That's interesting. I think HeaderFileInfo::isImport should definitely be tracked in the preprocessor, not in HeaderFileInfo. The fact that the header was #imported is not an intrinsic property of the file itself, but rather a preprocessor state. Can you think of other fields that don't really belong to HeaderFileInfo?

Based on your feedback, I simplified the patch quite a bit. We're no longer copying the include state between submodules. In its current form, this patch essentially moves HeaderFileInfo::NumIncludes into Preprocessor::NumIncludes and still uses it as the source of truth.
However, we're also tracking NumIncludes separately in each submodule and serializing this into the PCM. Instead of merging NumIncludes of the whole module when loading it (which we did before), we can merge NumIncludes only of the modules we actually import.

clang/lib/Lex/Preprocessor.cpp
1331

Iterating over all FileEntries is probably not very efficient, as Volodymyr mentioned. Thinking about how to make this more efficient...

Just a few comments on implementation details. The only high-level piece to call out is that I wonder if NumIncludes could/should be simplified (semantically) to a Boolean in a prep commit.

clang/include/clang/Lex/Preprocessor.h
722–725

I'm not sure about this choice. UIDs are unlikely to be adjacent and in SubmoduleIncludeState, since a given submodule is unlikely to have "most" files. Also, memory usage characteristics could be "weird" if the FileManager is being reused between different compilations.

DenseMap<unsigned, unsigned> will have the same number of allocations (i.e., just 1). If UIDs really are dense and near each other in one of the submodules then that map will be a little bigger than necessary (~2x), but it should be better for the rest of them.

clang/lib/Lex/Preprocessor.cpp
1331

My suggestion above to drop FileEntryMap in favour of a simple DenseMap would help a bit, just iterating through the files actually included by the submodules.

Further, I wonder if "num-includes"/file/submodule (unsigned) is actually useful, vs. "was-included"/file/submodule (bool). The only observer I see is HeaderSearch::PrintStats() but maybe I missed something?

If I'm right and we can switch to bool, then NumIncludes becomes a DenseSet<FileEntry *> IncludedFiles (or DenseSet<unsigned> for UIDs). (BTW, I also wondered if you could rotate the map, using File as the outer key, and then use bitsets for the sbumodules, but I doubt this is better, since most files are only included by a few submodules, right?)

Then you can just do a set union here. Also simplifies bitcode serialization.

(If a bool/set is sufficient, then I'd suggest landing that first/separately, before adding the per-submodule granularity in this patch.)

clang/lib/Serialization/ASTWriter.cpp
2486–2487

A vector of maps would be an improvement, but that'll still be a lot of allocations.

Leading me toward a simple flat vector + sort.

struct IncludeToSerialize { // Probably more straightforward than a std::tuple...
  StringRef Filename;
  unsigned SMID;
  unsigned NumIncludes;

  bool operator<(const IncludeToSerialize &RHS) const {
    if (SMID != RHS.SMID)
      return SMID < RHS.SMID;
    int Diff = Filename.compare(RHS.Filename);
    assert(Diff && "Expected unique SMID / Filename pairs");
    return Diff < 0;
  }
};
SmallVector<IncludeToSerialize> IncludesToSerialize;
// ...
    IncludesToSerialize.push_back({Filename, LocalSMID, NumIncludes});
// ...
llvm::sort(IncludesToSerialize);
for (const IncludeToSerialize &SI : IncludesToSerialize) {
  // emit record
}

(Or if there are duplicates expected to be encountered and ignored, you can remove the assertion and use stable_sort + unique + erase.)

2512

I wonder, will the Filename already be serialized elsewhere? Could an ID from that be reused here, rather than writing the filename again? (Maybe that'd need a bigger refactor of some sort to create a filename table?)

Stepping back, it looks like this is always eagerly loaded.

  • Could it be lazily-loaded by submodule?
  • Could it be lazily-loaded by filename?

In the former case, seems like a single record per submodule makes sense, with a single blob that can be decoded on-demand.

In the latter case, maybe it should be rotated, and stored a single record per filename as a blob that can be lazily decoded:

<filename-size> <filename> <num-submodules> (<smid> <num-includes>)+

That's interesting. I think HeaderFileInfo::isImport should definitely be tracked in the preprocessor, not in HeaderFileInfo. The fact that the header was #imported is not an intrinsic property of the file itself, but rather a preprocessor state. Can you think of other fields that don't really belong to HeaderFileInfo?

After checking HeaderFileInfo, looks like isImport is the only other field that should be tracked in the preprocessor. I had in mind a case where a hidden submodule imports a file with x-macros and then a visible submodule includes this header twice with different macros. First include would go through because NumIncludes == 0, and the second one shouldn't because NumIncludes == 1 && isImport == true. The import in the hidden submodule is incorrect but errors in unused headers shouldn't break actually used headers.

Call getFileInfo in Preprocessor::EnterMainSourceFile. This ensures deserialization of HeaderFileInfo, which seems to be necessary with modules-ts enabled.

Fix deserialization, improve (sub)module state tracking

Make loading of (sub)module includes lazy

jansvoboda11 retitled this revision from WIP: [clang][modules] Granular tracking of includes to [clang][modules] Track number of includes per submodule.Nov 9 2021, 1:31 AM
jansvoboda11 edited the summary of this revision. (Show Details)
jansvoboda11 marked 2 inline comments as done.Nov 9 2021, 1:53 AM

Thanks for your feedback @dexonsmith. I reworked the patch to use more sensible data structures as suggested, and made the AST deserialization lazy (on (sub)module import).

I think the only thing to figure out is the exact structure of the serialized information - whether we're fine with serializing all transitively included files in each AST file, or whether we'd like to fetch that information from other AST files instead.

I removed the WIP tag and would be happy to gather more feedback.

clang/include/clang/Lex/Preprocessor.h
722–725

You're right that UIDs of files included in a single submodule are unlikely to have "most" files. Thanks for pointing that out.

In the latest revision, I switched to llvm::DenseMap<const FileEntry *, unsigned>.

clang/lib/Lex/Preprocessor.cpp
1331

For each file, we need to have three distinct states: not included at all, included exactly once (firstTimeLexingFile), included more than once. This means we can't use a single DenseSet.

But we could use a DenseMap<Key, bool>, where "not included at all" can be expressed as being absent from the map, exactly once as having true in the map and more than once as having false in the map. Alternatively, we could use two DenseSet instances to encode the same, but I don't think having two lookups per file to determine stuff is efficient.

I can look into this in a follow-up patch.

clang/lib/Serialization/ASTWriter.cpp
2486–2487

In the latest revision, I ended up sorting just based on Filename, since this is now explicitly stored per submodule.

2512

We already serialize Filename elsewhere, but only for local input files. Here we need transitive closure of all included input files.

I'm still unsure whether it's fine to store transitively included files here or if we should look that information up in the respective AST files.

The current solution looks like it will bloat sizes of the AST files, but I think the transitive closure is already being stored for HeaderFileInfo anyways, so it shouldn't be that big of a deal?

2512

Thinking about it some more, the current implementation will be duplicating a lot of filenames between submodules of the same module. We might need to extract that to some common storage that we can refer to with simple integer offsets...

jansvoboda11 marked an inline comment as done.

Make clang-format happy.

Store only direct includes in the PCM (compared to transitive stored previously), use InputFile ID (instead of full filesystem path).

Also: add test of transitive includes, fix bug in VisibleModuleSet::setVisible.

Implementation looks a lot cleaner!

I'd still like to drop NumIncludes first if possible because I think it'll be easier to reason about without this extra layer of complexity. Also, that'd mitigate the potential regression in .pcm size.

(Note: I'd be more comfortable with @vsapsai and/or @rsmith reviewing the overall direction; I just jumped in for the implementation details.)

clang/include/clang/Lex/HeaderSearch.h
133–139 ↗(On Diff #386139)

Looks like this is already dead code? If so, please separate out and commit ahead of time (e.g., now).

clang/include/clang/Serialization/ModuleFile.h
400

Each StringMapEntry is going to have a pretty big allocation here, for a 512B vector. Given that it doesn't need to be after creation, it'd be more efficient to use this pattern:

llvm::StringMap<ArrayRef<uint64_t>> SubmoduleIncludedFiles;
SpecificBumpPtrAlloc<uint64_t> SubmoduleIncludedFilesAlloc;

// later
MutableArrayRef<uint64_t> Files(SubmoduleIncludedFiles.Allocate(Record.size()), Record.size());
llvm::copy(Record, Files.begin());
SubmoduleIncludedFiles[Key] = Files;

That said, I feel like this should just be:

DenseMap<StringRef, StringRef> SubmoduleIncludedFiles;

The key can point at the name of the submodule, which must already exist somewhere without needing a StringMap to create a new copy of it. The value is a lazily deserialized blob.

clang/lib/Lex/Preprocessor.cpp
1331

Seems like a DenseSet could still be used by having HeaderInfo pass back the WasInserted bit from the insertion to the preprocessor, and threading it through to Preprocessor::HandleEndOfFile (the only caller of FirstTimeLexingFile):

bool IsFirst = Set.insert(Key).second;

The threading doesn't seem too hard. Looking at main:

  • Preprocessor::HandleHeaderIncludeOrImport calls HeaderInfo::ShouldEnterIncludeFile. This does the ++FI.NumIncludes (going from 0 to 1). Instead, it could be IsFirst = !FI.WasIncluded; FI.WasIncluded = true;, then return IsFirst somehow. (Then your patch can pull IsFirst from the insert().second).
  • Preprocessor::HandleHeaderIncludeOrImport calls Preprocessor::EnterSourceFile. This creates a new Lexer for that file. IsFirst can be stored on that Lexer.
  • Preprocessor::HandleEndOfFile calls FirstTimeLexingFile. Instead, it can check the new accessor CurLexer->isFirstTimeLexing().

I can look into this in a follow-up patch.

Follow-up might be okay, but it'd be nice to remove an axis of complexity before adding a new one if it's reasonable. E.g., it'll be easier to debug emergent issues from changing it to a simple set since there's less machinery to worry about.

clang/lib/Serialization/ASTReader.cpp
5738–5740

This looks lazy, but a bunch of work was just done to decode the Record from bitcode. To make this actually lazy, you can encode the data in a blob, which doesn't have to be decoded until it's used.

clang/lib/Serialization/ASTWriter.cpp
2249–2253

Why does the count need to be encoded?

The only observer is Preprocessor::HandleEndOfFile. If it gets called again for this file, it'll be after ++NumIncludes.

As we've discussed earlier, tracking isImport shouldn't be done per .pcm and here is the test case https://gist.github.com/vsapsai/a2d2bd19c54c24540495fd9b262106aa I'm not sure it is worth adding the second #include as the test fails just with one.

Overall, the change seems more complicated than it has to be. I need to check it carefully to see what can be simplified. And I need to check in debugger how and when AST reading is triggered. Looks like not all of that is lazy but I need to check the compiled code, not my guesses.

Didn't go in-depth for serialization/deserialization. When we add tracking isImport on per-submodule basis, do you think AST reading/writing would change significantly?

clang/include/clang/Lex/ExternalPreprocessorSource.h
47

I think it is better for understanding and more convenient to use some using instead of duplicating llvm::DenseMap<const FileEntry *, unsigned> in multiple places.

clang/include/clang/Lex/Preprocessor.h
771–777

I think the interplay between CurSubmoduleIncludeState, IncludedFiles, and CurSubmoduleState is pretty complicated. Recently I've realized that it can be beneficial to distinguish include tracking for the purpose of serializing per submodule and for the purpose of deciding if should enter a file. In D114051 I've tried to illustrate this approach. There are some tests failing but hope the main idea is still understandable.

One of my big concerns is tracking VisibleModules in multiple places. D114051 demonstrates one of the ways to deal with it but I think it is more important for you to know the problem I was trying to solve, not the solution I came up with.

clang/lib/Basic/Module.cpp
653

Was meaning to make this fix for a long time but couldn't test it. Thanks for finally fixing it!

clang/lib/Lex/Preprocessor.cpp
1330

If I drop checking getLocalSubmoduleIncludes, no tests are failing. But it seems like this call is required. How can we test it?

Rebase on top of extracted patches.

jansvoboda11 marked 5 inline comments as done.Nov 17 2021, 12:14 PM

Implementation looks a lot cleaner!

I'd still like to drop NumIncludes first if possible because I think it'll be easier to reason about without this extra layer of complexity. Also, that'd mitigate the potential regression in .pcm size.

Done in D114096. Thanks for the feedback.

clang/include/clang/Lex/HeaderSearch.h
133–139 ↗(On Diff #386139)

Done in D114092.

clang/include/clang/Serialization/ModuleFile.h
400

I switched to StringRef value in the latest revision. Unfortunately, had to use std::string as the key instead of StringRef, since getFullModuleName constructs the string on heap. That forced me to use std::map, too. I'll explore using something else entirely as the key.

clang/lib/Lex/Preprocessor.cpp
1331

Extracted into D114093.

jansvoboda11 marked 4 inline comments as done.Nov 17 2021, 12:26 PM

Didn't go in-depth for serialization/deserialization. When we add tracking isImport on per-submodule basis, do you think AST reading/writing would change significantly?

I think moving isImport into Preprocessor can be done in a similar way to how we handle the number of includes (or rather "has been included/imported" behavior), so we should be able to reuse some parts of this patch there.

One question that remains to be answered is whether to keep two separate DenseSet<const FileEntry *> instances, or merge them into something like:

struct IncludeInfo {
  bool IsImportedOrIncluded; // currently Preprocessor::IncludedFiles
  bool IsImported;           // currently HeaderFileInfo::isImport
}
llvm::DenseMap<const FileEntry *, IncludeInfo> IncludedFilesInfo;

Thanks a lot for the feedback, I appreciate it.

clang/include/clang/Lex/ExternalPreprocessorSource.h
47

Yeah, spelling the type out everywhere is a bit unwieldy. It's a bit better with llvm::DenseSet<const FileEntry *> in the latest revision, but still not pretty. I wanted to avoid pulling Preprocessor.h everywhere, but will probably reconsider doing that.

clang/include/clang/Lex/Preprocessor.h
771–777

Thanks a ton, this must've taken quite a bit of time to put together. I agree your approach is much simpler. I'll investigate how it behaves on larger projects, but probably will end up adopting it in this patch.

I have done some minor tweaks locally to fix the failing PCH tests, Modules/import-textual-noguard.mm doesn't make much sense to me, I think I'll end up updating that test.

clang/lib/Lex/Preprocessor.cpp
1330

I think this should kick in when importing a submodule from the same module. I'll try to come up with a test case.

jansvoboda11 marked an inline comment as done.Nov 17 2021, 12:28 PM
jansvoboda11 marked an inline comment as done.
dexonsmith added inline comments.Nov 17 2021, 1:05 PM
clang/include/clang/Serialization/ModuleFile.h
400

Oh, if the key isn't being kept alive elsewhere, you can/should use StringMap<StringRef>, which is strictly better than std::map<std::string, StringRef> (except for non-deterministic iteration order). I suggested otherwise because I assumed the key already existed.

Also, if the map isn't being deleted from (much), might as well bump-ptr-allocate it:

BumpPtrAllocator SubmoduleIncludedFilesAlloc;
StringMap<StringRef, BumpPtrAllocator> SubmoduleIncludedFiles; // iniitalized with teh Alloc above.

I'll explore using something else entirely as the key.

Not necessarily important; just if the key was already (e.g., stored in its Module or something) you might as well use it. Unless, maybe if the Module is known to be constructed already, you could use its address?

400

I switched to StringRef value in the latest revision. Unfortunately, had to use std::string as the key instead of StringRef, since getFullModuleName constructs the string on heap. That forced me to use std::map, too. I'll explore using something else entirely as the key.

jansvoboda11 retitled this revision from [clang][modules] Track number of includes per submodule to [clang][modules] Track included files per submodule.Nov 18 2021, 9:37 AM
jansvoboda11 edited the summary of this revision. (Show Details)

Rebase, add type alias, improve map allocation in ASTReader.

Apply part of Volodymyr's cleanup.

Add missed newline.

jansvoboda11 marked 5 inline comments as done.Nov 18 2021, 10:16 AM

I think AST format for IncludedFiles was discussed here, so I'll continue here though the bulk of implementation is in D114095 now. Have you compared the size of resulting .pcm files when you are using a bitvector compared to a list of included headers? In my quick check (which is not a perfect comparison, to be honest) bitvector approach takes more space. For example, Darwin.pcm is 7320 bytes bigger, UIKit.pcm is 2388 bytes bigger, and entire modules cache is 14KB bigger. I haven't checked the details of the discrepancies, so curious if you have some insights already. For the record, I was testing with

echo '#import <UIKit/UIKit.h>' | path/to/built/bin/clang -fsyntax-only -isysroot "$(xcrun --sdk iphoneos --show-sdk-path)" -target arm64-apple-ios -fmodules -fmodules-cache-path=modules.noindex -x objective-c -

I think AST format for IncludedFiles was discussed here, so I'll continue here though the bulk of implementation is in D114095 now. Have you compared the size of resulting .pcm files when you are using a bitvector compared to a list of included headers? In my quick check (which is not a perfect comparison, to be honest) bitvector approach takes more space. For example, Darwin.pcm is 7320 bytes bigger, UIKit.pcm is 2388 bytes bigger, and entire modules cache is 14KB bigger. I haven't checked the details of the discrepancies, so curious if you have some insights already. For the record, I was testing with

echo '#import <UIKit/UIKit.h>' | path/to/built/bin/clang -fsyntax-only -isysroot "$(xcrun --sdk iphoneos --show-sdk-path)" -target arm64-apple-ios -fmodules -fmodules-cache-path=modules.noindex -x objective-c -

Replied in D114095.