This is an archive of the discontinued LLVM Phabricator instance.

Remove non-affecting module maps from PCM files.
ClosedPublic

Authored by ilyakuteev on Jul 27 2021, 7:27 AM.

Details

Summary

Problem:
PCM file includes references to all module maps used in compilation which created PCM. This problem leads to PCM-rebuilds in distributed compilations as some module maps could be missing in isolated compilation. (For example in our distributed build system we create a temp folder for every compilation with only modules and headers that are needed for that particular command).

Solution:
Add only affecting module map files to a PCM-file.

Diff Detail

Event Timeline

ilyakuteev requested review of this revision.Jul 27 2021, 7:27 AM
ilyakuteev created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2021, 7:27 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

I think module maps that specify textual headers should be considered inputs if we entered any of those textual headers; we won't treat those as being imported. In addition to considering the current module and its imports, perhaps we should walk the preprocessor's headers list and add the module maps corresponding to the modules of all headers that were entered?

ilyakuteev updated this revision to Diff 365969.

Added modules for headers known to HeadersSearch, including textual ones.

@rsmith Can you please check is this a correct way to add headers known to preprocessor? (Lines 173-191)

Hello everyone!

We have very slow remote compilations without this patch. :(. I'll be very grateful for your review.

This should have a testcase. I would imagine you can test this by constructing a situation where you build a module twice with different sets of module map files being considered and checking that you get bit-for-bit identical outputs. (For example: set up a temporary area for the test case files, build a module there, then copy in an irrelevant module map file into a directory named by a -I and build the module again.)

clang/lib/Serialization/ASTWriter.cpp
164

Modulemap -> ModuleMap for consistency please.

168

You're walking this list in a non-deterministic order; consider using a SmallVector here and popping elements from the end instead of the front in the loop below (ie, depth-first traversal instead of breadth-first).

183–184

I don't think this is right: I think we should consider every file that we've entered in this compilation, regardless of whether it's part of a module we're compiling. (We support modes where we'll enter modular headers despite not compiling them.) Can you replace this with just if (!HFI) continue; and still get the optimization you're looking for?

186

This should be findAllModulesForHeader: in the case where there is more than one module covering a header, the additional module maps can matter in some contexts.

1537–1538

This will not work correctly if a module map file is reachable via multiple different paths. Consider using FileEntry* comparisons instead here.

rsmith accepted this revision.Sep 14 2021, 2:40 PM

Please apply the lint suggestions; otherwise, looks good.

This revision is now accepted and ready to land.Sep 14 2021, 2:40 PM
ilyakuteev marked 5 inline comments as done.

Hello everyone! Ping :), need more approves for this patch.

@rsmith Thank you for your helpful tips. Can you please review the latest change? I fixed some tests and applied clang format suggestions.

ilyakuteev set the repository for this revision to rG LLVM Github Monorepo.Sep 27 2021, 7:11 AM
ilyakuteev closed this revision.Sep 27 2021, 10:59 AM
ilyakuteev reopened this revision.Sep 27 2021, 11:02 AM
This revision is now accepted and ready to land.Sep 27 2021, 11:02 AM
ilyakuteev closed this revision.Sep 27 2021, 11:13 AM
ilyakuteev reopened this revision.Sep 27 2021, 2:34 PM
This revision is now accepted and ready to land.Sep 27 2021, 2:34 PM

@rsmith can you please help with committing patch to trunk? Do not sure if I have correct rights to do it. Were are several failing tests, seems like it is caused by bugs in pre-merge checks (For example this one on win: https://github.com/google/llvm-premerge-checks/issues/353). Thanks!

Hello everyone, can somebody please help me commit this patch to trunk?

Hello everyone, can somebody please help me commit this patch to trunk?

Hi, I'm happy to commit this for you. What author name and email would you like to be associated with the commit?

ilyakuteev added a comment.EditedNov 16 2021, 4:55 AM

Hi, I am very glad for your help. :)
Author name: Ilya Kuteev
email: ilyakuteev@yandex-team.ru

This revision was automatically updated to reflect the committed changes.

Hi @ilyakuteev, this patch is causing some issues downstream. (See clang/test/Modules/non-affecting-module-maps-source-locations.m here.)

I think they all boil down to the fact that a SourceLocation is a simple offset into SourceManager's buffer of concatenated input files. By not serializing some input files, we make that buffer shorter by leaving out some of its parts. This means that a SourceLocation that used to point at the end of the last file might now be pointing outside the new, shorter buffer. More generally, any SourceLocation pointing after any of the removed input files is now bogus. It seems that we should be computing the set of non-affecting input files at the start of serialization, and then any SourceLocation we serialize, we need to adjust to account for the changes in the input file buffer. I'm not sure how efficiently we can implement this, given that we might need to adjust every SourceLocation. And we have lots of them.

We rely on this feature for "implicitly discovered, explicitly built modules". For our purpose, the (simpler) downstream patch changes the PCM file so that it still contains all input files, preventing the issue I just described, but includes a bit that says whether a file was affecting or not. My question is: would that approach work for your use-case? You mentioned you want to avoid rebuilds in distributed compilation when some non-affecting input files are missing. You could avoid that by not validating non-affecting input files in ASTReader. One potential issue I see with this approach is that two PCMs for module M produced on two machines with different sets of non-affecting input files are not interchangeable. They will have different contents and therefore different signature. This could lead to rebuilds of importers that expect one signature for PCM M but (after some distributed re-shuffling), the other PCM ends up on the machine, creating a mismatch between the expected and actual PCM signature.

Please, let me know your thoughts.

Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2022, 11:22 AM
jansvoboda11 added inline comments.Oct 19 2022, 11:26 AM
clang/lib/Serialization/ASTWriter.cpp
183–184

@rsmith It appears that between your approval and when the patch got committed, your suggested change was reverted. Can you please explain the situation where we enter modular headers without compiling them?

jansvoboda11 added inline comments.Oct 19 2022, 11:29 AM
clang/lib/Serialization/ASTWriter.cpp
1549

Why do we never consider system module maps non-affecting? (i.e. always consider them affecting)

As an end goal, it seems strictly better for build systems / distribution / artifact storage / caching if the output PCM has been canonicalized as-if the module maps weren't found, rather than tracking both which module maps were discovered and which to ignore. Ideally, we'd find a way to canonicalize the output efficiently and correctly... along the lines of this patch, but with a bug fix somewhere.

It may not be too hard/expensive to translate the source locations. E.g., you could build up a vector:

SmallVector<SourceLocRange> DroppedMMs;

// fill up DroppedMMs with MMs...
...

// sort to create map.
sort(DroppedMMs);

// accumulate offsets.
SmallVector<uint64_t> AccumulatedOffset;
uint64_t Total = 0;
AccumulatedOffset.reserve(DroppedMMs.size() + 1);
AccumulatedOffset.push_back(0);
for (MM : DroppedMMs)
  AccumulatedOffset.push_back(Total += MM.size());

// later, translate during serialization
Error serializeSourceLoc(SourceLoc SL) {
  if (DroppedMMs.empty())
    return SerializeSourceLocImpl(SL);
  auto I = llvm::lower_bound(DroppedMMs, SL);
  assert((I == MMs.end() || !I->contains(SL)) &&
         "Serializing a location from an ignored module map???");
  return serializeSourceLocImpl(SL - AccumulatedOffset[I - MMs.begin()]);
}

Then a std::lower_bound into DroppedMMs tells you how much to adjust any given SourceLoc by. Serializing a source location would go through this map. Presumably, the number of dropped files will be relatively small (number of ignored module maps) so the binary search should be fast. Probably there would be good peepholes for common cases (such as tracking LastDroppedMM to avoid repeating the same search).

A further (more involved) approach would be to separate module maps into a separate SourceManager, so that their source locations don't affect other input files. Then only module map source locations would need to be translated during serialization. (Now that FileManager has the capability to remap file contents, I think the commit that merged the SourceManagers could be effectively reverted.)

A further (more involved) approach would be to separate module maps into a separate SourceManager, so that their source locations don't affect other input files. Then only module map source locations would need to be translated during serialization. (Now that FileManager has the capability to remap file contents, I think the commit that merged the SourceManagers could be effectively reverted.)

Throwing another idea out there... As a sort of compromise to avoid changing serialization/deserialization too much, maybe a hybrid could make sense:

  • During compilation, keep newly-found/discovered module maps in a separate SourceManager.
  • Translate all serialized module map locations to the main SourceManager during serialization. Could do this up front (collected non-ignored module maps) or maybe even on-demand (need to translate this location so add it to the main source manager).

(Or something along these lines...)

I agree that avoiding serializing non-affecting input files is the better approach. Your code is more or less what I had in mind, thanks for sketching it out!
The number of ignored module maps is typically around 70 - 110 on macOS (when you allow system module maps to be treated as non-affecting), and those are at the start of the SourceManager block. I might implement this approach and test out if there's a noticeable performance impact. Maybe we could use something similar to the optimization in SourceManager::getFileIDLoaded().

I remember having a separate SourceManager came up before, when investigating other issues. Though I think you want to merge SourceManagers earlier than on serialization. I think other data structures might hold a SourceLocation pointing into the module-map-specific SourceManager. How can you tell which SourceManager it came from? This could be prevented by merging the module-map-specific SourceManager into the main one when returning non-null result from Module *ModuleMap::lookupModule(StringRef Name), were that the only way to get a module.

I agree that avoiding serializing non-affecting input files is the better approach. Your code is more or less what I had in mind, thanks for sketching it out!
The number of ignored module maps is typically around 70 - 110 on macOS (when you allow system module maps to be treated as non-affecting), and those are at the start of the SourceManager block. I might implement this approach and test out if there's a noticeable performance impact. Maybe we could use something similar to the optimization in SourceManager::getFileIDLoaded().

Interesting. In that case it'd probably be profitable to do a pass after sorting to merge adjacent SourceLoc ranges. If there are many ignored system modules they may well be immediately adjacent and they'd resolve to a single range.

I remember having a separate SourceManager came up before, when investigating other issues. Though I think you want to merge SourceManagers earlier than on serialization. I think other data structures might hold a SourceLocation pointing into the module-map-specific SourceManager. How can you tell which SourceManager it came from?

I'm curious whether this was a problem before, when there were two SourceManagers (before they were combined into one). It's possible that it's just obvious from context which SourceManager to use; that there isn't ambiguity in practice (because of where the SourceLocs are stored, or the source language, or something?).

Note that fully splitting the source manager might be worth doing (eventually) as a follow up, even if there aren't performance problems with the remapping approach. (Unless this patch-and-the-bug-fixes will address all the performance problems? I don't think it totally solves the quadratic use of SourceLocation bit space by module maps, but maybe it does?)

This could be prevented by merging the module-map-specific SourceManager into the main one when returning non-null result from Module *ModuleMap::lookupModule(StringRef Name), were that the only way to get a module.

Ah, yeah, on-demand, but sooner than I was thinking. SGTM!