This is an archive of the discontinued LLVM Phabricator instance.

[clangd] [C++20] [Modules] Support C++20 modules for clangd
Needs RevisionPublic

Authored by ChuanqiXu on Jun 16 2023, 2:17 AM.

Details

Summary

Try to address https://github.com/clangd/clangd/issues/1293. See the link for some design ideas.

What this patch does:

  • Offers an option "-experimental-modules-support" for the new feature. So that no matter how bad this is, it wouldn't affect current users. Following off the page, we'll assume the option is enabled.
  • When we load a compilation database, we will try to scan every TU recorded in the compilation database by the same process of clang-scan-deps to get the modules related files. For these modules related files, we will build a modules dependency graph based on the scanning results.
  • Every time we update a file, all the affected files (e.g., we changed a header file) will be re-scanned. This is necessary since we don't know if the change will introduce modules related things. And we will update the modules graph.
  • When we want to build a file, we will try to see if all the BMIs of the its dependencies are already built (third party modules are not included), if yes, go ahead to compile it. If not, we will wait for the modules manager to try to build all the dependent BMIs. No matter if the building is success or not, we will be resumed. Note that this implies that the BMIs are built lazily.
  • When we compile a file, all the options (-fmodule-file=<module-name>=<module-path>) to specify the position of BMIs will be dropped. And the new modules global compilation database will insert a new search path to the BMIs built by clangd itself. So that we're not version locked with the compiler the user uses and we won't affect user's build.

Missing functionalities

The major missing functionality is that when we update a module unit, its users won't update automatically. For example,

// b.cppm
export module b;
export int bb = 43;

// a.cpp
import b;
int aa = bb;

After initialization, we will see the value of bb in a.cpp is correctly 43 in the code intelligence. But if we change the value of bb into 44 in b.cppm, the value of bb displayed in a.cpp is still 43. We can get the newest result by inserting and deleting an empty line to a.cpp and save it. (Any change should work too.)

The reason why I don't address this in the patch is that the patch itself is already big now. The larger it is, the harder it is to review it. Also I think it is usable in some level. So let's try to address this in later patches.

Other problems

The major problem I see now is that we can't handle third party modules. Third party modules refer to the modules whose source codes are not in current project. The users are still able to see the hint from clangd if the BMI of the third party modules are built ahead of time. I think this is true for a lot use cases. But this breaks our goal to not be locked by the same version compiler.

I think we can only solve the issue after SG15 solves it. I know SG15 is discussing how to let the modules communicate across libraries boundary. And it is not wise to invent wheels agains SG15. So let's wait for that.

Unable to make the clangd built BMI persist now

In the above link, @nridge requires to persist the clangd built BMIs so that we can reuse the BMIs across the invocation of clangd. This is not addressed in the patch. One reason is the same with above one. The current patch is already large. And I find that it is not trivial to do this. Since we need to check the consistency of the built BMI with the source codes. I know clang has similar functionalities but I feel we may need some code to adapt that. So I choose to not implement it in the first step.

Performance

I tested this with a modularized library: https://github.com/alibaba/async_simple/tree/CXX20Modules. This library has 3 modules (async_simple, std and asio) and 65 module units. (Note that a module consists of multiple module units). Both std module and asio module have 100k+ lines of code (maybe more, I didn't count). And async_simple itself has 8k lines of code. This is the scale of the project.

I opened a file in the end of the dependency chain and restart the clangd server, the log shows that it takes 10s to get things ready.

hmmm not a good number actually. But I found the major reason is that the speed to built the BMIs is too slow. And the major potential improvements should live in the compiler side. The compiler should offer an option to build BMIs lightly. I don't feel we can do a lot in the clangd side.

I think currently the performance issues should mainly be related to the number module units and the lines of code in the module units. That said, it doesn't matter if we have a big project but there is no or small scaled module units.

Plans

This is clearly not so good. But I feel it is basically workable. A little bit awkward to get this in the end of the release circle. But I still want to try to land this in clang17 as it is an important features to modules users.

If we can land this, I plan to implement the light mode in clang first and implement the persist BMI feature then.

Diff Detail

Event Timeline

ChuanqiXu created this revision.Jun 16 2023, 2:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2023, 2:17 AM
ChuanqiXu requested review of this revision.Jun 16 2023, 2:17 AM
ChuanqiXu updated this revision to Diff 532056.Jun 16 2023, 2:26 AM
ChuanqiXu edited the summary of this revision. (Show Details)Jun 16 2023, 2:40 AM
ChuanqiXu edited the summary of this revision. (Show Details)Jun 16 2023, 3:06 AM
ChuanqiXu added reviewers: sammccall, nridge.
ChuanqiXu added a subscriber: nridge.
ChuanqiXu edited the summary of this revision. (Show Details)Jun 16 2023, 3:11 AM
ChuanqiXu retitled this revision from [WIP] [clangd] [C++20] [Modules] Support C++20 modules for clangd to [clangd] [C++20] [Modules] Support C++20 modules for clangd.Jun 16 2023, 3:16 AM
sammccall added a comment.EditedJun 16 2023, 12:27 PM

Thanks for putting this together, I'm going to study it carefully and try it out!

That said, there are two large issues that I think should be addressed in the design (though not necessarily *implemented* now).
I'll be upfront: these are things without which $EMPLOYER will not be able to use this at all.
Since there isn't really room for multiple modules implementations in clangd, I'd like to make sure these fit into the design, or can be bolted on.

In the past, we've been reasonably successful at finding extension points that let clangd scale to unreasonable codebase sizes, while doing the right thing for smaller projects too. (Index, build system integration, VFS support, etc). To quote from the bug:

you will need an alternative for them (luckily most of them already are willing to invest in custom tooling), but scanning all project files should still be fine for the vast majority of clangd users

For historical context: clangd was that custom tooling designed to scale to huge monorepos. Google needs to continue to pay the cost of such support for new features, and we're trying to make that happen real soon now...


support for clang header modules

Google has a large deployment of clangd, serving a codebase that builds significant parts as clang header modules for performance reasons. There is no near-term plan of adopting C++20 modules (multiple reasons, which I probably can't represent well).
We've been disabling header modules for clangd & other tools for a long time. But it's come to a head, and we expect to get someone working on enabling modules in clangd in ~2 months.
As I understand it, Meta are in a similar situation (though their solution is to version-lock clangd with the toolchain, keep modules enabled, and accept that some things are broken).

Our specific build system setup is that all module-build actions and inputs explicit (-fmodule-file=, -fmodule-map-file=). The build system will not produce usable PCMs due to version skew.
I know that Meta folks *would* like to make use of available PCMs from the build system.


support for large projects

We have multiple codebases large enough that touching every file to discover dependencies isn't feasible.
The largest internal one had 2B LOC 7 years ago, and is now much larger. But even Chrome for example has 20M. Such projects are prone to adopt modules (in some form) for build scalability.
Apart from the concrete scanning of deps, keeping the full project module graph in memory won't always be possible. (It's a perfectly reasonable default implementation though).

(Sorry, hit send too soon)

I suspect the answer for header modules is that we can study this patch and understand what the equivalents of graph nodes/deps/names/scanning look like for explicit header modules, and understand that we'll be able to abstractify some names and add some levels of indirection and it'll all work out.

For large project support, I suspect a bit more thought is needed.
We'll need some abstraction layer (like CompilationDatabase is to compile_commands.json) that exposes enough data to run the algorithms we need, without exposing so much that you have to hold the whole graph in memory. It could be backed by in-memory depscan results, or a build-system artifact, or a live query of the build system.
For "every time we update a file, all the affected files (e.g., we changed a header file) will be re-scanned" - we need a way to express this more abstractly than "re-scanned", and more narrowly than "all the affected files" - at least it needs to be limited to all the files that could themselves affect any open file.

(Going to dig into the design now and come back with more thoughts)

The major problem I see now is that we can't handle third party modules. Third party modules refer to the modules whose source codes are not in current project. The users are still able to see the hint from clangd if the BMI of the third party modules are built ahead of time. I think this is true for a lot use cases. But this breaks our goal to not be locked by the same version compiler.

I think we can only solve the issue after SG15 solves it. I know SG15 is discussing how to let the modules communicate across libraries boundary. And it is not wise to invent wheels agains SG15. So let's wait for that.

For context, SG15 is the ISO C++ standards committee's Tooling Study Group. One proposal being looked at that may be relevant here is P2581 - Specifying the Interoperability of Built Module Interface Files.

ChuanqiXu added a comment.EditedJun 16 2023, 10:50 PM

That said, there are two large issues that I think should be addressed in the design (though not necessarily *implemented* now).

Yeah, totally agreed. Design is pretty important especially in open source softwares. I'm pretty open to any ideas.


support for clang header modules

Google has a large deployment of clangd, serving a codebase that builds significant parts as clang header modules for performance reasons. There is no near-term plan of adopting C++20 modules (multiple reasons, which I probably can't represent well).
We've been disabling header modules for clangd & other tools for a long time. But it's come to a head, and we expect to get someone working on enabling modules in clangd in ~2 months.
As I understand it, Meta are in a similar situation (though their solution is to version-lock clangd with the toolchain, keep modules enabled, and accept that some things are broken).

I suspect the answer for header modules is that we can study this patch and understand what the equivalents of graph nodes/deps/names/scanning look like for explicit header modules, and understand that we'll be able to abstractify some names and add some levels of indirection and it'll all work out.

Yeah. On the one hand, I always think that named modules are far different from header modules (including clang header modules and header units) for tools. Since for tools, it makes sense to fallback to traditional inclusion when they saw header units or header modules. For example,
for clang header modules, IIUC, clangd can work if we rewrite the compile commands to strip a lot of clang header modules related options. And we reuse the rewrite process in the patch. (I know there are some exceptions due to macros). On the other hand, I'm curious about how do you handle/model clang explicit header modules with compilation database? I didn't look about this. Is the headers an entry in the compilation database? And if is it possible for one header to have multiple BMIs? Such problems look not easy to handle so I prefer to fallback it to inclusions.

Another point here to make clangd to treat header modules as headers is that the process of building BMI is not cheap. Even if we developed a mode which can create BMI lightly, it is still not free. So I think it is better to treat header modules as headers for perspective of performance too.

Our specific build system setup is that all module-build actions and inputs explicit (-fmodule-file=, -fmodule-map-file=). The build system will not produce usable PCMs due to version skew.
I know that Meta folks *would* like to make use of available PCMs from the build system.

Also it looks like to me the Meta's method are not a solution we want. This looks basically what I want for named modules half a year ago. Basically I did nothing and just tried to let clangd fallback to clang to handle everything.


support for large projects

Apart from the concrete scanning of deps, keeping the full project module graph in memory won't always be possible. (It's a perfectly reasonable default implementation though).

We'll need some abstraction layer (like CompilationDatabase is to compile_commands.json) that exposes enough data to run the algorithms we need, without exposing so much that you have to hold the whole graph in memory. It could be backed by in-memory depscan results, or a build-system artifact, or a live query of the build system.

Yeah, I agree the performance is a key point. In the current design, the ModulesDependencyGraph is only visible to ModulesManager so I feel we left the space for further optimization while I don't have concrete ideas. The current patch only left 5 interfaces for outer users (currently only TUScheduler) (the interfaces: UpdateNode(PathRef), isReadyToCompile(PathRef), HasInvalidDependencies(PathRef), addCallbackAfterReady(PathRef, std::function<void()>) and GenerateModuleInterfacesFor(PathRef)), which didn't leak any special data to outer users. So I feel while it is OK to adopt ideas to enhance the scaling ability, it is OK to forward now since we've left the space for further optimization. (premature optimization is the root of all evil :) )

For "every time we update a file, all the affected files (e.g., we changed a header file) will be re-scanned" - we need a way to express this more abstractly than "re-scanned", and more narrowly than "all the affected files" - at least it needs to be limited to all the files that could themselves affect any open file.

Yeah, actually now this patch doesn't make it explicitly. Now we achieve this implicitly by making TUScheduler::update (which will be called every time the TU changes) to call ModulesManager::UpdateNode(PathRef). What I want to do is to let ModulesManager::UpdateNode(PathRef Node) to call TUScheduler::update for the users of Node.

So here "re-scanned" means to call TUScheduler::update. And "all the affected files" is not reflected in the patch actually. Do you have suggestions to improve this or do you feel it is OK to make it implicitly as now?

Destroyerrrocket requested changes to this revision.Jun 18 2023, 5:15 AM
Destroyerrrocket added inline comments.
clang-tools-extra/clangd/ModulesManager.cpp
414–415

This is a bug; The second move is invalid. You could make a copy

This revision now requires changes to proceed.Jun 18 2023, 5:15 AM
ChuanqiXu updated this revision to Diff 532502.Jun 18 2023, 7:20 PM

Address comments.

clang-tools-extra/clangd/ModulesManager.cpp
414–415

Done. Thanks for looking this. I changed it with a new signature for the callbacks with a bool argument.

clang-tools-extra/clangd/ModulesManager.cpp
414–415

No problem! I'd love to help :)

Destroyerrrocket accepted this revision.Jun 19 2023, 4:40 AM
This revision is now accepted and ready to land.Jun 19 2023, 4:40 AM
nridge resigned from this revision.Jul 11 2023, 11:34 PM

I'm sorry, I feel like I don't have a good enough level of insight into the requirements to usefully critique this patch, nor the bandwidth to take a detailed look through the implementation right now.

I think it's best for me to resign as reviewer for the time being, and leave the review in Sam's capable hands.

Sam, I hope this is ok; if you'd like a second opinion on any particular point, or a second pair of eyes on the implementation, please feel free to re-add me and I will do my best to make time to weigh in.

ilya-biryukov requested changes to this revision.Aug 7 2023, 9:25 AM

Hi @ChuanqiXu,

Sam is on vacation now, but we are in the same team and I am responding on behalf of the team.

First, sorry for not getting to this review earlier.
The change is quite big, and, as Sam mentioned, we want to make sure this scales to our (unreasonable) project sizes and supports for header modules.
That is to say, we feel that the stakes are high for getting it right.

In order to move forward with this change, we need to do some homework to figure out the longer-term strategy for modules in Clangd.
We plan to come back with answers, but it may take a few months because planning this urgently is tough.
I do want to stress that we want to get modules in Clangd right, plan to work on it.

Sorry for these delays and confusion.

This revision now requires changes to proceed.Aug 7 2023, 9:25 AM

Got it. Being patience is not bad and it is good enough to know that we are moving forward : )

BTW, I have a question about supporting header modules (including clang header modules and C++20 header units) in static analysing tools (including clangd), "why can't we fallback to include the corresponding headers simply?". So I still feel it is not a problem to support header modules in static tools. Since I feel the semantics of header modules is almost transparent to other tools.

BTW, I have a question about supporting header modules (including clang header modules and C++20 header units) in static analysing tools (including clangd), "why can't we fallback to include the corresponding headers simply?". So I still feel it is not a problem to support header modules in static tools. Since I feel the semantics of header modules is almost transparent to other tools.

We could and it mostly works. In fact, this is what we do internally at Google at build system level since our source tools don't support header modules, but the build does support them.
There are problems with this approach, though:

  • Modules provide build time improvement that add up if they are widely used in the codebase. This leads to source tools being much slower than builds. This results in slowness and timeouts when source tools are used.
  • There are still incompatibilities between header modules and preprocessor and we occasionally get code that builds, but source tools don't work on it.

I think we should be open to special-casing the header modules if they turn out to be hard to support compared to C++20 Modules.
However, my intuition is that that would not be the case and the major problems are actually quite similar between the two (versioning of PCMs we consume, when PCMs need to be rebuilt, etc) and we get header modules almost for free if we solve those problem for C++20 Modules.
But we do want to make sure we don't miss anything important for header modules in the design phase as that's actually something used by Google today and it would be valuable to solve the aforementioned problems.

ChuanqiXu added a comment.EditedAug 8 2023, 7:35 PM

Got it. The explanation makes sense. A well designed and scaling solution is what I (and probably every one) want.

Then from the expectation, the difference between supporting header modules and C++20 named modules from the user side may be:

  • For supporting header modules in clangd, it is mainly a speed issue and some corner cases.
  • For supporting C++20 named modules in clangd, it will be pretty hard for users to use named modules in practice.

In another word, it may not be super bad for not supporting header modules in clangd. But it is super bad for named modules. So I am wondering if we can have an expectation for the time points to support named modules (even only experimentally) in clangd. For example, may it be a reasonable expectation that we can have named modules support in clangd in clang18? Clang18 will be released in the first week of March 2024. So that's still roughly 7 months away from now. I guess the time span may be sufficient. How do you feel about this? And if we have consensus on that, then we will need to move forward from the patch if we don't have solution in December at least. Since it takes time to review and experiment this further.

BTW, I don't mind someone else to take this job away completely as long as we can get the support in time. Since I am not a clangd developer : )

For example, may it be a reasonable expectation that we can have named modules support in clangd in clang18? Clang18 will be released in the first week of March 2024. So that's still roughly 7 months away from now. I guess the time span may be sufficient. How do you feel about this? And if we have consensus on that, then we will need to move forward from the patch if we don't have solution in December at least. Since it takes time to review and experiment this further.

BTW, I don't mind someone else to take this job away completely as long as we can get the support in time. Since I am not a clangd developer : )

Setting some deadlines totally makes sense and Clang 18 release looks like a very reasonable target.
Give us a few days to see if we can set this up. I hope to be back with concrete commitments by the end of this week or early next week.

Sorry for the long radio silence here. There's a lot to chew on, and I put it off too long. Thanks for your patience!

I agree we should get experimental modules support landed in some form on the LLVM 18 timeline.
It's fine if this doesn't have extension points for large projects, or for header modules, and if it'll take some refactoring later to get there. Still, I think we should consider them in the architecture (partly because a second impl helps reason about interfaces).

I do think it's important we play nicely with the rest of clangd infrastructure: things like threading/context, VFS clean-ness, ability to easily write gtests for modules-enabled ASTs.

This initial pass probably mostly comes across as a list of complaints... I'd like to be more constructive & provide ideas, but this is long already so I'll follow up with that.

I've tried to limit to mostly high-level comments (or at least details that require a bit of thought!)
A couple more that don't fit anywhere:


Indexing

Much of clangd's functionality relies on having an up-to-date index of the transitively included headers (and now, modules).
While we're more relaxed about having a full-project index, this preamble index really is essential.

It's infeasible to build this index from the PCM files themselves: this means fully deserializing them and is too expensive (we tried this with PCH).
Traditionally we indexed the preamble the ASTContext before serializing the PCH, though recently this is optionally async.
(You can see this in onPreambleAST in ClangdServer.cpp and trace from there.)

I think indexing can be left out of this initial patch, but we should work out where/how it fits and leave a comment!


Scope and incremental development

There are a lot of moving pieces to this. That's useful to understand how they all fit together (which is necessary whether they're in the same patch or not).
However it makes difficult to review, and to drive a review to a conclusion. I think we should change the scope of the initial patch to establish a small subset that works and we understand well:

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

Obviously this no longer provides any performance benefits from modules! But it gets us to a modules-based build that works, and these aspects can be added on top in a relatively clear way.
It avoids a lot of tricky problems:

  • scheduling
  • (lack of) transactionality of reads/writes from the module graph
  • the possibility of BMI version conflicts
  • various events that can invalidate the module graph
  • rebuilding of BMIs when source files change (the current patch deliberately doesn't do this to keep the scope down, but that means the result is both complex and incorrect, which is a difficult situation to get out of)

I think this also encourages us to unbundle some of the "ModulesManager"'s responsibilities and will lead to a better API there.

Do consider the lifecycle/namespacing of BMI files on disk

While we may eventually want to reuse BMIs between (sequential or concurrent) clangd instances, this is complicated and IMO won't happen soon (needs to deal with source versioning, I believe).

Therefore for now we should treat the BMIs as transient, conceptually like in-memory objects. In practice we want to store them on disk for size reasons, but these should have transient tempfile names that will never conflict, should be deleted on shutdown etc. (It's fine to do basic delete-on-destruction and leave crash-handling etc out for now).

clang-tools-extra/clangd/ModulesManager.cpp
50

the docs suggest this tool is going to spawn threads.

clangd needs to retain control of thread spawning, e.g. to ensure clangd::Context is propagated (required for config, tracing and others), thread priorities are set appropriately etc.

55

It looks like this performs IO for the dependency scanning.
The files may not be on a real filesystem, IO on source files should go through the configured VFS (see ThreadsafeFS from which you can obtain a concrete FS for the desired operation).

56

it seems like the library under some circumstances expects to be able to write output to physical disk, and choose the path to do so (MakeformatOutputPath?)

What purpose does this serve here, and is it possible to avoid it?

While potentially it's OK to write to a temporary directory, we need to ensure concurrent clangd instances don't conflict, the files get eventually cleaned up after crashes, we're not writing too much in environments without physical disks etc.

106

note that clangd invocations are not necessarily sequential, there may be multiple clangd instances operating on the same codebase at the same time.

So we either need to ensure this store can be shared across instances without races, or treat it as transient (separate dir for each clangd process, clean up on shutdown and on crash)

320

This seems to be the only place that BMI path is cleared, which will cause the BMI to be rebuilt next time it is needed, correct?

We need to rebuild the BMI every time the content of a module has changed, though: we need the exposed APIs to be up-to-date (and the SourceLocations etc).
Nothing seems to be doing this, so it seems we're going to keep using stale copies of headers forever.

(CDB.watch isn't anywhere close to sufficient here: it's best effort, it's async, and it only tells when compile flags changed, not the file content).

525

this doesn't seem like a valid thing to assert here, or anywhere

we don't hold the lock at this point, so the graph may have changed since whatever precondition the caller established.
In particular, it may now have invalid dependencies.

(This seems like a manifestation of the idea that we need some idea of snapshots in the graph and versioning of the BMIs we create)

561

How do we handle the following scenario?

  1. we get a call to GenerateModuleInterfacesFor("a.cpp"), where a.cpp => X => Z
  2. we recursively build Z.pcm#1 and X.pcm and we're just about to return...
  3. now somehow we find out that Z changed (and clear its BMI path)
  4. and we get a call to GenerateModuleInferfacesFor("b.cpp"), where b.cpp => Y => Z
  5. now we recursively build Z.pcm#2 and Y.pcm
  6. finally the original call to GenerateModuleInterfacesFor("a.cpp") returns

  • In this end state, we have modules for X and Z, but they can't be used together: X.pcm was build against Z.pcm#1 but we have Z.pcm#2. So a.cpp produces a mysterious PCM-related error.
  • If we solve this by invalidating X's BMI path when we invalidate Z's, then we just end up with a.cpp failing to compile with a missing PCM.
  • if we solve this by having GenerateModuleInterfacesFor basically loop until everything is up to date, then we stop guaranteeing forward progress when the system is busy, and *still* race because things can get out of date as soon as we return

I believe solving this class of problem probably means explicitly accounting for versions, and keeping refcounted versioned PCMs that are pinned while we're using them.

clang-tools-extra/clangd/ModulesManager.h
28

these details seem unused outside the cpp file (and untested), so should be in the cpp file - this would make it easier to understand which part of this file is the interface

56

Concretely, for the implementation where we scan the whole project up front, the data structure here can be an in-memory graph.

However rather than having fine-grained methods to query attributes of the graph, I think we should work out which high-level/coarse-grained queries we should support.

The current API leads to locking/unlocking the graph many separate times to obtain/update different information, and it's very difficult to reason about whether the behavior is correct if the graph changes in between. Coarse-grained single queries can be made atomic and queried once.

This will also result in an interface much more suitable for other implementation strategies, such as querying a build system that understands modules. (We will need to have this available as an extension point, as scanning the whole project does not scale to larger codebases)

116

I'd like to choose a different name than "third-party", which often (at least inside LLVM and inside google) refers to "vendored" libraries where we *do* always have headers and often build entirely from source.

Moreover, I think at least initially we should not support these in any way, and ensure we produce an error for the import statement.
In the majority and baseline case, the compiler may not be clang, and is probably at least a different version with incompatible BMI files.
(There may eventually be cases where we want to try making use of externally provided BMIs, but this is mostly orgthogonal to third-party-ness, I tihnk)

This means the signature here shouldn't need to call out third-party modules as special - these should be modeled the same way as if there was a dependency we couldn't build a BMI for for some other reason.

204

Generally "ModulesManager" is worryingly vague as a set of responsibilities, and this class looks like it does too much: provides threadsafe access to the module graph, performs actual scanning and BMI-building logic, and also acts as the scheduler.

I think we'll need to separate these concerns out into separate APIs, but it's probably best if we think about this later after understanding how to address more concrete questions.

clang-tools-extra/clangd/TUScheduler.cpp
867

This call to IsReadyToCompile seems to have a logic race:

  • we've previously queried the module manager as part of preparing the compile command
  • now we're querying it again based on the filename
  • what happens if the graph has changed significantly in the meantime?
871

making ModulesCV a member etc, exposing waitForModulesBuilt etc suggests we're going to wait on this for multiple threads. But I don't see that here.

If this is merely local, can we have:

Notification N;
addCallbackAfterReady([] { N.Notify(); });
N.wait();

or even just give ModuleMgr a blocking API?

874

we need to be able to handle shutdown while waiting for modules

874

here the ASTWorker is apparently blocking on all the modules being up-to-date (though the impl doesn't seem to actually achieve up-to-date-ness, I assume that's the intent).

This creates a big performance cliff where modifying a low-level header causes all features to stop working until everything that transitively depends on it is rebuilt. We used to have this cliff with preambles, though extensive effort we've eliminated it through the use of a separate preamble thread, tolerance for stale preambles, preamble patching etc.

I think the fix is roughly:

  • we assume imports are in the preamble (I think the language guarantees this idea works, though clang's actual preamble implementation may or may not support this right now)
  • the preambleworker should be blocking on modules being ready, not the astworker. This (usually) takes it off the critical interactive path
  • typically the preamble will now be a fairly small thing that's mostly references to PCMs
  • we need to ensure the PCMs survive (and aren't overwritten by newer versions) for as long as the preamble is alive and used. Some sort of owner class stored in PreambleData can achieve this.

(one day it may be possible to eliminate the use of preamble altogether in favor of some modules-based solution, with header modules inferred for non-modularized code. But obviously we can't do that until this is non-experimental, so the preamble is the best place for modules to live for now)

clang-tools-extra/clangd/test/modules.test
1

A smoke lit test is great, it's not realistic to achieve good test coverage of features this way though (and generally we don't try).

We'll need to work out how to get TestTU-based tests to work with modular builds.

@sammccall Hi, Sam. Thanks for your high-quality comments! It is valuable. All the low-level inline comments are helpful. But I didn't reply them for the suggested direction in the higher level comments.

I'll repeat your suggestion in my mind again to avoid any misunderstandings:

While we should leave the space for future development, we should do the following thing in the initial patch:

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

Do I understand right? If I understand correctly, I fully agree with the direction. We can go slowly, as long as we keep moving forward.

Then I'd like to leave the patch as-is for referencing and create new patches following the suggestion.

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

Do I understand right? If I understand correctly, I fully agree with the direction. We can go slowly, as long as we keep moving forward.

Then I'd like to leave the patch as-is for referencing and create new patches following the suggestion.

Yes, that's the suggestion, and that plan makes sense to me, thanks!

I did some more thinking about this (having a concrete implementation helps a lot!) and had a couple more thoughts.
At some point we should write down a design somewhere, need to strike a balance between doing it early enough to be useful but late enough that we've understood!

Dep scanning - roles

IIUC we do this for two reasons:

  • to identify what module names we must have PCMs for in order to build a given TU (either an open file, or a module we're building as PCM)
  • to build a database mapping module name => filename, which we compose with the CDB to know how to build a PCM for a given module name

I think it would be good to clearly separate these. The latter is simpler, more performance-critical, async, and is probably not used at all if the build system can tell us this mapping.
The latter is more complex, and will always be needed synchronously for the mainfile regardless of the build system.

Dep scanning - implementation

The dep scanner seems to work by getting the compile command and running the preprocessor. This is fairly heavyweight, and I can't see anywhere it's going into single-file mode - is it really reading all #included headers? This is certainly not workable for reparses of the mainfile (when no headers have changed).

It seems unneccesary: the standard seems to go to some lengths to ensure that we (almost) only need to lex the top-level file:

  • module and import decls can't be produced by macros (seems to be the effect of the pp-module directive etc)
  • module and import decls can't be #included (definition of module-file and [cpp.import] rules)

The wrinkle I see is that some PP usage is possible: the module name can be produced by a macro, and imports can be #ifdefd. I think the former is very unlikely (like #include MACRO_NAME) and we can not support it, and the latter will just result in us overestimating the deps, which seems OK.
You have more context here though, and maybe I'm misreading the restrictions placed by the standard. Today clang doesn't seem to enforce most of these sort of restrictions, which is probably worth fixing if they're real.

(This doesn't apply to header modules: it's perfectly possible to include a textual header which includes a modular header, and it's impossible to know without actually preprocessing. This divergence is nasty, but I don't think we should pessimize standard modules for it).

Interaction with preamble

At a high level, import decls should be processed with the preamble: they should change infrequently, rebuilding modules is expensive, coarse-grained work, we want to make the same policy decisions on whether to use stale PCMs or block on fresh ones etc.
However they don't appear in a prefix of the file, and this is pretty important to how the preamble action works, so exactly in what sense are they part of the preamble?

I believe the best answer is:

  • "preamble" is really a set of required artifacts + conditions to check for validity
  • import foo in a file means foo.pcm is a required artifact, not that preamble.pcm contains an ImportDecl

So given this code:

module;
#include <x>
module foo;
import dep1;
module :private;
import dep2;

The "preamble region" should end after #include <x> and preamble.pcm should contain the AST & PP state up to that point.
Meanwhile dep1.pcm and dep2.pcm are separate PCM files that will be loaded on each parse.
For a preamble to be usable at all, we need to have built preamble.pcm, dep1.pcm, dep2.pcm.
For a preamble to be up-to-date, the preamble region + set of imported modules must be unchanged, the preamble.pcm must be up to date with respect to its sources, and the module PCMs must be up-to-date with respect to their sources. (unclear exactly how to implement the latter, may need to add extra tracking)

(When building this module as a PCM we may not want to treat dep2 as a dependency or parse the private fragment... but this is not relevant to preambles as we won't be using a preamble for this anyway)

Dep scanning - roles

IIUC we do this for two reasons:

  • to identify what module names we must have PCMs for in order to build a given TU (either an open file, or a module we're building as PCM)
  • to build a database mapping module name => filename, which we compose with the CDB to know how to build a PCM for a given module name

I think it would be good to clearly separate these. The latter is simpler, more performance-critical, async, and is probably not used at all if the build system can tell us this mapping.
The latter is more complex, and will always be needed synchronously for the mainfile regardless of the build system.

I think the second instance of "the latter" was meant to be "the former" :)

At a high level, import decls should be processed with the preamble: [...]
However they don't appear in a prefix of the file

This point is not obvious to me: do you mean that there are coding styles that place import statements further down in the file, after non-trivial declarations?

If not, what stops us from altering the definition of "preamble" to something like "everything before the first declaration which is not an import/module declaration"?

  • to identify what module names we must have PCMs for in order to build a given TU (either an open file, or a module we're building as PCM)
  • to build a database mapping module name => filename, which we compose with the CDB to know how to build a PCM for a given module name

I think it would be good to clearly separate these. The latter is simpler, more performance-critical, async, and is probably not used at all if the build system can tell us this mapping.
The latter is more complex, and will always be needed synchronously for the mainfile regardless of the build system.

I think the second instance of "the latter" was meant to be "the former" :)

Oops, yes!

At a high level, import decls should be processed with the preamble: [...]
However they don't appear in a prefix of the file

This point is not obvious to me: do you mean that there are coding styles that place import statements further down in the file, after non-trivial declarations?

Yes:

  • inside a module unit, imports must appear directly underneath the module declaration, above non-trivial declarations
  • but non-module TUs can place imports anywhere in the file (at TU scope)

If not, what stops us from altering the definition of "preamble" to something like "everything before the first declaration which is not an import/module declaration"?

We *have* to find and build *all* the PCMs that will be imported - unlike with the current preamble PCH, we can't give up on some of them and fall back to textual inclusion.

We could also have the preamble region cover the imports that happen to be at the top - i.e. the PCH would contain these ImportDecls. But I don't know that actually achieves anything meaningful over stopping the preamble before the imports. Either way, each time we use the preamble we'll have load the imported PCMs.


I just realized my ideas for simplifying the dep scanning may not work: as well as named modules we could also have imports of header units, which presumably means we need to at least build an include path and stat files on it to resolve a header unit name :-(

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

Do I understand right? If I understand correctly, I fully agree with the direction. We can go slowly, as long as we keep moving forward.

Then I'd like to leave the patch as-is for referencing and create new patches following the suggestion.

Yes, that's the suggestion, and that plan makes sense to me, thanks!

I did some more thinking about this (having a concrete implementation helps a lot!) and had a couple more thoughts.
At some point we should write down a design somewhere, need to strike a balance between doing it early enough to be useful but late enough that we've understood!

Yeah, then let's make the page into some design ideas discussing page.

Dep scanning - roles

IIUC we do this for two reasons:

  • to identify what module names we must have PCMs for in order to build a given TU (either an open file, or a module we're building as PCM)
  • to build a database mapping module name => filename, which we compose with the CDB to know how to build a PCM for a given module name

I think it would be good to clearly separate these. The latter is simpler, more performance-critical, async, and is probably not used at all if the build system can tell us this mapping.
The latter is more complex, and will always be needed synchronously for the mainfile regardless of the build system.

Yes, agreed.

Dep scanning - implementation

The dep scanner seems to work by getting the compile command and running the preprocessor. This is fairly heavyweight, and I can't see anywhere it's going into single-file mode - is it really reading all #included headers? This is certainly not workable for reparses of the mainfile (when no headers have changed).

It seems unneccesary: the standard seems to go to some lengths to ensure that we (almost) only need to lex the top-level file:

  • module and import decls can't be produced by macros (seems to be the effect of the pp-module directive etc)
  • module and import decls can't be #included (definition of module-file and [cpp.import] rules)

The wrinkle I see is that some PP usage is possible: the module name can be produced by a macro, and imports can be #ifdefd. I think the former is very unlikely (like #include MACRO_NAME) and we can not support it, and the latter will just result in us overestimating the deps, which seems OK.
You have more context here though, and maybe I'm misreading the restrictions placed by the standard. Today clang doesn't seem to enforce most of these sort of restrictions, which is probably worth fixing if they're real.

(This doesn't apply to header modules: it's perfectly possible to include a textual header which includes a modular header, and it's impossible to know without actually preprocessing. This divergence is nasty, but I don't think we should pessimize standard modules for it).

There are some problems due to the complexities of the standard...

The take away may be:

  • module and import decls can't be produced by macros (seems to be the effect of the pp-module directive etc)
    • yes
  • module and import decls can't be #included (definition of module-file and [cpp.import] rules)
    • the module declarations can't be #included.
    • but the import decls can be #included partially. See the discussion of https://github.com/llvm/llvm-project/issues/59688 for detail. The explanation is:
      • the wording(http://eel.is/c++draft/cpp.import#3) is "If a pp-import is produced by source file inclusion (including by the rewrite produced when a #include directive names an importable header) while processing the group of a module-file, the program is ill-formed."
      • and the definition of module-file (http://eel.is/c++draft/cpp.pre#nt:module-file) is pp-global-module-fragment pp-module group pp-private-module-fragment.
      • so the phrase the group of a module-file only refers to the group in the definition of module-file literally. We can't expand the grammar.
  • the module name can be produced by a macro
    • yes
  • imports can be #ifdefd.
    • yes. And this is a pretty important use-case for using modules in practice.

So possibly we have to look into the #includes when scanning.

Interaction with preamble

At a high level, import decls should be processed with the preamble: they should change infrequently, rebuilding modules is expensive, coarse-grained work, we want to make the same policy decisions on whether to use stale PCMs or block on fresh ones etc.
However they don't appear in a prefix of the file, and this is pretty important to how the preamble action works, so exactly in what sense are they part of the preamble?

I believe the best answer is:

  • "preamble" is really a set of required artifacts + conditions to check for validity
  • import foo in a file means foo.pcm is a required artifact, not that preamble.pcm contains an ImportDecl

So given this code:

module;
#include <x>
module foo;
import dep1;
module :private;
import dep2;

The "preamble region" should end after #include <x> and preamble.pcm should contain the AST & PP state up to that point.
Meanwhile dep1.pcm and dep2.pcm are separate PCM files that will be loaded on each parse.
For a preamble to be usable at all, we need to have built preamble.pcm, dep1.pcm, dep2.pcm.
For a preamble to be up-to-date, the preamble region + set of imported modules must be unchanged, the preamble.pcm must be up to date with respect to its sources, and the module PCMs must be up-to-date with respect to their sources. (unclear exactly how to implement the latter, may need to add extra tracking)

(When building this module as a PCM we may not want to treat dep2 as a dependency or parse the private fragment... but this is not relevant to preambles as we won't be using a preamble for this anyway)

Agreed basically. To make it clear, I think what you proposed may be to make clang::clangd::PreambleData contains the paths (or wrapped data structures) to dep1.pcm and dep2.pcm. And other parts of clangd should interact with the dep1.pcm or dep2.pcm with clang::clangd::PreambleData?

For a preamble to be up-to-date, the preamble region + set of imported modules must be unchanged, the preamble.pcm must be up to date with respect to its sources, and the module PCMs must be up-to-date with respect to their sources. (unclear exactly how to implement the latter, may need to add extra tracking)

For this issue, (and the related ABA issue you gave inline comments), my thought was that the code intelligence doesn't have to have super strict real time requirement with the compilation systems. I mean, in my real experience, it is not so bad to see out-of-date results from code intelligence than the results from compilation systems. (I know this may not be accepted.)

inside a module unit, imports must appear directly underneath the module declaration, above non-trivial declarations

Then this is not so correct. With the explanation in the same link above (https://github.com/llvm/llvm-project/issues/59688), the following example may be valid:

cpp
// a.cpp
export module a;
hpp
// b.hpp
import a;
cpp
// c.cpp
module;
#include <b.hpp>
export module c;

@sammccall here is a question (or double check) about the intended initial version.

This is the requirement for the initial version:

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

And all of us agree that it will be the job of the preamble to manage the module files in the end of the day. I just want to ask or double check that it might be OK to not related the module files with preamble in the initial version, right?

Since the definition (or description) of preamble is:

A preamble can be reused between multiple versions of the file until invalidated by a modification to a header, compile commands or modification to relevant part of the current file.

And it looks like it beyonds the scope of the our requirement of the initial patch.

Aside: I've been doing some investigation into how modules+clangd could work in our huge monorepo (specifically bazel + distributed build cluster).
It looks feasible (with some serious effort) to get all BMI/index/etc data we need for transitive modules to be generated by a copy of clangd running in the distributed build system itself, and likely this will perform better than anything involving building in-process.
So it seems plausible to add a coarse-grained extension point for this, and assume that the "find-and-build-modules" part we're adding now doesn't have to scale to that size. (We'd still like it to scale to projects the size of Chromium, which is roughly 10x LLVM and our usual proxy for "large local project"). I think we can make some simplifying assumptions even beyond experimental stage:

  • the module graph will fit in memory
  • the compilation database will fit in memory
  • the set of files in the project is enumerable
  • reading all the source files in the project is possible: takes minutes but not hours

@sammccall here is a question (or double check) about the intended initial version.

This is the requirement for the initial version:

Don't attempt any cross-file or cross-version coordination: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

And all of us agree that it will be the job of the preamble to manage the module files in the end of the day. I just want to ask or double check that it might be OK to not related the module files with preamble in the initial version, right?

Since the definition (or description) of preamble is:

A preamble can be reused between multiple versions of the file until invalidated by a modification to a header, compile commands or modification to relevant part of the current file.

And it looks like it beyonds the scope of the our requirement of the initial patch.

On the one hand, if building modules in the AST works and is simpler, sure we can delay the "preamble optimization" for them.

But I don't really expect this is the case:

  • as we've established, you can reach import statements inside the preamble region (#include a textual header that itself contains an import). So you need to build at least some of the required BMIs before building the preamble, doing it when parsing the main AST is not enough. (I believe the current version of the patch will just fail to parse some code in the preamble in this case).
  • rebuilding BMIs with every preamble (not trying to reuse/cache/version them yet) will be pretty slow. But if we rebuild them every main-file reparse (i.e. every keystroke!) I think this is going to be intolerably slow to the point of not even being useful as a prototype.
  • I only see two things that are really different between building during preamble vs building during AST, and neither are a lot of work:
    • invalidation: we need the preamble to be invalid if imports outside the preamble region have changed, or if the inputs of modules themselves have changed. But I think we can punt on all of this for now, and just not rebuild the preamble sometimes when we should. (Workaround: manually touch the preamble region after adding imports)
    • we need to attach the modules to PreambleData, rather than having them local to ParsedAST::build. But this really does not seem like much work at all - as we're not (yet) sharing modules through the filesystem, encapsulating the lifetime of a generated module in an object is something we should be doing right from the start anyway.
  • given this, adding the code to ASTWorker and then moving it to PreambleWorker later seems like it doesn't save anything much over doing it right in the first place

@sammccall @nridge while I am looking for the initial support for modules in clangd, I failed to find the mechanism to update files after I update a header file.

e.g., when I am opening the following file:

// a.cc
#include "a.h"
...

and there is a concurrent update to a.h. How can the ASTWorker of a.cc know such changes so that it can update the corresponding Preamble of a.cc?

In the comments of ClangdServer::reparseOpenFilesIfNeeded(), I see:

/ Requests a reparse of currently opened files using their latest source.
/ This will typically only rebuild if something other than the source has
/ changed (e.g. the CDB yields different flags, or files included in the
/ preamble have been modified).

So I thought this is what I want. However, I can't search the caller of reparseOpenFilesIfNeeded which semantics matches the behavior. The two callers of reparseOpenFilesIfNeeded I found are ClangdLSPServer::applyConfiguration() and ClangdLSPServer::onDocumentDidSave() and neither of them matches description files included in the preamble have been modified.

So I want to ask what's the behavior when I update a header and where is the corresponding code. Thanks.

However, I can't search the caller of reparseOpenFilesIfNeeded which semantics matches the behavior. The two callers of reparseOpenFilesIfNeeded I found are ClangdLSPServer::applyConfiguration() and ClangdLSPServer::onDocumentDidSave() and neither of them matches description files included in the preamble have been modified.

So I want to ask what's the behavior when I update a header and where is the corresponding code. Thanks.

I'm afraid onDocumentDidSave() is all we have for now. It detects changes to the header when editing the header in the client (when the header is saved). I don't believe we have a mechanism for detecting changes to the header made in other ways.

If/when we want to add such a mechanism, I think the way to do it is using didChangeWatchedFiles (there is some discussion there about why LSP recommends servers delegate file-watching to the client rather than implementing file-watching in the server).

However, I can't search the caller of reparseOpenFilesIfNeeded which semantics matches the behavior. The two callers of reparseOpenFilesIfNeeded I found are ClangdLSPServer::applyConfiguration() and ClangdLSPServer::onDocumentDidSave() and neither of them matches description files included in the preamble have been modified.

So I want to ask what's the behavior when I update a header and where is the corresponding code. Thanks.

I'm afraid onDocumentDidSave() is all we have for now. It detects changes to the header when editing the header in the client (when the header is saved). I don't believe we have a mechanism for detecting changes to the header made in other ways.

IIUC, when we open a.cpp and b.h (there is no relationship between them), and we edit and save b.h, then the AST Worker of a.cpp will receive a request to update a.cpp. Did I understand correct? I imaged there may be an early exit point in the path of ASTWorker::update or PreambleThread::update if we detects the preamble doesn't change actually. But I failed to find it.

If/when we want to add such a mechanism, I think the way to do it is using didChangeWatchedFiles (there is some discussion there about why LSP recommends servers delegate file-watching to the client rather than implementing file-watching in the server).

Got it. And I am wondering the reason we didn't implement it may be that it is not so bad actually. Since a user generally won't open too many tabs. Do I understand right? And if it is the case, maybe we need to look at it in the future since it may be a concern with modules.

@sammccall @nridge while I am looking for the initial support for modules in clangd, I failed to find the mechanism to update files after I update a header file.

e.g., when I am opening the following file:

// a.cc
#include "a.h"
...

and there is a concurrent update to a.h. How can the ASTWorker of a.cc know such changes so that it can update the corresponding Preamble of a.cc?

The PrecompiledPreamble structure (defined in clang, not clangd) consists of a handle to the *.pch file, and also a list of filenames that it was built from. We can test whether it's out of date by stat()ing the input files (PrecompiledPreamble::CanReuse).

Once upon a time, clangd used this in a simple way: the ASTWorker always called [clangd::buildPreamble(inputs, old preamble)](https://github.com/llvm/llvm-project/blob/release/10.x/clang-tools-extra/clangd/Preamble.cpp#L101-L107) which would just return the old one if it was still valid.

For a while now we've had async preambles which are more complicated but use the same underlying mechanism. Each file has an ASTWorker thread and a PreambleThread. When the ASTWorker thread wants to reuse preamble, it notifies the PreambleThread "hey, maybe rebuild the preamble?" but meanwhile it charges on using the old stale preamble. The preamble asynchronously performs the validity check, rebuilds the preamble if needed, and eventually informs the ASTWorker which ensures the up-to-date preamble is eventually used.

This is a "pull" system: we only check if the preamble is valid when we tried to use it, i.e. when the main-file changed. If you just touch a header on disk but do nothing else, we won't rebuild either the main file or the preamble.

In the comments of ClangdServer::reparseOpenFilesIfNeeded(), I see:

/ Requests a reparse of currently opened files using their latest source.
/ This will typically only rebuild if something other than the source has
/ changed (e.g. the CDB yields different flags, or files included in the
/ preamble have been modified).

Because the above is a pull system, editing a header doesn't update e.g. diagnostics in files that include that header.
So this is a workaround: it requests all files to be rebuilt from current sources, so we pull on all preambles.
Then at every layer we try to ensure this does no work if nothing has changed :-)

In practice we call this in response to didSave (user edited+saved a header in their editor), we could potentially call it in response to changes on disk as Nathan suggested, and I think we have a custom integration somewhere that calls it when we know externally that compile flags have changed.

Got it. Thanks for your explanation. I can continue my experiment : )