This is an archive of the discontinued LLVM Phabricator instance.

[clang][modules] Delay creating `IdentifierInfo` for names of explicit modules
ClosedPublic

Authored by jansvoboda11 on Oct 11 2021, 6:29 AM.

Details

Summary

When using explicit Clang modules, some declarations might unexpectedly become invisible.

This is caused by the mechanism that loads PCM files passed via -fmodule-file=<path> and creates an IdentifierInfo for the module name. The IdentifierInfo creation takes place when the ASTReader is in a weird state, with modules that are loaded but not yet set up properly. This patch delays the creation of IdentifierInfo until the ASTReader is done with reading the PCM.

Note that the -fmodule-file=<name>=<path> form of the argument doesn't suffer from this issue, since it doesn't create IdentifierInfo for the module name.

Diff Detail

Event Timeline

jansvoboda11 requested review of this revision.Oct 11 2021, 6:29 AM
jansvoboda11 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 6:29 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

It'd be great if someone could clarify why the extra IdentifierInfo would trip up name resolution.

Suggestions on naming the test file are also welcome.

It seems like using CachedModuleLoads wasn't quite right in the first place. I wonder if -fmodule-file=<pcm> could/should use a different map (maybe one that doesn't exist yet), instead of replacing the key in CachedModuleLoads.

It'd be great if someone could clarify why the extra IdentifierInfo would trip up name resolution.

My guess would be that the attribute parsing does a lookup of "Identifier" and finds the entity introduced by -fmodule-file instead of the one that's part of the declaration. Not sure though.

Are you sure this is only for Objective-C interfaces? Doesn't also happen if you name a module after a C++ class or something, and reference it from an attribute?

Suggestions on naming the test file are also welcome.

Doesn't really seem to be about visibility. Maybe module-name-used-by-objc-bridge?

clang/include/clang/Lex/ModuleMap.h
101–102 ↗(On Diff #378646)

I wonder, could -fmodule-file=... be using this map instead (already a StringMap), and the map below reserved for "real" identifiers?

104–106 ↗(On Diff #378646)

In builds with lots of fine-grained submodules, this will add many tiny allocations (one for each map entry). Maybe use StringMap<Module*,BumpPtrAllocator> to group them together?

Leave existing cache, create new cache indexed by module names.

Are you sure this is only for Objective-C interfaces? Doesn't also happen if you name a module after a C++ class or something, and reference it from an attribute?

From my experiments, I don't think this is specific to Objective-C interfaces. I can reproduce this with:

// Interface.h
struct Interface;

// module-name-used-by-objc-bridge.m
void fn(struct Interface*);

// without this patch:
warning: declaration of 'struct Interface' will not be visible outside of this function

// with this patch: success, no diagnostics

However, this seems to be specific to objc_bridge and friends (objc_bridge_mutable, objc_bridge_related). I'm not super familiar with attributes, but these seem to be the only ones that can accept an undefined type identifier.

Run clang-format.

dexonsmith accepted this revision.Oct 12 2021, 10:29 AM

I wonder if it would be more clear going forward to name this additional cache by its purpose, so it doesn't sound more general than it really is. As currently named, it might look like all top-level modules should be stored in that map instead of the IdentifierInfo map.

Once you update the naming and/or comments to avoid that potential pitfall, this LGTM.

clang/include/clang/Lex/ModuleMap.h
101–102 ↗(On Diff #378646)

Can you explain in the commit message why this map can't be expanded to serve this purpose?

104–106 ↗(On Diff #379003)

Would be good to say why this is separate from CachedModuleLoads. Maybe, "for when an IdentifierInfo isn't available".

Or, should this be called CachedExplicitModuleLoads and be more clear about the purpose?

  • Use this for explicit modules from external sources.
  • Use CachedModuleLoads as a cache for source references where there's an IdentifierInfo.
714–717 ↗(On Diff #379003)

If you change the name of the member, please change the API to match.

This revision is now accepted and ready to land.Oct 12 2021, 10:29 AM

Use clearer wording in member name & documentation, use std::string instead of StringRef for storing module name to solve lifetime issues (surfaced in CI).

I'm concerned that this is adding complexity to paper over a bug elsewhere. Creating an IdentifierInfo should never cause unrelated uses of that identifier to change their meaning or behavior. My guess is that there's a bug in how we resolve or merge identifier information in the AST reader.

I'm concerned that this is adding complexity to paper over a bug elsewhere. Creating an IdentifierInfo should never cause unrelated uses of that identifier to change their meaning or behavior. My guess is that there's a bug in how we resolve or merge identifier information in the AST reader.

Thanks for the information. You're right that the issue is most likely somewhere around ASTReader, not name resolution.

My understanding is that Preprocessor::getIdentifierInfo may trigger AST deserialization. The problem here seems to be that we're calling getIdentifierInfo before the preprocessor's IdentifierTable is set up with "external identifier info lookup" (aka ASTReader). So the call to IdentifierTable::get creates new IdentifierInfo for "Interface" instead of looking it up in the AST. Further calls to getIdentifierInfo resolve to this new IdentifierInfo and the ObjCInterfaceDecl never gets deserialized.

@rsmith, is my understanding correct? If so, I'm not sure there's another bug hiding here: it's just an unexpected series of events that should probably be prevented (hence this patch).

My understanding is that Preprocessor::getIdentifierInfo may trigger AST deserialization. The problem here seems to be that we're calling getIdentifierInfo before the preprocessor's IdentifierTable is set up with "external identifier info lookup" (aka ASTReader). So the call to IdentifierTable::get creates new IdentifierInfo for "Interface" instead of looking it up in the AST. Further calls to getIdentifierInfo resolve to this new IdentifierInfo and the ObjCInterfaceDecl never gets deserialized.

Interesting, that could explain what we're seeing. If that's it, then this is an invalidation bug: changing the presence or behavior of the external identifier resolver should trigger invalidation of all existing identifiers so that we will properly query the external source.

But I'm missing something from that explanation: the IdentifierTable::ExternalLookup is set in the constructor and never changed from that point onwards, so I don't think it's the case that the external lookup hasn't been set yet at the point where we create the identifier. And ASTReader::ReadAST invalidates all existing identifiers when we load a new AST file (search for the calls to setOutOfDate(true)), so if the lookup results for Interface change we should invalidate the identifier properly. Given the kinds of entities affected by this, I wonder if there is something incorrect in the way that ObjCInterfaceDecls are handled or some Objective-C-specific name lookup issue? (ObjC lookups certainly have some modules-specific bugs due to be implemented as raw DeclContext::lookup calls instead of using Sema's name lookup mechanism, but I think that only applies to qualified lookups.)

My understanding is that Preprocessor::getIdentifierInfo may trigger AST deserialization. The problem here seems to be that we're calling getIdentifierInfo before the preprocessor's IdentifierTable is set up with "external identifier info lookup" (aka ASTReader). So the call to IdentifierTable::get creates new IdentifierInfo for "Interface" instead of looking it up in the AST. Further calls to getIdentifierInfo resolve to this new IdentifierInfo and the ObjCInterfaceDecl never gets deserialized.

Interesting, that could explain what we're seeing. If that's it, then this is an invalidation bug: changing the presence or behavior of the external identifier resolver should trigger invalidation of all existing identifiers so that we will properly query the external source.

But I'm missing something from that explanation: the IdentifierTable::ExternalLookup is set in the constructor and never changed from that point onwards,

I was wrong about that. Not sure how I missed the setter :)

Nonetheless: ASTReader::ReadAST calls ASTReader::ReadASTBlock, which sets the ASTReader as the external identifier lookup source. Then, once the recursive deserialization is done, ASTReader::ReadAST marks any pre-existing identifiers as out-of-date.

Perhaps the problem is that the identifiers attached to these attributes are somehow being loaded from within ReadASTBlock, before they get invalidated? That'd explain what's going wrong. If that's right, then I think this patch would not fully fix the bug: you'd see the same thing if you loaded a module that used the identifier Interface, then triggered deserialization of that identifier so it was marked up-to-date, then loaded the problematic module from your testcase. The fix for that would be to ensure that ReadASTBlock doesn't deserialize these (or any) identifiers. But I don't see where ReadASTBlock is triggering a load of any identifiers either...

I've stepped through this in a debugger. The problem is that the CompilerInstance is setting up an ASTReaderListener that is informed whenever a module is loaded, and that listener is called when the AST file has been added to the module manager but hasn't had its header read yet. That listener then then attempts to create an IdentifierInfo, which won't work, because the ASTReader is in a weird state, with modules that are loaded but not yet set up properly. Here's a backtrace:

#5  0x0000000009ddff20 in clang::IdentifierTable::get (this=0xffb60e8, Name=...) at include/clang/Basic/IdentifierTable.h:600
#6  0x000000000a81d41c in clang::Preprocessor::getIdentifierInfo (this=0xffb5ef8, Name=...) at clang/include/clang/Lex/Preprocessor.h:1244
#7  0x000000000a819464 in (anonymous namespace)::ReadModuleNames::ReadModuleName (this=0xffc3100, ModuleName=...) at clang/lib/Frontend/CompilerInstance.cpp:573
#8  0x000000000ab3d0f6 in clang::ChainedASTReaderListener::ReadModuleName (this=0x1000bdd0, ModuleName=...) at clang/lib/Serialization/ASTReader.cpp:158
#9  0x000000000ab5dd42 in clang::ASTReader::ReadControlBlock (this=0xffd2fe0, F=..., Loaded=..., ImportedBy=0x0, ClientLoadCapabilities=0) at clang/lib/Serialization/ASTReader.cpp:2906
#10 0x000000000ab5ec50 in clang::ASTReader::ReadASTCore (this=0xffd2fe0, FileName=..., Type=clang::serialization::MK_ExplicitModule, ImportLoc=..., ImportedBy=0x0, Loaded=..., ExpectedSize=0, ExpectedModTime=0, ExpectedSignature=..., ClientLoadCapabilities=0) at clang/lib/Serialization/ASTReader.cpp:4572
#11 0x000000000ab6684b in clang::ASTReader::ReadAST (this=0xffd2fe0, FileName=..., Type=clang::serialization::MK_ExplicitModule, ImportLoc=..., ClientLoadCapabilities=0, Imported=0x0)

What actually fails here is that we've not yet loaded the information about the identifier table in the Interface module, so we mark that slot in the identifier resolver as being up to date even though it absolutely isn't.

So, given that ReadModuleName needs to be robust against being called when half-way through loading modules, when it's not safe to touch any part of the AST or identifier table, I agree that it shouldn't be calling getIdentifierInfo. (I wish we didn't have these incredibly fragile callbacks that are run at a time when it's not safe to do most things, but it's not clear that's really fixable.) But I think we can still avoid the two different forms of cache here. What do you think about reverting the changes to ModuleMap.h and localizing the fix to the ASTReaderListener:

--- a/clang/lib/Frontend/CompilerInstance.cpp
+++ b/clang/lib/Frontend/CompilerInstance.cpp
@@ -565,25 +565,28 @@ namespace {
 // the files we were handed.
 struct ReadModuleNames : ASTReaderListener {
   Preprocessor &PP;
-  llvm::SmallVector<IdentifierInfo*, 8> LoadedModules;
+  llvm::SmallVector<std::string, 8> LoadedModules;
 
   ReadModuleNames(Preprocessor &PP) : PP(PP) {}
 
   void ReadModuleName(StringRef ModuleName) override {
-    LoadedModules.push_back(PP.getIdentifierInfo(ModuleName));
+    // Keep the module name as a string for now. It's not safe to create a new
+    // IdentifierInfo from an ASTReader callback.
+    LoadedModules.push_back(ModuleName.str());
   }
 
   void registerAll() {
     ModuleMap &MM = PP.getHeaderSearchInfo().getModuleMap();
-    for (auto *II : LoadedModules)
-      MM.cacheModuleLoad(*II, MM.findModule(II->getName()));
+    for (const std::string &ModuleName : LoadedModules)
+      MM.cacheModuleLoad(*PP.getIdentifierInfo(ModuleName),
+                         MM.findModule(ModuleName));
     LoadedModules.clear();
   }
 
   void markAllUnavailable() {
-    for (auto *II : LoadedModules) {
-      if (Module *M = PP.getHeaderSearchInfo().getModuleMap().findModule(
-              II->getName())) {
+    for (const std::string &ModuleName : LoadedModules) {
+      if (Module *M =
+              PP.getHeaderSearchInfo().getModuleMap().findModule(ModuleName)) {
         M->HasIncompatibleModuleFile = true;
 
         // Mark module as available if the only reason it was unavailable
jansvoboda11 retitled this revision from [clang][modules] Stop creating `IdentifierInfo` for names of explicit modules to [clang][modules] Delay creating `IdentifierInfo` for names of explicit modules.
jansvoboda11 edited the summary of this revision. (Show Details)

Thanks for your analysis, Richard. You're right that we're incorrectly marking the identifier as up-to-date.

I've rolled back my changes to ModuleMap cache and delayed the creation of IdentifierInfo. I'll commit if CI is happy.