This is an archive of the discontinued LLVM Phabricator instance.

[modules] Track how headers are included by different submodules.
Changes PlannedPublic

Authored by vsapsai on Jun 15 2021, 6:28 PM.

Details

Summary

Currently in a PCM file we are storing HeaderSearch state from
building an entire module. This is too coarse because for a non-modular
header isImport == true means it was imported by *some* header in the
module but doesn't capture if this module header is reachable itself. In
practice it can cause code

#import <Framework/Header.h>
#import <Unmodularized/Unmodularized.h>

to skip "Unmodularized.h" header because another "Framework/Invisible.h"
imported it. As the result declarations from "Unmodularized.h" aren't
accessible despite direct import.

Fix it by tracking isImport and NumIncludes per submodule including
a header. And on reading AST use the values only for visible submodules.

rdar://64773328

Diff Detail

Event Timeline

vsapsai created this revision.Jun 15 2021, 6:28 PM
vsapsai requested review of this revision.Jun 15 2021, 6:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2021, 6:28 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
vsapsai updated this revision to Diff 352326.Jun 15 2021, 8:09 PM

Split out smaller parts as separate reviews.

vsapsai added a subscriber: Restricted Project.
vsapsai added a subscriber: iains.
vsapsai updated this revision to Diff 352572.Jun 16 2021, 3:45 PM

Test when a non-modular header *is* visible through a module import (mostly for completeness).

vsapsai updated this revision to Diff 352580.Jun 16 2021, 4:18 PM

Add a test to verify MarkFileIncludedFromModule does |= isImport instead of = isImport.

Ping. It would be useful to know if this approach makes sense. If so, I can test it on a bigger codebase and see if there are any problems.

If my understanding is correct, this patch only changes how we track HeaderFileInfo::{isImport,NumIncludes} and makes sure to store the necessary info into AST files. Right?

Can you explain how values of these members can prevent Clang from importing the header and how this interacts with visibility?

FYI, a clangd test is failing in CI.

If my understanding is correct, this patch only changes how we track HeaderFileInfo::{isImport,NumIncludes} and makes sure to store the necessary info into AST files. Right?

Yes, you are correct.

Can you explain how values of these members can prevent Clang from importing the header and how this interacts with visibility?

Currently, in AST we basically store the state of HeaderSearch after building a module. And isImport means "if any of the headers in the module has used #import", NumIncludes means "how many times all of the headers in the module have included the given header". So in the test case import-nonmodular.m after building the module 'Modularized' we have for the header Unmodularized.h isImport = true, NumIncludes = 1. So when we see #import <Unmodularized/Unmodularized.h> we decide not to enter the header because according to PCM data it was already imported once. But when we try to use UnmodularizedStruct, we cannot do that because it is present in Modularized.Invisible but unfortunately not visible.

So my fix is to track HeaderFileInfo::{isImport,NumIncludes} per including submodule. And when we encounter include/import in the source code, try to load ExternalHFI from ExternalSource, we use isImport and NumIncludes only from the modules that are [transitively] visible at this point. This way we don't use isImport and NumIncludes from irrelevant modular headers.

Another option is to treat unmodularized headers as submodules and update their visibility when we see #import <Unmodularized/Unmodularized.h>. But that's pretty complicated and I'm not sure it is possible when we have the same non-modular header included from different modules. Then we have multiple copies of the same fake submodule and it's not clear which one to use, which one to update.

jansvoboda11 accepted this revision.Aug 31 2021, 5:19 AM

I see, thanks for the explanation. This LGTM provided the failing clangd test gets fixed.

This revision is now accepted and ready to land.Aug 31 2021, 5:19 AM
vsapsai added inline comments.Sep 1 2021, 4:28 PM
clang/lib/Serialization/ASTWriter.cpp
1671

Note for future readers (including myself). Can save 2 bytes because not writing Data.HFI.NumIncludes but need 4 extra bytes for Data.HFI.ModuleIncludersMap stop value.

vsapsai updated this revision to Diff 370117.Sep 1 2021, 5:02 PM

Clarify some comments; rebase for a fresh pre-commit testing.

vsapsai retitled this revision from [modules] Track how headers are included by different modules. to [modules] Track how headers are included by different submodules..Sep 1 2021, 5:04 PM
vsapsai edited the summary of this revision. (Show Details)
rsmith added a comment.Sep 1 2021, 5:31 PM

I wonder if perhaps we're tracking this state in the wrong way. The "has been included" information for #pragma once / #import should behave exactly like macro definition visibility: it should be reset whenever we enter a new "clean slate" state and should be saved and restored suitably when switching visibility states. If we track, for each header file, a map from module to include information, then we'll be paying a fairly high cost: each time we see a #include we'll need to look up every currently-visible module in the map to see if the header was included in that module. It might be better to track an up-to-date list of the headers that have been either #imported in the current preprocessor state (including via imported modules) or that have been both #included and contain #pragma once. You could stash that in the Preprocessor::SubmoduleState so that it's suitably swapped out when we switch to a new preprocessing state.

clang/include/clang/Lex/HeaderSearch.h
115

This seems like a very expensive thing to include in HeaderFileInfo, which we may have many thousands of in some compilations. (I have similar concerns about Aliases FWIW.)

clang/include/clang/Serialization/ASTBitCodes.h
44

Do we really need to bump the version number? (Does this affect something we read before we read the Clang version number and reject on a version mismatch?)

clang/lib/Serialization/ASTReader.cpp
1913–1917

This doesn't do the right thing in local submodule visibility mode. We need to compute visibility locally from within HeaderSearch based on the currently visible modules at the point of the query instead.

That's a good point. Let me check how we track macros, I haven't thought about that approach. And I haven't considered using Preprocessor::SubmoduleState, was too excited HeaderSearch::ShouldEnterIncludeFile works correctly with the updated data.

clang/include/clang/Lex/HeaderSearch.h
115

My fault. The fact that the struct is tightly packed, should have been a hint.

clang/include/clang/Serialization/ASTBitCodes.h
44

Nope, there is no logic for VERSION_MAJOR. Just wanted to avoid old clang writing HeaderFileInfoTrait in one format and new clang reading it in the new format. But if that is resolved by Clang version number, no need for VERSION_MAJOR bump.

clang/lib/Serialization/ASTReader.cpp
1913–1917

Is it wrong from general local submodule visibility approach or do you have a test case in mind? Don't need a specific case, general direction can help to come up with a test case.

In practice it was working because we were reading this part of PCM file only when HeaderFileInfo was requested and that was at the point when modules had correct visibility. Let me try with multiple unmodularized headers, that can be broken and it can be a good test case on its own.

vsapsai planned changes to this revision.Sep 1 2021, 7:35 PM

Investigate less heavy-weight approaches similar to those used for tracking macros from different submodules.