This is an archive of the discontinued LLVM Phabricator instance.

[clang][extract-api] Use relative includes
ClosedPublic

Authored by zixuw on Apr 14 2022, 7:52 PM.

Details

Summary

This patch transforms the given input headers to relative (angled)
include names using header search entries and some heuritics.
For example: /Path/To/Header.h will be included as <Header.h> with a
search path of -I /Path/To/; and
/Path/To/Framework.framework/Headers/Header.h will be included as
<Framework/Header.h>, given a search path of -F /Path/To.
Headermaps will also be queried in reverse to find a spelled name to
include headers.


Original Summary:

Proof-of-concept patch. Needs more work.
Transform input headers to relative (angle) includes.
getRelativeIncludeName walks HeaderSearchOptions::UserEntries to try
to find a relative include stem. Problems:

  • do we want the exact logic of HeaderSearch::suggestPathToFileForDiagnostics?
  • in ExtractAPIAction::PrepareToExecuteAction where the includes are transformed, we don't have a pre-processor yet, so we don't have access to a lot of useful information like HeaderSearch, headermaps, DirectoryLookup etc.
  • we might not always want to transform an absolute path because the resulting relative include name might get remapped in a headermap, for example in test known_files_only_hmap.c. But how does it work with modules where we need relative includes? Is the setup in known_files_only_hmap even valid?
  • reverse lookup in headermap when deciding if the symbol is coming from a knwon file: I think my current implementation is correct and necessary to solve the headermap problem, but might have problems due to the above point: an unknown file might be mapped to the transformed relative name of a known input.

I think the key problem is that, by passing in an input
/absolute/path/header.h, do we mean to process the actual content of
that particular header, or do we mean to use it just as a "lable" and
the thing we really want to process is whatever this lable points to
with the given search paths and headermaps?
In the example of frameworks, I think the latter is true: we want to
use DSTROOT to pass in the headers because DSTROOT has a defined
framework layout so that we can transform the paths to framework-style
relative includes by matching the layout pattern *.framework/Header/
and rely on headermaps/VFS/module/search paths to find the intended
headers.
In that sense, the known_files_only_hmap test is a really strange
setup with ambiguious intention and context.

Diff Detail

Event Timeline

zixuw created this revision.Apr 14 2022, 7:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 7:52 PM
zixuw requested review of this revision.Apr 14 2022, 7:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 7:52 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
zixuw added inline comments.Apr 14 2022, 8:25 PM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
87

I still need to dig into this logic and see what kinds of special cases it's trying to handle, and do we really need these instead of simply File.startswith(Dir).

zixuw updated this revision to Diff 423011.Apr 14 2022, 9:00 PM

Add test case to demonstrate the framework case.

dang added inline comments.Apr 15 2022, 12:33 AM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
65

This does rely on the path separator being "/". If stick with this regex, do we need to convert all paths to POSIX format? I think the best thing to do is to iterate through the given path components and match for just "(.+)\\.framework" to match just the framework name and the use the base name of the file directly instead of matching it via the regex as well...

200–201

Not sure that just accounting for the header map case is the correct thing to do here. Ideally we would like to know what the include string was, e.g. <MyFramework/MyHeader.h> and match that with how we included one of the original files. This would account for all remapping functionality.

we might not always want to transform an absolute path because the resulting relative include name might get remapped in a headermap, for example in test known_files_only_hmap.c. But how does it work with modules where we need relative includes? Is the setup in known_files_only_hmap even valid?

I think, in most cases, this shouldn't matter because if the header path input doesn't match the location stored in the header map, they should still have the same source content. The same should be true with header search resolution with modules & vfsoverlay

dang added a comment.Apr 19 2022, 5:48 AM

we might not always want to transform an absolute path because the resulting relative include name might get remapped in a headermap, for example in test known_files_only_hmap.c. But how does it work with modules where we need relative includes? Is the setup in known_files_only_hmap even valid?

I think, in most cases, this shouldn't matter because if the header path input doesn't match the location stored in the header map, they should still have the same source content. The same should be true with header search resolution with modules & vfsoverlay

Agreed, I think it would be classified as a user error to remap to different source content via headermap.

we might not always want to transform an absolute path because the resulting relative include name might get remapped in a headermap, for example in test known_files_only_hmap.c. But how does it work with modules where we need relative includes? Is the setup in known_files_only_hmap even valid?

I think, in most cases, this shouldn't matter because if the header path input doesn't match the location stored in the header map, they should still have the same source content. The same should be true with header search resolution with modules & vfsoverlay

Agreed, I think it would be classified as a user error to remap to different source content via headermap.

This is happening today in Xcode. Headers that are mapped from the DSTROOT back to the SRCROOT can be different, because they are not simply copied. Xcode also runs unifdef on them.

dang added a comment.Apr 21 2022, 6:57 AM

we might not always want to transform an absolute path because the resulting relative include name might get remapped in a headermap, for example in test known_files_only_hmap.c. But how does it work with modules where we need relative includes? Is the setup in known_files_only_hmap even valid?

I think, in most cases, this shouldn't matter because if the header path input doesn't match the location stored in the header map, they should still have the same source content. The same should be true with header search resolution with modules & vfsoverlay

Agreed, I think it would be classified as a user error to remap to different source content via headermap.

This is happening today in Xcode. Headers that are mapped from the DSTROOT back to the SRCROOT can be different, because they are not simply copied. Xcode also runs unifdef on them.

What do you mean? The headers in the DSTROOT don't have the ifdef? Either way, this shouldn't matter because the declarations we are processing would be the same across both versions of the header?

zixuw updated this revision to Diff 424959.Apr 25 2022, 10:37 AM
  • Use first match in getRelativeIncludeName
  • Try to get Preprocessor and reverse lookup headermaps in getRelativeIncludeName
  • Use getRelativeIncludeName to check for known files in LocationFileChecker
  • Misc fixes & adjustments

The problem is that we are trying to shorten the input file paths in PrepareToExecuteAction, where the CompilerInstance is still primal and doesn't even have a FileManager that we could use. That makes it hard (if possible at all) to reverse lookup headermaps and use the spelled names if we failed to find a search path prefix.
The impact is that we won't be able to get an angled include name for cases where an input doesn't reside in any of the normal search paths, but a headermap entry maps some name to that path. Or we might get the wrong include name where there are both headermap entries and search paths that match the input path and the ordering matters.

zixuw updated this revision to Diff 425017.Apr 25 2022, 1:27 PM
  • Create FileManager in PrepareToExecuteAction
  • Use FileManager to load headermaps and reverse lookup mappings
  • Use FileSystem to correctly get working directory and make absolute paths
zixuw added inline comments.Apr 25 2022, 3:33 PM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
134

One problem I can see in this right now is that there might be multiple headermaps that together construct a chain of mapping, for example

// A.hmap
Header.h -> Product/Header.h

// B.hmap
Product/Header.h -> /Absolute/path/to/Header.h

Then this iteration would conclude on using Product/Header.h and missed the first mapping (if the command line goes clang -extract-api -I A.hmap -I B.hmap ...).

It's also possible that a headermap and a search path together resolve the header. For example:

//A.hmap
Header.h -> Product/Header.h

// Invocation:
clang -extract-api /Some/Path/to/Product/Header.h -I A.hmap -I /Some/path/to/

I think we need to keep records and iterate backwards on the search paths again to get the full reverse lookup

zixuw added inline comments.Apr 25 2022, 3:39 PM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
134

Or, iterate once in reverse and find the last match? 🤔

zixuw added inline comments.Apr 25 2022, 5:25 PM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
134

Another thing just came to me: with multiple headermaps chaining remaps, which is the "correct" way to include a header?

// A.hmap
Header.h -> Product/Header.h

// B.hmap
Product/Header.h -> /Absolute/path/to/Header.h

In the context of Darwin frameworks, we'd use <Framework/Header.h> so we don't want to follow the chain all the way backwards.
But without any context or build system rationales of why multiple headermaps with remap chains are generated in the first place, I'm feeling that we don't nearly have enough information for this if we're only talking about headermap as it is and refraining from adopting heuristics.

zixuw updated this revision to Diff 425083.Apr 25 2022, 5:48 PM
  • Rewrite commit message for preparation
  • Remove shortening paths based on the current working directory: it does not work with angled includes, and unnecessary for our use
  • XFAIL test known_files_only_hmap.c, as it is not a valid setup with a questionable headermap. Need to determine how to fix that test case or just discard it, as the new relative_include.m test also checks that symbols from external files are dropped
zixuw retitled this revision from [POC][WIP] Use relative include in extract-api to [clang][extract-api] Use relative includes.Apr 25 2022, 5:51 PM
zixuw edited the summary of this revision. (Show Details)
dang added inline comments.Apr 28 2022, 4:13 AM
clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
134

Two things:

  • If we want an exhaustive search, then it would make sense to do what would actually happen in a compilation. Iterate forward and find all matches, then iterate forwards again with each of the matches from the previous step until we find all terminal matches and then heuristically pick the "best one" probably the shortest one.
  • The "correct" way of including a header would certainly be #include <Product/Header.h> for a Darwin framework. Nonetheless if the search paths and headermaps setup means that #include "Header.h" or #include <Header.h> is a possible way of getting to the same file I see no harm in doing that. As long as shortening a given file results in deterministic results it should work fine. I think it would be a user error if shortening a file path to say <Header.h> meant that including it would lead to a completely different file (that is with different declarations/definitions).
922

I think getRelativeIncludeName is slightly wrong. It doesn't seem like it accounts for matches specifically made with -iquote arguments. My understanding is that we should record that information at that point and include the file using the appropriate form #include "RelativeName" or #include <RelativeName>. It would probably be best if getRelativeIncludeName included the quote type in the string directly.

clang/test/ExtractAPI/known_files_only_hmap.c
-175

I think you can safely delete this test.

zixuw marked an inline comment as done.Apr 28 2022, 10:16 AM
zixuw updated this revision to Diff 426540.May 2 2022, 4:15 PM
  • Delete test known_files_only_hmap
  • Handle quoted includes
  • Attempt to fix Windows fails by converting backslashes before matching the framework regex
  • Update test relative_include
zixuw marked 4 inline comments as done.May 2 2022, 4:16 PM
zixuw updated this revision to Diff 426547.May 2 2022, 5:29 PM

Convert file path to use slashes for headermap reverse lookup

dang accepted this revision.May 3 2022, 8:53 AM

Minor comment LGTM otherwise

clang/lib/ExtractAPI/ExtractAPIConsumer.cpp
203

Should we be tracking if the include was quoted or not? In principle, that could mean a different redirect in header search and therefore potentially mapping to different content.

This revision is now accepted and ready to land.May 3 2022, 8:53 AM
zixuw updated this revision to Diff 426803.May 3 2022, 12:24 PM
  • Capture whether the include is quoted in KnownFiles
  • Misc: use getInputBufferName() instead of string literal
zixuw marked an inline comment as done.May 3 2022, 12:54 PM
dang accepted this revision.May 4 2022, 2:15 AM

LGTM

dang added a comment.May 4 2022, 3:35 AM

Since this is a new test can we use the approach in https://reviews.llvm.org/D124634 to check for diagnostics output.

zixuw added a comment.May 4 2022, 9:59 AM

Since this is a new test can we use the approach in https://reviews.llvm.org/D124634 to check for diagnostics output.

I used FileCheck to check the input buffer dump in this test, but I'll try to move to cc1 and fix the diagnostics check.

zixuw updated this revision to Diff 427061.May 4 2022, 10:27 AM

Update test case to use cc1 instead of driver

This revision was landed with ongoing or failed builds.May 4 2022, 10:28 AM
This revision was automatically updated to reflect the committed changes.
zixuw added a comment.May 4 2022, 10:45 AM

Accidentally added another test case in my local workspace. Removed in 5f841c71fc2cc77c92f526791cd7a938bcac69aa