This is an archive of the discontinued LLVM Phabricator instance.

[clang] Update ModuleMap::getModuleMapFile* to use FileEntryRef
ClosedPublic

Authored by benlangmuir on Oct 4 2022, 3:17 PM.

Details

Summary

Update SourceManager::ContentCache::OrigEntry to keep the original FileEntryRef, and use that to enable ModuleMap::getModuleMapFile* to return the original FileEntryRef. This change should be NFC for most users of SourceManager::ContentCache, but it could affect behaviour for users of getNameAsRequested such as in compileModuleImpl. I have not found a way to detect that difference without additional functional changes, other than incidental cases like changes from / to \ on Windows so there is no new test.

Note: this should fix the Windows failure on https://reviews.llvm.org/D134923 which was caused by an incidental getFile call mutating LastRef on the FileEntry.

Diff Detail

Event Timeline

benlangmuir created this revision.Oct 4 2022, 3:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 3:17 PM
Herald added a subscriber: arphaman. · View Herald Transcript
benlangmuir requested review of this revision.Oct 4 2022, 3:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 3:17 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
jansvoboda11 accepted this revision.Oct 4 2022, 5:52 PM

LGTM with a couple of small suggestions.

clang/include/clang/Basic/SourceManager.h
1066

Could you wrap F in OptionalFileEntryRefDegradesToFileEntryPtr to handle the conversion for you?

clang/lib/Basic/SourceManager.cpp
402

Nit: Could use auto.

clang/lib/Lex/ModuleMap.cpp
628

Nit: Use OptionalFileEntryRefDegradesToFileEntryPtr?

646

Nit: Use OptionalFileEntryRefDegradesToFileEntryPtr?

1028

Nit: Use OptionalFileEntryRefDegradesToFileEntryPtr? Maybe we should return that from getModuleMapFileForUniquing in the first place.

This revision is now accepted and ready to land.Oct 4 2022, 5:52 PM
bnbarham accepted this revision.Oct 5 2022, 9:51 AM
bnbarham added inline comments.
clang/include/clang/Basic/SourceManager.h
146–155

I haven't checked all uses, but do we still need all of OrigEntry/ContentsEntry/Filename? Seems like a single FileEntryRef should encapsulate all the information we need there.

1066

OrigEntry is a OptionalFileEntryRefDegradesToFileEntryPtr, so this should just be able to be return slot.getFile().getContentCache().OrigEntry?

clang/lib/Lex/ModuleMap.cpp
1028

I like the idea of returning OptionalFileEntryRefDegradesToFileEntryPtr until we move all this to use FileEntryRef as well.

benlangmuir edited the summary of this revision. (Show Details)
  • Used OptionalFileEntryRefDegradesToFileEntryPtr at two callsites
  • Attempted to fix Windows path / vs \ in test.
benlangmuir added inline comments.Oct 5 2022, 10:32 AM
clang/include/clang/Basic/SourceManager.h
146–155

Yes, see my change to the FIXME above. Before we can get rid of Filename, OrigEntry has to be non-Optional, which means the code that works with virtual buffers needs to use a virtual file as well. I don't think it's a huge deal, but it seemed orthogonal from the rest of this change. ContentsEntry is unrelated, as it is not used for its filename only for its contents.

1066

Correct, this was just a leftover from before I decided to make OrigEntry use OptionalFileEntryRefDegradesToFileEntryPtr.

clang/lib/Lex/ModuleMap.cpp
1028

I would prefer to keep returning Optional<FileEntryRef>. My reasoning is that OptionalFileEntryRefDegradesToFileEntryPtr should only be used on an API like this function when it would otherwise cause a lot of unnecessary workarounds at the callsites. For ContentCache::OrigEntry, for example, that is clearly the case. For getModuleMapFileForUniquing in my latest patch there are 11 total calls to this function, and only two of them would be improved by OptionalFileEntryRefDegradesToFileEntryPtr. On balance, think it's better to just put the degrading pointer at those 2 callsites and keep the API itself clean.

goncharov added subscribers: dexonsmith, goncharov.EditedOct 6 2022, 4:04 AM

That change might be problematic for content addressing storages. E.g. clang/test/Driver/cl-pch-showincludes.cpp started to fail in my setup as clang/test/Driver/header{0,1,3,4}.h are all identical and can be symlinked. cc @dexonsmith

That change might be problematic for content addressing storages. E.g. clang/test/Driver/cl-pch-showincludes.cpp started to fail in my setup as clang/test/Driver/header{0,1,3,4}.h are all identical and can be symlinked

What is the failure you're seeing? I would expect this change to make clang more consistent about symlinks, because it preserves which path was originally accessed. But maybe there's an edge case somewhere.

That change might be problematic for content addressing storages. E.g. clang/test/Driver/cl-pch-showincludes.cpp started to fail in my setup as clang/test/Driver/header{0,1,3,4}.h are all identical and can be symlinked. cc @dexonsmith

FWIW, I agree with Ben that this change seems like it should improve consistency for symlinked content. Knowing the failure mode / having reproduction steps would be helpful to track down your corner case!

(There are many subtle interactions around this stuff in clang so it’s hard to make forward progress.)

That change might be problematic for content addressing storages. E.g. clang/test/Driver/cl-pch-showincludes.cpp started to fail in my setup as clang/test/Driver/header{0,1,3,4}.h are all identical and can be symlinked. cc @dexonsmith

FWIW, I agree with Ben that this change seems like it should improve consistency for symlinked content. Knowing the failure mode / having reproduction steps would be helpful to track down your corner case!

(There are many subtle interactions around this stuff in clang so it’s hard to make forward progress.)

What I see in this clang/test/Driver/cl-pch-showincludes.cpp is that e.g. for run on :25 it now emits

Note: including file: ...header2.h
Note: including file: ...header1.h
Note: including file: ...header1.h

instead of

Note: including file: ...header2.h
Note: including file: ...header1.h
Note: including file: ...header3.h

as header3 -> header1. It's possbile to make it deterministic by making headers unique though.

That change might be problematic for content addressing storages. E.g. clang/test/Driver/cl-pch-showincludes.cpp started to fail in my setup as clang/test/Driver/header{0,1,3,4}.h are all identical and can be symlinked. cc @dexonsmith

FWIW, I agree with Ben that this change seems like it should improve consistency for symlinked content. Knowing the failure mode / having reproduction steps would be helpful to track down your corner case!

(There are many subtle interactions around this stuff in clang so it’s hard to make forward progress.)

What I see in this clang/test/Driver/cl-pch-showincludes.cpp is that e.g. for run on :25 it now emits

Note: including file: ...header2.h
Note: including file: ...header1.h
Note: including file: ...header1.h

instead of

Note: including file: ...header2.h
Note: including file: ...header1.h
Note: including file: ...header3.h

as header3 -> header1. It's possbile to make it deterministic by making headers unique though.

And can you share environment details?

  • What OS are you on?
  • Which test files are linked to which?
  • Are these symbolic links or hard links?
  • (etc.)

Maybe you can even share steps to reproduce your environment?

It's possbile to make it deterministic by making headers unique though.

Also, this is the first time you've mentioned non-determinism. Is this non-deterministic?

Okay, I was able to reproduce by symlinking all the 0-byte header files to header0 (any choice probably works). The behaviour is deterministic before and after my change.

This was only passing by luck in this setup, because it was relying on mutation of FileEntry->LastRef which mutates the path of header1 to header3. We do not actually track the difference in filename here, it just happened to match the behaviour without symlinks for this case because of the mutation. Note: the mutation is not triggered specifically by the include, it's anything that looks up that path in the file manager, so there is no guarantee that you would get header3 if something triggered accessing the filename header1 again at the wrong time.

I think my change is fine here and we should just update the test files so they will not be accidentally linked. If someone wants to work on tracking the filenames independently for each include that would be fine, but as long as we only have one name it should be the first one not the "one whose path was most recently accessed in FileManager".

@goncharov Was this the only test effected? If so, here's a fix: https://reviews.llvm.org/D135373

Okay, I was able to reproduce by symlinking all the 0-byte header files to header0 (any choice probably works). The behaviour is deterministic before and after my change.

This was only passing by luck in this setup, because it was relying on mutation of FileEntry->LastRef which mutates the path of header1 to header3. We do not actually track the difference in filename here, it just happened to match the behaviour without symlinks for this case because of the mutation. Note: the mutation is not triggered specifically by the include, it's anything that looks up that path in the file manager, so there is no guarantee that you would get header3 if something triggered accessing the filename header1 again at the wrong time.

I think my change is fine here and we should just update the test files so they will not be accidentally linked. If someone wants to work on tracking the filenames independently for each include that would be fine, but as long as we only have one name it should be the first one not the "one whose path was most recently accessed in FileManager".

@goncharov Was this the only test effected? If so, here's a fix: https://reviews.llvm.org/D135373

I believe so, thank you for the fix! (I was mostly trying to confirm that symlink setups will be fine)

I was mostly trying to confirm that symlink setups will be fine

I don't know if we promise this is supported anywhere, but I know we've made this kind of test change to force unique files several times in the past.

hans added a subscriber: hans.Oct 13 2022, 10:07 AM

Heads up that we're seeing a problem in Chromium that bisects to this change. (http://crbug.com/1373836)

In our case what's happening is that the compiler is getting the name of an included file wrong, which ends up in the debug info (and also shows in the preprocessor output). What makes it interesting is that this only reproduces on a remote build system, not when building locally.

I see there's discussion about symlinks of identical files, and I just realized that in our case the files are indeed identical. I haven't heard back from the build system folks, but I wouldn't be surprised if they symlink these against some content-addressed storage system.

hans added a comment.Oct 13 2022, 10:25 AM

Here's a reproducer of what's happening in our case. It's using the preprocessor because that's the easiest, but the real problem for us is that the debug info is pointing to the wrong file (and that it's different depending on the symlinkedness of the include).

$ cat /tmp/foo.h
#ifdef X
#undef X
int x;
#endif

#ifdef Y
#undef Y
int y;
#endif

$ ln -s /tmp/foo.h /tmp/bar.h

$ cat /tmp/a.cc
#define X
#include "foo.h"
#define Y
#include "bar.h"

$ build/bin/clang -E -o - /tmp/a.cc
# 1 "/tmp/a.cc"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 437 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "/tmp/a.cc" 2

# 1 "/tmp/foo.h" 1


int x;
# 3 "/tmp/a.cc" 2

# 1 "/tmp/foo.h" 1







int y;
# 5 "/tmp/a.cc" 2

Note that the second # 1 "/tmp/foo.h" 1 should really be bar.h.

hans added a comment.Oct 17 2022, 5:57 AM

The build system folks replied saying they're not using symlinks, but hard links for compiler inputs. Dropping the -s flag in the repro above demonstrates this.

Here's a new version of the reproducer that's a little less convoluted:

$ echo 1, 2, 3 > /tmp/foo.h
$ ln /tmp/foo.h /tmp/bar.h
$ cat > /tmp/a.cc
int foo_table[] = {
#include "/tmp/foo.h"
};
int bar_table[] = {
#include "/tmp/bar.h"
};
$ build/bin/clang -E -o - /tmp/a.cc
# 1 "/tmp/a.cc"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 437 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "/tmp/a.cc" 2
int foo_table[] = {
# 1 "/tmp/foo.h" 1
1, 2, 3
# 3 "/tmp/a.cc" 2
};
int bar_table[] = {
# 1 "/tmp/foo.h" 1                   <------ Should be bar.h.
1, 2, 3
# 6 "/tmp/a.cc" 2
};

The build system folks replied saying they're not using symlinks, but hard links for compiler inputs. Dropping the -s flag in the repro above demonstrates this.

I think this only happened to work before. If /tmp/a.cc changes to these file contents:

int foo_table[] = {
  #include "/tmp/foo.h"
};
int bar_table[] = {
  #include "/tmp/bar.h"
};
int foo2_table[] = {
  #include "/tmp/foo.h"
};

then I'm getting the following with my system clang:

# 1 "/tmp/a.cc"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 414 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "/tmp/a.cc" 2
 int foo_table[] = {
# 1 "/tmp/foo.h" 1
1, 2, 3
# 3 "/tmp/a.cc" 2
 };
 int bar_table[] = {
# 1 "/tmp/bar.h" 1
1, 2, 3
# 6 "/tmp/a.cc" 2
 };
 int foo2_table[] = {
# 1 "/tmp/bar.h" 1
1, 2, 3
# 9 "/tmp/a.cc" 2
 };

Note that foo2_table now points at bar.h but should point at foo.h. I presume after the present patch it'll point at foo.h (although I admit I didn't test).

The root cause is the same, and matches @benlangmuir's analysis in response to the testcase failure on Windows:

  • The FileManager is merging files based on their observed inode (regardless of symlink or hardlink).
  • Previously, filenames were reported based on the most recent access path: "last-ref".
  • Now, they're reported based on the original access path: "first-ref".
  • In neither case is it sound.

Here are a few solutions I can see:

  1. Change FileManager to only merge files with matching "realpath", seeing through symlinks but not hardlinks (could use observed inode collisions as a hint to do an lstat to avoid running an lstat every time). Then SourceManager will give a different FileID to these files.
  2. Change SourceManager to have different FileID to these files, even though they were merged by FileManager.
  3. Change the #include stack to track the accessed filename (the FileEntryRef) in addition to the SLoc, even though they have the same FileID.
  4. Require clients to keep contents distinct if they it matters (the (surprising) status quo).

What's in tree (and has been for a long time) is (4). (1) seems like the right solution to me.

As a heuristic on top of (4), "last-ref" (before this patch) probably gets the answer the user expects more often than "first-ref" (this patch) does, which I assume is why it was originally implemented. However, it's unpredictable and requires mutating a bunch of state where mutations are hard to reason about.

IMO, "first-ref" is easier to make sound, since it's an immutable property, so this patch is an incremental improvement; it makes the fuzzy modeling easier to observe but also perhaps more predictable. If we care about fixing hardlink include names ("correct-ref") we should fix them in all cases (ideally with (1)), but I wonder how urgent it is.

@hans, WDYT?

(catching up after I was away last week) I agree with @dexonsmith's analysis above. One historical note:

As a heuristic on top of (4), "last-ref" (before this patch) probably gets the answer the user expects more often than "first-ref" (this patch) does, which I assume is why it was originally implemented.

It was originally implemented this way because it works out more often for module-related files, and vfs overlays; debug info and preprocessor output of this kind was not considered. This is part of a workaround for lack of proper FileEntryRef support across the compiler that we should unwind over time.

hans added a comment.Oct 20 2022, 6:10 AM

The build system folks replied saying they're not using symlinks, but hard links for compiler inputs. Dropping the -s flag in the repro above demonstrates this.

I think this only happened to work before. If /tmp/a.cc changes to these file contents:

int foo_table[] = {
  #include "/tmp/foo.h"
};
int bar_table[] = {
  #include "/tmp/bar.h"
};
int foo2_table[] = {
  #include "/tmp/foo.h"
};

then I'm getting the following with my system clang:

# 1 "/tmp/a.cc"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 414 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "/tmp/a.cc" 2
 int foo_table[] = {
# 1 "/tmp/foo.h" 1
1, 2, 3
# 3 "/tmp/a.cc" 2
 };
 int bar_table[] = {
# 1 "/tmp/bar.h" 1
1, 2, 3
# 6 "/tmp/a.cc" 2
 };
 int foo2_table[] = {
# 1 "/tmp/bar.h" 1
1, 2, 3
# 9 "/tmp/a.cc" 2
 };

Note that foo2_table now points at bar.h but should point at foo.h. I presume after the present patch it'll point at foo.h (although I admit I didn't test).

Interesting, thanks!

The root cause is the same, and matches @benlangmuir's analysis in response to the testcase failure on Windows:

  • The FileManager is merging files based on their observed inode (regardless of symlink or hardlink).
  • Previously, filenames were reported based on the most recent access path: "last-ref".
  • Now, they're reported based on the original access path: "first-ref".
  • In neither case is it sound.

Agreed.

Here are a few solutions I can see:

  1. Change FileManager to only merge files with matching "realpath", seeing through symlinks but not hardlinks (could use observed inode collisions as a hint to do an lstat to avoid running an lstat every time). Then SourceManager will give a different FileID to these files.
  2. Change SourceManager to have different FileID to these files, even though they were merged by FileManager.
  3. Change the #include stack to track the accessed filename (the FileEntryRef) in addition to the SLoc, even though they have the same FileID.
  4. Require clients to keep contents distinct if they it matters (the (surprising) status quo).

What's in tree (and has been for a long time) is (4). (1) seems like the right solution to me.

As a heuristic on top of (4), "last-ref" (before this patch) probably gets the answer the user expects more often than "first-ref" (this patch) does, which I assume is why it was originally implemented. However, it's unpredictable and requires mutating a bunch of state where mutations are hard to reason about.

IMO, "first-ref" is easier to make sound, since it's an immutable property, so this patch is an incremental improvement; it makes the fuzzy modeling easier to observe but also perhaps more predictable. If we care about fixing hardlink include names ("correct-ref") we should fix them in all cases (ideally with (1)), but I wonder how urgent it is.

@hans, WDYT?

I feel I don't have enough background here to say whether (1) would work. Why is FileManager merging files in the first place?

This is not urgent (at least not for us), but it's a pretty surprising behavior.

One thought, which I'm not sure is relevant, is that this is only observable for headers which are included more than once, which is rare because normally there are include guards (or pragma once). isFileMultipleIncludeGuarded isn't tracked by the FileManager, but otherwise maybe that could have been a useful heuristic: don't merge header files without include guards?

hans added a comment.Nov 1 2022, 6:09 AM

Relatedly, we ran into a problem with clang-cl /showIncludes not including all files in its output when they're linked: https://github.com/llvm/llvm-project/issues/58726

One thought, which I'm not sure is relevant, is that this is only observable for headers which are included more than once, which is rare because normally there are include guards (or pragma once). isFileMultipleIncludeGuarded isn't tracked by the FileManager, but otherwise maybe that could have been a useful heuristic: don't merge header files without include guards?

Maybe that heuristic would work a level up, in SourceManager.

  • If multiple-include-guarded, create one FileID per FileEntry.
  • Else, create one FileID per FileEntryRef.

Although it gets complicated with language features like #import in Objective-C, where textual inclusion is implicitly multiple-include-guarded. Consider a file included both using #include (not guarded) and as #import (guarded).

And I'm not sure we really want to split the FileIDs... that seems like a potential performance regression. Instead, I think we just want to track (at the use site) how the file was referenced.

Relatedly, we ran into a problem with clang-cl /showIncludes not including all files in its output when they're linked: https://github.com/llvm/llvm-project/issues/58726

Interestingly, that discussion points out that -MD works correctly:

$ clang -MD -c /tmp/a.cc && cat a.d     
a.o: /tmp/a.cc /tmp/foo.h /tmp/bar.h

I presume this is because we've pushed FileEntryRef through dependency tracking. Just need to push it through more places, I think.

Relatedly, we ran into a problem with clang-cl /showIncludes not including all files in its output when they're linked: https://github.com/llvm/llvm-project/issues/58726

Interestingly, that discussion points out that -MD works correctly:

I commented on the issue about what's going on here.