This is an archive of the discontinued LLVM Phabricator instance.

[clang] refactor FileManager::GetUniqueIDMapping
AbandonedPublic

Authored by rmaz on Feb 6 2023, 9:22 AM.

Details

Summary

Refactor FileManager::GetUniqueIDMapping to populate an array of
OptionalFileEntryRefDegradesToFileEntryPtr instead of UID to
FileEntry *. This should allow the path serialization of input files
to use the paths used when looking up a file entry, instead of the
last reference.

Per the FileManager documentation, VirtualFileEntries is a subset of
SeenFileEntries. This means we only have to iterate over the latter
to create a complete UID mapping.

Diff Detail

Event Timeline

rmaz created this revision.Feb 6 2023, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2023, 9:22 AM
Herald added a subscriber: mgrang. · View Herald Transcript
rmaz requested review of this revision.Feb 6 2023, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2023, 9:22 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
rmaz updated this revision to Diff 495191.Feb 6 2023, 9:52 AM

Sort by UIDs, then Name

jansvoboda11 added inline comments.Feb 6 2023, 10:09 AM
clang/include/clang/Basic/FileManager.h
314

Since we're already modifying the two only users of this function, maybe we could use Optional<FileEntryRef> instead of OptionalFileEntryRefDegradesToFileEntryPtr (which we want to eventually remove)?

clang/lib/Basic/FileManager.cpp
623–625

Why is this necessary? IIUC, FileEntryRef has this logic in getBaseMapEntry(). I would expect FileEntryRef(Entry) to be correct regardless of what's in Value->V.

clang/lib/Serialization/ASTWriter.cpp
175

I'm afraid that iterating over file entries in "non-deterministic" order can cause non-determinism down the line. For example, calling HeaderSearch::findAllModulesForHeader() in the loop can trigger deserialization of HeaderFileInfo from loaded PCMs. That code fills the Module::Headers vectors, which we serialize without sorting first.

1918–1921

Nit: I'd slightly prefer:

if (A->getUID() != B->getUID())
  return A->getUID() < B->getUID();

return A->getName() < B->getName();

In general, I think this approach makes sense.

clang/lib/Serialization/ASTWriter.cpp
1917

Nit: Probably not necessary to take FileEntryRef by reference.

1931

This will insert duplicate entries (with the same key) into the on-disk hash table. I don't know if that's problematic, just thought I'd call it out.

rmaz added inline comments.Feb 6 2023, 10:34 AM
clang/include/clang/Basic/FileManager.h
314

We could do that with less churn once more or the called methods in the loop over the files are refactored in D142724, but I'm fine with either approach.

clang/lib/Basic/FileManager.cpp
623–625

I borrowed this from: https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/FileManager.cpp#L399-L408

I assumed this was necessary from @benlangmuir 's comment on D142780:

Also, it will never give you a vfs mapped path since it's skipping those (V.dyn_cast<FileEntry *>()).

Are you saying we should just do something like:

for (const auto &Entry : SeenFileEntries) {
    if (llvm::ErrorOr<FileEntryRef::MapValue> Value = Entry.getValue())
        Entries.push_back(FileEntryRef(Entry));
}

?

clang/lib/Serialization/ASTWriter.cpp
175

Do you think we should move the sorting into the getSeenFileEntries() method?

1931

Not sure I follow, why would we have FileEntrys with duplicate names?

jansvoboda11 added inline comments.Feb 6 2023, 11:02 AM
clang/lib/Basic/FileManager.cpp
623–625

Oh, okay, so the goal is to skip VFS-mapped file entries and only serialize the on-disk ones into PCMs. That makes sense given we do actual FileEntry comparison in HeaderFileInfoTrait::EqualKey() (instead of plain comparison of filenames that I assumed). I think that for VFS-mapped entries, we always have a separate element in the SeenFileEntries map that represents the on-disk file entry. So we should be skipping the VFS-mapped entries entirely instead of considering their remapping target entries again.

If my understanding is correct, we should do something like this:

if (llvm::ErrorOr<FileEntryRef::MapValue> Value = Entry.getValue())
  // Ignore VFS-mapped entries.
  if (Value->V.dyn_cast<FileEntry *>())
    Entries.push_back(FileEntryRef(Entry));
clang/lib/Serialization/ASTWriter.cpp
175

I think so, yes.

1931

Sorry, I had assumed we only do lookups in the on-disk hash table based on filenames (instead of full FileEntry comparison) and that for that reason, we should serialize all FileEntryRef into the PCM, even the VFS-mapped ones. But I now see that you're trying to only serialize the on-disk FileEntryRefs, which will not cause duplication. Ignore my comment here.

rmaz updated this revision to Diff 495259.Feb 6 2023, 1:01 PM

Don't return virtual file entries

rmaz marked 9 inline comments as done.Feb 6 2023, 1:02 PM
rmaz edited the summary of this revision. (Show Details)Feb 6 2023, 1:06 PM
jansvoboda11 added inline comments.Feb 6 2023, 5:13 PM
clang/include/clang/Basic/FileManager.h
314

Hmm, thinking about this more, it doesn't make sense for this function to return empty Optional<FileEntryRef>. Changing the parameter type to SmallVectorImpl<FileEntryRef> & and updating the clients to use . instead of -> is the most clear/correct option IMO.

I think the function name & documentation should point out that we're only returning files non-vfs-redirected files (and vfs-redirected files that don't use the external name).

clang/lib/Basic/FileManager.cpp
20

Already included in FileManager.h.

clang/lib/Serialization/ASTWriter.cpp
39

Already provided by FileManager.h.

jansvoboda11 added inline comments.Feb 6 2023, 5:14 PM
clang/lib/Basic/FileManager.cpp
623

Nit: I think dyn_cast() could be replaced by is().

rmaz updated this revision to Diff 495512.Feb 7 2023, 6:35 AM
rmaz edited the summary of this revision. (Show Details)

Use FileEntryRef, rename method

rmaz marked 4 inline comments as done.Feb 7 2023, 6:36 AM

This should allow the path serialization of input files to use the paths used when looking up a file entry, instead of the last reference.

Isn't this at odds with not having the VFS-mapped paths?

It's not obvious to me why we want these specific semantics. Elsewhere we have tried to preserve the virtual paths as well as the vfsoverlay files needed to interpret them. Is there a reason the current approach is better? I feel like there may be context here I'm lacking

rmaz added a comment.Feb 8 2023, 6:14 AM

This should allow the path serialization of input files to use the paths used when looking up a file entry, instead of the last reference.

Isn't this at odds with not having the VFS-mapped paths?

It's not obvious to me why we want these specific semantics. Elsewhere we have tried to preserve the virtual paths as well as the vfsoverlay files needed to interpret them. Is there a reason the current approach is better? I feel like there may be context here I'm lacking

My understanding for why we were dropping the virtual paths is that we would already have the on-disk path entries anyway, but @jansvoboda11 may know more.

The reason for the change from FileEntry to FileEntryRef is so we can produce deterministic pcm output regardless of if the files are regular files or hard links pointing to the same inode. Currently this can result in varying paths and numbers of inputs being serialized.

+1 for determinism. Okay I spent some time trying to understand how this code is used and the non-virtual paths make some sense now. I am a bit skeptical about this on-disk-hash-table by filepath but that's separate from this patch.

clang/lib/Serialization/ASTWriter.cpp
175

I think you can remove header_file_size.

rmaz updated this revision to Diff 497309.Feb 14 2023, 6:54 AM

remove header_file_size()

rmaz marked an inline comment as done.Feb 14 2023, 6:54 AM
rmaz abandoned this revision.Apr 11 2023, 7:15 AM