Per the FileManager documentation, VirtualFileEntries is a subset of SeenFileEntries. This means we only have to iterate over the latter to create complete UID mapping. This allows us to map UID to FileEntryRef instead of FileEntry.
Iterating over SeenFileEntries and skipping VirtualFileEntries looks good to me.
I'm not sure about changing this to FileEntryRef though. The way this API works you only get one per unique file, which is well suited to FileEntry * which has the same uniquing behaviour. In this case you're going to get a FileEntryRef, but *which* ref you get is non-deterministic if there were multiple refs for the same file (depends on hash table iteration order). Also, it will never give you a vfs mapped path since it's skipping those (V.dyn_cast<FileEntry *>()).
I think if we want to change this to FileEntryRef it needs to be deterministic which ref you get.
I think this might be the root of the problem we are seeing: depending on build configuration sometimes our build inputs are hard links that in the case of identical inputs point to the same inode. In that case we are seeing non-deterministic header paths serialized in pcm files. IIUC the header files are serialized based in their unique ID, so it wouldn't be possible to handle this case, is this right?
If compiling a single pcm accesses multiple hard links with the same UID, then it would not be possible to use the set of UIDs to get the "right path". At best we could make it get a deterministic path -- e.g. if we tracked the order of access.
If compiling a single pcm accesses only one hard link but it's an implicit module build and the overall compiler invocation accesses another one with the same UID, you are currently in the same situation as above, though in theory you could avoid it by not sharing the FileManager across implicit module builds. In clang-scan-deps we don't share the FileManager and we get away with it because we have a separate filesystem caching layer. In a normal implicit modules build I don't know how much this would cost.
If you're on a platform that supports fcntl with F_GETPATH such as Darwin, another possible source of non-determinism is that the underlying OS may return a non-deterministic RealPath from openFileForRead when there are multiple hard links.
With you so far. Is there a reason we need to use the UIDs in this case? Would it be possible to refactor GetUniqueIDMapping to instead populate an array with all the FileEntryRefs that had been seen and serialize that instead?