This is an archive of the discontinued LLVM Phabricator instance.

[clang][Modules] Increase the Entropy of ModuleManager Map Keys
AbandonedPublic

Authored by CodaFi on Aug 14 2020, 10:55 AM.

Details

Summary

The ModuleManager's use of FileEntry nodes as the keys for its map of
loaded modules is less than ideal. Uniqueness for FileEntry nodes is
maintained by FileManager, which in turn uses inode numbers on hosts
that support that. When coupled with the module cache's proclivity for
turning over and deleting stale PCMs, this means entries for different
module files can wind up reusing the same underlying inode. When this
happens, subsequent accesses to the Modules map will disagree on the
ModuleFile associated with a given file.

It's fine to use the file management utilities to guarantee the presence
of module data, but we need a better source of key material that is
invariant with respect to these OS-level details. Using file paths alone
is a similarly frought solution because the ASTWriter performs a custom
canonicalization step (that is not equivalent to path canonicalization)
that renders keys from loaded AST cores useless for looking up cached
entries.

To mitigate the effects of inode reuse, increase the entropy of the key
material by incorporating the modtime and size. This ultimately
decreases the likelihood that a PCM that is swapped on disk will confuse
the cache, but it does not eliminate the possibility of collisions.

rdar://48443680

Diff Detail

Event Timeline

CodaFi created this revision.Aug 14 2020, 10:55 AM
CodaFi requested review of this revision.Aug 14 2020, 10:55 AM

Overall, looks like a reasonable approach to solve inode reuse. The problem with filenames is that they might not be canonicalized and the same file can be known by different filenames. I'm trying to think through the consequences in the following scenarios:

  • same name but file content has changed;
  • different names but refer to the same file.

It's good to imagine these attack vectors. But I think the module cache being a relatively fault-tolerant and compiler-controlled system mitigates a lot of the damage you could cause by a well-timed "attack" in these scenarios:

  • same name but file content has changed;

If there is a cache entry, the signature check that occurs after the lookup succeeds should catch most shenanigans. Assuming an attacker is able to craft a PCM with an equivalent signature to the victim PCM, and was able to time it such that the PCM were replaced after a subsequent read, you could definitely run into problems. But our "attackers" in most scenarios are usually other cc1 and swiftc invocations trying to build the same module, so we should see signature changes at least.

  • different names but refer to the same file.

Then we'll waste space in the cache, but this requires the ability to predict the layout of the module cache ahead of time. It shouldn't affect the consistency of the entries in the table to do extra work - assuming you don't combine this approach with the scenario described above.

I'd also note here that the InMemoryModuleCache is already using a StringMap keyed by file names for its PCM table. You can see this patch as a kind of harmonization between the two approaches.

CodaFi updated this revision to Diff 285868.Aug 15 2020, 3:57 PM
CodaFi updated this revision to Diff 285873.Aug 15 2020, 5:33 PM
aprantl added inline comments.
clang/include/clang/Serialization/ModuleManager.h
122

Is it literally the file name, or something like the absolute realpath? And just because I'm curious: Is this the name of the .pcm or of the module map file?

122

I just realized @vsapsai already asked the same question :-)

CodaFi updated this revision to Diff 285874.Aug 15 2020, 6:38 PM
CodaFi updated this revision to Diff 285875.Aug 15 2020, 7:26 PM
CodaFi added inline comments.Aug 15 2020, 7:35 PM
clang/include/clang/Serialization/ModuleManager.h
122

It's the file path the module cache has computed for the PCM. I could try to use the real path to the file, but I'm not sure how portable/stable that interface is relative to this one.

CodaFi added a subscriber: rsmith.Aug 15 2020, 7:49 PM

Okay, I'm done throwing revisions at the bots. This windows-only failure is bizarre. @rsmith Do you have any insight into what's going wrong here?

CodaFi updated this revision to Diff 286201.Aug 17 2020, 10:19 PM

Figured it out for myself. The test is forming paths that are using non-canonical path separators. Naively using those as keys means that the subsequent canonicalization done by the ASTWriter renders the keys useless for lookups into these structures. I'm going to switch to FileEntry::tryGetRealPathName since it's quite literally what ASTWriter is doing as part of its canonicalization phase. I worry about that as a solution in general though. In the future, it would be great to expose the canonicalization utilities in the ASTWriter as a more general kind of facility that could be shared between the implementations so we don't desync things again.

CodaFi updated this revision to Diff 286334.Aug 18 2020, 10:12 AM
CodaFi edited the summary of this revision. (Show Details)
CodaFi marked 2 inline comments as done.

Switched tactics here. Rather than just change the source of the entropy, let's increase it from just inodes to (64-bits of inode) plus (file size) plus (mod time). It is still possible to defeat this scheme, but it means an attacker would have to replace the PCM with one that has been padded out to the same size then backdate its modtime to match the one in the cache - or some cascading failure of the syscalls providing these data conspires to make this happen.

CodaFi updated this revision to Diff 286361.Aug 18 2020, 11:06 AM
aprantl added inline comments.Aug 18 2020, 1:39 PM
clang/include/clang/Serialization/ModuleManager.h
122

If it's the path to the .pcm there's no point in wasting time on realpath — there should only be one module cache path and we don't care where exactly it is on disk, and the paths inside the module cache ought to be unique anyway, because we just computed them. Thanks!

aprantl added inline comments.Aug 18 2020, 1:40 PM
clang/include/clang/Serialization/ModuleManager.h
62

Can you add a doxygen comment explaining why we compute our own hashing as opposed to using the FileEntry pointer?

62

Basically what you wrote in the description of the review...

CodaFi updated this revision to Diff 286460.Aug 18 2020, 7:35 PM

@aprantl Good idea. Updated.

aprantl accepted this revision.Aug 19 2020, 9:33 AM
This revision is now accepted and ready to land.Aug 19 2020, 9:33 AM
CodaFi marked an inline comment as done.Aug 19 2020, 1:11 PM
CodaFi updated this revision to Diff 286679.Aug 19 2020, 3:40 PM
CodaFi retitled this revision from [clang][Modules] Use File Names Instead of inodes As Loaded Module Keys to [clang][Modules] Increase the Entropy of ModuleManager Map Keys.
CodaFi updated this revision to Diff 286697.Aug 19 2020, 5:45 PM
CodaFi abandoned this revision.Aug 28 2020, 4:21 PM

We have tested this proposed change out on our CI systems and have seen no relief from the symptoms of inode reuse with this approach. Abandoning this revision in favor of a more narrow fix.