This is an archive of the discontinued LLVM Phabricator instance.

[Support] [Windows] Don't check file access time in equivalent(file_status, file_status)
ClosedPublic

Authored by mstorsjo on Feb 16 2023, 1:59 AM.

Details

Summary

The sys::fs::equivalent(file_status, file_status) function is meant to
judge whether two file_status structs denote the same individual file.
On Unix, this is implemented by only comparing the st_dev and st_ino
numbers (from stat), ignoring all other fields.

On Windows, lacking reliable fields corresponding to st_dev and st_ino,
equivalent(file_status, file_status) compares essentially all fields.
However, since 1e39ef331b2b78baec84fdb577d497511cc46bd5
(https://reviews.llvm.org/D18456), file_status also contains the file
access time.

Including the file access time in equivalent(file_status, file_status)
makes it possible to spuriously break. In particular, when invoking
equivalent(Twine, Twine), with two paths, there's a race condition - the
function calls status() on both input paths. Even if the two paths
are the same, the comparison can fail, if the file was accessed
between the two status() calls.

Thus, it seems like the inclusion of the access time in
equivalent(file_status, file_status) was a mistake.

This race condition can cause spurious failures when linking with
LLD/ELF, where LLD uses equivalent() to check whether a file
exists within a specific sysroot, and that sysroot is accessed by other
processes concurrently.

This fixes downstream issue
https://github.com/msys2/MINGW-packages/issues/15695.

Diff Detail

Event Timeline

mstorsjo created this revision.Feb 16 2023, 1:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 1:59 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
mstorsjo requested review of this revision.Feb 16 2023, 1:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 1:59 AM

(I forgot to mention the attribution in the patch description; it'll be included in the real git commit in the end anyway, but the patch is authored by R. Voggenauer.)

hans added a comment.Feb 16 2023, 4:01 AM

Interesting, that must have been a pain to track down.

Does the same issue also affect the LastWriteTime fields? I.e., if someone touches a file between calls?

From https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information it sounds like just comparing the LastWriteTime and VolumeSerialNumber should be enough to check file identities. Perhaps we should just do that?

Does the same issue also affect the LastWriteTime fields? I.e., if someone touches a file between calls?

I guess it could do... But that's also probably a quite fuzzy distinction to try to make; if a file is touched, it's clearly still the same file. But if it is rewritten (but with the same contents) - is it still the same? That probably depends on whether whoever writes it opens/truncates the existing file or writes a new one, replacing the old one (possibly with an atomic rename). In the latter cases, it's clearly no longer the same file.

From https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information it sounds like just comparing the LastWriteTime and VolumeSerialNumber should be enough to check file identities. Perhaps we should just do that?

You mean FileIndex, not LastWriteTime? I guess that could be reasonable - but that also feels like a more risky change.

In that case I'd prefer to do it in two steps; first this, to get rid of the known-wrong reliance on last access (which should be quite safe), then secondly stripping out the other checks that are documented to not be necessary (which also probably is fine, but which is a bigger change).

hans accepted this revision.Feb 16 2023, 5:35 AM

Does the same issue also affect the LastWriteTime fields? I.e., if someone touches a file between calls?

I guess it could do... But that's also probably a quite fuzzy distinction to try to make; if a file is touched, it's clearly still the same file. But if it is rewritten (but with the same contents) - is it still the same? That probably depends on whether whoever writes it opens/truncates the existing file or writes a new one, replacing the old one (possibly with an atomic rename). In the latter cases, it's clearly no longer the same file.

From https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information it sounds like just comparing the LastWriteTime and VolumeSerialNumber should be enough to check file identities. Perhaps we should just do that?

You mean FileIndex, not LastWriteTime? I guess that could be reasonable - but that also feels like a more risky change.

Oops, yes I meant FileIndex.

In that case I'd prefer to do it in two steps; first this, to get rid of the known-wrong reliance on last access (which should be quite safe), then secondly stripping out the other checks that are documented to not be necessary (which also probably is fine, but which is a bigger change).

Sounds good to me.

This revision is now accepted and ready to land.Feb 16 2023, 5:35 AM
This revision was landed with ongoing or failed builds.Feb 17 2023, 12:57 AM
This revision was automatically updated to reflect the committed changes.

From https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information it sounds like just comparing the LastWriteTime and VolumeSerialNumber should be enough to check file identities. Perhaps we should just do that?

You mean FileIndex, not LastWriteTime? I guess that could be reasonable - but that also feels like a more risky change.

Oops, yes I meant FileIndex.

When diving into this next step, I noticed two things.

  1. We already have file_status::getUniqueID() which seems to do exactly what we want:
UniqueID file_status::getUniqueID() const {
  // The file is uniquely identified by the volume serial number along
  // with the 64-bit file identifier.
  uint64_t FileID = (static_cast<uint64_t>(FileIndexHigh) << 32ULL) |
                    static_cast<uint64_t>(FileIndexLow);

  return UniqueID(VolumeSerialNumber, FileID);
}

So we could simplify equivalent() (and other calls within the same file) to just call this neat method. That's nice.

  1. The docs contain another rather annoying detail though. The next paragraph in the referenced docs say this:

The ReFS file system, introduced with Windows Server 2012, includes 128-bit file identifiers. To retrieve the 128-bit file identifier use the GetFileInformationByHandleEx function with FileIdInfo to retrieve the FILE_ID_INFO structure. The 64-bit identifier in this structure is not guaranteed to be unique on ReFS.

So with that in mind, we'd need to extend the call to getStatus() to call GetFileInformationByHandleEx(FileHandle, FileIdInfo, &IdInfo) to get extended info about the file ID. We would need to extend the file_status struct, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem.h#L225-L238, to include this. That's no problem. But we'd also need to extend fs::UniqueID, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem/UniqueID.h#L26-L49, with that, and that seems to be much more annoying. We probably can't rely on 128 bit integers being available everywhere LLVM itself runs (since it can run on e.g. 32 bit hosts).

hans added a comment.Feb 17 2023, 8:27 AM

From https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information it sounds like just comparing the LastWriteTime and VolumeSerialNumber should be enough to check file identities. Perhaps we should just do that?

You mean FileIndex, not LastWriteTime? I guess that could be reasonable - but that also feels like a more risky change.

Oops, yes I meant FileIndex.

When diving into this next step, I noticed two things.

  1. We already have file_status::getUniqueID() which seems to do exactly what we want:
UniqueID file_status::getUniqueID() const {
  // The file is uniquely identified by the volume serial number along
  // with the 64-bit file identifier.
  uint64_t FileID = (static_cast<uint64_t>(FileIndexHigh) << 32ULL) |
                    static_cast<uint64_t>(FileIndexLow);

  return UniqueID(VolumeSerialNumber, FileID);
}

So we could simplify equivalent() (and other calls within the same file) to just call this neat method. That's nice.

  1. The docs contain another rather annoying detail though. The next paragraph in the referenced docs say this:

The ReFS file system, introduced with Windows Server 2012, includes 128-bit file identifiers. To retrieve the 128-bit file identifier use the GetFileInformationByHandleEx function with FileIdInfo to retrieve the FILE_ID_INFO structure. The 64-bit identifier in this structure is not guaranteed to be unique on ReFS.

So with that in mind, we'd need to extend the call to getStatus() to call GetFileInformationByHandleEx(FileHandle, FileIdInfo, &IdInfo) to get extended info about the file ID. We would need to extend the file_status struct, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem.h#L225-L238, to include this. That's no problem. But we'd also need to extend fs::UniqueID, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem/UniqueID.h#L26-L49, with that, and that seems to be much more annoying. We probably can't rely on 128 bit integers being available everywhere LLVM itself runs (since it can run on e.g. 32 bit hosts).

That's annoying.

I suppose we could add another uint64_t for the high bits in UniqueID, but then we'd have to propagate it to everyone consuming UniqueIDs. I don't know how annoying that would be. Maybe it would be enough to hash combine the high bits with the low :-]

We would need to extend the file_status struct, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem.h#L225-L238, to include this. That's no problem. But we'd also need to extend fs::UniqueID, https://github.com/llvm/llvm-project/blob/llvmorg-17-init/llvm/include/llvm/Support/FileSystem/UniqueID.h#L26-L49, with that, and that seems to be much more annoying. We probably can't rely on 128 bit integers being available everywhere LLVM itself runs (since it can run on e.g. 32 bit hosts).

That's annoying.

I suppose we could add another uint64_t for the high bits in UniqueID, but then we'd have to propagate it to everyone consuming UniqueIDs. I don't know how annoying that would be.

I tried to grep around in the monorepo, and it seems like it would be significantly annoying to do that - lots of things rely on UniqueIDs consisting of a pair only, and the data types being uint64_t.

We could of course just keep comparing individual fields in the sys::fs::equivalent() function, like we do now, instead of factorizing it to use getUniqueID() - but that's not really a complete fix, since getUniqueID() is used by lots of code too.

Maybe it would be enough to hash combine the high bits with the low :-]

Hah, I guess that could be one way to go about it ;-)

One would hope that the implementations of file systems that use 128 bit IDs is such that it primarily uses the bits that map to FileIndexLow/FileIndexHigh, so that unless you actually do have more than 2^64 files, you won't run into issues by just using the older fields.