This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
1/2
FileSystem.h
-
lib/Support/Windows/
-
Support/
-
Windows/
1/4
Path.inc

Differential D155579

[Windows] Avoid using FileIndex for unique IDs
ClosedPublic

Authored by mstorsjo on Jul 18 2023, 4:30 AM.

Download Raw Diff

Details

Reviewers

aganea
thieta
kadircet
zero9178
alvinhochun
hans
aaron.ballman

Commits

rG02a375478349: [Windows] Avoid using FileIndex for unique IDs

Summary

The FileIndex values returned from GetFileInformationByHandle are
considered stable and uniquely identifying a file, as long as the
handle is open. When handles are closed, there are no guarantees
for their stability or uniqueness. On some file systems (such as
NTFS), the indices are documented to be stable even across handles.
But with some file systems, in particular network mounts, file
indices can be reused very soon after handles are closed.

When such file indices are used for LLVM's UniqueID, files are
considered duplicates as soon as the filesystem driver happens to
have used the same file index for the handle used to inspect the
file. This caused widespread, non-obvious (seemingly random)
breakage. This can happen e.g. if running on a directory that is
shared via Remote Desktop or VirtualBox.

To avoid the issue, use a hash of the canonicalized path for the
file as unique identifier, instead of using FileIndex.

This fixes https://github.com/llvm/llvm-project/issues/61401 and
https://github.com/llvm/llvm-project/issues/22079.

Performance wise, this adds (usually) one extra call to
GetFinalPathNameByHandleW for each call to getStatus(). A test
cases such as running clang-scan-deps becomes around 1% slower
by this, which is considered tolerable.

Change the equivalent() function to use getUniqueID instead of
checking individual file_status fields. The
equivalent(Twine,Twine,bool& result) function calls status() on
each path successively, without keeping the file handles open,
which also is prone to such false positives. This also gets rid
of checks of other superfluous fields in the
equivalent(file_status, file_status) function - the unique ID of
a file should be enough (that is what is done for Unix anyway).

This comes with one known caveat: For hardlinks, each name for
the file now gets a different UniqueID, and equivalent() considers
them different. While that's not ideal, occasional false negatives
for equivalent() is usually that fatal (the cases where we strictly
do need to deduplicate files with different path names are quite
rare) compared to the issues caused by false positives for
equivalent() (where we'd deduplicate and omit totally distinct files).

The FileIndex is documented to be stable on NTFS though, so ideally
we could maybe have used it in the majority of cases. That would
require a heuristic for whether we can rely on FileIndex or not.
We considered using the existing function is_local_internal for that;
however that caused an unacceptable performance regression
(clang-scan-deps became 38% slower in one test, even more than that
in another test).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mstorsjo created this revision.Jul 18 2023, 4:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 4:30 AM

Herald added subscribers: arphaman, hiraditya. · View Herald Transcript

mstorsjo requested review of this revision.Jul 18 2023, 4:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 4:30 AM

Herald added a subscriber: wangpc. · View Herald Transcript

@thieta IIRC you had a usecase where you scan through large numbers of files with LLVM code somewhere. Are you able to take this for a spin to make sure it doesn't affect the performance of your usecase too much?

In D155579#4509952, @mstorsjo wrote:

@thieta IIRC you had a usecase where you scan through large numbers of files with LLVM code somewhere. Are you able to take this for a spin to make sure it doesn't affect the performance of your usecase too much?

Will do, but probably not some time real soon since I am on vacation with the family now.

Harbormaster completed remote builds in B246165: Diff 541456.Jul 18 2023, 10:28 AM

aaron.ballman added inline comments.Aug 7 2023, 11:41 AM

llvm/include/llvm/Support/FileSystem.h
238–239	Some comments explaining that the hash is only expected to be valid if `ReliableFileIndex` is `false` may help.
llvm/lib/Support/Windows/Path.inc
741–742
752–755

mstorsjo added inline comments.Aug 7 2023, 12:49 PM

llvm/include/llvm/Support/FileSystem.h
238–239	Yep, that'd be useful, will add that!
llvm/lib/Support/Windows/Path.inc
741–742	Yes, I'd love to do that - but there are gotos that jump past this spot, and the gotos can't cross an initialization.

aaron.ballman added inline comments.Aug 8 2023, 5:07 AM

llvm/lib/Support/Windows/Path.inc
741–742	Ugh... "thanks" goto! :-D

Add a comment clarifying that PathHas only is valid if ReliableFileIndex is false.

Harbormaster completed remote builds in B252270: Diff 549818.Aug 14 2023, 2:09 AM

LGTM! I wish there was a reasonable way to test this, but given the subject of the change, I don't think test coverage is practical.

This revision is now accepted and ready to land.Aug 14 2023, 6:08 AM

In D155579#4510035, @thieta wrote:

In D155579#4509952, @mstorsjo wrote:

@thieta IIRC you had a usecase where you scan through large numbers of files with LLVM code somewhere. Are you able to take this for a spin to make sure it doesn't affect the performance of your usecase too much?

Will do, but probably not some time real soon

@thieta Do you have time to try this out in the near future? I'd like to at least know of any potential performance impact before landing this.

(Ideally I'd like to have this in a release, but this is something that definitely should cook in git main for a month or two before considering including in a release. At that point we're probably close to the late 17.0.x, where it's probably too risky to include, so I guess this is practically 18.x material in any case?)

In D155579#4587472, @mstorsjo wrote:

(Ideally I'd like to have this in a release, but this is something that definitely should cook in git main for a month or two before considering including in a release. At that point we're probably close to the late 17.0.x, where it's probably too risky to include, so I guess this is practically 18.x material in any case?)

FWIW, I agree that this kind of change needs a bit longer of bake time to be confident in its correctness, so I think 17.x is unlikely at this point.

In D155579#4587472, @mstorsjo wrote:

In D155579#4510035, @thieta wrote:

In D155579#4509952, @mstorsjo wrote:

@thieta IIRC you had a usecase where you scan through large numbers of files with LLVM code somewhere. Are you able to take this for a spin to make sure it doesn't affect the performance of your usecase too much?

Will do, but probably not some time real soon

@thieta Do you have time to try this out in the near future? I'd like to at least know of any potential performance impact before landing this.

@mstorsjo I think a good test for you would be to run clang-scan-deps on a large CDB .json, multithreaded, and see if there’s a difference before/after patch.

In D155579#4588145, @aganea wrote:

@mstorsjo I think a good test for you would be to run clang-scan-deps on a large CDB .json, multithreaded, and see if there’s a difference before/after patch.

That sounds like a good thing to test. Does anyone have a suitable setup with such files ready that could run the tests? I don't really run Windows (other than VMs for testing things occasionally) or do development in it, I mostly try to cross compile things - and I haven't used clang-scan-deps before either.

In D155579#4588227, @mstorsjo wrote:

In D155579#4588145, @aganea wrote:

@mstorsjo I think a good test for you would be to run clang-scan-deps on a large CDB .json, multithreaded, and see if there’s a difference before/after patch.

That sounds like a good thing to test. Does anyone have a suitable setup with such files ready that could run the tests? I don't really run Windows (other than VMs for testing things occasionally) or do development in it, I mostly try to cross compile things - and I haven't used clang-scan-deps before either.

I set this up and managed to get some numbers now.

As test subject, I configured a build of llvm+clang, with llvm-mingw built with and without this patch. After configuring the build, I ran tim clang-scan-deps.exe --compilation-database=compile_commands.json > out 2> err (using https://github.com/sgraham/tim). Before the change, it took consistently real: 0m17.156s to run, while after the change it took around real: 0m23.703s. So this task became around 38% slower with this change - that's a notable slowdown.

I tried to look into what is causing so much slowdown. Overall, this path adds one call to realPathFromHandle (if status was called on a file descriptor without known path), which calls GetFinalPathNameByHandleW once or twice if the buffer wasn't large enough. (We could resize the new buffer on line 792 to MAX_PATH to avoid needing to do this twice in most cases.) This doesn't incur almost any extra runtime in my tests (perhaps most calls are with a path?). Then we call is_local_internal, which first calls GetVolumePathNameW followed by GetDriveTypeW. Here the absolute majority in runtime seems to be in GetVolumePathNameW - by omitting the rest of is_local_internal and just doing one single call to GetVolumePathNameW, the runtime increases from 17 to 23 seconds.

So the calls to GetFinalPathNameByHandleW and GetDriveTypeW are irrelevant for performance here, and GetVolumePathNameW is the only thing that causes all the extra runtime cost added here. I don't see any great ways around that. (One may be tempted to manually reimplement GetVolumePathNameW to just get the first segment of the path - but that breaks the cases where volumes are mounted in the system hierarchy etc.)

With that in mind, are @aaron.ballman and @aganea still ok with going ahead and landing this?

An alternative would be to just skip the call to is_local_internal and entirely switch to the hash based logic and never use FileIndex. I guess that'd mostly work, but the current patch tries to retain the use of FileIndex where reasonable (which probably should work better whenever e.g. symlinks or hardlinks are in use - which aren't common but do exist).

Another option could be to add a debug opt to disable this new code path, and skip all the canonicalization and reliability check. It could be one way or the other; either enabled by default or disabled by default. But probably better to enable it by default (this new code path) and let people add the flag if they need more performance.

Let me do some tests today with our codebase, see how much of an impact there is. I’d like also to see if this impacts overall Clang build times.

Hmm. This level of performance hit is not great - it means I will probably have to carry an internal patch to disable if we don't expose a compile time define or option.

I will try to test this in the coming week. It has just been busy with a lot of other things right now.

Ouch, that is a sizeable compile time performance regression -- I think that's too much expense for everyone on Windows to pay given how often users are likely to hit the issues this solves. Removing my acceptance of the patch while we consider what to do.

This revision now requires changes to proceed.Aug 21 2023, 8:27 AM

Unfortunately in our case the situation is even worse than what Martin was suggesting. My timings are 2x slower after this patch than without. Tested with a stage2 clang-cl built with ThinLTO and all optimizations on.

Build times for our game project, which is based on UE 5.2, on a 32c/64t Threadripper:

before patch: 8 min 13 sec
after patch: 15 min 20 sec

Same project, clang-scan-deps:

before patch: 1 min 48 sec
after patch: 3 min 47 sec

Even something as simple as that is slower after this patch:

PS C:\test> cat main.cpp
int main() { return 0; }

PS C:\test> Measure-Command { C:\before_patch\clang-cl.exe /c .\main.cpp }
...(omitted)...
TotalMilliseconds : 40.5117

PS C:\test> Measure-Command { C:\after_patch\clang-cl.exe /c .\main.cpp }
...(omitted)...
TotalMilliseconds : 57.4749

It seems one of root issues is the call to GetVolumePathNameW inside is_local_internal, which is absolutely terrible. It generates 12 (!!!) separate kernel calls for each folder component in the path. The deeper your files are, the longer it takes to execute.
Also worth noting that Clang calls llvm::sys::status twice per file:

once in DiagnoseInputExistence.
then in FileSystemStatCache::get.

Also these status calls are done on every single file or folder that Clang fiddles with, the sysroot folders, output files... Most of the time spent during our build is spent in the kernel (darker blue part):

If we had a resident daemon process that was caching inputs during the entire build, probably this patch wouldn't be an problem. One option like I was suggesting would be to hide this new behavior behind a (disabled) option, and tell users about it. But I'm not sure how useful it would be.

In D155579#4604241, @aganea wrote:

Unfortunately in our case the situation is even worse than what Martin was suggesting. My timings are 2x slower after this patch than without. Tested with a stage2 clang-cl built with ThinLTO and all optimizations on.

Build times for our game project, which is based on UE 5.2, on a 32c/64t Threadripper:

before patch: 8 min 13 sec

after patch: 15 min 20 sec

Same project, clang-scan-deps:

before patch: 1 min 48 sec

after patch: 3 min 47 sec

Ouch, that's quite seriously bad. I had expected that the overhead would be noticeable in something like clang-scan-deps, but for actual compilation where most effort is spent on other, actually compute intensive things, that's really quite spectacular.

It seems one of root issues is the call to GetVolumePathNameW inside is_local_internal, which is absolutely terrible. It generates 12 (!!!) separate kernel calls for each folder component in the path. The deeper your files are, the longer it takes to execute.

Oh, wow, that explains things...

One option like I was suggesting would be to hide this new behavior behind a (disabled) option, and tell users about it. But I'm not sure how useful it would be.

Yeah that's probably not too useful in the long run. When you hit this issue, you get extremely confusing error behaviours, to the point that I believe we can't diagnose and suggest the option to the user. I wouldn't expect any regular user to figure out to enable the option really. So whatever we do, it should probably work pretty much automatically.

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

In D155579#4604541, @mstorsjo wrote:

In D155579#4604241, @aganea wrote:

One option like I was suggesting would be to hide this new behavior behind a (disabled) option, and tell users about it. But I'm not sure how useful it would be.

Yeah that's probably not too useful in the long run. When you hit this issue, you get extremely confusing error behaviours, to the point that I believe we can't diagnose and suggest the option to the user. I wouldn't expect any regular user to figure out to enable the option really. So whatever we do, it should probably work pretty much automatically.

Agreed, I don't think an option really helps all that much.

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

A related question for hashes -- what about case insensitivity of file paths and slash direction? Presumably we want F:\foo\Bar to be the same path as F:/FOO/BAR?

In D155579#4606503, @aaron.ballman wrote:

In D155579#4604541, @mstorsjo wrote:

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

A related question for hashes -- what about case insensitivity of file paths and slash direction? Presumably we want F:\foo\Bar to be the same path as F:/FOO/BAR?

Yes, that's something that certainly would come up a lot. The approach taken by this patch is that we run all paths through realPathFromHandle, which calls GetFinalPathNameByHandleW, https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfinalpathnamebyhandlew, which should return a canonicalized path name for this file.

In D155579#4606508, @mstorsjo wrote:

In D155579#4606503, @aaron.ballman wrote:

In D155579#4604541, @mstorsjo wrote:

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

A related question for hashes -- what about case insensitivity of file paths and slash direction? Presumably we want F:\foo\Bar to be the same path as F:/FOO/BAR?

Yes, that's something that certainly would come up a lot. The approach taken by this patch is that we run all paths through realPathFromHandle, which calls GetFinalPathNameByHandleW, https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfinalpathnamebyhandlew, which should return a canonicalized path name for this file.

I guess the follow-up then is that we should check how GetFinalPathNameByHandleW behaves wrt potential symlinks and hardlinks - if it canonicalizes them as we'd like, then I guess it should be quite safe to always go with the hashing.

In D155579#4604541, @mstorsjo wrote:

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

I guess we can try it, but we might be trading one problem for another. Fun stuff like this. We would need a rigorous test plan, since this can't be tested automatically. And a strategy for applying the test plan since a single individual might not have access to all hardware resources to execute the test plan (perhaps sharing the testing across several individuals).

I'd like to suggest again the solution I put forward in D146490, which could only be enabled on a case-by-case basis, for short-lived apps like Clang or LLD. Other, or long-lived applications like ClangD would still use the current (unreliable) trunk behavior for IDs. At least that shovels all the OS/drivers problems to the OS, and leaves us with the optimization problem (of open handles), which could be manageable.

We can also try to improve your patch somehow:

In D155579#4604241, @aganea wrote:

It seems one of root issues is the call to GetVolumePathNameW inside is_local_internal, which is absolutely terrible. It generates 12 (!!!) separate kernel calls for each folder component in the path. The deeper your files are, the longer it takes to execute.

Correction, the issue is with GetDriveTypeW not GetVolumePathNameW. I realized that they do this because of reparse points, and because internally they call other APIs that each opens the path and checks for reparse points, again. For example, I don't understand why internally they call RegSetValueExW? (inside GetDriveTypeW) I wonder if there could be a different way to find whether the file is on a network drive or not.

Also worth noting that Clang calls llvm::sys::status twice per file:

once in DiagnoseInputExistence.

then in FileSystemStatCache::get.

We could also fix this, which would maybe dampen the impact of this patch.

In D155579#4606800, @aganea wrote:

In D155579#4604541, @mstorsjo wrote:

You who might have more use of nontrivial Windows build scenarios - how much impact would it be to go all in on the new path canonicalization + hash approach, i.e. ditching the file index entirely? Performance wise I would believe that it would be no significant change to before. The only thing that matters probably is how it behaves wrt symlinks/hardlinks, if those are present and in use.

I guess we can try it, but we might be trading one problem for another. Fun stuff like this.

Oh, interesting. However I don't think that seem like a big blocker for us - if VOLUME_NAME_NT works reliably across file systems, we could just use that. We don't need a filename to display anywhere here, we just need a canonical string representation for a file.

We would need a rigorous test plan, since this can't be tested automatically. And a strategy for applying the test plan since a single individual might not have access to all hardware resources to execute the test plan (perhaps sharing the testing across several individuals).

I don't quite see why such a rigorous test plan would be needed here, compared to how we do most other things? If it seems to work reasonably in the cases we've got access to testing and we don't know of any other specific problematic cases, we could land it, and expect to hear back if it causes something else?

I'd like to suggest again the solution I put forward in D146490, which could only be enabled on a case-by-case basis, for short-lived apps like Clang or LLD. Other, or long-lived applications like ClangD would still use the current (unreliable) trunk behavior for IDs. At least that shovels all the OS/drivers problems to the OS, and leaves us with the optimization problem (of open handles), which could be manageable.

I'm quite hesitant of that solution - it's complex and increases resource consumption - and keeping files open longer than necessary always causes issues, especially on Windows. Admittedly, it's not that big of an issue for short-lived processes. Having separate solutions for short-lived and long-lived processes doesn't feel too great though.

In D155579#4604241, @aganea wrote:

It seems one of root issues is the call to GetVolumePathNameW inside is_local_internal, which is absolutely terrible. It generates 12 (!!!) separate kernel calls for each folder component in the path. The deeper your files are, the longer it takes to execute.

Correction, the issue is with GetDriveTypeW not GetVolumePathNameW. I realized that they do this because of reparse points, and because internally they call other APIs that each opens the path and checks for reparse points, again. For example, I don't understand why internally they call RegSetValueExW? (inside GetDriveTypeW) I wonder if there could be a different way to find whether the file is on a network drive or not.

Hmm, in my testing before, it was definitely GetVolumePathNameW that was slow.

FWIW, I tested symlinks and hardlinks with Clang with this patch. Symlinks do get resolved to the target file, so the path hashing approach works just as well as the file index there. For hard links, path hashing is inferior to using the file index though - the file index knows that two different paths are the same file, but GetFinalPathNameByHandleW doesn't canonicalize them to one single name (which is the correct behaviour for hard links - all names are equal and independent of each other).

If we consider hard links on Windows something we don't need to cater for, we could use unconditional path hashing, combined with canonicalization with VOLUME_NAME_NT if the default VOLUME_NAME_DOS is problematic, and that should be quite performant.

Changed to always use a hashed path instead of nFileIndexHigh/nFileIndexLow, without any heuristics for when to do this. With this, I'm measuring a slowdown of around 1% in the test with clang-scan-deps.

How does this behave in your testcase @aganea?

I applied the suggestion regarding wonky file system drivers from Rust, and using VOLUME_NAME_NT, which works with such drivers too (tested with ImDisk, and confirmed that the default VOLUME_NAME_DOS fails there). We don't need to do fallbacking/testing between different formats here as we're not interested in the actual contents of the canonicalized path (we're not using it for anything), we're just interested in its identity.

For the potential case if GetFinalPathNameByHandleW still fails with VOLUME_NAME_NT, fall back on the old behaviour of using nFileIndexHigh/nFileIndexLow.

The only known drawback in this case is that we won't be able to notice two names pointing at the same hardlinked file - but I would believe that to be a compromise that we can affort to do on Windows (where hardlinks generally are rare).

Harbormaster completed remote builds in B256512: Diff 555711.Sep 4 2023, 6:06 AM

In D155579#4636964, @mstorsjo wrote:

How does this behave in your testcase @aganea?

I don't see any difference between this patch and the baseline, so that's good news.

There are two failing tests however (related to hard links):

FAIL: LLVM-Unit :: Support/./SupportTests.exe/40/82 (5085 of 72863)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/40/82' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe-LLVM-Unit-27536-40-82.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=82 GTEST_SHARD_INDEX=40 C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe --gtest_filter=FileSystemTest.Unique
--
Test Directory: C:\Users\AGANEA~1\AppData\Local\Temp\lit-tmp-_2_vvve2\file-system-test-e7decf
C:\git\llvm-project\llvm\unittests\Support\Path.cpp(716): error: Expected equality of these values:
  D2
    Which is: 16-byte object <48-04 40-84 00-00 00-00 C6-FC EA-25 9F-15 C7-9E>
  F1
    Which is: 16-byte object <48-04 40-84 00-00 00-00 E1-BF 0D-B1 58-7A 27-07>

C:\git\llvm-project\llvm\unittests\Support\Path.cpp:716
Expected equality of these values:
  D2
    Which is: 16-byte object <48-04 40-84 00-00 00-00 C6-FC EA-25 9F-15 C7-9E>
  F1
    Which is: 16-byte object <48-04 40-84 00-00 00-00 E1-BF 0D-B1 58-7A 27-07>


********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM-Unit :: Support/./SupportTests.exe/45/82 (58768 of 72863)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/45/82' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe-LLVM-Unit-27536-45-82.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=82 GTEST_SHARD_INDEX=45 C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\git\llvm-project\stage2_rpmalloc\unittests\Support\.\SupportTests.exe --gtest_filter=FileSystemTest.TempFiles
--
Test Directory: C:\Users\AGANEA~1\AppData\Local\Temp\lit-tmp-_2_vvve2\file-system-test-de39a5
C:\git\llvm-project\llvm\unittests\Support\Path.cpp(914): error: Value of: equal
  Actual: false
Expected: true
C:\git\llvm-project\llvm\unittests\Support\Path.cpp(917): error: Value of: fs::equivalent(A, B)
  Actual: false
Expected: true

C:\git\llvm-project\llvm\unittests\Support\Path.cpp:914
Value of: equal
  Actual: false
Expected: true
C:\git\llvm-project\llvm\unittests\Support\Path.cpp:917
Value of: fs::equivalent(A, B)
  Actual: false
Expected: true


********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
********************
Failed Tests (2):
  LLVM-Unit :: Support/./SupportTests.exe/FileSystemTest/TempFiles
  LLVM-Unit :: Support/./SupportTests.exe/FileSystemTest/Unique


Testing Time: 654.15s
  Skipped          :    56
  Unsupported      :  2452
  Passed           : 90408
  Expectedly Failed:   198
  Failed           :     2

Exclude tests that test equality of hardlinks, making it clearer that this is a known intentional missing feature here.

Harbormaster completed remote builds in B256844: Diff 556237.Sep 8 2023, 2:38 AM

LGTM, thank you Martin!

In D155579#4641716, @aganea wrote:

LGTM, thank you Martin!

Great that it looks good to you now, thanks!

As @aaron.ballman put a negative review in before, on the version with massive performance issues, do you want to have a look on this again now? If everyone are good with it, I’d consider landing it maybe some time next week. And this is clearly way too disruptive to even consider backporting indeed; having it cooking in git main until 18.x seems good.

LGTM! I am a bit uncomfortable about the hardlink behavior (it seems plausible that some folks may use hardlinks to specify paths to include to things like the Windows SDK, so we could run into them). I think landing this early in the 18.x cycle gives us quite a bit of bake time to see if there's fallout in practice, but we should definitely watch the issues list once this lands and again once we start putting out 18.x rcs.

I think this probably should come with a release note, 1) to talk about the fix and 2) to call out the behavioral change with hardlinks

llvm/unittests/Support/Path.cpp
714 ↗	(On Diff #556237)	Would this make sense?
917 ↗	(On Diff #556237)	same here as above

This revision is now accepted and ready to land.Sep 11 2023, 1:17 PM

mstorsjo added inline comments.Sep 11 2023, 1:21 PM

llvm/unittests/Support/Path.cpp
714 ↗	(On Diff #556237)	I guess that makes sense, although it’s a bit problematic to fix (without trading in the other bugs we’re trying to fix here). I hope that if someone tries to fix it, that they check up on why we ended up here.

In D155579#4643546, @aaron.ballman wrote:

LGTM! I am a bit uncomfortable about the hardlink behavior (it seems plausible that some folks may use hardlinks to specify paths to include to things like the Windows SDK, so we could run into them).

Just another note on this; hardlinks generally still work - the kinda unusual scenario that would break, is if you refer to the same file with different path names and rely on LLVM to deduplicate those files for you. I honestly believe that case to be kinda rare. The cases where we really want deduplication to happen for files with distinct paths feels kinda rare to me, in general. I.e. I think false negatives for distinct different path names isn't that bad. (False negatives for the same path name would be bad though. And with the previous implementation with FileIndex on unusual file systems, that would also be possible.) On the other hand, any case of a false positive for equivalent() for actually distinct files is totally fatal, and produces extremely confusing errors.

In D155579#4643546, @aaron.ballman wrote:

I think landing this early in the 18.x cycle gives us quite a bit of bake time to see if there's fallout in practice, but we should definitely watch the issues list once this lands and again once we start putting out 18.x rcs.

I think this probably should come with a release note, 1) to talk about the fix and 2) to call out the behavioral change with hardlinks

Yep - I'm updating the commit message for the final solution we settled on, and adding context about other solutions that were considered, and writing a release note for it, and I'll go ahead and land it later today.

Updated with the suggested FIXME markings, and added a release note for it. (I placed the release note in the section for "Changes to the Windows Target", even if this isn't about code generation for a Windows target, but about LLVM running on a Windows host. Instead of creating a separate section for this, I guessed that this reaches the target audience anyway.)

mstorsjo retitled this revision from [Windows] Avoid using FileIndex for unique IDs on network mounts to [Windows] Avoid using FileIndex for unique IDs.Sep 12 2023, 1:30 AM

mstorsjo edited the summary of this revision. (Show Details)

aaron.ballman added inline comments.Sep 12 2023, 5:18 AM

llvm/docs/ReleaseNotes.rst
123–124 ↗	(On Diff #556534)	Please add a link to the issues this resolves (that helps folks connect the dots as to why the changes happened in the first place).

Add references to the fixed bugs in the release notes.

This revision was landed with ongoing or failed builds.Sep 12 2023, 12:12 PM

Closed by commit rG02a375478349: [Windows] Avoid using FileIndex for unique IDs (authored by mstorsjo). · Explain Why

This revision was automatically updated to reflect the committed changes.

mstorsjo added a commit: rG02a375478349: [Windows] Avoid using FileIndex for unique IDs.

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

FileSystem.h

8 lines

lib/

Support/

Windows/

Path.inc

68 lines

Diff 549818

llvm/include/llvm/Support/FileSystem.h

Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	class file_status : public basic_file_status {
dev_t fs_st_dev = 0;		dev_t fs_st_dev = 0;
nlink_t fs_st_nlinks = 0;		nlink_t fs_st_nlinks = 0;
ino_t fs_st_ino = 0;		ino_t fs_st_ino = 0;
#elif defined (_WIN32)		#elif defined (_WIN32)
uint32_t NumLinks = 0;		uint32_t NumLinks = 0;
uint32_t VolumeSerialNumber = 0;		uint32_t VolumeSerialNumber = 0;
uint32_t FileIndexHigh = 0;		uint32_t FileIndexHigh = 0;
uint32_t FileIndexLow = 0;		uint32_t FileIndexLow = 0;
		// The PathHash field is only valid if ReliableFileIndex is false.
		uint64_t PathHash = 0;
		aaron.ballmanUnsubmitted Not Done Reply Inline Actions Some comments explaining that the hash is only expected to be valid if `ReliableFileIndex` is `false` may help. aaron.ballman: Some comments explaining that the hash is only expected to be valid if `ReliableFileIndex` is…
		mstorsjoAuthorUnsubmitted Done Reply Inline Actions Yep, that'd be useful, will add that! mstorsjo: Yep, that'd be useful, will add that!
		bool ReliableFileIndex = true;
#endif		#endif

public:		public:
file_status() = default;		file_status() = default;

explicit file_status(file_type Type) : basic_file_status(Type) {}		explicit file_status(file_type Type) : basic_file_status(Type) {}

#if defined(LLVM_ON_UNIX)		#if defined(LLVM_ON_UNIX)
file_status(file_type Type, perms Perms, dev_t Dev, nlink_t Links, ino_t Ino,		file_status(file_type Type, perms Perms, dev_t Dev, nlink_t Links, ino_t Ino,
time_t ATime, uint32_t ATimeNSec,		time_t ATime, uint32_t ATimeNSec,
time_t MTime, uint32_t MTimeNSec,		time_t MTime, uint32_t MTimeNSec,
uid_t UID, gid_t GID, off_t Size)		uid_t UID, gid_t GID, off_t Size)
: basic_file_status(Type, Perms, ATime, ATimeNSec, MTime, MTimeNSec,		: basic_file_status(Type, Perms, ATime, ATimeNSec, MTime, MTimeNSec,
UID, GID, Size),		UID, GID, Size),
fs_st_dev(Dev), fs_st_nlinks(Links), fs_st_ino(Ino) {}		fs_st_dev(Dev), fs_st_nlinks(Links), fs_st_ino(Ino) {}
#elif defined(_WIN32)		#elif defined(_WIN32)
file_status(file_type Type, perms Perms, uint32_t LinkCount,		file_status(file_type Type, perms Perms, uint32_t LinkCount,
uint32_t LastAccessTimeHigh, uint32_t LastAccessTimeLow,		uint32_t LastAccessTimeHigh, uint32_t LastAccessTimeLow,
uint32_t LastWriteTimeHigh, uint32_t LastWriteTimeLow,		uint32_t LastWriteTimeHigh, uint32_t LastWriteTimeLow,
uint32_t VolumeSerialNumber, uint32_t FileSizeHigh,		uint32_t VolumeSerialNumber, uint32_t FileSizeHigh,
uint32_t FileSizeLow, uint32_t FileIndexHigh,		uint32_t FileSizeLow, uint32_t FileIndexHigh,
uint32_t FileIndexLow)		uint32_t FileIndexLow, bool ReliableFileIndex, uint64_t PathHash)
: basic_file_status(Type, Perms, LastAccessTimeHigh, LastAccessTimeLow,		: basic_file_status(Type, Perms, LastAccessTimeHigh, LastAccessTimeLow,
LastWriteTimeHigh, LastWriteTimeLow, FileSizeHigh,		LastWriteTimeHigh, LastWriteTimeLow, FileSizeHigh,
FileSizeLow),		FileSizeLow),
NumLinks(LinkCount), VolumeSerialNumber(VolumeSerialNumber),		NumLinks(LinkCount), VolumeSerialNumber(VolumeSerialNumber),
FileIndexHigh(FileIndexHigh), FileIndexLow(FileIndexLow) {}		FileIndexHigh(FileIndexHigh), FileIndexLow(FileIndexLow),
		PathHash(PathHash), ReliableFileIndex(ReliableFileIndex) {}
#endif		#endif

UniqueID getUniqueID() const;		UniqueID getUniqueID() const;
uint32_t getLinkCount() const;		uint32_t getLinkCount() const;
};		};

/// @}		/// @}
/// @name Physical Operators		/// @name Physical Operators
▲ Show 20 Lines • Show All 1,315 Lines • Show Last 20 Lines

llvm/lib/Support/Windows/Path.inc

Show First 20 Lines • Show All 153 Lines • ▼ Show 20 Lines

if (UTF16ToUTF8(PathName.data(), PathName.size(), PathNameUTF8))

return "";

llvm::sys::path::make_preferred(PathNameUTF8);

return std::string(PathNameUTF8.data());

}

UniqueID file_status::getUniqueID() const {

// The file is uniquely identified by the volume serial number along

// with the 64-bit file identifier.

// with the 64-bit file identifier, as long as the handle that produced

// the file index is open. After closing the handle, the file index may

// be reused.

// The actual stability of file indices depends on the filesystem driver.

// For modern file systems (NTFS), the indices generally are stable. (For

// ReFS, the full file index actually is 128 bit.) For FAT, the indices

// change if the file system is defragmented. However, some network mounts

// reuse file indices very eagerly when handles are closed. (In particular,

// this happens with mounts used for directory sharing in Remote Desktop

// and VirtualBox.)

// If we believe the file index isn't reliable, use a hash of the

// canonicalized path instead, as a more stable identifier.

if (!ReliableFileIndex)

return UniqueID(VolumeSerialNumber, PathHash);

uint64_t FileID = (static_cast<uint64_t>(FileIndexHigh) << 32ULL) |

static_cast<uint64_t>(FileIndexLow);

return UniqueID(VolumeSerialNumber, FileID);

}

ErrorOr<space_info> disk_space(const Twine &Path) {

ULARGE_INTEGER Avail, Total, Free;

▲ Show 20 Lines • Show All 471 Lines • ▼ Show 20 Lines

bool can_execute(const Twine &Path) {

return !access(Path, AccessMode::Execute) ||

!access(Path + ".exe", AccessMode::Execute);

}

bool equivalent(file_status A, file_status B) {

assert(status_known(A) && status_known(B));

return A.FileIndexHigh == B.FileIndexHigh &&

return A.getUniqueID() == B.getUniqueID();

A.FileIndexLow == B.FileIndexLow && A.FileSizeHigh == B.FileSizeHigh &&

A.FileSizeLow == B.FileSizeLow &&

A.LastWriteTimeHigh == B.LastWriteTimeHigh &&

A.LastWriteTimeLow == B.LastWriteTimeLow &&

A.VolumeSerialNumber == B.VolumeSerialNumber;

}

std::error_code equivalent(const Twine &A, const Twine &B, bool &result) {

file_status fsA, fsB;

if (std::error_code ec = status(A, fsA))

return ec;

if (std::error_code ec = status(B, fsB))

return ec;

Show All 28 Lines

static file_type file_type_from_attrs(DWORD Attrs) {

return (Attrs & FILE_ATTRIBUTE_DIRECTORY) ? file_type::directory_file

: file_type::regular_file;

}

static perms perms_from_attrs(DWORD Attrs) {

return (Attrs & FILE_ATTRIBUTE_READONLY) ? (all_read | all_exe) : all_all;

}

static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {

static std::error_code getStatus(HANDLE FileHandle,

SmallVectorImpl<wchar_t> &Path,

file_status &Result) {

if (FileHandle == INVALID_HANDLE_VALUE)

goto handle_status_error;

switch (::GetFileType(FileHandle)) {

default:

llvm_unreachable("Don't know anything about this file type");

case FILE_TYPE_UNKNOWN: {

DWORD Err = ::GetLastError();

Show All 11 Lines

case FILE_TYPE_PIPE:

Result = file_status(file_type::fifo_file);

return std::error_code();

}

BY_HANDLE_FILE_INFORMATION Info;

if (!::GetFileInformationByHandle(FileHandle, &Info))

goto handle_status_error;

bool PathCanonicalized;

PathCanonicalized = false;

aaron.ballmanUnsubmitted

Not Done

goto handle_status_error;

- bool PathCanonicalized;

- PathCanonicalized = false;

+ bool PathCanonicalized = false;

if (Path.empty()) {

aaron.ballman:

mstorsjoAuthorUnsubmitted

Done

Yes, I'd love to do that - but there are gotos that jump past this spot, and the gotos can't cross an initialization.

mstorsjo: Yes, I'd love to do that - but there are gotos that jump past this spot, and the gotos can't…

aaron.ballmanUnsubmitted

Not Done

Ugh... "thanks" goto! :-D

aaron.ballman: Ugh... "thanks" goto! :-D

if (Path.empty()) {

realPathFromHandle(FileHandle, Path);

PathCanonicalized = true;

}

bool IsLocal;

if (std::error_code EC = is_local_internal(Path, IsLocal))

return EC;

bool ReliableFileIndex;

uint64_t PathHash;

ReliableFileIndex = true;

PathHash = 0;

aaron.ballmanUnsubmitted

Not Done

return EC;

- bool ReliableFileIndex;

- uint64_t PathHash;

- ReliableFileIndex = true;

- PathHash = 0;

+ bool ReliableFileIndex = true;

+ uint64_t PathHash = 0;

if (!IsLocal) {

aaron.ballman:

if (!IsLocal) {

// File indices aren't necessarily stable after closing the file handle.

// As a heuristic, consider files that are local to have stable (enough)

// file indicies. Some network mounts are known to eagerly reuse file

// indices.

// If the index isn't considered reliable, hash the canonicalized path

// instead.

ReliableFileIndex = false;

if (!PathCanonicalized)

realPathFromHandle(FileHandle, Path);

PathHash = hash_combine_range(Path.begin(), Path.end());

}

Result = file_status(

file_type_from_attrs(Info.dwFileAttributes),

perms_from_attrs(Info.dwFileAttributes), Info.nNumberOfLinks,

Info.ftLastAccessTime.dwHighDateTime, Info.ftLastAccessTime.dwLowDateTime,

Info.ftLastWriteTime.dwHighDateTime, Info.ftLastWriteTime.dwLowDateTime,

Info.dwVolumeSerialNumber, Info.nFileSizeHigh, Info.nFileSizeLow,

Info.nFileIndexHigh, Info.nFileIndexLow);

Info.nFileIndexHigh, Info.nFileIndexLow, ReliableFileIndex, PathHash);

return std::error_code();

handle_status_error:

DWORD LastError = ::GetLastError();

if (LastError == ERROR_FILE_NOT_FOUND || LastError == ERROR_PATH_NOT_FOUND)

Result = file_status(file_type::file_not_found);

else if (LastError == ERROR_SHARING_VIOLATION)

Result = file_status(file_type::type_unknown);

else

Result = file_status(file_type::status_error);

return mapWindowsError(LastError);

}

static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {

SmallVector<wchar_t, 128> path_utf16;

return getStatus(FileHandle, path_utf16, Result);

}

std::error_code status(const Twine &path, file_status &result, bool Follow) {

SmallString<128> path_storage;

SmallVector<wchar_t, 128> path_utf16;

StringRef path8 = path.toStringRef(path_storage);

if (isReservedName(path8)) {

result = file_status(file_type::character_file);

return std::error_code();

Show All 13 Lines

std::error_code status(const Twine &path, file_status &result, bool Follow) {

ScopedFileHandle h(

::CreateFileW(path_utf16.begin(), 0, // Attributes only.

FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,

NULL, OPEN_EXISTING, Flags, 0));

if (!h)

return getStatus(INVALID_HANDLE_VALUE, result);

return getStatus(h, result);

return getStatus(h, path_utf16, result);

}

std::error_code status(int FD, file_status &Result) {

HANDLE FileHandle = reinterpret_cast<HANDLE>(_get_osfhandle(FD));

return getStatus(FileHandle, Result);

}

std::error_code status(file_t FileHandle, file_status &Result) {

▲ Show 20 Lines • Show All 790 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Windows] Avoid using FileIndex for unique IDsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 549818

llvm/include/llvm/Support/FileSystem.h

llvm/lib/Support/Windows/Path.inc

[Windows] Avoid using FileIndex for unique IDs
ClosedPublic