This implements the directory watcher on Windows. It does the most
naive thing for simplicity. ReadDirectoryChangesW is used to monitor
the changes. However, in order to support interruption, we must use
overlapped IO, which allows us to use the blocking, synchronous
mechanism. We create a thread to post the notification to the consumer
to allow the monitoring to continue. The two threads communicate via a
locked queue.
Details
- Reviewers
amccarth - Commits
- rG5d74c4351175: DirectoryWatcher: add an implementation for Windows
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
1,320 ms | linux > libarcher.parallel::parallel-nosuppression.c |
Event Timeline
Overall looks good.
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
95 | Can this be an assert with some message or some explanation of the hEvent return value? What happens if hEvent is non-zero on Release builds? |
I wonder if we should unit test this functionality by having some tests that create and remove files that are watched. I'm not 100% convinced that is a great idea, but not having test coverage for the change is also not really a great idea either. Thoughts welcome.
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
22 | You should include llvm/Support/Windows/WindowsSupport.h not Windows.h directly. | |
71–76 | hDirectory -> Directory | |
80 | hDirectory -> Directory | |
86 | You should strip the Hungarian notation prefixes and ensure all the identifiers meet our usual naming rules, I'll stop bringing them up. | |
87 | Is a smart pointer required here or could you use std::vector<WCHAR> and reserve the space that way? | |
101 | You can drop the top-level const on value types. | |
170 | A newline above this line would be helpful for visual distinction. | |
232 | I think we should assert that -- calling this on a file isn't likely to behave in a good way. | |
238 | More top-level consts |
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
95 | I debated this myself which is why I added the assert. Honestly, if this fails, there is very little that can be done. This is creating an anonymous (unnamed) event. The failure here would be caused by out-of-memory conditions (you're dead anyways) or system is completely out of resources (you're dead anyways). I don't know of any recovery in that situation. I really would prefer to avoid the two-phase construction which is going to be required if we want to handle that error scenario. The event only makes sense after we have the directory handle, so I suppose that we could setup the event prior to the construction of the watcher itself, but that is just as much two-phase construction as adding an initialize method. |
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
87 | Sure, I can convert this to a std::vector<WCHAR> instead. |
@aaron.ballman - I completely agree with you about the testing. The interfaces are tested via https://reviews.llvm.org/source/llvm-github/browse/master/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp, which now that I look at again, seems to need an additional case for system name.
I'm sorry. I haven't had to time to review the entire change yet, but I thought I'd share some early feedback now and take more of a look on Monday.
The high level issues on my mind:
I'm wondering whether this has been overcomplicated with the overlapped IO. If the monitoring thread used FindFirstChangeNotificationW to get a waitable handle and then used, then you'd be able to call ReadDirectoryChangesW synchronously. In order to allow the parent thread signal to quit, they'd just need an Event and the monitor thread would use WaitForMultipleObjects to wait for either handle to become signaled. Maybe I'm overlooking something, but it might be worth a few minutes of consideration.
We'll also have to think about how to test this.
The lower level issues that I've spotted are inlined.
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
33 | The ovlIO name isn't consistent with LLVM style. | |
36 | If it were me, I'd probably make this a std::vector.
Also, I'd probably go with a slightly more descriptive name, like Notifications rather than Buffer. | |
82 | There's a lot going on in this constructor. Is this how the other implementations are arranged? Would it make sense to just initialize the object, and save most of the actual work to a Watch method? | |
87 |
| |
88 | I don't think you want to ignore the return value, since it'll tell you exactly how many characters you actually got back (or whether there was an error). Again, I recommend using realPathFromHandle from Support. | |
94 | No real difference here, but, for consistency, please make this CreateEventW with the explicit -W suffix. | |
189 | I think this leaks the ovlIO.hEvent. After you've joined both threads, make sure to call ::CloseHandle(). |
There already is testing coverage for this - I just missed the CMake changes.
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
36 | The alignas is because the documentation states that the buffer should be DWORD aligned. It is more for pedantic reasons rather than anything else. I think that making it a catastrophic failure is a good thing though - it would catch the error :) You are correct about the allocation - it is once per watch. I'll rename it at least. | |
82 | Largely the same. However, the majority of the "work" is actually the thread proc for the two threads. | |
87 | I didn't know about realPathFromHandle - I prefer that actually. |
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
87 | Actually, realPathFromHandle is private to Path.cpp :-( |
Some of this is nitpicky/opinionated, but the race condition is real. We need a reliable way to signal the watcher thread when it's time to exit. Options I see are:
- Use the FindFirstChange function to get a handle to wait on for the directory change and create a separate event to signal when it's time to exit. The watcher thread would use WaitForMultipleObjects. If' it's a directory change, then it can make a synchronous call the ReadDirectoryChangesW, knowing that there's info available. (It could possibly even do the callback at that point, without the need for a separate handler thread.)
- Continue to use the ReadDirectoryChangesW with overlapped IO, but, instead of waiting in GetOverlappedResult, it would first use WaitForMultipleObjects on the event in the overlapped IO and a distinct event used to tell the threat to exit (as in #1).
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
36 | But it's still an arbitrarily-sized buffer in the middle of a class definition. If you change your mind about how big to make it, it changes the definition of the class. The buffer is going to be accessed using pointer arithmetic, which is generally dangerous. Moving the buffer out of the class avoids both of those problems. The alignas here is _not_ pedantic, it's essential. Without it, you could easily have an alignment problem. But if you used a vector, you'd know that it would always be suitably aligned. | |
82 | Let me put it another way. Constructors cannot return errors and LLVM does use exceptions, so things that can fail generally shouldn't be in the constructor. This code is accessing the file system, creating and event, and spawning two threads. Any of those things can fail, but you've got no way to let the caller know whether something went wrong. If the lambdas were really short, then it would be easy to see that they're thread procs. But they're not, so they're hard to find and understand. If they were private member functions with descriptive names, the code would be easier to understand. | |
92 | Using Buffer.data() when you've only reserved space is undefined behavior. You should used resize instead of reserve and then pass the size rather than the capacity. Be aware that, while unlikely, this could still fail. The directory could have been removed or renamed between calls or the caller could have passed a bad handle to begin with. | |
195 | I don't think this is a reliable way to get the watcher thread to exit. You've overloaded the meaning of the event object and are racing the i/o system who plans to use the event for its own purposes. Suppose the watcher thread is waiting in GetOverlappedResult. If you set the event, then the other fields of the OVERLAPPED struct are in an indeterminant state. I don't see how the watcher thread can distinguish your exit signal from a normal completion. In another case, suppose to set the event when the watcher thread has just completed a GetOverlappedResult but hasn't yet started the next ReadDirectoryChangesW. The watcher thread won't notice the signal. It'll just loop around and the next ReadDirectoryChangesW will reset it. This will hang. This one reason why I suggested FindFirstChange and WaitForMultipleObjects. It lets you have distinct events objects for distinct events. It also has the benefit that you wouldn't need the overlapped IO and you possibly could combine the watcher and handler functions into a single thread. | |
clang/unittests/DirectoryWatcher/CMakeLists.txt | ||
1 | I'm not a Cmake expert, but I"m curious way MATCHES "Linux" but STREQUAL Windows. |
The reason that I added the overlapped event was specifically for the cancellation and overlooked the fact that it will be used by the kernel side as well, thanks for catching that!
clang/unittests/DirectoryWatcher/CMakeLists.txt | ||
---|---|---|
1 | MATCHES is a regular expression, and overly expensive. The STREQUAL is a string comparison (ala strcmp). |
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
82 | Sure, moving the thread procs into a member function is reasonable. Making that request would've been more clear :) |
clang/lib/DirectoryWatcher/windows/DirectoryWatcher-windows.cpp | ||
---|---|---|
15 | Left-overs :-( |
Remove unnecessary include, fix an incorrect wait (verified that unit tests now pass!)
We're seeing the tests for this fail in Chromium's Clang builds: https://bugs.chromium.org/p/chromium/issues/detail?id=1137737
I'll try increasing the test's timeout for now and see if that helps.
The longer timeout didn't help :(
I'm not sure what's different about the machine where this is failing. Maybe it's some filesystem issue due to being a VM?
Any ideas for good printfs or similar that could be added to figure out exactly what part is failing?
If I had to guess, my money would be on a deadlock. To unblock, I'd
propose reverting this patch until we can figure it out.
During the review, a deadlock was fixed related to the watcher thread, but
perhaps we missed one for the notifier thread.
I've also been seeing some failures on phab reviews, e.g. https://reviews.llvm.org/D89188.
This patch was reverted a while back because a couple DirectoryWatcher tests routinely timed out on build bots even though they work when run in isolation.
I believe the problem is that, on a machine busy doing a build, the startup of the watcher thread is delayed (either because the scheduler algorithm or the thread pool policy). Thus the "initial scan" on the test thread can complete _before_ the watcher thread has called ReadDirectoryChangesW. This leaves a window of time where the test can change a file that the watcher thread will miss.
To test this hypothesis, I took this patch and created one more event called WatcherThreadReady. I have the watcher thread set this event after successfully calling ReadDirectoryChangesW. In the constructor, I changed:
if (WaitForInitialSync) InitialScan();
to
if (WaitForInitialSync) { ::WaitForSingleObject(WatcherThreadReady, 10000); InitialScan(); }
This is crude, but it seems to be effective. The tests pass reliably for me when my machine is fully loaded. I didn't use an INFINITE timeout because it seems possibly missing a file change is less bad than hanging forever. I didn't even bother to check the result from the wait because there's nothing sane to do besides "best effort" if something goes wrong. I used a Windows event because those are most familiar to me and it's Windows-only code. But it certainly could be done with other type of synchronization object.
There may be more elegant ways to solve this, but something like this directly addresses the root cause with fairly minimal changes.
I wonder if the Linux and Mac implementations might suffer from a similar window but the bug is rare because of differences in thread scheduler.
I was able to play around with this further yesterday evening. You are correct - the issue is the load preventing the watcher thread from spinning up. I was able to reproduce this issue and resolve it by adding in a synchronization point (boolean + mutex + condition variable) before returning the directory watcher to ensure that the RDC was setup. I looked through all the previous failure cases as well as the one that I was finally able to reproduce - it is always the initial notification that we missed (because the thread took too long to come up). In the process I did do a few minor alterations as well. I would like to get additional testing over the weekend on the bots so people aren't impacted if there turns out to be some other subtle threading issue. However, running this in a tight loop locally seemed to be pretty stable (switching the builds to a SSD locally did help uncover the flakiness) so I am confident that this should be safe. I'm happy to address any additional comments in post-commit review.
We still occasionally (every couple of runs) see these tests hang on Windows in both Debug and Release. Unfortunately, I don't have access to the machines running the tests to debug the tests while they are hanging and I haven't had a chance to try to reproduce locally.
Interesting, are the logs from the runs available? I have run the test ~10000 times locally and its been stable. Perhaps the logs can show what is going on.
This is the best I can do from the online builds. I'll try and repro locally as well:
FAIL: Clang-Unit :: DirectoryWatcher/./DirectoryWatcherTests.exe/DirectoryWatcherTest.InitialScanAsync (75980 of 75980) ******************** TEST 'Clang-Unit :: DirectoryWatcher/./DirectoryWatcherTests.exe/DirectoryWatcherTest.InitialScanAsync' FAILED ******************** Script: -- D:\a\_work\1\b\llvm\Debug\tools\clang\unittests\DirectoryWatcher\.\DirectoryWatcherTests.exe --gtest_filter=DirectoryWatcherTest.InitialScanAsync -- Note: Google Test filter = DirectoryWatcherTest.InitialScanAsync [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DirectoryWatcherTest [ RUN ] DirectoryWatcherTest.InitialScanAsync ******************** Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. Terminate batch job (Y/N)? interrupted by user, skipping remaining tests
One thing we've run into in the past is that running some of these tests on a drive other than the OS drive can cause weird failure. I am not sure if that may be the case here as well.
As long as it's a local drive, that _shouldn't_ be a problem. I always run tests on a different drive than the OS system drive.
If it's a network drive, then, yeah, that would likely be a problem. If I recall correctly, ReadDirectoryChangesW has substantial limitations when pointed at a remote drive. The implementation should probably check that and signal an "unsupported" error.
Also note that Stella's sample log looks slightly different than the failures we were reproducing. It's almost as if the initial scan never finished. I haven't looked at that code, but I wonder if the file iteration is stuck in some kind of loop due to links or mount points or something.
I wasn't able to reproduce this locally by running *just* the DirectoryWatcher tests. I'm now running all of the clang tests repeatedly to see if I can get a repro that way.
The online tests appear to always hang either in the InitialScanAsync or InvalidatedWatcherAsync based on the log from the failures. We are using a non-OS drive for the tests, but it is not a network drive. The hang is also very consistent in our online testing - every couple of runs, sometimes more often. I suspect one of the reasons I cannot reproduce it locally with ease is that the test machines are able to run more tests in parallel and faster.
[==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DirectoryWatcherTest [ RUN ] DirectoryWatcherTest.InvalidatedWatcherAsync ******************** Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. Terminate batch job (Y/N)? interrupted by user, skipping remaining tests
We also see check-all timeout recently (fairly consistently), see https://bugs.chromium.org/p/chromium/issues/detail?id=1221702
Since Stella reported problems with this too, I speculatively reverted it (and follow-ups) in fb32de9e97af0921242a021e30020ffacf7aa6e2 for now.
Thanks!
I'm still trying to reproduce locally by running bigger and bigger sets of tests. check-clang doesn't appear to reproduce the issue for me, all of our online tests run check-all, so I am trying that next.
FWIW https://lab.llvm.org/buildbot/#/builders/123 has been red for several days after this landed too (eg https://lab.llvm.org/buildbot/#/builders/123/builds/4545)
I don't see a reason to include <atomic> here.