This is an archive of the discontinued LLVM Phabricator instance.

[lit] Detect Inconsistent File Access Times
Changes PlannedPublic

Authored by stuij on Feb 23 2023, 4:12 AM.

Details

Summary

I was seeing failing tests in my builds due to another process (in my
case, antivirus) changing the access time of files during the tests,
so tests using touch -a were failing.

This change modifies how we check if a system has consistent atime. We
have a list of platforms where we know these tests don't work (NetBSD,
Windows) with their existing reasons documented.

On other platforms, we do the same check that some of the tests
themselves end up doing: set the atime using touch, and then check it
was preserved with ls. I have found this check to be robust on my
system, but it's relying on the process updating the atime to race
with the process reading the atime, so it may not be as robust as we
want. Only if the check is successful, then you get a consistent-atime
feature which you can make your test REQUIRE.

This updates all existing tests which mention atime or use touch -a to
REQUIRE: consistent-atime, fixing their flakiness.

Diff Detail

Event Timeline

lenary created this revision.Feb 23 2023, 4:12 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
Herald added subscribers: pmatos, asb, ormris and 5 others. · View Herald Transcript
lenary requested review of this revision.Feb 23 2023, 4:12 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptFeb 23 2023, 4:12 AM
lenary added a subscriber: mgorny.

Adding more reviewers. @mgorny you disabled some tests on NetBSD in the past for the same reason (rG92dc7dce4a6f117a497ced1650bc48e5b658f0ea), this just updates how they were disabled.

This looks reasonable to me, with the caveat that I don't know a huge amount about how the different OSes access time systems work. One question though: if your antivirus was causing flakiness (as opposed to outright always-fails), won't it just move that flakiness into whether the REQUIRES calculation returns true or not (i.e. it could spuriously do so, causing the tests to be enabled but then potential still be flaky?

llvm/utils/lit/lit/llvm/config.py
165

Is "hopefully" really needed here?

This looks reasonable to me, with the caveat that I don't know a huge amount about how the different OSes access time systems work. One question though: if your antivirus was causing flakiness (as opposed to outright always-fails), won't it just move that flakiness into whether the REQUIRES calculation returns true or not (i.e. it could spuriously do so, causing the tests to be enabled but then potential still be flaky?

I'm definitely not an expert in OS atimes either, beyond the comments that have accumulated in the repo so far, and knowing that some filesystems prefer to have it switched off for performance reasons.

The flakiness I (and colleagues) have seen seems to correspond to computer load. In my case, I think the underlying issue is a race between the touch -a -t <timestamp> <file> (which in these test creates <file>), then the ls -ul <file>, and the antivirus, which is watching for file creation, and will also access the file. This is also one reason why I put "hopefully" in the comment - we're hoping to get the same race during lit configuration. One reason I think we're more likely to is that this code is run before we spin up lots of parallel threads to start testing, so the test machine should be under less load for this check, which should allow the antivirus to get in before the ls -ul, if all goes to plan (it's a race, so this is hard to be sure about).

I can see your point about this moving the flakiness, but I think that this check is more likely to fail in the other direction: marking the tests as unsupported when they might be ok. In that case, we lose coverage when we didn't need to.

llvm/utils/lit/lit/llvm/config.py
165

I'm hedging in this comment, in part because we're trying to find a race condition experimentally, and also because this will cause lit to fatal error before running any tests if touch exits non-zero for any reason.

int3 added a subscriber: int3.Feb 23 2023, 7:27 AM
int3 added inline comments.
llvm/utils/lit/lit/llvm/config.py
165

if we know for sure that NetBSD and Windows don't support this, why not hardcode those values in, and only try the experimental check otherwise?

Thanks very much for looking into this, the flakiness of these tests has been bugging me for ages.

llvm/utils/lit/lit/llvm/config.py
171

It looks like this command will be run on Windows. I think it will fail and cause False to be returned, which is the desired result, but this appears to be by accident rather than design. Therefore I'm inclined to agree with @int3 that a hard-coded check would be preferable.

Updates incoming to add a specific check for netbsd and windows, which are currently excluding the affected tests anyway.

llvm/utils/lit/lit/llvm/config.py
165

Yeah, this seems pragmatic, I will add it back in again - excluding NetBSD and Windows.

171

I am going to add the hardcoded checks, but I think touch is available in windows, it should be in the same directory as all the git binaries.

llvm/utils/lit/lit/llvm/config.py
171

This is definitely tangential to the change, but in case it's useful to know: conventionally only C:\Program Files\Git\bin is added to the path on Windows, not C:\Program Files\Git\usr\bin. C:\Program Files\Git\bin only contains bash.exe, git.exe and sh.exe.

lenary added inline comments.Feb 27 2023, 4:05 AM
llvm/utils/lit/lit/llvm/config.py
171

Something weirder is happening in _find_git_windows_unix_tools (from https://reviews.llvm.org/D84380), but I think it's probably right just to early exit with false on Windows.

lenary updated this revision to Diff 500742.Feb 27 2023, 5:22 AM
lenary retitled this revision from [lit] Detect Consistent File Access Times to [lit] Detect Inconsistent File Access Times.
lenary edited the summary of this revision. (Show Details)
lenary marked 2 inline comments as done.Feb 27 2023, 5:23 AM

Hopefully this addresses the feedback so far.

I see some of these tests previously had UNSUPPORTED: system-netbsd but not UNSUPPORTED: system-windows - do you know why?

llvm/utils/lit/lit/llvm/config.py
190

This could end up matching the wrong part of the string, for example if the temporary file happened to have 1995 in its name. A regex match for '\b1995\b' would be more reliable.

jhenderson added inline comments.Feb 28 2023, 12:39 AM
llvm/utils/lit/lit/llvm/config.py
171

I've got no objection to explicitly omitting Windows, but just wanted to point out that LLVM requires various Unix-like tools to be available according to https://llvm.org/docs/GettingStarted.html#software. Whilst touch isn't explicitly specified, I'd be surprised if you managed to get access to that explicit list without including touch inadvertently (it's also worth noting that there are several other tests that use touch and other utilities that are intended to work on Windows). I thought that Git\usr\bin was recommended to be included in people's paths for Windows, as the way to get access to these tools, but I couldn't find that recommendation, so maybe that's just our downstream recommendation.

I see some of these tests previously had UNSUPPORTED: system-netbsd but not UNSUPPORTED: system-windows - do you know why?

I'm not entirely sure why, but I have some ideas:

It looks like the pruning logic for both LTO and Clang modules is the same, using llvm::sys::fs::file_status::getLastAccessedTime() (but, this of course has to use the underlying system information), which is:

I've actually just done a more thorough search, and there are other tests that use touch -a (a few in clang/test/PCH, some others in clang/test/Modules). I presume this comes down to logic as to whether the code uses getLastAccessedTime which seems to be used:

  • in the modules cache prune logic (pruneModuleCache in clang/lib/Frontend/CompilerInstance.cpp)
  • in the LTO cache pruning logic (llvm::pruneCache)
  • in FileCollector, as used by Modules (llvm/include/llvm/Support/FileCollector.h)
  • in FilePermissionsApplier, as used by llvm-dwarfutil and llvm-objcopy (declared in llvm/include/llvm/Support/FileUtilities.h)

I cannot really see any correlation between the logic under test, and the use of touch -a or touch -a -m vs just touch. I'm not sure what the right "fix" should be any more. I guess I'm just happy for everyone who is not running into this issue.

lenary updated this revision to Diff 501838.Mar 2 2023, 5:38 AM
lenary marked an inline comment as done.
lenary added inline comments.Mar 2 2023, 5:38 AM
llvm/utils/lit/lit/llvm/config.py
190

Done.

This revision is now accepted and ready to land.Mar 2 2023, 8:59 AM
lenary marked an inline comment as done.Mar 10 2023, 3:18 AM

@jhenderson @int3 I think I have addressed your feedback - are you happy for me to land this?

Related to my inline comment, your changes will result in some tests being disabled on Windows that weren't before (at least one of the tests pass for me even on my machine where atime is disabled). I think we need to understand why these tests pass on Windows before losing test coverage by disabling them.

llvm/utils/lit/lit/llvm/config.py
172

As mentioned, I don't think this is a fair statement: many tests use touch or ls, and work fine on Windows. That being said, the top Google result for how to detect if access time is enabled on Windows yields the following command: fsutil behavior query disablelastaccess, which prints some information about its state that could be easily queried in python to give a more correct answer for Windows.

simonwallis2 accepted this revision.Apr 11 2023, 12:10 AM
lenary planned changes to this revision.Apr 12 2023, 2:38 AM

To be clear: I've not yet addressed @jhenderson's comments above, yet, which may require changes to the patch.

michaelplatings commandeered this revision.Apr 12 2023, 8:50 AM
michaelplatings edited reviewers, added: lenary; removed: michaelplatings.
lenary resigned from this revision.Apr 13 2023, 10:25 AM
stuij added a subscriber: stuij.Jun 23 2023, 4:16 AM
stuij commandeered this revision.Aug 22 2023, 8:00 AM
stuij added a reviewer: michaelplatings.
michaelplatings resigned from this revision.Sep 11 2023, 8:30 AM