This is an archive of the discontinued LLVM Phabricator instance.

[Support] PR42623: Avoid setting the delete-on-close bit if a TempFile doesn't reside on a local drive
ClosedPublic

Authored by rdwampler on Jun 14 2020, 6:28 AM.

Details

Summary

On Windows, after commit 881ba104656c40098d4bc90c52613c08136f0fe1, tools
using TempFile would error with "bad file descriptor" when writing the file
on a network drive. It appears that setting the delete-on-close bit via
SetFileInformationByHandle/FileDispositionInfo prevented it from
accessing the file on network drives, and although using FILE_DISPOSITION_INFO
seems to work, it causes other troubles.

Diff Detail

Event Timeline

rdwampler created this revision.Jun 14 2020, 6:28 AM

I noticed this first when using llvm-ar (LLVM 10), which worked on mapped network drive with LLVM 6.0 and bisected it to the commit in the message. I'm not familiar with Windows api so maybe there is a better way?

rdwampler updated this revision to Diff 270642.Jun 14 2020, 3:43 PM

That commit (which was reviewed here https://reviews.llvm.org/D48051) suggests this wasn't about the occasional temp file not getting deleted. It was to work around surprising behavior on Windows when a temp file is mapped into memory, which, as I recall, was the underlying cause in what appeared to be cache corruption for LTO.

I think it best if @pcc and/or @inglorion weigh in.

BTW, is it common to create temp files on a network drive? Is that done in distributed build systems? For a single-machine build, it would seem counter productive to use a network drive for a temp file.

That commit (which was reviewed here https://reviews.llvm.org/D48051) suggests this wasn't about the occasional temp file not getting deleted. It was to work around surprising behavior on Windows when a temp file is mapped into memory, which, as I recall, was the underlying cause in what appeared to be cache corruption for LTO.

I think it best if @pcc and/or @inglorion weigh in.

BTW, is it common to create temp files on a network drive? Is that done in distributed build systems? For a single-machine build, it would seem counter productive to use a network drive for a temp file.

I have a similar setup as described in here https://bugs.llvm.org/show_bug.cgi?id=42623 for local development. That bug report also includes a link to a GitHub issue where others where having the same issue. So, it's probably not uncommon to build on a drive attached as a network drive to a VM.

rnk requested changes to this revision.Sep 15 2020, 11:32 AM

The situation where a temp file won't be delete without the delete-on-close bit set seems rare so just avoid setting it entirely.

In my experience, it's actually very common for LLVM tools to leak temp files, so we really need to preserve this functionality, it can't just be removed.

We could accept a more targeted fix. Does the call to CreateFile fail with the delete on close disposition? If so, we could retry without the disposition if it fails.

This revision now requires changes to proceed.Sep 15 2020, 11:32 AM

We could accept a more targeted fix. Does the call to CreateFile fail with the delete on close disposition? If so, we could retry without the disposition if it fails.

The call to CreateFile does succeed, but it’s the subsequent writes to the file handle that fails. Perhaps, we can try to detect if we are creating a file on a network drive using GetDriveType and if so avoid setting delete on close?

rdwampler retitled this revision from [Support] PR42623: Avoid setting the delete-on-close bit for TempFile to [Support] PR42623: Avoid setting the delete-on-close bit if a TempFile doesn't reside on a local drive.Sep 30 2020, 8:24 AM
rdwampler edited the summary of this revision. (Show Details)
rdwampler updated this revision to Diff 295298.Sep 30 2020, 8:31 AM

Reverted change that removed code setting the file disposition.
Instead, added check to see if we are on a non-local filesystem (i.e., network drive), if so the disposition isn't set.

rnk accepted this revision.Oct 29 2020, 4:13 PM

lgtm

Sorry for the delay! It's been a busy fall.

This revision is now accepted and ready to land.Oct 29 2020, 4:13 PM
This revision was landed with ongoing or failed builds.Oct 30 2020, 10:44 AM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Feb 27 2021, 12:11 PM

Just FTR, this change had to be reverted in Rust's LLVM fork because it breaks compilation on Windows 7, see https://github.com/rust-lang/rust/issues/81051.

sbc100 added a subscriber: sbc100.Mar 5 2021, 8:31 AM

Just FTR, this change had to be reverted in Rust's LLVM fork because it breaks compilation on Windows 7, see https://github.com/rust-lang/rust/issues/81051.

Emscripten users also seem to be effect by this. Specifically windows 7 users are seeing link failure: "wasm-ld: error: failed to write the output file: permission denied".

https://github.com/emscripten-core/emscripten/issues/13067

Just FTR, this change had to be reverted in Rust's LLVM fork because it breaks compilation on Windows 7, see https://github.com/rust-lang/rust/issues/81051.

Emscripten users also seem to be effect by this. Specifically windows 7 users are seeing link failure: "wasm-ld: error: failed to write the output file: permission denied".

https://github.com/emscripten-core/emscripten/issues/13067

This change was to fix a very narrow use case. Since it affecting all Windows 7 users, I think we should revert it. I can investigate if there's another way to work around the original issue. But until then, is there still time to get this revert in the 12.0.0 release @tstellar?

Just FTR, this change had to be reverted in Rust's LLVM fork because it breaks compilation on Windows 7, see https://github.com/rust-lang/rust/issues/81051.

Emscripten users also seem to be effect by this. Specifically windows 7 users are seeing link failure: "wasm-ld: error: failed to write the output file: permission denied".

https://github.com/emscripten-core/emscripten/issues/13067

This change was to fix a very narrow use case. Since it affecting all Windows 7 users, I think we should revert it. I can investigate if there's another way to work around the original issue. But until then, is there still time to get this revert in the 12.0.0 release @tstellar?

FWIW, we've got a downstream user who has run into an issue on our LLVM 10 based toolchain, which looks likely to benefit from this fix as-is (we're still waiting on confirmation, as we don't have access to a Mach-O machine needed to reproduce the issue they were seeing). Consequently, there's a soft oppose from me to reverting this without an alternative fix.

Perhaps there should be a wider discussion on Windows support - it's been over a year since Windows 7 was end of life, so we can't really expect LLVM developers or build bots to verify that changes work on such machines (since they'd have inherent security issues). If we continue to support Windows 7, we're likely to break things in future releases too, which won't be picked up until potentially quite late on. I can't see anywhere obvious on the LLVM website where the minimum supported Windows version is documented unfortunately (although it's possible I've missed it).

FWIW, we've got a downstream user who has run into an issue on our LLVM 10 based toolchain, which looks likely to benefit from this fix as-is (we're still waiting on confirmation, as we don't have access to a Mach-O machine needed to reproduce the issue they were seeing).

Just to update that the downstream user has now confirmed that they don't see the issue with the LLVM 12 rc2 (but do with previous LLVM versions), presumably due to the fix here.

aganea added a subscriber: aganea.Mar 18 2021, 7:02 PM

It would be nice to retain Windows 7 support for now, seeing the number of users affected by this issue.

llvm/lib/Support/Windows/Path.inc
407

The error occurs on a second call with setDeleteDisposition(..., false) initiated by TempFile::keep(). A simple repro from https://bugs.llvm.org/show_bug.cgi?id=48378#c0 can be found here: https://pastebin.com/v51m3uBU

We could enclose this new block of code into if (Delete) { ... }. eg.

static std::error_code setDeleteDisposition(HANDLE Handle, bool Delete) {
  if (Delete) {
    // First, check if the file is on a network (non-local) drive. If so, don't
    // set DeleteFile to true, since it prevents opening the file for writes.
    SmallVector<wchar_t, 128> FinalPath;
    ...
    // At this point, the file is on a local drive.
  }

  FILE_DISPOSITION_INFO Disposition;
  Disposition.DeleteFile = Delete;
  if (!SetFileInformationByHandle(Handle, FileDispositionInfo, &Disposition,
                                  sizeof(Disposition)))
    return mapWindowsError(::GetLastError());
  return std::error_code();
}

The original report from @rdwampler suggests that the problem is not SetFileInformationByHandle, but writes that happen later in the process. In this particular case, the writes already happened.

WDYT?

It would be nice to retain Windows 7 support for now, seeing the number of users affected by this issue.

The point may be a bit moot, since there's been no sign of a revert being picked up in the 12.0.0 release candidates, so it looks like Windows 7 support is going to be broken for at least some use cases with LLVM 12. I guess someone could propose an alternative fix for a future RC or 12.0.1, which downstream projects could adopt if needed.

llvm/lib/Support/Windows/Path.inc
407

Not sure who the WDYT is directed at here, but I'm happy for alternative approaches that solve the same issue.

rdwampler added inline comments.Mar 19 2021, 5:48 AM
llvm/lib/Support/Windows/Path.inc
407

I'm leaning towards reverting this change and letting the callers handle this logic. Currently, the only caller is TempFile.

Please see D99212 for a fix for Windows 7.

FYI, I had a user report a bug regarding this to me recently, https://github.com/mstorsjo/llvm-mingw/issues/233. This fix makes TempFile simply leave the files behind, when on a network drive. Ping @rnk @amccarth
@jhenderson @rdwampler.

Background: TempFile::create calls createUniqueFile with OF_Delete set, which passes it down through createUniqueFile, createUniqueEntity, openFileForReadWrite, openFile, openNativeFile, which calls setDeleteDisposition if OF_Delete was set.

With this fix, setDeleteDisposition returns early (with a success error code) even though it didn't set the delete disposition as requested.

It should be simple to carry a bool flag in the TempFile class, to remove the file manually when closed. We'd still leak temp files when interrupted, but it'd be way better than leaking temp files on every single successful run too.

I considered fixing it by making setDeleteDisposition take a bool &Delete which can be updated indicating whether it actually was marked as to be implicitly deleted or not, but we'd need to plumb such a return value back through all those layers. I have a WIP patch for that, but it's ugly.

I also considered making TempFile::create check if the delete disposition flag actually was set correctly - however, one can't do GetFileInformationByHandleEx with FileDispositionInfo, apparently this is a write-only attribute. (The MSDN docs seem to agree.) So we need to carry the information from whoever called setDeleteDisposition back to the TempFile class.

We can't remove the OF_Delete flag from the call in TempFile::create, because the flag is needed for adding the DELETE flag to the dwDesiredAccess attribute to CreateFileW. We can't remove the setDeleteDisposition call from openNativeFile without breaking the OF_Delete flag as it is documented.

We could do a separate, duplicate call to setDeleteDisposition in TempFile::create, to allow checking whether it seems to work or not, but that's a bit ugly.

WDYT?

Having TempFile attempt to manually delete the underlying file seems like a reasonable approach to me.

Having TempFile attempt to manually delete the underlying file seems like a reasonable approach to me.

Thanks! Do you have any opinions on the least bad way of plumbing this signal through from setDeleteDisposition in openNativeFile down to TempFile::create?

andrewevstyukhin added inline comments.
llvm/lib/Support/Windows/Path.inc
408

Hello,
I have a failure at this line when linking at RAM-disk https://sourceforge.net/projects/imdisk-toolkit/ under Windows 10.

DWORD CountChars = ::GetFinalPathNameByHandleW(

returns error 40 (ERROR_INVALID_FUNCTION, function_not_supported).