Page MenuHomePhabricator

WIP: Support: Add vfs::OutputBackend to virtualize compiler outputs
Needs ReviewPublic

Authored by dexonsmith on Jan 26 2021, 8:29 PM.

Details

Summary

[This is supporting an RFC (https://lists.llvm.org/pipermail/cfe-dev/2021-January/067576.html / https://lists.llvm.org/pipermail/llvm-dev/2021-January/148124.html); a follow up patch has initial adoption in clang::CompilerInstance; marked WIP for now.]

WIP: Support: Add vfs::OutputBackend to virtualize compiler outputs

Add OutputBackend to the llvm::vfs namespace to virtualize compiler
outputs. The primary interface includes the following API:

  • getDirectory() returns an OutputDirectory, which is a proxy backend that uses the requested directory as a working directory.
  • createFile() creates a file and returns OutputFile for writing to it.
  • createDirectory() creates a directory and then calls getDirectory(). The default implementation assumes creating a directory is a no-op, to support backends that don't have a concept of a directories without files in them.
  • createUniqueFile() and createUniqueDirectory(), versions of the above that avoid creating entitity names that already exist. (StableUniqueEntityAdaptor<> can be used by backends that don't have a way to check what exists.)

This patch contains a number of backends, including:

  • NullOutputBackend, for silently ignoring all outputs.
  • OnDiskOutputBackend, for writing to disk.
  • InMemoryOutputBackend, for writing to an InMemoryFileSystem.
  • MirroringOutputBackend, for writing to multiple backends.
  • FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

There are also a few helpers for creating other backends:

  • ProxyOutputBackend forwards all calls to an underlying backend.
  • StableUniqueEntityAdaptor<> adds stable implementations of createUniqueFileImpl() and createUniqueDirectoryImpl().

The primary interface for OutputFile includes the following API:

  • takeOS() returns a std::unique_ptr<raw_pwrite_stream> for writing (getOS() gives a raw pointer to the stream if it hasn't been taken).
  • close() commits the file (to disk, in-memory, etc.).
  • getPath() gets the path to the output.
  • ~Output erases the output if it hasn't been opened.

OutputFile has support for buffering content. One use case is for
concrete subclasses that don't support streams. Another is to support
passing the same content to multiple underlying OutputFile instances
(such as for MirroringOutputBackend). OutputFile::close() calls
storeContentBuffer() with a ContentBuffer if this support was used;
otherwise, it calls storeStreamedContent().

An OutputConfig is passed when creating a file or directory to
communicate semantics about the output. There are a few flags:

  • Text, for outputs that should be opened in text mode.
  • Volatile, for outputs that other processes may modify after write. (OnDiskOutputBackend has support for writing its bytes to an mmap'ed region and passing a file-backed buffer on to other backends. This optimization is turned off for Volatile outputs.)
  • NoCrashCleanup, for outputs that should not be removed on crash.
  • NoAtomicWrite, for outputs that should be written directly, instead of moved atomically into place.
  • NoImplyCreateDirectories, for outputs that require the parent path to already exist.
  • NoOverwrite, for outputs that are not allowed to clobber any existing content.

Most of these flags are ignorable by backends that aren't associated
with a filesystem.

TODO:

  • Fix all the header docs. They've mostly bitrotted.
  • Document and enforce threading guarantees. Tentatively, all OutputBackend APIs safe to call concurrently. An OutputFile cannot be used concurrently, but two files from the same backend can be.
  • Test OutputDirectory.
  • Test createDirectory().
  • Test createDirectories().
  • Test createUnique*().

Diff Detail

Event Timeline

dexonsmith created this revision.Jan 26 2021, 8:29 PM
dexonsmith requested review of this revision.Jan 26 2021, 8:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2021, 8:29 PM
dexonsmith updated this revision to Diff 320970.Feb 2 2021, 7:21 PM
dexonsmith retitled this revision from WIP: Support: Add vfs::OutputManager to virtualize compiler outputs to WIP: Support: Add vfs::OutputBackend to virtualize compiler outputs.
dexonsmith edited the summary of this revision. (Show Details)

Update description and patch following some feedback on the RFC.

  • Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
  • Remove OutputManager, instead exposing OutputBackend directly.
  • Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
  • Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
  • Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.

Some detailed comments to go with the RFC thread!

llvm/include/llvm/Support/OutputBackend.h
64 ↗(On Diff #323434)

I think a struct in the style of FrontendOptions or so might have better ergonomics:

  • access is a bit more direct
  • can easily provide sensible defaults rather than having to negate half the option names
133 ↗(On Diff #323434)

It would be helpful for a client to have few "valid" bits to reason about:

  • open/closed
  • erased/not-erased
  • stream taken/not taken

I think erased is gone now, and just remains in the comments, which is nice!
Is it necessary to support both getting and taking the stream, or could we get away with one?

152 ↗(On Diff #323434)

Could consider calling this commit, which hints at the behavior when calling vs not calling it.

172 ↗(On Diff #323434)

As you mentioned on the mail, this seems a little wrong to expose here, I'd agree with removing it.

I think the only question it really answers is "what string was passed to OutputBackend::createFileImpl" - this may be useful, but it's easily conflated with ideas like "where can I read this file from". So I like the idea that path strings only appear in close proximity to OutputBackends.

177 ↗(On Diff #323434)

ContentBuffer seems handy for implementing output-files that buffer in memory, but does support it really need to be baked into the base class?
It causes a confusing structure in OutputFile where half of the interface is about dealing with streams and the other half is dealing with buffers, and it's hard to tell which is fundamental.

My best understanding is that ContentBuffer exists to avoid copies in some cases where the output is buffered into memory. If a copy was OK, we'd just use string/stringref, and if the output was being written to disk, we'd use a stream. It seems surprising if this really is a win overall. Is it possible to start with something simpler and evolve to something fast, with benchmarks?

(If it were gone from the main interface, it seems that it would be simple enough for e.g. InMemoryOutputDestination to implement takeOSImpl() and close() on top of a ContentBuffer member, for MirroringOutput to simply write its buffer to the output stream, etc.)

204 ↗(On Diff #323434)

this seems like a surprising place to strike a balance, it implies:

  • it's more important to optimize away null in the mirroring file than to encapsulate nullness as an implementation detail
  • but it's more important to encapsulate nullness than to optimize in other places (e.g. actually writing output!)

If nullness enables important optimizations (seems plausible) I'd prefer just making this public.

251 ↗(On Diff #323434)

ThreadSafeRefCountedBase?

Or even better, can we just use std::shared_ptr?

(I think the nominal reason for still having IntrusiveRefCntPtr is that it's cheaper when thread-safety is not required, but it is here)

262 ↗(On Diff #323434)

My understanding is the use of Twine in the VFS interfaces is a (hard-to-fix) mistake - in practice we do not compose paths using techniques that Twines support well, we mostly end up rendering twines to strings to pass them to other APIs anyway, and the potential performance benefits of Twine are unlikely to be meaningful in the face of IO anyway.

Can we use StringRef?

270 ↗(On Diff #323434)

This is pretty fuzzy to me - it sounds like this is the API both for creating a directory, and creating an output that encapsulates the notion of working directory.

What do you do if you want only one of these? In particular, if you want to track the output directory but only create if you write any outputs. I guess: you can't use this API, and need to track working directory externally instead.

For backends that *do* have a concept of directory existence, is it *required* to call this function or can you just write to a path and have parent directories created? I guess: it's optional by default but the NoImplyCreateDirectories controls this.


These aren't showstoppers, but if something simpler would work equally well, that would be nice. e.g.

  • providing "createDirectory" to do the IO, and making OutputDirectory a helper class instead of part of the interface
  • only dealing in absolute paths - the platform bundles working-directory handling with IO operations, but it's not like we can actually use the OS's working directory...
437 ↗(On Diff #323434)

The patterns here (inheritance, state and templates) seem quite complicated and I'm having a tough time following it.
Is this modeled on existing code somewhere? Maybe we can provide a simpler version of this initially, or leave the unique-entity stuff to separate patches?

437 ↗(On Diff #323434)

Stable seems to mean something significant here, but I don't know what it is.

450 ↗(On Diff #323434)

I'm not 100% sure of the structure of the stable-entity stuff, but it looks like this state needs to be synchronized for multithreaded access.

601 ↗(On Diff #323434)

Can we use errorCodeToError(std::errc::file_exists) for this?

(with createFileError if it's critical to copy the filename in here)

627 ↗(On Diff #323434)

this doesn't seem like a useful public API, because temp files and renaming are low-level implementation details of particular backends. Even if you know the backend is in use, it's hard to imaging handling "rename errors" differently than other IO errors, apart from the error message.

Can this be createStringError("Could not rename...", EC) - we do actually have an underlying error code from the OS

703 ↗(On Diff #323434)

why is it necessary to allow externally accessing/modifying the settings after construction?

llvm/lib/Support/OutputBackend.cpp
865 ↗(On Diff #323434)

doesn't the contract say File1 may have taken the content?