diff --git a/llvm/docs/ContentAddressableStorage.md b/llvm/docs/ContentAddressableStorage.md new file mode 100644 --- /dev/null +++ b/llvm/docs/ContentAddressableStorage.md @@ -0,0 +1,120 @@ +# Content Addressable Storage + +## Introduction to CAS + +Content Addressable Storage, or `CAS`, is a storage system where it assigns +unique addresses to the data stored. It is very useful for data deduplicaton +and creating unique identifiers. + +Unlikely other kind of storage system like file system, CAS is immutable. It +is more reliable to model a computation when representing the inputs and outputs +of the computation using objects stored in CAS. + +The basic unit of the CAS library is a CASObject, where it contains: + +* Data: arbitrary data +* References: references to other CASObject + +It can be conceptually modeled as something like: + +``` +struct CASObject { + ArrayRef Data; + ArrayRef Refs; +} +``` + +Such abstraction can allow simple composition of CASObjects into a DAG to +represent complicated data structure while still allowing data deduplication. +Note you can compare two DAGs by just comparing the CASObject hash of two +root nodes. + + + +## LLVM CAS Library User Guide + +The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`. +To reference a CASObject, there are few different abstractions provided +with different trade-offs: + +### ObjectRef + +`ObjectRef` is a lightweight reference to a CASObject stored in the CAS. +This is the most commonly used abstraction and it is cheap to copy/pass +along. It has following properties: + +* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref. +`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or +compared. +* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An +explicitly load is required before accessing the data stored in CASObject. +This load can also fail, for reasons like but not limited to: object does +not exist, corrupted CAS storage, operation timeout, etc. +* If two `ObjectRef` are equal, it is guarantee that the object they point to +(if exists) are identical. If they are not equal, the underlying objects are +guaranteed to be not the same. + +### ObjectProxy + +`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the +underlying stored data and references can be accessed without the need +of error handling. The class APIs also provide convenient methods to +access underlying data. The lifetime of the underlying data is equal to +the lifetime of the instance of `ObjectStore` unless explicitly copied. + +### CASID + +`CASID` is the hash identifier for CASObjects. It owns the underlying +storage for hash value so it can be expensive to copy and compare depending +on the hash algorithm. `CASID` is generally only useful in rare situations +like printing raw hash value or exchanging hash values between different +CAS instances with the same hashing schema. + +### ObjectStore + +`ObjectStore` is the CAS-like object storage. It provides API to save +and load CASObjects, for example: + +``` +ObjectRef A, B, C; +Expected Stored = ObjectStore.store("data", {A, B}); +Expected Loaded = ObjectStore.getProxy(C); +``` + +It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and +`CASID`. + + + +## CAS Library Implementation Guide + +The LLVM ObjectStore APIs are designed so that it is easy to add +customized CAS implementation that are interchangeable with builtin +CAS implementations. + +To add your own implementation, you just need to add a subclass to +`llvm::cas::ObjectStore` and implement all its pure virtual methods. +To be interchangeable with LLVM ObjectStore, the new CAS implementation +needs to conform to following contracts: + +* Different CASObject stored in the ObjectStore needs to have a different hash +and result in a different `ObjectRef`. Vice versa, same CASObject should have +same hash and same `ObjectRef`. Note two different CASObjects with identical +data but different references are considered different objects. +* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can +be used to determine the equality of the underlying CASObjects. +* The loaded objects from the ObjectStore need to have the lifetime to be at +least as long as the ObjectStore itself. + +If not specified, the behavior can be implementation defined. For example, +`ObjectRef` can be used to point to a loaded CASObject so +`ObjectStore` never fails to load. It is also legal to use a stricter model +than required. For example, an `ObjectRef` that can be used to compare +objects between different `ObjectStore` instances is legal but user +of the ObjectStore should not depend on this behavior. + +For CAS library implementer, there is also a `ObjectHandle` class that +is an internal representation of a loaded CASObject reference. +`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because +just like `ObjectRef`, `ObjectHandle` is only useful when paired with +the ObjectStore that knows about the loaded CASObject. diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst --- a/llvm/docs/Reference.rst +++ b/llvm/docs/Reference.rst @@ -15,6 +15,7 @@ BranchWeightMetadata Bugpoint CommandGuide/index + ContentAddressableStorage ConvergenceAndUniformity Coroutines DependenceGraphs/index @@ -224,3 +225,6 @@ :doc:`ConvergenceAndUniformity` A description of uniformity analysis in the presence of irreducible control flow, and its implementation. + +:doc:`ContentAddressableStorage` + A reference guide for using LLVM's CAS library. diff --git a/llvm/include/llvm/CAS/CASID.h b/llvm/include/llvm/CAS/CASID.h new file mode 100644 --- /dev/null +++ b/llvm/include/llvm/CAS/CASID.h @@ -0,0 +1,131 @@ +//===- llvm/CAS/CASID.h -----------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_CAS_CASID_H +#define LLVM_CAS_CASID_H + +#include "llvm/ADT/ArrayRef.h" +#include "llvm/ADT/DenseMapInfo.h" +#include "llvm/ADT/StringRef.h" + +namespace llvm { + +class raw_ostream; + +namespace cas { + +class CASID; + +/// Context for CAS identifiers. +class CASContext { +public: + virtual ~CASContext(); + + /// Get an identifer for the schema used by this CAS context. Two CAS + /// instances should return \c true for this identifier if and only if their + /// CASIDs are safe to compare by hash. This is used by \a + /// CASID::equalsImpl(). + virtual StringRef getHashSchemaIdentifier() const = 0; + +protected: + /// Print \p ID to \p OS. + virtual void printIDImpl(raw_ostream &OS, const CASID &ID) const = 0; + + friend class CASID; +}; + +/// Unique identifier for a CAS object. +/// +/// Locally, stores an internal CAS identifier that's specific to a single CAS +/// instance. It's guaranteed not to change across the view of that CAS, but +/// might change between runs. +/// +/// It also has \a CASIDContext pointer to allow comparison of these +/// identifiers. If two CASIDs are from the same CASIDContext, they can be +/// compared directly. If they are, then \a +/// CASIDContext::getHashSchemaIdentifier() is compared to see if they can be +/// compared by hash, in which case the result of \a getHash() is compared. +class CASID { +public: + void dump() const; + void print(raw_ostream &OS) const { + return getContext().printIDImpl(OS, *this); + } + friend raw_ostream &operator<<(raw_ostream &OS, const CASID &ID) { + ID.print(OS); + return OS; + } + std::string toString() const; + + ArrayRef getHash() const; + + friend bool operator==(const CASID &LHS, const CASID &RHS) { + // EmptyKey or TombstoneKey. + if (!LHS.Context || !RHS.Context) + return false; + + // CASIDs are equal when they have the same hash schema and same hash value. + return LHS.Context->getHashSchemaIdentifier() == + RHS.Context->getHashSchemaIdentifier() && + LHS.Hash == RHS.Hash; + } + + friend bool operator!=(const CASID &LHS, const CASID &RHS) { + return !(LHS == RHS); + } + + friend hash_code hash_value(const CASID &ID) { + ArrayRef Hash = ID.getHash(); + return hash_combine_range(Hash.begin(), Hash.end()); + } + + const CASContext &getContext() const { + assert(Context && "Tombstone or empty key for DenseMap?"); + return *Context; + } + + static CASID getDenseMapEmptyKey() { + return CASID(nullptr, DenseMapInfo::getEmptyKey().str()); + } + static CASID getDenseMapTombstoneKey() { + return CASID(nullptr, DenseMapInfo::getTombstoneKey().str()); + } + + CASID() = delete; + + static CASID create(const CASContext *Context, StringRef Hash) { + return CASID(Context, Hash.str()); + } + +private: + CASID(const CASContext *Context, std::string &&Hash) + : Context(Context), Hash(std::move(Hash)) {} + + const CASContext *Context; + std::string Hash; +}; + +} // namespace cas + +template <> struct DenseMapInfo { + static cas::CASID getEmptyKey() { return cas::CASID::getDenseMapEmptyKey(); } + + static cas::CASID getTombstoneKey() { + return cas::CASID::getDenseMapTombstoneKey(); + } + + static unsigned getHashValue(cas::CASID ID) { + return (unsigned)hash_value(ID); + } + + static bool isEqual(cas::CASID LHS, cas::CASID RHS) { return LHS == RHS; } +}; + +} // namespace llvm + +#endif // LLVM_CAS_CASID_H diff --git a/llvm/include/llvm/CAS/CASReference.h b/llvm/include/llvm/CAS/CASReference.h new file mode 100644 --- /dev/null +++ b/llvm/include/llvm/CAS/CASReference.h @@ -0,0 +1,205 @@ +//===- llvm/CAS/CASReference.h ----------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_CAS_CASREFERENCE_H +#define LLVM_CAS_CASREFERENCE_H + +#include "llvm/ADT/ArrayRef.h" +#include "llvm/ADT/DenseMapInfo.h" +#include "llvm/ADT/StringRef.h" + +namespace llvm { + +class raw_ostream; + +namespace cas { + +class ObjectStore; + +class ObjectHandle; +class ObjectRef; + +/// Base class for references to things in \a ObjectStore. +class ReferenceBase { +protected: + struct DenseMapEmptyTag {}; + struct DenseMapTombstoneTag {}; + static constexpr uint64_t getDenseMapEmptyRef() { return -1ULL; } + static constexpr uint64_t getDenseMapTombstoneRef() { return -2ULL; } + +public: + /// Get an internal reference. + uint64_t getInternalRef(const ObjectStore &ExpectedCAS) const { +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + assert(CAS == &ExpectedCAS && "Extracting reference for the wrong CAS"); +#endif + return InternalRef; + } + + unsigned getDenseMapHash() const { + return (unsigned)llvm::hash_value(InternalRef); + } + bool isDenseMapEmpty() const { return InternalRef == getDenseMapEmptyRef(); } + bool isDenseMapTombstone() const { + return InternalRef == getDenseMapTombstoneRef(); + } + bool isDenseMapSentinel() const { + return isDenseMapEmpty() || isDenseMapTombstone(); + } + +protected: + void print(raw_ostream &OS, const ObjectHandle &This) const; + void print(raw_ostream &OS, const ObjectRef &This) const; + + bool hasSameInternalRef(const ReferenceBase &RHS) const { + assert((!isDenseMapSentinel() && !RHS.isDenseMapSentinel()) && + "Invalid reference"); +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + assert(CAS == RHS.CAS && "Cannot compare across CAS instances"); +#endif + return InternalRef == RHS.InternalRef; + } + +protected: + friend class ObjectStore; + ReferenceBase(const ObjectStore *CAS, uint64_t InternalRef, bool IsHandle) + : InternalRef(InternalRef) { +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + this->CAS = CAS; +#endif + assert(InternalRef != getDenseMapEmptyRef() && "Reserved for DenseMapInfo"); + assert(InternalRef != getDenseMapTombstoneRef() && + "Reserved for DenseMapInfo"); + } + explicit ReferenceBase(DenseMapEmptyTag) + : InternalRef(getDenseMapEmptyRef()) {} + explicit ReferenceBase(DenseMapTombstoneTag) + : InternalRef(getDenseMapTombstoneRef()) {} + +private: + uint64_t InternalRef; + +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + const ObjectStore *CAS = nullptr; +#endif +}; + +/// Reference to an object in a \a ObjectStore instance. +/// +/// If you have an ObjectRef, you can point at it from new nodes with \a +/// ObjectStore::store(), but you don't know anything about it. "Loading" the +/// object is a separate step that may not have happened yet, and which can fail +/// (due to filesystem corruption) or introduce latency (if downloading from a +/// remote store). +/// +/// \a ObjectStore::store() takes a list of these, and these are returned by \a +/// ObjectStore::forEachRef() and \a ObjectStore::readRef(), which are accessors +/// for nodes, and \a ObjectStore::getReference(). +/// +/// \a ObjectStore::load() will load the referenced object, and returns \a +/// ObjectHandle, a variant that knows what kind of entity it is. \a +/// +/// This is a wrapper around a \c uint64_t (and a \a ObjectStore instance when +/// assertions are on). If necessary, it can be deconstructed and reconstructed +/// using \a Reference::getInternalRef() and \a +/// Reference::getFromInternalRef(), this should only happen as the CAS +/// implementation details, not called by the users of the CAS. +class ObjectRef : public ReferenceBase { + struct DenseMapTag {}; + +public: + friend bool operator==(const ObjectRef &LHS, const ObjectRef &RHS) { + return LHS.hasSameInternalRef(RHS); + } + friend bool operator!=(const ObjectRef &LHS, const ObjectRef &RHS) { + return !(LHS == RHS); + } + + /// Allow a reference to be recreated after it's deconstructed. + static ObjectRef getFromInternalRef(const ObjectStore &CAS, + uint64_t InternalRef) { + return ObjectRef(CAS, InternalRef); + } + + static ObjectRef getDenseMapEmptyKey() { + return ObjectRef(DenseMapEmptyTag{}); + } + static ObjectRef getDenseMapTombstoneKey() { + return ObjectRef(DenseMapTombstoneTag{}); + } + + /// Print internal ref and/or CASID. Only suitable for debugging. + void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); } + + LLVM_DUMP_METHOD void dump() const; + +private: + friend class ObjectStore; + friend class ReferenceBase; + using ReferenceBase::ReferenceBase; + ObjectRef(const ObjectStore &CAS, uint64_t InternalRef) + : ReferenceBase(&CAS, InternalRef, /*IsHandle=*/false) { + assert(InternalRef != -1ULL && "Reserved for DenseMapInfo"); + assert(InternalRef != -2ULL && "Reserved for DenseMapInfo"); + } + explicit ObjectRef(DenseMapEmptyTag T) : ReferenceBase(T) {} + explicit ObjectRef(DenseMapTombstoneTag T) : ReferenceBase(T) {} + explicit ObjectRef(ReferenceBase) = delete; +}; + +/// Handle to a loaded object in a \a ObjectStore instance. +/// +/// ObjectHandle encapulates a *loaded* object in the CAS. You need one +/// of these to inspect the content of an object: to look at its stored +/// data and references. +class ObjectHandle : public ReferenceBase { +public: + friend bool operator==(const ObjectHandle &LHS, const ObjectHandle &RHS) { + return LHS.hasSameInternalRef(RHS); + } + friend bool operator!=(const ObjectHandle &LHS, const ObjectHandle &RHS) { + return !(LHS == RHS); + } + + /// Print internal ref and/or CASID. Only suitable for debugging. + void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); } + + LLVM_DUMP_METHOD void dump() const; + +private: + friend class ObjectStore; + friend class ReferenceBase; + using ReferenceBase::ReferenceBase; + explicit ObjectHandle(ReferenceBase) = delete; + ObjectHandle(const ObjectStore &CAS, uint64_t InternalRef) + : ReferenceBase(&CAS, InternalRef, /*IsHandle=*/true) {} +}; + +} // namespace cas + +template <> struct DenseMapInfo { + static cas::ObjectRef getEmptyKey() { + return cas::ObjectRef::getDenseMapEmptyKey(); + } + + static cas::ObjectRef getTombstoneKey() { + return cas::ObjectRef::getDenseMapTombstoneKey(); + } + + static unsigned getHashValue(cas::ObjectRef Ref) { + return Ref.getDenseMapHash(); + } + + static bool isEqual(cas::ObjectRef LHS, cas::ObjectRef RHS) { + return LHS == RHS; + } +}; + +} // namespace llvm + +#endif // LLVM_CAS_CASREFERENCE_H diff --git a/llvm/include/llvm/CAS/ObjectStore.h b/llvm/include/llvm/CAS/ObjectStore.h new file mode 100644 --- /dev/null +++ b/llvm/include/llvm/CAS/ObjectStore.h @@ -0,0 +1,327 @@ +//===- llvm/CAS/ObjectStore.h -----------------------------------------*- C++ +//-*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_CAS_OBJECTSTORE_H +#define LLVM_CAS_OBJECTSTORE_H + +#include "llvm/ADT/StringRef.h" +#include "llvm/CAS/CASID.h" +#include "llvm/CAS/CASReference.h" +#include "llvm/Support/Error.h" +#include "llvm/Support/FileSystem.h" +#include +#include + +namespace llvm { + +class MemoryBuffer; + +namespace cas { + +class ObjectStore; +class ObjectProxy; + +/// Content-addressable storage for objects. +/// +/// Conceptually, objects are stored in a "unique set". +/// +/// - Objects are immutable ("value objects") that are defined by their +/// content. They are implicitly deduplicated by content. +/// - Each object has a unique identifier (UID) that's derived from its content, +/// called a \a CASID. +/// - This UID is a fixed-size (strong) hash of the transitive content of a +/// CAS object. +/// - It's comparable between any two CAS instances that have the same \a +/// CASIDContext::getHashSchemaIdentifier(). +/// - The UID can be printed (e.g., \a CASID::toString()) and it can parsed +/// by the same or a different CAS instance with \a +/// ObjectStore::parseID(). +/// - An object can be looked up by content or by UID. +/// - \a store() is "get-or-create" methods, writing an object if it +/// doesn't exist yet, and return a ref to it in any case. +/// - \a loadObject(const CASID&) looks up an object by its UID. +/// - Objects can reference other objects, forming an arbitrary DAG. +/// +/// The \a ObjectStore interface has a few ways of referencing objects: +/// +/// - \a ObjectRef encapsulates a reference to something in the CAS. It is an +/// opaque type that references an object inside a specific CAS. It is +/// implementation defined if the underlying object exists or not for an +/// ObjectRef, and it can used to speed up CAS lookup as an implementation +/// detail. However, you don't know anything about the underlying objects. +/// "Loading" the object is a separate step that may not have happened +/// yet, and which can fail (e.g. due to filesystem corruption) or introduce +/// latency (if downloading from a remote store). +/// - \a ObjectHandle encapulates a *loaded* object in the CAS. You need one of +/// these to inspect the content of an object: to look at its stored +/// data and references. This is internal to CAS implementation and not +/// availble from CAS public APIs. +/// - \a CASID: the UID for an object in the CAS, obtained through \a +/// ObjectStore::getID() or \a ObjectStore::parseID(). This is a valid CAS +/// identifier, but may reference an object that is unknown to this CAS +/// instance. +/// - \a ObjectProxy pairs an ObjectHandle (subclass) with a ObjectStore, and +/// wraps access APIs to avoid having to pass extra parameters. It is the +/// object used for accessing underlying data and refs by CAS users. +/// +/// Both ObjectRef and ObjectHandle are lightweight, wrapping a `uint64_t`. +/// Doing anything with them requires an ObjectStore. As a convenience, +/// ObjectProxy pairs an ObjectStore with ObjectHandle, and provides the +/// interfaces to read information from CAS Objects for users. CASID contains +/// the actual hash value of the object, thus it is expensive to create/copy, +/// and also much slower to access corresponding object. It is only useful when +/// you need to print the hash value for the objects, or exchange common +/// objects between different ObjectStore instances with the same CASContext. +/// +/// There are a few options for accessing content of objects, with different +/// lifetime tradeoffs: +/// +/// - \a getData()/getDataString() return StringRef with lifetime is guaranteed +/// to last as long as \a ObjectStore. +/// - \a getMemoryBuffer() returns a \a MemoryBuffer whose lifetime +/// is independent of the CAS (it can live longer). +/// - \a readRef() and \a forEachRef() iterate through the references in an +/// object. There is no lifetime assumption. +/// +class ObjectStore { + friend class ObjectProxy; + friend class ReferenceBase; + +public: + /// @name CASID Operations + /// { + + /// Get a \p CASID from a \p ID, which should have been generated by \a + /// CASID::print(). This succeeds as long as \a validateID() would pass. The + /// object may be unknown to this CAS instance. + virtual Expected parseID(StringRef ID) = 0; + + /// Get an ID for \p Ref. + virtual CASID getID(ObjectRef Ref) const = 0; + + /// Get a reference to the object called \p ID. + /// + /// Returns \c std::nullopt if not stored in this CAS. + virtual std::optional getReference(const CASID &ID) const = 0; + + /// Validate the underlying object referred by CASID. + virtual Error validate(const CASID &ID) = 0; + + /// } + /// @name Store Objects + /// { + /// Store object into ObjectStore. + virtual Expected store(ArrayRef Refs, + ArrayRef Data) = 0; + + /// Store object from StringRef. + Expected storeFromString(ArrayRef Refs, + StringRef String) { + return store(Refs, arrayRefFromStringRef(String)); + } + + /// Helper functions to store object and returns a ObjectProxy. + Expected createProxy(ArrayRef Refs, StringRef Data); + + /// Default implementation reads \p FD and calls \a storeNode(). Does not + /// take ownership of \p FD; the caller is responsible for closing it. + /// + /// If \p Status is sent in it is to be treated as a hint. Implementations + /// must protect against the file size potentially growing after the status + /// was taken (i.e., they cannot assume that an mmap will be null-terminated + /// where \p Status implies). + /// + /// Returns the \a CASID and the size of the file. + Expected + storeFromOpenFile(sys::fs::file_t FD, + std::optional Status = std::nullopt) { + return storeFromOpenFileImpl(FD, Status); + } + + /// } + /// @name Get Objects + /// { + /// Create ObjectProxy from CASID. If the object doesn't exit, get an error. + Expected getProxy(const CASID &ID); + /// Create ObjectProxy from ObjectRef. If the object can't be loaded, get an + /// error. + Expected getProxy(ObjectRef Ref); + + /// } +protected: + /// @name Implementation Details for underlying CAS. + /// { + /// Get a Ref from Handle. + virtual ObjectRef getReference(ObjectHandle Handle) const = 0; + + /// Load the object referenced by \p Ref. + /// + /// Errors if the object cannot be loaded. + virtual Expected load(ObjectRef Ref) = 0; + + /// Get an ID for \p Handle. + virtual CASID getID(ObjectHandle Handle) const = 0; + + /// Get the size of some data. + virtual uint64_t getDataSize(ObjectHandle Node) const = 0; + + /// Methods for handling objects. + virtual Error forEachRef(ObjectHandle Node, + function_ref Callback) const = 0; + virtual ObjectRef readRef(ObjectHandle Node, size_t I) const = 0; + virtual size_t getNumRefs(ObjectHandle Node) const = 0; + virtual ArrayRef getData(ObjectHandle Node, + bool RequiresNullTerminator = false) const = 0; + + /// Get ObjectRef from open file. + virtual Expected + storeFromOpenFileImpl(sys::fs::file_t FD, + std::optional Status) = 0; + /// } + /// @name Helper functions for implementing CAS. + /// { + + /// Get a lifetime-extended StringRef pointing at \p Data. + /// + /// Depending on the CAS implementation, this may involve in-memory storage + /// overhead. + StringRef getDataString(ObjectHandle Node) { + return toStringRef(getData(Node)); + } + + /// Get a lifetime-extended MemoryBuffer pointing at \p Data. + /// + /// Depending on the CAS implementation, this may involve in-memory storage + /// overhead. + std::unique_ptr + getMemoryBuffer(ObjectHandle Node, StringRef Name = "", + bool RequiresNullTerminator = true); + + /// Helper function to chain `load(ObjectRef)` into ObjectProxy. + Expected getProxy(Expected Ref); + + /// Read the data from \p Data into \p OS. + uint64_t readData(ObjectHandle Node, raw_ostream &OS, uint64_t Offset = 0, + uint64_t MaxBytes = -1ULL) const { + ArrayRef Data = getData(Node); + assert(Offset < Data.size() && "Expected valid offset"); + Data = Data.drop_front(Offset).take_front(MaxBytes); + OS << toStringRef(Data); + return Data.size(); + } + + /// Allow ObjectStore implementations to create internal handles. +#define MAKE_CAS_HANDLE_CONSTRUCTOR(HandleKind) \ + HandleKind make##HandleKind(uint64_t InternalRef) const { \ + return HandleKind(*this, InternalRef); \ + } + MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectHandle) + MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectRef) +#undef MAKE_CAS_HANDLE_CONSTRUCTOR + + /// Create an unknown object error. + static Error createUnknownObjectError(const CASID &ID); + + /// } + +public: + /// Print the ObjectStore internals for debugging purpose. + virtual void print(raw_ostream &) const {} + void dump() const; + + /// Get CASContext + const CASContext &getContext() const { return Context; } + + virtual ~ObjectStore(); + +protected: + ObjectStore(const CASContext &Context) : Context(Context) {} + +private: + const CASContext &Context; +}; + +/// Reference to an abstract hierarchical node, with data and references. +/// Reference is passed by value and is expected to be valid as long as the \a +/// ObjectStore is. +/// +/// TODO: Expose \a ObjectStore::readData() and only call \a +/// ObjectStore::getDataString() when asked. +class ObjectProxy { +public: + const ObjectStore &getCAS() const { return *CAS; } + ObjectStore &getCAS() { return *CAS; } + CASID getID() const { return CAS->getID(H); } + ObjectRef getRef() const { return CAS->getReference(H); } + size_t getNumReferences() const { return CAS->getNumRefs(H); } + ObjectRef getReference(size_t I) const { return CAS->readRef(H, I); } + + operator CASID() const { return getID(); } + CASID getReferenceID(size_t I) const { + std::optional ID = getCAS().getID(getReference(I)); + assert(ID && "Expected reference to be first-class object"); + return *ID; + } + + /// Visit each reference in order, returning an error from \p Callback to + /// stop early. + Error forEachReference(function_ref Callback) const { + return CAS->forEachRef(H, Callback); + } + Error forEachReferenceID(function_ref Callback) const { + return CAS->forEachRef(H, [&](ObjectRef Ref) { + std::optional ID = getCAS().getID(Ref); + assert(ID && "Expected reference to be first-class object"); + return Callback(*ID); + }); + } + + std::unique_ptr + getMemoryBuffer(StringRef Name = "", + bool RequiresNullTerminator = true) const; + + /// Get the content of the node. Valid as long as the CAS is valid. + StringRef getData() const { return CAS->getDataString(H); } + + friend bool operator==(const ObjectProxy &Proxy, ObjectRef Ref) { + return Proxy.getRef() == Ref; + } + friend bool operator==(ObjectRef Ref, const ObjectProxy &Proxy) { + return Proxy.getRef() == Ref; + } + friend bool operator!=(const ObjectProxy &Proxy, ObjectRef Ref) { + return !(Proxy.getRef() == Ref); + } + friend bool operator!=(ObjectRef Ref, const ObjectProxy &Proxy) { + return !(Proxy.getRef() == Ref); + } + +public: + ObjectProxy() = delete; + + static ObjectProxy load(ObjectStore &CAS, ObjectHandle Node) { + return ObjectProxy(CAS, Node); + } + +private: + ObjectProxy(ObjectStore &CAS, ObjectHandle H) : CAS(&CAS), H(H) {} + + ObjectStore *CAS; + ObjectHandle H; +}; + +Expected> +createPluginCAS(StringRef PluginPath, + ArrayRef PluginArgs = std::nullopt); +std::unique_ptr createInMemoryCAS(); + +} // namespace cas +} // namespace llvm + +#endif // LLVM_CAS_OBJECTSTORE_H diff --git a/llvm/lib/CAS/BuiltinCAS.h b/llvm/lib/CAS/BuiltinCAS.h new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/BuiltinCAS.h @@ -0,0 +1,148 @@ +//===- BuiltinCAS.h ---------------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_LIB_CAS_BUILTINCAS_H +#define LLVM_LIB_CAS_BUILTINCAS_H + +#include "llvm/ADT/StringRef.h" +#include "llvm/CAS/ObjectStore.h" +#include "llvm/Support/BLAKE3.h" +#include "llvm/Support/Error.h" +#include +#include + +namespace llvm::cas::builtin { + +/// Current hash type for the internal CAS. +/// +/// FIXME: This should be configurable via an enum to allow configuring the hash +/// function. The enum should be sent into \a createInMemoryCAS() and \a +/// createOnDiskCAS(). +/// +/// This is important (at least) for future-proofing, when we want to make new +/// CAS instances use BLAKE7, but still know how to read/write BLAKE3. +/// +/// Even just for BLAKE3, it would be useful to have these values: +/// +/// BLAKE3 => 32B hash from BLAKE3 +/// BLAKE3_16B => 16B hash from BLAKE3 (truncated) +/// +/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>. +/// +/// Motivation for a truncated hash is that it's cheaper to store. It's not +/// clear if we always (or ever) need the full 32B, and for an ephemeral +/// in-memory CAS, we almost certainly don't need it. +/// +/// Note that the cost is linear in the number of objects for the builtin CAS +/// and embedded action cache, since we're using internal offsets and/or +/// pointers as an optimization. +/// +/// However, it's possible we'll want to hook up a local builtin CAS to, e.g., +/// a distributed generic hash map to use as an ActionCache. In that scenario, +/// the transitive closure of the structured objects that are the results of +/// the cached actions would need to be serialized into the map, something +/// like: +/// +/// "action::" -> "0123" +/// "object::0123" -> "3,4567,89AB,CDEF,9,some data" +/// "object::4567" -> ... +/// "object::89AB" -> ... +/// "object::CDEF" -> ... +/// +/// These references would be full cost. +using HasherT = BLAKE3; +using HashType = decltype(HasherT::hash(std::declval &>())); + +class BuiltinCASContext : public CASContext { + void printIDImpl(raw_ostream &OS, const CASID &ID) const final; + +public: + /// Get the name of the hash for any table identifiers. + /// + /// FIXME: This should be configurable via an enum, with at the following + /// values: + /// + /// "BLAKE3" => 32B hash from BLAKE3 + /// "BLAKE3.16" => 16B hash from BLAKE3 (truncated) + /// + /// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS(). + static StringRef getHashName() { return "BLAKE3"; } + StringRef getHashSchemaIdentifier() const final { + static const std::string ID = + ("llvm.cas.builtin.v2[" + getHashName() + "]").str(); + return ID; + } + + static const BuiltinCASContext &getDefaultContext(); + + BuiltinCASContext() = default; +}; + +class BuiltinCAS : public ObjectStore { +public: + BuiltinCAS() : ObjectStore(BuiltinCASContext::getDefaultContext()) {} + + Expected parseID(StringRef Reference) final; + + virtual Expected parseIDImpl(ArrayRef Hash) = 0; + + Expected store(ArrayRef Refs, + ArrayRef Data) final; + virtual Expected storeImpl(ArrayRef ComputedHash, + ArrayRef Refs, + ArrayRef Data) = 0; + + Expected + storeFromOpenFileImpl(sys::fs::file_t FD, + std::optional Status) override; + virtual Expected + storeFromNullTerminatedRegion(ArrayRef ComputedHash, + sys::fs::mapped_file_region Map) { + return storeImpl(ComputedHash, std::nullopt, + makeArrayRef(Map.data(), Map.size())); + } + + /// Both builtin CAS implementations provide lifetime for free, so this can + /// be const, and readData() and getDataSize() can be implemented on top of + /// it. + virtual ArrayRef getDataConst(ObjectHandle Node) const = 0; + + ArrayRef getData(ObjectHandle Node, + bool RequiresNullTerminator) const final { + // BuiltinCAS Objects are always null terminated. + return getDataConst(Node); + } + uint64_t getDataSize(ObjectHandle Node) const final { + return getDataConst(Node).size(); + } + + Error createUnknownObjectError(const CASID &ID) const { + return createStringError(std::make_error_code(std::errc::invalid_argument), + "unknown object '" + ID.toString() + "'"); + } + + Error createCorruptObjectError(const CASID &ID) const { + return createStringError(std::make_error_code(std::errc::invalid_argument), + "corrupt object '" + ID.toString() + "'"); + } + + Error createCorruptStorageError() const { + return createStringError(std::make_error_code(std::errc::invalid_argument), + "corrupt storage"); + } + + Error validate(const CASID &ID) final; +}; + +// FIXME: Proxy not portable. Maybe also error-prone? +constexpr StringLiteral DefaultDirProxy = "/^llvm::cas::builtin::default"; +constexpr StringLiteral DefaultDir = "llvm.cas.builtin.default"; + +} // namespace llvm::cas::builtin + +#endif // LLVM_LIB_CAS_BUILTINCAS_H diff --git a/llvm/lib/CAS/BuiltinCAS.cpp b/llvm/lib/CAS/BuiltinCAS.cpp new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/BuiltinCAS.cpp @@ -0,0 +1,131 @@ +//===- BuiltinCAS.cpp -------------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "BuiltinCAS.h" +#include "BuiltinObjectHasher.h" +#include "llvm/ADT/StringExtras.h" +#include "llvm/Support/Alignment.h" +#include "llvm/Support/MemoryBuffer.h" +#include "llvm/Support/Process.h" + +using namespace llvm; +using namespace llvm::cas; +using namespace llvm::cas::builtin; + +static StringRef getCASIDPrefix() { return "llvmcas://"; } + +Expected BuiltinCAS::parseID(StringRef Reference) { + if (!Reference.consume_front(getCASIDPrefix())) + return createStringError(std::make_error_code(std::errc::invalid_argument), + "invalid cas-id '" + Reference + "'"); + + // FIXME: Allow shortened references? + if (Reference.size() != 2 * sizeof(HashType)) + return createStringError(std::make_error_code(std::errc::invalid_argument), + "wrong size for cas-id hash '" + Reference + "'"); + + std::string Binary; + if (!tryGetFromHex(Reference, Binary)) + return createStringError(std::make_error_code(std::errc::invalid_argument), + "invalid hash in cas-id '" + Reference + "'"); + + return parseIDImpl(arrayRefFromStringRef(Binary)); +} + +void BuiltinCASContext::printIDImpl(raw_ostream &OS, const CASID &ID) const { + SmallString<64> Hash; + toHex(ID.getHash(), /*LowerCase=*/true, Hash); + OS << getCASIDPrefix() << Hash; +} + +const BuiltinCASContext &BuiltinCASContext::getDefaultContext() { + static BuiltinCASContext DefaultContext; + return DefaultContext; +} + +static size_t getPageSize() { + static int PageSize = sys::Process::getPageSizeEstimate(); + return PageSize; +} + +Expected +BuiltinCAS::storeFromOpenFileImpl(sys::fs::file_t FD, + std::optional Status) { + int PageSize = getPageSize(); + + if (!Status) { + Status.emplace(); + if (std::error_code EC = sys::fs::status(FD, *Status)) + return errorCodeToError(EC); + } + + constexpr size_t MinMappedSize = 4 * 4096; + auto readWithStream = [&]() -> Expected { + // FIXME: MSVC: SmallString + SmallString<4 * 4096 * 2> Data; + if (Error E = sys::fs::readNativeFileToEOF(FD, Data, MinMappedSize)) + return std::move(E); + return store(std::nullopt, makeArrayRef(Data.data(), Data.size())); + }; + + // Check whether we can trust the size from stat. + if (Status->type() != sys::fs::file_type::regular_file && + Status->type() != sys::fs::file_type::block_file) + return readWithStream(); + + if (Status->getSize() < MinMappedSize) + return readWithStream(); + + std::error_code EC; + sys::fs::mapped_file_region Map(FD, sys::fs::mapped_file_region::readonly, + Status->getSize(), + /*offset=*/0, EC); + if (EC) + return errorCodeToError(EC); + + // If the file is guaranteed to be null-terminated, use it directly. Note + // that the file size may have changed from ::stat if this file is volatile, + // so we need to check for an actual null character at the end. + ArrayRef Data(Map.data(), Map.size()); + HashType ComputedHash = + BuiltinObjectHasher::hashObject(*this, std::nullopt, Data); + if (!isAligned(Align(PageSize), Data.size()) && Data.end()[0] == 0) + return storeFromNullTerminatedRegion(ComputedHash, std::move(Map)); + return storeImpl(ComputedHash, std::nullopt, Data); +} + +Expected BuiltinCAS::store(ArrayRef Refs, + ArrayRef Data) { + return storeImpl(BuiltinObjectHasher::hashObject(*this, Refs, Data), + Refs, Data); +} + +Error BuiltinCAS::validate(const CASID &ID) { + auto Ref = getReference(ID); + if (!Ref) + return createUnknownObjectError(ID); + + auto Handle = load(*Ref); + if (!Handle) + return Handle.takeError(); + + auto Proxy = ObjectProxy::load(*this, *Handle); + SmallVector Refs; + if (auto E = Proxy.forEachReference([&](ObjectRef Ref) -> Error { + Refs.push_back(Ref); + return Error::success(); + })) + return E; + + ArrayRef Data(Proxy.getData().data(), Proxy.getData().size()); + auto Hash = BuiltinObjectHasher::hashObject(*this, Refs, Data); + if (!ID.getHash().equals(Hash)) + return createCorruptObjectError(ID); + + return Error::success(); +} diff --git a/llvm/lib/CAS/BuiltinObjectHasher.h b/llvm/lib/CAS/BuiltinObjectHasher.h new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/BuiltinObjectHasher.h @@ -0,0 +1,73 @@ +//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H +#define LLVM_CAS_BUILTINOBJECTHASHER_H + +#include "llvm/ADT/StringRef.h" +#include "llvm/CAS/ObjectStore.h" +#include "llvm/Support/Endian.h" + +namespace llvm { +namespace cas { + +template class BuiltinObjectHasher { +public: + using HashT = decltype(HasherT::hash(std::declval &>())); + + static HashT hashObject(const ObjectStore &CAS, ArrayRef Refs, + ArrayRef Data) { + BuiltinObjectHasher H; + H.updateSize(Refs.size()); + for (const ObjectRef &Ref : Refs) + H.updateRef(CAS, Ref); + H.updateArray(Data); + return H.finish(); + } + +private: + HashT finish() { return Hasher.final(); } + + void updateRef(const ObjectStore &CAS, ObjectRef Ref) { + updateID(CAS.getID(Ref)); + } + + void updateID(const CASID &ID) { + // NOTE: Does not hash the size of the hash. That's a CAS implementation + // detail that shouldn't leak into the UUID for an object. + ArrayRef Hash = ID.getHash(); + assert(Hash.size() == sizeof(HashT) && + "Expected object ref to match the hash size"); + Hasher.update(Hash); + } + + void updateArray(ArrayRef Bytes) { + updateSize(Bytes.size()); + Hasher.update(Bytes); + } + + void updateArray(ArrayRef Bytes) { + updateArray(makeArrayRef(reinterpret_cast(Bytes.data()), + Bytes.size())); + } + + void updateSize(uint64_t Size) { + Size = support::endian::byte_swap(Size, support::endianness::little); + Hasher.update( + makeArrayRef(reinterpret_cast(&Size), sizeof(Size))); + } + + BuiltinObjectHasher() = default; + ~BuiltinObjectHasher() = default; + HasherT Hasher; +}; + +} // namespace cas +} // namespace llvm + +#endif // LLVM_CAS_BUILTINOBJECTHASHER_H diff --git a/llvm/lib/CAS/CMakeLists.txt b/llvm/lib/CAS/CMakeLists.txt new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/CMakeLists.txt @@ -0,0 +1,8 @@ +add_llvm_component_library(LLVMCAS + BuiltinCAS.cpp + InMemoryCAS.cpp + ObjectStore.cpp + + ADDITIONAL_HEADER_DIRS + ${LLVM_MAIN_INCLUDE_DIR}/llvm/CAS +) diff --git a/llvm/lib/CAS/InMemoryCAS.cpp b/llvm/lib/CAS/InMemoryCAS.cpp new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/InMemoryCAS.cpp @@ -0,0 +1,331 @@ +//===- InMemoryCAS.cpp ------------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "BuiltinCAS.h" +#include "BuiltinObjectHasher.h" +#include "llvm/ADT/LazyAtomicPointer.h" +#include "llvm/ADT/PointerIntPair.h" +#include "llvm/ADT/PointerUnion.h" +#include "llvm/ADT/TrieRawHashMap.h" +#include "llvm/Support/Allocator.h" +#include "llvm/Support/ThreadSafeAllocator.h" + +using namespace llvm; +using namespace llvm::cas; +using namespace llvm::cas::builtin; + +namespace { + +class InMemoryObject; + +/// Index of referenced IDs (map: Hash -> InMemoryObject*). Uses +/// LazyAtomicPointer to coordinate creation of objects. +using InMemoryIndexT = + ThreadSafeTrieRawHashMap, + sizeof(HashType)>; + +/// Values in \a InMemoryIndexT. \a InMemoryObject's point at this to access +/// their hash. +using InMemoryIndexValueT = InMemoryIndexT::value_type; + +class InMemoryObject { +public: + enum class Kind { + /// Node with refs and data. + RefNode, + + /// Node with refs and data co-allocated. + InlineNode, + + Max = InlineNode, + }; + + Kind getKind() const { return IndexAndKind.getInt(); } + const InMemoryIndexValueT &getIndex() const { + assert(IndexAndKind.getPointer()); + return *IndexAndKind.getPointer(); + } + + ArrayRef getHash() const { return getIndex().Hash; } + + InMemoryObject() = delete; + InMemoryObject(InMemoryObject &&) = delete; + InMemoryObject(const InMemoryObject &) = delete; + +protected: + InMemoryObject(Kind K, const InMemoryIndexValueT &I) : IndexAndKind(&I, K) {} + +private: + enum Counts : int { + NumKindBits = 2, + }; + PointerIntPair IndexAndKind; + static_assert((1U << NumKindBits) <= alignof(InMemoryIndexValueT), + "Kind will clobber pointer"); + static_assert(((int)Kind::Max >> NumKindBits) == 0, "Kind will be truncated"); + +public: + inline ArrayRef getData() const; + + inline ArrayRef getRefs() const; +}; + +class InMemoryRefObject : public InMemoryObject { +public: + static constexpr Kind KindValue = Kind::RefNode; + static bool classof(const InMemoryObject *O) { + return O->getKind() == KindValue; + } + + ArrayRef getRefsImpl() const { return Refs; } + ArrayRef getRefs() const { return Refs; } + ArrayRef getDataImpl() const { return Data; } + ArrayRef getData() const { return Data; } + + static InMemoryRefObject &create(function_ref Allocate, + const InMemoryIndexValueT &I, + ArrayRef Refs, + ArrayRef Data) { + void *Mem = Allocate(sizeof(InMemoryRefObject)); + return *new (Mem) InMemoryRefObject(I, Refs, Data); + } + +private: + InMemoryRefObject(const InMemoryIndexValueT &I, + ArrayRef Refs, ArrayRef Data) + : InMemoryObject(KindValue, I), Refs(Refs), Data(Data) { + assert(isAddrAligned(Align(8), this) && "Expected 8-byte alignment"); + assert(isAddrAligned(Align(8), Data.data()) && "Expected 8-byte alignment"); + assert(*Data.end() == 0 && "Expected null-termination"); + } + + ArrayRef Refs; + ArrayRef Data; +}; + +class InMemoryInlineObject : public InMemoryObject { +public: + static constexpr Kind KindValue = Kind::InlineNode; + static bool classof(const InMemoryObject *O) { + return O->getKind() == KindValue; + } + + ArrayRef getRefs() const { return getRefsImpl(); } + ArrayRef getRefsImpl() const { + return makeArrayRef( + reinterpret_cast(this + 1), NumRefs); + } + + ArrayRef getData() const { return getDataImpl(); } + ArrayRef getDataImpl() const { + ArrayRef Refs = getRefs(); + return makeArrayRef( + reinterpret_cast(Refs.data() + Refs.size()), DataSize); + } + + static InMemoryInlineObject & + create(function_ref Allocate, + const InMemoryIndexValueT &I, ArrayRef Refs, + ArrayRef Data) { + void *Mem = Allocate(sizeof(InMemoryInlineObject) + + sizeof(uintptr_t) * Refs.size() + Data.size() + 1); + return *new (Mem) InMemoryInlineObject(I, Refs, Data); + } + +private: + InMemoryInlineObject(const InMemoryIndexValueT &I, + ArrayRef Refs, + ArrayRef Data) + : InMemoryObject(KindValue, I), NumRefs(Refs.size()), + DataSize(Data.size()) { + auto *BeginRefs = reinterpret_cast(this + 1); + llvm::copy(Refs, BeginRefs); + auto *BeginData = reinterpret_cast(BeginRefs + NumRefs); + llvm::copy(Data, BeginData); + BeginData[Data.size()] = 0; + } + uint32_t NumRefs; + uint32_t DataSize; +}; + +/// In-memory CAS database. +class InMemoryCAS : public BuiltinCAS { +public: + Expected parseIDImpl(ArrayRef Hash) final { + return getID(indexHash(Hash)); + } + + Expected storeImpl(ArrayRef ComputedHash, + ArrayRef Refs, + ArrayRef Data) final; + + Expected + storeFromNullTerminatedRegion(ArrayRef ComputedHash, + sys::fs::mapped_file_region Map) override; + + CASID getID(const InMemoryIndexValueT &I) const { + StringRef Hash = toStringRef(I.Hash); + return CASID::create(&getContext(), Hash); + } + CASID getID(const InMemoryObject &O) const { return getID(O.getIndex()); } + + ObjectHandle getObjectHandle(const InMemoryObject &Node) const { + assert(!(reinterpret_cast(&Node) & 0x1ULL)); + return makeObjectHandle(reinterpret_cast(&Node)); + } + + Expected load(ObjectRef Ref) override { + return getObjectHandle(asInMemoryObject(Ref)); + } + + InMemoryIndexValueT &indexHash(ArrayRef Hash) { + return *Index + .insertLazy(Hash, + [](auto ValueConstructor) { + ValueConstructor.emplace(nullptr); + }) + .first; + } + + /// TODO: Consider callers to actually do an insert and to return a handle to + /// the slot in the trie. + const InMemoryObject *getInMemoryObject(CASID ID) const { + assert(ID.getContext().getHashSchemaIdentifier() == + getContext().getHashSchemaIdentifier() && + "Expected ID from same hash schema"); + if (InMemoryIndexT::const_pointer P = Index.find(ID.getHash())) + return P->Data; + return nullptr; + } + + const InMemoryObject &getInMemoryObject(ObjectHandle OH) const { + return *reinterpret_cast( + (uintptr_t)OH.getInternalRef(*this)); + } + + const InMemoryObject &asInMemoryObject(ReferenceBase Ref) const { + uintptr_t P = Ref.getInternalRef(*this); + return *reinterpret_cast(P); + } + ObjectRef toReference(const InMemoryObject &O) const { + return makeObjectRef(reinterpret_cast(&O)); + } + + CASID getID(ObjectRef Ref) const final { return getIDImpl(Ref); } + CASID getID(ObjectHandle Ref) const final { return getIDImpl(Ref); } + CASID getIDImpl(ReferenceBase Ref) const { + return getID(asInMemoryObject(Ref)); + } + + std::optional getReference(const CASID &ID) const final { + if (const InMemoryObject *Object = getInMemoryObject(ID)) + return toReference(*Object); + return std::nullopt; + } + ObjectRef getReference(ObjectHandle Handle) const final { + return toReference(asInMemoryObject(Handle)); + } + + ArrayRef getDataConst(ObjectHandle Node) const final { + return cast(asInMemoryObject(Node)).getData(); + } + + InMemoryCAS() = default; + +private: + size_t getNumRefs(ObjectHandle Node) const final { + return getInMemoryObject(Node).getRefs().size(); + } + ObjectRef readRef(ObjectHandle Node, size_t I) const final { + return toReference(*getInMemoryObject(Node).getRefs()[I]); + } + Error forEachRef(ObjectHandle Node, + function_ref Callback) const final; + + /// Index of referenced IDs (map: Hash -> InMemoryObject*). Mapped to nullptr + /// as a convenient way to store hashes. + /// + /// - Insert nullptr on lookups. + /// - InMemoryObject points back to here. + InMemoryIndexT Index; + + ThreadSafeAllocator Objects; + ThreadSafeAllocator> + MemoryMaps; +}; + +} // end anonymous namespace + +ArrayRef InMemoryObject::getData() const { + if (auto *Derived = dyn_cast(this)) + return Derived->getDataImpl(); + return cast(this)->getDataImpl(); +} + +ArrayRef InMemoryObject::getRefs() const { + if (auto *Derived = dyn_cast(this)) + return Derived->getRefsImpl(); + return cast(this)->getRefsImpl(); +} + +Expected +InMemoryCAS::storeFromNullTerminatedRegion(ArrayRef ComputedHash, + sys::fs::mapped_file_region Map) { + // Look up the hash in the index, initializing to nullptr if it's new. + ArrayRef Data(Map.data(), Map.size()); + auto &I = indexHash(ComputedHash); + + // Load or generate. + auto Allocator = [&](size_t Size) -> void * { + return Objects.Allocate(Size, alignof(InMemoryObject)); + }; + auto Generator = [&]() -> const InMemoryObject * { + return &InMemoryRefObject::create(Allocator, I, std::nullopt, Data); + }; + const InMemoryObject &Node = + cast(I.Data.loadOrGenerate(Generator)); + + // Save Map if the winning node uses it. + if (auto *RefNode = dyn_cast(&Node)) + if (RefNode->getData().data() == Map.data()) + new (MemoryMaps.Allocate(1)) sys::fs::mapped_file_region(std::move(Map)); + + return toReference(Node); +} + +Expected InMemoryCAS::storeImpl(ArrayRef ComputedHash, + ArrayRef Refs, + ArrayRef Data) { + // Look up the hash in the index, initializing to nullptr if it's new. + auto &I = indexHash(ComputedHash); + + // Create the node. + SmallVector InternalRefs; + for (ObjectRef Ref : Refs) + InternalRefs.push_back(&asInMemoryObject(Ref)); + auto Allocator = [&](size_t Size) -> void * { + return Objects.Allocate(Size, alignof(InMemoryObject)); + }; + auto Generator = [&]() -> const InMemoryObject * { + return &InMemoryInlineObject::create(Allocator, I, InternalRefs, Data); + }; + return toReference(cast(I.Data.loadOrGenerate(Generator))); +} + +Error InMemoryCAS::forEachRef(ObjectHandle Handle, + function_ref Callback) const { + auto &Node = getInMemoryObject(Handle); + for (const InMemoryObject *Ref : Node.getRefs()) + if (Error E = Callback(toReference(*Ref))) + return E; + return Error::success(); +} + +std::unique_ptr cas::createInMemoryCAS() { + return std::make_unique(); +} diff --git a/llvm/lib/CAS/ObjectStore.cpp b/llvm/lib/CAS/ObjectStore.cpp new file mode 100644 --- /dev/null +++ b/llvm/lib/CAS/ObjectStore.cpp @@ -0,0 +1,111 @@ +//===- ObjectStore.cpp ------------------------------------------*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "llvm/CAS/ObjectStore.h" +#include "llvm/Support/Debug.h" +#include "llvm/Support/FileSystem.h" +#include "llvm/Support/SmallVectorMemoryBuffer.h" + +using namespace llvm; +using namespace llvm::cas; + +CASContext::~CASContext() {} +ObjectStore::~ObjectStore() {} + +LLVM_DUMP_METHOD void CASID::dump() const { print(dbgs()); } +LLVM_DUMP_METHOD void ObjectStore::dump() const { print(dbgs()); } +LLVM_DUMP_METHOD void ObjectRef::dump() const { print(dbgs()); } +LLVM_DUMP_METHOD void ObjectHandle::dump() const { print(dbgs()); } + +std::string CASID::toString() const { + std::string S; + raw_string_ostream(S) << *this; + return S; +} + +ArrayRef CASID::getHash() const { + return arrayRefFromStringRef(Hash); +} + +static void printReferenceBase(raw_ostream &OS, StringRef Kind, + uint64_t InternalRef, std::optional ID) { + OS << Kind << "=" << InternalRef; + if (ID) + OS << "[" << *ID << "]"; +} + +void ReferenceBase::print(raw_ostream &OS, const ObjectHandle &This) const { + assert(this == &This); + + std::optional ID; +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + if (CAS) + ID = CAS->getID(This); +#endif + printReferenceBase(OS, "object-handle", InternalRef, ID); +} + +void ReferenceBase::print(raw_ostream &OS, const ObjectRef &This) const { + assert(this == &This); + + std::optional ID; +#if LLVM_ENABLE_ABI_BREAKING_CHECKS + if (CAS) + ID = CAS->getID(This); +#endif + printReferenceBase(OS, "object-ref", InternalRef, ID); +} + +std::unique_ptr +ObjectStore::getMemoryBuffer(ObjectHandle Node, StringRef Name, + bool RequiresNullTerminator) { + return MemoryBuffer::getMemBuffer( + toStringRef(getData(Node, RequiresNullTerminator)), Name, + RequiresNullTerminator); +} + +Expected ObjectStore::getProxy(const CASID &ID) { + std::optional Ref = getReference(ID); + if (!Ref) + return createUnknownObjectError(ID); + + std::optional H; + if (Error E = load(*Ref).moveInto(H)) + return std::move(E); + + return ObjectProxy::load(*this, *H); +} + +Expected ObjectStore::getProxy(ObjectRef Ref) { + return getProxy(load(Ref)); +} + +Expected ObjectStore::getProxy(Expected H) { + if (!H) + return H.takeError(); + return ObjectProxy::load(*this, *H); +} + +Error ObjectStore::createUnknownObjectError(const CASID &ID) { + return createStringError(std::make_error_code(std::errc::invalid_argument), + "unknown object '" + ID.toString() + "'"); +} + +Expected ObjectStore::createProxy(ArrayRef Refs, + StringRef Data) { + Expected Ref = store(Refs, arrayRefFromStringRef(Data)); + if (!Ref) + return Ref.takeError(); + return getProxy(*Ref); +} + +std::unique_ptr +ObjectProxy::getMemoryBuffer(StringRef Name, + bool RequiresNullTerminator) const { + return CAS->getMemoryBuffer(H, Name, RequiresNullTerminator); +} diff --git a/llvm/lib/CMakeLists.txt b/llvm/lib/CMakeLists.txt --- a/llvm/lib/CMakeLists.txt +++ b/llvm/lib/CMakeLists.txt @@ -9,6 +9,7 @@ add_subdirectory(InterfaceStub) add_subdirectory(IRPrinter) add_subdirectory(IRReader) +add_subdirectory(CAS) add_subdirectory(CodeGen) add_subdirectory(BinaryFormat) add_subdirectory(Bitcode) diff --git a/llvm/unittests/CAS/CASTestConfig.h b/llvm/unittests/CAS/CASTestConfig.h new file mode 100644 --- /dev/null +++ b/llvm/unittests/CAS/CASTestConfig.h @@ -0,0 +1,36 @@ +//===- CASTestConfig.h ----------------------------------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "llvm/CAS/ObjectStore.h" +#include "llvm/Config/llvm-config.h" +#include "llvm/Support/FileSystem.h" +#include "llvm/Testing/Support/Error.h" +#include "llvm/Testing/Support/SupportHelpers.h" +#include "gtest/gtest.h" + +#ifndef LLVM_UNITTESTS_CASTESTCONFIG_H +#define LLVM_UNITTESTS_CASTESTCONFIG_H + +struct CASTestingEnv { + std::unique_ptr CAS; +}; + +class CASTest + : public testing::TestWithParam> { +protected: + std::optional NextCASIndex; + + std::unique_ptr createObjectStore() { + auto TD = GetParam()(++(*NextCASIndex)); + return std::move(TD.CAS); + } + void SetUp() { NextCASIndex = 0; } + void TearDown() { NextCASIndex = std::nullopt; } +}; + +#endif diff --git a/llvm/unittests/CAS/CASTestConfig.cpp b/llvm/unittests/CAS/CASTestConfig.cpp new file mode 100644 --- /dev/null +++ b/llvm/unittests/CAS/CASTestConfig.cpp @@ -0,0 +1,22 @@ +//===- CASTestConfig.cpp --------------------------------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "CASTestConfig.h" +#include "llvm/CAS/ObjectStore.h" +#include "gtest/gtest.h" + +using namespace llvm; +using namespace llvm::cas; + +CASTestingEnv createInMemory(int I) { + std::unique_ptr CAS = createInMemoryCAS(); + return CASTestingEnv{std::move(CAS)}; +} + +INSTANTIATE_TEST_SUITE_P(InMemoryCAS, CASTest, + ::testing::Values(createInMemory)); diff --git a/llvm/unittests/CAS/CMakeLists.txt b/llvm/unittests/CAS/CMakeLists.txt new file mode 100644 --- /dev/null +++ b/llvm/unittests/CAS/CMakeLists.txt @@ -0,0 +1,12 @@ +set(LLVM_LINK_COMPONENTS + Support + CAS + TestingSupport + ) + +add_llvm_unittest(CASTests + CASTestConfig.cpp + ObjectStoreTest.cpp + ) + +target_link_libraries(CASTests PRIVATE LLVMTestingSupport) diff --git a/llvm/unittests/CAS/ObjectStoreTest.cpp b/llvm/unittests/CAS/ObjectStoreTest.cpp new file mode 100644 --- /dev/null +++ b/llvm/unittests/CAS/ObjectStoreTest.cpp @@ -0,0 +1,280 @@ +//===- ObjectStoreTest.cpp ------------------------------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "llvm/CAS/ObjectStore.h" +#include "llvm/Config/llvm-config.h" +#include "llvm/Support/FileSystem.h" +#include "llvm/Testing/Support/Error.h" +#include "llvm/Testing/Support/SupportHelpers.h" +#include "gtest/gtest.h" + +#include "CASTestConfig.h" + +using namespace llvm; +using namespace llvm::cas; + +TEST_P(CASTest, PrintIDs) { + std::unique_ptr CAS = createObjectStore(); + + std::optional ID1, ID2; + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "1").moveInto(ID1), + Succeeded()); + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "2").moveInto(ID2), + Succeeded()); + EXPECT_NE(ID1, ID2); + std::string PrintedID1 = ID1->toString(); + std::string PrintedID2 = ID2->toString(); + EXPECT_NE(PrintedID1, PrintedID2); + + std::optional ParsedID1, ParsedID2; + ASSERT_THAT_ERROR(CAS->parseID(PrintedID1).moveInto(ParsedID1), Succeeded()); + ASSERT_THAT_ERROR(CAS->parseID(PrintedID2).moveInto(ParsedID2), Succeeded()); + EXPECT_EQ(ID1, ParsedID1); + EXPECT_EQ(ID2, ParsedID2); +} + +TEST_P(CASTest, Blobs) { + std::unique_ptr CAS1 = createObjectStore(); + StringRef ContentStrings[] = { + "word", + "some longer text std::string's local memory", + R"(multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text)", + }; + + SmallVector IDs; + for (StringRef Content : ContentStrings) { + // Use StringRef::str() to create a temporary std::string. This could cause + // problems if the CAS is storing references to the input string instead of + // copying it. + std::optional Blob; + ASSERT_THAT_ERROR(CAS1->createProxy(std::nullopt, Content).moveInto(Blob), + Succeeded()); + IDs.push_back(Blob->getID()); + + // Check basic printing of IDs. + EXPECT_EQ(IDs.back().toString(), IDs.back().toString()); + if (IDs.size() > 2) + EXPECT_NE(IDs.front().toString(), IDs.back().toString()); + } + + // Check that the blobs give the same IDs later. + for (int I = 0, E = IDs.size(); I != E; ++I) { + std::optional Blob; + ASSERT_THAT_ERROR( + CAS1->createProxy(std::nullopt, ContentStrings[I]).moveInto(Blob), + Succeeded()); + EXPECT_EQ(IDs[I], Blob->getID()); + } + + // Run validation on all CASIDs. + for (int I = 0, E = IDs.size(); I != E; ++I) + ASSERT_THAT_ERROR(CAS1->validate(IDs[I]), Succeeded()); + + // Check that the blobs can be retrieved multiple times. + for (int I = 0, E = IDs.size(); I != E; ++I) { + for (int J = 0, JE = 3; J != JE; ++J) { + std::optional Buffer; + ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Buffer), Succeeded()); + EXPECT_EQ(ContentStrings[I], Buffer->getData()); + } + } + + // Confirm these blobs don't exist in a fresh CAS instance. + std::unique_ptr CAS2 = createObjectStore(); + for (int I = 0, E = IDs.size(); I != E; ++I) { + std::optional Proxy; + EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Proxy), Failed()); + } + + // Insert into the second CAS and confirm the IDs are stable. Getting them + // should work now. + for (int I = IDs.size(), E = 0; I != E; --I) { + auto &ID = IDs[I - 1]; + auto &Content = ContentStrings[I - 1]; + std::optional Blob; + ASSERT_THAT_ERROR(CAS2->createProxy(std::nullopt, Content).moveInto(Blob), + Succeeded()); + EXPECT_EQ(ID, Blob->getID()); + + std::optional Buffer; + ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Buffer), Succeeded()); + EXPECT_EQ(Content, Buffer->getData()); + } +} + +TEST_P(CASTest, BlobsBig) { + // A little bit of validation that bigger blobs are okay. Climb up to 1MB. + std::unique_ptr CAS = createObjectStore(); + SmallString<256> String1 = StringRef("a few words"); + SmallString<256> String2 = StringRef("others"); + while (String1.size() < 1024U * 1024U) { + std::optional ID1; + std::optional ID2; + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID1), + Succeeded()); + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID2), + Succeeded()); + ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded()); + ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded()); + ASSERT_EQ(ID1, ID2); + + String1.append(String2); + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID1), + Succeeded()); + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID2), + Succeeded()); + ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded()); + ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded()); + ASSERT_EQ(ID1, ID2); + String2.append(String1); + } + + // Specifically check near 1MB for objects large enough they're likely to be + // stored externally in an on-disk CAS and will be near a page boundary. + SmallString<0> Storage; + const size_t InterestingSize = 1024U * 1024ULL; + const size_t SizeE = InterestingSize + 2; + if (Storage.size() < SizeE) + Storage.resize(SizeE, '\01'); + for (size_t Size = InterestingSize - 2; Size != SizeE; ++Size) { + StringRef Data(Storage.data(), Size); + std::optional Blob; + ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, Data).moveInto(Blob), + Succeeded()); + ASSERT_EQ(Data, Blob->getData()); + ASSERT_EQ(0, Blob->getData().end()[0]); + } +} + +TEST_P(CASTest, LeafNodes) { + std::unique_ptr CAS1 = createObjectStore(); + StringRef ContentStrings[] = { + "word", + "some longer text std::string's local memory", + R"(multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text +multiline text multiline text multiline text multiline text multiline text)", + }; + + SmallVector Nodes; + SmallVector IDs; + for (StringRef Content : ContentStrings) { + // Use StringRef::str() to create a temporary std::string. This could cause + // problems if the CAS is storing references to the input string instead of + // copying it. + std::optional Node; + ASSERT_THAT_ERROR( + CAS1->store(std::nullopt, arrayRefFromStringRef(Content)) + .moveInto(Node), + Succeeded()); + Nodes.push_back(*Node); + + // Check basic printing of IDs. + IDs.push_back(CAS1->getID(*Node)); + EXPECT_EQ(IDs.back().toString(), IDs.back().toString()); + EXPECT_EQ(Nodes.front(), Nodes.front()); + EXPECT_EQ(Nodes.back(), Nodes.back()); + EXPECT_EQ(IDs.front(), IDs.front()); + EXPECT_EQ(IDs.back(), IDs.back()); + if (Nodes.size() <= 1) + continue; + EXPECT_NE(Nodes.front(), Nodes.back()); + EXPECT_NE(IDs.front(), IDs.back()); + } + + // Check that the blobs give the same IDs later. + for (int I = 0, E = IDs.size(); I != E; ++I) { + std::optional Node; + ASSERT_THAT_ERROR(CAS1->store(std::nullopt, arrayRefFromStringRef( + ContentStrings[I])) + .moveInto(Node), + Succeeded()); + EXPECT_EQ(IDs[I], CAS1->getID(*Node)); + } + + // Check that the blobs can be retrieved multiple times. + for (int I = 0, E = IDs.size(); I != E; ++I) { + for (int J = 0, JE = 3; J != JE; ++J) { + std::optional Object; + ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Object), Succeeded()); + ASSERT_TRUE(Object); + EXPECT_EQ(ContentStrings[I], Object->getData()); + } + } + + // Confirm these blobs don't exist in a fresh CAS instance. + std::unique_ptr CAS2 = createObjectStore(); + for (int I = 0, E = IDs.size(); I != E; ++I) { + std::optional Object; + EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Object), Failed()); + } + + // Insert into the second CAS and confirm the IDs are stable. Getting them + // should work now. + for (int I = IDs.size(), E = 0; I != E; --I) { + auto &ID = IDs[I - 1]; + auto &Content = ContentStrings[I - 1]; + std::optional Node; + ASSERT_THAT_ERROR( + CAS2->store(std::nullopt, arrayRefFromStringRef(Content)) + .moveInto(Node), + Succeeded()); + EXPECT_EQ(ID, CAS2->getID(*Node)); + + std::optional Object; + ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Object), Succeeded()); + ASSERT_TRUE(Object); + EXPECT_EQ(Content, Object->getData()); + } +} + +TEST_P(CASTest, NodesBig) { + std::unique_ptr CAS = createObjectStore(); + + // Specifically check near 1MB for objects large enough they're likely to be + // stored externally in an on-disk CAS, and such that one of them will be + // near a page boundary. + SmallString<0> Storage; + constexpr size_t InterestingSize = 1024U * 1024ULL; + constexpr size_t WordSize = sizeof(void *); + + // Start much smaller to account for headers. + constexpr size_t SizeB = InterestingSize - 8 * WordSize; + constexpr size_t SizeE = InterestingSize + 1; + if (Storage.size() < SizeE) + Storage.resize(SizeE, '\01'); + + SmallVector CreatedNodes; + // Avoid checking every size because this is an expensive test. Just check + // for data that is 8B-word-aligned, and one less. Also appending the created + // nodes as the references in the next block to check references are created + // correctly. + for (size_t Size = SizeB; Size < SizeE; Size += WordSize) { + for (bool IsAligned : {false, true}) { + StringRef Data(Storage.data(), Size - (IsAligned ? 0 : 1)); + std::optional Node; + ASSERT_THAT_ERROR(CAS->createProxy(CreatedNodes, Data).moveInto(Node), + Succeeded()); + ASSERT_EQ(Data, Node->getData()); + ASSERT_EQ(0, Node->getData().end()[0]); + ASSERT_EQ(Node->getNumReferences(), CreatedNodes.size()); + CreatedNodes.emplace_back(Node->getRef()); + } + } + + for (auto ID : CreatedNodes) + ASSERT_THAT_ERROR(CAS->validate(CAS->getID(ID)), Succeeded()); +} diff --git a/llvm/unittests/CMakeLists.txt b/llvm/unittests/CMakeLists.txt --- a/llvm/unittests/CMakeLists.txt +++ b/llvm/unittests/CMakeLists.txt @@ -20,6 +20,7 @@ add_subdirectory(BinaryFormat) add_subdirectory(Bitcode) add_subdirectory(Bitstream) +add_subdirectory(CAS) add_subdirectory(CodeGen) add_subdirectory(DebugInfo) add_subdirectory(Debuginfod)