This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
-
ContentAddressableStorage.md
-
Reference.rst
-
include/llvm/CAS/
-
llvm/
-
CAS/
2/5
CASID.h
7/26
CASReference.h
7/16
ObjectStore.h
-
lib/
-
CAS/
-
BuiltinCAS.h
-
BuiltinCAS.cpp
-
BuiltinObjectHasher.h
-
CMakeLists.txt
1
InMemoryCAS.cpp
-
ObjectStore.cpp
-
CMakeLists.txt
-
unittests/
-
CAS/
-
CASTestConfig.h
-
CASTestConfig.cpp
-
CMakeLists.txt
2
ObjectStoreTest.cpp
-
CMakeLists.txt

Differential D133716

[CAS] Add LLVMCAS library with InMemoryCAS implementation
Needs ReviewPublic

Authored by steven_wu on Sep 12 2022, 10:43 AM.

Download Raw Diff

Details

Reviewers

rnk
dblaikie
benlangmuir
jyknight
dexonsmith

Summary

Add llvm::cas::ObjectStore abstraction and InMemoryCAS as a in-memory
CAS object store implementation.

The ObjectStore models its objects as:

Content: An array of bytes for the data to be stored.
Refs: An array of references to other objects in the ObjectStore.

And each CAS Object can be idenfied with an unqine ID/Hash.

ObjectStore supports following general action:

Expected<ID> store(Content, ArrayRef<Ref>)
Expected<Ref> get(ID)

It also introduces following types to interact with a CAS ObjectStore:

CASID: Hash representation for an CAS Objects with its context to help print/compare CASIDs.
ObjectRef: A light-weight ref for an object in the ObjectStore. It is implementation defined so it can be optimized for read/store/references depending on the implementation.
ObjectHandle: A CAS internal light-weight handle to an loaded object in the ObjectStore. Underlying data for the object is guaranteed to be available and no error handling is required to access data. This is not exposed to the users of CAS from ObjectStore APIs.
ObjectProxy: A proxy for the users of CAS to interact with the data inside CAS Object. It bundles a ObjectHandle and an ObjectStore instance.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steven_wu created this revision.Sep 12 2022, 10:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 10:43 AM

Herald added subscribers: ributzka, hiraditya, mgorny. · View Herald Transcript

steven_wu requested review of this revision.Sep 12 2022, 10:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 10:43 AM

steven_wu added parent revisions: D133713: [Support] Introduce ThreadSafeAllocator, D133714: [ADT] Introduce LazyAtomicPointer, D133715: [ADT] Add TrieRawHashMap.Sep 12 2022, 10:43 AM

Harbormaster completed remote builds in B186196: Diff 459516.Sep 12 2022, 11:09 AM

tschuett added a subscriber: tschuett.Sep 12 2022, 11:11 AM

ychen added a subscriber: ychen.Sep 12 2022, 11:14 AM

akyrtzi added a subscriber: akyrtzi.Sep 12 2022, 12:06 PM

peterwaller-arm added a subscriber: peterwaller-arm.Sep 15 2022, 3:34 AM

steven_wu removed parent revisions: D133715: [ADT] Add TrieRawHashMap, D133714: [ADT] Introduce LazyAtomicPointer.Sep 19 2022, 3:21 PM

steven_wu edited parent revisions, added: D133715: [ADT] Add TrieRawHashMap; removed: D133713: [Support] Introduce ThreadSafeAllocator.

aganea added a subscriber: aganea.Sep 27 2022, 8:12 AM

aprantl added a subscriber: aprantl.Oct 5 2022, 8:45 AM

russell.gallop added a subscriber: russell.gallop.Nov 9 2022, 3:34 PM

Ping.

Now with context of the LLVM talk, I hope it makes the review easier. Will be really appreciated for any feedback to get CAS upstream. Feel free to reach out to me for any questions.

In D133716#3958319, @steven_wu wrote:

Ping.

Now with context of the LLVM talk, I hope it makes the review easier. Will be really appreciated for any feedback to get CAS upstream. Feel free to reach out to me for any questions.

FTR, I'll be happy to review most/many patches in this area once the core patches land (and also contribute to design discussions), but for the core patches (where I was the original author) I'm not sure I have enough perspective to spot what's missing. As such, I'm going to resign as a reviewer, but I'll still be "subscribed" to the email thread. (Feel free to add me as a subscriber/reviewer to anything else in this area as well...)

In my opinion, much of this can evolve in-tree after landing, and the RFC had consensus to move forward with landing this stuff in main following the usual code review. It's important for someone outside the original authors (a few Apple people, plus me!) to at least take a cursory look to ensure that it's well-enough documented and tested for others to contribute. Maybe an in-depth review isn't needed, but I'm sure that'd also be welcome if someone can contribute one.

Alternatively, if someone thinks the design needs a deeper look before landing, that sounds great too; but I think it'd be helpful for Steven to have signal either way.

+@MatzeB, in case you are able to contribute a review; IIRC, it was your design feedback in the RFC video call that caused us to simplify the interface, moving filesystem concepts like "blobs" and "trees" out of the core CAS interface.

dexonsmith added inline comments.Nov 29 2022, 4:06 PM

llvm/include/llvm/ADT/StringExtras.h
61–74 ↗	(On Diff #459516)	Oh, just noticed these changes; I suggest separating these out and adding quick unit tests.

Split StringExtras changes

steven_wu added a parent revision: D139035: [ADT] Add more ArrayRef <-> StringRef conversion functions.Nov 30 2022, 10:58 AM

Harbormaster completed remote builds in B200325: Diff 479020.Nov 30 2022, 11:05 AM

dblaikie added inline comments.Dec 1 2022, 5:04 PM

llvm/include/llvm/CAS/CASID.h
27	This could be private to avoid polluting autocomplete and such, maybe? (alternatively it's probably not too costly to have the dtor virtual and out of line to act as an anchor - I assume CASContexts aren't being created/torn down with any great frequency such that some dtor indirection would be especially costly?)
109–110	if you have a `std::string` member, should probably take the parameter as `std::string` and `std::move` it into the member - in cases where the caller can discard a string that'll be more efficient, and in cases where the caller can't, it's not especially worse. (though shoudl the member be `std::string`? should it be `SmallString` or anything else?)
llvm/include/llvm/CAS/CASReference.h
63	Only the last of these checks is limited to `LLVM_ENABLE_ABI_BREAKING_CHECKS` yeah? Might be worth pulling the other two out of the preprocessor conditional, so they get run more broadly?
107	expect -> extract? or something else?
111–114	If clients aren't expected to need to do it, could we leave that out until some necessary use comes up?
163	Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... )
llvm/include/llvm/CAS/ObjectStore.h
50	do they "reference" other objects - or does an object consist of other objects/include those objects in some sense? (maybe there's no real difference, and this current language is better... not sure)
54–58	Not clear from this what the difference between an `ObjectRef` and a `CASID` are - is that worth more details?
62–65	Ah. May be worth splitting up the documentation here more clearly between public and private concepts, rather than having this internal detail interleaved with external ones?
91	anchor's only useful if it's virtual? Maybe jus tmake the dtor out of line and use that as an anchor?
110–111	what sort of validity is of concern here?
175–176	this lets you walk which other objects reference this object? I guess you've probably got some pretty core use cases/need for this - but dealing with updating use lists in LLVM IR at least makes me ask: Do you need to be able to walk the uses of an object?
llvm/lib/CAS/InMemoryCAS.cpp
263	I'd probably put the `inline` keyword here and leave it off the declaration (otherwise I'd read this code and wonder how it's valid/doesn't produce duplicate definitions) - wonder what's more common in the LLVM project/codebase. No big deal either way, though.
llvm/unittests/CAS/ObjectStoreTest.cpp
162	not sure I understand the mention of std::string here - since this is `StringRef`? (maybe this is some remnant of a different version of the code)
174–176	Rather than testing this via UB, could it be tested via pointer equality? (eg: when querying the CAS, check that the data points to the same place as the data pasesd in)

Address review feedback.

@dblaikie Thanks for reviewing. Feedback inline.

llvm/include/llvm/CAS/CASID.h
27	This is a private method? But sure this is not a common object to create or destory. dtor as anchor sounds fine.
109–110	I don't have much preference for the internal data type. CASID should only be relevant when you actually need the hash representation, often when you need to print out the value for outside users. Otherwise, `ObjectRef` is always a better choice.
llvm/include/llvm/CAS/CASReference.h
107	This comment is out-dated. The builtin kind of the object is removed from the latest implementation to simplify the interface so it is easier to use and implement
111–114	It actually means that only the CAS implementers should use those method (for example, used in BuiltinCAS) but users of the CAS should not use those methods because `InternalRef` has no meaning outside the CAS internal. I rewrite it to clear up a bit.
163	`ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail. Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?
llvm/include/llvm/CAS/ObjectStore.h
50	It is a real `reference`, like a pointer reference. But this is up to the CAS implementation, as I can see it is totally possible to implement a CAS the directly embeds the "referenced" object without breaking the contract of the API, but there is really no good reason to do that.
54–58	I refine the docs a bit. Let me know if it is clearer to read.
110–111	This is more a debugging function that can check if the integrity of the CAS has been broken. The current implementation for builtin CAS is to rehash the object fetched from CASID and make sure the hash matches. If not, there is either a bug in CAS or the data has been corrupted.
175–176	No, the reference in CAS goes only one direction. This iterates through all the objects that is referenced by the current object and you don't know how many parents references you. In general, there is even no API for CAS to iterate through all the objects to compute a use-list and that is by design.

Harbormaster completed remote builds in B201800: Diff 481046.Dec 7 2022, 1:55 PM

Cleanup patch to move unrelated changes.

Harbormaster completed remote builds in B201802: Diff 481048.Dec 7 2022, 1:57 PM

dblaikie added inline comments.Dec 7 2022, 3:30 PM

llvm/include/llvm/CAS/CASID.h
109–110	Cool - I see you updated the parameter to `std::string&&` - generally I'd expect it to be `std::string` value, not a reference. That way the caller can move into the parameter if they have a string to discard, or allow the copy if they want to keep copies in the caller for whatever reason.
llvm/include/llvm/CAS/CASReference.h
163	Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor))) Seems like ObjectRef is maybe more suitably called ObjectID?
llvm/include/llvm/CAS/ObjectStore.h
60	Presumably it can also fail if the object isn't in the given CAS? (maybe that's the more obvious/simpler to document example?)
62–65	ping on this
110–111	Might be worth a few more words in the description maybe about "consistency" or somesuch?

steven_wu added inline comments.Dec 8 2022, 9:50 AM

llvm/include/llvm/CAS/CASReference.h
163	I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names? In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs. Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore. I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames.
llvm/include/llvm/CAS/ObjectStore.h
60	This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined. For builtinCAS here, ObjectRef always points to existing object, unless the integrity of the builtinCAS is broken. For a remote CAS, you might actually want to have ObjectRef to be unverified to avoid a roundtrip to remote to validate a ObjectRef. Also to you previous point of split up public/private interface document, is it better if I create a separate docs in `llvm/docs` to explain different concepts from the views of: 1. CAS users 2. CAS implementors?

benlangmuir added inline comments.Dec 8 2022, 10:23 AM

llvm/include/llvm/CAS/CASReference.h
163	I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with. `ObjectHandle` is only used by implementors of an `ObjectStore`. I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`. Seems like ObjectRef is maybe more suitably called ObjectID? I think this is confusing with `CASID`, which really is an ID. I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency.

benlangmuir added inline comments.Dec 8 2022, 10:28 AM

llvm/include/llvm/CAS/ObjectStore.h
60	This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined This complication is only for the implementation though, right? Clients of the object store shouldn't assume the object exists in the CAS, and like @dblaikie said, loading can fail if it is not. I don't think we want to promise that builtin CAS will always succeed when loading, since we might want to change that kind of detail later.

dexonsmith added inline comments.Dec 8 2022, 11:01 AM

llvm/include/llvm/CAS/CASReference.h
163	In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`. (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.) `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object. `ObjectHandle` promises that you can access the object. I think to explain the difference, it helps to have a specific example. If you have a few objects: object-id: 0 data: "some small data" object-id: 1 data: [4GB of data] object-id: 2 data: "a header" refs: 0, 1 If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1. Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts. Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading. It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you). At a high level, we have three concepts: a serialized, context-free identifier (currently CASID) an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle) Maybe the following rename could work: Delete CASID and replace its usage with `std::string` Rename ObjectRef to ObjectID Maybe: Rename ObjectHandle to ObjectRef I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate. @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea? @dblaikie, WDYT? Maybe you have other name ideas after the above explanation?

dexonsmith added inline comments.Dec 8 2022, 11:09 AM

llvm/include/llvm/CAS/CASReference.h
163	@benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.) Agreed that `ObjectRef` ought to have a good / easy-to-type name.

steven_wu added inline comments.Dec 8 2022, 11:19 AM

llvm/include/llvm/CAS/CASReference.h
163	@dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well. `ObjectHandle` is type for CAS implementor only now to represent loaded object.
llvm/include/llvm/CAS/ObjectStore.h
60	True. I will update the wording.

akyrtzi added inline comments.Dec 8 2022, 11:25 AM

llvm/include/llvm/CAS/CASReference.h
163	I'd have expected clients to work with both `ObjectRef` and `ObjectHandle` I don't think `ObjectHandle` has any value for clients, clients will be using `ObjectProxy`. an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between `CASID` and `ObjectRef` in such a context.

dexonsmith added inline comments.Dec 8 2022, 11:25 AM

llvm/include/llvm/CAS/CASReference.h
163	Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it. Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.) Also: Do we still need CASID? Why not just use `std::string`? If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.) Another idea: rename `ObjectRef` to `ObjectPtr`?

dexonsmith added inline comments.Dec 8 2022, 11:29 AM

llvm/include/llvm/CAS/CASReference.h
163	Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context. Ah, is that because a remote CAS may have garbage-collected the object? So, then, `ObjectRef` is just an object we know how to point at? (Or do you just mean that loading could fail because the remote CAS goes down?)

benlangmuir added inline comments.Dec 8 2022, 11:50 AM

llvm/include/llvm/CAS/CASReference.h
163	Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at? (Or do you just mean that loading could fail because the remote CAS goes down?) We do not check whether an object exists at all when creating the ObjectRef for a remote CAS: it's a wasted round trip over the network since you can't guarantee the object still exists later, and don't get a performance win for knowing it existed in the past. This is different with the local in-memory and on-disk CASes, because they can check for existence cheaply and provide a faster ref implementation once we know the object exists. a serialized, context-free identifier (currently CASID) There are two levels here: a serialized, context-free identifier (std::string, e.g. "llvmcas://<serialized hash>") a context-sensitive identifier hash containing raw hash bytes (CASID) -- you need a CASContext to use it an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) Not known to exist anywhere, but otherwise yes. a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle) Correct. Also: Do we still need CASID? Why not just use `std::string`? So the question here is whether CASID is pulling its weight as a client-visible type. It should be more efficient than the serialized string, since you have the hash bytes immediately available rather than parsing them out; it can be smaller for the same reason. I don't know if have measured this -- @steven_wu, @akyrtzi any thoughts? If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.) If we can drop CASID, I'm fine with this. Another idea: rename `ObjectRef` to `ObjectPtr`? I find Ref clearer than Ptr for what it does.

akyrtzi added inline comments.Dec 8 2022, 1:00 PM

llvm/include/llvm/CAS/CASReference.h
163	I like the idea of dropping `CASID` entirely. I'm also fine with renaming `ObjectRef` to `ObjectID`. I would also suggest that we change virtual Expected<ObjectHandle> load(ObjectRef Ref) to virtual Expected<Optional<ObjectHandle>> load(ObjectRef Ref) The optional indicates whether the object existed or not, and we reserve errors for catastrophic failures (e.g. "network went down"). The client could choose to turn the "object doesn't exist" `None` into an error (or not, it depends on the context) but this case should be distinguishable from `ObjectStore`'s API.

steven_wu added inline comments.Dec 8 2022, 1:42 PM

llvm/include/llvm/CAS/CASReference.h
163	I tried to get rid of `CASID` but there are some desirable features that decide to keep it for now. For example, it distinguishes itself from a printable string. For builtinCAS, you have a printable string that is something like: `llvmcas://abcdefg...`, while CASID is the raw hash value (0xabcdefg...) + the context ("llvm.builtin.cas.v1[BLAKE3]"). If we get rid of CASID, you will either have a std::string representation that is `llvmcas://` then you need to parse it every time you want access to raw hash, or you decide to have `std::string` to store the raw hash value, then: 1. you can't store a context anywhere. 2. It is confusing if a function is taking a printable string or a raw hash value, since both of them are std::string.

akyrtzi added inline comments.Dec 8 2022, 1:48 PM

llvm/include/llvm/CAS/CASReference.h
163	then you need to parse it every time you want access to raw hash Why do I need to parse the string to get the raw hash, I should be able to get the raw hash from the `ObjectRef`/`ObjectID`, right? Once I parse a "llvmcas://" string into an `ObjectRef`/`ObjectID` there should be no need to keep the string around anymore.

dexonsmith added inline comments.Dec 8 2022, 1:48 PM

llvm/include/llvm/CAS/CASReference.h
163	What is "every time"? I.e., why do you need to get the hash from the CASID? Could/should some/many places that currently use CASID use an ObjectRef instead? IOW, maybe some current uses of CASID should switch to ObjectRef or ObjectProxy, and the rest can then be converted to std::string. But I'm not sure. Only asking the question to ensure that you've considered it (vs. momentum of code written with CASID before ObjectRef existed).

dexonsmith added inline comments.Dec 8 2022, 2:51 PM

llvm/include/llvm/CAS/CASReference.h
163	Paging in old memories, I think where CASID remains useful is if you're talking about the same object in two different instances of a CAS (maybe one instance is in-memory, and the other is on-disk or remote), and both have the same CASContext. IIRC, you share a CASContext iff the std::strings mean the same thing. E.g., should be true for any CAS that has a std::string ID starting with `llvmcas://`, as long as both instances use the same hash function (currently always BLAKE3, right?). (Unless this changed...) If a client is doing a lot of this, having an optimized representation of the hash/identifier could be useful; it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. (In this scenario, it's also maybe nice to avoid a malloc/dealloc, but the hash will always overflow std::string's small storage... that was the motivation to avoid std::string originally.) I imagine we'll want to do that sort of thing at some point, but not sure how urgent it is (maybe it's already needed?), or whether the optimization is really necessary (maybe the overhead is hidden by other things that are orders of magnitude more expensive?)... Outside of that multi-CAS-instance-same-CAScontext scenario, I'm not sure I see a reason to avoid using ObjectRef (or std::string). Another thought: if we don't drop CASID, it could be renamed to ObjectHash. ObjectHash: the raw hash of the object, shareable within a CASContext (né CASID) ObjectID: an opaque identifier for a specific CAS instance (né ObjectRef) LoadedObjectHandle: implementation detail, points at "loaded" object in a CAS instance (né ObjectHandle) ObjectProxy: LoadedObjectHandle+ObjectStore+nice APIs.

akyrtzi added inline comments.Dec 8 2022, 3:06 PM

llvm/include/llvm/CAS/CASReference.h
163	it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. Hmm, it's still unclear to me why you need to traffic "llvmcas://" strings for that; the `ObjectRef/ID` that you got from one instance should be able to provide the hash bytes for the other. You don't need to re-validate the hash because the instances are context-compatible.

dexonsmith added inline comments.Dec 8 2022, 3:21 PM

llvm/include/llvm/CAS/CASReference.h
163	IIRC, CASID is the only container for the raw hash recognized by ObjectStore. If you take it away, and you want to traffic in raw hashes you need to invent a new container for them (or just let people use ArrayRef). If you think it’s useful to traffic in raw hashes, then I suggest keeping CASID/ObjectHash, since it has some error checking and dumping built-in.

akyrtzi added inline comments.Dec 8 2022, 4:08 PM

llvm/include/llvm/CAS/CASReference.h
163	or just let people use ArrayRef I'm personally fine with making it that the way you get an `ObjectID` is either via parsing a string identifier or via passing the `ArrayRef` hash bytes (for example, say I stored the hash as raw bytes in a file and now I want to get an `ObjectID` out of these bytes). Like a string identifier from a command-line option, the raw hash bytes will be a transient input that I will use to get an `ObjectID` in order to proceed with the rest of the work, I shouldn't need to be "trafficking" both hash bytes and ObjectIDs, at the same time. In general, as a client I'd like to have only 3 concepts that I need to be concerned about: A string identifier for a CAS object that can be parsed and printed (e.g. the "llvmcas://" strings) A type I can get as a valid ID after parsing a string identifier or after passing the hash bytes for the object (e.g. `ObjectID`) A type that contains the loaded data for the object. (e.g. `ObjectProxy`) An `ObjectHash` type could be useful for the implementations but I find it an unnecessary concept for the users of `ObjectStore` API.

dexonsmith added inline comments.Dec 8 2022, 4:39 PM

llvm/include/llvm/CAS/CASReference.h
163	Sure; encapsulation might be overkill. I don’t feel strongly!

steven_wu mentioned this in D132455: [ADT] add ConcurrentHashtable class..Jan 3 2023, 10:15 AM

Rebase for std::optional

Harbormaster completed remote builds in B205514: Diff 486058.Jan 3 2023, 12:44 PM

Add docs in the reference page that address CAS in different prespective.

Still use old names before do a global rename to a better name.

Harbormaster completed remote builds in B206002: Diff 486696.Jan 5 2023, 3:33 PM

Update patch after HashMap rename

Harbormaster completed remote builds in B206582: Diff 487492.Jan 9 2023, 12:02 PM

Update after makeArrayRef depreciation.

Harbormaster completed remote builds in B208299: Diff 489895.Jan 17 2023, 1:53 PM

thevinster added a subscriber: thevinster.Jan 22 2023, 5:48 PM

Rebase patch

Herald added a subscriber: wangpc. · View Herald TranscriptJul 31 2023, 2:11 PM

Harbormaster completed remote builds in B249316: Diff 545803.Jul 31 2023, 2:13 PM

Matt added a subscriber: Matt.Aug 27 2023, 10:56 PM

JamesWidman added a subscriber: JamesWidman.Aug 28 2023, 11:03 AM

Revision Contents

Path

Size

llvm/

docs/

ContentAddressableStorage.md

120 lines

Reference.rst

4 lines

include/

llvm/

CAS/

CASID.h

131 lines

CASReference.h

205 lines

ObjectStore.h

327 lines

lib/

CAS/

BuiltinCAS.h

148 lines

BuiltinCAS.cpp

131 lines

BuiltinObjectHasher.h

73 lines

8 lines

331 lines

111 lines

1 line

unittests/

CAS/

36 lines

22 lines

12 lines

280 lines

1 line

Diff 545803

llvm/docs/ContentAddressableStorage.md

This file was added.

				# Content Addressable Storage

				## Introduction to CAS

				Content Addressable Storage, or `CAS`, is a storage system where it assigns
				unique addresses to the data stored. It is very useful for data deduplicaton
				and creating unique identifiers.

				Unlikely other kind of storage system like file system, CAS is immutable. It
				is more reliable to model a computation when representing the inputs and outputs
				of the computation using objects stored in CAS.

				The basic unit of the CAS library is a CASObject, where it contains:

				* Data: arbitrary data
				* References: references to other CASObject

				It can be conceptually modeled as something like:

				```
				struct CASObject {
				ArrayRef<char> Data;
				ArrayRef<CASObject*> Refs;
				}
				```

				Such abstraction can allow simple composition of CASObjects into a DAG to
				represent complicated data structure while still allowing data deduplication.
				Note you can compare two DAGs by just comparing the CASObject hash of two
				root nodes.



				## LLVM CAS Library User Guide

				The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
				To reference a CASObject, there are few different abstractions provided
				with different trade-offs:

				### ObjectRef

				`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
				This is the most commonly used abstraction and it is cheap to copy/pass
				along. It has following properties:

				* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
				`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
				compared.
				* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
				explicitly load is required before accessing the data stored in CASObject.
				This load can also fail, for reasons like but not limited to: object does
				not exist, corrupted CAS storage, operation timeout, etc.
				* If two `ObjectRef` are equal, it is guarantee that the object they point to
				(if exists) are identical. If they are not equal, the underlying objects are
				guaranteed to be not the same.

				### ObjectProxy

				`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
				underlying stored data and references can be accessed without the need
				of error handling. The class APIs also provide convenient methods to
				access underlying data. The lifetime of the underlying data is equal to
				the lifetime of the instance of `ObjectStore` unless explicitly copied.

				### CASID

				`CASID` is the hash identifier for CASObjects. It owns the underlying
				storage for hash value so it can be expensive to copy and compare depending
				on the hash algorithm. `CASID` is generally only useful in rare situations
				like printing raw hash value or exchanging hash values between different
				CAS instances with the same hashing schema.

				### ObjectStore

				`ObjectStore` is the CAS-like object storage. It provides API to save
				and load CASObjects, for example:

				```
				ObjectRef A, B, C;
				Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
				Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
				```

				It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
				`CASID`.



				## CAS Library Implementation Guide

				The LLVM ObjectStore APIs are designed so that it is easy to add
				customized CAS implementation that are interchangeable with builtin
				CAS implementations.

				To add your own implementation, you just need to add a subclass to
				`llvm::cas::ObjectStore` and implement all its pure virtual methods.
				To be interchangeable with LLVM ObjectStore, the new CAS implementation
				needs to conform to following contracts:

				* Different CASObject stored in the ObjectStore needs to have a different hash
				and result in a different `ObjectRef`. Vice versa, same CASObject should have
				same hash and same `ObjectRef`. Note two different CASObjects with identical
				data but different references are considered different objects.
				* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
				be used to determine the equality of the underlying CASObjects.
				* The loaded objects from the ObjectStore need to have the lifetime to be at
				least as long as the ObjectStore itself.

				If not specified, the behavior can be implementation defined. For example,
				`ObjectRef` can be used to point to a loaded CASObject so
				`ObjectStore` never fails to load. It is also legal to use a stricter model
				than required. For example, an `ObjectRef` that can be used to compare
				objects between different `ObjectStore` instances is legal but user
				of the ObjectStore should not depend on this behavior.

				For CAS library implementer, there is also a `ObjectHandle` class that
				is an internal representation of a loaded CASObject reference.
				`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
				just like `ObjectRef`, `ObjectHandle` is only useful when paired with
				the ObjectStore that knows about the loaded CASObject.

llvm/docs/Reference.rst

Show All 9 Lines	.. toctree::
:hidden:		:hidden:

Atomics		Atomics
BitCodeFormat		BitCodeFormat
BlockFrequencyTerminology		BlockFrequencyTerminology
BranchWeightMetadata		BranchWeightMetadata
Bugpoint		Bugpoint
CommandGuide/index		CommandGuide/index
		ContentAddressableStorage
ConvergenceAndUniformity		ConvergenceAndUniformity
ConvergentOperations		ConvergentOperations
Coroutines		Coroutines
DependenceGraphs/index		DependenceGraphs/index
ExceptionHandling		ExceptionHandling
Extensions		Extensions
FaultMaps		FaultMaps
FuzzingLLVM		FuzzingLLVM
▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	:doc:`PointerAuth`
support in the backend.		support in the backend.

:doc:`YamlIO`		:doc:`YamlIO`
A reference guide for using LLVM's YAML I/O library.		A reference guide for using LLVM's YAML I/O library.

:doc:`ConvergenceAndUniformity`		:doc:`ConvergenceAndUniformity`
A description of uniformity analysis in the presence of irreducible		A description of uniformity analysis in the presence of irreducible
control flow, and its implementation.		control flow, and its implementation.

		:doc:`ContentAddressableStorage`
		A reference guide for using LLVM's CAS library.

llvm/include/llvm/CAS/CASID.h

This file was added.

				//===- llvm/CAS/CASID.h ------------------------------------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CAS_CASID_H
				#define LLVM_CAS_CASID_H

				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/DenseMapInfo.h"
				#include "llvm/ADT/StringRef.h"

				namespace llvm {

				class raw_ostream;

				namespace cas {

				class CASID;

				/// Context for CAS identifiers.
				class CASContext {
				public:
				virtual ~CASContext();
				dblaikieUnsubmitted Not Done Reply Inline Actions This could be private to avoid polluting autocomplete and such, maybe? (alternatively it's probably not too costly to have the dtor virtual and out of line to act as an anchor - I assume CASContexts aren't being created/torn down with any great frequency such that some dtor indirection would be especially costly?) dblaikie: This could be private to avoid polluting autocomplete and such, maybe? (alternatively it's…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions This is a private method? But sure this is not a common object to create or destory. dtor as anchor sounds fine. steven_wu: This is a private method? But sure this is not a common object to create or destory. dtor as…

				/// Get an identifer for the schema used by this CAS context. Two CAS
				/// instances should return \c true for this identifier if and only if their
				/// CASIDs are safe to compare by hash. This is used by \a
				/// CASID::equalsImpl().
				virtual StringRef getHashSchemaIdentifier() const = 0;

				protected:
				/// Print \p ID to \p OS.
				virtual void printIDImpl(raw_ostream &OS, const CASID &ID) const = 0;

				friend class CASID;
				};

				/// Unique identifier for a CAS object.
				///
				/// Locally, stores an internal CAS identifier that's specific to a single CAS
				/// instance. It's guaranteed not to change across the view of that CAS, but
				/// might change between runs.
				///
				/// It also has \a CASIDContext pointer to allow comparison of these
				/// identifiers. If two CASIDs are from the same CASIDContext, they can be
				/// compared directly. If they are, then \a
				/// CASIDContext::getHashSchemaIdentifier() is compared to see if they can be
				/// compared by hash, in which case the result of \a getHash() is compared.
				class CASID {
				public:
				void dump() const;
				void print(raw_ostream &OS) const {
				return getContext().printIDImpl(OS, *this);
				}
				friend raw_ostream &operator<<(raw_ostream &OS, const CASID &ID) {
				ID.print(OS);
				return OS;
				}
				std::string toString() const;

				ArrayRef<uint8_t> getHash() const;

				friend bool operator==(const CASID &LHS, const CASID &RHS) {
				// EmptyKey or TombstoneKey.
				if (!LHS.Context \|\| !RHS.Context)
				return false;

				// CASIDs are equal when they have the same hash schema and same hash value.
				return LHS.Context->getHashSchemaIdentifier() ==
				RHS.Context->getHashSchemaIdentifier() &&
				LHS.Hash == RHS.Hash;
				}

				friend bool operator!=(const CASID &LHS, const CASID &RHS) {
				return !(LHS == RHS);
				}

				friend hash_code hash_value(const CASID &ID) {
				ArrayRef<uint8_t> Hash = ID.getHash();
				return hash_combine_range(Hash.begin(), Hash.end());
				}

				const CASContext &getContext() const {
				assert(Context && "Tombstone or empty key for DenseMap?");
				return *Context;
				}

				static CASID getDenseMapEmptyKey() {
				return CASID(nullptr, DenseMapInfo<StringRef>::getEmptyKey().str());
				}
				static CASID getDenseMapTombstoneKey() {
				return CASID(nullptr, DenseMapInfo<StringRef>::getTombstoneKey().str());
				}

				CASID() = delete;

				static CASID create(const CASContext *Context, StringRef Hash) {
				return CASID(Context, Hash.str());
				}

				private:
				CASID(const CASContext *Context, std::string &&Hash)
				: Context(Context), Hash(std::move(Hash)) {}

				const CASContext *Context;
				std::string Hash;
				dblaikieUnsubmitted Not Done Reply Inline Actions if you have a `std::string` member, should probably take the parameter as `std::string` and `std::move` it into the member - in cases where the caller can discard a string that'll be more efficient, and in cases where the caller can't, it's not especially worse. (though shoudl the member be `std::string`? should it be `SmallString` or anything else?) dblaikie: if you have a `std::string` member, should probably take the parameter as `std::string` and…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions I don't have much preference for the internal data type. CASID should only be relevant when you actually need the hash representation, often when you need to print out the value for outside users. Otherwise, `ObjectRef` is always a better choice. steven_wu: I don't have much preference for the internal data type. CASID should only be relevant when you…
				dblaikieUnsubmitted Not Done Reply Inline Actions Cool - I see you updated the parameter to `std::string&&` - generally I'd expect it to be `std::string` value, not a reference. That way the caller can move into the parameter if they have a string to discard, or allow the copy if they want to keep copies in the caller for whatever reason. dblaikie: Cool - I see you updated the parameter to `std::string&&` - generally I'd expect it to be `std…
				};

				} // namespace cas

				template <> struct DenseMapInfo<cas::CASID> {
				static cas::CASID getEmptyKey() { return cas::CASID::getDenseMapEmptyKey(); }

				static cas::CASID getTombstoneKey() {
				return cas::CASID::getDenseMapTombstoneKey();
				}

				static unsigned getHashValue(cas::CASID ID) {
				return (unsigned)hash_value(ID);
				}

				static bool isEqual(cas::CASID LHS, cas::CASID RHS) { return LHS == RHS; }
				};

				} // namespace llvm

				#endif // LLVM_CAS_CASID_H

llvm/include/llvm/CAS/CASReference.h

This file was added.

				//===- llvm/CAS/CASReference.h ----------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CAS_CASREFERENCE_H
				#define LLVM_CAS_CASREFERENCE_H

				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/DenseMapInfo.h"
				#include "llvm/ADT/StringRef.h"

				namespace llvm {

				class raw_ostream;

				namespace cas {

				class ObjectStore;

				class ObjectHandle;
				class ObjectRef;

				/// Base class for references to things in \a ObjectStore.
				class ReferenceBase {
				protected:
				struct DenseMapEmptyTag {};
				struct DenseMapTombstoneTag {};
				static constexpr uint64_t getDenseMapEmptyRef() { return -1ULL; }
				static constexpr uint64_t getDenseMapTombstoneRef() { return -2ULL; }

				public:
				/// Get an internal reference.
				uint64_t getInternalRef(const ObjectStore &ExpectedCAS) const {
				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				assert(CAS == &ExpectedCAS && "Extracting reference for the wrong CAS");
				#endif
				return InternalRef;
				}

				unsigned getDenseMapHash() const {
				return (unsigned)llvm::hash_value(InternalRef);
				}
				bool isDenseMapEmpty() const { return InternalRef == getDenseMapEmptyRef(); }
				bool isDenseMapTombstone() const {
				return InternalRef == getDenseMapTombstoneRef();
				}
				bool isDenseMapSentinel() const {
				return isDenseMapEmpty() \|\| isDenseMapTombstone();
				}

				protected:
				void print(raw_ostream &OS, const ObjectHandle &This) const;
				void print(raw_ostream &OS, const ObjectRef &This) const;

				bool hasSameInternalRef(const ReferenceBase &RHS) const {
				assert((!isDenseMapSentinel() && !RHS.isDenseMapSentinel()) &&
				"Invalid reference");
				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				assert(CAS == RHS.CAS && "Cannot compare across CAS instances");
				dblaikieUnsubmitted Done Reply Inline Actions Only the last of these checks is limited to `LLVM_ENABLE_ABI_BREAKING_CHECKS` yeah? Might be worth pulling the other two out of the preprocessor conditional, so they get run more broadly? dblaikie: Only the last of these checks is limited to `LLVM_ENABLE_ABI_BREAKING_CHECKS` yeah? Might be…
				#endif
				return InternalRef == RHS.InternalRef;
				}

				protected:
				friend class ObjectStore;
				ReferenceBase(const ObjectStore *CAS, uint64_t InternalRef, bool IsHandle)
				: InternalRef(InternalRef) {
				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				this->CAS = CAS;
				#endif
				assert(InternalRef != getDenseMapEmptyRef() && "Reserved for DenseMapInfo");
				assert(InternalRef != getDenseMapTombstoneRef() &&
				"Reserved for DenseMapInfo");
				}
				explicit ReferenceBase(DenseMapEmptyTag)
				: InternalRef(getDenseMapEmptyRef()) {}
				explicit ReferenceBase(DenseMapTombstoneTag)
				: InternalRef(getDenseMapTombstoneRef()) {}

				private:
				uint64_t InternalRef;

				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				const ObjectStore *CAS = nullptr;
				#endif
				};

				/// Reference to an object in a \a ObjectStore instance.
				///
				/// If you have an ObjectRef, you can point at it from new nodes with \a
				/// ObjectStore::store(), but you don't know anything about it. "Loading" the
				/// object is a separate step that may not have happened yet, and which can fail
				/// (due to filesystem corruption) or introduce latency (if downloading from a
				/// remote store).
				///
				/// \a ObjectStore::store() takes a list of these, and these are returned by \a
				/// ObjectStore::forEachRef() and \a ObjectStore::readRef(), which are accessors
				/// for nodes, and \a ObjectStore::getReference().
				///
				/// \a ObjectStore::load() will load the referenced object, and returns \a
				/// ObjectHandle, a variant that knows what kind of entity it is. \a
				///
				/// This is a wrapper around a \c uint64_t (and a \a ObjectStore instance when
				dblaikieUnsubmitted Not Done Reply Inline Actions expect -> extract? or something else? dblaikie: expect -> extract? or something else?
				steven_wuAuthorUnsubmitted Done Reply Inline Actions This comment is out-dated. The builtin kind of the object is removed from the latest implementation to simplify the interface so it is easier to use and implement steven_wu: This comment is out-dated. The builtin kind of the object is removed from the latest…
				/// assertions are on). If necessary, it can be deconstructed and reconstructed
				/// using \a Reference::getInternalRef() and \a
				/// Reference::getFromInternalRef(), this should only happen as the CAS
				/// implementation details, not called by the users of the CAS.
				class ObjectRef : public ReferenceBase {
				struct DenseMapTag {};

				dblaikieUnsubmitted Not Done Reply Inline Actions If clients aren't expected to need to do it, could we leave that out until some necessary use comes up? dblaikie: If clients aren't expected to need to do it, could we leave that out until some necessary use…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions It actually means that only the CAS implementers should use those method (for example, used in BuiltinCAS) but users of the CAS should not use those methods because `InternalRef` has no meaning outside the CAS internal. I rewrite it to clear up a bit. steven_wu: It actually means that only the CAS implementers should use those method (for example, used in…
				public:
				friend bool operator==(const ObjectRef &LHS, const ObjectRef &RHS) {
				return LHS.hasSameInternalRef(RHS);
				}
				friend bool operator!=(const ObjectRef &LHS, const ObjectRef &RHS) {
				return !(LHS == RHS);
				}

				/// Allow a reference to be recreated after it's deconstructed.
				static ObjectRef getFromInternalRef(const ObjectStore &CAS,
				uint64_t InternalRef) {
				return ObjectRef(CAS, InternalRef);
				}

				static ObjectRef getDenseMapEmptyKey() {
				return ObjectRef(DenseMapEmptyTag{});
				}
				static ObjectRef getDenseMapTombstoneKey() {
				return ObjectRef(DenseMapTombstoneTag{});
				}

				/// Print internal ref and/or CASID. Only suitable for debugging.
				void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); }

				LLVM_DUMP_METHOD void dump() const;

				private:
				friend class ObjectStore;
				friend class ReferenceBase;
				using ReferenceBase::ReferenceBase;
				ObjectRef(const ObjectStore &CAS, uint64_t InternalRef)
				: ReferenceBase(&CAS, InternalRef, /IsHandle=/false) {
				assert(InternalRef != -1ULL && "Reserved for DenseMapInfo");
				assert(InternalRef != -2ULL && "Reserved for DenseMapInfo");
				}
				explicit ObjectRef(DenseMapEmptyTag T) : ReferenceBase(T) {}
				explicit ObjectRef(DenseMapTombstoneTag T) : ReferenceBase(T) {}
				explicit ObjectRef(ReferenceBase) = delete;
				};

				/// Handle to a loaded object in a \a ObjectStore instance.
				///
				/// ObjectHandle encapulates a loaded object in the CAS. You need one
				/// of these to inspect the content of an object: to look at its stored
				/// data and references.
				class ObjectHandle : public ReferenceBase {
				public:
				friend bool operator==(const ObjectHandle &LHS, const ObjectHandle &RHS) {
				return LHS.hasSameInternalRef(RHS);
				dblaikieUnsubmitted Not Done Reply Inline Actions Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I wondered if it should be called `Handle` - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if `ObjectHandle` could be `Object`? - not sure that'd make `ObjectRef` more obvious, though... ) dblaikie: Since `Handle` and `Ref` aren't especially self documenting (while reading the docs for `Ref` I…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions `ObjectRef` is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an `ObjectHandle`. The load can have latency or can fail. Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read? steven_wu: `ObjectRef` is just a reference, and you can't access anything in the object it points to…
				dblaikieUnsubmitted Not Done Reply Inline Actions Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. `Ref` and `Handle` both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor))) Seems like ObjectRef is maybe more suitably called ObjectID? dblaikie: Still feels like the names could be more self-descriptive, even with/independent of…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names? In my opinion, `ObjectRef` is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the `dereference` might fail in CAS. The difference between `ObjectRef` and `ObjectHandle` is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs. Also `ObjectRef` is the public type we encourage to use, just like other `Ref` types like `StringRef` and `ArrayRef`. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare `Ref` from different ObjectStore. I thought about rename `CASID` to `ObjectID` to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames. steven_wu: I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide…
				dexonsmithUnsubmitted Not Done Reply Inline Actions In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a `s/CASID/ObjectID/g` at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do `s/CASID/std::string/`. (Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.) `ObjectRef` is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an `ObjectRef` doesn't necessarily mean you know anything about the content of an object. `ObjectHandle` promises that you can access the object. I think to explain the difference, it helps to have a specific example. If you have a few objects: object-id: 0 data: "some small data" object-id: 1 data: [4GB of data] object-id: 2 data: "a header" refs: 0, 1 If you have an `ObjectHandle` for object 2, then you can see that the data is "a header" and you can get the `ObjectRef`s for the objects it references: 0 and 1. Those `ObjectRef`s allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts. Or, you can "load" an `ObjectRef`, get an `ObjectHandle` for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading. It's a good point that "ref" here bears no relation to the "ref" in `ArrayRef`. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you). At a high level, we have three concepts: a serialized, context-free identifier (currently CASID) an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle) Maybe the following rename could work: Delete CASID and replace its usage with `std::string` Rename ObjectRef to ObjectID Maybe: Rename ObjectHandle to ObjectRef I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate. @steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea? @dblaikie, WDYT? Maybe you have other name ideas after the above explanation? dexonsmith: In my mind, the name `ObjectID` was "taken", since I originally thought we'd do a…
				benlangmuirUnsubmitted Not Done Reply Inline Actions I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that clients of an `ObjectStore` will work with. `ObjectHandle` is only used by implementors of an `ObjectStore`. I agree that having both `Ref` and `Handle` is confusing - maybe we should rename Handle to something more verbose and descriptive like `LoadedObjectRef`. Seems like ObjectRef is maybe more suitably called ObjectID? I think this is confusing with `CASID`, which really is an ID. I don't have a strong opinion on whether `CASID` should be renamed to `ObjectID` or not for consistency. benlangmuir: I agree with Steven; I think `ObjectRef` deserves the good name here, since that's the one that…
				dexonsmithUnsubmitted Not Done Reply Inline Actions @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef` and `ObjectHandle`; but you have much more experience than I do. (I guess the `ObjectHandle` is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an `ObjectHandle` and navigates the content in a structured way.) Agreed that `ObjectRef` ought to have a good / easy-to-type name. dexonsmith: @benlangmuir, just seeing your reply. I'd have expected clients to work with both `ObjectRef`…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions @dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return `ObjectHandle` anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well. `ObjectHandle` is type for CAS implementor only now to represent loaded object. steven_wu: @dexonsmith Correct, we clean up the interface that no public methods from `ObjectStore` return…
				dexonsmithUnsubmitted Not Done Reply Inline Actions Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it. Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.) Also: Do we still need CASID? Why not just use `std::string`? If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.) Another idea: rename `ObjectRef` to `ObjectPtr`? dexonsmith: Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir…
				akyrtziUnsubmitted Not Done Reply Inline Actions I'd have expected clients to work with both `ObjectRef` and `ObjectHandle` I don't think `ObjectHandle` has any value for clients, clients will be using `ObjectProxy`. an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between `CASID` and `ObjectRef` in such a context. akyrtzi: > I'd have expected clients to work with both `ObjectRef` and `ObjectHandle` I don't think…
				dexonsmithUnsubmitted Not Done Reply Inline Actions Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context. Ah, is that because a remote CAS may have garbage-collected the object? So, then, `ObjectRef` is just an object we know how to point at? (Or do you just mean that loading could fail because the remote CAS goes down?) dexonsmith: > Something to highlight about this: For a "remote CAS" implementation you practically don't…
				benlangmuirUnsubmitted Not Done Reply Inline Actions Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at? (Or do you just mean that loading could fail because the remote CAS goes down?) We do not check whether an object exists at all when creating the ObjectRef for a remote CAS: it's a wasted round trip over the network since you can't guarantee the object still exists later, and don't get a performance win for knowing it existed in the past. This is different with the local in-memory and on-disk CASes, because they can check for existence cheaply and provide a faster ref implementation once we know the object exists. a serialized, context-free identifier (currently CASID) There are two levels here: a serialized, context-free identifier (std::string, e.g. "llvmcas://<serialized hash>") a context-sensitive identifier hash containing raw hash bytes (CASID) -- you need a CASContext to use it an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef) Not known to exist anywhere, but otherwise yes. a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle) Correct. Also: Do we still need CASID? Why not just use `std::string`? So the question here is whether CASID is pulling its weight as a client-visible type. It should be more efficient than the serialized string, since you have the hash bytes immediately available rather than parsing them out; it can be smaller for the same reason. I don't know if have measured this -- @steven_wu, @akyrtzi any thoughts? If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.) If we can drop CASID, I'm fine with this. Another idea: rename `ObjectRef` to `ObjectPtr`? I find Ref clearer than Ptr for what it does. benlangmuir: > Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef…
				akyrtziUnsubmitted Not Done Reply Inline Actions I like the idea of dropping `CASID` entirely. I'm also fine with renaming `ObjectRef` to `ObjectID`. I would also suggest that we change virtual Expected<ObjectHandle> load(ObjectRef Ref) to virtual Expected<Optional<ObjectHandle>> load(ObjectRef Ref) The optional indicates whether the object existed or not, and we reserve errors for catastrophic failures (e.g. "network went down"). The client could choose to turn the "object doesn't exist" `None` into an error (or not, it depends on the context) but this case should be distinguishable from `ObjectStore`'s API. akyrtzi: I like the idea of dropping `CASID` entirely. I'm also fine with renaming `ObjectRef` to…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions I tried to get rid of `CASID` but there are some desirable features that decide to keep it for now. For example, it distinguishes itself from a printable string. For builtinCAS, you have a printable string that is something like: `llvmcas://abcdefg...`, while CASID is the raw hash value (0xabcdefg...) + the context ("llvm.builtin.cas.v1[BLAKE3]"). If we get rid of CASID, you will either have a std::string representation that is `llvmcas://` then you need to parse it every time you want access to raw hash, or you decide to have `std::string` to store the raw hash value, then: 1. you can't store a context anywhere. 2. It is confusing if a function is taking a printable string or a raw hash value, since both of them are std::string. steven_wu: I tried to get rid of `CASID` but there are some desirable features that decide to keep it for…
				dexonsmithUnsubmitted Not Done Reply Inline Actions What is "every time"? I.e., why do you need to get the hash from the CASID? Could/should some/many places that currently use CASID use an ObjectRef instead? IOW, maybe some current uses of CASID should switch to ObjectRef or ObjectProxy, and the rest can then be converted to std::string. But I'm not sure. Only asking the question to ensure that you've considered it (vs. momentum of code written with CASID before ObjectRef existed). dexonsmith: What is "every time"? I.e., why do you need to get the hash from the CASID? Could/should…
				akyrtziUnsubmitted Not Done Reply Inline Actions then you need to parse it every time you want access to raw hash Why do I need to parse the string to get the raw hash, I should be able to get the raw hash from the `ObjectRef`/`ObjectID`, right? Once I parse a "llvmcas://" string into an `ObjectRef`/`ObjectID` there should be no need to keep the string around anymore. akyrtzi: > then you need to parse it every time you want access to raw hash Why do I need to parse the…
				dexonsmithUnsubmitted Not Done Reply Inline Actions Paging in old memories, I think where CASID remains useful is if you're talking about the same object in two different instances of a CAS (maybe one instance is in-memory, and the other is on-disk or remote), and both have the same CASContext. IIRC, you share a CASContext iff the std::strings mean the same thing. E.g., should be true for any CAS that has a std::string ID starting with `llvmcas://`, as long as both instances use the same hash function (currently always BLAKE3, right?). (Unless this changed...) If a client is doing a lot of this, having an optimized representation of the hash/identifier could be useful; it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. (In this scenario, it's also maybe nice to avoid a malloc/dealloc, but the hash will always overflow std::string's small storage... that was the motivation to avoid std::string originally.) I imagine we'll want to do that sort of thing at some point, but not sure how urgent it is (maybe it's already needed?), or whether the optimization is really necessary (maybe the overhead is hidden by other things that are orders of magnitude more expensive?)... Outside of that multi-CAS-instance-same-CAScontext scenario, I'm not sure I see a reason to avoid using ObjectRef (or std::string). Another thought: if we don't drop CASID, it could be renamed to ObjectHash. ObjectHash: the raw hash of the object, shareable within a CASContext (né CASID) ObjectID: an opaque identifier for a specific CAS instance (né ObjectRef) LoadedObjectHandle: implementation detail, points at "loaded" object in a CAS instance (né ObjectHandle) ObjectProxy: LoadedObjectHandle+ObjectStore+nice APIs. dexonsmith: Paging in old memories, I think where CASID remains useful is if you're talking about the same…
				akyrtziUnsubmitted Not Done Reply Inline Actions it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. Hmm, it's still unclear to me why you need to traffic "llvmcas://" strings for that; the `ObjectRef/ID` that you got from one instance should be able to provide the hash bytes for the other. You don't need to re-validate the hash because the instances are context-compatible. akyrtzi: > it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash…
				dexonsmithUnsubmitted Not Done Reply Inline Actions IIRC, CASID is the only container for the raw hash recognized by ObjectStore. If you take it away, and you want to traffic in raw hashes you need to invent a new container for them (or just let people use ArrayRef). If you think it’s useful to traffic in raw hashes, then I suggest keeping CASID/ObjectHash, since it has some error checking and dumping built-in. dexonsmith: IIRC, CASID is the only container for the raw hash recognized by ObjectStore. If you take it…
				akyrtziUnsubmitted Not Done Reply Inline Actions or just let people use ArrayRef I'm personally fine with making it that the way you get an `ObjectID` is either via parsing a string identifier or via passing the `ArrayRef` hash bytes (for example, say I stored the hash as raw bytes in a file and now I want to get an `ObjectID` out of these bytes). Like a string identifier from a command-line option, the raw hash bytes will be a transient input that I will use to get an `ObjectID` in order to proceed with the rest of the work, I shouldn't need to be "trafficking" both hash bytes and ObjectIDs, at the same time. In general, as a client I'd like to have only 3 concepts that I need to be concerned about: A string identifier for a CAS object that can be parsed and printed (e.g. the "llvmcas://" strings) A type I can get as a valid ID after parsing a string identifier or after passing the hash bytes for the object (e.g. `ObjectID`) A type that contains the loaded data for the object. (e.g. `ObjectProxy`) An `ObjectHash` type could be useful for the implementations but I find it an unnecessary concept for the users of `ObjectStore` API. akyrtzi: > or just let people use ArrayRef I'm personally fine with making it that the way you get an…
				dexonsmithUnsubmitted Not Done Reply Inline Actions Sure; encapsulation might be overkill. I don’t feel strongly! dexonsmith: Sure; encapsulation might be overkill. I don’t feel strongly!
				}
				friend bool operator!=(const ObjectHandle &LHS, const ObjectHandle &RHS) {
				return !(LHS == RHS);
				}

				/// Print internal ref and/or CASID. Only suitable for debugging.
				void print(raw_ostream &OS) const { return ReferenceBase::print(OS, *this); }

				LLVM_DUMP_METHOD void dump() const;

				private:
				friend class ObjectStore;
				friend class ReferenceBase;
				using ReferenceBase::ReferenceBase;
				explicit ObjectHandle(ReferenceBase) = delete;
				ObjectHandle(const ObjectStore &CAS, uint64_t InternalRef)
				: ReferenceBase(&CAS, InternalRef, /IsHandle=/true) {}
				};

				} // namespace cas

				template <> struct DenseMapInfo<cas::ObjectRef> {
				static cas::ObjectRef getEmptyKey() {
				return cas::ObjectRef::getDenseMapEmptyKey();
				}

				static cas::ObjectRef getTombstoneKey() {
				return cas::ObjectRef::getDenseMapTombstoneKey();
				}

				static unsigned getHashValue(cas::ObjectRef Ref) {
				return Ref.getDenseMapHash();
				}

				static bool isEqual(cas::ObjectRef LHS, cas::ObjectRef RHS) {
				return LHS == RHS;
				}
				};

				} // namespace llvm

				#endif // LLVM_CAS_CASREFERENCE_H

llvm/include/llvm/CAS/ObjectStore.h

This file was added.

				//===- llvm/CAS/ObjectStore.h ------------------------------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CAS_OBJECTSTORE_H
				#define LLVM_CAS_OBJECTSTORE_H

				#include "llvm/ADT/StringRef.h"
				#include "llvm/ADT/StringExtras.h"
				#include "llvm/CAS/CASID.h"
				#include "llvm/CAS/CASReference.h"
				#include "llvm/Support/Error.h"
				#include "llvm/Support/FileSystem.h"
				#include <cstddef>
				#include <optional>

				namespace llvm {

				class MemoryBuffer;

				namespace cas {

				class ObjectStore;
				class ObjectProxy;

				/// Content-addressable storage for objects.
				///
				/// Conceptually, objects are stored in a "unique set".
				///
				/// - Objects are immutable ("value objects") that are defined by their
				/// content. They are implicitly deduplicated by content.
				/// - Each object has a unique identifier (UID) that's derived from its content,
				/// called a \a CASID.
				/// - This UID is a fixed-size (strong) hash of the transitive content of a
				/// CAS object.
				/// - It's comparable between any two CAS instances that have the same \a
				/// CASIDContext::getHashSchemaIdentifier().
				/// - The UID can be printed (e.g., \a CASID::toString()) and it can parsed
				/// by the same or a different CAS instance with \a
				/// ObjectStore::parseID().
				/// - An object can be looked up by content or by UID.
				/// - \a store() is "get-or-create" methods, writing an object if it
				/// doesn't exist yet, and return a ref to it in any case.
				/// - \a loadObject(const CASID&) looks up an object by its UID.
				/// - Objects can reference other objects, forming an arbitrary DAG.
				///
				dblaikieUnsubmitted Not Done Reply Inline Actions do they "reference" other objects - or does an object consist of other objects/include those objects in some sense? (maybe there's no real difference, and this current language is better... not sure) dblaikie: do they "reference" other objects - or does an object consist of other objects/include those…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions It is a real `reference`, like a pointer reference. But this is up to the CAS implementation, as I can see it is totally possible to implement a CAS the directly embeds the "referenced" object without breaking the contract of the API, but there is really no good reason to do that. steven_wu: It is a real `reference`, like a pointer reference. But this is up to the CAS implementation…
				/// The \a ObjectStore interface has a few ways of referencing objects:
				///
				/// - \a ObjectRef encapsulates a reference to something in the CAS. It is an
				/// opaque type that references an object inside a specific CAS. It is
				/// implementation defined if the underlying object exists or not for an
				/// ObjectRef, and it can used to speed up CAS lookup as an implementation
				/// detail. However, you don't know anything about the underlying objects.
				/// "Loading" the object is a separate step that may not have happened
				dblaikieUnsubmitted Not Done Reply Inline Actions Not clear from this what the difference between an `ObjectRef` and a `CASID` are - is that worth more details? dblaikie: Not clear from this what the difference between an `ObjectRef` and a `CASID` are - is that…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions I refine the docs a bit. Let me know if it is clearer to read. steven_wu: I refine the docs a bit. Let me know if it is clearer to read.
				/// yet, and which can fail (e.g. due to filesystem corruption) or introduce
				/// latency (if downloading from a remote store).
				dblaikieUnsubmitted Not Done Reply Inline Actions Presumably it can also fail if the object isn't in the given CAS? (maybe that's the more obvious/simpler to document example?) dblaikie: Presumably it can also fail if the object isn't in the given CAS? (maybe that's the more…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined. For builtinCAS here, ObjectRef always points to existing object, unless the integrity of the builtinCAS is broken. For a remote CAS, you might actually want to have ObjectRef to be unverified to avoid a roundtrip to remote to validate a ObjectRef. Also to you previous point of split up public/private interface document, is it better if I create a separate docs in `llvm/docs` to explain different concepts from the views of: 1. CAS users 2. CAS implementors? steven_wu: This question is complicated. Maybe we should write down a spec for the CAS APIs? To this…
				benlangmuirUnsubmitted Not Done Reply Inline Actions This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined This complication is only for the implementation though, right? Clients of the object store shouldn't assume the object exists in the CAS, and like @dblaikie said, loading can fail if it is not. I don't think we want to promise that builtin CAS will always succeed when loading, since we might want to change that kind of detail later. benlangmuir: > This question is complicated. Maybe we should write down a spec for the CAS APIs? To this…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions True. I will update the wording. steven_wu: True. I will update the wording.
				/// - \a ObjectHandle encapulates a loaded object in the CAS. You need one of
				/// these to inspect the content of an object: to look at its stored
				/// data and references. This is internal to CAS implementation and not
				/// availble from CAS public APIs.
				/// - \a CASID: the UID for an object in the CAS, obtained through \a
				dblaikieUnsubmitted Not Done Reply Inline Actions Ah. May be worth splitting up the documentation here more clearly between public and private concepts, rather than having this internal detail interleaved with external ones? dblaikie: Ah. May be worth splitting up the documentation here more clearly between public and private…
				dblaikieUnsubmitted Not Done Reply Inline Actions ping on this dblaikie: ping on this
				/// ObjectStore::getID() or \a ObjectStore::parseID(). This is a valid CAS
				/// identifier, but may reference an object that is unknown to this CAS
				/// instance.
				/// - \a ObjectProxy pairs an ObjectHandle (subclass) with a ObjectStore, and
				/// wraps access APIs to avoid having to pass extra parameters. It is the
				/// object used for accessing underlying data and refs by CAS users.
				///
				/// Both ObjectRef and ObjectHandle are lightweight, wrapping a `uint64_t`.
				/// Doing anything with them requires an ObjectStore. As a convenience,
				/// ObjectProxy pairs an ObjectStore with ObjectHandle, and provides the
				/// interfaces to read information from CAS Objects for users. CASID contains
				/// the actual hash value of the object, thus it is expensive to create/copy,
				/// and also much slower to access corresponding object. It is only useful when
				/// you need to print the hash value for the objects, or exchange common
				/// objects between different ObjectStore instances with the same CASContext.
				///
				/// There are a few options for accessing content of objects, with different
				/// lifetime tradeoffs:
				///
				/// - \a getData()/getDataString() return StringRef with lifetime is guaranteed
				/// to last as long as \a ObjectStore.
				/// - \a getMemoryBuffer() returns a \a MemoryBuffer whose lifetime
				/// is independent of the CAS (it can live longer).
				/// - \a readRef() and \a forEachRef() iterate through the references in an
				/// object. There is no lifetime assumption.
				///
				dblaikieUnsubmitted Done Reply Inline Actions anchor's only useful if it's virtual? Maybe jus tmake the dtor out of line and use that as an anchor? dblaikie: anchor's only useful if it's virtual? Maybe jus tmake the dtor out of line and use that as an…
				class ObjectStore {
				friend class ObjectProxy;
				friend class ReferenceBase;

				public:
				/// @name CASID Operations
				/// {

				/// Get a \p CASID from a \p ID, which should have been generated by \a
				/// CASID::print(). This succeeds as long as \a validateID() would pass. The
				/// object may be unknown to this CAS instance.
				virtual Expected<CASID> parseID(StringRef ID) = 0;

				/// Get an ID for \p Ref.
				virtual CASID getID(ObjectRef Ref) const = 0;

				/// Get a reference to the object called \p ID.
				///
				/// Returns \c std::nullopt if not stored in this CAS.
				virtual std::optional<ObjectRef> getReference(const CASID &ID) const = 0;
				dblaikieUnsubmitted Not Done Reply Inline Actions what sort of validity is of concern here? dblaikie: what sort of validity is of concern here?
				steven_wuAuthorUnsubmitted Done Reply Inline Actions This is more a debugging function that can check if the integrity of the CAS has been broken. The current implementation for builtin CAS is to rehash the object fetched from CASID and make sure the hash matches. If not, there is either a bug in CAS or the data has been corrupted. steven_wu: This is more a debugging function that can check if the integrity of the CAS has been broken.
				dblaikieUnsubmitted Not Done Reply Inline Actions Might be worth a few more words in the description maybe about "consistency" or somesuch? dblaikie: Might be worth a few more words in the description maybe about "consistency" or somesuch?

				/// Validate the underlying object referred by CASID.
				virtual Error validate(const CASID &ID) = 0;

				/// }
				/// @name Store Objects
				/// {
				/// Store object into ObjectStore.
				virtual Expected<ObjectRef> store(ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) = 0;

				/// Store object from StringRef.
				Expected<ObjectRef> storeFromString(ArrayRef<ObjectRef> Refs,
				StringRef String) {
				return store(Refs, arrayRefFromStringRef<char>(String));
				}

				/// Helper functions to store object and returns a ObjectProxy.
				Expected<ObjectProxy> createProxy(ArrayRef<ObjectRef> Refs, StringRef Data);

				/// Default implementation reads \p FD and calls \a storeNode(). Does not
				/// take ownership of \p FD; the caller is responsible for closing it.
				///
				/// If \p Status is sent in it is to be treated as a hint. Implementations
				/// must protect against the file size potentially growing after the status
				/// was taken (i.e., they cannot assume that an mmap will be null-terminated
				/// where \p Status implies).
				///
				/// Returns the \a CASID and the size of the file.
				Expected<ObjectRef>
				storeFromOpenFile(sys::fs::file_t FD,
				std::optional<sys::fs::file_status> Status = std::nullopt) {
				return storeFromOpenFileImpl(FD, Status);
				}

				/// }
				/// @name Get Objects
				/// {
				/// Create ObjectProxy from CASID. If the object doesn't exit, get an error.
				Expected<ObjectProxy> getProxy(const CASID &ID);
				/// Create ObjectProxy from ObjectRef. If the object can't be loaded, get an
				/// error.
				Expected<ObjectProxy> getProxy(ObjectRef Ref);

				/// }
				protected:
				/// @name Implementation Details for underlying CAS.
				/// {
				/// Get a Ref from Handle.
				virtual ObjectRef getReference(ObjectHandle Handle) const = 0;

				/// Load the object referenced by \p Ref.
				///
				/// Errors if the object cannot be loaded.
				virtual Expected<ObjectHandle> load(ObjectRef Ref) = 0;

				/// Get an ID for \p Handle.
				virtual CASID getID(ObjectHandle Handle) const = 0;

				/// Get the size of some data.
				virtual uint64_t getDataSize(ObjectHandle Node) const = 0;

				/// Methods for handling objects.
				virtual Error forEachRef(ObjectHandle Node,
				function_ref<Error(ObjectRef)> Callback) const = 0;
				dblaikieUnsubmitted Not Done Reply Inline Actions this lets you walk which other objects reference this object? I guess you've probably got some pretty core use cases/need for this - but dealing with updating use lists in LLVM IR at least makes me ask: Do you need to be able to walk the uses of an object? dblaikie: this lets you walk which other objects reference this object? I guess you've probably got some…
				steven_wuAuthorUnsubmitted Done Reply Inline Actions No, the reference in CAS goes only one direction. This iterates through all the objects that is referenced by the current object and you don't know how many parents references you. In general, there is even no API for CAS to iterate through all the objects to compute a use-list and that is by design. steven_wu: No, the reference in CAS goes only one direction. This iterates through all the objects that is…
				virtual ObjectRef readRef(ObjectHandle Node, size_t I) const = 0;
				virtual size_t getNumRefs(ObjectHandle Node) const = 0;
				virtual ArrayRef<char> getData(ObjectHandle Node,
				bool RequiresNullTerminator = false) const = 0;

				/// Get ObjectRef from open file.
				virtual Expected<ObjectRef>
				storeFromOpenFileImpl(sys::fs::file_t FD,
				std::optional<sys::fs::file_status> Status) = 0;
				/// }
				/// @name Helper functions for implementing CAS.
				/// {

				/// Get a lifetime-extended StringRef pointing at \p Data.
				///
				/// Depending on the CAS implementation, this may involve in-memory storage
				/// overhead.
				StringRef getDataString(ObjectHandle Node) {
				return toStringRef(getData(Node));
				}

				/// Get a lifetime-extended MemoryBuffer pointing at \p Data.
				///
				/// Depending on the CAS implementation, this may involve in-memory storage
				/// overhead.
				std::unique_ptr<MemoryBuffer>
				getMemoryBuffer(ObjectHandle Node, StringRef Name = "",
				bool RequiresNullTerminator = true);

				/// Helper function to chain `load(ObjectRef)` into ObjectProxy.
				Expected<ObjectProxy> getProxy(Expected<ObjectHandle> Ref);

				/// Read the data from \p Data into \p OS.
				uint64_t readData(ObjectHandle Node, raw_ostream &OS, uint64_t Offset = 0,
				uint64_t MaxBytes = -1ULL) const {
				ArrayRef<char> Data = getData(Node);
				assert(Offset < Data.size() && "Expected valid offset");
				Data = Data.drop_front(Offset).take_front(MaxBytes);
				OS << toStringRef(Data);
				return Data.size();
				}

				/// Allow ObjectStore implementations to create internal handles.
				#define MAKE_CAS_HANDLE_CONSTRUCTOR(HandleKind) \
				HandleKind make##HandleKind(uint64_t InternalRef) const { \
				return HandleKind(*this, InternalRef); \
				}
				MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectHandle)
				MAKE_CAS_HANDLE_CONSTRUCTOR(ObjectRef)
				#undef MAKE_CAS_HANDLE_CONSTRUCTOR

				/// Create an unknown object error.
				static Error createUnknownObjectError(const CASID &ID);

				/// }

				public:
				/// Print the ObjectStore internals for debugging purpose.
				virtual void print(raw_ostream &) const {}
				void dump() const;

				/// Get CASContext
				const CASContext &getContext() const { return Context; }

				virtual ~ObjectStore();

				protected:
				ObjectStore(const CASContext &Context) : Context(Context) {}

				private:
				const CASContext &Context;
				};

				/// Reference to an abstract hierarchical node, with data and references.
				/// Reference is passed by value and is expected to be valid as long as the \a
				/// ObjectStore is.
				///
				/// TODO: Expose \a ObjectStore::readData() and only call \a
				/// ObjectStore::getDataString() when asked.
				class ObjectProxy {
				public:
				const ObjectStore &getCAS() const { return *CAS; }
				ObjectStore &getCAS() { return *CAS; }
				CASID getID() const { return CAS->getID(H); }
				ObjectRef getRef() const { return CAS->getReference(H); }
				size_t getNumReferences() const { return CAS->getNumRefs(H); }
				ObjectRef getReference(size_t I) const { return CAS->readRef(H, I); }

				operator CASID() const { return getID(); }
				CASID getReferenceID(size_t I) const {
				std::optional<CASID> ID = getCAS().getID(getReference(I));
				assert(ID && "Expected reference to be first-class object");
				return *ID;
				}

				/// Visit each reference in order, returning an error from \p Callback to
				/// stop early.
				Error forEachReference(function_ref<Error(ObjectRef)> Callback) const {
				return CAS->forEachRef(H, Callback);
				}
				Error forEachReferenceID(function_ref<Error(CASID)> Callback) const {
				return CAS->forEachRef(H, [&](ObjectRef Ref) {
				std::optional<CASID> ID = getCAS().getID(Ref);
				assert(ID && "Expected reference to be first-class object");
				return Callback(*ID);
				});
				}

				std::unique_ptr<MemoryBuffer>
				getMemoryBuffer(StringRef Name = "",
				bool RequiresNullTerminator = true) const;

				/// Get the content of the node. Valid as long as the CAS is valid.
				StringRef getData() const { return CAS->getDataString(H); }

				friend bool operator==(const ObjectProxy &Proxy, ObjectRef Ref) {
				return Proxy.getRef() == Ref;
				}
				friend bool operator==(ObjectRef Ref, const ObjectProxy &Proxy) {
				return Proxy.getRef() == Ref;
				}
				friend bool operator!=(const ObjectProxy &Proxy, ObjectRef Ref) {
				return !(Proxy.getRef() == Ref);
				}
				friend bool operator!=(ObjectRef Ref, const ObjectProxy &Proxy) {
				return !(Proxy.getRef() == Ref);
				}

				public:
				ObjectProxy() = delete;

				static ObjectProxy load(ObjectStore &CAS, ObjectHandle Node) {
				return ObjectProxy(CAS, Node);
				}

				private:
				ObjectProxy(ObjectStore &CAS, ObjectHandle H) : CAS(&CAS), H(H) {}

				ObjectStore *CAS;
				ObjectHandle H;
				};

				Expected<std::unique_ptr<ObjectStore>>
				createPluginCAS(StringRef PluginPath,
				ArrayRef<std::string> PluginArgs = std::nullopt);
				std::unique_ptr<ObjectStore> createInMemoryCAS();

				} // namespace cas
				} // namespace llvm

				#endif // LLVM_CAS_OBJECTSTORE_H

llvm/lib/CAS/BuiltinCAS.h

This file was added.

				//===- BuiltinCAS.h ---------------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_CAS_BUILTINCAS_H
				#define LLVM_LIB_CAS_BUILTINCAS_H

				#include "llvm/ADT/StringRef.h"
				#include "llvm/CAS/ObjectStore.h"
				#include "llvm/Support/BLAKE3.h"
				#include "llvm/Support/Error.h"
				#include <cstddef>
				#include <optional>

				namespace llvm::cas::builtin {

				/// Current hash type for the internal CAS.
				///
				/// FIXME: This should be configurable via an enum to allow configuring the hash
				/// function. The enum should be sent into \a createInMemoryCAS() and \a
				/// createOnDiskCAS().
				///
				/// This is important (at least) for future-proofing, when we want to make new
				/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
				///
				/// Even just for BLAKE3, it would be useful to have these values:
				///
				/// BLAKE3 => 32B hash from BLAKE3
				/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
				///
				/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
				///
				/// Motivation for a truncated hash is that it's cheaper to store. It's not
				/// clear if we always (or ever) need the full 32B, and for an ephemeral
				/// in-memory CAS, we almost certainly don't need it.
				///
				/// Note that the cost is linear in the number of objects for the builtin CAS
				/// and embedded action cache, since we're using internal offsets and/or
				/// pointers as an optimization.
				///
				/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
				/// a distributed generic hash map to use as an ActionCache. In that scenario,
				/// the transitive closure of the structured objects that are the results of
				/// the cached actions would need to be serialized into the map, something
				/// like:
				///
				/// "action:<schema>:<key>" -> "0123"
				/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
				/// "object:<schema>:4567" -> ...
				/// "object:<schema>:89AB" -> ...
				/// "object:<schema>:CDEF" -> ...
				///
				/// These references would be full cost.
				using HasherT = BLAKE3;
				using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));

				class BuiltinCASContext : public CASContext {
				void printIDImpl(raw_ostream &OS, const CASID &ID) const final;

				public:
				/// Get the name of the hash for any table identifiers.
				///
				/// FIXME: This should be configurable via an enum, with at the following
				/// values:
				///
				/// "BLAKE3" => 32B hash from BLAKE3
				/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
				///
				/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
				static StringRef getHashName() { return "BLAKE3"; }
				StringRef getHashSchemaIdentifier() const final {
				static const std::string ID =
				("llvm.cas.builtin.v2[" + getHashName() + "]").str();
				return ID;
				}

				static const BuiltinCASContext &getDefaultContext();

				BuiltinCASContext() = default;
				};

				class BuiltinCAS : public ObjectStore {
				public:
				BuiltinCAS() : ObjectStore(BuiltinCASContext::getDefaultContext()) {}

				Expected<CASID> parseID(StringRef Reference) final;

				virtual Expected<CASID> parseIDImpl(ArrayRef<uint8_t> Hash) = 0;

				Expected<ObjectRef> store(ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) final;
				virtual Expected<ObjectRef> storeImpl(ArrayRef<uint8_t> ComputedHash,
				ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) = 0;

				Expected<ObjectRef>
				storeFromOpenFileImpl(sys::fs::file_t FD,
				std::optional<sys::fs::file_status> Status) override;
				virtual Expected<ObjectRef>
				storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
				sys::fs::mapped_file_region Map) {
				return storeImpl(ComputedHash, std::nullopt,
				ArrayRef(Map.data(), Map.size()));
				}

				/// Both builtin CAS implementations provide lifetime for free, so this can
				/// be const, and readData() and getDataSize() can be implemented on top of
				/// it.
				virtual ArrayRef<char> getDataConst(ObjectHandle Node) const = 0;

				ArrayRef<char> getData(ObjectHandle Node,
				bool RequiresNullTerminator) const final {
				// BuiltinCAS Objects are always null terminated.
				return getDataConst(Node);
				}
				uint64_t getDataSize(ObjectHandle Node) const final {
				return getDataConst(Node).size();
				}

				Error createUnknownObjectError(const CASID &ID) const {
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"unknown object '" + ID.toString() + "'");
				}

				Error createCorruptObjectError(const CASID &ID) const {
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"corrupt object '" + ID.toString() + "'");
				}

				Error createCorruptStorageError() const {
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"corrupt storage");
				}

				Error validate(const CASID &ID) final;
				};

				// FIXME: Proxy not portable. Maybe also error-prone?
				constexpr StringLiteral DefaultDirProxy = "/^llvm::cas::builtin::default";
				constexpr StringLiteral DefaultDir = "llvm.cas.builtin.default";

				} // namespace llvm::cas::builtin

				#endif // LLVM_LIB_CAS_BUILTINCAS_H

llvm/lib/CAS/BuiltinCAS.cpp

This file was added.

				//===- BuiltinCAS.cpp -------------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "BuiltinCAS.h"
				#include "BuiltinObjectHasher.h"
				#include "llvm/ADT/StringExtras.h"
				#include "llvm/Support/Alignment.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include "llvm/Support/Process.h"

				using namespace llvm;
				using namespace llvm::cas;
				using namespace llvm::cas::builtin;

				static StringRef getCASIDPrefix() { return "llvmcas://"; }

				Expected<CASID> BuiltinCAS::parseID(StringRef Reference) {
				if (!Reference.consume_front(getCASIDPrefix()))
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"invalid cas-id '" + Reference + "'");

				// FIXME: Allow shortened references?
				if (Reference.size() != 2 * sizeof(HashType))
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"wrong size for cas-id hash '" + Reference + "'");

				std::string Binary;
				if (!tryGetFromHex(Reference, Binary))
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"invalid hash in cas-id '" + Reference + "'");

				return parseIDImpl(arrayRefFromStringRef(Binary));
				}

				void BuiltinCASContext::printIDImpl(raw_ostream &OS, const CASID &ID) const {
				SmallString<64> Hash;
				toHex(ID.getHash(), /LowerCase=/true, Hash);
				OS << getCASIDPrefix() << Hash;
				}

				const BuiltinCASContext &BuiltinCASContext::getDefaultContext() {
				static BuiltinCASContext DefaultContext;
				return DefaultContext;
				}

				static size_t getPageSize() {
				static int PageSize = sys::Process::getPageSizeEstimate();
				return PageSize;
				}

				Expected<ObjectRef>
				BuiltinCAS::storeFromOpenFileImpl(sys::fs::file_t FD,
				std::optional<sys::fs::file_status> Status) {
				int PageSize = getPageSize();

				if (!Status) {
				Status.emplace();
				if (std::error_code EC = sys::fs::status(FD, *Status))
				return errorCodeToError(EC);
				}

				constexpr size_t MinMappedSize = 4 * 4096;
				auto readWithStream = [&]() -> Expected<ObjectRef> {
				// FIXME: MSVC: SmallString<MinMappedSize * 2>
				SmallString<4 * 4096 * 2> Data;
				if (Error E = sys::fs::readNativeFileToEOF(FD, Data, MinMappedSize))
				return std::move(E);
				return store(std::nullopt, ArrayRef(Data.data(), Data.size()));
				};

				// Check whether we can trust the size from stat.
				if (Status->type() != sys::fs::file_type::regular_file &&
				Status->type() != sys::fs::file_type::block_file)
				return readWithStream();

				if (Status->getSize() < MinMappedSize)
				return readWithStream();

				std::error_code EC;
				sys::fs::mapped_file_region Map(FD, sys::fs::mapped_file_region::readonly,
				Status->getSize(),
				/offset=/0, EC);
				if (EC)
				return errorCodeToError(EC);

				// If the file is guaranteed to be null-terminated, use it directly. Note
				// that the file size may have changed from ::stat if this file is volatile,
				// so we need to check for an actual null character at the end.
				ArrayRef<char> Data(Map.data(), Map.size());
				HashType ComputedHash =
				BuiltinObjectHasher<HasherT>::hashObject(*this, std::nullopt, Data);
				if (!isAligned(Align(PageSize), Data.size()) && Data.end()[0] == 0)
				return storeFromNullTerminatedRegion(ComputedHash, std::move(Map));
				return storeImpl(ComputedHash, std::nullopt, Data);
				}

				Expected<ObjectRef> BuiltinCAS::store(ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) {
				return storeImpl(BuiltinObjectHasher<HasherT>::hashObject(*this, Refs, Data),
				Refs, Data);
				}

				Error BuiltinCAS::validate(const CASID &ID) {
				auto Ref = getReference(ID);
				if (!Ref)
				return createUnknownObjectError(ID);

				auto Handle = load(*Ref);
				if (!Handle)
				return Handle.takeError();

				auto Proxy = ObjectProxy::load(this, Handle);
				SmallVector<ObjectRef> Refs;
				if (auto E = Proxy.forEachReference([&](ObjectRef Ref) -> Error {
				Refs.push_back(Ref);
				return Error::success();
				}))
				return E;

				ArrayRef<char> Data(Proxy.getData().data(), Proxy.getData().size());
				auto Hash = BuiltinObjectHasher<HasherT>::hashObject(*this, Refs, Data);
				if (!ID.getHash().equals(Hash))
				return createCorruptObjectError(ID);

				return Error::success();
				}

llvm/lib/CAS/BuiltinObjectHasher.h

This file was added.

				//===- BuiltinObjectHasher.h ------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
				#define LLVM_CAS_BUILTINOBJECTHASHER_H

				#include "llvm/ADT/StringRef.h"
				#include "llvm/CAS/ObjectStore.h"
				#include "llvm/Support/Endian.h"

				namespace llvm {
				namespace cas {

				template <class HasherT> class BuiltinObjectHasher {
				public:
				using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));

				static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) {
				BuiltinObjectHasher H;
				H.updateSize(Refs.size());
				for (const ObjectRef &Ref : Refs)
				H.updateRef(CAS, Ref);
				H.updateArray(Data);
				return H.finish();
				}

				private:
				HashT finish() { return Hasher.final(); }

				void updateRef(const ObjectStore &CAS, ObjectRef Ref) {
				updateID(CAS.getID(Ref));
				}

				void updateID(const CASID &ID) {
				// NOTE: Does not hash the size of the hash. That's a CAS implementation
				// detail that shouldn't leak into the UUID for an object.
				ArrayRef<uint8_t> Hash = ID.getHash();
				assert(Hash.size() == sizeof(HashT) &&
				"Expected object ref to match the hash size");
				Hasher.update(Hash);
				}

				void updateArray(ArrayRef<uint8_t> Bytes) {
				updateSize(Bytes.size());
				Hasher.update(Bytes);
				}

				void updateArray(ArrayRef<char> Bytes) {
				updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
				Bytes.size()));
				}

				void updateSize(uint64_t Size) {
				Size = support::endian::byte_swap(Size, support::endianness::little);
				Hasher.update(
				ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
				}

				BuiltinObjectHasher() = default;
				~BuiltinObjectHasher() = default;
				HasherT Hasher;
				};

				} // namespace cas
				} // namespace llvm

				#endif // LLVM_CAS_BUILTINOBJECTHASHER_H

llvm/lib/CAS/CMakeLists.txt

This file was added.

				add_llvm_component_library(LLVMCAS
				BuiltinCAS.cpp
				InMemoryCAS.cpp
				ObjectStore.cpp

				ADDITIONAL_HEADER_DIRS
				${LLVM_MAIN_INCLUDE_DIR}/llvm/CAS
				)

llvm/lib/CAS/InMemoryCAS.cpp

This file was added.

				//===- InMemoryCAS.cpp ------------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "BuiltinCAS.h"
				#include "BuiltinObjectHasher.h"
				#include "llvm/ADT/LazyAtomicPointer.h"
				#include "llvm/ADT/PointerIntPair.h"
				#include "llvm/ADT/PointerUnion.h"
				#include "llvm/ADT/TrieRawHashMap.h"
				#include "llvm/Support/Allocator.h"
				#include "llvm/Support/ThreadSafeAllocator.h"

				using namespace llvm;
				using namespace llvm::cas;
				using namespace llvm::cas::builtin;

				namespace {

				class InMemoryObject;

				/// Index of referenced IDs (map: Hash -> InMemoryObject*). Uses
				/// LazyAtomicPointer to coordinate creation of objects.
				using InMemoryIndexT =
				ThreadSafeTrieRawHashMap<LazyAtomicPointer<const InMemoryObject>,
				sizeof(HashType)>;

				/// Values in \a InMemoryIndexT. \a InMemoryObject's point at this to access
				/// their hash.
				using InMemoryIndexValueT = InMemoryIndexT::value_type;

				class InMemoryObject {
				public:
				enum class Kind {
				/// Node with refs and data.
				RefNode,

				/// Node with refs and data co-allocated.
				InlineNode,

				Max = InlineNode,
				};

				Kind getKind() const { return IndexAndKind.getInt(); }
				const InMemoryIndexValueT &getIndex() const {
				assert(IndexAndKind.getPointer());
				return *IndexAndKind.getPointer();
				}

				ArrayRef<uint8_t> getHash() const { return getIndex().Hash; }

				InMemoryObject() = delete;
				InMemoryObject(InMemoryObject &&) = delete;
				InMemoryObject(const InMemoryObject &) = delete;

				protected:
				InMemoryObject(Kind K, const InMemoryIndexValueT &I) : IndexAndKind(&I, K) {}

				private:
				enum Counts : int {
				NumKindBits = 2,
				};
				PointerIntPair<const InMemoryIndexValueT *, NumKindBits, Kind> IndexAndKind;
				static_assert((1U << NumKindBits) <= alignof(InMemoryIndexValueT),
				"Kind will clobber pointer");
				static_assert(((int)Kind::Max >> NumKindBits) == 0, "Kind will be truncated");

				public:
				inline ArrayRef<char> getData() const;

				inline ArrayRef<const InMemoryObject *> getRefs() const;
				};

				class InMemoryRefObject : public InMemoryObject {
				public:
				static constexpr Kind KindValue = Kind::RefNode;
				static bool classof(const InMemoryObject *O) {
				return O->getKind() == KindValue;
				}

				ArrayRef<const InMemoryObject *> getRefsImpl() const { return Refs; }
				ArrayRef<const InMemoryObject *> getRefs() const { return Refs; }
				ArrayRef<char> getDataImpl() const { return Data; }
				ArrayRef<char> getData() const { return Data; }

				static InMemoryRefObject &create(function_ref<void *(size_t Size)> Allocate,
				const InMemoryIndexValueT &I,
				ArrayRef<const InMemoryObject *> Refs,
				ArrayRef<char> Data) {
				void *Mem = Allocate(sizeof(InMemoryRefObject));
				return *new (Mem) InMemoryRefObject(I, Refs, Data);
				}

				private:
				InMemoryRefObject(const InMemoryIndexValueT &I,
				ArrayRef<const InMemoryObject *> Refs, ArrayRef<char> Data)
				: InMemoryObject(KindValue, I), Refs(Refs), Data(Data) {
				assert(isAddrAligned(Align(8), this) && "Expected 8-byte alignment");
				assert(isAddrAligned(Align(8), Data.data()) && "Expected 8-byte alignment");
				assert(*Data.end() == 0 && "Expected null-termination");
				}

				ArrayRef<const InMemoryObject *> Refs;
				ArrayRef<char> Data;
				};

				class InMemoryInlineObject : public InMemoryObject {
				public:
				static constexpr Kind KindValue = Kind::InlineNode;
				static bool classof(const InMemoryObject *O) {
				return O->getKind() == KindValue;
				}

				ArrayRef<const InMemoryObject *> getRefs() const { return getRefsImpl(); }
				ArrayRef<const InMemoryObject *> getRefsImpl() const {
				return ArrayRef(reinterpret_cast<const InMemoryObject const >(this + 1),
				NumRefs);
				}

				ArrayRef<char> getData() const { return getDataImpl(); }
				ArrayRef<char> getDataImpl() const {
				ArrayRef<const InMemoryObject *> Refs = getRefs();
				return ArrayRef(reinterpret_cast<const char *>(Refs.data() + Refs.size()),
				DataSize);
				}

				static InMemoryInlineObject &
				create(function_ref<void *(size_t Size)> Allocate,
				const InMemoryIndexValueT &I, ArrayRef<const InMemoryObject *> Refs,
				ArrayRef<char> Data) {
				void *Mem = Allocate(sizeof(InMemoryInlineObject) +
				sizeof(uintptr_t) * Refs.size() + Data.size() + 1);
				return *new (Mem) InMemoryInlineObject(I, Refs, Data);
				}

				private:
				InMemoryInlineObject(const InMemoryIndexValueT &I,
				ArrayRef<const InMemoryObject *> Refs,
				ArrayRef<char> Data)
				: InMemoryObject(KindValue, I), NumRefs(Refs.size()),
				DataSize(Data.size()) {
				auto BeginRefs = reinterpret_cast<const InMemoryObject *>(this + 1);
				llvm::copy(Refs, BeginRefs);
				auto BeginData = reinterpret_cast<char >(BeginRefs + NumRefs);
				llvm::copy(Data, BeginData);
				BeginData[Data.size()] = 0;
				}
				uint32_t NumRefs;
				uint32_t DataSize;
				};

				/// In-memory CAS database.
				class InMemoryCAS : public BuiltinCAS {
				public:
				Expected<CASID> parseIDImpl(ArrayRef<uint8_t> Hash) final {
				return getID(indexHash(Hash));
				}

				Expected<ObjectRef> storeImpl(ArrayRef<uint8_t> ComputedHash,
				ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) final;

				Expected<ObjectRef>
				storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
				sys::fs::mapped_file_region Map) override;

				CASID getID(const InMemoryIndexValueT &I) const {
				StringRef Hash = toStringRef(I.Hash);
				return CASID::create(&getContext(), Hash);
				}
				CASID getID(const InMemoryObject &O) const { return getID(O.getIndex()); }

				ObjectHandle getObjectHandle(const InMemoryObject &Node) const {
				assert(!(reinterpret_cast<uintptr_t>(&Node) & 0x1ULL));
				return makeObjectHandle(reinterpret_cast<uintptr_t>(&Node));
				}

				Expected<ObjectHandle> load(ObjectRef Ref) override {
				return getObjectHandle(asInMemoryObject(Ref));
				}

				InMemoryIndexValueT &indexHash(ArrayRef<uint8_t> Hash) {
				return *Index
				.insertLazy(Hash,
				[](auto ValueConstructor) {
				ValueConstructor.emplace(nullptr);
				})
				.first;
				}

				/// TODO: Consider callers to actually do an insert and to return a handle to
				/// the slot in the trie.
				const InMemoryObject *getInMemoryObject(CASID ID) const {
				assert(ID.getContext().getHashSchemaIdentifier() ==
				getContext().getHashSchemaIdentifier() &&
				"Expected ID from same hash schema");
				if (InMemoryIndexT::const_pointer P = Index.find(ID.getHash()))
				return P->Data;
				return nullptr;
				}

				const InMemoryObject &getInMemoryObject(ObjectHandle OH) const {
				return reinterpret_cast<const InMemoryObject >(
				(uintptr_t)OH.getInternalRef(*this));
				}

				const InMemoryObject &asInMemoryObject(ReferenceBase Ref) const {
				uintptr_t P = Ref.getInternalRef(*this);
				return reinterpret_cast<const InMemoryObject >(P);
				}
				ObjectRef toReference(const InMemoryObject &O) const {
				return makeObjectRef(reinterpret_cast<uintptr_t>(&O));
				}

				CASID getID(ObjectRef Ref) const final { return getIDImpl(Ref); }
				CASID getID(ObjectHandle Ref) const final { return getIDImpl(Ref); }
				CASID getIDImpl(ReferenceBase Ref) const {
				return getID(asInMemoryObject(Ref));
				}

				std::optional<ObjectRef> getReference(const CASID &ID) const final {
				if (const InMemoryObject *Object = getInMemoryObject(ID))
				return toReference(*Object);
				return std::nullopt;
				}
				ObjectRef getReference(ObjectHandle Handle) const final {
				return toReference(asInMemoryObject(Handle));
				}

				ArrayRef<char> getDataConst(ObjectHandle Node) const final {
				return cast<InMemoryObject>(asInMemoryObject(Node)).getData();
				}

				InMemoryCAS() = default;

				private:
				size_t getNumRefs(ObjectHandle Node) const final {
				return getInMemoryObject(Node).getRefs().size();
				}
				ObjectRef readRef(ObjectHandle Node, size_t I) const final {
				return toReference(*getInMemoryObject(Node).getRefs()[I]);
				}
				Error forEachRef(ObjectHandle Node,
				function_ref<Error(ObjectRef)> Callback) const final;

				/// Index of referenced IDs (map: Hash -> InMemoryObject*). Mapped to nullptr
				/// as a convenient way to store hashes.
				///
				/// - Insert nullptr on lookups.
				/// - InMemoryObject points back to here.
				InMemoryIndexT Index;

				ThreadSafeAllocator<BumpPtrAllocator> Objects;
				ThreadSafeAllocator<SpecificBumpPtrAllocator<sys::fs::mapped_file_region>>
				MemoryMaps;
				};

				} // end anonymous namespace

				dblaikieUnsubmitted Not Done Reply Inline Actions I'd probably put the `inline` keyword here and leave it off the declaration (otherwise I'd read this code and wonder how it's valid/doesn't produce duplicate definitions) - wonder what's more common in the LLVM project/codebase. No big deal either way, though. dblaikie: I'd probably put the `inline` keyword here and leave it off the declaration (otherwise I'd read…
				ArrayRef<char> InMemoryObject::getData() const {
				if (auto *Derived = dyn_cast<InMemoryRefObject>(this))
				return Derived->getDataImpl();
				return cast<InMemoryInlineObject>(this)->getDataImpl();
				}

				ArrayRef<const InMemoryObject *> InMemoryObject::getRefs() const {
				if (auto *Derived = dyn_cast<InMemoryRefObject>(this))
				return Derived->getRefsImpl();
				return cast<InMemoryInlineObject>(this)->getRefsImpl();
				}

				Expected<ObjectRef>
				InMemoryCAS::storeFromNullTerminatedRegion(ArrayRef<uint8_t> ComputedHash,
				sys::fs::mapped_file_region Map) {
				// Look up the hash in the index, initializing to nullptr if it's new.
				ArrayRef<char> Data(Map.data(), Map.size());
				auto &I = indexHash(ComputedHash);

				// Load or generate.
				auto Allocator = [&](size_t Size) -> void * {
				return Objects.Allocate(Size, alignof(InMemoryObject));
				};
				auto Generator = [&]() -> const InMemoryObject * {
				return &InMemoryRefObject::create(Allocator, I, std::nullopt, Data);
				};
				const InMemoryObject &Node =
				cast<InMemoryObject>(I.Data.loadOrGenerate(Generator));

				// Save Map if the winning node uses it.
				if (auto *RefNode = dyn_cast<InMemoryRefObject>(&Node))
				if (RefNode->getData().data() == Map.data())
				new (MemoryMaps.Allocate(1)) sys::fs::mapped_file_region(std::move(Map));

				return toReference(Node);
				}

				Expected<ObjectRef> InMemoryCAS::storeImpl(ArrayRef<uint8_t> ComputedHash,
				ArrayRef<ObjectRef> Refs,
				ArrayRef<char> Data) {
				// Look up the hash in the index, initializing to nullptr if it's new.
				auto &I = indexHash(ComputedHash);

				// Create the node.
				SmallVector<const InMemoryObject *> InternalRefs;
				for (ObjectRef Ref : Refs)
				InternalRefs.push_back(&asInMemoryObject(Ref));
				auto Allocator = [&](size_t Size) -> void * {
				return Objects.Allocate(Size, alignof(InMemoryObject));
				};
				auto Generator = [&]() -> const InMemoryObject * {
				return &InMemoryInlineObject::create(Allocator, I, InternalRefs, Data);
				};
				return toReference(cast<InMemoryObject>(I.Data.loadOrGenerate(Generator)));
				}

				Error InMemoryCAS::forEachRef(ObjectHandle Handle,
				function_ref<Error(ObjectRef)> Callback) const {
				auto &Node = getInMemoryObject(Handle);
				for (const InMemoryObject *Ref : Node.getRefs())
				if (Error E = Callback(toReference(*Ref)))
				return E;
				return Error::success();
				}

				std::unique_ptr<ObjectStore> cas::createInMemoryCAS() {
				return std::make_unique<InMemoryCAS>();
				}

llvm/lib/CAS/ObjectStore.cpp

This file was added.

				//===- ObjectStore.cpp ------------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/CAS/ObjectStore.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/SmallVectorMemoryBuffer.h"

				using namespace llvm;
				using namespace llvm::cas;

				CASContext::~CASContext() {}
				ObjectStore::~ObjectStore() {}

				LLVM_DUMP_METHOD void CASID::dump() const { print(dbgs()); }
				LLVM_DUMP_METHOD void ObjectStore::dump() const { print(dbgs()); }
				LLVM_DUMP_METHOD void ObjectRef::dump() const { print(dbgs()); }
				LLVM_DUMP_METHOD void ObjectHandle::dump() const { print(dbgs()); }

				std::string CASID::toString() const {
				std::string S;
				raw_string_ostream(S) << *this;
				return S;
				}

				ArrayRef<uint8_t> CASID::getHash() const {
				return arrayRefFromStringRef<uint8_t>(Hash);
				}

				static void printReferenceBase(raw_ostream &OS, StringRef Kind,
				uint64_t InternalRef, std::optional<CASID> ID) {
				OS << Kind << "=" << InternalRef;
				if (ID)
				OS << "[" << *ID << "]";
				}

				void ReferenceBase::print(raw_ostream &OS, const ObjectHandle &This) const {
				assert(this == &This);

				std::optional<CASID> ID;
				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				if (CAS)
				ID = CAS->getID(This);
				#endif
				printReferenceBase(OS, "object-handle", InternalRef, ID);
				}

				void ReferenceBase::print(raw_ostream &OS, const ObjectRef &This) const {
				assert(this == &This);

				std::optional<CASID> ID;
				#if LLVM_ENABLE_ABI_BREAKING_CHECKS
				if (CAS)
				ID = CAS->getID(This);
				#endif
				printReferenceBase(OS, "object-ref", InternalRef, ID);
				}

				std::unique_ptr<MemoryBuffer>
				ObjectStore::getMemoryBuffer(ObjectHandle Node, StringRef Name,
				bool RequiresNullTerminator) {
				return MemoryBuffer::getMemBuffer(
				toStringRef(getData(Node, RequiresNullTerminator)), Name,
				RequiresNullTerminator);
				}

				Expected<ObjectProxy> ObjectStore::getProxy(const CASID &ID) {
				std::optional<ObjectRef> Ref = getReference(ID);
				if (!Ref)
				return createUnknownObjectError(ID);

				std::optional<ObjectHandle> H;
				if (Error E = load(*Ref).moveInto(H))
				return std::move(E);

				return ObjectProxy::load(this, H);
				}

				Expected<ObjectProxy> ObjectStore::getProxy(ObjectRef Ref) {
				return getProxy(load(Ref));
				}

				Expected<ObjectProxy> ObjectStore::getProxy(Expected<ObjectHandle> H) {
				if (!H)
				return H.takeError();
				return ObjectProxy::load(this, H);
				}

				Error ObjectStore::createUnknownObjectError(const CASID &ID) {
				return createStringError(std::make_error_code(std::errc::invalid_argument),
				"unknown object '" + ID.toString() + "'");
				}

				Expected<ObjectProxy> ObjectStore::createProxy(ArrayRef<ObjectRef> Refs,
				StringRef Data) {
				Expected<ObjectRef> Ref = store(Refs, arrayRefFromStringRef<char>(Data));
				if (!Ref)
				return Ref.takeError();
				return getProxy(*Ref);
				}

				std::unique_ptr<MemoryBuffer>
				ObjectProxy::getMemoryBuffer(StringRef Name,
				bool RequiresNullTerminator) const {
				return CAS->getMemoryBuffer(H, Name, RequiresNullTerminator);
				}

llvm/lib/CMakeLists.txt

	include(LLVM-Build)			include(LLVM-Build)

	# `Demangle', `Support' and `TableGen' libraries are added on the top-level			# `Demangle', `Support' and `TableGen' libraries are added on the top-level
	# CMakeLists.txt			# CMakeLists.txt

	add_subdirectory(IR)			add_subdirectory(IR)
	add_subdirectory(FuzzMutate)			add_subdirectory(FuzzMutate)
	add_subdirectory(FileCheck)			add_subdirectory(FileCheck)
	add_subdirectory(InterfaceStub)			add_subdirectory(InterfaceStub)
	add_subdirectory(IRPrinter)			add_subdirectory(IRPrinter)
	add_subdirectory(IRReader)			add_subdirectory(IRReader)
				add_subdirectory(CAS)
	add_subdirectory(CodeGen)			add_subdirectory(CodeGen)
	add_subdirectory(BinaryFormat)			add_subdirectory(BinaryFormat)
	add_subdirectory(Bitcode)			add_subdirectory(Bitcode)
	add_subdirectory(Bitstream)			add_subdirectory(Bitstream)
	add_subdirectory(DWARFLinker)			add_subdirectory(DWARFLinker)
	add_subdirectory(DWARFLinkerParallel)			add_subdirectory(DWARFLinkerParallel)
	add_subdirectory(Extensions)			add_subdirectory(Extensions)
	add_subdirectory(Frontend)			add_subdirectory(Frontend)
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/unittests/CAS/CASTestConfig.h

This file was added.

				//===- CASTestConfig.h ----------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/CAS/ObjectStore.h"
				#include "llvm/Config/llvm-config.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Testing/Support/Error.h"
				#include "llvm/Testing/Support/SupportHelpers.h"
				#include "gtest/gtest.h"

				#ifndef LLVM_UNITTESTS_CASTESTCONFIG_H
				#define LLVM_UNITTESTS_CASTESTCONFIG_H

				struct CASTestingEnv {
				std::unique_ptr<llvm::cas::ObjectStore> CAS;
				};

				class CASTest
				: public testing::TestWithParam<std::function<CASTestingEnv(int)>> {
				protected:
				std::optional<int> NextCASIndex;

				std::unique_ptr<llvm::cas::ObjectStore> createObjectStore() {
				auto TD = GetParam()(++(*NextCASIndex));
				return std::move(TD.CAS);
				}
				void SetUp() { NextCASIndex = 0; }
				void TearDown() { NextCASIndex = std::nullopt; }
				};

				#endif

llvm/unittests/CAS/CASTestConfig.cpp

This file was added.

				//===- CASTestConfig.cpp --------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "CASTestConfig.h"
				#include "llvm/CAS/ObjectStore.h"
				#include "gtest/gtest.h"

				using namespace llvm;
				using namespace llvm::cas;

				CASTestingEnv createInMemory(int I) {
				std::unique_ptr<ObjectStore> CAS = createInMemoryCAS();
				return CASTestingEnv{std::move(CAS)};
				}

				INSTANTIATE_TEST_SUITE_P(InMemoryCAS, CASTest,
				::testing::Values(createInMemory));

llvm/unittests/CAS/CMakeLists.txt

This file was added.

				set(LLVM_LINK_COMPONENTS
				Support
				CAS
				TestingSupport
				)

				add_llvm_unittest(CASTests
				CASTestConfig.cpp
				ObjectStoreTest.cpp
				)

				target_link_libraries(CASTests PRIVATE LLVMTestingSupport)

llvm/unittests/CAS/ObjectStoreTest.cpp

This file was added.

				//===- ObjectStoreTest.cpp ------------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/CAS/ObjectStore.h"
				#include "llvm/Config/llvm-config.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Testing/Support/Error.h"
				#include "llvm/Testing/Support/SupportHelpers.h"
				#include "gtest/gtest.h"

				#include "CASTestConfig.h"

				using namespace llvm;
				using namespace llvm::cas;

				TEST_P(CASTest, PrintIDs) {
				std::unique_ptr<ObjectStore> CAS = createObjectStore();

				std::optional<CASID> ID1, ID2;
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "1").moveInto(ID1),
				Succeeded());
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, "2").moveInto(ID2),
				Succeeded());
				EXPECT_NE(ID1, ID2);
				std::string PrintedID1 = ID1->toString();
				std::string PrintedID2 = ID2->toString();
				EXPECT_NE(PrintedID1, PrintedID2);

				std::optional<CASID> ParsedID1, ParsedID2;
				ASSERT_THAT_ERROR(CAS->parseID(PrintedID1).moveInto(ParsedID1), Succeeded());
				ASSERT_THAT_ERROR(CAS->parseID(PrintedID2).moveInto(ParsedID2), Succeeded());
				EXPECT_EQ(ID1, ParsedID1);
				EXPECT_EQ(ID2, ParsedID2);
				}

				TEST_P(CASTest, Blobs) {
				std::unique_ptr<ObjectStore> CAS1 = createObjectStore();
				StringRef ContentStrings[] = {
				"word",
				"some longer text std::string's local memory",
				R"(multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text)",
				};

				SmallVector<CASID> IDs;
				for (StringRef Content : ContentStrings) {
				// Use StringRef::str() to create a temporary std::string. This could cause
				// problems if the CAS is storing references to the input string instead of
				// copying it.
				std::optional<ObjectProxy> Blob;
				ASSERT_THAT_ERROR(CAS1->createProxy(std::nullopt, Content).moveInto(Blob),
				Succeeded());
				IDs.push_back(Blob->getID());

				// Check basic printing of IDs.
				EXPECT_EQ(IDs.back().toString(), IDs.back().toString());
				if (IDs.size() > 2)
				EXPECT_NE(IDs.front().toString(), IDs.back().toString());
				}

				// Check that the blobs give the same IDs later.
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				std::optional<ObjectProxy> Blob;
				ASSERT_THAT_ERROR(
				CAS1->createProxy(std::nullopt, ContentStrings[I]).moveInto(Blob),
				Succeeded());
				EXPECT_EQ(IDs[I], Blob->getID());
				}

				// Run validation on all CASIDs.
				for (int I = 0, E = IDs.size(); I != E; ++I)
				ASSERT_THAT_ERROR(CAS1->validate(IDs[I]), Succeeded());

				// Check that the blobs can be retrieved multiple times.
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				for (int J = 0, JE = 3; J != JE; ++J) {
				std::optional<ObjectProxy> Buffer;
				ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Buffer), Succeeded());
				EXPECT_EQ(ContentStrings[I], Buffer->getData());
				}
				}

				// Confirm these blobs don't exist in a fresh CAS instance.
				std::unique_ptr<ObjectStore> CAS2 = createObjectStore();
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				std::optional<ObjectProxy> Proxy;
				EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Proxy), Failed());
				}

				// Insert into the second CAS and confirm the IDs are stable. Getting them
				// should work now.
				for (int I = IDs.size(), E = 0; I != E; --I) {
				auto &ID = IDs[I - 1];
				auto &Content = ContentStrings[I - 1];
				std::optional<ObjectProxy> Blob;
				ASSERT_THAT_ERROR(CAS2->createProxy(std::nullopt, Content).moveInto(Blob),
				Succeeded());
				EXPECT_EQ(ID, Blob->getID());

				std::optional<ObjectProxy> Buffer;
				ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Buffer), Succeeded());
				EXPECT_EQ(Content, Buffer->getData());
				}
				}

				TEST_P(CASTest, BlobsBig) {
				// A little bit of validation that bigger blobs are okay. Climb up to 1MB.
				std::unique_ptr<ObjectStore> CAS = createObjectStore();
				SmallString<256> String1 = StringRef("a few words");
				SmallString<256> String2 = StringRef("others");
				while (String1.size() < 1024U * 1024U) {
				std::optional<CASID> ID1;
				std::optional<CASID> ID2;
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID1),
				Succeeded());
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String1).moveInto(ID2),
				Succeeded());
				ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded());
				ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded());
				ASSERT_EQ(ID1, ID2);

				String1.append(String2);
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID1),
				Succeeded());
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, String2).moveInto(ID2),
				Succeeded());
				ASSERT_THAT_ERROR(CAS->validate(*ID1), Succeeded());
				ASSERT_THAT_ERROR(CAS->validate(*ID2), Succeeded());
				ASSERT_EQ(ID1, ID2);
				String2.append(String1);
				}

				// Specifically check near 1MB for objects large enough they're likely to be
				// stored externally in an on-disk CAS and will be near a page boundary.
				SmallString<0> Storage;
				const size_t InterestingSize = 1024U * 1024ULL;
				const size_t SizeE = InterestingSize + 2;
				if (Storage.size() < SizeE)
				Storage.resize(SizeE, '\01');
				for (size_t Size = InterestingSize - 2; Size != SizeE; ++Size) {
				StringRef Data(Storage.data(), Size);
				std::optional<ObjectProxy> Blob;
				ASSERT_THAT_ERROR(CAS->createProxy(std::nullopt, Data).moveInto(Blob),
				Succeeded());
				ASSERT_EQ(Data, Blob->getData());
				ASSERT_EQ(0, Blob->getData().end()[0]);
				}
				}

				TEST_P(CASTest, LeafNodes) {
				std::unique_ptr<ObjectStore> CAS1 = createObjectStore();
				StringRef ContentStrings[] = {
				"word",
				dblaikieUnsubmitted Not Done Reply Inline Actions not sure I understand the mention of std::string here - since this is `StringRef`? (maybe this is some remnant of a different version of the code) dblaikie: not sure I understand the mention of std::string here - since this is `StringRef`? (maybe this…
				"some longer text std::string's local memory",
				R"(multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text
				multiline text multiline text multiline text multiline text multiline text)",
				};

				SmallVector<ObjectRef> Nodes;
				SmallVector<CASID> IDs;
				for (StringRef Content : ContentStrings) {
				// Use StringRef::str() to create a temporary std::string. This could cause
				// problems if the CAS is storing references to the input string instead of
				dblaikieUnsubmitted Not Done Reply Inline Actions Rather than testing this via UB, could it be tested via pointer equality? (eg: when querying the CAS, check that the data points to the same place as the data pasesd in) dblaikie: Rather than testing this via UB, could it be tested via pointer equality? (eg: when querying…
				// copying it.
				std::optional<ObjectRef> Node;
				ASSERT_THAT_ERROR(
				CAS1->store(std::nullopt, arrayRefFromStringRef<char>(Content))
				.moveInto(Node),
				Succeeded());
				Nodes.push_back(*Node);

				// Check basic printing of IDs.
				IDs.push_back(CAS1->getID(*Node));
				EXPECT_EQ(IDs.back().toString(), IDs.back().toString());
				EXPECT_EQ(Nodes.front(), Nodes.front());
				EXPECT_EQ(Nodes.back(), Nodes.back());
				EXPECT_EQ(IDs.front(), IDs.front());
				EXPECT_EQ(IDs.back(), IDs.back());
				if (Nodes.size() <= 1)
				continue;
				EXPECT_NE(Nodes.front(), Nodes.back());
				EXPECT_NE(IDs.front(), IDs.back());
				}

				// Check that the blobs give the same IDs later.
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				std::optional<ObjectRef> Node;
				ASSERT_THAT_ERROR(CAS1->store(std::nullopt, arrayRefFromStringRef<char>(
				ContentStrings[I]))
				.moveInto(Node),
				Succeeded());
				EXPECT_EQ(IDs[I], CAS1->getID(*Node));
				}

				// Check that the blobs can be retrieved multiple times.
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				for (int J = 0, JE = 3; J != JE; ++J) {
				std::optional<ObjectProxy> Object;
				ASSERT_THAT_ERROR(CAS1->getProxy(IDs[I]).moveInto(Object), Succeeded());
				ASSERT_TRUE(Object);
				EXPECT_EQ(ContentStrings[I], Object->getData());
				}
				}

				// Confirm these blobs don't exist in a fresh CAS instance.
				std::unique_ptr<ObjectStore> CAS2 = createObjectStore();
				for (int I = 0, E = IDs.size(); I != E; ++I) {
				std::optional<ObjectProxy> Object;
				EXPECT_THAT_ERROR(CAS2->getProxy(IDs[I]).moveInto(Object), Failed());
				}

				// Insert into the second CAS and confirm the IDs are stable. Getting them
				// should work now.
				for (int I = IDs.size(), E = 0; I != E; --I) {
				auto &ID = IDs[I - 1];
				auto &Content = ContentStrings[I - 1];
				std::optional<ObjectRef> Node;
				ASSERT_THAT_ERROR(
				CAS2->store(std::nullopt, arrayRefFromStringRef<char>(Content))
				.moveInto(Node),
				Succeeded());
				EXPECT_EQ(ID, CAS2->getID(*Node));

				std::optional<ObjectProxy> Object;
				ASSERT_THAT_ERROR(CAS2->getProxy(ID).moveInto(Object), Succeeded());
				ASSERT_TRUE(Object);
				EXPECT_EQ(Content, Object->getData());
				}
				}

				TEST_P(CASTest, NodesBig) {
				std::unique_ptr<ObjectStore> CAS = createObjectStore();

				// Specifically check near 1MB for objects large enough they're likely to be
				// stored externally in an on-disk CAS, and such that one of them will be
				// near a page boundary.
				SmallString<0> Storage;
				constexpr size_t InterestingSize = 1024U * 1024ULL;
				constexpr size_t WordSize = sizeof(void *);

				// Start much smaller to account for headers.
				constexpr size_t SizeB = InterestingSize - 8 * WordSize;
				constexpr size_t SizeE = InterestingSize + 1;
				if (Storage.size() < SizeE)
				Storage.resize(SizeE, '\01');

				SmallVector<ObjectRef, 4> CreatedNodes;
				// Avoid checking every size because this is an expensive test. Just check
				// for data that is 8B-word-aligned, and one less. Also appending the created
				// nodes as the references in the next block to check references are created
				// correctly.
				for (size_t Size = SizeB; Size < SizeE; Size += WordSize) {
				for (bool IsAligned : {false, true}) {
				StringRef Data(Storage.data(), Size - (IsAligned ? 0 : 1));
				std::optional<ObjectProxy> Node;
				ASSERT_THAT_ERROR(CAS->createProxy(CreatedNodes, Data).moveInto(Node),
				Succeeded());
				ASSERT_EQ(Data, Node->getData());
				ASSERT_EQ(0, Node->getData().end()[0]);
				ASSERT_EQ(Node->getNumReferences(), CreatedNodes.size());
				CreatedNodes.emplace_back(Node->getRef());
				}
				}

				for (auto ID : CreatedNodes)
				ASSERT_THAT_ERROR(CAS->validate(CAS->getID(ID)), Succeeded());
				}

llvm/unittests/CMakeLists.txt

	Show All 14 Lines
	endfunction()			endfunction()

	add_subdirectory(ADT)			add_subdirectory(ADT)
	add_subdirectory(Analysis)			add_subdirectory(Analysis)
	add_subdirectory(AsmParser)			add_subdirectory(AsmParser)
	add_subdirectory(BinaryFormat)			add_subdirectory(BinaryFormat)
	add_subdirectory(Bitcode)			add_subdirectory(Bitcode)
	add_subdirectory(Bitstream)			add_subdirectory(Bitstream)
				add_subdirectory(CAS)
	add_subdirectory(CodeGen)			add_subdirectory(CodeGen)
	add_subdirectory(DebugInfo)			add_subdirectory(DebugInfo)
	add_subdirectory(Debuginfod)			add_subdirectory(Debuginfod)
	add_subdirectory(Demangle)			add_subdirectory(Demangle)
	add_subdirectory(DWARFLinkerParallel)			add_subdirectory(DWARFLinkerParallel)
	add_subdirectory(ExecutionEngine)			add_subdirectory(ExecutionEngine)
	add_subdirectory(FileCheck)			add_subdirectory(FileCheck)
	add_subdirectory(Frontend)			add_subdirectory(Frontend)
	Show All 24 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CAS] Add LLVMCAS library with InMemoryCAS implementationNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 545803

llvm/docs/ContentAddressableStorage.md

llvm/docs/Reference.rst

llvm/include/llvm/CAS/CASID.h

llvm/include/llvm/CAS/CASReference.h

llvm/include/llvm/CAS/ObjectStore.h

llvm/lib/CAS/BuiltinCAS.h

llvm/lib/CAS/BuiltinCAS.cpp

llvm/lib/CAS/BuiltinObjectHasher.h

llvm/lib/CAS/CMakeLists.txt

llvm/lib/CAS/InMemoryCAS.cpp

llvm/lib/CAS/ObjectStore.cpp

llvm/lib/CMakeLists.txt

llvm/unittests/CAS/CASTestConfig.h

llvm/unittests/CAS/CASTestConfig.cpp

llvm/unittests/CAS/CMakeLists.txt

llvm/unittests/CAS/ObjectStoreTest.cpp

llvm/unittests/CMakeLists.txt

[CAS] Add LLVMCAS library with InMemoryCAS implementation
Needs ReviewPublic