Page MenuHomePhabricator

[CAS] Add LLVMCAS library with InMemoryCAS implementation
Needs ReviewPublic

Authored by steven_wu on Sep 12 2022, 10:43 AM.

Details

Summary

Add llvm::cas::ObjectStore abstraction and InMemoryCAS as a in-memory
CAS object store implementation.

The ObjectStore models its objects as:

  • Content: An array of bytes for the data to be stored.
  • Refs: An array of references to other objects in the ObjectStore.

And each CAS Object can be idenfied with an unqine ID/Hash.

ObjectStore supports following general action:

  • Expected<ID> store(Content, ArrayRef<Ref>)
  • Expected<Ref> get(ID)

It also introduces following types to interact with a CAS ObjectStore:

  • CASID: Hash representation for an CAS Objects with its context to help print/compare CASIDs.
  • ObjectRef: A light-weight ref for an object in the ObjectStore. It is implementation defined so it can be optimized for read/store/references depending on the implementation.
  • ObjectHandle: A CAS internal light-weight handle to an loaded object in the ObjectStore. Underlying data for the object is guaranteed to be available and no error handling is required to access data. This is not exposed to the users of CAS from ObjectStore APIs.
  • ObjectProxy: A proxy for the users of CAS to interact with the data inside CAS Object. It bundles a ObjectHandle and an ObjectStore instance.

Diff Detail

Unit TestsFailed

TimeTest
60,140 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vloxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vloxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vloxseg.c
60,210 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vluxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vluxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vluxseg.c
60,120 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vloxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vloxseg.c
60,280 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vluxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vluxseg.c

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 10:43 AM
steven_wu requested review of this revision.Sep 12 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 10:43 AM
ychen added a subscriber: ychen.Sep 12 2022, 11:14 AM
aganea added a subscriber: aganea.Sep 27 2022, 8:12 AM
aprantl added a subscriber: aprantl.Oct 5 2022, 8:45 AM

Ping.

Now with context of the LLVM talk, I hope it makes the review easier. Will be really appreciated for any feedback to get CAS upstream. Feel free to reach out to me for any questions.

dexonsmith resigned from this revision.Nov 29 2022, 4:02 PM
dexonsmith added subscribers: MatzeB, dexonsmith.

Ping.

Now with context of the LLVM talk, I hope it makes the review easier. Will be really appreciated for any feedback to get CAS upstream. Feel free to reach out to me for any questions.

FTR, I'll be happy to review most/many patches in this area once the core patches land (and also contribute to design discussions), but for the core patches (where I was the original author) I'm not sure I have enough perspective to spot what's missing. As such, I'm going to resign as a reviewer, but I'll still be "subscribed" to the email thread. (Feel free to add me as a subscriber/reviewer to anything else in this area as well...)

In my opinion, much of this can evolve in-tree after landing, and the RFC had consensus to move forward with landing this stuff in main following the usual code review. It's important for someone outside the original authors (a few Apple people, plus me!) to at least take a cursory look to ensure that it's well-enough documented and tested for others to contribute. Maybe an in-depth review isn't needed, but I'm sure that'd also be welcome if someone can contribute one.

Alternatively, if someone thinks the design needs a deeper look before landing, that sounds great too; but I think it'd be helpful for Steven to have signal either way.

+@MatzeB, in case you are able to contribute a review; IIRC, it was your design feedback in the RFC video call that caused us to simplify the interface, moving filesystem concepts like "blobs" and "trees" out of the core CAS interface.

dexonsmith added inline comments.Nov 29 2022, 4:06 PM
llvm/include/llvm/ADT/StringExtras.h
61–74 ↗(On Diff #459516)

Oh, just noticed these changes; I suggest separating these out and adding quick unit tests.

Split StringExtras changes

dblaikie added inline comments.Dec 1 2022, 5:04 PM
llvm/include/llvm/CAS/CASID.h
27

This could be private to avoid polluting autocomplete and such, maybe? (alternatively it's probably not too costly to have the dtor virtual and out of line to act as an anchor - I assume CASContexts aren't being created/torn down with any great frequency such that some dtor indirection would be especially costly?)

109–110

if you have a std::string member, should probably take the parameter as std::string and std::move it into the member - in cases where the caller can discard a string that'll be more efficient, and in cases where the caller can't, it's not especially worse.

(though shoudl the member be std::string? should it be SmallString or anything else?)

llvm/include/llvm/CAS/CASReference.h
63

Only the last of these checks is limited to LLVM_ENABLE_ABI_BREAKING_CHECKS yeah? Might be worth pulling the other two out of the preprocessor conditional, so they get run more broadly?

107

expect -> extract? or something else?

111–114

If clients aren't expected to need to do it, could we leave that out until some necessary use comes up?

163

Since Handle and Ref aren't especially self documenting (while reading the docs for Ref I wondered if it should be called Handle - until I got to the bit where it explained that that exists too/separately) - is there something that'd be more explicit about the difference between these? (like, I wonder if ObjectHandle could be Object? - not sure that'd make ObjectRef more obvious, though... )

llvm/include/llvm/CAS/ObjectStore.h
50

do they "reference" other objects - or does an object consist of other objects/include those objects in some sense? (maybe there's no real difference, and this current language is better... not sure)

54–58

Not clear from this what the difference between an ObjectRef and a CASID are - is that worth more details?

62–65

Ah. May be worth splitting up the documentation here more clearly between public and private concepts, rather than having this internal detail interleaved with external ones?

91

anchor's only useful if it's virtual?
Maybe jus tmake the dtor out of line and use that as an anchor?

110–111

what sort of validity is of concern here?

175–176

this lets you walk which other objects reference this object?
I guess you've probably got some pretty core use cases/need for this - but dealing with updating use lists in LLVM IR at least makes me ask: Do you need to be able to walk the uses of an object?

llvm/lib/CAS/InMemoryCAS.cpp
263

I'd probably put the inline keyword here and leave it off the declaration (otherwise I'd read this code and wonder how it's valid/doesn't produce duplicate definitions) - wonder what's more common in the LLVM project/codebase. No big deal either way, though.

llvm/unittests/CAS/ObjectStoreTest.cpp
162

not sure I understand the mention of std::string here - since this is StringRef? (maybe this is some remnant of a different version of the code)

174–176

Rather than testing this via UB, could it be tested via pointer equality? (eg: when querying the CAS, check that the data points to the same place as the data pasesd in)

steven_wu updated this revision to Diff 481046.Dec 7 2022, 1:53 PM
steven_wu marked 2 inline comments as done.

Address review feedback.

@dblaikie Thanks for reviewing. Feedback inline.

llvm/include/llvm/CAS/CASID.h
27

This is a private method? But sure this is not a common object to create or destory. dtor as anchor sounds fine.

109–110

I don't have much preference for the internal data type. CASID should only be relevant when you actually need the hash representation, often when you need to print out the value for outside users. Otherwise, ObjectRef is always a better choice.

llvm/include/llvm/CAS/CASReference.h
107

This comment is out-dated. The builtin kind of the object is removed from the latest implementation to simplify the interface so it is easier to use and implement

111–114

It actually means that only the CAS implementers should use those method (for example, used in BuiltinCAS) but users of the CAS should not use those methods because InternalRef has no meaning outside the CAS internal. I rewrite it to clear up a bit.

163

ObjectRef is just a reference, and you can't access anything in the object it points to, until you load the object using ObjectRef, which will turn into an ObjectHandle. The load can have latency or can fail.

Maybe the comments in ObjectStore.h is a better explanation with a bigger picture? Or should we put all the documentation into one place so it is easier to read?

llvm/include/llvm/CAS/ObjectStore.h
50

It is a real reference, like a pointer reference. But this is up to the CAS implementation, as I can see it is totally possible to implement a CAS the directly embeds the "referenced" object without breaking the contract of the API, but there is really no good reason to do that.

54–58

I refine the docs a bit. Let me know if it is clearer to read.

110–111

This is more a debugging function that can check if the integrity of the CAS has been broken. The current implementation for builtin CAS is to rehash the object fetched from CASID and make sure the hash matches. If not, there is either a bug in CAS or the data has been corrupted.

175–176

No, the reference in CAS goes only one direction. This iterates through all the objects that is referenced by the current object and you don't know how many parents references you.

In general, there is even no API for CAS to iterate through all the objects to compute a use-list and that is by design.

steven_wu updated this revision to Diff 481048.Dec 7 2022, 1:56 PM

Cleanup patch to move unrelated changes.

dblaikie added inline comments.Dec 7 2022, 3:30 PM
llvm/include/llvm/CAS/CASID.h
109–110

Cool - I see you updated the parameter to std::string&& - generally I'd expect it to be std::string value, not a reference. That way the caller can move into the parameter if they have a string to discard, or allow the copy if they want to keep copies in the caller for whatever reason.

llvm/include/llvm/CAS/CASReference.h
163

Still feels like the names could be more self-descriptive, even with/independent of documentation improvements. Ref and Handle both tend to connote a thing you can use to inspect the thing being referred to or handled. (if anything, I guess I'd actually expect a Handle to be the inaccessible version and the Ref to be the accessible version (thinking StringRef, ArrayRef, etc - where the data is directly accessible through that abstraction, whereas handles sometimes you have to pass back into some other object/API to then get the data (like a file handle/file descriptor)))

Seems like ObjectRef is maybe more suitably called ObjectID?

llvm/include/llvm/CAS/ObjectStore.h
60

Presumably it can also fail if the object isn't in the given CAS? (maybe that's the more obvious/simpler to document example?)

62–65

ping on this

110–111

Might be worth a few more words in the description maybe about "consistency" or somesuch?

steven_wu added inline comments.Dec 8 2022, 9:50 AM
llvm/include/llvm/CAS/CASReference.h
163

I don't have strong opinion on names. Maybe @dexonsmith @benlangmuir @akyrtzi can also provide some feedbacks for names?

In my opinion, ObjectRef is named that way because it is like a reference type (just an opaque pointer) and you need to "dereference" it to access underlying data, just that the dereference might fail in CAS. The difference between ObjectRef and ObjectHandle is mostly providing the flexibility for a remote CAS so there is no need to traffic all the data if you only dealing with refs.

Also ObjectRef is the public type we encourage to use, just like other Ref types like StringRef and ArrayRef. It is cheap and allows quick fetch of the underlying data (at least for the builtin CAS we provided). It doesn't really contains ID, and you can't compare Ref from different ObjectStore.

I thought about rename CASID to ObjectID to keep the name consistent, but there is no other more benefit. I am up for renaming types to make it more self-descriptive but I would like to reach an agreement before doing so because we have lots of downstream code needs to upstream and I would like to avoid repeated renames.

llvm/include/llvm/CAS/ObjectStore.h
60

This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined.

For builtinCAS here, ObjectRef always points to existing object, unless the integrity of the builtinCAS is broken. For a remote CAS, you might actually want to have ObjectRef to be unverified to avoid a roundtrip to remote to validate a ObjectRef.

Also to you previous point of split up public/private interface document, is it better if I create a separate docs in llvm/docs to explain different concepts from the views of: 1. CAS users 2. CAS implementors?

benlangmuir added inline comments.Dec 8 2022, 10:23 AM
llvm/include/llvm/CAS/CASReference.h
163

I agree with Steven; I think ObjectRef deserves the good name here, since that's the one that clients of an ObjectStore will work with. ObjectHandle is only used by implementors of an ObjectStore. I agree that having both Ref and Handle is confusing - maybe we should rename Handle to something more verbose and descriptive like LoadedObjectRef.

Seems like ObjectRef is maybe more suitably called ObjectID?

I think this is confusing with CASID, which really is an ID. I don't have a strong opinion on whether CASID should be renamed to ObjectID or not for consistency.

benlangmuir added inline comments.Dec 8 2022, 10:28 AM
llvm/include/llvm/CAS/ObjectStore.h
60

This question is complicated. Maybe we should write down a spec for the CAS APIs? To this question, the current spec for when to lookup for a CAS object to make sure it exists is totally implementation defined

This complication is only for the implementation though, right? Clients of the object store shouldn't assume the object exists in the CAS, and like @dblaikie said, loading can fail if it is not. I don't think we want to promise that builtin CAS will always succeed when loading, since we might want to change that kind of detail later.

dexonsmith added inline comments.Dec 8 2022, 11:01 AM
llvm/include/llvm/CAS/CASReference.h
163

In my mind, the name ObjectID was "taken", since I originally thought we'd do a s/CASID/ObjectID/g at some point (maybe this patch / before upstreaming is the right time?). Maybe it would be even better to do s/CASID/std::string/.

(Historical note: an earlier design had neither ObjectRef nor ObjectHandle, just CASID. CASIDs were tied to a specific CAS instance, had odd lifetimes, and tried to serve all three purposes. We figured out that this was awkward, especially for distributed/remote CAS implementations.)

ObjectRef is a reference to an object (possibly an ID, possible some internal CAS pointer). As @steven_wu points out, having an ObjectRef doesn't necessarily mean you know anything about the content of an object.

ObjectHandle promises that you can access the object.

I think to explain the difference, it helps to have a specific example. If you have a few objects:

object-id: 0
data: "some small data"

object-id: 1
data: [4GB of data]

object-id: 2
data: "a header"
refs: 0, 1

If you have an ObjectHandle for object 2, then you can see that the data is "a header" and you can get the ObjectRefs for the objects it references: 0 and 1.

Those ObjectRefs allow you to talk about objects 0 and 1 without necessarily loading / downloading them, which is important because object 1 is 4GB big. You can create a new object that references them, and you can get a serialized reference (CASID/ObjectID/std::string) for referring to it in other contexts.

Or, you can "load" an ObjectRef, get an ObjectHandle for that object, and look at its content/refs. If you have a distributed CAS, or the object is big, or [etc.], this action might have significant latency. I imagine we'd eventually want to add an API to allow asynchronous loading.

It's a good point that "ref" here bears no relation to the "ref" in ArrayRef. That makes it confusing. It is identifier-like, even if it doesn't know how to serialize itself into a string (you need the CAS instance to string-ify it for you).

At a high level, we have three concepts:

  • a serialized, context-free identifier (currently CASID)
  • an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)
  • a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)

Maybe the following rename could work:

  1. Delete CASID and replace its usage with std::string
  2. Rename ObjectRef to ObjectID
  3. Maybe: Rename ObjectHandle to ObjectRef

I don't have a strong opinion about whether that results in a better world. The "current" names are kind of etched into my brain so it's hard to evaluate.

@steven_wu and @benlangmuir, WDYT? Is there something I'm missing/forgetting that makes this a bad idea?

@dblaikie, WDYT? Maybe you have other name ideas after the above explanation?

dexonsmith added inline comments.Dec 8 2022, 11:09 AM
llvm/include/llvm/CAS/CASReference.h
163

@benlangmuir, just seeing your reply. I'd have expected clients to work with both ObjectRef and ObjectHandle; but you have much more experience than I do. (I guess the ObjectHandle is only indirectly useful to clients. For any serious work, you want to create a proxy that wraps an ObjectHandle and navigates the content in a structured way.)

Agreed that ObjectRef ought to have a good / easy-to-type name.

steven_wu added inline comments.Dec 8 2022, 11:19 AM
llvm/include/llvm/CAS/CASReference.h
163

@dexonsmith Correct, we clean up the interface that no public methods from ObjectStore return ObjectHandle anymore so users access data in object through ObjectProxy. One less data type to deal with for CAS users and ObjectProxy has slightly better interfaces to type as well.

ObjectHandle is type for CAS implementor only now to represent loaded object.

llvm/include/llvm/CAS/ObjectStore.h
60

True. I will update the wording.

akyrtzi added inline comments.Dec 8 2022, 11:25 AM
llvm/include/llvm/CAS/CASReference.h
163

I'd have expected clients to work with both ObjectRef and ObjectHandle

I don't think ObjectHandle has any value for clients, clients will be using ObjectProxy.

an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)

Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context.

dexonsmith added inline comments.Dec 8 2022, 11:25 AM
llvm/include/llvm/CAS/CASReference.h
163

Ah, that sounds great (I apologize for not reading the patch itself). Agreed with @benlangmuir that LoadedObjectHandle seems pretty reasonable if clients never have to use it.

Does my understanding the roles still stand? (Maybe that example could/should be used somewhere in the docs to clarify.)

Also:

  • Do we still need CASID? Why not just use std::string?
  • If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)
  • Another idea: rename ObjectRef to ObjectPtr?
dexonsmith added inline comments.Dec 8 2022, 11:29 AM
llvm/include/llvm/CAS/CASReference.h
163

Something to highlight about this: For a "remote CAS" implementation you practically don't know that "an object is known to exist" until you try to load it. From that respect there is essentially no functional difference between CASID and ObjectRef in such a context.

Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at?

(Or do you just mean that loading could fail because the remote CAS goes down?)

benlangmuir added inline comments.Dec 8 2022, 11:50 AM
llvm/include/llvm/CAS/CASReference.h
163

Ah, is that because a remote CAS may have garbage-collected the object? So, then, ObjectRef is just an object we know how to point at?
(Or do you just mean that loading could fail because the remote CAS goes down?)

We do not check whether an object exists at all when creating the ObjectRef for a remote CAS: it's a wasted round trip over the network since you can't guarantee the object still exists later, and don't get a performance win for knowing it existed in the past. This is different with the local in-memory and on-disk CASes, because they can check for existence cheaply and provide a faster ref implementation once we know the object exists.

  • a serialized, context-free identifier (currently CASID)

There are two levels here:

  • a serialized, context-free identifier (std::string, e.g. "llvmcas://<serialized hash>")
  • a context-sensitive identifier hash containing raw hash bytes (CASID) -- you need a CASContext to use it

an opaque, context-sensitive reference to an object known to "exist somewhere" (currently ObjectRef)

Not known to exist anywhere, but otherwise yes.

a context-sensitive reference to an object known to "be local/loaded" (currently ObjectHandle)

Correct.

Also:

  • Do we still need CASID? Why not just use std::string?

So the question here is whether CASID is pulling its weight as a client-visible type. It should be more efficient than the serialized string, since you have the hash bytes immediately available rather than parsing them out; it can be smaller for the same reason. I don't know if have measured this -- @steven_wu, @akyrtzi any thoughts?

  • If we don't need CASID and can delete it, WDYT of renaming ObjectRef to ObjectID? (It could be thought of as an "opaque" identifier.)

If we can drop CASID, I'm fine with this.

  • Another idea: rename ObjectRef to ObjectPtr?

I find Ref clearer than Ptr for what it does.

akyrtzi added inline comments.Dec 8 2022, 1:00 PM
llvm/include/llvm/CAS/CASReference.h
163

I like the idea of dropping CASID entirely. I'm also fine with renaming ObjectRef to ObjectID.
I would also suggest that we change

virtual Expected<ObjectHandle> load(ObjectRef Ref)

to

virtual Expected<Optional<ObjectHandle>> load(ObjectRef Ref)

The optional indicates whether the object existed or not, and we reserve errors for catastrophic failures (e.g. "network went down"). The client could choose to turn the "object doesn't exist" None into an error (or not, it depends on the context) but this case should be distinguishable from ObjectStore's API.

steven_wu added inline comments.Dec 8 2022, 1:42 PM
llvm/include/llvm/CAS/CASReference.h
163

I tried to get rid of CASID but there are some desirable features that decide to keep it for now. For example, it distinguishes itself from a printable string.

For builtinCAS, you have a printable string that is something like: llvmcas://abcdefg..., while CASID is the raw hash value (0xabcdefg...) + the context ("llvm.builtin.cas.v1[BLAKE3]"). If we get rid of CASID, you will either have a std::string representation that is llvmcas:// then you need to parse it every time you want access to raw hash, or you decide to have std::string to store the raw hash value, then: 1. you can't store a context anywhere. 2. It is confusing if a function is taking a printable string or a raw hash value, since both of them are std::string.

akyrtzi added inline comments.Dec 8 2022, 1:48 PM
llvm/include/llvm/CAS/CASReference.h
163

then you need to parse it every time you want access to raw hash

Why do I need to parse the string to get the raw hash, I should be able to get the raw hash from the ObjectRef/ObjectID, right?
Once I parse a "llvmcas://" string into an ObjectRef/ObjectID there should be no need to keep the string around anymore.

dexonsmith added inline comments.Dec 8 2022, 1:48 PM
llvm/include/llvm/CAS/CASReference.h
163

What is "every time"? I.e., why do you need to get the hash from the CASID? Could/should some/many places that currently use CASID use an ObjectRef instead?

IOW, maybe some current uses of CASID should switch to ObjectRef or ObjectProxy, and the rest can then be converted to std::string. But I'm not sure. Only asking the question to ensure that you've considered it (vs. momentum of code written with CASID before ObjectRef existed).

dexonsmith added inline comments.Dec 8 2022, 2:51 PM
llvm/include/llvm/CAS/CASReference.h
163

Paging in old memories, I think where CASID remains useful is if you're talking about the same object in two different instances of a CAS (maybe one instance is in-memory, and the other is on-disk or remote), and both have the same CASContext. IIRC, you share a CASContext iff the std::strings mean the same thing. E.g., should be true for any CAS that has a std::string ID starting with llvmcas://, as long as both instances use the same hash function (currently always BLAKE3, right?). (Unless this changed...)

If a client is doing a lot of this, having an optimized representation of the hash/identifier could be useful; it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another. (In this scenario, it's also maybe nice to avoid a malloc/dealloc, but the hash will always overflow std::string's small storage... that was the motivation to avoid std::string originally.)

I imagine we'll want to do that sort of thing at some point, but not sure how urgent it is (maybe it's already needed?), or whether the optimization is really necessary (maybe the overhead is hidden by other things that are orders of magnitude more expensive?)...

Outside of that multi-CAS-instance-same-CAScontext scenario, I'm not sure I see a reason to avoid using ObjectRef (or std::string).

Another thought: if we don't drop CASID, it could be renamed to ObjectHash.

  • ObjectHash: the raw hash of the object, shareable within a CASContext (né CASID)
  • ObjectID: an opaque identifier for a specific CAS instance (né ObjectRef)
  • LoadedObjectHandle: implementation detail, points at "loaded" object in a CAS instance (né ObjectHandle)
  • ObjectProxy: LoadedObjectHandle+ObjectStore+nice APIs.
akyrtzi added inline comments.Dec 8 2022, 3:06 PM
llvm/include/llvm/CAS/CASReference.h
163

it avoids repeatedly serializing/parsing/validating std::strings just to communicate a hash from one CAS instance to another.

Hmm, it's still unclear to me why you need to traffic "llvmcas://" strings for that; the ObjectRef/ID that you got from one instance should be able to provide the hash bytes for the other. You don't need to re-validate the hash because the instances are context-compatible.

dexonsmith added inline comments.Dec 8 2022, 3:21 PM
llvm/include/llvm/CAS/CASReference.h
163

IIRC, CASID is the only container for the raw hash recognized by ObjectStore. If you take it away, and you want to traffic in raw hashes you need to invent a new container for them (or just let people use ArrayRef).

If you think it’s useful to traffic in raw hashes, then I suggest keeping CASID/ObjectHash, since it has some error checking and dumping built-in.

akyrtzi added inline comments.Dec 8 2022, 4:08 PM
llvm/include/llvm/CAS/CASReference.h
163

or just let people use ArrayRef

I'm personally fine with making it that the way you get an ObjectID is either via parsing a string identifier or via passing the ArrayRef hash bytes (for example, say I stored the hash as raw bytes in a file and now I want to get an ObjectID out of these bytes).

Like a string identifier from a command-line option, the raw hash bytes will be a transient input that I will use to get an ObjectID in order to proceed with the rest of the work, I shouldn't need to be "trafficking" both hash bytes and ObjectIDs, at the same time.

In general, as a client I'd like to have only 3 concepts that I need to be concerned about:

  1. A string identifier for a CAS object that can be parsed and printed (e.g. the "llvmcas://" strings)
  2. A type I can get as a valid ID after parsing a string identifier or after passing the hash bytes for the object (e.g. ObjectID)
  3. A type that contains the loaded data for the object. (e.g. ObjectProxy)

An ObjectHash type could be useful for the implementations but I find it an unnecessary concept for the users of ObjectStore API.

dexonsmith added inline comments.Dec 8 2022, 4:39 PM
llvm/include/llvm/CAS/CASReference.h
163

Sure; encapsulation might be overkill. I don’t feel strongly!

steven_wu updated this revision to Diff 486058.Tue, Jan 3, 12:43 PM

Rebase for std::optional

steven_wu updated this revision to Diff 486696.Thu, Jan 5, 3:32 PM

Add docs in the reference page that address CAS in different prespective.

Still use old names before do a global rename to a better name.

steven_wu updated this revision to Diff 487492.Mon, Jan 9, 10:14 AM

Update patch after HashMap rename

Update after makeArrayRef depreciation.