This is an archive of the discontinued LLVM Phabricator instance.

[llvm] compression classes
Needs ReviewPublic

Authored by ckissane on Jul 25 2022, 2:26 PM.

Details

Summary

Adds a CompressionKind "enum" with values:
CompressionKind::Unknown ≈ 255
CompressionKind::Zlib ≈ 1
CompressionKind::ZStd ≈ 2

also note: OptionalCompressionKind is typedef'ed as Optional<CompressionKind>, and NoneType() is used to indicate no compression.

The CompressionKind "enum" has overrides for several operators:

CompressionKind::operator uint8_t() const
(Self-explanatory)

CompressionKind::operator bool() const
(true if supported, false otherwise)

bool operator==(CompressionKind a, CompressionKind b)

(Self-explanatory)

none compression is represented simply as a none type for use cases that will use Optional<CompressionKind>.
once you have a CompressionKind itself you can pass it around as a value (because a CompressionKind is just a struct containing one uint8_t (a fake "enum")).
due to my operator overloading, you can do stuff like this:

llvm::compression::OptionalCompressionKind OptionalCompressionScheme =
          llvm::compression::getOptionalCompressionKind(CompressionSchemeId);
      if (!OptionalCompressionScheme) {
        return llvm::MemoryBuffer::getMemBuffer(Blob, Name, true);
      }
      llvm::compression::CompressionKind CompressionScheme =
          *OptionalCompressionScheme;
      if (!CompressionScheme) {
        Error("compression class " +
              (CompressionScheme->Name + " is not available").str());
        return nullptr;
      }
      SmallVector<uint8_t, 0> Uncompressed;
      if (llvm::Error E = CompressionScheme->decompress(
              llvm::arrayRefFromStringRef(Blob), Uncompressed, Record[0])) {
        Error("could not decompress embedded file contents: " +
              llvm::toString(std::move(E)));
        return nullptr;
      }
      return llvm::MemoryBuffer::getMemBufferCopy(
          llvm::toStringRef(Uncompressed), Name);

(excerpt from ASTReader.cpp)

you can even do calls like

llvm::compression::CompressionKind::Zlib->decompress(...)
(this can be useful in cases like elf decompression where you might switch on debug section type and in a switch case for DebugCompressionType::Z)

I also believe similar semantics to the nullptr suggestion have been achieved, due to the bool cast returning supported status, and the Unknown type acting as a nullptr of sorts, being always unsupported and -1 (255) as a uint8. Optional(Unknown) is returned from getOptionalCompressionKind(uint8_t) when it is not 0 (NoneType()), 1 (Optional(Zlib)), or 2 (Optional(ZStd)).

In places where explict compression must be used CompressionKind can be passed around, and possibly optional (sometimes none) compression is represented as llvm::Optional<CompressionKind> which I have type aliased as OptionalCompressionKind

Diff Detail

Event Timeline

ckissane created this revision.Jul 25 2022, 2:26 PM
ckissane requested review of this revision.Jul 25 2022, 2:26 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptJul 25 2022, 2:26 PM
ckissane updated this revision to Diff 447469.Jul 25 2022, 2:28 PM
  • format
  • merge fix
dblaikie added inline comments.Jul 25 2022, 2:33 PM
clang/lib/Serialization/ASTWriter.cpp
2003–2004

Doesn't this cause slicing & end up with the base implementation?

(also the base class CompressionAlgorithm has no virtual functions, so I'm not sure how this is meant to work - does this code all work? Then I must be missing some things - how does this work?)

MaskRay added a comment.EditedJul 25 2022, 2:41 PM

Thanks for experimenting the refactoring. My gut feeling is that

  • for inheritance llvm/lib/Support/Compression.cpp introduces quite a bit of complexity.
  • BestSpeedCompression/DefaultCompression/BestSizeCompression may be kinda weird. Not all algorithms may need all of three.
  • this new interface does not make parallel compression/decompression easier.

I mentioned this on https://groups.google.com/g/generic-abi/c/satyPkuMisk/m/xRqMj8M3AwAJ

I know the paradox of choice:) Certainly that we should not add a plethora of bzip2, xz, lzo, brotli, etc to generic-abi. I don't think users are so fond of using a different format for a slight different read/write/compression ratio/memory usage need. They want to pick one format which performs well across a wide variety of workloads. (Adding a new format also introduces complexity on their build system side. They need to teach DWARF consumers they use. It's not a small undertaking.)

I think for a long time llvm/lib/Support/Compression.cpp will stay with just zlib and zstd. There is a question whether we want to make the heavier abstraction now. There is a possibility if a llvm-project use case needs an alternative (it needs very strong arguments not using zstd), it has the other approach that not using llvm/Support/Compression.h

ckissane added inline comments.Jul 25 2022, 3:05 PM
clang/lib/Serialization/ASTWriter.cpp
2003–2004

You are correct to observe that this patch does not fully pass around pointers to instances of the classes, however, because I don't pass pointers and the currently repetitive nature of the compression classes, this still functions correctly.
In short, a follow-up patch (which I will shortly upload) will convert this to using class instances and passing those around.
Including reworking functions throughout llvm-project to take advantage of this.
I am aiming to take this 2 step process to cut down on making an already large pass larger.
Let me know if you have any concerns or ideas.

dblaikie added inline comments.Jul 25 2022, 3:25 PM
clang/lib/Serialization/ASTWriter.cpp
2003–2004

But I'm not sure how this patch works correctly - wouldn't the line below (CompressionScheme.supported()) call CompressionAlgorithm::supported() which always returns false?

ckissane added inline comments.Jul 26 2022, 2:12 PM
clang/lib/Serialization/ASTWriter.cpp
2003–2004

good catch

ckissane updated this revision to Diff 447863.Jul 26 2022, 3:57 PM
  • feat: compression class + clang ast serial zstd option + zstd elf + ztsd objcopy support
  • [MC] fix merge issues with removal of GNU compression
  • fix compression class inheritence and passing
  • update usages of -enable-name-compression=false to be -name-compression=none
  • do not exit on parse of bad compression cl::opt
Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2022, 3:57 PM
Herald added subscribers: Restricted Project, Enna1, mgorny. · View Herald Transcript
ckissane retitled this revision from [Support] compression classes to compression classes.Jul 26 2022, 4:02 PM
ckissane updated this revision to Diff 448134.Jul 27 2022, 12:49 PM
  • fix compression class usage in some profile data tests

Any chance this could be split up to be more directly comparable to https://reviews.llvm.org/D130458 ?

clang-tools-extra/clangd/index/Serialization.cpp
35

We're generally trying to avoid global ctors in LLVM. So at most this should be a static local variable in a function that accesses the algorithm (though perhaps these "compression algorithm" classes shouldn't be accessible directly, and only through singleton accessors in the defined alongside - like there's no reason for LLVM to contain more than one instance of ZlibCompressionAlgorithm, I think?)

llvm/include/llvm/Support/Compression.h
72

Rather than supported() maybe the accessor functions could return nullptr when support isn't available?

if (CompressionAlgorithm *A = getZstdCompressionScheme())

etc.

Though I guess that doesn't allow for a default implementation - I guess an alternative function could be CompressionAlgorithm& getCompressionSchemeOrNone(Zstd) which always gives a valid CompressionAlgorithm by giving the do-nothing compression algorithm when the specified one is not available.

But I guess we don't generally want to silently fallback to null compression, because the streams we're producing always need to know if they have to emit headers, etc, or not? So maybe there's no need for a default?

ckissane updated this revision to Diff 448146.Jul 27 2022, 1:38 PM
  • tiny cleanup of using NoneCompressionAlgorithm::AlgorithmId

Any chance this could be split up to be more directly comparable to https://reviews.llvm.org/D130458 ?

yes definitely! Doing so now!

ckissane updated this revision to Diff 448176.Jul 27 2022, 2:58 PM
  • fix compression inheritence and some more compression class helpers
ckissane edited the summary of this revision. (Show Details)Jul 27 2022, 2:59 PM
ckissane added inline comments.
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54

note the helpers such as when(bool), whensupported() and notNone()

ckissane updated this revision to Diff 448181.Jul 27 2022, 3:06 PM
  • use CompressionScheme->notNone() in InstrProf
ckissane added inline comments.Jul 27 2022, 3:10 PM
clang-tools-extra/clangd/index/Serialization.cpp
35

your idea seems correct to me, however
some algorithms, such as zstd, support running in multiple threads, so this might influence our decision.

dblaikie added inline comments.Jul 27 2022, 3:16 PM
clang-tools-extra/clangd/index/Serialization.cpp
35

So long as the compression algorithm objects are stateless, this would be acceptable - such objects would be thread safe for multiple concurrent users.

ckissane retitled this revision from compression classes to [llvm] compression classes.Jul 27 2022, 3:31 PM
ckissane marked 2 inline comments as done.Jul 27 2022, 3:38 PM

marked outdated comments as done

dblaikie added inline comments.Jul 27 2022, 3:43 PM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
50–51

This seems a bit too convoluted for me.

I'd think something like:

if (DoInstrProfNameCompression) {
  if (CompressionAlgorithm *C = getZlibCompressionAlgorithm())
    C->compress(...);
}

Or even have getCompressionAlgorithm(SupportCompressionType::Zlib) (like that could be the only entry point - no need for algorithm-specific accessors, that function would have one switch over SupportCompressionType, returning null if Unknown or Null were passed, or if the requested algorithm was not available)

I'm not sure I understand the 'when'/'whenSupported' stuff and whether there's any value/need for more details to be communicated in the not-available case other than 'false'/null/nothing (like, if it needs to communicate a reason for non-availability, that's more involved than returning null from some factory/accessor function).

llvm/lib/Support/Compression.cpp
98–102

Maybe these don't need to be static members - if there are singleton insntances of the algorithms, they could be members of those singletons instead (possibly in the base/impl class - the derived classes could pass these values into the base ctor to initialize members in the impl or base - they could even be const public members, avoid the need for accessors (at least avoiding the need for virtual accessors, but hopefully avoiding accessors entirely))

leonardchan added inline comments.Jul 28 2022, 11:40 AM
clang-tools-extra/clangd/index/Serialization.cpp
196–197

Will this leak?

llvm/include/llvm/Support/Compression.h
57–59

Does the uncompress version of this just end up calling into the other uncompress function? If so, we could probably just have one decompress virtual method here and the one that accepts a SmallVectorImpl just calls into the virtual decompress rather than have two virtual methods that would do the same thing. It looks like you've done that in CompressionAlgorithmImpl, but I think it could be moved here.

61–62

Perhaps add some comments for these functions? At least for me, it's not entirely clear what these are for.

67–68

Perhaps it would be simpler to just have the individual subclasses inherit from CompressionAlgorithm rather than have them all go through CompressionAlgorithmImpl? It looks like each child class with methods like getAlgorithmId can just return the static values themselves rather than passing them up to a parent to be returned. I think unless some static polymorphism is needed here, CRTP might not be needed here.

llvm/lib/Support/Compression.cpp
68

Does NoneCompressionAlgorithm imply there's no compression at all? If so, I would think these methods should be empty.

170–171

Is the purpose of UnknownCompressionAlgorithm to be the default instance here? If so, would it be better perhaps to just omit this and have an llvm_unreachable in the default case below? I would assume users of this function should just have the right compression scheme ID they need and any error checking on if something is a valid ID would be done before calling this.

ckissane updated this revision to Diff 448469.Jul 28 2022, 4:15 PM
  • make compression singletons
ckissane updated this revision to Diff 448473.Jul 28 2022, 4:30 PM
  • fix usage of CompressionAlgorithmFromId
MaskRay added a comment.EditedJul 29 2022, 12:01 PM

I'd like to make a few arguments for the current namespace+free function design, as opposed to the class+member function design as explored in this patch (but thanks for the exploration!).
Let's discuss several use cases.

(a) if a use case just calls compress/uncompress. The class design has slightly more boilerplate as it needs to get the algorithm class, a new instance, or a singleton instance.
For each new use, the number of lines may not differ, but the involvement of a a static class member or an instance make the reader wonder whether the object will be reused or thrown away.
There is some slight cognitive burden.
The class design has a non-trivial one-shot cost to have a function returning the singleton instance.

(b) zlib compress/uncompress immediately following an availability check.

// free function
if (!compression::zlib::isAvailable())
  errs() << "cannot compress: " << compression::zlib::buildConfigurationHint();

// class
auto *algo = !compression::ZlibCompression;
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

(c) zlib/zstd compress/uncompress immediately following an availability check.

// free function
if (!compression::isAvailable(format))
  errs() << "cannot compress: " << compression::buildConfigurationHint(format);

// class
std::unique_ptr<Compression> algo = make_compression(format);
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

(d) compress/uncompress and an availability check are apart.

// free function
no change

// class
Store (the pointer to the) the algorithm object somewhere, or construct the pointer/object twice.
leonardchan added inline comments.Jul 29 2022, 12:03 PM
llvm/lib/Support/Compression.cpp
30–32

Perhaps for each of these, you could instead have something like:

ZStdCompressionAlgorithm *getZStdCompressionAlgorithm() {
  static ZStdCompressionAlgorithm* instance = new ZStdCompressionAlgorithm;
  return instance;
}

This way the instances are only new'd when they're actually used.

I'd like to make a few arguments for the current namespace+free function design, as opposed to the class+member function design as explored in this patch (but thanks for the exploration!).
Let's discuss several use cases.

(a) if a use case just calls compress/uncompress. The class design has slightly more boilerplate as it needs to get the algorithm class, a new instance, or a singleton instance.
For each new use, the number of lines may not differ, but the involvement of a a static class member or an instance make the reader wonder whether the object will be reused or thrown away.
There is some slight cognitive burden.
The class design has a non-trivial one-shot cost to have a function returning the singleton instance.

Though there must've been a condition that dominates this use somewhere - I'd suggest that condition could be where the algorithm is retrieved, and then passed to this code to use unconditionally.

If the algorithm object is const and raw pointers/references are used, I think it makes it clear to the reader that there's no ownership here, and it's not stateful when compressing/decompressing.

(b) zlib compress/uncompress immediately following an availability check.

// free function
if (!compression::zlib::isAvailable())
  errs() << "cannot compress: " << compression::zlib::buildConfigurationHint();

// class
auto *algo = !compression::ZlibCompression;
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I think maybe this code might end up looking like:

Algo *algo = getAlgo(Zlib)
if (!algo)
  errs() ...

It's possible that this function would return non-null even for a non-available algorithm if we wanted to communicate other things (like the cmake macro name to enable to add the functionality)

(c) zlib/zstd compress/uncompress immediately following an availability check.

// free function
if (!compression::isAvailable(format))
  errs() << "cannot compress: " << compression::buildConfigurationHint(format);

// class
std::unique_ptr<Compression> algo = make_compression(format);
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I don't think there's a need for unique_ptr here - algorithms can be constant singletons, referenced via raw const pointers/references without ownership.

& this example doesn't include the code that does the compression/decompression, which seems part of the discussion & part I find nice in that the type of compression used matches the type used in the check necessarily rather than being passed into two APIs independently.

(d) compress/uncompress and an availability check are apart.

// free function
no change

// class
Store (the pointer to the) the algorithm object somewhere, or construct the pointer/object twice.

The benefit here is that it's harder for the test to become separated from the usage - for the usage to end up becoming unconditional/incorrectly guarded.

llvm/lib/Support/Compression.cpp
30–32

Yep, I'd mentioned/suggested that (so, seconding here) elsewhere encouraging these to be singletons: https://reviews.llvm.org/D130516#3683384

And they don't even need to be 'new'd in that case, this would be fine:

ZstdCompressionAlgorithm &getZstdCompressionAlgorithm() {
  static ZstdCompressionAlgorithm C;
  return C;
}

Though I think maybe we don't need individual access to the algorithms, and it'd be fine to have only a single entry point like this:

CompressionAlgorithm *getCompressionAlgorithm(DebugCompressionType T) {
  switch (T) {
  case DebugCompressionType::ZStd: {
    static zstd::CompressionAlgorithm Zstd;
    if (zstd::isAvailable())
      return &Zstd;
  }
  ...
  }
  return nullptr;
}

(or, possibly, we want to return non-null even if it isn't available, if we include other things (like the configure macro name - so callers can use that name to print helpful error messages - but then they have to explicitly check if the algorithm is available after the call))

98–102

I don't think there's particular value in these being constexpr members - and maybe we don't need these at all just yet/could leave them out for now? It'd be great to reduce this whole patch to something more comparable with https://reviews.llvm.org/D130458

If you have plans for these other properties it might be helpful to understand what they are - they might help inform the design discussion.

(if we are keeping tnhese properties, including the string version of the name, etc - I'd think the way to do it would be for the base algorithm class to have non-static members to store these, and derived algorithm classes to pass the values into the base ctor to be stored in the members - they could even be const public members of the algorithm to be accessed directly, rather than via accessor functions (& certainly not virtual accessor functions))

ckissane added inline comments.Aug 1 2022, 10:27 AM
llvm/lib/Support/Compression.cpp
30–32

they currently already have singleton behavior i.e. llvm::compression::ZStdCompressionAlgorithm::Instance is the only place new ZStdCompressionAlgorithm() can be put into because the constructor is protected.

I'd rather not achieve "This way the instances are only new'd when they're actually used."
Because the rewards of that are relatively small, but it will make the code more verbose, I think the current pattern allows the best of both worlds of the namespace approach:
(llvm::compression::zlib::compress becomes llvm::compression::ZlibCompression->compress)
but they can be passed as class instances.

(still lots of outstanding comments from my last round, so I won't reiterate those - but waiting for some responses to them)

llvm/lib/Support/Compression.cpp
30–32

Global constructors are to be avoided in LLVM: https://llvm.org/docs/CodingStandards.html#do-not-use-static-constructors
(also these objects don't need to be dynamically allocated with new - they can be directly allocated (as static locals though, not as globals))

ckissane added a subscriber: phosek.Aug 1 2022, 2:33 PM
ckissane updated this revision to Diff 449346.Aug 2 2022, 10:44 AM
  • feat compression "enum" with methods
MaskRay added a comment.EditedAug 2 2022, 11:00 AM

I'd like to make a few arguments for the current namespace+free function design, as opposed to the class+member function design as explored in this patch (but thanks for the exploration!).
Let's discuss several use cases.

(a) if a use case just calls compress/uncompress. The class design has slightly more boilerplate as it needs to get the algorithm class, a new instance, or a singleton instance.
For each new use, the number of lines may not differ, but the involvement of a a static class member or an instance make the reader wonder whether the object will be reused or thrown away.
There is some slight cognitive burden.
The class design has a non-trivial one-shot cost to have a function returning the singleton instance.

Though there must've been a condition that dominates this use somewhere - I'd suggest that condition could be where the algorithm is retrieved, and then passed to this code to use unconditionally.

If the algorithm object is const and raw pointers/references are used, I think it makes it clear to the reader that there's no ownership here, and it's not stateful when compressing/decompressing.

A pointer to a singleton compression class is isomorphic to an enum class CompressionType variable.
Using an enum variable doesn't lose any usage pattern we can do with a pointer to a singleton compression class.
An enum variable allows more patterns, as the allowed values are enumerable (we don't need to worry about -Wswitch for the uses).

Say, we do

auto *algo = !compression::ZlibCompression;
if (!algo)
  ...


algo->compress(...);

either together or apart, the result is similar to the following but with (IMO) slightly larger cognitive burden:

if (!compression::isAvailable(format))
  ...

compression::compress(format);

(b) zlib compress/uncompress immediately following an availability check.

// free function
if (!compression::zlib::isAvailable())
  errs() << "cannot compress: " << compression::zlib::buildConfigurationHint();

// class
auto *algo = !compression::ZlibCompression;
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I think maybe this code might end up looking like:

Algo *algo = getAlgo(Zlib)
if (!algo)
  errs() ...

It's possible that this function would return non-null even for a non-available algorithm if we wanted to communicate other things (like the cmake macro name to enable to add the functionality)

I think this is similarly achieved with an enum variable.
With the class based approach, a pointer has a static type of the ancestor compression class and a dynamic type of any possible algorithm.
This is not different from that: the enum variable may have a value the enum class supports.

(c) zlib/zstd compress/uncompress immediately following an availability check.

// free function
if (!compression::isAvailable(format))
  errs() << "cannot compress: " << compression::buildConfigurationHint(format);

// class
std::unique_ptr<Compression> algo = make_compression(format);
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I don't think there's a need for unique_ptr here - algorithms can be constant singletons, referenced via raw const pointers/references without ownership.

& this example doesn't include the code that does the compression/decompression, which seems part of the discussion & part I find nice in that the type of compression used matches the type used in the check necessarily rather than being passed into two APIs independently.

Thanks for clarification. Then this fits my "singleton compression classes are isomorphic to an enum CompressionType variable" argument :)

(d) compress/uncompress and an availability check are apart.

// free function
no change

// class
Store (the pointer to the) the algorithm object somewhere, or construct the pointer/object twice.

The benefit here is that it's harder for the test to become separated from the usage - for the usage to end up becoming unconditional/incorrectly guarded.

ckissane updated this revision to Diff 449355.Aug 2 2022, 11:22 AM
  • make a zlib corruption check specific
ckissane updated this revision to Diff 449365.Aug 2 2022, 11:42 AM
  • trim down compression api: remove supported()
ckissane edited the summary of this revision. (Show Details)Aug 2 2022, 12:01 PM
ckissane edited the summary of this revision. (Show Details)Aug 2 2022, 12:09 PM

I'd like to make a few arguments for the current namespace+free function design, as opposed to the class+member function design as explored in this patch (but thanks for the exploration!).
Let's discuss several use cases.

(a) if a use case just calls compress/uncompress. The class design has slightly more boilerplate as it needs to get the algorithm class, a new instance, or a singleton instance.
For each new use, the number of lines may not differ, but the involvement of a a static class member or an instance make the reader wonder whether the object will be reused or thrown away.
There is some slight cognitive burden.
The class design has a non-trivial one-shot cost to have a function returning the singleton instance.

Though there must've been a condition that dominates this use somewhere - I'd suggest that condition could be where the algorithm is retrieved, and then passed to this code to use unconditionally.

If the algorithm object is const and raw pointers/references are used, I think it makes it clear to the reader that there's no ownership here, and it's not stateful when compressing/decompressing.

A pointer to a singleton compression class is isomorphic to an enum class CompressionType variable.

I don't mean to suggest that either design is fundamentally more or less functional - I'm totally OK with/agree that both design directions allow the implementation of all the desired final/end-user-visible functionality.

I'm trying to make a point about which, I think, achieves that goal in a "better" way - that's the space of design discussions, I think - what kinds of (developer, maintenance, etc) costs different designs incur.

Using an enum variable doesn't lose any usage pattern we can do with a pointer to a singleton compression class.

I agree that either design doesn't change what's possible - I do, though, think that the "usage patterns" are meaningfully different between the two designs.

An enum variable allows more patterns, as the allowed values are enumerable (we don't need to worry about -Wswitch for the uses).

Say, we do

auto *algo = !compression::ZlibCompression;
if (!algo)
  ...


algo->compress(...);

either together or apart, the result is similar to the following but with (IMO) slightly larger cognitive burden:

if (!compression::isAvailable(format))
  ...

compression::compress(format);

Specifically two APIs that are related (it's important/necessary to check for availability before calling compress or decompress) in their contracts but unrelated in their API use makes it easier to misuse the APIs and have a situation where the availability check doesn't cover the usage. That's what I think is important/important to discuss here.

(b) zlib compress/uncompress immediately following an availability check.

// free function
if (!compression::zlib::isAvailable())
  errs() << "cannot compress: " << compression::zlib::buildConfigurationHint();

// class
auto *algo = !compression::ZlibCompression;
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I think maybe this code might end up looking like:

Algo *algo = getAlgo(Zlib)
if (!algo)
  errs() ...

It's possible that this function would return non-null even for a non-available algorithm if we wanted to communicate other things (like the cmake macro name to enable to add the functionality)

I think this is similarly achieved with an enum variable.
With the class based approach, a pointer has a static type of the ancestor compression class and a dynamic type of any possible algorithm.
This is not different from that: the enum variable may have a value the enum class supports.

I agree that the code is similar in either case, but with a small difference that is important to me - that accessing the algorithm necessarily (to some degree - you could still have code that doesn't test the condition/dereferences null, the same way that code can dereference an empty Optional without checking first - but at least the API I'm suggesting makes clear there's a connection between availability and usage).

(c) zlib/zstd compress/uncompress immediately following an availability check.

// free function
if (!compression::isAvailable(format))
  errs() << "cannot compress: " << compression::buildConfigurationHint(format);

// class
std::unique_ptr<Compression> algo = make_compression(format);
if (!algo->isAvailable()) {
  errs() << "cannot compress: " << algo->buildConfigurationHint();
}

I don't think there's a need for unique_ptr here - algorithms can be constant singletons, referenced via raw const pointers/references without ownership.

& this example doesn't include the code that does the compression/decompression, which seems part of the discussion & part I find nice in that the type of compression used matches the type used in the check necessarily rather than being passed into two APIs independently.

Thanks for clarification. Then this fits my "singleton compression classes are isomorphic to an enum CompressionType variable" argument :)

I don't understand what you're saying here. Could you rephrase/expand a bit?

ckissane updated this revision to Diff 449375.Aug 2 2022, 12:11 PM
  • CompressionKind: clean up param names to == op
ckissane added a comment.EditedAug 2 2022, 12:17 PM

@dblaikie, @MaskRay I think I have worked out something that is the best of both worlds:
none compression is represented simply as a none type for use cases that will use Optional<CompressionKind>.
once you have a CompressionKind itself you can pass it around as a value (because a CompressionKind is just a struct containing one uint8_t (a fake "enum")).
due to my operator overloading, you can do stuff like this:

llvm::compression::OptionalCompressionKind OptionalCompressionScheme =
          llvm::compression::getOptionalCompressionKind(CompressionSchemeId);
      if (!OptionalCompressionScheme) {
        return llvm::MemoryBuffer::getMemBuffer(Blob, Name, true);
      }
      llvm::compression::CompressionKind CompressionScheme =
          *OptionalCompressionScheme;
      if (!CompressionScheme) {
        Error("compression class " +
              (CompressionScheme->getName() + " is not available").str());
        return nullptr;
      }
      SmallVector<uint8_t, 0> Uncompressed;
      if (llvm::Error E = CompressionScheme->decompress(
              llvm::arrayRefFromStringRef(Blob), Uncompressed, Record[0])) {
        Error("could not decompress embedded file contents: " +
              llvm::toString(std::move(E)));
        return nullptr;
      }
      return llvm::MemoryBuffer::getMemBufferCopy(
          llvm::toStringRef(Uncompressed), Name);

(excerpt from ASTReader.cpp)

you can even do calls like

llvm::compression::CompressionKind::Zlib->decompress(...)
(this can be useful in cases like elf decompression where you might switch on debug section type and in a switch case for DebugCompressionType::Z)

ckissane edited the summary of this revision. (Show Details)Aug 2 2022, 12:18 PM
ckissane edited the summary of this revision. (Show Details)Aug 2 2022, 12:25 PM

The current code here still seems more complicated than I'd prefer - looks like currently the size/speed/default levels are currently unused, so maybe we can omit those for now, knowing they will be added?
And the CompressionKind with all its operator overloads seems like a lot of surface area that is pretty non-obvious for usage - boolean testable, logical operator overloads, etc.
Could we have only one decompress/compress function each, for now?
& maybe leave out the name/enum from the base class for now, add it in later (& I think I mentionted in another comment those properties can be non-virtual, maybe even direct const members - passed into the base through the ctor from the derived class)

Maybe it's easier if I either post a patch, or at least more explicitly flesh out what I'm picturing/proposing/suggesting:
Header:

struct CompressionAlgorithm {
  virtual void Compress(...);
  virtual void Decompress(...);
};
enum class CompressionType {
  Zlib, Zstd
};
CompressionAlgorithm *getCompressionAlgorithm(CompressionType);

Implementation:

#if LLVM_ENABLE_ZLIB
struct ZlibCompressionAlgorthim : CompressionAlgorithm {
  void Compress(...) { ... }
  void Decompress(...) { ...}
}
#endif
...
CompressionAlgorithm *getCompressionAlgorithm(CompressionType T) {
  switch (T) {
  case CompressionType::Zlib: {
#if LLVM_ENABLE_ZLIB
    static ZlibCompressionAlgorithm A;
    return &A;
#else
    break;
#endif
  }
...
  }
  return nullptr;
}

Usage:

if (CompressionAlgorithm *C = getCompressionAlgorithm(CompressionType::Zlib) {
  C->compress(...);
}

And, yeah, I think it'd be suitable to eventually add name, type, size/speed/default levels:

struct CompressionAlgorithm {
  const StringRef Name;
  const CompressionType Type;
  const int DefaultLevel;
  const int BestSizeLevel;
  const int BestSpeedLevel;
  virtual void Compress(...);
  virtual void Decompress(...);
protected:
  CompressionAlgorithm(StringRef Name, CompressionType Type, ...) : Name(Name), Type(Type), ... {}
}
struct ZlibCompressionAlgorithm : CompressionAlgorithm {
  ZlibCompressionAlgorithm() : CompressionAlgorithm("zlib", CompressionType::Zlib, 5, 10, 1) { }
  /* as before */
};
...

Though those can be added as needed - good to keep in mind that they're a useful direction to go, but might simplify the review/discussion to omit them for now.

I think I have worked out something that is the best of both worlds:

I think @MaskRay's main concern, which I share to a degree, is that there's a lot of code/complexity here that doesn't currently seem warranted by the size of the problem. So adding more implementation complexity to this patch, even if it does provide some of the benefits (though I don't think the ability to do boolean logic, etc, is the main concern either myself or @MaskRay have - either way we'll have the enum) it's adding a lot more implementation complexity, which is something we're trying to address/reduce.

I'd like to make a few arguments for the current namespace+free function design, as opposed to the class+member function design as explored in this patch (but thanks for the exploration!).
Let's discuss several use cases.

(a) if a use case just calls compress/uncompress. The class design has slightly more boilerplate as it needs to get the algorithm class, a new instance, or a singleton instance.
For each new use, the number of lines may not differ, but the involvement of a a static class member or an instance make the reader wonder whether the object will be reused or thrown away.
There is some slight cognitive burden.
The class design has a non-trivial one-shot cost to have a function returning the singleton instance.

Though there must've been a condition that dominates this use somewhere - I'd suggest that condition could be where the algorithm is retrieved, and then passed to this code to use unconditionally.

If the algorithm object is const and raw pointers/references are used, I think it makes it clear to the reader that there's no ownership here, and it's not stateful when compressing/decompressing.

A pointer to a singleton compression class is isomorphic to an enum class CompressionType variable.

I don't mean to suggest that either design is fundamentally more or less functional - I'm totally OK with/agree that both design directions allow the implementation of all the desired final/end-user-visible functionality.

I'm trying to make a point about which, I think, achieves that goal in a "better" way - that's the space of design discussions, I think - what kinds of (developer, maintenance, etc) costs different designs incur.

[...]

I don't understand what you're saying here. Could you rephrase/expand a bit?

Maybe it's easier if I either post a patch, or at least more explicitly flesh out what I'm picturing/proposing/suggesting:

I agree that it is easier if you post a patch so that we can have discussion on concrete code. I think we are now at the "a picture is worth a thousand words" stage:)

The current code here still seems more complicated than I'd prefer - looks like currently the size/speed/default levels are currently unused, so maybe we can omit those for now, knowing they will be added?
And the CompressionKind with all its operator overloads seems like a lot of surface area that is pretty non-obvious for usage - boolean testable, logical operator overloads, etc.
Could we have only one decompress/compress function each, for now?
& maybe leave out the name/enum from the base class for now, add it in later (& I think I mentionted in another comment those properties can be non-virtual, maybe even direct const members - passed into the base through the ctor from the derived class)

Yes, I can continue to trim down the implementation! I agree with your sentiment.

Maybe it's easier if I either post a patch, or at least more explicitly flesh out what I'm picturing/proposing/suggesting:
Header:

struct CompressionAlgorithm {
  virtual void Compress(...);
  virtual void Decompress(...);
};
enum class CompressionType {
  Zlib, Zstd
};
CompressionAlgorithm *getCompressionAlgorithm(CompressionType);

Implementation:

#if LLVM_ENABLE_ZLIB
struct ZlibCompressionAlgorthim : CompressionAlgorithm {
  void Compress(...) { ... }
  void Decompress(...) { ...}
}
#endif
...
CompressionAlgorithm *getCompressionAlgorithm(CompressionType T) {
  switch (T) {
  case CompressionType::Zlib: {
#if LLVM_ENABLE_ZLIB
    static ZlibCompressionAlgorithm A;
    return &A;
#else
    break;
#endif
  }
...
  }
  return nullptr;
}

I agree with some of this, I have some strong thoughts I would like to work out about the whole nullptr being none or unsupported a little preemptively IMO.

Usage:

if (CompressionAlgorithm *C = getCompressionAlgorithm(CompressionType::Zlib) {

C->compress(...);

}

currently, you can do

if (CompressionKind C = CompressionKind::Zlib) {
  C->compress(...);
}

And, yeah, I think it'd be suitable to eventually add name, type, size/speed/default levels:

struct CompressionAlgorithm {
  const StringRef Name;
  const CompressionType Type;
  const int DefaultLevel;
  const int BestSizeLevel;
  const int BestSpeedLevel;
  virtual void Compress(...);
  virtual void Decompress(...);
protected:
  CompressionAlgorithm(StringRef Name, CompressionType Type, ...) : Name(Name), Type(Type), ... {}
}
struct ZlibCompressionAlgorithm : CompressionAlgorithm {
  ZlibCompressionAlgorithm() : CompressionAlgorithm("zlib", CompressionType::Zlib, 5, 10, 1) { }
  /* as before */
};
...

Though those can be added as needed - good to keep in mind that they're a useful direction to go, but might simplify the review/discussion to omit them for now.

ckissane updated this revision to Diff 449421.EditedAug 2 2022, 1:51 PM

compression api: greatly simplify implementation
+ cuts around 100 lines of code from compression.h and compression.cpp

ckissane edited the summary of this revision. (Show Details)Aug 2 2022, 1:54 PM

Yes, I can continue to trim down the implementation! I agree with your sentiment.

Thanks! This update helps - though I think we'll still want to further isolate the different pieces of this change/reduce this further.

I agree with some of this, I have some strong thoughts I would like to work out about the whole nullptr being none or unsupported a little preemptively IMO.

Could you clarify what use cases you have in mind that require the nuance between none and unsupported? (arguably this accessor function could assert when passed None - would that make it simpler? Then the only null return would be unsupported)

Usage:

if (CompressionAlgorithm *C = getCompressionAlgorithm(CompressionType::Zlib) {

C->compress(...);

}

currently, you can do

if (CompressionKind C = CompressionKind::Zlib) {
  C->compress(...);
}

The implementation complexity is a concern too, though. I think having CompressionKind, boolean conversions and logical operator overloads, etc, in addition to the CompressionAlgorithm doesn't seem to provide enough to justify the complexity - but perhaps I'm missing some context/understanding of the values those features provide?

What sort of use cases do you have in mind that necessitate that complexity/functionality? (specifically I see a lot of || llvm::NoneType() which seems really obtuse/unclear why a user of the API would think to do that/understand that was the right/necessary thing to do)

Maybe for comparison purposes it'd be good not to replace this API but to add it on top of the underlying API so only one callsite can be updated, rather than all the changes necessary to update all clients in one go (& in this way maybe omit some of the functionality in this first patch, since it won't have to cover all use cases - eg: those extra fields (name/compression levels (best size/speed/default), etc) could possibly be omitted from the first version of this patch, so the patch only adds enough functionality for one of the compresion clients (like MC, to match/compare with @MaskRay's patch) rather than all of them)

llvm/include/llvm/Support/Compression.h
48–51

Could we skip this wrapper & just have the underlying function (& also we shouldn't be overloading by case like this anyway - please name all the decompress/compress functions with the same case/spelling)

ckissane added inline comments.Aug 3 2022, 11:20 AM
llvm/include/llvm/Support/Compression.h
48–51

good point

ckissane updated this revision to Diff 449772.Aug 3 2022, 1:30 PM
  • remove uppercase Compress, Decompress
ckissane updated this revision to Diff 450093.Aug 4 2022, 12:05 PM
  • remove compression kind || && overload
ckissane edited the summary of this revision. (Show Details)Aug 4 2022, 12:08 PM

@dblaikie @MaskRay I would like it if you could all take another look.

In response to @dblaikie 's comments about implementation weight I have greatly simplified the implementation, including removing extra capitalized function overloads (Compress, Decompress), and removing || and && operator overrides. Also adopting the recently suggested class impl in part.

I also believe similar semantics to the nullptr suggestion have been achieved, due to the bool cast returning supported status, and the Unknown type acting as a nullptr of sorts, being always unsupported and -1 (255) as a uint8. Optional(Unknown) is returned from getOptionalCompressionKind(uint8_t) when it is not 0 (NoneType()), 1 (Optional(Zlib)), or 2 (Optional(ZStd)).

In places where explict compression must be used CompressionKind can be passed around, and possibly optional (sometimes none) compression is represented as llvm::Optional<CompressionKind> which I have type aliased as OptionalCompressionKind

ckissane edited the summary of this revision. (Show Details)Aug 4 2022, 2:56 PM
ckissane edited the summary of this revision. (Show Details)
ckissane updated this revision to Diff 450169.Aug 4 2022, 4:12 PM
ckissane edited the summary of this revision. (Show Details)
  • fix some nits in Compression.h
ckissane updated this revision to Diff 450176.Aug 4 2022, 4:23 PM
  • move if into switch
leonardchan added inline comments.Aug 4 2022, 5:39 PM
clang-tools-extra/clangd/index/Serialization.cpp
251–252

nit: I think error accepts format-like arguments, so you could have something similar to above with:

return error("Compressed string table, but {0} is unavailable", CompressionScheme->Name);
clang/lib/Serialization/ASTReader.cpp
1469–1471
1475–1476

I think this Error takes a StringRef, so I think you could have:

return Error("compression class " + CompressionScheme->Name + " is not available");
lld/ELF/InputSection.cpp
1230

typename might not be needed here

llvm/include/llvm/Support/Compression.h
28–112
40

this-> might not be needed here

86

I think this cast might not be needed

106

I think this cast might not be needed

llvm/lib/ObjCopy/ELF/ELFObject.cpp
442–461

What's the explanation for having the llvm_unreachable branch and getting the compression type? I would've thought this section would just be:

if (Error Err1 = compression::CompressionKind::Zlib->decompress(
        Compressed, DecompressedContent, static_cast<size_t>(Sec.Size))) {
  return createStringError(errc::invalid_argument,
                           "'" + Sec.Name +
                               "': " + toString(std::move(Err1)));
}

which looks like it would have identical behavior to the old code.

530–537

Same here. Should this just be compression::CompressionKind::Zlib->compress(OriginalData, CompressedData);? If this is in preparation for the ELF+zstd changes, perhaps we should save those for another patch once that lands?

llvm/lib/Support/Compression.cpp
53–79

nit: add overrides to be more explicit these are virtual methods

83–95

If the llvm_unreachables should be the default implementation for all subclasses, perhaps the [de]compress methods should be regular virtual with these default implementations rather than abstract virtual.

152

I think this cast might not be needed

152–155

Same here

179–182
dblaikie added inline comments.Aug 4 2022, 11:47 PM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

This still seems like a lot of hoops to jump through - why "noneIfUnsupported" rather than either having the compression scheme (I think it could be the CompressionAlgorithm itself, rather than having the separate OptionalCompressionKind abstraction) either be null itself, or expose an "isAvailable" operation directly on the CompressionAlgorithm?

Even if the CompressionKind/OptionalCompressionKind/CompressionAlgorithm abstractions are kept, I'm not sure why the above code is preferred over, say:

if (Compress && DoInstrProfNameCompression && OptionalCompressionScheme /* .isAvailable(), if we want to be more explicit */) {
  ...
}

What's the benefit that noneIfUnsupported is providing? (& generally I'd expect the Compress && DoInstrProfNameCompression to be tested/exit early before even naming/constructing/querying/doing anything with the compression scheme/algorithm/etc - so there'd be no need to combine the tests for availability and the tests for whether compression was requested)

Perhaps this API is motivated by a desire to implement something much closer to the original code than is necessary/suitable? Or some other use case/benefit I'm not quite understanding yet?

ckissane added inline comments.Aug 5 2022, 11:09 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

I shall remove noneIfUnsupported. You express good points, we can simply check if(OptionalCompressionScheme && *OptionalCompressionScheme) where necessary.

ckissane added inline comments.Aug 5 2022, 11:11 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

though it will make a lot of existing code patterns less clear, and more verbose

ckissane added inline comments.Aug 5 2022, 11:13 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

and sometimes you really do need to re code the exact thing noneIfUnsupported encapsulates...

dblaikie added inline comments.Aug 5 2022, 11:21 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

Are there examples within LLVM that you can show compare/contrast noneIfUnsupported helps?

ckissane updated this revision to Diff 450338.Aug 5 2022, 11:32 AM
  • cleanup some compression nits
ckissane added inline comments.Aug 5 2022, 11:36 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

yes, I'll paste a couple here

ckissane added inline comments.Aug 5 2022, 11:59 AM
llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
54–60

Ok, So I believe I was mistaken.
In older versions of this patch there was a none compression implementation that just did a memcpy, this made a natural fall through state, which made this type of pattern advantageous.
However, this is no longer the case.
Hence I will remove this without further adue.
Thank you for your astute observation!

ckissane updated this revision to Diff 450343.Aug 5 2022, 12:03 PM
  • remove OptionalCompressionKind noneIfUnsupported
ckissane added inline comments.Aug 5 2022, 12:06 PM
clang/lib/Serialization/ASTReader.cpp
1475–1476

unfortunately that doesn't work (I tried)

ckissane updated this revision to Diff 450345.Aug 5 2022, 12:09 PM
  • format error string
ckissane updated this revision to Diff 450392.Aug 5 2022, 2:00 PM
  • fix InputSection decompress issue
ckissane updated this revision to Diff 450401.Aug 5 2022, 2:08 PM
  • Merge remote-tracking branch 'origin/main' into ckissane.compression-class-simple

This is looking pretty close to what I've been picturing - the only thing remaining is that I think we could get rid of CompressionKind and access the CompressionAlgorithm directly - basically the contents of CompressionKind::operator-> could be a free/public function const CompressionAlgorithm &getCompressionAlgorithm(enum class CompressionKind); - and have it return a reference to the local static implementation, with a none implementation (for those profile cases where you want to pass in an algorithm if it's available, or none) and one implementation for each of zlib/zstd?

clang-tools-extra/clangd/index/Serialization.cpp
238

What purpose is this expression serving? (isn't it a tautology, given that CompressionScheme was initialized a couple of lines back with CompressionKind::Zlib?

clang/lib/Serialization/ASTWriter.cpp
2003–2009

Generally LLVM style rolls these together

llvm/lib/Object/Decompressor.cpp
41–50

Maybe leave this code more like it was before - it can turn into a switch over ELFCompressionSchemeId when Zstd is added here.

llvm/lib/ProfileData/Coverage/CoverageMappingReader.cpp
122–132

Be nice to share the same CompressionKind

llvm/lib/ProfileData/Coverage/CoverageMappingWriter.cpp
51–59

Why/how does OptionalCompressionKind end up in here, compared with this (suggested edit)?

llvm/lib/ProfileData/InstrProf.cpp
465

Yeah, this seems awkward, but I see what you're getting at - if you're going to pass around an algorithm to use, and this particular kind of use case wants to collapse the "algorithm not available" and "no compression was requested" - so, yeah, for this sort of use case I can see the valid in having a null compression algorithm implementation. That shouldn't add a lot of complexity to the implementation and simplify the usage so it doesn't have two layers of "not present" state.

492

Presumably this doesn't need to test OptionalCompressionScheme here, though, since it's tested in the implementation?

llvm/lib/ProfileData/SampleProfReader.cpp
881–886

Nice to pull out a common variable rather than accessing CompressionKind::Zlib twice independently (otherwise we probably might as well go with the more direct API @MaskRay has proposed)

ckissane updated this revision to Diff 451247.Aug 9 2022, 12:58 PM
  • address review comments
ckissane added a comment.EditedAug 9 2022, 2:10 PM

This is looking pretty close to what I've been picturing - the only thing remaining is that I think we could get rid of CompressionKind and access the CompressionAlgorithm directly - basically the contents of CompressionKind::operator-> could be a free/public function const CompressionAlgorithm &getCompressionAlgorithm(enum class CompressionKind); - and have it return a reference to the local static implementation, with a none implementation (for those profile cases where you want to pass in an algorithm if it's available, or none) and one implementation for each of zlib/zstd?

I can see what you are asking for, however since its behavior is essentially the same, and still uses both enum values and class implementations, I don't see any practical advantages (though if there is something I am failing to observe let me know).
Additionally, I can see some disadvantages in largely increasing code verbosity across the codebase.
A snippet of an example from elf: if(CompressionKind::Zlib) CompressionKind::Zlib->compress...
would turn into if(getCompressionAlgorithm(CompressionKind::Zlib)) getCompressionAlgorithm(CompressionKind::Zlib)->compress...
(assuming bool operator overload is also moved to class)
Because of this I am leaning away from such an implementation change.

I have only taken very brief look at the new version. Having an enum class CompressionKind with a parallel CompressionAlgorithm seems redundant.
friend CompressionAlgorithm *CompressionKind::operator->() const; looks magical.

I hope that someone insisting on object-oriented design can put up a version with less boilerplate to compete with D130506.

This is looking pretty close to what I've been picturing - the only thing remaining is that I think we could get rid of CompressionKind and access the CompressionAlgorithm directly - basically the contents of CompressionKind::operator-> could be a free/public function const CompressionAlgorithm &getCompressionAlgorithm(enum class CompressionKind); - and have it return a reference to the local static implementation, with a none implementation (for those profile cases where you want to pass in an algorithm if it's available, or none) and one implementation for each of zlib/zstd?

I can see what you are asking for, however since its behavior is essentially the same, and still uses both enum values and class implementations, I don't see any practical advantages (though if there is something I am failing to observe let me know).

The intent is to simplify both implementation and usage because this currently feels like overkill for a fairly small abstraction benefit (compared to @MaskRay's posted alternative, for instance) - we're abstracting only a handful of use cases over only 2 implementations, so this shouldn't be too complicated/overengineered.

Additionally, I can see some disadvantages in largely increasing code verbosity across the codebase.
A snippet of an example from elf: if(CompressionKind::Zlib) CompressionKind::Zlib->compress...
would turn into if(getCompressionAlgorithm(CompressionKind::Zlib)) getCompressionAlgorithm(CompressionKind::Zlib)->compress...

I think in either case this should be changed to something like this (specifically avoid writing CompressionKind::Zlib twice):

if (auto C = CompressionKind::Zlib)
  C->compress...

(assuming bool operator overload is also moved to class)
Because of this I am leaning away from such an implementation change.

I have only taken very brief look at the new version. Having an enum class CompressionKind with a parallel CompressionAlgorithm seems redundant.
friend CompressionAlgorithm *CompressionKind::operator->() const; looks magical.

I hope that someone insisting on object-oriented design can put up a version with less boilerplate to compete with D130506.

Posted something more comparable to D130506 in D131638 - hard to compare, though - D130506 is additive, whereas D131638 and this D130516 are more replacements - though a different with my change there is at least for the initial patch leaving the old APIs in place, with the intent to incrementally change the usages until the old API can be removed.