This is an archive of the discontinued LLVM Phabricator instance.

[Support] compression proposal for a enum->spec->impl approach
Needs ReviewPublic

Authored by ckissane on Aug 16 2022, 2:00 PM.

Details

Summary

[compression] move to a enum->spec->impl approach
+ None compression is signified by a nullptr for a CompressionSpecRef
+ Lack of support (lack of impl) is signified by a nullptr for a CompressionImplRef
+ move away from fancy fake enum approach
An alternative/hybrid between D130516 and D130506 and D131638

Diff Detail

Event Timeline

ckissane created this revision.Aug 16 2022, 2:00 PM
ckissane requested review of this revision.Aug 16 2022, 2:00 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptAug 16 2022, 2:00 PM
ckissane edited the summary of this revision. (Show Details)Aug 16 2022, 2:02 PM

Probably easier to review if it's framed somewhat more like @MaskRay and my examples - backwards compatible, without fixing all the API users as we can discuss those separately.

Though it is useful to have one or two example uses - but having all the API uses being updated in one patch could lead us to discussing the same issues repeatedly if we try to review it all in one go.

llvm/include/llvm/Object/Decompressor.h
52–53

I think the member should be the CompressionImpl, rather than the specref - null if no decompression is requested/needed, and non-null if it's required/provided by the consumeCompressedSectionHeader.

llvm/include/llvm/Support/Compression.h
34–35

Ref when it's actually a pointer might be confusing - not sure if these add more value over using the raw pointers themselves?

37–38

Probably don't need both of these, just the one that takes the enum type?

39

I don't think we need this function?

43

This should probably be a const member & the CompressionImpl's probably immutable, but could be const too, just to be explicit

45

Probably don't need this member - it should be communicated by the non-null-ness of Implementation?

47–49

These could/should probably go in the Implementation since they're not useful when the algorithm isn't available anyway?

86

the imple could have a pointer back to the spec, rather than having to do another lookup/need another function? (though, if the levels are moved to the impl, maybe this API is not needed?)

94–97

Generally we don't want more variables that need global constructors in LLVM - so these should probably be function-local statics in functions instead.
(I don't think we need a CompressionSpecRef for Unknown or None, though)

llvm/lib/Object/Decompressor.cpp
51

This probably isn't a useful error message for a user. And this code is unreachable/untestable, right? The above code would've already errored out on "unsupported compression type"?

ckissane added inline comments.Aug 16 2022, 2:47 PM
llvm/lib/Object/Decompressor.cpp
51

true, it is essentially a repeat of the above

ckissane updated this revision to Diff 453136.Aug 16 2022, 2:51 PM
  • remove Supported member of CompressionSpec
ckissane updated this revision to Diff 453139.Aug 16 2022, 2:57 PM
  • move compression level members from spec to impl
ckissane marked an inline comment as done.Aug 16 2022, 2:58 PM
ckissane added inline comments.
llvm/include/llvm/Support/Compression.h
47–49

done

ckissane marked an inline comment as done.Aug 16 2022, 2:58 PM
ckissane updated this revision to Diff 453140.Aug 16 2022, 3:04 PM
  • compression: remove some usage sugar from
ckissane added inline comments.Aug 16 2022, 3:06 PM
llvm/include/llvm/Support/Compression.h
94–97

these are just shortcuts to the function local statics of CompressionSpecRef getCompressionSpec(uint8_t Kind)

ckissane updated this revision to Diff 453143.Aug 16 2022, 3:32 PM
  • remove extra includes of ADT/Optional

(still a fair few unhandled comments from the last round - guess work to address those is still ongoing)

llvm/include/llvm/Support/Compression.h
94–97

Yeah, though they're still globals with non-trivial construction, which is to be avoided in LLVM ( https://llvm.org/docs/CodingStandards.html#do-not-use-static-constructors )

So they should either be removed (I think that's probably the right tradeoff, and the one I showed in my proposal - callers that want only one specific compression algorithm can pay the small runtime overhead of going through the switch unnecessarily) or replaced with global functions that use function local statics so the initialization is only paid when the functions are called, and not by every program that links in LLVM for any reason.

ckissane added inline comments.Aug 17 2022, 10:38 AM
llvm/include/llvm/Support/Compression.h
94–97

I concur with removal

ckissane updated this revision to Diff 453379.Aug 17 2022, 11:40 AM
  • remove CompressionSpecRefs::{Zlib,ZStd}
ckissane marked 6 inline comments as done.Aug 17 2022, 11:44 AM
ckissane added inline comments.
llvm/include/llvm/Support/Compression.h
39

removed

45

removed Supported bool

86

levels are moved to the impl, [thus] this API is not needed

llvm/lib/Object/Decompressor.cpp
51

redundancy removed

ckissane updated this revision to Diff 453404.Aug 17 2022, 1:09 PM
ckissane marked 3 inline comments as done.
  • remove Compression{Spec,Impl}Ref typedef
ckissane updated this revision to Diff 453409.Aug 17 2022, 1:13 PM
  • small compression test cleanup
ckissane updated this revision to Diff 453723.Aug 18 2022, 11:20 AM
  • fix static initialization skip errors in getCompressionSpec
ckissane marked an inline comment as done.Aug 18 2022, 11:27 AM
ckissane added inline comments.
llvm/include/llvm/Object/Decompressor.h
52–53

note that null if no decompression is requested/needed falls under the pattern for a CompressionSpec* however as a decompressor *assumes* that it is not none, it could still make sense to make it a CompressionImpl* as you request, which is null if unsupported (and error) and non-null if it's provided by the consumeCompressedSectionHeader.

llvm/include/llvm/Support/Compression.h
34–35

removed typdefs

ckissane updated this revision to Diff 453774.Aug 18 2022, 1:58 PM
ckissane marked an inline comment as done.
  • CompressionImplementation member on Decompressor instead of Spec
ckissane marked an inline comment as done.Aug 18 2022, 2:00 PM

@dblaikie I've handled most of your comments, can you take another look?

dblaikie added inline comments.Aug 18 2022, 2:18 PM
llvm/include/llvm/Object/Decompressor.h
52–53

I don't follow, sorry. All the error handling is done in consumeCompressedSectionHeader - the only thing done after/outside that is one unconditional CompressionScheme->Implementation->decompress - so it doesn't need the CompressionScheme only the impl, which it's going to use unconditionally anyway. The Decompressor::decompress will never be called if the header failed to parse/if there isn't a decompressor ready to handle it.

At least so far as I can see from the code?

llvm/lib/ProfileData/InstrProf.cpp
494

looks like this could be changed to pass the implementation, without the spec? (the caller doesn't need/use the spec)

llvm/lib/Support/Compression.cpp
187

Looks like this comment ( https://reviews.llvm.org/D131992#inline-1270577 ) got lost, as there's still two getCompressionSpec functions - please remove the one taking a raw uint8_t, leaving only the enum version. Callers should deal with validating that they're passing a valid enum to that function, and then getCompressionSpec can return a reference, instead of a pointer - since it'd only be passed a valid compression kind?

I think UnknownD (also, what's the D suffix meant to mean? I can't figure it out from context) can probably be removed, and callers can deal with that case themselves?

llvm/unittests/Support/CompressionTest.cpp
87–88

Probably don't need a named variable for this, since it's only used once on the next line? Roll the getCompressionSpec call into the line below?

89

Across this whole patch I think long variable names are probably making things harder to read/glance at than helpful - especially for shorter lived variables (or ones where the scope cares mostly about the variable anyway, so it's not getting lost amongst other variables/intentions in the code) could you use shorter variable names? (even single letter names or acronyms would probably suffice here - but Spec and Impl might be good for the unit test?)

But also, like I said previously - it'd be good to pull out all the use case changes maybe into another patch - so they can be reviewed separately & this review can focus on the API design (though I realize they're connected - it's helpful to se the use case to understand the design impact, but maybe with one or two small examples, rather than having to change every usage in one go)

ckissane updated this revision to Diff 453803.Aug 18 2022, 2:53 PM
  • shorten compression related variable names
ckissane marked 2 inline comments as done.Aug 18 2022, 2:57 PM
ckissane added inline comments.
llvm/include/llvm/Support/Compression.h
37–38

removed one

llvm/lib/ProfileData/InstrProf.cpp
494

(the caller doesn't need/use the spec)...
It definitely might in the future because it writes a header, that (in the conceivably near future) (like one of my follow up patches) could contain the uint 8 value of the compression type from the Spec, and 0 if uncompressed.

Both D131638 and this patch want to use classes with inheritance to support different compression algorithms.
There have been many versions but I think I have never received a convincing argument how such an inheritance based design looks better / is less error-prone than free functions (D130506).

@MaskRay I think this change is probably the best point of comparison/demonstration of this patch direction (taking some minor liberties with this patch to simplify things somewhat in ways that have already been discussed/seem quite reasonable - specifically having getCompresisonSpec return a reference and the enum having either no "unknown" value, or getCompressionSpec asserting on the unknown value):

// removing the hardcoded zlib compression kind parameter in favor of what the code would look like in the near future once it's parameterized on a compression kind
CompressionImplementation *Compressor = getCompressionSpec(CompressionAlgorithmKind)->Implementation;
bool DoCompression = Compress && DoInstrProfNameCompression && Compressor;
if (DoCompression) {
  Compressor->compress(arrayRefFromStringRef(FilenamesStr),
                                  CompressedStr,
                                  Compressor->BestSizeLevel);
}

I think a reasonable speculation about what the alternative would look like in the a non-OO design would be something like:

bool DoCompression = Compress && DoInstrProfNameCompression && isAvailable(CompressionAlgorithmKind);
if (DoCompression) {
  compress(CompressionAlgorithmKind, arrayRefFromStringRef(FilenamesStr), CompressedStr, getBestSizeLevel(CompressionAlgorithmKind));

And having these three correlated calls (isAvailable, compress, and getBestSizeLevel) all made independently/have to have the same compression algorithm kind passed to them independently seems more error prone & redundant (not in an actual execution/runtime efficiency perspective - I don't think any of these different designs would have materially different runtime performance - but in terms of "duplicate work, and work that could diverge/become inconsistent" - essentially the duplicate work/repeated switch statements, one in each compression::* generic entry point seems like a code smell to me that points towards a design that doesn't have that duplication) rather than tying these calls together and making the lookup happen once and then using the features after that. Also exposing the compression algorithm along with the availability in one operation (not exposing compress/decompress/size APIs when the algorithm isn't available) seems to have similar benefits.

I agree it's somewhat overkill for two types of compression, but I don't think it's so burdensome as to be avoided either. I probably wouldn't've put quite this much consideration into the design if I were doing it myself amongst other things, but I appreciate that @ckissane has & I'm pretty happy with where this is now.

A few things still leftover that I think I've mentioned to @ckissane offline:

  • getCompressionSpec returning by reference
  • more incremental patches - implementing the generic API in terms of the old one (or keeping the old one as a wrapper around the generic one) with at most one or two uses of the API (if any) in the first commit - then subsequent API migrations in future commits, eventually removing the old API once the migration is complete
  • having only the getCompressionSpec(CompressionKind) and dropping the version that takes uint8_t - callers should handle mapping from whatever domain-specific format/representation they have into the CompressionKind enum, there's no need for the raw version here
  • Probably removing the 'unknown' CompressionKind - if a caller needs to handle having "maybe a compression kind, or none" they can wrap CompressionKind in Optional themselves and handle not looking up the algorithm if no compression kind is desired/requested
llvm/lib/ProfileData/InstrProf.cpp
494

That should/can wait for the future change.

& I think at that point we could get into what the right design might be - and I was thinking maybe the right design might be for the CompressionImplementation to have a pointer back to its spec. But in general, lets defer that design decision for later - it's not too expensive to change what the parameter types are once the data is needed later.

(again, this might be even simpler addressed by not including this change at all up-front - keeping the old API and leaving these callers until you're patching the callers to add the new functionality of being able to choose between Zlib and Zstd - rather than, for now, plumbing in the new API without any change in functionality - the plumbing + functionality together might be easire to understand the motivation for and review, and doing it separately from most of the generic design work will isolate the usage design discussion so each piece can move forward separately, for the most part)

MaskRay added a comment.EditedAug 30 2022, 1:12 AM

(I was out of town last week so I did not respond then.)

@MaskRay I think this change is probably the best point of comparison/demonstration of this patch direction (taking some minor liberties with this patch to simplify things somewhat in ways that have already been discussed/seem quite reasonable - specifically having getCompresisonSpec return a reference and the enum having either no "unknown" value, or getCompressionSpec asserting on the unknown value):

I am sorry to disagree... As previously mentioned this is an expression problem.
When we maintain no state in the compression algorithms, the OO style has more boilerplate in llvm/lib/Support and in call sites, so I prefer the FP style (free function style) instead.

(Adding a compression algorithm has a significant barrier. I think we tend to add functions more than adding types. The FP style optimizes for adding new functions (while the OO style optimizes for adding new classes), so the FP style wins here.)

// removing the hardcoded zlib compression kind parameter in favor of what the code would look like in the near future once it's parameterized on a compression kind
CompressionImplementation *Compressor = getCompressionSpec(CompressionAlgorithmKind)->Implementation;
bool DoCompression = Compress && DoInstrProfNameCompression && Compressor;
if (DoCompression) {
  Compressor->compress(arrayRefFromStringRef(FilenamesStr),
                                  CompressedStr,
                                  Compressor->BestSizeLevel);
}

I think a reasonable speculation about what the alternative would look like in the a non-OO design would be something like:

bool DoCompression = Compress && DoInstrProfNameCompression && isAvailable(CompressionAlgorithmKind);
if (DoCompression) {
  compress(CompressionAlgorithmKind, arrayRefFromStringRef(FilenamesStr), CompressedStr, getBestSizeLevel(CompressionAlgorithmKind));

Most call sites do not specify the compression level, so getBestSizeLevel(CompressionAlgorithmKind) is omitted.

And having these three correlated calls (isAvailable, compress, and getBestSizeLevel) all made independently/have to have the same compression algorithm kind passed to them independently seems more error prone & redundant (not in an actual execution/runtime efficiency perspective - I don't think any of these different designs would have materially different runtime performance - but in terms of "duplicate work, and work that could diverge/become inconsistent" - essentially the duplicate work/repeated switch statements, one in each compression::* generic entry point seems like a code smell to me that points towards a design that doesn't have that duplication) rather than tying these calls together and making the lookup happen once and then using the features after that. Also exposing the compression algorithm along with the availability in one operation (not exposing compress/decompress/size APIs when the algorithm isn't available) seems to have similar benefits.

In the FP style (free function style), CompressionAlgorithmKind is passed in multiple places. It is as if, in the OO style, the Compressor object is passed in multiple places.

CompressionAlgorithmKind is an scoped enum and there is hardly another variable of the same type in the scope of CompressionAlgorithmKind. So I do not see how the FP style is more error-prone.
To me, not checking isAvailable(CompressionAlgorithmKind) is similar to not checking the nullness of Compressor.