This is an archive of the discontinued LLVM Phabricator instance.

[Bitcode] Add abbreviation for STRUCT_NAME when the name is not char6
Needs ReviewPublic

Authored by sammccall on May 19 2022, 4:05 PM.

Details

Reviewers
ilya-biryukov
Summary

When emitting bitcode for a C++ file, TYPE.STRUCT_NAME entries are a significant
part of the size. A typical name is "struct.std::_Vector_base.618", and
the record contents is the sequence of characters.

These records are efficiently encoded as arrays of 6-bit chars if each
char is representable in char6 encoding: [A-Za-z0-9._]
This does not include ":" so very few C++ names are so encoded - 0.4% in
the file I checked. (<> and space are also common and not encodable).

Before this patch, the fallback is to use unabbreviated encoding: each
character is a vbr6. For ~all characters (ascii>=0x20) this means
encoding as 12 bits per character.

After this patch, the fallback is to encode the characters as fixed8
arrays. This saves 4 bits per character (and also 6 bits per
unabbreviated record).

On my test file (bitcode from clang-tools-extra/clangd/ParsedAST.cpp):

overall size               -18% (113 => 93kB)
STRUCT_NAME fraction             47% => 37%
STRUCT_NAME average size   -33% (451 => 301)

Diff Detail

Event Timeline

sammccall created this revision.May 19 2022, 4:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 4:05 PM
sammccall requested review of this revision.May 19 2022, 4:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 4:05 PM