This is an archive of the discontinued LLVM Phabricator instance.

Replace long type names in IR with hashes
AbandonedPublic

Authored by sepavloff on Nov 27 2017, 10:00 AM.

Details

Summary

If a source file extensively uses templates, resulting LLVM IR may have
types with huge names. It may occur if a record type is defined in a class.
In this case its type name contains all declaration contexts and, if a
declaration context is a class specialization, it is specified with
template parameters.

This change implements transformation of long IR type names. If name length
exceeds some limit, it is truncated and SHA1 hash of its full name is
appended to the obtained abbreviated name. Such solution could reduce memory
footprint and still keep names usable for identification.

To implement this algorithm functions PrintTemplateArgumentList were changed.
They try to make their output a valid C++ code. For this purpose they ensure
that digraph '<:' and tokens '>>' are not formed by inserting space between
characters. The implementation prints template arguments into a separate
stream and then put its content into 'uplevel' stream adding space before
and/or after the text if necessary. Such implementation prevents from using
special stream implementations because the intermediate stream is always of
the same type. To cope with this problem, a new flag in PrintingPolicy is
introduced, which turns off checks for undesirable character sequences. In
this case the intermediate stream becomes unneeded and printing occurs into
the same stream.

Event Timeline

sepavloff created this revision.Nov 27 2017, 10:00 AM
rnk added a subscriber: rnk.Nov 27 2017, 1:25 PM

It's not clear to me that this abbreviation functionality should live in Support. Clang probably wants enough control over this (assuming we're doing it) that the logic should live in clang.

I also think we might want to try solving this a completely different way: just don't bother emitting more than two template arguments for IR type names. We don't really need to worry about type name uniqueness or matching them across TUs.

lib/AST/TypePrinter.cpp
1532–1534

static is nicer than a short anonymous namespace.

1543

Just use SmallString<0> for Buffer. No wasted stack space, no extra unique_ptr. It will allocate memory if it needs it.

1588–1589

It's usually nicer to define free functions in namespaces as void clang::printTemplateArgumentList(.... This ensures that nobody can mess up the namespace scope or forget to include the header that declares the function. It also sometimes turns link errors into compile errors.

rjmccall edited edge metadata.Nov 27 2017, 8:39 PM

In Swift's IRGen, we have an option controlling whether to emit meaningful value names. That option can be set directly from the command line, but it defaults to whether we're emitting IR assembly (i.e. .ll, not .bc), which means that the clients most interested in getting meaningful value names — humans, presumably — always get good names, but nobody else pays for them. I have many, many times wished that we'd taken that same approach in Clang instead of basing it on build configuration.

If type names are a significant burden on compile times, should we just start taking that same approach? Just don't use real type names in .bc output unless we're asked to. Maybe we can eventually retroactively use the same option for value names.

I agree with Reid that it's really weird for there to be a raw_ostream that automatically rewrites the string contents to be a hash when some limit is reached; that does not feel like generalizable technology.

include/clang/AST/PrettyPrinter.h
231

This saves, what, a few spaces from a few thousand IR type names? I'm skeptical that this even offsets the code-size impact of adding this option.

include/clang/AST/Type.h
4623

I like this refactor, but since it's the majority of your patch, please split it out (it can use post-commit review) and make this patch just about your actual change.

lib/AST/TypePrinter.cpp
1543

Well, it might as well have some stack storage, but otherwise I agree.

sepavloff updated this revision to Diff 124587.Nov 28 2017, 9:00 AM

Updated patch as part of it was committed in rL319178

sepavloff marked 3 inline comments as done.Nov 28 2017, 9:22 AM
In D40508#936617, @rnk wrote:

It's not clear to me that this abbreviation functionality should live in Support. Clang probably wants enough control over this (assuming we're doing it) that the logic should live in clang.

I also think we might want to try solving this a completely different way: just don't bother emitting more than two template arguments for IR type names. We don't really need to worry about type name uniqueness or matching them across TUs.

Actually the intention is to have unique type names across different TUs.
I will publish the relevant patch a bit latter, but the problem we are solving is in incorrect behavior of llvm-link. If one TU contains opaque type and the other has the type definition, these two types must be merged into one. The merge procedure relies on type names, so it is important to have the same type names for types in different TUs that are equivalent in the sense of C++.

include/clang/AST/PrettyPrinter.h
231

Not these few spaces themselves make the issue. The real evil is that to insert these spaces, printTemplateArgumentList had to print each template parameter into intermediate stream. We could try to use raw_abbrev_ostream to reduce memory consumption, it would not work because these intermediate streams are of type raw_svector_ostream and all these huge parameter texts would be materialized first and only then would go to compacting stream.
If no need to maintain compilable output, intermediate streams are not needed and all input can go directly to raw_abbrev_ostream.

lib/AST/TypePrinter.cpp
1532–1534

Yes, but this is function template. It won't create symbol in object file. Actually anonymous namespace has no effect here, it is only a documentation hint.

In Swift's IRGen, we have an option controlling whether to emit meaningful value names. That option can be set directly from the command line, but it defaults to whether we're emitting IR assembly (i.e. .ll, not .bc), which means that the clients most interested in getting meaningful value names — humans, presumably — always get good names, but nobody else pays for them. I have many, many times wished that we'd taken that same approach in Clang instead of basing it on build configuration.

If type names are a significant burden on compile times, should we just start taking that same approach? Just don't use real type names in .bc output unless we're asked to. Maybe we can eventually retroactively use the same option for value names.

This is indeed reasonable approach, I will implement it the subsequent patch. Actually we could vary name length to achieve better readability or same memory, as the hash is calculated for entire type name and remains the same.

I agree with Reid that it's really weird for there to be a raw_ostream that automatically rewrites the string contents to be a hash when some limit is reached; that does not feel like generalizable technology.

It reduces type names and at the same time keeps type uniqueness across TUs. It also does not require massive changes in printing methods. Probably the intent will be more clear when I publish the next patch of this set.

D40567 presents a patch that adds template argument names to class template specializations. It also describes the use case in which type names are used to identify type across different TUs.
Adding template parameters obviously increase memory consumption. This patch provides a way to reduce it.
Note that this issue arises only if source is compiled to bitcode (.ll or .bc). If compilation is made to machine representation (.s or .o), IR type name are not needed.

My skepticism about the raw_ostream is not about the design of having a custom raw_ostream subclass, it's that that subclass could conceivably be re-used by some other client. It feels like it belongs as an internal hack in Clang absent some real evidence that someone else would use it.

lib/AST/TypePrinter.cpp
1532–1534

Nonetheless, we generally prefer to use 'static' on internal functions, even function templates, instead of putting them in anonymous namespaces.

My skepticism about the raw_ostream is not about the design of having a custom raw_ostream subclass, it's that that subclass could conceivably be re-used by some other client. It feels like it belongs as an internal hack in Clang absent some real evidence that someone else would use it.

This class can be helpful in various cases where string identifier must persistently identify an object and memory consumption must be low. It may be:

  • Name mangling,
  • Symbol obfuscation,
  • More convenient replacement for raw_sha1_ostream (the latter produces binary result, while raw_abbrev_ostream produces text),
  • If we introduce an option that allows a user to specify how many symbols of full type name are kept in abbreviated name, then llvm-link may see types named as foo<int> and 2544896211ff669ed44dccd265f4c9163b340190, if they come from modules compiled with different options. To find out that these are the same type, it must have access to the same machinery.
lib/AST/TypePrinter.cpp
1532–1534

OK, fixed in rL319290.

My skepticism about the raw_ostream is not about the design of having a custom raw_ostream subclass, it's that that subclass could conceivably be re-used by some other client. It feels like it belongs as an internal hack in Clang absent some real evidence that someone else would use it.

This class can be helpful in various cases where string identifier must persistently identify an object and memory consumption must be low. It may be:

  • Name mangling,
  • Symbol obfuscation,
  • More convenient replacement for raw_sha1_ostream (the latter produces binary result, while raw_abbrev_ostream produces text),

There's no requirement to persistently identify an object here.

  • If we introduce an option that allows a user to specify how many symbols of full type name are kept in abbreviated name, then llvm-link may see types named as foo<int> and 2544896211ff669ed44dccd265f4c9163b340190, if they come from modules compiled with different options. To find out that these are the same type, it must have access to the same machinery.

The LLVM linking model does not actually depend on struct type names matching. My understanding is that, at best, it uses that as a heuristic for deciding whether to make an effort to unify two types, but it's not something that's ultimately supposed to impact IR semantics.

If we needed IR type names to match reliably, we would use a mangled name, not a pretty-print.

My skepticism about the raw_ostream is not about the design of having a custom raw_ostream subclass, it's that that subclass could conceivably be re-used by some other client. It feels like it belongs as an internal hack in Clang absent some real evidence that someone else would use it.

This class can be helpful in various cases where string identifier must persistently identify an object and memory consumption must be low. It may be:

  • If we introduce an option that allows a user to specify how many symbols of full type name are kept in abbreviated name, then llvm-link may see types named as foo<int> and 2544896211ff669ed44dccd265f4c9163b340190, if they come from modules compiled with different options. To find out that these are the same type, it must have access to the same machinery.

The LLVM linking model does not actually depend on struct type names matching. My understanding is that, at best, it uses that as a heuristic for deciding whether to make an effort to unify two types, but it's not something that's ultimately supposed to impact IR semantics.

It is mainly true with an exception, when llvm-link resolves opaque types it relies on type name only. And this behavior creates the issue that D40567 tries to resolve.

If we needed IR type names to match reliably, we would use a mangled name, not a pretty-print.

There is no requirement for IR type name to be an identifier, so pretty-print fits the need of type identification.

My skepticism about the raw_ostream is not about the design of having a custom raw_ostream subclass, it's that that subclass could conceivably be re-used by some other client. It feels like it belongs as an internal hack in Clang absent some real evidence that someone else would use it.

This class can be helpful in various cases where string identifier must persistently identify an object and memory consumption must be low. It may be:

  • If we introduce an option that allows a user to specify how many symbols of full type name are kept in abbreviated name, then llvm-link may see types named as foo<int> and 2544896211ff669ed44dccd265f4c9163b340190, if they come from modules compiled with different options. To find out that these are the same type, it must have access to the same machinery.

The LLVM linking model does not actually depend on struct type names matching. My understanding is that, at best, it uses that as a heuristic for deciding whether to make an effort to unify two types, but it's not something that's ultimately supposed to impact IR semantics.

It is mainly true with an exception, when llvm-link resolves opaque types it relies on type name only. And this behavior creates the issue that D40567 tries to resolve.

It is not clear from that report what the actual problem is. Two incomplete types get merged by the linker? So what?

If we needed IR type names to match reliably, we would use a mangled name, not a pretty-print.

There is no requirement for IR type name to be an identifier, so pretty-print fits the need of type identification.

Not really; pretty-printing drops a lot of information that is pertinent in a stable identifier, like scopes and so on, and makes arbitrary decisions about other things, like where to insert spaces, namespace qualifiers, etc.

The LLVM linking model does not actually depend on struct type names matching. My understanding is that, at best, it uses that as a heuristic for deciding whether to make an effort to unify two types, but it's not something that's ultimately supposed to impact IR semantics.

It is mainly true with an exception, when llvm-link resolves opaque types it relies on type name only. And this behavior creates the issue that D40567 tries to resolve.

It is not clear from that report what the actual problem is. Two incomplete types get merged by the linker? So what?

llvm-link is expected to produce IR that is semantically consistent with the program initially represented by a set of TUs. In this case it is not true. A function defined in source as foo(ABC<int>&) is converted by linking to foo<int*> &) and this breaks initial semantics.

If we needed IR type names to match reliably, we would use a mangled name, not a pretty-print.

There is no requirement for IR type name to be an identifier, so pretty-print fits the need of type identification.

Not really; pretty-printing drops a lot of information that is pertinent in a stable identifier, like scopes and so on, and makes arbitrary decisions about other things, like where to insert spaces, namespace qualifiers, etc.

Type name mangling indeed is attractive solution. It has at least the advantages:

  • It is reversible (in theory).
  • It can be more compact. For instance, there is no need to spell type name that is encountered already, a some kind of reference is sufficient.

And there are arguments against it:

  • It make working with IR harder for developers as readability is broken,
  • Type name in IR is mostly a decoration, with the exception of rare case of opaque type resolution, so type name mangling may be considered as an overkill.

On the other hand pretty-printing can be finely tuned so that all necessary information appears in its result. As there is no requirements on compatibility of type names in bitcode files, things like number of spaces look not so important, it is enough that the same version of clang was used to compile bc files that are linked by llvm-link. After all it is readable.

Are you trying to use LLVM struct type identity to infer information about the source program? That is not and has never been a guarantee.

sepavloff abandoned this revision.Feb 26 2018, 8:40 PM

Using llvm type names is considered a wrong direction.