This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Merge identical block descriptor global variables
ClosedPublic

Authored by ahatanak on Aug 15 2018, 8:23 AM.

Details

Summary

Currently, clang generates a new block descriptor global variable for each block literal. This patch merges block descriptors that are identical inside and across translation units using the same approach taken in r339438.

To enable merging identical block descriptors, the information of each member of a block descriptor is encoded into the block descriptor name. Also, the linkage of the block descriptor is linkonce_odr unless the copy or dispose helper is internal.

Diff Detail

Repository
rC Clang

Event Timeline

ahatanak created this revision.Aug 15 2018, 8:23 AM

A few points I forgot to mention:

  • This optimization kicks in only in NonGC mode. I don't think we need to care much about GC anymore, so I think that's OK.
  • There is a lot of redundancy among the copy/dispose helper function strings and the block layout string in the block descriptor name (they all encode the information about the captures), which can make the descriptor name long. If that becomes a problem, it's possible to encode the information in a way that shortens the descriptor name, but that would probably make the function that generates the name more complex.

A few points I forgot to mention:

  • This optimization kicks in only in NonGC mode. I don't think we need to care much about GC anymore, so I think that's OK.

Yes, that's fine.

  • There is a lot of redundancy among the copy/dispose helper function strings and the block layout string in the block descriptor name (they all encode the information about the captures), which can make the descriptor name long. If that becomes a problem, it's possible to encode the information in a way that shortens the descriptor name, but that would probably make the function that generates the name more complex.

It should be straightforward to at least get merge the copy/dispose helper function names, right? Those make basically the same pass over the captures and just check for slightly different things; the block-descriptor summary just has to include a little of both.

To unique the block layout string as well, we'd just have to cover the captures for which we haven't added any other description. If you want to punt on that, I think that's fine.

ahatanak updated this revision to Diff 161137.Aug 16 2018, 3:57 PM

Try harder to shorten the names of block descriptor global variables.

The strings extracted from the names of the copy and dispose helpers are merged whenever it is possible to do so. The block layout string doesn't include information about __strong, __weak, or byref captures unless the copy/dispose helpers are missing.

rsmith added a subscriber: rsmith.Aug 16 2018, 4:17 PM
rsmith added inline comments.
lib/CodeGen/CGBlocks.cpp
239–245

Would it make sense to also mark this constant as unnamed_addr?

ahatanak added inline comments.Aug 16 2018, 4:24 PM
lib/CodeGen/CGBlocks.cpp
239–245

Yes. I don't think there is any harm in marking it as unnamed_addr. However, unfortunately it looks like ld64 doesn't try to merge unnamed_addr global variables.

ahatanak updated this revision to Diff 161155.Aug 16 2018, 7:36 PM
ahatanak marked an inline comment as done.

Mark the block descriptor global variable as unnamed_addr.

rjmccall accepted this revision.Aug 16 2018, 11:00 PM
This revision is now accepted and ready to land.Aug 16 2018, 11:00 PM
This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.

This revision broke blocks on all ELF targets. The block descriptors' symbol names can now include the @ character, which is reserved on ELF platforms as a separator between symbol name and symbol version. As a result, nothing containing a block that has an Objective-C object argument will link. Please add mangling similar to that in the GNUstep runtime to avoid this.

lib/CodeGen/CGBlocks.cpp
163

Specifically, this line. It is unsafe to use an Objective-C encoding directly as a symbol name on ELF.

The GNUstep documentation I found replaces '@' with '\1'. Would that fix the problem?

https://github.com/gnustep/libobjc2/blob/master/ABIDoc/abi.tex

The GNUstep documentation I found replaces '@' with '\1'. Would that fix the problem?

https://github.com/gnustep/libobjc2/blob/master/ABIDoc/abi.tex

Yup. If this mangling is needed outside of CGObjCGNU, it would be good to take the code that does this mangling and move it somewhere that can be reused. I don't know if there are any characters that are valid in Objective-C type encoding but aren't valid in Windows PE/COFF symbols, but having a hook to fix them in a single place sounds like a good idea.