This is an archive of the discontinued LLVM Phabricator instance.

Function block naming - add hash of parameter type to end of block name
Needs ReviewPublic

Authored by alexbdv on Feb 18 2020, 11:03 PM.

Details

Summary

Function blocks don't have a name specified in source code. Currently their symbol name is based on the parent function's name and an index.
Ex: _ZN1Parent_Funciton_block_invoke_1
One issue that happens with the current naming scheme is that in subsequent builds, the name of the function block can change even if the function block or the parent function has changed. This presents issues for tools that use symbol names to identify changes in source code / tracking of binary metrics.

The proposed solution here is to change the incremental number to a hash of the block parameter types.
Ex: _ZN1Parent_Funciton_block_invoke_38172 (38172 is the hash of the parameters)
Therefore, the function block name will only change if its parameters or the parent function name changes. And will now remain stable regardless if new function blocks are added.

Diff Detail

Event Timeline

alexbdv created this revision.Feb 18 2020, 11:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2020, 11:03 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
vsk edited reviewers, added: ahatanak, erik.pilkington, dexonsmith; removed: vsk.Feb 19 2020, 8:21 AM

At a high level the idea sounds good to me. For out OSes, symbol name instability results in serious performance regressions as this invalidates text/data orderfiles. Why shouldn’t this be on by default?

Also, could you add tests to cover this change?

alexbdv added a subscriber: vsk.Feb 19 2020, 11:20 AM

@vsk - sure will add tests when removing from RFC.
As for making it default - would rather have this under a flag as hashing the block contents does have some overhead and I imagine this feature wouldn't be beneficial in most scenarios. Also, unexpectedly (by default) changing the name of the function blocks might have a negative impact on some existing workflows.

vsk added a comment.Feb 19 2020, 1:44 PM

As for making it default - would rather have this under a flag as hashing the block contents does have some overhead and I imagine this feature wouldn't be beneficial in most scenarios. Also, unexpectedly (by default) changing the name of the function blocks might have a negative impact on some existing workflows.

I agree there is a non-zero compile-time overhead, but I expect it to be very small. Istm the relative cost of adding a flag would be much higher, in terms of added complexity and adoption time.

What kind of workflow would this change break? Part of the motivation seems to be that workflows that rely on blocks having stable names are brittle. This seems like it makes things strictly less brittle.

Thanks for working on this, I agree it's a really important problem.

I'm as optimistic as Vedant that this is the right approach though.

  • On compile time, I do think there's reason to be concerned, since dumping IR was fairly expensive last I checked due to numbering metadata. Have you tried measuring a the time for a bootstrap of Clang?
  • On stability, I'm skeptical this will be an improvement in the general case. Metadata numbering is unstable, and having that as an input isn't great. It could even change between compiler versions.
  • I'm also worried that the hash will change any time someone touches the body of the block. That could even cause a change if other code changes a typedef which is used inside the block.

Is the current numbering per-module? Can we change that somehow to per-function? Then blocks would be numbered by order within a function, which seems fairly stable.

alexbdv added a comment.EditedFeb 19 2020, 5:20 PM

@vsk - about breaking existing workflows - I was referring only to if / when this gets shipped out as the default - all the names for the function blocks will change and this might cause issue with tooling that relies on symbol names being consistent across builds.

@dexonsmith

  • Just to make sure - this isn't dumping LLVM IR but a textual representation of the AST. Does this have the same issues with metadata / metadata numbering that LLVM IR has, or you were referring to the AST text dump, not llvm IR ? Also I don't think that clang would be a good test case here - it doesn't have many block functions.
  • Stability wise - I'm still not sure if this has the metadata numbering issue (since it's AST text representation), perhaps you can tell me how to check.
  • Correct, the hash might change if the block contents changes - this is kind of the intended use for this. As the flag mentions, it is hash-based.

Moving to a per-function index-based approach seems like the correct default approach. I would still like to have the hash version under a flag though. Moving to per-function index-based should be as simple as not adding the discriminator at all. Since function blocks contain the parent's function's name and given that llvm auto-renames duplicate symbols by adding an incremental number - the end result is that function blocks will be incrementally named based on the order that they have in the parent function.

alexbdv added a comment.EditedFeb 22 2020, 5:42 PM

@dexonsmith I did a benchmark with the worst case example I can come up with - 1000 regular functions and 1000 blocks : https://paste.ofcode.org/CFU6b46nuAA6ymxUEpkzka
I didn't manage to measure any performance decrease (the performance decrease was within the noise of repeated runs) - so seems the hashing is insignificant compile-time-wise even in worst case scenario (where most of the code is in function blocks).
And here is the actual text that is being hashed - the textual representation of the AST - this is not llvm IR with metadata: https://paste.ofcode.org/bfMzhbvHz9HRT7mVMe48Mx

@vsk / @dexonsmith - I've added some more info in latest comments. Let me know if I can provide more info / context on this to be able to reach a conclusion. Or if you think it is clear at this point that the hash-based approach is a no-go.

There are a few concerns about the approach:
1> Compile time: dumping AST as string then hashing the string. Alex measured it on a synthetic benchmark, it shows insignificant impact.
2> Stability: from my understanding, the main goal of this patch is to increase stability of the symbol name so it will not change if the relevant code for the block is not changing. It may not be as stable when the compiler version changes.
Using per-function ID improves the stability compared to per-module ID. @alexbdv do you have rough ideas on how much better the proposed approach is in terms of stability, comparing to per-function ID?
3> Using a compiler flag may slow down the adoption.

@dexonsmith: How can we move this forward? Do you have any other suggestion?

Thanks!
Manman

clang/lib/AST/Mangle.cpp
52

Should it be AST instead of IR in the comment here?

56

Is this needed to have deterministic behavior?

alexbdv marked an inline comment as done.Apr 8 2020, 12:27 PM
alexbdv added inline comments.
clang/lib/AST/Mangle.cpp
56

Correct.

@dexonsmith - any suggestions to move forward on this ?

What do you think about using the block type instead of a hash? That should be stable across compiler versions, changes to the block body, and changes to referenced declarations. You'd still have to number blocks of the same type within the same function, but that should be a much smaller problem. It'd also allow demanglers to provide more information about the symbol, e.g. a demangler could print:

block invocation function for block of type void (^)(int) in main

As opposed to what it prints today, which doesn't really tell you much about which block you're looking at:

block invocation function for block in main.

alexbdv updated this revision to Diff 257508.Apr 14 2020, 2:55 PM

@erik.pilkington / @vsk / @dexonsmith - how block name = hash(parent_function_name + block_type) ?
So any blocks that are inside the same parent function + have the same type will get an incremental number to de-duplicate names.

alexbdv marked 2 inline comments as done.Apr 14 2020, 3:05 PM

@erik.pilkington / @vsk / @dexonsmith - how block name = hash(parent_function_name + block_type) ?
So any blocks that are inside the same parent function + have the same type will get an incremental number to de-duplicate names.

Do these need to be behind a hash, or can we just put that directly in the block name? The name is currently __<mangled-function-name>_block_invoke followed by an optional integer discriminator. We could call it __<mangled-function-name>_block_invoke_<mangled-block-type> followed by an optional discriminator. WDYT?

clang/lib/AST/Mangle.cpp
86–87

Seems like this patch is addressing this FIXME. Can it be removed?

alexbdv updated this revision to Diff 257616.Apr 15 2020, 12:33 AM

@dexonsmith - I think that should work - like this ?

Yes, I like the look of that. Can you also update the demangler to reverse this nicely? @erik.pilkington can help there. There are two copies. Usually you modify the copy in libcxxabi/ and then run a script to copy the changes into llvm/.

alexbdv updated this revision to Diff 259181.EditedApr 22 2020, 12:27 AM

@dexonsmith, @erik.pilkington - how about this ?

____Z7my_mainv_block_invoke_ZTSi_ZTSj_ZTSPVK3sss

demangled as:
invocation function for block with params '(int, unsigned int, sss const volatile*)' in my_main()

previously was:
invocation function for block in my_main()

The demangler changes should also be made in libcxxabi/src/demangle, as there is a copy of the demangler there. This change also needs tests, there should be demangler tests (in libcxxabi/test/test_demangle.cpp), and IRGen tests (in clang/test/CodeGen) that check the mangled name. Finally, can you please add full contexts to your diffs? (by running e.g. git show -U99999).

clang/lib/AST/Mangle.cpp
39–50

We should also probably do this for global blocks, which are handled separately below.

42

Might be cleaner as a range-for: for (ParmVarDecl *PVD : BD->parameters)

46

I'm not sure why Out2 exists.

llvm/include/llvm/Demangle/ItaniumDemangle.h
5537 ↗(On Diff #259181)

Can you make a special BlockInvocationFunction AST node that stores the parameter types as AST nodes (rather than strings) in a NodeArray? Right now it looks like this code is leaking, since initializeOutputStream allocates a buffer that it expects the user of OutputStream to manage (the ownership is a bit awkward, but its meant to accommodate the API of __cxa_demangle).

alexbdv updated this revision to Diff 259487.Apr 22 2020, 11:25 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
alexbdv marked 6 inline comments as done.Apr 22 2020, 11:27 PM

For tests - the existing block tests should be enough, just need to update them.
I updated a few - want to make sure changes are final before updating all the tests to not have to update tests again.

llvm/include/llvm/Demangle/ItaniumDemangle.h
5537 ↗(On Diff #259181)

Resolved with better memory management.

jhenderson added inline comments.Apr 23 2020, 12:28 AM
llvm/test/tools/llvm-cxxfilt/invalid.test
1 ↗(On Diff #259487)

I'm not sure this test needs updating with an additional case?

This test is just to show what llvm-cxxfilt prints when an input is not a valid mangled name (I think). I think the existing block_invoke example is there merely to show triple underscores can be handled, but I actually am doubtful that it really belongs here, as that sounds more like belonging in the demangler testing.

NB. I haven't actually looked at the history of this test or the source code to verify anything I just said.

clang/lib/AST/Mangle.cpp
41

@dexonsmith, @erik.pilkington - how about this ?

____Z7my_mainv_block_invoke_ZTSi_ZTSj_ZTSPVK3sss

That doesn't look quite right, _ZTSi means the typeinfo symbol for int, but we really should just be printing the type, i. I think we should add a new member on MangleContext (in Mangle.h) for mangling block invocation functions, then implement it in ItaniumMangle.cpp to call CXXNameMangler::mangleType for each parameter (the MS mangling can probably just be left as-is, I don't think it matters all that much for blocks).

llvm/include/llvm/Demangle/ItaniumDemangle.h
5537 ↗(On Diff #259181)

Okay, but now this is just leaking in the buffer you allocate in DescOS. There isn't any way for this approach to work, because you need the temporary buffer you're allocating to live until after the SpecialName is printed, and the AST nodes do not get destroyed. Can you please do what I said above, and create a BlockInvocationFunction node that has a NodeArray of parameters? You can look at e.g. FunctionType above for an example of this.

alexbdv updated this revision to Diff 259684.Apr 23 2020, 12:52 PM
alexbdv marked an inline comment as done.

@erik.pilkington How about just adding numeric hash of block parameters for now ? This way we don't have to change the mangling / demangling scheme at all.
The mangling / demangling changes bring me into parts of LLVM that I'm not familiar with.
(PS - I still have to update tests).

@erik.pilkington would the hash-based numbering be OK for now ?

erik.pilkington edited reviewers, added: rjmccall; removed: Restricted Project.Apr 29 2020, 12:35 PM

@erik.pilkington would the hash-based numbering be OK for now ?

Feel free to drop the demangler changes for now, but I would prefer including the output of the mangled type in the symbol rather than it's hash.

@erik.pilkington would the hash-based numbering be OK for now ?

Feel free to drop the demangler changes for now, but I would prefer including the output of the mangled type in the symbol rather than it's hash.

Dropping the demangler changes will introduce a regression for demangling blocks... I don't think we should do that.

@dexonsmith what regression are you referring to ?
What this change is effectively doing now is changing the numbering of the blocks from incremental to hash-based.
So the demangler functionality remains the same (i think) - I saw that it is ignoring the (currently incrememntal) number, so changing it shouldn't make a difference.
Did I miss anything ?

@dexonsmith what regression are you referring to ?
What this change is effectively doing now is changing the numbering of the blocks from incremental to hash-based.
So the demangler functionality remains the same (i think) - I saw that it is ignoring the (currently incrememntal) number, so changing it shouldn't make a difference.
Did I miss anything ?

I think you missed @erik.pilkington's comment, which I was replying to.

alexbdv updated this revision to Diff 261978.May 4 2020, 6:02 PM

Updating tests.

alexbdv retitled this revision from [RFC] Add hash of block contents to function block names to Function block naming - add hash of parameter type to end of block name.May 4 2020, 6:03 PM
alexbdv updated this revision to Diff 262272.May 5 2020, 5:41 PM
alexbdv marked an inline comment as done.
alexbdv edited the summary of this revision. (Show Details)
alexbdv updated this revision to Diff 262281.May 5 2020, 6:31 PM

@dexonsmith @erik.pilkington The change is final now, could we get this in ?

Could I please get a review on this ? Thanks :) !

Could I please get a review on this ? Thanks :) !

I still think it makes more sense to directly encode the type instead of hashing it. I'm happy to make the demangler changes for whatever we land on, so feel free to ignore that part of it.

@dexonsmith - Are you OK with that ?

@dexonsmith - Are you OK with that ?

SGTM. I suggest @erik.pilkington lands the demangler patch before landing this so there isn't a window where we mangle something that can't be demangled.

JonasToth resigned from this revision.Sep 13 2020, 2:56 AM
dexonsmith resigned from this revision.Oct 7 2020, 11:21 AM
dexonsmith added a reviewer: Bigcheese.
dexonsmith added a subscriber: Bigcheese.

I'll let @erik.pilkington and @Bigcheese finish this review.