Page MenuHomePhabricator

[RFC] Add hash of block contents to function block names
Needs ReviewPublic

Authored by alexbdv on Feb 18 2020, 11:03 PM.

Details

Summary

Function blocks don't have a name specified in source code. Currently their symbol name is based on the parent function's name and an index.
Ex: _ZN1Parent_Funciton_block_invoke_1
One issue that happens with the current naming scheme is that in subsequent builds, the name of the function block can change even if the function block or the parent function has changed. This presents issues for tools that use symbol names to identify changes in source code / tracking of binary metrics.

The proposed solution here is to add a flag (default=off) that enables adding a hash of the block contents to the block name.
Ex: _ZN1Parent_Funciton_block_invoke_38172 (38172 is the hash)
Therefore, the function block name will only change if its content or the parent function name changes. And will now remain stable regardless if new function blocks are added.

Diff Detail

Event Timeline

alexbdv created this revision.Feb 18 2020, 11:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2020, 11:03 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
vsk edited reviewers, added: ahatanak, erik.pilkington, dexonsmith; removed: vsk.Feb 19 2020, 8:21 AM

At a high level the idea sounds good to me. For out OSes, symbol name instability results in serious performance regressions as this invalidates text/data orderfiles. Why shouldn’t this be on by default?

Also, could you add tests to cover this change?

alexbdv added a subscriber: vsk.Feb 19 2020, 11:20 AM

@vsk - sure will add tests when removing from RFC.
As for making it default - would rather have this under a flag as hashing the block contents does have some overhead and I imagine this feature wouldn't be beneficial in most scenarios. Also, unexpectedly (by default) changing the name of the function blocks might have a negative impact on some existing workflows.

vsk added a comment.Feb 19 2020, 1:44 PM

As for making it default - would rather have this under a flag as hashing the block contents does have some overhead and I imagine this feature wouldn't be beneficial in most scenarios. Also, unexpectedly (by default) changing the name of the function blocks might have a negative impact on some existing workflows.

I agree there is a non-zero compile-time overhead, but I expect it to be very small. Istm the relative cost of adding a flag would be much higher, in terms of added complexity and adoption time.

What kind of workflow would this change break? Part of the motivation seems to be that workflows that rely on blocks having stable names are brittle. This seems like it makes things strictly less brittle.

Thanks for working on this, I agree it's a really important problem.

I'm as optimistic as Vedant that this is the right approach though.

  • On compile time, I do think there's reason to be concerned, since dumping IR was fairly expensive last I checked due to numbering metadata. Have you tried measuring a the time for a bootstrap of Clang?
  • On stability, I'm skeptical this will be an improvement in the general case. Metadata numbering is unstable, and having that as an input isn't great. It could even change between compiler versions.
  • I'm also worried that the hash will change any time someone touches the body of the block. That could even cause a change if other code changes a typedef which is used inside the block.

Is the current numbering per-module? Can we change that somehow to per-function? Then blocks would be numbered by order within a function, which seems fairly stable.

alexbdv added a comment.EditedFeb 19 2020, 5:20 PM

@vsk - about breaking existing workflows - I was referring only to if / when this gets shipped out as the default - all the names for the function blocks will change and this might cause issue with tooling that relies on symbol names being consistent across builds.

@dexonsmith

  • Just to make sure - this isn't dumping LLVM IR but a textual representation of the AST. Does this have the same issues with metadata / metadata numbering that LLVM IR has, or you were referring to the AST text dump, not llvm IR ? Also I don't think that clang would be a good test case here - it doesn't have many block functions.
  • Stability wise - I'm still not sure if this has the metadata numbering issue (since it's AST text representation), perhaps you can tell me how to check.
  • Correct, the hash might change if the block contents changes - this is kind of the intended use for this. As the flag mentions, it is hash-based.

Moving to a per-function index-based approach seems like the correct default approach. I would still like to have the hash version under a flag though. Moving to per-function index-based should be as simple as not adding the discriminator at all. Since function blocks contain the parent's function's name and given that llvm auto-renames duplicate symbols by adding an incremental number - the end result is that function blocks will be incrementally named based on the order that they have in the parent function.

alexbdv added a comment.EditedFeb 22 2020, 5:42 PM

@dexonsmith I did a benchmark with the worst case example I can come up with - 1000 regular functions and 1000 blocks : https://paste.ofcode.org/CFU6b46nuAA6ymxUEpkzka
I didn't manage to measure any performance decrease (the performance decrease was within the noise of repeated runs) - so seems the hashing is insignificant compile-time-wise even in worst case scenario (where most of the code is in function blocks).
And here is the actual text that is being hashed - the textual representation of the AST - this is not llvm IR with metadata: https://paste.ofcode.org/bfMzhbvHz9HRT7mVMe48Mx

@vsk / @dexonsmith - I've added some more info in latest comments. Let me know if I can provide more info / context on this to be able to reach a conclusion. Or if you think it is clear at this point that the hash-based approach is a no-go.

There are a few concerns about the approach:
1> Compile time: dumping AST as string then hashing the string. Alex measured it on a synthetic benchmark, it shows insignificant impact.
2> Stability: from my understanding, the main goal of this patch is to increase stability of the symbol name so it will not change if the relevant code for the block is not changing. It may not be as stable when the compiler version changes.
Using per-function ID improves the stability compared to per-module ID. @alexbdv do you have rough ideas on how much better the proposed approach is in terms of stability, comparing to per-function ID?
3> Using a compiler flag may slow down the adoption.

@dexonsmith: How can we move this forward? Do you have any other suggestion?

Thanks!
Manman

clang/lib/AST/Mangle.cpp
52

Should it be AST instead of IR in the comment here?

56

Is this needed to have deterministic behavior?