This is an archive of the discontinued LLVM Phabricator instance.

[Debuginfo][llvm-dwarfdump][dsymutil] Add dsymutil compatibility dump.
Needs ReviewPublic

Authored by avl on Apr 20 2022, 3:21 AM.

Details

Summary

This patch adds a new kind of dump into the llvm-dwarfdump utility.
This new kind of dump can be used to compare semantically the same but
binary different DWARF files produced by dsymutil. f.e. these two should match:

dsymutil intput-file -o - | llvm-dwarfdump --dsymutil-compat-dump - -o - | md5
dsymutil --no-odr intput-file -o - | llvm-dwarfdump --dsymutil-compat-dump - -o - | md5

The idea of the dump is that it skips type dies. Instead, it prints type names.
Additionally, it skips dies offsets, strings offsets, etc...

"llvm-dwarfdump -a":

0x0000002a:   DW_TAG_class_type
                DW_AT_name      ("class1")

0x00000071:   DW_TAG_subprogram
                DW_AT_name      ("foo1")
                DW_AT_low_pc    (0x0000000000001000)
                DW_AT_high_pc   (0x0000000000001010)
                DW_AT_type      (0x0000002a "class1")

"llvm-dwarfdump --dsymutil-compat-dump":

== DW_TAG_subprogram [1]
   DW_AT_name "foo1"
   DW_AT_low_pc 0x0000000000001000
   DW_AT_high_pc 0x0000000000001010
   DW_AT_type "class1"

use cases for this dump:

  1. to compare result produced by "dsymutil" and "dsymutil --no-odr".
  2. to compare result produced by current upstream dsymutil and another one from https://reviews.llvm.org/D86539.
  3. to compare result produced by current upstream dsymutil and another one from https://reviews.llvm.org/D96035.

Diff Detail

Event Timeline

avl created this revision.Apr 20 2022, 3:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 3:21 AM
avl requested review of this revision.Apr 20 2022, 3:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 3:21 AM
avl edited the summary of this revision. (Show Details)Apr 20 2022, 3:22 AM

I don't think this would be enough to make dumps very comparable, would it? We'd have to sort DIEs or something (based on their semantic content) to make things very comparable, beyond only omitting offsets?

(& also this mode probably wouldn't be dsym specific, I'd guess - but some comparable/generic/semantic dump?)

avl added a comment.Apr 20 2022, 11:57 AM

I don't think this would be enough to make dumps very comparable, would it?

It works for dsymutil. Following commands produce equal md5 value for many llvm binaries(for others, they produce incompatible value which points to the real problem).

dsymutil intput-file -o - | llvm-dwarfdump --dsymutil-compat-dump - -o - | md5
dsymutil --no-odr intput-file -o - | llvm-dwarfdump --dsymutil-compat-dump - -o - | md5

We'd have to sort DIEs or something (based on their semantic content) to make things very comparable, beyond only omitting offsets?

This patch not only omit offsets but also skip dies. Currently it prints only dies which have
DW_AT_lowpc or DW_AT_ranges or parameters dies. For such dies the patch assumes(to have successful comparison) they are in the same order
as they are in the other input. Other dies are skipped. It allows not to depend on the order or amount of types dies.
The types are compared by names which are printed by this patch.

(& also this mode probably wouldn't be dsym specific, I'd guess - but some comparable/generic/semantic dump?)

Above assumption on the order of dies makes it dsymutil specific(and currently it would be most useful to compare dsymutil output). We can make it more general. But then, we would need to sort dies somehow as you suggested. I propose we can start from dsymutil specific solution and then make it more general.

Alrighty, I'll leave it to the dsymutil folks to hash this out - I'm not fundamentally opposed even in the current rather narrow form.

This is mostly outside my area of expertees, so I'm not going to review this more widely, but the new option needs adding to the Command Guide documentation for llvm-dwarfdump.

Instead of doing a dumping thing, it might be a good idea to have a comparison mode build into llvm-dsymutil that can compare the contents of two files maybe? As David Blaikie suggested, this output format won't help diff things as we enable more complex garbage collection and type uniquing. If we do want a dump format that can be diffed, it should probably be output in a format that sorts things a bit better. Like sorting all functions by address and making a common output format for them. Types would be output sorted by type name, but it would be harder to diff especially since the other llvm-dsymutil did more simple type uniquing and the new modifications you are trying to get in really merge all of the types. There is already a "--diff" option to this tool that omits all addresses and offsets to help diff two files so chunks that were removed can easily be identified, does that mode not work for you?

avl added a comment.Apr 23 2022, 8:14 AM

Instead of doing a dumping thing, it might be a good idea to have a comparison mode build into llvm-dsymutil that can compare the contents of two files maybe?

Agreed, creating some smarter tool(doing complex comparisons of DIEs trees) makes sense. Probably it should not be llvm-dsymutil. It might be llvm-dwarfdump or some separate tool.

Though, I think the dump suggested in this patch is still quite useful, especially for type de-duplication optimization. Please inspect the example of the usage of this dump in this message later.

As David Blaikie suggested, this output format won't help diff things as we enable more complex garbage collection and type uniquing. If we do want a dump format that can be diffed, it should probably be output in a format that sorts things a bit better. Like sorting all functions by address and making a common output format for them. Types would be output sorted by type name, but it would be harder to diff especially since the other llvm-dsymutil did more simple type uniquing and the new modifications you are trying to get in really merge all of the types.

If we think that adding any sorting will make this patch better - I am happy to add it.

Though, even without such sorting, this patch is quite powerful. The first thing is that it completely does not depend on the type(DW_TAG_*_type and other) dies. Type dies could be in any order or duplicated any number of times. Only names are printed for types(no need to sort type dies). The second thing is that dsymutil does not change the order of DW_TAG_subprogram dies. Thus, we can have an effective comparison without sorting DW_TAG_subprogram dies. We will need to sort them if there are some other tools that change order of DW_TAG_subprogram dies.

This patch also works good when types de-duplication is done in completely different way.
With some small modifications for declarations it helps to compare output of current dsymutil and output of dsymutil from the patch D96035(even if it works in non-deterministic mode).

There is already a "--diff" option to this tool that omits all addresses and offsets to help diff two files so chunks that were removed can easily be identified, does that mode not work for you?

No, it does not. f.e. if type deduplication is done then the final DWARF files(with ODR and without ODR) will contain different number of type dies, even if they are semantically the same. Thus dumps created with "--diff" option will not match. Someone need to inspect differences(which might be quite complex) to understand whether they are semantically equal. At the same time dumps created with "--dsymutil-compat-dump" option will be 100% equal. Let`s see the example:

dsymutil llvm-strings

llvm-dwarfdump --debug-info --diff llvm-strings.dSYM/Contents/Resources/DWARF/llvm-strings > llvm-strings-diff-dump
llvm-dwarfdump --dsymutil-compat-dump llvm-strings.dSYM/Contents/Resources/DWARF/llvm-strings > llvm-strings-compat-dump

dsymutil --no-odr llvm-strings

llvm-dwarfdump --debug-info --diff llvm-strings.dSYM/Contents/Resources/DWARF/llvm-strings > llvm-strings-noodr-diff-dump
llvm-dwarfdump --dsymutil-compat-dump llvm-strings.dSYM/Contents/Resources/DWARF/llvm-strings > llvm-strings-noodr-compat-dump


diff llvm-strings-diff-dump llvm-strings-noodr-diff-dump | wc -c | awk '{print $1/1000"K"}'
68563,8K

diff llvm-strings-compat-dump llvm-strings-noodr-compat-dump | wc -c | awk '{print $1/1000"K"}'
0K

One of the secret why dsymutil-compat-dump does not have a lot of differencies is that it skips much.
Probably we can make it more detailed. But even in this form it is able to catch errors. f.e. it shows
that dsymutil applied for "opt" binary has a bug inside odr deduplication optimization:

diff opt-compat-dump opt-noodr-compat-dump 
12158120c12158120
<      DW_AT_specification "operator()": "llvm::VPBasicBlock *"
---
>      DW_AT_specification "operator()": "const llvm::VPBasicBlock *"

The "const" qualifier is lost.

Thus, In short, a powerful tool comparing type die trees would be useful. But it would also be much more complex. While this dump is quite simple and can help to find some classes of errors.

avl updated this revision to Diff 424882.Apr 25 2022, 5:23 AM

renamed option to less specific variant (dsymutil-compat-dump -> type-compat-dump)
added doc description.

I am all for this patch if it helps you test and verify new features in dsymutil and llvm-dwarfutil progress. So I am ok with this, but I am not a code owner of the LLVM DWARF tools so I will let someone else give the final ok

Comparing different DWARF output in a more abstract way is exactly what the llvm-dva tool is designed to do; unfortunately it isn't upstream yet, although @CarlosAlbertoEnciso is working on generating the patches now.
I'd prefer not to have a works-in-a-few-cases-for-one-tool feature come in when a better solution is around the corner.

When you used this to compare the output of dsymutil changes, what was the input binary? Does this generate meaningful output for bigger programs like dsymutil itself?

In my experience, when you change the dsymutil algorithm that marks dies as kept, they end up in different subtrees. Is that something this tool can deal with? My original assumption was that it can't, but then I don't understand how this was useful for the other two patches.

Comparing different DWARF output in a more abstract way is exactly what the llvm-dva tool is designed to do; unfortunately it isn't upstream yet, although @CarlosAlbertoEnciso is working on generating the patches now.
I'd prefer not to have a works-in-a-few-cases-for-one-tool feature come in when a better solution is around the corner.

+1 if that tool is actually coming. @CarlosAlbertoEnciso or @probinson do you have a rough timeline in mind?

Comparing different DWARF output in a more abstract way is exactly what the llvm-dva tool is designed to do; unfortunately it isn't upstream yet, although @CarlosAlbertoEnciso is working on generating the patches now.
I'd prefer not to have a works-in-a-few-cases-for-one-tool feature come in when a better solution is around the corner.

+1 if that tool is actually coming. @CarlosAlbertoEnciso or @probinson do you have a rough timeline in mind?

I'm even now doing the last round of internal reviews before Carlos posts the patch set.

avl added a comment.EditedMay 6 2022, 6:49 AM

When you used this to compare the output of dsymutil changes, what was the input binary? Does this generate meaningful output for bigger programs like dsymutil itself?

yes it does. I used it for all llvm&clang binaries.

In my experience, when you change the dsymutil algorithm that marks dies as kept, they end up in different subtrees. Is that something this tool can deal with? My original assumption was that it can't, but then I don't understand how this was useful for the other two patches.

Yes, this tool can deal with different subtrees(type subtrees). The idea is that it skips type subtrees. The types are compared by string representation. Following is the example:

a)

0x0000001b:   DW_TAG_const_type
                DW_AT_type      (0x0000002a "class1")

0x0000002a:   DW_TAG_class_type
                DW_AT_name      ("class1")

0x00000071:   DW_TAG_subprogram
                DW_AT_name      ("foo1")
                DW_AT_low_pc    (0x0000000000001000)
                DW_AT_high_pc   (0x0000000000001010)
                DW_AT_type      (0x0000001b "const class1")


b)

0x0000001b:   DW_TAG_class_type
                DW_AT_name      ("class1")

0x0000002a:   DW_TAG_const_type
                DW_AT_type      (0x0000001b "class1")

0x00000071:   DW_TAG_subprogram
                DW_AT_name      ("foo1")
                DW_AT_low_pc    (0x0000000000001000)
                DW_AT_high_pc   (0x0000000000001010)
                DW_AT_type      (0x0000002a "const class1")

For both above examples --type-compat-dump will skip types descriptions and print only type names:

== DW_TAG_subprogram [1]
   DW_AT_name "foo1"
   DW_AT_low_pc 0x0000000000001000
   DW_AT_high_pc 0x0000000000001010
   DW_AT_type "const class1"

That allows to catch errors if type trees were incorrect. F.e. if "DW_AT_type "const class1"" look like this

DW_AT_type "class1"
DW_AT_type "const class2"
DW_AT_type "class1*".

That solution works not for all possible dies trees modifications, but for only type trees.