This is an archive of the discontinued LLVM Phabricator instance.

[DWARF] v5 implementation of string offsets tables - producer side
ClosedPublic

Authored by wolfgangp on Jan 12 2018, 5:01 PM.

Details

Summary

Generate DWARF v5 string offsets tables and use the DW_FORM_strx? forms to refer to indexed strings. All compile and type units in the compilation share one contribution. There are 3 test cases:

str-offsets-table.ll: Simple test for one unit in non-split and split scenarios.
str-offsets-multiple-cus.ll: Multiple compile units and a type unit
str-offsets-form.ll: Test if we generate DW_FORM_strx2 (in addition to DW_FORM_strx1).

One caveat is that there is no testing for the DW_FORM_strx{3,4} forms. I'm not quite sure how to effect this without a humongous source that uses > 64K strings (for strx3) or > 24M strings (for strx4), other than perhaps a generating script. However, the code that triggers these forms is very simple, so a test for strx2 may be sufficient.

Diff Detail

Repository
rL LLVM

Event Timeline

wolfgangp created this revision.Jan 12 2018, 5:01 PM
aprantl accepted this revision.Jan 12 2018, 5:57 PM

The code looks fairly straightforward. LGTM, thanks!

This revision is now accepted and ready to land.Jan 12 2018, 5:57 PM
dblaikie added inline comments.Jan 15 2018, 11:41 AM
lib/CodeGen/AsmPrinter/DIE.cpp
608 ↗(On Diff #129725)

Remove the return after unreachable.

lib/CodeGen/AsmPrinter/DwarfDebug.cpp
604–605 ↗(On Diff #129725)

This looks like it's inconsistent with the code.

The code looks like it sets the symbol for both the skeleton and non-skeleton holders in the case of split DWARF, doesn't it?

Perhaps it should be:

if (useSplitDwarf)
  SkeletonHolder...
else
  InfoHolder...

?

(or that could be rewritten as:

(useSplitDwarf ? SkeletonHolder : InfoHolder).setStringOffsetsStartSym(...);
2060–2062 ↗(On Diff #129725)

This is duplicated in a couple of places - could it be reasonably moved into some common place, like the DwarfUnit ctor? (not sure if that has some common setup code for adding attributes common to all units - maybe there is no such common place? not sure)

lib/CodeGen/AsmPrinter/DwarfDebug.h
265 ↗(On Diff #129725)

What's "segmented" mean/imply/add/clarify in this context?

lib/CodeGen/AsmPrinter/DwarfFile.cpp
44 ↗(On Diff #129725)

maybe just say "split units"

lib/CodeGen/AsmPrinter/DwarfUnit.cpp
251 ↗(On Diff #129725)

Would these comparisons be better written as "Index >= 0xff", etc?

test/DebugInfo/Generic/string-offsets-form.ll
8–13 ↗(On Diff #129725)

To get a really large string index that doesn't need lots of IR to be cleaned up, maybe just use one giant enum with lots of enumerators? Tie the enum to the IR with a single global, same as the struct here.

test/DebugInfo/Generic/string-offsets-multiple-cus.ll
17–38 ↗(On Diff #129725)

Could these be simplified to all use the same construct for the strings. Otherwise it makes the test a bit more confusing/cryptic by implying that functions, global variahbles, and structs, might be meaningfully different for string emission - which shouldn't be true.

(eg: all use global variables, all use enums (I'd say enums are probably a good call), or whatever)

Ah, I see, type units are relevant - you could probably make this work by having named/external linkage enums (which will go in type units) and anonymous/internal linkage enums (which won't go in type units - but can still be emitted by using an enumerator to initialize a global/external linkage variable, I think? That should cause the enumeration to go into the enum list on the CU)

49 ↗(On Diff #129725)

Maybe rather than naming it CU1_STROFF, name it STROFFBASE or something, since it's meant to be the same for all the units?

71–72 ↗(On Diff #129725)

Is each type unit getting its own string offset? I'd expect type units to use the string offset table of the CU they were attached to, so this should be the same as CU1_STROFF?

I mean it's a tradeoff - using separate string offset sections per TU means they can be dropped if this copy of the TU is dropped. But it also means more duplication - the same string offset having to be duplicated in the CU and every TU if the same string appears in all/several of them.

110–111 ↗(On Diff #129725)

Would it be worth printing out the index number in the section dumping to make it easier to look them up? Then the matching wouldn't need 5 trivial matches, you could match the index from the unit dumping to the string here. (& similarly in the non-TU case above)

test/DebugInfo/Generic/string-offsets-table.ll
13–21 ↗(On Diff #129725)

Some comments about which parts of this code are important/what situation is being created/tested for (& ensuring the code is as simple as possible for that goal) might be handy.

wolfgangp updated this revision to Diff 130344.Jan 17 2018, 6:42 PM
wolfgangp marked 8 inline comments as done.

Addressed most of David's review comments and commented on the ones I didn't address.

Changed the test cases to use enums instead of structs. This reduced their size. Added more comments and tightened them up a bit.

Fixed a couple of bugs:

  • strx3 handling was missing in couple of places.
  • compilations without a .debug_str section still generated a .debug_str_offsets section (a header). They don't any more.
lib/CodeGen/AsmPrinter/DwarfDebug.cpp
604–605 ↗(On Diff #129725)

Right. I wasn't an issue since the label wasn't used but it was created unnecessarily.

2060–2062 ↗(On Diff #129725)

It's a bit awkward, since we don't know in the DwarfUnit ctor whether we are dealing with a split unit. And even in the DwarfCompileUnit ctor we don't know whether we're constructing a Skeleton unit or a (possibly split) compile unit. The only ctor we could move it to would be the DwarfTypeUnit ctor, but that would still only cover one of the 3 cases. Perhaps there is a way to refactor all this but I don't see a good way to centralize it at the moment.

lib/CodeGen/AsmPrinter/DwarfDebug.h
265 ↗(On Diff #129725)

I meant to indicate by the name that the DWARF v5 offsets tables is a sequence of contributions preceded by a headers rather than one simple array of string offsets. I added a comment.

lib/CodeGen/AsmPrinter/DwarfUnit.cpp
251 ↗(On Diff #129725)

Looks simpler. I reversed the order to check for strx4 first with the rationale that with a very large number of strings the majority will be strx4 and so we hit that first in the comparison chain.

test/DebugInfo/Generic/string-offsets-multiple-cus.ll
49 ↗(On Diff #129725)

Right.

71–72 ↗(On Diff #129725)

At the moment all the units in a compilation share the same string offsets table. I corrected all the xxx_STROFF lit variables. I'm not sure if it's really worth it to drop the string offsets that belong to a dropped unit. I would think the gain would be small.

110–111 ↗(On Diff #129725)

That would require an independent change to the dumper. I would prefer to do this in a follow-on patch, it that's OK, along with improving the test case.

probinson added inline comments.
test/DebugInfo/Generic/string-offsets-multiple-cus.ll
71–72 ↗(On Diff #129725)

The thing is if we have a relocation from the string offsets to the string, and don't drop the entry in the offsets table, the linker can't drop the string, even if there are no actual references left. That would be the significant savings.

Ping.

Dave, I think I addressed most of your comments, and commented on the ones I didn't. On the subject of tradeoffs wrt shared string offsets contributions, one idea would be to give each type unit (but not compile unit) its own contribution for easy removability by the linker. However, I'd like to do that or any other improvements as a follow-on patch if that's OK with you.

dblaikie accepted this revision.Jan 25 2018, 4:06 PM

Sounds alright - thanks (:

test/DebugInfo/Generic/string-offsets-form.ll
20–22 ↗(On Diff #130344)

You can probably just check the debug_info and skip checking the abbrev here (since you are checking the form kind in the debug_info).

Though how do you know the first enumeration_type would have a strx2? Actually I'm a bit surprised the enumeration_type's name would be strx2 (would've guessed its string would be computed early and before all the enumerators?)?

wolfgangp marked an inline comment as done.Jan 25 2018, 5:39 PM
wolfgangp added inline comments.
test/DebugInfo/Generic/string-offsets-form.ll
20–22 ↗(On Diff #130344)

The order in which the strings are created depends on the order in which the DIEs are created, and the enumeration_type DIE is created after all the enumerator DIEs. At least that's what I gather from DwarfUnit::constructEnumTypeDie() in DwarfUnit.cpp.

As far as the abbrevs are concerned, abbrevs for parent DIEs seem to come before child DIEs, so the enumeration type DIE's abbrev would come first in the table.

But all this is implementation dependent. I agree that the check isn't really needed. I'll get rid of it before I commit. Thanks for the review.

This revision was automatically updated to reflect the committed changes.
wolfgangp marked an inline comment as done.