Page MenuHomePhabricator

[DebugInfo] Support DWARFv5 index sections.
ClosedPublic

Authored by ikudrin on Mar 10 2020, 9:09 AM.

Details

Summary

DWARFv5 defines index sections in package files in a slightly different way than the pre-standard GNU proposal, see Section 7.3.5 in the DWARF standard and https://gcc.gnu.org/wiki/DebugFissionDWP for GNU proposal. The main concern here is values for section identifiers, which are partially overlapped with changed meanings. The patch adds support for v5 index sections and resolves that difficulty by defining a set of identifiers for internal use which can represent and distinct values of both standards.

Diff Detail

Event Timeline

ikudrin created this revision.Mar 10 2020, 9:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2020, 9:09 AM

I haven't digested the patch yet, but I am wondering if you've seen the recent discussion (DWP mixed (DWARFv4/pre-standard + DWARFv5) content) on dwarf-discuss (link1, link2), which is very relevant for this patch.

If you have any opinions on that, it's not too late to join in. :)

I haven't digested the patch yet, but I am wondering if you've seen the recent discussion (DWP mixed (DWARFv4/pre-standard + DWARFv5) content) on dwarf-discuss (link1, link2), which is very relevant for this patch.

If you have any opinions on that, it's not too late to join in. :)

+1 to that.

If not joining the conversation - the summary is basically: To support v4 and v5 units in a single DWP, extend the v5 index format with new-old columns: DW_SECT_LOC and 9 and DW_SECT_MACINFO at 10.

So probably keep the authoritative enum as this extended version of v5 - including DW_SECT_TYPES and 2, LOC and 9, and MACINFO at 10. And emit this extended index format any time the inputs contain at least one v5 unit.

Thanks for the links! What a coincidence...

I believe that this patch is more or less compatible with any approach which might be taken. The idea is that there is a set of constants for internal use and functions to translate them to/from external representation and both constants and translation functions might be adjusted when needed. In any case, the general design remains the same.

I believe that this patch is more or less compatible with any approach which might be taken. The idea is that there is a set of constants for internal use and functions to translate them to/from external representation and both constants and translation functions might be adjusted when needed. In any case, the general design remains the same.

That's true, but I'm not sure it is really the best solution. This way we will have three numbering schemes (four if you count the "index" thingy in llvm-dwp) floating around: the "gnu" scheme, the "official" scheme, and the "internal" scheme. That's quite a lot to keep in ones head at once.

I'm wondering if if wouldn't be simpler to have two complete enums -- with the "gnu" scheme, and one with the "official" scheme -- and then to internally use the enum matching the on-disk format currently in use. To insulate the users from having to guess the right enum to use, we could add a series of accessors: getLocOffset, getMacInfoOffset, ..., which would use the appropriate enum based on the index version (or assert if there is no such constant in the given version).

WDYT?

Regardless of the outcome of that, I think it would be good to split this patch up and separate the enum shuffling from the new functionality (does this only add parsing support for v5 indexes, or is there something more?).

I believe that this patch is more or less compatible with any approach which might be taken. The idea is that there is a set of constants for internal use and functions to translate them to/from external representation and both constants and translation functions might be adjusted when needed. In any case, the general design remains the same.

That's true, but I'm not sure it is really the best solution.

Well, I do not pretend this to be a perfect solution, but anything I can imagine has its drawbacks. The whole problem comes from overlapping values in both standards, and any solution has to fix that somehow.

This way we will have three numbering schemes (four if you count the "index" thingy in llvm-dwp) floating around: the "gnu" scheme, the "official" scheme, and the "internal" scheme. That's quite a lot to keep in ones head at once.

However, you do not need to worry about all the schemas everywhere. The only place they meet is translation functions. Most of the code just use internal encoding, which is enough.

I'm wondering if if wouldn't be simpler to have two complete enums -- with the "gnu" scheme, and one with the "official" scheme -- and then to internally use the enum matching the on-disk format currently in use. To insulate the users from having to guess the right enum to use, we could add a series of accessors: getLocOffset, getMacInfoOffset, ..., which would use the appropriate enum based on the index version (or assert if there is no such constant in the given version).

WDYT?

Can't see how the accessors simplify the things. For me, getLocOffset() is almost the same as getOffset(DW_SECT_LOC) (or getOffset(DW_SECT_GNU_LOC)). That is no more than an agreement. Note that this patch does not have things like DW_SECT_GNU_INFO. There is only one constant for each section, so their usage is quite determined. The only exception is DW_SECT_MACRO and DW_SECT_GNU_MACRO, I just did not want to overcomplicate translation functions before receiving the feedback about the approach in general.

Regardless of the outcome of that, I think it would be good to split this patch up and separate the enum shuffling from the new functionality (does this only add parsing support for v5 indexes, or is there something more?).

I'll prepare a separate review for the refactoring in llvm-dwp.cpp. As for the new identifiers and all shuffling stuff around, I am not sure it is really valuable to separate them because without the parser of v5 indexes they are meaningless and just dead code. Anyway, the splitting should wait until we decide whether the approach is viable in general.

That's true, but I'm not sure it is really the best solution.

Well, I do not pretend this to be a perfect solution, but anything I can imagine has its drawbacks. The whole problem comes from overlapping values in both standards, and any solution has to fix that somehow.

Can't argue with that. :)

This way we will have three numbering schemes (four if you count the "index" thingy in llvm-dwp) floating around: the "gnu" scheme, the "official" scheme, and the "internal" scheme. That's quite a lot to keep in ones head at once.

However, you do not need to worry about all the schemas everywhere. The only place they meet is translation functions. Most of the code just use internal encoding, which is enough.

Well, that's where the accessor functions would come in. If everyone is calling that, then they are the only ones that need to handle the translation. One thing that bugs me about the current approach is that there are two more-or-less standardized enumerations, but we don't have enum types for either of those -- instead we have an enum containing the mangled internal constants.

However, I concede that this is just a matter of presentation, and the "internal" enum would be essentially implemented by this set of accessors.

What would you say to something different instead -- we do something similar to what was done with location lists, where we "upgrade" the format to the newest version during parsing? This is almost what you are doing right now -- the only difference is that the "internal" enum would no longer be internal -- it would actually match the on-disk format of a v5 index. This v5 enum would contain the official DWARFv5 constants as well as the new extensions we want to introduce for mixed 5+4 indices.

This means that if we adopt the currently proposed 5+4 approach (which is looking increasingly likely -- if you hadn't posted this patch, I would probably be working on it now), there will only be two enums. But until we actually start writing files like this, the new extension constants will still only be kind of internal, and if there is a change in the mixed index approach we can always shuffle things around here.

Regardless of the outcome of that, I think it would be good to split this patch up and separate the enum shuffling from the new functionality (does this only add parsing support for v5 indexes, or is there something more?).

I'll prepare a separate review for the refactoring in llvm-dwp.cpp. As for the new identifiers and all shuffling stuff around, I am not sure it is really valuable to separate them because without the parser of v5 indexes they are meaningless and just dead code. Anyway, the splitting should wait until we decide whether the approach is viable in general.

Some of the files in this patch are only touched because of the introduction of _GNU in the DW_SECT constants. It would be clearer if that are done in an NFC patch. Also the parsing of a v5 index and making sure that a v5 dwp compile unit can find its location lists correctly look like they could be separate (you already have separate tests for those two things anyway).

But yea, that can wait until we have consensus on the overall direction.

ikudrin updated this revision to Diff 249939.Mar 12 2020, 8:13 AM
  • Use values for clashing identifiers proposed by @dblaikie.
  • Convert all unknown section identifiers into a special value, DW_SECT_EXT_unknown; Use an optional parallel array to keep the raw values of unknown identifiers.
  • Split the refactoring part in llvm-dwp.cpp into a separate patch, D76067.

There are some other changes that can be split into separate patches. I will make the series when the direction of this patch is approved in general.

This is almost what you are doing right now -- the only difference is that the "internal" enum would no longer be internal -- it would actually match the on-disk format of a v5 index. This v5 enum would contain the official DWARFv5 constants as well as the new extensions we want to introduce for mixed 5+4 indices.

Yep, this would be the direction I would suggest/encourage. It seems that if the goal is to have one index in a DWP file (which seems reasonable) then all future index versions will have to support column indexes all previous DWARF sections - the DWARFvN enum can then be used to describe all the previous versions as well.

All we'd do is store a "what's the highest DWARF version we have seen in parsing all the inputs (either input CUs in .dwo files, or input cu_indexes in .dwp files)" and use that number to determine what version index we produce (& knowing that those inputs can only contain the limited columns expressable in that DWARF/index version, so we won't have undescribable columns in our internal representation when we're going to emit the index).

This is almost what you are doing right now -- the only difference is that the "internal" enum would no longer be internal -- it would actually match the on-disk format of a v5 index. This v5 enum would contain the official DWARFv5 constants as well as the new extensions we want to introduce for mixed 5+4 indices.

Yep, this would be the direction I would suggest/encourage. It seems that if the goal is to have one index in a DWP file (which seems reasonable) then all future index versions will have to support column indexes all previous DWARF sections - the DWARFvN enum can then be used to describe all the previous versions as well.

So, what are the differences with the last update, apart from that DWARFSectionKind is still internal? I believe we should not put the extensions into the official part before they are approved.

I'm not really up to speed with the .debug_*_index sections, so I haven't looked really at the overall approach. I've just provided some basic stylistic comments.

llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
22

I'd guess this should probably use doxygen-style comments?

Same goes for below.

70

lowerCamelCase for function names. Same below.

123

"items in columns ColumnKinds" doesn't read well to me. I'm not sure if its missing punctuation, an extra word, or what.

llvm/lib/DebugInfo/DWARF/DWARFUnitIndex.cpp
105

Probably out of scope for this change, but this should return an llvm::Error instead to say why parsing failed.

135

also lay -> are

llvm/test/DebugInfo/X86/dwp-v2-cu-index.s
2

Might be worth a brief comment at the top of this test saying this is the pre-standard GCC version.

llvm/test/DebugInfo/X86/dwp-v2-loc.s
1 ↗(On Diff #249939)

I might have missed something, but is this relevant? I thought this patch was for supporting .debug_cu_index?

llvm/test/DebugInfo/X86/dwp-v2-tu-index.s
2

Add a comment again about "v2".

llvm/tools/llvm-dwp/llvm-dwp.cpp
616

I see the above error message starts with a capital letter, but more generally I think we try to use lower-case for error messages. Maybe worth doing it right here and changing the above line in a separate change?

651

Same comment as above.

ikudrin updated this revision to Diff 250537.Mar 16 2020, 6:01 AM
ikudrin marked 10 inline comments as done.

@jhenderson, thank you for the comments!

  • Made comments for DWARFSectionKind, serializeSectionKind() and deserializeSectionKind() in doxygen style.
  • Renamed function.
  • Fixed wording in a comment.
  • Added descriptions to the tests.

@dblaikie, @labath, as far as I can understand, the patch complies with your vision. The main difference is that the enum is still intended for internal use only, but it probably should not go to the public part before the proposed values are accepted. Anyway, even while the proposal of the combined index is not approved, I believe that the patch is useful per se because it allows reading standard index sections and can later be easily extended for combined indexes. The patch does not restrict that movement. Please, correct me if I misunderstand anything.

ikudrin added inline comments.Mar 16 2020, 6:03 AM
llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
123

Thanks for noticing that. It was an ugly result of multiple edits. Rephrased again. Hope it is better now.

llvm/lib/DebugInfo/DWARF/DWARFUnitIndex.cpp
105

Added to my backlog, but do not mind if anyone willing to fix that.

llvm/test/DebugInfo/X86/dwp-v2-loc.s
1 ↗(On Diff #249939)

The patch adjusts the code in the constructor of DWARFUnit which reads the location table. This test checks that pre-v5 units read their location tables from .debug_loc.dwo sections. Its counterpart, dwp-v5-loclists.s checks that v5 units use .debug_loclists.dwo. These changes might be probably extracted later to a separate patch.

jhenderson added inline comments.Mar 17 2020, 2:51 AM
llvm/test/DebugInfo/X86/dwp-v5-cu-index.s
2

that is compliant

llvm/test/DebugInfo/X86/dwp-v5-tu-index.s
2

that is compliant

llvm/tools/llvm-dwp/llvm-dwp.cpp
616

Ping?

651

Ping?

ikudrin updated this revision to Diff 250731.Mar 17 2020, 5:15 AM
ikudrin marked 8 inline comments as done.
  • Fixed messages in llvm-dwp.cpp.
  • Fixed comments in the tests.
llvm/tools/llvm-dwp/llvm-dwp.cpp
616

Fixed. Sorry for forgetting those. D76277 is added for changing the existing messages.

Thanks. My comments have all been addressed. I'm happy once somebody else looks at the more technical aspects.

@dblaikie, @labath, as far as I can understand, the patch complies with your vision. The main difference is that the enum is still intended for internal use only, but it probably should not go to the public part before the proposed values are accepted. Anyway, even while the proposal of the combined index is not approved, I believe that the patch is useful per se because it allows reading standard index sections and can later be easily extended for combined indexes. The patch does not restrict that movement. Please, correct me if I misunderstand anything.

Sorry about the delay. It took a while to get this to the top of my stack.

Yes, this "complies with my/our vission", but looking at the UnknownColumnIds field, I am starting to have second thoughts about that vision. :( Being able to represent "unknown" columns and at the same time mapping all columns to a internal representation makes things a bit awkward.

The reason this wasn't a problem for location lists is that you cannot "skip over" an unknown DW_LLE entry -- it terminates the parse right away.

However, that is not the case for debug_?u_index, where you can easily ignore unknown columns. In that light, I am wondering if it wouldn't be better after all to stick to the on-disk column numbers internally (but maybe still keep the "unified" v5 enum in the public interface). @dblaikie, what do you make of that?

(btw, is there a test case for the "unknown column" code path?)

Side note: This has become complicated enough it might be worth separating these patches now rather than later - changes to dumping should be separate from changes to llvm-dwp, for instance.

@dblaikie, @labath, as far as I can understand, the patch complies with your vision. The main difference is that the enum is still intended for internal use only, but it probably should not go to the public part before the proposed values are accepted. Anyway, even while the proposal of the combined index is not approved, I believe that the patch is useful per se because it allows reading standard index sections and can later be easily extended for combined indexes. The patch does not restrict that movement. Please, correct me if I misunderstand anything.

Sorry about the delay. It took a while to get this to the top of my stack.

Yes, this "complies with my/our vission", but looking at the UnknownColumnIds field, I am starting to have second thoughts about that vision. :( Being able to represent "unknown" columns and at the same time mapping all columns to a internal representation makes things a bit awkward.

The reason this wasn't a problem for location lists is that you cannot "skip over" an unknown DW_LLE entry -- it terminates the parse right away.

However, that is not the case for debug_?u_index, where you can easily ignore unknown columns. In that light, I am wondering if it wouldn't be better after all to stick to the on-disk column numbers internally (but maybe still keep the "unified" v5 enum in the public interface). @dblaikie, what do you make of that?

(btw, is there a test case for the "unknown column" code path?)

I'm undecided - yeah, it is awkward having an extra data structure to store the original parsed values & then remapping them back, etc. Alternatively we could use a single 64 bit value - and store unknown values shifted up into the top 32 bits (really I sort of wish they'd just used only 1 byte for these values - seems unreasonable that we'd need 256 sections... clearly we don't need them now)? & shift them back if the values too high & report it as unknown. Sort of doing something /like/ if the DWARF spec actually had an range reserved for implementation extensions, which will hopefully be added in the future.

But I wouldn't be completely opposed to the data structure keeping things in its parsed form & printing out based on the version of the index, etc. While having an API for querying based on the v5-with-extensions identifiers, though that seems a bit awkward.

llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
47–48

Probably not arbitrarily - in the sense that this is an extension that consumers/producers will need to agree to - so maybe saying that ("non-standard extension"/"proposed for standardization" or something to that effect) and/or linking to the dwarf-discuss thread to support why these values were chosen & they can't be changed arbitrarily.

llvm/lib/DebugInfo/DWARF/DWARFUnitIndex.cpp
100–108

What endianness is this encoded with? If it's little endian, then a 2 byte field with 2 bytes of zero padding should be the same as reading a 4 bytes, or the other way around, right? So perhaps we could just always read it as 2 bytes then 2 bytes of padding rather than having to the version/reset/reread dance here?

135

Should we be fixing it up here - or should we avoid setting it incorrectly in the first place?

ikudrin marked 3 inline comments as done.Mar 18 2020, 7:18 AM

(btw, is there a test case for the "unknown column" code path?)

Yes, it is checked in llvm/test/DebugInfo/X86/debug-cu-index-unknown-section.s, which was added in D75609 and then extended in D75668.

As for unknown columns in general, I believe they are not that important to complicate the code too much. Before D75609, llvm-dwarfdump just crashed when saw them. dwarfdump prints some useless (for a user) error message. An unknown column cannot be used by clients of the library because they do not know what to do with it. Dumping is the only reason to support unknown identifiers, and that should be done as simple as possible. If the current implementation seems too complicated, we can consider, for example, dropping printing raw IDs for unknown sections.

llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
47–48

As far as the enum is internal, no one should really worry about the actual values; they are not important and do not need any kind of proof. They may be really arbitrary, that will change nothing. That is what I meant when said "more or less".

The plan is that this patch supports DWARFv5 unit indexes, but not the proposed combined indexes. When the combined indexes are approved, there will be another patch, which moves the enum with all extensions in the public space. At that moment the factual values will become important, and the references to the descriptive document will be added. Do you think it will be possible to add such a document to the DWARF Wiki?

llvm/lib/DebugInfo/DWARF/DWARFUnitIndex.cpp
100–108

The endianness comes from the input file. I cannot find anything in the standard or the pre-standard proposal that would restrict it to be only little-endian. Thus, we should handle both cases.

135

As it is written at the moment, we have to pass something to distinct TU and CU indexes, for v2 cases. We may pass, say, a bool, but true and false are not that descriptive as DW_SECT_INFO and DW_SECT_TYPES. I believe that for the purpose of this patch, this fix is the simplest thing to do.

BTW, to support the combined TU index, the code should be adjusted anyway because where probably will be two "main" columns, one for v2 type units in .debug_types.dwo and another for v5 type units in .debug_info.dwo.

dblaikie added inline comments.Mar 18 2020, 2:11 PM
llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
47–48

Hmm, I'm confused then - ah, OK - so you've added the enum to support encoding the version 2 and version 5 tables into one internal data structure, but haven't extended it to actually dump or use (for parsing: eg to find the debug_loc.dwo contribution for a v4 unit described by a v5 index) them when parsing/rendering a v5 index.

OK, sorry I hadn't realized that. Then, yes, the comment makes sense for now. Perhaps "the values are only used internally/not parsed from input files (if these values appear in input files they will be considered "unknown")" would be more suitable?

The plan is that this patch supports DWARFv5 unit indexes, but not the proposed combined indexes. When the combined indexes are approved, there will be another patch, which moves the enum with all extensions in the public space. At that moment the factual values will become important, and the references to the descriptive document will be added. Do you think it will be possible to add such a document to the DWARF Wiki?

Given the DWARF committee is not in session at the moment (I think) & it'll be a while before another spec is published - I think it'll be necessary and appropriate to implement support for the extension columns in llvm-dwarfdump at least before/when they're implemented in llvm-dwp (which will be soon) to support testing that functionality & working with such files.

Might be able to put something on the DWARF wiki, but I don't know much about it/how things go up there.

llvm/test/DebugInfo/X86/dwp-v5-loclists.s
1–3 ↗(On Diff #250731)

Might be possible/better to test this without debug_abbrev and debug_info - make the test shorter & dump only the loclists section itself? Yeah, not exactly valid, but no reason the dumper shouldn't support it and it'd be a more targeted test (apply this suggestion to other tests if possible/acceptable too)

ikudrin marked 2 inline comments as done.Mar 18 2020, 10:29 PM
ikudrin added inline comments.
llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
47–48

Perhaps "the values are only used internally/not parsed from input files (if these values appear in input files they will be considered "unknown")" would be more suitable?

The comment says something similar in lines 52-53. Do you think it should be extended?

I think it'll be necessary and appropriate to implement support for the extension columns in llvm-dwarfdump at least before/when they're implemented in llvm-dwp (which will be soon) to support testing that functionality & working with such files.

I think the same. The only concern in my side is that the proposal should be formulated as an RFC or similar document before implementing it in the code so that all the implementations can reference the same source. For my taste, a link to the middle of a forum conversation cannot be considered as a reliable source.

llvm/test/DebugInfo/X86/dwp-v5-loclists.s
1–3 ↗(On Diff #250731)

This test is for changes in the constructor of DWARFUnit. It checks that DWARFUnit takes locations from the right place, so all four sections are necessary.

dblaikie added inline comments.Mar 19 2020, 3:33 PM
llvm/include/llvm/DebugInfo/DWARF/DWARFUnitIndex.h
47–48

I think it might be simpler to remove the "more or less arbitrary" part - it invites questions - just how much less?

And maybe rephrase the "is intended to be" and make it "The enum is for internal use only & doesn't correspond to any input/output constants".

But otherwise I guess this is fairly temporary so not worth haggling over.

llvm/test/DebugInfo/X86/dwp-v5-loclists.s
1–3 ↗(On Diff #250731)

AH, fair enough - thanks!

(btw, is there a test case for the "unknown column" code path?)

Yes, it is checked in llvm/test/DebugInfo/X86/debug-cu-index-unknown-section.s, which was added in D75609 and then extended in D75668.

Got it. Thanks.

As for unknown columns in general, I believe they are not that important to complicate the code too much. Before D75609, llvm-dwarfdump just crashed when saw them. dwarfdump prints some useless (for a user) error message. An unknown column cannot be used by clients of the library because they do not know what to do with it. Dumping is the only reason to support unknown identifiers, and that should be done as simple as possible. If the current implementation seems too complicated, we can consider, for example, dropping printing raw IDs for unknown sections.

Yeah, I agree that they are not very important, but it would be a pitty to lose them. OTOH, the lazily-initialized parallel array solution seems like it's more complicated that it should be. Maybe we drop the "lazy" part, rename it to RawColumnKinds and always store both? It's not like that's going to waste a bunch of memory, and it will be easier to understand (I hope).

ikudrin updated this revision to Diff 253874.Mar 31 2020, 7:14 AM
ikudrin marked 4 inline comments as done.
  • Updated the comment for DWARFSectionKind.
  • Simplified the storing of raw section identifiers.
  • Moved independent changes into separate patches.
ikudrin updated this revision to Diff 253879.Mar 31 2020, 7:21 AM
  • Removed DWARFUnitIndex::getVersion() as it is related to the other patch.
labath accepted this revision.Apr 1 2020, 2:36 AM

Thanks for doing this. This looks fine to me. @dblaikie, @jhenderson, do you have any additional comments?

llvm/lib/DebugInfo/DWARF/DWARFUnitIndex.cpp
38–39

maybe a small helper like isKnownSectionKind ?

This revision is now accepted and ready to land.Apr 1 2020, 2:36 AM
jhenderson accepted this revision.Apr 1 2020, 2:52 AM

Nothing else from me, thanks.

ikudrin updated this revision to Diff 254207.Apr 1 2020, 8:13 AM
ikudrin marked an inline comment as done.

Thanks, @jhenderson, @labath!

  • Added a helper function isKnownV5SectionID().
dblaikie accepted this revision.Apr 1 2020, 3:48 PM

Thanks for doing this. This looks fine to me. @dblaikie, @jhenderson, do you have any additional comments?

nah, nothing dire that comes to mind - thanks for checking