Page MenuHomePhabricator

Pass `CompileUnit *` along `DWARFDIE` for DWZ
Changes PlannedPublic

Authored by jankratochvil on Jan 22 2020, 8:36 AM.

Details

Summary

This patch passes CompileUnit * along DWARFDIE for DWZ.

Diff Detail

Event Timeline

jankratochvil created this revision.Jan 22 2020, 8:36 AM
jankratochvil edited the summary of this revision. (Show Details)Jan 22 2020, 10:22 AM

It's been a while since we had that discussion, but the way I'm remembering it, sizeof(DWARFDIE) was not my biggest concern with this stuff (though I may have been echoing the concerns of some other people). What I was trying to prevent is diverging even more from the llvm dwarf implementation. In that sense, changing lldb's DWARFDIE without a matching change to the llvm counterpart is equally bad. I don't think the first two options are ever going to make it into llvm. The third might, but even that may be a hard sell -- a slightly more tractable option might be to reuse the low-order bits in the DWARFDIE via PointerIntPair. We might be able to get those eight bits that way.

But all of that seems peripheral to me. The main thing I was trying to achieve was to have the low-level DWARF code (DWARFUnit, DWARFDIE, ... -- basically the things that are in the llvm dwarf parser right now) not require access to anything outside of what is available to them. If there's a thing which requires more info, it should be built on top of this lower layer, not trying to reach out from inside it. Once it's outside, a lot of options become available. If there is a lldb_private::CompileUnit in the vicinity of this code, it could just fetch the language from there. Or, we can have some sort of a SuperDWARFDIE, which includes the language, or the main compile unit (either lldb or dwarf) to pass around as a single entity (using PointerIntPair would be an optimization of that). Or, instead of SuperDWARFDIE we could use a DIERef, a user_id_t, or some of the other reference kinds we have already..

As for this patch, I don't see any big problem with changing the representation in this case, but given that it's motivation is to able to later change DWARFDIE, I don't think its really useful/needed.

The test changes are cool, and I'd encourage you to split them off into a separate patch.

jankratochvil planned changes to this revision.Thu, Jan 23, 10:29 AM

Is this something we can store in the DWARFUnit? Seems like it would be nice to be able to ask the DWARFUnit for its DWZ Unit? Then no changes are needed to DWARFDie since it could get the language by checking the DWARFUnit first and then fall back to the DWZ DwarfUnit that is a member of the first DWARFUnit?

Is this something we can store in the DWARFUnit? Seems like it would be nice to be able to ask the DWARFUnit for its DWZ Unit? Then no changes are needed to DWARFDie since it could get the language by checking the DWARFUnit first and then fall back to the DWZ DwarfUnit that is a member of the first DWARFUnit?

DWARFDebugInfoEntry is a part of DWARFPartialUnit::m_die_array (DWARFPartialUnit=DW_TAG_partial_unit). One DWARFPartialUnit can be (and it is in real world DWZ debuginfos) included (by DW_TAG_imported_unit::DW_AT_import) into two different DW_TAG_compile_units each having different DW_TAG_compile_unit::DW_AT_language.
So no, the language cannot be stored in any DWARFUnit and neither in DWARFDebugInfoEntry. Unless we make a new copy of DWARFPartialUnit for each DW_TAG_imported_unit::DW_AT_import which was my former implementation (2) which you turned down as needlessly ineffective.
Your follow-up is appreciated.
(I need to prepare more my reply to @labath.)

BTW, I am currently trying to clean up the dwo/dwp code, and here (I think) I ran into a very similar problem to what you are encountering.

Dwo units currently contain a backlink to the skeleton unit that uses them, and we have a fairly convoluted system (that's the part I'm trying to clean up) to ensure that each time we are parsing a dwo unit, we are able to go back to the skeleton unit which uses it (and then to a lldb_private::CompileUnit). Now, in principle, there is nothing wrong with that because the skeleton and split units are in a 1:1 relationship. However, the llvm parser does not have such a backlink. And although I don't think it would be unreasonable to have the llvm parser include this backlink, given that the dwz unit cannot make use of that (not 1:1), I think it would be a good idea to come up with a different solution for this, and then have both dwo and dwz use that.

Overall I think the intention of passing another DWARFUnit pointer to identify the "compile unit that we're actually parsing stuff into" is reasonable (as with DWZ, the idea that a DWARFUnit somehow corresponds to an lldb CompileUnit is completely breaking down). But I don't think we should be making that a part of the DWARFDIE (and thereby inflitrating it into everything). Instead I think it should be something on top of that. Whether that needs to be a separate class/struct (std::pair<DWARFUnit, DWARFDIE>?) or just and additional argument to the functions that need it is still not fully clear to me...

jankratochvil retitled this revision from `DWARFASTParserClang::m_decl_ctx_to_die` `DWARFDIE`->`DIERef` for DWZ to Pass `CompileUnit *` along `DWARFDIE` for DWZ.
jankratochvil edited the summary of this revision. (Show Details)

Whether that needs to be a separate class/struct (std::pair<DWARFUnit, DWARFDIE>?) or just and additional argument to the functions that need it is still not fully clear to me...

I also do not find that clear.

In this code there are:

  • 19 FIXMEindex - we need to store DWARFCompileUnit * into manual/apple/debug_names index, it is storing now CompileUnit * but I think that could be fixed + converted at some places
  • 11 FIXMEunused - I think in these cases (non-DWARF debug info) the nullptr is OK for CompileUnit * as nobody uses it there
  • 3 FIXMEDeclContext - those (in fact 2) cases may be tricky, I have tried it but I started to get lost in it, hopefully solvable
  • 1 FIXMETypeuser_id_t - it is needed only for SBModule::GetTypeByID() and I think one needs to embed the DWZ main CU into lldb::user_id_t only in SBValue::GetID(). One cannot embed it to all lldb::user_id_t instances inside LLDB as then the DWZ main CU needs to be in DWARFDIE and that has been denied.

Does this look an acceptable approach to pursue it further?

This patch is applicable on top of non-regressing implementation using the DWARFDIE::m_main_cu:

  • git clone -b dwz git://git.jankratochvil.net/lldb

This patch should only compile, it has no chance to run without crashes so far.

labath added a comment.Wed, Feb 5, 9:35 AM

Thanks for exploring this.

From a high-level, this approach looks like it could be viable. However, you have taken this a lot further than I would have expected. What I was expecting to see would be that a bunch of the DWARF internal functions would get an extra argument (be it a CompileUnit, DWARFUnit, or whatever), but I did not expect that this would also appear on the "generic" interfaces. My idea was that the external interface would keep getting just a single user_id_t, but then during the process of decoding that integer into a DWARFDie, we would also produce the additional CompileUnit object.

Now, that doesn't mean that adding a CompileUnit argument to these functions too is a bad idea. Quite the opposite -- _if it actually works_, it would be great because it would give us a lot more freedom in encoding the information into the user_id_t value. However, I have some doubts about that. For example, right now we don't create any CompileUnit objects for types in DWARF type units, and I'm not sure how much we want to change that because of the sheer number of type units. That means we'd still have to have a way to encode a precise type location in a user_id_t without a CompileUnit, though maybe we would be ablle to assume that a CompileUnit is always available in the DWZ case (?).

This duality does make me think that we'd be better off in not adding the CompileUnit arguments and encoding everything into a user_id_t. However, I am not sure if we really have a choice here -- it may be that this information is so complex that it will not be possible to reasonably encode it into 64 bits.

PS. It would be easier to evaluate this if this was based on master instead of some patch which is taking the source in the direction we don't want to go it.

jankratochvil planned changes to this revision.Wed, Feb 5, 11:56 AM

From a high-level, this approach looks like it could be viable. However, you have taken this a lot further than I would have expected. What I was expecting to see would be that a bunch of the DWARF internal functions would get an extra argument (be it a CompileUnit, DWARFUnit, or whatever), but I did not expect that this would also appear on the "generic" interfaces. My idea was that the external interface would keep getting just a single user_id_t, but then during the process of decoding that integer into a DWARFDie, we would also produce the additional CompileUnit object.

In such case there need to be two kinds of user_id_t. One internal to DWARF/ not containing MainCU and one for "generic" interfaces which does contain MainCU. Because internally if user_id_t contains MainCU then already also DWARFDIE and DIERef must contain MainCU - see for example ElaboratingDIEIterator calling DWARFDIE::GetID where is no chance to know the MainCU (without heavy refactorization to pass it down the call chain) - and the code does not really need to know MainCU, it is just using user_id_t as a unique DIE identifier.

Now, that doesn't mean that adding a CompileUnit argument to these functions too is a bad idea. Quite the opposite -- _if it actually works_, it would be great because it would give us a lot more freedom in encoding the information into the user_id_t value. However, I have some doubts about that. For example, right now we don't create any CompileUnit objects for types in DWARF type units, and I'm not sure how much we want to change that because of the sheer number of type units. That means we'd still have to have a way to encode a precise type location in a user_id_t without a CompileUnit, though maybe we would be ablle to assume that a CompileUnit is always available in the DWZ case (?).

Yes, if there is no CompileUnit available then DWZ is not used for that type and the DWARFUnit itself can be used. CompileUnit * can remain nullptr when not known.

The new plan:

  • SymbolFile and SB* API should say unchanged with MainCUs encoded into user_id_t they use
  • types like Function do contain CompileUnit * but still it may be easier for the APIs to encode MainCU into their user_id_t.
  • indexes contain now DIERef. They should contain rather a new MainCU-extended variant of DIERef.
  • DWARFASTParser (+DWARFASTParserClang) API - there must be an additional CompileUnit * parameter as it uses DWARFDIE. (DWARFDIE cannot fit MainCU into its 16 bytes.)
  • DWARFMappedHash also must have new MainCU field I think.
  • NameToDIE should be using the new MainCU-enhanced DIERef.

Given that user_id_t and DIERef will exist both with and without MainCU (depending on which code is using them) it would be nice to make a new MainCU-extended inherited type for them so that their mistaken use is compile-time checked. But user_id_t is just an integer typedef so it would need to be class-wrapped. For user_id_t one needs to also make a new MainCU-extended type of UserID.

This duality does make me think that we'd be better off in not adding the CompileUnit arguments and encoding everything into a user_id_t. However, I am not sure if we really have a choice here -- it may be that this information is so complex that it will not be possible to reasonably encode it into 64 bits.

That my DWZ branch (git clone -b dwz git://git.jankratochvil.net/lldb) already encodes MainCU to all of DWARFDIE (becoming 24B instead of 16B), DIERef (staying 8B) and user_id_t (staying 8B, reducing DWO unit number from 31 bits to 29 bits). One can encode MainCU into the DWO field as both can never exist together.

PS. It would be easier to evaluate this if this was based on master instead of some patch which is taking the source in the direction we don't want to go it.

I need to verify the new refactorization satisfies all the requirements of my existing functional non-regressing DWZ implementation I have. I would then have to create a second proof of concept patch being to post here based on master. Which I hope is not required for your review before its real implementation. I understand the final refactorization patch will be based on master and DWZ will be rebased on top of it.

From a high-level, this approach looks like it could be viable. However, you have taken this a lot further than I would have expected. What I was expecting to see would be that a bunch of the DWARF internal functions would get an extra argument (be it a CompileUnit, DWARFUnit, or whatever), but I did not expect that this would also appear on the "generic" interfaces. My idea was that the external interface would keep getting just a single user_id_t, but then during the process of decoding that integer into a DWARFDie, we would also produce the additional CompileUnit object.

In such case there need to be two kinds of user_id_t. One internal to DWARF/ not containing MainCU and one for "generic" interfaces which does contain MainCU. Because internally if user_id_t contains MainCU then already also DWARFDIE and DIERef must contain MainCU - see for example ElaboratingDIEIterator calling DWARFDIE::GetID where is no chance to know the MainCU (without heavy refactorization to pass it down the call chain) - and the code does not really need to know MainCU, it is just using user_id_t as a unique DIE identifier.

Well.. I think we should avoid using user_id_t in places where we don't have enough information to construct one. I have removed the usage in ElaboratingDIEIterator with 034c2c6771d318f962bbf001f00ee11e542e1180.

...
The new plan:

  • SymbolFile and SB* API should say unchanged with MainCUs encoded into user_id_t they use
  • types like Function do contain CompileUnit * but still it may be easier for the APIs to encode MainCU into their user_id_t.

SG, I agree that would be better, since it requires less refactoring throughout.

  • indexes contain now DIERef. They should contain rather a new MainCU-extended variant of DIERef.

The indexes need to *produce* an "extended" DIERef. Internally, they can contain whatever they want. This is particularly important for the manual index, where simply introducing a new field would be a pretty big memory hit, I think. However, it should be fairly easy to design a more compact encoding scheme specifically for the use in the manual index.

In fact, if we make the index functions produce results piecemeal rather than in one big chunk, then we could make them return DWARFDIEs (and compile units or whatever) directly. The reason these functions return DIERefs now is that DIERefs are cheap to construct, and we don't want to construct all DWARFDIEs if the user only cares about one result. If the user can abort iteration, then we can get rid of DIERefs in these functions, as the first thing that the user does to them anyway is to convert them to a DWARFDIE.

  • DWARFASTParser (+DWARFASTParserClang) API - there must be an additional CompileUnit * parameter as it uses DWARFDIE. (DWARFDIE cannot fit MainCU into its 16 bytes.)

Yep. It may be interesting to try this out on regular dwo first. Right now, dwo dodges this problem by storing a CompileUnit pointer on both skeleton and split units. If we can make this work without the split unit back pointer, then we can come move closer to how things work in llvm, and also pave the way for the dwz stuff.

  • DWARFMappedHash also must have new MainCU field I think.

This is only used in conjunction with apple accelerators. I don't think dwz will ever be a thing there, so you can just ignore dwz here (type units and dwo are also ignored here).

  • NameToDIE should be using the new MainCU-enhanced DIERef.

This is only used in the manual index, where we can do something custom if needed.

Given that user_id_t and DIERef will exist both with and without MainCU (depending on which code is using them) it would be nice to make a new MainCU-extended inherited type for them so that their mistaken use is compile-time checked. But user_id_t is just an integer typedef so it would need to be class-wrapped. For user_id_t one needs to also make a new MainCU-extended type of UserID.

I think it would be good to have only one kind of "user id". What are the cases where you need a main-cu-less user id?

PS. It would be easier to evaluate this if this was based on master instead of some patch which is taking the source in the direction we don't want to go it.

I need to verify the new refactorization satisfies all the requirements of my existing functional non-regressing DWZ implementation I have. I would then have to create a second proof of concept patch being to post here based on master. Which I hope is not required for your review before its real implementation. I understand the final refactorization patch will be based on master and DWZ will be rebased on top of it.

This is enough for to get a very high level idea of where you're going, but it gets tricky when I want to look at some specific details. For instance, some of the changes in this patch seem trivial and non-controversial, but that's only because they are based on changes in the other patch, and those are the changes I had problems with..