Page MenuHomePhabricator

[VERY-WIP] [ItaniumDemangle] Add ability to re-serialize the parsed AST
AbandonedPublic

Authored by labath on Aug 10 2018, 11:36 PM.

Details

Reviewers
None
Summary

This is kind of an alternative to D50586. Instead of modifying the parsed string in the middle of parsing, we "remangle" the (possibly modified name" into a new string.

This isn't complete yet (I've needed to modify some parts of the AST construction to preserve enough information for remangling, but I haven't done this for all nodes), and it needs a lot of cleaning up. In particular, it does not yet expose any way to actually modify the parsed AST.

However, I am putting it out here, to show a possibly cleaner (and definitely more powerful) alternative to the mentioned patch.

Diff Detail

Event Timeline

labath created this revision.Aug 10 2018, 11:36 PM

This is really cool, thanks for putting this together! In order to properly recreate the mangled name, we would have to track a far amount more information in the AST, looks like you have a pretty good start. I think that the set of transformations that we could support here would be pretty limited. For instance, substituting int for int* in a mangled name is probably a non-starter because we would have to add a substitution, which means that we would have to track substitutions from the AST (in order to tell when int* already appeared in the mangled name and should be remangled as a substitution), and adjust future substitutions numbers, which sounds like it would be hard to get right. Just replacing one <builtin-type> for another should be fine though (since those can't be substitutions), as should the constructor thing you mentioned.

Given that, I'm skeptical that this is the right way forward here. Maybe adding some way for clients to hijack the parser would be a better approach, i.e. we could make the parser a CRTP base, and LLDB could override parseType() or parseCtorDtor() as needed. Thats would kinda be a generalization of https://reviews.llvm.org/D50586. Any thoughts?

This is really cool, thanks for putting this together! In order to properly recreate the mangled name, we would have to track a far amount more information in the AST, looks like you have a pretty good start. I think that the set of transformations that we could support here would be pretty limited. For instance, substituting int for int* in a mangled name is probably a non-starter because we would have to add a substitution, which means that we would have to track substitutions from the AST (in order to tell when int* already appeared in the mangled name and should be remangled as a substitution), and adjust future substitutions numbers, which sounds like it would be hard to get right. Just replacing one <builtin-type> for another should be fine though (since those can't be substitutions), as should the constructor thing you mentioned.

That's a good point. I was aware that substitutions will be a tricky, but I thought this would only be a problem when actually updating the substitution nodes themselves (because of the duplicated information between the "index" and "node" members). It did not occur to me that adding/removing any non-trivial (substitutable) node would invalidate these substitution nodes. That could be fixed, but it would require a bit more effort -- e.g., making sure all substitution accesses go through the central vector of substitutable node (and substitution nodes just store the index into the vector), then after every (notrivial) modification going through all substitution nodes and updating their indices. However, that could be overkill given that I don't anticipate needing any complex transformations like this.

Given that, I'm skeptical that this is the right way forward here. Maybe adding some way for clients to hijack the parser would be a better approach, i.e. we could make the parser a CRTP base, and LLDB could override parseType() or parseCtorDtor() as needed. Thats would kinda be a generalization of https://reviews.llvm.org/D50586. Any thoughts?

That's an interesting idea. The slight hickup I see is that for my use case, I only need to update C1->C2 for the top-level function (and not, e.g. for the C1 instances nested inside local names). This kind of thing is easy to do on the AST, but not from within the parseCtorDtor() function. I suspect that may be true for other transformations as well (I'm not sure if there is a good spec of what should exactly be substituted by D50586, but I think this could be an issue there too). I suppose that could be made to work by overriding parseLocalName (and anything else that may embed a function encoding), but it won't be ideal, because we would need to duplicate some of the parsing logic:

disableC1substitution();
Node *Encoding = parseEncoding();
enableC1substitution();
if (consumeIf('s'))
  // do stuff

if (consumeIf('d'))
  // do other stuff

Node *Entity = parseName(State);
...

Or we could just not support substitution in local names (I think LLDB has bigger problems when working with local objects than missing constructors).

However, I also think that just allowing the AST transformations that we can guarantee don't run into the substitution problems you mentioned (E.g. by making only the "kind" field of the BuiltinType node and the "flavor" field of the CtorDtor node mutable), would be fine too. Then, if anyone wants to do another modification, he can just mark the extra field mutable (provide a setter), and implement any fixups needed. With any luck, all fixups will be trivial, and we will never run into the case of needing a substitution-invalidating transformation.

That's an interesting idea. The slight hickup I see is that for my use case, I only need to update C1->C2 for the top-level function (and not, e.g. for the C1 instances nested inside local names). This kind of thing is easy to do on the AST, but not from within the parseCtorDtor() function. I suspect that may be true for other transformations as well (I'm not sure if there is a good spec of what should exactly be substituted by D50586, but I think this could be an issue there too). I suppose that could be made to work by overriding parseLocalName (and anything else that may embed a function encoding), but it won't be ideal, because we would need to duplicate some of the parsing logic:

disableC1substitution();
Node *Encoding = parseEncoding();
enableC1substitution();
if (consumeIf('s'))
  // do stuff

if (consumeIf('d'))
  // do other stuff

Node *Entity = parseName(State);
...

Or we could just not support substitution in local names (I think LLDB has bigger problems when working with local objects than missing constructors).

Couldn't you just track the <encoding> depth by overriding parseEncoding, and only do the analysis in parseCtorDtor for top-level encodings?

However, I also think that just allowing the AST transformations that we can guarantee don't run into the substitution problems you mentioned (E.g. by making only the "kind" field of the BuiltinType node and the "flavor" field of the CtorDtor node mutable), would be fine too. Then, if anyone wants to do another modification, he can just mark the extra field mutable (provide a setter), and implement any fixups needed. With any luck, all fixups will be trivial, and we will never run into the case of needing a substitution-invalidating transformation.

I'm not opposed to this if you put the time into getting it right, but just for the record my concerns are:

  1. This will be a fairly intrusive change to the demangler
  2. Adding extra info to the AST, and the indirection required to track substitution and template-param numbers will have a perf cost, which may be a dealbreaker depending on the numbers (<5% is probably fine, say)
  3. This is a really hard to get right/heavyweight way of solving this specific problem.
  4. Not broadly useful because the amount of changes you can make without breaking substitutions is limited.

That's an interesting idea. The slight hickup I see is that for my use case, I only need to update C1->C2 for the top-level function (and not, e.g. for the C1 instances nested inside local names). This kind of thing is easy to do on the AST, but not from within the parseCtorDtor() function. I suspect that may be true for other transformations as well (I'm not sure if there is a good spec of what should exactly be substituted by D50586, but I think this could be an issue there too). I suppose that could be made to work by overriding parseLocalName (and anything else that may embed a function encoding), but it won't be ideal, because we would need to duplicate some of the parsing logic:

disableC1substitution();
Node *Encoding = parseEncoding();
enableC1substitution();
if (consumeIf('s'))
  // do stuff

if (consumeIf('d'))
  // do other stuff

Node *Entity = parseName(State);
...

Or we could just not support substitution in local names (I think LLDB has bigger problems when working with local objects than missing constructors).

Couldn't you just track the <encoding> depth by overriding parseEncoding, and only do the analysis in parseCtorDtor for top-level encodings?

You're right again. That does sound like it could work. I guess I should have given this idea more thought. Now I feel silly.

Given this, I think I'm convinced that going the CRTP route is a much better option. I'll shelve this and try to make that work instead.

I have one process question though. What's the current workflow for merging changes between the demangler copies in libc++abi and llvm? Right now they're mostly identical except for a few macros, but CRTP will require putting a lot of stuff into the header (and into a namespace), so they will diverge more. Do you think it would make sense to factor the common bits into a separate (.inc ?) file which can be kept identical? Then cxa_demangle.cpp could just #include that file into an anonymous namespace and llvm could include that into the llvm namespace (from a header file).

Given this, I think I'm convinced that going the CRTP route is a much better option. I'll shelve this and try to make that work instead.

Okay, great! Glad we're on the same page. This was a neat idea, thanks for taking the time to investigate it.

I have one process question though. What's the current workflow for merging changes between the demangler copies in libc++abi and llvm? Right now they're mostly identical except for a few macros, but CRTP will require putting a lot of stuff into the header (and into a namespace), so they will diverge more. Do you think it would make sense to factor the common bits into a separate (.inc ?) file which can be kept identical? Then cxa_demangle.cpp could just #include that file into an anonymous namespace and llvm could include that into the llvm namespace (from a header file).

I have an ugly perl script that I use to patch changes from libcxxabi to llvm (see the end of this message). The basic idea for compatibility in libcxxabi is ItaniumDemangle.cpp maps to libcxxabi/src/cxa_demangle, and the other headers in llvm/lib/Demangle/* map to their equivalents in libcxxabi/src/demangle/*. I would like to keep them identical to keep patching easy, so any changes to the file structure in llvm should also be done in libcxxabi.
I think that we should move the Node hierarchy into it's own file ItaniumNode.h, and the CRTP parser into one, ItaniumParser.h which would include ItaniumNode.h, and put all the declarations in the namespace llvm::itanium_demangler (or an anon namespace in libcxxabi). That would allow us to implement ItaniumPartialDemangler in a separate file (which wouldn't map to anything in libcxxabi), which would be nice. ItaniumDemangle.cpp could just be a #include "ItaniumParser.h" and the implementation of itaniumDemangle(), or __cxa_demangle in libcxxabi. Does that sounds reasonable? CCing @rsmith, who expressed some interest in moving the AST/parser to a header, and @dexonsmith, who's had some thoughts on this before.

#!/usr/bin/env perl

use warnings;
use strict;

my $diff = `git show $ARGV[0] libcxxabi/src`;

$diff =~ s/libcxxabi\/src\/cxa_demangle.cpp/llvm\/lib\/Demangle\/ItaniumDemangle.cpp/g;
$diff =~ s/libcxxabi\/src\/demangle\/Utility.h/llvm\/lib\/Demangle\/Utility.h/g;
$diff =~ s/libcxxabi\/src\/demangle\/Compiler.h/llvm\/lib\/Demangle\/Compiler.h/g;
$diff =~ s/libcxxabi\/src\/demangle\/StringView.h/llvm\/lib\/Demangle\/StringView.h/g;
$diff =~ s/_LIBCPP_UNREACHABLE\(\)/LLVM_BUILTIN_UNREACHABLE/g;
$diff =~ s/_LIBCPP_FALLTHROUGH\(\)/LLVM_FALLTHROUGH/g;
$diff =~ s/DUMP_METHOD/LLVM_DUMP_METHOD/g;
$diff =~ s/__cxxabiv1/llvm/g;
$diff =~ s/_LIBCXXABI_FUNC_VIS//g;
$diff =~ s/__cxa_demangle/itaniumDemangle/g;

open OUTPUT, "| patch -p1";
local $SIG{PIPE} = sub { die "pipe broken" };
print OUTPUT $diff;

I have one process question though. What's the current workflow for merging changes between the demangler copies in libc++abi and llvm? Right now they're mostly identical except for a few macros, but CRTP will require putting a lot of stuff into the header (and into a namespace), so they will diverge more. Do you think it would make sense to factor the common bits into a separate (.inc ?) file which can be kept identical? Then cxa_demangle.cpp could just #include that file into an anonymous namespace and llvm could include that into the llvm namespace (from a header file).

I have an ugly perl script that I use to patch changes from libcxxabi to llvm (see the end of this message). The basic idea for compatibility in libcxxabi is ItaniumDemangle.cpp maps to libcxxabi/src/cxa_demangle, and the other headers in llvm/lib/Demangle/* map to their equivalents in libcxxabi/src/demangle/*. I would like to keep them identical to keep patching easy, so any changes to the file structure in llvm should also be done in libcxxabi.
I think that we should move the Node hierarchy into it's own file ItaniumNode.h, and the CRTP parser into one, ItaniumParser.h which would include ItaniumNode.h, and put all the declarations in the namespace llvm::itanium_demangler (or an anon namespace in libcxxabi). That would allow us to implement ItaniumPartialDemangler in a separate file (which wouldn't map to anything in libcxxabi), which would be nice. ItaniumDemangle.cpp could just be a #include "ItaniumParser.h" and the implementation of itaniumDemangle(), or __cxa_demangle in libcxxabi. Does that sounds reasonable? CCing @rsmith, who expressed some interest in moving the AST/parser to a header, and @dexonsmith, who's had some thoughts on this before.

I don't have a strong opinion on how to keep the two versions of the demangler in sync, but I do think it's really important that they do not diverge. Perhaps we should add a test that the two versions are in sync, check the perl script in somewhere, and document the patching process?

Longer term, I suspect we'll want to pull the demangler guts out into a separate project that LLVM and libcxxabi both depend on (and expose in different ways)... but having LLVM depend on another project would be chaotic when we don't have a monorepo.

labath abandoned this revision.Nov 26 2018, 3:01 AM