Page MenuHomePhabricator

[Demangler] [DO NOT SUBMIT] Initial patch for Microsoft demangler.
Needs ReviewPublic

Authored by ruiu on Jun 26 2017, 10:51 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This patch is not intended to be submitted as-is. I'm sending this out
now to get comments if there's anyone who has an opinion on the
implementation strategy.

So this is an initial patch for a Microsoft-style symbol demangler.
To give you an overview, I'll describe the code from 10,000 feet view.

  • First of all, this demangler is not based on the exiting Itanium demangler which resides in the same directory. This is because the existing demangler seems too complicated to me and reported to be too slow.

    I designed this demangler with a goal to reduce the number of memory allocations as possible. This hasn't yet optimized hard, because I don't want to do that until the feature is complete, but I think you can see it from the code.
  • Internally, all mangled symbols are converted to ASTs (or Types) and then converted to strings. This simplifies the code a lot. Particularly, constructing a string representing a C++ type is hard because you have to append new strings to both ends of an intermediate result. By using ASTs, you can eliminate that complexity.

    I also believe that this is faster than constructing strings directly.
  • This demangler does not use any LLVM code. So, there's no use of StringRef, ArrayRef, etc. I had to write my own small String class, which is a bit silly, but this is intentional as this demangler is supposed to be used in other programs such as libcxxabi as well.

I don't ask you guys to do regular code review on this patch because the
patch is incomplete. Particularly, this patch

  • does not provide a support for demanglign C++ special functions such as operator overloads, except the ctor and dtor, and
  • does not provide a "proper" external function (whatever it means). The current microsoftDemangle function is provided just for testing.

I guess that this patch might be fun to read. Enjoy.

Event Timeline

ruiu created this revision.Jun 26 2017, 10:51 PM
davide added a subscriber: davide.Jun 26 2017, 11:21 PM
majnemer added inline comments.
llvm/lib/Demangle/MicrosoftDemangle.cpp
492–507

Isn't 'E' only for 64-bit?

600–619

These should be handled as normal structs and unions.

811

symbo -> symbol.

eugene added a subscriber: eugene.Jun 27 2017, 3:05 PM
eugene added inline comments.
llvm/lib/Demangle/MicrosoftDemangle.cpp
190

Here a below a lot of function names violate LLVM's conventions (http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly)

Please let me know if that document is obsolete.

ruiu added inline comments.Jun 27 2017, 3:19 PM
llvm/lib/Demangle/MicrosoftDemangle.cpp
190

That's intentional. This file is intended to be used not only in LLVM but also in other projects. So this file does not use any LLVM functions and does not follow the LLVM coding style. I did not make that decision -- I just followed ItaniumDemangler.cpp which exists in the same directory.

rnk added a subscriber: rnk.Jun 28 2017, 11:32 AM
rnk added inline comments.
llvm/lib/Demangle/MicrosoftDemangle.cpp
168

We will probably need to separate these for function templates.

More comments later when I have some time to look in more detail.

llvm/lib/Demangle/MicrosoftDemangle.cpp
96

What about vectorcall?

304–314

Can you add a similar comment above the parse() function that describes the high level grammar, similar to how you've done here with this one subsection?

332

What is the significance of 'A' and 'P'?

412–413

vectorcall is Q

419

s/structors/destructors/? Otherwise I'm not sure what this means.

429

Not all of these are valid for all types. For example, unaligned is not valid for functions. Do we care about this?

448–449

This is not quite right. After the F comes an additional storage class. For example, if it's simply unaligned, it will be FA. const __unaligned will be FB. const volatile __unaligned would be FD. etc.

450–451

Same is true here. You need to also consume the following character and or in the result until you get a terminating sequence

456

Can we have another grammar summary here?

492–505

This is not quite right. This A and B here corresponds to the storage class returned from read_storage_class(). So you need to be prepared to handle every PE + <any possible return value of read_storage_class()>, including the complex return values like FA and FIB, etc.

507

Same is true here.

ruiu added a comment.Jun 28 2017, 1:51 PM

Zach,

You don't want to take a look into this patch in detail because this patch is incomplete. I know this has a lot of correctness issues, and I'm fixing all of them now. The point of sending this patch early is to get feedback for the design of this demangler, as this demangler is completely different from the existing Itanium demangler.

rnk added a comment.Jun 28 2017, 2:02 PM

Broadly speaking, having a general Type or Node class that forms a tree seems like the right approach. I'm sure it'll grow in complexity over time. Pre-allocating enough of them to demangle most function prototypes without memory allocation is also great. I think this is the right direction.

It might be worth completely eliminating the STL dependency so that this can be run reliably without memory allocation in a crashing process, but it's probably not worth it for now.

So the idea is to convert the mangled name to an AST and then convert the AST to a string. That makes sense. But is the string form of the result always enough? Will there ever be a need for a caller to get the AST or some other representation?

compnerd added inline comments.
llvm/lib/Demangle/MicrosoftDemangle.cpp
34

Isnt this effectively std::string_view or StringRef?

190

I think that we should adhere to the LLVM style. The itanium demangler was imported stock from libc++abi where it must adhere to the C++ ABI specification. I expect that over time it will move back to the LLVM style.

332

It is just the way that the numbers are encoded. 'A' = 0, 'P' = 15 (base 16-encoding).

ruiu added a comment.Jun 29 2017, 1:03 PM

So the idea is to convert the mangled name to an AST and then convert the AST to a string. That makes sense. But is the string form of the result always enough? Will there ever be a need for a caller to get the AST or some other representation?

This is an interesting question. It could imagine that it might be useful, but honestly I don't come up with a use case in which you want to access internal ASTs.

llvm/lib/Demangle/MicrosoftDemangle.cpp
34

Yes it is. This file is designed to be usable outside LLVM.

ruiu updated this revision to Diff 104934.Jun 30 2017, 2:18 PM

Over the last few days, I made a bunch of changes to the demanler to improve it.
It's not complete yet, but it started working really well. It can now demangle
complex C++ symbols as you can see in MicrosoftDemangleTest.cpp file.

Preliminary performance testing shows that this demangler is pretty fast.
I ran this and Microsoft's UnDecorateSymbolName for the test cases in
MicrosoftDemangleTest.cpp.

It showed that my demangler is 45% faster (or, finishs benchmark in 55% time)
than the Microsoft's demangler. This result seems very promising.

I think the remaining tasks are to add more types, add more tests and fix bugs.
I don't expect any major design changes. So if you want to review this, you can
do that now. I'm sending this out today because I'm on vacation next week, so
I'll address review comments in the week after next.

Will add more comments later.

llvm/lib/Demangle/MicrosoftDemangle.cpp
143

Missing vectorcall.

190

Agree with this. I would like it if we can adhere to LLVM style wherever possible.

318

isdigit(*input.p)

319

Why are we adding 1? If *input.p == '0' then wouldn't we want to return the number 0?

357

The comment here mentions that this is only for 64-bit, but we expect it always. Won't this generate an error in non-x86?

382

What's the +1 for?

394

You can save 2 lines if you invert this conditional and break on the negation. You can drop the curly braces as well as the continue statement.

532–537

Why not use read_storage_class() here? This misses all the cases like unaligned and restrict, and it also treats C and D as equivalent, which doesn't seem right. C is just volatile, while D is const volatile. Regardless, I think you can just say ty.sclass = read_storage_class();

544

You should also handle 3A6, which is a reference to a function.

559

Should this be Private | Static | FFar?

561

Should this be Private | Virtual | Far?

562

Are G and H not used?

605

vectorcall is Q

656

Can we have comments above every read_xxx function that describes the grammar of what is being read?

657

If you write this code:

nullptr_t NullPtr;

Then this is mangled as ?NullPtr@@3$$TA. Because of the 3 it makes it into this function, but nothing in this function handles a $$. In fact, the only place in the entire demangler that handles $$ is in the processing of an array, and it expects $$C. So this looks to be missing.

703

Can you sort this list in alphabetical order?

720
case 'S': return Char16;
case 'U': return Char32;
737

The same problem here as mentioned earlier. For 32-bit, it's A.

Note that both can theoretically appear in the same mangled name, for example: void foo(int * __ptr32 PA, int * __ptr64 PB);

Also, I think most (all?) of the original comments I had still apply.

zturner added inline comments.Jun 30 2017, 4:45 PM
llvm/lib/Demangle/MicrosoftDemangle.cpp
631

Support for restrict and unaligned got removed, it seems like? See original comments about how to correctly support those.

666

Here's a few more.

3A6  Reference to function
3P6   Rvalue Reference to Function
3P8  Pointer to Member Function
3AEAP8 Reference to Member Function Pointer
3$$QEAP8  Rvalue reference to member function pointer  (wtf?)

This is the second time I've seen this 3$$, so we should probably figure out what it means (it occurred in the variable of type nullptr_t also.

zturner added inline comments.Jun 30 2017, 4:56 PM
llvm/lib/Demangle/MicrosoftDemangle.cpp
663–666

Ok, I think I get it. Here's some pointers to member functions:

Ok I think I get it. Here's some pointers to member functions:

PtrMemberFunction@@3P8TheClass@@EAAXXZEQ1
PtrPtrMemberFunction@@3PEAP8TheClass@@EAAXXZEA
PtrPtrPtrMemberFunction@@3PEAPEAP8TheClass@@EAAXXZEA

The key observation here is that the 8 (8 here means member function pointer, as opposed to 6 in the case you've handled) doesn't necessarily come after the P. So this logic will not work here. I think you need to sink this conditional into the read_pointee function below. First you read the P, then you call read_pointee, which sees E (64-bit, and indicates that the storage class needs to be read), then storage class A (no storage class), then indirect recursion back into read_var_type(). It sees P again (pointer), then calls read_pointee again, this time there is no E (64-bit) or A (32-bit), but instead there is the number 8 (member function).

The same would apply to non-member functions, and the algorithm handles constness, reference, etc as well. So this should handle the general case.

pcc added a subscriber: pcc.Jan 30 2018, 5:41 PM

I've been using https://github.com/rui314/msvc-demangler which I understand is basically the same code as this.

Here are a few names which it was unable to demangle:

$ undname '??1B@@UAE@XZ'
E expected, but got AE@XZ
$ undname '?foo@B@@UAEXXZ'
E expected, but got AEXXZ
$ undname '?foo@B@@$4PPPPPPPM@A@AEXXZ'
unknown func class: $4PPPPPPPM@A@AEXXZ

They were all derived from the IR from the clang test case clang/test/CodeGenCXX/microsoft-abi-virtual-inheritance.cpp.