This is an archive of the discontinued LLVM Phabricator instance.

[PDB] Introduce `MSVCUndecoratedNameParser`
ClosedPublic

Authored by aleksandr.urakov on Sep 25 2018, 5:47 AM.

Details

Summary

This patch introduces the simple MSVCUndecoratedNameParser. It is needed for parsing names of PDB symbols corresponding to template instantiations. For example, for the name

`operator<<A>'::`2'::B::operator>

we can't just split the name with :: (as it is implemented for now) to retrieve its scopes. This parser processes such names in a more correct way.

Diff Detail

Repository
rLLDB LLDB

Event Timeline

I think you should look at CPlusPlusLanguage::MethodName. It already contains a parser (in fact, two of them) of c++ names, and I think it should be easy to extend it to do what you want.

I think you should look at CPlusPlusLanguage::MethodName. It already contains a parser (in fact, two of them) of c++ names, and I think it should be easy to extend it to do what you want.

I agree with Pavel here. Try to use and extend CPlusPlusLanguage::MethodName as needed. I believe it was recently backed by a new clang parser that knows how to chop up C++ demangled names

Ok, I'll look into that, thanks!

Try to use and extend CPlusPlusLanguage::MethodName as needed. I believe it was recently backed by a new clang parser that knows how to chop up C++ demangled names

It seems that CPlusPlusLanguage::MethodName is backed by LLDB CPlusPlusNameParser, which can't parse demangled names... Can you tell me, please, how is called a new Clang parser you have mentioned? May be I'll use it directly instead of PDBNameParser, or will back PDBNameParser by it (if the interface will be not very convenient)?

Hui added a subscriber: Hui.Sep 30 2018, 8:58 AM

Try to use and extend CPlusPlusLanguage::MethodName as needed. I believe it was recently backed by a new clang parser that knows how to chop up C++ demangled names

It seems that CPlusPlusLanguage::MethodName is backed by LLDB CPlusPlusNameParser, which can't parse demangled names... Can you tell me, please, how is called a new Clang parser you have mentioned? May be I'll use it directly instead of PDBNameParser, or will back PDBNameParser by it (if the interface will be not very convenient)?

Maybe you can try RichManglingContext::FromCxxMethodName(the_demangled_name). Parse it wtih RichManglingContext::ParseFunctionBaseName and extract the full/base name or function decl context (NS) if any with related methods.
No need to introduce a new one.

Ok, I'll look at this, thank you!

It seems that CPlusPlusLanguage::MethodName is backed by LLDB CPlusPlusNameParser, which can't parse demangled names...

What makes you say that? If you look at the MethodName unit tests (unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp), you will see that they all operate on demangled names -- there isn't a mangled name in the whole file.

Using RichManglingContext would work too, though I'm not sure if that will buy you anything, as that is just a wrapper around this parser.

I've tried to parse with it a name like

N0::`unnamed namespase'::Name

and it can't parse it correctly. May be it just can't parse MSVC demangled names?

Unfortunately, I can't look at the tests right now, I have a vacation. I'll look at these a week later, ok?

labath added a comment.Oct 1 2018, 5:04 AM

I've tried to parse with it a name like

N0::`unnamed namespase'::Name

and it can't parse it correctly. May be it just can't parse MSVC demangled names?

I expect the backqoutes are confusing it. If you try something without anonymous namespaces, it should work fine, and adding support for them shouldn't be too hard (though we may run also into problems with function pointers or other funky names, if they don't demangle the same way as with itanium).

Regardless of how this is exactly implemented, I think it's important to make the CPlusPlusLanguage::MethodName class understand these MSVC names, as this class is already used in a bunch of places. So, if it chokes on MSVC names, you're bound to run into more problems down the line.

Unfortunately, I can't look at the tests right now, I have a vacation. I'll look at these a week later, ok?

That's fine. There's no hurry here..

Hello!

I just have tried to patch CPlusPlusNameParser in the way to support MSVC demangled names, but there is a problem. CPlusPlusNameParser splits an incoming name in tokens with clang::Lexer. I've lexed the next name:

`anonymous namespace'::foo

The lexer treats the first character (a grave accent) as an unknown token, and it's ok for our purposes. Then it sees an identifier (anonymous), a keyword (namespace), and it's ok too. But the problem is with the last part of the string. The lexer sees an apostrophe and supposes that it's a character constant, it looks for a closing apostrophe, don't find it and treats all the line ending ('::foo) as a single unknown token.

It is possible to somehow make clang::Lexer lex MSVC demangled names correctly, but I'm not sure if it is the right place to do it. And it may have then some side effects during lexing a real code.

Another option is to somehow preprocess the name before lexing and replace all paired apostrophes with grave accents, and after lexing replace with apostrophes back, and make CPlusPlusNameParser understand unknown grave accent tokens. But it's a bit tricky, may be you can suggest some better solution?

Hello!

I just have tried to patch CPlusPlusNameParser in the way to support MSVC demangled names, but there is a problem. CPlusPlusNameParser splits an incoming name in tokens with clang::Lexer. I've lexed the next name:

`anonymous namespace'::foo

The lexer treats the first character (a grave accent) as an unknown token, and it's ok for our purposes. Then it sees an identifier (anonymous), a keyword (namespace), and it's ok too. But the problem is with the last part of the string. The lexer sees an apostrophe and supposes that it's a character constant, it looks for a closing apostrophe, don't find it and treats all the line ending ('::foo) as a single unknown token.

It is possible to somehow make clang::Lexer lex MSVC demangled names correctly, but I'm not sure if it is the right place to do it. And it may have then some side effects during lexing a real code.

Another option is to somehow preprocess the name before lexing and replace all paired apostrophes with grave accents, and after lexing replace with apostrophes back, and make CPlusPlusNameParser understand unknown grave accent tokens. But it's a bit tricky, may be you can suggest some better solution?

Just handle the anonymous namespace' thing specially before passing to CPlusPlusNameParser`.

Just handle the anonymous namespace' thing specially before passing to CPlusPlusNameParser`.

Yes, it's an interesting idea to somehow preprocess an MSVC demangled name and make a GCC demangled name from it (and make an MSVC-like name back after parsing). But then we need to handle not only anonymous namespaces, also things like this:

`operator<<A>'::`2'::B::operator>

Such a preprocessing will be comparable to the current implementation of PDBNameParser by complexity (or even more complex). I'll try to somehow estimate the complexity of this approach, thanks.

`operator<<A>'::`2'::B::operator>

The reason we had to use clang lexer for parsing itanium names is because parsing itanium demangled names is tricky precisely for cases like these. If the MSVC demangler makes these cases trivial by enclosing them in quotes, maybe a separate (simpler) parser is not such a bad idea.

However, I still think this should be done within the scope of CPlusPlusLanguage::MethodName otherwise, you'll have to special case MSVC for all existing uses of this class.

Yes, it's simpler to move it to the CPlusPlusLanguage::MethodName (or CPlusPlusNameParser?) I think. The only question left is how to differentiate MSVC demangled names from others? May be it would be ok to treat name as an MSVC name if it contains a grave accent? Because we probably already can parse MSVC names without grave accents with CPlusPlusLanguage::MethodName.

aleksandr.urakov retitled this revision from [PDB] Introduce `PDBNameParser` to [PDB] Introduce `MSVCUndecoratedNameParser`.
aleksandr.urakov edited the summary of this revision. (Show Details)
aleksandr.urakov added a reviewer: shafik.

Update the diff according to the discussion, making it possible to parse MSVC demangled names by CPlusPlusLanguage. The old PDB plugin still uses MSVCUndecoratedNameParser directly because:

  • we are sure that the name in PDB is an MSVC name;
  • it has a more convenient interface, especially for restoring namespaces from the parsed name.

Update the diff according to the discussion, making it possible to parse MSVC demangled names by CPlusPlusLanguage. The old PDB plugin still uses MSVCUndecoratedNameParser directly because:

  • we are sure that the name in PDB is an MSVC name;
  • it has a more convenient interface, especially for restoring namespaces from the parsed name.

So I had an interesting solution to this while working on the native pdb plugin. it is impossible to use it with the old pdb plugin, but given that it works flawlessly for the native pdb plugin, depending on how urgent your need is, maybe you can just put off working on this until you're ready to move over to the native pdb plugin?

Basically the idea is that the raw PDB contains mangled type names for every type. You can see this by dumping types using llvm-pdbutil, as follows (I just picked a random one from my build directory).

D:\src\llvmbuild\ninja-x64>bin\llvm-pdbutil.exe dump -types bin\sancov.pdb | grep -A 2 LF_STRUCT | more
    0x1001 | LF_STRUCTURE [size = 88] ``anonymous-namespace'::RawCoverage`
             unique name: `.?AURawCoverage@?A0xa74cdb40@@`
             vtable: <no type>, base list: <no type>, field list: <no type>
--
    0x100A | LF_STRUCTURE [size = 212] `std::default_delete<std::set<unsigned __int64,std::less<unsigned __int64>,std::allocator<unsigned __int64> > >`
             unique name: `.?AU?$default_delete@V?$set@_KU?$less@_K@std@@V?$allocator@_K@2@@std@@@std@@`
             vtable: <no type>, base list: <no type>, field list: <no type>
--
    0x102B | LF_STRUCTURE [size = 88] ``anonymous-namespace'::FileHeader`
             unique name: `.?AUFileHeader@?A0xa74cdb40@@`
             vtable: <no type>, base list: <no type>, field list: <no type>
--
    0x1031 | LF_STRUCTURE [size = 112] `std::default_delete<llvm::MemoryBuffer>`
             unique name: `.?AU?$default_delete@VMemoryBuffer@llvm@@@std@@`
             vtable: <no type>, base list: <no type>, field list: <no type>
--
    0x1081 | LF_STRUCTURE [size = 304] `llvm::AlignedCharArrayUnion<std::unique_ptr<llvm::MemoryBuffer,std::default_delete<llvm::MemoryBuffer> >,char,char,char,char,char,char,char,char,char>`
             unique name: `.?AU?$AlignedCharArrayUnion@V?$unique_ptr@VMemoryBuffer@llvm@@U?$default_delete@VMemoryBuffer@llvm@@@std@@@std@@DDDDDDDDD@llvm@@`
             vtable: <no type>, base list: <no type>, field list: <no type>
--
    0x1082 | LF_STRUCTURE [size = 176] `llvm::AlignedCharArrayUnion<std::error_code,char,char,char,char,char,char,char,char,char>`
             unique name: `.?AU?$AlignedCharArrayUnion@Verror_code@std@@DDDDDDDDD@llvm@@`
             vtable: <no type>, base list: <no type>, field list: <no type>

So the interesting thing here is this "unique name" field. This is not possible to access via DIA SDK but it gives us complete rich information about the type that is otherwise impossible. We don't even have to guess, because we can just demangle the name. And coincidentally, I recently just finished writing an Microsoft ABI demangler which is now in LLVM. :) This .?AU syntax is non-standard, but it was easy for me to figure out, and I hacked up our demangle library to support this prefix (it's not checked in yet). And basically everything that comes after it exactly matches a mangled type.

So, just to give an example. Instead of teaching CPlusPlusNameParser to handle `anonymous namespace'::RawCoverage, we simply demangle .?AURawCoverage@?A0xa74cdb40@@, and we get back a vector of 2 strings which are `anonymous namespace' and RawCoverage. But instead of just that, there are so many other benefits. Since PDB doesn't contain rich information about template parameters, all we could do until now is just say create an entry in the AST that says "there's a type with this enormously long name that contains angle brackets and other junk". But with this technique, we could actually create legitimate template decls in the AST the way it's supposed to be.

There is obviously a lot of complexity in doing it here, but I think long term it will be a richer experience if we parse the mangled name than if we parse the demangled name. But it's only possible with the native plugin.

What do you think?

What do you think?

Yes, it's a really cool idea! When I was starting the implementation of the parser from this patch, I thought that it would be good to have mangled names instead - then we could retrieve fully structured names (with all its scope specifiers, template parameters etc.), but I didn't know that we actually have them on the lower level!

I want to join the development of the new PDB plugin, but some time later - may be in a month or two. I want to contribute now all changes I made to support expressions on Windows, and then I have some LLVM unrelated work to do. But I think that the way you suggest to solve the problem from the patch is the really right way to do it, and I'm planning to implement it when I'll join the new plugin development.

But is the MSVC demangled names parsing really necessary for CPlusPlusLanguage? Can such names ever somehow occur there? May be (if they can't) we could move this parser back to the old PDB plugin, and then drop it as a weirder solution when the new plugin will be done? Then we could commit this as a solution for the old PDB plugin to proceed with some dependent (and not related to the old PDB plugin) patches?

It's not fully clear to me from the previous comments if you are proceeding with this or not, but in case you are, I have made comments inline. I see that you've added some lit tests, but I also think you it would be good add some unit tests for the name parser functionality per-se (similar to the existing name parsing tests), as it is much easier to see what is going on from those.

Right now, I don't think this solution can be specific to the "old" PDB plugin, as this functionality is used from other places as well (RichManglingContext being the most important one). Maybe once we start using the fancy MSVC demangler there, we can revisit that. (But given that the declared intent of that class is to chop up demangled names, I think it would make sense to keep this there even then. Unless it turns out we can delete the whole class at that point.)

source/Plugins/Language/CPlusPlus/MSVCUndecoratedNameParser.cpp
62–64 ↗(On Diff #171707)

Rewrite this (and all other instances of StringRef -> char * -> StringRef roundtripping) in terms of StringRef functions.
Maybe something like: emplace_back(name.take_front(i-1), name.slice(last_base_start, i-1)); ?

source/Plugins/Language/CPlusPlus/MSVCUndecoratedNameParser.h
35–39 ↗(On Diff #171707)

Could we replace these by something like ArrayRef<MSVCUndecoratedNameSpecifier> GetSpecifiers()

aleksandr.urakov marked 2 inline comments as done.Nov 1 2018, 3:06 AM

Thank you for comments! I've updated the patch.

aleksandr.urakov updated this revision to Diff 172102.
labath accepted this revision.Nov 3 2018, 3:38 AM

Thanks for your patience. This looks good to me now.

This revision is now accepted and ready to land.Nov 3 2018, 3:38 AM