This is an archive of the discontinued LLVM Phabricator instance.

[LLD] [COFF] Use the unified llvm demangle frontend function. NFC.
ClosedPublic

Authored by mstorsjo on Sep 6 2019, 1:57 PM.

Details

Summary

This isn't a strict NFC patch, as technically, as symbols starting with __Z also could be demangled now. As the namespace starting with two underscores is restricted to the platform implementation in at least C, this shouldn't be much of an issue in practice.

Diff Detail

Event Timeline

mstorsjo created this revision.Sep 6 2019, 1:57 PM
rnk added a comment.Sep 6 2019, 4:07 PM

Should the _ prefix (for itanium symbols in i386 mode) be stripped by the COFF wrapper function, or by the common demangle function (matching _Z or __Z)?

Maybe, I forget if the extra _ comes before the __imp_ prefix or after it when dllimported. But there's another consideration, which is that this code could hypothetically be shared with a MachO linker implementation, where the extra underscore prefix is also present. IMO it's best for any generic demangler to match any number of underscores followed by Z.

rnk added a comment.Sep 6 2019, 4:07 PM

Hit send too soon...

Or did you mean merging the two demangler functions at the llvm level as well?

I'd be in favor of doing that as well, if it's easy. :)

In D67301#1661748, @rnk wrote:

Should the _ prefix (for itanium symbols in i386 mode) be stripped by the COFF wrapper function, or by the common demangle function (matching _Z or __Z)?

Maybe, I forget if the extra _ comes before the __imp_ prefix or after it when dllimported.

It's after the __imp_ prefix, so the demangler could handle it.

But there's another consideration, which is that this code could hypothetically be shared with a MachO linker implementation, where the extra underscore prefix is also present. IMO it's best for any generic demangler to match any number of underscores followed by Z.

I'm not entirely decided on whether I think this is a good thing or not.

The caller knows with certainty whether there should be an extra underscore prefix or not, as it is well defined (always in MachO, never in ELF, plus on i386 in COFF). But if we make the demangler implicitly handle both cases, it can also accidentally e.g. try to demangle any symbol which just starts with a capital Z, without any real underscore prefix, on i386 in COFF or in MachO. (I realize the current code in LLD I just committed also has this issue though.) So if we make it a requirement for the caller to compensate for the prefix, it is less convenient, but also less ambiguous.

WDYT?

ruiu added a comment.Sep 8 2019, 10:36 PM

Merging the demanglers for Itanium and MSVC seems fine at the lld level because we handle only these two mangling schemes. But isn't there any mangling scheme than those two? Can all mangling schemes be distinguished reliably? If not, we can't provide a unified demangle function.

As to the underscore prefix, I prefer making it a caller's responsibility to strip it before calling a demangler. As you mentioned, callers always know which is the case.

But if we make the demangler implicitly handle both cases, it can also accidentally e.g. try to demangle any symbol which just starts with a capital Z, without any real underscore prefix, on i386 in COFF or in MachO. (I realize the current code in LLD I just committed also has this issue though.)

Just for the record, correcting myself here, no, that can't happen as the code stands right now after all. On i386, a leading underscore is consumed if present before calling the demangler.

rnk added a comment.Sep 9 2019, 3:52 PM
In D67301#1661748, @rnk wrote:

Should the _ prefix (for itanium symbols in i386 mode) be stripped by the COFF wrapper function, or by the common demangle function (matching _Z or __Z)?

Maybe, I forget if the extra _ comes before the __imp_ prefix or after it when dllimported.

It's after the __imp_ prefix, so the demangler could handle it.

But there's another consideration, which is that this code could hypothetically be shared with a MachO linker implementation, where the extra underscore prefix is also present. IMO it's best for any generic demangler to match any number of underscores followed by Z.

I'm not entirely decided on whether I think this is a good thing or not.

The caller knows with certainty whether there should be an extra underscore prefix or not, as it is well defined (always in MachO, never in ELF, plus on i386 in COFF). But if we make the demangler implicitly handle both cases, it can also accidentally e.g. try to demangle any symbol which just starts with a capital Z, without any real underscore prefix, on i386 in COFF or in MachO. (I realize the current code in LLD I just committed also has this issue though.) So if we make it a requirement for the caller to compensate for the prefix, it is less convenient, but also less ambiguous.

WDYT?

I would hope that demangling a non-C++ symbol that happens to start with Z would simply fail and return the original string. In practice, we have tools like demumble that work on text streams and look for any sequence of _+Z${alphanum}+ and they work well enough. But I don't feel strongly, I see the argument for safety.

It turns out most of this discussion already is moot; the llvm demangle library already has a common unified function, which takes itanium names with 1-4 leading underscores. That makes this patch even simpler, mostly just rip out a lot of duplicated code from lld. This makes things a bit looser (the previous lld frontend to itaniumDemangle required it to start with exactly _Z), but as this is the current behaviour of the common demangle library, that's probably not an issue in practice.

mstorsjo updated this revision to Diff 219534.Sep 10 2019, 6:36 AM
mstorsjo retitled this revision from [LLD] Unify the demangleItanium and demangleMSVC functions. NFC. to [LLD] Use the unified llvm demangle frontend function. NFC..
mstorsjo edited the summary of this revision. (Show Details)
mstorsjo added a reviewer: jhenderson.

Updated to use the already unified llvm demangler frontend function, removing a lot of duplicated code in lld.

I'm adding another entry point to it, Optional<std::string> tryDemangle(const std::string &MangledName), that easier allows distinguishing between whether it actually did demangle or not, used in the COFF code.

MaskRay added a comment.EditedSep 10 2019, 7:43 AM

This change is not NFC. This will cause every port of LLD to demangle more names than which were allowed before.

tryDemangle can decode more names than _Z __Z ___Z ____Z prefixed names, as can be seen from the source:

// <mangled-name> ::= _Z <encoding>
//                ::= <type>
// extension      ::= ___Z <encoding> _block_invoke
// extension      ::= ___Z <encoding> _block_invoke<decimal-digit>+
// extension      ::= ___Z <encoding> _block_invoke_<decimal-digit>+
template <typename Derived, typename Alloc>
Node *AbstractManglingParser<Derived, Alloc>::parse() {

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

This change is not NFC. This will cause every port of LLD to demangle more names than which were allowed before.

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

Ok, let's keep it strict then.

I'd suggest adding a parameter to the llvm global demangle convenience function, to decide between whether it should require a strict prefix (only _Z) or if 1-4 underscores are allowed.

mstorsjo updated this revision to Diff 219584.Sep 10 2019, 12:25 PM

Extended the llvm demangle function to take an optional parameter for requesting it to only demangle symbols with a stricter prefix check (only allowing _Z).

Added test cases in lld for cases where we don't want demangling to happen (which all pass in the current master as well).

rnk added a comment.Sep 10 2019, 1:06 PM

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

Are you sure? Surely users could use -fblocks to get those manglings, or use ObjC on Linux.

In D67301#1665171, @rnk wrote:

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

Are you sure? Surely users could use -fblocks to get those manglings, or use ObjC on Linux.

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

llvm/include/llvm/Demangle/Demangle.h
12 ↗(On Diff #219584)

I'm not sure this is supported.

LLVMSupport depends on LLVMDemangle. LLVMDemangle probably shouldn't have any dependency on other components.

mstorsjo marked an inline comment as done.Sep 10 2019, 9:15 PM
mstorsjo added inline comments.
llvm/include/llvm/Demangle/Demangle.h
12 ↗(On Diff #219584)

Hmm, good point.

I can do without the Optional here then, that form is only used in one place anyway.

mstorsjo updated this revision to Diff 219661.Sep 10 2019, 11:52 PM

Removed use of llvm::Optional in the demangle library, adapted COFF/Symbols.cpp to do a check for "demangled != demangleInput" instead of checking an Optional return value.

sbc100 added inline comments.Sep 11 2019, 10:04 AM
lld/COFF/Symbols.cpp
38

I think the convention is to use /*StrictPrefix=*/true. i.e. no spaces and an equals.

lld/wasm/Symbols.cpp
303 ↗(On Diff #219661)

Haven't we lost the checking of config->demangle here?

mstorsjo marked 2 inline comments as done.Sep 11 2019, 10:38 AM
mstorsjo added inline comments.
lld/COFF/Symbols.cpp
38

Ok, will change all of them to that form

lld/wasm/Symbols.cpp
303 ↗(On Diff #219661)

Oops, yes indeed. (There wasn't any testcase for that, apparently.)

mstorsjo marked an inline comment as done.Sep 11 2019, 12:26 PM
mstorsjo added inline comments.
lld/wasm/Symbols.cpp
303 ↗(On Diff #219661)

Sorry, yes, there is a test for it, but I don't have the wasm target enabled, so that particular test doesn't get run.

mstorsjo updated this revision to Diff 219762.Sep 11 2019, 12:27 PM

Tweaked the syntax for parameter literal names in comments, reinstated a check for config->demangle in the wasm backend.

In D67301#1665171, @rnk wrote:

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

Are you sure? Surely users could use -fblocks to get those manglings, or use ObjC on Linux.

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

@rnk - do you want to follow up on this further, or should we move forward with this in this form (which is pretty much NFC now).

In D67301#1665171, @rnk wrote:

I hope at least for ELF, you can keep the "_Z" prefix check. It doesn't need the ObjC "___Z" extension.

Are you sure? Surely users could use -fblocks to get those manglings, or use ObjC on Linux.

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

@rnk - do you want to follow up on this further, or should we move forward with this in this form (which is pretty much NFC now).

Ping @rnk

ruiu added inline comments.Sep 24 2019, 5:34 AM
lld/COFF/Symbols.cpp
39–40

Does this mean __imp_foo is printed as just __imp_foo but any C++ mangled name after __imp_ is printed as __declspec(dllimport) ...? It feels to me that always "demangle" __imp prefix is more consistent.

mstorsjo marked an inline comment as done.Sep 24 2019, 5:37 AM
mstorsjo added inline comments.
lld/COFF/Symbols.cpp
39–40

Yes, that's correct - that's how it's done right now.

Sure, we could change that, but that would be a separate change, as this tries to be an NFC refactoring.

rnk added a comment.Sep 24 2019, 8:39 AM

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

Weighing this desire to not change functionality against the cost of adding an extra boolean default argument to control the behavior, I'd prefer to change behavior to be consistent everywhere.

I guess we can't predict which bug is more likely: failing to demangle something the user wanted us to demangle, or unintentionally demangling a real C symbol starting with Z. Regardless of which bug appears down the road, I favor the solution that takes less code and makes LLVM tools behave consistently.

In D67301#1680905, @rnk wrote:

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

Weighing this desire to not change functionality against the cost of adding an extra boolean default argument to control the behavior, I'd prefer to change behavior to be consistent everywhere.

I guess we can't predict which bug is more likely: failing to demangle something the user wanted us to demangle, or unintentionally demangling a real C symbol starting with Z. Regardless of which bug appears down the road, I favor the solution that takes less code and makes LLVM tools behave consistently.

Is there a scenario where both Itanium and Microsoft mangling schemes should be tried? (mingw?) If no, separate entry points for itaniumDemangle and microsoftDemangle may be better. I did not expect the solution to end up with an extra parameter StrictPrefix. The additional complexity was not what I intended. If mingw has to live with both mangling schemes, I want to first check if the demangling code will be used in more than one places. If yes, I am not against deleting StrictPrefix. The overhead to demangle __Z, ___Z or ___Z prefixed names for ELF should be small.

In D67301#1680905, @rnk wrote:

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

Weighing this desire to not change functionality against the cost of adding an extra boolean default argument to control the behavior, I'd prefer to change behavior to be consistent everywhere.

I guess we can't predict which bug is more likely: failing to demangle something the user wanted us to demangle, or unintentionally demangling a real C symbol starting with Z. Regardless of which bug appears down the road, I favor the solution that takes less code and makes LLVM tools behave consistently.

Is there a scenario where both Itanium and Microsoft mangling schemes should be tried? (mingw?)

No, practically, only one of them should be possible at a given time. But I don't see why that is relevant to the discussion at hand.

If no, separate entry points for itaniumDemangle and microsoftDemangle may be better.

The distinction between itanium and microsoft isn't an issue here, at all.

They have strictly very different mangling; microsoft mangled names start with a ? while itanium mangled ones start with _Z. There is no risk of mixing them up at all.

On i386, all symbols (except for microsoft mangled and certain other calling conventions) start with an extra underscore, so itanium mangled names start with __Z. This is currently handled by always manually stripping one underscore on i386 before calling the demangler, but could be skipped as well if we use a more tolerant demangler. It could also be kept to avoid the risk of demangling a symbol like Znwy that originally didn't start with an underscore.

I did not expect the solution to end up with an extra parameter StrictPrefix. The additional complexity was not what I intended. If mingw has to live with both mangling schemes, I want to first check if the demangling code will be used in more than one places. If yes, I am not against deleting StrictPrefix. The overhead to demangle __Z, ___Z or ___Z prefixed names for ELF should be small.

It's not at all about the two different mangling schemes.

It's only about whether we artifically should make the general itanium demangler routine more strict than it is right now, or accept that there's a small risk of demangling something that isn't supposed to, but this is already a risk that all existing callers of the general llvm demangle library are ok with.

In D67301#1680905, @rnk wrote:

We haven't even heard people complaining about lack of demangling of __Z, ___Z or ___Z prefixed names. I still lean towards keeping the "_Z" check for ELF to avoid false positive.

Weighing this desire to not change functionality against the cost of adding an extra boolean default argument to control the behavior, I'd prefer to change behavior to be consistent everywhere.

I guess we can't predict which bug is more likely: failing to demangle something the user wanted us to demangle, or unintentionally demangling a real C symbol starting with Z. Regardless of which bug appears down the road, I favor the solution that takes less code and makes LLVM tools behave consistently.

Is there a scenario where both Itanium and Microsoft mangling schemes should be tried? (mingw?)

No, practically, only one of them should be possible at a given time. But I don't see why that is relevant to the discussion at hand.

If a call site of llvm::demangle deals either exclusively Itanium mangled names, or exclusively MSVC-style mangled names. I don't see why we must have a single entry point (llvm::demangle) for demangling. Two entry points (for Itanium and MSVC-style) may be better. It avoids the problem that some platforms are unnecessarily affected by special rules from some particular platforms.

If no, separate entry points for itaniumDemangle and microsoftDemangle may be better.

The distinction between itanium and microsoft isn't an issue here, at all.

They have strictly very different mangling; microsoft mangled names start with a ? while itanium mangled ones start with _Z. There is no risk of mixing them up at all.

On i386, all symbols (except for microsoft mangled and certain other calling conventions) start with an extra underscore, so itanium mangled names start with __Z. This is currently handled by always manually stripping one underscore on i386 before calling the demangler, but could be skipped as well if we use a more tolerant demangler. It could also be kept to avoid the risk of demangling a symbol like Znwy that originally didn't start with an underscore.

I did not expect the solution to end up with an extra parameter StrictPrefix. The additional complexity was not what I intended. If mingw has to live with both mangling schemes, I want to first check if the demangling code will be used in more than one places. If yes, I am not against deleting StrictPrefix. The overhead to demangle __Z, ___Z or ___Z prefixed names for ELF should be small.

It's not at all about the two different mangling schemes.

It's only about whether we artifically should make the general itanium demangler routine more strict than it is right now, or accept that there's a small risk of demangling something that isn't supposed to, but this is already a risk that all existing callers of the general llvm demangle library are ok with.

First, I'd prefer a solution that does not need to annotate call sites like: demangle(demangleInput, /*strictPrefix=*/true). This is prerequisite.

Next, if there is some way to not demangle __Z, ___Z or ___Z prefixed names for ELF, that will be great.. If mingw or apple may need more underscores, a new function will look good to me. That will document their extensions to the Itanium mangling scheme. However, if doing this would increase code complexity significantly, I'm fine with using a unified interface.

If a call site of llvm::demangle deals either exclusively Itanium mangled names, or exclusively MSVC-style mangled names. I don't see why we must have a single entry point (llvm::demangle) for demangling. Two entry points (for Itanium and MSVC-style) may be better. It avoids the problem that some platforms are unnecessarily affected by special rules from some particular platforms.

No, there's one single call site in the COFF linker, which right now does this:

demangleMSVC()
if (failedToDemangle && mingw)
    demangleItanium()

As the itanium and MSVC mangling schemes are completely orthogonal, this is just pointless, as the unified demangle function already can distinguish between the two just fine. This is what spurred this patch (requested by @rnk) to begin with.

The single entry point, llvm::demangle, which handles both MSVC and itanium, already exists, doing exactly this. It just happens to be more lax when it comes to itanium mangling.

First, I'd prefer a solution that does not need to annotate call sites like: demangle(demangleInput, /*strictPrefix=*/true). This is prerequisite.

So is the issue that the extra parameter is very verbose? Would llvm::demangleStrict(demangleInput) be better?

Next, if there is some way to not demangle __Z, ___Z or ___Z prefixed names for ELF, that will be great.. If mingw or apple may need more underscores, a new function will look good to me. That will document their extensions to the Itanium mangling scheme. However, if doing this would increase code complexity significantly, I'm fine with using a unified interface.

The point of the patchset is to remove redundant boilerplate code in lld, when there's a general unified function in the llvm demangle library already which does this. There's no point in a new function for mingw/apple that would support more underscores, as the existing llvm::demangle function does it just fine. If you want the ELF linker to be very strict about demangling, we could add a llvm::demangleStrict that the ELF linker could use, if that's any better?

Or just keep a small wrapper in lld, like this:

std::string demangleItanium(std::string input) {
  if (!input.startswith("_Z"))
    return input;
  return llvm::demangle(input);
}
mstorsjo updated this revision to Diff 221706.Sep 25 2019, 3:44 AM
mstorsjo retitled this revision from [LLD] Use the unified llvm demangle frontend function. NFC. to [LLD] [COFF] Use the unified llvm demangle frontend function. NFC..
mstorsjo edited the summary of this revision. (Show Details)

As it seems to be hard to agree on what should be done and how, I'm trying to split this up into separate smaller bits, which hopefully should get this patch moving forward easier.

rnk added a subscriber: compnerd.Sep 25 2019, 6:31 AM

Is there a scenario where both Itanium and Microsoft mangling schemes should be tried? (mingw?)

Well, I was pushing for exactly the opposite. The schemes are easily distinguished: leading ? or leading _+Z. It's easy to imagine a user using LLD to link together mingw (or even just @compnerd's x86_64-windows-itanium objects) and MSVC COFF files into the same program as long as they only talk to each other through plain C symbols. If we just handle everything all at once, there's less likelihood that down the line someone will file a bug asking us to add the other scheme along some strange codepath.

Anyway, this is my viewpoint, but I'm not writing the code, so @mstorsjo, please do not feel obligated to push this any further. :)

@rnk - Does the smaller scope of this patch make sense to you? It retains the point of unifying all the different manglings for the COFF linker behind the single unified interface, but without touching the rest of LLD at the same time.

rnk accepted this revision.Sep 26 2019, 11:16 AM

OK, so this is NFC for ELF, but has the relaxed __Z handling for COFF. lgtm

Sorry for the goose chase. :)

This revision is now accepted and ready to land.Sep 26 2019, 11:16 AM
In D67301#1684515, @rnk wrote:

OK, so this is NFC for ELF, but has the relaxed __Z handling for COFF. lgtm

Yup, it uses the relaxed _Z handling which can match __Z as well (which should be not too plausible, and that's reserved identifiers in C). But on i386, I manually still strip out the leading underscore, if present, so a C level identifier which just starts with capital Z, which as an cdecl symbol on i386 ends up as _Z..., won't match. (This is a slightly more plausible scenario in my opinion.)

This extra handling just uses two pretty harmless lines of code, which I don't even touch in this patch, if (config->machine == I386) demangleInput.consume_front("_");. They could be dropped, making the demangling even more lax, but I don't see much harm in keeping them.

I'll push this (plus the two other smaller ELF related cleanups that were ok'd) tomorrow then.

This revision was automatically updated to reflect the committed changes.