This is an archive of the discontinued LLVM Phabricator instance.

[demangler] Initial support for the new Rust mangling scheme
ClosedPublic

Authored by tmiasko on Apr 28 2021, 5:08 AM.

Details

Summary

Add a demangling support for a small subset of a new Rust mangling
scheme, with complete support planned as a follow up work.

Intergate Rust demangling into llvm-cxxfilt and use llvm-cxxfilt for
end-to-end testing. The new Rust mangling scheme uses "_R" as a prefix,
which makes it easy to disambiguate it from other mangling schemes.

The public API is modeled after __cxa_demangle / llvm::itaniumDemangle,
since potential candidates for further integration use those.

Diff Detail

Event Timeline

tmiasko created this revision.Apr 28 2021, 5:08 AM
tmiasko requested review of this revision.Apr 28 2021, 5:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 5:08 AM
dblaikie added inline comments.Apr 29 2021, 2:24 PM
llvm/include/llvm/Demangle/RustDemangle.h
9

Got a link to a spec for the mangling scheme that could go somewhere up here? (and/or in the .cpp file)

31–32

Most of the LLVM codebase doesn't use const members (can do weird things with assignability, for instance - not that this type needs to be assignable) - so I'd be inclined to stick with that convention and not make this const either.

32

Does the itanium demangler have a similar recursion limit handler? I didn't find one while quickly glancing around - would be good to understand any differences/tradeoffs here.

36

This comment seems incorrect - the demangle function is passed a string that includes the "_R" prefix, and parses that before doing anything else?

50

Looks like this part of the API isn't designed to be drop-in compatible with the itanium demangler API, yeah? (because the itanium one takes the buffer in the ctor rather than in a separate function, and takes the buffer as start/end rather than start/length)

If that's the case - any particular reason for this divergence and if we're diverging anyway, perhaps this could use a StringView parameter type rather than a decomposed start+length?

llvm/lib/Demangle/RustDemangle.cpp
46

If this took the argument by StringView (well, I guess, even if it didn't - could always build a StringView earlier and pass data()+size() here), then the StringView could be constructed earlier and StringView's startsWith(StringView) function could be used for the _R prefix check)

60–63

Any chance of using less raw memory management here? (been some recent discussions about the complexities of memory management in the itanium demangler that, if possible, it'd be good to avoid ( https://reviews.llvm.org/D100800 ) - but if it's all wrapped up in the "wanting to conform to the same API" (ie: the API design itself is awkward, but compatibility trumps fixing that awkwardness) - such is life)

94

The decimal-number and instantiating-crate parts aren't implemented in this patch, I take it? Could you include a FIXME that mentions that?

141–142

Is this correct/am I understanding this correctly, that some parsed components are not rendered in the demangling? Is that intentional/valid - that multiple mangled names could demangle to the same string/be ambiguous? Or is this a case of some feature not being implemented yet?

167–168

LLVM doesn't tend to use const local variables - I'd probably skip the const here & elsewhere.

169

This is here in case the decimal-number and initial bytes value would be ambiguous?

178–183
190

Nothing's currently using (& so nothing's currently testing) the return value of this function - probably best to omit it until it's needed/tested.

208–211

Maybe, like parseDecimalNumber below (though pick either way - so long as they're both the same), it might be nicer to move the 'Value' declaration further down and have this code be if (consume...) { return 0; }?

(though for both cases, if there's some nice way to avoid these special cases & have them fall out of the main loop/algorithm, that'd be even better I think - though I couldn't come up with a way off-hand)

209–211

LLVM style generally omits {} from single line blocks - remove these here and anywhere else in the patch.

238

This particular check could be written maybe more clearly as "Value == Max" but I see it's a pattern for overflow testing - perhaps it could be combined in a utility function for that purposes, so the 1 here matches up with the 1 in the += below. (& used for the "+= D" & overflow check in the next function)

263

This (& similarly in other places) could be constexpr to indicate it's a compile-time constant? (& distinct from the "don't generally use const on locals in LLVM)

264

This one 0 value uses 'u' but most/none of the others do - maybe drop it here for consistency?

284–289

Is this function worth it compared to calling "print(Ident.Name)"? (or at least could/should this function be implemented as "print(Ident.Name)" since the error checking is already done in print(StringView) and doesn't need to be done again in this caller?)

llvm/test/Demangle/rust.test
4–5

I'd have a bit of an easier time following the test cases if there was a comment describing the parse tree - would that be worth including?

26–32

Maybe this comment could be more descriptive - currently the implementation doesn't support any mangling version? (is the plan to support this currently, or is the version field purely forward-looking/for future use?) Maybe the comment could read "Only the null mangling version is currently supported" or something like that?

I'd probably only test 0 if that's the case - doesn't seem like 1 adds especially more coverage? But don't mind.

34–35

This is an attempt to overflow the version number, if it were parsed at all? Not sure it's an interesting test case.

tmiasko updated this revision to Diff 341862.Apr 30 2021, 5:44 AM
tmiasko marked 17 inline comments as done.

address review comments

@dblaikie thanks for review!

llvm/include/llvm/Demangle/RustDemangle.h
9

Included a link to a specification in the .cpp file.

32

In Rust mangling scheme, backreferences are given as offsets in the mangled symbol. Without any limits or additional protections, it would be easy to create a symbol that would lead to a stack overflow during demangling. I would consider a recursion limit to be necessary for fuzzing.

As far as I understand this isn't a problem for itanium demangler, where substitutions cannot refer to components that are yet to be demangled. Although the implementation has some form of protection against stack overflow in the form of Printing field.

36

I reorganized the code so that input is assigned only after removing the prefix.

The motivation for this becomes more apparent later; backreferences start from the end of the prefix.

50

I expect that rustDemangle will be the entry point, so there are no compatibility concerns here.

Replaced the arguments with a StringView.

llvm/lib/Demangle/RustDemangle.cpp
60–63

The motivation was to be compatible with itanium demangler API. I feel that interface will be easier to integrate.

If you see trade-offs differently, I am open to changing it to avoid manual memory management. Just consider how a different API would fit in in places where itaniumDemangle is used right now.

Internally, for the demangler itself, it is of little significance if demangled output is appended to an OutputStream or say an std::string.

94

Added a FXIME for <instantiating-crate>, and removed [<decimal-number>] which specification reserves for an encoding version. Encoding version is not in use currently, so there is no reason to parse it.

141–142

Yes, some components are not shown.

In the case here we have two kinds of namespaces: implementation internal namespaces using lowercase letters, and special well-known namespaces using uppercase letters. The former are not intended to be shown. The latter currently includes closures C and shims S and are shown by the existing demanglers.

This is yet to be implemented. I added a FIXME.

169

Exactly. "_" resolves an ambiguity in the case identifier itself starts with a digit or "_". Added a comment reflecting that.

178–183

I would find it convenient to leave this unaddressed for the time being. Currently implementation has no build time dependency on any generated headers or libraries, and using STLExtras.h introduces both.

llvm/test/Demangle/rust.test
4–5

I tried a few different notations, but even for short symbols they feel too verbose to be included in tests. For example:

_RNvNtCshGpAVYOtgW1_1a1b1c 

a::b::c

(symbol-name "_R"
  (path "N" (namespace "v")
    (path "N" (namespace "t")
      (path "C" (identifier (disambiguator "s" (base-62-number "hGpAVYOtgW1_")) "1a"))
      (identifier "1b"))
    (identifier "1c")))

I also have another parser that produces output as below, if that would be helpful I could publish it:

"" <symbol-name>
"_R" [<decimal-number>] <path> [<instantiating-crate>]
"_R" <path> [<instantiating-crate>]
"_R" "N" <namespace> <path> <identifier> [<instantiating-crate>]
"_RN" <namespace> <path> <identifier> [<instantiating-crate>]
"_RNv" <path> <identifier> [<instantiating-crate>]
"_RNv" "N" <namespace> <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvN" <namespace> <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvNt" <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvNt" "C" <identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" <identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" [<disambiguator>] <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" ["s" <base-62-number>] <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCs" <base-62-number> <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" ["u"] <decimal-number> ["_"] <bytes> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" <decimal-number> ["_"] <bytes> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1" ["_"] <byte> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1" <byte> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" [<disambiguator>] <undisambiguated-identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <undisambiguated-identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" ["u"] <decimal-number> ["_"] <bytes> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <decimal-number> ["_"] <bytes> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1" ["_"] <byte> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1" <byte> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" [<disambiguator>] <undisambiguated-identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <undisambiguated-identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" ["u"] <decimal-number> ["_"] <bytes> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <decimal-number> ["_"] <bytes> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1" ["_"] <byte> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1" <byte> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1c" [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1c" [<path>]
"_RNvNtCshGpAVYOtgW1_1a1b1c"
26–32

Added a comment explaining the situation with encoding version, and removed other test cases.

tmiasko updated this revision to Diff 341897.Apr 30 2021, 7:25 AM

Add functions for checked arithmetic.

dblaikie added inline comments.Apr 30 2021, 5:23 PM
llvm/lib/Demangle/RustDemangle.cpp
60–63

Yeah, fair - don't think there's much to do about that, short of some unique_ptr with custom std::freeing deleter (& maybe some utility to do realloc with such a thing), but best not to roll that into this patch - if you wanted to do some cleanup like that, probably implementing it first in the itanium demangler & this common OutputStream abstraction, then port it to the Rust one (either before/after this code is committed).

141–142

I see this notion is referenced in the mangling spec document ("Like other disambiguation information, this path would usually not actually be shown by demanglers.") but this is still quite surprising to me - is there precedent for this, do any itanium or Windows manglings produce ambiguous demanglings?

If there isn't precedent for this, I think it'd be worth having a broader discussion about it (maybe a llvm-dev thread).

178–183

std::all_of can be used instead, if the standard library is suitable here

254–255

Might be worth rolling the Error updating into mulAssign/addAssign? Remove a bit more duplication/chance of mistakes of not setting the Error flag?

llvm/test/Demangle/rust.test
4–5

The hand-rendered version doesn't seem too bad, really - but up to you.

tmiasko updated this revision to Diff 342236.May 2 2021, 6:28 AM

Set error directly in mulAssign & addAssign

tmiasko marked 5 inline comments as done and an inline comment as not done.May 2 2021, 8:27 AM
tmiasko added inline comments.
llvm/lib/Demangle/RustDemangle.cpp
141–142

One case in itanium that results in the same f()::a demangling for both
statics, would be the following:

void f() {
  { static int a; }
  { static int a; }
}

The symbols corresponding to constructors & destructors in itanium are
ambiguous after demangling as well.

254–255

I moved responsibility for setting the error into mulAssign/addAssign.

llvm/test/Demangle/rust.test
4–5

I would rather not maintain parse tree for test cases.

But, see https://gist.github.com/tmiasko/7c31853276d94af57c18ca7522badd1c for a parser producing a parse tree.

dblaikie accepted this revision.May 2 2021, 9:53 AM

Looks good - thanks!

llvm/lib/Demangle/RustDemangle.cpp
141–142

Oh, right - thanks for the examples!

This revision is now accepted and ready to land.May 2 2021, 9:53 AM
tschuett added a comment.EditedMay 2 2021, 10:27 AM

Stupid question! Why does the rust demangler has to follow the API of

__cxa_demangle

instead of a modern API with llvm::Error and a BumpPtrAllocator ? Or SmallString and StringRef.

tmiasko marked 2 inline comments as done.May 2 2021, 12:12 PM

Stupid question! Why does the rust demangler has to follow the API of

__cxa_demangle

instead of a modern API with llvm::Error and a BumpPtrAllocator ? Or SmallString and StringRef.

It is hard to realize the benefits of an improved API when it is inconsistent with API offered by other demanglers (earlier comments regarding API can be found in https://reviews.llvm.org/D101444?id=341152#inline-959049).

I don't have commit access. Could you commit this for me? Thank you.

Tomasz Miąsko <tomasz.miasko@gmail.com>

This revision was automatically updated to reflect the committed changes.