This is an archive of the discontinued LLVM Phabricator instance.

[Demangle][Rust] Parse binders
ClosedPublic

Authored by tmiasko on May 18 2021, 3:19 PM.

Details

Diff Detail

Event Timeline

tmiasko created this revision.May 18 2021, 3:19 PM
tmiasko requested review of this revision.May 18 2021, 3:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2021, 3:19 PM

Yeah, I think there's a bug in the bound count handling (off by one) due to the quirk in the parseOptionalBase62 parsing?

llvm/lib/Demangle/RustDemangle.cpp
471–476

Might be nice to have this written once/reused (rather than duplicated here between 'R' and 'Q' for instance) - not sure if it can be generalized over the use in demangleGenericArg too - all of them use the <lifetime> Production, but there's no common implementation of/function for that production - maybe there could be?

549

Judging by the phrasing, was this meant to say "Inputs exceeding the bound are /valid/" rather than invalid?

Maybe it'd be more suitable to have a fixed upper bound on the size of a demangled name in general instead of these special cases? I guess that could cover the recursion and this bound on bound lifetimes...

That might be more predictable for users and for testing, etc.

707

This seems like a really subtle and confusing API, especially given the existing "+ 1" behavior of the Base 62 encoding. I've been staring at the code in demangleFnSig+demangleBinder to figure out the first binder test case for a while before I realized that.

Any chance this could use uintmax for the invalid value, rather than offsetting the result in this way?

llvm/test/Demangle/rust.test
237–238

I'm having a hard time understanding this test in the context of the spec.

The spec says that _ encodes zero. So the G_ should be zero, right? Is there somewhere in the spec that says that <binder> mangled value is then 1 larger than that? (so a binder value 0 (_) is actually for 1 binder)?

tmiasko updated this revision to Diff 347197.May 22 2021, 5:03 AM
  • Reuse the implementation for R and Q productions.
  • Reword the comment regarding invalid binders.
tmiasko added inline comments.May 22 2021, 5:10 AM
llvm/lib/Demangle/RustDemangle.cpp
471–476

I reused the implementation between R and Q productions.

549

I reworded the comment and added an explanation why those inputs are invalid.

I think we need at least two different limitation mechanisms. A recursion limit to avoid unbounded recursion once we start parsing backreferences, especially those that don't produce any output. And another limit to handle large output.

I decided to check for invalid binders specifically since they are uniquely problematic; short strings like _RIC5crateFGAAAAA_EuE already generate gigabytes of output. The proposed bound rejects only invalid inputs, so it shouldn't be surprising for users in practice.

A general limit on the output size would be alternative option. Although it requires answering the hard question what the limit should be.

707

There are two layers of +1 which makes it quite confusing, but this is intended encoding of optional base62 numbers.

llvm/test/Demangle/rust.test
237–238

This looks like an omission in the specification The binder production should be marked as optional, and then the presence of G participates in the encoding of the value in analogous way to s in disambiguators:

If the disambiguator is not present, this index is 0. If it is of the form s_ then the index is 1. If it is of the form s<base-62-digit>_ then the index is <base-62-digit> + 2.

Does this clarify the situation?

I will look into updating the specification.

tmiasko updated this revision to Diff 347199.May 22 2021, 5:22 AM

Rebuild after a change to a parent revision

dblaikie added inline comments.May 24 2021, 5:09 PM
llvm/lib/Demangle/RustDemangle.cpp
471–476

any chance of having a "demangleLifetime" that can be used in "demangleGenericArg" as well as here? (nice to have a function for each production, so it's easier to find the implementation of that production in one place)

549

Fair enough - thanks for the details!

707

Fair enough - maybe some more quotations from the spec to explain this would be useful?

llvm/test/Demangle/rust.test
237–238

Ah, OK - yeah, spec update and some more words in the implementation (see other comment) would probably be handy. Thanks!

tmiasko updated this revision to Diff 347892.May 26 2021, 3:36 AM
tmiasko marked 4 inline comments as done.

Extend comment about +1 behaviour when parsing optional base-62 numbers.

llvm/lib/Demangle/RustDemangle.cpp
471–476

Yeah it would be nice to unify those, but there is slight difference in a treatment of erased lifetimes:

  • In generic arguments erased lifetimes are always printed, since order is significant (although rustc does not mangle any lifetimes arguments when all lifetimes are erased).
  • In references they are not printed (rustc does not mangle them in this context, so a difference in demangler behaviour would be unobservable in practice).
  • In trait objects (yet to be implemented) lifetimes are mandatory, but there is no reason to print erased lifetimes.
707

I extended the comment and added a few concrete examples.

llvm/test/Demangle/rust.test
237–238

The proposed update to the specification https://github.com/rust-lang/rfcs/pull/3130.

dblaikie accepted this revision.May 26 2021, 12:59 PM

Sounds good, thanks!

llvm/lib/Demangle/RustDemangle.cpp
471–476

having parameters to the demangle function to support these different behaviors doesn't seem totally unreasonable, while still having central handling? (or having variants "demangleLifetimeForGenericArg", etc)

This revision is now accepted and ready to land.May 26 2021, 12:59 PM
tmiasko updated this revision to Diff 349186.Jun 2 2021, 12:42 AM

Rename demangleBinder into demangleOptionalBinder

tmiasko updated this revision to Diff 349187.Jun 2 2021, 12:46 AM

Rerun build

This revision was automatically updated to reflect the committed changes.