This is an archive of the discontinued LLVM Phabricator instance.

[Demangle][Rust] Parse non-ASCII identifiers
ClosedPublic

Authored by tmiasko on Jun 16 2021, 3:28 AM.

Details

Summary

Rust allows use of non-ASCII identifiers, which in Rust mangling scheme
are encoded using Punycode.

The encoding deviates from the standard by using an underscore as the
separator between ASCII part and a base-36 encoding of non-ASCII
characters (avoiding hypen-minus in the symbol name). Other than that,
the encoding follows the standard, and the decoder implemented here in
turn follows the one given in RFC 3492.

To avoid an extra intermediate memory allocation while decoding
Punycode, the interface of OutputStream is extended with an insert
method.

Diff Detail

Event Timeline

tmiasko created this revision.Jun 16 2021, 3:28 AM
tmiasko requested review of this revision.Jun 16 2021, 3:28 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 16 2021, 3:28 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
tmiasko updated this revision to Diff 352927.Jun 18 2021, 12:44 AM
  • Avoid else after return.
dblaikie accepted this revision.Jun 28 2021, 4:20 PM

I haven't looked in detail at the punycode implementation - I guess this'll get some workout when tested against rust's output (leaving some chance of symmetric encode/decode bugs) & that's probably good enough.

libcxxabi/src/demangle/Utility.h
129

Could use an Arrayref<const char> (Or StringRef) instead of separate const char * and size_t parameters. Not necessary, there's some value in the symmetry with standard library functions (though I guess they're more likely to use begin/end rather than begin/length), etc.

131–136

I'd probably use an early return here to reduce indentation.

llvm/lib/Demangle/RustDemangle.cpp
539–540

Is this error path tested?

tmiasko updated this revision to Diff 355789.Jul 1 2021, 12:16 AM
tmiasko marked 2 inline comments as done.
  • Rebase and resolve conflicts with D104362.
  • Return early on zero length insert.
tmiasko added subscribers: erik.pilkington, ldionne.

Thanks for looking David!

In addition to the test cases here, I was also compiling source code with randomly generated non-ASCII identifiers and subsequently cross-validating demangled symbols with other implementations. The only issue I observed at the time, was that compilation time is quadratic with respect to the number of multibyte characters in the source code ...

@ldionne @erik.pilkington could you take a look at the new insert method added to demangling Utility.h? (I synchronized changes with a copy in libcxxabi).

libcxxabi/src/demangle/Utility.h
129

The API is based on the standard std::string::insert with the same type signature. I would prefer to leave this as is.

llvm/lib/Demangle/RustDemangle.cpp
539–540

This error code path is covered by:

; Invalid ABI with punycode.

CHECK: _RIC8functionFKu3n3hEuE
       _RIC8functionFKu3n3hEuE

If Punycode were to be supported in this position, it would demangle to function::<unsafe extern "☃" fn()>.

Oh, I forgot this didn't land yet and didn't follow up. I wasn't exactly sure how to interpret blocking review by libc++abi. @dblaikie is your review sufficient for landing this? If not, @ldionne could you take a look at the small part of changes in libcxxabi?

Oh, I forgot this didn't land yet and didn't follow up. I wasn't exactly sure how to interpret blocking review by libc++abi. @dblaikie is your review sufficient for landing this? If not, @ldionne could you take a look at the small part of changes in libcxxabi?

Yeah, I'm fairly comfortable approving a change like this - but now that you mention the libc++abi surface area here: Are other features of Utility.h tested (unit tested?) somewhere? Could you add testing for this new function too?

Also, admittedly I don't fully understand how this LLVM change ended up needing a change to libc++abi - how do these pieces connect/depend on each other?

tmiasko updated this revision to Diff 376608.Oct 1 2021, 12:37 PM

Rebase and add a unit test for OutputStream::insert

Yeah, I'm fairly comfortable approving a change like this - but now that you mention the libc++abi surface area here: Are other features of Utility.h tested (unit tested?) somewhere? Could you add testing for this new function too?

Also, admittedly I don't fully understand how this LLVM change ended up needing a change to libc++abi - how do these pieces connect/depend on each other?

I added a unit test for OutputStream::insert. It is also indirectly tested by the Rust demangler of course.

libcxxabi carries its own copy of Itanium demangler to implement __cxa_demangle. Those changes are just keeping those two in sync.

dblaikie accepted this revision.Oct 1 2021, 12:59 PM

Yeah, I'm fairly comfortable approving a change like this - but now that you mention the libc++abi surface area here: Are other features of Utility.h tested (unit tested?) somewhere? Could you add testing for this new function too?

Also, admittedly I don't fully understand how this LLVM change ended up needing a change to libc++abi - how do these pieces connect/depend on each other?

I added a unit test for OutputStream::insert. It is also indirectly tested by the Rust demangler of course.

libcxxabi carries its own copy of Itanium demangler to implement __cxa_demangle. Those changes are just keeping those two in sync.

Ah, so the issue is that the itanium demangling support is intended to be kept identical between these two places - and that includes a utility header which is, on the LLVM side, used by Rust demangling (which doesn't exist in libcxxabi) - so it's an update to this common header, which is to be kept in sync though there's no (current) use for the extra functionality on the libcxxabi side, because it's not used by the itanium demangling, only the rust demangling. (but good to keep it in sync anyway, in case someone does end up using it in itanium demangling, etc)

Fair enough. Thanks for the help!

This revision was not accepted when it landed; it landed in state Needs Review.Oct 1 2021, 1:14 PM
This revision was automatically updated to reflect the committed changes.