This is an archive of the discontinued LLVM Phabricator instance.

[Demangle] Add support for D symbols back referencing
ClosedPublic

Authored by ljmf00 on Oct 8 2021, 8:24 AM.

Details

Summary
This patch adds support for identifier back referencing allowing compressed
mangled names by avoiding repetitiveness.

Signed-off-by: Luís Ferreira <contact@lsferreira.net>

Diff Detail

Event Timeline

ljmf00 created this revision.Oct 8 2021, 8:24 AM
ljmf00 requested review of this revision.Oct 8 2021, 8:24 AM

Looks like this is a bit undertested - the number encoding supports uppercase values, but I don't think the test case tests any uppercase (multi-digit/character encoded) values, for instance? As well as various (I count about 3-4 or so, at least) different encoding failure cases that should probably be tested too.

llvm/lib/Demangle/DLangDemangle.cpp
73–84

Perhaps it'd make sense if this took Mangled as a const char** in/out, and returned Optional<Ret>? (both callers overwrite their Mangled parameter with the return result anyway - so seems simpler for it to be in/out?) if Optional isn't available, maybe long min, -1, something could be the invalid result value?

(similarly for decodeBackref)

222–223

is this an overflow check? Can it be reached if the other overflow check at 379 succeeds? (is it tested?)

238–239

Looks like this condition's probably untested? Should it be removed or turned into an assertion?

ljmf00 updated this revision to Diff 397753.Jan 5 2022, 5:30 PM
ljmf00 updated this revision to Diff 397754.Jan 5 2022, 5:35 PM
ljmf00 added inline comments.
llvm/lib/Demangle/DLangDemangle.cpp
73–84

As is, I understand your point, but what if I want to preserve Mangled and use decodeBackref/decodeBackrefPos as an expression for an if, strcmp or else? e.g. On the patch I introduce integer literal values I use decodeBackref like the following:

if (decodeBackref(Mangled, &Backref) == nullptr)
{
    //...
}

If I pass a reference of Mangled and use old Mangled inside, I would need to preserve a copy of its old position to make sure it won't be a "dirty" value. e.g. back reference "AAAA0", will produce a dirty Mangled and can produce bad behaviour afterwards.

222–223

This is a safe check when casting to long. The above overflow check is for unsigned long, but Ret is a long value. This is being done that way because signed overflow checks are considered undefined behaviour, since there is various ways to represent a signed value in different CPU targets. I checked locally, _D8demangleQDXXXXXXXXXXXXx triggers it with the fuzzer when that code is commented, so I'm going to add this to the test suite.

238–239

Well this shouldn't happen and since those methods are not exposed somewhere else, the compiler can optimize this out easily. I think it is reasonable to keep this check as a safety measure, although if we do some action, I maybe prefer to assert it. In my vision, triggering an assert on, e.g. a failed demangling for a linker error is worse than just outputting the mangled symbol anyway (nullptr will make llvm::demangle fallback to the initial Mangled reference). It is also worth mentioning that removing this when assert are disabled in release mode, can produce wrong and hard to debug behaviour.

dblaikie added inline comments.Jan 5 2022, 7:28 PM
llvm/lib/Demangle/DLangDemangle.cpp
73–84

Fair enough, if there are future use cases that won't immediately overwrite Mangled, sounds good.

Ret should probably be passed by reference here, and in decodeBackref if they can't be null?

222–223

The fuzzer trips ubsan with this <= 0 check commented out, for this test case? OK, great - thanks for explaining & adding the test case!

Hmm, I'm still not quite sure how this overflow could occur, if the check on line 214 passed. Could you walk me through a specific value of Val at the start of the loop that passes the initial overflow check, but then fails this one? And/or could you explain what the - 25 is for in the earlier overflow check. I'm reading that as "make sure it's not too close to the end, so there's room to add the next value" but maybe it's for some other purpose?

238–239

Sorry, not sure I'm following. Generally if the code can't be reached, it should be an assert, not a runtime check - adding defensive coding to support behavior if there are other bugs in the code isn't something we should be doing. This ensures the code doesn't have "haunted graveyard" kind of effects where code becomes hard to maintain because it's unclear why it's there in the first place, isn't tested, etc. Assertions document the intended invariants in a piece of code that make it easier for developers to make changes in the future.

It is also worth mentioning that removing this when assert are disabled in release mode, can produce wrong and hard to debug behaviour.

It can produce wrong/hard to debug behavior if there's some other bug in the code, yes? Generally the first thing LLVM developers do if they encounter problematic behavior, is run it with an assertions-enabled build which should help diagnose the issue. There's some further discussion here: https://llvm.org/docs/CodingStandards.html#assert-liberally

ljmf00 updated this revision to Diff 399037.Jan 11 2022, 12:12 PM

I simplified the symbol back referencing. I thought that referencing a back reference was permitted by the ABI spec, but it was pointing to the same position, so it is redundant to add two back-references. I also added a comment on the nullptr tests to describe what it is testing. In the case of Qa, it references itself and hangs.

llvm/lib/Demangle/DLangDemangle.cpp
73–84

Fixed. I will also introduce a patch in the stack to make decodeNumber use Ret by reference, instead.

222–223

The fuzzer trips with Asan, since Ret value is used to offset the current position to discover the back reference. Since it is documented in decodeBackrefPos that Ret is required to be > 0 on success, negative values aren't checked, bad offset memory is dereferenced later, and it eventually segfaults.

All the math operations are made with unsigned long and then casted to a long value, at the end. Lets consider a 64-bit unsigned long, with max capacity of (2^64 - 1). If we consider ((ulong.max - 25) / 26) - 10 (709490156681136589), I can safely do both operations of multiply and add (709490156681136589 * 26 + 25 == 18446744073709551339) and still have a negative value when casting it to long. This would obviously depend on how signed numbers are represented in hardware but will likely produce -277 if two's complement. https://godbolt.org/z/1GaMzzcxj

In another perspective, the last check is here to make sure that a value in unsigned long or long values are the same.

238–239

Right, your points made me change my point of view on that and makes perfect sense, since this is supposed to never ever happen. I was skeptical by the fact that it would produce a wrong behaviour if asserts were disabled, but this end up being easier to debug if, with the same input we trigger the assert.

dblaikie accepted this revision.Jan 11 2022, 6:04 PM

Looks good, thanks!

llvm/lib/Demangle/DLangDemangle.cpp
80

Since the function already communicates failure via a null return - should this maybe be an invariant/assertion/postcondition ("Ret is always >=0 on success, and unspecified on failure" or something like that?)

222–223

Ah, thanks for walking me through it!

This revision is now accepted and ready to land.Jan 11 2022, 6:04 PM
ljmf00 updated this revision to Diff 399182.Jan 11 2022, 7:23 PM
ljmf00 added inline comments.
llvm/lib/Demangle/DLangDemangle.cpp
80

Right. Changed the documentation as suggested.

This revision was landed with ongoing or failed builds.Jan 12 2022, 2:02 PM
This revision was automatically updated to reflect the committed changes.