This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Add a spell corrector for "undefined symbol" diagnostics
ClosedPublic

Authored by MaskRay on Aug 31 2019, 9:16 AM.

Details

Summary

Non-undefined symbols with Levenshtein distance 1 or a transposition are
suggestion candidates. This is probably good enough and it can suggest
some missing/superfluous qualifiers: const, restrict, volatile, & and &&
ref-qualifier, e.g.

 error: undefined symbol: foo(int*)
 >>> referenced by b.o:(.text+0x1)
+>>> did you mean: foo(int const*)
+>>> defined in: a.o

 error: undefined symbol: foo(int*&)
 >>> referenced by b.o:(.text+0x1)
+>>> did you mean: foo(int*)
+>>> defined in: b.o

Event Timeline

MaskRay created this revision.Aug 31 2019, 9:16 AM
MaskRay updated this revision to Diff 218226.Aug 31 2019, 9:34 AM

Fix a RUN line

MaskRay edited the summary of this revision. (Show Details)Aug 31 2019, 9:38 AM
ruiu added inline comments.Sep 1 2019, 9:46 PM
ELF/Relocations.cpp
797

Can you factor out the new code into a new function and then add a function comment as to what this is?

MaskRay updated this revision to Diff 218286.EditedSep 1 2019, 10:43 PM

Add a new function correctSpelling

(I was thinking if the code can be shared with COFF and wasm. We can define it in lld/Common,
but the common part is just ~25 lines, and a non-inline function will harm performance)

ruiu added inline comments.Sep 2 2019, 12:51 AM
ELF/Relocations.cpp
694

Please add a file comment saying that this function is for "did you mean?" feature.

721

IIUC, this function costs (N+1)*74 where N is the length of a symbol name. Symbol names can be very long. Doesn't this too slow?

grimar added inline comments.Sep 2 2019, 2:27 AM
ELF/Relocations.cpp
721

It's enabled for the first two undefined symbols. Even with large N, like 100 it shouldn't probably be too slow for the error reporting case?

727

Seems it would be simpler to make suggest return const Symbol * (on success) or nullptr (on fail)?
You should be able to early return nicely then:

if (const Symbol *S = suggest(newName))
  return S;

and avoid doing things like

if (corrected)
  break;
829

It feels for would look a bit better?

for (size_t I = 0; I<undefs.size(); ++I)
  if (!undefs[I].locs.empty())
    reportUndefinedSymbol<ELFT>(undefs[I], i < 2);

Interesting idea. I suspect we'll need to modify it over time in response to feedback. For example, what if there are several candidates, should we report them all, give one stronger weighting etc. One thing I don't think it will catch, which is my most common and least favourite typo, is a transposition which is usually a distance of 2. I'm less concerned about the algorithm speed at the moment as it should only trigger in an error case, and if it aids debugging then it is time well spent.

ELF/Relocations.cpp
757

I think you mean to use spellCorrector as a verb (do spelling correction), but it reads like a noun (spellCorrector is some object that does spelling correction). I suggest doSpellingCorrection or correctSpelling.

peter.smith added inline comments.Sep 2 2019, 2:51 AM
ELF/Relocations.cpp
694

I suggest getAlternativeSpelling as there could be many symbols within the same distance, particularly sym1, sym2, sym9 for correctSpelling(sym)

MaskRay updated this revision to Diff 218328.Sep 2 2019, 4:18 AM
MaskRay marked 10 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

Interesting idea. I suspect we'll need to modify it over time in response to feedback. For example, what if there are several candidates, should we report them all, give one stronger weighting etc.

Agree. We can wait for user feedback.

One thing I don't think it will catch, which is my most common and least favourite typo, is a transposition which is usually a distance of 2.

Added transposition.

I'm less concerned about the algorithm speed at the moment as it should only trigger in an error case, and if it aids debugging then it is time well spent.

MaskRay added inline comments.Sep 2 2019, 4:27 AM
ELF/Relocations.cpp
721

I think the complexity is fine. I avoid making the size of symtab a factor in the time complexity. Doing mutation in the input is probably fine.

In a large executable, I get a long mangled name of N=4398 bytes.
O(74*N) is not too bad.

727

Thanks. This simplification is available since this is now a new function.

757

Will use correctSpelling

829

I'll use

for (auto it : enumerate(undefs))

ruiu added inline comments.Sep 2 2019, 4:39 AM
ELF/Relocations.cpp
721

Maybe you are right. A symbol that is 4398 bytes long is perhaps an outlier, but even for a 1000 byte long symbol, it is 74,000 hash table lookup, which may not too bad. Did you run this code against a long symbol name? If it doesn't feel sluggish, I'm fine.

MaskRay marked an inline comment as done.Sep 2 2019, 5:41 AM
MaskRay added inline comments.
ELF/Relocations.cpp
721

I tested linking a large executable. It took 19.8s in total.

For the long symbol (sym.getName().size() = 4379), if I patched getAlternativeSpelling to attempt all suggestions before returning, suggest was called 665682 times, and it took about 0.76s, a small portion of the whole link time.

(As a comparison, symVector.size() = 4195114.)

thakis added a comment.Sep 2 2019, 6:28 PM

This is a cool idea, but I'm struggling to come up with a scenario where it's actually useful :)

For this to fire and be a true positive, you'd have to have an incorrect declaration but a correct definition of a function. I can't remember ever running into this – and the compiler seems to be in a better place to print a diagnostic for this case in almost all cases (except for if the TU with the correct definition doesn't happen to see the incorrect declaration).

This is a cool idea, but I'm struggling to come up with a scenario where it's actually useful :)

For this to fire and be a true positive, you'd have to have an incorrect declaration but a correct definition of a function. I can't remember ever running into this – and the compiler seems to be in a better place to print a diagnostic for this case in almost all cases (except for if the TU with the correct definition doesn't happen to see the incorrect declaration).

The file that provides the definition may have a mismatch of the declaration and the definition, e.g.

// a.h
void foo(int &a);

// a.cc
#include "a.h"
void foo(int a) {}

// b.cc
#include "a.h"
void bar(int a) { foo(a); }

% ld.lld a.o b.o            
ld.lld: error: undefined symbol: foo(int&)
>>> referenced by b.cc
>>>               b.o:(bar(int))
>>> did you mean: foo(int)
>>> defined in: a.o
grimar added a comment.Sep 3 2019, 2:42 AM

It looks fine to me I think. I probably have no more comments except 2 nits below.

ELF/Relocations.cpp
700

I think LLD coding style requires curly bracers here, since you have more than one line under if.

829

I am not sure enumerate is better (it is not consistent and this approach uses auto), I'll leave it to others.

grimar added inline comments.Sep 3 2019, 2:43 AM
ELF/Relocations.cpp
829

(not consistent with the LLD code I meant. Which doesn't use it anywhere, preferring 'for')

thakis added a comment.Sep 3 2019, 4:28 AM

For this to fire and be a true positive, you'd have to have an incorrect declaration but a correct definition of a function. I can't remember ever running into this – and the compiler seems to be in a better place to print a diagnostic for this case in almost all cases (except for if the TU with the correct definition doesn't happen to see the incorrect declaration).

The file that provides the definition may have a mismatch of the declaration and the definition, e.g.

Well, I guess we'll see how useful it turns out to be in practice :)

ruiu accepted this revision.Sep 4 2019, 1:33 AM

LGTM

ELF/Relocations.cpp
718

I'd add a comment here to describe this loop. This loop creates all possible strings of edit distance 1 as typo correction candidates and see whether they exist or not.

721

I still have a little concern about the speed because 0.76s is not a negligible delay. When you get an undefined error, you are likely to be in a edit-compile-link cycle, and in the cycle you are likely to get an error message as soon as possible. That being said, 0.76s is probably a worst-case number, and if it becomes a problem, we may be able to improve this code, so I'm fine for now.

This revision is now accepted and ready to land.Sep 4 2019, 1:33 AM
MaskRay updated this revision to Diff 218613.Sep 4 2019, 1:58 AM
MaskRay marked 6 inline comments as done.

Add a comment associated with the loop

ELF/Relocations.cpp
721

The pessimistic case has a symbol of 4379 bytes long. I think the length of a typical mangled name is much smaller than 1000.

This is a failed attempt and it tries all suggestions. A successful attempt may take just half of the time.

This is an extremely large executable with 4195114 symbols. A typical application has much less.

The pessimistic 0.76s case assumes the worse case in every factor. For a typical application, I'd be surprised if the typo correction takes even 5% of the pessimistic 0.76s case, i.e 0.038s.

This revision was automatically updated to reflect the committed changes.