This is an archive of the discontinued LLVM Phabricator instance.

Show correct column nr. when multi-byte utf8 chars are used.
Needs ReviewPublic

Authored by erikjv on Jun 1 2017, 3:14 AM.

Details

Reviewers
bkramer
klimek
Summary

Previously, the column number in a diagnostic would be the byte position
in the line. This results in incorrect column numbers when a multi-byte
UTF-8 character would be present in the input. This change corrects for
those multi-byte characters and for zero-length diacritic marks.

This fixes PR21144.

Diff Detail

Event Timeline

erikjv created this revision.Jun 1 2017, 3:14 AM

Correctly counting columns is a bit more complicated that that... for example, consider what happens if you replace ideëen with idez̈en. See https://stackoverflow.com/questions/3634627/how-to-know-the-preferred-display-width-in-columns-of-unicode-characters .

erikjv updated this revision to Diff 117660.Oct 4 2017, 5:28 AM
erikjv edited the summary of this revision. (Show Details)
yvvan added a subscriber: yvvan.Oct 25 2017, 11:38 PM

I didn't really search for it before, but it looks like LLVM already has a routine for computing column widths? See llvm::sys::unicode::columnWidthUTF8.

There are some tools which parse clang diagnostic output; we might need a flag to control this. Not sure who would know about that?

lib/Basic/SourceManager.cpp
1501

Instead of adding a parameter to getColumnNumber, it would probably make sense to just make this caller correct the column number afterwards.

erikjv updated this revision to Diff 124903.Nov 30 2017, 3:26 AM

I moved all code to the TextDiagnostics, so all other interfaces still get byte offsets.

Still worried about the effect on tools which parse clang diagnostics... please send a message to cfe-dev. Hopefully we'll get responses there.

Godin added a subscriber: Godin.May 22 2018, 8:29 AM
lelf added a subscriber: lelf.May 19 2019, 7:07 PM