This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Fix Rewriter
AbandonedPublic

Authored by jkorous on Aug 29 2020, 1:24 PM.

Details

Reviewers
arphaman
akyrtzi
Summary

I feel like I found a bug but I am both not entirely sure and running out of ideas. I have these two tests that I assume we want to be passing but they are currently failing.

I found what I feel is an issue when text replacements (using Rewriter exposed in libclang API I am working on) were off by one char at the end. After some digging I'd say it's because the way range lengths are calculated. I feel it's all about the fact that we have two representations of ranges - either it's a pair of indices to a buffer where the semantics are as expected - start of the range, one past the end of the range. Then we also have "token ranges" where the semantics are different - it's start and exactly the end. And I feel the we're converting or calculating the lengths in not-completly-consistent fashion. Alternatively it could be due to some kind of range conversion in libclang.

I tried couple different approaches but haven't found any that would both feel sensible AND won't fail significant number of tests.

I tried removing this in int Rewriter::getRangeSize(const CharSourceRange &Range, RewriteOptions opts) const { which felt like not the right thing to do but surprisingly no test failed.

// Adjust the end offset to the end of the last token, instead of being the
// start of the last token if this is a token range.
if (Range.isTokenRange())
  EndOff += Lexer::MeasureTokenLength(Range.getEnd(), *SourceMgr, *LangOpts);

I tried this which I felt is the right thing to do but 70 failing tests disagreed:

 int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {
-  return getRangeSize(CharSourceRange::getTokenRange(Range), opts);
+  return getRangeSize(CharSourceRange::getCharRange(Range), opts);
 }

I guess I still need to digest this comment in tests:

CharSourceRange CRange; // covers exact char range
CharSourceRange TRange; // extends CRange to whole tokens
SourceRange SRange;     // different type but behaves like TRange

I will try to wrap my head around this after the weekend.

Diff Detail

Event Timeline

jkorous created this revision.Aug 29 2020, 1:24 PM
jkorous requested review of this revision.Aug 29 2020, 1:24 PM

@jkorous if the ambiguity between char/token ranges is a problem, how about adding a getRangeCharSize that takes an unambiguous CharSourceRange range, or even making getRangeSize take the CharSourceRange instead?

Honestly, the more I'm looking into this, the more puzzled I am - please assume that my previous conclusions weren't necessarily correct. I started adding notes to the code so I can keep track of my thoughts - I'll update the revision.

Based on how we handle the different range types it seems that actually all of them are half-open intervals except CharSourceRange when it represents a token range - then it's basically a closed interval in token sense but the end offset doesn't refer to end of the last token but the begin of the last token.

Say, I want to represent tokens (int, foo) in this code:

int foo = 5;

then the offsets for CharSourceRange token range would be:

int foo = 5;
^   ^
|   |

Which means - we remember first and last token (not one-past-last token) but the way we represent the last token is offset of its first character.

I am sorry - I don't understand the suggestion. CharSourceRange is ambiguous by design - the semantics are controlled by the IsTokenRange member - when it's true it implies "token range" semantics described above and when it's false it implies "just a half-open range of characters". So I don't think we could really make things clearer unless we introduce another type.
https://clang.llvm.org/doxygen/classclang_1_1CharSourceRange.html

Also, Rewriter::getRangeSize is already overloaded - one overload takes CharSourceRange as input and seems like it handles both cases (token range or half-open char range) correctly. What I don't understand yet is why the other overload that takes SourceRange as input makes assumption that it actually represents a closed token range.

jkorous updated this revision to Diff 289054.Aug 31 2020, 5:56 PM

Added a random test and a bunch of notes (to be nuked before anything is committed).

AFAIK SourceRange is supposed to always represent a token range (begin loc points to beginning of first token, end loc points to beginning of last token). For representing a character range, CharSourceRange should be used, though IMO its IsTokenRange member was a mistake, CharSourceRange should have only being used to represent half-open character-based range. This distinction is thankfully more clear on the Swift side.

So, to me it seems that the only problematic code-path is when I'm plumbing source range from libclang API (where I naturally need to convert from CXSourceRange to some internal type) to Rewriter as Rewriter::getRangeSize() makes an assumption about the input which is missing from its signature and documentation (and by proxy from other Rewriter methods).

The only thing that prevents me from changing the implementation like this:

 int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {
-  return getRangeSize(CharSourceRange::getTokenRange(Range), opts);
+  return getRangeSize(CharSourceRange::getCharRange(Range), opts);
 }

is the 70 failing tests which sounds like this is actually expected and I'd break a lot of stuff.

AFAIK SourceRange is supposed to always represent a token range (begin loc points to beginning of first token, end loc points to beginning of last token). For representing a character range, CharSourceRange should be used, though IMO its IsTokenRange member was a mistake, CharSourceRange should have only being used to represent half-open character-based range. This distinction is thankfully more clear on the Swift side.

Thanks a lot for this clarification! So, does that mean the issue is actually here?

static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}
jkorous added inline comments.Aug 31 2020, 6:08 PM
clang/tools/libclang/CIndex.cpp
6830

@akyrtzi this seems to be conflicting to the notion of SourceRange always being a closed token range. Or is it more nuanced or context-dependent?

Thanks a lot for this clarification! So, does that mean the issue is actually here?

static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}

I think it depends on what that CXSourceRange represents (how was it created).

Thanks a lot for this clarification! So, does that mean the issue is actually here?

static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}

I think it depends on what that CXSourceRange represents (how was it created).

Oh, I am embarrassed to admit that it never crossed my mind that semantics might not be a type invariant here... Ok, thanks, this is really helpful.

We may have places in the code where SourceRange is used as a pair of locations, and those locations are character locations instead of token ones, so essentially the information of whether the range is token-based or character-based gets lost, and we get into trouble when passing such a SourceRange to APIs that assume token-based.

jkorous added a comment.EditedSep 1 2020, 9:50 AM

When we return source ranges from functions in libclang we use cxloc::translateSourceRange to convert them from SourceRange semantics to half-open character range CXSourceRange. I checked the cxloc::translateSourceRange implementation and also that we use it if not everywhere then very often.

/// Translate a Clang source range into a CIndex source range.
///
/// Clang internally represents ranges where the end location points to the
/// start of the token at the end. However, for external clients it is more
/// useful to have a CXSourceRange be a proper half-open interval. This routine
/// does the appropriate translation.
CXSourceRange cxloc::translateSourceRange(const SourceManager &SM,
                                          const LangOptions &LangOpts,
                                          const CharSourceRange &R) {

What this means that outside of libclang API boundary we have half-open character ranges but in libclang implementation we use the closed token ranges.

This also means IIUC that whenever we're passing a range to any libclang function we have to find start of the last token in the range and move the end of the range there. This sounds not great and for certain use-cases might cause noticeable performance regression.

jkorous abandoned this revision.Sep 1 2020, 5:10 PM

I believe this is the correct approach: https://reviews.llvm.org/D86990

jkorous added inline comments.Sep 1 2020, 5:12 PM
clang/tools/libclang/CIndex.cpp
6830

Maybe this is the only mis-use of SourceRange. It's somewhat natural that input for tokenization procedure isn't formulated in terms of tokens. We might still consider adding a specific type for this to keep things type safe.