This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Rewrite/
-
Rewrite/
-
Rewriter.cpp
-
tools/libclang/
-
libclang/
2/2
CIndex.cpp
-
CXSourceLocation.h
-
unittests/
-
Rewrite/
-
RewriterTest.cpp
-
libclang/
-
LibclangTest.cpp

Differential D86840

[WIP] Fix Rewriter
AbandonedPublic

Authored by jkorous on Aug 29 2020, 1:24 PM.

Download Raw Diff

Details

Reviewers

arphaman
akyrtzi

Summary

I feel like I found a bug but I am both not entirely sure and running out of ideas. I have these two tests that I assume we want to be passing but they are currently failing.

I found what I feel is an issue when text replacements (using Rewriter exposed in libclang API I am working on) were off by one char at the end. After some digging I'd say it's because the way range lengths are calculated. I feel it's all about the fact that we have two representations of ranges - either it's a pair of indices to a buffer where the semantics are as expected - start of the range, one past the end of the range. Then we also have "token ranges" where the semantics are different - it's start and exactly the end. And I feel the we're converting or calculating the lengths in not-completly-consistent fashion. Alternatively it could be due to some kind of range conversion in libclang.

I tried couple different approaches but haven't found any that would both feel sensible AND won't fail significant number of tests.

I tried removing this in int Rewriter::getRangeSize(const CharSourceRange &Range, RewriteOptions opts) const { which felt like not the right thing to do but surprisingly no test failed.

// Adjust the end offset to the end of the last token, instead of being the
// start of the last token if this is a token range.
if (Range.isTokenRange())
  EndOff += Lexer::MeasureTokenLength(Range.getEnd(), *SourceMgr, *LangOpts);

I tried this which I felt is the right thing to do but 70 failing tests disagreed:

 int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {
-  return getRangeSize(CharSourceRange::getTokenRange(Range), opts);
+  return getRangeSize(CharSourceRange::getCharRange(Range), opts);
 }

I guess I still need to digest this comment in tests:

CharSourceRange CRange; // covers exact char range
CharSourceRange TRange; // extends CRange to whole tokens
SourceRange SRange;     // different type but behaves like TRange

I will try to wrap my head around this after the weekend.

Diff Detail

Event Timeline

jkorous created this revision.Aug 29 2020, 1:24 PM

Herald added subscribers: danielkiss, dexonsmith. · View Herald TranscriptAug 29 2020, 1:24 PM

jkorous requested review of this revision.Aug 29 2020, 1:24 PM

@jkorous if the ambiguity between char/token ranges is a problem, how about adding a getRangeCharSize that takes an unambiguous CharSourceRange range, or even making getRangeSize take the CharSourceRange instead?

Honestly, the more I'm looking into this, the more puzzled I am - please assume that my previous conclusions weren't necessarily correct. I started adding notes to the code so I can keep track of my thoughts - I'll update the revision.

Based on how we handle the different range types it seems that actually all of them are half-open intervals except CharSourceRange when it represents a token range - then it's basically a closed interval in token sense but the end offset doesn't refer to end of the last token but the begin of the last token.

Say, I want to represent tokens (int, foo) in this code:

int foo = 5;

then the offsets for CharSourceRange token range would be:

int foo = 5;
^   ^
|   |

Which means - we remember first and last token (not one-past-last token) but the way we represent the last token is offset of its first character.

I am sorry - I don't understand the suggestion. CharSourceRange is ambiguous by design - the semantics are controlled by the IsTokenRange member - when it's true it implies "token range" semantics described above and when it's false it implies "just a half-open range of characters". So I don't think we could really make things clearer unless we introduce another type.
https://clang.llvm.org/doxygen/classclang_1_1CharSourceRange.html

Also, Rewriter::getRangeSize is already overloaded - one overload takes CharSourceRange as input and seems like it handles both cases (token range or half-open char range) correctly. What I don't understand yet is why the other overload that takes SourceRange as input makes assumption that it actually represents a closed token range.

jkorous added a reviewer: akyrtzi.Aug 31 2020, 5:39 PM

Added a random test and a bunch of notes (to be nuked before anything is committed).

AFAIK SourceRange is supposed to always represent a token range (begin loc points to beginning of first token, end loc points to beginning of last token). For representing a character range, CharSourceRange should be used, though IMO its IsTokenRange member was a mistake, CharSourceRange should have only being used to represent half-open character-based range. This distinction is thankfully more clear on the Swift side.

So, to me it seems that the only problematic code-path is when I'm plumbing source range from libclang API (where I naturally need to convert from CXSourceRange to some internal type) to Rewriter as Rewriter::getRangeSize() makes an assumption about the input which is missing from its signature and documentation (and by proxy from other Rewriter methods).

The only thing that prevents me from changing the implementation like this:

 int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {
-  return getRangeSize(CharSourceRange::getTokenRange(Range), opts);
+  return getRangeSize(CharSourceRange::getCharRange(Range), opts);
 }

is the 70 failing tests which sounds like this is actually expected and I'd break a lot of stuff.

In D86840#2248453, @akyrtzi wrote:

AFAIK SourceRange is supposed to always represent a token range (begin loc points to beginning of first token, end loc points to beginning of last token). For representing a character range, CharSourceRange should be used, though IMO its IsTokenRange member was a mistake, CharSourceRange should have only being used to represent half-open character-based range. This distinction is thankfully more clear on the Swift side.

Thanks a lot for this clarification! So, does that mean the issue is actually here?

static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}

jkorous added inline comments.Aug 31 2020, 6:08 PM

clang/tools/libclang/CIndex.cpp
6830	@akyrtzi this seems to be conflicting to the notion of `SourceRange` always being a closed token range. Or is it more nuanced or context-dependent?

In D86840#2248463, @jkorous wrote:
Thanks a lot for this clarification! So, does that mean the issue is actually here?
static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}

I think it depends on what that CXSourceRange represents (how was it created).

In D86840#2248467, @akyrtzi wrote:
In D86840#2248463, @jkorous wrote:
Thanks a lot for this clarification! So, does that mean the issue is actually here?
static inline SourceRange translateCXSourceRange(CXSourceRange R) {
  return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
                     SourceLocation::getFromRawEncoding(R.end_int_data));
}
I think it depends on what that CXSourceRange represents (how was it created).

Oh, I am embarrassed to admit that it never crossed my mind that semantics might not be a type invariant here... Ok, thanks, this is really helpful.

We may have places in the code where SourceRange is used as a pair of locations, and those locations are character locations instead of token ones, so essentially the information of whether the range is token-based or character-based gets lost, and we get into trouble when passing such a SourceRange to APIs that assume token-based.

When we return source ranges from functions in libclang we use cxloc::translateSourceRange to convert them from SourceRange semantics to half-open character range CXSourceRange. I checked the cxloc::translateSourceRange implementation and also that we use it if not everywhere then very often.

/// Translate a Clang source range into a CIndex source range.
///
/// Clang internally represents ranges where the end location points to the
/// start of the token at the end. However, for external clients it is more
/// useful to have a CXSourceRange be a proper half-open interval. This routine
/// does the appropriate translation.
CXSourceRange cxloc::translateSourceRange(const SourceManager &SM,
                                          const LangOptions &LangOpts,
                                          const CharSourceRange &R) {

What this means that outside of libclang API boundary we have half-open character ranges but in libclang implementation we use the closed token ranges.

This also means IIUC that whenever we're passing a range to any libclang function we have to find start of the last token in the range and move the end of the range there. This sounds not great and for certain use-cases might cause noticeable performance regression.

jkorous mentioned this in D86990: [libclang] Source range conversion.Sep 1 2020, 5:04 PM

I believe this is the correct approach: https://reviews.llvm.org/D86990

jkorous added inline comments.Sep 1 2020, 5:12 PM

clang/tools/libclang/CIndex.cpp
6830	Maybe this is the only mis-use of `SourceRange`. It's somewhat natural that input for tokenization procedure isn't formulated in terms of tokens. We might still consider adding a specific type for this to keep things type safe.

Revision Contents

Path

Size

clang/

lib/

Rewrite/

Rewriter.cpp

4 lines

tools/

libclang/

CIndex.cpp

3 lines

CXSourceLocation.h

1 line

unittests/

Rewrite/

RewriterTest.cpp

15 lines

libclang/

LibclangTest.cpp

19 lines

Diff 289054

clang/lib/Rewrite/Rewriter.cpp

Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	if (I != RewriteBuffers.end()) {
EndOff = RB.getMappedOffset(EndOff, opts.IncludeInsertsAtEndOfRange);		EndOff = RB.getMappedOffset(EndOff, opts.IncludeInsertsAtEndOfRange);
StartOff = RB.getMappedOffset(StartOff, !opts.IncludeInsertsAtBeginOfRange);		StartOff = RB.getMappedOffset(StartOff, !opts.IncludeInsertsAtBeginOfRange);
}		}

// Adjust the end offset to the end of the last token, instead of being the		// Adjust the end offset to the end of the last token, instead of being the
// start of the last token if this is a token range.		// start of the last token if this is a token range.
if (Range.isTokenRange())		if (Range.isTokenRange())
EndOff += Lexer::MeasureTokenLength(Range.getEnd(), SourceMgr, LangOpts);		EndOff += Lexer::MeasureTokenLength(Range.getEnd(), SourceMgr, LangOpts);
		// TODO: This means that CharSourceRange is half-open otherwise we're getting the wrong number here (think about a single character range).
return EndOff-StartOff;		return EndOff-StartOff;
}		}

		// TODO: This means that "the input SourceRange" is half-open token range which is conflicting with how we use SourceRange in translateCXSourceRange.
int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {		int Rewriter::getRangeSize(SourceRange Range, RewriteOptions opts) const {
return getRangeSize(CharSourceRange::getTokenRange(Range), opts);		return getRangeSize(CharSourceRange::getTokenRange(Range), opts);
}		}

/// getRewrittenText - Return the rewritten form of the text in the specified		/// getRewrittenText - Return the rewritten form of the text in the specified
/// range. If the start or end of the range was unrewritable or if they are		/// range. If the start or end of the range was unrewritable or if they are
/// in different buffers, this returns an empty string.		/// in different buffers, this returns an empty string.
///		///
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines
}		}

bool Rewriter::InsertTextAfterToken(SourceLocation Loc, StringRef Str) {		bool Rewriter::InsertTextAfterToken(SourceLocation Loc, StringRef Str) {
if (!isRewritable(Loc)) return true;		if (!isRewritable(Loc)) return true;
FileID FID;		FileID FID;
unsigned StartOffs = getLocationOffsetAndFileID(Loc, FID);		unsigned StartOffs = getLocationOffsetAndFileID(Loc, FID);
RewriteOptions rangeOpts;		RewriteOptions rangeOpts;
rangeOpts.IncludeInsertsAtBeginOfRange = false;		rangeOpts.IncludeInsertsAtBeginOfRange = false;
		// TODO: This means getRangeSize(SourceRange&) really expects half-closed token range.
StartOffs += getRangeSize(SourceRange(Loc, Loc), rangeOpts);		StartOffs += getRangeSize(SourceRange(Loc, Loc), rangeOpts);
getEditBuffer(FID).InsertText(StartOffs, Str, /InsertAfter/true);		getEditBuffer(FID).InsertText(StartOffs, Str, /InsertAfter/true);
return false;		return false;
}		}

/// RemoveText - Remove the specified text region.		/// RemoveText - Remove the specified text region.
bool Rewriter::RemoveText(SourceLocation Start, unsigned Length,		bool Rewriter::RemoveText(SourceLocation Start, unsigned Length,
RewriteOptions opts) {		RewriteOptions opts) {
▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

clang/tools/libclang/CIndex.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	if (EndLoc.isValid() && EndLoc.isMacroID() &&
EndLoc = Expansion.getEnd();		EndLoc = Expansion.getEnd();
IsTokenRange = Expansion.isTokenRange();		IsTokenRange = Expansion.isTokenRange();
}		}
if (IsTokenRange && EndLoc.isValid()) {		if (IsTokenRange && EndLoc.isValid()) {
unsigned Length =		unsigned Length =
Lexer::MeasureTokenLength(SM.getSpellingLoc(EndLoc), SM, LangOpts);		Lexer::MeasureTokenLength(SM.getSpellingLoc(EndLoc), SM, LangOpts);
EndLoc = EndLoc.getLocWithOffset(Length);		EndLoc = EndLoc.getLocWithOffset(Length);
}		}
		// TODO: This means that CharSourceRange is half-open even when not token range because the return type is half-open (per doxygen) and for non-token ranges we're not converting the end in any way.
CXSourceRange Result = {		CXSourceRange Result = {
{&SM, &LangOpts}, R.getBegin().getRawEncoding(), EndLoc.getRawEncoding()};		{&SM, &LangOpts}, R.getBegin().getRawEncoding(), EndLoc.getRawEncoding()};
return Result;		return Result;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Cursor visitor.		// Cursor visitor.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 6,652 Lines • ▼ Show 20 Lines	static void getTokens(ASTUnit *CXXUnit, SourceRange Range,
if (Invalid)		if (Invalid)
return;		return;

Lexer Lex(SourceMgr.getLocForStartOfFile(BeginLocInfo.first),		Lexer Lex(SourceMgr.getLocForStartOfFile(BeginLocInfo.first),
CXXUnit->getASTContext().getLangOpts(), Buffer.begin(),		CXXUnit->getASTContext().getLangOpts(), Buffer.begin(),
Buffer.data() + BeginLocInfo.second, Buffer.end());		Buffer.data() + BeginLocInfo.second, Buffer.end());
Lex.SetCommentRetentionState(true);		Lex.SetCommentRetentionState(true);

		// TODO: This means SourceRange is half-open as we stop before the end.
		jkorousAuthorUnsubmitted Done Reply Inline Actions @akyrtzi this seems to be conflicting to the notion of `SourceRange` always being a closed token range. Or is it more nuanced or context-dependent? jkorous: @akyrtzi this seems to be conflicting to the notion of `SourceRange` always being a closed…
		jkorousAuthorUnsubmitted Done Reply Inline Actions Maybe this is the only mis-use of `SourceRange`. It's somewhat natural that input for tokenization procedure isn't formulated in terms of tokens. We might still consider adding a specific type for this to keep things type safe. jkorous: Maybe this is the only mis-use of `SourceRange`. It's somewhat natural that input for…
// Lex tokens until we hit the end of the range.		// Lex tokens until we hit the end of the range.
const char *EffectiveBufferEnd = Buffer.data() + EndLocInfo.second;		const char *EffectiveBufferEnd = Buffer.data() + EndLocInfo.second;
Token Tok;		Token Tok;
bool previousWasAt = false;		bool previousWasAt = false;
do {		do {
// Lex the next token		// Lex the next token
Lex.LexFromRawLexer(Tok);		Lex.LexFromRawLexer(Tok);
if (Tok.is(tok::eof))		if (Tok.is(tok::eof))
▲ Show 20 Lines • Show All 2,274 Lines • Show Last 20 Lines

clang/tools/libclang/CXSourceLocation.h

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	return translateSourceRange(Context.getSourceManager(),
Context.getLangOpts(),		Context.getLangOpts(),
CharSourceRange::getTokenRange(R));		CharSourceRange::getTokenRange(R));
}		}

static inline SourceLocation translateSourceLocation(CXSourceLocation L) {		static inline SourceLocation translateSourceLocation(CXSourceLocation L) {
return SourceLocation::getFromRawEncoding(L.int_data);		return SourceLocation::getFromRawEncoding(L.int_data);
}		}

		// TODO: This means that SourceRange is just a half-open interval of chars because that's the semantics we specify in libclang API.
static inline SourceRange translateCXSourceRange(CXSourceRange R) {		static inline SourceRange translateCXSourceRange(CXSourceRange R) {
return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),		return SourceRange(SourceLocation::getFromRawEncoding(R.begin_int_data),
SourceLocation::getFromRawEncoding(R.end_int_data));		SourceLocation::getFromRawEncoding(R.end_int_data));
}		}


}} // end namespace: clang::cxloc		}} // end namespace: clang::cxloc

#endif		#endif

clang/unittests/Rewrite/RewriterTest.cpp

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	TEST(Rewriter, ReplaceTextRangeTypes) {
// get ^~~~~ = "0;"		// get ^~~~~ = "0;"
RangeTypeTest T(Code, 42, 44);		RangeTypeTest T(Code, 42, 44);
T.Rewrite.ReplaceText(T.CRange, "foo");		T.Rewrite.ReplaceText(T.CRange, "foo");
EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "foogc;");		EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "foogc;");
T.Rewrite.ReplaceText(T.TRange, "bar");		T.Rewrite.ReplaceText(T.TRange, "bar");
EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "bar;");		EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "bar;");
T.Rewrite.ReplaceText(T.SRange, "0");		T.Rewrite.ReplaceText(T.SRange, "0");
EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "0;");		EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(42, 47)), "0;");

		TEST(Rewriter, RangeSize) {
		StringRef Code = "void foo(){ }";

		RangeTypeTest T(Code, 0, 13);
		const auto Len = T.Rewrite.getRangeSize(T.makeCharRange(5, 8).getAsRange(), Rewriter::RewriteOptions{});
		EXPECT_EQ(Len, 3);
		}

		TEST(Rewriter, RewriteRegression) {
		StringRef Code = "void foo(){ }";

		RangeTypeTest T(Code, 0, 13);
		T.Rewrite.ReplaceText(T.makeCharRange(5, 8).getAsRange(), "bar");
		EXPECT_EQ(T.Rewrite.getRewrittenText(T.makeCharRange(0, 13)), "void bar(){ }");
}		}

} // anonymous namespace		} // anonymous namespace

clang/unittests/libclang/LibclangTest.cpp

//===- unittests/libclang/LibclangTest.cpp --- libclang tests -------------===//		//===- unittests/libclang/LibclangTest.cpp --- libclang tests -------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang-c/Index.h"		#include "clang-c/Index.h"
		#include "clang/Basic/SourceLocation.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/FileSystem.h"		#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Path.h"		#include "llvm/Support/Path.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "gtest/gtest.h"		#include "gtest/gtest.h"
#include "TestUtils.h"		#include "TestUtils.h"
#include <fstream>		#include <fstream>
▲ Show 20 Lines • Show All 713 Lines • ▼ Show 20 Lines	TEST_F(LibclangSerializationTest, TokenKindsAreCorrectAfterLoading) {

std::string ASTName = "test.ast";		std::string ASTName = "test.ast";
WriteFile(ASTName, "");		WriteFile(ASTName, "");

ASSERT_TRUE(SaveAndLoadTU(ASTName));		ASSERT_TRUE(SaveAndLoadTU(ASTName));

CheckTokenKinds();		CheckTokenKinds();
}		}

		TEST_F(LibclangParseTest, TranslateSourceRange) {
		std::string Main = "main.c";
		WriteFile(Main, "int main(void);\n");

		ClangTU = clang_parseTranslationUnit(Index, Main.c_str(), nullptr, 0, nullptr, 0, TUFlags);

		CXFile File = clang_getFile(ClangTU, Main.c_str());
		CXSourceLocation B = clang_getLocation(ClangTU, File, 1, 4);
		CXSourceLocation E = clang_getLocation(ClangTU, File, 1, 8);

		CXSourceRange Rng = clang_getRange(B, E);

		CXToken *Tokens;
		unsigned int NumTokens;
		clang_tokenize(ClangTU, Rng, &Tokens, &NumTokens);
		ASSERT_EQ(NumTokens, 1);
		}