Page MenuHomePhabricator

[libclang] Source range conversion
ClosedPublic

Authored by jkorous on Sep 1 2020, 5:04 PM.

Details

Reviewers
arphaman
akyrtzi
Summary

I believe this is the correct conversion of CXSourceRange objects that are passed to libclang API and we should use it consistently everywhere. If it turns out that CXSourceRange objects can have multiple different semantics we should definitely introduce one type per semantic and fix libclang API to be type safe.

However, that is a bigger task and given the downsides (necessity to pass CXTranslationUnit everywhere, potential performance impact of the tokenization) we might end up with a completely different solution.

For context - test proposals + discussion:
https://reviews.llvm.org/D86840

Diff Detail

Event Timeline

jkorous created this revision.Sep 1 2020, 5:04 PM
jkorous requested review of this revision.Sep 1 2020, 5:04 PM

SourceRange cxloc::translateCharRangeToTokenRange(CXTranslationUnit TU, CXSourceRange R) is an expensive operation and I'm concerned it will be easy to start calling at places and introduce performance degradation.

Would it be viable to change return type of SourceRange translateCXSourceRange(CXSourceRange R) to CharSourceRange translateCXSourceRange(CXSourceRange R), assuming that the semantics of CXSourceRange is that it represents a character-based range? Then pass it along as CharSourceRange and if there are internal clang APIs that need to adjust to accept CharSourceRanges then we enhance them.

@akyrtzi IIUC you are saying we should try to push the char-range semantics from libclang API as deep as possible to see if we could avoid this annoying conversion. I think that's a reasonable long-term strategy.
But I also think we'll inevitably run into some clang APIs that use SourceRange for valid reasons and we won't be able to change those. Some other clang APIs might require significant amount of work to be enhanced.

Ultimately, given this conversion isn't part of public libclang API and has explicit semantics, I would keep it - I'd just add a warning to its doc comment.

On practical level - I need this conversion for my other work and I can't go just now and refactor the Rewriter interface.
https://reviews.llvm.org/D86992

@akyrtzi IIUC you are saying we should try to push the char-range semantics from libclang API as deep as possible to see if we could avoid this annoying conversion. I think that's a reasonable long-term strategy.
But I also think we'll inevitably run into some clang APIs that use SourceRange for valid reasons and we won't be able to change those. Some other clang APIs might require significant amount of work to be enhanced.

Ultimately, given this conversion isn't part of public libclang API and has explicit semantics, I would keep it - I'd just add a warning to its doc comment.

On practical level - I need this conversion for my other work and I can't go just now and refactor the Rewriter interface.
https://reviews.llvm.org/D86992

AFAICT the Rewriter already supports CharSourceRange see from Rewriter.h:

/// getRangeSize - Return the size in bytes of the specified range if they
/// are in the same file.  If not, this returns -1.
int getRangeSize(SourceRange Range,
                 RewriteOptions opts = RewriteOptions()) const;
int getRangeSize(const CharSourceRange &Range,
                 RewriteOptions opts = RewriteOptions()) const;

/// getRewrittenText - Return the rewritten form of the text in the specified
/// range.  If the start or end of the range was unrewritable or if they are
/// in different buffers, this returns an empty string.
///
/// Note that this method is not particularly efficient.
std::string getRewrittenText(CharSourceRange Range) const;

/// getRewrittenText - Return the rewritten form of the text in the specified
/// range.  If the start or end of the range was unrewritable or if they are
/// in different buffers, this returns an empty string.
///
/// Note that this method is not particularly efficient.
std::string getRewrittenText(SourceRange Range) const {
  return getRewrittenText(CharSourceRange::getTokenRange(Range));
}

In fact CharSourceRange is a more natural type for the Rewriter to use since the Rewriter wants to resolve all ranges down to byte offsets in order to do its thing. And CXSourceRange is already character based, so if you end up doing:

CXSourceRange (character) -> (convert) SourceRange (token) -> Rewriter (convert to character) -> Rewriter (work with byte offset)

You add unnecessary conversion steps, compared to:

CXSourceRange (character) -> CharSourceRange (character) -> Rewriter (work with byte offset)

where no conversion is needed.

You are absolutely right! I initially didn't realize the overloads of Rewriter methods can be of importance and haven't looked back since.

So, I'm adding the conversion you suggested and removing the illogical one to not confuse anyone.

Thanks!

akyrtzi added inline comments.Sep 2 2020, 2:20 PM
clang/tools/libclang/CXSourceLocation.h
74–78

"SourceRange" -> "CharSourceRange"

78

Minor nitpick but maybe rename to "translateCXRangeToCharRange", to make the name more clear?

jkorous updated this revision to Diff 289839.Sep 3 2020, 5:26 PM
  • fix the doc comment
  • rename
akyrtzi accepted this revision.Sep 3 2020, 5:45 PM
This revision is now accepted and ready to land.Sep 3 2020, 5:45 PM
jkorous closed this revision.Sep 4 2020, 10:56 AM

I forgot to add the magic "Differential revision:" prefix to the commit message :(

Landed as https://reviews.llvm.org/rGbaf3c77bd9f6baf60a09ef3625fef84080642b72

commit baf3c77bd9f6baf60a09ef3625fef84080642b72 (HEAD -> master, llvm.org/master)
Author: Jan Korous <jkorous@apple.com>
Date:   Wed Sep 2 13:11:35 2020 -0700

    [libclang] Add translateCXRangeToCharRange conversion

    Add new conversion with clearly specified semantics.

    https://reviews.llvm.org/D86990