This is an archive of the discontinued LLVM Phabricator instance.

[clang][lex] Enable Lexer to grow its buffer
Needs ReviewPublic

Authored by sunho on Feb 2 2023, 12:28 AM.

Details

Summary

Change Lexer to use offsets instead of direct pointers to buffer so that even if we swap the buffer address in the middle, Lexer will be still functional.

Incremental input (via clang-repl, cling, etc) adds code line by line growing the TU. One of the last elements which needs to support growing is the source code buffer. One of the challenges is that when we grow the buffer, practically the buffer address can change. Since Lexer is using direct pointer to some point in buffer, once buffer is swapped every pointer needs to be updated including all trivial local variables -- which is very challenging to do without sacrificing readability of code.

This change solves this issue nicely. Since we will be only adding code at the back of the buffer, the offsets are always constant even if we grow the buffer many times and all the access to new buffer will be valid. We do add a number of indirections to BufferStart, but performance impact on actual compile time turned out to be negligible. The only visible performance trend seems to be 0.5%~0.7% increase in instruction count.

The debian failure is due to some clang-format issue which is unrelated to this change.

Diff Detail

Event Timeline

sunho created this revision.Feb 2 2023, 12:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 12:28 AM
sunho retitled this revision from Format to [clang] Change Lexer to use offsets instead of direct pointer.Feb 2 2023, 12:34 AM
sunho edited the summary of this revision. (Show Details)
sunho retitled this revision from [clang] Change Lexer to use offsets instead of direct pointer to [clang][lex] Change Lexer to use offsets instead of direct pointer.
sunho updated this revision to Diff 494968.Feb 5 2023, 6:03 PM
This comment was removed by sunho.
sunho updated this revision to Diff 494976.Feb 5 2023, 6:51 PM
This comment was removed by sunho.
sunho edited the summary of this revision. (Show Details)Feb 6 2023, 2:42 AM
sunho edited the summary of this revision. (Show Details)
sunho edited the summary of this revision. (Show Details)Feb 6 2023, 2:47 AM
sunho edited the summary of this revision. (Show Details)
sunho published this revision for review.Feb 6 2023, 3:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2023, 3:05 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
sunho edited the summary of this revision. (Show Details)Feb 6 2023, 3:26 AM
sunho edited the summary of this revision. (Show Details)Feb 6 2023, 3:28 AM
sunho edited the summary of this revision. (Show Details)
sunho retitled this revision from [clang][lex] Change Lexer to use offsets instead of direct pointer to [clang][lex] Enable Lexer to grow its buffer.Feb 6 2023, 3:38 AM
shafik added a subscriber: shafik.Feb 6 2023, 6:40 PM
shafik added inline comments.
clang/lib/Lex/Lexer.cpp
1203

I wonder do we really need to do these pointer gymnastics, maybe making this a member function would eliminate the need for it.

1815–1816

nit

WG21 is meeting all this week, so a bunch of folks who should take a look at this may not get around to it right away.

junaire added a subscriber: junaire.Feb 6 2023, 8:49 PM
sunho edited the summary of this revision. (Show Details)Feb 8 2023, 5:08 PM
sunho added a reviewer: davrec.Feb 8 2023, 5:10 PM
v.g.vassilev added inline comments.Feb 9 2023, 2:45 AM
clang/lib/Lex/Lexer.cpp
211

Is that an outdated comment? If not maybe elaborate why this is wrong.

sunho added inline comments.Feb 9 2023, 9:07 AM
clang/lib/Lex/Lexer.cpp
211

It's indeed outdated comment. I'll remove it.

1203

Yes, we can change it to offset here.

We should probably add some tests here. Alternatively we can add the tests from https://reviews.llvm.org/D143148 but that'd make this patch bulkier and probably harder to review.

Only had a chance to give it a once over, I will look through more closely later, def by this weekend. Main thing is I think we shouldn't be exposing the buffer pointers after this change, i.e. no public function should return const char *, unless I'm missing something. If that box is checked and performance cost is negligible I'd give this the thumbs up.

clang/include/clang/Lex/Lexer.h
307–310

I think I'd like this to return unsigned; i.e. I think after this patch we should not even be publicly exposing buffer locations as pointers, IIUC. A brief search for uses of getBufferLocation() (there aren't many) suggests this would be probably be fine and indeed would get rid of some unnecessary pointer arithmetic. And indeed if anything really needs that const char * that might be a red flag to investigate further.

609

FWIW it sucks that uint32_t is already sprinkled throughout the interface alongside unsigned, wish just one was used consistently, but that does not need to be addressed in this patch.

This is an impressive amount of work. I think it makes sense!
Thanks a lot for doing that work.
I only have a few nits after a first review of this.

clang/include/clang/Lex/Lexer.h
89–90

Should that use SourceLocation::UIntTy?
Looking at comments in SourceManager, I think there was an attempt at supporting > 2GB file but I don't think it got anywhere.
Nevertheless, using SourceLocation::UIntTy would arguably be more consistent

It does seem to be a huge undertaking to change it though, I'm not sure it would be worth it at all. There would be far bigger issues with ridiculously large source files anyway.

609

here, uint32_t is a codepoint, should arguably be char32_t instead. but agreed, not in this patch.

clang/lib/Lex/Lexer.cpp
213
1203

Agreed, that would be nice! (In all places getBuffer().data() is used)

1353
1378–1379
1740
1745

Ditto in all similar places, I think it reads easier

tschuett added inline comments.
clang/include/clang/Lex/Lexer.h
89–90

I am a bit afraid that unsigned has different sizes on different platforms. At least a using BufferOffsetType = uint64_t; would be nice.

Suggested a few adjustments in LexTokenInternal you might want to test for perf improvements (orthogonal to this patch, but could boost its numbers :).
And also noted that Lexer::getBuffer() has same issue as getBufferLocation() re potential for pointer invalidation IIUC. At a minimum we need comments on these functions explaining any risks; better still to remove them from the public interface. If downstream users use these functions and complain, good - they need to be aware of this change.

clang/include/clang/Lex/Lexer.h
285

Same issue as with getBufferLocation(), publicly returning it permits possible pointer invalidation. Fortunately I only see it used in a single spot (prior to this patch anyway) which can be easily eliminated IIUC. Yank this function? Or make private/append "Unsafe" to name (and explain in comments)?

307–310

Looking more closely I see that getCurrentBufferOffset returns the unsigned, and this patch already changes some getBufferLocation usages to getCurrentBufferOffset. So, I say either yank it or make it private or append "Unsafe" and explain in comments.

clang/lib/Lex/Lexer.cpp
2949–2950

indent

2973–2976

indent

3624–3627
for (isHorizontalWhitespace(BufferStart[++CurOffset]);;)
  ;

might save a few instructions? Worth trying since this function is the main perf-critical one.

3733–3746

Spitballing again for possible minor perf improvements:

if (char Char0 = BufferStart[CurOffset] == '/' && !inKeepCommentMode()) {
  if (char Char1 = BufferStart[CurOffset + 1] == '/' && LineComment && 
      (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) {
    if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
      return true; // There is a token to return.
    goto SkipIgnoredUnits;
  } else if (Char1 == '*') {
    if (SkipBlockComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
      return true; // There is a token to return.
    goto SkipIgnoredUnits;
  }
} else if (isHorizontalWhitespace(Char0)) {
  goto SkipHorizontalWhitespace;
}
davrec added inline comments.Feb 20 2023, 10:28 AM
clang/lib/Lex/Lexer.cpp
3624–3627

^ Ignore, erroneous :)

In general, I think this makes sense. However:

This change solves this issue nicely. Since we will be only adding code at the back of the buffer, the offsets are always constant even if we grow the buffer many times and all the access to new buffer will be valid. We do add a number of indirections to BufferStart, but performance impact on actual compile time turned out to be negligible. The only visible performance trend seems to be 0.5%~0.7% increase in instruction count.

Can you give some performance numbers for how this impacts compile time performance for some large C and C++ projects? https://llvm-compile-time-tracker.com/ might be of help in gathering that data, FWIW.

In terms of whether to use unsigned vs a specific type; I don't have a strong opinion, but it'd be good to at least static_assert properties we care about (like the type being at least 32 bits). We've had efforts in the past to allow for > 2GB source files, so I weakly think it would make sense to use a 64-bit value explicitly up front, but I have no idea how that changes the performance characteristics for the changes either.

In terms of whether to expose access to pointers to the underlying buffer through lexer APIs... I sort of agree with @davrec that it would be good to avoid exposing those interfaces given that the pointer values are likely to be invalidated when the buffer grows. My instinct is that Clang developers are unlikely to consider that buffer to be something that can be invalidated, so if we retain an interface to get a pointer to the buffer when you get to the point of actually allowing it to grow, it'd be nice if we can find some way to forcefully grow the buffer to help catch misuses in Clang when running the llvm-lit tests.

The changes made (from what I've seen, I haven't reviewed every line) make sense to me. The amount of change does make me a bit nervous though.

In terms of whether to use unsigned vs a specific type; I don't have a strong opinion, but it'd be good to at least static_assert properties we care about (like the type being at least 32 bits). We've had efforts in the past to allow for > 2GB source files, so I weakly think it would make sense to use a 64-bit value explicitly up front, but I have no idea how that changes the performance characteristics for the changes either.

I like the idea of using a strong type for two reasons: 1) normal type safety stuff, and 2) it would enable a platform dependent and/or configurable type; I don't see any reason for use of a 64-bit offset on an ILP-32 platform, so no point in adding the overhead there.

In terms of whether to expose access to pointers to the underlying buffer through lexer APIs... I sort of agree with @davrec that it would be good to avoid exposing those interfaces given that the pointer values are likely to be invalidated when the buffer grows. My instinct is that Clang developers are unlikely to consider that buffer to be something that can be invalidated, so if we retain an interface to get a pointer to the buffer when you get to the point of actually allowing it to grow, it'd be nice if we can find some way to forcefully grow the buffer to help catch misuses in Clang when running the llvm-lit tests.

Perhaps such access can be facilitated via a shared_ptr-like handle that dynamically records whether direct access is in use; something like a read-lock. Attempts to grow the buffer could then assert that no such use is outstanding.