This is an archive of the discontinued LLVM Phabricator instance.

[Preamble] Fixed preamble breaking with BOM presence (and particularly, fluctuating BOM presence)
ClosedPublic

Authored by cameron314 on Sep 5 2017, 12:56 PM.

Details

Summary

This patch fixes preamble skipping when the preamble region includes a byte order mark (BOM). Previously, parsing would fail if preamble PCH generation was enabled and a BOM was present.

This also fixes preamble invalidation when a BOM appears or disappears. This may seem to be an obscure edge case, but it happens regularly with IDEs that pass buffer overrides that never (or always) have a BOM, yet the underlying file from the initial parse that generated a PCH might (or might not) have a BOM.

I've included a test case for these scenarios.

Note: This depends on the test infrastructure introduced in D37474.

Diff Detail

Repository
rL LLVM

Event Timeline

cameron314 created this revision.Sep 5 2017, 12:56 PM
ilya-biryukov edited edge metadata.
ilya-biryukov added subscribers: cfe-commits, klimek, bkramer.
  • How are various preprocessor offests (and SourceLocation offsets) are calculated? Do they account for BOM presence and ignore it?
  • Are there potential problems we may run into because of the changing offsets? Could we add tests checking changing the offsets does not matter?
  • Should we add checks that BOM was removed or added, but not changed? I would not expect preamble to be reusable "as is" if BOM (and therefore, input encoding) changed.
include/clang/Frontend/PrecompiledPreamble.h
102 ↗(On Diff #113898)

Let's leave this class's interface immutable. It is used concurrently in clangd and having a mutable method like this would break the code.

Passing new PreambleBounds to AddImplicitPreamble and setting the offsets accordingly would do the trick, leave the interface immutable and make the fact that offsets might change more evident.

191 ↗(On Diff #113898)

Let's store original PreambleBounds instead of PreambleEndsAtStartOfLine and PreambleOffset.
It would make the code easier to read.

include/clang/Lex/Lexer.h
50 ↗(On Diff #113898)

Maybe pick a name that clearly states that it's a BOM size?
Or add a comment indicating that it's a BOM offset.

639 ↗(On Diff #113898)

Maybe leave the old name? Doesn't SkipBytes captures the new semantics just as good?

lib/Frontend/PrecompiledPreamble.cpp
195 ↗(On Diff #113898)

Could you inline usages of this function and remove it?

unittests/Frontend/PchPreambleTest.cpp
190 ↗(On Diff #113898)

We're not really testing that preamble was reused.
Maybe return a flag from ASTUnit::Reparse to indicate if preamble was reused and check it here?

Thanks for the response!

How are various preprocessor offests (and SourceLocation offsets) are calculated? Do they account for BOM presence and ignore it?

Everything is in byte offsets; the SourceLocation after the BOM is not the same as before the BOM. The lexer automatically skips the BOM at the beginning of the file if it sees one (Lexer::InitLexer), and everything else works normally after that. The start of the first line is after the BOM, if any, which means it doesn't affect line/column numbers.

Are there potential problems we may run into because of the changing offsets? Could we add tests checking changing the offsets does not matter?

That's a good point; I've looked into it and the PCH for the preamble is parsed using just the buffer slice that contains the preamble, excluding any BOM. That means that when we resume parsing later on a main buffer with a BOM, the SourceLocations within the preamble itself will be off. However, normally this doesn't matter since the only things in the preamble are preprocessor directives, whose positions are very rarely used. (I should note at this point that we've been using a variant of this patch in production for a few years without any problem.) So, we have two choices: Either parse the preamble with the BOM and throw out the preamble/PCH when the BOM presence changes from the main buffer, or slice the buffer when using a preamble PCH so that it never has a BOM during parsing. I'm leaning towards the second option, since it's a little cleaner and lets the preamble be reused more easily; the only downside is that an external consumer would not be able to use any absolute offsets from the AST (note that line/column offsets would be identical) in the original buffer if it has a BOM -- but in any case, absolute offsets are usually useless without the buffer itself, which if obtained from clang would always be the correct buffer.

Should we add checks that BOM was removed or added, but not changed? I would not expect preamble to be reusable "as is" if BOM (and therefore, input encoding) changed.

I'm not sure I understand this point. Clang only understands UTF-8; the BOM is either present or not, but the encoding never changes. (And the BOM itself is always the same byte sequence too.) It has no impact on the file contents.

include/clang/Frontend/PrecompiledPreamble.h
102 ↗(On Diff #113898)

Fair point, I'll change this.

191 ↗(On Diff #113898)

Again, good point, I'll change this.

include/clang/Lex/Lexer.h
50 ↗(On Diff #113898)

I can see how this might be confusing. I'll add a comment.

639 ↗(On Diff #113898)

SkipBytes moves relative to the current position, but the lexer skips the BOM implicitly on construction; I don't want to skip it twice. SetByteOffset is absolute, which makes it simple and clear to use without having to reason about implicit past state.

lib/Frontend/PrecompiledPreamble.cpp
195 ↗(On Diff #113898)

I could; I think it makes sense to leave the wrapper, though, since the ASTUnit deals with the PrecompiledPreamble at its level of abstraction, and the PrecompiledPreamble deals with the lexer at its level of abstraction.

unittests/Frontend/PchPreambleTest.cpp
190 ↗(On Diff #113898)

We are; if it wasn't reused, the header would have been opened again and the last assert on GetFileReadCount below would fail.

Are there potential problems we may run into because of the changing offsets? Could we add tests checking changing the offsets does not matter?

That's a good point; I've looked into it and the PCH for the preamble is parsed using just the buffer slice that contains the preamble, excluding any BOM. That means that when we resume parsing later on a main buffer with a BOM, the SourceLocations within the preamble itself will be off. However, normally this doesn't matter since the only things in the preamble are preprocessor directives, whose positions are very rarely used. (I should note at this point that we've been using a variant of this patch in production for a few years without any problem.) So, we have two choices: Either parse the preamble with the BOM and throw out the preamble/PCH when the BOM presence changes from the main buffer, or slice the buffer when using a preamble PCH so that it never has a BOM during parsing. I'm leaning towards the second option, since it's a little cleaner and lets the preamble be reused more easily; the only downside is that an external consumer would not be able to use any absolute offsets from the AST (note that line/column offsets would be identical) in the original buffer if it has a BOM -- but in any case, absolute offsets are usually useless without the buffer itself, which if obtained from clang would always be the correct buffer.

Maybe there's a third option option to remove the BOM from the buffer before passing it to clang?
Could you elaborate on your use-case a little more? Is there no way to consistently always pass buffers either with or without BOM?

Out of two options you mention discarding preamble on BOM changes seems like an easy option that is both correct and won't make a difference in performance since BOM rarely changes.
Looking at your use-case, it sounds like you'll only have 1 extra reparse of preamble, which is probably fine.

Should we add checks that BOM was removed or added, but not changed? I would not expect preamble to be reusable "as is" if BOM (and therefore, input encoding) changed.

I'm not sure I understand this point. Clang only understands UTF-8; the BOM is either present or not, but the encoding never changes. (And the BOM itself is always the same byte sequence too.) It has no impact on the file contents.

Sure, it's not something clang supports, it's an edge-case when clang receives "malformed" input. Does lexer only skip utf-8 BOM, but not other versions of BOM?
But you're right, it's highly unlikely anything will break in that case.

unittests/Frontend/PchPreambleTest.cpp
190 ↗(On Diff #113898)

Missed that, thanks. Looks good.
Maybe add a comment explicitly noting that?

Maybe there's a third option option to remove the BOM from the buffer before passing it to clang?
Could you elaborate on your use-case a little more? Is there no way to consistently always pass buffers either with or without BOM?
Out of two options you mention discarding preamble on BOM changes seems like an easy option that is both correct and won't make a difference in performance since BOM rarely changes.
Looking at your use-case, it sounds like you'll only have 1 extra reparse of preamble, which is probably fine.

In my particular use case, when the file is remapped by the IDE, there's never a BOM. But when it's not remapped, the real file may or may not have a BOM. Since the file goes back and forth between mapped and unmapped depending on whether it's saved, the BOM presence can change quite frequently, and we don't really have control over it (the BOM can change on disk too). This is a common use case for anyone integrating clang/libclang into an IDE; the rarity of UTF-8 BOMs on platforms other than Windows probably obscured this until now.

I think since we can handle the changing BOM presence in the preamble gracefully, we should. I'll draft a patch that does the slicing correctly so that the offsets are always valid.

Sure, it's not something clang supports, it's an edge-case when clang receives "malformed" input. Does lexer only skip utf-8 BOM, but not other versions of BOM?
But you're right, it's highly unlikely anything will break in that case.

Ah, I see. Yes, the lexer only skips a UTF-8 BOM, but I seem to recall seeing some code that detects BOMs in other encodings and emits an error (in the driver, possibly?).

unittests/Frontend/PchPreambleTest.cpp
190 ↗(On Diff #113898)

Sure, will do.

Here's an updated patch. The code required to make it work is much simpler when the BOM is simply ignored :-)

Parsing errors on preamble additions and removals are definitely bad and should be fixed.
But I would argue that the right approach is to invalidate the preamble and rebuild it on BOM changes.

Current fix in ASTUnit just hides an error in the underlying APIs. For example, all other other clients of PrecompiledPreamble are still broken.

lib/Frontend/ASTUnit.cpp
1262 ↗(On Diff #114228)

It seems that having only this chunk would fix your issue.
Everything else is just a non-functional refactoring, maybe let's focus on that part (and tests) in this review and send the rest as a separate change?
To keep refactoring and functional changes logically separate.

erikjv added a subscriber: erikjv.Sep 8 2017, 5:30 AM

Will this fix PR25023 and PR21144?

Will this fix PR25023 and PR21144?

PR25023 should be fixed by this change. It is essentially a repro of the same bug.
Could we add a c-index-test-based test here to make sure we addressed that particular use-case?

The state of PR21144 won't be affected, as this change does not touch the code invoked during normal compilation without preambles.
If PR21144 is fixed in a way that would make SourceLocations the same regardless if BOM was present or not, we might have a better guarantee that nothing will break in case we want to reuse preamble between BOM/non-BOM versions.

It seems there's other users of PrecompiledPreamble that would have to be fixed, yes. If we go with my original fix of taking into account the BOM in the preamble bounds, there's no way of reusing the PCH when the BOM appears/disappears. I still maintain this is a common use case for IDE-type clients. This type of performance bug is very hard to track down.

@erikjv: Yes, I think this will fix PR25023.
PR21144 is unrelated; clang uses UTF-8 byte offsets instead of logical-character offsets for column numbers, which makes sense to me.

lib/Frontend/ASTUnit.cpp
1262 ↗(On Diff #114228)

Yes, with this form of the fix, the other changes are mostly cosmetic. I could simply revert them, it's not worth the hassle of submitting another patch.

cameron314 edited the summary of this revision. (Show Details)

Alright, I've changed the patch so that the preamble takes into account the BOM presence and is invalidated when it changes. This automatically fixes all clients of PrecompiledPreamble, and ensures that all SourceLocations are always consistent when using a PCH generated from the preamble.

I think this should do the trick!

ilya-biryukov accepted this revision.Sep 14 2017, 7:42 AM

See my comments about removing StartOffset field, but other than that looks good.

include/clang/Lex/Lexer.h
52 ↗(On Diff #115104)

We could simplify it further by removing StartOffset, leaving only Size.
If you look at the code, it always uses StartOffset + Size now, which is effectively size with BOM.
What do you think?

lib/Frontend/PrecompiledPreamble.cpp
227 ↗(On Diff #115104)

Maybe store BOM bytes in PreambleBytes too? Would that allow to get rid of StartOffset field (see other comment)?

This revision is now accepted and ready to land.Sep 14 2017, 7:42 AM
cameron314 added inline comments.Sep 14 2017, 9:24 AM
include/clang/Lex/Lexer.h
52 ↗(On Diff #115104)

Yeah, I thought of that, but it's still nice to have the two separated.

ASTUnit.cpp, for example, checks if Size != 0 to determine if there's a preamble (without considering the offset). This isn't in the diff because it was already like that.

ilya-biryukov added inline comments.Sep 14 2017, 9:40 AM
include/clang/Lex/Lexer.h
52 ↗(On Diff #115104)

Could we simply return Size = 0 from ComputePreambleBounds if we simply skipped BOM and the preamble itself is empty?
My concern is that currently it's very easy to forget adding StartOffset and simply use Size when writing code that uses PreambleBounds. If we only have Size, probability of mistakes is much lower.

cameron314 added inline comments.Sep 14 2017, 12:40 PM
include/clang/Lex/Lexer.h
52 ↗(On Diff #115104)

Alright, sold. There's already other places that use the size without checking the offset, it turns out.

Final diff. Test passes!

This revision was automatically updated to reflect the committed changes.