This is an archive of the discontinued LLVM Phabricator instance.

[clang][AST] Support AST files larger than 512M
ClosedPublic

Authored by DmitryPolukhin on Mar 23 2020, 2:00 AM.

Details

Summary

Clang uses 32-bit integers for storing bit offsets from the beginning of
the file that results in 512M limit on AST file. This diff replaces
absolute offsets with relative offsets from the beginning of
corresponding data structure when it is possible. And uses 64-bit
offsets for DeclOffests and TypeOffssts because these coder AST
section may easily exceeds 512M alone.

This diff breaks AST file format compatibility so VERSION_MAJOR bumped.

Test Plan:
Existing clang AST serialization tests
Tested on clangd with ~700M and ~900M preamble files

Diff Detail

Event Timeline

DmitryPolukhin created this revision.Mar 23 2020, 2:00 AM

clang-format for ASTBitCodes.h

This looks reasonable to me, though I'm not familiar enough with the code to do a good review.

Just wanted to confirm my understanding: the problematic offsets that are now section-relative instead of file-relative are only used in two sections of the file (macro directives and sloc entries). These sections seem likely to be small relative to others that hold e.g. the AST. So this should raise the effective size limit by a lot, more likely 100x than 5?

(We've coincidentally started to see large crashing preambles this week, so thanks for this! I think we probably don't need to add any detection/recovery if this limit is high enough)

This looks reasonable to me, though I'm not familiar enough with the code to do a good review.

Just wanted to confirm my understanding: the problematic offsets that are now section-relative instead of file-relative are only used in two sections of the file (macro directives and sloc entries). These sections seem likely to be small relative to others that hold e.g. the AST. So this should raise the effective size limit by a lot, more likely 100x than 5?

Yes, problematic offsets are now section relative and the sections several times smaller than the whole file. In suspect in some artificial cases these section can be pretty large. For example, you can create source file with only macro and you can exceed limit for on macro size alone. Moreover I'm not sure that files bigger than 4G can be processed correctly. Because there are other places where 32 bit numbers are used for some IDs or byte offsets.

(We've coincidentally started to see large crashing preambles this week, so thanks for this! I think we probably don't need to add any detection/recovery if this limit is high enough)

I added asserts for all new places where overflow can happen but perhaps we also need something for release builds. If you have cases when you reach size limit, it would be very helpful if you can try this patch on these cases. I tested it on couple ~700M files but my examples relatively unified and follow the same pattern.

Use 64-bit offsets for TypesOffsests and DeclOffsests

DmitryPolukhin retitled this revision from [clang][AST] User relative offsets inside AST file to [clang][AST] Support AST files larger than 512M.Mar 28 2020, 7:57 AM
DmitryPolukhin edited the summary of this revision. (Show Details)
xbolva00 added inline comments.
clang/lib/Serialization/ASTReader.cpp
3102

use C++ cast style

Resolve merge conflict, revert clang-format for ASTBitCodes.h to avoid further conflicts, use C++ cast

DmitryPolukhin marked an inline comment as done.Mar 28 2020, 11:42 AM

Rebase, all tests passes locally

sammccall added inline comments.Mar 30 2020, 4:43 AM
clang/include/clang/Serialization/ASTBitCodes.h
220–221

Is there one of these for every decl in the module? It seems like we're probably giving up a good fraction of the 4% increase for just using absolute 64 bit offsets everywhere :-( Is there still a significant gain from using section-relative elsewhere?

If there are indeed lots of these, giving up 4 bytes to padding (in addition to the wide offset) seems unfortunate and because we memcpy the structs into the AST file seems like a sad reason :-)
Can we align this to 4 bytes instead?
(e.g. by splitting into two fields and encapsulating the few direct accesses, though there's probably a neater way)

DmitryPolukhin marked an inline comment as done.Mar 30 2020, 12:19 PM
DmitryPolukhin added inline comments.
clang/include/clang/Serialization/ASTBitCodes.h
220–221

Yes, it seems that all referenced decls should have an entry in this table. 4% increase was counted on previous diff before I discovered DeclOffsets and TypeOffsets. This diff uses relative offsets for previous cases and increase is only 2 additional 64-bit integers per file. DeclOffsets are about 4.7% and TypeOffsets are about 2.5% of the whole file. With this diff both number will be 2x, splitting DeclOffset in two separate arrays could make the increase only 1.5x for DeclOffsets. If approach from this diff looks fine, I can split array of DeclOffset into 2 separate arrays. It could be a bit less efficient because of worse memory locality but will save some space.

@rsmith, @dexonsmith, @jdoerfert could you please take a look to this diff?
If you think that there are other reviewers who might have more context AST persistence, please add them.

dexonsmith resigned from this revision.Apr 2 2020, 1:16 PM
dexonsmith added a reviewer: Bigcheese.
dexonsmith added a subscriber: Bigcheese.

@rsmith, @dexonsmith, @jdoerfert could you please take a look to this diff?
If you think that there are other reviewers who might have more context AST persistence, please add them.

@Bigcheese, can you take a look?

hokein added a subscriber: hokein.Apr 7 2020, 6:10 AM

I'm not familiar with the code, but this patch does fix the large-preamble crashes we encountered internally.

sammccall accepted this revision.Apr 14 2020, 4:57 AM

We're starting to see quite a lot of production crashes with this and would very much like a fix.
At this point it seems like we're best moving ahead and if someone comes up with a better idea, we can improve it later.

clang/include/clang/Serialization/ASTBitCodes.h
220–221

Or even just 32-bit BitOffsetLow and BitOffsetHigh fields so the overall struct is 32-bit aligned... then you'd still be local?

This revision is now accepted and ready to land.Apr 14 2020, 4:57 AM

Split BitOffset in DeclOffset in high/low parts; rebase

Split BitOffset in DeclOffset in high/low parts; rebase

DmitryPolukhin marked an inline comment as done.Apr 15 2020, 2:19 AM

@sammccall thank you for review!
I'll wait for one more day for additional feedback if any, and land this diff.

This revision was automatically updated to reflect the committed changes.

I am not sure, but maybe this patch causes an undefined behavior?
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/40472/steps/check-clang%20ubsan/logs/stdio

/b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6327:28: runtime error: load of misaligned address 0x7feac40a55bc for type 'const uint64_t' (aka 'const unsigned long'), which requires 8 byte alignment
0x7feac40a55bc: note: pointer points here
  00 00 00 00 a3 c7 01 00  00 00 00 00 0c c8 01 00  00 00 00 00 29 c8 01 00  00 00 00 00 7a cc 01 00
              ^ 
    #0 0x3be2fe4 in clang::ASTReader::TypeCursorForIndex(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6327:28
    #1 0x3be30a0 in clang::ASTReader::readTypeRecord(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6348:24
    #2 0x3bd3d4a in clang::ASTReader::GetType(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6985:26
    ...

(I am in the blamelist too of that build, so that's why I am sniffing.)

I am not sure, but maybe this patch causes an undefined behavior?
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/40472/steps/check-clang%20ubsan/logs/stdio

/b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6327:28: runtime error: load of misaligned address 0x7feac40a55bc for type 'const uint64_t' (aka 'const unsigned long'), which requires 8 byte alignment
0x7feac40a55bc: note: pointer points here
  00 00 00 00 a3 c7 01 00  00 00 00 00 0c c8 01 00  00 00 00 00 29 c8 01 00  00 00 00 00 7a cc 01 00
              ^ 
    #0 0x3be2fe4 in clang::ASTReader::TypeCursorForIndex(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6327:28
    #1 0x3be30a0 in clang::ASTReader::readTypeRecord(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6348:24
    #2 0x3bd3d4a in clang::ASTReader::GetType(unsigned int) /b/sanitizer-x86_64-linux-fast/build/llvm-project/clang/lib/Serialization/ASTReader.cpp:6985:26
    ...

(I am in the blamelist too of that build, so that's why I am sniffing.)

Temporary reverted in a8f85da9f538a400dfea00e4954e403bf5f3269c

DmitryPolukhin reopened this revision.Apr 16 2020, 10:29 AM
This revision is now accepted and ready to land.Apr 16 2020, 10:29 AM

Split TypeOffsets to low/high parts too

sammccall accepted this revision.Apr 16 2020, 1:42 PM

A little sad it's not possible to just 64-bit align the typeoffsets array, but 32 seems to be the magic number.

clang/include/clang/Serialization/ASTBitCodes.h
225

may be worth mentioning that blobs in bitstream are specifically 32-bit aligned as part of the motivation here.

228

consider making this a class and these members private

256

Alternatively you could factor this as a class UnderalignedInt64, use that for type-offsets, and embed it in DeclOffset...

DmitryPolukhin marked 3 inline comments as done.

Comments resolved

clang/include/clang/Serialization/ASTBitCodes.h
228

I kept members public because they public for all other similar struct in the file. I think initial idea to keep them public was to have POD struct but it is actually not the case due to user defined constructor.

This revision was automatically updated to reflect the committed changes.
JackKemp99 added inline comments.
clang/lib/Serialization/ASTWriter.cpp
1908

clangd crashes here:

DmitryPolukhin added inline comments.Feb 10 2021, 9:26 AM
clang/lib/Serialization/ASTWriter.cpp
1908

@JackKemp99 It means that you have a source file that has number of source location that exceeds 32-bit offset. Do you have a reproducer that you can share?

Is it something Boost related? There is a discussion about too many source location with Boost https://lists.llvm.org/pipermail/cfe-dev/2021-February/067599.html Your example may not exceed amount in of SourceLocation IDs but exceeds lower bound on offsets during preamble generation.