This is an archive of the discontinued LLVM Phabricator instance.

[clang][AST] Improve AST Reader/Writer memory footprint
ClosedPublic

Authored by weiwang on May 3 2021, 4:02 PM.

Details

Summary

Reduce memory footprint of AST Reader/Writer:

  1. Adjust internal data containers' element type.
  2. Switch to set for deduplication of deferred diags.

Diff Detail

Event Timeline

weiwang created this revision.May 3 2021, 4:02 PM
weiwang requested review of this revision.May 3 2021, 4:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2021, 4:03 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
weiwang added a comment.EditedMay 3 2021, 4:11 PM

We've seen a huge memory footprint from AST Reader/Writer in a single CU with module enabled from internal workloads. Upon further analysis, the content of vector DeclsToCheckForDeferredDiags seems mostly redundant. In one case, 1,734,387,685 out of 1,734,404,000 elements are the same. While this may indicate something wrong with the source itself, it also suggests that compiler would be better to perform deduplication on this type of Decl ID.

This change also tries to restrict the data length of several other vectors which only deal with 32-bit unsigned int.

yaxunl added a comment.May 3 2021, 7:16 PM

Decls in Sema::DeclsToCheckForDeferredDiags is supposed to be unique. Therefore the fact that '1,734,387,685 out of 1,734,404,000 elements are the same' is surprising. Did this happen when you compile the source code and write AST? What language was the source code? C++, OpenMP, or CUDA? What was the decl that got duplicated? Thanks.

Decls in Sema::DeclsToCheckForDeferredDiags is supposed to be unique. Therefore the fact that '1,734,387,685 out of 1,734,404,000 elements are the same' is surprising. Did this happen when you compile the source code and write AST? What language was the source code? C++, OpenMP, or CUDA? What was the decl that got duplicated? Thanks.

Sorry for the late reply.

This happens during a single compilation instance, and emits a module file from a big list of module map files and headers. I believe the code is pure C++. Because of the huge amount of duplications, the memory RSS shoots to over 50GB. I'll update once having more information.

Finally dealt with the other issues I need to take care. Let's resume the discussion.

I printed out the decls that are duplicated. The list of very long, but the top ones all seem from glibc headers. For example (the number after 'c:' is the duplication count),

static inline __uint16_t __bswap_16(__uint16_t __bsx) {
    return ((__uint16_t)((((__bsx) >> 8) & 255) | (((__bsx) & 255) << 8)));
}
c:8581627
static inline __uint16_t __bswap_16(__uint16_t __bsx)
c:8581627
static inline __uint32_t __bswap_32(__uint32_t __bsx) {
    return ((((__bsx) & 4278190080U) >> 24) | (((__bsx) & 16711680U) >> 8) | (((__bsx) & 65280U) << 8) | (((__bsx) & 255U) << 24));
}
c:8581627
static inline __uint32_t __bswap_32(__uint32_t __bsx)
c:8581627
static inline __uint64_t __bswap_64(__uint64_t __bsx) {
    return ((((__bsx) & 18374686479671623680ULL) >> 56) | (((__bsx) & 71776119061217280ULL) >> 40) | (((__bsx) & 280375465082880ULL) >> 24) |   (((__bsx) & 1095216660480ULL) >> 8) | (((__bsx) & 4278190080ULL) << 8) | (((__bsx) & 16711680ULL) << 24) | (((__bsx) & 65280ULL) << 40) |       (((__bsx) & 255ULL) << 56));
}
c:8581627
static inline __uint64_t __bswap_64(__uint64_t __bsx)
c:8581627
static inline __uint16_t __uint16_identity(__uint16_t __x) {
    return __x;
}
c:8581627
static inline __uint32_t __uint32_identity(__uint32_t __x) {
    return __x;
}
c:8581627
static inline __uint64_t __uint64_identity(__uint64_t __x) {
    return __x;
}
c:8581627
inline int iscanonical(float __val) {
    return ((void)(typeof (__val))(__val) , 1);
}
c:8581627
inline int iscanonical(double __val) {
    return ((void)(typeof (__val))(__val) , 1);
}
c:8581627
inline int iscanonical(long double __val) {
    return __iscanonicall(__val);
}
c:8581627
inline int issignaling(float __val) {
    return __issignalingf(__val);
}
c:8581627
inline int issignaling(double __val) {
    return __issignaling(__val);
}
c:8581627
inline int issignaling(long double __val) {
    return __issignalingl(__val);
}
c:8581627
static int __call(float __x, float __y) throw() {
    return __iseqsigf(__x, __y);
}
c:8581627
static int __call(double __x, double __y) throw() {
    return __iseqsig(__x, __y);
}
c:8581627
static int __call(long double __x, long double __y) throw() {
    return __iseqsigl(__x, __y);
}
c:8581627

I think the source codebase is a mix of conventional headers and modules, I don't know if that makes any difference.

I think the root cause might be duplicated decls are added to Sema::DeclsToCheckForDeferredDiags defined in

https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Sema/Sema.h#L1789

When compiling source codes, a decl is added only once. However if modules are imported, duplicate decls may be added.

We need to avoid adding duplicate decls to Sema::DeclsToCheckForDeferredDiags. However we cannot simply change it to a set since the order is important, otherwise the error message for later code may show up earlier, causing confusion for users. I would suggest to change its type to SetVector, which keeps the order and also avoids duplicates.

weiwang added a comment.EditedMay 17 2021, 3:51 PM

I think the root cause might be duplicated decls are added to Sema::DeclsToCheckForDeferredDiags defined in

https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Sema/Sema.h#L1789

When compiling source codes, a decl is added only once. However if modules are imported, duplicate decls may be added.

We need to avoid adding duplicate decls to Sema::DeclsToCheckForDeferredDiags. However we cannot simply change it to a set since the order is important, otherwise the error message for later code may show up earlier, causing confusion for users. I would suggest to change its type to SetVector, which keeps the order and also avoids duplicates.

Thanks for the suggestion! It does make more sense to use SetVector here. I will try this and report back.

Also I think such decls are first read from AST files with ASTReader::ReadAST and filled into ASTReader::DeclsToCheckForDeferredDiags (a SmallVector), then into Sema::DeclsToCheckForDeferredDiags as external source in Sema::emitDeferredDiags. I Wonder if we set ASTReader::DeclsToCheckForDeferredDiags to be llvm::SmallSetVector as this diff currently does, would it eliminate duplication all along the way?

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

What if you use SmallSetVector for both Sema::DeclsToCheckForDeferredDiags and ASTReader::DeclsToCheckForDeferredDiags? Does it cause extra memory usage compared to using it only for ASTReader::DeclsToCheckForDeferredDiags? Thanks.

weiwang added a comment.EditedMay 20 2021, 1:25 PM

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

What if you use SmallSetVector for both Sema::DeclsToCheckForDeferredDiags and ASTReader::DeclsToCheckForDeferredDiags? Does it cause extra memory usage compared to using it only for ASTReader::DeclsToCheckForDeferredDiags? Thanks.

There would be a slight increase in memory usage since SmallSetVector requires more memory than just SmallVector internally, but given the majority the RSS comes from duplicated decls, I don't think it's an issue by making both SmallSetVector. If you think it's better to change both, I'll update the diff.

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

What if you use SmallSetVector for both Sema::DeclsToCheckForDeferredDiags and ASTReader::DeclsToCheckForDeferredDiags? Does it cause extra memory usage compared to using it only for ASTReader::DeclsToCheckForDeferredDiags? Thanks.

There would be a slight increase in memory usage since SmallSetVector requires more memory than just SmallVector internally, but given the majority the RSS comes from duplicated decls, I don't think it's an issue by making both SmallSetVector. If you think it's better to change both, I'll update the diff.

Let's use SmallSetVector for both. Thanks.

weiwang updated this revision to Diff 346866.May 20 2021, 3:00 PM

make both ASTReader::DeclsToCheckForDeferredDiags and Sema::DeclsToCheckForDeferredDiags SmallSetVector

yaxunl accepted this revision.May 20 2021, 3:16 PM

LGTM. Thanks.

This revision is now accepted and ready to land.May 20 2021, 3:16 PM
This revision was landed with ongoing or failed builds.May 20 2021, 3:35 PM
This revision was automatically updated to reflect the committed changes.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

For offloading languages e.g. OpenMP/CUDA/HIP, there are apparent errors in functions shared between host and device. However, unless these functions are sure to be emitted on device or host, these errors should not be emitted. These errors are so called deferred error messages. The function decls which need to be checked are recorded. After AST is finalized, the AST of these functions are iterated. If a function is found sure to be emitted, the deferred error message in it are emitted.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

For offloading languages e.g. OpenMP/CUDA/HIP, there are apparent errors in functions shared between host and device. However, unless these functions are sure to be emitted on device or host, these errors should not be emitted. These errors are so called deferred error messages. The function decls which need to be checked are recorded. After AST is finalized, the AST of these functions are iterated. If a function is found sure to be emitted, the deferred error message in it are emitted.

Thanks! So the DeclsToCheckForDeferredDiags contains the candidate decls to be checked. The decls are selected because they would generate diags in the context of offloading languages, but whether or not an diag will be emitted is deferred till traversal of the AST is performed. I still don't quite understand why would libc functions be in the candidate list? They look simple enough (and I think they've been there forever). For example __uint16_identity (https://code.woboq.org/userspace/glibc/bits/uintn-identity.h.html#32),

static inline __uint16_t __uint16_identity(__uint16_t __x) {
    return __x;
}

I don't see how this function would generate diag either on host or device.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

For offloading languages e.g. OpenMP/CUDA/HIP, there are apparent errors in functions shared between host and device. However, unless these functions are sure to be emitted on device or host, these errors should not be emitted. These errors are so called deferred error messages. The function decls which need to be checked are recorded. After AST is finalized, the AST of these functions are iterated. If a function is found sure to be emitted, the deferred error message in it are emitted.

Thanks! So the DeclsToCheckForDeferredDiags contains the candidate decls to be checked. The decls are selected because they would generate diags in the context of offloading languages, but whether or not an diag will be emitted is deferred till traversal of the AST is performed. I still don't quite understand why would libc functions be in the candidate list? They look simple enough (and I think they've been there forever). For example __uint16_identity (https://code.woboq.org/userspace/glibc/bits/uintn-identity.h.html#32),

static inline __uint16_t __uint16_identity(__uint16_t __x) {
    return __x;
}

I don't see how this function would generate diag either on host or device.

A function may call another function which contains deferred diagnostics.