While building huge code bases it's not uncommon to see perf reports with following FoldingSet items:
1.56% 0.47% clang clang-14 [.] llvm::FoldingSetBase::FindNodeOrInsertPos 0.30% 0.01% clang clang-14 [.] llvm::ContextualFoldingSet<clang::FunctionProtoType, clang::ASTContext&>::NodeEquals 0.25% 0.02% clang clang-14 [.] llvm::FoldingSetBase::InsertNode 0.23% 0.12% clang clang-14 [.] llvm::FoldingSetBase::GrowBucketCount 0.22% 0.21% clang clang-14 [.] llvm::FoldingSetNodeID::AddPointer 0.47% 0.06% clang clang-14 [.] llvm::FoldingSetBase::InsertNode or 1.12% 0.75% clang++ libLLVM-13.so [.] llvm::FoldingSetBase::GrowBucketCount 0.49% 0.48% clang++ libLLVM-13.so [.] llvm::FoldingSetNodeID::AddPointer 0.41% 0.09% clang++ libLLVM-13.so [.] llvm::FoldingSetNodeID::operator== etc.
Among many FoldingSet users most notable seem to be ASTContext and CodeGenTypes.
The reasons that we spend not-so-tiny amount of time in FoldingSet calls from there, are following:
- Default FoldingSet capacity for 2^6 items very often is not enough. For PointerTypes/ElaboratedTypes/ParenTypes it's not unlikely to observe growing it to 256 or 512 items. FunctionProtoTypes can easily exceed 1k items capacity growing up to 4k or even 8k size.
- FoldingSetBase::GrowBucketCount cost itself is not very bad (pure reallocations are rather cheap thanks to BumpPtrAllocator) What matters is high collision rate when lot of items end up in same bucket slowing down FoldingSetBase::FindNodeOrInsertPos and trashing CPU cache (as items with same hash are organized in intrusive linked list which need to be traversed).
- Lack of AddInteger/AddPointer and computeHash inlining slows down NodeEquals/Profile/:operator== calls. Inlining makes FunctionProtoTypes/PointerTypes/ElaboratedTypes/ParenTypes Profile functions faster but since NodeEquals is still called indirectly through function pointer from FindNodeOrInsertPos there is room for further inlining improvements.
After addressing above issues I built Linux (with default config) on isolated CPU cores in silent x86-64 Linux environment.
Compile time statistics diff produced by perf before and after change are following:
instructions -0.4%, cycles -0.9%
size-text change of output Clang binary is below +0.1%.
Similarly like in: https://reviews.llvm.org/D118169 for code bases containing smaller translation units
it's expected to get less significant speedup with this patch.
It's probably good to give that value a meaningful ( and constexpr) variable name as it's used at several point.