This is an archive of the discontinued LLVM Phabricator instance.

[modules] [pch] Do not deserialize all lazy template specializations when looking for one.
Needs ReviewPublic

Authored by v.g.vassilev on Dec 19 2017, 2:55 PM.

Details

Summary

Currently, we load all lazy template specializations when we search whether there is a suitable template specialization for a template. This is especially suboptimal with modules. If module B specializes a template from module A the ASTReader would only read the specialization DeclID. This is observed to be especially pathological when module B is stl. However, the template instantiator (almost immediately after) will call findSpecialization. In turn, findSpecialization will load all lazy specializations to give an answer.

This patch teaches findSpecialization to work with lazy specializations without having to deserialize their full content. It provides along with the DeclID an cross-TU stable ODRHash of the template arguments which is enough to decide if we have already specialization and which are the exact ones (we load all decls with the same hash to avoid potential collisions) to deserialize.

While we make finding a template specialization more memory-efficient we are far from being done. There are still a few places which trigger eager deserialization of template specializations: the places where we require completion of the redeclaration chain.

Diff Detail

Event Timeline

v.g.vassilev created this revision.Dec 19 2017, 2:55 PM
v.g.vassilev added a reviewer: rtrieu.

Fixed a comment typo.

I think you should track lazy partial specializations separately from lazy full specializations -- we need to load all the partial specializations when doing partial specialization selection, and don't want to load all full specializations at the same time.

lib/Serialization/ASTReaderDecl.cpp
92 ↗(On Diff #127606)

Move these subexpressions to separate statements; the two "read" calls are not sequenced here.

3972 ↗(On Diff #127606)

Likewise here.

v.g.vassilev marked 2 inline comments as done.

Address inline comments: order the read operations.

Teach ASTReader::CompleteRedeclChain to load only the template specializations with the same template argument list.

lib/Serialization/ASTReaderDecl.cpp
92 ↗(On Diff #127606)

Good catch!

rsmith added a comment.Jan 2 2018, 2:08 PM

This is looking like a really good start; we'll still want to do something about partial specializations, but this should already help a lot.

Can you produce some sample performance numbers for this change? This is replacing a linear-time algorithm (with a high constant) with a quadratic-time algorithm (with a lower constant), and it'd be useful to confirm that's not going to be a pressing problem in the short term. (In the medium term, we should move to using an on-disk hash table for template specialization lookup, at least for templates with large numbers of specializations.)

include/clang/AST/DeclTemplate.h
762–765 ↗(On Diff #128465)

No space between operator and (, please.

lib/AST/DeclTemplate.cpp
209 ↗(On Diff #128465)

Can we remove the loaded specializations from the list / zero them out, so know we've already asked the ExternalASTSource for them?

235–242 ↗(On Diff #128465)

I'm a little concerned about this: a debug build could perform more deserialization than an optimized build here. It's probably not a big deal, but it could lead to some surprising debugging experiences down the line (where a debug build works but an optimized build misbehaves).

Perhaps we could assert that there are no lazy specializations with the relevant hash in our list before performing this lookup? (This'd require that we remove them as we load them.)

lib/Serialization/ASTReader.cpp
7009–7010 ↗(On Diff #128465)

} and else on same line, please.

v.g.vassilev marked 4 inline comments as done.

Address comments:

  • Fix style;
  • Do not potentially deserialize a specialization in debug mode.

Loading of lazy partial template specializations should not trigger loading of full template specializations.

Often during selection process we need to load all partial template specializations. It is very rare to do so for full template specialization.

I have created a simple (-ish) benchmark targeted to stress test modules with a good number of template specializations. I use the mp11 library from boost. I am building two modules which contain around 1K specializations (mostly full specializations). The example can be found here. I use a small fraction of them. This patch deserializes (only) 1117 specializations for around 0.2 seconds whereas without it clang deserializes 1905 specializations for 0.4 s.

This I believe can be further improved and I am investigating but that should not be a blocker for this patch.

Zero IsPartial and improve comments in assert and free text.

Reverse template specialization order.

Reduce hash collisions for template specializations.

Currently when we hash a tag type the visitor calls ODRHash::AddDecl which mostly relies on the decl name give distinct hash value. The types coming from template specializations have very similar properties (including decl names). For those we need to provide more information in order to disambiguate them.

This patch adds the template arguments for the template specialization decl corresponding to its type. We manage to reduce further the amount of deserializations from 1117 down to 451.

Reduce further the deserializations from 451 to 449 by providing a more complete implementation of ODRHash::AddTemplateArgument.

Kudos @rtrieu!

Finer grained clang stats can be seen here.

rsmith added a comment.Jan 5 2018, 3:55 PM

I just tried this on one of our larger builds. The good news is that it appears to work: the compilation still succeeds, and significantly fewer nodes are deserialized. The bad news is that the final .cc file compilation got slower, from 42s to 86s. (There's a lot of noise here, and the builds might have been on vastly different machines, so take this with a pinch of salt for now.) Sample stats:

  • "types read" is down from 30% to 17%
  • "declarations read" is down from 34% to 23%
  • number of ClassTemplateSpecializations read has decreased by 30%, number of CXXRecordDecls read is down 25%
  • total ASTContext memory usage is down by 12%

Add sanity check.

We assert that the ODR hash of the template arguments should be the same as the one for from the actually found template specialization.

This way we managed to catch a few collisions in the ODRHash logic.

bruno resigned from this revision.Nov 9 2020, 12:32 PM
tapaswenipathak commandeered this revision.Jun 25 2022, 11:30 AM
tapaswenipathak added a reviewer: v.g.vassilev.
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 11:30 AM
v.g.vassilev commandeered this revision.Feb 6 2023, 11:42 AM
v.g.vassilev edited reviewers, added: tapaswenipathak; removed: v.g.vassilev.
v.g.vassilev edited reviewers, added: dblaikie; removed: tapaswenipathak.Mar 22 2023, 5:48 AM
v.g.vassilev added a subscriber: Restricted Project.

Add me as a reviewer to add this to my stack.

v.g.vassilev added inline comments.Apr 4 2023, 8:51 AM
clang/lib/Serialization/ASTWriterDecl.cpp
288

We should not store the lazy specialization information as an array of items because that makes the search slow. Instead we should use the OnDiskHashTable approach which we use already to store the identifier data.

ChuanqiXu added inline comments.Jul 10 2023, 7:27 PM
clang/include/clang/AST/DeclTemplate.h
786–788

Maybe this can save some space.

clang/lib/AST/DeclTemplate.cpp
366

TPL here looks not used.

clang/lib/Serialization/ASTWriterDecl.cpp
288

Do you want to implement it in this patch? Or this is a note for future optimizations?

I tested this with our internal use cases with modules and it looks good. Things get compiled correctly and we saved 10% compilation time. (May be due to different code bases). But some local in tree tests got failed. I took some time to investigate them but I didn't get insights : (

But some local in tree tests got failed. I took some time to investigate them but I didn't get insights : (

If the failures involve template template parameters, please note that you need https://reviews.llvm.org/D153003 for that to properly hash and find in the list of lazy specializations.

clang/lib/AST/DeclTemplate.cpp
366

This is required because of the argument forwarding from findSpecializationImpl below...

SAtacker added inline comments.Jul 11 2023, 4:18 AM
clang/lib/Serialization/ASTWriterDecl.cpp
288

I tried it here https://reviews.llvm.org/D144831 but to my surprise it gives worse performance for my testing environment which generates random specializations per module and a single main file.

shafik added a subscriber: shafik.Jul 11 2023, 8:34 AM
shafik added inline comments.
clang/lib/AST/DeclTemplate.cpp
337

/*=false*/ is this left over code that was meant to be removed?

559

nit

1292

Was /*=false*/ meant to be removed?

1304
Hahnfeld added inline comments.Jul 11 2023, 8:53 AM
clang/include/clang/AST/DeclTemplate.h
786–788

Can you elaborate why this would save space? The best practice is to order data members by size to avoid padding gaps. As far as I can see, this is already done here.

ChuanqiXu added inline comments.Jul 11 2023, 9:07 AM
clang/include/clang/AST/DeclTemplate.h
786–788

Oh, sorry. This was an oversight.

But some local in tree tests got failed. I took some time to investigate them but I didn't get insights : (

If the failures involve template template parameters, please note that you need https://reviews.llvm.org/D153003 for that to properly hash and find in the list of lazy specializations.

Got it. This is on my TODO list. (Feel free to take it if you're hurry. Just tell me when you started. So that we can avoid reinventing the same wheel)

v.g.vassilev added inline comments.Jul 11 2023, 9:25 AM
clang/lib/Serialization/ASTWriterDecl.cpp
288

Do you want to implement it in this patch? Or this is a note for future optimizations?

That was the intent since it was blocking the google folks. If that's not the case then maybe we can do it as a separate improvement...

v.g.vassilev marked 7 inline comments as done.

Address most of the comments. I will need some help with the Modules/pr60085.cppm failure. I suspect we pass to CodeGen some instantiation by iterator index and that does not work as the instantiations are lazily triggered upon use now.

clang/lib/AST/DeclTemplate.cpp
337

That was not my intent. I added /*=false*/ to keep track of the the actual default argument.

However, you are right, we are not using it without passing the correct argument. I will drop the default arg.

366

Are you suggesting to make this replacement: TPL -> /*TPL*/?

v.g.vassilev added inline comments.Jul 12 2023, 12:17 AM
clang/test/Modules/odr_hash.cpp
2905 ↗(On Diff #539409)

I realize that change is probably incorrect since we want to show both places where things differ.

I will need some help with the Modules/pr60085.cppm failure.

I am looking at https://reviews.llvm.org/D154914 so maybe I can't look this deeply now. If there are concrete and small issues, maybe I can try to take a look. Hope this helps.

clang/lib/AST/DeclTemplate.cpp
366

I am not sure. I'm just wondering if there will be a static checker emits warnings for unused arguments.

Fix a compilation error.

Fix the odr_hash.cpp failure.

Address most of the comments. I will need some help with the Modules/pr60085.cppm failure. I suspect we pass to CodeGen some instantiation by iterator index and that does not work as the instantiations are lazily triggered upon use now.

In fact it looks like it fails exactly in the same way as described in the original report https://github.com/llvm/llvm-project/issues/60085. The commit message in https://github.com/llvm/llvm-project/commit/78e48977a6e67 hints at the fact that the issue was gone surprisingly. I suspect that test case stopped reproducing the underlying issue by chance. That probably means that this patch is not breaking anything but exposing an underlying problem...

Address most of the comments. I will need some help with the Modules/pr60085.cppm failure. I suspect we pass to CodeGen some instantiation by iterator index and that does not work as the instantiations are lazily triggered upon use now.

In fact it looks like it fails exactly in the same way as described in the original report https://github.com/llvm/llvm-project/issues/60085. The commit message in https://github.com/llvm/llvm-project/commit/78e48977a6e67 hints at the fact that the issue was gone surprisingly. I suspect that test case stopped reproducing the underlying issue by chance. That probably means that this patch is not breaking anything but exposing an underlying problem...

Got it. I understand it is frustrating to fix an additional bug during development. Also it is generally not good to revert a valid test case.

Are you in a hurry to land this recently? (e.g., land this for clang17) If not, I suggest to wait for me to fix the underlying issue after clang17 gets released. And if I take too long to fix it (e.g, 9.30), given the speciality of the patch, I think we can revert that test case and land this one that time. How do you feel about the idea?

Address most of the comments. I will need some help with the Modules/pr60085.cppm failure. I suspect we pass to CodeGen some instantiation by iterator index and that does not work as the instantiations are lazily triggered upon use now.

In fact it looks like it fails exactly in the same way as described in the original report https://github.com/llvm/llvm-project/issues/60085. The commit message in https://github.com/llvm/llvm-project/commit/78e48977a6e67 hints at the fact that the issue was gone surprisingly. I suspect that test case stopped reproducing the underlying issue by chance. That probably means that this patch is not breaking anything but exposing an underlying problem...

Got it. I understand it is frustrating to fix an additional bug during development. Also it is generally not good to revert a valid test case.

I agree. Here is a reduced version of the same test case:

// RUN: rm -rf %t
// RUN: mkdir %t
// RUN: split-file %s %t
//
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/d.cppm \
// RUN:     -emit-module-interface -o %t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/c.cppm \
// RUN:     -emit-module-interface -o %t/c.pcm -fmodule-file=%t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/b.cppm \
// RUN:     -emit-module-interface -o %t/b.pcm -fmodule-file=%t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/a.cppm \
// RUN:     -emit-module-interface -o %t/a.pcm -fmodule-file=%t/d.pcm \
// RUN:     -fmodule-file=%t/c.pcm -fmodule-file=%t/b.pcm 
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/a.pcm \
// RUN:     -S -emit-llvm -disable-llvm-passes -o - | FileCheck %t/a.cppm
//
// Use -fmodule-file=<module-name>=<BMI-path>
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/d.cppm \
// RUN:     -emit-module-interface -o %t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/c.cppm \
// RUN:     -emit-module-interface -o %t/c.pcm -fmodule-file=%t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/b.cppm \
// RUN:     -emit-module-interface -o %t/b.pcm -fmodule-file=%t/d.pcm
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/a.cppm \
// RUN:     -emit-module-interface -o %t/a.pcm -fmodule-file=%t/d.pcm \
// RUN:     -fmodule-file=%t/c.pcm -fmodule-file=%t/b.pcm 
// RUN: %clang_cc1 -std=c++20 -triple %itanium_abi_triple %t/a.pcm \
// RUN:     -S -emit-llvm -disable-llvm-passes -o - | FileCheck %t/a.cppm

//--- d.cppm
export module d;

export template<typename>
struct integer {
  using type = int;

  static constexpr auto value() {
    return 0;
  }

};

export constexpr void ddd(auto const v) {
  v.value();
}

//--- c.cppm
export module c;

import d;

int use = integer<int>().value();

//--- b.cppm
export module b;

import d;

using use = integer<int>::type;

//--- a.cppm
export module a;

import d;

export void a() {
  ddd(integer<int>());
}

Are you in a hurry to land this recently? (e.g., land this for clang17) If not, I suggest to wait for me to fix the underlying issue after clang17 gets released. And if I take too long to fix it (e.g, 9.30), given the speciality of the patch, I think we can revert that test case and land this one that time. How do you feel about the idea?

We are not in hurry - this patch was developed in 2017 :) It can probably wait a few more years ;) I had some spare cycles to work on it. That's all.

I do not think we can land it without asking @rsmith or @dblaikie to test it on their internal builds to make sure things are not regressing performance...

@alexfh, could you test that patch on your internal infrastructure and check if it actually decreases the memory footprint and/or compile times? I believe this was the only blocker to land it as it is.