This is an archive of the discontinued LLVM Phabricator instance.

[clang] Instantiate alias templates with sugar
ClosedPublic

Authored by mizvekov on Oct 23 2022, 1:53 PM.

Details

Summary

This makes use of the changes introduced in D134604, in order to
instantiate alias templates witn a final sugared substitution.

This comes at no additional relevant cost.
Since we don't track / unique them in specializations, we wouldn't be
able to resugar them later anyway.

Signed-off-by: Matheus Izvekov <mizvekov@gmail.com>

Diff Detail

Event Timeline

mizvekov created this revision.Oct 23 2022, 1:53 PM
mizvekov requested review of this revision.Oct 23 2022, 1:53 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptOct 23 2022, 1:53 PM
mizvekov added reviewers: davrec, Restricted Project.Oct 23 2022, 2:54 PM
erichkeane accepted this revision.Oct 24 2022, 7:02 AM
This revision is now accepted and ready to land.Oct 24 2022, 7:02 AM
This revision was automatically updated to reflect the committed changes.
mizvekov reopened this revision.Oct 31 2022, 8:22 AM
This revision is now accepted and ready to land.Oct 31 2022, 8:22 AM
mizvekov updated this revision to Diff 472021.Oct 31 2022, 8:23 AM
This revision was landed with ongoing or failed builds.Oct 31 2022, 9:58 AM
This revision was automatically updated to reflect the committed changes.
alexfh added a subscriber: alexfh.Nov 7 2022, 5:16 AM

Hi Matheus, 279fe6281d2ca5b2318c7437316c28750feaac8d causes compilation timeout on some of our internal files. We're trying to get a test case we can share, but so far the only information I can provide is compiler perf profile difference:

OK:

3.08%  compiler.OK  compiler.OK         [.] llvm::FoldingSetBase::GrowBucketCount(unsigned int, llvm::FoldingSetBase::FoldingSet
2.82%  compiler.OK  compiler.OK         [.] llvm::FoldingSetBase::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&, llv
2.32%  compiler.OK  compiler.OK         [.] llvm::FoldingSet<clang::ElaboratedType>::NodeEquals(llvm::FoldingSetBase const*, llv
1.86%  compiler.OK  compiler.OK         [.] clang::Decl::castFromDeclContext(clang::DeclContext const*)
1.61%  compiler.OK  compiler.OK         [.] clang::TypeLoc::getFullDataSizeForType(clang::QualType)
1.56%  compiler.OK  compiler.OK         [.] clang::TypeLoc::getNextTypeLocImpl(clang::TypeLoc)
1.47%  compiler.OK  compiler.OK         [.] clang::Sema::CheckTemplateArgumentList(clang::TemplateDecl*, clang::SourceLocation,

Bad:

61.07%  compiler.bad  compiler.bad        [.] llvm::FoldingSet<clang::UsingType>::NodeEquals(llvm::FoldingSetBase const*, llvm::F
 8.14%  compiler.bad  compiler.bad        [.] clang::UsingType::Profile(llvm::FoldingSetNodeID&, clang::UsingShadowDecl const*, c
 3.63%  compiler.bad  compiler.bad        [.] llvm::FoldingSetBase::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&, ll
 1.95%  compiler.bad  compiler.bad        [.] llvm::FoldingSetNodeID::operator==(llvm::FoldingSetNodeID const&) const
 0.69%  compiler.bad  compiler.bad        [.] llvm::FoldingSetBase::GrowBucketCount(unsigned int, llvm::FoldingSetBase::FoldingSe
 0.56%  compiler.bad  compiler.bad        [.] llvm::FoldingSet<clang::ElaboratedType>::NodeEquals(llvm::FoldingSetBase const*, ll
 0.49%  compiler.bad  compiler.bad        [.] clang::Sema::CheckTemplateArgumentList(clang::TemplateDecl*, clang::SourceLocation,

Hi Matheus, 279fe6281d2ca5b2318c7437316c28750feaac8d causes compilation timeout on some of our internal files.

Hi @alexfh. It looks like somehow we may be creating a crazy amount of extra UsingTypes in your test scenario.

There is a clang flag that prints some performance statistics, including the amount of AST nodes created. I can't lookup the spelling as I am on vacations / on a cellphone for the next two weeks, but I believe this is documented.

One idea to get a reduction here is to perhaps tie your creduce interestingness test to that UsingType count.

alexfh added a comment.Nov 7 2022, 6:54 PM

Hi Matheus, 279fe6281d2ca5b2318c7437316c28750feaac8d causes compilation timeout on some of our internal files.

Hi @alexfh. It looks like somehow we may be creating a crazy amount of extra UsingTypes in your test scenario.

There is a clang flag that prints some performance statistics, including the amount of AST nodes created. I can't lookup the spelling as I am on vacations / on a cellphone for the next two weeks, but I believe this is documented.

One idea to get a reduction here is to perhaps tie your creduce interestingness test to that UsingType count.

Something related to using types has definitely changed. The diff of what -Xclang=-print-stats prints in clang edf1a2e89340c8fa64a679e7d4ec2b5ee92ec40f (a few commits before this one) and clang a few commits after this one (ee1f132d2c4d399be711275a62698ea9e766c199):

 *** AST Context Stats:
-  2570192 types total.
+  2647700 types total.
     2 Decayed types, 48 each (96 bytes)
     194 ConstantArray types, 56 each (10864 bytes)
     58 DependentSizedArray types, 64 each (3712 bytes)
     28 IncompleteArray types, 40 each (1120 bytes)
     18 Atomic types, 40 each (720 bytes)
     2 Attributed types, 48 each (96 bytes)
     62 Builtin types, 24 each (1488 bytes)
     268 Decltype types, 40 each (10720 bytes)
     10 Auto types, 48 each (480 bytes)
     39 DeducedTemplateSpecialization types, 48 each (1872 bytes)
-    1763 DependentName types, 48 each (84624 bytes)
-    126 DependentTemplateSpecialization types, 48 each (6048 bytes)
-    1034975 Elaborated types, 48 each (49678800 bytes)
-    5301 FunctionProto types, 40 each (212040 bytes)
+    1759 DependentName types, 48 each (84432 bytes)
+    129 DependentTemplateSpecialization types, 48 each (6192 bytes)
+    1074359 Elaborated types, 48 each (51569232 bytes)
+    5329 FunctionProto types, 40 each (213160 bytes)
     1174 InjectedClassName types, 40 each (46960 bytes)
     110 MemberPointer types, 48 each (5280 bytes)
-    444 PackExpansion types, 40 each (17760 bytes)
+    424 PackExpansion types, 40 each (16960 bytes)
     129 Paren types, 40 each (5160 bytes)
-    1221 Pointer types, 40 each (48840 bytes)
-    1582 LValueReference types, 40 each (63280 bytes)
-    525 RValueReference types, 40 each (21000 bytes)
-    19 SubstTemplateTypeParmPack types, 48 each (912 bytes)
-    2138 SubstTemplateTypeParm types, 40 each (85520 bytes)
+    1174 Pointer types, 40 each (46960 bytes)
+    1598 LValueReference types, 40 each (63920 bytes)
+    555 RValueReference types, 40 each (22200 bytes)
+    33 SubstTemplateTypeParmPack types, 48 each (1584 bytes)
+    859 SubstTemplateTypeParm types, 40 each (34360 bytes)
     84 Enum types, 32 each (2688 bytes)
     1334 Record types, 32 each (42688 bytes)
     1276440 TemplateSpecialization types, 40 each (51057600 bytes)
     43125 TemplateTypeParm types, 40 each (1725000 bytes)
     1071 Typedef types, 40 each (42840 bytes)
     197249 UnaryTransform types, 48 each (9467952 bytes)
-    669 Using types, 40 each (26760 bytes)
+    40052 Using types, 40 each (1602080 bytes)
     32 Vector types, 40 each (1280 bytes)
-Total bytes = 112674200
+Total bytes = 116089696

The test is still being reduced.

@alexfh Thanks!

While there is a huge increase in the amount of UsingTypes, it seems the total amount is still reasonable and does not explain the perf hit.

Perhaps this is a case of bad hashing and they are all falling into the same bucket?

cc @sam.mcall for awareness of UsingType issue.
It may be some very simple problem, I just can't even look at the code right now.

Given the broad impact of this in our code I'm inclined to revert the patch
to unblock us. The test case I have so far is still too large, but I hope
to get something shareable tomorrow.

alexfh added a comment.Nov 8 2022, 4:26 AM

@alexfh Thanks!

While there is a huge increase in the amount of UsingTypes, it seems the total amount is still reasonable and does not explain the perf hit.

Perhaps this is a case of bad hashing and they are all falling into the same bucket?

cc @sam.mcall for awareness of UsingType issue.
It may be some very simple problem, I just can't even look at the code right now.

I've reduced the test case to an initializer with a few thousand std::variant elements (

), which is compiled around 2 times slower with the clang with this patch vs clang before the patch:

$ ../clang-base -fsyntax-only q2.cc  -ftime-report ; ../clang-exp -fsyntax-only q2.cc  -ftime-report
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 7.4495 seconds (7.4498 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   7.2215 (100.0%)   0.2280 (100.0%)   7.4495 (100.0%)   7.4498 (100.0%)  Clang front-end timer
   7.2215 (100.0%)   0.2280 (100.0%)   7.4495 (100.0%)   7.4498 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 13.7677 seconds (13.7686 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  13.5666 (100.0%)   0.2011 (100.0%)  13.7677 (100.0%)  13.7686 (100.0%)  Clang front-end timer
  13.5666 (100.0%)   0.2011 (100.0%)  13.7677 (100.0%)  13.7686 (100.0%)  Total

When I duplicate the number of array elements, the parsing time after the patch grows by a larger factor:

$ ../clang-base -fsyntax-only q2.cc  -ftime-report ; ../clang-exp -fsyntax-only q2.cc  -ftime-report
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 14.1165 seconds (14.1173 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  13.7642 (100.0%)   0.3523 (100.0%)  14.1165 (100.0%)  14.1173 (100.0%)  Clang front-end timer
  13.7642 (100.0%)   0.3523 (100.0%)  14.1165 (100.0%)  14.1173 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 41.6697 seconds (41.6729 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  41.2583 (100.0%)   0.4114 (100.0%)  41.6697 (100.0%)  41.6729 (100.0%)  Clang front-end timer
  41.2583 (100.0%)   0.4114 (100.0%)  41.6697 (100.0%)  41.6729 (100.0%)  Total

Thus, the patch introduces non-linear dependency of compilation times from the number of certain elements in the AST.

I'm going to revert the patch for now and let you figure out this when it's convenient to you. Have a nice vacation!

And another problem with this patch: there's another pattern (or multiple different patterns?) in the code, that result in around 3x clang memory usage increase after this patch. The result of -print-stats doesn't make it clear where the additional allocations come from:

--- at_culprit	2022-11-10 13:16:52.000000000 +0100
+++ before_culprit	2022-11-10 13:21:44.000000000 +0100
@@ -1,340 +1,338 @@
-
 STATISTICS:
 
 *** Semantic Analysis Stats:
 0 SFINAE diagnostics trapped.
 
 Number of memory regions: 600
 Bytes used: 13579328
 Bytes allocated: 13631488
 Bytes wasted: 52160 (includes alignment, etc)
 
 *** Analysis Based Warnings Stats:
 168155 functions analyzed (0 w/o CFGs).
   697398 CFG blocks built.
   4 average CFG blocks per function.
   503 max CFG blocks per function.
 0 functions analyzed for uninitialiazed variables
   0 variables analyzed.
   0 average variables per function.
   0 max variables per function.
   0 block visits.
   0 average block visits per function.
   0 max block visits per function.
 
 *** AST Context Stats:
-  19535407 types total.
+  19808684 types total.
     30 Decayed types, 48 each (1440 bytes)
     585 ConstantArray types, 56 each (32760 bytes)
     2792 DependentSizedArray types, 64 each (178688 bytes)
-    96 IncompleteArray types, 40 each (3840 bytes)
+    90 IncompleteArray types, 40 each (3600 bytes)
     38 Atomic types, 40 each (1520 bytes)
     32 Attributed types, 48 each (1536 bytes)
     62 Builtin types, 24 each (1488 bytes)
     1 Complex types, 40 each (40 bytes)
     64274 Decltype types, 40 each (2570960 bytes)
     17491 Auto types, 48 each (839568 bytes)
     114 DeducedTemplateSpecialization types, 48 each (5472 bytes)
-    798961 DependentName types, 48 each (38350128 bytes)
-    527 DependentTemplateSpecialization types, 48 each (25296 bytes)
-    5241210 Elaborated types, 48 each (251578080 bytes)
-    588059 FunctionProto types, 40 each (23522360 bytes)
+    798160 DependentName types, 48 each (38311680 bytes)
+    467 DependentTemplateSpecialization types, 48 each (22416 bytes)
+    4735944 Elaborated types, 48 each (227325312 bytes)
+    585169 FunctionProto types, 40 each (23406760 bytes)
     5631 InjectedClassName types, 40 each (225240 bytes)
     4684 MemberPointer types, 48 each (224832 bytes)
     2 ObjCObjectPointer types, 40 each (80 bytes)
     2 ObjCObject types, 40 each (80 bytes)
     1 ObjCInterface types, 48 each (48 bytes)
-    30517 PackExpansion types, 40 each (1220680 bytes)
-    2700 Paren types, 40 each (108000 bytes)
-    393866 Pointer types, 40 each (15754640 bytes)
-    208529 LValueReference types, 40 each (8341160 bytes)
-    176267 RValueReference types, 40 each (7050680 bytes)
-    814 SubstTemplateTypeParmPack types, 48 each (39072 bytes)
-    3432531 SubstTemplateTypeParm types, 40 each (137301240 bytes)
+    30567 PackExpansion types, 40 each (1222680 bytes)
+    2694 Paren types, 40 each (107760 bytes)
+    395432 Pointer types, 40 each (15817280 bytes)
+    207973 LValueReference types, 40 each (8318920 bytes)
+    178946 RValueReference types, 40 each (7157840 bytes)
+    764 SubstTemplateTypeParmPack types, 48 each (36672 bytes)
+    4213429 SubstTemplateTypeParm types, 40 each (168537160 bytes)
     1596 Enum types, 32 each (51072 bytes)
     577336 Record types, 32 each (18474752 bytes)
     5892736 TemplateSpecialization types, 40 each (235709440 bytes)
     1776985 TemplateTypeParm types, 40 each (71079400 bytes)
     2 TypeOfExpr types, 32 each (64 bytes)
     188177 Typedef types, 40 each (7527080 bytes)
     124634 UnaryTransform types, 48 each (5982432 bytes)
-    4068 Using types, 40 each (162720 bytes)
+    1787 Using types, 40 each (71480 bytes)
     57 Vector types, 40 each (2280 bytes)
-Total bytes = 826368168
+Total bytes = 833249832
 5151/505661 implicit default constructors created
 21741/525877 implicit copy constructors created
 18531/522651 implicit move constructors created
 9247/526033 implicit copy assignment operators created
 5875/517861 implicit move assignment operators created
 23158/524964 implicit destructors created
 
-Number of memory regions: 1846
-Bytes used: 11780764491
-Bytes allocated: 12214755831
-Bytes wasted: 433991340 (includes alignment, etc)
+Number of memory regions: 1697
+Bytes used: 4995782175
+Bytes allocated: 5403406263
+Bytes wasted: 407624088 (includes alignment, etc)
 
 *** Decl Stats:
   4825242 decls total.
     21328 AccessSpec decls, 40 each (853120 bytes)
     2 Empty decls, 40 each (80 bytes)
     1 ExternCContext decls, 72 each (72 bytes)
     4 FileScopeAsm decls, 56 each (224 bytes)
     7779 Friend decls, 64 each (497856 bytes)
     6 LifetimeExtendedTemporary decls, 72 each (432 bytes)
     1721 LinkageSpec decls, 80 each (137680 bytes)
     1548 Using decls, 88 each (136224 bytes)
     5 Label decls, 80 each (400 bytes)
     1746 Namespace decls, 112 each (195552 bytes)
     3 NamespaceAlias decls, 96 each (288 bytes)
     1 ObjCInterface decls, 128 each (128 bytes)
     2 BuiltinTemplate decls, 72 each (144 bytes)
     5979 ClassTemplate decls, 88 each (526152 bytes)
     70734 FunctionTemplate decls, 88 each (6224592 bytes)
     30972 TypeAliasTemplate decls, 88 each (2725536 bytes)
     101 VarTemplate decls, 88 each (8888 bytes)
     164 TemplateTemplateParm decls, 88 each (14432 bytes)
     1600 Enum decls, 160 each (256000 bytes)
     538740 CXXRecord decls, 144 each (77578560 bytes)
     569602 ClassTemplateSpecialization decls, 184 each (104806768 bytes)
     1861 ClassTemplatePartialSpecialization decls, 208 each (387088 bytes)
     1776943 TemplateTypeParm decls, 80 each (142155440 bytes)
     198773 TypeAlias decls, 96 each (19082208 bytes)
     39828 Typedef decls, 88 each (3504864 bytes)
     11 UsingDirective decls, 88 each (968 bytes)
     1 UsingPack decls, 64 each (64 bytes)
     2759 UsingShadow decls, 80 each (220720 bytes)
     552 ConstructorUsingShadow decls, 104 each (57408 bytes)
     4 Binding decls, 72 each (288 bytes)
     14688 Field decls, 80 each (1175040 bytes)
     205002 Function decls, 168 each (34440336 bytes)
     500 CXXDeductionGuide decls, 184 each (92000 bytes)
     170230 CXXMethod decls, 168 each (28598640 bytes)
     122968 CXXConstructor decls, 176 each (21642368 bytes)
     6139 CXXConversion decls, 176 each (1080464 bytes)
     26174 CXXDestructor decls, 184 each (4816016 bytes)
     28025 NonTypeTemplateParm decls, 88 each (2466200 bytes)
     187031 Var decls, 104 each (19451224 bytes)
     2 Decomposition decls, 104 each (208 bytes)
     56907 ImplicitParam decls, 104 each (5918328 bytes)
     578834 ParmVar decls, 104 each (60198736 bytes)
     2278 VarTemplateSpecialization decls, 152 each (346256 bytes)
     1 VarTemplatePartialSpecialization decls, 176 each (176 bytes)
     5172 EnumConstant decls, 80 each (413760 bytes)
     80 IndirectField decls, 72 each (5760 bytes)
     131 UnresolvedUsingValue decls, 88 each (11528 bytes)
     148310 StaticAssert decls, 64 each (9491840 bytes)
 Total bytes = 549521056
 
 *** Stmt/Expr Stats:
   10418265 stmts/exprs total.
     38 GCCAsmStmt, 88 each (3344 bytes)
     193 BreakStmt, 8 each (1544 bytes)
     122 CXXForRangeStmt, 96 each (11712 bytes)
     250016 CompoundStmt, 16 each (4000256 bytes)
     14 ContinueStmt, 8 each (112 bytes)
     174667 DeclStmt, 24 each (4192008 bytes)
     73 DoStmt, 32 each (2336 bytes)
     465 ForStmt, 56 each (26040 bytes)
     5 GotoStmt, 24 each (120 bytes)
     6954 IfStmt, 16 each (111264 bytes)
     552 NullStmt, 8 each (4416 bytes)
     204436 ReturnStmt, 16 each (3270976 bytes)
     2143 CaseStmt, 24 each (51432 bytes)
     15 DefaultStmt, 32 each (480 bytes)
     1128 SwitchStmt, 24 each (27072 bytes)
     18 AttributedStmt, 16 each (288 bytes)
     12432 ConditionalOperator, 48 each (596736 bytes)
     9 ArrayInitIndexExpr, 16 each (144 bytes)
     9 ArrayInitLoopExpr, 32 each (288 bytes)
     911 ArraySubscriptExpr, 32 each (29152 bytes)
     2 ArrayTypeTraitExpr, 56 each (112 bytes)
     36 AtomicExpr, 88 each (3168 bytes)
     468424 BinaryOperator, 32 each (14989568 bytes)
     601 CompoundAssignOperator, 48 each (28848 bytes)
     41398 CXXBindTemporaryExpr, 32 each (1324736 bytes)
     402039 CXXBoolLiteralExpr, 16 each (6432624 bytes)
     55746 CXXConstructExpr, 40 each (2229840 bytes)
     12046 CXXTemporaryObjectExpr, 48 each (578208 bytes)
     1063 CXXDefaultArgExpr, 32 each (34016 bytes)
     88 CXXDefaultInitExpr, 32 each (2816 bytes)
     2453 CXXDeleteExpr, 32 each (78496 bytes)
     10625 CXXDependentScopeMemberExpr, 72 each (765000 bytes)
     21 CXXFoldExpr, 64 each (1344 bytes)
     37 CXXInheritedCtorInitExpr, 32 each (1184 bytes)
     6020 CXXNewExpr, 56 each (337120 bytes)
     4369 CXXNoexceptExpr, 32 each (139808 bytes)
     1401 CXXNullPtrLiteralExpr, 16 each (22416 bytes)
     35 CXXPseudoDestructorExpr, 80 each (2800 bytes)
     12046 CXXScalarValueInitExpr, 24 each (289104 bytes)
     244 CXXStdInitializerListExpr, 24 each (5856 bytes)
     65726 CXXThisExpr, 16 each (1051616 bytes)
     1270 CXXTypeidExpr, 32 each (40640 bytes)
     1752 CXXUnresolvedConstructExpr, 32 each (56064 bytes)
     414990 CallExpr, 24 each (9959760 bytes)
     39603 CXXMemberCallExpr, 24 each (950472 bytes)
     24985 CXXOperatorCallExpr, 32 each (799520 bytes)
     18 BuiltinBitCastExpr, 40 each (720 bytes)
     16798 CStyleCastExpr, 40 each (671920 bytes)
     3635 CXXFunctionalCastExpr, 40 each (145400 bytes)
     1628 CXXConstCastExpr, 48 each (78144 bytes)
     7 CXXDynamicCastExpr, 48 each (336 bytes)
     3503 CXXReinterpretCastExpr, 48 each (168144 bytes)
     35765 CXXStaticCastExpr, 48 each (1716720 bytes)
     1338486 ImplicitCastExpr, 24 each (32123664 bytes)
     254 CharacterLiteral, 24 each (6096 bytes)
     33019 CompoundLiteralExpr, 40 each (1320760 bytes)
     77 ConvertVectorExpr, 40 each (3080 bytes)
     1833471 DeclRefExpr, 32 each (58671072 bytes)
     72073 DependentScopeDeclRefExpr, 56 each (4036088 bytes)
     125 FloatingLiteral, 32 each (4000 bytes)
     1189793 ConstantExpr, 24 each (28555032 bytes)
     14532 ExprWithCleanups, 24 each (348768 bytes)
     687 ImplicitValueInitExpr, 16 each (10992 bytes)
     73875 InitListExpr, 64 each (4728000 bytes)
     211313 IntegerLiteral, 32 each (6762016 bytes)
     4917 LambdaExpr, 32 each (157344 bytes)
     1071069 MaterializeTemporaryExpr, 24 each (25705656 bytes)
     100739 MemberExpr, 48 each (4835472 bytes)
     98 NoInitExpr, 16 each (1568 bytes)
     23 OffsetOfExpr, 40 each (920 bytes)
     524730 OpaqueValueExpr, 24 each (12593520 bytes)
     417645 UnresolvedLookupExpr, 64 each (26729280 bytes)
     28632 UnresolvedMemberExpr, 80 each (2290560 bytes)
     18741 PackExpansionExpr, 32 each (599712 bytes)
     64820 ParenExpr, 32 each (2074240 bytes)
     31080 ParenListExpr, 24 each (745920 bytes)
     2629 PredefinedExpr, 16 each (42064 bytes)
     178 ShuffleVectorExpr, 40 each (7120 bytes)
     4690 SizeOfPackExpr, 40 each (187600 bytes)
     2 SourceLocExpr, 32 each (64 bytes)
     3 StmtExpr, 32 each (96 bytes)
     16336 StringLiteral, 16 each (261376 bytes)
     563909 SubstNonTypeTemplateParmExpr, 40 each (22556360 bytes)
     1387 SubstNonTypeTemplateParmPackExpr, 40 each (55480 bytes)
     196451 TypeTraitExpr, 24 each (4714824 bytes)
     47956 UnaryExprOrTypeTraitExpr, 32 each (1534592 bytes)
     271478 UnaryOperator, 24 each (6515472 bytes)
     5 LabelStmt, 32 each (160 bytes)
     468 WhileStmt, 16 each (7488 bytes)
 Total bytes = 303422696
 
 STATISTICS FOR 'test.ii':
 
 *** Preprocessor Stats:
 8332 directives found:
   434 #define.
   0 #undef.
   #include/#include_next/#import:
     2 source files entered.
     0 max include stack depth
   0 #if/#ifndef/#ifdef.
   0 #else/#elif/#elifdef/#elifndef.
   0 #endif.
   499 #pragma.
 0 #if/#ifndef#ifdef regions skipped
 0/0/0 obj/fn/builtin macros expanded, 0 on the fast path.
 0 token paste (##) operations performed, 0 on the fast path.
 
 Preprocessor Memory: 83335B total
   BumpPtr: 49152
   Macro Expanded Tokens: 384
   Predefines Buffer: 16383
   Macros: 16384
   #pragma push_macro Info: 0
   Poison Reasons: 1024
   Comment Handlers: 8
 
 *** Identifier Table Stats:
 # Identifiers:   24471
 # Empty Buckets: 8297
 Hash density (#identifiers per bucket): 0.746796
 Ave identifier length: 15.237873
 Max identifier length: 71
 
 Number of memory regions: 243
 Bytes used: 1376197
 Bytes allocated: 1466368
 Bytes wasted: 90171 (includes alignment, etc)
 
 *** HeaderSearch Stats:
 1 files tracked.
   0 #import/#pragma once files.
   0 #include/#include_next/#import.
     0 #includes skipped due to the multi-include optimization.
 0 framework lookups.
 0 subframework lookups.
 
 *** Source Manager Stats:
 1 files mapped, 2 mem buffers mapped.
 1083 local SLocEntry's allocated (49128 bytes of capacity), 7520180B of 
 Sloc address space used.
 0 loaded SLocEntries allocated, 0B of Sloc address space used.
 7499274 bytes of files mapped, 1 files with line #'s computed, 0 files 
 with macro args computed.
 FileID scans: 2194 linear, 8644 binary.
 
 
 *** File Manager Stats:
 1 real files found, 2 real dirs found.
 0 virtual files found, 0 virtual dirs found.
 2 dir lookups, 2 dir cache misses.
 1 file lookups, 1 file cache misses.
 
 ===-------------------------------------------------------------------------===
                           ... Statistics Collected ...
 ===-------------------------------------------------------------------------===
 
  1232865 asm-printer           - Number of machine instrs printed
   182898 assembler             - Number of emitted assembler fragments - 
 align
   591195 assembler             - Number of emitted assembler fragments - 
 data
     7823 assembler             - Number of emitted assembler fragments - 
 fill
  1122581 assembler             - Number of emitted assembler fragments - 
 total
  1140582 assembler             - Number of fragment layouts
 85464904 assembler             - Number of emitted object file bytes
        2 assembler             - Number of assembler layout and relaxation 
 steps
   551614 assembler             - Number of evaluated fixups
     8859 assume-queries        - Number of Queries into an assume assume 
 bundles
    20543 dagcombine            - Number of dag nodes combined
    14563 dwarfehprepare        - Number of cleanup landing pads remaining
    48174 dwarfehprepare        - Number of functions with nounwind
    32482 dwarfehprepare        - Number of functions with unwind
        2 file-search           - Number of directory cache misses.
        2 file-search           - Number of directory lookups.
        1 file-search           - Number of file cache misses.
        1 file-search           - Number of file lookups.
    37954 isel                  - Number of blocks selected using DAG
  1455524 isel                  - Number of times dag isel has to try 
 another path
    80656 isel                  - Number of entry blocks encountered
   109377 isel                  - Number of blocks selected entirely by 
 fast isel
     2983 isel                  - Number of dead insts removed on failure
     9165 isel                  - Number of entry blocks where fast isel 
 failed to lower arguments
   190004 isel                  - Number of instructions fast isel failed 
 on
   581285 isel                  - Number of instructions fast isel selected
   213242 isel                  - Number of insts selected by 
 target-independent selector
   365061 isel                  - Number of insts selected by 
 target-specific selector
  7860818 mcexpr                - Number of MCExpr evaluations
      897 phi-node-elimination  - Number of phis lowered
     3127 pre-RA-sched          - Number of loads clustered together
  3176769 prologepilog          - Number of bytes used for stack in all 
 functions
    80656 prologepilog          - Number of functions seen in PEI
   586056 regalloc              - Number of copies coalesced
    52504 regalloc              - Number of loads added
    44368 regalloc              - Number of stores added
    80656 stackmaps             - Number of functions skipped
    80656 stackmaps             - Number of functions visited
     9514 twoaddressinstruction - Number of two-address instructions
        1 x86-isel              - Number of tail calls

The only major difference is in the total memory allocation (- - with this patch, + - before it or after the revert):

-Number of memory regions: 1846
-Bytes used: 11780764491
-Bytes allocated: 12214755831
+Number of memory regions: 1697
+Bytes used: 4995782175
+Bytes allocated: 5403406263

Note that the Clang 16 branch is coming up approximately on January 24, and this needs to be reverted or perf-regression fixed by then. @mizvekov : If you or someone else don't have a solution/revert to this by January 13th, @aaron.ballman and I will begin the process to revert this ourselves.

Note that the Clang 16 branch is coming up approximately on January 24, and this needs to be reverted or perf-regression fixed by then. @mizvekov : If you or someone else don't have a solution/revert to this by January 13th, @aaron.ballman and I will begin the process to revert this ourselves.

I see that this was already reverted in a5c18fcf6e7ffeea72aaf079477caf2ac1d641bc