Page MenuHomePhabricator

[clang] de-duplicate methods from AST files
Needs ReviewPublic

Authored by rmaz on Sep 10 2021, 2:09 PM.

Details

Summary

This diff will de-duplicate methods read from AST files before
inserting them in to a global method pool list. When reading
ObjCMethodDecl from AST files we can end up with a significant
amount of duplication when modules contain redeclarations of
system framework methods. For instance a common pattern is to
redeclare -(instancetype)init with NS_UNAVAILABLE, which
results in the entire ObjCMethodList for init being serialized
in each module with this redeclaration.

Measuring this against our codebase for files that use -fmodules
shows an overall 19% compile time improvement, and in some cases
as much as 79% for files with a lot of modular dependencies.

Diff Detail

Event Timeline

rmaz requested review of this revision.Sep 10 2021, 2:09 PM
rmaz created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptSep 10 2021, 2:09 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This looks good to me in general. Since it should not change functionality, it may not be possible to write a test case.

Manman

clang/lib/Serialization/ASTReader.cpp
8194 ↗(On Diff #372010)

Does it make sense to check for duplication inside addMethodToGlobalList, as the function goes through the list as well? Maybe it is slower, as we will need to go through the list for each method, instead of a lookup.

rmaz added inline comments.Sep 10 2021, 3:14 PM
clang/lib/Serialization/ASTReader.cpp
8194 ↗(On Diff #372010)

Yes, you are right, it is slower as we need to do a list traverse per insert rather than per selector lookup. I also profiled keeping some global state along with the MethodPool so that the set didn't have to be rebuilt each time, but the performance difference was negligible.

@dexonsmith @bruno: are you okay with this change? It looks good to me :]

Thanks,
Manman

dexonsmith added a subscriber: vsapsai.

Thanks for looking at this!

I have a couple of comments inline. @vsapsai, can you also take a look?

clang/lib/Serialization/ASTReader.cpp
8188–8190 ↗(On Diff #372010)

I find quadratic algorithms a bit scary, even when current benchmarks don't expose problems. For example, it seems like this could blow up on the following workload:

  • one module adds many things to the global method list
  • there are many (fine-grained) modules that transitively load that module

Or have I misunderstood the complexity here?

8194 ↗(On Diff #372010)

Can you take another look at the approach you experimented with, putting the global state in the MethodPool? Besides the performance difference (you measured it as negligible but it'd avoid the concern I have about uncommon workloads hitting quadratic blowups), it'd also provide consistency in behaviour between callers of addMethodToGlobalList... and enable you to add a unit test for the API change to MethodPool.

rmaz added inline comments.Sep 14 2021, 11:37 AM
clang/lib/Serialization/ASTReader.cpp
8188–8190 ↗(On Diff #372010)

Yes, I take your point, if the method pool generation updates inbetween each of the later modules then it is possible to hit this case.

8194 ↗(On Diff #372010)

I'll update with this approach, it should allow for moving the set insert logic into addMethodToGlobalList in this case.

I'm a little bit confused here. Where is the speed-up coming from? Is it because ObjCMethodList for init not being serialized? I'm asking because right now I don't see how seen helps with that.

My current understanding is that Sema::addMethodToGlobalList is too expensive and seen reduces the number of calls to it. So maybe it makes sense to invest into making it faster? For example, Sema::MatchTwoMethodDeclarations can return early if both method decl pointers are equal. But I haven't done any measurements, that's just an example.

rmaz added a comment.Sep 14 2021, 4:46 PM

I'm a little bit confused here. Where is the speed-up coming from? Is it because ObjCMethodList for init not being serialized? I'm asking because right now I don't see how seen helps with that.

My current understanding is that Sema::addMethodToGlobalList is too expensive and seen reduces the number of calls to it. So maybe it makes sense to invest into making it faster? For example, Sema::MatchTwoMethodDeclarations can return early if both method decl pointers are equal. But I haven't done any measurements, that's just an example.

The speedup is coming from reducing the number of times Sema::addMethodToGlobalList is called when looking up methods loaded from modules. From what I can see each serialized module will contain an ObjCMethodList with all the methods from all visible modules at the point of serialization. This means that if you have a lot of modules that redeclare -(id)init, which is a common pattern in our code, then each serialized module will contain a lot of shared method decls, which we do not de-duplicate. To illustrate this, here are the top 5 instance method counts sorted by amount of duplicates (using pointer comparison) from a pass of the ReadMethodPoolVisitor for a single file:

selector# methods# unique# duplicated
init982516528173
init21559301225
copy1398633765
init1027270757
init948410538

Without having a set to deduplicate I'm not sure how we could make Sema::addMethodToGlobalList fast enough, wouldn't it require a list traversal for each insert, making it O(n^2)?

rmaz updated this revision to Diff 372597.Sep 14 2021, 5:19 PM

Updated to use a GlobalMethods struct which contains the global
method lists as well as the sets used to de-duplicate added methods.

rmaz updated this revision to Diff 372598.Sep 14 2021, 5:21 PM

update to fix lint warnings

dexonsmith added inline comments.Sep 15 2021, 10:38 AM
clang/include/clang/Sema/Sema.h
1422–1425

Do these two sets really need to be per-selector (per GlobalMethods object in GlobalMethodPool)? I feel like they could be global to Sema, as long as the same ObjCMethodDecl is never added to GlobalMethodPool for more than one selector. (It seems plausible we have that property since presumably ObjCMethodDecl::getSelector is always the key used for inserting into GlobalMethodPool below... can you confirm?)

Assuming you can pull the sets out to represent the GlobalMethodPool as a whole, then I suggest creating a data structure for GlobalMethodPool that encapsulates the DenseMap and the two sets (maybe: rename "GlobalMethods" to "GlobalMethodLists", rename "GlobalMethodPool" to "GlobalMethods", and give the new data structure the name "GlobalMethodPool"). Best to do it as multiple commits: NFC commit(s) to refactor (renames, create new type update call sites to use new APIs, etc.), and a final commit that changes the functionality (adding the set behaviour) but that doesn't need to touch call sites.

On the other hand, if the sets really need to be per-selector (please explain why if so...), please use SmallPtrSet here instead of DenseSet to avoid allocation in the common case of 1 decl per selector. I'd suggest encapsulating the list/set together somehow (maybe starting with an NFC refactoring to add a "list/set" data structure at the top-level (maybe rename ObjCMethodList => ObjCMethodListNode, and the new list has a private member called "Head" and necessary APIs for insertion/etc), then in a second commit adding the SmallPtrSet into the list data structure and use it to gate the "insert" operation).

rmaz added inline comments.Sep 16 2021, 8:05 AM
clang/include/clang/Sema/Sema.h
1422–1425

I feel like they could be global to Sema, as long as the same ObjCMethodDecl is never added to GlobalMethodPool for more than one selector. (It seems plausible we have that property since presumably ObjCMethodDecl::getSelector is always the key used for inserting into GlobalMethodPool below... can you confirm?)

Yes, that is correct. We could also use a single set instead of two as the instance and factory methods would have different decls anyway.

Assuming you can pull the sets out to represent the GlobalMethodPool as a whole, then I suggest creating a data structure for GlobalMethodPool that encapsulates the DenseMap and the two sets

I'll go with this approach with the single set, starting with a data structure refactor diff.

rmaz updated this revision to Diff 373072.Sep 16 2021, 2:27 PM

Update with single DenseSet per GlobalMethodPool

dexonsmith added inline comments.Sep 16 2021, 2:44 PM
clang/include/clang/Sema/Sema.h
1434–1436

Hmm, I was imagining that the set would be more encapsulated than this, not just stored in the same place.

I'm wondering if the following could be done in a prep commit:

  • Change Sema::addMethodToGlobalList to a private member function of GlobalMethodPool.
  • Make GlobalMethodPool::insert private
  • Add GlobalMethodPool::addMethod(ObjCMethodDecl*,bool,bool), which does the second half of Sema::AddMethodToGlobalPool (the parts that don't need Sema's other fields), and change the latter to call the former.

WDYT?

The speedup is coming from reducing the number of times Sema::addMethodToGlobalList is called when looking up methods loaded from modules. From what I can see each serialized module will contain an ObjCMethodList with all the methods from all visible modules at the point of serialization.

Thanks for the explanation! I'm still curious to reproduce the problem locally and have created a test case generator https://gist.github.com/vsapsai/f9d3603dde95eebd23248da4d7b4f5ec It creates a chain of .m -> Synthesized9 -> Synthesized8 -> Synthesized7 ->... Does it represent the structure of the code you are dealing with?

rmaz added a comment.Sep 17 2021, 9:20 AM

Thanks for the explanation! I'm still curious to reproduce the problem locally and have created a test case generator https://gist.github.com/vsapsai/f9d3603dde95eebd23248da4d7b4f5ec It creates a chain of .m -> Synthesized9 -> Synthesized8 -> Synthesized7 ->... Does it represent the structure of the code you are dealing with?

The case we have is more like:

.m   -> A -> long list of partially shared deps -> Foundation
     -> B -> long list of partially shared deps -> Foundation
     -> C -> long list of partially shared deps -> Foundation
     .... * a few hundred

So we have a file that imports a lot of modules, in the hundreds. Each of those modules has multiple ObjC interfaces with -(id)init NS_UNAVAILABLE and imports Foundation, UIKit and also a large number of libraries that are shared across the top level imports. This will result in A.pcm, B.pcm and C.pcm including hundreds or thousands of init decls that are the same, from system frameworks or whatever modules are shared between the top level imports.

IIUC the code currently serializes the entire ObjCMethodList for a module for every declared method, including the methods that are not part of that module. When deserializing we don't descend into module dependencies as the entire method list would already be deserialized, but that doesn't help for modules that aren't directly dependent. Is this right? If so it seems another approach could be to only serialize the methods declared in that module itself, and during deserialization we would have to load the methods from all dependent modules.

rmaz added inline comments.Sep 17 2021, 9:55 AM
clang/include/clang/Sema/Sema.h
1434–1436

Definitely neater, will take a look at this later today.

rmaz added inline comments.Sep 17 2021, 11:36 AM
clang/include/clang/Sema/Sema.h
1434–1436

This might need a slightly different approach, as for the insert use case we have:

Sema &S = *getSema();
  Sema::GlobalMethodPool::iterator Pos =
      S.MethodPool.insert(std::make_pair(Sel, Sema::GlobalMethodPool::Lists()))
          .first;

  Pos->second.first.setBits(Visitor.getInstanceBits());
  Pos->second.first.setHasMoreThanOneDecl(Visitor.instanceHasMoreThanOneDecl());
  Pos->second.second.setBits(Visitor.getFactoryBits());
  Pos->second.second.setHasMoreThanOneDecl(Visitor.factoryHasMoreThanOneDecl());

  // Add methods to the global pool *after* setting hasMoreThanOneDecl, since
  // when building a module we keep every method individually and may need to
  // update hasMoreThanOneDecl as we add the methods.
  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.first);
  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.second);

At the moment we fetch a method list, modify its state and then start inserting the methods. If we move to something like MethodPool.addMethod(ObjCMethodDecl *) we will have to look up the method list for each insert, and we would need extra methods to set the state first on the method list. How about something like MethodPool.addMethods(ArrayRef<ObjCMethodDecl*> methods, unsigned instanceBits, bool hasMoreThanOneDecl)? Then we only need two list lookups per selector as before and we can handle the list state update before insert.

The case we have is more like:

.m   -> A -> long list of partially shared deps -> Foundation
     -> B -> long list of partially shared deps -> Foundation
     -> C -> long list of partially shared deps -> Foundation
     .... * a few hundred

So we have a file that imports a lot of modules, in the hundreds. Each of those modules has multiple ObjC interfaces with -(id)init NS_UNAVAILABLE and imports Foundation, UIKit and also a large number of libraries that are shared across the top level imports. This will result in A.pcm, B.pcm and C.pcm including hundreds or thousands of init decls that are the same, from system frameworks or whatever modules are shared between the top level imports.

IIUC the code currently serializes the entire ObjCMethodList for a module for every declared method, including the methods that are not part of that module. When deserializing we don't descend into module dependencies as the entire method list would already be deserialized, but that doesn't help for modules that aren't directly dependent. Is this right? If so it seems another approach could be to only serialize the methods declared in that module itself, and during deserialization we would have to load the methods from all dependent modules.

I have created a different test synthesizer synthesize-shared-framework-chain-test.py to reproduce the described framework layout. In addMethodsToPool added a set seen to count how many methods we can skip and we don't skip anything in the modules but there are a lot of duplicates in .m file compilation. Also I've noticed that the size of .pcm increases the longer chain of modules is reachable from the module (19K for Shared0.pcm vs 38K for Shared49.pcm). Haven't checked if it is because of ObjCMethodList or for another reason.

I don't remember for sure but I don't think there is a consistent policy about a module storing transitive data or only data it owns. I suspect we might be using both approaches and need to check each case separately.

I don't remember for sure but I don't think there is a consistent policy about a module storing transitive data or only data it owns. I suspect we might be using both approaches and need to check each case separately.

I imagine we ideally want only owned data, except where there's a significant performance benefit for caching transitive data. I know identifier lookup tables include transitive data if there are "enough" transitive modules to justify it (see @rsmith's commits from 6-8 years ago related to multi-file on-disk hash tables).

dexonsmith added inline comments.Sep 20 2021, 8:52 AM
clang/include/clang/Sema/Sema.h
1434–1436

That seems reasonable to me.

Looks like in METHOD_POOL we don't need the transitive closure for performance reasons. And we are kinda trying to store only data the module owns

// Only write this selector if it's not in an existing AST or something
// changed.

But even if a single method is in the module we are building, we'd serialize entire list of methods for this selector. I've tried a proof of concept https://reviews.llvm.org/D110123 to see what happens if we are more aggressive with skipping methods coming from other modules. It's not entirely correct as we are missing some transitive methods and I have doubts it is the best implementation. But for me it cuts down the compilation time of the synthetic test case and we aren't dealing with any duplicates, so that seems like a viable approach. In bad news this change doesn't fail any tests, so looks like this area isn't sufficiently well tested.

What folks are thinking about writing less in METHOD_POOL?

rmaz added a comment.Sep 21 2021, 7:49 AM

What folks are thinking about writing less in METHOD_POOL?

I prefer the idea of it, but I think the ReadMethodPoolVisitor also has to be changed for this to work. When it finds a selector in a module it will return true, which causes the search to stop descending into dependent modules:

if (!Visitor(*CurrentModule))
  continue;

// The visitor has requested that cut off visitation of any
// module that the current module depends on. To indicate this
// behavior, we mark all of the reachable modules as having been visited.

Wouldn't this logic have to be changed to ensure we pick up all the transitive methods from dependent modules?

What folks are thinking about writing less in METHOD_POOL?

I prefer the idea of it, but I think the ReadMethodPoolVisitor also has to be changed for this to work. When it finds a selector in a module it will return true, which causes the search to stop descending into dependent modules:

if (!Visitor(*CurrentModule))
  continue;

// The visitor has requested that cut off visitation of any
// module that the current module depends on. To indicate this
// behavior, we mark all of the reachable modules as having been visited.

Wouldn't this logic have to be changed to ensure we pick up all the transitive methods from dependent modules?

You are right, suggested implementation is insufficient and I've added a test to https://reviews.llvm.org/D110123 to catch it.

Right now MODULE_POOL seems to be caching all transitive modules but it does so poorly in case of shared dependencies and sometimes the cache is empty forcing you to collect methods from imported modules. So it looks like we are somewhere in the middle on caching --- non-caching spectrum. I'd prefer to go entirely to one of the ends and for that I see 2 options:

  1. Deduplicate methods from shared dependencies when reading method pools from imported modules.
  2. Serialize only methods owned by the current module (and change ReadMethodPoolVisitor appropriately).

I think the option 2 would be better as it seems to be easier to understand. Implementation complexity seems to be comparable, runtime seems to be better for option 2. Also given that collecting methods from a module is basically InstanceMethods.append(Data.Instance.begin(), Data.Instance.end()), I don't think caching methods from transitive dependencies saves us processing time.

  1. Serialize only methods owned by the current module (and change ReadMethodPoolVisitor appropriately).

Would that require visiting all in-memory modules every time there's a global method pool lookup?

If so, that sounds expensive... and similar to the problem that had to be solved to make identifier lookup fast. Maybe the same approach can/should be used here?

Or if not, can you clarify for me?

rmaz added a comment.Sep 24 2021, 1:23 PM

@vsapsai just checking on where we are with this one. Is the current plan to investigate an approach only serializing the methods declared in each module, or are we good to go ahead with the set de-duplication approach? I tried profiling with D110123 and with the following patch:

diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 279067a8ac8b..aaaca0aff9ab 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -8130,7 +8130,7 @@ namespace serialization {
       FactoryBits = Data.FactoryBits;
       InstanceHasMoreThanOneDecl = Data.InstanceHasMoreThanOneDecl;
       FactoryHasMoreThanOneDecl = Data.FactoryHasMoreThanOneDecl;
-      return true;
+      return false;
     }

but this was a fair bit slower than the deduplication approach, and for some files would never complete, presumably stuck in an infinite loop of dependent modules. Is there a way to exclude the serialization of methods from dependent modules but make the method lookup more efficient somehow?

  1. Serialize only methods owned by the current module (and change ReadMethodPoolVisitor appropriately).

Would that require visiting all in-memory modules every time there's a global method pool lookup?

If so, that sounds expensive... and similar to the problem that had to be solved to make identifier lookup fast. Maybe the same approach can/should be used here?

Or if not, can you clarify for me?

We use ReadMethodPoolVisitor only from ASTReader::ReadMethodPool. I don't know all the cases when it is called but it is called at least on encountering @end to handle methods declared in an interface. With the proposed change the behavior is the following:

  • the first time we encounter a selector, we walk the entire transitive closure of dependencies
  • when ReadMethodPool is called again for the same selector, we check only immediate dependencies but exit early and don't drill into transitive dependencies due to
// If we've already searched this module file, skip it now.
if (M.Generation <= PriorGeneration)
  return true;
  • when we add a new dependency and trigger ReadMethodPool again, we visit only the added dependency and its unvisited transitive dependencies. We exit early for already visited modules.

So from the module visiting perspective there are no regressions. I'm not super happy with repeated early returns for immediate dependencies but we were doing it already. This behavior seems to be orthogonal and we can address it separately if needed.

@vsapsai just checking on where we are with this one. Is the current plan to investigate an approach only serializing the methods declared in each module, or are we good to go ahead with the set de-duplication approach? I tried profiling with D110123 and with the following patch:

diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 279067a8ac8b..aaaca0aff9ab 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -8130,7 +8130,7 @@ namespace serialization {
       FactoryBits = Data.FactoryBits;
       InstanceHasMoreThanOneDecl = Data.InstanceHasMoreThanOneDecl;
       FactoryHasMoreThanOneDecl = Data.FactoryHasMoreThanOneDecl;
-      return true;
+      return false;
     }

but this was a fair bit slower than the deduplication approach, and for some files would never complete, presumably stuck in an infinite loop of dependent modules. Is there a way to exclude the serialization of methods from dependent modules but make the method lookup more efficient somehow?

That patch looks correct, I was experimenting with exactly the same local change. Have you tried D110123 on your original build? In my testing with synthetic test case it looks as good as set deduplication. In its current state D110123 isn't really representative for actual code bases because it relies on method lists of specific shape (all non-transitive methods coming after all transitive methods). You can try to add deduplication stats to see that there is less deduplication but it's not entirely eliminated. At least that's my understanding which can be flawed. Though the fact that clang didn't complete for some files is super strange and concerning, have no idea why is it happening and will need to experiment on a real-life project.

I still think it is worth to pursue writing only owned methods in METHOD_POOL. Deduplication with a DenseSet works but I still expect it to be more expensive than avoiding duplicates by construction. As for the next step I was thinking about changing ASTMethodPoolTrait to skip transitive methods during serialization and it should help with eliminating all the duplicates. With that change we can test and see how it works.

rmaz added a comment.Tue, Sep 28, 11:15 AM

That patch looks correct, I was experimenting with exactly the same local change. Have you tried D110123 on your original build? In my testing with synthetic test case it looks as good as set deduplication.

I am getting some pretty unexpected results from this approach. For the files I could get to complete below are the compilation times for each approach for the 10 slowest (in the baseline) files:

filebaselineset deduplicationno external methods in pcm
Total95.5775.30306.46
IGSFCVC.m4.571.5313.71
IGVFVC.mm4.191.6127.93
IGMFAHC.mm4.001.939.94
IGMFLCVC.m3.891.5512.30
PLRFPPC.mm3.193.263.99
IGBFPVC.m3.181.1128.92
IGBVC.m2.370.9817.42
PLRRM.mm1.991.992.31
IGBSPSC.m1.880.8514.38

All of the most interesting files were left out of these results as I could not get compilation to complete for the no external methods approach.

I still think it is worth to pursue writing only owned methods in METHOD_POOL. Deduplication with a DenseSet works but I still expect it to be more expensive than avoiding duplicates by construction. As for the next step I was thinking about changing ASTMethodPoolTrait to skip transitive methods during serialization and it should help with eliminating all the duplicates. With that change we can test and see how it works.

I will take a go at this approach to eliminate all external methods and compare to see if this exhibits similar behaviour.

rmaz added a comment.Tue, Sep 28, 12:14 PM

Avoiding serializing external methods in ASTMethodPoolTrait is working successfully, although the times are not as good as using set deduplication:

filebaselineset dedupeno external
Total149.6892.97109.83
IGMFVC.mm25.755.386.35
IGMFLADS.m15.973.304.61
IGIMFVC.mm6.702.983.77
IGSFCVC.m4.571.531.87
IGVFVC.mm4.191.611.85
IGMFAHC.mm4.001.932.41
IGMFLCVC.m3.891.551.96
PLRFPPC.mm3.193.264.05
IGBFPVC.m3.181.111.36

Where the no external numbers are using D110648

I've updated D110123 to be the way I was planning it to be and I was testing locally with it. So far, with this change the speedup over a baseline is ~20%, though measurements aren't super rigorous. I.e., no multiple runs to warm up the caches, no control for other background activity. At least, with this precision can claim there is no slow down in compilation time.

What's interesting, I was able to trigger more diagnostic. Specifically, I got warnings about length ambiguity because in NSStatusItem it is CGFloat, while in NSString it is NSUInteger. I also had more diagnostic that's unclear how it is triggered. It can be a project issue or a bug somewhere else, need to check it more carefully.

In theory, "set dedupe" approach should be slower than "no external" as we are iterating through shared dependencies. But in practice it isn't, which is puzzling. My current theory is that "no external" also feeds into correctness, so we might be doing more [correct] method overloading checks.

rmaz added a comment.Thu, Sep 30, 7:40 AM

What's interesting, I was able to trigger more diagnostic. Specifically, I got warnings about length ambiguity because in NSStatusItem it is CGFloat, while in NSString it is NSUInteger. I also had more diagnostic that's unclear how it is triggered. It can be a project issue or a bug somewhere else, need to check it more carefully.

Yes, I also had a couple of files fail to compile due to mismatched (or differently matched) selectors.

In theory, "set dedupe" approach should be slower than "no external" as we are iterating through shared dependencies. But in practice it isn't, which is puzzling. My current theory is that "no external" also feeds into correctness, so we might be doing more [correct] method overloading checks.

Well, wouldn't the tradeoff there be that we now have to descend into all dependent modules during selector lookup when we didn't have to previously? And this would do more work as only a small subset of those modules would have a selector match, which in the current case has already been handled and serialized on a single method list.

What's interesting, I was able to trigger more diagnostic. Specifically, I got warnings about length ambiguity because in NSStatusItem it is CGFloat, while in NSString it is NSUInteger. I also had more diagnostic that's unclear how it is triggered. It can be a project issue or a bug somewhere else, need to check it more carefully.

Yes, I also had a couple of files fail to compile due to mismatched (or differently matched) selectors.

Diagnostic about -length is due to isAcceptableMethodMismatch returning different results depending on the order of methods. Probably other diagnostic is caused by a similar issue. Will need to add tests and to fix isAcceptableMethodMismatch.

In theory, "set dedupe" approach should be slower than "no external" as we are iterating through shared dependencies. But in practice it isn't, which is puzzling. My current theory is that "no external" also feeds into correctness, so we might be doing more [correct] method overloading checks.

Well, wouldn't the tradeoff there be that we now have to descend into all dependent modules during selector lookup when we didn't have to previously? And this would do more work as only a small subset of those modules would have a selector match, which in the current case has already been handled and serialized on a single method list.

My assumption was that all dependent modules are in memory at this point. And we visit transitive modules only once, so I don't expect it to be a big performance hit (though I can be wrong). And deserializing methods from different modules shouldn't be more work because we are deserializing fewer methods than with "set dedupe". Maybe the order of methods matters? I can see us short circuiting in some cases but not the others. Though that's not a very plausible explanation for 20-25% discrepancy in compilation time.

rmaz added a comment.Mon, Oct 4, 3:23 PM

My assumption was that all dependent modules are in memory at this point. And we visit transitive modules only once, so I don't expect it to be a big performance hit (though I can be wrong). And deserializing methods from different modules shouldn't be more work because we are deserializing fewer methods than with "set dedupe".

I added some stats collection for the number of methods deserialized, here are the results from the slowest file in the above table:

Method# Module File Lookups# Module File Hits# Instance Methods Deserialized# Class Methods Deserialized
Baseline340231298200529298
Set Dedupe34023129137436781
No external4526596040101704

So while the no external approach has ~3.5x fewer methods to deserialize, it has to visit ~7.5x as many module files to deserialize methods from.

These are interesting results. I'm curious to measure the time spent in ASTReader::ReadMethodPool.

I'm planning to preserve the order of methods roughly by s/InstanceMethods.append/InstanceMethods.prepend/ in ReadMethodPoolVisitor. isAcceptableMethodMismatch is fixable but "id" can have multiple methods applicable and existing source code relies on specific methods to be first in the methods' lists. Changing the order would break existing working code and we cannot afford that. Curious to see how the preserved order might impact the performance.

rmaz added a comment.Fri, Oct 8, 12:23 PM

These are interesting results. I'm curious to measure the time spent in ASTReader::ReadMethodPool.

Updated numbers from today, looks like we added a lot more modular deps since last week.

Method# Module File Lookups# Module File Hits# Instance Methods Deserialized# Class Methods Deserializedms in ASTReader::ReadMethodPool
Baseline13914057725430411990715759
Set Dedupe1391405773881519125421
No external157677159550362428430

The time taken in ASTReader::ReadMethodPool ends up very close. This was previously dominated by Sema::addMethodToGlobalList, and both approaches will end up with the same number of calls to this method, so maybe this is not so surprising. The difference comes down to: is it faster to descend into all build dependent modules to deserialize only the methods you need or is it faster to deserialize from the first module with overlap and skip the duplicate inserts with a hashmap lookup. The numbers would suggest the latter, and this approach also has the benefit of requiring no other changes as the method insert order to the global list should be identical.

It's good to know ASTReader::ReadMethodPool is pretty close for both approaches. And as far as I understand, it includes the code

ReadMethodPoolVisitor Visitor(*this, Sel, PriorGeneration);
ModuleMgr.visit(Visitor);

so the module visiting doesn't seem to be too expensive.

I can be wrong but I like writing less data as it can result in savings for more projects. For example, if compile time is dominated by method deserialization and not by Sema::addMethodToGlobalList, we still should see an improvement. I can try to experiment on a bunch of other projects and their heavy .m files and report the results, if you'd like.

Also I've updated D110123 to preserve the order of methods (mostly, there are still some differences). Works better than the previous approach, compilation time difference is within the noise.

rmaz added a comment.Thu, Oct 14, 9:58 AM

I can be wrong but I like writing less data as it can result in savings for more projects. For example, if compile time is dominated by method deserialization and not by Sema::addMethodToGlobalList, we still should see an improvement. I can try to experiment on a bunch of other projects and their heavy .m files and report the results, if you'd like.

I think that would be a useful comparison, yes. For our code I consistently measure this approach as 10-20% slower on average, with very few outliers where the set deduplication approach is slower.

Sorry for the delay, I haven't finished testing more projects yet but here are my preliminary numbers

FileBaseline (s)Set Dedupe (s)No external (s)Set Dedupe (percentage of baseline)No external (percentage of baseline)
Project1. File11.261465750.9317558750.94367412573.863%74.808%
Project1. File21.4443218751.08683951.09580587575.249%75.870%
Project1. File31.24750250.940327250.94294387575.377%75.587%
Project1. File41.1977150.983316250.989868582.099%82.646%
Project1. File51.2524813750.943399250.94576675.322%75.511%

Methodology: clear a modules cache, compile a file once to pre-populate the cache, compile file 8 times and measure elapsed time, take the time average. Didn't include files using a precompiled header because that makes cleaning a module cache trickier. Standard deviation of the measurements is consistent between different approaches and doesn't go above 0.015, so I haven't included it here. "Project1" is an app, so it should have a fairly long chain of dependencies, though I haven't checked it in details.

So it looks like "no external" approach is slightly but consistently slower 😢 than "set dedupe" approach. I'm curious to get the results for an empty module cache because clean builds are also important for us. And if you see anything suspicious in the methodology, please let me know.

rmaz added a comment.Wed, Oct 20, 3:54 PM

Methodology: clear a modules cache, compile a file once to pre-populate the cache, compile file 8 times and measure elapsed time, take the time average.

This is the same approach I used, although with 3 tries.

So it looks like "no external" approach is slightly but consistently slower 😢 than "set dedupe" approach.

This agrees with what I see with our code too.

I'm curious to get the results for an empty module cache because clean builds are also important for us.

I should measure this too. What would you suggest for the approach, to clean the module cache before each build, retry and average? How much weight should be given to the clean vs populated module cache numbers?

And for the empty module cache the numbers are

FileBaseline (s)Set Dedupe (s)No external (s)Set Dedupe (percentage of baseline)No external (percentage of baseline)
Project1. File114.08067313.629611813.695448496.797%97.264%
Project1. File215.227346614.611179814.793587295.954%97.151%
Project1. File314.004756813.598817413.615552697.101%97.221%
Project1. File413.044987412.586731612.800584496.487%98.126%
Project1. File514.375667413.792857813.756204295.946%95.691% 💪

Though in this case the variance is much higher than before. Roughly it is 10 times higher and can reach 0.25. With such variance I think the difference between different fix approaches isn't meaningful and even comparing with the baseline is shaky. I haven't investigated further but suspect that the time of building modules dominates addMethodToGlobalList time and we cannot notice a meaningful difference.

Methodology note: take average of 5 compilations instead of 8 as each compilation is longer.

I'm curious to get the results for an empty module cache because clean builds are also important for us.

I should measure this too. What would you suggest for the approach, to clean the module cache before each build, retry and average? How much weight should be given to the clean vs populated module cache numbers?

I am compiling just a single file and before each compilation I

  1. Delete ModuleCache.noindex
  2. Create ModuleCache.noindex directory
  3. Create ModuleCache.noindex/Session.modulevalidation file

Then retry multiple times and average. Given my experience, calculating standard deviation can be useful too.

I'm wondering what your numbers will be but for me I think they are fairly meaningless due to high deviation. That's why I don't think it's worth assigning any weight to the clean module cache build (at least for individual files).

Curiously, the results for the second project are unexpected and bizarre.

FileBaseline (s)Set Dedupe (s)No external (s)Set Dedupe (percentage of baseline)No external (percentage of baseline)
Project2. File10.3785821250.3183643750.31831987584.094%84.082%
Project2. File20.2458131250.249316750.25345175101.425%103.107%
Project2. File30.246433250.2605548750.261508105.730%106.117%
Project2. File40.247455250.2568960.253167625103.815%102.308%
Project2. File50.24831350.2561638750.2635545103.161%106.138%

Standard deviation goes down again and doesn't exceed 0.02. This project is a big desktop app, so expect it to have long-isa dependency chains. Also, anecdotally, when I compared baseline and "no external" full project builds, "no external" was 20% faster.

My understanding is that due to the short compilation times the measurements are less reliable (haven't done anything heroic to make sure nothing else was running on my machine). And maybe these files are inherently more volatile because when I was selecting them for measurements, their compile time was >20s and now we have ~0.25s (with clean module cache it is still below 20s).

rmaz added a comment.EditedThu, Oct 21, 2:59 PM

I am seeing the clean build times behaving the same as the populated ones, just slower:

FileBaseline (s)Set DedupeNo External
IGMFVC.mm230194195

So given these numbers are we good to go ahead with set dedupe approach?

So given these numbers are we good to go ahead with set dedupe approach?

I'd rather get an opinion on this from other reviewers. It's not purely a numbers game, there can be other reasons to prefer one solution over another. I am biased, so I don't think I can make this call. If reviewers want a short summary instead of checking the entire discussion, we can prepare that.

If anybody is interested in the raw measurement data, I can provide it. The data is suitable for calculating the statistical significance between different methods using t-test.

So given these numbers are we good to go ahead with set dedupe approach?

I'd rather get an opinion on this from other reviewers. It's not purely a numbers game, there can be other reasons to prefer one solution over another. I am biased, so I don't think I can make this call. If reviewers want a short summary instead of checking the entire discussion, we can prepare that.

Can you summarize how each of the two proposed new architectures differ from the baseline?

Note also that clang/lib/Serialization/MultiOnDiskHashTable.h is an option if there's a tradeoff between redundantly repeating information in module files (optimize for storage) and the number of module files that need to be visited (optimize for having information available). I think I mentioned it way up thread (indirectly, when talking about identifier tables), but I'm not sure if it was considered.

If anybody is interested in the raw measurement data, I can provide it. The data is suitable for calculating the statistical significance between different methods using t-test.

We are targeting the use case where in impl.m we have

@import ImmediateDep1;
@import ImmediateDep2;
...
@import ImmediateDepN;

and each of ImmediateDep has @import SharedDep;.

For simplicity we'll consider that we are working with a single selector as each of them is processed separately and independently. When clang encounters a selector, it wants to collect methods corresponding to this selector from all modules to a GlobalMethodPool. As in GlobalMethodPool we store methods in a list, adding a new method requires traversing this list. It is possible to change the list to something else but this change isn't about that, in O(L*X) complexity we are reducing X (which can be ~100000) and not L (which is closer to ~1000).

In the baseline METHOD_POOL for each "ImmediateDep" contains methods from "SharedDep". When we add methods to the GlobalMethodPool, we try to add methods from all "ImmediateDep". As the result we iterate through GlobalMethodPool method list multiple time for each method from "SharedDep" as they are available in each "ImmediateDep".

Richard's idea is to put a DenseSet of encountered methods in front of GlobalMethodPool method list. This way duplicates from "SharedDep" can be detected and discarded quickly so we traverse a list only for each unique method and not for duplicates.

My idea is not to store any duplicates in METHOD_POOL. This is achieved by each module storing only their own methods and not storing any [transitively] imported methods. In turn, populating GlobalMethodPool requires traversing the full dependency graph once and loading methods from METHOD_POOL in corresponding modules compared to traversing only immediate dependencies as in the baseline.

I believe MultiOnDiskHashTable isn't really applicable to this performance issue. Even if we use it to store methods, it doesn't affect the amount of methods we process. And right now processing all the duplicates is the expensive part.