This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Make the cache key independent of the module identifier paths
ClosedPublic

Authored by akyrtzi on May 22 2023, 3:20 PM.

Details

Summary

Otherwise there are cache misses just from changing the name of a path, even though the input modules did not change.

rdar://109672225

Diff Detail

Unit TestsFailed

Event Timeline

akyrtzi created this revision.May 22 2023, 3:20 PM
akyrtzi requested review of this revision.May 22 2023, 3:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 22 2023, 3:20 PM
This revision is now accepted and ready to land.May 22 2023, 3:56 PM
akyrtzi updated this revision to Diff 524540.May 22 2023, 5:31 PM

Use Index.getModule() to get both ModuleId and the hash, so that we do only one lookup for both.

This revision was landed with ongoing or failed builds.May 22 2023, 9:45 PM
This revision was automatically updated to reflect the committed changes.

@tejohnson Can you please confirm the correctness of this change?

This change has caused major compile-time regressions in Rust (compiled modules no longer get reused in incremental builds), because modules get added to the index in different orders across compilations. On the surface, it seems like we could easily work around this by adding a sort over the module key. However, the modules involved in the compilation can actually change, e.g. if references to symbols in a module are added/removed so that the module does not need to be linked at all (or starts needing to be linked). Even with a sort, this is going to shift the module indices, and result in at least unnecessary cache invalidation (I'm not sure it can result in result in incorrect cache reuse).

The basic premise of this patch (that we can use the module order instead of the module key) seems to be premised on a specific compilation model that is not valid for all thin lto consumers.

What about changing the code to sort using the module hash:

   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });

would that resolve your issue?

nikic added a comment.Jul 27 2023, 6:50 AM

What about changing the code to sort using the module hash:

   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });

would that resolve your issue?

From a cursory test that seems to work.

@tejohnson Can you please confirm the correctness of this change?

This change has caused major compile-time regressions in Rust (compiled modules no longer get reused in incremental builds), because modules get added to the index in different orders across compilations. On the surface, it seems like we could easily work around this by adding a sort over the module key. However, the modules involved in the compilation can actually change, e.g. if references to symbols in a module are added/removed so that the module does not need to be linked at all (or starts needing to be linked). Even with a sort, this is going to shift the module indices, and result in at least unnecessary cache invalidation (I'm not sure it can result in result in incorrect cache reuse).

The basic premise of this patch (that we can use the module order instead of the module key) seems to be premised on a specific compilation model that is not valid for all thin lto consumers.

Thanks for the report, yeah I can see how using the module id here can result in spurious differences. Using the module hash should address the issue I think.

What about changing the code to sort using the module hash:

   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });

would that resolve your issue?

This is better, @nikic can you confirm?

The module ID is an older concept that predates the module hash. We should probably remove that from the in-memory index completely, since it just takes up space and can lead to confusion about what values are stable and should be used. The numeric id is utilized in the Bitcode format for compactness, but I don't think we need it in memory anymore (quick scan of the codebase suggests not). Let me see if I can remove that once this issue is fixed.

I've just confirmed that using the hash works for the original, unreduced test cases as well.

The module ID is an older concept that predates the module hash. We should probably remove that from the in-memory index completely, since it just takes up space and can lead to confusion about what values are stable and should be used. The numeric id is utilized in the Bitcode format for compactness, but I don't think we need it in memory anymore (quick scan of the codebase suggests not). Let me see if I can remove that once this issue is fixed.

D156730