This is an archive of the discontinued LLVM Phabricator instance.

This change has caused major compile-time regressions in Rust (compiled modules no longer get reused in incremental builds), because modules get added to the index in different orders across compilations. On the surface, it seems like we could easily work around this by adding a sort over the module key. However, the modules involved in the compilation can actually change, e.g. if references to symbols in a module are added/removed so that the module does not need to be linked at all (or starts needing to be linked). Even with a sort, this is going to shift the module indices, and result in at least unnecessary cache invalidation (I'm not sure it can result in result in incorrect cache reuse).

The basic premise of this patch (that we can use the module order instead of the module key) seems to be premised on a specific compilation model that is not valid for all thin lto consumers.

What about changing the code to sort using the module hash:

   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });

would that resolve your issue?

In D151165#4538667, @akyrtzi wrote:
What about changing the code to sort using the module hash:
   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });
would that resolve your issue?

From a cursory test that seems to work.

In D151165#4538190, @nikic wrote:

@tejohnson Can you please confirm the correctness of this change?

This change has caused major compile-time regressions in Rust (compiled modules no longer get reused in incremental builds), because modules get added to the index in different orders across compilations. On the surface, it seems like we could easily work around this by adding a sort over the module key. However, the modules involved in the compilation can actually change, e.g. if references to symbols in a module are added/removed so that the module does not need to be linked at all (or starts needing to be linked). Even with a sort, this is going to shift the module indices, and result in at least unnecessary cache invalidation (I'm not sure it can result in result in incorrect cache reuse).

The basic premise of this patch (that we can use the module order instead of the module key) seems to be premised on a specific compilation model that is not valid for all thin lto consumers.

Thanks for the report, yeah I can see how using the module id here can result in spurious differences. Using the module hash should address the issue I think.

In D151165#4538667, @akyrtzi wrote:
What about changing the code to sort using the module hash:
   llvm::sort(ImportModulesVector,
              [](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-               return Lhs.getId() < Rhs.getId();
+               return Lhs.getHash() < Rhs.getHash();
              });
would that resolve your issue?

This is better, @nikic can you confirm?

The module ID is an older concept that predates the module hash. We should probably remove that from the in-memory index completely, since it just takes up space and can lead to confusion about what values are stable and should be used. The numeric id is utilized in the Bitcode format for compactness, but I don't think we need it in memory anymore (quick scan of the codebase suggests not). Let me see if I can remove that once this issue is fixed.

I've just confirmed that using the hash works for the original, unreduced test cases as well.

nikic mentioned this in D156525: [ThinLTO] Use module hash instead of module ID for cache key.Jul 28 2023, 5:26 AM

nikic mentioned this in rG279c2971951c: [ThinLTO] Use module hash instead of module ID for cache key.Jul 28 2023, 7:39 AM

In D151165#4538756, @tejohnson wrote:

The module ID is an older concept that predates the module hash. We should probably remove that from the in-memory index completely, since it just takes up space and can lead to confusion about what values are stable and should be used. The numeric id is utilized in the Bitcode format for compactness, but I don't think we need it in memory anymore (quick scan of the codebase suggests not). Let me see if I can remove that once this issue is fixed.

D156730

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

ModuleSummaryIndex.h

7 lines

lib/

LTO/

LTO.cpp

39 lines

test/

ThinLTO/

X86/

cache-decoupled-from-filenames.ll

27 lines

Diff 524563

llvm/include/llvm/IR/ModuleSummaryIndex.h

Show First 20 Lines • Show All 1,722 Lines • ▼ Show 20 Lines	public:

/// Return module entry for module with the given \p ModPath.		/// Return module entry for module with the given \p ModPath.
ModuleInfo *getModule(StringRef ModPath) {		ModuleInfo *getModule(StringRef ModPath) {
auto It = ModulePathStringTable.find(ModPath);		auto It = ModulePathStringTable.find(ModPath);
assert(It != ModulePathStringTable.end() && "Module not registered");		assert(It != ModulePathStringTable.end() && "Module not registered");
return &*It;		return &*It;
}		}

		/// Return module entry for module with the given \p ModPath.
		const ModuleInfo *getModule(StringRef ModPath) const {
		auto It = ModulePathStringTable.find(ModPath);
		assert(It != ModulePathStringTable.end() && "Module not registered");
		return &*It;
		}

/// Check if the given Module has any functions available for exporting		/// Check if the given Module has any functions available for exporting
/// in the index. We consider any module present in the ModulePathStringTable		/// in the index. We consider any module present in the ModulePathStringTable
/// to have exported functions.		/// to have exported functions.
bool hasExportedFunctions(const Module &M) const {		bool hasExportedFunctions(const Module &M) const {
return ModulePathStringTable.count(M.getModuleIdentifier());		return ModulePathStringTable.count(M.getModuleIdentifier());
}		}

const TypeIdSummaryMapTy &typeIds() const { return TypeIdMap; }		const TypeIdSummaryMapTy &typeIds() const { return TypeIdMap; }
▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/lib/LTO/LTO.cpp

Show First 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	for (uint64_t GUID : ExportsGUID) {
// The export list can impact the internalization, be conservative here		// The export list can impact the internalization, be conservative here
Hasher.update(ArrayRef<uint8_t>((uint8_t *)&GUID, sizeof(GUID)));		Hasher.update(ArrayRef<uint8_t>((uint8_t *)&GUID, sizeof(GUID)));
}		}

// Include the hash for every module we import functions from. The set of		// Include the hash for every module we import functions from. The set of
// imported symbols for each module may affect code generation and is		// imported symbols for each module may affect code generation and is
// sensitive to link order, so include that as well.		// sensitive to link order, so include that as well.
using ImportMapIteratorTy = FunctionImporter::ImportMapTy::const_iterator;		using ImportMapIteratorTy = FunctionImporter::ImportMapTy::const_iterator;
std::vector<ImportMapIteratorTy> ImportModulesVector;		struct ImportModule {
		ImportMapIteratorTy ModIt;
		const ModuleSummaryIndex::ModuleInfo *ModInfo;

		StringRef getIdentifier() const { return ModIt->getKey(); }
		const FunctionImporter::FunctionsToImportTy &getFunctions() const {
		return ModIt->second;
		}

		const ModuleHash &getHash() const { return ModInfo->second.second; }
		uint64_t getId() const { return ModInfo->second.first; }
		};

		std::vector<ImportModule> ImportModulesVector;
ImportModulesVector.reserve(ImportList.size());		ImportModulesVector.reserve(ImportList.size());

for (ImportMapIteratorTy It = ImportList.begin(); It != ImportList.end();		for (ImportMapIteratorTy It = ImportList.begin(); It != ImportList.end();
++It) {		++It) {
ImportModulesVector.push_back(It);		ImportModulesVector.push_back({It, Index.getModule(It->getKey())});
}		}
		// Order using moduleId integer which is based on the order the module was
		// added.
llvm::sort(ImportModulesVector,		llvm::sort(ImportModulesVector,
[](const ImportMapIteratorTy &Lhs, const ImportMapIteratorTy &Rhs)		[](const ImportModule &Lhs, const ImportModule &Rhs) -> bool {
-> bool { return Lhs->getKey() < Rhs->getKey(); });		return Lhs.getId() < Rhs.getId();
for (const ImportMapIteratorTy &EntryIt : ImportModulesVector) {		});
auto ModHash = Index.getModuleHash(EntryIt->first());		for (const ImportModule &Entry : ImportModulesVector) {
		auto ModHash = Entry.getHash();
Hasher.update(ArrayRef<uint8_t>((uint8_t *)&ModHash[0], sizeof(ModHash)));		Hasher.update(ArrayRef<uint8_t>((uint8_t *)&ModHash[0], sizeof(ModHash)));

AddUint64(EntryIt->second.size());		AddUint64(Entry.getFunctions().size());
for (auto &Fn : EntryIt->second)		for (auto &Fn : Entry.getFunctions())
AddUint64(Fn);		AddUint64(Fn);
}		}

// Include the hash for the resolved ODR.		// Include the hash for the resolved ODR.
for (auto &Entry : ResolvedODR) {		for (auto &Entry : ResolvedODR) {
Hasher.update(ArrayRef<uint8_t>((const uint8_t *)&Entry.first,		Hasher.update(ArrayRef<uint8_t>((const uint8_t *)&Entry.first,
sizeof(GlobalValue::GUID)));		sizeof(GlobalValue::GUID)));
Hasher.update(ArrayRef<uint8_t>((const uint8_t *)&Entry.second,		Hasher.update(ArrayRef<uint8_t>((const uint8_t *)&Entry.second,
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	for (auto &GS : DefinedGlobals) {
Hasher.update(		Hasher.update(
ArrayRef<uint8_t>((const uint8_t *)&Linkage, sizeof(Linkage)));		ArrayRef<uint8_t>((const uint8_t *)&Linkage, sizeof(Linkage)));
AddUsedCfiGlobal(GS.first);		AddUsedCfiGlobal(GS.first);
AddUsedThings(GS.second);		AddUsedThings(GS.second);
}		}

// Imported functions may introduce new uses of type identifier resolutions,		// Imported functions may introduce new uses of type identifier resolutions,
// so we need to collect their used resolutions as well.		// so we need to collect their used resolutions as well.
for (auto &ImpM : ImportList)		for (const ImportModule &ImpM : ImportModulesVector)
for (auto &ImpF : ImpM.second) {		for (auto &ImpF : ImpM.getFunctions()) {
GlobalValueSummary *S = Index.findSummaryInModule(ImpF, ImpM.first());		GlobalValueSummary *S =
		Index.findSummaryInModule(ImpF, ImpM.getIdentifier());
AddUsedThings(S);		AddUsedThings(S);
// If this is an alias, we also care about any types/etc. that the aliasee		// If this is an alias, we also care about any types/etc. that the aliasee
// may reference.		// may reference.
if (auto *AS = dyn_cast_or_null<AliasSummary>(S))		if (auto *AS = dyn_cast_or_null<AliasSummary>(S))
AddUsedThings(AS->getBaseObject());		AddUsedThings(AS->getBaseObject());
}		}

auto AddTypeIdSummary = [&](StringRef TId, const TypeIdSummary &S) {		auto AddTypeIdSummary = [&](StringRef TId, const TypeIdSummary &S) {
▲ Show 20 Lines • Show All 1,490 Lines • Show Last 20 Lines

llvm/test/ThinLTO/X86/cache-decoupled-from-filenames.ll

This file was added.

				; RUN: rm -rf %t && mkdir -p %t/1 %t/2 %t/3 %t/4
				; RUN: opt -module-hash -module-summary %s -o %t/t.bc
				; RUN: opt -module-hash -module-summary %S/Inputs/cache-import-lists1.ll -o %t/1/a.bc
				; RUN: opt -module-hash -module-summary %S/Inputs/cache-import-lists2.ll -o %t/2/b.bc

				; Tests that the hash for t is insensitive to the bitcode module filenames.

				; RUN: rm -rf %t/cache
				; RUN: llvm-lto2 run -cache-dir %t/cache -o %t.o %t/t.bc %t/1/a.bc %t/2/b.bc -r=%t/t.bc,main,plx -r=%t/t.bc,f1,lx -r=%t/t.bc,f2,lx -r=%t/1/a.bc,f1,plx -r=%t/1/a.bc,linkonce_odr,plx -r=%t/2/b.bc,f2,plx -r=%t/2/b.bc,linkonce_odr,lx
				; RUN: ls %t/cache \| count 3

				; RUN: cp %t/1/a.bc %t/4/d.bc
				; RUN: cp %t/2/b.bc %t/3/k.bc
				; RUN: llvm-lto2 run -cache-dir %t/cache -o %t.o %t/t.bc %t/4/d.bc %t/3/k.bc -r=%t/t.bc,main,plx -r=%t/t.bc,f1,lx -r=%t/t.bc,f2,lx -r=%t/4/d.bc,f1,plx -r=%t/4/d.bc,linkonce_odr,plx -r=%t/3/k.bc,f2,plx -r=%t/3/k.bc,linkonce_odr,lx
				; RUN: ls %t/cache \| count 3

				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define void @main() {
				call void @f1()
				call void @f2()
				ret void
				}

				declare void @f1()
				declare void @f2()

This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Make the cache key independent of the module identifier pathsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 524563

llvm/include/llvm/IR/ModuleSummaryIndex.h

llvm/lib/LTO/LTO.cpp

llvm/test/ThinLTO/X86/cache-decoupled-from-filenames.ll

[ThinLTO] Make the cache key independent of the module identifier paths
ClosedPublic