This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Allow multiple summary entries.
AbandonedPublic

Authored by fhahn on Jul 6 2017, 12:41 PM.

Details

Summary

When combining modules with entries that allow multiple definitions,
the combined index can contain multiple summaries for a single GUID.
Unless I miss something here, we should be able to continue by just
picking the first summary, if all entries in the list allow
multiple definitions.

I am not sure if we should relax the assertion here or select a single
summary when building the index?

Diff Detail

Event Timeline

fhahn created this revision.Jul 6 2017, 12:41 PM
tejohnson edited edge metadata.Jul 6 2017, 2:23 PM

How are you invoking this? Typically this path is invoked for distributed builds, when the individual index files for each backend were written during the thin link (e.g. via gold plugin's -thinlto-index-only option). In that case, we only expect a single summary to exist per GUID, since it is indicating which copy to import. We don't typically pass in the entire combined index (which would cause other issues as I believe we are simply importing every summary in the index on this path).

fhahn added a comment.Jul 11 2017, 7:38 AM

To reproduce the issue you could use

+; Check that we only add a single summary entry for multiple definitions
+; of a linkonce_odr function
+
+; RUN: opt -module-summary %s -o %t1.bc
+; RUN: opt -module-summary %s -o %t2.bc
+; RUN: llvm-lto -thinlto-action=thinlink -o %t3.bc %t1.bc %t2.bc
+; RUN: llvm-bcanalyzer -dump %t3.bc | FileCheck %s
+
+define linkonce_odr void @foo(i8*) {
+  ret void
+}
+; CHECK: <GLOBALVAL_SUMMARY_BLOCK
+; CHECK:  <VALUE_GUID
+; CHECK-NEXT:  <COMBINED
+; CHECK-NOT:  <COMBINED
+; CHECK: </GLOBALVAL_SUMMARY_BLOCK>

I've been debugging this issue using an index + bitcode files provided by a third party, I'll try to get information on how they generated the index.

To reproduce the issue you could use

+; Check that we only add a single summary entry for multiple definitions
+; of a linkonce_odr function
+
+; RUN: opt -module-summary %s -o %t1.bc
+; RUN: opt -module-summary %s -o %t2.bc
+; RUN: llvm-lto -thinlto-action=thinlink -o %t3.bc %t1.bc %t2.bc
+; RUN: llvm-bcanalyzer -dump %t3.bc | FileCheck %s
+
+define linkonce_odr void @foo(i8*) {
+  ret void
+}
+; CHECK: <GLOBALVAL_SUMMARY_BLOCK
+; CHECK:  <VALUE_GUID
+; CHECK-NEXT:  <COMBINED
+; CHECK-NOT:  <COMBINED
+; CHECK: </GLOBALVAL_SUMMARY_BLOCK>

I've been debugging this issue using an index + bitcode files provided by a third party, I'll try to get information on how they generated the index.

I understand how you can get multiple definitions of linkonce functions into a combined index dumped to a file. But the clang code you are changing here is only invoked for a ThinLTO backend compile via clang (to support distributed builds), and should not be passed the full combined index. As mentioned in my first response, this isn't going to do what you want anyway, as the ThinLTO backend invoked via this path will import *everything* with a summary in the provided index (see a few lines down from your change, where we populate the ImportList entry for each summary in the provided index). Clang for a ThinLTO backend should only be passed an individual module's slice of the index, which can be produced for each of the input modules from llvm-lto via the -thinlto-action=distributedindexes option, or if linking via gold, with the -plugin-opt=thinlto-index-only option.

How is the third party trying to build with ThinLTO? If they are attempting distributed ThinLTO builds, then they need to invoke the thin link in such a way that they get these individual index files.

This error happens when I try to triage a thinLTO failure on ARM.

Initially I got some error like
Instruction does not dominate all uses!

%205 = bitcast i1 (%"class.blink::LayoutObject"*)** %194 to i8*, !dbg !51180
%200 = getelementptr i8, i8* %205, i32 ptrtoint (i8* @__typeid__ZTSN5blink12LayoutObjectE_100_byte to i32), !dbg !51170

LLVM ERROR: Broken function found, compilation aborted!
clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation)

I then used distributed thinLTO to try to find a reduced test case.
I use -Wl,-plugin-opt,thinlto-index-only=file to get the index file for each bite code and I
run
$cmd -fthinlto-index=${obj}.thinlto.bc -x ir ${obj} -o a.o -c

to get the error shown in this change.

This error happens when I try to triage a thinLTO failure on ARM.

Initially I got some error like
Instruction does not dominate all uses!

%205 = bitcast i1 (%"class.blink::LayoutObject"*)** %194 to i8*, !dbg !51180
%200 = getelementptr i8, i8* %205, i32 ptrtoint (i8* @__typeid__ZTSN5blink12LayoutObjectE_100_byte to i32), !dbg !51170

LLVM ERROR: Broken function found, compilation aborted!
clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation)

I then used distributed thinLTO to try to find a reduced test case.
I use -Wl,-plugin-opt,thinlto-index-only=file to get the index file for each bite code and I
run
$cmd -fthinlto-index=${obj}.thinlto.bc -x ir ${obj} -o a.o -c

to get the error shown in this change.

That's really odd - I'm not sure why you would get more than one summary for a GUID in the individual index files. Can you package up your thin link inputs and command for me to take a look?

pcc added a subscriber: pcc.Jul 11 2017, 12:43 PM

I've sent a reproduce test case to tejohnson.

I've sent a reproduce test case to tejohnson.

FYI I tracked down what is going on here and reproduced with a small test case. Essentially, two copies of a linkonce odr function were optimized somewhat differently by the compile step and have different instruction counts, with the prevailing one being larger. Depending on the order they are encountered during the thin link's callgraph traversal, we may initially decide to import the smaller non-prevailing copy, and later decide to import the larger prevailing copy. This should be fixed in the function importer, which needs to make a decision about which copy should be imported (the fix is simple, I just need to decide which fix is "right").

This is different than the situation Florian encountered, which is due to passing the full combined index to clang when invoking a distributed backend process, which we don't want users to do. Florian, I'd still like to understand better what the third party you are fixing this for is trying to do, so I can advise.

fhahn added a comment.Jul 12 2017, 2:24 PM

It's @yunlian, so if the issue you described above covers his failure, than that's great. @tejohnson do you have time to work on a fix soon? Otherwise I could give it a try, I would probably need a few pointers though, as I am not really familiar with ThinLTO.

Passing in the full combined index I generated manually was just a way to produce a similar failure to what I was seeing with the index @yunlian sent me.

mehdi_amini edited edge metadata.Jul 12 2017, 5:29 PM

It's @yunlian, so if the issue you described above covers his failure, than that's great. @tejohnson do you have time to work on a fix soon? Otherwise I could give it a try, I would probably need a few pointers though, as I am not really familiar with ThinLTO.

A good starting point is static const GlobalValueSummary *selectCallee() in llvm/lib/Transforms/IPO/FunctionImport.cpp ; but it is likely not trivial to solve.

It's @yunlian, so if the issue you described above covers his failure, than that's great. @tejohnson do you have time to work on a fix soon? Otherwise I could give it a try, I would probably need a few pointers though, as I am not really familiar with ThinLTO.

A good starting point is static const GlobalValueSummary *selectCallee() in llvm/lib/Transforms/IPO/FunctionImport.cpp ; but it is likely not trivial to solve.

Not necessarily nontrivial from a coding perspective - but the question is how do we want to solve it? E.g. I have a very simple fix that addressed the problem in my test case, by always selecting the first one (assumes that the first is the prevailing, which is generally true).

Should we always select the prevailing copy? E.g. just return the first summary (which is likely to belong to the prevailing copy) if any summary in the list is under the threshold. In my test case and in Yunlian's, we ended up selecting the prevailing copy eventually. But it's possible that only one or more non-prevailing copies will ever be selected. We could just assume that if any copy is under the threshold, that the prevailing is still the best copy to import. But I suppose the prevailing copy could be significantly larger if the inlining during the compile step was very different.

Or, we could always import the copy with the smallest instruction count. But if we have profile data, that profile data likely was collected on the (same) prevailing copy and perhaps that copy was just more aggressively inlined (and might be the better optimized than a non-prevailing copy with a smaller instruction count, which might not have profile data if the early inlining was different leading to the profile not matching when compiling that copy).

Or, we could keep the same selection algorithm, and ensure that we stick with whatever was the first summary selected (in the test case, that was the smaller non-prevailing copy). I.e. keep track of which GUIDs/summaries are already imported for a module, and ensure we don't pick a different one (we do want to reprocess it though if we ever encounter a call to that GUID with a higher threshold, since we may decide to import more of its callees).

Or, we could go back later and see if we have selected more than one summary for a GUID, and pick one (the first one? that's what essentially ends up happening when we have multiple and are doing in-process ThinLTO backends, the IRLinker will likely just keep the first one and ignore the subsequent copies that we try to link in). But that requires some additional post-processing that would be good to avoid if possible.

Thoughts?

Teresa: can you describe a bit more how we end up importing two summaries for the same GUID? I would expect that after importing one, we don't even come to try to import another since we already have one.

Teresa: can you describe a bit more how we end up importing two summaries for the same GUID? I would expect that after importing one, we don't even come to try to import another since we already have one.

Because of this:
http://llvm-cs.pcc.me.uk/lib/Transforms/IPO/FunctionImport.cpp#261

/// Since the traversal of the call graph is DFS, we can revisit a function
/// a second time with a higher threshold. In this case, it is added back to
/// the worklist with the new threshold.

The goal is that if we revisit with a higher threshold (which we do in the test case), that we want to reconsider downstream callees again. E.g. if we have foo() calls bar() calls baz(), and we first decided to import bar() into foo's module, but the threshold was not high enough to import baz, if we later encounter this call from foo->bar again with a higher threshold (because we hit it on a path that was either hotter or shorter), we want to reconsider whether to import baz().

The issue here is that when we reconsider "foo" we pick a different copy (earlier in the list, but skipped initially due to having an instruction count above the first (lower) threshold).

Thanks, very clear :)

I would expect that if we reprocess a GUID we invalidate the previous import of the same GUID. Right now my impression is that the issue is that ImportList[ExportModulePath][VI.getGUID()] is indexed on the importing module. So it'd require for every new import to loop over the ImportList map and delete the GUID on each inner map.

Alternatively (and likely preferably from a compile-time point of view to limit the list of import), we could keep a map of GUID->Summary and reuse it before trying to select a new callee.

Thanks, very clear :)

I would expect that if we reprocess a GUID we invalidate the previous import of the same GUID. Right now my impression is that the issue is that ImportList[ExportModulePath][VI.getGUID()] is indexed on the importing module. So it'd require for every new import to loop over the ImportList map and delete the GUID on each inner map.

Right, that was the processing I was hoping to avoid.

Alternatively (and likely preferably from a compile-time point of view to limit the list of import), we could keep a map of GUID->Summary and reuse it before trying to select a new callee.

Yep, that was my third option. I only hesitate because it seems like it might be better to select the prevailing copy when we can, but perhaps it doesn't matter much.

Alternatively (and likely preferably from a compile-time point of view to limit the list of import), we could keep a map of GUID->Summary and reuse it before trying to select a new callee.

Yep, that was my third option. I only hesitate because it seems like it might be better to select the prevailing copy when we can, but perhaps it doesn't matter much.

I'm not sure I understand what is your first option? Always selecting the prevailing copy?
Unless doing a topological order for visiting (which I don't think we could reasonably do), I don't see how you can ensure that: on your first visit you may see the prevailing not eligible, but it may become eligible on the second visit, at which point you need to do something about the first copy you already selected.

Alternatively (and likely preferably from a compile-time point of view to limit the list of import), we could keep a map of GUID->Summary and reuse it before trying to select a new callee.

Yep, that was my third option. I only hesitate because it seems like it might be better to select the prevailing copy when we can, but perhaps it doesn't matter much.

I'm not sure I understand what is your first option? Always selecting the prevailing copy?
Unless doing a topological order for visiting (which I don't think we could reasonably do), I don't see how you can ensure that: on your first visit you may see the prevailing not eligible, but it may become eligible on the second visit, at which point you need to do something about the first copy you already selected.

My first option was if any copy is under the threshold, simply pick the prevailing copy (it may be over threshold, but assume we want that one anyway). Another possibility is to only allow importing of the prevailing copy, if it is over the inst limit, too bad. The main reason I think that could be better is when we have PGO - the prevailing copy would for sure have the matched PGO data, but the non-prevailing copies may not.

My first option was if any copy is under the threshold, simply pick the prevailing copy (it may be over threshold, but assume we want that one anyway). Another possibility is to only allow importing of the prevailing copy, if it is over the inst limit, too bad. The main reason I think that could be better is when we have PGO - the prevailing copy would for sure have the matched PGO data, but the non-prevailing copies may not.

Ah makes sense, both way are fine with me.

I'd rather limit to the prevailing copy though and too bad if it is over the limit, to avoid the possibility of too large importing just because there was a small non-prevailing somewhere, but likely a rare case.

Limiting to the prevailing copy means that we could prune the index early after symbol resolution to keep only the prevailing summary, which could speed-up all the process (and possibly get more accurate dead-stripping if we don't use the prevailing information there already?).

My first option was if any copy is under the threshold, simply pick the prevailing copy (it may be over threshold, but assume we want that one anyway). Another possibility is to only allow importing of the prevailing copy, if it is over the inst limit, too bad. The main reason I think that could be better is when we have PGO - the prevailing copy would for sure have the matched PGO data, but the non-prevailing copies may not.

Ah makes sense, both way are fine with me.

I'd rather limit to the prevailing copy though and too bad if it is over the limit, to avoid the possibility of too large importing just because there was a small non-prevailing somewhere, but likely a rare case.

SGTM. That's also very simple to implement. I've just been to swamped with other things so far to do it. Just chatted with Yunlian, he is not blocked.

Limiting to the prevailing copy means that we could prune the index early after symbol resolution to keep only the prevailing summary, which could speed-up all the process (and possibly get more accurate dead-stripping if we don't use the prevailing information there already?).

I think we want to keep the other copies so we can mark their new linkage types (e.g. demoted to available_externally) for use in the backends. Dead stripping should use prevailing info, think there is a todo there.

My first option was if any copy is under the threshold, simply pick the prevailing copy (it may be over threshold, but assume we want that one anyway). Another possibility is to only allow importing of the prevailing copy, if it is over the inst limit, too bad. The main reason I think that could be better is when we have PGO - the prevailing copy would for sure have the matched PGO data, but the non-prevailing copies may not.

Ah makes sense, both way are fine with me.

I'd rather limit to the prevailing copy though and too bad if it is over the limit, to avoid the possibility of too large importing just because there was a small non-prevailing somewhere, but likely a rare case.

SGTM. That's also very simple to implement. I've just been to swamped with other things so far to do it. Just chatted with Yunlian, he is not blocked.

D35436

fhahn abandoned this revision.Jul 15 2017, 7:11 AM

My understanding is that D35436 prevents the case described in this review from happening. Thanks for having a look!