This is an archive of the discontinued LLVM Phabricator instance.

ThinLTO: add early "dead-stripping" on the Index
ClosedPublic

Authored by tejohnson on Aug 13 2016, 11:57 PM.

Details

Summary

Using the linker-supplied list of "preserved" symbols, we can compute
the list of "dead" symbols, i.e. the one that are not reachable from
a "preserved" symbol transitively on the reference graph.
Right now we are using this information to mark these functions as
non-eligible for import.

The impact is two folds:

  • Reduction of compile time: we don't import these functions anywhere or import the function these symbols are calling.
  • The limited number of import/export leads to better internalization.

Diff Detail

Repository
rL LLVM

Event Timeline

mehdi_amini retitled this revision from to ThinLTO: add early "dead-stripping" on the Index.
mehdi_amini updated this object.
mehdi_amini added reviewers: tejohnson, davidxl.
mehdi_amini added a subscriber: llvm-commits.
tejohnson added inline comments.Aug 15 2016, 7:34 AM
include/llvm/Transforms/IPO/FunctionImport.h
98 ↗(On Diff #67971)

Please also add this handling to LTO.cpp so that the optimization is available for all client linkers. As a bonus, then DeadSymbols doesn't need ot be optional. For LTO.cpp, the set of preserved GUIDs is currently computed just below the call to ComputeCrossModuleImport as ExportedGUIDs, and can simply be moved up so that it can be passed to computeDeadSymbols.

Alternatively, do we want to instead pass in the set of preserved GUIDs to ComputeCrossModuleImport and invoke computeDeadSymbols there, or do we anticipate that there will be other uses outside of ComptueCrossModuleImport? If not, then I think it is cleaner to do the dead symbol computation there.

lib/LTO/ThinLTOCodeGenerator.cpp
529 ↗(On Diff #67971)

There are now a bunch of duplicated copies of the same sequence in this source file: collectDefinedGVSummariesPerModule, computeGUIDPreservedSymbols, computeDeadSymbols, ComputeCrossModuleImport. Probably time to refactor this sequence into a helper? Ok as a follow-on patch if you prefer though.

lib/Transforms/IPO/FunctionImport.cpp
467 ↗(On Diff #67971)

Do you mean we should only make the prevailing copy's references live here?

This should be pretty straightforward, just pull out the detection of prevailing copies from resolveWeakForLinkerInIndex and pass in a callback. For LTO.cpp, the isPrevailing lambda definition is just below the call to ComputeCrossModuleImport and could be moved up. Ok as a follow on patch though.

473 ↗(On Diff #67971)

Why emit the referenced GUID twice here and in the below loop, rather than just once with the default type (which is already uint64_t so first cast unnecessary).

test/ThinLTO/X86/deadstrip.ll
15 ↗(On Diff #67971)

Might want to add that bar* are dead because they should have been inlined into main.

Add support in the new LTO API

mehdi_amini added inline comments.Aug 22 2016, 9:01 AM
include/llvm/Transforms/IPO/FunctionImport.h
98 ↗(On Diff #68788)

I tried moved the computation to ComputeCrossModuleImport. But if you look at LTO.cpp runThinLTO() we reuse the DeadSymbols set for the lambda isExported used as a callback for thinLTOInternalizeAndPromoteInIndex

lib/LTO/ThinLTOCodeGenerator.cpp
529 ↗(On Diff #68788)

The move to the new API should fix it!

lib/Transforms/IPO/FunctionImport.cpp
467 ↗(On Diff #68788)

I tried to do it, but figure it'll be easy after the ThinLTOCodeGenerator is moved the new API, so I don't have multiple call sites to convert to pass a lambda, so I'll leave it for a future patch :)

473 ↗(On Diff #68788)

That was just debugging.

Address comments

clang-format

tejohnson added inline comments.Aug 24 2016, 7:36 PM
lib/LTO/LTO.cpp
200 ↗(On Diff #68868)

This should be:

GlobalRes.VisibleToRegularObj |= Res.VisibleToRegularObj;

with that fix the gold tests pass.

201 ↗(On Diff #68868)

New braces around else block not needed

603 ↗(On Diff #68868)

If a symbol is in ExportedGUIDs, wouldn't it necessarily also be in GUIDPreservedSymbols and therefore not dead (if it is VisibleToRegularObj the partition would also have been set to External). So do we even need the ExportedGUIDs check here? And if not, do we need to compute it?

test/ThinLTO/X86/deadstrip.ll
18 ↗(On Diff #68868)

Should be %t.out.1.3.import.bc

tejohnson edited edge metadata.Aug 24 2016, 8:21 PM

3 of the spec cpu2006 benchmarks also fail due to undefined refs.

mehdi_amini marked 10 inline comments as done.
mehdi_amini edited edge metadata.

Address comments, thanks!

mehdi_amini marked an inline comment as done.Aug 24 2016, 8:33 PM
mehdi_amini marked 2 inline comments as done.Aug 24 2016, 8:34 PM

Forgot to commit the test fix...

Unfortunately the spec linking failures were with the fix to VisibleInRegularObj, so will still need to dig into those.

lib/LTO/LTO.cpp
673 ↗(On Diff #69198)

Think this is dead now.

tejohnson commandeered this revision.Dec 19 2016, 1:33 PM
tejohnson edited reviewers, added: mehdi_amini; removed: tejohnson.

Taking this over as discussed last week on IRC. Have an updated patch and some analysis to add to it.

pcc added a subscriber: pcc.Dec 19 2016, 1:35 PM
tejohnson updated this revision to Diff 81995.Dec 19 2016, 1:39 PM
tejohnson removed a subscriber: pcc.

Updated patch.
All invocations of IsExported should check that:

(!DeadSymbols.count(GUID) && ExportedGUIDs.count(GUID))

I had earlier suggested that we only needed to check that !DeadSymbols,
but this is wrong and was detected when I fixed the llvm-lto2 based
test (which was previously not actually checking the output).
I also updated the IsExported lambdas in ThinLTOCodeGenerator.cpp
to use the same test, so that DeadSymbols are filtered out, otherwise
the benefit would not be seen there.

There are still correctness issues, which I will send an update on shortly.

This is getting undefined references in the link of the SPEC C++ apps. Looking at one of the cases, it is caused by incorrectly determining that some references are dead when in fact they have a reference via the llvm.global_ctors variable. The index correctly reflects the chain of references starting at the llvm.global_ctors variable.

However, this symbol is never processed by the symbol resolution handling because it is skipped when looking for symbols in InputFile::Symbol::shouldSkip() - this is because while it is flagged as SF_Global, it is also flagged as SF_FormatSpecific (anything starting with "llvm." is in ModuleSymbolTable::getSymbolFlags (this is not new logic, although it was moved here recently). (Changing the InputFile to not skip this causes issues because the linker doesn't know the semantics of this special appending variable and complains about dup symbols.)

It seems like we need to add anything referenced from at least some of these "FormatSpecific" values as they are potential roots in the dead symbol analysis. The best way I can think of off the top of my head is to write a special InputFile symbol walker that finds these, then somehow cache them so we can tell LTO about them (the symbols are walked in gold when we first look at each file to see if we want to claim it, whereas LTO is not constructed until the linker has analyzed all symbols).

mehdi_amini edited edge metadata.Dec 19 2016, 2:33 PM

As an alternative, have you considered having a bit "used" (or "do not dead strip") that would express in the summary exactly what we want?

As an alternative, have you considered having a bit "used" (or "do not dead strip") that would express in the summary exactly what we want?

Yes, that would be another alternative, I thought about that but was hoping it could be avoided, but maybe not easily. The bit would probably need to be set for any variable that starts with "llvm." to be safely conservative (e.g. we should also consider anything in llvm.used or llvm.compiler.used as potentially live).

pcc edited edge metadata.Dec 19 2016, 2:48 PM

This information will probably need to be encoded in the summary somehow once we're at the point where we no longer need to load the module during symbol resolution.

I think at this point we just need to collect a set of GC roots in add{Regular,Thin}LTO in the same way that we collect the set of used variables with collectUsedGlobalVariables.

In D23488#627154, @pcc wrote:

This information will probably need to be encoded in the summary somehow once we're at the point where we no longer need to load the module during symbol resolution.

At that point we will have a separately loadable symbol table in the bitcode file, but there again I guess we will not know anything about these llvm.* symbols in the table, right?

I think at this point we just need to collect a set of GC roots in add{Regular,Thin}LTO in the same way that we collect the set of used variables with collectUsedGlobalVariables.

Presumably using a new flag added to the summary, right? Or did you have a different approach in mind for now?

Right now the patch uses the set of preserved GUIDs computed from the GlobalResolution info as the roots in runThinLTO (see invocation of computeDeadSymbols). So probably we would just use the summary flag in computeDeadSymbols to add to the set of preserved GUIDs passed in. I.e. not in addThinLTO (this is for index-based internalization/dead stripping, so I don't think regular LTO comes into play).

As an alternative, have you considered having a bit "used" (or "do not dead strip") that would express in the summary exactly what we want?

Yes, that would be another alternative, I thought about that but was hoping it could be avoided, but maybe not easily. The bit would probably need to be set for any variable that starts with "llvm." to be safely conservative (e.g. we should also consider anything in llvm.used or llvm.compiler.used as potentially live).

llvm.used is definitely a must-be root, llvm.compiler.used are supposed to be able to be dead-stripped at link-time.

pcc added a comment.Dec 19 2016, 4:04 PM
In D23488#627154, @pcc wrote:

This information will probably need to be encoded in the summary somehow once we're at the point where we no longer need to load the module during symbol resolution.

At that point we will have a separately loadable symbol table in the bitcode file, but there again I guess we will not know anything about these llvm.* symbols in the table, right?

I was envisioning that the live bits in the global summary wouldn't be applied to the llvm.* symbols themselves but to the globals referenced by them.

I think at this point we just need to collect a set of GC roots in add{Regular,Thin}LTO in the same way that we collect the set of used variables with collectUsedGlobalVariables.

Presumably using a new flag added to the summary, right? Or did you have a different approach in mind for now?

Yes, it could be done in the summary. I was also thinking you could have a function called something like collectGCRoots that does something like this:

void collectGCRoots(Module *M, SmallSet<GlobalValue *> &GCRoots) {
  collectUsedGlobalVariables(M, GCRoots, true);
  collectUsedGlobalVariables(M, GCRoots, false);
  // similarly for llvm.global_ctors and llvm.global_dtors
}

then use GCRoots to decide whether to set the bit in GlobalResolution. But now that I think about it, I think I prefer the approach of setting live bits in the summary, for example it would seem to make it much simpler to pull the live lists of type identifiers out of the combined summary for CFI/devirtualization.

Also think about how this interacts with regular LTO. You probably also want anything referenced from the regular LTO module to be treated as a GC root.

Right now the patch uses the set of preserved GUIDs computed from the GlobalResolution info as the roots in runThinLTO (see invocation of computeDeadSymbols). So probably we would just use the summary flag in computeDeadSymbols to add to the set of preserved GUIDs passed in. I.e. not in addThinLTO (this is for index-based internalization/dead stripping, so I don't think regular LTO comes into play).

Yes, or change the computeDeadSymbols function to be implemented only in terms of the live bits. Then adding the GC roots from the linker (i.e. symbols referenced from regular objects or a regular LTO module) would be implemented by setting live bits on the roots.

pcc added a comment.Dec 19 2016, 4:18 PM

As an alternative, have you considered having a bit "used" (or "do not dead strip") that would express in the summary exactly what we want?

Yes, that would be another alternative, I thought about that but was hoping it could be avoided, but maybe not easily. The bit would probably need to be set for any variable that starts with "llvm." to be safely conservative (e.g. we should also consider anything in llvm.used or llvm.compiler.used as potentially live).

llvm.used is definitely a must-be root, llvm.compiler.used are supposed to be able to be dead-stripped at link-time.

I think we will need logic for removing entries from llvm.compiler.used then. Consider this scenario:

module A:

@llvm.compiler.used = [@f]

define internal void @f() {
  call void @g()
  ret void
}

declare void @g()

module B:

define void @g() {
  ret void
}

In this case we can't dead strip f by internalizing as that would end up keeping the reference to the dead function g, we need to actually remove it.

In D23488#627212, @pcc wrote:

As an alternative, have you considered having a bit "used" (or "do not dead strip") that would express in the summary exactly what we want?

Yes, that would be another alternative, I thought about that but was hoping it could be avoided, but maybe not easily. The bit would probably need to be set for any variable that starts with "llvm." to be safely conservative (e.g. we should also consider anything in llvm.used or llvm.compiler.used as potentially live).

llvm.used is definitely a must-be root, llvm.compiler.used are supposed to be able to be dead-stripped at link-time.

I think we will need logic for removing entries from llvm.compiler.used then. Consider this scenario:

module A:

@llvm.compiler.used = [@f]

define internal void @f() {
  call void @g()
  ret void
}

declare void @g()

module B:

define void @g() {
  ret void
}

In this case we can't dead strip f by internalizing as that would end up keeping the reference to the dead function g, we need to actually remove it.

Yes. And since the llvm.compiler.used symbols aren't supposed to be touched by the compiler, I'm not sure what the legality is of removing from there for this optimization done in the thin link + thinlto backend. Probably not worth special casing llvm.compiler.used - just treat it conservatively (as gc roots).

In D23488#627205, @pcc wrote:
In D23488#627154, @pcc wrote:

This information will probably need to be encoded in the summary somehow once we're at the point where we no longer need to load the module during symbol resolution.

At that point we will have a separately loadable symbol table in the bitcode file, but there again I guess we will not know anything about these llvm.* symbols in the table, right?

I was envisioning that the live bits in the global summary wouldn't be applied to the llvm.* symbols themselves but to the globals referenced by them.

Either one works for the index-based dead stripping. But my point was we need something to flag this in the summary because these symbols won't be in the bitcode symbol table (essentially the situation we're running into here).

I think at this point we just need to collect a set of GC roots in add{Regular,Thin}LTO in the same way that we collect the set of used variables with collectUsedGlobalVariables.

Presumably using a new flag added to the summary, right? Or did you have a different approach in mind for now?

Yes, it could be done in the summary. I was also thinking you could have a function called something like collectGCRoots that does something like this:

void collectGCRoots(Module *M, SmallSet<GlobalValue *> &GCRoots) {
  collectUsedGlobalVariables(M, GCRoots, true);
  collectUsedGlobalVariables(M, GCRoots, false);
  // similarly for llvm.global_ctors and llvm.global_dtors
}

then use GCRoots to decide whether to set the bit in GlobalResolution. But now that I think about it, I think I prefer the approach of setting live bits in the summary, for example it would seem to make it much simpler to pull the live lists of type identifiers out of the combined summary for CFI/devirtualization.

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!). So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

Also think about how this interacts with regular LTO. You probably also want anything referenced from the regular LTO module to be treated as a GC root.

Good point - I think we would miss those right now with this patch.

Right now the patch uses the set of preserved GUIDs computed from the GlobalResolution info as the roots in runThinLTO (see invocation of computeDeadSymbols). So probably we would just use the summary flag in computeDeadSymbols to add to the set of preserved GUIDs passed in. I.e. not in addThinLTO (this is for index-based internalization/dead stripping, so I don't think regular LTO comes into play).

Yes, or change the computeDeadSymbols function to be implemented only in terms of the live bits. Then adding the GC roots from the linker (i.e. symbols referenced from regular objects or a regular LTO module) would be implemented by setting live bits on the roots.

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!).

Actually, looking at how this info is used right now, it seems like what I am talking about doing here will allow us to remove the build of the module and invocation of collectUsedGlobalVariables from addThinLTO. It is currently being used to help identify exported GUIDs to prevent internalization, but by marking these as live roots in the summary we can do the same thing: treating the flagged live roots as potentially exported and not dead.

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

To do what I describe above (use the summary flagging to obviate the collectUsedGlobalVariables invocation from the thin link), we'll need to flag the references of the llvm.* variables.

pcc added a comment.Dec 19 2016, 5:07 PM
In D23488#627205, @pcc wrote:
In D23488#627154, @pcc wrote:

This information will probably need to be encoded in the summary somehow once we're at the point where we no longer need to load the module during symbol resolution.

At that point we will have a separately loadable symbol table in the bitcode file, but there again I guess we will not know anything about these llvm.* symbols in the table, right?

I was envisioning that the live bits in the global summary wouldn't be applied to the llvm.* symbols themselves but to the globals referenced by them.

Either one works for the index-based dead stripping. But my point was we need something to flag this in the summary because these symbols won't be in the bitcode symbol table (essentially the situation we're running into here).

Sure, that makes sense. I was thinking about whether we'd also need a separate "used" bit in the bitcode symbol table, so that we can apply the right semantics to those symbols in both regular and thin LTO. In that case we'd actually want to mark the globals referenced by llvm.used rather than just marking llvm.used itself (because the refs wouldn't be available in the BC symtab). But I think that's probably best dealt with orthogonally.

I think at this point we just need to collect a set of GC roots in add{Regular,Thin}LTO in the same way that we collect the set of used variables with collectUsedGlobalVariables.

Presumably using a new flag added to the summary, right? Or did you have a different approach in mind for now?

Yes, it could be done in the summary. I was also thinking you could have a function called something like collectGCRoots that does something like this:

void collectGCRoots(Module *M, SmallSet<GlobalValue *> &GCRoots) {
  collectUsedGlobalVariables(M, GCRoots, true);
  collectUsedGlobalVariables(M, GCRoots, false);
  // similarly for llvm.global_ctors and llvm.global_dtors
}

then use GCRoots to decide whether to set the bit in GlobalResolution. But now that I think about it, I think I prefer the approach of setting live bits in the summary, for example it would seem to make it much simpler to pull the live lists of type identifiers out of the combined summary for CFI/devirtualization.

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!).

Right, this is the thing we want to avoid doing with the bitcode symbol table.

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

Works for me.

pcc added a comment.Dec 19 2016, 5:14 PM

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!).

Actually, looking at how this info is used right now, it seems like what I am talking about doing here will allow us to remove the build of the module and invocation of collectUsedGlobalVariables from addThinLTO. It is currently being used to help identify exported GUIDs to prevent internalization, but by marking these as live roots in the summary we can do the same thing: treating the flagged live roots as potentially exported and not dead.

I don't think that just removing the call to collectUsedGlobalVariables would be sufficient to avoid creating the module. We currently still need the module to create symbol names and compute module flags for the LTO client.

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

To do what I describe above (use the summary flagging to obviate the collectUsedGlobalVariables invocation from the thin link), we'll need to flag the references of the llvm.* variables.

Aren't the llvm.* variables and their references available in the summary? Presumably even once we move to the bitcode symbol table we could treat llvm.* as GC roots by following their references, no?

In D23488#627265, @pcc wrote:

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!).

Actually, looking at how this info is used right now, it seems like what I am talking about doing here will allow us to remove the build of the module and invocation of collectUsedGlobalVariables from addThinLTO. It is currently being used to help identify exported GUIDs to prevent internalization, but by marking these as live roots in the summary we can do the same thing: treating the flagged live roots as potentially exported and not dead.

I don't think that just removing the call to collectUsedGlobalVariables would be sufficient to avoid creating the module. We currently still need the module to create symbol names and compute module flags for the LTO client.

The symbol name creation is down in the InputFile and will go away with the new bitcode symbol table, right?
Where are the module flags needed?

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

To do what I describe above (use the summary flagging to obviate the collectUsedGlobalVariables invocation from the thin link), we'll need to flag the references of the llvm.* variables.

Aren't the llvm.* variables and their references available in the summary?

Yes

Presumably even once we move to the bitcode symbol table we could treat llvm.* as GC roots by following their references, no?

For the computeDeadSymbols this works. But what that doesn't get us is treating those as external (subsuming the handling of the Used variables in addSymbolToGlobalRes), so that they aren't internalized.

pcc added a comment.Dec 19 2016, 5:46 PM
In D23488#627265, @pcc wrote:

We don't want to parse the module and invoke collectUsedGlobalVariables when we do the thin link, which is where we need this info. Oh - I see that we are currently doing this. ISTR that this is only temporary (I hope - we don't want to have to parse the IR in the ThinLink!).

Actually, looking at how this info is used right now, it seems like what I am talking about doing here will allow us to remove the build of the module and invocation of collectUsedGlobalVariables from addThinLTO. It is currently being used to help identify exported GUIDs to prevent internalization, but by marking these as live roots in the summary we can do the same thing: treating the flagged live roots as potentially exported and not dead.

I don't think that just removing the call to collectUsedGlobalVariables would be sufficient to avoid creating the module. We currently still need the module to create symbol names and compute module flags for the LTO client.

The symbol name creation is down in the InputFile and will go away with the new bitcode symbol table, right?

Right.

Where are the module flags needed?

Sorry, I meant the symbol flags (i.e. ModuleSymbolTable::getSymbolFlags).

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

To do what I describe above (use the summary flagging to obviate the collectUsedGlobalVariables invocation from the thin link), we'll need to flag the references of the llvm.* variables.

Aren't the llvm.* variables and their references available in the summary?

Yes

Presumably even once we move to the bitcode symbol table we could treat llvm.* as GC roots by following their references, no?

For the computeDeadSymbols this works. But what that doesn't get us is treating those as external (subsuming the handling of the Used variables in addSymbolToGlobalRes), so that they aren't internalized.

Sure, that's what I was thinking about here:

I was thinking about whether we'd also need a separate "used" bit in the bitcode symbol table, so that we can apply the right semantics to those symbols in both regular and thin LTO. In that case we'd actually want to mark the globals referenced by llvm.used rather than just marking llvm.used itself (because the refs wouldn't be available in the BC symtab). But I think that's probably best dealt with orthogonally.

So it would have to be done during the compile step (where we could do what you suggest above). And then we put it in the summary. Although like I mentioned above, we can just flag the llvm.* variables as the live roots, that's simpler than walking all of their references and flagging them.

We could also leave the llvm.* entirely out of the summary, and just flag their reference as "used".
Another alternative is to leave the summary unchanged, and have the computeDeadSymbols to add to the roots getGUID('llvm.used'); getGUID('llvm.global_ctors'); etc (just get the same list as the global-dce pass.

lib/Transforms/IPO/FunctionImport.cpp
337 ↗(On Diff #81995)

Is this intended?

In D23488#627212, @pcc wrote:

[...]
In this case we can't dead strip f by internalizing as that would end up keeping the reference to the dead function g, we need to actually remove it.

Right, it's probably not worth it in practice. There is likely not commonly enough compiler.used cases. We can always reconsider later.

tejohnson added inline comments.Dec 21 2016, 8:05 PM
lib/Transforms/IPO/FunctionImport.cpp
337 ↗(On Diff #81995)

Looks like it snuck in with your early version of the patch. I'll remove it.

tejohnson updated this revision to Diff 82304.Dec 21 2016, 8:08 PM
tejohnson edited edge metadata.

Added flag to the summary to enable flagging llvm.* values as live,
and add those to the live worklist when doing index-based dead value
analysis. With this change all SPEC cpu2006 C/C++ benchmarks pass.

mehdi_amini added inline comments.Dec 21 2016, 8:53 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
197 ↗(On Diff #82304)

s/std::string/StringRef/

305 ↗(On Diff #82304)

Not sure about this after reading the comment above, Also, any values used but not defined within module level asm should be listed on the llvm.used or llvm.compiler.used global and marked as referenced from there.

lib/Bitcode/Reader/BitcodeReader.cpp
792 ↗(On Diff #82304)

One line comment for the Version < 3 (it makes sense to me now, but I'm not sure I'll see it immediately if I have to touch this in a few months).

lib/LTO/LTO.cpp
847 ↗(On Diff #82304)

What if Res.second.VisibleToRegularObj && Res.second.IRName.empty()? Should we assert instead?

603 ↗(On Diff #68868)

What is the answer to your question here?

In D23488#627205, @pcc wrote:

Also think about how this interacts with regular LTO. You probably also want anything referenced from the regular LTO module to be treated as a GC root.

Good point - I think we would miss those right now with this patch.

I just realized I haven't addressed this yet. Looks like it requires adding a new flag into the SymbolResolution, since we can't currently distinguish between a value referenced between different ThinLTO partitions and the case where we have a reference between a ThinLTO partition and the regular LTO partition - in both cases the resulting GlobalResolution Partition will be External. I'll probably have to add a VisibleToRegularLTOPartition flag onto the SymbolResolution and set it when the Partition == External and we have seen a reference from Partition 0. I'll do that and add a test.

lib/Analysis/ModuleSummaryAnalysis.cpp
197 ↗(On Diff #82304)

Fixed

305 ↗(On Diff #82304)

Note that part of the comment is referring to "values used but not defined" within the module level asm. Whereas we are creating a summary here for something defined within module level asm. To be conservatively correct these should be marked live.

lib/Bitcode/Reader/BitcodeReader.cpp
792 ↗(On Diff #82304)

Done.

lib/LTO/LTO.cpp
847 ↗(On Diff #82304)

Looking at the code in addSymbolToGlobalRes, IRName is set when the GV is marked as prevailing. So I guess "Res.second.VisibleToRegularObj && Res.second.IRName.empty()" would mean a symbol that is visible to a regular object but not the prevailing copy. In that case, it shouldn't need to be preserved.

603 ↗(On Diff #68868)

I realized when revisiting the patch that my logic was flawed. It can be exported and dead (in fact, that's one thing we are specifically trying to catch with this patch). Additionally, it can be live (not in DeadSymbols) and not be exported. So your original check was correct. We want to treat as exported when internalizing if it was a) original exported to another module and b) it isn't dead per the index based deadness analysis.

I just realized I haven't addressed this yet. Looks like it requires adding a new flag into the SymbolResolution, since we can't currently distinguish between a value referenced between different ThinLTO partitions and the case where we have a reference between a ThinLTO partition and the regular LTO partition - in both cases the resulting GlobalResolution Partition will be External. I'll probably have to add a VisibleToRegularLTOPartition flag onto the SymbolResolution and set it when the Partition == External and we have seen a reference from Partition 0. I'll do that and add a test.

Correction: the new flag will go into the GlobalResolution not the SymbolResolution!

tejohnson updated this revision to Diff 82352.Dec 22 2016, 10:52 AM

Address review comments, add handling/test for use in regular LTO partition.

mehdi_amini added inline comments.Dec 22 2016, 11:16 AM
lib/LTO/LTO.cpp
847 ↗(On Diff #82304)

The fact that you had to look at addSymbolToGlobalRes to provide the explanation is a good indication that a comment would be welcome here.

335 ↗(On Diff #82352)

Can you add a comment that explains the test (and both branches)

341 ↗(On Diff #82352)

It is not clear to me how the VisibleToRegularLTOPartition is computed here, it seems that for partition to be 0 (RegularLTO) the symbol has to be used or defined in LTO but not used or defined in ThinLTO?

tejohnson added inline comments.Dec 22 2016, 11:25 AM
lib/LTO/LTO.cpp
847 ↗(On Diff #82304)

Ok will do

335 ↗(On Diff #82352)

Will do

341 ↗(On Diff #82352)

This will be called for each partition it is referenced from. Partition 0 is always regular LTO, ThinLTO modules are partitions 1 and higher. So if this is ever called for Partition 0 then we have a reference in regular LTO. See the callsites in add*LTO

mehdi_amini added inline comments.Dec 22 2016, 11:29 AM
lib/LTO/LTO.cpp
341 ↗(On Diff #82352)

I misread the assignment as = instead of |=.

mehdi_amini added inline comments.Dec 22 2016, 11:31 AM
lib/LTO/LTO.cpp
341 ↗(On Diff #82352)

Also technically to be only "visible" to the RegularLTO partition, it should be referenced from LTO but not defined there with a prevailing resolution.

pcc added inline comments.Dec 22 2016, 12:28 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

Don't you just need one flag: VisibleOutsideThinLTO (or something).

mehdi_amini added inline comments.Dec 22 2016, 12:32 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

I think one flag should be enough, but shouldn't we have the same flag for LTO internalization?

pcc added inline comments.Dec 22 2016, 12:36 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

Wouldn't that be redundant with Partition == 0?

tejohnson added inline comments.Dec 22 2016, 1:05 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

You're right, we only need one flag (forgot that the VisibleToRegularObj flag was added to GlobalResolutions for this patch). I will combine them as suggested.

lib/LTO/LTO.cpp
341 ↗(On Diff #82352)

I'm using the term "visible" loosely here - if it is defined in the regular LTO partition it is essentially also visible there.

mehdi_amini added inline comments.Dec 22 2016, 1:06 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

You're right!
So do we need to record every ThinLTO module as an individual partition? When do we use this?
We could have enum partition { Unkown, Global, LTO, ThinLTO }?

tejohnson added inline comments.Dec 22 2016, 1:09 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

So do we need to record every ThinLTO module as an individual partition? When do we use this?

To detect when symbols are used by multiple ThinLTO modules (i.e. exported). E.g. it transitions from Unknown -> partition1 when first called with Partition=partition1, then from partition1 -> External if invoked again for the same symbol with a different partition2.

tejohnson updated this revision to Diff 82365.Dec 22 2016, 1:24 PM

Address comments

mehdi_amini added inline comments.Dec 22 2016, 1:35 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

To detect when symbols are used by multiple ThinLTO modules (i.e. exported). E.g. it transitions from Unknown -> partition1 when first called with Partition=partition1, then from partition1 -> External if invoked again for the same symbol with a different partition2.

The way partitions are handled and the flag VisibleOutsideThinLTO is still not clear to me, we know when symbols are exported from the index itself why do we need partitions for that?

tejohnson added inline comments.Dec 22 2016, 2:07 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

The way partitions are handled and the flag VisibleOutsideThinLTO is still not clear to me, we know when symbols are exported from the index itself why do we need partitions for that?

At some point we have to collate that information to be able to look up which guid are exported from their modules (for creating the ExportedGUIDs set for internalization). We could detect cross module references among ThinLTO modules by walking the index, but since we have the information here already, it seems more efficient to just keep track of it. Otherwise we need to walk the whole index, comparing module paths, etc.

Forgot to ask, did you collect any statistics?
I'd be interested in things like (for let say clang itself):

  • ThinLink time
  • Link time
  • Number of imported function
  • Number of imported module
  • Number of internalization
  • Whatever else you think about :)
mehdi_amini added inline comments.Dec 22 2016, 2:46 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

At some point we have to collate that information to be able to look up which guid are exported from their modules (for creating the ExportedGUIDs set for internalization). We could detect cross module references among ThinLTO modules by walking the index, but since we have the information here already, it seems more efficient to just keep track of it. Otherwise we need to walk the whole index, comparing module paths, etc.

Do we have an example that would cover the following case:

// main.cpp
void bar1();
void bar2();
void baz();
void foo1() {
  bar1();
}
void foo2() {
  bar2();
}
int main() {
  baz()
}
`

And

`
// bar.cpp
void bar1() {
}
void bar2() {
  foo2(); // The cycle shouldn't prevent dead-stripping.
}
void baz() {
   // Adding a call to bar() here should not prevent internalization.
}
`

With only main exported by the linker, I expect us to be able to:

  1. dead-strip from the index foo() and bar()
  2. "internalize" (drop) foo() and bar() eagerly from the IR

I think the current logic should be OK. But I still find the partition thing and the ExportedGUIDs confusing.

lib/LTO/LTO.cpp
879 ↗(On Diff #82365)

Could we *not* add DeadSymbols to ExportedGUIDs here and remove the DeadSymbols.count(GUID) from the test below?

Forgot to ask, did you collect any statistics?
I'd be interested in things like (for let say clang itself):

  • ThinLink time
  • Link time
  • Number of imported function
  • Number of imported module
  • Number of internalization
  • Whatever else you think about :)

No, I can collect some stats next week after the holiday. The compile time data may take a bit longer to get (have to run when my machine is otherwise mostly idle).

include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

Do we have an example that would cover the following case:

// main.cpp
void bar1();
void bar2();
void baz();
void foo1() {
  bar1();
}
void foo2() {
  bar2();
}
int main() {
  baz()
}
`

And

`
// bar.cpp
void bar1() {
}
void bar2() {
  foo2(); // The cycle shouldn't prevent dead-stripping.
}
void baz() {
   // Adding a call to bar() here should not prevent internalization.

bar1 or bar2? Adding a call to bar2 would prevent its dead stripping and prevent internalization. Adding a call to bar1 should not prevent its internalization.

}

`

With only main exported by the linker, I expect us to be able to:

1) dead-strip from the index foo() and bar()

I don't have a cycle case like foo2/bar2, will add. The existing test case covers the foo1/bar1 case. Although note if calls to either bar2 or bar1 are added to baz as you suggest above, it will prevent their dead stripping.

  1. "internalize" (drop) foo() and bar() eagerly from the IR

Yes the existing test case ensures we internalize where appropriate, even when there were cross-module references (but where both caller/callee were dead per new index-based analysis), and that we remove them as appropriate. Other than the cycle case, which I'll add, the above cases are covered by the new test case in the patch.

I think the current logic should be OK. But I still find the partition thing and the ExportedGUIDs confusing.

lib/LTO/LTO.cpp
879 ↗(On Diff #82365)

Good idea.

lib/LTO/ThinLTOCodeGenerator.cpp
727 ↗(On Diff #82365)

Actually, in ThinLTOCodeGenerator checks here and elsewhere it looks like this test can just be !DeadSymbols.count(GUID) rather than (!DeadSymbols.count(GUID) && GUIDPreservedSymbols.count(GUID)). Here the same GUIDPreservedSymbols set was passed computeDeadSymbols, so they would already be excluded from the DeadSymbol set. ThinLTOCodeGenerator does not seem to distinguish between references outside of ThinLTO vs cross-references between ThinLTO modules, which in fact means that the dead analysis will be conservative here...

mehdi_amini added inline comments.Dec 22 2016, 4:42 PM
include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

bar1 or bar2? Adding a call to bar2 would prevent its dead stripping and prevent internalization. Adding a call to bar1 should not prevent its internalization.

Sorry, bar1() (I wrote the comment before adding bar2(), and yes we should only internalize bar1() when called from baz().

lib/LTO/ThinLTOCodeGenerator.cpp
727 ↗(On Diff #82365)

I think the expectation is that GUIDPreservedSymbols is only supposed to contain symbols visible outside of the ThinLTO partition.

From the header:

/// Set of symbols that need to be preserved outside of the set of bitcode
/// files.
StringSet<> PreservedSymbols;

Even though in the current impl. file:

void ThinLTOCodeGenerator::crossReferenceSymbol(StringRef Name) {
  // FIXME: At the moment, we don't take advantage of this extra information,
  // we're conservatively considering cross-references as preserved.
  //  CrossReferencedSymbols.insert(Name);
  PreservedSymbols.insert(Name);
}
tejohnson added inline comments.Dec 22 2016, 4:44 PM
lib/LTO/ThinLTOCodeGenerator.cpp
727 ↗(On Diff #82365)

Even though in the current impl. file:

Right, that's what I was referring to. Any reason crossReferenceSymbol() doesn't add to a different set (the commented out CrossReferencedSymbols?) - although the isExported checks here and elsewhere in ThinLTOCodeGenerator would need to change if that happened. Maybe better to just switch this to the new LTO API?

mehdi_amini added inline comments.Dec 22 2016, 4:46 PM
lib/LTO/ThinLTOCodeGenerator.cpp
727 ↗(On Diff #82365)

Switching to the new LTO API is one reason I haven't fixed this yet.

No, I can collect some stats next week after the holiday. The compile time data may take a bit longer to get (have to run when my machine is otherwise mostly idle).

To clarify: I'm not saying it is blocking in order to land this patch. I just think it'll be interesting data that to have and it'd be nice if you can collect them at some point :)

Here are some stats for ThinLTO linking clang:

644 function-import              - Number of dead stripped symbols in index

155102 function-import - Number of live symbols in index

(the patch I will upload in a minute contains these new stats)

The number of functions imported dropped only a very small amount (from 62889 to 62815), and the number of modules imported from went from 24470 to 24449. The number of internalizations was unchanged.

I measured the full ThinLink+BE+link time for clang and there was no consistent change across 3 runs each (stripping enabled/disabled).

So at least for clang, this didn't find much opportunity. I can try for larger internal apps later when it is integrated.

include/llvm/LTO/LTO.h
389 ↗(On Diff #82352)

Added the circular dependence case to the new test.

lib/LTO/LTO.cpp
879 ↗(On Diff #82365)

Done.

lib/LTO/ThinLTOCodeGenerator.cpp
727 ↗(On Diff #82365)

Actually, in ThinLTOCodeGenerator checks here and elsewhere it looks like this test can just be !DeadSymbols.count(GUID) rather than (!DeadSymbols.count(GUID) && GUIDPreservedSymbols.count(GUID)). Here the same GUIDPreservedSymbols set was passed computeDeadSymbols, so they would already be excluded from the DeadSymbol set. ThinLTOCodeGenerator does not seem to distinguish between references outside of ThinLTO vs cross-references between ThinLTO modules, which in fact means that the dead analysis will be conservative here...

I had this backwards: because DeadSymbols was computed from GUIDPreservedSymbols, we only need to check GUIDPreservedSymbols here (anything in GUIDPreservedSymbols will necessarily not be in DeadSymbols). I've made that change here and elsewhere in this file (which basically puts us back to your original version of the patch for these checks...now I know why the isExported lambda checks didn't change here originally...).

tejohnson updated this revision to Diff 82680.Dec 29 2016, 10:30 AM

Address comments, add stats and an option.

I'm surprised by how little we find in clang, that seems fishy to me, I'll investigate on OSX when the patch lands.

Ping. Any more comments?

mehdi_amini accepted this revision.Jan 4 2017, 9:47 PM
mehdi_amini edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jan 4 2017, 9:47 PM
This revision was automatically updated to reflect the committed changes.