Page MenuHomePhabricator

[ThinLTO] Support for reference graph in per-module and combined summary.
ClosedPublic

Authored by tejohnson on Feb 12 2016, 1:21 PM.

Details

Summary

This patch adds support for including a full reference graph including
call graph edges and other GV references in the summary.

The reference graph edges can be used to make importing decisions
without materializing any source modules, can be used in the plugin
to make file staging decisions for distributed build systems, and is
expected to have other uses.

The call graph edges are recorded in each function summary in the
bitcode via a list of <CalleeValueIds, StaticCount> tuples when no PGO
data exists, or <CalleeValueId, StaticCount, ProfileCount> pairs when
there is PGO, where the ValueId can be mapped to the function GUID via
the ValueSymbolTable. In the function index in memory, the call graph
edges reference the target via the CalleeGUID instead of the
CalleeValueId.

The reference graph edges are recorded in each summary record with a
list of referenced value IDs, which can be mapped to value GUID via the
ValueSymbolTable.

Addtionally, a new summary record type is added to record references
from global variable initializers. A number of bitcode records and data
structures have been renamed to reflect the newly expanded scope of the
summary beyond functions. More cleanup will follow.

Measurements on 483.xalancbmk show the following increases in object and
combined index sizes:

  • Overall .o bitcode increase is 1.34% without PGO and 1.54% with PGO
  • The combined index increase is 116% without PGO and 120% with PGO
  • The aggregate size of the .o bitcode and combined index together increased 3.22% without PGO and 3.39% with PGO (the combined index size increase is large taken alone, but its size is still on the same order of magnitude as the larger .o files).

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mehdi_amini added inline comments.Feb 25 2016, 12:45 AM
include/llvm/Bitcode/LLVMBitCodes.h
190–202 ↗(On Diff #48954)

Side note: we will need to nail the format in the coming weeks/months somehow, or have a plan for backward compatibility (auto upgrade...).

195 ↗(On Diff #48954)

It is a bit annoying to have to manually overload for every "optional" data. It may also lead to a combinatory explosion if there were many. Unfortunately the only alternative in the current bitcode format is to have function summary stored as block instead of record...

include/llvm/IR/FunctionInfo.h
97 ↗(On Diff #48980)

Without profile data, it might be useful to record the number of static callsites from a function to the callee. This can help compiler backend to make better global decisions later (e.g. better inlining and enable more function GC).

Can you elaborate how would you use this information and how it would help inlining and GC?

298 ↗(On Diff #48980)

I think we usually prefix takeXXX and not getXXX when ownership is transferred.

Responses to Mehdi's comments.

include/llvm/Bitcode/LLVMBitCodes.h
190–203 ↗(On Diff #48980)

Agreed, although I think we are still in the heavy churn tuning phase.

195 ↗(On Diff #48980)

Actually my first implementation of this used a single record id. What I did was add a bit to indicate whether there was any profile data, so that the reader knew how to interpret the array data at the end (as either a list of just callee ids, or <callee id, count> pairs). I think I used the size of the record to deduce whether there was any call information. I decided to switch to explicitly encoding this information in the record id because it seemed cleaner and more consistent, and also removed a bit from each record (the profile flag). This reduces the size for large .o bitcode files and for the combined index, although the size for small .o bitcode files is a bit larger due to the 2 extra abbrev ids.

I could probably go back to using the size to deduce the presence or absence of call information, and reduce this to 2 per summary type. WDYT?

include/llvm/IR/FunctionInfo.h
97 ↗(On Diff #48980)

I think what David was alluding to was similar to what I was describing on IRC yesterday - if you know you have a single static callsite via the summary, you can give the same inline cost benefit currently given for internal(ized) single-callsite functions. Then if it is inlined, hopefully the original out-of-line function can be eliminated via linker GC.

To do this, the combined index creation step would need to aggregate the # static callsites to each function and put that into the callee function's combined summary.

298 ↗(On Diff #48980)

Ok, will fix.

Response to eraman's comment.

include/llvm/ProfileData/ProfileCommon.h
98 ↗(On Diff #48980)

I looked at APInt briefly when I was thinking about how to do this, but was concerned that it was going to be inefficient in the common case. However, it looks like that isn't the case as it has a fast path. Will change this to use it.

Note I actually stole this code from RAGreedy::initializeCSRCost() (which could probably be refactored to use this new method once it goes in).

mehdi_amini added inline comments.Feb 25 2016, 8:49 AM
include/llvm/IR/FunctionInfo.h
291 ↗(On Diff #48980)

Why this?
If needed, replace with FunctionSummaryIOHelper() = default;

tejohnson added inline comments.Feb 25 2016, 9:31 AM
include/llvm/IR/FunctionInfo.h
291 ↗(On Diff #48980)

This was leftover from when I had a member variable that was being passed in and initialized in the constructor. Removing this now.

tejohnson updated this revision to Diff 49088.Feb 25 2016, 10:14 AM
  • Address comments by eraman and jokereph.
tejohnson updated this revision to Diff 49104.Feb 25 2016, 11:46 AM

Add static callsite count to summary callgraph edges.
For 483.xalancbmk this increases the size of the combined index
by another 3.5%.

mehdi_amini added inline comments.Feb 25 2016, 11:54 PM
include/llvm/IR/FunctionInfo.h
105 ↗(On Diff #48980)

Missing const variant.

tejohnson added inline comments.Feb 26 2016, 6:31 AM
include/llvm/IR/FunctionInfo.h
118 ↗(On Diff #49104)

Adding that now.

tejohnson updated this revision to Diff 49181.Feb 26 2016, 6:34 AM
  • Add const variant of FunctionSummary::edges

How could we integrate accesses to global variable as part of this?
It turns out that to be able to benefit from the linker information on what symbol is exported during the import, this is a must.

In D17212#362864, @joker.eph wrote:

How could we integrate accesses to global variable as part of this?
It turns out that to be able to benefit from the linker information on what symbol is exported during the import, this is a must.

Well without it you still can see which function symbols will be exported, just not the variables, so you are running with less info and I guess need to assume that all static variables will be exposed and promote them. To refine that behavior for variables, yes, we'd need additional info in the summary.

(For davidxl or anyone else who didn't see the IRC conversation, Mehdi is looking at doing pure summary-based importing decisions in the linker step, then giving this info to the ThinLTO backends to avoid promotion of local values that aren't exported. For a distributed build if we wanted to do it this way the importing decisions would all be made in the plugin step, then would need to be serialized out for the distributed backends to check.)

Two possibilities depending on the fidelity of the info you think you need:

  1. One possibility is to just put a flag in the function summary if it accesses *any* local variables, and adjust the importing threshold accordingly. Then in the ThinLTO backend for the exporting module you need to check which of your own functions are on the import list, and which local variables they access, and promote accordingly.
  1. If it will be really beneficial to note exactly which local variables are accessed by which function, we'll need to broaden the edges list to include accesses to variables (I assume you only care about local variables here). E.g. the per-module summary edge list for a function would need to include value ids for any local variables referenced by that function (not sure that the other parts of the triple, the static and profile counts, are needed for that). Then in the combined VST we need to include entries for GUIDs of things that don't have a function summary, but are referenced by these edges. When accessing a function summary edge list for a candidate function to import, you could then see the GUID of any local variables accessed. You wouldn't know them by name, but if for example you wanted a heuristic like "if >=N hot import candidate functions from module M access a local variable with GUID G, go ahead and import those and let G be promoted by the backend (which like in approach #1 needs to check which local variables are accessed by any functions on an import list)".

Obviously 1) is easier and cheaper space-wise. What are your thoughts?

Is there a need to mark it? Does the following work?

  1. using summary call graph to do importing decision
  2. for any function that is imported by another module, mark it as

'exported'

  1. When processing the exporting module, check any exported function and

promote any statics it references.

David

Is there a need to mark it? Does the following work?

  1. using summary call graph to do importing decision
  2. for any function that is imported by another module, mark it as 'exported'
  3. When processing the exporting module, check any exported function and promote any statics it references.

    David

That does work for correctness, but Mehdi was proposing that functions accessing local values need to meet a higher importing threshold so that we don't cause too many promotions due to over-eager importing.

In terms of serializing the info out for doing any of this with a distributed build, we could do it via an "exported" bit in the combined summary though.

mehdi_amini added inline comments.Feb 28 2016, 7:53 PM
lib/Bitcode/Reader/BitcodeReader.cpp
5528 ↗(On Diff #49181)

Shouldn't this be ValueIdToCallGraphGUIDMap[ValueID] = Function::getGUID(FunctionGlobalId); ?

tejohnson added inline comments.Feb 29 2016, 6:42 AM
lib/Bitcode/Reader/BitcodeReader.cpp
5528 ↗(On Diff #48980)

Good catch! Yes, it should be. The VST_CODE_ENTRY case does not need to invoke getGlobalIdentifier since those are externally-defined functions (and so not possibly local).

But looking at the other places here where we set up this map I realized that the VST_CODE_COMBINED_FNENTRY case was wrong - we already have the GUID and there is no ValueName in that record. I have a fix for both coming up shortly.

tejohnson updated this revision to Diff 49371.Feb 29 2016, 7:09 AM
  • Fix ValueIdToCallGraphGUIDMap entries.
tejohnson updated this revision to Diff 49994.Mar 7 2016, 2:04 PM

Update patch to include full reference graph.

Some notable changes from prior patch:

  • include non-call reference edges in summary records
  • introduce summary entries for global variable initializations
  • add the VSTOFFSET record to the combined index bitcode file, and use it when reading both the per-module and combined indices to parse the VST before the summary section (analogous to how it is used when parsing normal bitcode to parse the VST before the function blocks), which reduces the amount of bookkeeping required to associate the VST entry with the value ids used in the summary section.

Bitcode name changes:

  • add a new FS_*_GLOBALVAR_INIT_REFS records to hold ref graph edges from global variable inits to referenced global vars.
  • changed the VST_CODE_COMBINED_FNENTRY name to VST_CODE_COMBINED_GVDEFENTRY to reflect the broadened scope (can correspond to either a function or variable def and associated summary).
  • changed FUNCTION_SUMMARY_BLOCK_ID to GLOBALVAL_SUMMARY_BLOCK_ID to reflect additional scope

Data structure and related naming changes:

  • Added a new GlobalValueSummary type which holds summary info common to both the function and globalvar summaries.
  • Made the existing FunctionSummary record a derived class of the new GlobalValueSummary.
  • Added a new derived GlobalVarSummary class of GlobalValueSummary.
  • Renamed FunctionInfo* -> GlobalValueInfo*
  • Renamed FunctionSummary -> GlobalValueSummary or just Summary as appropriate
  • Misc other variable and method renamings to reflect above changes.

TODO:

  • Rename FunctionInfoIndex to something more broad (ModuleIndex?) (ditto for related classes like FunctionInfoIndexBitcodeReader).
  • Move getGUID and getGlobalIdentifier from Function to GlobalValue class.
  • Some lingering FunctionSummary references that need to be updated to GlobalValueSummary or just Summary.

I will work on the TODO items now but wanted to get this patch out for review,
testing by Mehdi, and feedback.

Note that some of these changes (many of the renames, can be committed first
as NFC changes).

LGTM overall. Still some comments/questions below.

Thanks for the good work!

include/llvm/IR/FunctionInfo.h
87 ↗(On Diff #49994)

Can't/shouldn't be protected?

108 ↗(On Diff #49994)

why to be written to...?

149 ↗(On Diff #49994)

I think the usual terminology in LLVM is "PGO" (I had to google FDO)
(same below)

154 ↗(On Diff #49994)

F?

Also again the "to be written to" seems off to me.

216 ↗(On Diff #49994)

Can't you have a single ctor?

GlobalValueInfo(uint64_t Offset = 0, std::unique_ptr<GlobalValueSummary> Summary = nullptr)

lib/Bitcode/Reader/BitcodeReader.cpp
416 ↗(On Diff #49994)

I guess more renaming could have been done here right?
(I really don't mind, but since you cared to rename many places below...)

445 ↗(On Diff #49994)

doc

5788 ↗(On Diff #49994)

Mmmmm....

5850 ↗(On Diff #49994)

Mmmmm (bis)....

lib/Bitcode/Writer/BitcodeWriter.cpp
852 ↗(On Diff #49994)

I'd write it:

if (M->getValueSymbolTable().empty())
  return 0;
return WriteValueSymbolTableForwardDecl(Stream);
2346 ↗(On Diff #49994)

Why is it no longer valid?

2473 ↗(On Diff #49994)

Mind adding a one line comment?

2486 ↗(On Diff #49994)

Guarantee on the recursion depth?

2560 ↗(On Diff #49994)

if(CS) could be hoisted out of the loop.

2572 ↗(On Diff #49994)

You depend on the order of the calls.
If an instruction first add a call to a function, but later on will reference it, you won't remove it from RefEdges.

This may be intended and you want to have an accurate count of ref + calls and here you're trying to filter calls out of refedges. However a call instruction could legitimately refer the function (a call passing a function pointer as an argument for instance)

2574 ↗(On Diff #49994)

?

2603 ↗(On Diff #49994)

?

2977 ↗(On Diff #49994)

Isn't the opposite that the code is doing?

Another question: you reported measurements in the description, is it up to date with the new patch?

In D17212#369588, @joker.eph wrote:

Another question: you reported measurements in the description, is it up to date with the new patch?

Not yet, need to update that. The size of the thinlto.bc combined index went up quite a bit percentage wise with the ref graph, especially once I corrected it to recursively look for references, and added in the global variable init refs. Here are the new stats, only have non-PGO so far, will collect PGO as well and update the summary (once I make a fix described below to add linkage types to variables):

For 483.xalancbmk (non-PGO):
Total .o (bitcode) size: +1.24%
Combined .thinlto.bc size: +120%
Aggregate .o + .thinlto.bc size: +3.16%

I gave the last number because the size of the combined index .thinlto.bc file is on the same order of magnitude as the largest individual .o bitcode files, so the increase should be taken in that context.

include/llvm/IR/FunctionInfo.h
87 ↗(On Diff #49994)

Good point, will make protected.

108 ↗(On Diff #49994)

Good catch, this was leftover from the prior patch when it was part of a bitcode writer helper, didn't updated it up after moving here. Will fix.

149 ↗(On Diff #49994)

Will fix...bummer that every compiler I have worked on seems to use a different term! PGO/FDO/PBO...bah.

154 ↗(On Diff #49994)

Same issue as above, stale comment, will update.

216 ↗(On Diff #49994)

Yes, will fix.

lib/Bitcode/Reader/BitcodeReader.cpp
416 ↗(On Diff #49994)

Yep, this was part of the TODO list in this patch update. Working on that now. Specifically, I'm changing FunctionIndex* and FunctionInfoIndex* to ModuleSummaryIndex*. I left this for a follow-on update because already the renaming was pretty extensive, and this particular change bleeds into some of the interfaces and variable/field names used in other parts of the compiler (clang etc).

445 ↗(On Diff #49994)

Will do.

5788 ↗(On Diff #49994)

Oops. In my haste to get the renaming done and send this out, forgot that I planned to add a linkage type to the global var init records. Will add now.

5850 ↗(On Diff #49994)

Ditto.

lib/Bitcode/Writer/BitcodeWriter.cpp
852 ↗(On Diff #49994)

Good idea, will fix.

2346 ↗(On Diff #49994)

Good catch. I think this got removed when I was using a different data structure here for awhile, and I missed it when restoring the old approach. Will add back.

2473 ↗(On Diff #49994)

Will do.

2486 ↗(On Diff #49994)

Beyond the Visited set check earlier? What would you like to see?

2560 ↗(On Diff #49994)

Yes, missed this when I cleaned things up from when I was initially looking for non-call references inside this loop.

2572 ↗(On Diff #49994)

I don't think the order of calls matters? The reference list is populated once outside of this loop.

On your second point, that is true that this will cause a function both called and referenced in another way to only be recorded as called by this function. Is it important to list both types of references? I was thinking that it was important for importing needs to distinguish the functions being called from the other non-call references, but that essentially the combination of the two are the full reference set of the function. If I should put it in both places, I will need to figure out how to distinguish the two types of function references in findRefEdges.

2574 ↗(On Diff #49994)

More cruft left from when I was looking for non-call refs in this loop, will remove

2603 ↗(On Diff #49994)

Incomplete cleanup of debugging output, will remove.

2977 ↗(On Diff #49994)

Yes, good catch. I think what I initially wanted was to record them this way, but the issue is that the alias doesn't have a separate summary, which is why we are grabbing the aliasee summary here. Will update the comment.

mehdi_amini added inline comments.Mar 8 2016, 11:42 AM
lib/Bitcode/Reader/BitcodeReader.cpp
416 ↗(On Diff #49994)

Fine with me.

lib/Bitcode/Writer/BitcodeWriter.cpp
852 ↗(On Diff #49994)

Re-reading it, the variable had the nice property of having a name, making it very clear what is returned.
If you adopt the above suggested change, I would add a one line comment along the line of // return the VSTOffsetPlaceholder if we have a VST

2486 ↗(On Diff #49994)

I was not worried on non-termination or corretness, but depth that would explode the stack.

It was late and I couldn't think clearly enough to find the answer myself. I just got a coffee so I should be able to answer myself now ;)

-> For every instruction you will pull transitively all the operands until you reach a global (or something already seen). i.e. this is a DFS search on the SSA graph.
If I'm correct, a worklist would probably be appropriate here.

2560 ↗(On Diff #49994)

Re-reading, I'm not even sure what this loop on the operand is doing at all!
Wouldn't the code do exactly the same if you just remove the loop and turn the test into if (CS)`

2572 ↗(On Diff #49994)

I pointed what I saw as an inconsistency, but I may misunderstand totally what you're trying to do here.

So I will assume in the following that:

  • RefEdges contains the global values referenced by the current function
  • CallGraphEdges contains the list of Functions called by the current function

Do we agree that we should have either of these properties:

  1. A function that is part of CallGraphEdges should *not* be present in RefEdges even if it is referenced in another way than a call
  2. A function that is part of CallGraphEdges should *also* be present in RefEdges *if it is referenced* in another way than a call

What I read from your code right now is:

  1. A function that is part of CallGraphEdges *may or may not* be present in RefEdges *if it is referenced* in another way than a call.
tejohnson added inline comments.Mar 8 2016, 1:21 PM
lib/Bitcode/Writer/BitcodeWriter.cpp
852 ↗(On Diff #49994)

Will do.

2486 ↗(On Diff #49994)

I wouldn't have thought that an instruction would be large enough to cause stack explosion. But I have no issue with changing this to a worklist iteration instead.

2560 ↗(On Diff #49994)

Right, this loop is bogus!

2572 ↗(On Diff #49994)

Yep, I was distracted by the bogus loop above, we are collecting this per instruction so there is an ordering issue, and we are getting #3 which is undesirable.

I think the way to fix this is to check for the callsite first, then pass in some info from the callsite to findRefEdges so that the callsite reference is itself ignored. I.e. pass in either the callee GV and skip it to get #1, or pass in the Use to get #2.

I saw David's follow-on that he thinks #2 is best. I can try that.

mehdi_amini added inline comments.Mar 8 2016, 4:13 PM
lib/Bitcode/Writer/BitcodeWriter.cpp
2486 ↗(On Diff #49994)

I may misunderstand, but it seems to me that the depth is not the width of a single instruction, but potentially on the order of the number of instructions in the Function.
If I'm wrong and you're bounded somehow by the number of operands, then recursion is fine with me.

tejohnson added inline comments.Mar 9 2016, 7:21 AM
lib/Bitcode/Writer/BitcodeWriter.cpp
2486 ↗(On Diff #49994)

If you are right then this needs to change beyond moving to a worklist, as I only intended to capture the references within a single instruction or variable def.

The two entry points to this routine are in WriteFunction, where we pass in an Instruction as the User, and in WriteModuleLevelReferences, where we pass in a GlobalVariable (and I had confirmed that the operands of a variable are its initializer). We recursively walk the given User's operands(). I wouldn't have thought that we could jump from a given Instruction to other instructions in the function this way?

tejohnson added inline comments.Mar 9 2016, 11:57 AM
lib/Bitcode/Writer/BitcodeWriter.cpp
2572 ↗(On Diff #49994)

Actually it turns out to be pretty simple, we only need to pass in the callee GV to findRefEdges to exclude it (because you can't have both a call and a non-call reference to a function in the same instruction, at least I couldn't find a way!).

For xalancbmk this fix actually reduced the combined index and .o sizes a bit. It turns out that with the old flow we were inadvertently including intrinsics (which were ignored in the call graph edge list) in the reference list.

In the new version of the patch I will upload shortly I've included a new test that checks for both issues (ensuring a function is in both the call list and ref list if it is accessed both ways by the function, and ignoring intrinsics).

tejohnson updated this revision to Diff 50173.Mar 9 2016, 11:58 AM
  • Address Mehdi's review comments.
tejohnson updated this revision to Diff 50287.Mar 10 2016, 9:14 AM

Fixes to findRefEdges discussed in review and IRC

Two main changes to findRefEdges:

  1. Use worklist iteration instead of recursion

The reason is that when we invoke with an Instruction, it will iterate
into other instructions in the function when it encounters a use of
a local variable defined in another instruction. Ultimately we are
analyzing all instructions in the function, and we prevent duplicate
work by sharing the Visited set across all instructions in the function.
However, using a worklist approach is more efficient for large functions
where with recursion we may end up recursing many times (using much
stack).

  1. More accurate detection and skipping of callsite references

The prior solution would miss the non-call reference to foo in the
following case:
foo((void *)foo);
since we were skipping all GVs that matched the callee GV.

Additionally, as noted in 1) we may traverse through local variables
defined on other instructions. Those other instructions may be calls
which return a value. We want to make sure we don't count calls in other
instructions encountered this way as non-call refs. Depending on the
order of traversal (e.g. a use on a phi encountered before the call
instruction), we can't count on the call being in the visited set.

Finally, doing the checking for callsites here avoids the need for
WriteFunction to special case call instructions, which has its own set
of issues. For example, one solution considered was for call
instructions to invoke findRefEdges on each of its data_ops(). This
means findRefEdges would need to distinguish between a GV user that was
passed in being either a reference (if it was a call instruction
operand) or a global variable for which we want to analyze initializers
(i.e. called from WriteModuleLevelReferences).

Finally, enhanced the new test quite a bit to check for the various
cases I encountered when fixing findRefEdges.

mehdi_amini accepted this revision.Mar 10 2016, 10:20 AM
mehdi_amini edited edge metadata.

LGTM.
Thanks! I think we reached the point of "good enough". I feel this review was productive :)

This revision is now accepted and ready to land.Mar 10 2016, 10:20 AM
In D17212#372070, @joker.eph wrote:

LGTM.
Thanks! I think we reached the point of "good enough". I feel this review was productive :)

Great, thanks for the reviews! Agreed, the final state is much more complete and accurate.

Will update the title and summary along with updated stats before I commit.

tejohnson retitled this revision from [ThinLTO] Support for call graph in per-module and combined summary. to [ThinLTO] Support for reference graph in per-module and combined summary..Mar 11 2016, 6:57 AM
tejohnson updated this object.
tejohnson edited edge metadata.
tejohnson updated this object.Mar 11 2016, 7:00 AM
This revision was automatically updated to reflect the committed changes.

This is the same as what was reviewed, except for 2 small changes that can be post-commit reviewed:

Mehdi: A bad interaction between my changes and your libLTO interfaces committed this week (which merges into the first index when creating the combined index, instead of creating a new combined index and merging everything into it). See removeEmptySummaryEntries() and its callsite for the detailed explanation of what I needed to do since I am now parsing the VST first and need to do some eager creation of index entries that was being cleaned up as we merged the indexes.

Easwaran: Updated getBlockProfileCount as per our conversation this morning to do the multiplication before the integer divide in order to avoid inaccuracies due to truncation.