This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Support for reference graph in per-module and combined summary.
ClosedPublic

Authored by tejohnson on Feb 12 2016, 1:21 PM.

Details

Summary

This patch adds support for including a full reference graph including
call graph edges and other GV references in the summary.

The reference graph edges can be used to make importing decisions
without materializing any source modules, can be used in the plugin
to make file staging decisions for distributed build systems, and is
expected to have other uses.

The call graph edges are recorded in each function summary in the
bitcode via a list of <CalleeValueIds, StaticCount> tuples when no PGO
data exists, or <CalleeValueId, StaticCount, ProfileCount> pairs when
there is PGO, where the ValueId can be mapped to the function GUID via
the ValueSymbolTable. In the function index in memory, the call graph
edges reference the target via the CalleeGUID instead of the
CalleeValueId.

The reference graph edges are recorded in each summary record with a
list of referenced value IDs, which can be mapped to value GUID via the
ValueSymbolTable.

Addtionally, a new summary record type is added to record references
from global variable initializers. A number of bitcode records and data
structures have been renamed to reflect the newly expanded scope of the
summary beyond functions. More cleanup will follow.

Measurements on 483.xalancbmk show the following increases in object and
combined index sizes:

  • Overall .o bitcode increase is 1.34% without PGO and 1.54% with PGO
  • The combined index increase is 116% without PGO and 120% with PGO
  • The aggregate size of the .o bitcode and combined index together increased 3.22% without PGO and 3.39% with PGO (the combined index size increase is large taken alone, but its size is still on the same order of magnitude as the larger .o files).

Diff Detail

Event Timeline

tejohnson updated this revision to Diff 47847.Feb 12 2016, 1:21 PM
tejohnson retitled this revision from to [ThinLTO] Support for call graph in per-module and combined summary..
tejohnson updated this object.
tejohnson added reviewers: mehdi_amini, davidxl.
tejohnson added a subscriber: llvm-commits.
davidxl added inline comments.Feb 16 2016, 1:03 PM
include/llvm/Bitcode/LLVMBitCodes.h
190–202

A side note -- it might be useful to reserve some bits for general function attributes such as 'address-taken' etc.

include/llvm/IR/FunctionInfo.h
97

Without profile data, it might be useful to record the number of static callsites from a function to the callee. This can help compiler backend to make better global decisions later (e.g. better inlining and enable more function GC).

286

Is this member really needed? The helper is not intended to own the summary.

lib/Bitcode/Reader/BitcodeReader.cpp
5708

Is it better to introduce a symbolic name for the start index instead of hardcoded 3?

lib/Bitcode/Writer/BitcodeWriter.cpp
2520

Should this be called 'getBlockProfileCount' ? This utiltiy method belongs to include/llvm/ProfileData/ProfileCommon.h.

2864–2868

Should the type be SmallVector<uint64_t, 64> ? GUID is 64 bit type.

davidxl edited edge metadata.Feb 16 2016, 1:03 PM
davidxl added a subscriber: eraman.

+eraman This patch defines scaledCount() method which should probably be in ProfileCommon.h. There is a similar method in your inliner patch which can be moved there too.

tejohnson added inline comments.Feb 23 2016, 2:23 PM
include/llvm/Bitcode/LLVMBitCodes.h
190–202

What is the advantage of saving bits now? The format is still highly in flux anyway, and once it is nailed down eventually we would need to use some kind of version id to specify the change regardless of whether it was using saved bits or not.

include/llvm/IR/FunctionInfo.h
97

I could do that by overloading the count field to hold the static number of callsites. Alternatively, for statically-generated profile information is it useful to record the block frequency sums or something like that?

286

The helper does own the function summary temporarily. We create it when the function summary section is read, then later transfer ownership to the index when the VST is read. The reason why I need to keep a non-owning pointer to the summary is that later on after all VST entries are read from the combined index I need to set up the call graph edges in the combined index, and for that I need to keep the association here between the function summary and the CallGraphEdgeValueIdList.

lib/Bitcode/Reader/BitcodeReader.cpp
5708

I can do that. There's a lot of hardcoded start indices just like this in the file already, but I agree a symbolic name would be clearer.

lib/Bitcode/Writer/BitcodeWriter.cpp
2520

Good idea. Will rename and move.

2864–2868

Good point, it should be uint64_t, but for a different reason. We aren't writing the GUID, but rather a value id which is unsigned. But we are writing the profile count when we have profile data, which is uint64_t. Fixing it here and when writing the combined summary.

davidxl added inline comments.Feb 23 2016, 5:23 PM
include/llvm/Bitcode/LLVMBitCodes.h
190–202

ok.

include/llvm/IR/FunctionInfo.h
97

Regarding static # of callsites -- another way is to build a multi-graph -- basically do not collapse all edges from one caller to the same callee. How much space do we save there?

If we want to save space, there is a better way to do this -- we can build a graph with edges from module to the callee. Such a module->function edge contains the following information :
{ Aggregate call count, max call count, static # of sites }

It is probably not useful to record static block frequency.

286

owns it temporarily in what sense? Does it have a chance to de-allocate it?

davidxl added inline comments.Feb 23 2016, 7:58 PM
include/llvm/IR/FunctionInfo.h
97

My suggestion regarding using module->function edge can not replace the aggregate function->function call edge here -- but it can be emitted as additional information to track static callsite info -- but that is independent of what this patch does.

tejohnson added inline comments.Feb 24 2016, 7:46 AM
include/llvm/IR/FunctionInfo.h
97

Right, we still want the function->function edges for better precision. Will consider module->function edges at a later point.

I will need to do some measurements to see what the overhead is of not collapsing the edges within each function. But if we simply add a static callsite count (possibly both with and without PGO), I think that may be sufficient.

286

It passes ownership to the function index in memory when the corresponding VST entry is created. After all the VST entries are created, it uses the saved (non-owned) pointer to create the call graph edges in the function index in memory (need to wait until all the VST entries are parsed as we need to know all the value id and GUIDs). See FunctionIndexBitcodeReader::parseValueSymbolTable (note the function return is in the EndBlock case in the switch statement, not at the bottom of the routine).

The other alternative would be to keep a mapping there from the FunctionSummaryIOHelper object to the corresponding FunctionSummary pointer (created when we parse the corresponding VST entry and transfer the ownership of the unique_ptr to the function index). I think that is probably cleaner and more straightforward. Will make that change.

As an aside, I just noticed that the ValueIdToCallGraphGUIDMap is only used in this routine, I will move it to function local instead of being on the FunctionIndexBitcodeReader class.

davidxl added inline comments.Feb 24 2016, 9:19 AM
include/llvm/IR/FunctionInfo.h
286

Right -- if the IO helper class only conditionally pass the ownership to functionIndex in some cases, the unique pointer is needed. However if it is intended to be always passed to another object, the helper just need to keep a naked pointer to the summary as it does not really own it.

tejohnson added inline comments.Feb 24 2016, 9:34 AM
include/llvm/IR/FunctionInfo.h
286

In the BitcodeWriter it stays owned by the helper. In the BitcodeReader ownership is always transferred to the in-memory function index. Even in the latter case, I think creating a unique_ptr immediately, rather than creating a naked pointer and only creating a unique_ptr when the VST is read, is cleaner as it makes the current ownership clear and avoids malloc.

tejohnson updated this revision to Diff 48954.Feb 24 2016, 9:36 AM
tejohnson edited edge metadata.

Address most of davidxl's comments (except keeping track of total static
count of call edges, that is still TODO).

mehdi_amini edited edge metadata.Feb 24 2016, 9:42 AM

I have a concerned about the "Only inter-module calls are recorded" part. This prevents to perform accurate pure summary-based importing decision (that I'm planning on doing).
Have you measured the impact of representing the full call-graph in the summary?

In D17212#360805, @joker.eph wrote:

I have a concerned about the "Only inter-module calls are recorded" part. This prevents to perform accurate pure summary-based importing decision (that I'm planning on doing).
Have you measured the impact of representing the full call-graph in the summary?

Good point, I forgot that would be needed if we want to make summary-only importing decisions. Let me add the intra-module edges and measure the added overhead.

In D17212#360805, @joker.eph wrote:

I have a concerned about the "Only inter-module calls are recorded" part. This prevents to perform accurate pure summary-based importing decision (that I'm planning on doing).
Have you measured the impact of representing the full call-graph in the summary?

Good point, I forgot that would be needed if we want to make summary-only importing decisions. Let me add the intra-module edges and measure the added overhead.

For 483.xalancbmk, adding in the intra-module calls adds ~4% to the combined index size (so 26-27% over no callgraph) with and without PGO. This seems reasonable. Will update the patch to include those.

tejohnson updated this revision to Diff 48980.Feb 24 2016, 2:21 PM
tejohnson edited edge metadata.
  • Include intra-module calls.
eraman added inline comments.Feb 24 2016, 2:32 PM
include/llvm/ProfileData/ProfileCommon.h
98

Why not just do
ScaledCount = EntryCount * BlockFreq / EntryFreq using APInt<128>, take the lower 64 bits if the result fitx within uint64_t and use UINT64_MAX otheerwise?

I skimmed through and it sounds good to me, please see a few comments below.

mehdi_amini added inline comments.Feb 25 2016, 12:45 AM
include/llvm/Bitcode/LLVMBitCodes.h
190–203

Side note: we will need to nail the format in the coming weeks/months somehow, or have a plan for backward compatibility (auto upgrade...).

195

It is a bit annoying to have to manually overload for every "optional" data. It may also lead to a combinatory explosion if there were many. Unfortunately the only alternative in the current bitcode format is to have function summary stored as block instead of record...

include/llvm/IR/FunctionInfo.h
97

Without profile data, it might be useful to record the number of static callsites from a function to the callee. This can help compiler backend to make better global decisions later (e.g. better inlining and enable more function GC).

Can you elaborate how would you use this information and how it would help inlining and GC?

298

I think we usually prefix takeXXX and not getXXX when ownership is transferred.

Responses to Mehdi's comments.

include/llvm/Bitcode/LLVMBitCodes.h
190–203

Agreed, although I think we are still in the heavy churn tuning phase.

195

Actually my first implementation of this used a single record id. What I did was add a bit to indicate whether there was any profile data, so that the reader knew how to interpret the array data at the end (as either a list of just callee ids, or <callee id, count> pairs). I think I used the size of the record to deduce whether there was any call information. I decided to switch to explicitly encoding this information in the record id because it seemed cleaner and more consistent, and also removed a bit from each record (the profile flag). This reduces the size for large .o bitcode files and for the combined index, although the size for small .o bitcode files is a bit larger due to the 2 extra abbrev ids.

I could probably go back to using the size to deduce the presence or absence of call information, and reduce this to 2 per summary type. WDYT?

include/llvm/IR/FunctionInfo.h
97

I think what David was alluding to was similar to what I was describing on IRC yesterday - if you know you have a single static callsite via the summary, you can give the same inline cost benefit currently given for internal(ized) single-callsite functions. Then if it is inlined, hopefully the original out-of-line function can be eliminated via linker GC.

To do this, the combined index creation step would need to aggregate the # static callsites to each function and put that into the callee function's combined summary.

298

Ok, will fix.

Response to eraman's comment.

include/llvm/ProfileData/ProfileCommon.h
98

I looked at APInt briefly when I was thinking about how to do this, but was concerned that it was going to be inefficient in the common case. However, it looks like that isn't the case as it has a fast path. Will change this to use it.

Note I actually stole this code from RAGreedy::initializeCSRCost() (which could probably be refactored to use this new method once it goes in).

mehdi_amini added inline comments.Feb 25 2016, 8:49 AM
include/llvm/IR/FunctionInfo.h
291

Why this?
If needed, replace with FunctionSummaryIOHelper() = default;

tejohnson added inline comments.Feb 25 2016, 9:31 AM
include/llvm/IR/FunctionInfo.h
291

This was leftover from when I had a member variable that was being passed in and initialized in the constructor. Removing this now.

tejohnson updated this revision to Diff 49088.Feb 25 2016, 10:14 AM
  • Address comments by eraman and jokereph.
tejohnson updated this revision to Diff 49104.Feb 25 2016, 11:46 AM

Add static callsite count to summary callgraph edges.
For 483.xalancbmk this increases the size of the combined index
by another 3.5%.

mehdi_amini added inline comments.Feb 25 2016, 11:54 PM
include/llvm/IR/FunctionInfo.h
105

Missing const variant.

tejohnson added inline comments.Feb 26 2016, 6:31 AM
include/llvm/IR/FunctionInfo.h
105

Adding that now.

tejohnson updated this revision to Diff 49181.Feb 26 2016, 6:34 AM
  • Add const variant of FunctionSummary::edges

How could we integrate accesses to global variable as part of this?
It turns out that to be able to benefit from the linker information on what symbol is exported during the import, this is a must.

In D17212#362864, @joker.eph wrote:

How could we integrate accesses to global variable as part of this?
It turns out that to be able to benefit from the linker information on what symbol is exported during the import, this is a must.

Well without it you still can see which function symbols will be exported, just not the variables, so you are running with less info and I guess need to assume that all static variables will be exposed and promote them. To refine that behavior for variables, yes, we'd need additional info in the summary.

(For davidxl or anyone else who didn't see the IRC conversation, Mehdi is looking at doing pure summary-based importing decisions in the linker step, then giving this info to the ThinLTO backends to avoid promotion of local values that aren't exported. For a distributed build if we wanted to do it this way the importing decisions would all be made in the plugin step, then would need to be serialized out for the distributed backends to check.)

Two possibilities depending on the fidelity of the info you think you need:

  1. One possibility is to just put a flag in the function summary if it accesses *any* local variables, and adjust the importing threshold accordingly. Then in the ThinLTO backend for the exporting module you need to check which of your own functions are on the import list, and which local variables they access, and promote accordingly.
  1. If it will be really beneficial to note exactly which local variables are accessed by which function, we'll need to broaden the edges list to include accesses to variables (I assume you only care about local variables here). E.g. the per-module summary edge list for a function would need to include value ids for any local variables referenced by that function (not sure that the other parts of the triple, the static and profile counts, are needed for that). Then in the combined VST we need to include entries for GUIDs of things that don't have a function summary, but are referenced by these edges. When accessing a function summary edge list for a candidate function to import, you could then see the GUID of any local variables accessed. You wouldn't know them by name, but if for example you wanted a heuristic like "if >=N hot import candidate functions from module M access a local variable with GUID G, go ahead and import those and let G be promoted by the backend (which like in approach #1 needs to check which local variables are accessed by any functions on an import list)".

Obviously 1) is easier and cheaper space-wise. What are your thoughts?

Is there a need to mark it? Does the following work?

  1. using summary call graph to do importing decision
  2. for any function that is imported by another module, mark it as

'exported'

  1. When processing the exporting module, check any exported function and

promote any statics it references.

David

Is there a need to mark it? Does the following work?

  1. using summary call graph to do importing decision
  2. for any function that is imported by another module, mark it as

'exported'

  1. When processing the exporting module, check any exported function and

promote any statics it references.

David

That does work for correctness, but Mehdi was proposing that functions accessing local values need to meet a higher importing threshold so that we don't cause too many promotions due to over-eager importing.

In terms of serializing the info out for doing any of this with a distributed build, we could do it via an "exported" bit in the combined summary though.

mehdi_amini added inline comments.Feb 28 2016, 7:53 PM
lib/Bitcode/Reader/BitcodeReader.cpp
5528

Shouldn't this be ValueIdToCallGraphGUIDMap[ValueID] = Function::getGUID(FunctionGlobalId); ?

tejohnson added inline comments.Feb 29 2016, 6:42 AM
lib/Bitcode/Reader/BitcodeReader.cpp
5528

Good catch! Yes, it should be. The VST_CODE_ENTRY case does not need to invoke getGlobalIdentifier since those are externally-defined functions (and so not possibly local).

But looking at the other places here where we set up this map I realized that the VST_CODE_COMBINED_FNENTRY case was wrong - we already have the GUID and there is no ValueName in that record. I have a fix for both coming up shortly.

tejohnson updated this revision to Diff 49371.Feb 29 2016, 7:09 AM
  • Fix ValueIdToCallGraphGUIDMap entries.
tejohnson updated this revision to Diff 49994.Mar 7 2016, 2:04 PM

Update patch to include full reference graph.

Some notable changes from prior patch:

  • include non-call reference edges in summary records
  • introduce summary entries for global variable initializations
  • add the VSTOFFSET record to the combined index bitcode file, and use it when reading both the per-module and combined indices to parse the VST before the summary section (analogous to how it is used when parsing normal bitcode to parse the VST before the function blocks), which reduces the amount of bookkeeping required to associate the VST entry with the value ids used in the summary section.

Bitcode name changes:

  • add a new FS_*_GLOBALVAR_INIT_REFS records to hold ref graph edges from global variable inits to referenced global vars.
  • changed the VST_CODE_COMBINED_FNENTRY name to VST_CODE_COMBINED_GVDEFENTRY to reflect the broadened scope (can correspond to either a function or variable def and associated summary).
  • changed FUNCTION_SUMMARY_BLOCK_ID to GLOBALVAL_SUMMARY_BLOCK_ID to reflect additional scope

Data structure and related naming changes:

  • Added a new GlobalValueSummary type which holds summary info common to both the function and globalvar summaries.
  • Made the existing FunctionSummary record a derived class of the new GlobalValueSummary.
  • Added a new derived GlobalVarSummary class of GlobalValueSummary.
  • Renamed FunctionInfo* -> GlobalValueInfo*
  • Renamed FunctionSummary -> GlobalValueSummary or just Summary as appropriate
  • Misc other variable and method renamings to reflect above changes.

TODO:

  • Rename FunctionInfoIndex to something more broad (ModuleIndex?) (ditto for related classes like FunctionInfoIndexBitcodeReader).
  • Move getGUID and getGlobalIdentifier from Function to GlobalValue class.
  • Some lingering FunctionSummary references that need to be updated to GlobalValueSummary or just Summary.

I will work on the TODO items now but wanted to get this patch out for review,
testing by Mehdi, and feedback.

Note that some of these changes (many of the renames, can be committed first
as NFC changes).

LGTM overall. Still some comments/questions below.

Thanks for the good work!

include/llvm/IR/FunctionInfo.h
73

Can't/shouldn't be protected?

91

why to be written to...?

99

I think the usual terminology in LLVM is "PGO" (I had to google FDO)
(same below)

104

F?

Also again the "to be written to" seems off to me.

146

Can't you have a single ctor?

GlobalValueInfo(uint64_t Offset = 0, std::unique_ptr<GlobalValueSummary> Summary = nullptr)

lib/Bitcode/Reader/BitcodeReader.cpp
416

I guess more renaming could have been done here right?
(I really don't mind, but since you cared to rename many places below...)

454

doc

5749

Mmmmm....

5759

Mmmmm (bis)....

lib/Bitcode/Writer/BitcodeWriter.cpp
920

I'd write it:

if (M->getValueSymbolTable().empty())
  return 0;
return WriteValueSymbolTableForwardDecl(Stream);
2346

Why is it no longer valid?

2517–2518

Mind adding a one line comment?

2530

Guarantee on the recursion depth?

2582

if(CS) could be hoisted out of the loop.

2594

You depend on the order of the calls.
If an instruction first add a call to a function, but later on will reference it, you won't remove it from RefEdges.

This may be intended and you want to have an accurate count of ref + calls and here you're trying to filter calls out of refedges. However a call instruction could legitimately refer the function (a call passing a function pointer as an argument for instance)

2596

?

2617

?

2957

Isn't the opposite that the code is doing?

Another question: you reported measurements in the description, is it up to date with the new patch?

In D17212#369588, @joker.eph wrote:

Another question: you reported measurements in the description, is it up to date with the new patch?

Not yet, need to update that. The size of the thinlto.bc combined index went up quite a bit percentage wise with the ref graph, especially once I corrected it to recursively look for references, and added in the global variable init refs. Here are the new stats, only have non-PGO so far, will collect PGO as well and update the summary (once I make a fix described below to add linkage types to variables):

For 483.xalancbmk (non-PGO):
Total .o (bitcode) size: +1.24%
Combined .thinlto.bc size: +120%
Aggregate .o + .thinlto.bc size: +3.16%

I gave the last number because the size of the combined index .thinlto.bc file is on the same order of magnitude as the largest individual .o bitcode files, so the increase should be taken in that context.

include/llvm/IR/FunctionInfo.h
73

Good point, will make protected.

91

Good catch, this was leftover from the prior patch when it was part of a bitcode writer helper, didn't updated it up after moving here. Will fix.

99

Will fix...bummer that every compiler I have worked on seems to use a different term! PGO/FDO/PBO...bah.

104

Same issue as above, stale comment, will update.

146

Yes, will fix.

lib/Bitcode/Reader/BitcodeReader.cpp
416

Yep, this was part of the TODO list in this patch update. Working on that now. Specifically, I'm changing FunctionIndex* and FunctionInfoIndex* to ModuleSummaryIndex*. I left this for a follow-on update because already the renaming was pretty extensive, and this particular change bleeds into some of the interfaces and variable/field names used in other parts of the compiler (clang etc).

454

Will do.

5749

Oops. In my haste to get the renaming done and send this out, forgot that I planned to add a linkage type to the global var init records. Will add now.

5759

Ditto.

lib/Bitcode/Writer/BitcodeWriter.cpp
920

Good idea, will fix.

2346

Good catch. I think this got removed when I was using a different data structure here for awhile, and I missed it when restoring the old approach. Will add back.

2517–2518

Will do.

2530

Beyond the Visited set check earlier? What would you like to see?

2582

Yes, missed this when I cleaned things up from when I was initially looking for non-call references inside this loop.

2594

I don't think the order of calls matters? The reference list is populated once outside of this loop.

On your second point, that is true that this will cause a function both called and referenced in another way to only be recorded as called by this function. Is it important to list both types of references? I was thinking that it was important for importing needs to distinguish the functions being called from the other non-call references, but that essentially the combination of the two are the full reference set of the function. If I should put it in both places, I will need to figure out how to distinguish the two types of function references in findRefEdges.

2596

More cruft left from when I was looking for non-call refs in this loop, will remove

2617

Incomplete cleanup of debugging output, will remove.

2957

Yes, good catch. I think what I initially wanted was to record them this way, but the issue is that the alias doesn't have a separate summary, which is why we are grabbing the aliasee summary here. Will update the comment.

mehdi_amini added inline comments.Mar 8 2016, 11:42 AM
lib/Bitcode/Reader/BitcodeReader.cpp
416

Fine with me.

lib/Bitcode/Writer/BitcodeWriter.cpp
920

Re-reading it, the variable had the nice property of having a name, making it very clear what is returned.
If you adopt the above suggested change, I would add a one line comment along the line of // return the VSTOffsetPlaceholder if we have a VST

2530

I was not worried on non-termination or corretness, but depth that would explode the stack.

It was late and I couldn't think clearly enough to find the answer myself. I just got a coffee so I should be able to answer myself now ;)

-> For every instruction you will pull transitively all the operands until you reach a global (or something already seen). i.e. this is a DFS search on the SSA graph.
If I'm correct, a worklist would probably be appropriate here.

2582

Re-reading, I'm not even sure what this loop on the operand is doing at all!
Wouldn't the code do exactly the same if you just remove the loop and turn the test into if (CS)`

2594

I pointed what I saw as an inconsistency, but I may misunderstand totally what you're trying to do here.

So I will assume in the following that:

  • RefEdges contains the global values referenced by the current function
  • CallGraphEdges contains the list of Functions called by the current function

Do we agree that we should have either of these properties:

  1. A function that is part of CallGraphEdges should *not* be present in RefEdges even if it is referenced in another way than a call
  2. A function that is part of CallGraphEdges should *also* be present in RefEdges *if it is referenced* in another way than a call

What I read from your code right now is:

  1. A function that is part of CallGraphEdges *may or may not* be present in RefEdges *if it is referenced* in another way than a call.
tejohnson added inline comments.Mar 8 2016, 1:21 PM
lib/Bitcode/Writer/BitcodeWriter.cpp
920

Will do.

2530

I wouldn't have thought that an instruction would be large enough to cause stack explosion. But I have no issue with changing this to a worklist iteration instead.

2582

Right, this loop is bogus!

2594

Yep, I was distracted by the bogus loop above, we are collecting this per instruction so there is an ordering issue, and we are getting #3 which is undesirable.

I think the way to fix this is to check for the callsite first, then pass in some info from the callsite to findRefEdges so that the callsite reference is itself ignored. I.e. pass in either the callee GV and skip it to get #1, or pass in the Use to get #2.

I saw David's follow-on that he thinks #2 is best. I can try that.

mehdi_amini added inline comments.Mar 8 2016, 4:13 PM
lib/Bitcode/Writer/BitcodeWriter.cpp
2530

I may misunderstand, but it seems to me that the depth is not the width of a single instruction, but potentially on the order of the number of instructions in the Function.
If I'm wrong and you're bounded somehow by the number of operands, then recursion is fine with me.

tejohnson added inline comments.Mar 9 2016, 7:21 AM
lib/Bitcode/Writer/BitcodeWriter.cpp
2530

If you are right then this needs to change beyond moving to a worklist, as I only intended to capture the references within a single instruction or variable def.

The two entry points to this routine are in WriteFunction, where we pass in an Instruction as the User, and in WriteModuleLevelReferences, where we pass in a GlobalVariable (and I had confirmed that the operands of a variable are its initializer). We recursively walk the given User's operands(). I wouldn't have thought that we could jump from a given Instruction to other instructions in the function this way?

tejohnson added inline comments.Mar 9 2016, 11:57 AM
lib/Bitcode/Writer/BitcodeWriter.cpp
2594

Actually it turns out to be pretty simple, we only need to pass in the callee GV to findRefEdges to exclude it (because you can't have both a call and a non-call reference to a function in the same instruction, at least I couldn't find a way!).

For xalancbmk this fix actually reduced the combined index and .o sizes a bit. It turns out that with the old flow we were inadvertently including intrinsics (which were ignored in the call graph edge list) in the reference list.

In the new version of the patch I will upload shortly I've included a new test that checks for both issues (ensuring a function is in both the call list and ref list if it is accessed both ways by the function, and ignoring intrinsics).

tejohnson updated this revision to Diff 50173.Mar 9 2016, 11:58 AM
  • Address Mehdi's review comments.
tejohnson updated this revision to Diff 50287.Mar 10 2016, 9:14 AM

Fixes to findRefEdges discussed in review and IRC

Two main changes to findRefEdges:

  1. Use worklist iteration instead of recursion

The reason is that when we invoke with an Instruction, it will iterate
into other instructions in the function when it encounters a use of
a local variable defined in another instruction. Ultimately we are
analyzing all instructions in the function, and we prevent duplicate
work by sharing the Visited set across all instructions in the function.
However, using a worklist approach is more efficient for large functions
where with recursion we may end up recursing many times (using much
stack).

  1. More accurate detection and skipping of callsite references

The prior solution would miss the non-call reference to foo in the
following case:
foo((void *)foo);
since we were skipping all GVs that matched the callee GV.

Additionally, as noted in 1) we may traverse through local variables
defined on other instructions. Those other instructions may be calls
which return a value. We want to make sure we don't count calls in other
instructions encountered this way as non-call refs. Depending on the
order of traversal (e.g. a use on a phi encountered before the call
instruction), we can't count on the call being in the visited set.

Finally, doing the checking for callsites here avoids the need for
WriteFunction to special case call instructions, which has its own set
of issues. For example, one solution considered was for call
instructions to invoke findRefEdges on each of its data_ops(). This
means findRefEdges would need to distinguish between a GV user that was
passed in being either a reference (if it was a call instruction
operand) or a global variable for which we want to analyze initializers
(i.e. called from WriteModuleLevelReferences).

Finally, enhanced the new test quite a bit to check for the various
cases I encountered when fixing findRefEdges.

mehdi_amini accepted this revision.Mar 10 2016, 10:20 AM
mehdi_amini edited edge metadata.

LGTM.
Thanks! I think we reached the point of "good enough". I feel this review was productive :)

This revision is now accepted and ready to land.Mar 10 2016, 10:20 AM
In D17212#372070, @joker.eph wrote:

LGTM.
Thanks! I think we reached the point of "good enough". I feel this review was productive :)

Great, thanks for the reviews! Agreed, the final state is much more complete and accurate.

Will update the title and summary along with updated stats before I commit.

tejohnson retitled this revision from [ThinLTO] Support for call graph in per-module and combined summary. to [ThinLTO] Support for reference graph in per-module and combined summary..Mar 11 2016, 6:57 AM
tejohnson updated this object.
tejohnson edited edge metadata.
tejohnson updated this object.Mar 11 2016, 7:00 AM
This revision was automatically updated to reflect the committed changes.

This is the same as what was reviewed, except for 2 small changes that can be post-commit reviewed:

Mehdi: A bad interaction between my changes and your libLTO interfaces committed this week (which merges into the first index when creating the combined index, instead of creating a new combined index and merging everything into it). See removeEmptySummaryEntries() and its callsite for the detailed explanation of what I needed to do since I am now parsing the VST first and need to do some eager creation of index entries that was being cleaned up as we merged the indexes.

Easwaran: Updated getBlockProfileCount as per our conversation this morning to do the multiplication before the integer divide in order to avoid inaccuracies due to truncation.