This is an archive of the discontinued LLVM Phabricator instance.

[MemProf] Context disambiguation cloning pass [patch 1a/3]

Authored by tejohnson on Jan 3 2023, 10:01 AM.



Support for building, printing, and displaying CallsiteContextGraph
which represents the MemProf metadata contexts. Uses CRTP to enable
support for both IR (regular LTO) and summary (ThinLTO). This patch
includes the support for building it in regular LTO mode (from
memprof and callsite metadata), and the next patch will add the
handling for building it from ThinLTO summaries.

Also includes support for dumping the graph to text and to dot files.

Follow-on patches will contain the support for cloning on the graph and
in the IR.

The graph represents the call contexts in all memprof metadata on
allocation calls, with nodes for the allocations themselves, as well as
for the calls in each context. The graph is initially built from the
allocation memprof metadata (or summary) MIBs. It is then updated to
match calls with callsite metadata onto the nodes, updating it to
reflect any inlining performed on those calls.

Each MIB (representing an allocation's call context with allocation
behavior) is assigned a unique context id during the graph build. The
edges and nodes in the graph are decorated with the context ids they
carry. This is used to correctly update the graph when cloning is
performed so that we can uniquify the context for a single (possibly
cloned) allocation.

Depends on D140786.

Diff Detail

Event Timeline

tejohnson created this revision.Jan 3 2023, 10:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 3 2023, 10:01 AM
tejohnson requested review of this revision.Jan 3 2023, 10:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 3 2023, 10:01 AM

Examples of the dot files (for new test llvm/test/Transforms/PGHOContextDisambiguation/basic.ll), also converted to png format, are attached:[.png] is the dot file generated after adding all allocation nodes, and the stack nodes corresponding to the memprof metadata on those allocations.[.png] is the dot file generated after matching callsite metadata from other calls to the stack nodes.

The color scheme on the nodes/edges: blue = cold, red = noncold, purple = both
In the dot file format there is a tooltip that shows the assigned context ids for each node and edge.

davidxl added inline comments.Jan 3 2023, 10:17 PM
60 ↗(On Diff #486018)

Keep the name of the interface 'set_intersect'. To allow overloading, make the return a reference parameter.

93 ↗(On Diff #486018)

nit: it feels better to keep the name of the interface set_subtract. For symmetry, make the 'removed' a reference parameter.


dumping utilities can be split out. Similarly for new interfaces like 'last', or set_subtract .


why is the option called '-lifetime-cold-'? should it be '-lifetime-short-'?

Enna1 added a subscriber: Enna1.Jan 8 2023, 6:59 PM

Update with a couple of fixes made when testing on a large target.

I will add tests for the fixes, but wanted to commit the code changes
right away for the review.

One of the fixes necessitated pulling the removed edges handling up from
the follow on patch and expanding it. There is another fix to avoid
infinite recursion and a couple other minor fixes.

Still reviewing, here are some initial comments.

60 ↗(On Diff #486018)

If we make the return a reference (eg. S1Ty&) this will break since we are returning a temporary. Returning const S1Ty& would work due to reference lifetime extension but constrains what we can do with the result.

131 ↗(On Diff #486018)

nit: call this back() to be consistent with iterator method names?




The behaviour of the const and mutable version of the callsites function differ. In callsites if its null it returns an empty array ref, for mutableCallsites it may assert (or segfault on an opt build). Should we avoid the assert by allocating storage instead?


So far we've used memprof to refer to all of the prior work, I would strongly prefer we continue using the same to make it easy to discover all the related code by just searching for a single term. This is also useful for listing all the options related to memprof which start with the "memprof-" prefix. Wdyt?


nit: Expand CCG to CallingContextGraph here and below.


Can we move this to MemoryProfileInfo.h and use it here as well as MemoryProfileInfo.cpp where it exists as a static function?


I guess inheriting from std::pair provides equality operators etc? Is there any other reason to derive from std::pair as compared to just defining the members inline.

nit: struct FuncInfo final (same for CallInfo below)


nit: IMO operator bool() is less cryptic than (bool)*this


This lambda is doing a lot of work, can me move this out to its own method? It's a little hard to read right now because of the size and doing so would make it a little easier to read.


auto& to avoid a copy?


Twine(CloneNo) should incur fewer conversions.
"pgho" suffix would change if you consider the comment about pgho/memprof above.


Can we split this patch to only implement the basic version (non-ThinLTO)?

I'll move on to the other patches in the stack and then come back to this one, thanks for your patience!


nit: s/reverse/descending?


I didn't understand how we can have this case with different MIB contexts here. Can you elaborate in the comment?


I think LastNode is redundant and CurNode can be used in L1016 and L1024.


Prefer incrementing this iterator outside the for loop. It was a little strange to see EI++ in the initialization and increment step.


It would be good to use llvm/Support/GraphWriter.h to export as dotGraph instead of re-implementing things here. You may need to use DOTGraphTraits.h too to implement some things such as tooltips which are used here.

This is probably a significant change but it would reduce the amount of code we have to maintain in this file.

snehasish added inline comments.Jan 13 2023, 11:29 AM

missing = in the comment, i.e. /*TowardsCallee=*/
There are a few other cases in this patch.


Can we use shared_ptr for the edges so that we don't have to manually track ownership?

(also mentioned in the other patch)

Thanks for the comments. I haven't had a chance to address them yet, but most of the suggestions make sense I think. I added just a few responses here.

60 ↗(On Diff #486018)

I think it should stay with a value return, not a reference, also to be consistent with other functions in this file that return the result set by value. Also, I do want the name to be different (also to be consistent with what is done elsewhere in this file - see e.g. set_subtract vs set_difference) - I think it is clearer to have a different name for the function that returns the result vs updating the first parameter.


I'd prefer not to allocate, based on where and how this is used. Since it is returning a reference, I can't have it return an empty vector like callsites().


ack - I was wondering the same thing myself, will do a rename.


Yes, it is for inheriting the operators, rather than redefining them. This same trick is used elsewhere in llvm.


Yes we could. I have some concern about the additional overhead and whether it is worth it given that tracking these didn't end up being too difficult. I'll collect some measurements, it might be small given the overhead of all the rest of the graph memory.


Will do for commit - as we discussed offline, at this point I may do so after the review which is already in progress here.


Thanks for that pointer, will do. I copied the approach used by ModuleSummaryIndex, which seems like it should eventually be migrated to using the GraphWriter as well!


This is the minimum lifetime required in order to mark the context at cold when we do profile matching.

tejohnson added inline comments.Feb 2 2023, 7:07 AM
131 ↗(On Diff #486018)

I split this out into D143184, and renamed last() to back() as suggested.


I have been playing around with using shared_ptr for the ContextEdges. For some reason, the memory is really blowing up. I expect some increase, but not what I am seeing. I confirmed I am using make_shared which should minimize the overhead, and I don't see where I would be creating extra copies that live somewhere longer than expected, but I feel like I must be doing something wrong. Still looking...

tejohnson updated this revision to Diff 500018.Feb 23 2023, 5:02 PM

Rebase (to pick up changes refactored out in D144220, D144318, and D144314).

tejohnson marked 3 inline comments as done.Feb 24 2023, 3:37 PM

I'm about to update the patch with the version that uses shared_ptr for the edges. I'm also using unique_ptr to simplify management of ContextNodes, where I found some leaks happening while debugging my shared_ptr usage. Also made a change to print the context ids in sorted order to avoid spurious test case changes going forward, but that is going to result in some test changes with this update.

Patch not yet ready for re-review, I am working on addressing the other comments while merging in some fixes I made. I also have not yet updated the follow on patches to reflect the shared_ptr changes, will do that later.


I found and fixed a bug causing this blow up. The shared_ptr memory seems reasonable.

tejohnson updated this revision to Diff 500320.Feb 24 2023, 3:38 PM

Use shared_ptr for edges, unique_ptr for nodes, and print context ids in order.

tejohnson updated this revision to Diff 503040.Mar 7 2023, 7:41 AM

Fix typo that prevented matching of inline sequences of more than one inline, and add new tests for the fix.

Updating again with another fix from my testing (see description below), along with a new test for it. I will be addressing the remaining comments/suggestions next.

tejohnson added inline comments.Mar 8 2023, 2:18 PM

I discovered when testing with a larger app and more graph validation enabled that this early return is not correct. We can have different subsets of duplicate context nodes being propagated along different edges to the same node, and so we were stopping the propagation early. This meant both an insane graph (context ids of node didn't match those of caller and callee edges) but also prevented some cloning. Removing this however leads to long compile time.

I redesigned the context duplication and the caller which is updateStackNodes to split this handling into 3:

  1. walk the calls and perform all necessary context id duplication, saving the info in a map (in updateStackNodes and a modified duplicateContextIds).
  2. propagate the context ids across the full graph in a single pass (in a new propagateDuplicateContextIds called from updateStackNodes).
  3. do the post order traversal to generate new nodes for inlined call chains, moving the context ids determined earlier (which might be duplicates) to the new node (in updateStackNodes).

This allowed handling quite a few more cold calls in my large application, with smaller compile time than the patch even without the fix to remove this early termination. I'll upload the new version of this handling in a few minutes.

tejohnson updated this revision to Diff 503519.Mar 8 2023, 2:19 PM

Fix and improve context id duplication, add test that failed node checking without fix.

tejohnson updated this revision to Diff 503874.Mar 9 2023, 12:03 PM
tejohnson marked 12 inline comments as done and an inline comment as not done.

Address most of the remaining comments.

I've now addressed all the feedback except the following 2 items that I will do next, as these will cause more global churn:

  • Use GraphWriter (I suspect this will cause some tests that check the graph output to need updating).
  • Rename PGHO* to MemProf*

Refactored out and now named assignStackNodesPostOrder


We could, but it is clearer IMO to use a name that corresponds to the meaning at those uses below. In any case, this code has been changed and we now assign LastNode more directly.

tejohnson marked an inline comment as done.Mar 10 2023, 12:05 PM

Use GraphWriter

tejohnson marked an inline comment as done.Mar 10 2023, 3:19 PM

About to upload a large but trivial change to replace PGHO/pgho with MemProf/memprof everywhere.

After this change I will make one more update after talking to @snehasish yesterday, by splitting the ThinLTO handling out into a separate patch that depends on this one. Then it is ready for re-review

tejohnson updated this revision to Diff 504294.Mar 10 2023, 3:20 PM

Replace PGHO/pgho with MemProf/memprof

tejohnson updated this revision to Diff 504318.Mar 10 2023, 6:06 PM

Remove the ThinLTO changes which are being split into another follow on patch.

tejohnson retitled this revision from [MemProf] Context disambiguation cloning pass [patch 1/3] to [MemProf] Context disambiguation cloning pass [patch 1a/3].Mar 10 2023, 6:06 PM
tejohnson edited the summary of this revision. (Show Details)

Removed the ThinLTO changes, and I will upload a patch with just those shortly. This patch is now ready for re-review.

Still working my way through updateStackNodes. Here are some of the comments I have so far --

For the IR based tests, I noticed some comments which describe how the code was compiled. I wonder if it makes sense to use synthetic IR (derived from the original source) to make the tests more readable and less brittle. I took the indirect.ll test and ran it through llvm-reduce. and found that the reduced IR was much easier to follow. I attached my script and the reduced IR for indirect.ll. To run the reducer, llvm-reduce ../path/to/indirect.ll. Let me know what you think.

29 ↗(On Diff #504318)

I think \p is for parameters but here the parameter name is just M?

17 ↗(On Diff #504318)

nit: I don't think we introduce the ThinLTO changes in this patch so maybe update this text in the follow on patches instead?

297 ↗(On Diff #504318)

Should the StackContext be a reference too? On the other hand, these are iterators so perhaps neither should be since they are lightweight.

312 ↗(On Diff #504318)

Iterating on the contents of this map (L935) is non-deterministic since we use a pointer as key. It would be nice to avoid non-determinism by using a MapVector even though I don't think there is any externally visible effect.

For NodeToCallingFunc below, we don't iterate on the contents so I think that's fine.

350 ↗(On Diff #504318)

Unclosed parenthesis?

398 ↗(On Diff #504318)

This method seems to be unused in this diff. If this is used in a subsequent diff then can we move it there? Also for moveEdgeToExistingCalleeClone which is only called by this method.

544 ↗(On Diff #504318)

Can we inline the {ContextId} list initializer here directly?

617 ↗(On Diff #504318)

Are we marking the first AllocNode we encounter as a clone since

  • LastContextId starts from 0
  • LastContextId is incremented in addStackNodesForMIB (and duplicateContextIds)
  • Comment at L183 indicates that 0 is used to identify a clone

A simple fix would be to start LastContextId from 1 in the constructor? I found the update to LastContextId hard to follow but I don't have a suggestion on how this could be made simpler.

Also if this is a bug, then maybe a unit test to check the consistency of internal data structures might be useful.

935 ↗(On Diff #504318)

We can use a structured binding here - for (auto& [Func, CallsWithMetadata] : FuncToCallsWithMetadata)

Also auto& Call below to avoid a copy of the pair.

965 ↗(On Diff #504318)

How about auto& Ids = std::get<1>(Calls[0]); since we don't need to unpack the other elements (in this if block)?

Same for functor passed to the std::sort function below.

1163 ↗(On Diff #504318)

Can we check if I isa<CallInst> (and invoke etc) before querying for metadata? I think this is simpler but a type check might speed things up by avoiding linear metadata attachment scans[1].


1167 ↗(On Diff #504318)

Should we assert I.getMetadata(LLVMContext::MD_callsite) is not null?

1177 ↗(On Diff #504318)

nit: we can move this bookkeeping to addAllocNode (similar to how we update AllocationCallToContextNodeMap in addAllocNode).

1190 ↗(On Diff #504318)

I think we should move the logic from L1190 to L1206 (end of the constructor) to the process method. It was a bit surprising for me to the process method not actually do any processing .. :)

Also this would bring all the dump and export calls in one place to follow sequentially.

1200 ↗(On Diff #504318)

It would be nice to move this cleanup to the end even though it does not affect handleCallsitesWithMultipleTargets.

1202 ↗(On Diff #504318)

auto& to avoid a copy?

1257 ↗(On Diff #504318)

nit: This can be auto* too like the one below.

1355 ↗(On Diff #504318)

Now that we have GraphTraits, can we use the graph iterators instead of our own DFS here?

1378 ↗(On Diff #504318)

I don't see CheckEdges set to true in this diff. I think we should remove it altogether or make it always check by default if we are calling verify. What do you think?

1429 ↗(On Diff #504318)

Now that we have GraphTraits, can we use the graph iterators instead of our own DFS here?

1443 ↗(On Diff #504318)

s/GetNode/getNode to conform to llvm style. I think it's only used in the nodes_* functions below so we should be good.

Same for GetCallee below.

1484 ↗(On Diff #504318)


1563 ↗(On Diff #504318)

We can omit the "else" part of "else if" after the return statements.

1574 ↗(On Diff #504318)

sstream and result variable names should be InitCaps?

396 ↗(On Diff #504318)

I don't think the Thin-LTO summaries are used in this patch. Not opposed to leaving it in for the next patch. Just curious whether this was intentional.

tejohnson marked 20 inline comments as done.Mar 20 2023, 1:07 PM

Address all comments other than the test reduction and graph traversal suggestions, which I will work on next.

Still working my way through updateStackNodes. Here are some of the comments I have so far --

For the IR based tests, I noticed some comments which describe how the code was compiled. I wonder if it makes sense to use synthetic IR (derived from the original source) to make the tests more readable and less brittle. I took the indirect.ll test and ran it through llvm-reduce. and found that the reduced IR was much easier to follow. I attached my script and the reduced IR for indirect.ll. To run the reducer, llvm-reduce ../path/to/indirect.ll. Let me know what you think.

This is a nice idea. Will do so for the tests but probably do that after addressing the other comments in case there is any more churn in the expected test output, and since it is more mechanical.

17 ↗(On Diff #504318)

Instead of removing the ThinLTO comment, I added "(eventually)" in front of it. Otherwise the CRTP design choice doesn't make sense.

312 ↗(On Diff #504318)

Actually, the only time we index as a map is while building, and we don't really need to do this since we process each function once. I can simply change this to a vector of pairs.

617 ↗(On Diff #504318)

The first AllocNode indeed will have 0 value here. However, that isn't a bug or in conflict with the comment, although I could see how it might be confusing. For alloc nodes we simply use LastContextId to give them unique ids that are only used in dot dumping (see the first sentence in the comment for OrigStackOrAllocId. For stack nodes this is the stack id corresponding to the node (i.e. from the metadata). The comment is just noting that clones will retain a 0 value for this field, but that it isn't an issue since we only use it during matching of callsite metadata (when the graph is being built). I rewrote the comment on L183, hopefully that clarifies things.

The reason why we increment LastContextId before using it in addStackNodesForMIB is that that is where we are using it for its main purpose - to assign a unique (non-zero) id to each allocation context from memprof metadata.

1190 ↗(On Diff #504318)

I would prefer not to do this, since updateStackNodes called below is part of the graph building and should be here in the graph constructor.

Note that process() is currently empty only because the graph transformations were split into patches 2 and 3.

1378 ↗(On Diff #504318)

It's used in follow on patches, when we are checking nodes in isolation, i.e. not checking the whole graph as we do here via check()/checkNodesRecursively() (the latter ensures we check each edge exactly once). I'll remove from this patch.

tejohnson updated this revision to Diff 506688.Mar 20 2023, 1:07 PM
tejohnson marked 3 inline comments as done.

Address all comments other than the test reduction and graph traversal suggestions, which I will work on next.

snehasish accepted this revision.Mar 20 2023, 2:31 PM

lgtm, overall. The remaining comments are minor and thanks for the explanations in the tests.

22 ↗(On Diff #506688)

This header seems unused in this diff?

17 ↗(On Diff #504318)

Yes, that sounds good to me.

617 ↗(On Diff #504318)

Thanks for the explanation. It makes sense to me with the updated comment.

1014 ↗(On Diff #504318)

"after the last" sounds confusing, perhaps callout we are iterating backwards?

1129 ↗(On Diff #504318)

nit: Use Twine for concat?

1190 ↗(On Diff #504318)

Got it, as I worked my way through updateStackNodes I realized that it's still just set up.

1494 ↗(On Diff #504318)

std::to_string(int) can be replaced with Twine(int) directly to avoid redundant conversions. Also in other locations apart from this one.

1513 ↗(On Diff #504318)

Maybe just declare AttributeString as Twine and change the return statement to return AttributeString.str(); ?

Similar pattern of usage below too if choose to update this one.

56 ↗(On Diff #504318)

I guess once we reduce the tests, the heapallocsite metadata will be dropped.

On a somewhat related note, should we update the code below (in a separate patch) to drop memprof metadata too?

9 ↗(On Diff #506688)

Should this have trailing underscores too? i.e __attribute__((noilnine))

This revision is now accepted and ready to land.Mar 20 2023, 2:31 PM
tejohnson marked 6 inline comments as done.Mar 21 2023, 9:02 AM

Address the remaining minor suggestions (still need to update the graph walks and reduce the test cases).

56 ↗(On Diff #504318)

I guess once we reduce the tests, the heapallocsite metadata will be dropped.

Yeah, I manually removed this metadata from most of the tests, looks like I missed this one. Won't update right now since this will get stripped when I reduce the test cases.

On a somewhat related note, should we update the code below (in a separate patch) to drop memprof metadata too?

Maybe? Unlike the heapallocsite metadata, the memprof metadata doesn't become invalid when the non-line table debug info is removed (it doesn't directly reference it). So it might depend on the intent of this code - do you know when it is invoked? I looked a bit but it wasn't clear to me.

tejohnson updated this revision to Diff 507004.Mar 21 2023, 9:02 AM

Address the remaining minor suggestions (still need to update the graph walks and reduce the test cases).

snehasish added inline comments.Mar 21 2023, 10:02 AM
56 ↗(On Diff #504318)

From the commit message for the pass [1] it looks like the primary usage is to trim bitcode before embedding. The other use case is for bugpoint related reduction which might be useful for us.

doesn't become invalid

Yes, leaving it in is benign and its very unlikely that a memprof annotated bitcode will end up in a place where this is a concern. Up to you if you want to take any action on this.


tejohnson added inline comments.Mar 21 2023, 10:40 AM
1513 ↗(On Diff #504318)

Actually this didn't work here, since Twine's operator= is deleted, and getNodeAttributes requires conditional updates to the string. I was able to modify getEdgeAttribute below to use a single Twine, however. I also had to change getContextIds back to using std::string for similar reasons, but updated it to use Twine instead of std:to_string.

Fix Twine usage

tejohnson marked 2 inline comments as done.Mar 21 2023, 12:06 PM
tejohnson added inline comments.
1355 ↗(On Diff #504318)

For both this and check() below, we don't need or want DFS. The problem with DFS is that it seems to expect a single root entry node, which we don't have. But neither of these traversals need to be in any specific order. So I have simply replaced the recursive walk with using the nodes() range iterator provided by GraphTraits.h. This is the same iteration done by the graph writer.

There are a few minor test changes.

tejohnson marked an inline comment as done.

Use GraphTraits to iterate over the graph when printing and checking.

tejohnson updated this revision to Diff 507115.Mar 21 2023, 2:04 PM

Reduced the test cases