Page MenuHomePhabricator

[LCG] Switch one of the update methods for the LazyCallGraph to support limited batch updates.
ClosedPublic

Authored by chandlerc on Aug 5 2017, 5:21 AM.

Details

Summary

Specifically, allow removing multiple reference edges starting from
a common source node. There are a few constraints that play into
supporting this form of batching:

  1. The way updates occur during the CGSCC walk, about the most we can functionally batch together are those with a common source node. This also makes the batching simpler to implement, so it seems a worthwhile restriction.
  2. The far and away hottest function for large C++ files I measured (generated code for protocol buffers) showed a huge amount of time was spent removing ref edges specifically, so it seems worth focusing there.
  3. The algorithm for removing ref edges is very amenable to this restricted batching. There are just both API and implementation special casing for the non-batch case that gets in the way. Once removed, supporting batches is nearly trivial.

This does modify the API in an interesting way -- now, we only preserve
the target RefSCC when the RefSCC structure is unchanged. In the face of
any splits, we create brand new RefSCC objects. However, all of the
users were OK with it that I could find. Only the unittest needed
interesting updates here.

How much does batching these updates help? I instrumented the compiler
when run over a very large generated source file for a protocol buffer
and found that the majority of updates are intrinsically updating one
function at a time. However, nearly 40% of the total ref edges removed
are removed as part of a batch of removals greater than one, so these
are the cases batching can help with.

When compiling the IR for this file with 'opt' and 'O3', this patch
reduces the total time by 8-9%.

I'm still working on adding a bit of specific unittest coverage for the batch
part of the API, but wanted to go ahead and send for review as that isn't very
interesting.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc created this revision.Aug 5 2017, 5:21 AM
silvas added a subscriber: silvas.Aug 5 2017, 2:17 PM

When compiling the IR for this file with 'opt' and 'O3', this patch reduces the total time by 8-9%.

Just as a point of reference, do you have handy how much of the total compile time are we spending in the PM's CGSCC / LCG stuff? If we can speed up the overall time by 8-9% by improving it then it suggests that the total time is much larger, which I find somewhat surprising (presumably, we should be spending the vast majority of our time inside of the optimizations themselves).

Is this module you're working on just a really pathological case?

Or is this maybe a situation where the denormalized ref edge representation is causing a bunch of extra work?

When compiling the IR for this file with 'opt' and 'O3', this patch reduces the total time by 8-9%.

Just as a point of reference, do you have handy how much of the total compile time are we spending in the PM's CGSCC / LCG stuff?

Yep, that's how I got here. Before all of my changes to LCG, we spent just under 40% of the total optimizer time (and nearly that much of the total compile time) in this one method.

Removing the parent sets made this method obnoxiously faster (over 2x) and shaved just over 20% off the total optimizer time.

This patch makes this method another 2x faster, and shaves 8% off the total optimizer time.

After both of these, this method remains the only thing hot really, and it is now between 6-8% depending on the noise in the profile. I'm still looking for more things to make faster here. I think i have another 2-4 things that will make it faster.

If we can speed up the overall time by 8-9% by improving it then it suggests that the total time is much larger, which I find somewhat surprising (presumably, we should be spending the vast majority of our time inside of the optimizations themselves).

Me too. Turns out this method is really, really hot.

Is this module you're working on just a really pathological case?

It is entirely possible this module is pathological -- I've not seen this when profiling other compiles.

Unfortunately, it is a pathological case that is *generated code* that for me, happens to represent like 10% of all the files we compile and many of the slowest files we compile. So even if this module is really, really weird I still need it to compile very fast. =]

Or is this maybe a situation where the denormalized ref edge representation is causing a bunch of extra work?

I'm sure this is part of the problem. But fixing this is (much) harder than making this routine fast. And it doesn't *appear* to be the primary problem. I instrumented this routine expecting to see removal of 100,000+ edges all the time and think "oh, so its denormalized...". That wasn't what happened. In this (potentially pathological) case, there are only like 7k ref edges ever removed in the entire opt run! And most (some 3.5k) are removed due to a *single* ref edge becoming dead after the inliner runs. The fact that the optimizer is deleting one edge at a time is pretty strong evidence that the denormalization isn't as prominent in this problem as I would have expected. I'm pretty sure based on these numbers that the real reason this ends up slow in this module is the size of RefSCC that we're removing edges from.

Still, I suspect *this* patch to be mostly making the denormalized representation scale better (by batching their removal). But it seems much easier to do this than to switch representations, and this patch actually makes the code simpler, so it seemed like a fine intermediate step.

silvas added a comment.Aug 5 2017, 9:01 PM

When compiling the IR for this file with 'opt' and 'O3', this patch reduces the total time by 8-9%.

Just as a point of reference, do you have handy how much of the total compile time are we spending in the PM's CGSCC / LCG stuff?

Yep, that's how I got here. Before all of my changes to LCG, we spent just under 40% of the total optimizer time (and nearly that much of the total compile time) in this one method.

Removing the parent sets made this method obnoxiously faster (over 2x) and shaved just over 20% off the total optimizer time.

This patch makes this method another 2x faster, and shaves 8% off the total optimizer time.

After both of these, this method remains the only thing hot really, and it is now between 6-8% depending on the noise in the profile. I'm still looking for more things to make faster here. I think i have another 2-4 things that will make it faster.

If we can speed up the overall time by 8-9% by improving it then it suggests that the total time is much larger, which I find somewhat surprising (presumably, we should be spending the vast majority of our time inside of the optimizations themselves).

Me too. Turns out this method is really, really hot.

Is this module you're working on just a really pathological case?

It is entirely possible this module is pathological -- I've not seen this when profiling other compiles.

Unfortunately, it is a pathological case that is *generated code* that for me, happens to represent like 10% of all the files we compile and many of the slowest files we compile. So even if this module is really, really weird I still need it to compile very fast. =]

Or is this maybe a situation where the denormalized ref edge representation is causing a bunch of extra work?

I'm sure this is part of the problem. But fixing this is (much) harder than making this routine fast. And it doesn't *appear* to be the primary problem. I instrumented this routine expecting to see removal of 100,000+ edges all the time and think "oh, so its denormalized...". That wasn't what happened. In this (potentially pathological) case, there are only like 7k ref edges ever removed in the entire opt run! And most (some 3.5k) are removed due to a *single* ref edge becoming dead after the inliner runs. The fact that the optimizer is deleting one edge at a time is pretty strong evidence that the denormalization isn't as prominent in this problem as I would have expected. I'm pretty sure based on these numbers that the real reason this ends up slow in this module is the size of RefSCC that we're removing edges from.

In case you haven't dug in, I just dug in a bit in Mathematica out of curiousity and it seems that the large RefSCC's come from the lazy descriptor initialization.

This is the only RefSCC in the .pb.cc file I tried that wasn't just a single node. The .proto this was generated contains a message Foo containing a message Bar containing a message Message3 (.proto: https://reviews.llvm.org/F4745358, .pb.cc: https://reviews.llvm.org/F4745329, .pb.h: https://reviews.llvm.org/F4745572).


https://reviews.llvm.org/F4740125
(sorry, Mathematica doesn't have an easy way to jitter the nodes with this layout so that the labels don't overlap; bummer)

_Z31protobuf_AssignDesc_foo_2eprotov contains a call edge to _Z28protobuf_AddDesc_foo_2eprotov which contains ref edges to _ZN3FooC2Ev and the corresponding constructors for Baz and Message3. In turn, these constructors reference the vtable pointer which leads to the MergeFrom (the virtual one taking a ::google::protobuf::Message) and virtual MergePartialFromCodedStream which are big methods that recursively call into similar code of any contained messages.

In other words, it seems that there will always be a RefSCC that is of size approximately O(num messages defined in a given .proto file) which explains why you're seeing such massive RefSCC's.

(the dot file this came from is https://reviews.llvm.org/F4735714 which was generated with opt -passes=print-lcg-dot)

Still, I suspect *this* patch to be mostly making the denormalized representation scale better (by batching their removal). But it seems much easier to do this than to switch representations, and this patch actually makes the code simpler, so it seemed like a fine intermediate step.

Makes sense. Just looking at the graph above, it makes intuitive sense because since vtables don't contain pointers to other vtables, the big fanout doesn't end up being multiplied by the transitive closure of base classes or something like that. I.e. the ref fanout is limited by the size of a single vtable, rather than potentially the total size of all vtables in a module. So that's not terribly problematic.

silvas added a comment.Aug 5 2017, 9:18 PM

Sorry, to be clear, the entire RefSCC is something like:

for each message Foo:
_ZN3FooC2Ev --(ref through vtable)-> _ZNK3Foo11GetMetadataEv --> _ZN12_GLOBAL__N_130protobuf_AssignDescriptorsOnceEv --ref-> _Z31protobuf_AssignDesc_foo_2eprotov --> _Z28protobuf_AddDesc_foo_2eprotov --> _ZN3FooC2Ev (and all the other C2 constructors)

Additionally, to make the RefSCC even larger, when message Foo contains a Message Baz, then Foo::MergeFrom and Foo::MergePartialFromCodedStream end up referencing _ZN3FooC2Ev because they new a Baz object via the mutable_* methods (to make things even worse they also directly call Baz::{MergeFrom,MergePartialFromCodedStream}. So all the MergeFrom/MergePartialFromCodedStream end up in the RefSCC as well.

Really nice analysis. I hadn't gotten that far, so that is helpful. Sadly, the descriptor stuff being the source isn't too surprising to me.

Anyways, any thoughts about the patch itself? I've got at least one more fix prepared behind this.

craig.topper added inline comments.
include/llvm/Analysis/LazyCallGraph.h
807 ↗(On Diff #109863)

'intact' is one word

lib/Analysis/LazyCallGraph.cpp
1278 ↗(On Diff #109863)

'current'

silvas accepted this revision.Aug 7 2017, 12:25 PM

Really nice analysis. I hadn't gotten that far, so that is helpful. Sadly, the descriptor stuff being the source isn't too surprising to me.

Anyways, any thoughts about the patch itself? I've got at least one more fix prepared behind this.

Overall it looks fairly mechanical and slightly cleaner after the patch. LGTM.

Also, I looked a bit more into the graph I posted above. There is one ref edge which is what really causes the large RefSCC: _ZN12_GLOBAL__N_130protobuf_AssignDescriptorsOnceEv -> _Z31protobuf_AssignDesc_foo_2eprotov. That edge comes from the once initialization code

GOOGLE_PROTOBUF_DECLARE_ONCE(protobuf_AssignDescriptors_once_);                                                                                                                                             
inline void protobuf_AssignDescriptorsOnce() {
  ::google::protobuf::GoogleOnceInit(&protobuf_AssignDescriptors_once_,
                 &protobuf_AssignDesc_foo_2eproto);
}

(it's a ref edge, but if we were to fully inline through GoogleOnceInit it would become a call edge. The actual dispatch to protobuf_AssignDesc_foo_2eproto happens through an external function though so we're saved)

If that edge is deleted (which we can't do statically, but just for the sake of investigation), then these are the only remaining nontrivial RefSCC's.


https://reviews.llvm.org/F4829448

There is one nontrivial RefSCC of constant size per message (Foo, Bar, Message3) where all edges within the RefSCC are ref edges. These are just all the functions that statically reference the vtable (constructor/destructor), together with any virtual functions that call back into them. These are basically constant size independent of the number of messages.

Then there is one RefSCC that is O(number of messages in the .proto file). The protobuf_AddDesc_foo_2eproto RefSCC. Note that all edges here are call edges. Basically this routine just recursively default intializes all default instances, which themselves recursively initialize any instances they depend on. The thing that ties the SCC together is that default_instance() looks like this:

const Baz& Baz::default_instance() {
  if (default_instance_ == NULL) protobuf_AddDesc_foo_2eproto();
  return *default_instance_;
}

However, only the default instance stuff gets pulled into this SCC, so even though it is O(number of messages in the proto file) the constant is much smaller than when the edge _ZN12_GLOBAL__N_130protobuf_AssignDescriptorsOnceEv -> _Z31protobuf_AssignDesc_foo_2eprotov is present (which pulls in the MergeFrom and MergePartialFromCodedStream and a bunch of getters/setters and stuff into the RefSCC).

lib/Analysis/LazyCallGraph.cpp
1133–1142 ↗(On Diff #109863)

I found this a bit confusing. SourceC and TargetCs seem to be only used for this bailout here. So the section-delineating comment // Collect the SCCs for the source and targets. is a bit misleading (seems like it is collecting them for use later). Maybe just move the If all targets are in the same SCC as the source ... comment up to there and move the if a bit closer to make that clear.

In another patch you may want to consider putting the tarjan walk below into a separate function which I think would make this easier to understand. The main reason I think it seems pretty important to do that is to make it clear to the reader that we aren't doing anything fancy using our knowledge of the SourceC and TargetCs and are just walking everything in the RefSCC again, which makes the complexity obvious. (maybe beefing up the comments would help too)

(then again, I can see why you'd want to keep it all inline, because above you raw delete edges and so some of the invariants might be broken until we finish the update.)

This revision is now accepted and ready to land.Aug 7 2017, 12:25 PM
silvas added a comment.Aug 7 2017, 1:06 PM

(it's a ref edge, but if we were to fully inline through GoogleOnceInit it would become a call edge. The actual dispatch to protobuf_AssignDesc_foo_2eproto happens through an external function though so we're saved)

I said "saved" because I was mistakenly thinking that making this a call edge would have a large effect (and forgot to reword before posting), but it actually doesn't since the edges into this mainly come from Foo::GetMetadata (and Bar, Message3) which are only ever called virtually, so there is always a ref edge in the way to prevent the creation of a large CallSCC.

silvas added a comment.Aug 7 2017, 2:03 PM

One way to break up this large RefSCC is to notice that the fanout from the vtable ref edges in the ctors/dtors is contributing a lot to it being so large. If those edges are made more precise, then it can reduce the size. For example, by breaking the vtable ref edges we prevent the ref edges to virtual GetMetadata methods and so the huge RefSCC breaks up as I mentioned earlier.

This can basically be seen as increasing the precision of the reference visitation; right now, it just assumes that all transitively reachable function pointers are ref edges. This leads to a lot of false positives, as ref edges are just meant to be a conservative superset of all edges that we might eventually discover through static optimization as call edges. In this case, the C2 constructors for Foo, Baz, and Metadata3 can be seen to contain no indirect calls and all their callees are either external or contain no indirect calls. So we can avoid adding any ref edges at all. A simple context-insensitive bottom-up RPO tracking whether an indirect call instruction is present be enough for this case (it would be another cache-busting walk over all instructions in the module though, but that may be worth it to decrease the number of ref edges / ref edge operations).

I'm sure the tradeoffs and stuff for doing this kind of "may call through this function pointer" analysis are well studied in the Java literature, but I'm not really very familiar with it. The main difference for our use case here is that external functions count as "contributes no ref edges" instead of "could dynamically call potentially everything" (since we aren't concerned with runtime calls, but rather a conservative superset of statically discoverable direct calls).
But of course the point of this is to be fast and get rid of the most egregious false positive ref edges so something simple is probably enough.

chandlerc marked 3 inline comments as done.Aug 8 2017, 4:39 AM

Thanks for the review all! I've fixed the mentioned issues and am landing based on Sean's LGTM.

One way to break up this large RefSCC is to notice that the fanout from the vtable ref edges in the ctors/dtors is contributing a lot to it being so large. If those edges are made more precise, then it can reduce the size. For example, by breaking the vtable ref edges we prevent the ref edges to virtual GetMetadata methods and so the huge RefSCC breaks up as I mentioned earlier.

This can basically be seen as increasing the precision of the reference visitation; right now, it just assumes that all transitively reachable function pointers are ref edges. This leads to a lot of false positives, as ref edges are just meant to be a conservative superset of all edges that we might eventually discover through static optimization as call edges. In this case, the C2 constructors for Foo, Baz, and Metadata3 can be seen to contain no indirect calls and all their callees are either external or contain no indirect calls. So we can avoid adding any ref edges at all. A simple context-insensitive bottom-up RPO tracking whether an indirect call instruction is present be enough for this case (it would be another cache-busting walk over all instructions in the module though, but that may be worth it to decrease the number of ref edges / ref edge operations).

I'm sure the tradeoffs and stuff for doing this kind of "may call through this function pointer" analysis are well studied in the Java literature, but I'm not really very familiar with it. The main difference for our use case here is that external functions count as "contributes no ref edges" instead of "could dynamically call potentially everything" (since we aren't concerned with runtime calls, but rather a conservative superset of statically discoverable direct calls).
But of course the point of this is to be fast and get rid of the most egregious false positive ref edges so something simple is probably enough.

Sadly, I think this is a bit harder than it seems... We rely on reference edges to track *propagation* as well. So you can imagine a function F that is a leaf function and returns a function pointer. All callers might also have no indirect calls, but if we can inline F into them and inline some *other* function into them (or into their callers) we can still end up collapsing to a call.

Essentially, I think we'd end up looking at the entire reachable partition of the graph for indirect calls, and would almost always find one. =/

But honestly, with the fixes coming, even this fairly extreme case becomes very fast to handle, so I'm not too worried about this.

lib/Analysis/LazyCallGraph.cpp
1133–1142 ↗(On Diff #109863)

Yeah, this ended up not being relevant and being confusing. I've switched it to a much simpler and much more direct test for the early exit.

Regarding the longer-term issue of separating the Tarjan walk out, subsequent patches are likely to make this slightly less appealing, but I'm happy to revisit this in a subsequent patch if it makes sense.

This revision was automatically updated to reflect the committed changes.
chandlerc marked an inline comment as done.