This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Limit the number of times for a store being considered for merging
ClosedPublic

Authored by wmi on Jul 23 2019, 4:58 PM.

Details

Summary

Recently we run into a case with compile time problem (27 mins in O3 but a few seconds in O0). In that case, we have a huge block containing many stores. Each time a store is visited, dagcombiner will find a candidate set all chained by the same root node for the store, then try to merge the nodes in the candidate set into a wider store. This is a quadratic algorithm, because for the same candidate set, everytime a store inside of the set is visited, the same set will be considered for merging. 

In this patch, we add a map to record how many times the pair of root node and the store node is seen before a store node is added into candidate set. If the count is above a threshold (set to 10 by default now), the store node won't be added to candidate set again. With the patch, the compile time drops to 20 seconds.

I verify that the patch won't miss store merging opportunities in most cases in the way that by building a large internal server application and bootstrapping clang w/wo the patch, I find no difference for the binaries.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Jul 23 2019, 4:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2019, 4:58 PM
Herald added a subscriber: hiraditya. · View Herald Transcript

This has come up a few times before (https://reviews.llvm.org/D60133 is the most recent instance). So far it's either been deemed not important enough or there was a way to limit the candidate search to limit the search to the correct size (e.g. checking the in context max valid store size and making sure there was a valid non-trivial store merge).

It would be good to understand why we're finding candidates that cannot be merged and see if the search can be tightened. I vaguely recall that the underlying cause in D60133 could be solved by replacing the canMergeStoresTo's method with something expressing maximum mergable size with this store in the current context and that that would also remove a large number of these fruitless merge searches. I suspect this may work for your case as well.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15477

nit: Hoist this and the equal check in the else clause out of the if.

wmi marked an inline comment as done.Jul 29 2019, 11:20 AM

Thanks for pointing me to D60133.

I check and the reason the candidates cannot be merged isn't because of the legal size. It is that checkMergeStoreCandidatesForDependencies always returns false so https://reviews.llvm.org/D60133 doesn't help.

For the case addressed in https://reviews.llvm.org/D60133, limiting the search to the correct size is definitely a good way to solve it. But I think it can cause compile time problem likely also because the same store and root pair appearing repeatedly, so the patch here may also help for the case in D60133.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15477

The RootNode is updated inside of then branch so it wasn't hoisted above.

I check and the reason the candidates cannot be merged isn't because of the legal size. It is that checkMergeStoreCandidatesForDependencies always returns false so https://reviews.llvm.org/D60133 doesn't help.

Ah! Interesting. I've only seen that dependence triggered from inlined memcpy.

What's the shape of the DAG here? There must be some non-standard data dependence between stores, which is not being checked. Maybe we can change the candidate search to avoid such candidates.

fhahn added a subscriber: fhahn.Jul 29 2019, 12:10 PM

Recently we run into a case with compile time problem (27 mins in O3 but a few seconds in O0). In that case, we have a huge block containing many stores. Each time a store is visited, dagcombiner will find a candidate set all chained by the same root node for the store, then try to merge the nodes in the candidate set into a wider store. This is a quadratic algorithm, because for the same candidate set, everytime a store inside of the set is visited, the same set will be considered for merging.

Could you share the test case? I looked into a similar issue and came up with some sort of caching for the lookup. I never shared the patch, because we addressed the issue with a limit. But it might be helpful for your case.

wmi added a comment.Jul 29 2019, 12:42 PM

I check and the reason the candidates cannot be merged isn't because of the legal size. It is that checkMergeStoreCandidatesForDependencies always returns false so https://reviews.llvm.org/D60133 doesn't help.

Ah! Interesting. I've only seen that dependence triggered from inlined memcpy.

What's the shape of the DAG here? There must be some non-standard data dependence between stores, which is not being checked. Maybe we can change the candidate search to avoid such candidates.

The DAG is huge and I havn't seen any speciality. About the dependence between stores, I look into it and checkMergeStoreCandidatesForDependencies always returns false because llvm::SDNode::hasPredecessorHelper visits too many nodes in the worklist which exceeds the MaxSteps and the function bails out. That is also the reason the compile time increases so much.

wmi added a comment.Jul 29 2019, 12:49 PM

Recently we run into a case with compile time problem (27 mins in O3 but a few seconds in O0). In that case, we have a huge block containing many stores. Each time a store is visited, dagcombiner will find a candidate set all chained by the same root node for the store, then try to merge the nodes in the candidate set into a wider store. This is a quadratic algorithm, because for the same candidate set, everytime a store inside of the set is visited, the same set will be considered for merging.

Could you share the test case? I looked into a similar issue and came up with some sort of caching for the lookup. I never shared the patch, because we addressed the issue with a limit. But it might be helpful for your case.

Good to know that. It is an internal source and I don't have tool to do obfuscation. Could you share your patch? I can try it out and see if it helps.

Ah. That's a tricky one. It sounds like this isn't a case of the candidate stores not being mergable, but rather us giving up before we completely verify that.

I don't suppose the values in the indirect value operands to the candidate stores have a relatively short dependence on a node not too far from the RootNode. Right now, we only prune at the root node (and components through the TokenFactors). but we could also just mark further predecessors of the RootNode as not needing to be searched.

wmi added a comment.Jul 29 2019, 2:47 PM

Ah. That's a tricky one. It sounds like this isn't a case of the candidate stores not being mergable, but rather us giving up before we completely verify that.

I don't suppose the values in the indirect value operands to the candidate stores have a relatively short dependence on a node not too far from the RootNode. Right now, we only prune at the root node (and components through the TokenFactors). but we could also just mark further predecessors of the RootNode as not needing to be searched.

It may not be easy for us to prune the searching space without missing some merging opportunity. I feel it is easier to achieve that goal by controlling how many times a {RootNode, StoreNode} pair appear in candidate set repeatedly. With the patch, no bit change in the binaries were found for clang and some large application. So the current patch is a safe change.

It may not be easy for us to prune the searching space without missing some merging opportunity. I feel it is easier to achieve that goal by controlling how many times a {RootNode, StoreNode} pair appear in candidate set repeatedly. With the patch, no bit change in the binaries were found for clang and some large application. So the current patch is a safe change.

True. But my concern is that as we prune the search space, we will introduce burrs in the which will be extremely hard to find and notice. I agree that its' not worth it to solve it the ideal way, but if a quick to write but safe pruning of the search space happens to work we should go with that. If not, it's likely it'll be too hard to resolve better than a heuristics. The scheme here is fine modulo the following concerns:

  1. If it's really an issue with the candidate dependence check, we should move the RootStoreCountMap update be modified in the dependency search, not the candidate check. This shouldn't dramatically change the run time
  1. The RootNode is effectively a function of the candidate store. We really should only have a single RootNode so it'd be better as a map from SDNode* to (SDNode*, int).
  1. The RootStoreCountMap ends up with stale information as we replace nodes, which may cause unexpected failures. Can you modify the node listeners to trim the map.
wmi added a comment.Jul 30 2019, 9:43 AM

It may not be easy for us to prune the searching space without missing some merging opportunity. I feel it is easier to achieve that goal by controlling how many times a {RootNode, StoreNode} pair appear in candidate set repeatedly. With the patch, no bit change in the binaries were found for clang and some large application. So the current patch is a safe change.

True. But my concern is that as we prune the search space, we will introduce burrs in the which will be extremely hard to find and notice. I agree that its' not worth it to solve it the ideal way, but if a quick to write but safe pruning of the search space happens to work we should go with that. If not, it's likely it'll be too hard to resolve better than a heuristics. The scheme here is fine modulo the following concerns:

  1. If it's really an issue with the candidate dependence check, we should move the RootStoreCountMap update be modified in the dependency search, not the candidate check. This shouldn't dramatically change the run time.
  1. The RootNode is effectively a function of the candidate store. We really should only have a single RootNode so it'd be better as a map from SDNode* to (SDNode*, int).

I havn't got the idea. Could you please elaborate what is the "SDNode *" in the key and what is the "SDNode *" in the value? And how it helps the early return in the dependence check?

  1. The RootStoreCountMap ends up with stale information as we replace nodes, which may cause unexpected failures. Can you modify the node listeners to trim the map.

Good point. I will update it.

I havn't got the idea. Could you please elaborate what is the "SDNode *" in the key and what is the "SDNode *" in the value? And how it helps the early return in the dependence check?

Not sure if you mean the first or second point, but here's both:

  1. The issue with having the values incremented in the candidate search is that we'll be much more aggressive about marking things non-mergable. There are a number of cases where we need to visit each store in the merge set once before it can be merged as part dag combining the last store. Also, by deferring the increment to when checkMergeStoreCandidatesForDependencies fails, store merge slowdowns unrelated to this will continue to appear as compile time issues not performance degradations which means are much more likely to consider them when they come up.
  1. Write the map as Candidate Store to (RootNode, count) where the root node acts as "a checksum". So the dependence check is lookup and match checksum and the update is lookup, increment on checksum match and reset to 1 on mismatch.
  1. The RootStoreCountMap ends up with stale information as we replace nodes, which may cause unexpected failures. Can you modify the node listeners to trim the map.

Good point. I will update it.

wmi added a comment.Jul 30 2019, 8:39 PM

Thanks Nirav. I update the patch. Please take a look.

wmi updated this revision to Diff 212498.Jul 30 2019, 8:42 PM

Address Nirav's comment.

niravd added inline comments.Jul 31 2019, 6:47 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
118

the command flag name is a bit unclear and should have the "combiner-" prefix. Maybe "combiner-store-merge-dependence-limit"?

15496

No sense in considering stores that won't merge. Filter candidates on StoreRootCountMap before we add to StoreNodes here.

15506

If we're going to fail from dependences for an N store merge, we encounter it for a merge starting from each of the N candidates. We should
mark each of the "NumStores" candidate stores we have when we fail. Also, you shouldn't bother updating unless the dependence check failed from timeout.

wmi marked 3 inline comments as done.Jul 31 2019, 10:34 AM

Thanks for the suggestions. I will update patch immediately. Please see if it matches what is in your mind.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
118

Done.

15496

Done.

15506

Good point. Done.

wmi updated this revision to Diff 212623.Jul 31 2019, 10:36 AM

Address Nirav's comment.

niravd added inline comments.Jul 31 2019, 10:57 AM
llvm/include/llvm/CodeGen/SelectionDAGNodes.h
869 ↗(On Diff #212623)

Visited.size() and maxsteps are in scope for any call of the help. Just do the comparision at the call point instead of adding an argument.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15446

Extraneous return false here.

15558

Shouldn't you be checking that RootNode matches before you increment the count?

wmi marked 3 inline comments as done.Jul 31 2019, 11:30 AM
wmi added inline comments.
llvm/include/llvm/CodeGen/SelectionDAGNodes.h
869 ↗(On Diff #212623)

Ah, I didn't notice that. Will change.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15446

Thanks for spotting it. I add it for experiment but forget to remove it when generating the patch.

15558

Yes, I should. Done.

wmi updated this revision to Diff 212631.Jul 31 2019, 11:31 AM

Address Nirav's comment.

niravd accepted this revision.Jul 31 2019, 11:51 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Jul 31 2019, 11:51 AM
wmi added a comment.Jul 31 2019, 11:57 AM

Nirav, thanks for all the suggestions and quick responses!

This revision was automatically updated to reflect the committed changes.