This is an archive of the discontinued LLVM Phabricator instance.

[LVI] Switch from BFS to DFS exploration strategy
ClosedPublic

Authored by reames on Dec 30 2016, 7:34 PM.

Details

Summary

This patch changes the order in which LVI explores previously unexplored paths.

Previously, the code used an BFS strategy where each unexplored input was added to the search queue before any of them were explored. This has the effect of causing all inputs to be explored before returning to re-evaluate the merge point (non-local or phi node). This has the unfortunate property of doing redundant work if one of the inputs to the merge is found to be overdefined (i.e. unanalysable). If any input is overdefined, the result of the merge will be too; regardless of the values of other inputs.

The new code uses a DFS strategy where we re-evaluate the merge after evaluating each input. If we discover an overdefined input, we immediately return without exploring other inputs.

I don't believe this patch changes the observed results produced by LVI, but the interactions between CVP/JT, LVI, and the partial cache reset that happens on jump-threading are complicated enough that I can't state that confidently. I can state that on at least one example (pr10584), this variant is substantially faster (80% improvement in compile time).

I also can't find any clear reason why the original code uses DFS despite it being clearly intentional. Anyone know history here?

Diff Detail

Event Timeline

reames updated this revision to Diff 82754.Dec 30 2016, 7:34 PM
reames retitled this revision from to [LVI] Switch from BFS to DFS exploration strategy.
reames updated this object.
dberlin edited edge metadata.Feb 2 2017, 11:05 AM

DFS is clearly the optimal exploration order for this problem.
As mentioned on the bug, the optimal solution is to have this do dfs backwards, and callers to call it in RPO forwards.

That will ensure the minimal work per call.

Once you commit this, i will test and update callers like CVP, etc to iterate in the correct order.

dberlin accepted this revision.Feb 2 2017, 11:07 AM

It would be nice if you had a way to test that this changes nothing. If anything, it actually should improve results, since the optimal lattice values for loops and irreducible control flow may only be achievable with proper iteration order.

My usual plan is to save-temps llvm into bc files, and then generate results before and after.

Assuming you do *something* to verify we aren't getting worse answers, i think this can go in.

This revision is now accepted and ready to land.Feb 2 2017, 11:07 AM

(also,just to record for history, this produces a 4-10x speedup on value propagation pretty much everywhere i've tried it, which is not surprising, since the old exploration strategy is almost optimally bad :P)

joerg added a subscriber: joerg.Feb 2 2017, 1:54 PM

I have one case in pkgsrc (gmic) which currently hits the 2GB VA limit for build processes. Taking the problematic example and comparing before/after on Linux I get:
before: 1.9GB max RSS
after: 0.9GB max RSS

E.g. in this case, it effectively cuts memory use in half. That's quite significant and something I would like to see in 4.0.

FYI: i ran this on all LLVM files for correlatedvalueprop and jumpthreading.
As expected, it produces strictly better answers.
For CVP, it is significantly better at propagating non-nullness (but otherwise no code-differences).
For jump threading, no differences are found.

This revision was automatically updated to reflect the committed changes.