This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Split some critical edges coming out of indirect branches
ClosedPublic

Authored by mkuper on Feb 13 2017, 3:25 PM.

Details

Summary

Split critical edges coming out of indirect branches, when it is easy to do: we match a "jump table" pattern where the jump table has local linkage, is used by exactly one indirectbr, and has no other uses.

Having a critical edge survive until MI means that when we go out of SSA, we have to define all the live-ins of a destination block within the origin block. Normally, MachineSink tries to split critical edges and sink these definitions into the destination block, but teaching it to split indirect edges on the MI level in a generic way looks problematic. So, instead, this tries to split these edges in IR.

This is motivated by the use of computed gotos in python 2.7 - PyEval_EvalFrame() ends up using an indirect branch with ~100 successors, and passing a constant to each of those. This causes us to emit about ~100 defs of registers containing constants, which we then fail to sink because the edge is critical (destination block has incoming edges from both the indirectbr and from a switch). So, at each goto, we "spill" about a hundred constants.
That end result is that a clang-compiled python interpreter is about ~2.5 (!) slower on a simple python reduction loop than a gcc-compiled interpreter.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper created this revision.Feb 13 2017, 3:25 PM
efriedma edited edge metadata.Feb 13 2017, 5:00 PM

Would it be possible to split the edges in some less fragile manner? I mean, in general you can "split" the edge from an indirectbr to a block with exactly one indirectbr predecessor by splitting the other predecessors.

Would it be possible to split the edges in some less fragile manner? I mean, in general you can "split" the edge from an indirectbr to a block with exactly one indirectbr predecessor by splitting the other predecessors.

I agree this is fragile in the sense that we need to hit a very specific pattern. But I don't have significantly better ideas.
I could try to generalize this, but I think we'd basically need to run both backward data-flow from the IBRs to make sure they're only fed by addresses from GVs (and at most one GV each), and forward data-flow from the GVs to make sure they aren't potentially written to. I'm not sure this is worth the trouble.

I don't really understand what you mean regarding splitting the other predecessors. I want to avoid having a phi on a block where one of the predecessor blocks on the phi is an indirectbr. I don't see how doing anything to the other incoming edges would help, you'll still end with this phi, just change the other predecessors. (The reason this matters only when the edge is critical is because if the indirectbr has only one successor, it doesn't matter where the defs end up living.)

Regardless, I'd like to also extend this to split cases where we have several indirectbr predecessors (of this form).

I don't really understand what you mean regarding splitting the other predecessors. I want to avoid having a phi on a block where one of the predecessor blocks on the phi is an indirectbr. I don't see how doing anything to the other incoming edges would help, you'll still end with this phi, just change the other predecessors.

Err, sorry, that was a terrible explanation; I started out thinking one thing, and writing something different.

Anyway, suppose you have a block B with one indirectbr predecessor and some other non-indirectbr predecessors. First, you split B before the first non-PHI instruction, making B2. Then you clone B, making B3. Then you change all the non-indirectbr predecessors of B point at B3. (Then you fix up the PHI nodes.) You now have exactly the same CFG as your code produces, without messing with the global variable.

Regardless, I'd like to also extend this to split cases where we have several indirectbr predecessors (of this form).

clang doesn't generate code like that, even if a function has multiple indirect goto statements. Have you seen this come up in practice?

I don't really understand what you mean regarding splitting the other predecessors. I want to avoid having a phi on a block where one of the predecessor blocks on the phi is an indirectbr. I don't see how doing anything to the other incoming edges would help, you'll still end with this phi, just change the other predecessors.

Err, sorry, that was a terrible explanation; I started out thinking one thing, and writing something different.

Anyway, suppose you have a block B with one indirectbr predecessor and some other non-indirectbr predecessors. First, you split B before the first non-PHI instruction, making B2. Then you clone B, making B3. Then you change all the non-indirectbr predecessors of B point at B3. (Then you fix up the PHI nodes.) You now have exactly the same CFG as your code produces, without messing with the global variable.

Oh, got it! That's really nice, thanks!

Regardless, I'd like to also extend this to split cases where we have several indirectbr predecessors (of this form).

clang doesn't generate code like that, even if a function has multiple indirect goto statements. Have you seen this come up in practice?

No, I haven't seen that, it was just the generalization step that seemed obvious to me. You're right, it's probably better to generalize in the direction you suggested above.
I'll try that and see if anything unexpected breaks.

mkuper updated this revision to Diff 88479.Feb 14 2017, 7:08 PM

Changed to work the way Eli suggested.

(I need to add a couple more tests, will do that if this really does look better.)

Yes, this approach looks good.

Can we add the Python interpreter loop in question to the testsuite so we don't regress this in the future?

Do we have any tests in the testsuite which use indirect goto which might be affected? If not, what else might be affected?

Yes, this approach looks good.

Can we add the Python interpreter loop in question to the testsuite so we don't regress this in the future?

I can try to craft a test that approximates the performance impact. I'm not sure it would make sense to add python itself to the test suite.

Do we have any tests in the testsuite which use indirect goto which might be affected? If not, what else might be affected?

I believe Eric saw this in the wild a while ago, I'm not entirely sure what the context is. But I think the main use-case of computed gotos is implementing interpreters. :-)

I haven't bothered with the test-suite, because I assumed this is rare enough not to appear. A cursory check shows we have one case that explicitly tests correctness of computed gotos (SingleSource/Regression/C/2004-03-15-IndirectGoto.c), but I doubt we have anything where this matters for performance. I'll check.

I've verified that 2004-03-15-IndirectGoto.c really is the only place in the test-suite where this fires.

(To be clear - that should have been "Eric *also* saw this in the wild", the python issue is very much not synthetic.)

mkuper updated this revision to Diff 88789.Feb 16 2017, 2:38 PM

Added two more tests.

echristo edited edge metadata.Feb 17 2017, 10:38 AM

I've verified that 2004-03-15-IndirectGoto.c really is the only place in the test-suite where this fires.

(To be clear - that should have been "Eric *also* saw this in the wild", the python issue is very much not synthetic.)

Yes. I saw this in the main interpreter loop in WebKit a number of years ago.

A gentle ping.

I can try to craft a test that approximates the performance impact. I'm not sure it would make sense to add python itself to the test suite.

The key is that we need some coverage for this fix; without a test, if indirectbr handling breaks somehow in the future, we won't know until someone complains about a huge performance regression a year later.

Patch looks fine, but I'd like to see perf numbers from the testsuite showing an improvement before we merge this.

I can try to craft a test that approximates the performance impact. I'm not sure it would make sense to add python itself to the test suite.

The key is that we need some coverage for this fix; without a test, if indirectbr handling breaks somehow in the future, we won't know until someone complains about a huge performance regression a year later.

I understand, I'm just not sure what you're suggesting. Would you like me to commit a synthetic test for this into the test-suite, to serve as a performance regression test?

Patch looks fine, but I'd like to see perf numbers from the testsuite showing an improvement before we merge this.

As I said above, this code-path only gets hit in exactly one source file in the testsuite, and that file doesn't belong to anything that can be used as a performance test: http://llvm.org/svn/llvm-project/test-suite/trunk/SingleSource/Regression/C/2004-03-15-IndirectGoto.c
So there are really no meaningful testsuite numbers to provide. I'm pretty sure computed gotos are really not common except when implementing interpreters, emulators, etc.

I can see whether it gets hit in SPEC, maybe perlbench...

Would you like me to commit a synthetic test for this into the test-suite, to serve as a performance regression test?

How big is Python, anyway? It might not be that ridiculous to put it into the testsuite.

Failing that, some artificial test which reflects the usage in Python would be okay, I guess.

Would you like me to commit a synthetic test for this into the test-suite, to serve as a performance regression test?

How big is Python, anyway? It might not be that ridiculous to put it into the testsuite.

It's not huge. On my machine, "make -j32" takes about 30 seconds wall time.
But it's still a fairly large yak (e.g. the build system is non-trivial, at lest to me - and it is not cmake-based. There's an external project that provides cmake build files, but I don't know how stable that is) . I would strongly prefer not to shave it at the moment.

A synthetic test would be based on something like this: https://gist.github.com/anonymous/652ff876293730601cccd1091f1ed446 modified to do something sane.

mkuper updated this revision to Diff 89455.Feb 22 2017, 6:07 PM

Tweaked patch a bit so that we always eliminate trivial phis.

If there isn't any existing application you can easily import, a toy interpreter loop is fine.

Test-suite test is in. Let me know if you have any further comments on the patch itself.

efriedma accepted this revision.Feb 23 2017, 4:13 PM

LGTM.

This revision is now accepted and ready to land.Feb 23 2017, 4:13 PM
This revision was automatically updated to reflect the committed changes.