Page MenuHomePhabricator

[SelectionDAG] Change the jump table size unit from entry to target
Needs ReviewPublic

Authored by evandro on Apr 4 2019, 5:08 PM.

Details

Summary

Modern processors predict the target(s) of an indirect branche regardless of the size of any jump table used to glean its target address. Rather, branch predictors typically use resources limited by the number of actual targets that occur at run time.

This patch changes the semantics of the options min-jump-table-entries and max-jump-table-size to use the number of different targets instead of the number of entries in a jump table. Thus, the are now renamed to min-jump-table-cases and max-jump-table-cases, respectively.

Before, when max-jump-table-cases was specified, it could happen that cluster jump tables could have targets used repeatedly, but typically the same length. With this patch, tables may have different lengths, but typically the same number of unique targets.

Diff Detail

Event Timeline

evandro created this revision.Apr 4 2019, 5:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2019, 5:08 PM

I'm still collecting data on other architectures, but I've observed improvements between 3 and 15% in some SPEC benchmarks, such as 253.perlbmk and 400.perlbench, and proprietary ones on AArch64.

hans added a comment.Apr 5 2019, 7:56 AM

This patch changes the semantics of the options min-jump-table-entries and max-jump-table-size to use the number of different targets instead of the number of entries in a jump table. Thus, the are now renamed to min-jump-table-cases and max-jump-table-cases, respectively.

Hmm, I'll have to think about this one.

I'm not sure it makes sense to change -min-jump-table-entries. I think that one is really supposed to decide when a jump table is better than separate branches, and I'm not sure it should change because of this.

-max-jump-table-size was introduced exactly for this purpose (limiting the load on the branch predictor), so making that based on the number of targets sounds reasonable (I'd suggest calling it max-jump-table-targets instead of -cases though).

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

Look at llvm/include/llvm/CodeGen/TargetLowering.h below.

dmgreen added a subscriber: dmgreen.Apr 8 2019, 3:28 AM

What does this do for codesize?

aheejin added inline comments.Apr 8 2019, 5:30 AM
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
1

Why does this test need this option?

hans added a comment.Apr 11 2019, 5:41 AM

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

Look at llvm/include/llvm/CodeGen/TargetLowering.h below.

Thanks! I had forgotten how this works :-)

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

llvm/include/llvm/CodeGen/TargetLowering.h
962

What's the (NumCases < Range) part for? Based on the description, I'd expect this to just check "NumCases <= MaxJumpTableCases"

What does this do for codesize?

Since there aren't that many jump tables, the increase in code size is negligible.

evandro marked 2 inline comments as done.Apr 19 2019, 10:33 AM
evandro added inline comments.
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
1

Because before there was no jump table and now there is, since the number of targets is small enough. Otherwise, this test would have to be modified for a reason unrelated to the test.

evandro marked 3 inline comments as done.Apr 19 2019, 10:37 AM

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

For each case there is a target in the table, that may potentially be reached or not at run time, but would prevent stressing the predictor. So, cases and targets are not orthogonal, but the same.

llvm/include/llvm/CodeGen/TargetLowering.h
962

It's how I infer that there may be a default case. Or am I missing a better way to do so?

Since there aren't that many jump tables, the increase in code size is negligible.

For a point of reference on the codesize tests I ran, the increases from this patch are larger than the decreases from D59936 (which was the last decent codesize change I saw). Codesize changes might seem small at times, but they often come small change at a time. Plus it depends how you measure them.

Also, many cpus don't really work the way you claim. Some don't even have branch predictors, or the time would not be dominated by the branch mispredict. With those it's more about the difference between jump table setup code and the equivalent series of branches.

hans added a comment.Apr 25 2019, 2:01 AM

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

For each case there is a target in the table, that may potentially be reached or not at run time, but would prevent stressing the predictor. So, cases and targets are not orthogonal, but the same.

But if that's so, shouldn't the "default" cases count too?

I don't know how CPUs do this internally, but in your change description you wrote: "Rather, branch predictors typically use resources limited by the number of actual targets that occur at run time."

I interpreted this as the number of actual target addresses, which is less than or equal to the number of cases. For example in

switch (x) {
case 1: case 8: case 12: foo(); break;
case 3: case 7: case 9: bar(); break;
default: baz();
}

there are only 3 branch targets, but 6 cases, and an indirect jump through a jump table could take 12 different values.

The question is which of those dimensions that the branch target predictor is limited by.

Does the CPU keep track of the branch target for each of the 12 possible in-range values? Or does it have some kind of reverse map that tracks which values map to which of the 3 targets? I don't have the expertise to answer this, but so far I'm not convinced that limiting the number of cases is better than limiting the size of the range.

evandro marked an inline comment as done.Apr 25 2019, 6:21 AM
evandro added a comment.EditedApr 25 2019, 6:25 AM

@hans, yes, default, whether explicit or implicit, counts too, for the resources used inside the target only care about different addresses. In you example, a typical branch predictor will see up to 3 target addresses. Of course, branch predictors are trained by branches that are actually taken, so only target addresses that are executed count. Which opens the window for future work involving FDO or just the static probabilities given by the default heuristics.

I'll study your example further.