Page MenuHomePhabricator

[CodeGen] Replace -max-jump-table-size with -max-jump-table-targets
AcceptedPublic

Authored by evandro on Apr 4 2019, 5:08 PM.

Details

Summary

Modern processors predict the targets of an indirect branch regardless of the size of any jump table used to glean its target address. Moreover, branch predictors typically use resources limited by the number of actual targets that occur at run time.

This patch changes the semantics of the option -max-jump-table-size to limit the number of different targets instead of the number of entries in a jump table. Thus, it is now renamed to -max-jump-table-targets.

Before, when -max-jump-table-size was specified, it could happen that cluster jump tables could have targets used repeatedly, but each one was counted and typically resulted in tables with the same number of entries. With this patch, when specifying -max-jump-table-targets, tables may have different lengths, since the number of unique targets is counted towards the limit, but the number of unique targets in tables is the same, but for the last one containing the balance of targets.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2019, 5:08 PM

I'm still collecting data on other architectures, but I've observed improvements between 3 and 15% in some SPEC benchmarks, such as 253.perlbmk and 400.perlbench, and proprietary ones on AArch64.

hans added a comment.Apr 5 2019, 7:56 AM

This patch changes the semantics of the options min-jump-table-entries and max-jump-table-size to use the number of different targets instead of the number of entries in a jump table. Thus, the are now renamed to min-jump-table-cases and max-jump-table-cases, respectively.

Hmm, I'll have to think about this one.

I'm not sure it makes sense to change -min-jump-table-entries. I think that one is really supposed to decide when a jump table is better than separate branches, and I'm not sure it should change because of this.

-max-jump-table-size was introduced exactly for this purpose (limiting the load on the branch predictor), so making that based on the number of targets sounds reasonable (I'd suggest calling it max-jump-table-targets instead of -cases though).

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

Look at llvm/include/llvm/CodeGen/TargetLowering.h below.

dmgreen added a subscriber: dmgreen.Apr 8 2019, 3:28 AM

What does this do for codesize?

aheejin added inline comments.Apr 8 2019, 5:30 AM
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
1 ↗(On Diff #193810)

Why does this test need this option?

hans added a comment.Apr 11 2019, 5:41 AM

I also don't see exactly how the semantics of max-jump-table-size are changed with this patch? What am I missing? Can you upload it again with more context maybe?

Look at llvm/include/llvm/CodeGen/TargetLowering.h below.

Thanks! I had forgotten how this works :-)

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

llvm/include/llvm/CodeGen/TargetLowering.h
962

What's the (NumCases < Range) part for? Based on the description, I'd expect this to just check "NumCases <= MaxJumpTableCases"

What does this do for codesize?

Since there aren't that many jump tables, the increase in code size is negligible.

evandro marked 2 inline comments as done.Apr 19 2019, 10:33 AM
evandro added inline comments.
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
1 ↗(On Diff #193810)

Because before there was no jump table and now there is, since the number of targets is small enough. Otherwise, this test would have to be modified for a reason unrelated to the test.

evandro marked 3 inline comments as done.Apr 19 2019, 10:37 AM

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

For each case there is a target in the table, that may potentially be reached or not at run time, but would prevent stressing the predictor. So, cases and targets are not orthogonal, but the same.

llvm/include/llvm/CodeGen/TargetLowering.h
962

It's how I infer that there may be a default case. Or am I missing a better way to do so?

Since there aren't that many jump tables, the increase in code size is negligible.

For a point of reference on the codesize tests I ran, the increases from this patch are larger than the decreases from D59936 (which was the last decent codesize change I saw). Codesize changes might seem small at times, but they often come small change at a time. Plus it depends how you measure them.

Also, many cpus don't really work the way you claim. Some don't even have branch predictors, or the time would not be dominated by the branch mispredict. With those it's more about the difference between jump table setup code and the equivalent series of branches.

hans added a comment.Apr 25 2019, 2:01 AM

I still think looking at the number of cases isn't that much better than looking at the size of the range though. As you said, the point is to limit the load on the branch target predictor, and IIUC that's limited on the number of *different branch targets*, which is really orthogonal to the number of cases. I realize that we don't have that information as readily available, but do you agree that limiting the jump table to a certain number of different targets would be a better approach?

For each case there is a target in the table, that may potentially be reached or not at run time, but would prevent stressing the predictor. So, cases and targets are not orthogonal, but the same.

But if that's so, shouldn't the "default" cases count too?

I don't know how CPUs do this internally, but in your change description you wrote: "Rather, branch predictors typically use resources limited by the number of actual targets that occur at run time."

I interpreted this as the number of actual target addresses, which is less than or equal to the number of cases. For example in

switch (x) {
case 1: case 8: case 12: foo(); break;
case 3: case 7: case 9: bar(); break;
default: baz();
}

there are only 3 branch targets, but 6 cases, and an indirect jump through a jump table could take 12 different values.

The question is which of those dimensions that the branch target predictor is limited by.

Does the CPU keep track of the branch target for each of the 12 possible in-range values? Or does it have some kind of reverse map that tracks which values map to which of the 3 targets? I don't have the expertise to answer this, but so far I'm not convinced that limiting the number of cases is better than limiting the size of the range.

evandro marked an inline comment as done.Apr 25 2019, 6:21 AM
evandro added a comment.EditedApr 25 2019, 6:25 AM

@hans, yes, default, whether explicit or implicit, counts too, for the resources used inside the target only care about different addresses. In you example, a typical branch predictor will see up to 3 target addresses. Of course, branch predictors are trained by branches that are actually taken, so only target addresses that are executed count. Which opens the window for future work involving FDO or just the static probabilities given by the default heuristics.

I'll study your example further.

evandro updated this revision to Diff 206941.Jun 27 2019, 2:41 PM
evandro retitled this revision from [SelectionDAG] Change the jump table size unit from entry to target to [CodeGen] Change the jump table size unit from entry to target.
evandro edited the summary of this revision. (Show Details)
xbolva00 added a subscriber: xbolva00.

Please upload with the full context.

diff with context?

evandro updated this revision to Diff 209282.Jul 11 2019, 12:01 PM

🔔 ¡Ping! 🔔

hans added a comment.Jul 30 2019, 5:52 AM

Hi Evandro,

Very sorry for the slow reply here, but it needed some thinking.

From the change description:

[CodeGen] Change the jump table size unit from entry to target

It's not so clear what a "jump table size unit" is. I think maybe "Replace -max-jump-table-size with -max-jump-table-targets" or similar would be more clear.

llvm/include/llvm/CodeGen/SwitchLoweringUtils.h
224

Maye add "unique" in here to make it clearer.

llvm/lib/CodeGen/SwitchLoweringUtils.cpp
67

I would probably just have gone with "Targets.insert(Clusters[i].MBB)" since C isn't getting used anywhere else.

85

MaxTargets isn't a great variable name for something that's "Number of targets, including the default if it's reachable".

Maybe getJumpTableNumTargets() could take the default case into account so it doesn't need to be handled separately?

148

Hmm, this is problematic. getJumpTableNumTargets has linear time complexity in the number of clusters, so I think this is essentially adding another factor O(n) to the overall time complexity, which is unfortunate.

One way to solve this within the original O(n^2) complexity of the current code, would be to build an auxiliary data structure in the outer for-loop, e.g. TotalNumTargets[], where TotalNumTargets[x] is the total number of targets in clusters i..x (i being defined in the outer loop).

That would be built in O(n) time, and the inner loop would then query TotalNumTargets[j] (in constant time).

evandro marked 4 inline comments as done.Aug 6 2019, 5:21 PM
evandro added inline comments.
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
148

If I understood you correctly, pre calculating the number of targets in clusters i..x, where x will be j, is the same as performing this calculation inside the inner loop.

evandro updated this revision to Diff 213770.Aug 6 2019, 6:07 PM
evandro retitled this revision from [CodeGen] Change the jump table size unit from entry to target to [CodeGen] Replace -max-jump-table-size with -max-jump-table-targets.
evandro edited the summary of this revision. (Show Details)Aug 6 2019, 6:37 PM

Ping❗ 🔔🔔🔔

hans added a comment.Aug 14 2019, 6:10 AM

Very sorry again for the slow reply. I think this patch still suffers from the O(n^3) time complexity.

llvm/lib/CodeGen/SwitchLoweringUtils.cpp
148

No, the current code does O(n) work in the inner loop, making the overall time complexity O(n^3). Pre-computing the number of targets in clusters i..x in the outer loop would preserve the overall time complexity of O(n^2).

evandro marked 2 inline comments as done.Aug 15 2019, 6:06 PM
evandro added inline comments.
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
148

Let me see if I understood what you mean in a patch...

evandro updated this revision to Diff 215516.Aug 15 2019, 6:07 PM
evandro marked an inline comment as done.
evandro marked an inline comment as done.Aug 16 2019, 11:24 AM

🔔🔔 ‼️Ping‼️ 🔔🔔

hans added a comment.Aug 23 2019, 7:05 AM

🔔🔔 ‼️Ping‼️ 🔔🔔

Sorry, I've been busy with LLVM 9 release and other work. Hopefully I can get to this sometime next week; I haven't forgotten about it.

hans added a comment.Mon, Sep 9, 7:20 AM

I think this keeps the algorithm still withing O(n^2) so that's good.

But I am worried that this adds more complexity to something that is already very complex. Can you share numbers to show that this is worth it?

I left some comments for how to maybe make this cleaner, and I think that's the main improvement that's needed: it needs be easier to read the code, understand what's happening and see that it's correct.

I also worry that this makes the code slower, even though the new functionality is not used by most targets. Even though it's still O(n^2) time complexity, we're doing more work now. Can you measure and see how this affects compile time of a large switch? (Maybe try the program from https://bugs.llvm.org/show_bug.cgi?id=23490)

llvm/lib/CodeGen/SwitchLoweringUtils.cpp
129

Maybe PartitionStats is a better name, since it's really about numbers, not any other kinds of traits.

134

I guess you mean PartitionTrait, not PartitionTarget.
Also, the indexing of this seems very confusing, and that seems to be because elements are inserted with push_back().

Couldn't PartitionTrait[x] be the traits for Clusters[i..x], i.e. a straight-forward indexing?

Also, since it's a vector, I don't think the name should be in singular.

143

Using the same variable, defined on line 158, both for initializing the array, and then for lookups later is confusing.

I think it would be easier to read of the code in this block was more like:

PartitionTraits[j].Range = ...
PartitionTraits[j].Cases = ...
PartitionTraits[j].Targets = ...
147–148

Here the indexing gets confusing (also we probably don't want to copy the struct, but use a const-ref).

evandro marked 2 inline comments as done.Thu, Sep 12, 10:58 AM
evandro marked 2 inline comments as done.Thu, Sep 12, 12:53 PM
evandro updated this revision to Diff 219982.Thu, Sep 12, 12:55 PM

Please, stand by for numbers.

Running llc -O1 3 times on the byte code from a.i in the "reduced some more" archive:

  • Before: average of 38.383s ± 0.064s
  • After: average of 38.440s ± 0.049s

Or an increase of 0.15% in the run time.

hans accepted this revision.Tue, Sep 17, 6:38 AM

lgtm with comments

llvm/include/llvm/CodeGen/SwitchLoweringUtils.h
228

Since it has access to Clusters and First and Last, passing in Cases and Range seems redundant. It seems the function should be able to figure those things out itself.

llvm/lib/CodeGen/SwitchLoweringUtils.cpp
126

The comment needs updating for the PartitionTrait rename.

134

s/traits/stats/
And thanks for updating the indexing, this is easier to follow.

This revision is now accepted and ready to land.Tue, Sep 17, 6:38 AM
evandro marked 3 inline comments as done.Tue, Sep 17, 8:25 AM

Thank you.

llvm/include/llvm/CodeGen/SwitchLoweringUtils.h
228

That is true, but, whenever getJumpTableNumTargets() is called, those values are calculated and then used again. So, at least for this use case, it seems to be more efficient to let getJumpTableNumTargets() calculate and return them through references.

evandro marked an inline comment as done.Tue, Sep 17, 8:31 AM
evandro added inline comments.
llvm/include/llvm/CodeGen/SwitchLoweringUtils.h
228

I take it back. Cases needs the TotalCases array to be calculated. So, no, with the current arguments, getJumpTableNumCases() can only figure Range out.

evandro updated this revision to Diff 220526.Tue, Sep 17, 9:48 AM

Update the patch Including the suggested refactoring.

hans added inline comments.Wed, Sep 18, 12:18 AM
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
62

I think it would be simpler if this just returned a PartitionStats object, instead of "returning" via a reference parameter.

67

The point of the AllCases vector was to avoid having to iterate from First to Last each time to compute the number of cases.

Now that we're iterating from First to Last anyway to count the number of targets, there's no point to this optimization really, and we might as well count the number of cases while counting the number of targets. That would make the code simpler.

evandro marked 4 inline comments as done.Wed, Sep 18, 8:14 AM
evandro added inline comments.
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
67

Of course!

evandro updated this revision to Diff 220679.Wed, Sep 18, 8:23 AM
evandro marked an inline comment as done.
hans added inline comments.Fri, Sep 20, 12:06 AM
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
68

There's still no need for the vector.

The number of cases from First to Last is the sum of (Clusters[i].High - Clusters[i].Low) for each i between First and Last. The sum can be computed directly since we're running the for-loop anyway. There's no need to use the vector.

evandro marked 2 inline comments as done.Fri, Sep 20, 11:16 AM
evandro added inline comments.
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
68

I see, inlining the functions makes this clear.

evandro updated this revision to Diff 221092.Fri, Sep 20, 12:27 PM
evandro marked an inline comment as done.
evandro updated this revision to Diff 221095.Fri, Sep 20, 12:46 PM
hans added inline comments.Mon, Sep 23, 1:51 AM
llvm/lib/CodeGen/SwitchLoweringUtils.cpp
33

I'd suggest not declaring these until they're used.

36

I don't know what the "accumulated" refers to here. I don't think there's any need for a comment here actually.

39

There's nothing accumulated here. I'd suggest just dropping the comment.

41

It would be better to just declare the variables here:

APInt *Hi = ...
APInt *Lo = ...

49

And declare these variables here (no need to reuse the same variables as in the loop):

APInt *Hi = ...
APInt * Lo = ...