Page MenuHomePhabricator

[InlineCost] Increase the cost of Switch
ClosedPublic

Authored by junbuml on Mar 17 2017, 10:10 AM.

Details

Summary

The motivation example is like below which has 13 cases but only 2 distinct targets

lor.lhs.false2:                                   ; preds = %if.then
  switch i32 %Status, label %if.then27 [
    i32 -7012, label %if.end35
    i32 -10008, label %if.end35
    i32 -10016, label %if.end35
    i32 15000, label %if.end35
    i32 14013, label %if.end35
    i32 10114, label %if.end35
    i32 10107, label %if.end35
    i32 10105, label %if.end35
    i32 10013, label %if.end35
    i32 10011, label %if.end35
    i32 7008, label %if.end35
    i32 7007, label %if.end35
    i32 5002, label %if.end35
  ]

which is compiled into a balanced binary tree like this on AArch64 (similar on X86)

.LBB853_9:                              // %lor.lhs.false2
        mov     w8, #10012
        cmp             w19, w8
        b.gt    .LBB853_14
// BB#10:                               // %lor.lhs.false2
        mov     w8, #5001
        cmp             w19, w8
        b.gt    .LBB853_18
// BB#11:                               // %lor.lhs.false2
        mov     w8, #-10016
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#12:                               // %lor.lhs.false2
        mov     w8, #-10008
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#13:                               // %lor.lhs.false2
        mov     w8, #-7012
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_14:                             // %lor.lhs.false2
        mov     w8, #14012
        cmp             w19, w8
        b.gt    .LBB853_21
// BB#15:                               // %lor.lhs.false2
        mov     w8, #-10105
        add             w8, w19, w8
        cmp             w8, #9          // =9
        b.hi    .LBB853_17
// BB#16:                               // %lor.lhs.false2
        orr     w9, wzr, #0x1
        lsl     w8, w9, w8
        mov     w9, #517
        and             w8, w8, w9
        cbnz    w8, .LBB853_23
.LBB853_17:                             // %lor.lhs.false2
        mov     w8, #10013
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_18:                             // %lor.lhs.false2
        mov     w8, #-7007
        add             w8, w19, w8
        cmp             w8, #2          // =2
        b.lo    .LBB853_23
// BB#19:                               // %lor.lhs.false2
        mov     w8, #5002
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#20:                               // %lor.lhs.false2
        mov     w8, #10011
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_21:                             // %lor.lhs.false2
        mov     w8, #14013
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#22:                               // %lor.lhs.false2
        mov     w8, #15000
        cmp             w19, w8
        b.ne    .LBB853_3

However, the inline cost model estimates the cost to be linear with the number of distinct targets and the cost of the above switch is just 2 InstrCosts. The function containing this switch is then inlined about 900 times.

This change modifies the model to be linear with the size of the balanced binary tree.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Note that this change was originally written by @haicheng in D29870 and I updated his change by using a new TTI hook to get the number of case cluster based on D31080.

I can still see the +8% in performance and -7.63% in size in spec2000/vortex. No significant performance and size change in other benchmarks in spec2000/spec2006.

joerg added a subscriber: joerg.Mar 18 2017, 5:44 AM

Can you please also try this on top of the changes from D30333? That's changing one switch transformation to happen much later, making it more sensitive to inline cost estimates.

Can you please also try this on top of the changes from D30333? That's changing one switch transformation to happen much later, making it more sensitive to inline cost estimates.

Thanks I will try this !

While I'm working on the accurate version of case cluster calculation through the TTI hook in this change based on D31080, I also want to see if it make sense to apply different level of approximations depending on the number of case? So, for example, if the number of case is more than some threshold, we can use a very rough approximation (e.g, use the number case as number of cluster like D29870). Otherwise, we can use a somewhat accurate version of cluster calculation based on D31080.

The below comment by Hans is copied from D31080 :

Before we get any further, I also would like to ask if you have done any measurements of compile-time with this set of patches. As I said before, I think this be quite an expensive hook to call for the inline cost analysis, and it would be nice to see some numbers. If it turns out that it is expensive, perhaps we could come up with some better inline cost heuristic, perhaps something based on the density of the switch.

I think there are three difference cost heuristics :

  1. The cheapest yet inaccurate version is D29870 in which the number of case is simply used as number of cluster.
  2. The most accurate yet expensive version is to use a hook from D31080.
  3. I guess what Hans suggested in above comment (something based on the density of the switch) must be somewhere in between #1 and #2.

Both #1 and #2 consider forming a BTree, and don't require to update the cost heuristic for the changes in switch lowering. I'm not clear about #3 in terms of the accuracy, compile-time, and maintenance. Hans, can you give me little bit more detail?

For me, at least #1 is still better than the current heuristic which simply count the number of distinct successor blocks. If then, would it make sense to use #1 by default and add difference levels of cost heuristics with flags so that we can come up with the most reasonable heuristic and allow others to do experiments ?

hans edited edge metadata.Mar 30 2017, 5:15 AM

The below comment by Hans is copied from D31080 :

Before we get any further, I also would like to ask if you have done any measurements of compile-time with this set of patches. As I said before, I think this be quite an expensive hook to call for the inline cost analysis, and it would be nice to see some numbers. If it turns out that it is expensive, perhaps we could come up with some better inline cost heuristic, perhaps something based on the density of the switch.

I think there are three difference cost heuristics :

  1. The cheapest yet inaccurate version is D29870 in which the number of case is simply used as number of cluster.
  2. The most accurate yet expensive version is to use a hook from D31080.
  3. I guess what Hans suggested in above comment (something based on the density of the switch) must be somewhere in between #1 and #2.

    Both #1 and #2 consider forming a BTree, and don't require to update the cost heuristic for the changes in switch lowering. I'm not clear about #3 in terms of the accuracy, compile-time, and maintenance. Hans, can you give me little bit more detail?

    For me, at least #1 is still better than the current heuristic which simply count the number of distinct successor blocks. If then, would it make sense to use #1 by default and add difference levels of cost heuristics with flags so that we can come up with the most reasonable heuristic and allow others to do experiments ?

Yes, counting the number of successor blocks doesn't seem right.

I think a decent heuristic might be:

  1. Ask TTI about the native word width, and check if this switch is trivially lowered with a bit test (two successors, case range fits in a machine word); this is very cheap
  2. Ask TTI about jump table density conditions; if the whole switch is dense enough, assume it's a jump table and compute a cost based on that
  3. Otherwise assume it's a balanced tree and estimate the cost based on number of cases (it would be nice not to actually have to build the tree, but just compute an estimate cost using some formula)

This means the estimate will be off for switches that are lowered with a mix of jump tables, binary trees and bit-tests, but those aren't that common, and not being completely accurate is probably fine.

I think going with only #1 might not be so good because we'll overestimate the cost of many switches. But I think computing 1-3 would make for a good a cheap heuristic.

If we are not worry about the mix of cases, I think this is reasonable cheap to be used.

Chandler, do you agree with the heuristic Hans suggested above? Even though it do not cover switches that are lowered with a mix of jump table/bit test/BTree, I think this is reasonable compromise between accuracy and cost of the hook.

junbuml updated this revision to Diff 94559.Apr 7 2017, 1:10 PM

Based on Hans' suggestion, checked if switch is suitable for either bit test or jump table. If not suitable for both, use BTree. Please take a look and let me know any comment.

haicheng added inline comments.Apr 10 2017, 9:13 AM
include/llvm/CodeGen/SwitchCaseCluster.h
123 ↗(On Diff #94559)

Return

lib/Analysis/InlineCost.cpp
1006 ↗(On Diff #94559)

I think we can exit early if the number of cases is too large.

1015 ↗(On Diff #94559)

If the estimation chooses to use jumptable, I think we also need to add the cost of the table which is proportional to the range.

lib/CodeGen/SelectionDAG/SwitchCaseCluster.cpp
86 ↗(On Diff #94559)

We can start from begin()+1

test/Transforms/Inline/switch.ll
3 ↗(On Diff #94559)

We may need to add tests for jump table and bit test.

junbuml added inline comments.Apr 10 2017, 12:37 PM
lib/Analysis/InlineCost.cpp
1006 ↗(On Diff #94559)

In SwitchCaseClusterFinder::getEstimatedNumberOfCluster(), we have early exit for a large number of cases. But I guess you mean something else. Can you specify little bit more about the "too large".

1015 ↗(On Diff #94559)

I'm not sure if we really need to consider the size of table as a cost. I think just couple of instructions to look up the table and jump to actual blocks need to be considered as cost.

lib/CodeGen/SelectionDAG/SwitchCaseCluster.cpp
86 ↗(On Diff #94559)

Thanks. I will do that.

test/Transforms/Inline/switch.ll
3 ↗(On Diff #94559)

Yes. I will do that.

junbuml added inline comments.Apr 10 2017, 1:48 PM
lib/Analysis/InlineCost.cpp
1015 ↗(On Diff #94559)

You are right Haicheng. Looks like we need to consider the cost of the table as well.

haicheng added inline comments.Apr 10 2017, 2:18 PM
lib/Analysis/InlineCost.cpp
1006 ↗(On Diff #94559)

One case needs at least one instruction. So if cost + numcases * instrcost > threshold, we can exit early.

chandlerc edited edge metadata.Apr 10 2017, 3:13 PM

Chandler, do you agree with the heuristic Hans suggested above? Even though it do not cover switches that are lowered with a mix of jump table/bit test/BTree, I think this is reasonable compromise between accuracy and cost of the hook.

Yes, I like this model.

lib/CodeGen/SelectionDAG/SwitchCaseCluster.cpp
68–69 ↗(On Diff #94559)

inlining and loop unrolling. it's a generic cost model.

This really looks like it is going in the right direction. I'm going to work on reviewing some of teh code changes a bit more closely, but I wanted to mention one other thing.

This seems like a really good change to the inlining cost model, but it also seems likely to be a pretty big change. I think it is important to collect some benchmark data to make sure we're not going to uncover a significant regression by surprise. At the very least, I think running the LLVM test suite would be a good start and identifying:

  1. How many benchmarks change
  2. For the ones that change, what is the codesize impact
  3. For the ones that change, what is the runtime impact

For #2 and #3 you probably want at least '-O2', but maybe also '-O3' and '-Os'.

It may be useful to ask others to benchmark other applications and/or various architectures as well. To facilitate that, I might suggest putting the code for this in under a flag that is off and then soliciting benchmark data on llvm-dev with the flag, and based on that data, enable the flag. But if this doesn't fire too often in the test suite, or the results are particularly good, might be easy to just try it and see.

Thanks again for working on this!

junbuml updated this revision to Diff 94880.Apr 11 2017, 1:17 PM

Addressed Haicheng's comments and added a flag as Chandler asked. With this update, I kicked off performance tests for the llvm test suite, spec2000, spec2006 in aarch64, but I will be able to share organized data early next as I will be out of office rest of the week. Please let me know any comment.

hans added a comment.Apr 11 2017, 2:18 PM

I was hoping we wouldn't need the refactoring to SwitchCaseCluster.cpp, and that InlineCost.cpp could just work with TLI to check whether jump tables are allowed, density requirements etc.

My concern is that separating the case cluster code from SelectionDAGBuilder might be more trouble than it's worth.

I was hoping we wouldn't need the refactoring to SwitchCaseCluster.cpp, and that InlineCost.cpp could just work with TLI to check whether jump tables are allowed, density requirements etc.

Sure, I can make InlineCost work with just TLI, but I don't want to duplicate the same code for InlineCost from lowering. So I will refactor just a little bit on SelectionDAGBuilder to expose some util functions.

My concern is that separating the case cluster code from SelectionDAGBuilder might be more trouble than it's worth.

Can you give me little bit more details about your concern ?

hans added a comment.Apr 17 2017, 4:13 PM

I was hoping we wouldn't need the refactoring to SwitchCaseCluster.cpp, and that InlineCost.cpp could just work with TLI to check whether jump tables are allowed, density requirements etc.

Sure, I can make InlineCost work with just TLI, but I don't want to duplicate the same code for InlineCost from lowering. So I will refactor just a little bit on SelectionDAGBuilder to expose some util functions.

That sounds good to me.

My concern is that separating the case cluster code from SelectionDAGBuilder might be more trouble than it's worth.

Can you give me little bit more details about your concern ?

The concern is just that the switch lowering code is fairly tightly integrated with SelectionDAGBuilder, and pulling it out isn't worth the effort if all we need from TLI is some basic info, like if the whole switch is dense enough for a lookup table.

junbuml updated this revision to Diff 95777.Apr 19 2017, 10:36 AM

Addressed Hans' comments; now the switch cost heuristic just work with TLI.

Please see list of benchmarks changed in LLVM test suite and spec2000/2006. No significant code size regression was found in any config. Overall minor positive impact on code size in LLVM test suite, but in O3 with LTO, there was -7.9% reduce in code size in spec2000/vortex.

In AArch64, I didn't see any clear performance impact in LLVM test suite, but in O3 with LTO, I observed +17.82% performance improvement in spec2000/vertex.

  • O2 :
BenchmarksCode size (- is better)
MultiSource/Applications/siod-0.061%
MultiSource/Applications/hbd0.000%
MultiSource/Applications/JM/lencod/lencod-0.070%
MultiSource/Applications/JM/ldecod/ldecod0.000%
MultiSource/Applications/lua/lua0.001%
MultiSource/Applications/d/make_dparser-0.081%
MultiSource/Applications/sqlite3/sqlite3-0.217%
MultiSource/Benchmarks/Prolangs-C/bison/mybison0.000%
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset0.000%
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg-0.082%
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg0.000%
MultiSource/Benchmarks/MallocBench/gs/gs0.000%
spec2000/perlbmk-0.066%
spec2000/gcc-0.183%
spec2000/parser0.000%
spec2000/crafty-0.080%
spec2000/mesa-0.286%
spec2006/soplex0.000%
spec2006/xalancbmk0.000%
spec2006/hmmer-0.072%
spec2006/gcc-0.191%
spec2006/h264ref0.000%
spec2006/povray-0.057%
spec2006/perlbench-0.060%
  • Os :
BenchmarksCode size(- is better)
MultiSource/Applications/siod/siod-0.123%
MultiSource/Applications/JM/lencod/lencod0.000%
MultiSource/Applications/lua/lua0.000%
MultiSource/Applications/sqlite3/sqlite30.001%
MultiSource/Benchmarks/mafft/pairlocalalign0.000%
MultiSource/Benchmarks/7zip/7zip-benchmark0.001%
spec2000/perlbmk0.000%
spec2000/gap0.000%
spec2000/gcc0.000%
spec2000/mesa-0.071%
spec2000/vortex0.000%
spec2006/gobmk0.000%
spec2006/xalancbmk0.000%
spec2006/hmmer0.000%
spec2006/gcc-0.048%
spec2006/omnetpp0.000%
spec2006/h264ref0.000%
spec2006/perlbench-0.062%
  • O3 :
BenchmarksCode size(- is better)
MultiSource/Applications/kimwitu++/kc0.000%
MultiSource/Applications/siod/siod-0.061%
MultiSource/Applications/hbd/hbd0.000%
MultiSource/Applications/JM/lencod/lencod0.000%
MultiSource/Applications/lua/lua0.001%
MultiSource/Applications/SIBsim4/SIBsim40.001%
MultiSource/Applications/d/make_dparser0.000%
MultiSource/Applications/sqlite3/sqlite3-0.212%
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg0.000%
MultiSource/Benchmarks/MallocBench/gs/gs0.000%
MultiSource/Benchmarks/7zip/7zip-benchmark0.000%
spec2000/perlbmk-0.066%
spec2000/gcc-0.182%
spec2000/mesa-0.356%
spec2000/vortex0.000%
spec2006/soplex0.000%
spec2006/xalancbmk-0.019%
spec2006/bzip20.000%
spec2006/hmmer0.000%
spec2006/gcc-0.188%
spec2006/h264ref0.000%
spec2006/povray-0.056%
spec2006/perlbench-0.121%
  • Spec2000/2006 performance in O3 :
BenchmarksScore(+ is better)
spec2000/perlbmk+2.706%
spec2000/vortex+2.028%
spec2000/mesa-1.76%
spec2006/soplex-1.334%
spec2006/povray+1.509%
spec2006/perlbench+0.757%
  • Spec2000/2006 performance in O3 with LTO :
BenchmarksScore(+ is better)
spec2000/gzip+2.180%
spec2000/mesa-4.093%
spec2000/vortex+17.822%
hans added a comment.Apr 19 2017, 4:14 PM

It's starting to look much simpler, which is great.

include/llvm/Analysis/TargetTransformInfo.h
773 ↗(On Diff #95777)

The other function here return an estimated cost for lowering. I think that would be a better interface for this too.

include/llvm/CodeGen/BasicTTIImpl.h
175 ↗(On Diff #95777)

I wish this could be much simpler. Maybe most of the code could defer to TLI::isSuitableForBitTest / isSuitableForJumpTable which could also be used from the DAG code.

225 ↗(On Diff #95777)

If we have TLI::isSuitableForBitTests, maybe we should have isSuitableForJumpTable too, that way we don't have to duplicate as much logic, and that one could do what areJTsAllowed() as well.

lib/Analysis/InlineCost.cpp
1049 ↗(On Diff #95777)

If the whole switch is suitable for a jump table, can't we just return after this?

1051 ↗(On Diff #95777)

Why do we have to do this?

The code is basically constructing the tree and throwing it away. It should be possible to compute an estimate for the size of the tree with a closed-form mathematical expression.

haicheng added inline comments.Apr 20 2017, 6:37 AM
lib/Analysis/InlineCost.cpp
1051 ↗(On Diff #95777)

If n is the case number, f(n) is the mapping from the case number to the node number of BTree. The recursion is
f(n) = 1 + f(n/2) + f (n - n/2), when n > 3.
So, f(n) is between n + 2^(log2(n) - 1) - 1 and n + 2^(log2(n)) - 1. The lower bound is about 1.5n - 1 and the upper bound is about 2n - 1

haicheng added inline comments.Apr 20 2017, 8:48 AM
lib/Analysis/InlineCost.cpp
1051 ↗(On Diff #95777)

The exact equation is

f(n) = n, n <= 3
f(n) = n + 2^(log(n) - 1) - 1, n > 3 && 2^log(n) <= n <= 1.5*2^(log(n))
f(n) = 2n - 2^(log(n)) - 1, n > 3 && 1.5*2^(log(n)) < n < 2^(log(n)+1)

junbuml updated this revision to Diff 96252.Apr 21 2017, 3:31 PM
junbuml added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
773 ↗(On Diff #95777)

I doubt if this is good place to get the inline cost because other functions handle user costs which is different from inline cost. Mixing the user cost and inline cost here might be a bad choice. I believe the inline cost should be decided in InlineCost.

include/llvm/CodeGen/BasicTTIImpl.h
175 ↗(On Diff #95777)

Tried to make it simpler, but we still need to find Min/MaxValue here without forming CaseClusters and it's also good to check IsJTAllowed early before doing the actual suitability check to avoid iterating the for loop to find Min/MaxValue in case not allowed. Please take a look and let me know if there is any part you want to move in either isSuitableForBitTest or isSuitableForJumpTable.

225 ↗(On Diff #95777)

Added isSuitableForJumpTable in TLI which used in here and DAG, but keep areJTsAllowed() as a separate function because areJTsAllowed() need to be checked only once in findJumpTable() in DAG, and we can also hit the early exit in this function when JT is not allowed.

lib/Analysis/InlineCost.cpp
1051 ↗(On Diff #95777)

Thanks Haicheng for this.

If n is a power of 2, the number of node should be n + 2^(log2(n) - 1) - 1. For non-power of 2 cases , the lower bound is n + 2^(floor(log2(n)) - 1) - 1 and the upper bound is n + 2^(ceiling(log2(n)) - 1) - 1. As a estimation, I think the use of upper bound is simple and conservative enough.

hans added inline comments.Apr 21 2017, 4:35 PM
include/llvm/Analysis/TargetTransformInfo.h
773 ↗(On Diff #95777)

I was thinking it doesn't need to be used as the inlining cost directly, but exposing some metric the inline cost model could then make use of - that's kind of what we're doing anyway.

What I don't like about this one is that it feels a little out of place compared to other functions which return an estimate of the cost of lowering an instruction, whereas this seems to return details about the lowering instead.

I'm not sure how to make this better though.

include/llvm/Target/TargetLowering.h
766 ↗(On Diff #96252)

NumCluster -> NumClusters please

lib/Analysis/InlineCost.cpp
1035 ↗(On Diff #96252)

I'm not sure it's a BTree, but rather a regular binary search tree.

1048 ↗(On Diff #96252)

Very nice! I think the code can be simplified though, and this will allow us to avoid the rounding from Log2_64_Ceil: (N = NumCaseCluster)

NumberOfNonLeaf = 2^(log2(N) - 1) - 1 = 2^log2(N) * 2^-1 - 1 = N/2 - 1
Adding NumCaseCluster to that yields N + N / 2 - 1 = N * 3 / 2 - 1.

I kind of wish this is all there was to this change and we didn't need all that other code: quick check if the switch is suitable for a bit test or jump table, otherwise compute a cost based on N * 3 / 2 - 1. I don't know if it's possible though.

1053 ↗(On Diff #96252)

Can we really run into INT_MAX here? Especially given the Threshold check above?

1055 ↗(On Diff #96252)

I would suggest handling the NumCaseCluster <= 3 first and returning early for that, since it's the less complicated case.

junbuml updated this revision to Diff 96464.Apr 24 2017, 2:04 PM
junbuml marked 3 inline comments as done.

Addressed Hans' comments.

include/llvm/Analysis/TargetTransformInfo.h
773 ↗(On Diff #95777)

I see what you meant and I can see functions which return the lowering cost. However, this file also hold functions which expose details about the machine directly to be used in IR-level. For me it seems that exposing this information from TLI to InlineCost still keep the original intention of this class. I will be happy to hear any better suggestion about the way of exposing this lowering information.

lib/Analysis/InlineCost.cpp
1048 ↗(On Diff #96252)

Thanks Hans for this. Updated it based your comments.

I think it's difficult to use a closed form which cover all lowering cases, especially in different targets. I believe we need to differentiate each lowering case as the costs varying depending on the lowering cases and targets.

1053 ↗(On Diff #96252)

In very extreme case, it could happen. For a very large Threshold (e.g., INT_MAX), even though SI.getNumCases() * InlineConstants::InstrCost is smaller than the Threshold. ExpectedNumberOfCompare * 2 * InlineConstants::InstrCost could hit INT_MAX.

hans added a comment.Apr 25 2017, 10:14 AM

Apologies for the reviews dragging out.

I think we've got the right idea with computing inline cost based on an estimate of the number of nodes in binary search tree, but the patch is still making a lot of changes all over the place making it hard to review.

The smaller you can make this change, the easier it will be to get this committed.

include/llvm/Target/TargetLowering.h
771 ↗(On Diff #96464)

This check should probably be done in isSuitableForJumpTable. There should be no need to pass NumClusters to areJTsAllowed since whether jump tables are allowed only depends on the target and function. The switch instruction also shouldn't be passed in, just the function.

779 ↗(On Diff #96464)

This is used in isSuitableForBitTests, but does it really need to be exposed in the TargetLowering interface?

792 ↗(On Diff #96464)

What's the difference between Range and JumpTableSize? As far as I can tell, those are always the same.

lib/Analysis/InlineCost.cpp
1057 ↗(On Diff #96464)

Aren't you adding Cost twice here? You're doing Cost += and also SwitchCost + Cost.

Oh, Cost is just a regular int; yeah then I see how it can overflow. But std::min is returnning an uint64_t here, so it seems you're still not handling the overflow?

hans added inline comments.Apr 25 2017, 11:00 AM
lib/Analysis/InlineCost.cpp
1057 ↗(On Diff #96464)

Oh never mind, you're capping the uint64_t at INT_MAX so it shuold work. Probably still don't want to do Cost += though.

junbuml updated this revision to Diff 96614.Apr 25 2017, 12:24 PM

I think we've got the right idea with computing inline cost based on an estimate of the number of nodes in binary search tree, but the patch is still making a lot of changes all over the place making it hard to review.
The smaller you can make this change, the easier it will be to get this committed.

For me, it seems that the code change in DAG and TLI only make sense when reviewed together with the changes in InlineCost. That's why I put them together. If you generally agree with the change in DAG side code, I can break it as a separate patch, and leave only inliner side change in here.

include/llvm/Target/TargetLowering.h
771 ↗(On Diff #96464)

Yes, I agree that areJTsAllowed() should see only function and target, so I just pass a Function to areJTsAllowed(). But I don't think isSuitableForJumpTable is good place for this check because this check is for the whole clusters of a switch, but isSuitableForJumpTable should see a set of clusters, not necessary the whole clusters, especially when we try to build jump tables for split clusters in DAG. We also need to do an early exit when this check hit. So I do this check in findJumpTables() in DAG and getEstimatedNumberOfCaseClusters() in TTI.

779 ↗(On Diff #96464)

This is also used in findBitTestClusters and buildBitTests(), and for me doing this check through TLI doesn't seem to be weird. Please let me know if you see any better place for this function.

792 ↗(On Diff #96464)

Yes, these are the same.

lib/Analysis/InlineCost.cpp
1057 ↗(On Diff #96464)

Thanks for this. Yes, it should be Cost =, instead of Cost +=.

hans added a comment.Apr 25 2017, 2:18 PM

I think we've got the right idea with computing inline cost based on an estimate of the number of nodes in binary search tree, but the patch is still making a lot of changes all over the place making it hard to review.
The smaller you can make this change, the easier it will be to get this committed.

For me, it seems that the code change in DAG and TLI only make sense when reviewed together with the changes in InlineCost. That's why I put them together. If you generally agree with the change in DAG side code, I can break it as a separate patch, and leave only inliner side change in here.

Yes, reviewing them together makes total sense. I just wish the change we're making were smaller.

Anyway, I think this is really close now. Each time I read through it, it looks better :-)

include/llvm/CodeGen/BasicTTIImpl.h
194 ↗(On Diff #96614)

Declaring CI and starting to increment it outside the for-loop is a little unusual. I realize this is to avoid repeating the first iteration (maybe I wrote this somewhere?), but I think it would be better if this were written as a straight-forward loop:

APInt MaxCaseVal = SI.case_begin()->getCaseValue()->getValue();
APInt MinCaseVal = SI.case_begin()->getCaseValue()->getValue();
for (auto CI : SI.cases()) {
  ...
}
include/llvm/Target/TargetLowering.h
771 ↗(On Diff #96464)

isSuitableForJumpTable() seems like a good place for this to me. It should be able to handle the whole switch range or a subset in the same way.

lib/Analysis/InlineCost.cpp
1057 ↗(On Diff #96464)

Cool, makes sense now. Are the (uint64_t) casts strictly necessary though? SwitchCost is already uint64_t so I'd imagine INT_MAX to get promoted and everything to work out?

1018 ↗(On Diff #96614)

Could this overflow?

junbuml added inline comments.Apr 26 2017, 10:34 AM
include/llvm/CodeGen/BasicTTIImpl.h
194 ↗(On Diff #96614)

Yes, I like this. I will do that.

include/llvm/Target/TargetLowering.h
771 ↗(On Diff #96464)

If we use range for this check in isSuitableForJumpTable(), it will change the behavior in lowering. Current lowering code use the total number of cluster (N) for this check only once at the beginning of findJumpTables(), which is different from both Range and NumCases. And also when partitioning clusters for jump tables, doing this check in isSuitableForJumpTable() may change the behavior of the loop finding partitions. If either Range or NumCases should be used for this check, I think it should be a separate patch.

hans added inline comments.Apr 26 2017, 11:15 AM
include/llvm/Target/TargetLowering.h
771 ↗(On Diff #96464)

Oh, I see what you mean.

It's annoying that we'll end up with a isSuitableForJumpTable() that doesn't take the minimum size into account though :-/

junbuml updated this revision to Diff 96799.Apr 26 2017, 11:55 AM
junbuml marked 3 inline comments as done.
junbuml added inline comments.
lib/Analysis/InlineCost.cpp
1057 ↗(On Diff #96464)

We need (uint64_t)INT_MAX, but we don't need (uint64_t) Cost as SwitchCost is uint64_t.

hans added inline comments.Apr 26 2017, 2:04 PM
include/llvm/Target/TargetLowering.h
787 ↗(On Diff #96799)

It still seems backward that this is checking density and maximum table size, but that the minimum size is expected to be checked elsewhere :-/ There must be some better way to factor this.

junbuml added inline comments.Apr 26 2017, 2:30 PM
include/llvm/Target/TargetLowering.h
787 ↗(On Diff #96799)

I agree that doing minSize check in isSuitableForJumpTable looks clear. However, I want to keep the changes in lowering side in this patch is NFC. That's why we cannot do this check in isSuitableForJumpTable().

If you want to add MinSize check in isSuitableForJumpTable(), I think we should also pass the cluster size (N) to isSuitableForJumpTable() in findJumpTable() because Range and NumCases could be different from the cluster size(N).

Only way I can think of doing this check in isSuitableForJumpTable() is :

  1. pass the cluster size (N) to isSuitableForJumpTable() for the first call of isSuitableForJumpTable() in findJumpTable() because this is for the whole range.
  2. Since the second calls of isSuitableForJumpTable() in findJumpTable for partitionized clusters should not check MinSize with sub clusters, we do not use isSuitableForJumpTable(), instead checking the max size and density separately just like current code.

However, I think this way is even worse than my current code. Please let me know your thought.

hans accepted this revision.Apr 26 2017, 2:48 PM

lgtm

include/llvm/Target/TargetLowering.h
787 ↗(On Diff #96799)

Let's put a FIXME in isSuitableForJumpTable that it would be nice if the min size check could somehow be combined with the other checks here.

This revision is now accepted and ready to land.Apr 26 2017, 2:48 PM
junbuml updated this revision to Diff 96832.Apr 26 2017, 3:17 PM

Added FIXME as Hans suggested.
I will commit it if there is any further comment from other reviewers by tomorrow. I will also post a follow-up patch to enable the flag (-inline-generic-switch-cost).

Thank you very much for the review.

This revision was automatically updated to reflect the committed changes.