This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Extract switch lowering as a spearate object NFC
Needs ReviewPublic

Authored by junbuml on Mar 17 2017, 7:48 AM.

Details

Summary

This refactor the switch lowering to extract forming case clusters as a separate
class from SelectionDAGBuilder and decouple forming case cluster and building DAG.
Based on this refactoring, I will expose a TTI hook which allow inliner to use the same
logic when deciding inline cost for switch instructions.

Diff Detail

Event Timeline

junbuml created this revision.Mar 17 2017, 7:48 AM
junbuml retitled this revision from [DAG] Extract switch lowering as a spearate object NFC to [DAG] Extract switch lowering as a spearate object NFC.Mar 17 2017, 10:05 AM
junbuml added reviewers: chandlerc, mcrosier, hans.
junbuml added a subscriber: llvm-commits.
hans edited edge metadata.Mar 17 2017, 1:31 PM

Hi Jun,

I think extracing the logic for clustering cases into jump tables, bit tests, etc. into a separate class might be a good idea, but I don't think we should extract the actual lowering parts. If the purpose of this is to expose a hook to figure out how a switch table will be lowered, the actual lowering doesn't need to happen here. The class that deals with clustering cases should not need to know about SDAGBuilder.

I'm concerned that this hook might turn out to be very expensive though.

chandlerc edited edge metadata.Mar 17 2017, 1:38 PM
In D31080#704197, @hans wrote:

Hi Jun,

I think extracing the logic for clustering cases into jump tables, bit tests, etc. into a separate class might be a good idea, but I don't think we should extract the actual lowering parts. If the purpose of this is to expose a hook to figure out how a switch table will be lowered, the actual lowering doesn't need to happen here. The class that deals with clustering cases should not need to know about SDAGBuilder.

Agreed.

I'm concerned that this hook might turn out to be very expensive though.

What primarily concerns you about the cost?

If it is the cost of actually computing the clustering, the thing the inliner wants could be a rough approximation. Put differently, we don't need to actually *form* the clustering, just have an estimate of how many branches will end up being used (as opposed to jump tables).

hans added a comment.Mar 17 2017, 2:09 PM

I'm concerned that this hook might turn out to be very expensive though.

What primarily concerns you about the cost?

If it is the cost of actually computing the clustering, the thing the inliner wants could be a rough approximation. Put differently, we don't need to actually *form* the clustering, just have an estimate of how many branches will end up being used (as opposed to jump tables).

Yeah, it's the cost of computing the clustering. I thought most hooks were supposed to be roughly O(1), but this code is actually doing work -- O(n^2) in the worst case, alebeit with low overhead.

It might be possible to come up with some rough estimate though.

Another thought is that at some point it might be worth thinking about moving switch lowering to the IR level, to avoid the whole problem of computing the inlining cost, but that's a big project.

Thanks Hans and Chandler for the reviews. My original intention for this was to have a single logic for calculating case clusters so that we don't need to maintain several different versions depending on places it's used.

From that perspective, I extract the logic for case clustering abstractized in SwitchLoweringCaseClusterBuilder with two virtual methods (buildJumpTable and buildBitTests). Basic implementation of them in SwitchLoweringCaseClusterBuilder don't build anything related with DAGBuilder, which will be used in the TTI hook, but those in SwitchLoweringCaseClusterBuilderForDAG do extra stuffs for DAGbuilder as it will be invoked in SelectionDAGBuilder. However, I agree that SwitchLoweringCaseClusterBuilder should not need to know about SDAGBuilder.

I think having a single abstraction for calculating case clusters might be a good for maintenance and accuracy as long as the calculation cost is reasonable. So, would it make sense to define more virtual methods to cut out expensive calculations for non-DAG builder case ?

hans added a comment.Mar 20 2017, 4:30 PM

Thanks Hans and Chandler for the reviews. My original intention for this was to have a single logic for calculating case clusters so that we don't need to maintain several different versions depending on places it's used.

From that perspective, I extract the logic for case clustering abstractized in SwitchLoweringCaseClusterBuilder with two virtual methods (buildJumpTable and buildBitTests). Basic implementation of them in SwitchLoweringCaseClusterBuilder don't build anything related with DAGBuilder, which will be used in the TTI hook, but those in SwitchLoweringCaseClusterBuilderForDAG do extra stuffs for DAGbuilder as it will be invoked in SelectionDAGBuilder. However, I agree that SwitchLoweringCaseClusterBuilder should not need to know about SDAGBuilder.

I think having a single abstraction for calculating case clusters might be a good for maintenance and accuracy as long as the calculation cost is reasonable. So, would it make sense to define more virtual methods to cut out expensive calculations for non-DAG builder case ?

Rather than splitting those out as virtual calls, I think the nicest design would be if the case clustering logic could work as an analysis: you feed it an array of cases, from which it computes some kind of result which indicates what cases go in jump tables, which are bit tests, etc. The SDAGBuilder would then consume that result to actually build the jump tables etc.

Rather than splitting those out as virtual calls, I think the nicest design would be if the case clustering logic could work as an analysis: you feed it an array of cases, from which it computes some kind of result which indicates what cases go in jump tables, which are bit tests, etc. The SDAGBuilder would then consume that result to actually build the jump tables etc.

Could you give me little bit more details about what you mention because for me it seems almost same as what current implementation is doing. If I understand correctly, current implementation of switch lowering use CaseClusterVector (a vector of CaseCluster) storing what cases go to JT and what cases go to BTest. In visitSwitch(), we first build CaseClusterVector without actual lowering. After then SDAGBuilder use CaseClusterVector to actually lower to JT or BTest. Did you mean for us to use another array instead of CaseClusterVector ?

hans added a comment.Mar 21 2017, 9:34 AM

Rather than splitting those out as virtual calls, I think the nicest design would be if the case clustering logic could work as an analysis: you feed it an array of cases, from which it computes some kind of result which indicates what cases go in jump tables, which are bit tests, etc. The SDAGBuilder would then consume that result to actually build the jump tables etc.

Could you give me little bit more details about what you mention because for me it seems almost same as what current implementation is doing. If I understand correctly, current implementation of switch lowering use CaseClusterVector (a vector of CaseCluster) storing what cases go to JT and what cases go to BTest. In visitSwitch(), we first build CaseClusterVector without actual lowering. After then SDAGBuilder use CaseClusterVector to actually lower to JT or BTest. Did you mean for us to use another array instead of CaseClusterVector ?

The current implementation does try to split the analysis and lowering, but it doesn't succeed completely.

For example, findJumpTables() calls buildJumpTable() which creates a new MachineBasicBlock, creates a JumpTableHeader that it adds to the MachineFunction, and so on. In other words, it performs some lowering.

Your patch works around this by providing virtual methods for buildJumpTable(): one that doesn't actually build a jump table, and the other that does, which means it has to depend on SelectionDAGBuilder.

I'm saying it would be nice if the analysis and lowering could be separated further, so that the analysis does not have to know about SelectionDAGBuilder at all, and SelectionDAGBuilder would do the lowering entirely based on the results of the analysis.

The current implementation does try to split the analysis and lowering, but it doesn't succeed completely.

For example, findJumpTables() calls buildJumpTable() which creates a new MachineBasicBlock, creates a JumpTableHeader that it adds to the MachineFunction, and so on. In other words, it performs >some lowering.

Your patch works around this by providing virtual methods for buildJumpTable(): one that doesn't actually build a jump table, and the other that does, which means it has to depend on SelectionDAGBuilder.

Yes, I agree that the analysis shouldn't have to know about SelectionDAGBuilder.

I'm saying it would be nice if the analysis and lowering could be separated further, so that the analysis does not have to know about SelectionDAGBuilder at all, and SelectionDAGBuilder would do the lowering entirely based on the results of the analysis.

If the main concern is the visibility of SelectionDAGBuilder in the base class for the analysis (SwitchLoweringCaseClusterBuilder in my current patch), don't you think further refactoring to remove SelectionDAGBuilder in SwitchLoweringCaseClusterBuilder by moving SwitchLoweringCaseClusterBuilderForDAG into SelectionDAGBuilder could be reasonable design?

hans added a comment.Mar 21 2017, 1:30 PM

The current implementation does try to split the analysis and lowering, but it doesn't succeed completely.

For example, findJumpTables() calls buildJumpTable() which creates a new MachineBasicBlock, creates a JumpTableHeader that it adds to the MachineFunction, and so on. In other words, it performs >some lowering.

Your patch works around this by providing virtual methods for buildJumpTable(): one that doesn't actually build a jump table, and the other that does, which means it has to depend on SelectionDAGBuilder.

Yes, I agree that the analysis shouldn't have to know about SelectionDAGBuilder.

I'm saying it would be nice if the analysis and lowering could be separated further, so that the analysis does not have to know about SelectionDAGBuilder at all, and SelectionDAGBuilder would do the lowering entirely based on the results of the analysis.

If the main concern is the visibility of SelectionDAGBuilder in the base class for the analysis (SwitchLoweringCaseClusterBuilder in my current patch), don't you think further refactoring to remove SelectionDAGBuilder in SwitchLoweringCaseClusterBuilder by moving SwitchLoweringCaseClusterBuilderForDAG into SelectionDAGBuilder could be reasonable design?

Even if moving the subclass into SelectionDAGBuilder, it still doesn't seem like a very nice design; the analysis isn't just an analysis if it's calling back into lowering code. Also, SelectionDAGBuilders details are currently leaking into the CaseCluster struct in the JT/BTCasesIndex member.

I think it would be better if the analysis could work independently: taking as input a set of cases and returning a vector of clustered cases (each cluster might simply consist of a kind and iterators/indexes into the original set of cases).

junbuml updated this revision to Diff 93279.Mar 28 2017, 1:00 PM

From Hans' comment, decouple cluster calculation from lowering. Introduced a new structure CaseCluster which is only for cluster calculation. MachineCaseCluster will be filled from CaseCluster before lowering. Please take a look and let me know any comment.

junbuml updated this revision to Diff 93287.Mar 28 2017, 1:13 PM
hans added a comment.Mar 29 2017, 9:26 AM

From Hans' comment, decouple cluster calculation from lowering. Introduced a new structure CaseCluster which is only for cluster calculation.

Thanks! Things are getting better. I've made some comments, but haven't looked carefully at the changes to SelectionDAGBuilder.h/cpp yet.

MachineCaseCluster will be filled from CaseCluster before lowering.

This seems a little unfortunate, because it seems a MachineCaseCluster basically includes the same info as a CaseCluster with some more fields.

Would it be possible instead to do the lowering using just the CaseCluster objects, and maybe storing any auxiliary data on the side?

Before we get any further, I also would like to ask if you have done any measurements of compile-time with this set of patches. As I said before, I think this be quite an expensive hook to call for the inline cost analysis, and it would be nice to see some numbers. If it turns out that it is expensive, perhaps we could come up with some better inline cost heuristic, perhaps something based on the density of the switch.

include/llvm/CodeGen/SwitchCaseCluster.h
41 ↗(On Diff #93287)

This should probably be a SmallVector since it will often contain only one element. I'm also not sure if having a typedef for it is really helpful. There needs to be a comment explaining that the vector holds case indexes from the switch.

69 ↗(On Diff #93287)

I'm not sure this assert is worth it.

89 ↗(On Diff #93287)

But how is C.Cases set?

104 ↗(On Diff #93287)

Since TargetLowering.h is included, I guess this isn't needed.

131 ↗(On Diff #93287)

What does return a default block mean? The default basic block of the switch? Why return it?
Spelling, there's an i missing in initial.

142 ↗(On Diff #93287)

Spelling: missing a in unreachable.

150 ↗(On Diff #93287)

What is "it" here?
The comment on the next line looks like leftover from earlier code?
Same thing on the next method.

lib/CodeGen/SelectionDAG/CMakeLists.txt
26

No need to remove this blank line, I think.

Thanks Hans for the review.

Would it be possible instead to do the lowering using just the CaseCluster objects, and maybe storing any auxiliary data on the side?

My initial thought was to have only Kind and CaseVector in CaseCluster as we can extract Low and High from the CaseVector. However, this may increase runtime cost as we often access the Low and High, so I cached them in the struct itself.

MachineCaseCluster has two more fields 1) BranchProbability and 2) the union for the destination block or the case index in JT/BT. We can introduce an auxiliary structure to map clusters with the information, but I think this may increase code complexity and cost to access them through the auxiliary map. Instead of an auxiliary data for BranchProbability and the union, what about to hold a pointer to CaseCluster in MachineCaseCluster like :

struct MachineCaseCluster {
  CaseCluster *Cluster;
  union {
    MachineBasicBlock *MBB;
    unsigned JTCasesIndex;
    unsigned BTCasesIndex;
  };
  BranchProbability Prob;
}

Before we get any further, I also would like to ask if you have done any measurements of compile-time with this set of patches. As I said before, I think this be quite an expensive hook to call for the inline cost analysis, and it would be nice to see some numbers. If it turns out that it is expensive, perhaps we could come up with some better inline cost heuristic, perhaps something based on the density of the switch.

Sure, compile-time experiment should be reported with the patches. Regarding the cost for the hook, I want to discuss in D31085 in which the hook is introduced. I will copy your comment in D31085.

junbuml updated this revision to Diff 94557.Apr 7 2017, 1:03 PM
junbuml marked 7 inline comments as done.

Added the pointer of CaseCluster as a member in MachineCaseCluster. Little bit more refactoring to share the same code for inline cost heuristic (D31782).

mcrosier resigned from this revision.Jul 26 2017, 6:09 AM