This is an archive of the discontinued LLVM Phabricator instance.

[DON'T MERGE] Jump-threading for finite state automata
Needs ReviewPublic

Authored by eastig on Sep 25 2020, 7:54 AM.

Details

Summary

This is an implementation of jump-threading to optimise switch-based finite state automata.

Diff Detail

Event Timeline

eastig created this revision.Sep 25 2020, 7:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 25 2020, 7:54 AM
eastig requested review of this revision.Sep 25 2020, 7:54 AM

This patch is not supposed to be used as is. It can be used for experiments and as a proof of concept. A version for reviewing will be submitted soon.

khchen added a subscriber: khchen.Sep 30 2020, 8:14 AM
amehsan added a subscriber: amehsan.EditedOct 3 2020, 7:14 PM

@alexey.zhikhar and I had a look at your patch. Some findings that might be interesting:

1- The IR from your patch is shorter than ours. We haven't checked carefully where the difference is, but there is a gap of about 10% (of the size of core_state_transition)
2- Our code from our current implementation is about 2%-3% faster on our platform.
3- While the main idea is exactly the same our implementation looks at all the control flow paths in the loop first and perform the analysis. Then we have a code gen process. Your approach is closer to the existing jump threading in that you make one change at a time.
4- As I said, we are making some significant changes in our implementation to improve the code quality (among other things), etc. So our new code that we will be able to share is not ready yet.

Thanks for sharing your code.

eastig added a comment.Oct 5 2020, 1:55 AM

@alexey.zhikhar and I had a look at your patch. Some findings that might be interesting:

1- The IR from your patch is shorter than ours. We haven't checked carefully where the difference is, but there is a gap of about 10% (of the size of core_state_transition)
2- Our code from our current implementation is about 2%-3% faster on our platform.
3- While the main idea is exactly the same our implementation looks at all the control flow paths in the loop first and perform the analysis. Then we have a code gen process. Your approach is closer to the existing jump threading in that you make one change at a time.
4- As I said, we are making some significant changes in our implementation to improve the code quality (among other things), etc. So our new code that we will be able to share is not ready yet.

Thanks for sharing your code.

Hi Ehsan,
Thank you for your feedback.

The code was written in early 2015 before LLVM got improvements in DominatorTree and SSAUpdater. The biggest part of the code is to restore SSA form which is not perfect. Some sophisticated CFGs can cause issues.
The code has not been optimised for performance. I think it has opportunities for improvements.
The original idea I had was to support both SwitchInst and BranchInst. However, for BranchInst the implementation was very unstable. As the target was to optimize SwitchInst, support of BranchInst was dropped.

Did you consider to implement more Agressive Jump Threading instead of a new pass for specific use (FSMs)? Like following solutions:

eastig added a comment.Oct 5 2020, 4:51 AM

Did you consider to implement more Agressive Jump Threading instead of a new pass for specific use (FSMs)? Like following solutions:

Do you mean to improve the existing LLVM Jump Threading pass?

Did you consider to implement more Agressive Jump Threading instead of a new pass for specific use (FSMs)? Like following solutions:

Do you mean to improve the existing LLVM Jump Threading pass?

Yes, either improve current one or write new more powerful JT pass.

Do you mean to improve the existing LLVM Jump Threading pass?

Yes, either improve current one or write new more powerful JT pass.

IMHO, the current LLVM JT pass is a lightweight JT. It tries to balance an amount of duplicated basic blocks and to preserve loops from becoming irreducible.
Making it powerful could break design principles and contracts the pass is based on. The Aggressive JT can easily duplicate a lot of basic blocks and make loops irreducible.
Also AJT benefits more from the analyze-then-transform approach rather than from iterative approach. I saw that no additional jumps were threaded after the first pass.
There are still unanswered questions regarding AJT: what applications benefit most from it (definitely FST), where to put it in the optimizations pipeline, what heuristics to use, etc.

All of this make questionable whether the current JT should be upgraded into AJT.

grandinj removed a subscriber: grandinj.Oct 5 2020, 10:54 AM

Yeah, I agree that it is quite questionable how useful the AJT is. But JT for FSMs is definitely useful and should be added.

@eastig I tried this experimental patch in our system, and it definitely brings highly significant improvement to coremark. When do you expect to post a version for reviewing -- what will the differences be?

@amehsan If your jump-threading change can bring even more improvement, then I am definitely curious to see it and if it can be integrated with this in some way.

Thank you all for your contributions here!

@eastig I tried this experimental patch in our system, and it definitely brings highly significant improvement to coremark. When do you expect to post a version for reviewing -- what will the differences be?

Hi Alan,
Thank you for the feedback. I need to get an approval from my current employer to continue working on the patch. I'll keep you posted.

@eastig I tried this experimental patch in our system, and it definitely brings highly significant improvement to coremark. When do you expect to post a version for reviewing -- what will the differences be?

Hi Alan,
Thank you for the feedback. I need to get an approval from my current employer to continue working on the patch. I'll keep you posted.

Hi Alan,
Good news. I've got an approval. I cannot say when I post the updated version. I've just started a new job and are completely busy. I don't work anymore at Arm.
The differences will be:

  1. SSAUpdaterBulk will be used to remove custom code restoring SSA. This will reduce amount of code.
  2. The analyze-then-transform approach rather than the iterative approach. This allows to remove some crazy heuristics.
  3. SelectInstrs are unfolded only if needed.
  4. Some constants can be involved into cmp+br. As a result they cannot reach the switch. This causes copied basic blocks which are not needed.
  5. Basic blocks common for some threading opportunities will not be copied if possible.

@alanphipps

I am not sure whether our approach can be integrated, with this patch. The approaches are completely different. But I can write a summary of our algorithm and we can discuss the algorithm first. We are working on rewriting the code and we can post it when it is ready. Since @eastig has also switched to a similar approach (analyze-then-transform), it will be interesting to have a discussion about different algorithms.

ychen added a subscriber: ychen.Oct 21 2020, 10:37 PM

Hi Alan,
Good news. I've got an approval. I cannot say when I post the updated version. I've just started a new job and are completely busy. I don't work anymore at Arm.

Hi! Just following up here to see where things stand. I appreciate you prioritizing this around your new job -- let me know what I can do to help to get this in. Thanks!

amehsan added a comment.EditedDec 16 2020, 4:17 AM

Sorry for delay. I will describe the algorithm (that I mentioned in an earlier comment) in two steps. In step 1 (This comment) I describe the high level and main steps of the algorithm. In the next step I will go into details that are missing here. I’d be glad to get some feedback and know what people think about the approach.

The algorithm is developed with coremark in mind. But I think it is generalizable, and I will try to clarify generalization opportunities.

There are three main components in the algorithm.

1- Choosing a SSA variable in the loop that is interesting. (“the variable” hereafter)
2- Proving the branches that depend on the value of the variable has opportunity for jump threading.
3- Deciding on which blocks needs to be replicated, how much code increase we will have and which branches we can remove.

Parts (1) and (2) are easier as I describe below. Part (3) is the one that has some details that I will describe in a subsequent comment.

For (1) we can rely on some heuristics. One current heuristic is to look inside the loop for switch statements with some properties. For example if we focus on the CFG subgraph induced by BBs inside the loop, then switch statement, dominates or post dominates every BB. (note that we are focused on a subgraph of CFG). There is probably a lot of room for experimenting with different heuristics here and generalizing the algorithm.

(2) is fairly straightforward. (It also depends on (1)). But the main idea is that we need to start from the definition of “the variable” and follow use-def chains backward and prove that in every iteration of the loop, a constant value is chosen for this variable that will be used in the next iteration. Again, this condition works for coremark, but this should be generalizable.

I leave (3) for another comment, but the main idea is to enumerate all paths in the loop. Then for each path identify the CFG edge over which the value of “the variable” for the next iteration is determined. For each BB, depending on whether it appears before or after this edge in the path, we make different decision.

It will be great to get some feedback on what people think about this approach. I will provide details of (3) soon.

I think it is generalizable, and I will try to clarify generalization opportunities.

There are three main components in the algorithm.

1- Choosing a SSA variable in the loop that is interesting. (“the variable” hereafter)
2- Proving the branches that depend on the value of the variable has opportunity for jump threading.
3- Deciding on which blocks needs to be replicated, how much code increase we will have and which branches we can remove.

To be clear, there is a lot of work that needs to be done after step (3). That is the actual transformation that can be discussed. But my goal here was to focus on the analysis to find the opportunity and also deciding how to replicate the code, which branches/BBs will be impacted.

marksl added a subscriber: marksl.Dec 16 2020, 5:05 PM

This is the remaining of the algorithm that I mentioned above. (Step (3) of the algorithm). Please see the previous comment first. There should be enough details here to make the algorithm clear. Let me know if something is missing. Our work on revising the code has been suspended for a while. Will try to resume that soon. We are currently running some experiments to find similar code patterns so we can check generalizability of the algorithm beyond coremark.

(3-1) Enumerate all paths in the loop. Each path will be a sequence of basic blocks.
(3-2) For each path indicate the point in which a new value for the variable is defined. This is either a phi node or a select statement. But we can unfold select statements to make things easier. This statements should have been identified in the step (2) of the algorithm in the previous comment.
(3-3) The crucial point here (that needs to be checked) is that there is only a single point in each path where a new value is assigned to “the variable”. We partition the set of blocks in each path to two sets: “AFTER SET” The blocks that contain the phi node of (3-2), and the blocks after that. Every other block goes to “BEFORE SET”.

The required check for this step should be fairly straightforward by following the use-def chains. Note that once we focus on a path in the loop, there is a fixed constant value for “the variable” when we are in a block in the “BEFORE SET”. There is also a fixed constant value for “the variable” in the “AFTER SET”.

(3-4) With the information available at this point, it should be possible to decide which basic blocks needs to be replicated. Consider a BB that appears in the “BEFORE SET” of some paths. Let’s X denote the number of distinct values that “the variable” takes in the “BEFORE SET” of all these paths. We will need X copies of this BB, one for each of the distinct values. Similar property holds for basic blocks that appear in the “AFTER SET”. Note that one property that we rely on is the following: No BB appears in “BEFORE SET” of one path and “AFTER SET” of another path. Edges between new BBs should be fairly straightforward.

https://bugs.llvm.org/show_bug.cgi?id=42313

Please check if this pass can handle it.

https://bugs.llvm.org/show_bug.cgi?id=42313

Please check if this pass can handle it.

I don't know about the pass posted here. From a look at the testcase it should be OK for the algorithm that I posted. Once we resume the work on the code, we will check this. Thanks for sharing this.

Hey @eastig and @amehsan, thanks for working on this. Do you have any idea when this work is going to be resumed?

Hey @eastig and @amehsan, thanks for working on this. Do you have any idea when this work is going to be resumed?

Hi @dnsampaio. We have already resumed the work. My colleague who work on this will post a more detailed update soon. But to give you a quick update here: He looked into the std::merge example mentioned above and he found that our algorithm needs a change. This is being fixed, but this raises some questions about how general the algorithm is after the modification. So he is currently looking into some of the related literature to compare algorithms. He will post a more detailed update soon.

Hi, I just wanted to give an update about the pass @amehsan mentioned before. We revised the implementation so that it handles more general cases such as the std::merge example provided above. The core idea remains the same of using an analyze-then-transform approach to first find paths in an FSM that have the opportunity to be threaded, and then transforming them to skip over the expensive switch operation when it is possible to do so.

The main difference with the previous algorithm is that we do not expect all the paths to be contained in a single loop. Instead we look at all paths that form a cycle through some switch block. This means that we can consider every switch block in the program as a starting point. If the condition is predictable along these paths (i.e. it can be traced back to a constant) then there is an opportunity for threading. At this point we can label the blocks that decide the next state and create duplicate paths that branch directly to the next case.

There are 2 challenges we encountered that will complicate the transformation, but we have solutions for them:

  1. The state variable is sometimes assigned multiple times in one iteration, and we do not know until runtime which state will be chosen. This means that one duplicated path may have to jump to another duplicated path.
  2. The state variable is sometimes assigned in the switch block itself. We may be able to treat it as a special case and avoid code duplication.

Lastly, I reviewed the Codasip paper mentioned above, now available here: https://news.codasip.com/whitepaper-llvm-jump-threading.
Note that this paper is based on this thesis for KIT: https://pp.ipd.kit.edu/uploads/publikationen/priesner17masterarbeit.pdf.

I think that our approach should handle the same FSM cases as Codasip. One limitation is that we jump thread the switch entirely versus Codasip which could jump thread individual cases. It is possible to extend this algorithm in the future to jump thread individual cases, if it proves to be useful.

I am currently working out some details, but I will upstream the analysis phase of this pass shortly. In the mean time any feedback about this approach would be appreciated.

cynecx added a subscriber: cynecx.Mar 13 2021, 1:35 PM

Hi, I have posted the analysis mentioned above: https://reviews.llvm.org/D99205