This is an implementation of jump-threading to optimise switch-based finite state automata.
|220 ms||linux > LLVM.CodeGen/AMDGPU::opt-pipeline.ll|
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -O0 -mtriple=amdgcn--amdhsa -disable-output -disable-verify -debug-pass=Structure /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/opt-pipeline.ll 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck -check-prefix=GCN-O0 /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/AMDGPU/opt-pipeline.ll
|120 ms||linux > LLVM.Other::new-pm-defaults.ll|
Script: -- : 'RUN: at line 10'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -disable-verify -debug-pass-manager -passes='default<O1>' -S /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-defaults.ll 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-defaults.ll --check-prefix=CHECK-O --check-prefix=CHECK-O1 --check-prefix=CHECK-NOEXT
|260 ms||linux > LLVM.Other::new-pm-thinlto-defaults.ll|
Script: -- : 'RUN: at line 11'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -disable-verify -debug-pass-manager -passes='thinlto-pre-link<O1>,name-anon-globals' -S /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-defaults.ll 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-defaults.ll --check-prefixes=CHECK-O,CHECK-O1,CHECK-PRELINK-O,CHECK-PRELINK-O-NODIS,CHECK-PRELINK-O1
|100 ms||linux > LLVM.Other::new-pm-thinlto-postlink-pgo-defaults.ll|
Script: -- : 'RUN: at line 4'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -disable-verify -debug-pass-manager -passes='thinlto<O1>' -S /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll --check-prefixes=CHECK-O,CHECK-O1,CHECK-NOEXT --dump-input=fail
|120 ms||linux > LLVM.Other::new-pm-thinlto-postlink-samplepgo-defaults.ll|
Script: -- : 'RUN: at line 3'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -disable-verify -debug-pass-manager -pgo-kind=pgo-sample-use-pipeline -profile-file='/mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/Inputs/new-pm-thinlto-samplepgo-defaults.prof' -passes='thinlto<O1>' -S /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll --check-prefixes=CHECK-O,CHECK-O1,CHECK-NOEXT --dump-input=fail
|View Full Test Results (6 Failed)|
@alexey.zhikhar and I had a look at your patch. Some findings that might be interesting:
1- The IR from your patch is shorter than ours. We haven't checked carefully where the difference is, but there is a gap of about 10% (of the size of core_state_transition)
2- Our code from our current implementation is about 2%-3% faster on our platform.
3- While the main idea is exactly the same our implementation looks at all the control flow paths in the loop first and perform the analysis. Then we have a code gen process. Your approach is closer to the existing jump threading in that you make one change at a time.
4- As I said, we are making some significant changes in our implementation to improve the code quality (among other things), etc. So our new code that we will be able to share is not ready yet.
Thanks for sharing your code.
Thank you for your feedback.
The code was written in early 2015 before LLVM got improvements in DominatorTree and SSAUpdater. The biggest part of the code is to restore SSA form which is not perfect. Some sophisticated CFGs can cause issues.
The code has not been optimised for performance. I think it has opportunities for improvements.
The original idea I had was to support both SwitchInst and BranchInst. However, for BranchInst the implementation was very unstable. As the target was to optimize SwitchInst, support of BranchInst was dropped.
Did you consider to implement more Agressive Jump Threading instead of a new pass for specific use (FSMs)? Like following solutions:
- gcc/libfirm jt http://beza1e1.tuxen.de/articles/jump_threading.html
- codasip jump threading https://codasip.com/wp-content/uploads/2019/08/Codasip_Jump_Threading.pdf
Do you mean to improve the existing LLVM Jump Threading pass?
Yes, either improve current one or write new more powerful JT pass.
IMHO, the current LLVM JT pass is a lightweight JT. It tries to balance an amount of duplicated basic blocks and to preserve loops from becoming irreducible.
Making it powerful could break design principles and contracts the pass is based on. The Aggressive JT can easily duplicate a lot of basic blocks and make loops irreducible.
Also AJT benefits more from the analyze-then-transform approach rather than from iterative approach. I saw that no additional jumps were threaded after the first pass.
There are still unanswered questions regarding AJT: what applications benefit most from it (definitely FST), where to put it in the optimizations pipeline, what heuristics to use, etc.
All of this make questionable whether the current JT should be upgraded into AJT.
@eastig I tried this experimental patch in our system, and it definitely brings highly significant improvement to coremark. When do you expect to post a version for reviewing -- what will the differences be?
@amehsan If your jump-threading change can bring even more improvement, then I am definitely curious to see it and if it can be integrated with this in some way.
Thank you all for your contributions here!
Good news. I've got an approval. I cannot say when I post the updated version. I've just started a new job and are completely busy. I don't work anymore at Arm.
The differences will be:
- SSAUpdaterBulk will be used to remove custom code restoring SSA. This will reduce amount of code.
- The analyze-then-transform approach rather than the iterative approach. This allows to remove some crazy heuristics.
- SelectInstrs are unfolded only if needed.
- Some constants can be involved into cmp+br. As a result they cannot reach the switch. This causes copied basic blocks which are not needed.
- Basic blocks common for some threading opportunities will not be copied if possible.
I am not sure whether our approach can be integrated, with this patch. The approaches are completely different. But I can write a summary of our algorithm and we can discuss the algorithm first. We are working on rewriting the code and we can post it when it is ready. Since @eastig has also switched to a similar approach (analyze-then-transform), it will be interesting to have a discussion about different algorithms.