This is an archive of the discontinued LLVM Phabricator instance.

Add jump-threading optimization for deterministic finite automata
ClosedPublic

Authored by alexey.zhikhar on Mar 23 2021, 12:16 PM.

Details

Summary

The current JumpThreading pass does not jump thread loops since it can
result in irreducible control flow that harms other optimizations. This
prevents switch statements inside a loop from being optimized to use
unconditional branches.

This code pattern occurs in the core_state_transition function of
Coremark. The state machine can be implemented manually with goto
statements resulting in a large runtime improvement, and this transform
makes the switch implementation match the goto version in performance.

This patch specifically targets switch statements inside a loop that
have the opportunity to be threaded. Once it identifies an opportunity,
it creates new paths that branch directly to the correct code block.
For example, the left CFG could be transformed to the right CFG:

       sw.bb                        sw.bb
     /   |   \                    /   |   \
case1  case2  case3          case1  case2  case3
     \   |   /                /       |       \
     latch.bb             latch.2  latch.3  latch.1
      br sw.bb              /         |         \
                        sw.bb.2     sw.bb.3     sw.bb.1
                         br case2    br case3    br case1

Co-author: Justin Kreiner @jkreiner
Co-author: Ehsan Amiri @amehsan

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
efriedma added inline comments.May 4 2021, 1:35 PM
llvm/lib/CodeGen/DFAJumpThreading.cpp
643 ↗(On Diff #342832)

cast<PHINode>(Incoming)

If we're going to be cloning basic blocks, we need some sort of cost computation. Some blocks aren't legal to clone, and some are expensive simply due to size. (Illegal to clone like this: convergent, noduplicate, indirectbr predecessors.) See llvm/Analysis/CodeMetrics.h.

That is a good point, I agree we will need a cost model for the transformation. I have started working on one and gathering more performance stats. Also thanks for sharing CodeMetrics.h, I can add a check for those types of blocks that aren't legal to clone.

If I'm understanding correctly, the goal here is to specifically handle cases where we can completely eliminate the switch. That's a bit narrow, but I guess it makes sense, as least as a first cut.

This is correct for how it has been implemented, but the transformation should also work for the general case where the switch can be partially threaded. The analysis part will need to be changed a bit to identify the partial threading opportunities, but it is possible in the future.

In case anyone is interested, there is some context about this patch over here: https://reviews.llvm.org/D88307

jkreiner updated this revision to Diff 345215.May 13 2021, 10:51 AM

Implemented a cost model to avoid cloning too much code, and addressed reviewer comments.

jkreiner marked 4 inline comments as done.May 13 2021, 11:54 AM

I removed the unnecessary conditions, and used the llvm::DenseMap instead.

llvm/lib/CodeGen/DFAJumpThreading.cpp
740 ↗(On Diff #345215)

Here is the cost calculation we implemented. It could be more accurate by accounting for instructions and blocks that get simplified, but for now it at least prevents code explosion.

453 ↗(On Diff #342832)

Yes, I agree the vector select check was unnecessary here.

efriedma added inline comments.May 17 2021, 3:45 PM
llvm/lib/CodeGen/DFAJumpThreading.cpp
13 ↗(On Diff #345215)

Could probably use a brief outline of the overall algorithm here.

109 ↗(On Diff #345215)

Don't use std::list.

I think you can use SmallVector and pop_back(), assuming the iteration order doesn't matter. If you need LIFO order, use std::deque.

182 ↗(On Diff #345215)

I don't think this applyUpdates() call does anything? The DomTree doesn't care about unreachable edges.

213 ↗(On Diff #345215)

Can you merge together the two triangle codepaths? They appear nearly identical. Probably just need a couple std::swap calls.

375 ↗(On Diff #345215)

Do you check somewhere that there's exactly one ConstantInt associated with a given successor? In general, there could be any number.

450 ↗(On Diff #345215)

isa<Instruction>() if you're not going to use the pointer returned by dyn_cast<>.

690 ↗(On Diff #345215)

Probably want to ensure we're only calling collectEphemeralValues once per function.

740 ↗(On Diff #345215)

Okay, makes sense.

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

This is kind of late to be running this pass. That might be useful in some cases: we might be making the code harder to analyze if there's irreducible control flow. On the other hand, we're skipping interesting optimizations if the CFG becomes significantly simpler. My instinct is that the advantages of running earlier outweigh the disadvantages.

llvm/test/CodeGen/AArch64/dfa-constant-propagation.ll
1 ↗(On Diff #345215)

Please put the tests in llvm/test/Transforms/DFAJumpThreading.

Matt added a subscriber: Matt.May 18 2021, 7:10 AM
jkreiner marked 2 inline comments as done.May 18 2021, 7:38 AM
jkreiner added inline comments.
llvm/lib/CodeGen/DFAJumpThreading.cpp
109 ↗(On Diff #345215)

Okay, there's a few places std::list was used so I'll replace them all.

182 ↗(On Diff #345215)

Wouldn't it be needed since a new block is created? This edge is reached sometimes. I may be misunderstanding the usage of the DomTreeUpdater though.

213 ↗(On Diff #345215)

Sure I'll look into merging these two cases.

375 ↗(On Diff #345215)

Good point, this case isn't checked for so it would currently cause buggy behavior in this function.

Now that I think about it, this function can probably be removed since the return value stored as the EntryValue isn't used in the transformation.

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

I agree that it may be beneficial to run the pass earlier. I'm not aware yet of which optimizations it could interfere with. Do you have any suggestions of where to run it instead?

dnsampaio added inline comments.May 21 2021, 1:41 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

I would suggest trying just before llvm's original jump-threading. For the proposed implemented here in D88307, it actually produced code that llvm's original jump-threading would further optimize, where the other way around, the original jump-threading would just not do anything. I'm running over llvm-12, so it may take me some time to test this patch.

xbolva00 added inline comments.May 21 2021, 1:48 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

and part of standard opt 2+ pipeline? why aarch64 specific?

jkreiner added inline comments.May 21 2021, 6:53 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

Thank you for the suggestion, I'll try it out before the original jump-threading then.

I only had it there since I was testing it on AArch64, but it could be moved to the general opt pipeline.

jkreiner added inline comments.May 21 2021, 1:35 PM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

This seems to be a better location for the pass. I've measured a performance gain 5% greater than before for Coremark by placing it before the original jump-threading.

jkreiner updated this revision to Diff 348040.May 26 2021, 11:51 AM

Addressed reviewer comments, and moved the pass earlier in the pipeline.

jkreiner marked 10 inline comments as done.May 26 2021, 11:55 AM
jkreiner updated this revision to Diff 348077.May 26 2021, 1:40 PM

Fixed a failing opt pipeline test.

I've finally got time to test this.
For our downstream arch we're seeing gains up to 27%.
Very good, thanks for working on this.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
364–367

In release mode I get warnings this is not used (we use -Werror):
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:364:21: error: unused function 'operator<<' [-Werror,-Wunused-function]

Perhaps put it around #ifndef NDEBUG ?

foad added a subscriber: foad.May 27 2021, 5:09 AM

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

llvm/include/llvm/Transforms/Scalar/DFAJumpThreading.h
2

Typo ".cpp" :)

For our downstream arch we're seeing gains up to 27%.

That's great to hear!

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
364–367

I see, I'll add that #ifndef NDEBUG check.

jkreiner updated this revision to Diff 348343.May 27 2021, 11:54 AM
jkreiner marked 2 inline comments as done.
foad added a comment.May 28 2021, 1:15 AM

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

Naive follow-up question: why does this have to be a complete new implementation of jump threading? Would it be feasible to have a single implementation that takes a "don't create irreducible control flow" flag?

@jkreiner with both llvm 12 and head I get a compiler crash in this function for file F17082111. Just run opt --dfa-jump-threading. Sorry, I couldn't reduce it further, I just don't know how to get bugpoint to work with opt.
With a llvm-12 in debug build I get:

opt --dfa-jump-threading -S < fail.ll
opt: /work1/dsampaio/csw/llvm-project/llvm/include/llvm/IR/Instructions.h:2767: llvm::Value *llvm::PHINode::getIncomingValueForBlock(const llvm::BasicBlock *) const: Assertion `Idx >= 0 && "Invalid basic block argument!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: opt --dfa-jump-threading -S
1.      Running pass 'Function Pass Manager' on module '<stdin>'.
2.      Running pass 'DFA Jump Threading' on function '@main'
 #0 0x00007f6a3415119a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x00007f6a3415136b PrintStackTraceSignalHandler(void*) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x00007f6a3414f95b llvm::sys::RunSignalHandlers() /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Signals.cpp:70:5
 #3 0x00007f6a34151ae1 SignalHandler(int) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007f6a32d7a980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980)
 #5 0x00007f6a32076fb7 raise /build/glibc-S9d2JN/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0
 #6 0x00007f6a32078921 abort /build/glibc-S9d2JN/glibc-2.27/stdlib/abort.c:81:0
 #7 0x00007f6a3206848a __assert_fail_base /build/glibc-S9d2JN/glibc-2.27/assert/assert.c:89:0
 #8 0x00007f6a32068502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502)
 #9 0x00007f6a344eb474 llvm::PHINode::getIncomingValueForBlock(llvm::BasicBlock const*) const /work1/dsampaio/csw/llvm-project/llvm/include/llvm/IR/Instructions.h:2768:29
#10 0x00007f6a35a37265 (anonymous namespace)::AllSwitchPaths::run() /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:532:24
#11 0x00007f6a35a36826 (anonymous namespace)::DFAJumpThreading::run(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:1149:23
#12 0x00007f6a35a36c1f (anonymous namespace)::DFAJumpThreadingLegacyPass::runOnFunction(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:164:5
#13 0x00007f6a3440f84c llvm::FPPassManager::runOnFunction(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1435:23
#14 0x00007f6a34414ac5 llvm::FPPassManager::runOnModule(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1481:16
#15 0x00007f6a34410214 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1550:23
#16 0x00007f6a3440fd38 llvm::legacy::PassManagerImpl::run(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:541:16
#17 0x00007f6a34414dd1 llvm::legacy::PassManager::run(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1677:3
#18 0x0000000000472e17 main /work1/dsampaio/csw/llvm-project/llvm/tools/opt/opt.cpp:998:3
#19 0x00007f6a32059bf7 __libc_start_main /build/glibc-S9d2JN/glibc-2.27/csu/../csu/libc-start.c:344:0
#20 0x000000000043577a _start (/work1/dsampaio/csw/devimage/toolchain_default/toolroot/opt/kalray/accesscore/bin/opt+0x43577a)
Aborted (core dumped)

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

Naive follow-up question: why does this have to be a complete new implementation of jump threading? Would it be feasible to have a single implementation that takes a "don't create irreducible control flow" flag?

I just want to clarify that this is not a complete new implementation of jump threading, the cases that both of the passes catch are mutually exclusive. There were some discussions about implementing a more aggressive jump threading versus a separate pass for specific uses (FSMs) that might answer your question over here: D88307

@jkreiner with both llvm 12 and head I get a compiler crash in this function for file F17082111. Just run opt --dfa-jump-threading. Sorry, I couldn't reduce it further, I just don't know how to get bugpoint to work with opt.

Thank you for providing the testcase, I reproduced it locally. It is the last day of my internship, so I won't be able to submit a fix for it. @alexey.zhikhar initially wrote some parts of the analysis, and will continue where I left off on the remaining work items for this patch.

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

It would be very hard for us to explain the difference without having access to both the downstream compiler and the hardware but something that comes to mind is:

(1) were the two experiments performed on the same baseline compiler?
(2) 1% could be within noise and might be due to trivial changes in the generated code (change of alignment, change of branch addresses and branch target addresses, etc.)
(3) were both passes in the same place in the pipeline in both experiments?

From the information provided, it is not possible to understand the root cause of the difference.

ormris removed a subscriber: ormris.Jun 7 2021, 4:08 PM

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

It would be very hard for us to explain the difference without having access to both the downstream compiler and the hardware but something that comes to mind is:

(1) were the two experiments performed on the same baseline compiler?
(2) 1% could be within noise and might be due to trivial changes in the generated code (change of alignment, change of branch addresses and branch target addresses, etc.)
(3) were both passes in the same place in the pipeline in both experiments?

From the information provided, it is not possible to understand the root cause of the difference.

Thanks, and understood. I'll see if I can do some more digging here, but I suspect it may be noise. I wasn't sure if there was an obvious difference in the algorithms here that might account for a degradation. I'm pretty sure both passes are not in the same place in the pipeline (just taking the patch from the other review).

marksl added a subscriber: marksl.Jun 8 2021, 2:15 PM
alexey.zhikhar commandeered this revision.Jun 9 2021, 7:00 AM
alexey.zhikhar added a reviewer: jkreiner.

Fixed the bug submitted by @dnsampaio, test case added

Misc minor changes

Fix a clang-format warning

@efriedma Eli, I believe all your comments have been addressed, please take another look when you get a chance, thanks.

@efriedma ping.

We understand that you might be busy, so maybe you could recommend another reviewer that could make the final decision? Thank you.

amehsan added a subscriber: sebpop.EditedJul 8 2021, 9:20 AM

@sebpop Do you mind reviewing this patch? This is jump threading for finite state machine. You may remember we had a discussion about it back in 2018 and 2019 llvm dev meeting....We didn't follow up as quickly as I hoped but recently we got the opportunity to improve the code and post it here. As you can see some review has been done, but currently it is waiting for further review or green light to merge.

Earlier discussion that resulted in this patch is also here: https://reviews.llvm.org/D88307

Did a first pass, see comments inline. Plan to look at this closer soon.

I think some tests are missing, mostly negative tests:

  • test for min code size,
  • test with blocks that cannot be cloned,
  • test with some unsupported instructions.
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
104

Just a thought on "deploying" this. Perhaps easier to start having this off by default?
In case problems are found after committing (correctness, or perhaps compile times), you don't need to revert the whole pass, but instead just toggle this and can keep the pass in tree.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
13

Nit: perhaps you can be more specific here:

Currently this does not happen in LLVM jump threading

and say this is JumpThreading.cpp?

25

Nit picking some words here. What do you mean by "switch variable" and what do you mean by "decided"?

30

Nit: runs -> triggers?

33

Just curious, is this algorithm described in literature? Can we include a reference?

186

Can you define what unfolding means in this context?

191

Do you mean that this is code that could be shared? Should this be a TODO?

322
// This data structure keeps track of all blocks that have been cloned in the
// optimization.

->

// This data structure keeps track of all blocks that have been cloned.
740

Could you include here, or where ever most appropiate, the definition of threading path. Copied from a test case:

A threadable path includes a list of basic blocks, the exit
state, and the block that determines the next state.
< path of BBs that form a cycle > [ state, determinator ]
llvm/test/Transforms/DFAJumpThreading/dfa-jump-threading-transform.ll
20

Perhaps check the full IR for clarity.

I am making an AggressiveJumpThreading pass in downstream to solve the State Machine in Coremark.
And I omit the problem that more aggressive jump threading would cause irreducible control flow. I heard about that gcc has implemented a version which overcomes the problem. It may be beneficial to look into the details of the gcc implementation for all of us.
This patch looks like a big pattern match to me. It requires a switch statement whose condition are predictable, in other words, DFA.
First question is that is it profitable to make a pass for specific pattern? Since I feel that the DFA pattern wouldn't be normal in codes. Then if the pass turned off by default, I have no problems.

The second one is that this pass would collect all paths available in AllSwitchPaths and considering cost model when performing the transformation.
One concern is that if it may be possible wasting too many times in collecting useless paths who is cost is more than benefit.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
73

IsViewCFGBefore => WouldViewCFGBeforeTransform or ViewCFGBeforeTransform.

Now the name looks a little bit confusing.

262–263

It may be suitable to use range based for.

382

Does it necessary to mark ~MainSwitch as virtual? Since there is no derived class from MainSwitch.

723

What's the rationale to use ceil log base 2? Maybe there is a conclusion in math. But I guess it may be better to tell it.

1139

It may be better to comment why we don't use range-based for here.

I am making an AggressiveJumpThreading pass in downstream to solve the State Machine in Coremark.

We have 2 aggressive jump threading passes in upstream review: this work, and also D88307. If I understand this correctly, you're working on another downstream? If so, would it not be more efficient if we support this work? For me, this work has the most potential as it is under active development as opposed to D88307. Your remark about irreducible control flow is a valid one, which was made earlier too. Eli said this about that:

This is kind of late to be running this pass. That might be useful in some cases: we might be making the code harder to analyze if there's irreducible control flow. On the other hand, we're skipping interesting optimizations if the CFG becomes significantly simpler. My instinct is that the advantages of running earlier outweigh the disadvantages.

which is something that still needs to be figured out I think. I.e., it's run earlier now, but do we need more performance numbers to see if this doesn't regress other stuff? But @ChuanqiXu, if you're working on something similar and have ideas, please consider sharing them here.

And I omit the problem that more aggressive jump threading would cause irreducible control flow. I heard about that gcc has implemented a version which overcomes the problem. It may be beneficial to look into the details of the gcc implementation for all of us.
This patch looks like a big pattern match to me. It requires a switch statement whose condition are predictable, in other words, DFA.
First question is that is it profitable to make a pass for specific pattern? Since I feel that the DFA pattern wouldn't be normal in codes. Then if the pass turned off by default, I have no problems.

My colleague @jaykang10 found that this also triggers on Perlbench in SPEC.

I have seen similar code pattern in perlbench of spec2017. In the case, there are cleanups for lifetime.marker and the goto statements go through the cleanups. I have discussed it with cfe-dev. https://lists.llvm.org/pipermail/cfe-dev/2021-July/068478.html It looks like this patch resolves the issue with perlbench as well as coremark.

alexey.zhikhar added inline comments.Jul 9 2021, 9:32 AM
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
33

This is the algorithm that we came up with so we cannot provide a reference. "Codasip" has a whitepaper with a more general algorithm for jump threading that covers DFA as well. Justin @jkreiner compared our approach with the one by Codasip, and found that we can cover all cases that Codasip can (for DFA) with one exception: currently our implementation is limited to cases in which the state variable of a switch is always predictable, but it can be extended to cover the case where the state variable is only sometimes predictable.

Thank you very much for your comments, I'm working on updating the diff, will submit it shortly.

alexey.zhikhar edited the summary of this revision. (Show Details)

Addressed the inline comments, added a co-author.

alexey.zhikhar marked 14 inline comments as done.Jul 9 2021, 1:53 PM
alexey.zhikhar added inline comments.
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
104

Good idea, done.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
73

I usually name all my boolean variables IsSomething but I agree that IsView... sounds awkward. Also having a verb in the beginning make it sound like a routine rather than a variable. I updated it to ClViewCFGBefore, Cl for command line.

382

Classes might derive from MainSwitch in the future, it's a good practice to declare destructors virtual.

723

Added a comment.

740

Added a description to the definition of the ThreadingPath class.

1139

No good reason, fixed.

It's a lot of code, and I am still reading it, but here's a bunch of mostly nits if you don't mind that... (while I am going to continue reading it).

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
121

Nit: don't need the brackets.

134

Nit: don't need the brackets.

216

Create a helper function for this...

226

... so that we don't need to duplicate this here.

288

Nit: no need for all the curly brackets.

(sorry, I just find it a lot easier to read, and it's the coding style )

314

Perhaps a comment here that State just corresponds to the value of a case statement?

338

Do we need BBName?
Can we not do something like:

OS << BB->hasName() ? BB->getName() : "...";
357

I was confused what exactly the determinator was, but I think it's the block that determines the next state. Think some comments about this would be beneficial too.

473

no curly brackets

480

Same

486

Same,

and actually in a lot more cases, so will stop mentioning it from now on, but there's opportunities to get rid of a lot of curly brackets. :-)

1147
auto *SI = dyn_cast<SwitchInst>(BB.getTerminator();
if (!SI)
  continue;
alexey.zhikhar marked 5 inline comments as done.
alexey.zhikhar marked 11 inline comments as done.Jul 12 2021, 2:02 PM
alexey.zhikhar added inline comments.
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
338

The idea here is to print the basic block itself if it has no name, e.g.:

1:
  %2 = ...
  %3 = ...

We can't say OS << (BB->hasName() ? BB->getName() : *BB); since both legs of a ternary operator must be of the same type.

486

Thanks, I went through the whole file looking for unnecessary curly braces, so now they should be no more. Please feel free to let me know if I missed something.

I am making an AggressiveJumpThreading pass in downstream to solve the State Machine in Coremark.

We have 2 aggressive jump threading passes in upstream review: this work, and also D88307. If I understand this correctly, you're working on another downstream? If so, would it not be more efficient if we support this work? For me, this work has the most potential as it is under active development as opposed to D88307. Your remark about irreducible control flow is a valid one, which was made earlier too. Eli said this about that:

This is kind of late to be running this pass. That might be useful in some cases: we might be making the code harder to analyze if there's irreducible control flow. On the other hand, we're skipping interesting optimizations if the CFG becomes significantly simpler. My instinct is that the advantages of running earlier outweigh the disadvantages.

which is something that still needs to be figured out I think. I.e., it's run earlier now, but do we need more performance numbers to see if this doesn't regress other stuff? But @ChuanqiXu, if you're working on something similar and have ideas, please consider sharing them here.

Since I only started to see Coremark recently and you have much more experience. I should be fine if there isn't any regression found.

And I omit the problem that more aggressive jump threading would cause irreducible control flow. I heard about that gcc has implemented a version which overcomes the problem. It may be beneficial to look into the details of the gcc implementation for all of us.
This patch looks like a big pattern match to me. It requires a switch statement whose condition are predictable, in other words, DFA.
First question is that is it profitable to make a pass for specific pattern? Since I feel that the DFA pattern wouldn't be normal in codes. Then if the pass turned off by default, I have no problems.

My colleague @jaykang10 found that this also triggers on Perlbench in SPEC.

Cool, I would try to measure it.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
382

So it may be better to add virtual ~MainSwitch in the future.

723

Is this assumption stable? For the one hand, the motivation example in Coremark wouldn't be lowered Into a binary tree. On the other hand, it may be possible that the switch statement may be lowered to a jump table.

Thanks for the updates.
If I am not mistaken, I still think tests are missing for cases when jump threading should *not* trigger:

I think some tests are missing, mostly negative tests:

  • test for min code size,
  • test with blocks that cannot be cloned,
  • test with some unsupported instructions
alexey.zhikhar marked an inline comment as done.Jul 13 2021, 7:51 AM

Thanks for the updates.
If I am not mistaken, I still think tests are missing for cases when jump threading should *not* trigger:

I think some tests are missing, mostly negative tests:

  • test for min code size,
  • test with blocks that cannot be cloned,
  • test with some unsupported instructions

You're right that they're missing, I have not forgotten about the test cases, I'm working on them and will add them soon. Thanks

SjoerdMeijer added inline comments.Jul 14 2021, 1:43 AM
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
33

Thanks for this, and for addressing my previous nitpicks about definitions and terminology.
I think it would be convenient to have most of them in one place, i.e. here, describing the high level ideas and terminology used.
Sketching a little bit, and copy-pasting from different places, I was thinking of something like this:

// Transform each threading path to effectively jump thread the FSM. For
// example the CFG below could be transformed as follows, where the cloned
// blocks unconditionally branch to the next correct case based on what is
// identified in the analysis.
//          sw.bb                        sw.bb
//        /   |   \                    /   |   \
//   case1  case2  case3          case1  case2  case3
//        \   |   /                 |      |      |
//       determinator            det.2   det.3  det.1
//        br sw.bb                /        |        \
//                          sw.bb.2     sw.bb.3     sw.bb.1
//                           br case2    br case3    br case1§
//
// Defintions and Terminology:
//
// * Threadable path:
//   a list of basic blocks, the exit state, and the block that determines
//   the next state, for which the following notation will be used:
//   < path of BBs that form a cycle > [ state, determinator ]
//
// * Predictable switch:
//   The switch variable is always a known constant so that all conditional
//   jumps based on switch variable can be converted to unconditional jump.
//
// * Determinator:
//   The basic block that determines the next state of the DFA. 
//
// ETC

And I think you can keep the comments at the different places inlined in the code, don't think that duplication will harm.

79

I think this option is not tested.

84

And this one too, so need some more tests for this.

405

I don't think we have tests for cases that are not predictable.

459

This could probably do with a more descriptive function name, to make more explicit what kind of Value we expect.

496

Don't need the brackets here

501

and here.

  • Add optimization remark emitter
  • Test for min code size
  • Test with blocks that cannot be cloned
  • Test with some unsupported instructions
  • Test unpredictable switch
  • Replace all mentions of FSM with DFA
  • Always use the term "Threading Path", instead of interchanging "threading" with "threadable"
alexey.zhikhar marked 7 inline comments as done.Jul 14 2021, 2:52 PM
alexey.zhikhar added inline comments.
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
33

Thanks, Sjoerd, I added this to the file header. Generally I prefer to not duplicate documentation for the same reasons as duplicating code, so I removed the long comment from TransformDFA::createAllExitPaths() but I can put it back if you insist.

79

A test case for MaxPathDepth is not ready yet, but duly noted. Thanks

382

Let's see if Sjoerd @SjoerdMeijer has an opinion here.

405

Please see negative4

SjoerdMeijer added inline comments.Jul 15 2021, 5:29 AM
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
382

No strong opinion. Looks fine as it is.

llvm/test/Transforms/DFAJumpThreading/negative.ll
3

Perhaps this works, but thought this should be:

--check-prefix=REMARK
186

I don't think we want to jump thread a function that has a minsize attribute? For example:

define i32 @negative5( ) minsize {

Need a test for that?

alexey.zhikhar marked 2 inline comments as done.
  • Add a test for the minsize attribute
  • Test MaxPathDepth with a test case where only a subset of paths is threaded
  • Reduce the test case for the cost model by passing the threshold through CLI
  • Fix a bug in how the NumTransforms statistic is incremented
  • Expand the cost model to include the case for jump tables
alexey.zhikhar marked 3 inline comments as done.Jul 22 2021, 2:46 PM
alexey.zhikhar added inline comments.
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
79

Please see max_path_length

723

That's a good point, I added a separate clause for when a jump table is expected instead of binary search (the same hook is used in inlining).

Thanks, looking good. But one last request, sorry, can you add tests for the optimisation remarks too?

alexey.zhikhar marked an inline comment as done.Jul 23 2021, 6:07 AM

Thanks, looking good. But one last request, sorry, can you add tests for the optimisation remarks too?

No worries, Sjoerd, I appreciate your comments.

There are optimization remark tests in negative.ll, do you mean something in addition to that?

Thanks, looking good. But one last request, sorry, can you add tests for the optimisation remarks too?

No worries, Sjoerd, I appreciate your comments.

There are optimization remark tests in negative.ll, do you mean something in addition to that?

Ah, sorry, I had missed that, so ignore that comment.

SjoerdMeijer accepted this revision.Jul 23 2021, 7:08 AM

Thanks for working on this. This LGTM as a first version that is off by default. Having this in tree makes testing and getting different (compile-time) numbers easier, which we need for the follow up to get this enabled by default.

This revision is now accepted and ready to land.Jul 23 2021, 7:08 AM

Rebase on top of the latest main

alexey.zhikhar edited the summary of this revision. (Show Details)Jul 27 2021, 7:38 AM
This revision was landed with ongoing or failed builds.Jul 27 2021, 11:36 AM
This revision was automatically updated to reflect the committed changes.

This appears to create a layering violation. llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp includes llvm/include/llvm/CodeGen/Passes.h and llvm/lib/CodeGen/HardwareLoops.cpp includes llvm/include/llvm/Transforms/Scalar.h creating a cycle between LLVMCodeGen and LLVMScalarOpts

@chandlerc as the owner of layering

Is the include of llvm/CodeGen/Passes.h required? My build appears to succeed without it

This appears to create a layering violation. llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp includes llvm/include/llvm/CodeGen/Passes.h and llvm/lib/CodeGen/HardwareLoops.cpp includes llvm/include/llvm/Transforms/Scalar.h creating a cycle between LLVMCodeGen and LLVMScalarOpts

@chandlerc as the owner of layering

Is the include of llvm/CodeGen/Passes.h required? My build appears to succeed without it

Ah looks like @bkramer just fixed it with https://github.com/llvm/llvm-project/commit/05815c9f638c2a62e1ce9b28b26d74c7bea81f2e, thanks!

Compile time statistics gathered on CTMark (no change, basically):

Compile Time
dfa TRUEdfa FALSE
test-suite :: CTMark/7zip/7zip-benchmark.test83.203183.3182
test-suite :: CTMark/Bullet/bullet.test54.695954.7607
test-suite :: CTMark/ClamAV/clamscan.test32.406432.8996
test-suite :: CTMark/SPASS/SPASS.test30.178630.0229
test-suite :: CTMark/consumer-typeset/consumer-typeset.test24.08924.0834
test-suite :: CTMark/kimwitu++/kc.test33.790734.073
test-suite :: CTMark/lencod/lencod.test39.340139.3303
test-suite :: CTMark/mafft/pairlocalalign.test21.47521.408
test-suite :: CTMark/sqlite3/sqlite3.test31.172931.1351
test-suite :: CTMark/tramp3d-v4/tramp3d-v4.test57.560956.1245

Number of transformed switch statements:

CTMark/kimwitu++/kimwl.stats:   "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/LzmaEnc.stats:      "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/DeflateDecoder.stats:       "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/GzHandler.stats:    "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/CabHandler.stats:   "dfa-jump-threading.NumTransforms": 3
CTMark/7zip/ShrinkDecoder.stats:        "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/Update.stats:       "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/List.stats: "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/Extract.stats:      "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/ZipHandlerOut.stats:        "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/TarHandler.stats:   "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/7zUpdate.stats:     "dfa-jump-threading.NumTransforms": 2
CTMark/7zip/XzDec.stats:        "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/OpenArchive.stats:  "dfa-jump-threading.NumTransforms": 1
CTMark/7zip/BwtSort.stats:      "dfa-jump-threading.NumTransforms": 1
CTMark/ClamAV/libclamav_untar.stats:    "dfa-jump-threading.NumTransforms": 1
CTMark/ClamAV/libclamav_message.stats:  "dfa-jump-threading.NumTransforms": 1
CTMark/sqlite3/sqlite3.stats:   "dfa-jump-threading.NumTransforms": 10
CTMark/SPASS/iascanner.stats:   "dfa-jump-threading.NumTransforms": 1
CTMark/SPASS/dfgscanner.stats:  "dfa-jump-threading.NumTransforms": 1
CTMark/consumer-typeset/z36.stats:      "dfa-jump-threading.NumTransforms": 1
CTMark/consumer-typeset/z49.stats:      "dfa-jump-threading.NumTransforms": 1
CTMark/consumer-typeset/z38.stats:      "dfa-jump-threading.NumTransforms": 1

Nice numbers, actually, more “hits” than expected. +1.

Can you cooperate with @nikic to get fresh CT data for enabled dfa pass? Your numbers look fine so it would be great to enable it on by default.

Yes, that looks very promising. I think it is best to create a new patch for that, so that we can have and move this discussion there.

Thanks everybody.

Internally, we had had a few rounds of testing before posting this patch, but before enabling it by default, we'd like to have a few more to make sure that nothing unexpected will arise. I will submit a patch as soon as we're done with that.

Just a heads up. I have benchmarked this version against our downstream implementation of jump threading and see some regressions:

       Regression
core1  0.4%
core2  0.28%
core3  2.7%
core4  0.24%
core5  1.12%

Especially for the more capable cores "core3" and "core5" the difference is quite big, so we do leave some performance on the table.

I will probably add a reproducer as a regression test and raise a ticket for this, so that we can look into this.

cynecx removed a subscriber: cynecx.Aug 4 2021, 10:10 AM
amehsan added a comment.EditedAug 5 2021, 8:35 AM

A couple of comments/clarifications about irreducible CFG that came up in earlier comments: This pass does not generate irreducible CFG for coremark. It does generate irreducible CFG for https://bugs.llvm.org/show_bug.cgi?id=42313. We also checked gcc and it seems that gcc also generate irreducible CFG for PR42313.

The following observation could be helpful: CFG of the output of this pass, (if we collapse some nodes of CFG into one node) will be the same as the structure of the DFA graph represented by the code. I have not thought about it, but I suspect this is generalizable to cases where we partially jump thread a switch statement. This observation can be used to detect irreducible CFG in the analysis step and potentially disable the transformation or convert the code to reducible CFG using known techniques (cost analysis will be needed).

Just a heads up. I have benchmarked this version against our downstream implementation of jump threading and see some regressions:

       Regression
core1  0.4%
core2  0.28%
core3  2.7%
core4  0.24%
core5  1.12%

Especially for the more capable cores "core3" and "core5" the difference is quite big, so we do leave some performance on the table.

I will probably add a reproducer as a regression test and raise a ticket for this, so that we can look into this.

Thanks Sjoerd for the careful review of this patch and feedback provided. It will be interesting to see if the root cause of the performance degradation is in the code generated by the pass, or not. If that is the case, we can investigate whether changing the code gen in this pass is the best way forward, or some extra clean up/optimization is needed somewhere else. We will look into it when more information is available. Thanks again for all the feedback.

No worries, and thanks for working on this!

I had a first look, and I think the output code is just different (*) for the 2 jump threading implementations, so difficult to tell more about this at this point. I will look into this more, but that will be in September after the holiday period.

(*) It is the switch.o3.ll test case from D88307 that I took, which I think is just a reproducer from coremark.

ychen added a subscriber: ychen.Nov 18 2021, 11:25 AM

I wrote a ticket about a crash with this pass that we stumbled upon in fuzzy testing:
https://github.com/llvm/llvm-project/issues/64860

Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 5:57 AM
Herald added subscribers: hoy, nlopes. · View Herald Transcript
nlopes added inline comments.Aug 21 2023, 10:48 AM
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
1143

please use a poison value as placeholder instead of undef.
We are trying to remove undef.
Thank you!