Page MenuHomePhabricator

Add jump-threading optimization for deterministic finite automata
Needs ReviewPublic

Authored by alexey.zhikhar on Mar 23 2021, 12:16 PM.

Details

Summary

The current JumpThreading pass does not jump thread loops since it can
result in irreducible control flow that harms other optimizations. This
prevents switch statements inside a loop from being optimized to use
unconditional branches.

This code pattern occurs in the core_state_transition function of
Coremark. The state machine can be implemented manually with goto
statements resulting in a large runtime improvement, and this transform
makes the switch implementation match the goto version in performance.

This patch specifically targets switch statements inside a loop that
have the opportunity to be threaded. Once it identifies an opportunity,
it creates new paths that branch directly to the correct code block.
For example, the left CFG could be transformed to the right CFG:

       sw.bb                        sw.bb
     /   |   \                    /   |   \
case1  case2  case3          case1  case2  case3
     \   |   /                /       |       \
     latch.bb             latch.2  latch.3  latch.1
      br sw.bb              /         |         \
                        sw.bb.2     sw.bb.3     sw.bb.1
                         br case2    br case3    br case1

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cynecx added a subscriber: cynecx.Mar 23 2021, 3:29 PM
khchen added a subscriber: khchen.Mar 23 2021, 8:19 PM

Hi everyone, I wanted to give another progress update on this patch. We have gotten the transformation working on the coremark opportunity resulting in a speedup of over 20% for the benchmark. It also works on some smaller testcases such as std::merge. We are in the process of testing it more extensively and debugging issues that come up. Also there is still some work with cleaning up the codegen, refactoring, and writing test cases. An updated version will be submitted soon.

Hi everyone, I wanted to give another progress update on this patch. We have gotten the transformation working on the coremark opportunity resulting in a speedup of over 20% for the benchmark. It also works on some smaller testcases such as std::merge. We are in the process of testing it more extensively and debugging issues that come up. Also there is still some work with cleaning up the codegen, refactoring, and writing test cases. An updated version will be submitted soon.

Great news!

speedup of over 20%

Wonderful!

Fantastic! Consider adding them as new testcases.

jkreiner updated this revision to Diff 342832.May 4 2021, 12:47 PM
jkreiner retitled this revision from [WIP] Analysis for DFA jump-threading optimization to Add jump-threading optimization for deterministic finite automata.
jkreiner edited the summary of this revision. (Show Details)
jkreiner updated this revision to Diff 342841.May 4 2021, 1:15 PM

There were some formatting issues with the previous diff.

If we're going to be cloning basic blocks, we need some sort of cost computation. Some blocks aren't legal to clone, and some are expensive simply due to size. (Illegal to clone like this: convergent, noduplicate, indirectbr predecessors.) See llvm/Analysis/CodeMetrics.h.

If I'm understanding correctly, the goal here is to specifically handle cases where we can completely eliminate the switch. That's a bit narrow, but I guess it makes sense, as least as a first cut.

llvm/lib/CodeGen/DFAJumpThreading.cpp
373 ↗(On Diff #342832)

Checking FirstDef->getType()->isIntegerTy() isn't necessary, by the definition of SwitchInst.

453 ↗(On Diff #342832)

Don't think you can end up with a vector select here; you've already constrained the result to be an integer.

566 ↗(On Diff #342832)

Prefer llvm::DenseMap

643 ↗(On Diff #342832)

cast<PHINode>(Incoming)

If we're going to be cloning basic blocks, we need some sort of cost computation. Some blocks aren't legal to clone, and some are expensive simply due to size. (Illegal to clone like this: convergent, noduplicate, indirectbr predecessors.) See llvm/Analysis/CodeMetrics.h.

That is a good point, I agree we will need a cost model for the transformation. I have started working on one and gathering more performance stats. Also thanks for sharing CodeMetrics.h, I can add a check for those types of blocks that aren't legal to clone.

If I'm understanding correctly, the goal here is to specifically handle cases where we can completely eliminate the switch. That's a bit narrow, but I guess it makes sense, as least as a first cut.

This is correct for how it has been implemented, but the transformation should also work for the general case where the switch can be partially threaded. The analysis part will need to be changed a bit to identify the partial threading opportunities, but it is possible in the future.

In case anyone is interested, there is some context about this patch over here: https://reviews.llvm.org/D88307

jkreiner updated this revision to Diff 345215.May 13 2021, 10:51 AM

Implemented a cost model to avoid cloning too much code, and addressed reviewer comments.

jkreiner marked 4 inline comments as done.May 13 2021, 11:54 AM

I removed the unnecessary conditions, and used the llvm::DenseMap instead.

llvm/lib/CodeGen/DFAJumpThreading.cpp
453 ↗(On Diff #342832)

Yes, I agree the vector select check was unnecessary here.

740 ↗(On Diff #345215)

Here is the cost calculation we implemented. It could be more accurate by accounting for instructions and blocks that get simplified, but for now it at least prevents code explosion.

efriedma added inline comments.May 17 2021, 3:45 PM
llvm/lib/CodeGen/DFAJumpThreading.cpp
13 ↗(On Diff #345215)

Could probably use a brief outline of the overall algorithm here.

109 ↗(On Diff #345215)

Don't use std::list.

I think you can use SmallVector and pop_back(), assuming the iteration order doesn't matter. If you need LIFO order, use std::deque.

182 ↗(On Diff #345215)

I don't think this applyUpdates() call does anything? The DomTree doesn't care about unreachable edges.

213 ↗(On Diff #345215)

Can you merge together the two triangle codepaths? They appear nearly identical. Probably just need a couple std::swap calls.

375 ↗(On Diff #345215)

Do you check somewhere that there's exactly one ConstantInt associated with a given successor? In general, there could be any number.

450 ↗(On Diff #345215)

isa<Instruction>() if you're not going to use the pointer returned by dyn_cast<>.

690 ↗(On Diff #345215)

Probably want to ensure we're only calling collectEphemeralValues once per function.

740 ↗(On Diff #345215)

Okay, makes sense.

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

This is kind of late to be running this pass. That might be useful in some cases: we might be making the code harder to analyze if there's irreducible control flow. On the other hand, we're skipping interesting optimizations if the CFG becomes significantly simpler. My instinct is that the advantages of running earlier outweigh the disadvantages.

llvm/test/CodeGen/AArch64/dfa-constant-propagation.ll
1 ↗(On Diff #345215)

Please put the tests in llvm/test/Transforms/DFAJumpThreading.

Matt added a subscriber: Matt.May 18 2021, 7:10 AM
jkreiner marked 2 inline comments as done.May 18 2021, 7:38 AM
jkreiner added inline comments.
llvm/lib/CodeGen/DFAJumpThreading.cpp
109 ↗(On Diff #345215)

Okay, there's a few places std::list was used so I'll replace them all.

182 ↗(On Diff #345215)

Wouldn't it be needed since a new block is created? This edge is reached sometimes. I may be misunderstanding the usage of the DomTreeUpdater though.

213 ↗(On Diff #345215)

Sure I'll look into merging these two cases.

375 ↗(On Diff #345215)

Good point, this case isn't checked for so it would currently cause buggy behavior in this function.

Now that I think about it, this function can probably be removed since the return value stored as the EntryValue isn't used in the transformation.

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

I agree that it may be beneficial to run the pass earlier. I'm not aware yet of which optimizations it could interfere with. Do you have any suggestions of where to run it instead?

dnsampaio added inline comments.Fri, May 21, 1:41 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

I would suggest trying just before llvm's original jump-threading. For the proposed implemented here in D88307, it actually produced code that llvm's original jump-threading would further optimize, where the other way around, the original jump-threading would just not do anything. I'm running over llvm-12, so it may take me some time to test this patch.

xbolva00 added inline comments.Fri, May 21, 1:48 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

and part of standard opt 2+ pipeline? why aarch64 specific?

jkreiner added inline comments.Fri, May 21, 6:53 AM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

Thank you for the suggestion, I'll try it out before the original jump-threading then.

I only had it there since I was testing it on AArch64, but it could be moved to the general opt pipeline.

jkreiner added inline comments.Fri, May 21, 1:35 PM
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
461 ↗(On Diff #345215)

This seems to be a better location for the pass. I've measured a performance gain 5% greater than before for Coremark by placing it before the original jump-threading.

jkreiner updated this revision to Diff 348040.Wed, May 26, 11:51 AM

Addressed reviewer comments, and moved the pass earlier in the pipeline.

jkreiner marked 10 inline comments as done.Wed, May 26, 11:55 AM
jkreiner updated this revision to Diff 348077.Wed, May 26, 1:40 PM

Fixed a failing opt pipeline test.

I've finally got time to test this.
For our downstream arch we're seeing gains up to 27%.
Very good, thanks for working on this.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
364–367

In release mode I get warnings this is not used (we use -Werror):
llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:364:21: error: unused function 'operator<<' [-Werror,-Wunused-function]

Perhaps put it around #ifndef NDEBUG ?

foad added a subscriber: foad.Thu, May 27, 5:09 AM

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

llvm/include/llvm/Transforms/Scalar/DFAJumpThreading.h
2

Typo ".cpp" :)

For our downstream arch we're seeing gains up to 27%.

That's great to hear!

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp
364–367

I see, I'll add that #ifndef NDEBUG check.

jkreiner updated this revision to Diff 348343.Thu, May 27, 11:54 AM
jkreiner marked 2 inline comments as done.
foad added a comment.Fri, May 28, 1:15 AM

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

Naive follow-up question: why does this have to be a complete new implementation of jump threading? Would it be feasible to have a single implementation that takes a "don't create irreducible control flow" flag?

@jkreiner with both llvm 12 and head I get a compiler crash in this function for file F17082111. Just run opt --dfa-jump-threading. Sorry, I couldn't reduce it further, I just don't know how to get bugpoint to work with opt.
With a llvm-12 in debug build I get:

opt --dfa-jump-threading -S < fail.ll
opt: /work1/dsampaio/csw/llvm-project/llvm/include/llvm/IR/Instructions.h:2767: llvm::Value *llvm::PHINode::getIncomingValueForBlock(const llvm::BasicBlock *) const: Assertion `Idx >= 0 && "Invalid basic block argument!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: opt --dfa-jump-threading -S
1.      Running pass 'Function Pass Manager' on module '<stdin>'.
2.      Running pass 'DFA Jump Threading' on function '@main'
 #0 0x00007f6a3415119a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x00007f6a3415136b PrintStackTraceSignalHandler(void*) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x00007f6a3414f95b llvm::sys::RunSignalHandlers() /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Signals.cpp:70:5
 #3 0x00007f6a34151ae1 SignalHandler(int) /work1/dsampaio/csw/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007f6a32d7a980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980)
 #5 0x00007f6a32076fb7 raise /build/glibc-S9d2JN/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0
 #6 0x00007f6a32078921 abort /build/glibc-S9d2JN/glibc-2.27/stdlib/abort.c:81:0
 #7 0x00007f6a3206848a __assert_fail_base /build/glibc-S9d2JN/glibc-2.27/assert/assert.c:89:0
 #8 0x00007f6a32068502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502)
 #9 0x00007f6a344eb474 llvm::PHINode::getIncomingValueForBlock(llvm::BasicBlock const*) const /work1/dsampaio/csw/llvm-project/llvm/include/llvm/IR/Instructions.h:2768:29
#10 0x00007f6a35a37265 (anonymous namespace)::AllSwitchPaths::run() /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:532:24
#11 0x00007f6a35a36826 (anonymous namespace)::DFAJumpThreading::run(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:1149:23
#12 0x00007f6a35a36c1f (anonymous namespace)::DFAJumpThreadingLegacyPass::runOnFunction(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/Transforms/Scalar/DFAJumpThreading.cpp:164:5
#13 0x00007f6a3440f84c llvm::FPPassManager::runOnFunction(llvm::Function&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1435:23
#14 0x00007f6a34414ac5 llvm::FPPassManager::runOnModule(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1481:16
#15 0x00007f6a34410214 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1550:23
#16 0x00007f6a3440fd38 llvm::legacy::PassManagerImpl::run(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:541:16
#17 0x00007f6a34414dd1 llvm::legacy::PassManager::run(llvm::Module&) /work1/dsampaio/csw/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1677:3
#18 0x0000000000472e17 main /work1/dsampaio/csw/llvm-project/llvm/tools/opt/opt.cpp:998:3
#19 0x00007f6a32059bf7 __libc_start_main /build/glibc-S9d2JN/glibc-2.27/csu/../csu/libc-start.c:344:0
#20 0x000000000043577a _start (/work1/dsampaio/csw/devimage/toolchain_default/toolroot/opt/kalray/accesscore/bin/opt+0x43577a)
Aborted (core dumped)

Naive question: does your new pass "result in irreducible control flow that harms other optimizations"? From the ASCII art diagram in the commit message it looks like it does. Why is that OK?

It could produce irreducible control flow like that, but the pass is late enough in the pipeline that it won't have a negative impact. I experimented with putting it early in the pipeline and the gains I measured weren't great, but it seems to be profitable where it is right now.

Naive follow-up question: why does this have to be a complete new implementation of jump threading? Would it be feasible to have a single implementation that takes a "don't create irreducible control flow" flag?

I just want to clarify that this is not a complete new implementation of jump threading, the cases that both of the passes catch are mutually exclusive. There were some discussions about implementing a more aggressive jump threading versus a separate pass for specific uses (FSMs) that might answer your question over here: D88307

@jkreiner with both llvm 12 and head I get a compiler crash in this function for file F17082111. Just run opt --dfa-jump-threading. Sorry, I couldn't reduce it further, I just don't know how to get bugpoint to work with opt.

Thank you for providing the testcase, I reproduced it locally. It is the last day of my internship, so I won't be able to submit a fix for it. @alexey.zhikhar initially wrote some parts of the analysis, and will continue where I left off on the remaining work items for this patch.

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

It would be very hard for us to explain the difference without having access to both the downstream compiler and the hardware but something that comes to mind is:

(1) were the two experiments performed on the same baseline compiler?
(2) 1% could be within noise and might be due to trivial changes in the generated code (change of alignment, change of branch addresses and branch target addresses, etc.)
(3) were both passes in the same place in the pipeline in both experiments?

From the information provided, it is not possible to understand the root cause of the difference.

ormris removed a subscriber: ormris.Mon, Jun 7, 4:08 PM

Thanks for all of the effort! On our downstream Cortex-R5 compiler, I'm seeing a 20.4% speedup on Coremark with this patch, which is good, however, the older patch (https://reviews.llvm.org/D88307) gave me a 21.6% speedup. Any idea what could account for the difference there?

It would be very hard for us to explain the difference without having access to both the downstream compiler and the hardware but something that comes to mind is:

(1) were the two experiments performed on the same baseline compiler?
(2) 1% could be within noise and might be due to trivial changes in the generated code (change of alignment, change of branch addresses and branch target addresses, etc.)
(3) were both passes in the same place in the pipeline in both experiments?

From the information provided, it is not possible to understand the root cause of the difference.

Thanks, and understood. I'll see if I can do some more digging here, but I suspect it may be noise. I wasn't sure if there was an obvious difference in the algorithms here that might account for a degradation. I'm pretty sure both passes are not in the same place in the pipeline (just taking the patch from the other review).

marksl added a subscriber: marksl.Tue, Jun 8, 2:15 PM
alexey.zhikhar commandeered this revision.Wed, Jun 9, 7:00 AM
alexey.zhikhar added a reviewer: jkreiner.

Fixed the bug submitted by @dnsampaio, test case added

Misc minor changes

Fix a clang-format warning

@efriedma Eli, I believe all your comments have been addressed, please take another look when you get a chance, thanks.