SSAUpdater is often a bottleneck in JumpThreading, and one of the reasons is
that it performs a lot of unnecessary computations (DT/IDF) over and over
again. This patch implements a classic algorithm for PHI-nodes placement and
uses JT-specific properties to optimize it (namely: we only have two blocks with
definitions and these blocks are the same for all instructions we're going to
rewrite).
With this patch the test from PR16756 speeds-up by ~2x, while the time spent in
JumpThreading goes down by ~4x.
Before the patch:
Total Execution Time: 26.6205 seconds (26.6232 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 13.3016 ( 50.6%) 0.0167 ( 4.7%) 13.3183 ( 50.0%) 13.3190 ( 50.0%) Jump Threading 5.3226 ( 20.3%) 0.0170 ( 4.8%) 5.3397 ( 20.1%) 5.3408 ( 20.1%) Jump Threading 1.7753 ( 6.8%) 0.0020 ( 0.6%) 1.7772 ( 6.7%) 1.7772 ( 6.7%) SLP Vectorizer 1.1617 ( 4.4%) 0.1579 ( 44.1%) 1.3195 ( 5.0%) 1.3199 ( 5.0%) X86 DAG->DAG Instruction Selection
With the patch:
Total Execution Time: 12.8328 seconds (12.8335 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 4.5331 ( 36.3%) 0.0135 ( 4.1%) 4.5466 ( 35.4%) 4.5471 ( 35.4%) Jump Threading 1.7649 ( 14.1%) 0.0015 ( 0.5%) 1.7664 ( 13.8%) 1.7665 ( 13.8%) SLP Vectorizer 1.1914 ( 9.5%) 0.1594 ( 47.9%) 1.3507 ( 10.5%) 1.3510 ( 10.5%) X86 DAG->DAG Instruction Selection 0.8300 ( 6.6%) 0.0034 ( 1.0%) 0.8333 ( 6.5%) 0.8333 ( 6.5%) Jump Threading
Also, some archeology. Some time ago there was a patch for an alternative
SSAUpdater implementation (https://reviews.llvm.org/D28934). It was not
committed back then, but I decided to try it first. Initially, it also showed
huge speed-ups, but then I discovered a couple of bugs, fixes for which ate a
big chunk of the speedups (although the new implementation still was
significantly faster than the existing one). I can upload an updated version of
that patch if there is an interest, but in JumpThreading I decided to pursue
another path: the D28934 implementation, being much faster than the existing
one, still does not scale very well. The algorithm used there is very efficient
for small incremental updates, but doesn't scale well for bulk updates (e.g.
when we have to update big number of instructions), as there is not much to
share across the algorithm invocations. In contrast, the approach I propose here
aims at reusing as much computations as possible.
Compile time impact: no noticable changes on CTMark, a big improvement on the
test from PR16756.
typo: s/ee/we/