This is an archive of the discontinued LLVM Phabricator instance.

[MachineSink] Use the real post dominator tree
ClosedPublic

Authored by jingyue on Oct 6 2014, 3:14 PM.

Details

Summary

Fixes a FIXME in MachineSinking. Instead of using the simple heuristics in
isPostDominatedBy, use the real MachinePostDominatorTree and MachineLoopInfo.
The old heuristics caused instructions to sink unnecessarily, and might create
register pressure.

This is the second try of the fix. The first one (D4814) caused a performance
regression due to failing to sink instructions out of loops (PR21115). This
patch fixes PR21115 by sinking an instruction from a deeper loop to a shallower
one regardless of whether the target block post-dominates the source.

Thanks Alexey Volkov for reporting PR21115!

Diff Detail

Event Timeline

jingyue updated this revision to Diff 14478.Oct 6 2014, 3:14 PM
jingyue retitled this revision from to [MachineSink] Use the real post dominator tree.
jingyue updated this object.
jingyue edited the test plan for this revision. (Show Details)
jingyue added reviewers: Jiangning, resistor.
jingyue added subscribers: eliben, meheff, Unknown Object (MLST).

Compilation time from llvm-test-suite. Three columns are benchmark name, original compilation time (in seconds), and the slowdown/speedup caused by this change (negative means the new version is faster). I only picked the ones with compilation time longer than one second, because other data points are very noisy.

I don't see a significant regression in compilation time based on these numbers.

clamscan                                              9.6843             -0.7650%
ldecod                                                4.5576              0.6604%
lencod                                                9.8791              1.5637%
SPASS                                                 9.5813              2.0035%
make_dparser                                          2.0540              1.2723%
kc                                                   10.0371             -0.1296%
lua                                                   2.6484              2.4356%
oggenc                                                3.1777              1.0269%
siod                                                  1.4711              1.3638%
sqlite3                                              12.4904              0.4944%
treecc                                                1.3868             -1.7237%
7zip-benchmark                                       27.6821             -0.2909%
smg2000                                               4.7685              2.4668%
bullet                                               20.0863              0.6708%
espresso                                              3.6563              1.7755%
gs                                                    2.9518             -0.8027%
consumer-jpeg                                         3.4321             -0.4087%
consumer-lame                                         2.8166              2.2606%
consumer-typeset                                      7.7711              0.4109%
office-ispell                                         1.1021              0.0816%
paq8p                                                 1.3805              0.0290%
timberwolfmc                                          6.3654             -1.0692%
mybison                                               1.1789              1.8489%
ControlFlow-dbl                                       1.0986              0.1001%
ControlFlow-flt                                       1.0994             -1.3461%
ControlLoops-dbl                                      1.0301             -1.3881%
ControlLoops-flt                                      1.0259             -0.1170%
CrossingThresholds-dbl                                1.0041             -0.6896%
Equivalencing-flt                                     1.0074             -2.5841%
Expansion-dbl                                         1.0220             -0.4315%
Expansion-flt                                         1.0275              0.0097%
GlobalDataFlow-dbl                                    1.0226             -0.2546%
GlobalDataFlow-flt                                    1.0147             -0.2072%
InductionVariable-dbl                                 1.0076             -0.3878%
LinearDependence-dbl                                  1.0397             -0.0866%
LinearDependence-flt                                  1.0477             -1.8301%
LoopRestructuring-flt                                 1.0237             -0.9422%
Reductions-dbl                                        1.0482              0.1239%
Reductions-flt                                        1.0262              2.6349%
pairlocalalign                                        6.6571             -1.3261%
cjpeg                                                 3.3882              2.9228%
tramp3d-v4                                           15.5355              0.0097%
loop_unroll                                           2.6732              0.4218%
simple_types_constant_folding                         2.1412              0.5031%
simple_types_loop_invariant                           1.7099             -0.3104%
spirit                                                3.0955              0.3676%

Runtime performance per SPEC CINT2006. The four columns are benchmark name, reference time (not useful here), run time, and ratio vs reference (not useful here).

Original

400.perlbench    9770        320       30.5 S                                  
400.perlbench    9770        323       30.3 *                                  
400.perlbench    9770        324       30.2 S                                  
401.bzip2        9650        420       23.0 S                                  
401.bzip2        9650        451       21.4 S                                  
401.bzip2        9650        420       23.0 *                                  
403.gcc          8050        249       32.3 S                                  
403.gcc          8050        256       31.5 S                                  
403.gcc          8050        250       32.1 *                                  
429.mcf          9120        280       32.6 *                                  
429.mcf          9120        283       32.2 S                                  
429.mcf          9120        279       32.7 S                                  
445.gobmk       10490        407       25.8 *                                  
445.gobmk       10490        407       25.8 S                                  
445.gobmk       10490        407       25.8 S                                  
456.hmmer        9330        363       25.7 S                                  
456.hmmer        9330        363       25.7 *                                  
456.hmmer        9330        365       25.6 S                                  
458.sjeng       12100        492       24.6 S                                  
458.sjeng       12100        490       24.7 *                                  
458.sjeng       12100        489       24.7 S                                  
462.libquantum  20720        396       52.4 S                                  
462.libquantum  20720        403       51.4 *                                  
462.libquantum  20720        450       46.0 S                                  
464.h264ref     22130        460       48.1 *                                  
464.h264ref     22130        460       48.1 S                                  
464.h264ref     22130        461       48.0 S                                  
471.omnetpp      6250        230       27.2 *                                  
471.omnetpp      6250        231       27.1 S                                  
471.omnetpp      6250        229       27.3 S                                  
473.astar        7020        336       20.9 *                                  
473.astar        7020        338       20.8 S                                  
473.astar        7020        336       20.9 S                                  
483.xalancbmk    6900        186       37.1 *                                  
483.xalancbmk    6900        196       35.2 S                                  
483.xalancbmk    6900        185       37.2 S

With this change

400.perlbench    9770        307       31.9 S                                  
400.perlbench    9770        307       31.8 S                                  
400.perlbench    9770        307       31.9 *                                  
401.bzip2        9650        419       23.0 *                                  
401.bzip2        9650        419       23.0 S                                  
401.bzip2        9650        419       23.0 S                                  
403.gcc          8050        251       32.0 S                                  
403.gcc          8050        249       32.3 S                                  
403.gcc          8050        251       32.1 *                                  
429.mcf          9120        273       33.4 S                                  
429.mcf          9120        280       32.6 S                                  
429.mcf          9120        276       33.1 *                                  
445.gobmk       10490        406       25.9 S                                  
445.gobmk       10490        406       25.8 *                                  
445.gobmk       10490        410       25.6 S                                  
456.hmmer        9330        360       25.9 S                                  
456.hmmer        9330        361       25.9 *                                  
456.hmmer        9330        361       25.8 S                                  
458.sjeng       12100        483       25.0 S                                  
458.sjeng       12100        485       24.9 *                                  
458.sjeng       12100        486       24.9 S                                  
462.libquantum  20720        364       56.9 S                                  
462.libquantum  20720        376       55.2 *                                  
462.libquantum  20720        381       54.3 S                                  
464.h264ref     22130        457       48.4 S                                  
464.h264ref     22130        456       48.5 *                                  
464.h264ref     22130        455       48.6 S                                  
471.omnetpp      6250        228       27.4 S                                  
471.omnetpp      6250        229       27.3 S                                  
471.omnetpp      6250        228       27.4 *                                  
473.astar        7020        335       21.0 S                                  
473.astar        7020        333       21.1 S                                  
473.astar        7020        333       21.1 *                                  
483.xalancbmk    6900        186       37.2 *                                  
483.xalancbmk    6900        185       37.4 S                                  
483.xalancbmk    6900        186       37.1 S

Each benchmark was run three times and the results look stable across runs. We take the median and compute the slowdown/speedup.

benchmark	original	sink	slowdown
400.perlbench 	323	307	-4.953560372
401.bzip2     	420	419	-0.2380952381
403.gcc       	250	251	0.4
429.mcf       	280	276	-1.428571429
445.gobmk     	407	406	-0.2457002457
456.hmmer     	363	361	-0.5509641873
458.sjeng     	490	485	-1.020408163
462.libquantum	403	376	-6.699751861
464.h264ref   	460	456	-0.8695652174
471.omnetpp   	230	228	-0.8695652174
473.astar     	336	333	-0.8928571429
483.xalancbmk 	186	186	0

We see significant speedup on 400.perlbench and 462.libquantum without any significant regression.

bruno added a subscriber: bruno.Oct 6 2014, 3:50 PM

Hi, anyone willing to review this patch? I think the code change is fairly straight-forward, and that the 4% and 6% speedup on perlbench and libquantum is something the community likes :) Thanks!

hfinkel accepted this revision.Oct 14 2014, 6:06 AM
hfinkel added a reviewer: hfinkel.
hfinkel added a subscriber: hfinkel.

Given that MachineSink does not run at -O0, I think that the compile-time impact (~1.5%) looks reasonable given the advantages of doing this properly. LGTM.

This revision is now accepted and ready to land.Oct 14 2014, 6:06 AM
jingyue updated this object.Oct 14 2014, 8:37 PM
jingyue edited edge metadata.
jingyue closed this revision.Oct 14 2014, 8:37 PM

Thanks a lot, Hal! Committed in r219773.