This is an archive of the discontinued LLVM Phabricator instance.

Add a flag that enables more aggressive load PRE
Needs ReviewPublic

Authored by twoh on Apr 7 2017, 1:11 PM.

Details

Summary

Existing load PRE implementation is very conservative about code size increase and performs PRE only when there's no instruction count increase. This patch adds an option to allow more aggressive load PRE.

Event Timeline

twoh created this revision.Apr 7 2017, 1:11 PM
dberlin edited edge metadata.Apr 10 2017, 11:30 AM

Any performance numbers?

Note that this probably ends up equivalent to this algorithm: http://www1.cse.wustl.edu/~cytron/cs531/Resources/Papers/valnum.pdf

In which case, we should do it in rank order to minimize the number of iterations it will take to float the loads upwards as far as they go.

twoh added a comment.Apr 11 2017, 8:55 AM

No perf numbers with public benchmarks yet, but with our internal benchmark I confirmed that aggressive load PRE removes redundant instructions in a hot loop. I'll report spec2006 numbers later. This patch is for quick unlocking of the feature before making more fundamental improvement.

I would strongly prefer to either turn this on by default (IE no option) or not do it.
Which means we need numbers.

It's not a useful knob to tweak, IMHO, and definitely not one amenable to good heuristics (the age old register allocation vs PRE debate).
For code size, PRE is generally a lose compared to VBE based hoisting.

twoh added a comment.Apr 11 2017, 3:58 PM

@dberlin Thank you for your comments. I agree on you that it is not easy to invent a good heuristics around this knob, but still I find this useful for some cases that actually matters. Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?

I'm still collecting numbers. Will post it here once I have them all.

"Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?"

No, not if they have no idea why they would use it, etc :)

If we can make reasonable choices based on O*, we should.
Knobs like this basicallly go untested and break :)

twoh added a comment.EditedApr 12 2017, 10:52 PM

Below are the numbers with spec2006 (The numbers in the last row are overall for SPEC score and total for compile time and binary size):

-O2:

               |                                Base |                          Aggressive |                                Diff |
               | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size |
 400.perlbench |       27.9           30   1,207,608 |       27.6           29   1,211,704 |     -1.08%       -3.33%       0.34% |
     401.bzip2 |       18.9            4      94,400 |       19.1            3      94,400 |      1.06%      -25.00%       0.00% |
       403.gcc |       26.4           79   3,771,952 |       26.4           81   3,784,280 |      0.00%        2.53%       0.33% |
       429.mcf |       24.1            1      23,200 |       24.1            2      23,200 |      0.00%      100.00%       0.00% |
     445.gobmk |       21.8           21   3,958,128 |       21.8           21   3,962,224 |      0.00%        0.00%       0.10% |
     456.hmmer |       21.0           10     344,288 |       21.0           10     344,288 |      0.00%        0.00%       0.00% |
     458.sjeng |       22.0            4     156,192 |       22.9            4     156,192 |      4.09%        0.00%       0.00% |
462.libquantum |       44.4            2      55,712 |       48.1            2      55,712 |      8.33%        0.00%       0.00% |
   464.h264ref |       38.5           16     694,768 |       38.7           17     694,768 |      0.52%        6.25%       0.00% |
   471.omnetpp |       17.7           30   1,553,888 |       18.4           30   1,553,888 |      3.95%        0.00%       0.00% |
     473.astar |       18.0            2     125,616 |       18.1            2     125,616 |      0.56%        0.00%       0.00% |
 483.xalancbmk |       29.5          199   6,903,592 |       29.6          203   6,907,688 |      0.34%        2.01%       0.06% |
               |       24.8          398  18,889,344 |       25.2          404  18,913,960 |      1.61%        1.51%       0.13% |

-O3:

               |                                Base |                          Aggressive |                                Diff |
               | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size |
 400.perlbench |       27.2           30   3,570,840 |       27.0           35   3,581,096 |     -0.74%       16.67%       0.29% |
     401.bzip2 |       19.1            4     245,912 |       18.9            5     246,200 |     -1.05%       25.00%       0.12% |
       403.gcc |       26.0           79  10,517,504 |       26.8           93  10,546,032 |      3.08%       17.72%       0.27% |
       429.mcf |       23.7            1      62,152 |       24.2            1      62,224 |      2.11%        0.00%       0.12% |
     445.gobmk |       21.7           21   6,450,344 |       21.9           24   6,456,712 |      0.92%       14.29%       0.10% |
     456.hmmer |       21.0           10   1,034,864 |       21.1           12   1,036,112 |      0.48%       20.00%       0.12% |
     458.sjeng |       22.5            4     336,416 |       22.4            4     336,808 |     -0.44%        0.00%       0.12% |
462.libquantum |       47.3            2     138,016 |       44.6            2     137,768 |     -5.71%        0.00%      -0.18% |
   464.h264ref |       39.0           16   1,878,656 |       39.0           21   1,884,208 |      0.00%       31.25%       0.30% |
   471.omnetpp |       19.7           30   4,468,208 |       18.9           32   4,469,464 |     -4.06%        6.67%       0.03% |
     473.astar |       18.0            2     295,152 |       18.2            3     295,824 |      1.11%       50.00%       0.23% |
 483.xalancbmk |       30.0          199  39,545,312 |       29.2          221  39,533,360 |     -2.67%       11.06%      -0.03% |
               |       25.2          398  68,543,376 |       25.0          453  68,585,808 |     -0.79%       13.82%       0.06% |

For -O2 enabling aggressive load PRE brings somewhat expected results (better spec score, increase in compilation time and binary size). But for -O3 it doesn't even improve the spec score while increasing compile time significantly.

twoh added a comment.Apr 12 2017, 11:07 PM

With numbers, it doesn't seem worth to turn it on by default. @dberlin, just curious, do you have any rough estimation about how much compilation time can be saved by implementing rank-order based iteration optimization?

davide added a subscriber: davide.Apr 12 2017, 11:37 PM

I tried to do something similar many months ago and the compile time on some tests skyrocketed, FWIW. I don't have the tests handy, but you may want to keep that in mind.

mcrosier resigned from this revision.May 25 2017, 6:45 AM