Existing load PRE implementation is very conservative about code size increase and performs PRE only when there's no instruction count increase. This patch adds an option to allow more aggressive load PRE.
Diff Detail
- Build Status
Buildable 5395 Build 5395: arc lint + arc unit
Event Timeline
Any performance numbers?
Note that this probably ends up equivalent to this algorithm: http://www1.cse.wustl.edu/~cytron/cs531/Resources/Papers/valnum.pdf
In which case, we should do it in rank order to minimize the number of iterations it will take to float the loads upwards as far as they go.
No perf numbers with public benchmarks yet, but with our internal benchmark I confirmed that aggressive load PRE removes redundant instructions in a hot loop. I'll report spec2006 numbers later. This patch is for quick unlocking of the feature before making more fundamental improvement.
I would strongly prefer to either turn this on by default (IE no option) or not do it.
Which means we need numbers.
It's not a useful knob to tweak, IMHO, and definitely not one amenable to good heuristics (the age old register allocation vs PRE debate).
For code size, PRE is generally a lose compared to VBE based hoisting.
@dberlin Thank you for your comments. I agree on you that it is not easy to invent a good heuristics around this knob, but still I find this useful for some cases that actually matters. Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?
I'm still collecting numbers. Will post it here once I have them all.
"Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?"
No, not if they have no idea why they would use it, etc :)
If we can make reasonable choices based on O*, we should.
Knobs like this basicallly go untested and break :)
Below are the numbers with spec2006 (The numbers in the last row are overall for SPEC score and total for compile time and binary size):
-O2:
| Base | Aggressive | Diff | | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size | 400.perlbench | 27.9 30 1,207,608 | 27.6 29 1,211,704 | -1.08% -3.33% 0.34% | 401.bzip2 | 18.9 4 94,400 | 19.1 3 94,400 | 1.06% -25.00% 0.00% | 403.gcc | 26.4 79 3,771,952 | 26.4 81 3,784,280 | 0.00% 2.53% 0.33% | 429.mcf | 24.1 1 23,200 | 24.1 2 23,200 | 0.00% 100.00% 0.00% | 445.gobmk | 21.8 21 3,958,128 | 21.8 21 3,962,224 | 0.00% 0.00% 0.10% | 456.hmmer | 21.0 10 344,288 | 21.0 10 344,288 | 0.00% 0.00% 0.00% | 458.sjeng | 22.0 4 156,192 | 22.9 4 156,192 | 4.09% 0.00% 0.00% | 462.libquantum | 44.4 2 55,712 | 48.1 2 55,712 | 8.33% 0.00% 0.00% | 464.h264ref | 38.5 16 694,768 | 38.7 17 694,768 | 0.52% 6.25% 0.00% | 471.omnetpp | 17.7 30 1,553,888 | 18.4 30 1,553,888 | 3.95% 0.00% 0.00% | 473.astar | 18.0 2 125,616 | 18.1 2 125,616 | 0.56% 0.00% 0.00% | 483.xalancbmk | 29.5 199 6,903,592 | 29.6 203 6,907,688 | 0.34% 2.01% 0.06% | | 24.8 398 18,889,344 | 25.2 404 18,913,960 | 1.61% 1.51% 0.13% |
-O3:
| Base | Aggressive | Diff | | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size | 400.perlbench | 27.2 30 3,570,840 | 27.0 35 3,581,096 | -0.74% 16.67% 0.29% | 401.bzip2 | 19.1 4 245,912 | 18.9 5 246,200 | -1.05% 25.00% 0.12% | 403.gcc | 26.0 79 10,517,504 | 26.8 93 10,546,032 | 3.08% 17.72% 0.27% | 429.mcf | 23.7 1 62,152 | 24.2 1 62,224 | 2.11% 0.00% 0.12% | 445.gobmk | 21.7 21 6,450,344 | 21.9 24 6,456,712 | 0.92% 14.29% 0.10% | 456.hmmer | 21.0 10 1,034,864 | 21.1 12 1,036,112 | 0.48% 20.00% 0.12% | 458.sjeng | 22.5 4 336,416 | 22.4 4 336,808 | -0.44% 0.00% 0.12% | 462.libquantum | 47.3 2 138,016 | 44.6 2 137,768 | -5.71% 0.00% -0.18% | 464.h264ref | 39.0 16 1,878,656 | 39.0 21 1,884,208 | 0.00% 31.25% 0.30% | 471.omnetpp | 19.7 30 4,468,208 | 18.9 32 4,469,464 | -4.06% 6.67% 0.03% | 473.astar | 18.0 2 295,152 | 18.2 3 295,824 | 1.11% 50.00% 0.23% | 483.xalancbmk | 30.0 199 39,545,312 | 29.2 221 39,533,360 | -2.67% 11.06% -0.03% | | 25.2 398 68,543,376 | 25.0 453 68,585,808 | -0.79% 13.82% 0.06% |
For -O2 enabling aggressive load PRE brings somewhat expected results (better spec score, increase in compilation time and binary size). But for -O3 it doesn't even improve the spec score while increasing compile time significantly.
With numbers, it doesn't seem worth to turn it on by default. @dberlin, just curious, do you have any rough estimation about how much compilation time can be saved by implementing rank-order based iteration optimization?
I tried to do something similar many months ago and the compile time on some tests skyrocketed, FWIW. I don't have the tests handy, but you may want to keep that in mind.