This is an archive of the discontinued LLVM Phabricator instance.

Add a flag that enables more aggressive load PRE
Needs ReviewPublic

Authored by twoh on Apr 7 2017, 1:11 PM.

Download Raw Diff

Details

Reviewers

• dberlin
resistor
mcrosier

Summary

Existing load PRE implementation is very conservative about code size increase and performs PRE only when there's no instruction count increase. This patch adds an option to allow more aggressive load PRE.

Diff Detail

Build Status

Buildable 5395
Build 5395: arc lint + arc unit

Event Timeline

twoh created this revision.Apr 7 2017, 1:11 PM

Harbormaster completed remote builds in B5395: Diff 94560.Apr 7 2017, 1:11 PM

Any performance numbers?

Note that this probably ends up equivalent to this algorithm: http://www1.cse.wustl.edu/~cytron/cs531/Resources/Papers/valnum.pdf

In which case, we should do it in rank order to minimize the number of iterations it will take to float the loads upwards as far as they go.

No perf numbers with public benchmarks yet, but with our internal benchmark I confirmed that aggressive load PRE removes redundant instructions in a hot loop. I'll report spec2006 numbers later. This patch is for quick unlocking of the feature before making more fundamental improvement.

I would strongly prefer to either turn this on by default (IE no option) or not do it.
Which means we need numbers.

It's not a useful knob to tweak, IMHO, and definitely not one amenable to good heuristics (the age old register allocation vs PRE debate).
For code size, PRE is generally a lose compared to VBE based hoisting.

@dberlin Thank you for your comments. I agree on you that it is not easy to invent a good heuristics around this knob, but still I find this useful for some cases that actually matters. Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?

I'm still collecting numbers. Will post it here once I have them all.

"Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?"

No, not if they have no idea why they would use it, etc :)

If we can make reasonable choices based on O*, we should.
Knobs like this basicallly go untested and break :)

Below are the numbers with spec2006 (The numbers in the last row are overall for SPEC score and total for compile time and binary size):

-O2:

               |                                Base |                          Aggressive |                                Diff |
               | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size |
 400.perlbench |       27.9           30   1,207,608 |       27.6           29   1,211,704 |     -1.08%       -3.33%       0.34% |
     401.bzip2 |       18.9            4      94,400 |       19.1            3      94,400 |      1.06%      -25.00%       0.00% |
       403.gcc |       26.4           79   3,771,952 |       26.4           81   3,784,280 |      0.00%        2.53%       0.33% |
       429.mcf |       24.1            1      23,200 |       24.1            2      23,200 |      0.00%      100.00%       0.00% |
     445.gobmk |       21.8           21   3,958,128 |       21.8           21   3,962,224 |      0.00%        0.00%       0.10% |
     456.hmmer |       21.0           10     344,288 |       21.0           10     344,288 |      0.00%        0.00%       0.00% |
     458.sjeng |       22.0            4     156,192 |       22.9            4     156,192 |      4.09%        0.00%       0.00% |
462.libquantum |       44.4            2      55,712 |       48.1            2      55,712 |      8.33%        0.00%       0.00% |
   464.h264ref |       38.5           16     694,768 |       38.7           17     694,768 |      0.52%        6.25%       0.00% |
   471.omnetpp |       17.7           30   1,553,888 |       18.4           30   1,553,888 |      3.95%        0.00%       0.00% |
     473.astar |       18.0            2     125,616 |       18.1            2     125,616 |      0.56%        0.00%       0.00% |
 483.xalancbmk |       29.5          199   6,903,592 |       29.6          203   6,907,688 |      0.34%        2.01%       0.06% |
               |       24.8          398  18,889,344 |       25.2          404  18,913,960 |      1.61%        1.51%       0.13% |

-O3:

               |                                Base |                          Aggressive |                                Diff |
               | SPEC score compile time binary size | SPEC score compile time binary size | SPEC score compile time binary size |
 400.perlbench |       27.2           30   3,570,840 |       27.0           35   3,581,096 |     -0.74%       16.67%       0.29% |
     401.bzip2 |       19.1            4     245,912 |       18.9            5     246,200 |     -1.05%       25.00%       0.12% |
       403.gcc |       26.0           79  10,517,504 |       26.8           93  10,546,032 |      3.08%       17.72%       0.27% |
       429.mcf |       23.7            1      62,152 |       24.2            1      62,224 |      2.11%        0.00%       0.12% |
     445.gobmk |       21.7           21   6,450,344 |       21.9           24   6,456,712 |      0.92%       14.29%       0.10% |
     456.hmmer |       21.0           10   1,034,864 |       21.1           12   1,036,112 |      0.48%       20.00%       0.12% |
     458.sjeng |       22.5            4     336,416 |       22.4            4     336,808 |     -0.44%        0.00%       0.12% |
462.libquantum |       47.3            2     138,016 |       44.6            2     137,768 |     -5.71%        0.00%      -0.18% |
   464.h264ref |       39.0           16   1,878,656 |       39.0           21   1,884,208 |      0.00%       31.25%       0.30% |
   471.omnetpp |       19.7           30   4,468,208 |       18.9           32   4,469,464 |     -4.06%        6.67%       0.03% |
     473.astar |       18.0            2     295,152 |       18.2            3     295,824 |      1.11%       50.00%       0.23% |
 483.xalancbmk |       30.0          199  39,545,312 |       29.2          221  39,533,360 |     -2.67%       11.06%      -0.03% |
               |       25.2          398  68,543,376 |       25.0          453  68,585,808 |     -0.79%       13.82%       0.06% |

For -O2 enabling aggressive load PRE brings somewhat expected results (better spec score, increase in compilation time and binary size). But for -O3 it doesn't even improve the spec score while increasing compile time significantly.

With numbers, it doesn't seem worth to turn it on by default. @dberlin, just curious, do you have any rough estimation about how much compilation time can be saved by implementing rank-order based iteration optimization?

I tried to do something similar many months ago and the compile time on some tests skyrocketed, FWIW. I don't have the tests handy, but you may want to keep that in mind.

mcrosier resigned from this revision.May 25 2017, 6:45 AM

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

GVN.cpp

3 lines

Diff 94560

lib/Transforms/Scalar/GVN.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
STATISTIC(NumGVNBlocks, "Number of blocks merged");		STATISTIC(NumGVNBlocks, "Number of blocks merged");
STATISTIC(NumGVNSimpl, "Number of instructions simplified");		STATISTIC(NumGVNSimpl, "Number of instructions simplified");
STATISTIC(NumGVNEqProp, "Number of equalities propagated");		STATISTIC(NumGVNEqProp, "Number of equalities propagated");
STATISTIC(NumPRELoad, "Number of loads PRE'd");		STATISTIC(NumPRELoad, "Number of loads PRE'd");

static cl::opt<bool> EnablePRE("enable-pre",		static cl::opt<bool> EnablePRE("enable-pre",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);
static cl::opt<bool> EnableLoadPRE("enable-load-pre", cl::init(true));		static cl::opt<bool> EnableLoadPRE("enable-load-pre", cl::init(true));
		static cl::opt<bool> AggressiveLoadPRE("aggressive-load-pre", cl::init(false));

// Maximum allowed recursion depth.		// Maximum allowed recursion depth.
static cl::opt<uint32_t>		static cl::opt<uint32_t>
MaxRecurseDepth("max-recurse-depth", cl::Hidden, cl::init(1000), cl::ZeroOrMore,		MaxRecurseDepth("max-recurse-depth", cl::Hidden, cl::init(1000), cl::ZeroOrMore,
cl::desc("Max recurse depth (default = 1000)"));		cl::desc("Max recurse depth (default = 1000)"));

struct llvm::GVN::Expression {		struct llvm::GVN::Expression {
uint32_t opcode;		uint32_t opcode;
▲ Show 20 Lines • Show All 996 Lines • ▼ Show 20 Lines	bool GVN::PerformLoadPRE(LoadInst *LI, AvailValInBlkVect &ValuesPerBlock,
unsigned NumUnavailablePreds = PredLoads.size() + CriticalEdgePred.size();		unsigned NumUnavailablePreds = PredLoads.size() + CriticalEdgePred.size();
assert(NumUnavailablePreds != 0 &&		assert(NumUnavailablePreds != 0 &&
"Fully available value should already be eliminated!");		"Fully available value should already be eliminated!");

// If this load is unavailable in multiple predecessors, reject it.		// If this load is unavailable in multiple predecessors, reject it.
// FIXME: If we could restructure the CFG, we could make a common pred with		// FIXME: If we could restructure the CFG, we could make a common pred with
// all the preds that don't have an available LI and insert a new load into		// all the preds that don't have an available LI and insert a new load into
// that one block.		// that one block.
if (NumUnavailablePreds != 1)		if (!AggressiveLoadPRE && NumUnavailablePreds != 1)
return false;		return false;

// Split critical edges, and update the unavailable predecessors accordingly.		// Split critical edges, and update the unavailable predecessors accordingly.
for (BasicBlock *OrigPred : CriticalEdgePred) {		for (BasicBlock *OrigPred : CriticalEdgePred) {
BasicBlock *NewPred = splitCriticalEdges(OrigPred, LoadBB);		BasicBlock *NewPred = splitCriticalEdges(OrigPred, LoadBB);
assert(!PredLoads.count(OrigPred) && "Split edges shouldn't be in map!");		assert(!PredLoads.count(OrigPred) && "Split edges shouldn't be in map!");
PredLoads[NewPred] = nullptr;		PredLoads[NewPred] = nullptr;
DEBUG(dbgs() << "Split critical edge " << OrigPred->getName() << "->"		DEBUG(dbgs() << "Split critical edge " << OrigPred->getName() << "->"
▲ Show 20 Lines • Show All 1,281 Lines • Show Last 20 Lines