Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dtemirbulatov added inline comments.Feb 10 2019, 4:14 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2467

ok, WIP.

2469

well, those lines were incorrect.

4981

ok, Let see what we will get with the proposed change on the benchmarks side.

ABataev added inline comments.Feb 10 2019, 4:55 PM
test/Transforms/SLPVectorizer/X86/bswap.ll
25 ↗(On Diff #186169)

Still look like too optimistic.

dtemirbulatov marked 3 inline comments as done.Feb 10 2019, 5:49 PM
dtemirbulatov added inline comments.
test/Transforms/SLPVectorizer/X86/bswap.ll
25 ↗(On Diff #186169)

hmm, it might be missed spill-cost recalculation.

RKSimon added inline comments.Feb 11 2019, 7:36 AM
test/Transforms/SLPVectorizer/X86/sitofp.ll
5 ↗(On Diff #186169)

Why did you drop these tests?

Fix several issues with cost estimation, add support for tree hight reduction even for profitable trees.

dtemirbulatov marked 3 inline comments as done.Feb 18 2019, 11:20 PM
dtemirbulatov added inline comments.
test/Transforms/SLPVectorizer/X86/sitofp.ll
5 ↗(On Diff #186169)

no conflicts anymore.

dtemirbulatov planned changes to this revision.Feb 19 2019, 3:17 AM
dtemirbulatov marked an inline comment as done.

I found an incorrect way of summing costs at line 2660, it should be backward loop here.

dtemirbulatov requested review of this revision.Feb 19 2019, 4:16 AM

oh, Rechecked the algorithm at 2660 at it looks correct. sorry.

ABataev added inline comments.Feb 19 2019, 7:25 AM
test/Transforms/SLPVectorizer/X86/unreachable.ll
24 ↗(On Diff #187314)

Still the problem with the cost model! I did not look at your patch closely but seems to me you're throwing away the cost of leaf gather nodes from the vectorization tree. Also, seems to me, you don't take into account that for the reduced trees you also need to build the leaf gather nodes and take their cost into account.

dtemirbulatov marked an inline comment as done.Feb 19 2019, 10:17 AM
dtemirbulatov added inline comments.
test/Transforms/SLPVectorizer/X86/unreachable.ll
24 ↗(On Diff #187314)

yes, Thanks, I am thinking that too.

looks like, I addressed the issue of leaf gather nodes.

ABataev added inline comments.Feb 22 2019, 6:29 AM
test/Transforms/SLPVectorizer/X86/extractcost.ll
10 ↗(On Diff #187912)

Are you sure that your vectorization result is better than it was before?

test/Transforms/SLPVectorizer/X86/resched.ll
77 ↗(On Diff #187912)

Same question: are you sure it is better?

Fix cost calculation, fix the run-time error by adding external users for canceled elements in reduceTree(), add testcase.

xbolva00 added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
4398

Operations?

Since cost computing was changed, can you rerun spec benchmarks?

ABataev added inline comments.Feb 26 2019, 6:37 AM
test/Transforms/SLPVectorizer/X86/intrinsic.ll
150 ↗(On Diff #188348)

This does not look better

test/Transforms/SLPVectorizer/X86/long_chains.ll
12 ↗(On Diff #188348)

Again, are you sure this one is better than it was before?

test/Transforms/SLPVectorizer/X86/slp-throttle.ll
1

Do you really need such a big test? Could you simplify it?

Since cost computing was changed, can you rerun spec benchmarks?

sorry, I found my previous result not quite reliable, I have changed my methodology, I will rerun spec benchmarks.

dtemirbulatov updated this revision to Diff 188873.EditedMar 1 2019, 3:30 AM

Change the decision of reducing tree if only the cost of vectorizing is too high and reduce the tree to the highest element of the tree that might be minimally profitable.

dtemirbulatov marked 7 inline comments as done.Mar 1 2019, 3:33 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
4398

Fixed, Thanks.

test/Transforms/SLPVectorizer/X86/intrinsic.ll
150 ↗(On Diff #188348)

Done, Thanks.

test/Transforms/SLPVectorizer/X86/long_chains.ll
12 ↗(On Diff #188348)

Done, Thanks,

test/Transforms/SLPVectorizer/X86/slp-throttle.ll
1

Done, Thanks.

dtemirbulatov marked 3 inline comments as done.Mar 1 2019, 6:37 AM

Add cost calculation for in vectorizable tree operation to be extracted for the canceled tree elements use. Fix performance regression on Phoronix's c-ray-1.2.0 test-suite by avoiding to vectorize just one seed instruction. Spec 2006 numbers are unchanged with several instances of new vectorizations but not hot enough to change the numbers. Also, I noticed a good vectorizations example with Phoronix's encode-mp3-1.7.3 of two FFT kernels with ~8% faster for fft_short kernel and ~3% faster for fft_long on Intel(R) Core(TM) i7-6700HQ.

vporpo added inline comments.Mar 11 2019, 4:22 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
725

Same here, do we really need the parent?

726

It is not straight-forward what "expanding" refers to.
Do you mean "scalarize" everything above this node?
If so, a variable name like ScalarizeAtCost would be better?
(Also typo: hight->height)

Please rephrase this comment. It should be clear that it refers to the reduction of the tree with throttling.

730

I don't think we should add an extra Value *Parent field here.
There are many reasons for this:

  1. The TreeEntry represents a group of scalars, so the parent should also be a TreeEntry node, not a Value*.
  2. The "Parent" is actually the user of the TreeEntry, and we already have this information in UserTreeIndices. It is a vector of the indexes of the parent nodes (there may be more than one "parent" nodes, as the graph is a DAG, not a tree).
2741

I don't think a map is needed.
You are mapping an instruction to a vector of the TreeEntry indexes of its operands that need to gather, but I think there are better ways to get this data.
For example, you could replace the code of line 2749 with something along these lines:

for (Value *Op : V->operands())
   if (TreeEntry *OpTE = getTreeEntry(Op))
     GatherCost += OpTE->Cost;

If it is really needed here, it needs a better name, maybe GatheringOperandsMap ?

2743–2744

Shouldn't we guard this whole loop with if (ReduceTree) ?

I find the code in this loop very confusing and hard to follow.
Shouldn't you be doing a reverse-post-order traversal of the VectorizableTree accumulating the cost of the operands of each node and scalarizing all the nodes from that point on if the accumulated cost > threshold ?

2751

Avoid two lookups in GatherMap.find() and GatherMap[V], use an iterator.

Rebase the change, Fix Vasileios remarks.

dtemirbulatov marked 10 inline comments as done.Mar 13 2019, 9:43 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
725

Fixed

726

Done.

2741

Still, I think we need the map. BTW, getTreeEntry() finds only vectorizable elements.

2743–2744

It is the main tree cost calculation, I isolated ReduceTree code even further.

2751

Fixed.

dtemirbulatov marked 4 inline comments as done.Mar 13 2019, 9:51 AM

Delete unnecessary variables from getTreeCost().

vporpo added inline comments.Mar 18 2019, 8:39 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
536

Could you use a cl::opt flag instead of this ReduceTree function argument?

2456

I am nitpicking here, but "reduce" is kind of overloaded in the SLP pass. We also have tryToReduce(), which is for reduction.
Please try to look for a more unique name (maybe shrinkTree(), or cutTree(), or throttleTree() or something similar)

2477

Could you check this in a better way, without using an additional 'NonGather' vector? Maybe introduce a small helper function isJustTineyTree(I) that checks a few entries below 'I'.

2720

if (!ScalarsToVec.count(EU.Scalar))

2723

As far as I understand, you are calculating the additional cost for the EU.Scalars that are no longer vectorized (because of reduceTree()).
So you are getting all its in-tree uses and you are calculating the extract operation cost from for the data transfer from EU.Scalar->vector use.

  1. Isn't this an 'insert' operation and not an 'extract' ?
  2. Isn't this just the gather cost that has already been considered in line 2756 ?
  3. Even if you have multiple in-tree vectorized uses, we should only count the gather cost once, as the vector can then be reused for the rest of the uses with no extra cost.
2747

I think that introducing the 'ScalarsToVec' to hold the vectorizable scalars "below the cut" could be done better. In particular I don't like that it is local to getTreeCost() and gets carried around in many functions, and that it is some extra state that is not very visible.

Isn't it better to add a bool 'IsThrottled' flag in TreeEntry? This way:

  1. you don't need to carry anything around and,
  2. one can easily tell if an entry gets vectorized or not, at a given time, just by looking at the Entry

Also you can simply check if a V is vectorized by E = getTreeEntry(V); if (E && !E->isThrottled).

2756

If ReduceTree is false, is the cost calculation correct (no extract cost?) ?

In any case if ReduceTree == false, I think it is better if we compute the cost as in the original code, without having to compute extractCost() etc. for every single 'cut' of the tree.

Also consider adding some of the costs as member variables in TreeEntry: like ExtractCost, SpillCost, GatherCost etc.

3780

if (!RemovedOperations.empty())

4922

Isn't Chain containing only stores? Why do we need the if here ?
Also, why wasn't R.getVectorElementSize(Chain[0]) OK?

4923

From your comment of line 4920 I guess you are also checking for vectorized 'SI' instructions.
How can this be caused by reduceTree ?

If this is indeed required here (please let me know why) and try to use a function for it and reuse it for both 'Chain' and 'Operands' (line 4920) if possible.

4926

Why is this code needed? Can reduceTree remove all the instructions from the SLP tree? This does not sound right.

4983

Could you add a cl::opt flag to guard this (and the whole reduceTree optimization).

dtemirbulatov marked 3 inline comments as done.Mar 19 2019, 7:00 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
536

ok, true by default.

2723

Consider this example :
We considering to vectorize full tree height:
....
a0 = x*a;
a1 = x*b;
a2 = x+c;
a3 = x+d;
....
Then, later we decided to remove "a2", "a3" out of proposed to vectorized operations. And "x" variable is not available in a scalar form and in order to obtain "x" value we should extract "x" to be used in a scalar form. And while extracting "x" we should calculate the cost for such operation. It might be not profitable to vectorize such tree cut if we have to produce such an extract operation.
And for 2756 I have another example:
a0 = x*a;
a1 = x*b;
a2 = a0+c;
a3 = a1+d;
suppose that we would like to start vectorizaton from "a2", "a3" and "a0" and "a1" should be in scalar form we have calculated "c" and "d" as a gather operations, but since we are interrupting full-tree we have to consider calculating gather cost for "a0", "a1" because full intruppted full vectorization chain and "a0", "a1" are scalars now.

2756

If ReduceTree is false, is the cost calculation correct (no extract cost?) ?

yes, I think in this case we calculate the extract cost, for the uncut version with "I == (E - 1)" for full tree hight.

dtemirbulatov marked an inline comment as done.Mar 19 2019, 7:28 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2723

in the first example, "x" is only available in a vectorized form only, internal vectorized tree use. So changing the example:
x = ....
a0 = x*a;
a1 = x*b;
a2 = x+c;
a3 = x+d;

dtemirbulatov marked an inline comment as done.Mar 19 2019, 9:23 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2723

oh, I think, I incorrectly calculate gather cost for the second example at 2756, instead of calculating gather costs for a1, a0, I did the calculation for a2 and a3.

Fixed cost estimation for canceled elements to be inserted to interact with vectorized.
Replaced main throttling loop in reverse-post-order traversal as it was suggested in the previous remarks.
Making final decision on throttle inside getTreeCost() we don't need to hold cost estimations in TreeEntry.
Avoid vectorizing any flow instructions during partial vectorization.
Added cl::opt flag for Throtelling, it is now on by default to show the impact on tests, but during commit, I could make this option off by default to estimate the whole impact before enabling everywhere.
Here are the spec cup 2006 numbers before and after:

before:
400.perlbench 9770 304 32.2 *
401.bzip2 9650 473 20.4 *
403.gcc 8050 250 32.1 *
429.mcf 9120 326 28.0 *
445.gobmk 10490 465 22.5 *
456.hmmer 9330 344 27.1 *
458.sjeng 12100 462 26.2 *
462.libquantum 20720 255 81.2 *
464.h264ref 22130 504 43.9 *
471.omnetpp 6250 329 19.0 *
473.astar 7020 416 16.9 *
483.xalancbmk 6900 198 34.8 *
Est. SPECint(R)_base2006 29.1
Est. SPECint2006

after:
400.perlbench 9770 301 32.4 *
401.bzip2 9650 471 20.5 *
403.gcc 8050 241 33.4 *
429.mcf 9120 329 27.7 *
445.gobmk 10490 465 22.6 *
456.hmmer 9330 344 27.1 *
458.sjeng 12100 461 26.3 *
462.libquantum 20720 267 77.6 *
464.h264ref 22130 504 43.9 *
471.omnetpp 6250 323 19.4 *
473.astar 7020 416 16.9 *
483.xalancbmk 6900 197 35.0 *
Est. SPECint(R)_base2006 29.2
Est. SPECint2006

with flags applied "-march=broadwell -m64 -O3" on Intel(R) Core(TM) i7-6700HQ

ABataev added inline comments.Mar 21 2019, 12:42 PM
test/Transforms/SLPVectorizer/X86/slp-throttle.ll
1

Commit the test as NFC before to see the changes.

ABataev added inline comments.Mar 21 2019, 12:53 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2485

No need for E here

2756
  1. Unformatted.
  2. Use --I;
  3. Are you sure that the exit condition is correct? Maybe >=, since you start from size()-1?
4414

Replace auto here with the real type.

4732

Restore

4922

StoreInst *->auto *

dtemirbulatov marked an inline comment as done.Mar 22 2019, 6:47 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2756

Are you sure that the exit condition is correct? Maybe >=, since you start from size()-1?

yes, We don't one to consider one entry vectorizon here.

Fixed remarks, removed Cost state from TreeEntry, turned cl::opt throttling flag false by default.

Did you check the 462.libquantum 's regression?

ABataev added inline comments.Mar 22 2019, 12:24 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
841

Not formatted

2458

Hmm, will it work with the graph? Seems to me, it will work absolutely correctly only with the list and might not be quite correct even for tree. VectorizableTree is actually a linearized graph and the index is not connected anyhow with the tree/graph depth/height.
You need to nalyze each particular subgraph/subtree independently, and thus, cannot rely on the single index.

Changed getTreeCost loop to condition "--I;".

Did you check the 462.libquantum 's regression?

hmm, 462.libquantum is not affected by this change. I increased the number to runs for the test and here is results:

before:

400.perlbench 9770 302 32.3 *
401.bzip2 9650 470 20.6 *
403.gcc 8050 244 33.0 *
429.mcf 9120 318 28.7 *
445.gobmk 10490 463 22.7 *
456.hmmer 9330 345 27.1 *
458.sjeng 12100 460 26.3 *
462.libquantum 20720 250 83.0 *
464.h264ref 22130 503 44.0 *
471.omnetpp 6250 317 19.7 *
473.astar 7020 413 17.0 *
483.xalancbmk 6900 191 36.2 *
Est. SPECint(R)_base2006 29.6
Est. SPECint2006

after:
400.perlbench 9770 301 32.4 *
401.bzip2 9650 469 20.6 *
403.gcc 8050 235 34.2 *
429.mcf 9120 315 29.0 *
445.gobmk 10490 463 22.7 *
456.hmmer 9330 344 27.1 *
458.sjeng 12100 460 26.3 *
462.libquantum 20720 252 82.3 *
464.h264ref 22130 502 44.1 *
471.omnetpp 6250 313 20.0 *
473.astar 7020 412 17.0 *
483.xalancbmk 6900 190 36.4 *
Est. SPECint(R)_base2006 29.8
Est. SPECint2006

Here are tests that were affected by this change: 400.perlbench, 403.gcc, h264ref, 483.xalancbmk.

dtemirbulatov marked an inline comment as done.Mar 25 2019, 6:43 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2458

Looks like it works correct to me. It does not need any particular index, it just relies on "Entry->NeedToGather" variable at line 1313. And it works consequently, not at once in the end. We found chain of operations to vectorize, then we call getTreeCost() and inside at 2805 invoke ViewGraph and, at that time, NeedToGather is already set by cutTree for our chain members.

ABataev added inline comments.Mar 25 2019, 6:49 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2458

I don't think it works absolutely correctly. You could get better results if you would implement correct tree/graph throttling. YOu cannot rely on index of the TreeEntry in your case, you need follow the tree structure.