Page MenuHomePhabricator

[SLP] Vectorize gather-like idioms ending at non-consecutive loads.
ClosedPublic

Authored by mssimpso on Nov 19 2015, 10:02 AM.

Details

Summary

This patch tries to vectorize gather-like expression trees ending at
non-consecutive loads, such as the one shown in the example below.

... = g[a[0] - b[0]] + g[a[1] - b[1]] + ... + g[a[n] - b[n]];

Here, the index calculations for the "g" accesses can be vectorized. The loads
of the "a" and "b" array elements and the subtractions can all be replaced by
their vector equivalents. Our bottom-up vectorizer currently misses cases like
this because the expression trees don't end in stores or reductions.

It's possible to vectorize these cases in a top-down phase beginning at the
consecutive loads. However, I've chosen here to detect the specific pattern of
interest and proceed bottom-up as we do with other interesting cases. The
advantage of this approach is that it avoids the complexity, compile-time, and
phase ordering issues of a full-blown top-down pass. The disadvantage is that
it's probably not as general as it would be otherwise.

The primary changes included in the patch allow us to (1) vectorize the
gather-like pattern shown above and (2) set vector factors based on the width
of memory accesses in the expression trees. Your feedback is welcome.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 40672.Nov 19 2015, 10:02 AM
mssimpso retitled this revision from to [SLP] Vectorize gather-like idioms ending at non-consecutive loads..
mssimpso updated this object.
mssimpso added reviewers: nadav, jmolloy, hfinkel.
mssimpso added subscribers: llvm-commits, mcrosier.
nadav edited edge metadata.Nov 19 2015, 10:21 AM
nadav added a subscriber: nadav.

Matthew, thanks for working on this.

In your patch you are vectorizing very short trees that are made out of loads, geps and potentially stores. In the early days of the SLP we disabled the vectorization of short trees because of two reasons:

  1. It is very difficult to estimate the profitability of this kind of vectorization. The trees are so small and the scores of the trees are very low, often close to zero. Estimating the cost of gathers is even more difficult. I don't know if you’ve ran performance measurements but I’d be surprised if this kind of vectorization would be profitable for general programs.
  2. The SelectionDAG ConsecutiveStore optimization already vectorizes short load-store trees.

I think that it is better to optimize this pattern by extending SelectionDAG’s ConsecutiveStore optimizations.

What do you think?

Thanks,
Nadav

Hi Nadav,

Thanks very much for the quick feedback! I'm happy to consider a different implementation. I have have some questions for you first, though, if you don't mind.

I don't think the expression trees this patch affects are necessarily limited in size. The requirement is that they be seeded by GEPs (that are used by non-consecutive loads). As far as I know, the trees can be arbitrarily large. The trees in my example are small (gep, zext, sub, load), but this doesn't have to be the case in general. Are you suggesting that in your experience, seeding SLP with these GEPs typically only hits short, unprofitable trees in practice?

I have measured performance on our workloads. The cost estimate for the test case I provided in the patch comes in well below zero and is profitable. Estimating the cost of gathers is difficult, but here we are only optimizing the index calculations. I will admit that the number of programs we care about is limited. It would nice to have some additional data points. Is this something you could help with?

Finally, I'm not familiar with the portion of SelectionDAG you mention, so please excuse the rest of my questions. First, there aren't any stores in my example, so would the ConsecutiveStores optimizations you mention even apply? And lastly, wouldn't we still be subject to the same cost estimates (and imprecision therein) if the implementation was moved elsewhere?

Thanks again!

hfinkel edited edge metadata.Nov 26 2015, 11:51 AM

I don't think the expression trees this patch affects are necessarily limited in size.

I agree; these patterns can occur with reasonable complexity in practice.

Have you run this on the test suite, did you find any statistically-significant compile-time impact?

lib/Transforms/Vectorize/SLPVectorizer.cpp
4294 ↗(On Diff #40672)

prpofitable (typo)

Nadav/Hal,

Thanks very much for the additional feedback. Just letting you know that I haven't forgotten about this review. I will provide some compile-time and performance results soon.

mssimpso updated this revision to Diff 41663.Dec 2 2015, 1:19 PM
mssimpso edited edge metadata.

Fixed typo in comment.

mssimpso marked an inline comment as done.Dec 2 2015, 1:20 PM

Nadav/Hal,

Below are the statistically significant compile-time differences observed for the test suite, spec2000, and spec2006 (there is one). Results were computed from 10 samples each, median aggregation, 95% confidence intervals, and a 0.05 statistical significance level for the Mann–Whitney U test.

Program                                                             Base  Change      %
---------------------------------------------------------------------------------------
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael  0.73    0.89  18.88

No statistically significant performance differences were observed for spec2000 and spec2006 on a Cortex-A57-like cpu (but see the explanation below). Our infrastructure is currently unable to produce run-time data for the test suite. However, a binary diff shows that the following benchmarks were modified by the change.

Program
---------------------------------------------------------------------------------------
MultiSource/Applications/JM/lencod
MultiSource/Applications/minisat
MultiSource/Benchmarks/Bullet
MultiSource/Benchmarks/MallocBench/espresso
spec2006/h264ref
spec2006/povray

Regarding the lack of performance differences observed in spec2000 and spec2006, the current patch is somewhat limited, and I was planning a follow-on to address the issue. The indices of the GEPs that seed the expressions are forced to be i64. However, the expressions may not require that much precision, so we end up with unneeded extensions and/or narrower vectors than is optimal, and the cost model often prevents us from vectorizing. Please see the test cases in the patch for an example. We are not yet able to vectorize the second case because of this issue.

The follow-on would essentially be to incorporate James's type-shrinking work from the loop vectorizer in order to rewrite the expressions in the narrower type if profitable. Work-in-progress has shown that type-shrinking with this patch can provide a significant performance improvement for spec2006/h264ref, at least.

Please let me know what you think of this plan and whether the optimization is better suited for SLP or SelectionDAG. Thanks again!

anemet added a subscriber: anemet.Dec 3 2015, 12:51 PM

Nadav/Hal,

Below are the statistically significant compile-time differences observed for the test suite, spec2000, and spec2006 (there is one). Results were computed from 10 samples each, median aggregation, 95% confidence intervals, and a 0.05 statistical significance level for the Mann–Whitney U test.

Program                                                             Base  Change      %
---------------------------------------------------------------------------------------
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael  0.73    0.89  18.88

No statistically significant performance differences were observed for spec2000 and spec2006 on a Cortex-A57-like cpu (but see the explanation below). Our infrastructure is currently unable to produce run-time data for the test suite. However, a binary diff shows that the following benchmarks were modified by the change.

Program
---------------------------------------------------------------------------------------
MultiSource/Applications/JM/lencod
MultiSource/Applications/minisat
MultiSource/Benchmarks/Bullet
MultiSource/Benchmarks/MallocBench/espresso
spec2006/h264ref
spec2006/povray

Regarding the lack of performance differences observed in spec2000 and spec2006, the current patch is somewhat limited, and I was planning a follow-on to address the issue. The indices of the GEPs that seed the expressions are forced to be i64. However, the expressions may not require that much precision, so we end up with unneeded extensions and/or narrower vectors than is optimal, and the cost model often prevents us from vectorizing. Please see the test cases in the patch for an example. We are not yet able to vectorize the second case because of this issue.

The follow-on would essentially be to incorporate James's type-shrinking work from the loop vectorizer in order to rewrite the expressions in the narrower type if profitable. Work-in-progress has shown that type-shrinking with this patch can provide a significant performance improvement for spec2006/h264ref, at least.

Please let me know what you think of this plan and whether the optimization is better suited for SLP or SelectionDAG. Thanks again!

I think SLP is a good place for this, given that the expression trees might be quite large. However, a nearly 20% compile-time slowdown (on an application which surely shows no speedup given your list of binary diffs) is not acceptable. Can you please profile running on that application compilation and propose a patch which limits whatever bad behavior is asserting itself there?

Hal,

Thanks very much for the follow-up. Yes, I'm working on compile-time along with the type shrinking work I previously mentioned. I will post an updated revision soon.

Hal,

Thanks very much for the follow-up. Yes, I'm working on compile-time along with the type shrinking work I previously mentioned. I will post an updated revision soon.

@mssimpso, I don't think that this is the right approach. I think that this should be handled in SelectionDAG, and not in the SLP vectorizer. Your measurements show that there are no performance gains on Spec2000 (and the performance test suite?). Every new heuristic/scan that we add to the SLP vectorizer increases the compile times, and I don't think that adding code to scan loads and start a vec-tree at the load roots is a good heuristic. Many people are using LLVM as a JIT compiler and these people care deeply about compile times. I don't think that in the general case there are many opportunities for vectorization when you search from the address of loads, and I think that your measurements show that this is indeed the case. If you care about the specific pattern of load/store then you can optimize this in SelectionDAG.

@mssimpso, I don't think that this is the right approach. I think that this should be handled in SelectionDAG, and not in the SLP vectorizer. Your measurements show that there are no performance gains on Spec2000 (and the performance test suite?). Every new heuristic/scan that we add to the SLP vectorizer increases the compile times, and I don't think that adding code to scan loads and start a vec-tree at the load roots is a good heuristic. Many people are using LLVM as a JIT compiler and these people care deeply about compile times. I don't think that in the general case there are many opportunities for vectorization when you search from the address of loads, and I think that your measurements show that this is indeed the case. If you care about the specific pattern of load/store then you can optimize this in SelectionDAG.

@nadav, how would the SelectionDAG algorithm be different from what we do here (starting to combine the from load addresses bottom-up)? There is also loop-awareness and being able to vectorize across multiple basic blocks that could be beneficial doing this in LLVM IR.

Also in order to optimize h264ref in SPEC2006int (there is gain in SPEC!) additional type shrinking is needed which is probably easier to do at this level.

lib/Transforms/Vectorize/SLPVectorizer.cpp
415 ↗(On Diff #41663)

I think that we typically use the term "vectorization factor"

3149–3152 ↗(On Diff #41663)

Hmm, that does not really match the comment for the function. Just because the tree ends in a store we could have a wider load feeding it. Something is not precise enough here.

@anemet In SelectionDAG we can use the memory order chains, and don't have to scan the whole basic block for loads/stores. This is very efficient, and is already used by the load-store merger in SelectionDAG.

I don't think that with the proposed patch we are very likely to catch patterns that cross basic blocks/loops. Can you show examples of real code that we do catch?

@anemet In SelectionDAG we can use the memory order chains, and don't have to scan the whole basic block for loads/stores. This is very efficient, and is already used by the load-store merger in SelectionDAG.

But the SLP vectorizer already scans for stores in the entire function so piggybacking on that for loads seems to make sense.

Are you worried about scanning for loads or matching up the "related ones" into the initial bundle? I could see the latter being problematic but I am not sure how the memory order chain could help with that. At the bottom of these chains we have non-consecutive loads (a[b[0] - c[0]], a[b[1] - c[1]], ...). At the top, we have consecutive loads but those should be on independent chains.

jmolloy resigned from this revision.Dec 12 2015, 5:32 AM
jmolloy removed a reviewer: jmolloy.

Resigning from this; Hal and Nadav are reviewing it.

Discussed this more with Nadav in person. Assuming the compile-time issues can be worked out, it probably makes sense to have this in the SLP Vectorizer.

mssimpso updated this revision to Diff 43760.Dec 29 2015, 1:56 PM

Eliminated compile-time regression and addressed Adam's comments.

Hi All,

I've updated this patch to eliminate the previously observed compile-time regression in MiBench/security-rijndael. I fixed the issue by moving the contiguous access check (the most expensive check prior to vectorization) to the last step and by processing the accesses in chunks of 16. These two changes follow the existing flow we have for stores to minimze compile-time.

I benchmarked this change together with two others. The additional changes include the type-shrinking work I previously mentioned (D15815) and an additional cost model hook (D15816) to catch the sign extensions we introduce with the type-shrinking. Feel free to comment on those patches as well if interested. Together, the patch set causes no compile-time regressions and improves spec2006/h264ref by ~6% on our Cortex-A57-like architecture.

Thanks!

Ayal added a subscriber: Ayal.Jan 2 2016, 11:53 PM

It may be better to start from GEPs rather than loads, collecting single-varying-index-GEPs grouped by same-base-pointer that appear in same basic block. That would work for scatters as well as gathers (and a mix ;-), and for other potential users.

If a pair of addresses are consecutive, indeed it would probably be futile to try and vectorize their index computation, because one can be computed from the other rather than independently. But (1) this holds more generally for any simple stride; and (2) it's enough to refrain from treating this pair, why abort all related candidates. Adding "sum += g[2*i]+g[2*i+1];" to your example throws it off needlessly.

+some typos.

lib/Transforms/Vectorize/SLPVectorizer.cpp
416 ↗(On Diff #43760)

e[x]pression

3158–3159 ↗(On Diff #43760)

We want to base the vector element size on the width of memory operations where possible.

This deserves a comment at the outset, namely that the vector element size is derived from the largest size stored to or loaded from memory (following Adam's comment).

3363 ↗(On Diff #43760)

store >> load

3505 ↗(On Diff #43760)

typo

delena added a subscriber: delena.Jan 3 2016, 12:17 AM
mssimpso updated this revision to Diff 44007.Jan 5 2016, 8:11 AM

Addressed Ayal's comments

Thanks very much for the feedback, Ayal! Your suggestions sound reasonable to me. I've updated the patch to seed with GEPs instead of loads and to avoid aborting all candidates if some are consecutive. I also refactored other parts of the patch to match these changes.

I think that starting with GEPs is more straightforward. We were basically already doing this, but only considering the ones used by loads. And I originally chose to abort the candidates if some were consecutive to avoid potential compile-time increases. However, I've benchmarked the new patch with your suggestions and observed no regressions in the LLVM test suite or SPEC. We still bail out early if there are no viable candidates.

mssimpso marked 4 inline comments as done.Jan 5 2016, 8:12 AM
Ayal added inline comments.Jan 8 2016, 5:03 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4331 ↗(On Diff #44007)

How about simply checking if (isa<SCEVConstant>(Delta)) instead? As said, If one index is easily derived from the other, the two are poor candidates for parallel independent computation. This holds not only when they're exactly adjacent.

Would be good to add tests to make sure this works as intended.

Note that each such index may still be a good candidate, if grouped together with some other, independent indices. Discarding both from consideration may be reasonable, certainly compile-time-wise. The better guidance for such patterns is top-down, as you noted at the beginning.

4332–4334 ↗(On Diff #44007)

You could, alternatively, remove correlated GEPs from a set of all current candidates, initialized to all non-nullified GEPs. As long as this set has at-least two remaining candidates.

mssimpso updated this revision to Diff 44357.Jan 8 2016, 12:10 PM

Addressed Ayal's comments. Thanks!

mssimpso marked 2 inline comments as done.Jan 8 2016, 12:11 PM
Ayal added a comment.Jan 12 2016, 2:42 AM

Functionally this looks fine to me, thanks! I have no further comments. Nadav or Hal should formally LGTMize it though.

Together, the patch set causes no compile-time regressions and improves spec2006/h264ref by ~6% on our Cortex-A57-like architecture.

Cool. It would be good to record other known gains (povray?). This could be useful, e.g., if one attempts to recognize such patterns starting at consecutive loads going downwards, rather than arbitrary independent GEPs going upwards.

Thanks very much for the comments everyone! Ayal, in addition to the improvement in spec2006/h264ref (~6%), I observed small improvements in spec2006/povray (~1%) and spec2000/vortex (< 1%) with this patch and the two dependent ones.

mssimpso accepted this revision.Jan 14 2016, 8:00 AM
mssimpso added a reviewer: mssimpso.

From Nadav's LGTM.

This revision is now accepted and ready to land.Jan 14 2016, 8:00 AM
This revision was automatically updated to reflect the committed changes.