This is an archive of the discontinued LLVM Phabricator instance.

Improve LoopVectorizers estimation of scalar loops by predicting LSR behaviour
Needs ReviewPublic

Authored by jonpa on Apr 18 2017, 7:38 AM.

Details

Summary

I have experimented with collecting a set of instructions that LSR is likely to eliminate in the scalar version of the loop. This is e.g. an add of a PHI and a constant.

At this point, the patch is showing a general improvement of at least %5 in scalar estimates. The collectLikelyLSRed() can likely be improved.

At this point, I wonder what the response to this is? Is this a good or bad idea?

How much would we want a general improvement vs avoiding worst case degradation of estimates?

Diff Detail

Event Timeline

jonpa created this revision.Apr 18 2017, 7:38 AM
rengolin edited edge metadata.May 2 2017, 12:36 PM

Hi Jonas,

As much as this helps, I think duplicating the LSR logic here, even if a very limited subset of it, is problematic.

If assumptions change with time, you may start having increasingly different results and not detect them until the more pathological cases start to explode.

In cases like this, in the past, we discussed splitting the analysis pass from the execution pass on the targeted transformation. IIRC, this is how the loop analysis passes were born.

Maybe, if you move the LSR analysis code to the generic loop pass side, and both LSR and LV use the same analysis, you can work around the issue without duplicating anything.

Makes sense?

cheers,
--renato

jonpa added a comment.May 3 2017, 11:16 PM

Hi Jonas,

As much as this helps, I think duplicating the LSR logic here, even if a very limited subset of it, is problematic.

If assumptions change with time, you may start having increasingly different results and not detect them until the more pathological cases start to explode.

In cases like this, in the past, we discussed splitting the analysis pass from the execution pass on the targeted transformation. IIRC, this is how the loop analysis passes were born.

Maybe, if you move the LSR analysis code to the generic loop pass side, and both LSR and LV use the same analysis, you can work around the issue without duplicating anything.

Makes sense?

cheers,
--renato

Hi Renato,

thanks for your opinion!

Ideally the LSR should handle vectorized addresses as well, and if it could, there would not be this difference against the scalar loops. It however can't, and that's why this patch improves on vectorizer decions by letting the scalar loop have a cheaper cost in these cases.

Is LSR going to do this anytime soon? Or is it an option to run LSR (also?) before the Loop vectorizer? I have seen many cases where the scalar loop is much smaller just because the vectorizer generates a vector add, vector shift and what not of the address, while LSR removes those instructions entirely in the scalar loop. In this sense, isn't the loop vectorizer run too early?

I am not sure if it's worth the effort to factor out the LSR analysis just to know how much better it is on scalar loops... It would be better to fix the vectorized loops instead. So I guess, this patch should actually just hopefully be temporary if used...

sanjoy added a subscriber: sanjoy.May 7 2017, 2:25 PM

In this sense, isn't the loop vectorizer run too early?

That's a good question, one that may be found our by adding it to just before the loop vectoriser and see how compile time increases proportional to the run-time performance benefits of standard benchmarks.

I'm adding some more people to chime in, just in case they have seen this before and have a plan we're not aware of. :)

cheers,
--renato

jonpa added a comment.May 31 2017, 1:21 AM

In this sense, isn't the loop vectorizer run too early?

That's a good question, one that may be found our by adding it to just before the loop vectoriser and see how compile time increases proportional to the run-time performance benefits of standard benchmarks.

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

For both compile time and run time performance?

jonpa added a comment.May 31 2017, 2:20 AM

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

For both compile time and run time performance?

I have just looked at performance. It was a while since I tried to evaluate the compile time impact of a single pass. How would you do this?

I have just looked at performance. It was a while since I tried to evaluate the compile time impact of a single pass. How would you do this?

Well, if the whole compile time doesn't increase noticeably on a good number of programs and benchmarks (I'm expecting it won't), then it should be mostly fine.

One way to track compile time is using LNT. Put up a server, run vanilla, submit the results, run with the change, submit the results, compare. (http://llvm.org/docs/lnt/quickstart.html)

If the results are mostly fine, then it'd be just a matter or understanding why it was where it was and make sure everyone agrees with the move (or addition).

I'm adding a few people that have recently worked in the LSR as well as some vectoriser people.

I have seen a few comments in the LV that expect LSR to pass after it, so it might not be completely advisable to *move*, but passing before and after might have a noticeable impact.

cheers,
--renato

sanjoy resigned from this revision.Jan 29 2022, 5:34 PM