Page MenuHomePhabricator

[SCEV] rewriteLoopExitValues(): even if have hard uses, still rewrite if cheap (PR44668)
ClosedPublic

Authored by lebedev.ri on Jan 27 2020, 1:16 PM.

Details

Summary

Replacing uses of IV outside of the loop is likely generally useful,
but rewriteLoopExitValues() is cautious, and if it isn't told to always
perform the replacement, and there are hard uses of IV in loop,
it doesn't replace.

In PR44668,
that prevents -indvars from replacing uses of induction variable
after the loop, which might be one of the optimization failures
preventing that code from being vectorized.

Instead, now that the cost model is fixed, i believe we should be
a little bit more optimistic, and also perform replacement
if we believe it is within our budget.

Fixes PR44668.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jan 27 2020, 1:16 PM
nikic added a subscriber: nikic.Jan 27 2020, 1:46 PM

The general structure of the patch looks entirely reasonable. I am concerned about the number of test diffs. Mostly because there might be regressions herein, and it would be really hard to spot. I'd like to suggest a strategy for making the test changes easier.

Split the patch into a couple of stages:

  1. Add all the TTI plumbing, and one simple use of the costing (e.g. trunc), but keep most of isHighCostExpansion as is. Intention is that very few tests change, and results are obviously correct.
  2. Add remaining use of TTI in isHighCost except for the min/max (since those were unconditional failures previously). Again, hopefully smallish number of diffs to review.
  3. Add the min/max logic. These are the most interesting diffs to study.
  4. Change the bailout logic in IndVarSimplify.

If you're okay with this approach, consider yourself to have a conditional LGTM for (1) provided that the test diffs are actually very limited.

If you want to make an alternate suggestion as to how to review the test diffs effectively, I'm open to it.

llvm/lib/Analysis/ScalarEvolutionExpander.cpp
2156 ↗(On Diff #240661)

Minor: We may already have a SCEVType to OpCode conversion helper. If we don't, we should probably create one and factor this out. I'm sure we have the same logic elsewhere as well.

2184 ↗(On Diff #240661)

The second check here is redundant as you've already checked that RHS is a constant.

2261 ↗(On Diff #240661)

Style wise, I'd suggest splitting AddRec into it's own case since it requires complexity the others don't.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1370

Can I ask you to split this piece into it's own separate patch? The rest of NFC-ish, this really isn't.

Also, you can remove the not-constant and not-unknown cases as that's now handled by the high cost clause.

llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
48 ↗(On Diff #240661)

Given we have the same magic number scatter around, I'd suggest moving this to a utility file and sharing the same constant param for all uses. We can split later if useful.

llvm/test/Transforms/IndVarSimplify/do-recompute-if-cheap.ll
39–40

This test change is against the intention of the file per the name. I think when you split as suggested above, this will need consolidated, renamed, or something.

Thank you for taking a look!

The general structure of the patch looks entirely reasonable.

Aha. That was the main question here.

I am concerned about the number of test diffs.
Mostly because there might be regressions herein, and it would be really hard to spot.

I agree that the test changes do look unreviewable.

The main question may be: what do we consider a regression?
"Previously we considered this SCEV high-cost and now we don't" won't work well,
because of the nature of the patch - we go from semi-arbitrary rules
to "consistent" application of "consistent" cost model.
Naturally, some patterns will no longer be high-cost, and some will become high-cost.

For example, i believe all of the min/max+loop latch changes fall into this category,
and i suspect that the majority of changes are of that nature.

"Cost-modelling is incorrect, SCEV expression should have cost N but we counted it as K"
This is the main problem IMO. This might need unittests for isHighCostExpansion() cost-modelling.

I'd like to suggest a strategy for making the test changes easier.
Split the patch into a couple of stages:

  1. Add all the TTI plumbing, and one simple use of the costing (e.g. trunc), but keep most of isHighCostExpansion as is. Intention is that very few tests change, and results are obviously correct.
  2. Add remaining use of TTI in isHighCost except for the min/max (since those were unconditional failures previously). Again, hopefully smallish number of diffs to review.
  3. Add the min/max logic. These are the most interesting diffs to study.
  4. Change the bailout logic in IndVarSimplify.

If you're okay with this approach, consider yourself to have a conditional LGTM for (1) provided that the test diffs are actually very limited.

If you want to make an alternate suggestion as to how to review the test diffs effectively, I'm open to it.

I'll explore ways to make diffs more reviewable.

llvm/lib/Analysis/ScalarEvolutionExpander.cpp
2156 ↗(On Diff #240661)

We don't necessarily have 1:1 mapping between SCEV type and an IR instruction.
(E.g. scAddRecExpr, and even min/max)
What should that helper be doing in that case?

2261 ↗(On Diff #240661)

Also, this overcharges scAddRecExpr if Op is actually a zero constant.

lebedev.ri retitled this revision from [SCEV] SCEVExpander::isHighCostExpansion(): perform actual cost-modelling (PR44668) to [SCEV] rewriteLoopExitValues(): even if have hard uses, still rewrite if cheap (PR44668).
lebedev.ri edited the summary of this revision. (Show Details)

Okay, here it goes, patch splitting complete :)

Ping @reames / @mkazantsev - the patches are all nicely splitted for review :)
Thanks

Ping @reames / @mkazantsev
Please do indicate if there is any way i can help move this process along.
Thanks

This revision is now accepted and ready to land.Feb 24 2020, 10:12 PM

LGTM

Thank you for the review!

Rebased, NFC.

This revision was automatically updated to reflect the committed changes.

Hi,
Did you get performance numbers for these patches? We track the performance of our (Arm) open source DSP library and the cost model fixes were generally a notable improvement, so many thanks for that! But the final patch for rewriting exit values has generally been bad, especially considering the gains from the modelling improvements. I need to look into it further, but on my current test case I'm seeing +30% increase in stack accesses with a similar decrease in performance. I'm just wondering if you observed any negative effects yourself?

Hi,

Hi. Thank you for bringing this up to my attention.

We track the performance of our (Arm) open source DSP library
and the cost model fixes were generally a notable improvement,
so many thanks for that!

Which i believe means that we stopped performing many of the expansions
we've previously done because they now no longer fit
into our arbitrarily-picked budget magic number..

But the final patch for rewriting exit values has generally been bad,
especially considering the gains from the modelling improvements.

The goal of this change in particular was to really try to get rid of
uses of IV outside of the loop, since that was one of the obstacles
preventing vectorization in one of the loops i looked at.
(read: in some cases this patch could lead to dramatic improvements due to vectorization)

In light of cost modelling issues you pointed out, this means two things:

  1. We need an undo transform :) It obviously would need to be quite late in pipeline though. https://bugs.llvm.org/show_bug.cgi?id=42965 / D12494 / D66450 (CC @reames @danilaml @srking - what would it take such a pass moving? :))
  2. After all, we might not be modelling the cost correctly, as pointed out in post-commit review notes @ D73744. Will take a look.

I need to look into it further, but on my current test case I'm seeing +30% increase
in stack accesses with a similar decrease in performance.

I'm not sure what "increase in stack accesses" mean here?
I'm reading that as: since we need some values after the loop (to re-compute the values),
we end up spilling them onto stack to free up registers for/in loop,
and then end up reloading them after the loop? That's not unexpected i guess,
although the perf drop (define "similar"?) is obviously alarming.

I'm interested whether @dmgreen, @bjope also observed something similar from these patches?

Did you get performance numbers for these patches?
I'm just wondering if you observed any negative effects yourself?

I just double-checked, and i'm not seeing anything like that.
But cost modelling being cost modelling i'm not very surprised.
Let's see what happens if i address post-commit-review comments..

On LLVM's LNT, the closest runs are http://lnt.llvm.org/db_default/v4/nts/132695?compare_to=132694
It says there are some improvements and one regression,
but from http://lnt.llvm.org/db_default/v4/nts/graph?plot.0=1364.1604571.3&highlight_run=132695
that regression looks like noise to me.

@samparker, maybe @dmgreen @bjope: posted D75908. if possible, please consider seeing if/how it alters benchmark results and maybe reply on that patch

@lebedev.ri I've been successfully using a pass based on https://reviews.llvm.org/D12494 in a downstream port (although I'm not sure if it's enough to fix all ARM's regressions). I was planning to submit it for the review after some touch ups (i.e. it only uses legacy pm) but haven't found the time for that yet. I also need to evaluate whether it is still needed after D75908 (that might take some time).

@lebedev.ri I've been successfully using a pass based on https://reviews.llvm.org/D12494 in a downstream port (although I'm not sure if it's enough to fix all ARM's regressions). I was planning to submit it for the review after some touch ups (i.e. it only uses legacy pm) but haven't found the time for that yet.

I also need to evaluate whether it is still needed after D75908 (that might take some time).

I don't see why it wouldn't be needed after that, i would even think it's needed more than ever now.

The goal of this change in particular was to really try to get rid of
uses of IV outside of the loop, since that was one of the obstacles
preventing vectorization in one of the loops i looked at.

Maybe we should teach the vectorizer to rewrite exit values? If we know a loop will be vectorized after rewriting exit values, there's a stronger incentive to rewrite them. Not sure how hard that would be? Off the top of my head, it should be straightforward to fit into the vectorizer code that classifies instructions.

The goal of this change in particular was to really try to get rid of
uses of IV outside of the loop, since that was one of the obstacles
preventing vectorization in one of the loops i looked at.

Maybe we should teach the vectorizer to rewrite exit values? If we know a loop will be vectorized after rewriting exit values, there's a stronger incentive to rewrite them. Not sure how hard that would be? Off the top of my head, it should be straightforward to fit into the vectorizer code that classifies instructions.

That might be worth exploring.