This is an archive of the discontinued LLVM Phabricator instance.

[LV] [ScalarEvolution] Fix PR34965 - Cache pointer stride information before LV code gen
AbandonedPublic

Authored by dcaballe on Oct 26 2017, 3:02 PM.

Details

Summary

This patch fixes PR34965. I have some concerns, though, about the current solution. I'm attaching the information I posted in Bugzilla. You can skip the Investigation section to have a quick understanding of what's going on, what the current patch does, my concerns and alternative solutions.

Thanks,
Diego

Symptom

Return value of Legal->isConsecutivePtr(%11) during LV code gen has changed from 1 to 0 as a side effect of VPlan patch (revision r311077). This value 0 (meaning non-consecutive pointer) during code gen is inconsistent with value 1 (meaning stride-1 pointer) returned during legal/cost model.
For that reason, store using %11 is considered a stride-1 vector store during legal/cost model but a scatter during code gen, resulting in an assert.

Investigation

createVectorizedLoopSkeleton in LV code gen generates the vector loop, turns original scalar loop into the scalar remainder, introduces the middle and scalar.ph basic blocks after the vector loop and generates new phi nodes that feed the scalar remainder (the IR after createVectorizedLoopSkeleton is attached (after_createVectorizedLoopSkeleton.ll). After this IR modification, Legal->isConsecutivePtr() no longer returns 1 for the problematic pointer, now in the scalar remainder.

Legal->isConsecutivePtr() indirectly uses ScalarEvolution (SE) to determine the pointer stride. For this particular test, Legal->isUniform() also triggers the invalidation of some SCEVs in SE cache, including the one for %11. This bug is exposed by VPlan patch because it changes the execution order of Legal->isUniform() and Legal->isConsecutivePtr() happening before LV code gen. This is the invocation order:

  • Before VPlan patch: Legal->isUniform(…) Invalidates SCEV for %11. Legal->isConsecutivePtr(%11) Computes new SCEV for %11. It returns 1. createVectorizedLoopSkeleton() Modifies IR (middle, scalar.ph and new phis) Legal->isConsecutivePtr(%11) Uses cached SCEV for %11. It returns 1.
  • After VPlan patch: Legal->isConsecutivePtr(%11) Uses cached SCEV for %11. It returns 1. Legal->isUniform(…) Invalidates SCEV for %11. createVectorizedLoopSkeleton() Modifies IR (middle, scalar.ph and new phis) Legal->isConsecutivePtr(%11) Computes new SCEV for %11 using modified IR. It returns 0!

For this test, recomputing Legal->isConsecutivePtr(%11) after createVectorizedLoopSkeleton yields a different output because SE is not able to detect a recurrence. These are the SCEVs computed for the problematic pointer before and after createVectorizedLoopSkeleton:

  • Before (%11 in original IR): ({(8 + (4 * (zext i32 %.ph2 to i64)) + undef),+,4}<nsw><%6>). // SCEVAddRec
  • After (%21 in modified IR): ((4 * (zext i32 (2 + %18) to i64))<nuw><nsw> + undef)<nsw> // No recurrence

In order to return a SCEVAddRec for the latter, SE would have to be able to represent %bc.resume.val and %bc.resume.val1 in terms of more basic components to conclude that %bc.resume.val1 is always %bc.resume.val + 1. Unfortunately, there is no way to represent a non-inductive phi operation in SCEV at this point.

Root Cause

Current LV code gen is based on the fragile assumption that analyzing the scalar remainder loop will yield the same information as analyzing the input scalar loop. Unfortunately, this test demonstrates that this is not always the case so we cannot blindly rely on that assumption.

Tentative Fix

Pointer strides are cached in Legal during the analysis of the input scalar loop (following the same approach as for uniform and scalar values). Legal->isConsecutivePtr now uses cached pointer strides instead of blindly recomputing the same information every time. Pointer strides are recomputed when runtime checks are added as a requirement for vectorization (PSE.addPredicate in LV).

Concern: LAI-> replaceSymbolicStrideSCEV may also add predicates to PSE. This routine is called when pointer strides are collected but is also called from other analysis-like routines in LAI (hasComputableBounds, isNoWrap, getPTrStride, etc.). If a new predicate that changes pointer strides is added after collecting them, those changes won’t be taken into account when using Legal->isConsecutivePtr. This scenario could be unlikely given that LAI-> replaceSymbolicStrideSCEV is also used when collecting pointer strides but we need to be sure. I would need some feedback from someone with experience in PSE/LAI.

Alternative/Complementary Solutions

  1. If Tentative Fix is not enough to deal with unexpected PSE->AddPredicate from LAI, we could revert changes in Legal->isConsecutivePtr() and collect Legal->isConsecutivePtr() output before going into LV code gen.
  1. Extend SE so that a SCEVAddRec can be generated for this test.

Diff Detail

Event Timeline

dcaballe created this revision.Oct 26 2017, 3:02 PM
dcaballe edited the summary of this revision. (Show Details)

In reply to Hal Finkel's comments in llvm-dev:

"P.S. I agree with your conclusions based on your description of what's going on: We need to cache the results before we start transforming things. I also agree that we should potentially be concerned with maintaining the cache in the face of new predicates being added. Maybe we should instantiate the cached values as a separate setup after all predicates have been added?"

That is my concern. It's not clear to me when "all predicates have been added". replaceSymbolicStrideSCEV can be called from multiple routines, as I explained in Tentative Fix. I guess we could assume that all predicates have been added just before going into LV code gen. If that is the case, we could call 'collectPtrStride' only once, for example, in line 7668 (just before createVectorizedLoopSkeleton) and do something like the following in Legal->isConsecutivePtr():

if !PtrStrides.empty()
    // Compute isConsecutivePtr using PtrStride information
else
    // Compute isConsecutivePtr from scratch.
dcaballe updated this revision to Diff 121535.Nov 3 2017, 1:17 PM
dcaballe added a reviewer: rengolin.

I implemented a new fix where the return value of Legal->isConsecutivePtr() is collected just before the original loop is modified. This approach seems simpler and less invasive than the one described in my previous comment.

Thanks!

rengolin edited edge metadata.

I implemented a new fix where the return value of Legal->isConsecutivePtr() is collected just before the original loop is modified. This approach seems simpler and less invasive than the one described in my previous comment.

I may not have understood the problem correctly, but I think we shouldn't moved cached analysis into new representations.

To me it seems that SCEV is not able to see something that has a dependency, which VPlan has just exposed. In this case, fixing/extending SCEV is the right thing to do.

I'm adding the VPlan folks to get a wider audience.

cheers,
--renato

hsaito edited edge metadata.Nov 7 2017, 11:47 AM

I implemented a new fix where the return value of Legal->isConsecutivePtr() is collected just before the original loop is modified. This approach seems simpler and less invasive than the one described in my previous comment.

I may not have understood the problem correctly, but I think we shouldn't moved cached analysis into new representations.

To me it seems that SCEV is not able to see something that has a dependency, which VPlan has just exposed. In this case, fixing/extending SCEV is the right thing to do.

I'm adding the VPlan folks to get a wider audience.

cheers,
--renato

There are two things. If we are expecting to see the same result coming back, we shouldn't re-analyze. If we are expecting to see better result, there is a good reason to re-analyze.
Memory reference kind information (unit-stride, strided, or gather) is utilized in multiple places in vectorizer, and failing to see the same/better result can lead to vectorizer's stability
problem. So, having such data handy (i.e., cache) is a good defensive programming. As we improve vectorizer's capability, we expect memory reference kind information becomes more
and more complex (e.g., unit-stride up to certain VF, piece-wise unit-stride beyond that). Re-analyzing that will become more and more undesirable. So, please consider the tentative fix
as a baby step towards that direction.

Fixing/extending SCEV is also a good thing to do ---- and if vectorizer is re-analyzing, we should add an assertion such that the resulting information hasn't degraded from the cached info.

Another thing that caught my attention was Legal->isUniform() invalidating SCEV for %11. I don't think it should. Having said that, Simon Moll (Saarland) and we (Intel team) are looking into bringing Divergence Analysis from RV project. As such, we currently do not have a good incentive to dig deeper into Legal->isUniform() to resolve this particular issue.

Thanks,
Hideki

Thanks for the feedback.

Before posting this review, I got some early feedback regarding fixing/extending SCEV for this case and it didn't look good. Maybe, we could special-case it but a generic solution, if I understood correctly, could require representing non-inductive phis in SCEV. This is not my area of expertise so, hopefully, we'll get more feedback about it. I'll ping llvm-dev if this doesn't happen in the short term.

In any case, I agree with Hideki and Hal. We should cache any analysis information before we start transforming the code. We shouldn't expect the analysis information from the original loop to be preserved after transforming the code. VPlan will address this issue in a better way but, in the meantime, this should be a reasonable fix.

Thanks,
Diego

rengolin accepted this revision.Nov 22 2017, 10:24 AM

Hi Diego,

Ok, I agree with Hideki that caching is good as well as re-evaluating and making sure that we don't degrade. However, as you both said, detecting degradation will require SCEV to go beyond its current design and new cases in the vectoriser we don't have yet.

From that point of view, I agree with Hideki that this can be a baby-step towards a better model, so the patch would be ok for now.

Can you please add some comments to collectIsConsecutivePtr to reflect that? Otherwise, LGTM. Thanks!

This revision is now accepted and ready to land.Nov 22 2017, 10:24 AM
Ayal edited edge metadata.Nov 28 2017, 3:27 AM

Diego,

Nice catch! Continuing to use SCEV and expect consistent answers in the midst of restructuring the IR is indeed wrong; its cache is not meant to cover for cases that can no longer be analyzed as before.

We’d like to add few points one can learn from this issue, as you are now to support and develop VPlan going forward.

  • LV already records consecutivity of pointers, or rather the decision to widen the corresponding loads and stores, using CM_Widen. The only thing that seems to be missing, which required calling SCEV again when generating vector code, is to figure out if the stride is +1 or -1. This can be fixed by keeping CM_Widen for stride +1 and using, say, CM_Widen_Reverse for stride -1.
  • So, what change did VPlan’s patch do that triggered this issue? It refactored collectUniformsAndScalars() and collectInstsToScalarize() away from selectVectorizationFactor(), in order to facilitate building VPlans first and deciding on the best VF/VPlan later. It is still possible btw to move selectVectorizationFactor() back and embed these collect methods within expectedCost() as they worked before, to see their effect on the reproducer.
  • How does this refactoring cause the observed issue? Collecting uniforms fills SCEV cache entries (specifically setCostBasedWideningDecision()). Computing the cost to eventually select the best VF may invalidate certain cache entries when called the first time for VF = 1 (specifically getBackedgeTakenInfo()). That’s why first collecting uniforms for all VF’s, and then computing costs for all VF’s may leave the cache invalid, compared to interleaving the two for each VF – when more than VF = 1 is involved.
  • How come expectedCost() which eventually calls isUniform() after collectUniformsAndScalars() was invoked, causes any trouble (let alone invalidate SCEV cache entries) – haven’t we recorded its result already? Well, getInstructionCost() does consult the collected isUniformAfterVectorization() on each instruction, but later calls Legal->isUniform(Op2) on an operand – note that Op2 may reside outside the vectorized loop and thus not collected. Indeed the Op2 checked in the reproducer is outside the vectorized loop, and Legal->isUniform(Op2) calls LAI->isUniform() which in turn triggers SCEV’s isLoopInvariant(). Computing Op2’s AddRec SCEV triggers getBackedgeTakenInfo() of the enclosing loop, which invalidates the 1st-order-recurring AddRec inside the vectorized loop. LAI->isUniform() could simply return true for values outside TheLoop, and/or getInstructionCost() could check if Op2 is outside the vectorized loop or in the per-VF collected isUniformAfterVectorization() instead of calling Legal->isUniform(Op2).
  • Despite retaining the original scalar loop intact as remainder/default loop, information concerning it may be lost. Namely, values inside it which SCEV could originally analyze may now require flow-sensitive analysis (which, btw, Polyhedral Value & Memory Analysis may handle – see Johannes’s talk). This is inherent to LV, irrespective of VPlan, and may affect plans to further optimize this loop such as vectorizing it. One tentative solution to the reported case is to subtract step from the current value of an IV, instead of rewiring phi’s to access its value from the previous iteration (1st-order recurring AddRec). Worth leaving a TODO note.

Bearing in mind that all VPlan patches so far deliberately aimed to minimize functional changes, such intriguing and esoteric issues hopefully emphasize the importance of changing LV gradually and carefully, and the benefit of LLVM’s vast user-base and profound QA (e.g., PR35122 ;-).

Good Luck!
Gil and Ayal.

Thanks, Renato/Gil/Ayal for your feedback.

I think that the CM_Widen approach is a good idea but it might go beyond this bug fix because it implies extending the current design. I'll have a look to see how invasive the changes would be. If it's too invasive, I would suggest that we proceed with the current tentative fix and address the CM_Widen design extension in a separate thread.

Thanks,
Diego

I had a look at the CM_Widen approach that Ayal and Gil suggested and the fix was indeed very simple and cleaner than the one in this differential.
I created D40742 with the new fix so that you can compare and contrast with this one. IMO, D40742 seems a better option for this bug.

Ayal/Gil, thanks a lot for the feedback!

Diego.

dcaballe abandoned this revision.Dec 12 2017, 11:25 AM

Abandoning this differential. D40742 seems a better fix for this bug and was approved.
Thank you all for the feedback!