Index: docs/VectorizationPlan.rst =================================================================== --- /dev/null +++ docs/VectorizationPlan.rst @@ -0,0 +1,574 @@ ++++++ +VPlan ++++++ + +Goal of initial VPlan patch ++++++++++++++++++++++++++++ +The design and implementation of VPlan follow our RFC [10]_ and presentation +[11]_. The initial patch is designed to: + +- be a *lightweight* NFC patch; +- show key aspects of VPlan's Hierarchical CFG concept; +- demonstrate how VPlan can + + * capture *all* current vectorization decisions: which instructions are to + + + be vectorized "on their own", or + + be part of an interleave group, or + + be scalarized, and optionally have scalar instances moved down to other + basic blocks and under a condition; and + + be packed or unpacked (at the definition rather than at its uses) to + provide both scalarized and vectorized forms; and + + * represent all control-flow *within loop body* of vectorized code version. + +- Be a step towards + + * aligning Cost step with Transformation step, + * representing entire code being transformed, + * adding optmizations: + + + optimize conditional scalarization further, + + retaining uniform control-flow, + + vectorizing outerloops, + + and more. + +Out of scope for initial patch: + +- changing how a loop is checked if it can be vectorized - "Legal"; +- changing how a loop is checked if it should be vectorized - "Cost". + + +================== +Vectorization Plan +================== + +.. contents:: + :local: + +Overview +======== +The Vectorization Plan is an explicit recipe for describing a vectorization +candidate. It serves for both estimating the cost reliably and for performing +the translation, and facilitates dealing with multiple vectorization candidates. + +The overall structure consists of: + +1. One LoopVectorizationPlanner for each attempt to vectorize a loop or a loop + nest. + +2. A LoopVectorizationPlanner can construct, optimize and discard one or more + VPlans, providing different ways to vectorize the loop or the loop nest. + +3. Once the best VPlan is determined, including the best vectorization factor + and unroll factor, this VPlan drives the vector code generation using a + VPTransformState object. + +4. Each VPlan represents the loop or the loop nest using a hierarchical CFG. + +5. At the bottom level of the hierarchical CFG are VPBasicBlocks. + +6. Each VPBasicBlock consists of one or more VPRecipes to generate Instructions + for it. + +Motivation +---------- +The vectorization transformation can be rather complicated, involving several +potential alternatives, especially for outer loops [1]_ but also possibly for +innermost loops. These alternatives may have significant performance impact, +both positive and negative. A cost model is therefore employed to identify the +best alternative, including the alternative of avoiding any transformation +altogether. + +The process of vectorization traditionally involves three major steps: Legal, +Cost, and Transform. This is the general case in LLVM's LoopVectorizer: + +1. Legal Step: check if loop can be legally vectorized; encode constraints and + artifacts if so. +2. Cost Step: compute the relative cost of vectorizing it along possible + vectorization and unroll factors (VF, UF). +3. Transform Step: vectorize the loop according to best VF and UF. + +This design, which works only directly on the original LLVM-IR, has some +implications: + +1. Cost Step tries to predict what the vectorized loop will look like and how + much it will cost, independently of what the Transform Step will eventually + do. It's hard to keep the two in sync. +2. Cost Step essentially considers a single vectorization candidate. Any + alternatives are immediately evaluately and resolved. +3. Legal Step does more than check for vectorizability; e.g., it records + auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo. +4. Transform Step first populates the single basic block of the vectorized loop + and later revisits scalarized instructions to predicate them one by one, as + needed. + +The Vectorization Plan is designed to explicitly model a vectorization +candidate to overcome the above constraints, which is especially important for +the vectorization of outer-loops. This affects the overall process by +essentially splitting the Transform Step into a Plan Step and a Code-Gen Step: + +1. Legal Step: check if loop can be legally vectorized; encode contraints and + artifacts if so. Initiate Vectorization Plan showing how the loop can be + vectorized only after passing Legal, to save redundant construction. +2. Plan Step: + + a. Build initial Vectorization Plans following the constraints and + decisions taken by Legal. + b. Explore ways to optimize the vectorization plan, complying with + all legal constraints, possibly constructing several plans following + tentative vectorization decisions. +3. Cost Step: compute the relative cost of each plan. This step can be applied + repeatedly by Plan Step 2.b. +4. Code-Gen Step: materialize the best plan. Note that only this step modifies + the IR, as in the current Loop Vectorizer. + +The Cost Step can also be split into an Early-Pruning Step(s) and a +"Cost-Gen" Step, where the former applies quick yet inaccurate estimates to +prune obviously-unpromising candidates, and the latter applies more accurate +estimates based on a full Plan. + +One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds +Step 2.b. + +As the scope of vectorization grows from innermost to outer loops, so do the +uncertainty and complexity of each step. One way to mitigate the shortcomings +of the Legal and Cost steps is to rely on programmers to indicate which loops +can and/or should be vectorized. This is implicit for certain loops in +data-parallel languages such as OpenCL [4]_, [5]_ and explicit in others such as +OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports +and raises the importance of explicit vectorization beyond the current +capabilities of Clang and LLVM. Namely, from currently forcing the +vectorization of innermost loops according to prescribed width and/or +interleaving count, to supporting OpenMP's "#pragma omp simd" construct and +associated clauses, including vectorizing across function boundaries [2]_. + +References +---------- +.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit + Nuzman and Ayal Zaks, PACT 2008. + +.. [2] "Proposal for function vectorization and loop vectorization with function + calls", Xinmin Tian, [`cfe-dev + `_]., + March 2, 2016. + See also `review `_. + +.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios + Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. + +.. [4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers' Meeting 2011. + +.. [5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs", Ralf + Karrenberg, Springer 2015. See also "Improving Performance of OpenCL on + CPUs", LLVM Developers' Meeting 2012. + +.. [6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on + Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., + IPDPSW 2012. + +.. [7] "Exploiting mixed SIMD parallelism by reducing data reorganization + overhead", Hao Zhou and Jingling Xue, CGO 2016. + +.. [8] "Register Allocation via Hierarchical Graph Coloring", David Callahan and + Brian Koblenz, PLDI 1991 + +.. [9] "Structural analysis: A new approach to flow analysis in optimizing + compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 + +.. [10] "RFC: Extending LV to vectorize outerloops", [`llvm-dev + `_], + September 21, 2016. + +.. [11] "Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer + loop auto-vectorization", Hideki Saito, `LLVM Developers' Meeting 2016 + `_, November 3, 2016. + +Examples +-------- +An example with a single predicated scalarized instruction - integer division: + +.. code-block:: c + + void foo(int* a, int b, int* c) { + #pragma simd + for (int i = 0; i < 10000; ++i) + if (a[i] > 777) + a[i] = b - (c[i] + a[i] / b); + } + + +IR Dump Before Loop Vectorization: + +.. code-block:: LLVM + :emphasize-lines: 6,11 + + for.body: ; preds = %for.inc, %entry + %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ] + %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv + %0 = load i32, i32* %arrayidx, align 4, !tbaa !1 + %cmp1 = icmp sgt i32 %0, 777 + br i1 %cmp1, label %if.then, label %for.inc + + if.then: ; preds = %for.body + %arrayidx3 = getelementptr inbounds i32, i32* %c, i64 %indvars.iv + %1 = load i32, i32* %arrayidx3, align 4, !tbaa !1 + %div = sdiv i32 %0, %b + %add.neg = sub i32 %b, %1 + %sub = sub i32 %add.neg, %div + store i32 %sub, i32* %arrayidx, align 4, !tbaa !1 + br label %for.inc + + for.inc: ; preds = %for.body, %if.then + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 + %exitcond = icmp eq i64 %indvars.iv.next, 10000 + br i1 %exitcond, label %for.cond.cleanup, label %for.body + +The VPlan that is built initially: + +.. image:: VPlanPrinter.png + +Design Guidelines +================= +1. Analysis-like: building and manipulating the Vectorization Plan must not + modify the IR. In particular, if a VPlan is discarded + compilation should proceed as if the VPlan had not been built. + +2. Support all current capabilities: the Vectorization Plan must be capable of + representing the exact functionality of LLVM's existing Loop Vectorizer. + In particular, the transition can start with an NFC patch. + In particular, VPlan must support efficient selection of VF and/or UF. + +3. Align Cost & CodeGen: the Vectorization Plan must serve both the cost + model and the code generation phases, where the cost estimation must + evaluate the to-be-generated code reliably. + +4. Support vectorizing additional constructs: + + a. vectorization of Outer-loops. + In particular, VPlan must be able to represent the control-flow of a + vectorized loop which may include multiple basic-blocks and nested loops. + b. SLP vectorization. + c. Combinations of the above, including nested vectorization: vectorizing + both an inner loop and an outerloop at the same time (each with its own + VF and UF), mixed vectorization: vectorizing a loop and SLP patterns + inside [7]_, (re)vectorizing vector code. + +5. Support multiple candidates efficiently: + In particular, similar candidates related to a range of possible VF's and + UF's must be represented efficiently. + In particular support potential versionings efficiently. + +6. Compact: the Vectorization Plan must be efficient and provide as compact a + representation as possible. In particular where the transformation is + straightfoward, and where the plan is to reuse existing IR (e.g., + leftover iterations). + +VPlan Classes: Definitions +========================== + +:VPlan: + A recipe for generating a vectorized version from a given IR code. + Takes a "scenario-based approach" to vectorization planning. + Given IR code required to be SESE, mainly to simplify dominance + information. This vectorized version is represented using a Hierarchical CFG. + +:Hierarchical CFG: + A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. + The Hierarchical CFG data structure we use is similar to the Tile Tree [8]_, + where cross-Tile edges are lifted to connect Tiles instead of the original + basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the + terms Region and Block rather than Tile [8]_ to avoid confusion with loop + tiling. + +:VPBasicBlock: + Serves as the leaf of the Hierarchical CFG. Represents a sequence of + instructions that will appear consecutively in a basic block of the vectorized + version. The instructions of such a basic block originate from one or more + VPBasicBlocks. + The VPBasicBlock takes care of the control-flow + relations with other VPBasicBlock's and Regions. + Holds a sequence of zero or more + VPRecipe's that take care of representing the instructions. + A VPBasicBlock that holds no VPRecipe's represents no instructions; this + may happen, e.g., to support disjoint Regions and to ensure Regions have a + single exit, possibly an empty one. + +:VPRecipeBase: + A base class describing one or more instructions that will appear + consecutively in the vectorized version, based on Instructions from the given + IR. + These Instructions are referred to as the "Ingredients" of the Recipe. + A Recipe specifies how its ingredients are to be vectorized: e.g., + copy or reuse them as uniform, scalarize or vectorize them according to an + enclosing loop dimension, vectorize them according to internal SLP dimension. + + **Design principle:** in order to reason about how to vectorize an Instruction + or how much it would cost, one has to consult the VPRecipe holding it. + + **Design principle:** when a sequence of instructions conveys additional + information as a group, we use a VPRecipe to encapsulate them and attach + this information to the VPRecipe. For instance a VPRecipe can model an + interleave group of loads or stores with additional information for + calculating their cost and performing code-gen, as a group. + + **Design principle:** where possible a VPRecipe should reuse the existing + container of its ingredients. A new containter should be opened on-demand, + e.g., to facilitate changing the order of Instructions between original + and vectorized versions. + +:VPOneByOneRecipeBase: + Represents recipes which transform each Instruction in their Ingredients + independently, in order. + The Ingredients are a sub-sequence of original Instructions, which reside in + the same IR BasicBlock and in the same order. The Ingredients are + accessed by a pointer to the first and last Instruction in their original IR + basic block. Serves as a base class for the concrete sub-classes + VPScalarizeOneByOneRecipe and VPVectorizeOneByOneRecipe. + +:VPScalarizeOneByOneRecipe: + A concrete VPRecipe which scalarizes each ingredient, generating either + instances of lane 0 for a uniform instruction, or instances for a range of + lanes otherwise. + +:VPVectorizeOneByOneRecipe: + A concrete VPRecipe which vectorizes each ingredient. + +:VPInterleaveRecipe: + A concrete VPRecipe which transforms an interleave group of loads or stores + into one wide load/store and shuffles. + +:VPConditionBitRecipeBase: + A base class for VPRecipes which provide the condition bit feeding a + conditional branch. Such cases correspond to scalarized or uniform branches. + +:VPExtractMaskBitRecipe: + A concrete VPRecipe which represents the extraction of a bit from a mask, + needed when scalarizing a conditional branch. + Such branches are needed to guard scalarized and predicated instructions. + +:VPMergeScalarizeBranchRecipe: + A concrete VPRecipe which represents Phi's needed when control converges back + from a scalarized branch. + Such phi's are needed to merge live-out values that are set under a + scalarized branch. They can be scalar or vector, depending on the user of the + live-out value. + +:VPWidenIntInductionRecipe: + A concrete VPRecipe which widens integer reductions, producing their vector + values and computing the necessary values for producing their scalar values. + The scalar values themselves are generated, possibly elsewhere, by the + complementing VPBuildScalarStepsRecipe. + +:VPBuildScalarStepsRecipe: + A concrete VPRecipe complemeting the handling of integer induction variables, + responsible for generating the scalar values used by the IV's scalar users. + +:VPRegionBlock: + A collection of VPBasicBlocks and VPRegionBlocks which form a + single-entry-single-exit subgraph of the CFG in the vectorized code. + + **Design principle:** When some additional information relates to an SESE set + of VPBlocks, we use a VPRegionBlock to wrap them and attach the information to + it. For example, a VPRegionBlock can be used to indicate that a scalarized + SESE region is to be replicated. It is also designed to serve predicating + divergent branches while retaining uniform branches as much as possible / + desirable, and represent inner loops. + +:VPBlockBase: + The building block of the Hierarchical CFG. A VPBlockBase can be either a + VPBasicBlock or a VPRegionBlock. + A VPBlockBase may indicate that its contents are + to be replicated several times. This is designed to support scalarizing + VPBlockBases which generate VF replicas of their instructions, which in turn + remain scalar. And to do so using a single VPlan for multiple candidate VF's. + +:VPTransformState: + Stores information used for code generation, passed from the Planner to its + selected VPlan for execution, and used to pass additional information down + from VPBlocks to the VPRecipes. + +:VPlanUtils: + Contains a collection of methods for the construction and modification of + abstract VPlans. + +:VPlanUtilsLoopVectorizer: + Derived from VPlanUtils, providing additional methods for the construction and + modification of VPlans. + +:LoopVectorizationPlanner: + The object in charge of creating and manipulating VPlans for a given IR code. + + +VPlan Classes: Diagram +====================== + +The classes of VPlan with main fields and methods; sub-classes of VPRecipeBase +are shown in a separate figure: + +.. image:: VPlanUML.png + + +The class hierarchy of VPlan's VPRecipeBase class: + +.. image:: VPlanRecipesUML.png + + +Integration with LoopVectorize.cpp/processLoop() +================================================ + +Here's the integration within LoopVectorize.cpp's existing flow, in +LoopVectorizePass::processLoop(Loop \*L): + +1. Plan only after passing all early bail-outs: + + a. including those that take place after Legal, which is kept intact; + b. including those that use the Cost Model - refactor it slightly to expose + its MaxVF upper bound and canVectorize() early exit: + +.. code-block:: c++ + + // Check if the target supports potentially unsafe FP vectorization. + // FIXME: Add a check for the type of safety issue (denormal, signaling) + // for the target we're vectorizing for, to make sure none of the + // additional fp-math flags can help. + if (Hints.isPotentiallyUnsafe() && + TTI->isFPVectorizationPotentiallyUnsafe()) { + DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n"); + ORE->emit( + createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L) + << "loop not vectorized due to unsafe FP support."); + emitMissedWarning(F, L, Hints, ORE); + return false; + } + + if (!CM.canVectorize(OptForSize)) + return false; + + // Early prune excessive VF's + unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize); + + // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail. + if (OptForSize && CM.requiresTail(MaxVF)) + return false; + +2. Plan: + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before, + c. compute best interleave-count using Cost Model as before. Above two + steps are refactored into LVP.plan() (see below): + +.. code-block:: c++ + + // Use the planner. + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM); + + // Get user vectorization factor. + unsigned UserVF = Hints.getWidth(); + + // Select the vectorization factor. + LoopVectorizationCostModel::VectorizationFactor VF = + LVP.plan(OptForSize, UserVF, MaxVF); + bool VectorizeLoop = (VF.Width > 1); + + std::pair VecDiagMsg, IntDiagMsg; + + if (!UserVF && !VectorizeLoop) { + DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); + VecDiagMsg = std::make_pair( + "VectorizationNotBeneficial", + "the cost-model indicates that vectorization is not beneficial"); + } + + // Select the interleave count. + unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost); + + // Get user interleave count. + unsigned UserIC = Hints.getInterleave(); + +3. Transform: + + a. invoke an Unroller to unroll the loop (as before), or + b. invoke LVP.executeBestPlan() to vectorize the loop: + +.. code-block:: c++ + + if (!VectorizeLoop) { + assert(IC > 1 && "interleave count should not be 1 or 0"); + // If we decided that it is not legal to vectorize the loop, then + // interleave it. + InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL, + &CM); + Unroller.vectorize(); + + ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(), + L->getHeader()) + << "interleaved loop (interleaved count: " + << NV("InterleaveCount", IC) << ")"); + } else { + + // If we decided that it is \* legal \* to vectorize the loop, then do it. + InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, + &LVL, &CM); + + LVP.executeBestPlan(LB); + + ++LoopsVectorized; + + // Add metadata to disable runtime unrolling a scalar loop when there are + // no runtime checks about strides and memory. A scalar loop that is + // rarely used is not worth unrolling. + if (!LB.areSafetyChecksAdded()) + AddRuntimeUnrollDisableMetaData(L); + + // Report the vectorization decision. + ORE->emit(OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(), + L->getHeader()) + << "vectorized loop (vectorization width: " + << NV("VectorizationFactor", VF.Width) + << ", interleaved count: " << NV("InterleaveCount", IC) << ")"); + } + + // Mark the loop as already vectorized to avoid vectorizing again. + Hints.setAlreadyVectorized(); + +4. Plan, refactored into LVP.plan(): + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before: + +.. code-block:: c++ + + LoopVectorizationCostModel::VectorizationFactor + LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF, + unsigned MaxVF) { + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + if (UserVF == 1) + return {UserVF, 0}; + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions",dbgs())); + return {UserVF, 0}; + } + if (MaxVF == 1) + return {1, 0}; + + assert(MaxVF > 1 && "MaxVF is zero."); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + for (unsigned i = 2; i <= MaxVF; i = i+i) + CM->collectInstsToScalarize(i); + buildInitialVPlans(2, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + // Select the optimal vectorization factor. + return CM->selectVectorizationFactor(OptForSize, MaxVF); + } Index: docs/Vectorizers.rst =================================================================== --- docs/Vectorizers.rst +++ docs/Vectorizers.rst @@ -380,6 +380,18 @@ .. image:: linpack-pc.png +Internals +--------- + +.. toctree:: + :hidden: + + VectorizationPlan + +:doc:`VectorizationPlan` + The loop vectorizer is based on an abstract representation called Vectorization Plan. + This document describes its philosophy and design. + .. _slp-vectorizer: The SLP Vectorizer Index: lib/Transforms/Vectorize/CMakeLists.txt =================================================================== --- lib/Transforms/Vectorize/CMakeLists.txt +++ lib/Transforms/Vectorize/CMakeLists.txt @@ -4,6 +4,7 @@ LoopVectorize.cpp SLPVectorizer.cpp Vectorize.cpp + VPlan.cpp ADDITIONAL_HEADER_DIRS ${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms Index: lib/Transforms/Vectorize/LoopVectorize.cpp =================================================================== --- lib/Transforms/Vectorize/LoopVectorize.cpp +++ lib/Transforms/Vectorize/LoopVectorize.cpp @@ -47,6 +47,7 @@ //===----------------------------------------------------------------------===// #include "llvm/Transforms/Vectorize/LoopVectorize.h" +#include "VPlan.h" #include "llvm/ADT/DenseMap.h" #include "llvm/ADT/Hashing.h" #include "llvm/ADT/MapVector.h" @@ -97,6 +98,7 @@ #include "llvm/Transforms/Utils/LoopVersioning.h" #include "llvm/Transforms/Vectorize.h" #include +#include #include #include @@ -399,6 +401,9 @@ /// LoopVectorizationLegality class to provide information about the induction /// and reduction variables that were found to a given vectorization factor. class InnerLoopVectorizer { + friend class LoopVectorizationPlanner; + friend class llvm::VPlan; + public: InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo *LI, DominatorTree *DT, @@ -445,7 +450,8 @@ // When we if-convert we need to create edge masks. We have to cache values // so that we don't end up with exponential recursion/IR. typedef DenseMap, VectorParts> - EdgeMaskCache; + EdgeMaskCacheTy; + typedef DenseMap BlockMaskCacheTy; /// Create an empty loop, based on the loop ranges of the old loop. void createEmptyLoop(); @@ -461,43 +467,44 @@ /// Copy and widen the instructions from the old loop. virtual void vectorizeLoop(); + /// Handle all cross-iteration phis in the header. + void fixCrossIterationPHIs(); + /// Fix a first-order recurrence. This is the second phase of vectorizing /// this phi node. void fixFirstOrderRecurrence(PHINode *Phi); + /// Fix a reduction cross-iteration phi. This is the second phase of + /// vectorizing this phi node. + void fixReduction(PHINode *Phi); + /// \brief The Loop exit block may have single value PHI nodes where the /// incoming value is 'Undef'. While vectorizing we only handled real values /// that were defined inside the loop. Here we fix the 'undef case'. /// See PR14725. void fixLCSSAPHIs(); - /// Iteratively sink the scalarized operands of a predicated instruction into - /// the block that was created for it. - void sinkScalarOperands(Instruction *PredInst); - - /// Predicate conditional instructions that require predication on their - /// respective conditions. - void predicateInstructions(); - /// Collect the instructions from the original loop that would be trivially /// dead in the vectorized loop if generated. - void collectTriviallyDeadInstructions(); + static void collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions); /// Shrinks vector element sizes to the smallest bitwidth they can be legally /// represented as. void truncateToMinimalBitwidths(); +public: /// A helper function that computes the predicate of the block BB, assuming /// that the header block of the loop is set to True. It returns the *entry* /// mask for the block BB. VectorParts createBlockInMask(BasicBlock *BB); + +protected: /// A helper function that computes the predicate of the edge between SRC /// and DST. VectorParts createEdgeMask(BasicBlock *Src, BasicBlock *Dst); - /// A helper function to vectorize a single BB within the innermost loop. - void vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV); - /// Vectorize a single PHINode in a block. This method handles the induction /// variable canonicalization. It supports both VF = 1 for unrolled loops and /// arbitrary length vectors. @@ -508,13 +515,69 @@ /// and update the analysis passes. void updateAnalysis(); - /// This instruction is un-vectorizable. Implement it as a sequence - /// of scalars. If \p IfPredicateInstr is true we need to 'hide' each - /// scalarized instruction behind an if block predicated on the control - /// dependence of the instruction. - virtual void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false); +public: + /// A helper function to vectorize a single Instruction in the innermost loop. + virtual void vectorizeInstruction(Instruction &I); + + /// A helper function to scalarize a single Instruction in the innermost loop. + /// Generates a sequence of scalar instances for each lane between \p MinLane + /// and \p MaxLane, times each part between \p MinPart and \p MaxPart, + /// inclusive.. + void scalarizeInstruction(Instruction *Instr, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, + unsigned MaxLane); + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. If the value has + /// been vectorized but not scalarized, the necessary extractelement + /// instruction will be generated. + Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. The scalar parts + /// for this value must already be initialized. + void setScalarValue(Value *V, unsigned Part, unsigned Lane, Value *Scalar) { + assert(VectorLoopValueMap.hasScalar(V) && + "Cannot set an uninitialized scalar value"); + VectorLoopValueMap.ScalarMapStorage[V][Part][Lane] = Scalar; + } + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. If there isn't one, return a null pointer. + /// Note that the value returned may also be a null pointer if that specific + /// part has not been generated yet. + Value *getVectorValue(Value *V, unsigned Part) { + if (!VectorLoopValueMap.hasVector(V)) + return nullptr; + return VectorLoopValueMap.VectorMapStorage[V][Part]; + } + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. The vector parts for this value must already + /// be initialized. + void setVectorValue(Value *V, unsigned Part, Value *Vector) { + assert(VectorLoopValueMap.hasVector(V) && + "Cannot set an uninitialized vector value"); + VectorLoopValueMap.VectorMapStorage[V][Part] = Vector; + } + + /// Construct the vector value of a scalarized value \p V one lane at a time. + /// This method is for predicated instructions where we'd like the + /// insert-element instructions to reside in the predicated block to have + /// them execute only if needed. + void constructVectorValue(Value *V, unsigned Part, unsigned Lane); + /// Return a constant reference to the VectorParts corresponding to \p V from + /// the original loop. If the value has already been vectorized, the + /// corresponding vector entry in VectorLoopValueMap is returned. If, + /// however, the value has a scalar entry in VectorLoopValueMap, we construct + /// new vector values on-demand by inserting the scalar values into vectors + /// with an insertelement sequence. If the value has been neither vectorized + /// nor scalarized, it must be loop invariant, so we simply broadcast the + /// value into vectors. + const VectorParts &getVectorValue(Value *V); + +protected: /// Vectorize Load and Store instructions, virtual void vectorizeMemoryInstruction(Instruction *Instr); @@ -532,13 +595,6 @@ Instruction::BinaryOps Opcode = Instruction::BinaryOpsEnd); - /// Compute scalar induction steps. \p ScalarIV is the scalar induction - /// variable on which to base the steps, \p Step is the size of the step, and - /// \p EntryVal is the value from the original loop that maps to the steps. - /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it - /// can be a truncate instruction). - void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal); - /// Create a vector induction phi node based on an existing scalar one. This /// currently only works for integer induction variables with a constant /// step. \p EntryVal is the value from the original loop that maps to the @@ -548,10 +604,6 @@ void createVectorIntInductionPHI(const InductionDescriptor &II, Instruction *EntryVal); - /// Widen an integer induction variable \p IV. If \p Trunc is provided, the - /// induction variable will first be truncated to the corresponding type. - void widenIntInduction(PHINode *IV, TruncInst *Trunc = nullptr); - /// Returns true if an instruction \p I should be scalarized instead of /// vectorized for the chosen vectorization factor. bool shouldScalarizeInstruction(Instruction *I) const; @@ -559,25 +611,25 @@ /// Returns true if we should generate a scalar version of \p IV. bool needsScalarInduction(Instruction *IV) const; - /// Return a constant reference to the VectorParts corresponding to \p V from - /// the original loop. If the value has already been vectorized, the - /// corresponding vector entry in VectorLoopValueMap is returned. If, - /// however, the value has a scalar entry in VectorLoopValueMap, we construct - /// new vector values on-demand by inserting the scalar values into vectors - /// with an insertelement sequence. If the value has been neither vectorized - /// nor scalarized, it must be loop invariant, so we simply broadcast the - /// value into vectors. - const VectorParts &getVectorValue(Value *V); - - /// Return a value in the new loop corresponding to \p V from the original - /// loop at unroll index \p Part and vector index \p Lane. If the value has - /// been vectorized but not scalarized, the necessary extractelement - /// instruction will be generated. - Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); - +public: /// Try to vectorize the interleaved access group that \p Instr belongs to. void vectorizeInterleaveGroup(Instruction *Instr); + /// Widen an integer induction variable \p IV. If \p Trunc is provided, the + /// induction variable will first be truncated to the corresponding type. + std::pair widenIntInduction(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc = nullptr); + + /// Compute scalar induction steps. \p ScalarIV is the scalar induction + /// variable on which to base the steps, \p Step is the size of the step, and + /// \p EntryVal is the value from the original loop that maps to the steps. + /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it + /// can be a truncate instruction). + void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal, + unsigned MinPart, unsigned MaxPart, unsigned MinLane, + unsigned MaxLane); + +protected: /// Generate a shuffle sequence that will reverse the vector Vec. virtual Value *reverseVector(Value *Vec); @@ -694,6 +746,16 @@ return ScalarMapStorage[Key]; } + ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) { + if (!hasScalar(Key)) { + ScalarParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part].resize(Lanes); + ScalarMapStorage[Key] = Entry; + } + return ScalarMapStorage[Key]; + } + /// \return A reference to the vector map entry corresponding to \p Key. /// The key should already be in the map. This function should only be used /// when it's necessary to update values that have already been vectorized. @@ -712,6 +774,15 @@ friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V); friend Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, unsigned Lane); + friend Value *InnerLoopVectorizer::getVectorValue(Value *V, unsigned Part); + friend void InnerLoopVectorizer::setScalarValue(Value *V, unsigned Part, + unsigned Lane, + Value *Scalar); + friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part, + Value *Vector); + friend void InnerLoopVectorizer::constructVectorValue(Value *V, + unsigned Part, + unsigned Lane); private: /// The unroll factor. Each entry in the vector map contains UF vector @@ -765,9 +836,11 @@ /// many different vector instructions. unsigned UF; +public: /// The builder that we use IRBuilder<> Builder; +protected: // --- Vectorization state --- /// The vector-loop preheader. @@ -796,10 +869,8 @@ /// vectorized and scalarized. ValueMap VectorLoopValueMap; - /// Store instructions that should be predicated, as a pair - /// - SmallVector, 4> PredicatedInstructions; - EdgeMaskCache MaskCache; + EdgeMaskCacheTy EdgeMaskCache; + BlockMaskCacheTy BlockMaskCache; /// Trip count of the original loop. Value *TripCount; /// Trip count of the widened loop (TripCount - TripCount % (VF*UF)) @@ -814,14 +885,6 @@ // Record whether runtime checks are added. bool AddedSafetyChecks; - // Holds instructions from the original loop whose counterparts in the - // vectorized loop would be trivially dead if generated. For example, - // original induction update instructions can become dead because we - // separately emit induction "steps" when generating code for the new loop. - // Similarly, we create a new latch condition when setting up the structure - // of the new loop, so the old one can become dead. - SmallPtrSet DeadInstructions; - // Holds the end values for each induction variable. We save the end values // so we can later fix-up the external users of the induction variables. DenseMap IVEndValues; @@ -840,14 +903,36 @@ UnrollFactor, LVL, CM) {} private: - void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false) override; + void vectorizeInstruction(Instruction &I) override; + void scalarizeInstruction(Instruction *Instr, bool IfPredicateInstr = false); void vectorizeMemoryInstruction(Instruction *Instr) override; Value *getBroadcastInstrs(Value *V) override; Value *getStepVector(Value *Val, int StartIdx, Value *Step, Instruction::BinaryOps Opcode = Instruction::BinaryOpsEnd) override; Value *reverseVector(Value *Vec) override; + + void vectorizeLoop() override; + + /// Iteratively sink the scalarized operands of a predicated instruction into + /// the block that was created for it. + void sinkScalarOperands(Instruction *PredInst); + + /// Predicate conditional instructions that require predication on their + /// respective conditions. + void predicateInstructions(); + + /// Store instructions that should be predicated, as a pair + /// + SmallVector, 4> PredicatedInstructions; + + // Holds instructions from the original loop whose counterparts in the + // vectorized loop would be trivially dead if generated. For example, + // original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; }; /// \brief Look for a meaningful debug location on the instruction or it's @@ -1873,11 +1958,20 @@ unsigned Width; // Vector width with best cost unsigned Cost; // Cost of the loop with that width }; + + bool canVectorize(bool OptForSize); + + bool requiresTail(unsigned MaxVectorSize); + + /// \return An upper bound for the vectorization factor. + unsigned computeMaxVectorizationFactor(bool OptForSize); + /// \return The most profitable vectorization factor and the cost of that VF. /// This method checks every power of two up to VF. If UserVF is not ZERO /// then this vectorization factor will be selected if vectorization is /// possible. - VectorizationFactor selectVectorizationFactor(bool OptForSize); + VectorizationFactor selectVectorizationFactor(bool OptForSize, + unsigned MaxVF); /// \return The size (in bits) of the smallest and widest types in the code /// that needs to be vectorized. We ignore values that remain scalar such as @@ -1928,6 +2022,9 @@ /// \returns True if it is more profitable to scalarize instruction \p I for /// vectorization factor \p VF. bool isProfitableToScalarize(Instruction *I, unsigned VF) const { + // Unroller also calls this method, but does not collectInstsToScalarize. + if (VF == 1) + return true; auto Scalars = InstsToScalarize.find(VF); assert(Scalars != InstsToScalarize.end() && "VF not yet analyzed for scalarization profitability"); @@ -2139,10 +2236,12 @@ int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts, unsigned VF); +public: /// Collects the instructions to scalarize for each predicated instruction in /// the loop. void collectInstsToScalarize(unsigned VF); +private: /// Collect the instructions that are uniform after vectorization. An /// instruction is uniform if we represent it with a single scalar value in /// the vectorized loop corresponding to each vector iteration. Examples of @@ -2161,6 +2260,7 @@ /// iteration of the original scalar loop. void collectLoopScalars(unsigned VF); +public: /// Collect Uniform and Scalar values for the given \p VF. /// The sets depend on CM decision for Load/Store instructions /// that may be vectorized as interleave, gather-scatter or scalarized. @@ -2173,6 +2273,7 @@ collectLoopScalars(VF); } +private: /// Keeps cost model vectorization decision and cost for instructions. /// Right now it is used for memory instructions only. typedef DenseMap, @@ -2210,4538 +2311,4976 @@ SmallPtrSet VecValuesToIgnore; }; -/// \brief This holds vectorization requirements that must be verified late in -/// the process. The requirements are set by legalize and costmodel. Once -/// vectorization has been determined to be possible and profitable the -/// requirements can be verified by looking for metadata or compiler options. -/// For example, some loops require FP commutativity which is only allowed if -/// vectorization is explicitly specified or if the fast-math compiler option -/// has been provided. -/// Late evaluation of these requirements allows helpful diagnostics to be -/// composed that tells the user what need to be done to vectorize the loop. For -/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late -/// evaluation should be used only when diagnostics can generated that can be -/// followed by a non-expert user. -class LoopVectorizationRequirements { +/// LoopVectorizationPlanner - builds and optimizes the Vectorization Plans +/// which record the decisions how to vectorize the given loop. +/// In particular, represent the control-flow of the vectorized version, +/// the replication of instructions that are to be scalarized, and interleave +/// access groups. +class LoopVectorizationPlanner { public: - LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE) - : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {} + LoopVectorizationPlanner(Loop *L, LoopInfo *LI, const TargetLibraryInfo *TLI, + const TargetTransformInfo *TTI, + LoopVectorizationLegality *Legal, + LoopVectorizationCostModel *CM) + : TheLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), + ILV(nullptr), BestVF(0), BestUF(0) {} - void addUnsafeAlgebraInst(Instruction *I) { - // First unsafe algebra instruction. - if (!UnsafeAlgebraInst) - UnsafeAlgebraInst = I; - } + ~LoopVectorizationPlanner() {} - void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; } + /// Plan how to best vectorize, return the best VF and its cost. + LoopVectorizationCostModel::VectorizationFactor + plan(bool OptForSize, unsigned UserVF, unsigned MaxVF); - bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) { - const char *PassName = Hints.vectorizeAnalysisPassName(); - bool Failed = false; - if (UnsafeAlgebraInst && !Hints.allowReordering()) { - ORE.emit( - OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps", - UnsafeAlgebraInst->getDebugLoc(), - UnsafeAlgebraInst->getParent()) - << "loop not vectorized: cannot prove it is safe to reorder " - "floating-point operations"); - Failed = true; - } + /// Finalize the best decision and dispose of all other VPlans. + void setBestPlan(unsigned VF, unsigned UF); - // Test if runtime memcheck thresholds are exceeded. - bool PragmaThresholdReached = - NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold; - bool ThresholdReached = - NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold; - if ((ThresholdReached && !Hints.allowReordering()) || - PragmaThresholdReached) { - ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps", - L->getStartLoc(), - L->getHeader()) - << "loop not vectorized: cannot prove it is safe to reorder " - "memory operations"); - DEBUG(dbgs() << "LV: Too many memory checks needed.\n"); - Failed = true; - } + /// Generate the IR code for the body of the vectorized loop according to the + /// best selected VPlan. + void executeBestPlan(InnerLoopVectorizer &LB); - return Failed; - } + VPlan *getVPlanForVF(unsigned VF) { return VPlans[VF].get(); } -private: - unsigned NumRuntimePointerChecks; - Instruction *UnsafeAlgebraInst; + void printCurrentPlans(const std::string &Title, raw_ostream &O); - /// Interface to emit optimization remarks. - OptimizationRemarkEmitter &ORE; -}; + /// Test a predicate on a range of VFs. + /// The returned value reflects the result for a prefix of the range, with \p + /// EndRangeVF modified accordingly. + bool testVFRange(const std::function &Predicate, + unsigned StartRangeVF, unsigned &EndRangeVF); -static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl &V) { - if (L.empty()) { - if (!hasCyclesInLoopBody(L)) - V.push_back(&L); - return; - } - for (Loop *InnerL : L) - addAcyclicInnerLoop(*InnerL, V); -} +protected: + /// Build initial VPlans according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. + /// Returns the number of VPlans built, zero if failed. + unsigned buildInitialVPlans(unsigned MinVF, unsigned MaxVF); + + /// On VPlan construction, each instruction marked for predication by Legal + /// gets its own basic block guarded by an if-then. This initial planning + /// is legal, but is not optimal. This function attempts to leverage the + /// necessary conditional execution of the predicated instruction in favor + /// of other related instructions. The function applies these optimizations + /// to all VPlans. + void optimizePredicatedInstructions(); -/// The LoopVectorize Pass. -struct LoopVectorize : public FunctionPass { - /// Pass identification, replacement for typeid - static char ID; +private: + /// Build an initial VPlan according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. \return a VPlan + /// that corresponds to vectorization factors starting from the given + /// \p StartRangeVF and up to \p EndRangeVF, exclusive, possibly decreasing + /// the given \p EndRangeVF. + std::shared_ptr buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF); + + std::pair + widenIntInduction(VPlan *Plan, unsigned StartRangeVF, unsigned &EndRangeVF, + PHINode *IV, TruncInst *Trunc = nullptr); + + /// Determine whether \p I will be scalarized in a given range of VFs. + /// The returned value reflects the result for a prefix of the range, with \p + /// EndRangeVF modified accordingly. + bool willBeScalarized(Instruction *I, unsigned StartRangeVF, + unsigned &EndRangeVF); - explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true) - : FunctionPass(ID) { - Impl.DisableUnrolling = NoUnrolling; - Impl.AlwaysVectorize = AlwaysVectorize; - initializeLoopVectorizePass(*PassRegistry::getPassRegistry()); - } + /// Iteratively sink the scalarized operands of a predicated instruction into + /// the block that was created for it. + void sinkScalarOperands(Instruction *PredInst, VPlan *Plan); - LoopVectorizePass Impl; + /// Determine whether a newly-created recipe adds a second user to one of the + /// variants the values its ingredients use. This may cause the defining + /// recipe to generate that variant itself to serve all such users. + void assignScalarVectorConversions(Instruction *PredInst, VPlan *Plan); - bool runOnFunction(Function &F) override { - if (skipFunction(F)) - return false; + /// Returns true if an instruction \p I should be scalarized instead of + /// vectorized for the chosen vectorization factor. + bool shouldScalarizeInstruction(Instruction *I, unsigned VF) const; - auto *SE = &getAnalysis().getSE(); - auto *LI = &getAnalysis().getLoopInfo(); - auto *TTI = &getAnalysis().getTTI(F); - auto *DT = &getAnalysis().getDomTree(); - auto *BFI = &getAnalysis().getBFI(); - auto *TLIP = getAnalysisIfAvailable(); - auto *TLI = TLIP ? &TLIP->getTLI() : nullptr; - auto *AA = &getAnalysis().getAAResults(); - auto *AC = &getAnalysis().getAssumptionCache(F); - auto *LAA = &getAnalysis(); - auto *DB = &getAnalysis().getDemandedBits(); - auto *ORE = &getAnalysis().getORE(); +private: + /// The loop that we evaluate. + Loop *TheLoop; - std::function GetLAA = - [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); }; + /// Loop Info analysis. + LoopInfo *LI; - return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC, - GetLAA, *ORE); - } + /// Target Library Info. + const TargetLibraryInfo *TLI; - void getAnalysisUsage(AnalysisUsage &AU) const override { - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addPreserved(); - AU.addPreserved(); - AU.addPreserved(); - AU.addPreserved(); - } -}; + /// Target Transform Info. + const TargetTransformInfo *TTI; -} // end anonymous namespace + /// The legality analysis. + LoopVectorizationLegality *Legal; -//===----------------------------------------------------------------------===// -// Implementation of LoopVectorizationLegality, InnerLoopVectorizer and -// LoopVectorizationCostModel. -//===----------------------------------------------------------------------===// + /// The profitablity analysis. + LoopVectorizationCostModel *CM; -Value *InnerLoopVectorizer::getBroadcastInstrs(Value *V) { - // We need to place the broadcast of invariant variables outside the loop. - Instruction *Instr = dyn_cast(V); - bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody); - bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr; + InnerLoopVectorizer *ILV; - // Place the code for broadcasting invariant variables in the new preheader. - IRBuilder<>::InsertPointGuard Guard(Builder); - if (Invariant) - Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); + // Holds instructions from the original loop that we predicated. Such + // instructions reside in their own conditioned VPBasicBlock and represent + // an optimization opportunity for sinking their scalarized operands thus + // reducing their cost by the predicate's probability. + SmallPtrSet PredicatedInstructions; - // Broadcast the scalar into all locations in the vector. - Value *Shuf = Builder.CreateVectorSplat(VF, V, "broadcast"); + /// VPlans are shared between VFs, use smart pointers. + DenseMap> VPlans; - return Shuf; -} + unsigned BestVF; -void InnerLoopVectorizer::createVectorIntInductionPHI( - const InductionDescriptor &II, Instruction *EntryVal) { - Value *Start = II.getStartValue(); - ConstantInt *Step = II.getConstIntStepValue(); - assert(Step && "Can not widen an IV with a non-constant step"); + unsigned BestUF; - // Construct the initial value of the vector IV in the vector loop preheader - auto CurrIP = Builder.saveIP(); - Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); - if (isa(EntryVal)) { - auto *TruncType = cast(EntryVal->getType()); - Step = ConstantInt::getSigned(TruncType, Step->getSExtValue()); - Start = Builder.CreateCast(Instruction::Trunc, Start, TruncType); - } - Value *SplatStart = Builder.CreateVectorSplat(VF, Start); - Value *SteppedStart = getStepVector(SplatStart, 0, Step); - Builder.restoreIP(CurrIP); + // Holds instructions from the original loop whose counterparts in the + // vectorized loop would be trivially dead if generated. For example, + // original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; +}; - Value *SplatVF = - ConstantVector::getSplat(VF, ConstantInt::getSigned(Start->getType(), - VF * Step->getSExtValue())); - // We may need to add the step a number of times, depending on the unroll - // factor. The last of those goes into the PHI. - PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind", - &*LoopVectorBody->getFirstInsertionPt()); - Instruction *LastInduction = VecInd; - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part] = LastInduction; - LastInduction = cast( - Builder.CreateAdd(LastInduction, SplatVF, "step.add")); +class VPLaneRange { +private: + static const unsigned VF = INT_MAX; + unsigned MinLane = 0; + unsigned MaxLane = VF - 1; + void dumpLane(raw_ostream &O, unsigned Lane) const { + if (Lane == VF - 1) + O << "VF-1"; + else + O << Lane; } - VectorLoopValueMap.initVector(EntryVal, Entry); - if (isa(EntryVal)) - addMetadata(Entry, EntryVal); - // Move the last step to the end of the latch block. This ensures consistent - // placement of all induction updates. - auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch(); - auto *Br = cast(LoopVectorLatch->getTerminator()); - auto *ICmp = cast(Br->getCondition()); - LastInduction->moveBefore(ICmp); - LastInduction->setName("vec.ind.next"); +public: + VPLaneRange() {} + VPLaneRange(unsigned Min) : MinLane(Min) {} + VPLaneRange(unsigned Min, unsigned Max) : MinLane(Min), MaxLane(Max) {} + unsigned getMinLane() const { return MinLane; } + unsigned getMaxLane() const { return MaxLane; } + bool isEmpty() const { return MinLane > MaxLane; } + bool isFull() const { return MinLane == 0 && MaxLane == VF - 1; } + void print(raw_ostream &O) const { + dumpLane(O, MinLane); + O << ".."; + dumpLane(O, MaxLane); + } + static VPLaneRange intersect(const VPLaneRange &One, const VPLaneRange &Two) { + return VPLaneRange(std::max(One.MinLane, Two.MinLane), + std::min(One.MaxLane, Two.MaxLane)); + } +}; - VecInd->addIncoming(SteppedStart, LoopVectorPreHeader); - VecInd->addIncoming(LastInduction, LoopVectorLatch); -} +/// VPScalarizeOneByOneRecipe is a VPOneByOneRecipeBase which scalarizes each +/// Instruction in its ingredients independently, in order. The scalarization +/// is performed in one of two methods: a) by generating a single uniform scalar +/// Instruction. b) by generating multiple Instructions, each one for a +/// respective lane. +class VPScalarizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; -bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const { - return Cost->isScalarAfterVectorization(I, VF) || - Cost->isProfitableToScalarize(I, VF); -} +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; -bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const { - if (shouldScalarizeInstruction(IV)) - return true; - auto isScalarInst = [&](User *U) -> bool { - auto *I = cast(U); - return (OrigLoop->contains(I) && shouldScalarizeInstruction(I)); - }; - return any_of(IV->users(), isScalarInst); -} + VPLaneRange DesignatedLanes; -void InnerLoopVectorizer::widenIntInduction(PHINode *IV, TruncInst *Trunc) { +public: + VPScalarizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPScalarizeOneByOneSC, B, E, Plan) {} - auto II = Legal->getInductionVars()->find(IV); - assert(II != Legal->getInductionVars()->end() && "IV is not an induction"); + ~VPScalarizeOneByOneRecipe() {} - auto ID = II->second; - assert(IV->getType() == ID.getStartValue()->getType() && "Types must match"); + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; + } - // The scalar value to broadcast. This will be derived from the canonical - // induction variable. - Value *ScalarIV = nullptr; + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } - // The step of the induction. - Value *Step = nullptr; + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Scalarize"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":"; + for (auto It = Begin; It != End; ++It) { + O << '\n' << *It; + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; + } + } +}; - // The value from the original loop to which we are mapping the new induction - // variable. - Instruction *EntryVal = Trunc ? cast(Trunc) : IV; +/// VPVectorizeOneByOneRecipe is a VPOneByOneRecipeBase which transforms by +/// vectorizing each Instruction in itsingredients independently, in order. +/// This recipe covers most of the traditional vectorization cases where +/// each ingredient produces a vectorized version of itself. +class VPVectorizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; - // True if we have vectorized the induction variable. - auto VectorizedIV = false; +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; - // Determine if we want a scalar version of the induction variable. This is - // true if the induction variable itself is not widened, or if it has at - // least one user in the loop that is not widened. - auto NeedsScalarIV = VF > 1 && needsScalarInduction(EntryVal); +public: + VPVectorizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPVectorizeOneByOneSC, B, E, Plan) {} - // If the induction variable has a constant integer step value, go ahead and - // get it now. - if (ID.getConstIntStepValue()) - Step = ID.getConstIntStepValue(); + ~VPVectorizeOneByOneRecipe() {} - // Try to create a new independent vector induction variable. If we can't - // create the phi node, we will splat the scalar induction variable in each - // loop iteration. - if (VF > 1 && Step && !shouldScalarizeInstruction(EntryVal)) { - createVectorIntInductionPHI(ID, EntryVal); - VectorizedIV = true; + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; } - // If we haven't yet vectorized the induction variable, or if we will create - // a scalar one, we need to define the scalar induction variable and step - // values. If we were given a truncation type, truncate the canonical - // induction variable and constant step. Otherwise, derive these values from - // the induction descriptor. - if (!VectorizedIV || NeedsScalarIV) { - if (Trunc) { - auto *TruncType = cast(Trunc->getType()); - assert(Step && "Truncation requires constant integer step"); - auto StepInt = cast(Step)->getSExtValue(); - ScalarIV = Builder.CreateCast(Instruction::Trunc, Induction, TruncType); - Step = ConstantInt::getSigned(TruncType, StepInt); - } else { - ScalarIV = Induction; - auto &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); - if (IV != OldInduction) { - ScalarIV = Builder.CreateSExtOrTrunc(ScalarIV, IV->getType()); - ScalarIV = ID.transform(Builder, ScalarIV, PSE.getSE(), DL); - ScalarIV->setName("offset.idx"); - } - if (!Step) { - SCEVExpander Exp(*PSE.getSE(), DL, "induction"); - Step = Exp.expandCodeFor(ID.getStep(), ID.getStep()->getType(), - &*Builder.GetInsertPoint()); - } + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Vectorize:"; + for (auto It = Begin; It != End; ++It) { + O << '\n' << *It; + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; } } +}; - // If we haven't yet vectorized the induction variable, splat the scalar - // induction variable, and build the necessary step vectors. - if (!VectorizedIV) { - Value *Broadcasted = getBroadcastInstrs(ScalarIV); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = getStepVector(Broadcasted, VF * Part, Step); - VectorLoopValueMap.initVector(EntryVal, Entry); - if (Trunc) - addMetadata(Entry, Trunc); - } +/// A recipe which widens integer reductions, producing their vector values +/// and computing the necessary values for producing their scalar values. +/// The scalar values themselves are generated by a complementing +/// VPBuildScalarStepsRecipe. +class VPWidenIntInductionRecipe : public VPRecipeBase { +private: + bool NeedsScalarIV; + PHINode *IV; + TruncInst *Trunc; + Value *ScalarIV = nullptr; + Value *Step = nullptr; - // If an induction variable is only used for counting loop iterations or - // calculating addresses, it doesn't need to be widened. Create scalar steps - // that can be used by instructions we will later scalarize. Note that the - // addition of the scalar steps will not increase the number of instructions - // in the loop in the common case prior to InstCombine. We will be trading - // one vector extract for each scalar step. - if (NeedsScalarIV) - buildScalarSteps(ScalarIV, Step, EntryVal); -} +public: + VPWidenIntInductionRecipe(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc = nullptr) + : VPRecipeBase(VPWidenIntInductionSC), NeedsScalarIV(NeedsScalarIV), + IV(IV), Trunc(Trunc) {} -Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step, - Instruction::BinaryOps BinOp) { - // Create and check the types. - assert(Val->getType()->isVectorTy() && "Must be a vector"); - int VLen = Val->getType()->getVectorNumElements(); + ~VPWidenIntInductionRecipe() {} - Type *STy = Val->getType()->getScalarType(); - assert((STy->isIntegerTy() || STy->isFloatingPointTy()) && - "Induction Step must be an integer or FP"); - assert(Step->getType() == STy && "Step has wrong type"); + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenIntInductionSC; + } - SmallVector Indices; + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; - if (STy->isIntegerTy()) { - // Create a vector of consecutive numbers from zero to VF. - for (int i = 0; i < VLen; ++i) - Indices.push_back(ConstantInt::get(STy, StartIdx + i)); + /// Print the recipe. + void print(raw_ostream &O) const override; - // Add the consecutive indices to the vector value. - Constant *Cv = ConstantVector::get(Indices); - assert(Cv->getType() == Val->getType() && "Invalid consecutive vec"); - Step = Builder.CreateVectorSplat(VLen, Step); - assert(Step->getType() == Val->getType() && "Invalid step vec"); - // FIXME: The newly created binary instructions should contain nsw/nuw flags, - // which can be found from the original scalar operations. - Step = Builder.CreateMul(Cv, Step); - return Builder.CreateAdd(Val, Step, "induction"); + Value *getScalarIV() { + assert(ScalarIV && "ScalarIV does not exist yet"); + return ScalarIV; } - // Floating point induction. - assert((BinOp == Instruction::FAdd || BinOp == Instruction::FSub) && - "Binary Opcode should be specified for FP induction"); - // Create a vector of consecutive numbers from zero to VF. - for (int i = 0; i < VLen; ++i) - Indices.push_back(ConstantFP::get(STy, (double)(StartIdx + i))); + Value *getStep() { + assert(Step && "Step does not exist yet"); + return Step; + } +}; - // Add the consecutive indices to the vector value. - Constant *Cv = ConstantVector::get(Indices); +/// This is a complemeting recipe for handling integer induction variables, +/// responsible for generating the scalar values used by the IV's scalar users. +class VPBuildScalarStepsRecipe : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; - Step = Builder.CreateVectorSplat(VLen, Step); +private: + VPWidenIntInductionRecipe *WII; + Instruction *EntryVal; + VPLaneRange DesignatedLanes; - // Floating point operations had to be 'fast' to enable the induction. - FastMathFlags Flags; - Flags.setUnsafeAlgebra(); +public: + VPBuildScalarStepsRecipe(VPWidenIntInductionRecipe *WII, + Instruction *EntryVal, VPlan *Plan) + : VPRecipeBase(VPBuildScalarStepsSC), WII(WII), EntryVal(EntryVal) { + Plan->setInst2Recipe(EntryVal, this); + } - Value *MulOp = Builder.CreateFMul(Cv, Step); - if (isa(MulOp)) - // Have to check, MulOp may be a constant - cast(MulOp)->setFastMathFlags(Flags); + ~VPBuildScalarStepsRecipe() {} - Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction"); - if (isa(BOp)) - cast(BOp)->setFastMathFlags(Flags); - return BOp; -} + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } -void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step, - Value *EntryVal) { + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPBuildScalarStepsSC; + } - // We shouldn't have to build scalar steps if we aren't vectorizing. - assert(VF > 1 && "VF should be greater than one"); + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; - // Get the value type and ensure it and the step have the same integer type. - Type *ScalarIVTy = ScalarIV->getType()->getScalarType(); - assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() && - "Val and Step should have the same integer type"); + /// Print the recipe. + void print(raw_ostream &O) const override; +}; - // Determine the number of scalars we need to generate for each unroll - // iteration. If EntryVal is uniform, we only need to generate the first - // lane. Otherwise, we generate all VF values. - unsigned Lanes = - Cost->isUniformAfterVectorization(cast(EntryVal), VF) ? 1 : VF; +/// A VPInterleaveRecipe is a VPRecipe which transforms an interleave group of +/// loads or stores into one wide load/store and shuffles. +class VPInterleaveRecipe : public VPRecipeBase { +private: + const InterleaveGroup *IG; - // Compute the scalar steps and save the results in VectorLoopValueMap. - ScalarParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { - auto *StartIdx = ConstantInt::get(ScalarIVTy, VF * Part + Lane); - auto *Mul = Builder.CreateMul(StartIdx, Step); - auto *Add = Builder.CreateAdd(ScalarIV, Mul); - Entry[Part][Lane] = Add; - } +public: + VPInterleaveRecipe(const InterleaveGroup *IG, VPlan *Plan) + : VPRecipeBase(VPInterleaveSC), IG(IG) { + for (unsigned I = 0, E = IG->getNumMembers(); I < E; ++I) + Plan->setInst2Recipe(IG->getMember(I), this); } - VectorLoopValueMap.initScalar(EntryVal, Entry); -} -int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) { + ~VPInterleaveRecipe() {} - const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() : - ValueToValueMap(); + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC; + } - int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false); - if (Stride == 1 || Stride == -1) - return Stride; - return 0; -} + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; -bool LoopVectorizationLegality::isUniform(Value *V) { - return LAI->isUniform(V); -} + /// Print the recipe. + void print(raw_ostream &O) const override; -const InnerLoopVectorizer::VectorParts & -InnerLoopVectorizer::getVectorValue(Value *V) { - assert(V != Induction && "The new induction variable should not be used."); - assert(!V->getType()->isVectorTy() && "Can't widen a vector"); - assert(!V->getType()->isVoidTy() && "Type does not produce a value"); + const InterleaveGroup *getInterleaveGroup() { return IG; } +}; - // If we have a stride that is replaced by one, do it here. - if (Legal->hasStride(V)) - V = ConstantInt::get(V->getType(), 1); - - // If we have this scalar in the map, return it. - if (VectorLoopValueMap.hasVector(V)) - return VectorLoopValueMap.VectorMapStorage[V]; - - // If the value has not been vectorized, check if it has been scalarized - // instead. If it has been scalarized, and we actually need the value in - // vector form, we will construct the vector values on demand. - if (VectorLoopValueMap.hasScalar(V)) { - - // Initialize a new vector map entry. - VectorParts Entry(UF); - - // If we've scalarized a value, that value should be an instruction. - auto *I = cast(V); +/// A VPExtractMaskBitRecipe is a VPConditionBitRecipe which supports a +/// scalarized conditional branch. Such branches are needed to guard scalarized +/// instructions with possible side-effects that are predicated under a +/// condition. This recipe is in charge of generating the instruction that +/// computes the condition for this branch in the vectorized version. +class VPExtractMaskBitRecipe : public VPConditionBitRecipeBase { +private: + /// The original IR basic block in which the scalarized and predicated + /// instruction(s) reside. Needed for generating the mask of the block + /// and from it the desired condition bit. + BasicBlock *MaskedBasicBlock; - // If we aren't vectorizing, we can just copy the scalar map values over to - // the vector map. - if (VF == 1) { - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = getScalarValue(V, Part, 0); - return VectorLoopValueMap.initVector(V, Entry); - } +public: + /// Construct a VPExtractMaskBitRecipe given the IR BasicBlock whose mask + /// should provide the desired bit. This recipe has no Instructions as + /// ingredients, hence does not call Plan->setInst2Recipe(). + VPExtractMaskBitRecipe(BasicBlock *BB) + : VPConditionBitRecipeBase(VPExtractMaskBitSC), MaskedBasicBlock(BB) {} - // Get the last scalar instruction we generated for V. If the value is - // known to be uniform after vectorization, this corresponds to lane zero - // of the last unroll iteration. Otherwise, the last instruction is the one - // we created for the last vector lane of the last unroll iteration. - unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1; - auto *LastInst = cast(getScalarValue(V, UF - 1, LastLane)); + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPExtractMaskBitSC; + } - // Set the insert point after the last scalarized instruction. This ensures - // the insertelement sequence will directly follow the scalar definitions. - auto OldIP = Builder.saveIP(); - auto NewIP = std::next(BasicBlock::iterator(LastInst)); - Builder.SetInsertPoint(&*NewIP); + /// The method which generates the comparison and related mask management + /// instructions leading to computing the desired condition bit, corresponding + /// to this VPExtractMaskBitRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; - // However, if we are vectorizing, we need to construct the vector values. - // If the value is known to be uniform after vectorization, we can just - // broadcast the scalar value corresponding to lane zero for each unroll - // iteration. Otherwise, we construct the vector values using insertelement - // instructions. Since the resulting vectors are stored in - // VectorLoopValueMap, we will only generate the insertelements once. - for (unsigned Part = 0; Part < UF; ++Part) { - Value *VectorValue = nullptr; - if (Cost->isUniformAfterVectorization(I, VF)) { - VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0)); - } else { - VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); - for (unsigned Lane = 0; Lane < VF; ++Lane) - VectorValue = Builder.CreateInsertElement( - VectorValue, getScalarValue(V, Part, Lane), - Builder.getInt32(Lane)); - } - Entry[Part] = VectorValue; - } - Builder.restoreIP(OldIP); - return VectorLoopValueMap.initVector(V, Entry); + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Extract Mask Bit:\n" << MaskedBasicBlock->getName(); } - // If this scalar is unknown, assume that it is a constant or that it is - // loop invariant. Broadcast V and save the value for future uses. - Value *B = getBroadcastInstrs(V); - return VectorLoopValueMap.initVector(V, VectorParts(UF, B)); -} - -Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, - unsigned Lane) { + StringRef getName() const override { return MaskedBasicBlock->getName(); } +}; - // If the value is not an instruction contained in the loop, it should - // already be scalar. - if (OrigLoop->isLoopInvariant(V)) - return V; +/// A VPMergeScalarizeBranchRecipe is a VPRecipe which represents the Phi's +/// needed when control converges back from a scalarized branch. Such phi's are +/// needed to merge live-out values that are set under a scalarized conditional +/// branch. They can be scalar or vector, depending on the user of the +/// live-out value. This recipe works in concert with VPExtractMaskBitRecipe. +class VPMergeScalarizeBranchRecipe : public VPRecipeBase { +private: + Instruction *LiveOut; - assert(Lane > 0 ? - !Cost->isUniformAfterVectorization(cast(V), VF) - : true && "Uniform values only have lane zero"); +public: + // Construct a VPMergeScalarizeBranchRecipe given \LiveOut whose value needs + // a Phi after merging back from a scalarized branch. + // LiveOut is mapped to the recipe vectorizing it, instead of this recipe + // which provides it with PHIs; hence no call to Plan->setInst2Recipe() here. + VPMergeScalarizeBranchRecipe(Instruction *LiveOut) + : VPRecipeBase(VPMergeScalarizeBranchSC), LiveOut(LiveOut) {} - // If the value from the original loop has not been vectorized, it is - // represented by UF x VF scalar values in the new loop. Return the requested - // scalar value. - if (VectorLoopValueMap.hasScalar(V)) - return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane]; + ~VPMergeScalarizeBranchRecipe() {} - // If the value has not been scalarized, get its entry in VectorLoopValueMap - // for the given unroll part. If this entry is not a vector type (i.e., the - // vectorization factor is one), there is no need to generate an - // extractelement instruction. - auto *U = getVectorValue(V)[Part]; - if (!U->getType()->isVectorTy()) { - assert(VF == 1 && "Value not scalarized has non-vector type"); - return U; + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPMergeScalarizeBranchSC; } - // Otherwise, the value from the original loop has been vectorized and is - // represented by UF vector values. Extract and return the requested scalar - // value from the appropriate vector lane. - return Builder.CreateExtractElement(U, Builder.getInt32(Lane)); -} + /// The method which generates Phi instructions for live-outs as needed to + /// retain SSA form, corresponding to this VPMergeScalarizeBranchRecipe in the + /// vectorized version, thereby "executing" the VPlan. + void vectorize(VPTransformState &State) override; -Value *InnerLoopVectorizer::reverseVector(Value *Vec) { - assert(Vec->getType()->isVectorTy() && "Invalid type"); - SmallVector ShuffleMask; - for (unsigned i = 0; i < VF; ++i) - ShuffleMask.push_back(Builder.getInt32(VF - i - 1)); + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Merge Scalarize Branch:\n" << *LiveOut; + } +}; - return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()), - ConstantVector::get(ShuffleMask), - "reverse"); -} +class VPlanUtilsLoopVectorizer : public VPlanUtils { +public: + VPlanUtilsLoopVectorizer(VPlan *Plan) : VPlanUtils(Plan) {} -// Try to vectorize the interleave group that \p Instr belongs to. -// -// E.g. Translate following interleaved load group (factor = 3): -// for (i = 0; i < N; i+=3) { -// R = Pic[i]; // Member of index 0 -// G = Pic[i+1]; // Member of index 1 -// B = Pic[i+2]; // Member of index 2 -// ... // do something to R, G, B -// } -// To: -// %wide.vec = load <12 x i32> ; Read 4 tuples of R,G,B -// %R.vec = shuffle %wide.vec, undef, <0, 3, 6, 9> ; R elements -// %G.vec = shuffle %wide.vec, undef, <1, 4, 7, 10> ; G elements -// %B.vec = shuffle %wide.vec, undef, <2, 5, 8, 11> ; B elements -// -// Or translate following interleaved store group (factor = 3): -// for (i = 0; i < N; i+=3) { -// ... do something to R, G, B -// Pic[i] = R; // Member of index 0 -// Pic[i+1] = G; // Member of index 1 -// Pic[i+2] = B; // Member of index 2 -// } -// To: -// %R_G.vec = shuffle %R.vec, %G.vec, <0, 1, 2, ..., 7> -// %B_U.vec = shuffle %B.vec, undef, <0, 1, 2, 3, u, u, u, u> -// %interleaved.vec = shuffle %R_G.vec, %B_U.vec, -// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11> ; Interleave R,G,B elements -// store <12 x i32> %interleaved.vec ; Write 4 tuples of R,G,B -void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) { - const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr); - assert(Group && "Fail to get an interleaved access group."); + ~VPlanUtilsLoopVectorizer() {} - // Skip if current instruction is not the insert position. - if (Instr != Group->getInsertPos()) - return; + VPOneByOneRecipeBase *createOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, + VPlan *Plan, bool isScalarizing); - Value *Ptr = getPointerOperand(Instr); + bool appendInstruction(VPOneByOneRecipeBase *Recipe, Instruction *Instr); - // Prepare for the vector type of the interleaved load/store. - Type *ScalarTy = getMemInstValueType(Instr); - unsigned InterleaveFactor = Group->getFactor(); - Type *VecTy = VectorType::get(ScalarTy, InterleaveFactor * VF); - Type *PtrTy = VecTy->getPointerTo(getMemInstAddressSpace(Instr)); + VPOneByOneRecipeBase *splitRecipe(Instruction *Split); - // Prepare for the new pointers. - setDebugLocFromInst(Builder, Ptr); - SmallVector NewPtrs; - unsigned Index = Group->getIndex(Instr); + void insertBefore(Instruction *Inst, Instruction *Before, + unsigned MinLane = 0); - // If the group is reverse, adjust the index to refer to the last vector lane - // instead of the first. We adjust the index from the first vector lane, - // rather than directly getting the pointer for lane VF - 1, because the - // pointer operand of the interleaved access is supposed to be uniform. For - // uniform instructions, we're only required to generate a value for the - // first vector lane in each unroll iteration. - if (Group->isReverse()) - Index += (VF - 1) * Group->getFactor(); + void removeInstruction(Instruction *Inst, unsigned FromLane = 0); - for (unsigned Part = 0; Part < UF; Part++) { - Value *NewPtr = getScalarValue(Ptr, Part, 0); + void sinkInstruction(Instruction *Inst, VPBasicBlock *To, + unsigned MinLane = 0); - // Notice current instruction could be any index. Need to adjust the address - // to the member of index 0. - // - // E.g. a = A[i+1]; // Member of index 1 (Current instruction) - // b = A[i]; // Member of index 0 - // Current pointer is pointed to A[i+1], adjust it to A[i]. - // - // E.g. A[i+1] = a; // Member of index 1 - // A[i] = b; // Member of index 0 - // A[i+2] = c; // Member of index 2 (Current instruction) - // Current pointer is pointed to A[i+2], adjust it to A[i]. - NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index)); + template void designateLaneZero(T &Recipe) { + Recipe->DesignatedLanes = VPLaneRange(0, 0); + } +}; - // Cast to the vector pointer type. - NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy)); +/// \brief This holds vectorization requirements that must be verified late in +/// the process. The requirements are set by legalize and costmodel. Once +/// vectorization has been determined to be possible and profitable the +/// requirements can be verified by looking for metadata or compiler options. +/// For example, some loops require FP commutativity which is only allowed if +/// vectorization is explicitly specified or if the fast-math compiler option +/// has been provided. +/// Late evaluation of these requirements allows helpful diagnostics to be +/// composed that tells the user what need to be done to vectorize the loop. For +/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late +/// evaluation should be used only when diagnostics can generated that can be +/// followed by a non-expert user. +class LoopVectorizationRequirements { +public: + LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE) + : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {} + + void addUnsafeAlgebraInst(Instruction *I) { + // First unsafe algebra instruction. + if (!UnsafeAlgebraInst) + UnsafeAlgebraInst = I; } - setDebugLocFromInst(Builder, Instr); - Value *UndefVec = UndefValue::get(VecTy); + void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; } - // Vectorize the interleaved load group. - if (isa(Instr)) { + bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) { + const char *PassName = Hints.vectorizeAnalysisPassName(); + bool Failed = false; + if (UnsafeAlgebraInst && !Hints.allowReordering()) { + ORE.emit( + OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps", + UnsafeAlgebraInst->getDebugLoc(), + UnsafeAlgebraInst->getParent()) + << "loop not vectorized: cannot prove it is safe to reorder " + "floating-point operations"); + Failed = true; + } - // For each unroll part, create a wide load for the group. - SmallVector NewLoads; - for (unsigned Part = 0; Part < UF; Part++) { - auto *NewLoad = Builder.CreateAlignedLoad( - NewPtrs[Part], Group->getAlignment(), "wide.vec"); - addMetadata(NewLoad, Instr); - NewLoads.push_back(NewLoad); + // Test if runtime memcheck thresholds are exceeded. + bool PragmaThresholdReached = + NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold; + bool ThresholdReached = + NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold; + if ((ThresholdReached && !Hints.allowReordering()) || + PragmaThresholdReached) { + ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps", + L->getStartLoc(), + L->getHeader()) + << "loop not vectorized: cannot prove it is safe to reorder " + "memory operations"); + DEBUG(dbgs() << "LV: Too many memory checks needed.\n"); + Failed = true; } - // For each member in the group, shuffle out the appropriate data from the - // wide loads. - for (unsigned I = 0; I < InterleaveFactor; ++I) { - Instruction *Member = Group->getMember(I); + return Failed; + } - // Skip the gaps in the group. - if (!Member) - continue; +private: + unsigned NumRuntimePointerChecks; + Instruction *UnsafeAlgebraInst; - VectorParts Entry(UF); - Constant *StrideMask = createStrideMask(Builder, I, InterleaveFactor, VF); - for (unsigned Part = 0; Part < UF; Part++) { - Value *StridedVec = Builder.CreateShuffleVector( - NewLoads[Part], UndefVec, StrideMask, "strided.vec"); + /// Interface to emit optimization remarks. + OptimizationRemarkEmitter &ORE; +}; - // If this member has different type, cast the result type. - if (Member->getType() != ScalarTy) { - VectorType *OtherVTy = VectorType::get(Member->getType(), VF); - StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy); - } - - Entry[Part] = - Group->isReverse() ? reverseVector(StridedVec) : StridedVec; - } - VectorLoopValueMap.initVector(Member, Entry); - } +static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl &V) { + if (L.empty()) { + if (!hasCyclesInLoopBody(L)) + V.push_back(&L); return; } + for (Loop *InnerL : L) + addAcyclicInnerLoop(*InnerL, V); +} - // The sub vector type for current instruction. - VectorType *SubVT = VectorType::get(ScalarTy, VF); +/// The LoopVectorize Pass. +struct LoopVectorize : public FunctionPass { + /// Pass identification, replacement for typeid + static char ID; - // Vectorize the interleaved store group. - for (unsigned Part = 0; Part < UF; Part++) { - // Collect the stored vector from each member. - SmallVector StoredVecs; - for (unsigned i = 0; i < InterleaveFactor; i++) { - // Interleaved store group doesn't allow a gap, so each index has a member - Instruction *Member = Group->getMember(i); - assert(Member && "Fail to get a member from an interleaved store group"); + explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true) + : FunctionPass(ID) { + Impl.DisableUnrolling = NoUnrolling; + Impl.AlwaysVectorize = AlwaysVectorize; + initializeLoopVectorizePass(*PassRegistry::getPassRegistry()); + } - Value *StoredVec = - getVectorValue(cast(Member)->getValueOperand())[Part]; - if (Group->isReverse()) - StoredVec = reverseVector(StoredVec); + LoopVectorizePass Impl; - // If this member has different type, cast it to an unified type. - if (StoredVec->getType() != SubVT) - StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT); + bool runOnFunction(Function &F) override { + if (skipFunction(F)) + return false; - StoredVecs.push_back(StoredVec); - } + auto *SE = &getAnalysis().getSE(); + auto *LI = &getAnalysis().getLoopInfo(); + auto *TTI = &getAnalysis().getTTI(F); + auto *DT = &getAnalysis().getDomTree(); + auto *BFI = &getAnalysis().getBFI(); + auto *TLIP = getAnalysisIfAvailable(); + auto *TLI = TLIP ? &TLIP->getTLI() : nullptr; + auto *AA = &getAnalysis().getAAResults(); + auto *AC = &getAnalysis().getAssumptionCache(F); + auto *LAA = &getAnalysis(); + auto *DB = &getAnalysis().getDemandedBits(); + auto *ORE = &getAnalysis().getORE(); - // Concatenate all vectors into a wide vector. - Value *WideVec = concatenateVectors(Builder, StoredVecs); + std::function GetLAA = + [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); }; - // Interleave the elements in the wide vector. - Constant *IMask = createInterleaveMask(Builder, VF, InterleaveFactor); - Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask, - "interleaved.vec"); + return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC, + GetLAA, *ORE); + } - Instruction *NewStoreInstr = - Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment()); - addMetadata(NewStoreInstr, Instr); + void getAnalysisUsage(AnalysisUsage &AU) const override { + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addPreserved(); + AU.addPreserved(); + AU.addPreserved(); + AU.addPreserved(); } -} +}; -void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) { - // Attempt to issue a wide load. - LoadInst *LI = dyn_cast(Instr); - StoreInst *SI = dyn_cast(Instr); +} // end anonymous namespace - assert((LI || SI) && "Invalid Load/Store instruction"); +//===----------------------------------------------------------------------===// +// Implementation of LoopVectorizationLegality, InnerLoopVectorizer, +// LoopVectorizationCostModel and LoopVectorizationPlanner. +//===----------------------------------------------------------------------===// - LoopVectorizationCostModel::InstWidening Decision = - Cost->getWideningDecision(Instr, VF); - assert(Decision != LoopVectorizationCostModel::CM_Unknown && - "CM decision should be taken at this point"); - if (Decision == LoopVectorizationCostModel::CM_Interleave) - return vectorizeInterleaveGroup(Instr); +Value *InnerLoopVectorizer::getBroadcastInstrs(Value *V) { + // We need to place the broadcast of invariant variables outside the loop. + Instruction *Instr = dyn_cast(V); + bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody); + bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr; - Type *ScalarDataTy = getMemInstValueType(Instr); - Type *DataTy = VectorType::get(ScalarDataTy, VF); - Value *Ptr = getPointerOperand(Instr); - unsigned Alignment = getMemInstAlignment(Instr); - // An alignment of 0 means target abi alignment. We need to use the scalar's - // target abi alignment in such a case. - const DataLayout &DL = Instr->getModule()->getDataLayout(); - if (!Alignment) - Alignment = DL.getABITypeAlignment(ScalarDataTy); - unsigned AddressSpace = getMemInstAddressSpace(Instr); + // Place the code for broadcasting invariant variables in the new preheader. + IRBuilder<>::InsertPointGuard Guard(Builder); + if (Invariant) + Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); - // Scalarize the memory instruction if necessary. - if (Decision == LoopVectorizationCostModel::CM_Scalarize) - return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr)); + // Broadcast the scalar into all locations in the vector. + Value *Shuf = Builder.CreateVectorSplat(VF, V, "broadcast"); - // Determine if the pointer operand of the access is either consecutive or - // reverse consecutive. - int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); - bool Reverse = ConsecutiveStride < 0; - bool CreateGatherScatter = - (Decision == LoopVectorizationCostModel::CM_GatherScatter); + return Shuf; +} - VectorParts VectorGep; +void InnerLoopVectorizer::createVectorIntInductionPHI( + const InductionDescriptor &II, Instruction *EntryVal) { + Value *Start = II.getStartValue(); + ConstantInt *Step = II.getConstIntStepValue(); + assert(Step && "Can not widen an IV with a non-constant step"); - // Handle consecutive loads/stores. - GetElementPtrInst *Gep = getGEPInstruction(Ptr); - if (ConsecutiveStride) { - if (Gep) { - unsigned NumOperands = Gep->getNumOperands(); -#ifndef NDEBUG - // The original GEP that identified as a consecutive memory access - // should have only one loop-variant operand. - unsigned NumOfLoopVariantOps = 0; - for (unsigned i = 0; i < NumOperands; ++i) - if (!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)), - OrigLoop)) - NumOfLoopVariantOps++; - assert(NumOfLoopVariantOps == 1 && - "Consecutive GEP should have only one loop-variant operand"); -#endif - GetElementPtrInst *Gep2 = cast(Gep->clone()); - Gep2->setName("gep.indvar"); + // Construct the initial value of the vector IV in the vector loop preheader + auto CurrIP = Builder.saveIP(); + Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); + if (isa(EntryVal)) { + auto *TruncType = cast(EntryVal->getType()); + Step = ConstantInt::getSigned(TruncType, Step->getSExtValue()); + Start = Builder.CreateCast(Instruction::Trunc, Start, TruncType); + } + Value *SplatStart = Builder.CreateVectorSplat(VF, Start); + Value *SteppedStart = getStepVector(SplatStart, 0, Step); + Builder.restoreIP(CurrIP); - // A new GEP is created for a 0-lane value of the first unroll iteration. - // The GEPs for the rest of the unroll iterations are computed below as an - // offset from this GEP. - for (unsigned i = 0; i < NumOperands; ++i) - // We can apply getScalarValue() for all GEP indices. It returns an - // original value for loop-invariant operand and 0-lane for consecutive - // operand. - Gep2->setOperand(i, getScalarValue(Gep->getOperand(i), - 0, /* First unroll iteration */ - 0 /* 0-lane of the vector */ )); - setDebugLocFromInst(Builder, Gep); - Ptr = Builder.Insert(Gep2); + Value *SplatVF = + ConstantVector::getSplat(VF, ConstantInt::getSigned(Start->getType(), + VF * Step->getSExtValue())); + // We may need to add the step a number of times, depending on the unroll + // factor. The last of those goes into the PHI. + PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind", + &*LoopVectorBody->getFirstInsertionPt()); + Instruction *LastInduction = VecInd; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Entry[Part] = LastInduction; + LastInduction = cast( + Builder.CreateAdd(LastInduction, SplatVF, "step.add")); + } + VectorLoopValueMap.initVector(EntryVal, Entry); + if (isa(EntryVal)) + addMetadata(Entry, EntryVal); - } else { // No GEP - setDebugLocFromInst(Builder, Ptr); - Ptr = getScalarValue(Ptr, 0, 0); - } - } else { - // At this point we should vector version of GEP for Gather or Scatter - assert(CreateGatherScatter && "The instruction should be scalarized"); - if (Gep) { - // Vectorizing GEP, across UF parts. We want to get a vector value for base - // and each index that's defined inside the loop, even if it is - // loop-invariant but wasn't hoisted out. Otherwise we want to keep them - // scalar. - SmallVector OpsV; - for (Value *Op : Gep->operands()) { - Instruction *SrcInst = dyn_cast(Op); - if (SrcInst && OrigLoop->contains(SrcInst)) - OpsV.push_back(getVectorValue(Op)); - else - OpsV.push_back(VectorParts(UF, Op)); - } - for (unsigned Part = 0; Part < UF; ++Part) { - SmallVector Ops; - Value *GEPBasePtr = OpsV[0][Part]; - for (unsigned i = 1; i < Gep->getNumOperands(); i++) - Ops.push_back(OpsV[i][Part]); - Value *NewGep = Builder.CreateGEP(GEPBasePtr, Ops, "VectorGep"); - cast(NewGep)->setIsInBounds(Gep->isInBounds()); - assert(NewGep->getType()->isVectorTy() && "Expected vector GEP"); + // Move the last step to the end of the latch block. This ensures consistent + // placement of all induction updates. + auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch(); + auto *Br = cast(LoopVectorLatch->getTerminator()); + auto *ICmp = cast(Br->getCondition()); + LastInduction->moveBefore(ICmp); + LastInduction->setName("vec.ind.next"); - NewGep = - Builder.CreateBitCast(NewGep, VectorType::get(Ptr->getType(), VF)); - VectorGep.push_back(NewGep); - } - } else - VectorGep = getVectorValue(Ptr); - } + VecInd->addIncoming(SteppedStart, LoopVectorPreHeader); + VecInd->addIncoming(LastInduction, LoopVectorLatch); +} - VectorParts Mask = createBlockInMask(Instr->getParent()); - // Handle Stores: - if (SI) { - assert(!Legal->isUniform(SI->getPointerOperand()) && - "We do not allow storing to uniform addresses"); - setDebugLocFromInst(Builder, SI); - // We don't want to update the value in the map as it might be used in - // another expression. So don't use a reference type for "StoredVal". - VectorParts StoredVal = getVectorValue(SI->getValueOperand()); +bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const { + return Cost->isScalarAfterVectorization(I, VF) || + Cost->isProfitableToScalarize(I, VF); +} - for (unsigned Part = 0; Part < UF; ++Part) { - Instruction *NewSI = nullptr; - if (CreateGatherScatter) { - Value *MaskPart = Legal->isMaskRequired(SI) ? Mask[Part] : nullptr; - NewSI = Builder.CreateMaskedScatter(StoredVal[Part], VectorGep[Part], - Alignment, MaskPart); - } else { - // Calculate the pointer for the specific unroll-part. - Value *PartPtr = - Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF)); +bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const { + if (shouldScalarizeInstruction(IV)) + return true; + auto isScalarInst = [&](User *U) -> bool { + auto *I = cast(U); + return (OrigLoop->contains(I) && shouldScalarizeInstruction(I)); + }; + return any_of(IV->users(), isScalarInst); +} - if (Reverse) { - // If we store to reverse consecutive memory locations, then we need - // to reverse the order of elements in the stored value. - StoredVal[Part] = reverseVector(StoredVal[Part]); - // If the address is consecutive but reversed, then the - // wide store needs to start at the last vector element. - PartPtr = - Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF)); - PartPtr = - Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF)); - Mask[Part] = reverseVector(Mask[Part]); - } +std::pair +InnerLoopVectorizer::widenIntInduction(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc) { - Value *VecPtr = - Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace)); + auto II = Legal->getInductionVars()->find(IV); + assert(II != Legal->getInductionVars()->end() && "IV is not an induction"); - if (Legal->isMaskRequired(SI)) - NewSI = Builder.CreateMaskedStore(StoredVal[Part], VecPtr, Alignment, - Mask[Part]); - else - NewSI = - Builder.CreateAlignedStore(StoredVal[Part], VecPtr, Alignment); - } - addMetadata(NewSI, SI); - } - return; + auto ID = II->second; + assert(IV->getType() == ID.getStartValue()->getType() && "Types must match"); + + // The scalar value to broadcast. This will be derived from the canonical + // induction variable. + Value *ScalarIV = nullptr; + + // The step of the induction. + Value *Step = nullptr; + + // The value from the original loop to which we are mapping the new induction + // variable. + Instruction *EntryVal = Trunc ? cast(Trunc) : IV; + + // True if we have vectorized the induction variable. + auto VectorizedIV = false; + + // If the induction variable has a constant integer step value, go ahead and + // get it now. + if (ID.getConstIntStepValue()) + Step = ID.getConstIntStepValue(); + + // Try to create a new independent vector induction variable. If we can't + // create the phi node, we will splat the scalar induction variable in each + // loop iteration. + if (VF > 1 && Step && !shouldScalarizeInstruction(EntryVal)) { + createVectorIntInductionPHI(ID, EntryVal); + VectorizedIV = true; } - // Handle loads. - assert(LI && "Must have a load instruction"); - setDebugLocFromInst(Builder, LI); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Instruction *NewLI; - if (CreateGatherScatter) { - Value *MaskPart = Legal->isMaskRequired(LI) ? Mask[Part] : nullptr; - NewLI = Builder.CreateMaskedGather(VectorGep[Part], Alignment, MaskPart, - 0, "wide.masked.gather"); - Entry[Part] = NewLI; + // If we haven't yet vectorized the induction variable, or if we will create + // a scalar one, we need to define the scalar induction variable and step + // values. If we were given a truncation type, truncate the canonical + // induction variable and constant step. Otherwise, derive these values from + // the induction descriptor. + if (!VectorizedIV || NeedsScalarIV) { + if (Trunc) { + auto *TruncType = cast(Trunc->getType()); + assert(Step && "Truncation requires constant integer step"); + auto StepInt = cast(Step)->getSExtValue(); + ScalarIV = Builder.CreateCast(Instruction::Trunc, Induction, TruncType); + Step = ConstantInt::getSigned(TruncType, StepInt); } else { - // Calculate the pointer for the specific unroll-part. - Value *PartPtr = - Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF)); - - if (Reverse) { - // If the address is consecutive but reversed, then the - // wide load needs to start at the last vector element. - PartPtr = Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF)); - PartPtr = Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF)); - Mask[Part] = reverseVector(Mask[Part]); + ScalarIV = Induction; + auto &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); + if (IV != OldInduction) { + ScalarIV = Builder.CreateSExtOrTrunc(ScalarIV, IV->getType()); + ScalarIV = ID.transform(Builder, ScalarIV, PSE.getSE(), DL); + ScalarIV->setName("offset.idx"); + } + if (!Step) { + SCEVExpander Exp(*PSE.getSE(), DL, "induction"); + Step = Exp.expandCodeFor(ID.getStep(), ID.getStep()->getType(), + &*Builder.GetInsertPoint()); } - - Value *VecPtr = - Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace)); - if (Legal->isMaskRequired(LI)) - NewLI = Builder.CreateMaskedLoad(VecPtr, Alignment, Mask[Part], - UndefValue::get(DataTy), - "wide.masked.load"); - else - NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load"); - Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI; } - addMetadata(NewLI, LI); } - VectorLoopValueMap.initVector(Instr, Entry); + + // If we haven't yet vectorized the induction variable, splat the scalar + // induction variable, and build the necessary step vectors. + if (!VectorizedIV) { + Value *Broadcasted = getBroadcastInstrs(ScalarIV); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = getStepVector(Broadcasted, VF * Part, Step); + VectorLoopValueMap.initVector(EntryVal, Entry); + if (Trunc) + addMetadata(Entry, Trunc); + } + + // If an induction variable is only used for counting loop iterations or + // calculating addresses, it doesn't need to be widened. + + return std::make_pair(ScalarIV, Step); } -void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr) { - assert(!Instr->getType()->isAggregateType() && "Can't handle vectors"); - DEBUG(dbgs() << "LV: Scalarizing" - << (IfPredicateInstr ? " and predicating:" : ":") << *Instr - << '\n'); - // Holds vector parameters or scalars, in case of uniform vals. - SmallVector Params; +Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step, + Instruction::BinaryOps BinOp) { + // Create and check the types. + assert(Val->getType()->isVectorTy() && "Must be a vector"); + int VLen = Val->getType()->getVectorNumElements(); - setDebugLocFromInst(Builder, Instr); + Type *STy = Val->getType()->getScalarType(); + assert((STy->isIntegerTy() || STy->isFloatingPointTy()) && + "Induction Step must be an integer or FP"); + assert(Step->getType() == STy && "Step has wrong type"); - // Does this instruction return a value ? - bool IsVoidRetTy = Instr->getType()->isVoidTy(); + SmallVector Indices; - // Initialize a new scalar map entry. - ScalarParts Entry(UF); + if (STy->isIntegerTy()) { + // Create a vector of consecutive numbers from zero to VF. + for (int i = 0; i < VLen; ++i) + Indices.push_back(ConstantInt::get(STy, StartIdx + i)); - VectorParts Cond; - if (IfPredicateInstr) - Cond = createBlockInMask(Instr->getParent()); + // Add the consecutive indices to the vector value. + Constant *Cv = ConstantVector::get(Indices); + assert(Cv->getType() == Val->getType() && "Invalid consecutive vec"); + Step = Builder.CreateVectorSplat(VLen, Step); + assert(Step->getType() == Val->getType() && "Invalid step vec"); + // FIXME: The newly created binary instructions should contain nsw/nuw flags, + // which can be found from the original scalar operations. + Step = Builder.CreateMul(Cv, Step); + return Builder.CreateAdd(Val, Step, "induction"); + } - // Determine the number of scalars we need to generate for each unroll - // iteration. If the instruction is uniform, we only need to generate the - // first lane. Otherwise, we generate all VF values. - unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF; + // Floating point induction. + assert((BinOp == Instruction::FAdd || BinOp == Instruction::FSub) && + "Binary Opcode should be specified for FP induction"); + // Create a vector of consecutive numbers from zero to VF. + for (int i = 0; i < VLen; ++i) + Indices.push_back(ConstantFP::get(STy, (double)(StartIdx + i))); - // For each vector unroll 'part': - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); - // For each scalar that we create: - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { - - // Start if-block. - Value *Cmp = nullptr; - if (IfPredicateInstr) { - Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Lane)); - Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp, - ConstantInt::get(Cmp->getType(), 1)); - } + // Add the consecutive indices to the vector value. + Constant *Cv = ConstantVector::get(Indices); - Instruction *Cloned = Instr->clone(); - if (!IsVoidRetTy) - Cloned->setName(Instr->getName() + ".cloned"); + Step = Builder.CreateVectorSplat(VLen, Step); - // Replace the operands of the cloned instructions with their scalar - // equivalents in the new loop. - for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) { - auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane); - Cloned->setOperand(op, NewOp); - } - addNewMetadata(Cloned, Instr); + // Floating point operations had to be 'fast' to enable the induction. + FastMathFlags Flags; + Flags.setUnsafeAlgebra(); - // Place the cloned scalar in the new loop. - Builder.Insert(Cloned); + Value *MulOp = Builder.CreateFMul(Cv, Step); + if (isa(MulOp)) + // Have to check, MulOp may be a constant + cast(MulOp)->setFastMathFlags(Flags); - // Add the cloned scalar to the scalar map entry. - Entry[Part][Lane] = Cloned; + Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction"); + if (isa(BOp)) + cast(BOp)->setFastMathFlags(Flags); + return BOp; +} - // If we just cloned a new assumption, add it the assumption cache. - if (auto *II = dyn_cast(Cloned)) - if (II->getIntrinsicID() == Intrinsic::assume) - AC->registerAssumption(II); +void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step, + Value *EntryVal, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, + unsigned MaxLane) { + + // We shouldn't have to build scalar steps if we aren't vectorizing. + assert(VF > 1 && "VF should be greater than one"); + + // Get the value type and ensure it and the step have the same integer type. + Type *ScalarIVTy = ScalarIV->getType()->getScalarType(); + assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() && + "Val and Step should have the same integer type"); + + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(EntryVal, VF); - // End if-block. - if (IfPredicateInstr) - PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp)); + // Compute the scalar steps and save the results in VectorLoopValueMap. + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { + Entry[Part].resize(VF); + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { + auto *StartIdx = ConstantInt::get(ScalarIVTy, VF * Part + Lane); + auto *Mul = Builder.CreateMul(StartIdx, Step); + auto *Add = Builder.CreateAdd(ScalarIV, Mul); + Entry[Part][Lane] = Add; } } - VectorLoopValueMap.initScalar(Instr, Entry); } -PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start, - Value *End, Value *Step, - Instruction *DL) { - BasicBlock *Header = L->getHeader(); - BasicBlock *Latch = L->getLoopLatch(); - // As we're just creating this loop, it's possible no latch exists - // yet. If so, use the header as this will be a single block loop. - if (!Latch) - Latch = Header; +int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) { - IRBuilder<> Builder(&*Header->getFirstInsertionPt()); - Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction); - setDebugLocFromInst(Builder, OldInst); - auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index"); - - Builder.SetInsertPoint(Latch->getTerminator()); - setDebugLocFromInst(Builder, OldInst); - - // Create i+1 and fill the PHINode. - Value *Next = Builder.CreateAdd(Induction, Step, "index.next"); - Induction->addIncoming(Start, L->getLoopPreheader()); - Induction->addIncoming(Next, Latch); - // Create the compare. - Value *ICmp = Builder.CreateICmpEQ(Next, End); - Builder.CreateCondBr(ICmp, L->getExitBlock(), Header); + const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() : + ValueToValueMap(); - // Now we have two terminators. Remove the old one from the block. - Latch->getTerminator()->eraseFromParent(); + int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false); + if (Stride == 1 || Stride == -1) + return Stride; + return 0; +} - return Induction; +bool LoopVectorizationLegality::isUniform(Value *V) { + return LAI->isUniform(V); } -Value *InnerLoopVectorizer::getOrCreateTripCount(Loop *L) { - if (TripCount) - return TripCount; +void InnerLoopVectorizer::constructVectorValue(Value *V, unsigned Part, + unsigned Lane) { + assert(V != Induction && "The new induction variable should not be used."); + assert(!V->getType()->isVectorTy() && "Can't widen a vector"); + assert(!V->getType()->isVoidTy() && "Type does not produce a value"); - IRBuilder<> Builder(L->getLoopPreheader()->getTerminator()); - // Find the loop boundaries. - ScalarEvolution *SE = PSE.getSE(); - const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount(); - assert(BackedgeTakenCount != SE->getCouldNotCompute() && - "Invalid loop count"); + if (!VectorLoopValueMap.hasVector(V)) { + VectorParts Entry(UF); + for (unsigned P = 0; P < UF; ++P) + Entry[P] = nullptr; + VectorLoopValueMap.initVector(V, Entry); + } - Type *IdxTy = Legal->getWidestInductionType(); + VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V]; - // The exit count might have the type of i64 while the phi is i32. This can - // happen if we have an induction variable that is sign extended before the - // compare. The only way that we get a backedge taken count is that the - // induction variable was signed and as such will not overflow. In such a case - // truncation is legal. - if (BackedgeTakenCount->getType()->getPrimitiveSizeInBits() > - IdxTy->getPrimitiveSizeInBits()) - BackedgeTakenCount = SE->getTruncateOrNoop(BackedgeTakenCount, IdxTy); - BackedgeTakenCount = SE->getNoopOrZeroExtend(BackedgeTakenCount, IdxTy); + assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist"); - // Get the total trip count from the count by adding 1. - const SCEV *ExitCount = SE->getAddExpr( - BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType())); + auto *ScalarInst = cast(getScalarValue(V, Part, Lane)); - const DataLayout &DL = L->getHeader()->getModule()->getDataLayout(); + Value *VectorValue = nullptr; - // Expand the trip count and place the new instructions in the preheader. - // Notice that the pre-header does not change, only the loop body. - SCEVExpander Exp(*SE, DL, "induction"); + // If we're constructing lane 0, start from undef; otherwise, start from the + // last value created. + if (Lane == 0) + VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); + else + VectorValue = Parts[Part]; - // Count holds the overall loop count (N). - TripCount = Exp.expandCodeFor(ExitCount, ExitCount->getType(), - L->getLoopPreheader()->getTerminator()); + VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst, + Builder.getInt32(Lane)); + Parts[Part] = VectorValue; +} - if (TripCount->getType()->isPointerTy()) - TripCount = - CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int", - L->getLoopPreheader()->getTerminator()); +const InnerLoopVectorizer::VectorParts & +InnerLoopVectorizer::getVectorValue(Value *V) { + assert(V != Induction && "The new induction variable should not be used."); + assert(!V->getType()->isVectorTy() && "Can't widen a vector"); + assert(!V->getType()->isVoidTy() && "Type does not produce a value"); - return TripCount; -} + // If we have a stride that is replaced by one, do it here. + if (Legal->hasStride(V)) + V = ConstantInt::get(V->getType(), 1); -Value *InnerLoopVectorizer::getOrCreateVectorTripCount(Loop *L) { - if (VectorTripCount) - return VectorTripCount; + // If we have this scalar in the map, return it. + if (VectorLoopValueMap.hasVector(V)) + return VectorLoopValueMap.VectorMapStorage[V]; - Value *TC = getOrCreateTripCount(L); - IRBuilder<> Builder(L->getLoopPreheader()->getTerminator()); + // If the value has not been vectorized, check if it has been scalarized + // instead. If it has been scalarized, and we actually need the value in + // vector form, we will construct the vector values on demand. + if (VectorLoopValueMap.hasScalar(V)) { - // Now we need to generate the expression for the part of the loop that the - // vectorized body will execute. This is equal to N - (N % Step) if scalar - // iterations are not required for correctness, or N - Step, otherwise. Step - // is equal to the vectorization factor (number of SIMD elements) times the - // unroll factor (number of SIMD instructions). - Constant *Step = ConstantInt::get(TC->getType(), VF * UF); - Value *R = Builder.CreateURem(TC, Step, "n.mod.vf"); + // Initialize a new vector map entry. + VectorParts Entry(UF); - // If there is a non-reversed interleaved group that may speculatively access - // memory out-of-bounds, we need to ensure that there will be at least one - // iteration of the scalar epilogue loop. Thus, if the step evenly divides - // the trip count, we set the remainder to be equal to the step. If the step - // does not evenly divide the trip count, no adjustment is necessary since - // there will already be scalar iterations. Note that the minimum iterations - // check ensures that N >= Step. - if (VF > 1 && Legal->requiresScalarEpilogue()) { - auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0)); - R = Builder.CreateSelect(IsZero, Step, R); - } + // If we've scalarized a value, that value should be an instruction. + auto *I = cast(V); - VectorTripCount = Builder.CreateSub(TC, R, "n.vec"); + // If we aren't vectorizing, we can just copy the scalar map values over to + // the vector map. + if (VF == 1) { + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = getScalarValue(V, Part, 0); + return VectorLoopValueMap.initVector(V, Entry); + } - return VectorTripCount; -} + // Get the last scalar instruction we generated for V. If the value is + // known to be uniform after vectorization, this corresponds to lane zero + // of the last unroll iteration. Otherwise, the last instruction is the one + // we created for the last vector lane of the last unroll iteration. + unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1; + auto *LastInst = cast(getScalarValue(V, UF - 1, LastLane)); -void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L, - BasicBlock *Bypass) { - Value *Count = getOrCreateTripCount(L); - BasicBlock *BB = L->getLoopPreheader(); - IRBuilder<> Builder(BB->getTerminator()); + // Set the insert point after the last scalarized instruction. This ensures + // the insertelement sequence will directly follow the scalar definitions. + auto OldIP = Builder.saveIP(); + auto NextInsertionPoint = std::next(BasicBlock::iterator(LastInst)); + if (NextInsertionPoint != LastInst->getParent()->end()) + Builder.SetInsertPoint(&*NextInsertionPoint); + else + Builder.SetInsertPoint(LastInst->getParent()); - // Generate code to check that the loop's trip count that we computed by - // adding one to the backedge-taken count will not overflow. - Value *CheckMinIters = Builder.CreateICmpULT( - Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check"); + // However, if we are vectorizing, we need to construct the vector values. + // If the value is known to be uniform after vectorization, we can just + // broadcast the scalar value corresponding to lane zero for each unroll + // iteration. Otherwise, we construct the vector values using insertelement + // instructions. Since the resulting vectors are stored in + // VectorLoopValueMap, we will only generate the insertelements once. + for (unsigned Part = 0; Part < UF; ++Part) { + Value *VectorValue = nullptr; + if (Cost->isUniformAfterVectorization(I, VF)) { + VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0)); + } else { + VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); + for (unsigned Lane = 0; Lane < VF; ++Lane) + VectorValue = Builder.CreateInsertElement( + VectorValue, getScalarValue(V, Part, Lane), + Builder.getInt32(Lane)); + } + Entry[Part] = VectorValue; + } + Builder.restoreIP(OldIP); + return VectorLoopValueMap.initVector(V, Entry); + } - BasicBlock *NewBB = - BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked"); - // Update dominator tree immediately if the generated block is a - // LoopBypassBlock because SCEV expansions to generate loop bypass - // checks may query it before the current function is finished. - DT->addNewBlock(NewBB, BB); - if (L->getParentLoop()) - L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); - ReplaceInstWithInst(BB->getTerminator(), - BranchInst::Create(Bypass, NewBB, CheckMinIters)); - LoopBypassBlocks.push_back(BB); + // If this scalar is unknown, assume that it is a constant or that it is + // loop invariant. Broadcast V and save the value for future uses. + Value *B = getBroadcastInstrs(V); + return VectorLoopValueMap.initVector(V, VectorParts(UF, B)); } -void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L, - BasicBlock *Bypass) { - Value *TC = getOrCreateVectorTripCount(L); - BasicBlock *BB = L->getLoopPreheader(); - IRBuilder<> Builder(BB->getTerminator()); - - // Now, compare the new count to zero. If it is zero skip the vector loop and - // jump to the scalar loop. - Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()), - "cmp.zero"); +Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, + unsigned Lane) { - // Generate code to check that the loop's trip count that we computed by - // adding one to the backedge-taken count will not overflow. - BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); - // Update dominator tree immediately if the generated block is a - // LoopBypassBlock because SCEV expansions to generate loop bypass - // checks may query it before the current function is finished. - DT->addNewBlock(NewBB, BB); - if (L->getParentLoop()) - L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); - ReplaceInstWithInst(BB->getTerminator(), - BranchInst::Create(Bypass, NewBB, Cmp)); - LoopBypassBlocks.push_back(BB); -} + // If the value is not an instruction contained in the loop, it should + // already be scalar. + if (OrigLoop->isLoopInvariant(V)) + return V; -void InnerLoopVectorizer::emitSCEVChecks(Loop *L, BasicBlock *Bypass) { - BasicBlock *BB = L->getLoopPreheader(); + assert(Lane > 0 ? + !Cost->isUniformAfterVectorization(cast(V), VF) + : true && "Uniform values only have lane zero"); - // Generate the code to check that the SCEV assumptions that we made. - // We want the new basic block to start at the first instruction in a - // sequence of instructions that form a check. - SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(), - "scev.check"); - Value *SCEVCheck = - Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator()); + // If the value from the original loop has not been vectorized, it is + // represented by UF x VF scalar values in the new loop. Return the requested + // scalar value. + if (VectorLoopValueMap.hasScalar(V)) + return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane]; - if (auto *C = dyn_cast(SCEVCheck)) - if (C->isZero()) - return; + // If the value has not been scalarized, get its entry in VectorLoopValueMap + // for the given unroll part. If this entry is not a vector type (i.e., the + // vectorization factor is one), there is no need to generate an + // extractelement instruction. + auto *U = getVectorValue(V)[Part]; + if (!U->getType()->isVectorTy()) { + assert(VF == 1 && "Value not scalarized has non-vector type"); + return U; + } - // Create a new block containing the stride check. - BB->setName("vector.scevcheck"); - auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); - // Update dominator tree immediately if the generated block is a - // LoopBypassBlock because SCEV expansions to generate loop bypass - // checks may query it before the current function is finished. - DT->addNewBlock(NewBB, BB); - if (L->getParentLoop()) - L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); - ReplaceInstWithInst(BB->getTerminator(), - BranchInst::Create(Bypass, NewBB, SCEVCheck)); - LoopBypassBlocks.push_back(BB); - AddedSafetyChecks = true; + // Otherwise, the value from the original loop has been vectorized and is + // represented by UF vector values. Extract and return the requested scalar + // value from the appropriate vector lane. + return Builder.CreateExtractElement(U, Builder.getInt32(Lane)); } -void InnerLoopVectorizer::emitMemRuntimeChecks(Loop *L, BasicBlock *Bypass) { - BasicBlock *BB = L->getLoopPreheader(); +Value *InnerLoopVectorizer::reverseVector(Value *Vec) { + assert(Vec->getType()->isVectorTy() && "Invalid type"); + SmallVector ShuffleMask; + for (unsigned i = 0; i < VF; ++i) + ShuffleMask.push_back(Builder.getInt32(VF - i - 1)); - // Generate the code that checks in runtime if arrays overlap. We put the - // checks into a separate block to make the more common case of few elements - // faster. - Instruction *FirstCheckInst; - Instruction *MemRuntimeCheck; - std::tie(FirstCheckInst, MemRuntimeCheck) = - Legal->getLAI()->addRuntimeChecks(BB->getTerminator()); - if (!MemRuntimeCheck) - return; + return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()), + ConstantVector::get(ShuffleMask), + "reverse"); +} - // Create a new block containing the memory check. - BB->setName("vector.memcheck"); - auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); - // Update dominator tree immediately if the generated block is a - // LoopBypassBlock because SCEV expansions to generate loop bypass - // checks may query it before the current function is finished. - DT->addNewBlock(NewBB, BB); - if (L->getParentLoop()) - L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); - ReplaceInstWithInst(BB->getTerminator(), - BranchInst::Create(Bypass, NewBB, MemRuntimeCheck)); - LoopBypassBlocks.push_back(BB); - AddedSafetyChecks = true; +// Try to vectorize the interleave group that \p Instr belongs to. +// +// E.g. Translate following interleaved load group (factor = 3): +// for (i = 0; i < N; i+=3) { +// R = Pic[i]; // Member of index 0 +// G = Pic[i+1]; // Member of index 1 +// B = Pic[i+2]; // Member of index 2 +// ... // do something to R, G, B +// } +// To: +// %wide.vec = load <12 x i32> ; Read 4 tuples of R,G,B +// %R.vec = shuffle %wide.vec, undef, <0, 3, 6, 9> ; R elements +// %G.vec = shuffle %wide.vec, undef, <1, 4, 7, 10> ; G elements +// %B.vec = shuffle %wide.vec, undef, <2, 5, 8, 11> ; B elements +// +// Or translate following interleaved store group (factor = 3): +// for (i = 0; i < N; i+=3) { +// ... do something to R, G, B +// Pic[i] = R; // Member of index 0 +// Pic[i+1] = G; // Member of index 1 +// Pic[i+2] = B; // Member of index 2 +// } +// To: +// %R_G.vec = shuffle %R.vec, %G.vec, <0, 1, 2, ..., 7> +// %B_U.vec = shuffle %B.vec, undef, <0, 1, 2, 3, u, u, u, u> +// %interleaved.vec = shuffle %R_G.vec, %B_U.vec, +// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11> ; Interleave R,G,B elements +// store <12 x i32> %interleaved.vec ; Write 4 tuples of R,G,B +void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) { + const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr); + assert(Group && "Fail to get an interleaved access group."); - // We currently don't use LoopVersioning for the actual loop cloning but we - // still use it to add the noalias metadata. - LVer = llvm::make_unique(*Legal->getLAI(), OrigLoop, LI, DT, - PSE.getSE()); - LVer->prepareNoAliasMetadata(); -} + // Skip if current instruction is not the insert position. + if (Instr != Group->getInsertPos()) + return; -void InnerLoopVectorizer::createEmptyLoop() { - /* - In this function we generate a new loop. The new loop will contain - the vectorized instructions while the old loop will continue to run the - scalar remainder. + Value *Ptr = getPointerOperand(Instr); - [ ] <-- loop iteration number check. - / | - / v - | [ ] <-- vector loop bypass (may consist of multiple blocks). - | / | - | / v - || [ ] <-- vector pre header. - |/ | - | v - | [ ] \ - | [ ]_| <-- vector loop. - | | - | v - | -[ ] <--- middle-block. - | / | - | / v - -|- >[ ] <--- new preheader. - | | - | v - | [ ] \ - | [ ]_| <-- old scalar loop to handle remainder. - \ | - \ v - >[ ] <-- exit block. - ... - */ + // Prepare for the vector type of the interleaved load/store. + Type *ScalarTy = getMemInstValueType(Instr); + unsigned InterleaveFactor = Group->getFactor(); + Type *VecTy = VectorType::get(ScalarTy, InterleaveFactor * VF); + Type *PtrTy = VecTy->getPointerTo(getMemInstAddressSpace(Instr)); - BasicBlock *OldBasicBlock = OrigLoop->getHeader(); - BasicBlock *VectorPH = OrigLoop->getLoopPreheader(); - BasicBlock *ExitBlock = OrigLoop->getExitBlock(); - assert(VectorPH && "Invalid loop structure"); - assert(ExitBlock && "Must have an exit block"); + // Prepare for the new pointers. + setDebugLocFromInst(Builder, Ptr); + SmallVector NewPtrs; + unsigned Index = Group->getIndex(Instr); - // Some loops have a single integer induction variable, while other loops - // don't. One example is c++ iterators that often have multiple pointer - // induction variables. In the code below we also support a case where we - // don't have a single induction variable. - // - // We try to obtain an induction variable from the original loop as hard - // as possible. However if we don't find one that: - // - is an integer - // - counts from zero, stepping by one - // - is the size of the widest induction variable type - // then we create a new one. - OldInduction = Legal->getPrimaryInduction(); - Type *IdxTy = Legal->getWidestInductionType(); + // If the group is reverse, adjust the index to refer to the last vector lane + // instead of the first. We adjust the index from the first vector lane, + // rather than directly getting the pointer for lane VF - 1, because the + // pointer operand of the interleaved access is supposed to be uniform. For + // uniform instructions, we're only required to generate a value for the + // first vector lane in each unroll iteration. + if (Group->isReverse()) + Index += (VF - 1) * Group->getFactor(); - // Split the single block loop into the two loop structure described above. - BasicBlock *VecBody = - VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body"); - BasicBlock *MiddleBlock = - VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block"); - BasicBlock *ScalarPH = - MiddleBlock->splitBasicBlock(MiddleBlock->getTerminator(), "scalar.ph"); + for (unsigned Part = 0; Part < UF; Part++) { + Value *NewPtr = getScalarValue(Ptr, Part, 0); - // Create and register the new vector loop. - Loop *Lp = new Loop(); - Loop *ParentLoop = OrigLoop->getParentLoop(); + // Notice current instruction could be any index. Need to adjust the address + // to the member of index 0. + // + // E.g. a = A[i+1]; // Member of index 1 (Current instruction) + // b = A[i]; // Member of index 0 + // Current pointer is pointed to A[i+1], adjust it to A[i]. + // + // E.g. A[i+1] = a; // Member of index 1 + // A[i] = b; // Member of index 0 + // A[i+2] = c; // Member of index 2 (Current instruction) + // Current pointer is pointed to A[i+2], adjust it to A[i]. + NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index)); - // Insert the new loop into the loop nest and register the new basic blocks - // before calling any utilities such as SCEV that require valid LoopInfo. - if (ParentLoop) { - ParentLoop->addChildLoop(Lp); - ParentLoop->addBasicBlockToLoop(ScalarPH, *LI); - ParentLoop->addBasicBlockToLoop(MiddleBlock, *LI); - } else { - LI->addTopLevelLoop(Lp); + // Cast to the vector pointer type. + NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy)); } - Lp->addBasicBlockToLoop(VecBody, *LI); - // Find the loop boundaries. - Value *Count = getOrCreateTripCount(Lp); - - Value *StartIdx = ConstantInt::get(IdxTy, 0); + setDebugLocFromInst(Builder, Instr); + Value *UndefVec = UndefValue::get(VecTy); - // We need to test whether the backedge-taken count is uint##_max. Adding one - // to it will cause overflow and an incorrect loop trip count in the vector - // body. In case of overflow we want to directly jump to the scalar remainder - // loop. - emitMinimumIterationCountCheck(Lp, ScalarPH); - // Now, compare the new count to zero. If it is zero skip the vector loop and - // jump to the scalar loop. - emitVectorLoopEnteredCheck(Lp, ScalarPH); - // Generate the code to check any assumptions that we've made for SCEV - // expressions. - emitSCEVChecks(Lp, ScalarPH); + // Vectorize the interleaved load group. + if (isa(Instr)) { - // Generate the code that checks in runtime if arrays overlap. We put the - // checks into a separate block to make the more common case of few elements - // faster. - emitMemRuntimeChecks(Lp, ScalarPH); + // For each unroll part, create a wide load for the group. + SmallVector NewLoads; + for (unsigned Part = 0; Part < UF; Part++) { + auto *NewLoad = Builder.CreateAlignedLoad( + NewPtrs[Part], Group->getAlignment(), "wide.vec"); + addMetadata(NewLoad, Instr); + NewLoads.push_back(NewLoad); + } - // Generate the induction variable. - // The loop step is equal to the vectorization factor (num of SIMD elements) - // times the unroll factor (num of SIMD instructions). - Value *CountRoundDown = getOrCreateVectorTripCount(Lp); - Constant *Step = ConstantInt::get(IdxTy, VF * UF); - Induction = - createInductionVariable(Lp, StartIdx, CountRoundDown, Step, - getDebugLocFromInstOrOperands(OldInduction)); + // For each member in the group, shuffle out the appropriate data from the + // wide loads. + for (unsigned I = 0; I < InterleaveFactor; ++I) { + Instruction *Member = Group->getMember(I); - // We are going to resume the execution of the scalar loop. - // Go over all of the induction variables that we found and fix the - // PHIs that are left in the scalar version of the loop. - // The starting values of PHI nodes depend on the counter of the last - // iteration in the vectorized loop. - // If we come from a bypass edge then we need to start from the original - // start value. + // Skip the gaps in the group. + if (!Member) + continue; - // This variable saves the new starting index for the scalar loop. It is used - // to test if there are any tail iterations left once the vector loop has - // completed. - LoopVectorizationLegality::InductionList *List = Legal->getInductionVars(); - for (auto &InductionEntry : *List) { - PHINode *OrigPhi = InductionEntry.first; - InductionDescriptor II = InductionEntry.second; + VectorParts Entry(UF); + Constant *StrideMask = createStrideMask(Builder, I, InterleaveFactor, VF); + for (unsigned Part = 0; Part < UF; Part++) { + Value *StridedVec = Builder.CreateShuffleVector( + NewLoads[Part], UndefVec, StrideMask, "strided.vec"); - // Create phi nodes to merge from the backedge-taken check block. - PHINode *BCResumeVal = PHINode::Create( - OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator()); - Value *&EndValue = IVEndValues[OrigPhi]; - if (OrigPhi == OldInduction) { - // We know what the end value is. - EndValue = CountRoundDown; - } else { - IRBuilder<> B(LoopBypassBlocks.back()->getTerminator()); - Type *StepType = II.getStep()->getType(); - Instruction::CastOps CastOp = - CastInst::getCastOpcode(CountRoundDown, true, StepType, true); - Value *CRD = B.CreateCast(CastOp, CountRoundDown, StepType, "cast.crd"); - const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); - EndValue = II.transform(B, CRD, PSE.getSE(), DL); - EndValue->setName("ind.end"); - } + // If this member has different type, cast the result type. + if (Member->getType() != ScalarTy) { + VectorType *OtherVTy = VectorType::get(Member->getType(), VF); + StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy); + } - // The new PHI merges the original incoming value, in case of a bypass, - // or the value at the end of the vectorized loop. - BCResumeVal->addIncoming(EndValue, MiddleBlock); - - // Fix the scalar body counter (PHI node). - unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH); - - // The old induction's phi node in the scalar body needs the truncated - // value. - for (BasicBlock *BB : LoopBypassBlocks) - BCResumeVal->addIncoming(II.getStartValue(), BB); - OrigPhi->setIncomingValue(BlockIdx, BCResumeVal); + Entry[Part] = + Group->isReverse() ? reverseVector(StridedVec) : StridedVec; + } + VectorLoopValueMap.initVector(Member, Entry); + } + return; } - // Add a check in the middle block to see if we have completed - // all of the iterations in the first vector loop. - // If (N - N%VF) == N, then we *don't* need to run the remainder. - Value *CmpN = - CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count, - CountRoundDown, "cmp.n", MiddleBlock->getTerminator()); - ReplaceInstWithInst(MiddleBlock->getTerminator(), - BranchInst::Create(ExitBlock, ScalarPH, CmpN)); - - // Get ready to start creating new instructions into the vectorized body. - Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt()); - - // Save the state. - LoopVectorPreHeader = Lp->getLoopPreheader(); - LoopScalarPreHeader = ScalarPH; - LoopMiddleBlock = MiddleBlock; - LoopExitBlock = ExitBlock; - LoopVectorBody = VecBody; - LoopScalarBody = OldBasicBlock; - - // Keep all loop hints from the original loop on the vector loop (we'll - // replace the vectorizer-specific hints below). - if (MDNode *LID = OrigLoop->getLoopID()) - Lp->setLoopID(LID); - - LoopVectorizeHints Hints(Lp, true, *ORE); - Hints.setAlreadyVectorized(); -} + // The sub vector type for current instruction. + VectorType *SubVT = VectorType::get(ScalarTy, VF); -// Fix up external users of the induction variable. At this point, we are -// in LCSSA form, with all external PHIs that use the IV having one input value, -// coming from the remainder loop. We need those PHIs to also have a correct -// value for the IV when arriving directly from the middle block. -void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi, - const InductionDescriptor &II, - Value *CountRoundDown, Value *EndValue, - BasicBlock *MiddleBlock) { - // There are two kinds of external IV usages - those that use the value - // computed in the last iteration (the PHI) and those that use the penultimate - // value (the value that feeds into the phi from the loop latch). - // We allow both, but they, obviously, have different values. + // Vectorize the interleaved store group. + for (unsigned Part = 0; Part < UF; Part++) { + // Collect the stored vector from each member. + SmallVector StoredVecs; + for (unsigned i = 0; i < InterleaveFactor; i++) { + // Interleaved store group doesn't allow a gap, so each index has a member + Instruction *Member = Group->getMember(i); + assert(Member && "Fail to get a member from an interleaved store group"); - assert(OrigLoop->getExitBlock() && "Expected a single exit block"); + Value *StoredVec = + getVectorValue(cast(Member)->getValueOperand())[Part]; + if (Group->isReverse()) + StoredVec = reverseVector(StoredVec); - DenseMap MissingVals; + // If this member has different type, cast it to an unified type. + if (StoredVec->getType() != SubVT) + StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT); - // An external user of the last iteration's value should see the value that - // the remainder loop uses to initialize its own IV. - Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoop->getLoopLatch()); - for (User *U : PostInc->users()) { - Instruction *UI = cast(U); - if (!OrigLoop->contains(UI)) { - assert(isa(UI) && "Expected LCSSA form"); - MissingVals[UI] = EndValue; + StoredVecs.push_back(StoredVec); } - } - // An external user of the penultimate value need to see EndValue - Step. - // The simplest way to get this is to recompute it from the constituent SCEVs, - // that is Start + (Step * (CRD - 1)). - for (User *U : OrigPhi->users()) { - auto *UI = cast(U); - if (!OrigLoop->contains(UI)) { - const DataLayout &DL = - OrigLoop->getHeader()->getModule()->getDataLayout(); - assert(isa(UI) && "Expected LCSSA form"); + // Concatenate all vectors into a wide vector. + Value *WideVec = concatenateVectors(Builder, StoredVecs); - IRBuilder<> B(MiddleBlock->getTerminator()); - Value *CountMinusOne = B.CreateSub( - CountRoundDown, ConstantInt::get(CountRoundDown->getType(), 1)); - Value *CMO = B.CreateSExtOrTrunc(CountMinusOne, II.getStep()->getType(), - "cast.cmo"); - Value *Escape = II.transform(B, CMO, PSE.getSE(), DL); - Escape->setName("ind.escape"); - MissingVals[UI] = Escape; - } - } + // Interleave the elements in the wide vector. + Constant *IMask = createInterleaveMask(Builder, VF, InterleaveFactor); + Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask, + "interleaved.vec"); - for (auto &I : MissingVals) { - PHINode *PHI = cast(I.first); - // One corner case we have to handle is two IVs "chasing" each-other, - // that is %IV2 = phi [...], [ %IV1, %latch ] - // In this case, if IV1 has an external use, we need to avoid adding both - // "last value of IV1" and "penultimate value of IV2". So, verify that we - // don't already have an incoming value for the middle block. - if (PHI->getBasicBlockIndex(MiddleBlock) == -1) - PHI->addIncoming(I.second, MiddleBlock); + Instruction *NewStoreInstr = + Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment()); + addMetadata(NewStoreInstr, Instr); } } -namespace { -struct CSEDenseMapInfo { - static bool canHandle(Instruction *I) { - return isa(I) || isa(I) || - isa(I) || isa(I); - } - static inline Instruction *getEmptyKey() { - return DenseMapInfo::getEmptyKey(); - } - static inline Instruction *getTombstoneKey() { - return DenseMapInfo::getTombstoneKey(); - } - static unsigned getHashValue(Instruction *I) { - assert(canHandle(I) && "Unknown instruction!"); - return hash_combine(I->getOpcode(), hash_combine_range(I->value_op_begin(), - I->value_op_end())); - } - static bool isEqual(Instruction *LHS, Instruction *RHS) { - if (LHS == getEmptyKey() || RHS == getEmptyKey() || - LHS == getTombstoneKey() || RHS == getTombstoneKey()) - return LHS == RHS; - return LHS->isIdenticalTo(RHS); - } -}; -} +void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) { + // Attempt to issue a wide load. + LoadInst *LI = dyn_cast(Instr); + StoreInst *SI = dyn_cast(Instr); -///\brief Perform cse of induction variable instructions. -static void cse(BasicBlock *BB) { - // Perform simple cse. - SmallDenseMap CSEMap; - for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;) { - Instruction *In = &*I++; + assert((LI || SI) && "Invalid Load/Store instruction"); - if (!CSEDenseMapInfo::canHandle(In)) - continue; + LoopVectorizationCostModel::InstWidening Decision = + Cost->getWideningDecision(Instr, VF); + assert(Decision != LoopVectorizationCostModel::CM_Unknown && + "CM decision should be taken at this point"); + if (Decision == LoopVectorizationCostModel::CM_Interleave) + return vectorizeInterleaveGroup(Instr); - // Check if we can replace this instruction with any of the - // visited instructions. - if (Instruction *V = CSEMap.lookup(In)) { - In->replaceAllUsesWith(V); - In->eraseFromParent(); - continue; - } + Type *ScalarDataTy = getMemInstValueType(Instr); + Type *DataTy = VectorType::get(ScalarDataTy, VF); + Value *Ptr = getPointerOperand(Instr); + unsigned Alignment = getMemInstAlignment(Instr); + // An alignment of 0 means target abi alignment. We need to use the scalar's + // target abi alignment in such a case. + const DataLayout &DL = Instr->getModule()->getDataLayout(); + if (!Alignment) + Alignment = DL.getABITypeAlignment(ScalarDataTy); + unsigned AddressSpace = getMemInstAddressSpace(Instr); - CSEMap[In] = In; - } -} + // Determine if the pointer operand of the access is either consecutive or + // reverse consecutive. + int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); + bool Reverse = ConsecutiveStride < 0; + bool CreateGatherScatter = + (Decision == LoopVectorizationCostModel::CM_GatherScatter); -/// \brief Adds a 'fast' flag to floating point operations. -static Value *addFastMathFlag(Value *V) { - if (isa(V)) { - FastMathFlags Flags; - Flags.setUnsafeAlgebra(); - cast(V)->setFastMathFlags(Flags); - } - return V; -} + VectorParts VectorGep; -/// \brief Estimate the overhead of scalarizing an instruction. This is a -/// convenience wrapper for the type-based getScalarizationOverhead API. -static unsigned getScalarizationOverhead(Instruction *I, unsigned VF, - const TargetTransformInfo &TTI) { - if (VF == 1) - return 0; + // Handle consecutive loads/stores. + GetElementPtrInst *Gep = getGEPInstruction(Ptr); + if (ConsecutiveStride) { + if (Gep) { + unsigned NumOperands = Gep->getNumOperands(); +#ifndef NDEBUG + // The original GEP that identified as a consecutive memory access + // should have only one loop-variant operand. + unsigned NumOfLoopVariantOps = 0; + for (unsigned i = 0; i < NumOperands; ++i) + if (!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)), + OrigLoop)) + NumOfLoopVariantOps++; + assert(NumOfLoopVariantOps == 1 && + "Consecutive GEP should have only one loop-variant operand"); +#endif + GetElementPtrInst *Gep2 = cast(Gep->clone()); + Gep2->setName("gep.indvar"); - unsigned Cost = 0; - Type *RetTy = ToVectorTy(I->getType(), VF); - if (!RetTy->isVoidTy()) - Cost += TTI.getScalarizationOverhead(RetTy, true, false); + // A new GEP is created for a 0-lane value of the first unroll iteration. + // The GEPs for the rest of the unroll iterations are computed below as an + // offset from this GEP. + for (unsigned i = 0; i < NumOperands; ++i) + // We can apply getScalarValue() for all GEP indices. It returns an + // original value for loop-invariant operand and 0-lane for consecutive + // operand. + Gep2->setOperand(i, getScalarValue(Gep->getOperand(i), + 0, /* First unroll iteration */ + 0 /* 0-lane of the vector */ )); + setDebugLocFromInst(Builder, Gep); + Ptr = Builder.Insert(Gep2); - if (CallInst *CI = dyn_cast(I)) { - SmallVector Operands(CI->arg_operands()); - Cost += TTI.getOperandsScalarizationOverhead(Operands, VF); + } else { // No GEP + setDebugLocFromInst(Builder, Ptr); + Ptr = getScalarValue(Ptr, 0, 0); + } } else { - SmallVector Operands(I->operand_values()); - Cost += TTI.getOperandsScalarizationOverhead(Operands, VF); - } - - return Cost; -} + // At this point we should vector version of GEP for Gather or Scatter + assert(CreateGatherScatter && "The instruction should be scalarized"); + if (Gep) { + // Vectorizing GEP, across UF parts. We want to get a vector value for base + // and each index that's defined inside the loop, even if it is + // loop-invariant but wasn't hoisted out. Otherwise we want to keep them + // scalar. + SmallVector OpsV; + for (Value *Op : Gep->operands()) { + Instruction *SrcInst = dyn_cast(Op); + if (SrcInst && OrigLoop->contains(SrcInst)) + OpsV.push_back(getVectorValue(Op)); + else + OpsV.push_back(VectorParts(UF, Op)); + } + for (unsigned Part = 0; Part < UF; ++Part) { + SmallVector Ops; + Value *GEPBasePtr = OpsV[0][Part]; + for (unsigned i = 1; i < Gep->getNumOperands(); i++) + Ops.push_back(OpsV[i][Part]); + Value *NewGep = Builder.CreateGEP(GEPBasePtr, Ops, "VectorGep"); + cast(NewGep)->setIsInBounds(Gep->isInBounds()); + assert(NewGep->getType()->isVectorTy() && "Expected vector GEP"); -// Estimate cost of a call instruction CI if it were vectorized with factor VF. -// Return the cost of the instruction, including scalarization overhead if it's -// needed. The flag NeedToScalarize shows if the call needs to be scalarized - -// i.e. either vector version isn't available, or is too expensive. -static unsigned getVectorCallCost(CallInst *CI, unsigned VF, - const TargetTransformInfo &TTI, - const TargetLibraryInfo *TLI, - bool &NeedToScalarize) { - Function *F = CI->getCalledFunction(); - StringRef FnName = CI->getCalledFunction()->getName(); - Type *ScalarRetTy = CI->getType(); - SmallVector Tys, ScalarTys; - for (auto &ArgOp : CI->arg_operands()) - ScalarTys.push_back(ArgOp->getType()); + NewGep = + Builder.CreateBitCast(NewGep, VectorType::get(Ptr->getType(), VF)); + VectorGep.push_back(NewGep); + } + } else + VectorGep = getVectorValue(Ptr); + } - // Estimate cost of scalarized vector call. The source operands are assumed - // to be vectors, so we need to extract individual elements from there, - // execute VF scalar calls, and then gather the result into the vector return - // value. - unsigned ScalarCallCost = TTI.getCallInstrCost(F, ScalarRetTy, ScalarTys); - if (VF == 1) - return ScalarCallCost; + VectorParts Mask = createBlockInMask(Instr->getParent()); + // Handle Stores: + if (SI) { + assert(!Legal->isUniform(SI->getPointerOperand()) && + "We do not allow storing to uniform addresses"); + setDebugLocFromInst(Builder, SI); + // We don't want to update the value in the map as it might be used in + // another expression. So don't use a reference type for "StoredVal". + VectorParts StoredVal = getVectorValue(SI->getValueOperand()); - // Compute corresponding vector type for return value and arguments. - Type *RetTy = ToVectorTy(ScalarRetTy, VF); - for (Type *ScalarTy : ScalarTys) - Tys.push_back(ToVectorTy(ScalarTy, VF)); + for (unsigned Part = 0; Part < UF; ++Part) { + Instruction *NewSI = nullptr; + if (CreateGatherScatter) { + Value *MaskPart = Legal->isMaskRequired(SI) ? Mask[Part] : nullptr; + NewSI = Builder.CreateMaskedScatter(StoredVal[Part], VectorGep[Part], + Alignment, MaskPart); + } else { + // Calculate the pointer for the specific unroll-part. + Value *PartPtr = + Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF)); - // Compute costs of unpacking argument values for the scalar calls and - // packing the return values to a vector. - unsigned ScalarizationCost = getScalarizationOverhead(CI, VF, TTI); + if (Reverse) { + // If we store to reverse consecutive memory locations, then we need + // to reverse the order of elements in the stored value. + StoredVal[Part] = reverseVector(StoredVal[Part]); + // If the address is consecutive but reversed, then the + // wide store needs to start at the last vector element. + PartPtr = + Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF)); + PartPtr = + Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF)); + Mask[Part] = reverseVector(Mask[Part]); + } - unsigned Cost = ScalarCallCost * VF + ScalarizationCost; + Value *VecPtr = + Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace)); - // If we can't emit a vector call for this function, then the currently found - // cost is the cost we need to return. - NeedToScalarize = true; - if (!TLI || !TLI->isFunctionVectorizable(FnName, VF) || CI->isNoBuiltin()) - return Cost; + if (Legal->isMaskRequired(SI)) + NewSI = Builder.CreateMaskedStore(StoredVal[Part], VecPtr, Alignment, + Mask[Part]); + else + NewSI = + Builder.CreateAlignedStore(StoredVal[Part], VecPtr, Alignment); + } + addMetadata(NewSI, SI); + } + return; + } - // If the corresponding vector cost is cheaper, return its cost. - unsigned VectorCallCost = TTI.getCallInstrCost(nullptr, RetTy, Tys); - if (VectorCallCost < Cost) { - NeedToScalarize = false; - return VectorCallCost; + // Handle loads. + assert(LI && "Must have a load instruction"); + setDebugLocFromInst(Builder, LI); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Instruction *NewLI; + if (CreateGatherScatter) { + Value *MaskPart = Legal->isMaskRequired(LI) ? Mask[Part] : nullptr; + NewLI = Builder.CreateMaskedGather(VectorGep[Part], Alignment, MaskPart, + 0, "wide.masked.gather"); + Entry[Part] = NewLI; + } else { + // Calculate the pointer for the specific unroll-part. + Value *PartPtr = + Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF)); + + if (Reverse) { + // If the address is consecutive but reversed, then the + // wide load needs to start at the last vector element. + PartPtr = Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF)); + PartPtr = Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF)); + Mask[Part] = reverseVector(Mask[Part]); + } + + Value *VecPtr = + Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace)); + if (Legal->isMaskRequired(LI)) + NewLI = Builder.CreateMaskedLoad(VecPtr, Alignment, Mask[Part], + UndefValue::get(DataTy), + "wide.masked.load"); + else + NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load"); + Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI; + } + addMetadata(NewLI, LI); } - return Cost; + VectorLoopValueMap.initVector(Instr, Entry); } -// Estimate cost of an intrinsic call instruction CI if it were vectorized with -// factor VF. Return the cost of the instruction, including scalarization -// overhead if it's needed. -static unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF, - const TargetTransformInfo &TTI, - const TargetLibraryInfo *TLI) { - Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); - assert(ID && "Expected intrinsic call!"); +void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, + unsigned MinPart, + unsigned MaxPart, + unsigned MinLane, + unsigned MaxLane) { + assert(!Instr->getType()->isAggregateType() && "Can't handle vectors"); + // Holds vector parameters or scalars, in case of uniform vals. + SmallVector Params; - Type *RetTy = ToVectorTy(CI->getType(), VF); - SmallVector Tys; - for (Value *ArgOperand : CI->arg_operands()) - Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); + setDebugLocFromInst(Builder, Instr); - FastMathFlags FMF; - if (auto *FPMO = dyn_cast(CI)) - FMF = FPMO->getFastMathFlags(); + // Does this instruction return a value ? + bool IsVoidRetTy = Instr->getType()->isVoidTy(); - return TTI.getIntrinsicInstrCost(ID, RetTy, Tys, FMF); -} + // Initialize a new scalar map entry. + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF); -static Type *smallestIntegerVectorType(Type *T1, Type *T2) { - auto *I1 = cast(T1->getVectorElementType()); - auto *I2 = cast(T2->getVectorElementType()); - return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2; -} -static Type *largestIntegerVectorType(Type *T1, Type *T2) { - auto *I1 = cast(T1->getVectorElementType()); - auto *I2 = cast(T2->getVectorElementType()); - return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2; + // For each vector unroll 'part': + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { + // For each scalar that we create: + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { + + Instruction *Cloned = Instr->clone(); + if (!IsVoidRetTy) + Cloned->setName(Instr->getName() + ".cloned"); + + // Replace the operands of the cloned instructions with their scalar + // equivalents in the new loop. + for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) { + auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane); + Cloned->setOperand(op, NewOp); + } + addNewMetadata(Cloned, Instr); + + // Place the cloned scalar in the new loop. + Builder.Insert(Cloned); + + // Add the cloned scalar to the scalar map entry. + Entry[Part][Lane] = Cloned; + + // If we just cloned a new assumption, add it the assumption cache. + if (auto *II = dyn_cast(Cloned)) + if (II->getIntrinsicID() == Intrinsic::assume) + AC->registerAssumption(II); + } + } } -void InnerLoopVectorizer::truncateToMinimalBitwidths() { - // For every instruction `I` in MinBWs, truncate the operands, create a - // truncated version of `I` and reextend its result. InstCombine runs - // later and will remove any ext/trunc pairs. - // - SmallPtrSet Erased; - for (const auto &KV : Cost->getMinimalBitwidths()) { - // If the value wasn't vectorized, we must maintain the original scalar - // type. The absence of the value from VectorLoopValueMap indicates that it - // wasn't vectorized. - if (!VectorLoopValueMap.hasVector(KV.first)) - continue; - VectorParts &Parts = VectorLoopValueMap.getVector(KV.first); - for (Value *&I : Parts) { - if (Erased.count(I) || I->use_empty() || !isa(I)) - continue; - Type *OriginalTy = I->getType(); - Type *ScalarTruncatedTy = - IntegerType::get(OriginalTy->getContext(), KV.second); - Type *TruncatedTy = VectorType::get(ScalarTruncatedTy, - OriginalTy->getVectorNumElements()); - if (TruncatedTy == OriginalTy) - continue; +PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start, + Value *End, Value *Step, + Instruction *DL) { + BasicBlock *Header = L->getHeader(); + BasicBlock *Latch = L->getLoopLatch(); + // As we're just creating this loop, it's possible no latch exists + // yet. If so, use the header as this will be a single block loop. + if (!Latch) + Latch = Header; - IRBuilder<> B(cast(I)); - auto ShrinkOperand = [&](Value *V) -> Value * { - if (auto *ZI = dyn_cast(V)) - if (ZI->getSrcTy() == TruncatedTy) - return ZI->getOperand(0); - return B.CreateZExtOrTrunc(V, TruncatedTy); - }; + IRBuilder<> Builder(&*Header->getFirstInsertionPt()); + Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction); + setDebugLocFromInst(Builder, OldInst); + auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index"); - // The actual instruction modification depends on the instruction type, - // unfortunately. - Value *NewI = nullptr; - if (auto *BO = dyn_cast(I)) { - NewI = B.CreateBinOp(BO->getOpcode(), ShrinkOperand(BO->getOperand(0)), - ShrinkOperand(BO->getOperand(1))); - cast(NewI)->copyIRFlags(I); - } else if (auto *CI = dyn_cast(I)) { - NewI = - B.CreateICmp(CI->getPredicate(), ShrinkOperand(CI->getOperand(0)), - ShrinkOperand(CI->getOperand(1))); - } else if (auto *SI = dyn_cast(I)) { - NewI = B.CreateSelect(SI->getCondition(), - ShrinkOperand(SI->getTrueValue()), - ShrinkOperand(SI->getFalseValue())); - } else if (auto *CI = dyn_cast(I)) { - switch (CI->getOpcode()) { - default: - llvm_unreachable("Unhandled cast!"); - case Instruction::Trunc: - NewI = ShrinkOperand(CI->getOperand(0)); - break; - case Instruction::SExt: - NewI = B.CreateSExtOrTrunc( - CI->getOperand(0), - smallestIntegerVectorType(OriginalTy, TruncatedTy)); - break; - case Instruction::ZExt: - NewI = B.CreateZExtOrTrunc( - CI->getOperand(0), - smallestIntegerVectorType(OriginalTy, TruncatedTy)); - break; - } - } else if (auto *SI = dyn_cast(I)) { - auto Elements0 = SI->getOperand(0)->getType()->getVectorNumElements(); - auto *O0 = B.CreateZExtOrTrunc( - SI->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements0)); - auto Elements1 = SI->getOperand(1)->getType()->getVectorNumElements(); - auto *O1 = B.CreateZExtOrTrunc( - SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1)); + Builder.SetInsertPoint(Latch->getTerminator()); + setDebugLocFromInst(Builder, OldInst); - NewI = B.CreateShuffleVector(O0, O1, SI->getMask()); - } else if (isa(I)) { - // Don't do anything with the operands, just extend the result. - continue; - } else if (auto *IE = dyn_cast(I)) { - auto Elements = IE->getOperand(0)->getType()->getVectorNumElements(); - auto *O0 = B.CreateZExtOrTrunc( - IE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements)); - auto *O1 = B.CreateZExtOrTrunc(IE->getOperand(1), ScalarTruncatedTy); - NewI = B.CreateInsertElement(O0, O1, IE->getOperand(2)); - } else if (auto *EE = dyn_cast(I)) { - auto Elements = EE->getOperand(0)->getType()->getVectorNumElements(); - auto *O0 = B.CreateZExtOrTrunc( - EE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements)); - NewI = B.CreateExtractElement(O0, EE->getOperand(2)); - } else { - llvm_unreachable("Unhandled instruction type!"); - } + // Create i+1 and fill the PHINode. + Value *Next = Builder.CreateAdd(Induction, Step, "index.next"); + Induction->addIncoming(Start, L->getLoopPreheader()); + Induction->addIncoming(Next, Latch); + // Create the compare. + Value *ICmp = Builder.CreateICmpEQ(Next, End); + Builder.CreateCondBr(ICmp, L->getExitBlock(), Header); - // Lastly, extend the result. - NewI->takeName(cast(I)); - Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy); - I->replaceAllUsesWith(Res); - cast(I)->eraseFromParent(); - Erased.insert(I); - I = Res; - } - } + // Now we have two terminators. Remove the old one from the block. + Latch->getTerminator()->eraseFromParent(); - // We'll have created a bunch of ZExts that are now parentless. Clean up. - for (const auto &KV : Cost->getMinimalBitwidths()) { - // If the value wasn't vectorized, we must maintain the original scalar - // type. The absence of the value from VectorLoopValueMap indicates that it - // wasn't vectorized. - if (!VectorLoopValueMap.hasVector(KV.first)) - continue; - VectorParts &Parts = VectorLoopValueMap.getVector(KV.first); - for (Value *&I : Parts) { - ZExtInst *Inst = dyn_cast(I); - if (Inst && Inst->use_empty()) { - Value *NewI = Inst->getOperand(0); - Inst->eraseFromParent(); - I = NewI; - } - } - } + return Induction; } -void InnerLoopVectorizer::vectorizeLoop() { - //===------------------------------------------------===// - // - // Notice: any optimization or new instruction that go - // into the code below should be also be implemented in - // the cost-model. - // - //===------------------------------------------------===// - Constant *Zero = Builder.getInt32(0); - - // In order to support recurrences we need to be able to vectorize Phi nodes. - // Phi nodes have cycles, so we need to vectorize them in two stages. First, - // we create a new vector PHI node with no incoming edges. We use this value - // when we vectorize all of the instructions that use the PHI. Next, after - // all of the instructions in the block are complete we add the new incoming - // edges to the PHI. At this point all of the instructions in the basic block - // are vectorized, so we can use them to construct the PHI. - PhiVector PHIsToFix; - - // Collect instructions from the original loop that will become trivially - // dead in the vectorized loop. We don't need to vectorize these - // instructions. - collectTriviallyDeadInstructions(); +Value *InnerLoopVectorizer::getOrCreateTripCount(Loop *L) { + if (TripCount) + return TripCount; - // Scan the loop in a topological order to ensure that defs are vectorized - // before users. - LoopBlocksDFS DFS(OrigLoop); - DFS.perform(LI); + IRBuilder<> Builder(L->getLoopPreheader()->getTerminator()); + // Find the loop boundaries. + ScalarEvolution *SE = PSE.getSE(); + const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount(); + assert(BackedgeTakenCount != SE->getCouldNotCompute() && + "Invalid loop count"); - // Vectorize all of the blocks in the original loop. - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) - vectorizeBlockInLoop(BB, &PHIsToFix); + Type *IdxTy = Legal->getWidestInductionType(); - // Insert truncates and extends for any truncated instructions as hints to - // InstCombine. - if (VF > 1) - truncateToMinimalBitwidths(); + // The exit count might have the type of i64 while the phi is i32. This can + // happen if we have an induction variable that is sign extended before the + // compare. The only way that we get a backedge taken count is that the + // induction variable was signed and as such will not overflow. In such a case + // truncation is legal. + if (BackedgeTakenCount->getType()->getPrimitiveSizeInBits() > + IdxTy->getPrimitiveSizeInBits()) + BackedgeTakenCount = SE->getTruncateOrNoop(BackedgeTakenCount, IdxTy); + BackedgeTakenCount = SE->getNoopOrZeroExtend(BackedgeTakenCount, IdxTy); - // At this point every instruction in the original loop is widened to a - // vector form. Now we need to fix the recurrences in PHIsToFix. These PHI - // nodes are currently empty because we did not want to introduce cycles. - // This is the second stage of vectorizing recurrences. - for (PHINode *Phi : PHIsToFix) { - assert(Phi && "Unable to recover vectorized PHI"); + // Get the total trip count from the count by adding 1. + const SCEV *ExitCount = SE->getAddExpr( + BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType())); - // Handle first-order recurrences that need to be fixed. - if (Legal->isFirstOrderRecurrence(Phi)) { - fixFirstOrderRecurrence(Phi); - continue; - } + const DataLayout &DL = L->getHeader()->getModule()->getDataLayout(); - // If the phi node is not a first-order recurrence, it must be a reduction. - // Get it's reduction variable descriptor. - assert(Legal->isReductionVariable(Phi) && - "Unable to find the reduction variable"); - RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi]; - - RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind(); - TrackingVH ReductionStartValue = RdxDesc.getRecurrenceStartValue(); - Instruction *LoopExitInst = RdxDesc.getLoopExitInstr(); - RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind = - RdxDesc.getMinMaxRecurrenceKind(); - setDebugLocFromInst(Builder, ReductionStartValue); - - // We need to generate a reduction vector from the incoming scalar. - // To do so, we need to generate the 'identity' vector and override - // one of the elements with the incoming scalar reduction. We need - // to do it in the vector-loop preheader. - Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator()); - - // This is the vector-clone of the value that leaves the loop. - const VectorParts &VectorExit = getVectorValue(LoopExitInst); - Type *VecTy = VectorExit[0]->getType(); - - // Find the reduction identity variable. Zero for addition, or, xor, - // one for multiplication, -1 for And. - Value *Identity; - Value *VectorStart; - if (RK == RecurrenceDescriptor::RK_IntegerMinMax || - RK == RecurrenceDescriptor::RK_FloatMinMax) { - // MinMax reduction have the start value as their identify. - if (VF == 1) { - VectorStart = Identity = ReductionStartValue; - } else { - VectorStart = Identity = - Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident"); - } - } else { - // Handle other reduction kinds: - Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity( - RK, VecTy->getScalarType()); - if (VF == 1) { - Identity = Iden; - // This vector is the Identity vector where the first element is the - // incoming scalar reduction. - VectorStart = ReductionStartValue; - } else { - Identity = ConstantVector::getSplat(VF, Iden); + // Expand the trip count and place the new instructions in the preheader. + // Notice that the pre-header does not change, only the loop body. + SCEVExpander Exp(*SE, DL, "induction"); - // This vector is the Identity vector where the first element is the - // incoming scalar reduction. - VectorStart = - Builder.CreateInsertElement(Identity, ReductionStartValue, Zero); - } - } + // Count holds the overall loop count (N). + TripCount = Exp.expandCodeFor(ExitCount, ExitCount->getType(), + L->getLoopPreheader()->getTerminator()); - // Fix the vector-loop phi. + if (TripCount->getType()->isPointerTy()) + TripCount = + CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int", + L->getLoopPreheader()->getTerminator()); - // Reductions do not have to start at zero. They can start with - // any loop invariant values. - const VectorParts &VecRdxPhi = getVectorValue(Phi); - BasicBlock *Latch = OrigLoop->getLoopLatch(); - Value *LoopVal = Phi->getIncomingValueForBlock(Latch); - const VectorParts &Val = getVectorValue(LoopVal); - for (unsigned part = 0; part < UF; ++part) { - // Make sure to add the reduction stat value only to the - // first unroll part. - Value *StartVal = (part == 0) ? VectorStart : Identity; - cast(VecRdxPhi[part]) - ->addIncoming(StartVal, LoopVectorPreHeader); - cast(VecRdxPhi[part]) - ->addIncoming(Val[part], LoopVectorBody); - } + return TripCount; +} - // Before each round, move the insertion point right between - // the PHIs and the values we are going to write. - // This allows us to write both PHINodes and the extractelement - // instructions. - Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); +Value *InnerLoopVectorizer::getOrCreateVectorTripCount(Loop *L) { + if (VectorTripCount) + return VectorTripCount; - VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst); - setDebugLocFromInst(Builder, LoopExitInst); + Value *TC = getOrCreateTripCount(L); + IRBuilder<> Builder(L->getLoopPreheader()->getTerminator()); - // If the vector reduction can be performed in a smaller type, we truncate - // then extend the loop exit value to enable InstCombine to evaluate the - // entire expression in the smaller type. - if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) { - Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF); - Builder.SetInsertPoint(LoopVectorBody->getTerminator()); - for (unsigned part = 0; part < UF; ++part) { - Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy); - Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy) - : Builder.CreateZExt(Trunc, VecTy); - for (Value::user_iterator UI = RdxParts[part]->user_begin(); - UI != RdxParts[part]->user_end();) - if (*UI != Trunc) { - (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd); - RdxParts[part] = Extnd; - } else { - ++UI; - } - } - Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); - for (unsigned part = 0; part < UF; ++part) - RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy); - } + // Now we need to generate the expression for the part of the loop that the + // vectorized body will execute. This is equal to N - (N % Step) if scalar + // iterations are not required for correctness, or N - Step, otherwise. Step + // is equal to the vectorization factor (number of SIMD elements) times the + // unroll factor (number of SIMD instructions). + Constant *Step = ConstantInt::get(TC->getType(), VF * UF); + Value *R = Builder.CreateURem(TC, Step, "n.mod.vf"); - // Reduce all of the unrolled parts into a single vector. - Value *ReducedPartRdx = RdxParts[0]; - unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK); - setDebugLocFromInst(Builder, ReducedPartRdx); - for (unsigned part = 1; part < UF; ++part) { - if (Op != Instruction::ICmp && Op != Instruction::FCmp) - // Floating point operations had to be 'fast' to enable the reduction. - ReducedPartRdx = addFastMathFlag( - Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part], - ReducedPartRdx, "bin.rdx")); - else - ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp( - Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]); - } + // If there is a non-reversed interleaved group that may speculatively access + // memory out-of-bounds, we need to ensure that there will be at least one + // iteration of the scalar epilogue loop. Thus, if the step evenly divides + // the trip count, we set the remainder to be equal to the step. If the step + // does not evenly divide the trip count, no adjustment is necessary since + // there will already be scalar iterations. Note that the minimum iterations + // check ensures that N >= Step. + if (VF > 1 && Legal->requiresScalarEpilogue()) { + auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0)); + R = Builder.CreateSelect(IsZero, Step, R); + } - if (VF > 1) { - // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles - // and vector ops, reducing the set of values being computed by half each - // round. - assert(isPowerOf2_32(VF) && - "Reduction emission only supported for pow2 vectors!"); - Value *TmpVec = ReducedPartRdx; - SmallVector ShuffleMask(VF, nullptr); - for (unsigned i = VF; i != 1; i >>= 1) { - // Move the upper half of the vector to the lower half. - for (unsigned j = 0; j != i / 2; ++j) - ShuffleMask[j] = Builder.getInt32(i / 2 + j); - - // Fill the rest of the mask with undef. - std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), - UndefValue::get(Builder.getInt32Ty())); - - Value *Shuf = Builder.CreateShuffleVector( - TmpVec, UndefValue::get(TmpVec->getType()), - ConstantVector::get(ShuffleMask), "rdx.shuf"); - - if (Op != Instruction::ICmp && Op != Instruction::FCmp) - // Floating point operations had to be 'fast' to enable the reduction. - TmpVec = addFastMathFlag(Builder.CreateBinOp( - (Instruction::BinaryOps)Op, TmpVec, Shuf, "bin.rdx")); - else - TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind, - TmpVec, Shuf); - } + VectorTripCount = Builder.CreateSub(TC, R, "n.vec"); - // The result is in the first element of the vector. - ReducedPartRdx = - Builder.CreateExtractElement(TmpVec, Builder.getInt32(0)); - - // If the reduction can be performed in a smaller type, we need to extend - // the reduction to the wider type before we branch to the original loop. - if (Phi->getType() != RdxDesc.getRecurrenceType()) - ReducedPartRdx = - RdxDesc.isSigned() - ? Builder.CreateSExt(ReducedPartRdx, Phi->getType()) - : Builder.CreateZExt(ReducedPartRdx, Phi->getType()); - } + return VectorTripCount; +} - // Create a phi node that merges control-flow from the backedge-taken check - // block and the middle block. - PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx", - LoopScalarPreHeader->getTerminator()); - for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I) - BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]); - BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); - - // Now, we need to fix the users of the reduction variable - // inside and outside of the scalar remainder loop. - // We know that the loop is in LCSSA form. We need to update the - // PHI nodes in the exit blocks. - for (BasicBlock::iterator LEI = LoopExitBlock->begin(), - LEE = LoopExitBlock->end(); - LEI != LEE; ++LEI) { - PHINode *LCSSAPhi = dyn_cast(LEI); - if (!LCSSAPhi) - break; +void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L, + BasicBlock *Bypass) { + Value *Count = getOrCreateTripCount(L); + BasicBlock *BB = L->getLoopPreheader(); + IRBuilder<> Builder(BB->getTerminator()); - // All PHINodes need to have a single entry edge, or two if - // we already fixed them. - assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI"); + // Generate code to check that the loop's trip count that we computed by + // adding one to the backedge-taken count will not overflow. + Value *CheckMinIters = Builder.CreateICmpULT( + Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check"); - // We found a reduction value exit-PHI. Update it with the - // incoming bypass edge. - if (LCSSAPhi->getIncomingValue(0) == LoopExitInst) - LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); - } // end of the LCSSA phi scan. + BasicBlock *NewBB = + BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked"); + // Update dominator tree immediately if the generated block is a + // LoopBypassBlock because SCEV expansions to generate loop bypass + // checks may query it before the current function is finished. + DT->addNewBlock(NewBB, BB); + if (L->getParentLoop()) + L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); + ReplaceInstWithInst(BB->getTerminator(), + BranchInst::Create(Bypass, NewBB, CheckMinIters)); + LoopBypassBlocks.push_back(BB); +} - // Fix the scalar loop reduction variable with the incoming reduction sum - // from the vector body and from the backedge value. - int IncomingEdgeBlockIdx = - Phi->getBasicBlockIndex(OrigLoop->getLoopLatch()); - assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index"); - // Pick the other block. - int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1); - Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi); - Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst); - } // end of for each Phi in PHIsToFix. +void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L, + BasicBlock *Bypass) { + Value *TC = getOrCreateVectorTripCount(L); + BasicBlock *BB = L->getLoopPreheader(); + IRBuilder<> Builder(BB->getTerminator()); - // Update the dominator tree. - // - // FIXME: After creating the structure of the new loop, the dominator tree is - // no longer up-to-date, and it remains that way until we update it - // here. An out-of-date dominator tree is problematic for SCEV, - // because SCEVExpander uses it to guide code generation. The - // vectorizer use SCEVExpanders in several places. Instead, we should - // keep the dominator tree up-to-date as we go. - updateAnalysis(); + // Now, compare the new count to zero. If it is zero skip the vector loop and + // jump to the scalar loop. + Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()), + "cmp.zero"); - // Fix-up external users of the induction variables. - for (auto &Entry : *Legal->getInductionVars()) - fixupIVUsers(Entry.first, Entry.second, - getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), - IVEndValues[Entry.first], LoopMiddleBlock); - - fixLCSSAPHIs(); - predicateInstructions(); - - // Remove redundant induction instructions. - cse(LoopVectorBody); + // Generate code to check that the loop's trip count that we computed by + // adding one to the backedge-taken count will not overflow. + BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); + // Update dominator tree immediately if the generated block is a + // LoopBypassBlock because SCEV expansions to generate loop bypass + // checks may query it before the current function is finished. + DT->addNewBlock(NewBB, BB); + if (L->getParentLoop()) + L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); + ReplaceInstWithInst(BB->getTerminator(), + BranchInst::Create(Bypass, NewBB, Cmp)); + LoopBypassBlocks.push_back(BB); } -void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) { +void InnerLoopVectorizer::emitSCEVChecks(Loop *L, BasicBlock *Bypass) { + BasicBlock *BB = L->getLoopPreheader(); - // This is the second phase of vectorizing first-order recurrences. An - // overview of the transformation is described below. Suppose we have the - // following loop. - // - // for (int i = 0; i < n; ++i) - // b[i] = a[i] - a[i - 1]; - // - // There is a first-order recurrence on "a". For this loop, the shorthand - // scalar IR looks like: - // - // scalar.ph: - // s_init = a[-1] - // br scalar.body - // - // scalar.body: - // i = phi [0, scalar.ph], [i+1, scalar.body] - // s1 = phi [s_init, scalar.ph], [s2, scalar.body] - // s2 = a[i] - // b[i] = s2 - s1 - // br cond, scalar.body, ... - // - // In this example, s1 is a recurrence because it's value depends on the - // previous iteration. In the first phase of vectorization, we created a - // temporary value for s1. We now complete the vectorization and produce the - // shorthand vector IR shown below (for VF = 4, UF = 1). - // - // vector.ph: - // v_init = vector(..., ..., ..., a[-1]) - // br vector.body - // - // vector.body - // i = phi [0, vector.ph], [i+4, vector.body] - // v1 = phi [v_init, vector.ph], [v2, vector.body] - // v2 = a[i, i+1, i+2, i+3]; - // v3 = vector(v1(3), v2(0, 1, 2)) - // b[i, i+1, i+2, i+3] = v2 - v3 - // br cond, vector.body, middle.block - // - // middle.block: - // x = v2(3) - // br scalar.ph - // - // scalar.ph: - // s_init = phi [x, middle.block], [a[-1], otherwise] - // br scalar.body - // - // After execution completes the vector loop, we extract the next value of - // the recurrence (x) to use as the initial value in the scalar loop. + // Generate the code to check that the SCEV assumptions that we made. + // We want the new basic block to start at the first instruction in a + // sequence of instructions that form a check. + SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(), + "scev.check"); + Value *SCEVCheck = + Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator()); - // Get the original loop preheader and single loop latch. - auto *Preheader = OrigLoop->getLoopPreheader(); - auto *Latch = OrigLoop->getLoopLatch(); + if (auto *C = dyn_cast(SCEVCheck)) + if (C->isZero()) + return; - // Get the initial and previous values of the scalar recurrence. - auto *ScalarInit = Phi->getIncomingValueForBlock(Preheader); - auto *Previous = Phi->getIncomingValueForBlock(Latch); + // Create a new block containing the stride check. + BB->setName("vector.scevcheck"); + auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); + // Update dominator tree immediately if the generated block is a + // LoopBypassBlock because SCEV expansions to generate loop bypass + // checks may query it before the current function is finished. + DT->addNewBlock(NewBB, BB); + if (L->getParentLoop()) + L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); + ReplaceInstWithInst(BB->getTerminator(), + BranchInst::Create(Bypass, NewBB, SCEVCheck)); + LoopBypassBlocks.push_back(BB); + AddedSafetyChecks = true; +} - // Create a vector from the initial value. - auto *VectorInit = ScalarInit; - if (VF > 1) { - Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); - VectorInit = Builder.CreateInsertElement( - UndefValue::get(VectorType::get(VectorInit->getType(), VF)), VectorInit, - Builder.getInt32(VF - 1), "vector.recur.init"); - } +void InnerLoopVectorizer::emitMemRuntimeChecks(Loop *L, BasicBlock *Bypass) { + BasicBlock *BB = L->getLoopPreheader(); - // We constructed a temporary phi node in the first phase of vectorization. - // This phi node will eventually be deleted. - VectorParts &PhiParts = VectorLoopValueMap.getVector(Phi); - Builder.SetInsertPoint(cast(PhiParts[0])); + // Generate the code that checks in runtime if arrays overlap. We put the + // checks into a separate block to make the more common case of few elements + // faster. + Instruction *FirstCheckInst; + Instruction *MemRuntimeCheck; + std::tie(FirstCheckInst, MemRuntimeCheck) = + Legal->getLAI()->addRuntimeChecks(BB->getTerminator()); + if (!MemRuntimeCheck) + return; - // Create a phi node for the new recurrence. The current value will either be - // the initial value inserted into a vector or loop-varying vector value. - auto *VecPhi = Builder.CreatePHI(VectorInit->getType(), 2, "vector.recur"); - VecPhi->addIncoming(VectorInit, LoopVectorPreHeader); + // Create a new block containing the memory check. + BB->setName("vector.memcheck"); + auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph"); + // Update dominator tree immediately if the generated block is a + // LoopBypassBlock because SCEV expansions to generate loop bypass + // checks may query it before the current function is finished. + DT->addNewBlock(NewBB, BB); + if (L->getParentLoop()) + L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI); + ReplaceInstWithInst(BB->getTerminator(), + BranchInst::Create(Bypass, NewBB, MemRuntimeCheck)); + LoopBypassBlocks.push_back(BB); + AddedSafetyChecks = true; - // Get the vectorized previous value. We ensured the previous values was an - // instruction when detecting the recurrence. - auto &PreviousParts = getVectorValue(Previous); + // We currently don't use LoopVersioning for the actual loop cloning but we + // still use it to add the noalias metadata. + LVer = llvm::make_unique(*Legal->getLAI(), OrigLoop, LI, DT, + PSE.getSE()); + LVer->prepareNoAliasMetadata(); +} - // Set the insertion point to be after this instruction. We ensured the - // previous value dominated all uses of the phi when detecting the - // recurrence. - Builder.SetInsertPoint( - &*++BasicBlock::iterator(cast(PreviousParts[UF - 1]))); +void InnerLoopVectorizer::createEmptyLoop() { + /* + In this function we generate a new loop. The new loop will contain + the vectorized instructions while the old loop will continue to run the + scalar remainder. - // We will construct a vector for the recurrence by combining the values for - // the current and previous iterations. This is the required shuffle mask. - SmallVector ShuffleMask(VF); - ShuffleMask[0] = Builder.getInt32(VF - 1); - for (unsigned I = 1; I < VF; ++I) - ShuffleMask[I] = Builder.getInt32(I + VF - 1); + [ ] <-- loop iteration number check. + / | + / v + | [ ] <-- vector loop bypass (may consist of multiple blocks). + | / | + | / v + || [ ] <-- vector pre header. + |/ | + | v + | [ ] \ + | [ ]_| <-- vector loop. + | | + | v + | -[ ] <--- middle-block. + | / | + | / v + -|- >[ ] <--- new preheader. + | | + | v + | [ ] \ + | [ ]_| <-- old scalar loop to handle remainder. + \ | + \ v + >[ ] <-- exit block. + ... + */ - // The vector from which to take the initial value for the current iteration - // (actual or unrolled). Initially, this is the vector phi node. - Value *Incoming = VecPhi; + BasicBlock *OldBasicBlock = OrigLoop->getHeader(); + BasicBlock *VectorPH = OrigLoop->getLoopPreheader(); + BasicBlock *ExitBlock = OrigLoop->getExitBlock(); + assert(VectorPH && "Invalid loop structure"); + assert(ExitBlock && "Must have an exit block"); - // Shuffle the current and previous vector and update the vector parts. - for (unsigned Part = 0; Part < UF; ++Part) { - auto *Shuffle = - VF > 1 - ? Builder.CreateShuffleVector(Incoming, PreviousParts[Part], - ConstantVector::get(ShuffleMask)) - : Incoming; - PhiParts[Part]->replaceAllUsesWith(Shuffle); - cast(PhiParts[Part])->eraseFromParent(); - PhiParts[Part] = Shuffle; - Incoming = PreviousParts[Part]; - } + // Some loops have a single integer induction variable, while other loops + // don't. One example is c++ iterators that often have multiple pointer + // induction variables. In the code below we also support a case where we + // don't have a single induction variable. + // + // We try to obtain an induction variable from the original loop as hard + // as possible. However if we don't find one that: + // - is an integer + // - counts from zero, stepping by one + // - is the size of the widest induction variable type + // then we create a new one. + OldInduction = Legal->getPrimaryInduction(); + Type *IdxTy = Legal->getWidestInductionType(); - // Fix the latch value of the new recurrence in the vector loop. - VecPhi->addIncoming(Incoming, LI->getLoopFor(LoopVectorBody)->getLoopLatch()); + // Split the single block loop into the two loop structure described above. + BasicBlock *VecBody = + VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body"); + BasicBlock *MiddleBlock = + VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block"); + BasicBlock *ScalarPH = + MiddleBlock->splitBasicBlock(MiddleBlock->getTerminator(), "scalar.ph"); - // Extract the last vector element in the middle block. This will be the - // initial value for the recurrence when jumping to the scalar loop. - auto *Extract = Incoming; - if (VF > 1) { - Builder.SetInsertPoint(LoopMiddleBlock->getTerminator()); - Extract = Builder.CreateExtractElement(Extract, Builder.getInt32(VF - 1), - "vector.recur.extract"); + // Create and register the new vector loop. + Loop *Lp = new Loop(); + Loop *ParentLoop = OrigLoop->getParentLoop(); + + // Insert the new loop into the loop nest and register the new basic blocks + // before calling any utilities such as SCEV that require valid LoopInfo. + if (ParentLoop) { + ParentLoop->addChildLoop(Lp); + ParentLoop->addBasicBlockToLoop(ScalarPH, *LI); + ParentLoop->addBasicBlockToLoop(MiddleBlock, *LI); + } else { + LI->addTopLevelLoop(Lp); } + Lp->addBasicBlockToLoop(VecBody, *LI); - // Fix the initial value of the original recurrence in the scalar loop. - Builder.SetInsertPoint(&*LoopScalarPreHeader->begin()); - auto *Start = Builder.CreatePHI(Phi->getType(), 2, "scalar.recur.init"); - for (auto *BB : predecessors(LoopScalarPreHeader)) { - auto *Incoming = BB == LoopMiddleBlock ? Extract : ScalarInit; - Start->addIncoming(Incoming, BB); - } + // Find the loop boundaries. + Value *Count = getOrCreateTripCount(Lp); - Phi->setIncomingValue(Phi->getBasicBlockIndex(LoopScalarPreHeader), Start); - Phi->setName("scalar.recur"); + Value *StartIdx = ConstantInt::get(IdxTy, 0); - // Finally, fix users of the recurrence outside the loop. The users will need - // either the last value of the scalar recurrence or the last value of the - // vector recurrence we extracted in the middle block. Since the loop is in - // LCSSA form, we just need to find the phi node for the original scalar - // recurrence in the exit block, and then add an edge for the middle block. - for (auto &I : *LoopExitBlock) { - auto *LCSSAPhi = dyn_cast(&I); - if (!LCSSAPhi) - break; - if (LCSSAPhi->getIncomingValue(0) == Phi) { - LCSSAPhi->addIncoming(Extract, LoopMiddleBlock); - break; - } - } -} + // We need to test whether the backedge-taken count is uint##_max. Adding one + // to it will cause overflow and an incorrect loop trip count in the vector + // body. In case of overflow we want to directly jump to the scalar remainder + // loop. + emitMinimumIterationCountCheck(Lp, ScalarPH); + // Now, compare the new count to zero. If it is zero skip the vector loop and + // jump to the scalar loop. + emitVectorLoopEnteredCheck(Lp, ScalarPH); + // Generate the code to check any assumptions that we've made for SCEV + // expressions. + emitSCEVChecks(Lp, ScalarPH); -void InnerLoopVectorizer::fixLCSSAPHIs() { - for (Instruction &LEI : *LoopExitBlock) { - auto *LCSSAPhi = dyn_cast(&LEI); - if (!LCSSAPhi) - break; - if (LCSSAPhi->getNumIncomingValues() == 1) - LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()), - LoopMiddleBlock); - } -} + // Generate the code that checks in runtime if arrays overlap. We put the + // checks into a separate block to make the more common case of few elements + // faster. + emitMemRuntimeChecks(Lp, ScalarPH); -void InnerLoopVectorizer::collectTriviallyDeadInstructions() { - BasicBlock *Latch = OrigLoop->getLoopLatch(); + // Generate the induction variable. + // The loop step is equal to the vectorization factor (num of SIMD elements) + // times the unroll factor (num of SIMD instructions). + Value *CountRoundDown = getOrCreateVectorTripCount(Lp); + Constant *Step = ConstantInt::get(IdxTy, VF * UF); + Induction = + createInductionVariable(Lp, StartIdx, CountRoundDown, Step, + getDebugLocFromInstOrOperands(OldInduction)); - // We create new control-flow for the vectorized loop, so the original - // condition will be dead after vectorization if it's only used by the - // branch. - auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); - if (Cmp && Cmp->hasOneUse()) - DeadInstructions.insert(Cmp); + // We are going to resume the execution of the scalar loop. + // Go over all of the induction variables that we found and fix the + // PHIs that are left in the scalar version of the loop. + // The starting values of PHI nodes depend on the counter of the last + // iteration in the vectorized loop. + // If we come from a bypass edge then we need to start from the original + // start value. - // We create new "steps" for induction variable updates to which the original - // induction variables map. An original update instruction will be dead if - // all its users except the induction variable are dead. - for (auto &Induction : *Legal->getInductionVars()) { - PHINode *Ind = Induction.first; - auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - if (all_of(IndUpdate->users(), [&](User *U) -> bool { - return U == Ind || DeadInstructions.count(cast(U)); - })) - DeadInstructions.insert(IndUpdate); - } -} + // This variable saves the new starting index for the scalar loop. It is used + // to test if there are any tail iterations left once the vector loop has + // completed. + LoopVectorizationLegality::InductionList *List = Legal->getInductionVars(); + for (auto &InductionEntry : *List) { + PHINode *OrigPhi = InductionEntry.first; + InductionDescriptor II = InductionEntry.second; -void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) { + // Create phi nodes to merge from the backedge-taken check block. + PHINode *BCResumeVal = PHINode::Create( + OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator()); + Value *&EndValue = IVEndValues[OrigPhi]; + if (OrigPhi == OldInduction) { + // We know what the end value is. + EndValue = CountRoundDown; + } else { + IRBuilder<> B(LoopBypassBlocks.back()->getTerminator()); + Type *StepType = II.getStep()->getType(); + Instruction::CastOps CastOp = + CastInst::getCastOpcode(CountRoundDown, true, StepType, true); + Value *CRD = B.CreateCast(CastOp, CountRoundDown, StepType, "cast.crd"); + const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); + EndValue = II.transform(B, CRD, PSE.getSE(), DL); + EndValue->setName("ind.end"); + } - // The basic block and loop containing the predicated instruction. - auto *PredBB = PredInst->getParent(); - auto *VectorLoop = LI->getLoopFor(PredBB); + // The new PHI merges the original incoming value, in case of a bypass, + // or the value at the end of the vectorized loop. + BCResumeVal->addIncoming(EndValue, MiddleBlock); - // Initialize a worklist with the operands of the predicated instruction. - SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); + // Fix the scalar body counter (PHI node). + unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH); - // Holds instructions that we need to analyze again. An instruction may be - // reanalyzed if we don't yet know if we can sink it or not. - SmallVector InstsToReanalyze; + // The old induction's phi node in the scalar body needs the truncated + // value. + for (BasicBlock *BB : LoopBypassBlocks) + BCResumeVal->addIncoming(II.getStartValue(), BB); + OrigPhi->setIncomingValue(BlockIdx, BCResumeVal); + } - // Returns true if a given use occurs in the predicated block. Phi nodes use - // their operands in their corresponding predecessor blocks. - auto isBlockOfUsePredicated = [&](Use &U) -> bool { - auto *I = cast(U.getUser()); - BasicBlock *BB = I->getParent(); - if (auto *Phi = dyn_cast(I)) - BB = Phi->getIncomingBlock( - PHINode::getIncomingValueNumForOperand(U.getOperandNo())); - return BB == PredBB; - }; + // Add a check in the middle block to see if we have completed + // all of the iterations in the first vector loop. + // If (N - N%VF) == N, then we *don't* need to run the remainder. + Value *CmpN = + CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count, + CountRoundDown, "cmp.n", MiddleBlock->getTerminator()); + ReplaceInstWithInst(MiddleBlock->getTerminator(), + BranchInst::Create(ExitBlock, ScalarPH, CmpN)); - // Iteratively sink the scalarized operands of the predicated instruction - // into the block we created for it. When an instruction is sunk, it's - // operands are then added to the worklist. The algorithm ends after one pass - // through the worklist doesn't sink a single instruction. - bool Changed; - do { + // Get ready to start creating new instructions into the vectorized body. + Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt()); - // Add the instructions that need to be reanalyzed to the worklist, and - // reset the changed indicator. - Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); - InstsToReanalyze.clear(); - Changed = false; + // Save the state. + LoopVectorPreHeader = Lp->getLoopPreheader(); + LoopScalarPreHeader = ScalarPH; + LoopMiddleBlock = MiddleBlock; + LoopExitBlock = ExitBlock; + LoopVectorBody = VecBody; + LoopScalarBody = OldBasicBlock; - while (!Worklist.empty()) { - auto *I = dyn_cast(Worklist.pop_back_val()); + // Keep all loop hints from the original loop on the vector loop (we'll + // replace the vectorizer-specific hints below). + if (MDNode *LID = OrigLoop->getLoopID()) + Lp->setLoopID(LID); - // We can't sink an instruction if it is a phi node, is already in the - // predicated block, is not in the loop, or may have side effects. - if (!I || isa(I) || I->getParent() == PredBB || - !VectorLoop->contains(I) || I->mayHaveSideEffects()) - continue; + LoopVectorizeHints Hints(Lp, true, *ORE); + Hints.setAlreadyVectorized(); +} - // It's legal to sink the instruction if all its uses occur in the - // predicated block. Otherwise, there's nothing to do yet, and we may - // need to reanalyze the instruction. - if (!all_of(I->uses(), isBlockOfUsePredicated)) { - InstsToReanalyze.push_back(I); - continue; - } +// Fix up external users of the induction variable. At this point, we are +// in LCSSA form, with all external PHIs that use the IV having one input value, +// coming from the remainder loop. We need those PHIs to also have a correct +// value for the IV when arriving directly from the middle block. +void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi, + const InductionDescriptor &II, + Value *CountRoundDown, Value *EndValue, + BasicBlock *MiddleBlock) { + // There are two kinds of external IV usages - those that use the value + // computed in the last iteration (the PHI) and those that use the penultimate + // value (the value that feeds into the phi from the loop latch). + // We allow both, but they, obviously, have different values. - // Move the instruction to the beginning of the predicated block, and add - // it's operands to the worklist. - I->moveBefore(&*PredBB->getFirstInsertionPt()); - Worklist.insert(I->op_begin(), I->op_end()); + assert(OrigLoop->getExitBlock() && "Expected a single exit block"); - // The sinking may have enabled other instructions to be sunk, so we will - // need to iterate. - Changed = true; + DenseMap MissingVals; + + // An external user of the last iteration's value should see the value that + // the remainder loop uses to initialize its own IV. + Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoop->getLoopLatch()); + for (User *U : PostInc->users()) { + Instruction *UI = cast(U); + if (!OrigLoop->contains(UI)) { + assert(isa(UI) && "Expected LCSSA form"); + MissingVals[UI] = EndValue; } - } while (Changed); -} + } -void InnerLoopVectorizer::predicateInstructions() { + // An external user of the penultimate value need to see EndValue - Step. + // The simplest way to get this is to recompute it from the constituent SCEVs, + // that is Start + (Step * (CRD - 1)). + for (User *U : OrigPhi->users()) { + auto *UI = cast(U); + if (!OrigLoop->contains(UI)) { + const DataLayout &DL = + OrigLoop->getHeader()->getModule()->getDataLayout(); + assert(isa(UI) && "Expected LCSSA form"); - // For each instruction I marked for predication on value C, split I into its - // own basic block to form an if-then construct over C. Since I may be fed by - // an extractelement instruction or other scalar operand, we try to - // iteratively sink its scalar operands into the predicated block. If I feeds - // an insertelement instruction, we try to move this instruction into the - // predicated block as well. For non-void types, a phi node will be created - // for the resulting value (either vector or scalar). - // - // So for some predicated instruction, e.g. the conditional sdiv in: - // - // for.body: - // ... - // %add = add nsw i32 %mul, %0 - // %cmp5 = icmp sgt i32 %2, 7 - // br i1 %cmp5, label %if.then, label %if.end - // - // if.then: - // %div = sdiv i32 %0, %1 - // br label %if.end - // - // if.end: - // %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ] - // - // the sdiv at this point is scalarized and if-converted using a select. - // The inactive elements in the vector are not used, but the predicated - // instruction is still executed for all vector elements, essentially: - // - // vector.body: - // ... - // %17 = add nsw <2 x i32> %16, %wide.load - // %29 = extractelement <2 x i32> %wide.load, i32 0 - // %30 = extractelement <2 x i32> %wide.load51, i32 0 - // %31 = sdiv i32 %29, %30 - // %32 = insertelement <2 x i32> undef, i32 %31, i32 0 - // %35 = extractelement <2 x i32> %wide.load, i32 1 - // %36 = extractelement <2 x i32> %wide.load51, i32 1 - // %37 = sdiv i32 %35, %36 - // %38 = insertelement <2 x i32> %32, i32 %37, i32 1 - // %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17 - // - // Predication will now re-introduce the original control flow to avoid false - // side-effects by the sdiv instructions on the inactive elements, yielding - // (after cleanup): - // - // vector.body: - // ... - // %5 = add nsw <2 x i32> %4, %wide.load - // %8 = icmp sgt <2 x i32> %wide.load52, - // %9 = extractelement <2 x i1> %8, i32 0 - // br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue - // - // pred.sdiv.if: - // %10 = extractelement <2 x i32> %wide.load, i32 0 - // %11 = extractelement <2 x i32> %wide.load51, i32 0 - // %12 = sdiv i32 %10, %11 - // %13 = insertelement <2 x i32> undef, i32 %12, i32 0 - // br label %pred.sdiv.continue - // - // pred.sdiv.continue: - // %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ] - // %15 = extractelement <2 x i1> %8, i32 1 - // br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55 - // - // pred.sdiv.if54: - // %16 = extractelement <2 x i32> %wide.load, i32 1 - // %17 = extractelement <2 x i32> %wide.load51, i32 1 - // %18 = sdiv i32 %16, %17 - // %19 = insertelement <2 x i32> %14, i32 %18, i32 1 - // br label %pred.sdiv.continue55 - // - // pred.sdiv.continue55: - // %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ] - // %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5 + IRBuilder<> B(MiddleBlock->getTerminator()); + Value *CountMinusOne = B.CreateSub( + CountRoundDown, ConstantInt::get(CountRoundDown->getType(), 1)); + Value *CMO = B.CreateSExtOrTrunc(CountMinusOne, II.getStep()->getType(), + "cast.cmo"); + Value *Escape = II.transform(B, CMO, PSE.getSE(), DL); + Escape->setName("ind.escape"); + MissingVals[UI] = Escape; + } + } - for (auto KV : PredicatedInstructions) { - BasicBlock::iterator I(KV.first); - BasicBlock *Head = I->getParent(); - auto *BB = SplitBlock(Head, &*std::next(I), DT, LI); - auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false, - /*BranchWeights=*/nullptr, DT, LI); - I->moveBefore(T); - sinkScalarOperands(&*I); + for (auto &I : MissingVals) { + PHINode *PHI = cast(I.first); + // One corner case we have to handle is two IVs "chasing" each-other, + // that is %IV2 = phi [...], [ %IV1, %latch ] + // In this case, if IV1 has an external use, we need to avoid adding both + // "last value of IV1" and "penultimate value of IV2". So, verify that we + // don't already have an incoming value for the middle block. + if (PHI->getBasicBlockIndex(MiddleBlock) == -1) + PHI->addIncoming(I.second, MiddleBlock); + } +} - I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if"); - BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue"); +namespace { +struct CSEDenseMapInfo { + static bool canHandle(Instruction *I) { + return isa(I) || isa(I) || + isa(I) || isa(I); + } + static inline Instruction *getEmptyKey() { + return DenseMapInfo::getEmptyKey(); + } + static inline Instruction *getTombstoneKey() { + return DenseMapInfo::getTombstoneKey(); + } + static unsigned getHashValue(Instruction *I) { + assert(canHandle(I) && "Unknown instruction!"); + return hash_combine(I->getOpcode(), hash_combine_range(I->value_op_begin(), + I->value_op_end())); + } + static bool isEqual(Instruction *LHS, Instruction *RHS) { + if (LHS == getEmptyKey() || RHS == getEmptyKey() || + LHS == getTombstoneKey() || RHS == getTombstoneKey()) + return LHS == RHS; + return LHS->isIdenticalTo(RHS); + } +}; +} - // If the instruction is non-void create a Phi node at reconvergence point. - if (!I->getType()->isVoidTy()) { - Value *IncomingTrue = nullptr; - Value *IncomingFalse = nullptr; +///\brief Perform cse of induction variable instructions. +static void cse(BasicBlock *BB) { + // Perform simple cse. + SmallDenseMap CSEMap; + for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;) { + Instruction *In = &*I++; - if (I->hasOneUse() && isa(*I->user_begin())) { - // If the predicated instruction is feeding an insert-element, move it - // into the Then block; Phi node will be created for the vector. - InsertElementInst *IEI = cast(*I->user_begin()); - IEI->moveBefore(T); - IncomingTrue = IEI; // the new vector with the inserted element. - IncomingFalse = IEI->getOperand(0); // the unmodified vector - } else { - // Phi node will be created for the scalar predicated instruction. - IncomingTrue = &*I; - IncomingFalse = UndefValue::get(I->getType()); - } + if (!CSEDenseMapInfo::canHandle(In)) + continue; - BasicBlock *PostDom = I->getParent()->getSingleSuccessor(); - assert(PostDom && "Then block has multiple successors"); - PHINode *Phi = - PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front()); - IncomingTrue->replaceAllUsesWith(Phi); - Phi->addIncoming(IncomingFalse, Head); - Phi->addIncoming(IncomingTrue, I->getParent()); + // Check if we can replace this instruction with any of the + // visited instructions. + if (Instruction *V = CSEMap.lookup(In)) { + In->replaceAllUsesWith(V); + In->eraseFromParent(); + continue; } - } - DEBUG(DT->verifyDomTree()); + CSEMap[In] = In; + } } -InnerLoopVectorizer::VectorParts -InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) { - assert(is_contained(predecessors(Dst), Src) && "Invalid edge"); - - // Look for cached value. - std::pair Edge(Src, Dst); - EdgeMaskCache::iterator ECEntryIt = MaskCache.find(Edge); - if (ECEntryIt != MaskCache.end()) - return ECEntryIt->second; +/// \brief Adds a 'fast' flag to floating point operations. +static Value *addFastMathFlag(Value *V) { + if (isa(V)) { + FastMathFlags Flags; + Flags.setUnsafeAlgebra(); + cast(V)->setFastMathFlags(Flags); + } + return V; +} - VectorParts SrcMask = createBlockInMask(Src); +/// \brief Estimate the overhead of scalarizing an instruction. This is a +/// convenience wrapper for the type-based getScalarizationOverhead API. +static unsigned getScalarizationOverhead(Instruction *I, unsigned VF, + const TargetTransformInfo &TTI) { + if (VF == 1) + return 0; - // The terminator has to be a branch inst! - BranchInst *BI = dyn_cast(Src->getTerminator()); - assert(BI && "Unexpected terminator found"); + unsigned Cost = 0; + Type *RetTy = ToVectorTy(I->getType(), VF); + if (!RetTy->isVoidTy()) + Cost += TTI.getScalarizationOverhead(RetTy, true, false); - if (BI->isConditional()) { - VectorParts EdgeMask = getVectorValue(BI->getCondition()); + if (CallInst *CI = dyn_cast(I)) { + SmallVector Operands(CI->arg_operands()); + Cost += TTI.getOperandsScalarizationOverhead(Operands, VF); + } else { + SmallVector Operands(I->operand_values()); + Cost += TTI.getOperandsScalarizationOverhead(Operands, VF); + } - if (BI->getSuccessor(0) != Dst) - for (unsigned part = 0; part < UF; ++part) - EdgeMask[part] = Builder.CreateNot(EdgeMask[part]); + return Cost; +} - for (unsigned part = 0; part < UF; ++part) - EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]); +// Estimate cost of a call instruction CI if it were vectorized with factor VF. +// Return the cost of the instruction, including scalarization overhead if it's +// needed. The flag NeedToScalarize shows if the call needs to be scalarized - +// i.e. either vector version isn't available, or is too expensive. +static unsigned getVectorCallCost(CallInst *CI, unsigned VF, + const TargetTransformInfo &TTI, + const TargetLibraryInfo *TLI, + bool &NeedToScalarize) { + Function *F = CI->getCalledFunction(); + StringRef FnName = CI->getCalledFunction()->getName(); + Type *ScalarRetTy = CI->getType(); + SmallVector Tys, ScalarTys; + for (auto &ArgOp : CI->arg_operands()) + ScalarTys.push_back(ArgOp->getType()); - MaskCache[Edge] = EdgeMask; - return EdgeMask; - } + // Estimate cost of scalarized vector call. The source operands are assumed + // to be vectors, so we need to extract individual elements from there, + // execute VF scalar calls, and then gather the result into the vector return + // value. + unsigned ScalarCallCost = TTI.getCallInstrCost(F, ScalarRetTy, ScalarTys); + if (VF == 1) + return ScalarCallCost; - MaskCache[Edge] = SrcMask; - return SrcMask; -} + // Compute corresponding vector type for return value and arguments. + Type *RetTy = ToVectorTy(ScalarRetTy, VF); + for (Type *ScalarTy : ScalarTys) + Tys.push_back(ToVectorTy(ScalarTy, VF)); -InnerLoopVectorizer::VectorParts -InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) { - assert(OrigLoop->contains(BB) && "Block is not a part of a loop"); + // Compute costs of unpacking argument values for the scalar calls and + // packing the return values to a vector. + unsigned ScalarizationCost = getScalarizationOverhead(CI, VF, TTI); - // Loop incoming mask is all-one. - if (OrigLoop->getHeader() == BB) { - Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1); - return getVectorValue(C); - } + unsigned Cost = ScalarCallCost * VF + ScalarizationCost; - // This is the block mask. We OR all incoming edges, and with zero. - Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0); - VectorParts BlockMask = getVectorValue(Zero); + // If we can't emit a vector call for this function, then the currently found + // cost is the cost we need to return. + NeedToScalarize = true; + if (!TLI || !TLI->isFunctionVectorizable(FnName, VF) || CI->isNoBuiltin()) + return Cost; - // For each pred: - for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) { - VectorParts EM = createEdgeMask(*it, BB); - for (unsigned part = 0; part < UF; ++part) - BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]); + // If the corresponding vector cost is cheaper, return its cost. + unsigned VectorCallCost = TTI.getCallInstrCost(nullptr, RetTy, Tys); + if (VectorCallCost < Cost) { + NeedToScalarize = false; + return VectorCallCost; } - - return BlockMask; + return Cost; } -void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF, - unsigned VF, PhiVector *PV) { - PHINode *P = cast(PN); - // Handle recurrences. - if (Legal->isReductionVariable(P) || Legal->isFirstOrderRecurrence(P)) { - VectorParts Entry(UF); - for (unsigned part = 0; part < UF; ++part) { - // This is phase one of vectorizing PHIs. - Type *VecTy = - (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF); - Entry[part] = PHINode::Create( - VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt()); - } - VectorLoopValueMap.initVector(P, Entry); - PV->push_back(P); - return; - } +// Estimate cost of an intrinsic call instruction CI if it were vectorized with +// factor VF. Return the cost of the instruction, including scalarization +// overhead if it's needed. +static unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF, + const TargetTransformInfo &TTI, + const TargetLibraryInfo *TLI) { + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + assert(ID && "Expected intrinsic call!"); - setDebugLocFromInst(Builder, P); - // Check for PHI nodes that are lowered to vector selects. - if (P->getParent() != OrigLoop->getHeader()) { - // We know that all PHIs in non-header blocks are converted into - // selects, so we don't have to worry about the insertion order and we - // can just use the builder. - // At this point we generate the predication tree. There may be - // duplications since this is a simple recursive scan, but future - // optimizations will clean it up. + Type *RetTy = ToVectorTy(CI->getType(), VF); + SmallVector Tys; + for (Value *ArgOperand : CI->arg_operands()) + Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); - unsigned NumIncoming = P->getNumIncomingValues(); + FastMathFlags FMF; + if (auto *FPMO = dyn_cast(CI)) + FMF = FPMO->getFastMathFlags(); - // Generate a sequence of selects of the form: - // SELECT(Mask3, In3, - // SELECT(Mask2, In2, - // ( ...))) - VectorParts Entry(UF); - for (unsigned In = 0; In < NumIncoming; In++) { - VectorParts Cond = - createEdgeMask(P->getIncomingBlock(In), P->getParent()); - const VectorParts &In0 = getVectorValue(P->getIncomingValue(In)); + return TTI.getIntrinsicInstrCost(ID, RetTy, Tys, FMF); +} - for (unsigned part = 0; part < UF; ++part) { - // We might have single edge PHIs (blocks) - use an identity - // 'select' for the first PHI operand. - if (In == 0) - Entry[part] = Builder.CreateSelect(Cond[part], In0[part], In0[part]); - else - // Select between the current value and the previous incoming edge - // based on the incoming mask. - Entry[part] = Builder.CreateSelect(Cond[part], In0[part], Entry[part], - "predphi"); +static Type *smallestIntegerVectorType(Type *T1, Type *T2) { + auto *I1 = cast(T1->getVectorElementType()); + auto *I2 = cast(T2->getVectorElementType()); + return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2; +} +static Type *largestIntegerVectorType(Type *T1, Type *T2) { + auto *I1 = cast(T1->getVectorElementType()); + auto *I2 = cast(T2->getVectorElementType()); + return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2; +} + +void InnerLoopVectorizer::truncateToMinimalBitwidths() { + // For every instruction `I` in MinBWs, truncate the operands, create a + // truncated version of `I` and reextend its result. InstCombine runs + // later and will remove any ext/trunc pairs. + // + SmallPtrSet Erased; + for (const auto &KV : Cost->getMinimalBitwidths()) { + // If the value wasn't vectorized, we must maintain the original scalar + // type. The absence of the value from VectorLoopValueMap indicates that it + // wasn't vectorized. + if (!VectorLoopValueMap.hasVector(KV.first)) + continue; + VectorParts &Parts = VectorLoopValueMap.getVector(KV.first); + for (Value *&I : Parts) { + if (Erased.count(I) || I->use_empty() || !isa(I)) + continue; + Type *OriginalTy = I->getType(); + Type *ScalarTruncatedTy = + IntegerType::get(OriginalTy->getContext(), KV.second); + Type *TruncatedTy = VectorType::get(ScalarTruncatedTy, + OriginalTy->getVectorNumElements()); + if (TruncatedTy == OriginalTy) + continue; + + IRBuilder<> B(cast(I)); + auto ShrinkOperand = [&](Value *V) -> Value * { + if (auto *ZI = dyn_cast(V)) + if (ZI->getSrcTy() == TruncatedTy) + return ZI->getOperand(0); + return B.CreateZExtOrTrunc(V, TruncatedTy); + }; + + // The actual instruction modification depends on the instruction type, + // unfortunately. + Value *NewI = nullptr; + if (auto *BO = dyn_cast(I)) { + NewI = B.CreateBinOp(BO->getOpcode(), ShrinkOperand(BO->getOperand(0)), + ShrinkOperand(BO->getOperand(1))); + cast(NewI)->copyIRFlags(I); + } else if (auto *CI = dyn_cast(I)) { + NewI = + B.CreateICmp(CI->getPredicate(), ShrinkOperand(CI->getOperand(0)), + ShrinkOperand(CI->getOperand(1))); + } else if (auto *SI = dyn_cast(I)) { + NewI = B.CreateSelect(SI->getCondition(), + ShrinkOperand(SI->getTrueValue()), + ShrinkOperand(SI->getFalseValue())); + } else if (auto *CI = dyn_cast(I)) { + switch (CI->getOpcode()) { + default: + llvm_unreachable("Unhandled cast!"); + case Instruction::Trunc: + NewI = ShrinkOperand(CI->getOperand(0)); + break; + case Instruction::SExt: + NewI = B.CreateSExtOrTrunc( + CI->getOperand(0), + smallestIntegerVectorType(OriginalTy, TruncatedTy)); + break; + case Instruction::ZExt: + NewI = B.CreateZExtOrTrunc( + CI->getOperand(0), + smallestIntegerVectorType(OriginalTy, TruncatedTy)); + break; + } + } else if (auto *SI = dyn_cast(I)) { + auto Elements0 = SI->getOperand(0)->getType()->getVectorNumElements(); + auto *O0 = B.CreateZExtOrTrunc( + SI->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements0)); + auto Elements1 = SI->getOperand(1)->getType()->getVectorNumElements(); + auto *O1 = B.CreateZExtOrTrunc( + SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1)); + + NewI = B.CreateShuffleVector(O0, O1, SI->getMask()); + } else if (isa(I)) { + // Don't do anything with the operands, just extend the result. + continue; + } else if (auto *IE = dyn_cast(I)) { + auto Elements = IE->getOperand(0)->getType()->getVectorNumElements(); + auto *O0 = B.CreateZExtOrTrunc( + IE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements)); + auto *O1 = B.CreateZExtOrTrunc(IE->getOperand(1), ScalarTruncatedTy); + NewI = B.CreateInsertElement(O0, O1, IE->getOperand(2)); + } else if (auto *EE = dyn_cast(I)) { + auto Elements = EE->getOperand(0)->getType()->getVectorNumElements(); + auto *O0 = B.CreateZExtOrTrunc( + EE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements)); + NewI = B.CreateExtractElement(O0, EE->getOperand(2)); + } else { + llvm_unreachable("Unhandled instruction type!"); } + + // Lastly, extend the result. + NewI->takeName(cast(I)); + Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy); + I->replaceAllUsesWith(Res); + cast(I)->eraseFromParent(); + Erased.insert(I); + I = Res; } - VectorLoopValueMap.initVector(P, Entry); - return; } - // This PHINode must be an induction variable. - // Make sure that we know about it. - assert(Legal->getInductionVars()->count(P) && "Not an induction variable"); - - InductionDescriptor II = Legal->getInductionVars()->lookup(P); - const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); - - // FIXME: The newly created binary instructions should contain nsw/nuw flags, - // which can be found from the original scalar operations. - switch (II.getKind()) { - case InductionDescriptor::IK_NoInduction: - llvm_unreachable("Unknown induction"); - case InductionDescriptor::IK_IntInduction: - return widenIntInduction(P); - case InductionDescriptor::IK_PtrInduction: { - // Handle the pointer induction variable case. - assert(P->getType()->isPointerTy() && "Unexpected type."); - // This is the normalized GEP that starts counting at zero. - Value *PtrInd = Induction; - PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType()); - // Determine the number of scalars we need to generate for each unroll - // iteration. If the instruction is uniform, we only need to generate the - // first lane. Otherwise, we generate all VF values. - unsigned Lanes = Cost->isUniformAfterVectorization(P, VF) ? 1 : VF; - // These are the scalar results. Notice that we don't generate vector GEPs - // because scalar GEPs result in better code. - ScalarParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { - Constant *Idx = ConstantInt::get(PtrInd->getType(), Lane + Part * VF); - Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx); - Value *SclrGep = II.transform(Builder, GlobalIdx, PSE.getSE(), DL); - SclrGep->setName("next.gep"); - Entry[Part][Lane] = SclrGep; + // We'll have created a bunch of ZExts that are now parentless. Clean up. + for (const auto &KV : Cost->getMinimalBitwidths()) { + // If the value wasn't vectorized, we must maintain the original scalar + // type. The absence of the value from VectorLoopValueMap indicates that it + // wasn't vectorized. + if (!VectorLoopValueMap.hasVector(KV.first)) + continue; + VectorParts &Parts = VectorLoopValueMap.getVector(KV.first); + for (Value *&I : Parts) { + ZExtInst *Inst = dyn_cast(I); + if (Inst && Inst->use_empty()) { + Value *NewI = Inst->getOperand(0); + Inst->eraseFromParent(); + I = NewI; } } - VectorLoopValueMap.initScalar(P, Entry); - return; } - case InductionDescriptor::IK_FpInduction: { - assert(P->getType() == II.getStartValue()->getType() && - "Types must match"); - // Handle other induction variables that are now based on the - // canonical one. - assert(P != OldInduction && "Primary induction can be integer only"); +} - Value *V = Builder.CreateCast(Instruction::SIToFP, Induction, P->getType()); - V = II.transform(Builder, V, PSE.getSE(), DL); - V->setName("fp.offset.idx"); +void InnerLoopVectorizer::vectorizeLoop() { - // Now we have scalar op: %fp.offset.idx = StartVal +/- Induction*StepVal + //===------------------------------------------------===// + // + // Notice: any optimization or new instruction that go + // into the code below should be also be implemented in + // the cost-model. + // + //===------------------------------------------------===// - Value *Broadcasted = getBroadcastInstrs(V); - // After broadcasting the induction variable we need to make the vector - // consecutive by adding StepVal*0, StepVal*1, StepVal*2, etc. - Value *StepVal = cast(II.getStep())->getValue(); - VectorParts Entry(UF); - for (unsigned part = 0; part < UF; ++part) - Entry[part] = getStepVector(Broadcasted, VF * part, StepVal, - II.getInductionOpcode()); - VectorLoopValueMap.initVector(P, Entry); - return; - } - } -} + // Insert truncates and extends for any truncated instructions as hints to + // InstCombine. + if (VF > 1) + truncateToMinimalBitwidths(); -/// A helper function for checking whether an integer division-related -/// instruction may divide by zero (in which case it must be predicated if -/// executed conditionally in the scalar code). -/// TODO: It may be worthwhile to generalize and check isKnownNonZero(). -/// Non-zero divisors that are non compile-time constants will not be -/// converted into multiplication, so we will still end up scalarizing -/// the division, but can do so w/o predication. -static bool mayDivideByZero(Instruction &I) { - assert((I.getOpcode() == Instruction::UDiv || - I.getOpcode() == Instruction::SDiv || - I.getOpcode() == Instruction::URem || - I.getOpcode() == Instruction::SRem) && - "Unexpected instruction"); - Value *Divisor = I.getOperand(1); - auto *CInt = dyn_cast(Divisor); - return !CInt || CInt->isZero(); -} - -void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV) { - // For each instruction in the old loop. - for (Instruction &I : *BB) { + fixCrossIterationPHIs(); - // If the instruction will become trivially dead when vectorized, we don't - // need to generate it. - if (DeadInstructions.count(&I)) - continue; + // Update the dominator tree. + // + // FIXME: After creating the structure of the new loop, the dominator tree is + // no longer up-to-date, and it remains that way until we update it + // here. An out-of-date dominator tree is problematic for SCEV, + // because SCEVExpander uses it to guide code generation. The + // vectorizer use SCEVExpanders in several places. Instead, we should + // keep the dominator tree up-to-date as we go. + updateAnalysis(); - // Scalarize instructions that should remain scalar after vectorization. - if (VF > 1 && - !(isa(&I) || isa(&I) || - isa(&I)) && - shouldScalarizeInstruction(&I)) { - scalarizeInstruction(&I, Legal->isScalarWithPredication(&I)); - continue; - } + // Fix-up external users of the induction variables. + for (auto &Entry : *Legal->getInductionVars()) + fixupIVUsers(Entry.first, Entry.second, + getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), + IVEndValues[Entry.first], LoopMiddleBlock); - switch (I.getOpcode()) { - case Instruction::Br: - // Nothing to do for PHIs and BR, since we already took care of the - // loop control flow instructions. - continue; - case Instruction::PHI: { - // Vectorize PHINodes. - widenPHIInstruction(&I, UF, VF, PV); - continue; - } // End of PHI. - - case Instruction::UDiv: - case Instruction::SDiv: - case Instruction::SRem: - case Instruction::URem: - // Scalarize with predication if this instruction may divide by zero and - // block execution is conditional, otherwise fallthrough. - if (Legal->isScalarWithPredication(&I)) { - scalarizeInstruction(&I, true); - continue; - } - case Instruction::Add: - case Instruction::FAdd: - case Instruction::Sub: - case Instruction::FSub: - case Instruction::Mul: - case Instruction::FMul: - case Instruction::FDiv: - case Instruction::FRem: - case Instruction::Shl: - case Instruction::LShr: - case Instruction::AShr: - case Instruction::And: - case Instruction::Or: - case Instruction::Xor: { - // Just widen binops. - auto *BinOp = cast(&I); - setDebugLocFromInst(Builder, BinOp); - const VectorParts &A = getVectorValue(BinOp->getOperand(0)); - const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - - // Use this vector value for all users of the original instruction. - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); + fixLCSSAPHIs(); - if (BinaryOperator *VecOp = dyn_cast(V)) - VecOp->copyIRFlags(BinOp); + // Remove redundant induction instructions. + cse(LoopVectorBody); +} - Entry[Part] = V; - } +void InnerLoopVectorizer::fixCrossIterationPHIs() { + // In order to support recurrences we need to be able to vectorize Phi nodes. + // Phi nodes have cycles, so we need to vectorize them in two stages. First, + // we create a new vector PHI node with no incoming edges. We use this value + // when we vectorize all of the instructions that use the PHI. Next, after + // all of the instructions in the block are complete we add the new incoming + // edges to the PHI. At this point all of the instructions in the basic block + // are vectorized, so we can use them to construct the PHI. - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, BinOp); + // At this point every instruction in the original loop is widened to a + // vector form. Now we need to fix the recurrences. These PHI nodes are + // currently empty because we did not want to introduce cycles. + // This is the second stage of vectorizing recurrences. + for (Instruction &I : *OrigLoop->getHeader()) { + PHINode *Phi = dyn_cast(&I); + if (!Phi) break; - } - case Instruction::Select: { - // Widen selects. - // If the selector is loop invariant we can create a select - // instruction with a scalar condition. Otherwise, use vector-select. - auto *SE = PSE.getSE(); - bool InvariantCond = - SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); - setDebugLocFromInst(Builder, &I); - - // The condition can be loop invariant but still defined inside the - // loop. This means that we can't just use the original 'cond' value. - // We have to take the 'vectorized' value and pick the first lane. - // Instcombine will make this a no-op. - const VectorParts &Cond = getVectorValue(I.getOperand(0)); - const VectorParts &Op0 = getVectorValue(I.getOperand(1)); - const VectorParts &Op1 = getVectorValue(I.getOperand(2)); - - auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); + // Handle first-order recurrences and reductions that need to be fixed. + if (Legal->isFirstOrderRecurrence(Phi)) + fixFirstOrderRecurrence(Phi); + else if (Legal->isReductionVariable(Phi)) + fixReduction(Phi); + } +} - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part] = Builder.CreateSelect( - InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); - } +void InnerLoopVectorizer::fixReduction(PHINode *Phi) { + Constant *Zero = Builder.getInt32(0); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + // Get the reduction variable descriptor. + RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi]; + + RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind(); + TrackingVH ReductionStartValue = RdxDesc.getRecurrenceStartValue(); + Instruction *LoopExitInst = RdxDesc.getLoopExitInstr(); + RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind = + RdxDesc.getMinMaxRecurrenceKind(); + setDebugLocFromInst(Builder, ReductionStartValue); + + // We need to generate a reduction vector from the incoming scalar. + // To do so, we need to generate the 'identity' vector and override + // one of the elements with the incoming scalar reduction. We need + // to do it in the vector-loop preheader. + Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator()); + + // This is the vector-clone of the value that leaves the loop. + const VectorParts &VectorExit = getVectorValue(LoopExitInst); + Type *VecTy = VectorExit[0]->getType(); + + // Find the reduction identity variable. Zero for addition, or, xor, + // one for multiplication, -1 for And. + Value *Identity; + Value *VectorStart; + if (RK == RecurrenceDescriptor::RK_IntegerMinMax || + RK == RecurrenceDescriptor::RK_FloatMinMax) { + // MinMax reduction have the start value as their identify. + if (VF == 1) { + VectorStart = Identity = ReductionStartValue; + } else { + VectorStart = Identity = + Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident"); } + } else { + // Handle other reduction kinds: + Constant *Iden = + RecurrenceDescriptor::getRecurrenceIdentity(RK, VecTy->getScalarType()); + if (VF == 1) { + Identity = Iden; + // This vector is the Identity vector where the first element is the + // incoming scalar reduction. + VectorStart = ReductionStartValue; + } else { + Identity = ConstantVector::getSplat(VF, Iden); - case Instruction::ICmp: - case Instruction::FCmp: { - // Widen compares. Generate vector compares. - bool FCmp = (I.getOpcode() == Instruction::FCmp); - auto *Cmp = dyn_cast(&I); - setDebugLocFromInst(Builder, Cmp); - const VectorParts &A = getVectorValue(Cmp->getOperand(0)); - const VectorParts &B = getVectorValue(Cmp->getOperand(1)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *C = nullptr; - if (FCmp) { - C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); - cast(C)->copyFastMathFlags(Cmp); - } else { - C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); - } - Entry[Part] = C; - } - - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + // This vector is the Identity vector where the first element is the + // incoming scalar reduction. + VectorStart = + Builder.CreateInsertElement(Identity, ReductionStartValue, Zero); } + } - case Instruction::Store: - case Instruction::Load: - vectorizeMemoryInstruction(&I); - break; - case Instruction::ZExt: - case Instruction::SExt: - case Instruction::FPToUI: - case Instruction::FPToSI: - case Instruction::FPExt: - case Instruction::PtrToInt: - case Instruction::IntToPtr: - case Instruction::SIToFP: - case Instruction::UIToFP: - case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { - auto *CI = dyn_cast(&I); - setDebugLocFromInst(Builder, CI); - - // Optimize the special case where the source is a constant integer - // induction variable. Notice that we can only optimize the 'trunc' case - // because (a) FP conversions lose precision, (b) sext/zext may wrap, and - // (c) other casts depend on pointer size. - if (Cost->isOptimizableIVTruncate(CI, VF)) { - widenIntInduction(cast(CI->getOperand(0)), - cast(CI)); - break; - } - - /// Vectorize casts. - Type *DestTy = - (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); - - const VectorParts &A = getVectorValue(CI->getOperand(0)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; - } + // Fix the vector-loop phi. - case Instruction::Call: { - // Ignore dbg intrinsics. - if (isa(I)) - break; - setDebugLocFromInst(Builder, &I); - - Module *M = BB->getParent()->getParent(); - auto *CI = cast(&I); - - StringRef FnName = CI->getCalledFunction()->getName(); - Function *F = CI->getCalledFunction(); - Type *RetTy = ToVectorTy(CI->getType(), VF); - SmallVector Tys; - for (Value *ArgOperand : CI->arg_operands()) - Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); - - Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); - if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || - ID == Intrinsic::lifetime_start)) { - scalarizeInstruction(&I); - break; - } - // The flag shows whether we use Intrinsic or a usual Call for vectorized - // version of the instruction. - // Is it beneficial to perform intrinsic call compared to lib call? - bool NeedToScalarize; - unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); - bool UseVectorIntrinsic = - ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; - if (!UseVectorIntrinsic && NeedToScalarize) { - scalarizeInstruction(&I); - break; - } + // Reductions do not have to start at zero. They can start with + // any loop invariant values. + const VectorParts &VecRdxPhi = getVectorValue(Phi); + BasicBlock *Latch = OrigLoop->getLoopLatch(); + Value *LoopVal = Phi->getIncomingValueForBlock(Latch); + const VectorParts &Val = getVectorValue(LoopVal); + for (unsigned part = 0; part < UF; ++part) { + // Make sure to add the reduction stat value only to the + // first unroll part. + Value *StartVal = (part == 0) ? VectorStart : Identity; + cast(VecRdxPhi[part])->addIncoming(StartVal, LoopVectorPreHeader); + cast(VecRdxPhi[part]) + ->addIncoming(Val[part], + LI->getLoopFor(LoopVectorBody)->getLoopLatch()); + } + + // Before each round, move the insertion point right between + // the PHIs and the values we are going to write. + // This allows us to write both PHINodes and the extractelement + // instructions. + Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - SmallVector Args; - for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { - Value *Arg = CI->getArgOperand(i); - // Some intrinsics have a scalar argument - don't replace it with a - // vector. - if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { - const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); - Arg = VectorArg[Part]; - } - Args.push_back(Arg); - } + VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst); + setDebugLocFromInst(Builder, LoopExitInst); - Function *VectorF; - if (UseVectorIntrinsic) { - // Use vector version of the intrinsic. - Type *TysForDecl[] = {CI->getType()}; - if (VF > 1) - TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); - VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); + // If the vector reduction can be performed in a smaller type, we truncate + // then extend the loop exit value to enable InstCombine to evaluate the + // entire expression in the smaller type. + if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) { + Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF); + Builder.SetInsertPoint(LoopVectorBody->getTerminator()); + for (unsigned part = 0; part < UF; ++part) { + Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy); + Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy) + : Builder.CreateZExt(Trunc, VecTy); + for (Value::user_iterator UI = RdxParts[part]->user_begin(); + UI != RdxParts[part]->user_end();) + if (*UI != Trunc) { + (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd); + RdxParts[part] = Extnd; } else { - // Use vector version of the library call. - StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); - assert(!VFnName.empty() && "Vector function name is empty."); - VectorF = M->getFunction(VFnName); - if (!VectorF) { - // Generate a declaration - FunctionType *FTy = FunctionType::get(RetTy, Tys, false); - VectorF = - Function::Create(FTy, Function::ExternalLinkage, VFnName, M); - VectorF->copyAttributesFrom(F); - } + ++UI; } - assert(VectorF && "Can't create vector function."); + } + Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); + for (unsigned part = 0; part < UF; ++part) + RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy); + } + + // Reduce all of the unrolled parts into a single vector. + Value *ReducedPartRdx = RdxParts[0]; + unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK); + setDebugLocFromInst(Builder, ReducedPartRdx); + for (unsigned part = 1; part < UF; ++part) { + if (Op != Instruction::ICmp && Op != Instruction::FCmp) + // Floating point operations had to be 'fast' to enable the reduction. + ReducedPartRdx = addFastMathFlag( + Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part], + ReducedPartRdx, "bin.rdx")); + else + ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp( + Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]); + } - SmallVector OpBundles; - CI->getOperandBundlesAsDefs(OpBundles); - CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); + if (VF > 1) { + // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles + // and vector ops, reducing the set of values being computed by half each + // round. + assert(isPowerOf2_32(VF) && + "Reduction emission only supported for pow2 vectors!"); + Value *TmpVec = ReducedPartRdx; + SmallVector ShuffleMask(VF, nullptr); + for (unsigned i = VF; i != 1; i >>= 1) { + // Move the upper half of the vector to the lower half. + for (unsigned j = 0; j != i / 2; ++j) + ShuffleMask[j] = Builder.getInt32(i / 2 + j); + + // Fill the rest of the mask with undef. + std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), + UndefValue::get(Builder.getInt32Ty())); + + Value *Shuf = Builder.CreateShuffleVector( + TmpVec, UndefValue::get(TmpVec->getType()), + ConstantVector::get(ShuffleMask), "rdx.shuf"); - if (isa(V)) - V->copyFastMathFlags(CI); + if (Op != Instruction::ICmp && Op != Instruction::FCmp) + // Floating point operations had to be 'fast' to enable the reduction. + TmpVec = addFastMathFlag(Builder.CreateBinOp((Instruction::BinaryOps)Op, + TmpVec, Shuf, "bin.rdx")); + else + TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind, + TmpVec, Shuf); + } - Entry[Part] = V; - } + // The result is in the first element of the vector. + ReducedPartRdx = + Builder.CreateExtractElement(TmpVec, Builder.getInt32(0)); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); + // If the reduction can be performed in a smaller type, we need to extend + // the reduction to the wider type before we branch to the original loop. + if (Phi->getType() != RdxDesc.getRecurrenceType()) + ReducedPartRdx = + RdxDesc.isSigned() + ? Builder.CreateSExt(ReducedPartRdx, Phi->getType()) + : Builder.CreateZExt(ReducedPartRdx, Phi->getType()); + } + + // Create a phi node that merges control-flow from the backedge-taken check + // block and the middle block. + PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx", + LoopScalarPreHeader->getTerminator()); + for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I) + BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]); + BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); + + // Now, we need to fix the users of the reduction variable + // inside and outside of the scalar remainder loop. + // We know that the loop is in LCSSA form. We need to update the + // PHI nodes in the exit blocks. + for (BasicBlock::iterator LEI = LoopExitBlock->begin(), + LEE = LoopExitBlock->end(); + LEI != LEE; ++LEI) { + PHINode *LCSSAPhi = dyn_cast(LEI); + if (!LCSSAPhi) break; - } - default: - // All other instructions are unsupported. Scalarize them. - scalarizeInstruction(&I); - break; - } // end of switch. - } // end of for_each instr. -} + // All PHINodes need to have a single entry edge, or two if + // we already fixed them. + assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI"); -void InnerLoopVectorizer::updateAnalysis() { - // Forget the original basic block. - PSE.getSE()->forgetLoop(OrigLoop); + // We found a reduction value exit-PHI. Update it with the + // incoming bypass edge. + if (LCSSAPhi->getIncomingValue(0) == LoopExitInst) + LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); + } // end of the LCSSA phi scan. - // Update the dominator tree information. - assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) && - "Entry does not dominate exit."); + // Fix the scalar loop reduction variable with the incoming reduction sum + // from the vector body and from the backedge value. + int IncomingEdgeBlockIdx = + Phi->getBasicBlockIndex(OrigLoop->getLoopLatch()); + assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index"); + // Pick the other block. + int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1); + Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi); + Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst); +} - // We don't predicate stores by this point, so the vector body should be a - // single loop. - DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader); +void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) { - DT->addNewBlock(LoopMiddleBlock, LoopVectorBody); - DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]); - DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader); - DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]); + // This is the second phase of vectorizing first-order recurrences. An + // overview of the transformation is described below. Suppose we have the + // following loop. + // + // for (int i = 0; i < n; ++i) + // b[i] = a[i] - a[i - 1]; + // + // There is a first-order recurrence on "a". For this loop, the shorthand + // scalar IR looks like: + // + // scalar.ph: + // s_init = a[-1] + // br scalar.body + // + // scalar.body: + // i = phi [0, scalar.ph], [i+1, scalar.body] + // s1 = phi [s_init, scalar.ph], [s2, scalar.body] + // s2 = a[i] + // b[i] = s2 - s1 + // br cond, scalar.body, ... + // + // In this example, s1 is a recurrence because it's value depends on the + // previous iteration. In the first phase of vectorization, we created a + // temporary value for s1. We now complete the vectorization and produce the + // shorthand vector IR shown below (for VF = 4, UF = 1). + // + // vector.ph: + // v_init = vector(..., ..., ..., a[-1]) + // br vector.body + // + // vector.body + // i = phi [0, vector.ph], [i+4, vector.body] + // v1 = phi [v_init, vector.ph], [v2, vector.body] + // v2 = a[i, i+1, i+2, i+3]; + // v3 = vector(v1(3), v2(0, 1, 2)) + // b[i, i+1, i+2, i+3] = v2 - v3 + // br cond, vector.body, middle.block + // + // middle.block: + // x = v2(3) + // br scalar.ph + // + // scalar.ph: + // s_init = phi [x, middle.block], [a[-1], otherwise] + // br scalar.body + // + // After execution completes the vector loop, we extract the next value of + // the recurrence (x) to use as the initial value in the scalar loop. - DEBUG(DT->verifyDomTree()); -} + // Get the original loop preheader and single loop latch. + auto *Preheader = OrigLoop->getLoopPreheader(); + auto *Latch = OrigLoop->getLoopLatch(); -/// \brief Check whether it is safe to if-convert this phi node. -/// -/// Phi nodes with constant expressions that can trap are not safe to if -/// convert. -static bool canIfConvertPHINodes(BasicBlock *BB) { - for (Instruction &I : *BB) { - auto *Phi = dyn_cast(&I); - if (!Phi) - return true; - for (Value *V : Phi->incoming_values()) - if (auto *C = dyn_cast(V)) - if (C->canTrap()) - return false; - } - return true; -} + // Get the initial and previous values of the scalar recurrence. + auto *ScalarInit = Phi->getIncomingValueForBlock(Preheader); + auto *Previous = Phi->getIncomingValueForBlock(Latch); -bool LoopVectorizationLegality::canVectorizeWithIfConvert() { - if (!EnableIfConversion) { - ORE->emit(createMissedAnalysis("IfConversionDisabled") - << "if-conversion is disabled"); - return false; + // Create a vector from the initial value. + auto *VectorInit = ScalarInit; + if (VF > 1) { + Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator()); + VectorInit = Builder.CreateInsertElement( + UndefValue::get(VectorType::get(VectorInit->getType(), VF)), VectorInit, + Builder.getInt32(VF - 1), "vector.recur.init"); } - assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable"); + // We constructed a temporary phi node in the first phase of vectorization. + // This phi node will eventually be deleted. + VectorParts &PhiParts = VectorLoopValueMap.getVector(Phi); + Builder.SetInsertPoint(cast(PhiParts[0])); - // A list of pointers that we can safely read and write to. - SmallPtrSet SafePointes; + // Create a phi node for the new recurrence. The current value will either be + // the initial value inserted into a vector or loop-varying vector value. + auto *VecPhi = Builder.CreatePHI(VectorInit->getType(), 2, "vector.recur"); + VecPhi->addIncoming(VectorInit, LoopVectorPreHeader); - // Collect safe addresses. - for (BasicBlock *BB : TheLoop->blocks()) { - if (blockNeedsPredication(BB)) - continue; + // Get the vectorized previous value. We ensured the previous values was an + // instruction when detecting the recurrence. + auto &PreviousParts = getVectorValue(Previous); - for (Instruction &I : *BB) - if (auto *Ptr = getPointerOperand(&I)) - SafePointes.insert(Ptr); - } + // Set the insertion point to be after this instruction. We ensured the + // previous value dominated all uses of the phi when detecting the + // recurrence. + Builder.SetInsertPoint( + &*++BasicBlock::iterator(cast(PreviousParts[UF - 1]))); - // Collect the blocks that need predication. - BasicBlock *Header = TheLoop->getHeader(); - for (BasicBlock *BB : TheLoop->blocks()) { - // We don't support switch statements inside loops. - if (!isa(BB->getTerminator())) { - ORE->emit(createMissedAnalysis("LoopContainsSwitch", BB->getTerminator()) - << "loop contains a switch statement"); - return false; - } + // We will construct a vector for the recurrence by combining the values for + // the current and previous iterations. This is the required shuffle mask. + SmallVector ShuffleMask(VF); + ShuffleMask[0] = Builder.getInt32(VF - 1); + for (unsigned I = 1; I < VF; ++I) + ShuffleMask[I] = Builder.getInt32(I + VF - 1); - // We must be able to predicate all blocks that need to be predicated. - if (blockNeedsPredication(BB)) { - if (!blockCanBePredicated(BB, SafePointes)) { - ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator()) - << "control flow cannot be substituted for a select"); - return false; - } - } else if (BB != Header && !canIfConvertPHINodes(BB)) { - ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator()) - << "control flow cannot be substituted for a select"); - return false; - } + // The vector from which to take the initial value for the current iteration + // (actual or unrolled). Initially, this is the vector phi node. + Value *Incoming = VecPhi; + + // Shuffle the current and previous vector and update the vector parts. + for (unsigned Part = 0; Part < UF; ++Part) { + auto *Shuffle = + VF > 1 + ? Builder.CreateShuffleVector(Incoming, PreviousParts[Part], + ConstantVector::get(ShuffleMask)) + : Incoming; + PhiParts[Part]->replaceAllUsesWith(Shuffle); + cast(PhiParts[Part])->eraseFromParent(); + PhiParts[Part] = Shuffle; + Incoming = PreviousParts[Part]; } - // We can if-convert this loop. - return true; -} + // Fix the latch value of the new recurrence in the vector loop. + VecPhi->addIncoming(Incoming, LI->getLoopFor(LoopVectorBody)->getLoopLatch()); -bool LoopVectorizationLegality::canVectorize() { - // We must have a loop in canonical form. Loops with indirectbr in them cannot - // be canonicalized. - if (!TheLoop->getLoopPreheader()) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood") - << "loop control flow is not understood by vectorizer"); - return false; + // Extract the last vector element in the middle block. This will be the + // initial value for the recurrence when jumping to the scalar loop. + auto *Extract = Incoming; + if (VF > 1) { + Builder.SetInsertPoint(LoopMiddleBlock->getTerminator()); + Extract = Builder.CreateExtractElement(Extract, Builder.getInt32(VF - 1), + "vector.recur.extract"); } - // FIXME: The code is currently dead, since the loop gets sent to - // LoopVectorizationLegality is already an innermost loop. - // - // We can only vectorize innermost loops. - if (!TheLoop->empty()) { - ORE->emit(createMissedAnalysis("NotInnermostLoop") - << "loop is not the innermost loop"); - return false; + // Fix the initial value of the original recurrence in the scalar loop. + Builder.SetInsertPoint(&*LoopScalarPreHeader->begin()); + auto *Start = Builder.CreatePHI(Phi->getType(), 2, "scalar.recur.init"); + for (auto *BB : predecessors(LoopScalarPreHeader)) { + auto *Incoming = BB == LoopMiddleBlock ? Extract : ScalarInit; + Start->addIncoming(Incoming, BB); } - // We must have a single backedge. - if (TheLoop->getNumBackEdges() != 1) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood") - << "loop control flow is not understood by vectorizer"); - return false; - } + Phi->setIncomingValue(Phi->getBasicBlockIndex(LoopScalarPreHeader), Start); + Phi->setName("scalar.recur"); - // We must have a single exiting block. - if (!TheLoop->getExitingBlock()) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood") - << "loop control flow is not understood by vectorizer"); - return false; + // Finally, fix users of the recurrence outside the loop. The users will need + // either the last value of the scalar recurrence or the last value of the + // vector recurrence we extracted in the middle block. Since the loop is in + // LCSSA form, we just need to find the phi node for the original scalar + // recurrence in the exit block, and then add an edge for the middle block. + for (auto &I : *LoopExitBlock) { + auto *LCSSAPhi = dyn_cast(&I); + if (!LCSSAPhi) + break; + if (LCSSAPhi->getIncomingValue(0) == Phi) { + LCSSAPhi->addIncoming(Extract, LoopMiddleBlock); + break; + } } +} - // We only handle bottom-tested loops, i.e. loop in which the condition is - // checked at the end of each iteration. With that we can assume that all - // instructions in the loop are executed the same number of times. - if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood") - << "loop control flow is not understood by vectorizer"); - return false; +void InnerLoopVectorizer::fixLCSSAPHIs() { + for (Instruction &LEI : *LoopExitBlock) { + auto *LCSSAPhi = dyn_cast(&LEI); + if (!LCSSAPhi) + break; + if (LCSSAPhi->getNumIncomingValues() == 1) + LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()), + LoopMiddleBlock); } +} - // We need to have a loop header. - DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName() - << '\n'); +void InnerLoopVectorizer::collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions) { + BasicBlock *Latch = OrigLoop->getLoopLatch(); - // Check if we can if-convert non-single-bb loops. - unsigned NumBlocks = TheLoop->getNumBlocks(); - if (NumBlocks != 1 && !canVectorizeWithIfConvert()) { - DEBUG(dbgs() << "LV: Can't if-convert the loop.\n"); - return false; - } + // We create new control-flow for the vectorized loop, so the original + // condition will be dead after vectorization if it's only used by the + // branch. + auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); + if (Cmp && Cmp->hasOneUse()) + DeadInstructions.insert(Cmp); - // ScalarEvolution needs to be able to find the exit count. - const SCEV *ExitCount = PSE.getBackedgeTakenCount(); - if (ExitCount == PSE.getSE()->getCouldNotCompute()) { - ORE->emit(createMissedAnalysis("CantComputeNumberOfIterations") - << "could not determine number of loop iterations"); - DEBUG(dbgs() << "LV: SCEV could not compute the loop exit count.\n"); - return false; + // We create new "steps" for induction variable updates to which the original + // induction variables map. An original update instruction will be dead if + // all its users except the induction variable are dead. + for (auto &Induction : *Legal->getInductionVars()) { + PHINode *Ind = Induction.first; + auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); + if (all_of(IndUpdate->users(), [&](User *U) -> bool { + return U == Ind || DeadInstructions.count(cast(U)); + })) + DeadInstructions.insert(IndUpdate); } +} - // Check if we can vectorize the instructions and CFG in this loop. - if (!canVectorizeInstrs()) { - DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n"); - return false; - } +void InnerLoopUnroller::sinkScalarOperands(Instruction *PredInst) { - // Go over each instruction and look at memory deps. - if (!canVectorizeMemory()) { - DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n"); - return false; - } + // The basic block and loop containing the predicated instruction. + auto *PredBB = PredInst->getParent(); + auto *VectorLoop = LI->getLoopFor(PredBB); - DEBUG(dbgs() << "LV: We can vectorize this loop" - << (LAI->getRuntimePointerChecking()->Need - ? " (with a runtime bound check)" - : "") - << "!\n"); + // Initialize a worklist with the operands of the predicated instruction. + SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); - bool UseInterleaved = TTI->enableInterleavedAccessVectorization(); + // Holds instructions that we need to analyze again. An instruction may be + // reanalyzed if we don't yet know if we can sink it or not. + SmallVector InstsToReanalyze; - // If an override option has been passed in for interleaved accesses, use it. - if (EnableInterleavedMemAccesses.getNumOccurrences() > 0) - UseInterleaved = EnableInterleavedMemAccesses; + // Returns true if a given use occurs in the predicated block. Phi nodes use + // their operands in their corresponding predecessor blocks. + auto isBlockOfUsePredicated = [&](Use &U) -> bool { + auto *I = cast(U.getUser()); + BasicBlock *BB = I->getParent(); + if (auto *Phi = dyn_cast(I)) + BB = Phi->getIncomingBlock( + PHINode::getIncomingValueNumForOperand(U.getOperandNo())); + return BB == PredBB; + }; - // Analyze interleaved memory accesses. - if (UseInterleaved) - InterleaveInfo.analyzeInterleaving(*getSymbolicStrides()); + // Iteratively sink the scalarized operands of the predicated instruction + // into the block we created for it. When an instruction is sunk, it's + // operands are then added to the worklist. The algorithm ends after one pass + // through the worklist doesn't sink a single instruction. + bool Changed; + do { - unsigned SCEVThreshold = VectorizeSCEVCheckThreshold; - if (Hints->getForce() == LoopVectorizeHints::FK_Enabled) - SCEVThreshold = PragmaVectorizeSCEVCheckThreshold; + // Add the instructions that need to be reanalyzed to the worklist, and + // reset the changed indicator. + Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); + InstsToReanalyze.clear(); + Changed = false; - if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) { - ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks") - << "Too many SCEV assumptions need to be made and checked " - << "at runtime"); - DEBUG(dbgs() << "LV: Too many SCEV checks needed.\n"); - return false; - } + while (!Worklist.empty()) { + auto *I = dyn_cast(Worklist.pop_back_val()); - // Okay! We can vectorize. At this point we don't have any other mem analysis - // which may limit our maximum vectorization factor, so just return true with - // no restrictions. - return true; -} + // We can't sink an instruction if it is a phi node, is already in the + // predicated block, is not in the loop, or may have side effects. + if (!I || isa(I) || I->getParent() == PredBB || + !VectorLoop->contains(I) || I->mayHaveSideEffects()) + continue; -static Type *convertPointerToIntegerType(const DataLayout &DL, Type *Ty) { - if (Ty->isPointerTy()) - return DL.getIntPtrType(Ty); + // It's legal to sink the instruction if all its uses occur in the + // predicated block. Otherwise, there's nothing to do yet, and we may + // need to reanalyze the instruction. + if (!all_of(I->uses(), isBlockOfUsePredicated)) { + InstsToReanalyze.push_back(I); + continue; + } - // It is possible that char's or short's overflow when we ask for the loop's - // trip count, work around this by changing the type size. - if (Ty->getScalarSizeInBits() < 32) - return Type::getInt32Ty(Ty->getContext()); + // Move the instruction to the beginning of the predicated block, and add + // it's operands to the worklist. + I->moveBefore(&*PredBB->getFirstInsertionPt()); + Worklist.insert(I->op_begin(), I->op_end()); - return Ty; + // The sinking may have enabled other instructions to be sunk, so we will + // need to iterate. + Changed = true; + } + } while (Changed); } -static Type *getWiderType(const DataLayout &DL, Type *Ty0, Type *Ty1) { - Ty0 = convertPointerToIntegerType(DL, Ty0); - Ty1 = convertPointerToIntegerType(DL, Ty1); - if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits()) - return Ty0; - return Ty1; -} +void InnerLoopUnroller::vectorizeLoop() { -/// \brief Check that the instruction has outside loop users and is not an -/// identified reduction variable. -static bool hasOutsideLoopUser(const Loop *TheLoop, Instruction *Inst, - SmallPtrSetImpl &AllowedExit) { - // Reduction and Induction instructions are allowed to have exit users. All - // other instructions must not have external users. - if (!AllowedExit.count(Inst)) - // Check that all of the users of the loop are inside the BB. - for (User *U : Inst->users()) { - Instruction *UI = cast(U); - // This user may be a reduction exit value. - if (!TheLoop->contains(UI)) { - DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n'); - return true; - } - } - return false; -} + // Collect instructions from the original loop that will become trivially + // dead in the vectorized loop. We don't need to vectorize these + // instructions. + collectTriviallyDeadInstructions(OrigLoop, Legal, DeadInstructions); -void LoopVectorizationLegality::addInductionPhi( - PHINode *Phi, const InductionDescriptor &ID, - SmallPtrSetImpl &AllowedExit) { - Inductions[Phi] = ID; - Type *PhiTy = Phi->getType(); - const DataLayout &DL = Phi->getModule()->getDataLayout(); + // Scan the loop in a topological order to ensure that defs are vectorized + // before users. + LoopBlocksDFS DFS(OrigLoop); + DFS.perform(LI); - // Get the widest type. - if (!PhiTy->isFloatingPointTy()) { - if (!WidestIndTy) - WidestIndTy = convertPointerToIntegerType(DL, PhiTy); - else - WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy); - } + // Vectorize all of the blocks in the original loop. + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) + for (Instruction &I : *BB) { + if (!DeadInstructions.count(&I)) + vectorizeInstruction(I); + } - // Int inductions are special because we only allow one IV. - if (ID.getKind() == InductionDescriptor::IK_IntInduction && - ID.getConstIntStepValue() && - ID.getConstIntStepValue()->isOne() && - isa(ID.getStartValue()) && - cast(ID.getStartValue())->isNullValue()) { + fixCrossIterationPHIs(); - // Use the phi node with the widest type as induction. Use the last - // one if there are multiple (no good reason for doing this other - // than it is expedient). We've checked that it begins at zero and - // steps by one, so this is a canonical induction variable. - if (!PrimaryInduction || PhiTy == WidestIndTy) - PrimaryInduction = Phi; - } + // Update the dominator tree. + // + // FIXME: After creating the structure of the new loop, the dominator tree is + // no longer up-to-date, and it remains that way until we update it + // here. An out-of-date dominator tree is problematic for SCEV, + // because SCEVExpander uses it to guide code generation. The + // vectorizer use SCEVExpanders in several places. Instead, we should + // keep the dominator tree up-to-date as we go. + updateAnalysis(); - // Both the PHI node itself, and the "post-increment" value feeding - // back into the PHI node may have external users. - AllowedExit.insert(Phi); - AllowedExit.insert(Phi->getIncomingValueForBlock(TheLoop->getLoopLatch())); + // Fix-up external users of the induction variables. + for (auto &Entry : *Legal->getInductionVars()) + fixupIVUsers(Entry.first, Entry.second, + getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), + IVEndValues[Entry.first], LoopMiddleBlock); - DEBUG(dbgs() << "LV: Found an induction variable.\n"); - return; + fixLCSSAPHIs(); + predicateInstructions(); + + // Remove redundant induction instructions. + cse(LoopVectorBody); } -bool LoopVectorizationLegality::canVectorizeInstrs() { - BasicBlock *Header = TheLoop->getHeader(); +void InnerLoopUnroller::predicateInstructions() { - // Look for the attribute signaling the absence of NaNs. - Function &F = *Header->getParent(); - HasFunNoNaNAttr = - F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true"; + // For each instruction I marked for predication on value C, split I into its + // own basic block to form an if-then construct over C. Since I may be fed by + // an extractelement instruction or other scalar operand, we try to + // iteratively sink its scalar operands into the predicated block. If I feeds + // an insertelement instruction, we try to move this instruction into the + // predicated block as well. For non-void types, a phi node will be created + // for the resulting value (either vector or scalar). + // + // So for some predicated instruction, e.g. the conditional sdiv in: + // + // for.body: + // ... + // %add = add nsw i32 %mul, %0 + // %cmp5 = icmp sgt i32 %2, 7 + // br i1 %cmp5, label %if.then, label %if.end + // + // if.then: + // %div = sdiv i32 %0, %1 + // br label %if.end + // + // if.end: + // %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ] + // + // the sdiv at this point is scalarized and if-converted using a select. + // The inactive elements in the vector are not used, but the predicated + // instruction is still executed for all vector elements, essentially: + // + // vector.body: + // ... + // %17 = add nsw <2 x i32> %16, %wide.load + // %29 = extractelement <2 x i32> %wide.load, i32 0 + // %30 = extractelement <2 x i32> %wide.load51, i32 0 + // %31 = sdiv i32 %29, %30 + // %32 = insertelement <2 x i32> undef, i32 %31, i32 0 + // %35 = extractelement <2 x i32> %wide.load, i32 1 + // %36 = extractelement <2 x i32> %wide.load51, i32 1 + // %37 = sdiv i32 %35, %36 + // %38 = insertelement <2 x i32> %32, i32 %37, i32 1 + // %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17 + // + // Predication will now re-introduce the original control flow to avoid false + // side-effects by the sdiv instructions on the inactive elements, yielding + // (after cleanup): + // + // vector.body: + // ... + // %5 = add nsw <2 x i32> %4, %wide.load + // %8 = icmp sgt <2 x i32> %wide.load52, + // %9 = extractelement <2 x i1> %8, i32 0 + // br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue + // + // pred.sdiv.if: + // %10 = extractelement <2 x i32> %wide.load, i32 0 + // %11 = extractelement <2 x i32> %wide.load51, i32 0 + // %12 = sdiv i32 %10, %11 + // %13 = insertelement <2 x i32> undef, i32 %12, i32 0 + // br label %pred.sdiv.continue + // + // pred.sdiv.continue: + // %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ] + // %15 = extractelement <2 x i1> %8, i32 1 + // br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55 + // + // pred.sdiv.if54: + // %16 = extractelement <2 x i32> %wide.load, i32 1 + // %17 = extractelement <2 x i32> %wide.load51, i32 1 + // %18 = sdiv i32 %16, %17 + // %19 = insertelement <2 x i32> %14, i32 %18, i32 1 + // br label %pred.sdiv.continue55 + // + // pred.sdiv.continue55: + // %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ] + // %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5 - // For each block in the loop. - for (BasicBlock *BB : TheLoop->blocks()) { - // Scan the instructions in the block and look for hazards. - for (Instruction &I : *BB) { - if (auto *Phi = dyn_cast(&I)) { - Type *PhiTy = Phi->getType(); - // Check that this PHI type is allowed. - if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() && - !PhiTy->isPointerTy()) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi) - << "loop control flow is not understood by vectorizer"); - DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n"); - return false; - } + for (auto KV : PredicatedInstructions) { + BasicBlock::iterator I(KV.first); + BasicBlock *Head = I->getParent(); + auto *BB = SplitBlock(Head, &*std::next(I), DT, LI); + auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false, + /*BranchWeights=*/nullptr, DT, LI); + I->moveBefore(T); + sinkScalarOperands(&*I); - // If this PHINode is not in the header block, then we know that we - // can convert it to select during if-conversion. No need to check if - // the PHIs in this block are induction or reduction variables. - if (BB != Header) { - // Check that this instruction has no outside users or is an - // identified reduction value with an outside user. - if (!hasOutsideLoopUser(TheLoop, Phi, AllowedExit)) - continue; - ORE->emit(createMissedAnalysis("NeitherInductionNorReduction", Phi) - << "value could not be identified as " - "an induction or reduction variable"); - return false; - } + I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if"); + BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue"); - // We only allow if-converted PHIs with exactly two incoming values. - if (Phi->getNumIncomingValues() != 2) { - ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi) - << "control flow not understood by vectorizer"); - DEBUG(dbgs() << "LV: Found an invalid PHI.\n"); - return false; - } + // If the instruction is non-void create a Phi node at reconvergence point. + if (!I->getType()->isVoidTy()) { + Value *IncomingTrue = nullptr; + Value *IncomingFalse = nullptr; - RecurrenceDescriptor RedDes; - if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes)) { - if (RedDes.hasUnsafeAlgebra()) - Requirements->addUnsafeAlgebraInst(RedDes.getUnsafeAlgebraInst()); - AllowedExit.insert(RedDes.getLoopExitInstr()); - Reductions[Phi] = RedDes; - continue; - } + if (I->hasOneUse() && isa(*I->user_begin())) { + // If the predicated instruction is feeding an insert-element, move it + // into the Then block; Phi node will be created for the vector. + InsertElementInst *IEI = cast(*I->user_begin()); + IEI->moveBefore(T); + IncomingTrue = IEI; // the new vector with the inserted element. + IncomingFalse = IEI->getOperand(0); // the unmodified vector + } else { + // Phi node will be created for the scalar predicated instruction. + IncomingTrue = &*I; + IncomingFalse = UndefValue::get(I->getType()); + } - InductionDescriptor ID; - if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID)) { - addInductionPhi(Phi, ID, AllowedExit); - if (ID.hasUnsafeAlgebra() && !HasFunNoNaNAttr) - Requirements->addUnsafeAlgebraInst(ID.getUnsafeAlgebraInst()); - continue; - } + BasicBlock *PostDom = I->getParent()->getSingleSuccessor(); + assert(PostDom && "Then block has multiple successors"); + PHINode *Phi = + PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front()); + IncomingTrue->replaceAllUsesWith(Phi); + Phi->addIncoming(IncomingFalse, Head); + Phi->addIncoming(IncomingTrue, I->getParent()); + } + } - if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop, DT)) { - FirstOrderRecurrences.insert(Phi); - continue; - } + DEBUG(DT->verifyDomTree()); +} - // As a last resort, coerce the PHI to a AddRec expression - // and re-try classifying it a an induction PHI. - if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true)) { - addInductionPhi(Phi, ID, AllowedExit); - continue; - } +InnerLoopVectorizer::VectorParts +InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) { + assert(is_contained(predecessors(Dst), Src) && "Invalid edge"); - ORE->emit(createMissedAnalysis("NonReductionValueUsedOutsideLoop", Phi) - << "value that could not be identified as " - "reduction is used outside the loop"); - DEBUG(dbgs() << "LV: Found an unidentified PHI." << *Phi << "\n"); - return false; - } // end of PHI handling - - // We handle calls that: - // * Are debug info intrinsics. - // * Have a mapping to an IR intrinsic. - // * Have a vector version available. - auto *CI = dyn_cast(&I); - if (CI && !getVectorIntrinsicIDForCall(CI, TLI) && - !isa(CI) && - !(CI->getCalledFunction() && TLI && - TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) { - ORE->emit(createMissedAnalysis("CantVectorizeCall", CI) - << "call instruction cannot be vectorized"); - DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n"); - return false; - } + // Look for cached value. + std::pair Edge(Src, Dst); + EdgeMaskCacheTy::iterator ECEntryIt = EdgeMaskCache.find(Edge); + if (ECEntryIt != EdgeMaskCache.end()) + return ECEntryIt->second; - // Intrinsics such as powi,cttz and ctlz are legal to vectorize if the - // second argument is the same (i.e. loop invariant) - if (CI && hasVectorInstrinsicScalarOpd( - getVectorIntrinsicIDForCall(CI, TLI), 1)) { - auto *SE = PSE.getSE(); - if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) { - ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI) - << "intrinsic instruction cannot be vectorized"); - DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n"); - return false; - } - } + VectorParts SrcMask = createBlockInMask(Src); - // Check that the instruction return type is vectorizable. - // Also, we can't vectorize extractelement instructions. - if ((!VectorType::isValidElementType(I.getType()) && - !I.getType()->isVoidTy()) || - isa(I)) { - ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I) - << "instruction return type cannot be vectorized"); - DEBUG(dbgs() << "LV: Found unvectorizable type.\n"); - return false; - } + // The terminator has to be a branch inst! + BranchInst *BI = dyn_cast(Src->getTerminator()); + assert(BI && "Unexpected terminator found"); - // Check that the stored type is vectorizable. - if (auto *ST = dyn_cast(&I)) { - Type *T = ST->getValueOperand()->getType(); - if (!VectorType::isValidElementType(T)) { - ORE->emit(createMissedAnalysis("CantVectorizeStore", ST) - << "store instruction cannot be vectorized"); - return false; - } + if (BI->isConditional()) { + VectorParts EdgeMask = getVectorValue(BI->getCondition()); - // FP instructions can allow unsafe algebra, thus vectorizable by - // non-IEEE-754 compliant SIMD units. - // This applies to floating-point math operations and calls, not memory - // operations, shuffles, or casts, as they don't change precision or - // semantics. - } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) && - !I.hasUnsafeAlgebra()) { - DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n"); - Hints->setPotentiallyUnsafe(); - } + if (BI->getSuccessor(0) != Dst) + for (unsigned part = 0; part < UF; ++part) + EdgeMask[part] = Builder.CreateNot(EdgeMask[part]); - // Reduction instructions are allowed to have exit users. - // All other instructions must not have external users. - if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) { - ORE->emit(createMissedAnalysis("ValueUsedOutsideLoop", &I) - << "value cannot be used outside the loop"); - return false; - } + for (unsigned part = 0; part < UF; ++part) + EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]); - } // next instr. + EdgeMaskCache[Edge] = EdgeMask; + return EdgeMask; } - if (!PrimaryInduction) { - DEBUG(dbgs() << "LV: Did not find one integer induction var.\n"); - if (Inductions.empty()) { - ORE->emit(createMissedAnalysis("NoInductionVariable") - << "loop induction variable could not be identified"); - return false; - } - } + EdgeMaskCache[Edge] = SrcMask; + return SrcMask; +} - // Now we know the widest induction type, check if our found induction - // is the same size. If it's not, unset it here and InnerLoopVectorizer - // will create another. - if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType()) - PrimaryInduction = nullptr; +InnerLoopVectorizer::VectorParts +InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) { + assert(OrigLoop->contains(BB) && "Block is not a part of a loop"); - return true; -} + // Look for cached value. + BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB); + if (BCEntryIt != BlockMaskCache.end()) + return BCEntryIt->second; -void LoopVectorizationCostModel::collectLoopScalars(unsigned VF) { + // Loop incoming mask is all-one. + if (OrigLoop->getHeader() == BB) { + Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1); + return getVectorValue(C); + } - // We should not collect Scalars more than once per VF. Right now, - // this function is called from collectUniformsAndScalars(), which - // already does this check. Collecting Scalars for VF=1 does not make any - // sense. + // This is the block mask. We OR all incoming edges, and with zero. + Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0); + VectorParts BlockMask = getVectorValue(Zero); - assert(VF >= 2 && !Scalars.count(VF) && - "This function should not be visited twice for the same VF"); + // For each pred: + for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) { + VectorParts EM = createEdgeMask(*it, BB); + for (unsigned part = 0; part < UF; ++part) + BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]); + } - // If an instruction is uniform after vectorization, it will remain scalar. - Scalars[VF].insert(Uniforms[VF].begin(), Uniforms[VF].end()); + BlockMaskCache[BB] = BlockMask; + return BlockMask; +} - // Collect the getelementptr instructions that will not be vectorized. A - // getelementptr instruction is only vectorized if it is used for a legal - // gather or scatter operation. - for (auto *BB : TheLoop->blocks()) - for (auto &I : *BB) { - if (auto *GEP = dyn_cast(&I)) { - Scalars[VF].insert(GEP); - continue; - } - auto *Ptr = getPointerOperand(&I); - if (!Ptr) - continue; - auto *GEP = getGEPInstruction(Ptr); - if (GEP && getWideningDecision(&I, VF) == CM_GatherScatter) - Scalars[VF].erase(GEP); +void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF, + unsigned VF, PhiVector *PV) { + PHINode *P = cast(PN); + // Handle recurrences. + if (Legal->isReductionVariable(P) || Legal->isFirstOrderRecurrence(P)) { + VectorParts Entry(UF); + for (unsigned part = 0; part < UF; ++part) { + // This is phase one of vectorizing PHIs. + Type *VecTy = + (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF); + Entry[part] = PHINode::Create( + VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt()); } + VectorLoopValueMap.initVector(P, Entry); + PV->push_back(P); + return; + } - // An induction variable will remain scalar if all users of the induction - // variable and induction variable update remain scalar. - auto *Latch = TheLoop->getLoopLatch(); - for (auto &Induction : *Legal->getInductionVars()) { - auto *Ind = Induction.first; - auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - - // Determine if all users of the induction variable are scalar after - // vectorization. - auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool { - auto *I = cast(U); - return I == IndUpdate || !TheLoop->contains(I) || Scalars[VF].count(I); - }); - if (!ScalarInd) - continue; + setDebugLocFromInst(Builder, P); + // Check for PHI nodes that are lowered to vector selects. + if (P->getParent() != OrigLoop->getHeader()) { + // We know that all PHIs in non-header blocks are converted into + // selects, so we don't have to worry about the insertion order and we + // can just use the builder. + // At this point we generate the predication tree. There may be + // duplications since this is a simple recursive scan, but future + // optimizations will clean it up. - // Determine if all users of the induction variable update instruction are - // scalar after vectorization. - auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool { - auto *I = cast(U); - return I == Ind || !TheLoop->contains(I) || Scalars[VF].count(I); - }); - if (!ScalarIndUpdate) - continue; + unsigned NumIncoming = P->getNumIncomingValues(); - // The induction variable and its update instruction will remain scalar. - Scalars[VF].insert(Ind); - Scalars[VF].insert(IndUpdate); - } -} + // Generate a sequence of selects of the form: + // SELECT(Mask3, In3, + // SELECT(Mask2, In2, + // ( ...))) + VectorParts Entry(UF); + for (unsigned In = 0; In < NumIncoming; In++) { + VectorParts Cond = + createEdgeMask(P->getIncomingBlock(In), P->getParent()); + const VectorParts &In0 = getVectorValue(P->getIncomingValue(In)); -bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) { - if (!blockNeedsPredication(I->getParent())) - return false; - switch(I->getOpcode()) { - default: - break; - case Instruction::Store: - return !isMaskRequired(I); - case Instruction::UDiv: - case Instruction::SDiv: - case Instruction::SRem: - case Instruction::URem: - return mayDivideByZero(*I); + for (unsigned part = 0; part < UF; ++part) { + // We might have single edge PHIs (blocks) - use an identity + // 'select' for the first PHI operand. + if (In == 0) + Entry[part] = Builder.CreateSelect(Cond[part], In0[part], In0[part]); + else + // Select between the current value and the previous incoming edge + // based on the incoming mask. + Entry[part] = Builder.CreateSelect(Cond[part], In0[part], Entry[part], + "predphi"); + } + } + VectorLoopValueMap.initVector(P, Entry); + return; } - return false; -} - -bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I, - unsigned VF) { - // Get and ensure we have a valid memory instruction. - LoadInst *LI = dyn_cast(I); - StoreInst *SI = dyn_cast(I); - assert((LI || SI) && "Invalid memory instruction"); - auto *Ptr = getPointerOperand(I); + // This PHINode must be an induction variable. + // Make sure that we know about it. + assert(Legal->getInductionVars()->count(P) && "Not an induction variable"); - // In order to be widened, the pointer should be consecutive, first of all. - if (!isConsecutivePtr(Ptr)) - return false; + InductionDescriptor II = Legal->getInductionVars()->lookup(P); + const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout(); - // If the instruction is a store located in a predicated block, it will be - // scalarized. - if (isScalarWithPredication(I)) - return false; + // FIXME: The newly created binary instructions should contain nsw/nuw flags, + // which can be found from the original scalar operations. + switch (II.getKind()) { + case InductionDescriptor::IK_NoInduction: + llvm_unreachable("Unknown induction"); + case InductionDescriptor::IK_IntInduction: + widenIntInduction(needsScalarInduction(P), P); // Used only by Unroller + return; + case InductionDescriptor::IK_PtrInduction: { + // Handle the pointer induction variable case. + assert(P->getType()->isPointerTy() && "Unexpected type."); + // This is the normalized GEP that starts counting at zero. + Value *PtrInd = Induction; + PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType()); + // Determine the number of scalars we need to generate for each unroll + // iteration. If the instruction is uniform, we only need to generate the + // first lane. Otherwise, we generate all VF values. + unsigned Lanes = Cost->isUniformAfterVectorization(P, VF) ? 1 : VF; + // These are the scalar results. Notice that we don't generate vector GEPs + // because scalar GEPs result in better code. + ScalarParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Entry[Part].resize(VF); + for (unsigned Lane = 0; Lane < Lanes; ++Lane) { + Constant *Idx = ConstantInt::get(PtrInd->getType(), Lane + Part * VF); + Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx); + Value *SclrGep = II.transform(Builder, GlobalIdx, PSE.getSE(), DL); + SclrGep->setName("next.gep"); + Entry[Part][Lane] = SclrGep; + } + } + VectorLoopValueMap.initScalar(P, Entry); + return; + } + case InductionDescriptor::IK_FpInduction: { + assert(P->getType() == II.getStartValue()->getType() && + "Types must match"); + // Handle other induction variables that are now based on the + // canonical one. + assert(P != OldInduction && "Primary induction can be integer only"); - // If the instruction's allocated size doesn't equal it's type size, it - // requires padding and will be scalarized. - auto &DL = I->getModule()->getDataLayout(); - auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType(); - if (hasIrregularType(ScalarTy, DL, VF)) - return false; + Value *V = Builder.CreateCast(Instruction::SIToFP, Induction, P->getType()); + V = II.transform(Builder, V, PSE.getSE(), DL); + V->setName("fp.offset.idx"); - return true; + // Now we have scalar op: %fp.offset.idx = StartVal +/- Induction*StepVal + + Value *Broadcasted = getBroadcastInstrs(V); + // After broadcasting the induction variable we need to make the vector + // consecutive by adding StepVal*0, StepVal*1, StepVal*2, etc. + Value *StepVal = cast(II.getStep())->getValue(); + VectorParts Entry(UF); + for (unsigned part = 0; part < UF; ++part) + Entry[part] = getStepVector(Broadcasted, VF * part, StepVal, + II.getInductionOpcode()); + VectorLoopValueMap.initVector(P, Entry); + return; + } + } } -void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) { +/// A helper function for checking whether an integer division-related +/// instruction may divide by zero (in which case it must be predicated if +/// executed conditionally in the scalar code). +/// TODO: It may be worthwhile to generalize and check isKnownNonZero(). +/// Non-zero divisors that are non compile-time constants will not be +/// converted into multiplication, so we will still end up scalarizing +/// the division, but can do so w/o predication. +static bool mayDivideByZero(Instruction &I) { + assert((I.getOpcode() == Instruction::UDiv || + I.getOpcode() == Instruction::SDiv || + I.getOpcode() == Instruction::URem || + I.getOpcode() == Instruction::SRem) && + "Unexpected instruction"); + Value *Divisor = I.getOperand(1); + auto *CInt = dyn_cast(Divisor); + return !CInt || CInt->isZero(); +} - // We should not collect Uniforms more than once per VF. Right now, - // this function is called from collectUniformsAndScalars(), which - // already does this check. Collecting Uniforms for VF=1 does not make any - // sense. +void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) { + switch (I.getOpcode()) { + case Instruction::PHI: { + // Vectorize PHINodes. + PhiVector PV; // Records Reduction and FirstOrderRecurrence header Phis. + widenPHIInstruction(&I, UF, VF, &PV); + break; + } // End of PHI. + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + case Instruction::Add: + case Instruction::FAdd: + case Instruction::Sub: + case Instruction::FSub: + case Instruction::Mul: + case Instruction::FMul: + case Instruction::FDiv: + case Instruction::FRem: + case Instruction::Shl: + case Instruction::LShr: + case Instruction::AShr: + case Instruction::And: + case Instruction::Or: + case Instruction::Xor: { + // Just widen binops. + auto *BinOp = cast(&I); + setDebugLocFromInst(Builder, BinOp); + const VectorParts &A = getVectorValue(BinOp->getOperand(0)); + const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - assert(VF >= 2 && !Uniforms.count(VF) && - "This function should not be visited twice for the same VF"); + // Use this vector value for all users of the original instruction. + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); - // Visit the list of Uniforms. If we'll not find any uniform value, we'll - // not analyze again. Uniforms.count(VF) will return 1. - Uniforms[VF].clear(); + if (BinaryOperator *VecOp = dyn_cast(V)) + VecOp->copyIRFlags(BinOp); - // We now know that the loop is vectorizable! - // Collect instructions inside the loop that will remain uniform after - // vectorization. + Entry[Part] = V; + } - // Global values, params and instructions outside of current loop are out of - // scope. - auto isOutOfScope = [&](Value *V) -> bool { - Instruction *I = dyn_cast(V); - return (!I || !TheLoop->contains(I)); - }; + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, BinOp); + break; + } + case Instruction::Select: { + // Widen selects. + // If the selector is loop invariant we can create a select + // instruction with a scalar condition. Otherwise, use vector-select. + auto *SE = PSE.getSE(); + bool InvariantCond = + SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); + setDebugLocFromInst(Builder, &I); + + // The condition can be loop invariant but still defined inside the + // loop. This means that we can't just use the original 'cond' value. + // We have to take the 'vectorized' value and pick the first lane. + // Instcombine will make this a no-op. + const VectorParts &Cond = getVectorValue(I.getOperand(0)); + const VectorParts &Op0 = getVectorValue(I.getOperand(1)); + const VectorParts &Op1 = getVectorValue(I.getOperand(2)); + + auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); - SetVector Worklist; - BasicBlock *Latch = TheLoop->getLoopLatch(); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Entry[Part] = Builder.CreateSelect( + InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); + } - // Start with the conditional branch. If the branch condition is an - // instruction contained in the loop that is only used by the branch, it is - // uniform. - auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); - if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse()) { - Worklist.insert(Cmp); - DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n"); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; } - // Holds consecutive and consecutive-like pointers. Consecutive-like pointers - // are pointers that are treated like consecutive pointers during - // vectorization. The pointer operands of interleaved accesses are an - // example. - SmallSetVector ConsecutiveLikePtrs; - - // Holds pointer operands of instructions that are possibly non-uniform. - SmallPtrSet PossibleNonUniformPtrs; + case Instruction::ICmp: + case Instruction::FCmp: { + // Widen compares. Generate vector compares. + bool FCmp = (I.getOpcode() == Instruction::FCmp); + auto *Cmp = dyn_cast(&I); + setDebugLocFromInst(Builder, Cmp); + const VectorParts &A = getVectorValue(Cmp->getOperand(0)); + const VectorParts &B = getVectorValue(Cmp->getOperand(1)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *C = nullptr; + if (FCmp) { + C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); + cast(C)->copyFastMathFlags(Cmp); + } else { + C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); + } + Entry[Part] = C; + } - auto isUniformDecision = [&](Instruction *I, unsigned VF) { - InstWidening WideningDecision = getWideningDecision(I, VF); - assert(WideningDecision != CM_Unknown && - "Widening decision should be ready at this moment"); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - return (WideningDecision == CM_Widen || - WideningDecision == CM_Interleave); - }; - // Iterate over the instructions in the loop, and collect all - // consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible - // that a consecutive-like pointer operand will be scalarized, we collect it - // in PossibleNonUniformPtrs instead. We use two sets here because a single - // getelementptr instruction can be used by both vectorized and scalarized - // memory instructions. For example, if a loop loads and stores from the same - // location, but the store is conditional, the store will be scalarized, and - // the getelementptr won't remain uniform. - for (auto *BB : TheLoop->blocks()) - for (auto &I : *BB) { + case Instruction::Store: + case Instruction::Load: + vectorizeMemoryInstruction(&I); + break; + case Instruction::ZExt: + case Instruction::SExt: + case Instruction::FPToUI: + case Instruction::FPToSI: + case Instruction::FPExt: + case Instruction::PtrToInt: + case Instruction::IntToPtr: + case Instruction::SIToFP: + case Instruction::UIToFP: + case Instruction::Trunc: + case Instruction::FPTrunc: + case Instruction::BitCast: { + auto *CI = dyn_cast(&I); + setDebugLocFromInst(Builder, CI); - // If there's no pointer operand, there's nothing to do. - auto *Ptr = dyn_cast_or_null(getPointerOperand(&I)); - if (!Ptr) - continue; + /// Vectorize casts. + Type *DestTy = + (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); - // True if all users of Ptr are memory accesses that have Ptr as their - // pointer operand. - auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool { - return getPointerOperand(U) == Ptr; - }); + const VectorParts &A = getVectorValue(CI->getOperand(0)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - // Ensure the memory instruction will not be scalarized or used by - // gather/scatter, making its pointer operand non-uniform. If the pointer - // operand is used by any instruction other than a memory access, we - // conservatively assume the pointer operand may be non-uniform. - if (!UsersAreMemAccesses || !isUniformDecision(&I, VF)) - PossibleNonUniformPtrs.insert(Ptr); + case Instruction::Call: { + // Ignore dbg intrinsics. + if (isa(I)) + break; + setDebugLocFromInst(Builder, &I); + + Module *M = I.getParent()->getParent()->getParent(); + auto *CI = cast(&I); + + StringRef FnName = CI->getCalledFunction()->getName(); + Function *F = CI->getCalledFunction(); + Type *RetTy = ToVectorTy(CI->getType(), VF); + SmallVector Tys; + for (Value *ArgOperand : CI->arg_operands()) + Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); + + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + bool NeedToScalarize; // Redundant, needed for UseVectorIntrinsic. + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + SmallVector Args; + for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { + Value *Arg = CI->getArgOperand(i); + // Some intrinsics have a scalar argument - don't replace it with a + // vector. + if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { + const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); + Arg = VectorArg[Part]; + } + Args.push_back(Arg); + } - // If the memory instruction will be vectorized and its pointer operand - // is consecutive-like, or interleaving - the pointer operand should - // remain uniform. - else - ConsecutiveLikePtrs.insert(Ptr); - } + Function *VectorF; + if (UseVectorIntrinsic) { + // Use vector version of the intrinsic. + Type *TysForDecl[] = {CI->getType()}; + if (VF > 1) + TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); + VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); + } else { + // Use vector version of the library call. + StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); + assert(!VFnName.empty() && "Vector function name is empty."); + VectorF = M->getFunction(VFnName); + if (!VectorF) { + // Generate a declaration + FunctionType *FTy = FunctionType::get(RetTy, Tys, false); + VectorF = + Function::Create(FTy, Function::ExternalLinkage, VFnName, M); + VectorF->copyAttributesFrom(F); + } + } + assert(VectorF && "Can't create vector function."); - // Add to the Worklist all consecutive and consecutive-like pointers that - // aren't also identified as possibly non-uniform. - for (auto *V : ConsecutiveLikePtrs) - if (!PossibleNonUniformPtrs.count(V)) { - DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n"); - Worklist.insert(V); - } + SmallVector OpBundles; + CI->getOperandBundlesAsDefs(OpBundles); + CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); - // Expand Worklist in topological order: whenever a new instruction - // is added , its users should be either already inside Worklist, or - // out of scope. It ensures a uniform instruction will only be used - // by uniform instructions or out of scope instructions. - unsigned idx = 0; - while (idx != Worklist.size()) { - Instruction *I = Worklist[idx++]; + if (isa(V)) + V->copyFastMathFlags(CI); - for (auto OV : I->operand_values()) { - if (isOutOfScope(OV)) - continue; - auto *OI = cast(OV); - if (all_of(OI->users(), [&](User *U) -> bool { - return isOutOfScope(U) || Worklist.count(cast(U)); - })) { - Worklist.insert(OI); - DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n"); - } + Entry[Part] = V; } + + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; } - // Returns true if Ptr is the pointer operand of a memory access instruction - // I, and I is known to not require scalarization. - auto isVectorizedMemAccessUse = [&](Instruction *I, Value *Ptr) -> bool { - return getPointerOperand(I) == Ptr && isUniformDecision(I, VF); - }; + default: + // All other instructions are scalarized. + DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I); + llvm_unreachable("Unhandled instruction!"); + } // end of switch. +} - // For an instruction to be added into Worklist above, all its users inside - // the loop should also be in Worklist. However, this condition cannot be - // true for phi nodes that form a cyclic dependence. We must process phi - // nodes separately. An induction variable will remain uniform if all users - // of the induction variable and induction variable update remain uniform. - // The code below handles both pointer and non-pointer induction variables. - for (auto &Induction : *Legal->getInductionVars()) { - auto *Ind = Induction.first; - auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - - // Determine if all users of the induction variable are uniform after - // vectorization. - auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool { - auto *I = cast(U); - return I == IndUpdate || !TheLoop->contains(I) || Worklist.count(I) || - isVectorizedMemAccessUse(I, Ind); - }); - if (!UniformInd) - continue; - - // Determine if all users of the induction variable update instruction are - // uniform after vectorization. - auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool { - auto *I = cast(U); - return I == Ind || !TheLoop->contains(I) || Worklist.count(I) || - isVectorizedMemAccessUse(I, IndUpdate); - }); - if (!UniformIndUpdate) - continue; +void InnerLoopVectorizer::updateAnalysis() { + // Forget the original basic block. + PSE.getSE()->forgetLoop(OrigLoop); - // The induction variable and its update instruction will remain uniform. - Worklist.insert(Ind); - Worklist.insert(IndUpdate); - DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n"); - DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n"); - } + // Update the dominator tree information. + assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) && + "Entry does not dominate exit."); - Uniforms[VF].insert(Worklist.begin(), Worklist.end()); + if (!DT->getNode(LoopVectorBody)) // For InnerLoopUnroller. + DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader); + auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch(); + DT->addNewBlock(LoopMiddleBlock, LoopVectorLatch); + DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]); + DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader); + DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]); + DEBUG(DT->verifyDomTree()); } -bool LoopVectorizationLegality::canVectorizeMemory() { - LAI = &(*GetLAA)(*TheLoop); - InterleaveInfo.setLAI(LAI); - const OptimizationRemarkAnalysis *LAR = LAI->getReport(); - if (LAR) { - OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(), - "loop not vectorized: ", *LAR); - ORE->emit(VR); - } - if (!LAI->canVectorizeMemory()) - return false; - - if (LAI->hasStoreToLoopInvariantAddress()) { - ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress") - << "write to a loop invariant address could not be vectorized"); - DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n"); - return false; +/// \brief Check whether it is safe to if-convert this phi node. +/// +/// Phi nodes with constant expressions that can trap are not safe to if +/// convert. +static bool canIfConvertPHINodes(BasicBlock *BB) { + for (Instruction &I : *BB) { + auto *Phi = dyn_cast(&I); + if (!Phi) + return true; + for (Value *V : Phi->incoming_values()) + if (auto *C = dyn_cast(V)) + if (C->canTrap()) + return false; } - - Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks()); - PSE.addPredicate(LAI->getPSE().getUnionPredicate()); - return true; } -bool LoopVectorizationLegality::isInductionVariable(const Value *V) { - Value *In0 = const_cast(V); - PHINode *PN = dyn_cast_or_null(In0); - if (!PN) +bool LoopVectorizationLegality::canVectorizeWithIfConvert() { + if (!EnableIfConversion) { + ORE->emit(createMissedAnalysis("IfConversionDisabled") + << "if-conversion is disabled"); return false; + } - return Inductions.count(PN); -} + assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable"); -bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) { - return FirstOrderRecurrences.count(Phi); -} + // A list of pointers that we can safely read and write to. + SmallPtrSet SafePointes; -bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) { - return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT); -} + // Collect safe addresses. + for (BasicBlock *BB : TheLoop->blocks()) { + if (blockNeedsPredication(BB)) + continue; -bool LoopVectorizationLegality::blockCanBePredicated( - BasicBlock *BB, SmallPtrSetImpl &SafePtrs) { - const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel(); + for (Instruction &I : *BB) + if (auto *Ptr = getPointerOperand(&I)) + SafePointes.insert(Ptr); + } - for (Instruction &I : *BB) { - // Check that we don't have a constant expression that can trap as operand. - for (Value *Operand : I.operands()) { - if (auto *C = dyn_cast(Operand)) - if (C->canTrap()) - return false; - } - // We might be able to hoist the load. - if (I.mayReadFromMemory()) { - auto *LI = dyn_cast(&I); - if (!LI) - return false; - if (!SafePtrs.count(LI->getPointerOperand())) { - if (isLegalMaskedLoad(LI->getType(), LI->getPointerOperand()) || - isLegalMaskedGather(LI->getType())) { - MaskedOp.insert(LI); - continue; - } - // !llvm.mem.parallel_loop_access implies if-conversion safety. - if (IsAnnotatedParallel) - continue; - return false; - } + // Collect the blocks that need predication. + BasicBlock *Header = TheLoop->getHeader(); + for (BasicBlock *BB : TheLoop->blocks()) { + // We don't support switch statements inside loops. + if (!isa(BB->getTerminator())) { + ORE->emit(createMissedAnalysis("LoopContainsSwitch", BB->getTerminator()) + << "loop contains a switch statement"); + return false; } - if (I.mayWriteToMemory()) { - auto *SI = dyn_cast(&I); - // We only support predication of stores in basic blocks with one - // predecessor. - if (!SI) + // We must be able to predicate all blocks that need to be predicated. + if (blockNeedsPredication(BB)) { + if (!blockCanBePredicated(BB, SafePointes)) { + ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator()) + << "control flow cannot be substituted for a select"); return false; - - // Build a masked store if it is legal for the target. - if (isLegalMaskedStore(SI->getValueOperand()->getType(), - SI->getPointerOperand()) || - isLegalMaskedScatter(SI->getValueOperand()->getType())) { - MaskedOp.insert(SI); - continue; } - - bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0); - bool isSinglePredecessor = SI->getParent()->getSinglePredecessor(); - - if (++NumPredStores > NumberOfStoresToPredicate || !isSafePtr || - !isSinglePredecessor) - return false; - } - if (I.mayThrow()) + } else if (BB != Header && !canIfConvertPHINodes(BB)) { + ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator()) + << "control flow cannot be substituted for a select"); return false; + } } + // We can if-convert this loop. return true; } -void InterleavedAccessInfo::collectConstStrideAccesses( - MapVector &AccessStrideInfo, - const ValueToValueMap &Strides) { +bool LoopVectorizationLegality::canVectorize() { + // We must have a loop in canonical form. Loops with indirectbr in them cannot + // be canonicalized. + if (!TheLoop->getLoopPreheader()) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood") + << "loop control flow is not understood by vectorizer"); + return false; + } - auto &DL = TheLoop->getHeader()->getModule()->getDataLayout(); + // FIXME: The code is currently dead, since the loop gets sent to + // LoopVectorizationLegality is already an innermost loop. + // + // We can only vectorize innermost loops. + if (!TheLoop->empty()) { + ORE->emit(createMissedAnalysis("NotInnermostLoop") + << "loop is not the innermost loop"); + return false; + } - // Since it's desired that the load/store instructions be maintained in - // "program order" for the interleaved access analysis, we have to visit the - // blocks in the loop in reverse postorder (i.e., in a topological order). - // Such an ordering will ensure that any load/store that may be executed - // before a second load/store will precede the second load/store in - // AccessStrideInfo. - LoopBlocksDFS DFS(TheLoop); - DFS.perform(LI); - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) - for (auto &I : *BB) { - auto *LI = dyn_cast(&I); - auto *SI = dyn_cast(&I); - if (!LI && !SI) - continue; + // We must have a single backedge. + if (TheLoop->getNumBackEdges() != 1) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood") + << "loop control flow is not understood by vectorizer"); + return false; + } - Value *Ptr = getPointerOperand(&I); - // We don't check wrapping here because we don't know yet if Ptr will be - // part of a full group or a group with gaps. Checking wrapping for all - // pointers (even those that end up in groups with no gaps) will be overly - // conservative. For full groups, wrapping should be ok since if we would - // wrap around the address space we would do a memory access at nullptr - // even without the transformation. The wrapping checks are therefore - // deferred until after we've formed the interleaved groups. - int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, - /*Assume=*/true, /*ShouldCheckWrap=*/false); + // We must have a single exiting block. + if (!TheLoop->getExitingBlock()) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood") + << "loop control flow is not understood by vectorizer"); + return false; + } - const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr); - PointerType *PtrTy = dyn_cast(Ptr->getType()); - uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType()); + // We only handle bottom-tested loops, i.e. loop in which the condition is + // checked at the end of each iteration. With that we can assume that all + // instructions in the loop are executed the same number of times. + if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood") + << "loop control flow is not understood by vectorizer"); + return false; + } - // An alignment of 0 means target ABI alignment. - unsigned Align = getMemInstAlignment(&I); - if (!Align) - Align = DL.getABITypeAlignment(PtrTy->getElementType()); + // We need to have a loop header. + DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName() + << '\n'); - AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align); - } -} + // Check if we can if-convert non-single-bb loops. + unsigned NumBlocks = TheLoop->getNumBlocks(); + if (NumBlocks != 1 && !canVectorizeWithIfConvert()) { + DEBUG(dbgs() << "LV: Can't if-convert the loop.\n"); + return false; + } -// Analyze interleaved accesses and collect them into interleaved load and -// store groups. -// -// When generating code for an interleaved load group, we effectively hoist all -// loads in the group to the location of the first load in program order. When -// generating code for an interleaved store group, we sink all stores to the -// location of the last store. This code motion can change the order of load -// and store instructions and may break dependences. -// -// The code generation strategy mentioned above ensures that we won't violate -// any write-after-read (WAR) dependences. -// -// E.g., for the WAR dependence: a = A[i]; // (1) -// A[i] = b; // (2) -// -// The store group of (2) is always inserted at or below (2), and the load -// group of (1) is always inserted at or above (1). Thus, the instructions will -// never be reordered. All other dependences are checked to ensure the -// correctness of the instruction reordering. -// -// The algorithm visits all memory accesses in the loop in bottom-up program -// order. Program order is established by traversing the blocks in the loop in -// reverse postorder when collecting the accesses. -// -// We visit the memory accesses in bottom-up order because it can simplify the -// construction of store groups in the presence of write-after-write (WAW) -// dependences. -// -// E.g., for the WAW dependence: A[i] = a; // (1) -// A[i] = b; // (2) -// A[i + 1] = c; // (3) -// -// We will first create a store group with (3) and (2). (1) can't be added to -// this group because it and (2) are dependent. However, (1) can be grouped -// with other accesses that may precede it in program order. Note that a -// bottom-up order does not imply that WAW dependences should not be checked. -void InterleavedAccessInfo::analyzeInterleaving( - const ValueToValueMap &Strides) { - DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n"); + // ScalarEvolution needs to be able to find the exit count. + const SCEV *ExitCount = PSE.getBackedgeTakenCount(); + if (ExitCount == PSE.getSE()->getCouldNotCompute()) { + ORE->emit(createMissedAnalysis("CantComputeNumberOfIterations") + << "could not determine number of loop iterations"); + DEBUG(dbgs() << "LV: SCEV could not compute the loop exit count.\n"); + return false; + } - // Holds all accesses with a constant stride. - MapVector AccessStrideInfo; - collectConstStrideAccesses(AccessStrideInfo, Strides); + // Check if we can vectorize the instructions and CFG in this loop. + if (!canVectorizeInstrs()) { + DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n"); + return false; + } - if (AccessStrideInfo.empty()) - return; + // Go over each instruction and look at memory deps. + if (!canVectorizeMemory()) { + DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n"); + return false; + } - // Collect the dependences in the loop. - collectDependences(); + DEBUG(dbgs() << "LV: We can vectorize this loop" + << (LAI->getRuntimePointerChecking()->Need + ? " (with a runtime bound check)" + : "") + << "!\n"); - // Holds all interleaved store groups temporarily. - SmallSetVector StoreGroups; - // Holds all interleaved load groups temporarily. - SmallSetVector LoadGroups; + bool UseInterleaved = TTI->enableInterleavedAccessVectorization(); - // Search in bottom-up program order for pairs of accesses (A and B) that can - // form interleaved load or store groups. In the algorithm below, access A - // precedes access B in program order. We initialize a group for B in the - // outer loop of the algorithm, and then in the inner loop, we attempt to - // insert each A into B's group if: - // - // 1. A and B have the same stride, - // 2. A and B have the same memory object size, and - // 3. A belongs in B's group according to its distance from B. - // - // Special care is taken to ensure group formation will not break any - // dependences. - for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend(); - BI != E; ++BI) { - Instruction *B = BI->first; - StrideDescriptor DesB = BI->second; + // If an override option has been passed in for interleaved accesses, use it. + if (EnableInterleavedMemAccesses.getNumOccurrences() > 0) + UseInterleaved = EnableInterleavedMemAccesses; - // Initialize a group for B if it has an allowable stride. Even if we don't - // create a group for B, we continue with the bottom-up algorithm to ensure - // we don't break any of B's dependences. - InterleaveGroup *Group = nullptr; - if (isStrided(DesB.Stride)) { - Group = getInterleaveGroup(B); - if (!Group) { - DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B << '\n'); - Group = createInterleaveGroup(B, DesB.Stride, DesB.Align); - } - if (B->mayWriteToMemory()) - StoreGroups.insert(Group); - else - LoadGroups.insert(Group); - } + // Analyze interleaved memory accesses. + if (UseInterleaved) + InterleaveInfo.analyzeInterleaving(*getSymbolicStrides()); - for (auto AI = std::next(BI); AI != E; ++AI) { - Instruction *A = AI->first; - StrideDescriptor DesA = AI->second; + unsigned SCEVThreshold = VectorizeSCEVCheckThreshold; + if (Hints->getForce() == LoopVectorizeHints::FK_Enabled) + SCEVThreshold = PragmaVectorizeSCEVCheckThreshold; - // Our code motion strategy implies that we can't have dependences - // between accesses in an interleaved group and other accesses located - // between the first and last member of the group. Note that this also - // means that a group can't have more than one member at a given offset. - // The accesses in a group can have dependences with other accesses, but - // we must ensure we don't extend the boundaries of the group such that - // we encompass those dependent accesses. - // - // For example, assume we have the sequence of accesses shown below in a - // stride-2 loop: - // - // (1, 2) is a group | A[i] = a; // (1) - // | A[i-1] = b; // (2) | - // A[i-3] = c; // (3) - // A[i] = d; // (4) | (2, 4) is not a group - // - // Because accesses (2) and (3) are dependent, we can group (2) with (1) - // but not with (4). If we did, the dependent access (3) would be within - // the boundaries of the (2, 4) group. - if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) { + if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) { + ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks") + << "Too many SCEV assumptions need to be made and checked " + << "at runtime"); + DEBUG(dbgs() << "LV: Too many SCEV checks needed.\n"); + return false; + } - // If a dependence exists and A is already in a group, we know that A - // must be a store since A precedes B and WAR dependences are allowed. - // Thus, A would be sunk below B. We release A's group to prevent this - // illegal code motion. A will then be free to form another group with - // instructions that precede it. - if (isInterleaved(A)) { - InterleaveGroup *StoreGroup = getInterleaveGroup(A); - StoreGroups.remove(StoreGroup); - releaseGroup(StoreGroup); - } + // Okay! We can vectorize. At this point we don't have any other mem analysis + // which may limit our maximum vectorization factor, so just return true with + // no restrictions. + return true; +} - // If a dependence exists and A is not already in a group (or it was - // and we just released it), B might be hoisted above A (if B is a - // load) or another store might be sunk below A (if B is a store). In - // either case, we can't add additional instructions to B's group. B - // will only form a group with instructions that it precedes. - break; - } +static Type *convertPointerToIntegerType(const DataLayout &DL, Type *Ty) { + if (Ty->isPointerTy()) + return DL.getIntPtrType(Ty); - // At this point, we've checked for illegal code motion. If either A or B - // isn't strided, there's nothing left to do. - if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride)) - continue; + // It is possible that char's or short's overflow when we ask for the loop's + // trip count, work around this by changing the type size. + if (Ty->getScalarSizeInBits() < 32) + return Type::getInt32Ty(Ty->getContext()); - // Ignore A if it's already in a group or isn't the same kind of memory - // operation as B. - if (isInterleaved(A) || A->mayReadFromMemory() != B->mayReadFromMemory()) - continue; + return Ty; +} - // Check rules 1 and 2. Ignore A if its stride or size is different from - // that of B. - if (DesA.Stride != DesB.Stride || DesA.Size != DesB.Size) - continue; +static Type *getWiderType(const DataLayout &DL, Type *Ty0, Type *Ty1) { + Ty0 = convertPointerToIntegerType(DL, Ty0); + Ty1 = convertPointerToIntegerType(DL, Ty1); + if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits()) + return Ty0; + return Ty1; +} - // Calculate the distance from A to B. - const SCEVConstant *DistToB = dyn_cast( - PSE.getSE()->getMinusSCEV(DesA.Scev, DesB.Scev)); - if (!DistToB) - continue; - int64_t DistanceToB = DistToB->getAPInt().getSExtValue(); +/// \brief Check that the instruction has outside loop users and is not an +/// identified reduction variable. +static bool hasOutsideLoopUser(const Loop *TheLoop, Instruction *Inst, + SmallPtrSetImpl &AllowedExit) { + // Reduction and Induction instructions are allowed to have exit users. All + // other instructions must not have external users. + if (!AllowedExit.count(Inst)) + // Check that all of the users of the loop are inside the BB. + for (User *U : Inst->users()) { + Instruction *UI = cast(U); + // This user may be a reduction exit value. + if (!TheLoop->contains(UI)) { + DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n'); + return true; + } + } + return false; +} - // Check rule 3. Ignore A if its distance to B is not a multiple of the - // size. - if (DistanceToB % static_cast(DesB.Size)) - continue; +void LoopVectorizationLegality::addInductionPhi( + PHINode *Phi, const InductionDescriptor &ID, + SmallPtrSetImpl &AllowedExit) { + Inductions[Phi] = ID; + Type *PhiTy = Phi->getType(); + const DataLayout &DL = Phi->getModule()->getDataLayout(); - // Ignore A if either A or B is in a predicated block. Although we - // currently prevent group formation for predicated accesses, we may be - // able to relax this limitation in the future once we handle more - // complicated blocks. - if (isPredicated(A->getParent()) || isPredicated(B->getParent())) - continue; + // Get the widest type. + if (!PhiTy->isFloatingPointTy()) { + if (!WidestIndTy) + WidestIndTy = convertPointerToIntegerType(DL, PhiTy); + else + WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy); + } - // The index of A is the index of B plus A's distance to B in multiples - // of the size. - int IndexA = - Group->getIndex(B) + DistanceToB / static_cast(DesB.Size); + // Int inductions are special because we only allow one IV. + if (ID.getKind() == InductionDescriptor::IK_IntInduction && + ID.getConstIntStepValue() && + ID.getConstIntStepValue()->isOne() && + isa(ID.getStartValue()) && + cast(ID.getStartValue())->isNullValue()) { - // Try to insert A into B's group. - if (Group->insertMember(A, IndexA, DesA.Align)) { - DEBUG(dbgs() << "LV: Inserted:" << *A << '\n' - << " into the interleave group with" << *B << '\n'); - InterleaveGroupMap[A] = Group; - - // Set the first load in program order as the insert position. - if (A->mayReadFromMemory()) - Group->setInsertPos(A); - } - } // Iteration over A accesses. - } // Iteration over B accesses. - - // Remove interleaved store groups with gaps. - for (InterleaveGroup *Group : StoreGroups) - if (Group->getNumMembers() != Group->getFactor()) - releaseGroup(Group); - - // Remove interleaved groups with gaps (currently only loads) whose memory - // accesses may wrap around. We have to revisit the getPtrStride analysis, - // this time with ShouldCheckWrap=true, since collectConstStrideAccesses does - // not check wrapping (see documentation there). - // FORNOW we use Assume=false; - // TODO: Change to Assume=true but making sure we don't exceed the threshold - // of runtime SCEV assumptions checks (thereby potentially failing to - // vectorize altogether). - // Additional optional optimizations: - // TODO: If we are peeling the loop and we know that the first pointer doesn't - // wrap then we can deduce that all pointers in the group don't wrap. - // This means that we can forcefully peel the loop in order to only have to - // check the first pointer for no-wrap. When we'll change to use Assume=true - // we'll only need at most one runtime check per interleaved group. - // - for (InterleaveGroup *Group : LoadGroups) { + // Use the phi node with the widest type as induction. Use the last + // one if there are multiple (no good reason for doing this other + // than it is expedient). We've checked that it begins at zero and + // steps by one, so this is a canonical induction variable. + if (!PrimaryInduction || PhiTy == WidestIndTy) + PrimaryInduction = Phi; + } - // Case 1: A full group. Can Skip the checks; For full groups, if the wide - // load would wrap around the address space we would do a memory access at - // nullptr even without the transformation. - if (Group->getNumMembers() == Group->getFactor()) - continue; + // Both the PHI node itself, and the "post-increment" value feeding + // back into the PHI node may have external users. + AllowedExit.insert(Phi); + AllowedExit.insert(Phi->getIncomingValueForBlock(TheLoop->getLoopLatch())); - // Case 2: If first and last members of the group don't wrap this implies - // that all the pointers in the group don't wrap. - // So we check only group member 0 (which is always guaranteed to exist), - // and group member Factor - 1; If the latter doesn't exist we rely on - // peeling (if it is a non-reveresed accsess -- see Case 3). - Value *FirstMemberPtr = getPointerOperand(Group->getMember(0)); - if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /*Assume=*/false, - /*ShouldCheckWrap=*/true)) { - DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to " - "first group member potentially pointer-wrapping.\n"); - releaseGroup(Group); - continue; - } - Instruction *LastMember = Group->getMember(Group->getFactor() - 1); - if (LastMember) { - Value *LastMemberPtr = getPointerOperand(LastMember); - if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /*Assume=*/false, - /*ShouldCheckWrap=*/true)) { - DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to " - "last group member potentially pointer-wrapping.\n"); - releaseGroup(Group); - } - } else { - // Case 3: A non-reversed interleaved load group with gaps: We need - // to execute at least one scalar epilogue iteration. This will ensure - // we don't speculatively access memory out-of-bounds. We only need - // to look for a member at index factor - 1, since every group must have - // a member at index zero. - if (Group->isReverse()) { - releaseGroup(Group); - continue; - } - DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n"); - RequiresScalarEpilogue = true; - } - } + DEBUG(dbgs() << "LV: Found an induction variable.\n"); + return; } -LoopVectorizationCostModel::VectorizationFactor -LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) { - // Width 1 means no vectorize - VectorizationFactor Factor = {1U, 0U}; - if (OptForSize && Legal->getRuntimePointerChecking()->Need) { - ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize") - << "runtime pointer checks needed. Enable vectorization of this " - "loop with '#pragma clang loop vectorize(enable)' when " - "compiling with -Os/-Oz"); - DEBUG(dbgs() - << "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n"); - return Factor; - } +bool LoopVectorizationLegality::canVectorizeInstrs() { + BasicBlock *Header = TheLoop->getHeader(); - if (!EnableCondStoresVectorization && Legal->getNumPredStores()) { - ORE->emit(createMissedAnalysis("ConditionalStore") - << "store that is conditionally executed prevents vectorization"); - DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n"); - return Factor; - } + // Look for the attribute signaling the absence of NaNs. + Function &F = *Header->getParent(); + HasFunNoNaNAttr = + F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true"; - MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI); - unsigned SmallestType, WidestType; - std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes(); - unsigned WidestRegister = TTI.getRegisterBitWidth(true); - unsigned MaxSafeDepDist = -1U; + // For each block in the loop. + for (BasicBlock *BB : TheLoop->blocks()) { + // Scan the instructions in the block and look for hazards. + for (Instruction &I : *BB) { + if (auto *Phi = dyn_cast(&I)) { + Type *PhiTy = Phi->getType(); + // Check that this PHI type is allowed. + if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() && + !PhiTy->isPointerTy()) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi) + << "loop control flow is not understood by vectorizer"); + DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n"); + return false; + } - // Get the maximum safe dependence distance in bits computed by LAA. If the - // loop contains any interleaved accesses, we divide the dependence distance - // by the maximum interleave factor of all interleaved groups. Note that - // although the division ensures correctness, this is a fairly conservative - // computation because the maximum distance computed by LAA may not involve - // any of the interleaved accesses. - if (Legal->getMaxSafeDepDistBytes() != -1U) - MaxSafeDepDist = - Legal->getMaxSafeDepDistBytes() * 8 / Legal->getMaxInterleaveFactor(); + // If this PHINode is not in the header block, then we know that we + // can convert it to select during if-conversion. No need to check if + // the PHIs in this block are induction or reduction variables. + if (BB != Header) { + // Check that this instruction has no outside users or is an + // identified reduction value with an outside user. + if (!hasOutsideLoopUser(TheLoop, Phi, AllowedExit)) + continue; + ORE->emit(createMissedAnalysis("NeitherInductionNorReduction", Phi) + << "value could not be identified as " + "an induction or reduction variable"); + return false; + } - WidestRegister = - ((WidestRegister < MaxSafeDepDist) ? WidestRegister : MaxSafeDepDist); - unsigned MaxVectorSize = WidestRegister / WidestType; + // We only allow if-converted PHIs with exactly two incoming values. + if (Phi->getNumIncomingValues() != 2) { + ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi) + << "control flow not understood by vectorizer"); + DEBUG(dbgs() << "LV: Found an invalid PHI.\n"); + return false; + } - DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / " - << WidestType << " bits.\n"); - DEBUG(dbgs() << "LV: The Widest register is: " << WidestRegister - << " bits.\n"); + RecurrenceDescriptor RedDes; + if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes)) { + if (RedDes.hasUnsafeAlgebra()) + Requirements->addUnsafeAlgebraInst(RedDes.getUnsafeAlgebraInst()); + AllowedExit.insert(RedDes.getLoopExitInstr()); + Reductions[Phi] = RedDes; + continue; + } - if (MaxVectorSize == 0) { - DEBUG(dbgs() << "LV: The target has no vector registers.\n"); - MaxVectorSize = 1; - } + InductionDescriptor ID; + if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID)) { + addInductionPhi(Phi, ID, AllowedExit); + if (ID.hasUnsafeAlgebra() && !HasFunNoNaNAttr) + Requirements->addUnsafeAlgebraInst(ID.getUnsafeAlgebraInst()); + continue; + } - assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements" - " into one vector!"); + if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop, DT)) { + FirstOrderRecurrences.insert(Phi); + continue; + } - unsigned VF = MaxVectorSize; - if (MaximizeBandwidth && !OptForSize) { - // Collect all viable vectorization factors. - SmallVector VFs; - unsigned NewMaxVectorSize = WidestRegister / SmallestType; - for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2) - VFs.push_back(VS); + // As a last resort, coerce the PHI to a AddRec expression + // and re-try classifying it a an induction PHI. + if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true)) { + addInductionPhi(Phi, ID, AllowedExit); + continue; + } - // For each VF calculate its register usage. - auto RUs = calculateRegisterUsage(VFs); + ORE->emit(createMissedAnalysis("NonReductionValueUsedOutsideLoop", Phi) + << "value that could not be identified as " + "reduction is used outside the loop"); + DEBUG(dbgs() << "LV: Found an unidentified PHI." << *Phi << "\n"); + return false; + } // end of PHI handling - // Select the largest VF which doesn't require more registers than existing - // ones. - unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true); - for (int i = RUs.size() - 1; i >= 0; --i) { - if (RUs[i].MaxLocalUsers <= TargetNumRegisters) { - VF = VFs[i]; - break; + // We handle calls that: + // * Are debug info intrinsics. + // * Have a mapping to an IR intrinsic. + // * Have a vector version available. + auto *CI = dyn_cast(&I); + if (CI && !getVectorIntrinsicIDForCall(CI, TLI) && + !isa(CI) && + !(CI->getCalledFunction() && TLI && + TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) { + ORE->emit(createMissedAnalysis("CantVectorizeCall", CI) + << "call instruction cannot be vectorized"); + DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n"); + return false; } - } - } - - // If we optimize the program for size, avoid creating the tail loop. - if (OptForSize) { - unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); - DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); - - // If we don't know the precise trip count, don't try to vectorize. - if (TC < 2) { - ORE->emit( - createMissedAnalysis("UnknownLoopCountComplexCFG") - << "unable to calculate the loop count due to complex control flow"); - DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); - return Factor; - } - // Find the maximum SIMD width that can fit within the trip count. - VF = TC % MaxVectorSize; + // Intrinsics such as powi,cttz and ctlz are legal to vectorize if the + // second argument is the same (i.e. loop invariant) + if (CI && hasVectorInstrinsicScalarOpd( + getVectorIntrinsicIDForCall(CI, TLI), 1)) { + auto *SE = PSE.getSE(); + if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) { + ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI) + << "intrinsic instruction cannot be vectorized"); + DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n"); + return false; + } + } - if (VF == 0) - VF = MaxVectorSize; - else { - // If the trip count that we found modulo the vectorization factor is not - // zero then we require a tail. - ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize") - << "cannot optimize for size and vectorize at the " - "same time. Enable vectorization of this loop " - "with '#pragma clang loop vectorize(enable)' " - "when compiling with -Os/-Oz"); - DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); - return Factor; - } - } - - int UserVF = Hints->getWidth(); - if (UserVF != 0) { - assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); - DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + // Check that the instruction return type is vectorizable. + // Also, we can't vectorize extractelement instructions. + if ((!VectorType::isValidElementType(I.getType()) && + !I.getType()->isVoidTy()) || + isa(I)) { + ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I) + << "instruction return type cannot be vectorized"); + DEBUG(dbgs() << "LV: Found unvectorizable type.\n"); + return false; + } - Factor.Width = UserVF; + // Check that the stored type is vectorizable. + if (auto *ST = dyn_cast(&I)) { + Type *T = ST->getValueOperand()->getType(); + if (!VectorType::isValidElementType(T)) { + ORE->emit(createMissedAnalysis("CantVectorizeStore", ST) + << "store instruction cannot be vectorized"); + return false; + } - collectUniformsAndScalars(UserVF); - collectInstsToScalarize(UserVF); - return Factor; - } + // FP instructions can allow unsafe algebra, thus vectorizable by + // non-IEEE-754 compliant SIMD units. + // This applies to floating-point math operations and calls, not memory + // operations, shuffles, or casts, as they don't change precision or + // semantics. + } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) && + !I.hasUnsafeAlgebra()) { + DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n"); + Hints->setPotentiallyUnsafe(); + } - float Cost = expectedCost(1).first; -#ifndef NDEBUG - const float ScalarCost = Cost; -#endif /* NDEBUG */ - unsigned Width = 1; - DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n"); + // Reduction instructions are allowed to have exit users. + // All other instructions must not have external users. + if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) { + ORE->emit(createMissedAnalysis("ValueUsedOutsideLoop", &I) + << "value cannot be used outside the loop"); + return false; + } - bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled; - // Ignore scalar width, because the user explicitly wants vectorization. - if (ForceVectorization && VF > 1) { - Width = 2; - Cost = expectedCost(Width).first / (float)Width; + } // next instr. } - for (unsigned i = 2; i <= VF; i *= 2) { - // Notice that the vector loop needs to be executed less times, so - // we need to divide the cost of the vector loops by the width of - // the vector elements. - VectorizationCostTy C = expectedCost(i); - float VectorCost = C.first / (float)i; - DEBUG(dbgs() << "LV: Vector loop of width " << i - << " costs: " << (int)VectorCost << ".\n"); - if (!C.second && !ForceVectorization) { - DEBUG( - dbgs() << "LV: Not considering vector loop of width " << i - << " because it will not generate any vector instructions.\n"); - continue; - } - if (VectorCost < Cost) { - Cost = VectorCost; - Width = i; + if (!PrimaryInduction) { + DEBUG(dbgs() << "LV: Did not find one integer induction var.\n"); + if (Inductions.empty()) { + ORE->emit(createMissedAnalysis("NoInductionVariable") + << "loop induction variable could not be identified"); + return false; } } - DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs() - << "LV: Vectorization seems to be not beneficial, " - << "but was forced by a user.\n"); - DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n"); - Factor.Width = Width; - Factor.Cost = Width * Cost; - return Factor; + // Now we know the widest induction type, check if our found induction + // is the same size. If it's not, unset it here and InnerLoopVectorizer + // will create another. + if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType()) + PrimaryInduction = nullptr; + + return true; } -std::pair -LoopVectorizationCostModel::getSmallestAndWidestTypes() { - unsigned MinWidth = -1U; - unsigned MaxWidth = 8; - const DataLayout &DL = TheFunction->getParent()->getDataLayout(); +void LoopVectorizationCostModel::collectLoopScalars(unsigned VF) { - // For each block. - for (BasicBlock *BB : TheLoop->blocks()) { - // For each instruction in the loop. - for (Instruction &I : *BB) { - Type *T = I.getType(); + // We should not collect Scalars more than once per VF. Right now, + // this function is called from collectUniformsAndScalars(), which + // already does this check. Collecting Scalars for VF=1 does not make any + // sense. - // Skip ignored values. - if (ValuesToIgnore.count(&I)) - continue; + assert(VF >= 2 && !Scalars.count(VF) && + "This function should not be visited twice for the same VF"); - // Only examine Loads, Stores and PHINodes. - if (!isa(I) && !isa(I) && !isa(I)) - continue; + // If an instruction is uniform after vectorization, it will remain scalar. + Scalars[VF].insert(Uniforms[VF].begin(), Uniforms[VF].end()); - // Examine PHI nodes that are reduction variables. Update the type to - // account for the recurrence type. - if (auto *PN = dyn_cast(&I)) { - if (!Legal->isReductionVariable(PN)) - continue; - RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN]; - T = RdxDesc.getRecurrenceType(); + // Collect the getelementptr instructions that will not be vectorized. A + // getelementptr instruction is only vectorized if it is used for a legal + // gather or scatter operation. + for (auto *BB : TheLoop->blocks()) + for (auto &I : *BB) { + if (auto *GEP = dyn_cast(&I)) { + Scalars[VF].insert(GEP); + continue; } + auto *Ptr = getPointerOperand(&I); + if (!Ptr) + continue; + auto *GEP = getGEPInstruction(Ptr); + if (GEP && getWideningDecision(&I, VF) == CM_GatherScatter) + Scalars[VF].erase(GEP); + } - // Examine the stored values. - if (auto *ST = dyn_cast(&I)) - T = ST->getValueOperand()->getType(); + // An induction variable will remain scalar if all users of the induction + // variable and induction variable update remain scalar. + auto *Latch = TheLoop->getLoopLatch(); + for (auto &Induction : *Legal->getInductionVars()) { + auto *Ind = Induction.first; + auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - // Ignore loaded pointer types and stored pointer types that are not - // consecutive. However, we do want to take consecutive stores/loads of - // pointer vectors into account. - if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I)) - continue; + // Determine if all users of the induction variable are scalar after + // vectorization. + auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool { + auto *I = cast(U); + return I == IndUpdate || !TheLoop->contains(I) || Scalars[VF].count(I); + }); + if (!ScalarInd) + continue; - MinWidth = std::min(MinWidth, - (unsigned)DL.getTypeSizeInBits(T->getScalarType())); - MaxWidth = std::max(MaxWidth, - (unsigned)DL.getTypeSizeInBits(T->getScalarType())); - } + // Determine if all users of the induction variable update instruction are + // scalar after vectorization. + auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool { + auto *I = cast(U); + return I == Ind || !TheLoop->contains(I) || Scalars[VF].count(I); + }); + if (!ScalarIndUpdate) + continue; + + // The induction variable and its update instruction will remain scalar. + Scalars[VF].insert(Ind); + Scalars[VF].insert(IndUpdate); } +} - return {MinWidth, MaxWidth}; +bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) { + if (!blockNeedsPredication(I->getParent())) + return false; + switch(I->getOpcode()) { + default: + break; + case Instruction::Store: + return !isMaskRequired(I); + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + return mayDivideByZero(*I); + } + return false; } -unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize, - unsigned VF, - unsigned LoopCost) { +bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I, + unsigned VF) { + // Get and ensure we have a valid memory instruction. + LoadInst *LI = dyn_cast(I); + StoreInst *SI = dyn_cast(I); + assert((LI || SI) && "Invalid memory instruction"); - // -- The interleave heuristics -- - // We interleave the loop in order to expose ILP and reduce the loop overhead. - // There are many micro-architectural considerations that we can't predict - // at this level. For example, frontend pressure (on decode or fetch) due to - // code size, or the number and capabilities of the execution ports. - // - // We use the following heuristics to select the interleave count: - // 1. If the code has reductions, then we interleave to break the cross - // iteration dependency. - // 2. If the loop is really small, then we interleave to reduce the loop - // overhead. - // 3. We don't interleave if we think that we will spill registers to memory - // due to the increased register pressure. + auto *Ptr = getPointerOperand(I); - // When we optimize for size, we don't interleave. - if (OptForSize) - return 1; + // In order to be widened, the pointer should be consecutive, first of all. + if (!isConsecutivePtr(Ptr)) + return false; - // We used the distance for the interleave count. - if (Legal->getMaxSafeDepDistBytes() != -1U) - return 1; + // If the instruction is a store located in a predicated block, it will be + // scalarized. + if (isScalarWithPredication(I)) + return false; - // Do not interleave loops with a relatively small trip count. - unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); - if (TC > 1 && TC < TinyTripCountInterleaveThreshold) - return 1; + // If the instruction's allocated size doesn't equal it's type size, it + // requires padding and will be scalarized. + auto &DL = I->getModule()->getDataLayout(); + auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType(); + if (hasIrregularType(ScalarTy, DL, VF)) + return false; - unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1); - DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters - << " registers\n"); + return true; +} - if (VF == 1) { - if (ForceTargetNumScalarRegs.getNumOccurrences() > 0) - TargetNumRegisters = ForceTargetNumScalarRegs; - } else { - if (ForceTargetNumVectorRegs.getNumOccurrences() > 0) - TargetNumRegisters = ForceTargetNumVectorRegs; - } +void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) { - RegisterUsage R = calculateRegisterUsage({VF})[0]; - // We divide by these constants so assume that we have at least one - // instruction that uses at least one register. - R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U); - R.NumInstructions = std::max(R.NumInstructions, 1U); - - // We calculate the interleave count using the following formula. - // Subtract the number of loop invariants from the number of available - // registers. These registers are used by all of the interleaved instances. - // Next, divide the remaining registers by the number of registers that is - // required by the loop, in order to estimate how many parallel instances - // fit without causing spills. All of this is rounded down if necessary to be - // a power of two. We want power of two interleave count to simplify any - // addressing operations or alignment considerations. - unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) / - R.MaxLocalUsers); + // We should not collect Uniforms more than once per VF. Right now, + // this function is called from collectUniformsAndScalars(), which + // already does this check. Collecting Uniforms for VF=1 does not make any + // sense. - // Don't count the induction variable as interleaved. - if (EnableIndVarRegisterHeur) - IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) / - std::max(1U, (R.MaxLocalUsers - 1))); + assert(VF >= 2 && !Uniforms.count(VF) && + "This function should not be visited twice for the same VF"); - // Clamp the interleave ranges to reasonable counts. - unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF); + // Visit the list of Uniforms. If we'll not find any uniform value, we'll + // not analyze again. Uniforms.count(VF) will return 1. + Uniforms[VF].clear(); - // Check if the user has overridden the max. - if (VF == 1) { - if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0) - MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor; - } else { - if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0) - MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor; - } + // We now know that the loop is vectorizable! + // Collect instructions inside the loop that will remain uniform after + // vectorization. - // If we did not calculate the cost for VF (because the user selected the VF) - // then we calculate the cost of VF here. - if (LoopCost == 0) - LoopCost = expectedCost(VF).first; + // Global values, params and instructions outside of current loop are out of + // scope. + auto isOutOfScope = [&](Value *V) -> bool { + Instruction *I = dyn_cast(V); + return (!I || !TheLoop->contains(I)); + }; - // Clamp the calculated IC to be between the 1 and the max interleave count - // that the target allows. - if (IC > MaxInterleaveCount) - IC = MaxInterleaveCount; - else if (IC < 1) - IC = 1; + SetVector Worklist; + BasicBlock *Latch = TheLoop->getLoopLatch(); - // Interleave if we vectorized this loop and there is a reduction that could - // benefit from interleaving. - if (VF > 1 && Legal->getReductionVars()->size()) { - DEBUG(dbgs() << "LV: Interleaving because of reductions.\n"); - return IC; + // Start with the conditional branch. If the branch condition is an + // instruction contained in the loop that is only used by the branch, it is + // uniform. + auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); + if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse()) { + Worklist.insert(Cmp); + DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n"); } - // Note that if we've already vectorized the loop we will have done the - // runtime check and so interleaving won't require further checks. - bool InterleavingRequiresRuntimePointerCheck = - (VF == 1 && Legal->getRuntimePointerChecking()->Need); - - // We want to interleave small loops in order to reduce the loop overhead and - // potentially expose ILP opportunities. - DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'); - if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) { - // We assume that the cost overhead is 1 and we use the cost model - // to estimate the cost of the loop and interleave until the cost of the - // loop overhead is about 5% of the cost of the loop. - unsigned SmallIC = - std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost)); - - // Interleave until store/load ports (estimated by max interleave count) are - // saturated. - unsigned NumStores = Legal->getNumStores(); - unsigned NumLoads = Legal->getNumLoads(); - unsigned StoresIC = IC / (NumStores ? NumStores : 1); - unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1); - - // If we have a scalar reduction (vector reductions are already dealt with - // by this point), we can increase the critical path length if the loop - // we're interleaving is inside another loop. Limit, by default to 2, so the - // critical path only gets increased by one reduction operation. - if (Legal->getReductionVars()->size() && TheLoop->getLoopDepth() > 1) { - unsigned F = static_cast(MaxNestedScalarReductionIC); - SmallIC = std::min(SmallIC, F); - StoresIC = std::min(StoresIC, F); - LoadsIC = std::min(LoadsIC, F); - } - - if (EnableLoadStoreRuntimeInterleave && - std::max(StoresIC, LoadsIC) > SmallIC) { - DEBUG(dbgs() << "LV: Interleaving to saturate store or load ports.\n"); - return std::max(StoresIC, LoadsIC); - } - - DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n"); - return SmallIC; - } + // Holds consecutive and consecutive-like pointers. Consecutive-like pointers + // are pointers that are treated like consecutive pointers during + // vectorization. The pointer operands of interleaved accesses are an + // example. + SmallSetVector ConsecutiveLikePtrs; - // Interleave if this is a large loop (small loops are already dealt with by - // this point) that could benefit from interleaving. - bool HasReductions = (Legal->getReductionVars()->size() > 0); - if (TTI.enableAggressiveInterleaving(HasReductions)) { - DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n"); - return IC; - } + // Holds pointer operands of instructions that are possibly non-uniform. + SmallPtrSet PossibleNonUniformPtrs; - DEBUG(dbgs() << "LV: Not Interleaving.\n"); - return 1; -} + auto isUniformDecision = [&](Instruction *I, unsigned VF) { + InstWidening WideningDecision = getWideningDecision(I, VF); + assert(WideningDecision != CM_Unknown && + "Widening decision should be ready at this moment"); -SmallVector -LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef VFs) { - // This function calculates the register usage by measuring the highest number - // of values that are alive at a single location. Obviously, this is a very - // rough estimation. We scan the loop in a topological order in order and - // assign a number to each instruction. We use RPO to ensure that defs are - // met before their users. We assume that each instruction that has in-loop - // users starts an interval. We record every time that an in-loop value is - // used, so we have a list of the first and last occurrences of each - // instruction. Next, we transpose this data structure into a multi map that - // holds the list of intervals that *end* at a specific location. This multi - // map allows us to perform a linear search. We scan the instructions linearly - // and record each time that a new interval starts, by placing it in a set. - // If we find this value in the multi-map then we remove it from the set. - // The max register usage is the maximum size of the set. - // We also search for instructions that are defined outside the loop, but are - // used inside the loop. We need this number separately from the max-interval - // usage number because when we unroll, loop-invariant values do not take - // more register. - LoopBlocksDFS DFS(TheLoop); - DFS.perform(LI); + return (WideningDecision == CM_Widen || + WideningDecision == CM_Interleave); + }; + // Iterate over the instructions in the loop, and collect all + // consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible + // that a consecutive-like pointer operand will be scalarized, we collect it + // in PossibleNonUniformPtrs instead. We use two sets here because a single + // getelementptr instruction can be used by both vectorized and scalarized + // memory instructions. For example, if a loop loads and stores from the same + // location, but the store is conditional, the store will be scalarized, and + // the getelementptr won't remain uniform. + for (auto *BB : TheLoop->blocks()) + for (auto &I : *BB) { - RegisterUsage RU; - RU.NumInstructions = 0; + // If there's no pointer operand, there's nothing to do. + auto *Ptr = dyn_cast_or_null(getPointerOperand(&I)); + if (!Ptr) + continue; - // Each 'key' in the map opens a new interval. The values - // of the map are the index of the 'last seen' usage of the - // instruction that is the key. - typedef DenseMap IntervalMap; - // Maps instruction to its index. - DenseMap IdxToInstr; - // Marks the end of each interval. - IntervalMap EndPoint; - // Saves the list of instruction indices that are used in the loop. - SmallSet Ends; - // Saves the list of values that are used in the loop but are - // defined outside the loop, such as arguments and constants. - SmallPtrSet LoopInvariants; + // True if all users of Ptr are memory accesses that have Ptr as their + // pointer operand. + auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool { + return getPointerOperand(U) == Ptr; + }); - unsigned Index = 0; - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { - RU.NumInstructions += BB->size(); - for (Instruction &I : *BB) { - IdxToInstr[Index++] = &I; + // Ensure the memory instruction will not be scalarized or used by + // gather/scatter, making its pointer operand non-uniform. If the pointer + // operand is used by any instruction other than a memory access, we + // conservatively assume the pointer operand may be non-uniform. + if (!UsersAreMemAccesses || !isUniformDecision(&I, VF)) + PossibleNonUniformPtrs.insert(Ptr); - // Save the end location of each USE. - for (Value *U : I.operands()) { - auto *Instr = dyn_cast(U); + // If the memory instruction will be vectorized and its pointer operand + // is consecutive-like, or interleaving - the pointer operand should + // remain uniform. + else + ConsecutiveLikePtrs.insert(Ptr); + } - // Ignore non-instruction values such as arguments, constants, etc. - if (!Instr) - continue; + // Add to the Worklist all consecutive and consecutive-like pointers that + // aren't also identified as possibly non-uniform. + for (auto *V : ConsecutiveLikePtrs) + if (!PossibleNonUniformPtrs.count(V)) { + DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n"); + Worklist.insert(V); + } - // If this instruction is outside the loop then record it and continue. - if (!TheLoop->contains(Instr)) { - LoopInvariants.insert(Instr); - continue; - } + // Expand Worklist in topological order: whenever a new instruction + // is added , its users should be either already inside Worklist, or + // out of scope. It ensures a uniform instruction will only be used + // by uniform instructions or out of scope instructions. + unsigned idx = 0; + while (idx != Worklist.size()) { + Instruction *I = Worklist[idx++]; - // Overwrite previous end points. - EndPoint[Instr] = Index; - Ends.insert(Instr); + for (auto OV : I->operand_values()) { + if (isOutOfScope(OV)) + continue; + auto *OI = cast(OV); + if (all_of(OI->users(), [&](User *U) -> bool { + return isOutOfScope(U) || Worklist.count(cast(U)); + })) { + Worklist.insert(OI); + DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n"); } } } - // Saves the list of intervals that end with the index in 'key'. - typedef SmallVector InstrList; - DenseMap TransposeEnds; - - // Transpose the EndPoints to a list of values that end at each index. - for (auto &Interval : EndPoint) - TransposeEnds[Interval.second].push_back(Interval.first); - - SmallSet OpenIntervals; + // Returns true if Ptr is the pointer operand of a memory access instruction + // I, and I is known to not require scalarization. + auto isVectorizedMemAccessUse = [&](Instruction *I, Value *Ptr) -> bool { + return getPointerOperand(I) == Ptr && isUniformDecision(I, VF); + }; - // Get the size of the widest register. - unsigned MaxSafeDepDist = -1U; - if (Legal->getMaxSafeDepDistBytes() != -1U) - MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8; - unsigned WidestRegister = - std::min(TTI.getRegisterBitWidth(true), MaxSafeDepDist); - const DataLayout &DL = TheFunction->getParent()->getDataLayout(); - - SmallVector RUs(VFs.size()); - SmallVector MaxUsages(VFs.size(), 0); - - DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n"); - - // A lambda that gets the register usage for the given type and VF. - auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) { - if (Ty->isTokenTy()) - return 0U; - unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType()); - return std::max(1, VF * TypeSize / WidestRegister); - }; - - for (unsigned int i = 0; i < Index; ++i) { - Instruction *I = IdxToInstr[i]; - - // Remove all of the instructions that end at this location. - InstrList &List = TransposeEnds[i]; - for (Instruction *ToRemove : List) - OpenIntervals.erase(ToRemove); + // For an instruction to be added into Worklist above, all its users inside + // the loop should also be in Worklist. However, this condition cannot be + // true for phi nodes that form a cyclic dependence. We must process phi + // nodes separately. An induction variable will remain uniform if all users + // of the induction variable and induction variable update remain uniform. + // The code below handles both pointer and non-pointer induction variables. + for (auto &Induction : *Legal->getInductionVars()) { + auto *Ind = Induction.first; + auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - // Ignore instructions that are never used within the loop. - if (!Ends.count(I)) + // Determine if all users of the induction variable are uniform after + // vectorization. + auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool { + auto *I = cast(U); + return I == IndUpdate || !TheLoop->contains(I) || Worklist.count(I) || + isVectorizedMemAccessUse(I, Ind); + }); + if (!UniformInd) continue; - // Skip ignored values. - if (ValuesToIgnore.count(I)) + // Determine if all users of the induction variable update instruction are + // uniform after vectorization. + auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool { + auto *I = cast(U); + return I == Ind || !TheLoop->contains(I) || Worklist.count(I) || + isVectorizedMemAccessUse(I, IndUpdate); + }); + if (!UniformIndUpdate) continue; - // For each VF find the maximum usage of registers. - for (unsigned j = 0, e = VFs.size(); j < e; ++j) { - if (VFs[j] == 1) { - MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size()); - continue; - } - collectUniformsAndScalars(VFs[j]); - // Count the number of live intervals. - unsigned RegUsage = 0; - for (auto Inst : OpenIntervals) { - // Skip ignored values for VF > 1. - if (VecValuesToIgnore.count(Inst) || - isScalarAfterVectorization(Inst, VFs[j])) - continue; - RegUsage += GetRegUsage(Inst->getType(), VFs[j]); - } - MaxUsages[j] = std::max(MaxUsages[j], RegUsage); - } - - DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # " - << OpenIntervals.size() << '\n'); - - // Add the current instruction to the list of open intervals. - OpenIntervals.insert(I); + // The induction variable and its update instruction will remain uniform. + Worklist.insert(Ind); + Worklist.insert(IndUpdate); + DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n"); + DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n"); } - for (unsigned i = 0, e = VFs.size(); i < e; ++i) { - unsigned Invariant = 0; - if (VFs[i] == 1) - Invariant = LoopInvariants.size(); - else { - for (auto Inst : LoopInvariants) - Invariant += GetRegUsage(Inst->getType(), VFs[i]); - } + Uniforms[VF].insert(Worklist.begin(), Worklist.end()); +} - DEBUG(dbgs() << "LV(REG): VF = " << VFs[i] << '\n'); - DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsages[i] << '\n'); - DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n'); - DEBUG(dbgs() << "LV(REG): LoopSize: " << RU.NumInstructions << '\n'); +bool LoopVectorizationLegality::canVectorizeMemory() { + LAI = &(*GetLAA)(*TheLoop); + InterleaveInfo.setLAI(LAI); + const OptimizationRemarkAnalysis *LAR = LAI->getReport(); + if (LAR) { + OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(), + "loop not vectorized: ", *LAR); + ORE->emit(VR); + } + if (!LAI->canVectorizeMemory()) + return false; - RU.LoopInvariantRegs = Invariant; - RU.MaxLocalUsers = MaxUsages[i]; - RUs[i] = RU; + if (LAI->hasStoreToLoopInvariantAddress()) { + ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress") + << "write to a loop invariant address could not be vectorized"); + DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n"); + return false; } - return RUs; + Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks()); + PSE.addPredicate(LAI->getPSE().getUnionPredicate()); + + return true; } -void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) { +bool LoopVectorizationLegality::isInductionVariable(const Value *V) { + Value *In0 = const_cast(V); + PHINode *PN = dyn_cast_or_null(In0); + if (!PN) + return false; - // If we aren't vectorizing the loop, or if we've already collected the - // instructions to scalarize, there's nothing to do. Collection may already - // have occurred if we have a user-selected VF and are now computing the - // expected cost for interleaving. - if (VF < 2 || InstsToScalarize.count(VF)) - return; + return Inductions.count(PN); +} - // Initialize a mapping for VF in InstsToScalalarize. If we find that it's - // not profitable to scalarize any instructions, the presence of VF in the - // map will indicate that we've analyzed it already. - ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF]; +bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) { + return FirstOrderRecurrences.count(Phi); +} - // Find all the instructions that are scalar with predication in the loop and - // determine if it would be better to not if-convert the blocks they are in. - // If so, we also record the instructions to scalarize. - for (BasicBlock *BB : TheLoop->blocks()) { - if (!Legal->blockNeedsPredication(BB)) - continue; - for (Instruction &I : *BB) - if (Legal->isScalarWithPredication(&I)) { - ScalarCostsTy ScalarCosts; - if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0) - ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end()); - } - } +bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) { + return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT); } -int LoopVectorizationCostModel::computePredInstDiscount( - Instruction *PredInst, DenseMap &ScalarCosts, - unsigned VF) { +bool LoopVectorizationLegality::blockCanBePredicated( + BasicBlock *BB, SmallPtrSetImpl &SafePtrs) { + const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel(); - assert(!isUniformAfterVectorization(PredInst, VF) && - "Instruction marked uniform-after-vectorization will be predicated"); + for (Instruction &I : *BB) { + // Check that we don't have a constant expression that can trap as operand. + for (Value *Operand : I.operands()) { + if (auto *C = dyn_cast(Operand)) + if (C->canTrap()) + return false; + } + // We might be able to hoist the load. + if (I.mayReadFromMemory()) { + auto *LI = dyn_cast(&I); + if (!LI) + return false; + if (!SafePtrs.count(LI->getPointerOperand())) { + if (isLegalMaskedLoad(LI->getType(), LI->getPointerOperand()) || + isLegalMaskedGather(LI->getType())) { + MaskedOp.insert(LI); + continue; + } + // !llvm.mem.parallel_loop_access implies if-conversion safety. + if (IsAnnotatedParallel) + continue; + return false; + } + } - // Initialize the discount to zero, meaning that the scalar version and the - // vector version cost the same. - int Discount = 0; + if (I.mayWriteToMemory()) { + auto *SI = dyn_cast(&I); + // We only support predication of stores in basic blocks with one + // predecessor. + if (!SI) + return false; - // Holds instructions to analyze. The instructions we visit are mapped in - // ScalarCosts. Those instructions are the ones that would be scalarized if - // we find that the scalar version costs less. - SmallVector Worklist; + // Build a masked store if it is legal for the target. + if (isLegalMaskedStore(SI->getValueOperand()->getType(), + SI->getPointerOperand()) || + isLegalMaskedScatter(SI->getValueOperand()->getType())) { + MaskedOp.insert(SI); + continue; + } - // Returns true if the given instruction can be scalarized. - auto canBeScalarized = [&](Instruction *I) -> bool { + bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0); + bool isSinglePredecessor = SI->getParent()->getSinglePredecessor(); - // We only attempt to scalarize instructions forming a single-use chain - // from the original predicated block that would otherwise be vectorized. - // Although not strictly necessary, we give up on instructions we know will - // already be scalar to avoid traversing chains that are unlikely to be - // beneficial. - if (!I->hasOneUse() || PredInst->getParent() != I->getParent() || - isScalarAfterVectorization(I, VF)) + if (++NumPredStores > NumberOfStoresToPredicate || !isSafePtr || + !isSinglePredecessor) + return false; + } + if (I.mayThrow()) return false; + } - // If the instruction is scalar with predication, it will be analyzed - // separately. We ignore it within the context of PredInst. - if (Legal->isScalarWithPredication(I)) - return false; + return true; +} - // If any of the instruction's operands are uniform after vectorization, - // the instruction cannot be scalarized. This prevents, for example, a - // masked load from being scalarized. - // - // We assume we will only emit a value for lane zero of an instruction - // marked uniform after vectorization, rather than VF identical values. - // Thus, if we scalarize an instruction that uses a uniform, we would - // create uses of values corresponding to the lanes we aren't emitting code - // for. This behavior can be changed by allowing getScalarValue to clone - // the lane zero values for uniforms rather than asserting. - for (Use &U : I->operands()) - if (auto *J = dyn_cast(U.get())) - if (isUniformAfterVectorization(J, VF)) - return false; +void InterleavedAccessInfo::collectConstStrideAccesses( + MapVector &AccessStrideInfo, + const ValueToValueMap &Strides) { - // Otherwise, we can scalarize the instruction. - return true; - }; + auto &DL = TheLoop->getHeader()->getModule()->getDataLayout(); - // Returns true if an operand that cannot be scalarized must be extracted - // from a vector. We will account for this scalarization overhead below. Note + // Since it's desired that the load/store instructions be maintained in + // "program order" for the interleaved access analysis, we have to visit the + // blocks in the loop in reverse postorder (i.e., in a topological order). + // Such an ordering will ensure that any load/store that may be executed + // before a second load/store will precede the second load/store in + // AccessStrideInfo. + LoopBlocksDFS DFS(TheLoop); + DFS.perform(LI); + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) + for (auto &I : *BB) { + auto *LI = dyn_cast(&I); + auto *SI = dyn_cast(&I); + if (!LI && !SI) + continue; + + Value *Ptr = getPointerOperand(&I); + // We don't check wrapping here because we don't know yet if Ptr will be + // part of a full group or a group with gaps. Checking wrapping for all + // pointers (even those that end up in groups with no gaps) will be overly + // conservative. For full groups, wrapping should be ok since if we would + // wrap around the address space we would do a memory access at nullptr + // even without the transformation. The wrapping checks are therefore + // deferred until after we've formed the interleaved groups. + int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, + /*Assume=*/true, /*ShouldCheckWrap=*/false); + + const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr); + PointerType *PtrTy = dyn_cast(Ptr->getType()); + uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType()); + + // An alignment of 0 means target ABI alignment. + unsigned Align = getMemInstAlignment(&I); + if (!Align) + Align = DL.getABITypeAlignment(PtrTy->getElementType()); + + AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align); + } +} + +// Analyze interleaved accesses and collect them into interleaved load and +// store groups. +// +// When generating code for an interleaved load group, we effectively hoist all +// loads in the group to the location of the first load in program order. When +// generating code for an interleaved store group, we sink all stores to the +// location of the last store. This code motion can change the order of load +// and store instructions and may break dependences. +// +// The code generation strategy mentioned above ensures that we won't violate +// any write-after-read (WAR) dependences. +// +// E.g., for the WAR dependence: a = A[i]; // (1) +// A[i] = b; // (2) +// +// The store group of (2) is always inserted at or below (2), and the load +// group of (1) is always inserted at or above (1). Thus, the instructions will +// never be reordered. All other dependences are checked to ensure the +// correctness of the instruction reordering. +// +// The algorithm visits all memory accesses in the loop in bottom-up program +// order. Program order is established by traversing the blocks in the loop in +// reverse postorder when collecting the accesses. +// +// We visit the memory accesses in bottom-up order because it can simplify the +// construction of store groups in the presence of write-after-write (WAW) +// dependences. +// +// E.g., for the WAW dependence: A[i] = a; // (1) +// A[i] = b; // (2) +// A[i + 1] = c; // (3) +// +// We will first create a store group with (3) and (2). (1) can't be added to +// this group because it and (2) are dependent. However, (1) can be grouped +// with other accesses that may precede it in program order. Note that a +// bottom-up order does not imply that WAW dependences should not be checked. +void InterleavedAccessInfo::analyzeInterleaving( + const ValueToValueMap &Strides) { + DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n"); + + // Holds all accesses with a constant stride. + MapVector AccessStrideInfo; + collectConstStrideAccesses(AccessStrideInfo, Strides); + + if (AccessStrideInfo.empty()) + return; + + // Collect the dependences in the loop. + collectDependences(); + + // Holds all interleaved store groups temporarily. + SmallSetVector StoreGroups; + // Holds all interleaved load groups temporarily. + SmallSetVector LoadGroups; + + // Search in bottom-up program order for pairs of accesses (A and B) that can + // form interleaved load or store groups. In the algorithm below, access A + // precedes access B in program order. We initialize a group for B in the + // outer loop of the algorithm, and then in the inner loop, we attempt to + // insert each A into B's group if: + // + // 1. A and B have the same stride, + // 2. A and B have the same memory object size, and + // 3. A belongs in B's group according to its distance from B. + // + // Special care is taken to ensure group formation will not break any + // dependences. + for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend(); + BI != E; ++BI) { + Instruction *B = BI->first; + StrideDescriptor DesB = BI->second; + + // Initialize a group for B if it has an allowable stride. Even if we don't + // create a group for B, we continue with the bottom-up algorithm to ensure + // we don't break any of B's dependences. + InterleaveGroup *Group = nullptr; + if (isStrided(DesB.Stride)) { + Group = getInterleaveGroup(B); + if (!Group) { + DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B << '\n'); + Group = createInterleaveGroup(B, DesB.Stride, DesB.Align); + } + if (B->mayWriteToMemory()) + StoreGroups.insert(Group); + else + LoadGroups.insert(Group); + } + + for (auto AI = std::next(BI); AI != E; ++AI) { + Instruction *A = AI->first; + StrideDescriptor DesA = AI->second; + + // Our code motion strategy implies that we can't have dependences + // between accesses in an interleaved group and other accesses located + // between the first and last member of the group. Note that this also + // means that a group can't have more than one member at a given offset. + // The accesses in a group can have dependences with other accesses, but + // we must ensure we don't extend the boundaries of the group such that + // we encompass those dependent accesses. + // + // For example, assume we have the sequence of accesses shown below in a + // stride-2 loop: + // + // (1, 2) is a group | A[i] = a; // (1) + // | A[i-1] = b; // (2) | + // A[i-3] = c; // (3) + // A[i] = d; // (4) | (2, 4) is not a group + // + // Because accesses (2) and (3) are dependent, we can group (2) with (1) + // but not with (4). If we did, the dependent access (3) would be within + // the boundaries of the (2, 4) group. + if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) { + + // If a dependence exists and A is already in a group, we know that A + // must be a store since A precedes B and WAR dependences are allowed. + // Thus, A would be sunk below B. We release A's group to prevent this + // illegal code motion. A will then be free to form another group with + // instructions that precede it. + if (isInterleaved(A)) { + InterleaveGroup *StoreGroup = getInterleaveGroup(A); + StoreGroups.remove(StoreGroup); + releaseGroup(StoreGroup); + } + + // If a dependence exists and A is not already in a group (or it was + // and we just released it), B might be hoisted above A (if B is a + // load) or another store might be sunk below A (if B is a store). In + // either case, we can't add additional instructions to B's group. B + // will only form a group with instructions that it precedes. + break; + } + + // At this point, we've checked for illegal code motion. If either A or B + // isn't strided, there's nothing left to do. + if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride)) + continue; + + // Ignore A if it's already in a group or isn't the same kind of memory + // operation as B. + if (isInterleaved(A) || A->mayReadFromMemory() != B->mayReadFromMemory()) + continue; + + // Check rules 1 and 2. Ignore A if its stride or size is different from + // that of B. + if (DesA.Stride != DesB.Stride || DesA.Size != DesB.Size) + continue; + + // Calculate the distance from A to B. + const SCEVConstant *DistToB = dyn_cast( + PSE.getSE()->getMinusSCEV(DesA.Scev, DesB.Scev)); + if (!DistToB) + continue; + int64_t DistanceToB = DistToB->getAPInt().getSExtValue(); + + // Check rule 3. Ignore A if its distance to B is not a multiple of the + // size. + if (DistanceToB % static_cast(DesB.Size)) + continue; + + // Ignore A if either A or B is in a predicated block. Although we + // currently prevent group formation for predicated accesses, we may be + // able to relax this limitation in the future once we handle more + // complicated blocks. + if (isPredicated(A->getParent()) || isPredicated(B->getParent())) + continue; + + // The index of A is the index of B plus A's distance to B in multiples + // of the size. + int IndexA = + Group->getIndex(B) + DistanceToB / static_cast(DesB.Size); + + // Try to insert A into B's group. + if (Group->insertMember(A, IndexA, DesA.Align)) { + DEBUG(dbgs() << "LV: Inserted:" << *A << '\n' + << " into the interleave group with" << *B << '\n'); + InterleaveGroupMap[A] = Group; + + // Set the first load in program order as the insert position. + if (A->mayReadFromMemory()) + Group->setInsertPos(A); + } + } // Iteration over A accesses. + } // Iteration over B accesses. + + // Remove interleaved store groups with gaps. + for (InterleaveGroup *Group : StoreGroups) + if (Group->getNumMembers() != Group->getFactor()) + releaseGroup(Group); + + // Remove interleaved groups with gaps (currently only loads) whose memory + // accesses may wrap around. We have to revisit the getPtrStride analysis, + // this time with ShouldCheckWrap=true, since collectConstStrideAccesses does + // not check wrapping (see documentation there). + // FORNOW we use Assume=false; + // TODO: Change to Assume=true but making sure we don't exceed the threshold + // of runtime SCEV assumptions checks (thereby potentially failing to + // vectorize altogether). + // Additional optional optimizations: + // TODO: If we are peeling the loop and we know that the first pointer doesn't + // wrap then we can deduce that all pointers in the group don't wrap. + // This means that we can forcefully peel the loop in order to only have to + // check the first pointer for no-wrap. When we'll change to use Assume=true + // we'll only need at most one runtime check per interleaved group. + // + for (InterleaveGroup *Group : LoadGroups) { + + // Case 1: A full group. Can Skip the checks; For full groups, if the wide + // load would wrap around the address space we would do a memory access at + // nullptr even without the transformation. + if (Group->getNumMembers() == Group->getFactor()) + continue; + + // Case 2: If first and last members of the group don't wrap this implies + // that all the pointers in the group don't wrap. + // So we check only group member 0 (which is always guaranteed to exist), + // and group member Factor - 1; If the latter doesn't exist we rely on + // peeling (if it is a non-reveresed accsess -- see Case 3). + Value *FirstMemberPtr = getPointerOperand(Group->getMember(0)); + if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /*Assume=*/false, + /*ShouldCheckWrap=*/true)) { + DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to " + "first group member potentially pointer-wrapping.\n"); + releaseGroup(Group); + continue; + } + Instruction *LastMember = Group->getMember(Group->getFactor() - 1); + if (LastMember) { + Value *LastMemberPtr = getPointerOperand(LastMember); + if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /*Assume=*/false, + /*ShouldCheckWrap=*/true)) { + DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to " + "last group member potentially pointer-wrapping.\n"); + releaseGroup(Group); + } + } else { + // Case 3: A non-reversed interleaved load group with gaps: We need + // to execute at least one scalar epilogue iteration. This will ensure + // we don't speculatively access memory out-of-bounds. We only need + // to look for a member at index factor - 1, since every group must have + // a member at index zero. + if (Group->isReverse()) { + releaseGroup(Group); + continue; + } + DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n"); + RequiresScalarEpilogue = true; + } + } +} + +bool LoopVectorizationCostModel::canVectorize(bool OptForSize) { + if (OptForSize && Legal->getRuntimePointerChecking()->Need) { + ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize") + << "runtime pointer checks needed. Enable vectorization of this " + "loop with '#pragma clang loop vectorize(enable)' when " + "compiling with -Os/-Oz"); + DEBUG(dbgs() + << "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n"); + return false; + } + + if (!EnableCondStoresVectorization && Legal->getNumPredStores()) { + ORE->emit(createMissedAnalysis("ConditionalStore") + << "store that is conditionally executed prevents vectorization"); + DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n"); + return false; + } + + // If we optimize the program for size, avoid creating the tail loop. + if (OptForSize) { + unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); + DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); + + // If we don't know the precise trip count, don't try to vectorize. + if (TC < 2) { + ORE->emit( + createMissedAnalysis("UnknownLoopCountComplexCFG") + << "unable to calculate the loop count due to complex control flow"); + DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); + return false; + } + } + return true; +} + +unsigned +LoopVectorizationCostModel::computeMaxVectorizationFactor(bool OptForSize) { + MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI); + unsigned SmallestType, WidestType; + std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes(); + unsigned WidestRegister = TTI.getRegisterBitWidth(true); + unsigned MaxSafeDepDist = -1U; + + // Get the maximum safe dependence distance in bits computed by LAA. If the + // loop contains any interleaved accesses, we divide the dependence distance + // by the maximum interleave factor of all interleaved groups. Note that + // although the division ensures correctness, this is a fairly conservative + // computation because the maximum distance computed by LAA may not involve + // any of the interleaved accesses. + if (Legal->getMaxSafeDepDistBytes() != -1U) + MaxSafeDepDist = + Legal->getMaxSafeDepDistBytes() * 8 / Legal->getMaxInterleaveFactor(); + + WidestRegister = + ((WidestRegister < MaxSafeDepDist) ? WidestRegister : MaxSafeDepDist); + unsigned MaxVectorSize = WidestRegister / WidestType; + + DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / " + << WidestType << " bits.\n"); + DEBUG(dbgs() << "LV: The Widest register is: " << WidestRegister + << " bits.\n"); + + if (MaxVectorSize == 0) { + DEBUG(dbgs() << "LV: The target has no vector registers.\n"); + MaxVectorSize = 1; + } + + assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements" + " into one vector!"); + + unsigned VF = MaxVectorSize; + + if (MaximizeBandwidth && !OptForSize) { + // Collect all viable vectorization factors. + SmallVector VFs; + unsigned NewMaxVectorSize = WidestRegister / SmallestType; + for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2) + VFs.push_back(VS); + + // For each VF calculate its register usage. + auto RUs = calculateRegisterUsage(VFs); + + // Select the largest VF which doesn't require more registers than existing + // ones. + unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true); + for (int i = RUs.size() - 1; i >= 0; --i) { + if (RUs[i].MaxLocalUsers <= TargetNumRegisters) { + VF = VFs[i]; + break; + } + } + } + return VF; +} + +bool LoopVectorizationCostModel::requiresTail(unsigned MaxVectorSize) { + unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); + DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); + + // Find the maximum SIMD width that can fit within the trip count. + unsigned VF = TC % MaxVectorSize; + + if (VF == 0) + return false; + + // If the trip count that we found modulo the vectorization factor is not + // zero then we require a tail. + ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize") + << "cannot optimize for size and vectorize at the " + "same time. Enable vectorization of this loop " + "with '#pragma clang loop vectorize(enable)' " + "when compiling with -Os/-Oz"); + DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); + return true; +} + +LoopVectorizationCostModel::VectorizationFactor +LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize, + unsigned VF) { + // Width 1 means no vectorize + VectorizationFactor Factor = {1U, 0U}; + + float Cost = expectedCost(1).first; +#ifndef NDEBUG + const float ScalarCost = Cost; +#endif /* NDEBUG */ + unsigned Width = 1; + DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n"); + + bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled; + // Ignore scalar width, because the user explicitly wants vectorization. + if (ForceVectorization && VF > 1) { + Width = 2; + Cost = expectedCost(Width).first / (float)Width; + } + + for (unsigned i = 2; i <= VF; i *= 2) { + // Notice that the vector loop needs to be executed less times, so + // we need to divide the cost of the vector loops by the width of + // the vector elements. + VectorizationCostTy C = expectedCost(i); + float VectorCost = C.first / (float)i; + DEBUG(dbgs() << "LV: Vector loop of width " << i + << " costs: " << (int)VectorCost << ".\n"); + if (!C.second && !ForceVectorization) { + DEBUG( + dbgs() << "LV: Not considering vector loop of width " << i + << " because it will not generate any vector instructions.\n"); + continue; + } + if (VectorCost < Cost) { + Cost = VectorCost; + Width = i; + } + } + + DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs() + << "LV: Vectorization seems to be not beneficial, " + << "but was forced by a user.\n"); + DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n"); + Factor.Width = Width; + Factor.Cost = Width * Cost; + return Factor; +} + +std::pair +LoopVectorizationCostModel::getSmallestAndWidestTypes() { + unsigned MinWidth = -1U; + unsigned MaxWidth = 8; + const DataLayout &DL = TheFunction->getParent()->getDataLayout(); + + // For each block. + for (BasicBlock *BB : TheLoop->blocks()) { + // For each instruction in the loop. + for (Instruction &I : *BB) { + Type *T = I.getType(); + + // Skip ignored values. + if (ValuesToIgnore.count(&I)) + continue; + + // Only examine Loads, Stores and PHINodes. + if (!isa(I) && !isa(I) && !isa(I)) + continue; + + // Examine PHI nodes that are reduction variables. Update the type to + // account for the recurrence type. + if (auto *PN = dyn_cast(&I)) { + if (!Legal->isReductionVariable(PN)) + continue; + RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN]; + T = RdxDesc.getRecurrenceType(); + } + + // Examine the stored values. + if (auto *ST = dyn_cast(&I)) + T = ST->getValueOperand()->getType(); + + // Ignore loaded pointer types and stored pointer types that are not + // consecutive. However, we do want to take consecutive stores/loads of + // pointer vectors into account. + if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I)) + continue; + + MinWidth = std::min(MinWidth, + (unsigned)DL.getTypeSizeInBits(T->getScalarType())); + MaxWidth = std::max(MaxWidth, + (unsigned)DL.getTypeSizeInBits(T->getScalarType())); + } + } + + return {MinWidth, MaxWidth}; +} + +unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize, + unsigned VF, + unsigned LoopCost) { + + // -- The interleave heuristics -- + // We interleave the loop in order to expose ILP and reduce the loop overhead. + // There are many micro-architectural considerations that we can't predict + // at this level. For example, frontend pressure (on decode or fetch) due to + // code size, or the number and capabilities of the execution ports. + // + // We use the following heuristics to select the interleave count: + // 1. If the code has reductions, then we interleave to break the cross + // iteration dependency. + // 2. If the loop is really small, then we interleave to reduce the loop + // overhead. + // 3. We don't interleave if we think that we will spill registers to memory + // due to the increased register pressure. + + // When we optimize for size, we don't interleave. + if (OptForSize) + return 1; + + // We used the distance for the interleave count. + if (Legal->getMaxSafeDepDistBytes() != -1U) + return 1; + + // Do not interleave loops with a relatively small trip count. + unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); + if (TC > 1 && TC < TinyTripCountInterleaveThreshold) + return 1; + + unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1); + DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters + << " registers\n"); + + if (VF == 1) { + if (ForceTargetNumScalarRegs.getNumOccurrences() > 0) + TargetNumRegisters = ForceTargetNumScalarRegs; + } else { + if (ForceTargetNumVectorRegs.getNumOccurrences() > 0) + TargetNumRegisters = ForceTargetNumVectorRegs; + } + + RegisterUsage R = calculateRegisterUsage({VF})[0]; + // We divide by these constants so assume that we have at least one + // instruction that uses at least one register. + R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U); + R.NumInstructions = std::max(R.NumInstructions, 1U); + + // We calculate the interleave count using the following formula. + // Subtract the number of loop invariants from the number of available + // registers. These registers are used by all of the interleaved instances. + // Next, divide the remaining registers by the number of registers that is + // required by the loop, in order to estimate how many parallel instances + // fit without causing spills. All of this is rounded down if necessary to be + // a power of two. We want power of two interleave count to simplify any + // addressing operations or alignment considerations. + unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) / + R.MaxLocalUsers); + + // Don't count the induction variable as interleaved. + if (EnableIndVarRegisterHeur) + IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) / + std::max(1U, (R.MaxLocalUsers - 1))); + + // Clamp the interleave ranges to reasonable counts. + unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF); + + // Check if the user has overridden the max. + if (VF == 1) { + if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0) + MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor; + } else { + if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0) + MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor; + } + + // If we did not calculate the cost for VF (because the user selected the VF) + // then we calculate the cost of VF here. + if (LoopCost == 0) + LoopCost = expectedCost(VF).first; + + // Clamp the calculated IC to be between the 1 and the max interleave count + // that the target allows. + if (IC > MaxInterleaveCount) + IC = MaxInterleaveCount; + else if (IC < 1) + IC = 1; + + // Interleave if we vectorized this loop and there is a reduction that could + // benefit from interleaving. + if (VF > 1 && Legal->getReductionVars()->size()) { + DEBUG(dbgs() << "LV: Interleaving because of reductions.\n"); + return IC; + } + + // Note that if we've already vectorized the loop we will have done the + // runtime check and so interleaving won't require further checks. + bool InterleavingRequiresRuntimePointerCheck = + (VF == 1 && Legal->getRuntimePointerChecking()->Need); + + // We want to interleave small loops in order to reduce the loop overhead and + // potentially expose ILP opportunities. + DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'); + if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) { + // We assume that the cost overhead is 1 and we use the cost model + // to estimate the cost of the loop and interleave until the cost of the + // loop overhead is about 5% of the cost of the loop. + unsigned SmallIC = + std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost)); + + // Interleave until store/load ports (estimated by max interleave count) are + // saturated. + unsigned NumStores = Legal->getNumStores(); + unsigned NumLoads = Legal->getNumLoads(); + unsigned StoresIC = IC / (NumStores ? NumStores : 1); + unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1); + + // If we have a scalar reduction (vector reductions are already dealt with + // by this point), we can increase the critical path length if the loop + // we're interleaving is inside another loop. Limit, by default to 2, so the + // critical path only gets increased by one reduction operation. + if (Legal->getReductionVars()->size() && TheLoop->getLoopDepth() > 1) { + unsigned F = static_cast(MaxNestedScalarReductionIC); + SmallIC = std::min(SmallIC, F); + StoresIC = std::min(StoresIC, F); + LoadsIC = std::min(LoadsIC, F); + } + + if (EnableLoadStoreRuntimeInterleave && + std::max(StoresIC, LoadsIC) > SmallIC) { + DEBUG(dbgs() << "LV: Interleaving to saturate store or load ports.\n"); + return std::max(StoresIC, LoadsIC); + } + + DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n"); + return SmallIC; + } + + // Interleave if this is a large loop (small loops are already dealt with by + // this point) that could benefit from interleaving. + bool HasReductions = (Legal->getReductionVars()->size() > 0); + if (TTI.enableAggressiveInterleaving(HasReductions)) { + DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n"); + return IC; + } + + DEBUG(dbgs() << "LV: Not Interleaving.\n"); + return 1; +} + +SmallVector +LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef VFs) { + // This function calculates the register usage by measuring the highest number + // of values that are alive at a single location. Obviously, this is a very + // rough estimation. We scan the loop in a topological order in order and + // assign a number to each instruction. We use RPO to ensure that defs are + // met before their users. We assume that each instruction that has in-loop + // users starts an interval. We record every time that an in-loop value is + // used, so we have a list of the first and last occurrences of each + // instruction. Next, we transpose this data structure into a multi map that + // holds the list of intervals that *end* at a specific location. This multi + // map allows us to perform a linear search. We scan the instructions linearly + // and record each time that a new interval starts, by placing it in a set. + // If we find this value in the multi-map then we remove it from the set. + // The max register usage is the maximum size of the set. + // We also search for instructions that are defined outside the loop, but are + // used inside the loop. We need this number separately from the max-interval + // usage number because when we unroll, loop-invariant values do not take + // more register. + LoopBlocksDFS DFS(TheLoop); + DFS.perform(LI); + + RegisterUsage RU; + RU.NumInstructions = 0; + + // Each 'key' in the map opens a new interval. The values + // of the map are the index of the 'last seen' usage of the + // instruction that is the key. + typedef DenseMap IntervalMap; + // Maps instruction to its index. + DenseMap IdxToInstr; + // Marks the end of each interval. + IntervalMap EndPoint; + // Saves the list of instruction indices that are used in the loop. + SmallSet Ends; + // Saves the list of values that are used in the loop but are + // defined outside the loop, such as arguments and constants. + SmallPtrSet LoopInvariants; + + unsigned Index = 0; + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { + RU.NumInstructions += BB->size(); + for (Instruction &I : *BB) { + IdxToInstr[Index++] = &I; + + // Save the end location of each USE. + for (Value *U : I.operands()) { + auto *Instr = dyn_cast(U); + + // Ignore non-instruction values such as arguments, constants, etc. + if (!Instr) + continue; + + // If this instruction is outside the loop then record it and continue. + if (!TheLoop->contains(Instr)) { + LoopInvariants.insert(Instr); + continue; + } + + // Overwrite previous end points. + EndPoint[Instr] = Index; + Ends.insert(Instr); + } + } + } + + // Saves the list of intervals that end with the index in 'key'. + typedef SmallVector InstrList; + DenseMap TransposeEnds; + + // Transpose the EndPoints to a list of values that end at each index. + for (auto &Interval : EndPoint) + TransposeEnds[Interval.second].push_back(Interval.first); + + SmallSet OpenIntervals; + + // Get the size of the widest register. + unsigned MaxSafeDepDist = -1U; + if (Legal->getMaxSafeDepDistBytes() != -1U) + MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8; + unsigned WidestRegister = + std::min(TTI.getRegisterBitWidth(true), MaxSafeDepDist); + const DataLayout &DL = TheFunction->getParent()->getDataLayout(); + + SmallVector RUs(VFs.size()); + SmallVector MaxUsages(VFs.size(), 0); + + DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n"); + + // A lambda that gets the register usage for the given type and VF. + auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) { + if (Ty->isTokenTy()) + return 0U; + unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType()); + return std::max(1, VF * TypeSize / WidestRegister); + }; + + for (unsigned int i = 0; i < Index; ++i) { + Instruction *I = IdxToInstr[i]; + + // Remove all of the instructions that end at this location. + InstrList &List = TransposeEnds[i]; + for (Instruction *ToRemove : List) + OpenIntervals.erase(ToRemove); + + // Ignore instructions that are never used within the loop. + if (!Ends.count(I)) + continue; + + // Skip ignored values. + if (ValuesToIgnore.count(I)) + continue; + + // For each VF find the maximum usage of registers. + for (unsigned j = 0, e = VFs.size(); j < e; ++j) { + if (VFs[j] == 1) { + MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size()); + continue; + } + collectUniformsAndScalars(VFs[j]); + // Count the number of live intervals. + unsigned RegUsage = 0; + for (auto Inst : OpenIntervals) { + // Skip ignored values for VF > 1. + if (VecValuesToIgnore.count(Inst) || + isScalarAfterVectorization(Inst, VFs[j])) + continue; + RegUsage += GetRegUsage(Inst->getType(), VFs[j]); + } + MaxUsages[j] = std::max(MaxUsages[j], RegUsage); + } + + DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # " + << OpenIntervals.size() << '\n'); + + // Add the current instruction to the list of open intervals. + OpenIntervals.insert(I); + } + + for (unsigned i = 0, e = VFs.size(); i < e; ++i) { + unsigned Invariant = 0; + if (VFs[i] == 1) + Invariant = LoopInvariants.size(); + else { + for (auto Inst : LoopInvariants) + Invariant += GetRegUsage(Inst->getType(), VFs[i]); + } + + DEBUG(dbgs() << "LV(REG): VF = " << VFs[i] << '\n'); + DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsages[i] << '\n'); + DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n'); + DEBUG(dbgs() << "LV(REG): LoopSize: " << RU.NumInstructions << '\n'); + + RU.LoopInvariantRegs = Invariant; + RU.MaxLocalUsers = MaxUsages[i]; + RUs[i] = RU; + } + + return RUs; +} + +void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) { + + // Function should not be called for the scalar case. + assert(VF >= 2 && "Function called for the scalar loop"); + + // if we've already collected the + // instructions to scalarize, there's nothing to do. Collection may already + // have occurred if we have a user-selected VF and are now computing the + // expected cost for interleaving. + if (InstsToScalarize.count(VF)) + return; + + // Initialize a mapping for VF in InstsToScalalarize. If we find that it's + // not profitable to scalarize any instructions, the presence of VF in the + // map will indicate that we've analyzed it already. + ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF]; + + // Find all the instructions that are scalar with predication in the loop and + // determine if it would be better to not if-convert the blocks they are in. + // If so, we also record the instructions to scalarize. + for (BasicBlock *BB : TheLoop->blocks()) { + if (!Legal->blockNeedsPredication(BB)) + continue; + for (Instruction &I : *BB) + if (Legal->isScalarWithPredication(&I)) { + ScalarCostsTy ScalarCosts; + if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0) + ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end()); + } + } +} + +int LoopVectorizationCostModel::computePredInstDiscount( + Instruction *PredInst, DenseMap &ScalarCosts, + unsigned VF) { + + assert(!isUniformAfterVectorization(PredInst, VF) && + "Instruction marked uniform-after-vectorization will be predicated"); + + // Initialize the discount to zero, meaning that the scalar version and the + // vector version cost the same. + int Discount = 0; + + // Holds instructions to analyze. The instructions we visit are mapped in + // ScalarCosts. Those instructions are the ones that would be scalarized if + // we find that the scalar version costs less. + SmallVector Worklist; + + // Returns true if the given instruction can be scalarized. + auto canBeScalarized = [&](Instruction *I) -> bool { + + // We only attempt to scalarize instructions forming a single-use chain + // from the original predicated block that would otherwise be vectorized. + // Although not strictly necessary, we give up on instructions we know will + // already be scalar to avoid traversing chains that are unlikely to be + // beneficial. + if (!I->hasOneUse() || PredInst->getParent() != I->getParent() || + isScalarAfterVectorization(I, VF)) + return false; + + // If the instruction is scalar with predication, it will be analyzed + // separately. We ignore it within the context of PredInst. + if (Legal->isScalarWithPredication(I)) + return false; + + // If any of the instruction's operands are uniform after vectorization, + // the instruction cannot be scalarized. This prevents, for example, a + // masked load from being scalarized. + // + // We assume we will only emit a value for lane zero of an instruction + // marked uniform after vectorization, rather than VF identical values. + // Thus, if we scalarize an instruction that uses a uniform, we would + // create uses of values corresponding to the lanes we aren't emitting code + // for. This behavior can be changed by allowing getScalarValue to clone + // the lane zero values for uniforms rather than asserting. + for (Use &U : I->operands()) + if (auto *J = dyn_cast(U.get())) + if (isUniformAfterVectorization(J, VF)) + return false; + + // Otherwise, we can scalarize the instruction. + return true; + }; + + // Returns true if an operand that cannot be scalarized must be extracted + // from a vector. We will account for this scalarization overhead below. Note // that the non-void predicated instructions are placed in their own blocks, // and their return values are inserted into vectors. Thus, an extract would // still be required. @@ -6749,606 +7288,1721 @@ return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF); }; - // Compute the expected cost discount from scalarizing the entire expression - // feeding the predicated instruction. We currently only consider expressions - // that are single-use instruction chains. - Worklist.push_back(PredInst); - while (!Worklist.empty()) { - Instruction *I = Worklist.pop_back_val(); + // Compute the expected cost discount from scalarizing the entire expression + // feeding the predicated instruction. We currently only consider expressions + // that are single-use instruction chains. + Worklist.push_back(PredInst); + while (!Worklist.empty()) { + Instruction *I = Worklist.pop_back_val(); + + // If we've already analyzed the instruction, there's nothing to do. + if (ScalarCosts.count(I)) + continue; + + // Compute the cost of the vector instruction. Note that this cost already + // includes the scalarization overhead of the predicated instruction. + unsigned VectorCost = getInstructionCost(I, VF).first; + + // Compute the cost of the scalarized instruction. This cost is the cost of + // the instruction as if it wasn't if-converted and instead remained in the + // predicated block. We will scale this cost by block probability after + // computing the scalarization overhead. + unsigned ScalarCost = VF * getInstructionCost(I, 1).first; + + // Compute the scalarization overhead of needed insertelement instructions + // and phi nodes. + if (Legal->isScalarWithPredication(I) && !I->getType()->isVoidTy()) { + ScalarCost += TTI.getScalarizationOverhead(ToVectorTy(I->getType(), VF), + true, false); + ScalarCost += VF * TTI.getCFInstrCost(Instruction::PHI); + } + + // Compute the scalarization overhead of needed extractelement + // instructions. For each of the instruction's operands, if the operand can + // be scalarized, add it to the worklist; otherwise, account for the + // overhead. + for (Use &U : I->operands()) + if (auto *J = dyn_cast(U.get())) { + assert(VectorType::isValidElementType(J->getType()) && + "Instruction has non-scalar type"); + if (canBeScalarized(J)) + Worklist.push_back(J); + else if (needsExtract(J)) + ScalarCost += TTI.getScalarizationOverhead( + ToVectorTy(J->getType(),VF), false, true); + } + + // Scale the total scalar cost by block probability. + ScalarCost /= getReciprocalPredBlockProb(); + + // Compute the discount. A non-negative discount means the vector version + // of the instruction costs more, and scalarizing would be beneficial. + Discount += VectorCost - ScalarCost; + ScalarCosts[I] = ScalarCost; + } + + return Discount; +} + +LoopVectorizationCostModel::VectorizationCostTy +LoopVectorizationCostModel::expectedCost(unsigned VF) { + VectorizationCostTy Cost; + + // For each block. + for (BasicBlock *BB : TheLoop->blocks()) { + VectorizationCostTy BlockCost; + + // For each instruction in the old loop. + for (Instruction &I : *BB) { + // Skip dbg intrinsics. + if (isa(I)) + continue; + + // Skip ignored values. + if (ValuesToIgnore.count(&I)) + continue; + + VectorizationCostTy C = getInstructionCost(&I, VF); + + // Check if we should override the cost. + if (ForceTargetInstructionCost.getNumOccurrences() > 0) + C.first = ForceTargetInstructionCost; + + BlockCost.first += C.first; + BlockCost.second |= C.second; + DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first << " for VF " + << VF << " For instruction: " << I << '\n'); + } + + // If we are vectorizing a predicated block, it will have been + // if-converted. This means that the block's instructions (aside from + // stores and instructions that may divide by zero) will now be + // unconditionally executed. For the scalar case, we may not always execute + // the predicated block. Thus, scale the block's cost by the probability of + // executing it. + if (VF == 1 && Legal->blockNeedsPredication(BB)) + BlockCost.first /= getReciprocalPredBlockProb(); + + Cost.first += BlockCost.first; + Cost.second |= BlockCost.second; + } + + return Cost; +} + +/// \brief Gets Address Access SCEV after verifying that the access pattern +/// is loop invariant except the induction variable dependence. +/// +/// This SCEV can be sent to the Target in order to estimate the address +/// calculation cost. +static const SCEV *getAddressAccessSCEV( + Value *Ptr, + LoopVectorizationLegality *Legal, + ScalarEvolution *SE, + const Loop *TheLoop) { + auto *Gep = dyn_cast(Ptr); + if (!Gep) + return nullptr; + + // We are looking for a gep with all loop invariant indices except for one + // which should be an induction variable. + unsigned NumOperands = Gep->getNumOperands(); + for (unsigned i = 1; i < NumOperands; ++i) { + Value *Opd = Gep->getOperand(i); + if (!SE->isLoopInvariant(SE->getSCEV(Opd), TheLoop) && + !Legal->isInductionVariable(Opd)) + return nullptr; + } + + // Now we know we have a GEP ptr, %inv, %ind, %inv. return the Ptr SCEV. + return SE->getSCEV(Ptr); +} + +static bool isStrideMul(Instruction *I, LoopVectorizationLegality *Legal) { + return Legal->hasStride(I->getOperand(0)) || + Legal->hasStride(I->getOperand(1)); +} + +unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I, + unsigned VF) { + Type *ValTy = getMemInstValueType(I); + auto SE = PSE.getSE(); + + unsigned Alignment = getMemInstAlignment(I); + unsigned AS = getMemInstAddressSpace(I); + Value *Ptr = getPointerOperand(I); + Type *PtrTy = ToVectorTy(Ptr->getType(), VF); + + // Figure out whether the access is strided and get the stride value + // if it's known in compile time + const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop); + + // Get the cost of the scalar memory instruction and address computation. + unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV); + + Cost += VF * + TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment, + AS); + + // Get the overhead of the extractelement and insertelement instructions + // we might create due to scalarization. + Cost += getScalarizationOverhead(I, VF, TTI); + + // If we have a predicated store, it may not be executed for each vector + // lane. Scale the cost by the probability of executing the predicated + // block. + if (Legal->isScalarWithPredication(I)) + Cost /= getReciprocalPredBlockProb(); + + return Cost; +} + +unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I, + unsigned VF) { + Type *ValTy = getMemInstValueType(I); + Type *VectorTy = ToVectorTy(ValTy, VF); + unsigned Alignment = getMemInstAlignment(I); + Value *Ptr = getPointerOperand(I); + unsigned AS = getMemInstAddressSpace(I); + int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); + + assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) && + "Stride should be 1 or -1 for consecutive memory access"); + unsigned Cost = 0; + if (Legal->isMaskRequired(I)) + Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); + else + Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); + + bool Reverse = ConsecutiveStride < 0; + if (Reverse) + Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); + return Cost; +} + +unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I, + unsigned VF) { + LoadInst *LI = cast(I); + Type *ValTy = LI->getType(); + Type *VectorTy = ToVectorTy(ValTy, VF); + unsigned Alignment = LI->getAlignment(); + unsigned AS = LI->getPointerAddressSpace(); + + return TTI.getAddressComputationCost(ValTy) + + TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) + + TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy); +} + +unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I, + unsigned VF) { + Type *ValTy = getMemInstValueType(I); + Type *VectorTy = ToVectorTy(ValTy, VF); + unsigned Alignment = getMemInstAlignment(I); + Value *Ptr = getPointerOperand(I); + + return TTI.getAddressComputationCost(VectorTy) + + TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr, + Legal->isMaskRequired(I), Alignment); +} + +unsigned LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I, + unsigned VF) { + Type *ValTy = getMemInstValueType(I); + Type *VectorTy = ToVectorTy(ValTy, VF); + unsigned AS = getMemInstAddressSpace(I); + + auto Group = Legal->getInterleavedAccessGroup(I); + assert(Group && "Fail to get an interleaved access group."); + + unsigned InterleaveFactor = Group->getFactor(); + Type *WideVecTy = VectorType::get(ValTy, VF * InterleaveFactor); + + // Holds the indices of existing members in an interleaved load group. + // An interleaved store group doesn't need this as it doesn't allow gaps. + SmallVector Indices; + if (isa(I)) { + for (unsigned i = 0; i < InterleaveFactor; i++) + if (Group->getMember(i)) + Indices.push_back(i); + } + + // Calculate the cost of the whole interleaved group. + unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy, + Group->getFactor(), Indices, + Group->getAlignment(), AS); + + if (Group->isReverse()) + Cost += Group->getNumMembers() * + TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); + return Cost; +} + +unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I, + unsigned VF) { + + // Calculate scalar cost only. Vectorization cost should be ready at this + // moment. + if (VF == 1) { + Type *ValTy = getMemInstValueType(I); + unsigned Alignment = getMemInstAlignment(I); + unsigned AS = getMemInstAlignment(I); + + return TTI.getAddressComputationCost(ValTy) + + TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS); + } + return getWideningCost(I, VF); +} + +LoopVectorizationCostModel::VectorizationCostTy +LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) { + // If we know that this instruction will remain uniform, check the cost of + // the scalar version. + if (isUniformAfterVectorization(I, VF)) + VF = 1; + + if (VF > 1 && isProfitableToScalarize(I, VF)) + return VectorizationCostTy(InstsToScalarize[VF][I], false); + + Type *VectorTy; + unsigned C = getInstructionCost(I, VF, VectorTy); + + bool TypeNotScalarized = + VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF; + return VectorizationCostTy(C, TypeNotScalarized); +} + +void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) { + if (VF == 1) + return; + for (BasicBlock *BB : TheLoop->blocks()) { + // For each instruction in the old loop. + for (Instruction &I : *BB) { + Value *Ptr = getPointerOperand(&I); + if (!Ptr) + continue; + + if (isa(&I) && Legal->isUniform(Ptr)) { + // Scalar load + broadcast + unsigned Cost = getUniformMemOpCost(&I, VF); + setWideningDecision(&I, VF, CM_Scalarize, Cost); + continue; + } + + // We assume that widening is the best solution when possible. + if (Legal->memoryInstructionCanBeWidened(&I, VF)) { + unsigned Cost = getConsecutiveMemOpCost(&I, VF); + setWideningDecision(&I, VF, CM_Widen, Cost); + continue; + } + + // Choose between Interleaving, Gather/Scatter or Scalarization. + unsigned InterleaveCost = UINT_MAX; + unsigned NumAccesses = 1; + if (Legal->isAccessInterleaved(&I)) { + auto Group = Legal->getInterleavedAccessGroup(&I); + assert(Group && "Fail to get an interleaved access group."); - // If we've already analyzed the instruction, there's nothing to do. - if (ScalarCosts.count(I)) - continue; + // Make one decision for the whole group. + if (getWideningDecision(&I, VF) != CM_Unknown) + continue; - // Compute the cost of the vector instruction. Note that this cost already - // includes the scalarization overhead of the predicated instruction. - unsigned VectorCost = getInstructionCost(I, VF).first; + NumAccesses = Group->getNumMembers(); + InterleaveCost = getInterleaveGroupCost(&I, VF); + } - // Compute the cost of the scalarized instruction. This cost is the cost of - // the instruction as if it wasn't if-converted and instead remained in the - // predicated block. We will scale this cost by block probability after - // computing the scalarization overhead. - unsigned ScalarCost = VF * getInstructionCost(I, 1).first; + unsigned GatherScatterCost = + Legal->isLegalGatherOrScatter(&I) + ? getGatherScatterCost(&I, VF) * NumAccesses + : UINT_MAX; - // Compute the scalarization overhead of needed insertelement instructions - // and phi nodes. - if (Legal->isScalarWithPredication(I) && !I->getType()->isVoidTy()) { - ScalarCost += TTI.getScalarizationOverhead(ToVectorTy(I->getType(), VF), - true, false); - ScalarCost += VF * TTI.getCFInstrCost(Instruction::PHI); + unsigned ScalarizationCost = + getMemInstScalarizationCost(&I, VF) * NumAccesses; + + // Choose better solution for the current VF, + // write down this decision and use it during vectorization. + unsigned Cost; + InstWidening Decision; + if (InterleaveCost <= GatherScatterCost && + InterleaveCost < ScalarizationCost) { + Decision = CM_Interleave; + Cost = InterleaveCost; + } else if (GatherScatterCost < ScalarizationCost) { + Decision = CM_GatherScatter; + Cost = GatherScatterCost; + } else { + Decision = CM_Scalarize; + Cost = ScalarizationCost; + } + // If the instructions belongs to an interleave group, the whole group + // receives the same decision. The whole group receives the cost, but + // the cost will actually be assigned to one instruction. + if (auto Group = Legal->getInterleavedAccessGroup(&I)) + setWideningDecision(Group, VF, Decision, Cost); + else + setWideningDecision(&I, VF, Decision, Cost); } + } +} - // Compute the scalarization overhead of needed extractelement - // instructions. For each of the instruction's operands, if the operand can - // be scalarized, add it to the worklist; otherwise, account for the - // overhead. - for (Use &U : I->operands()) - if (auto *J = dyn_cast(U.get())) { - assert(VectorType::isValidElementType(J->getType()) && - "Instruction has non-scalar type"); - if (canBeScalarized(J)) - Worklist.push_back(J); - else if (needsExtract(J)) - ScalarCost += TTI.getScalarizationOverhead( - ToVectorTy(J->getType(),VF), false, true); +unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I, + unsigned VF, + Type *&VectorTy) { + Type *RetTy = I->getType(); + if (canTruncateToMinimalBitwidth(I, VF)) + RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]); + VectorTy = ToVectorTy(RetTy, VF); + auto SE = PSE.getSE(); + + // TODO: We need to estimate the cost of intrinsic calls. + switch (I->getOpcode()) { + case Instruction::GetElementPtr: + // We mark this instruction as zero-cost because the cost of GEPs in + // vectorized code depends on whether the corresponding memory instruction + // is scalarized or not. Therefore, we handle GEPs with the memory + // instruction cost. + return 0; + case Instruction::Br: { + return TTI.getCFInstrCost(I->getOpcode()); + } + case Instruction::PHI: { + auto *Phi = cast(I); + + // First-order recurrences are replaced by vector shuffles inside the loop. + if (VF > 1 && Legal->isFirstOrderRecurrence(Phi)) + return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector, + VectorTy, VF - 1, VectorTy); + + // TODO: IF-converted IFs become selects. + return 0; + } + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::URem: + case Instruction::SRem: + // If we have a predicated instruction, it may not be executed for each + // vector lane. Get the scalarization cost and scale this amount by the + // probability of executing the predicated block. If the instruction is not + // predicated, we fall through to the next case. + if (VF > 1 && Legal->isScalarWithPredication(I)) { + unsigned Cost = 0; + + // These instructions have a non-void type, so account for the phi nodes + // that we will create. This cost is likely to be zero. The phi node + // cost, if any, should be scaled by the block probability because it + // models a copy at the end of each predicated block. + Cost += VF * TTI.getCFInstrCost(Instruction::PHI); + + // The cost of the non-predicated instruction. + Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy); + + // The cost of insertelement and extractelement instructions needed for + // scalarization. + Cost += getScalarizationOverhead(I, VF, TTI); + + // Scale the cost by the probability of executing the predicated blocks. + // This assumes the predicated block for each vector lane is equally + // likely. + return Cost / getReciprocalPredBlockProb(); + } + case Instruction::Add: + case Instruction::FAdd: + case Instruction::Sub: + case Instruction::FSub: + case Instruction::Mul: + case Instruction::FMul: + case Instruction::FDiv: + case Instruction::FRem: + case Instruction::Shl: + case Instruction::LShr: + case Instruction::AShr: + case Instruction::And: + case Instruction::Or: + case Instruction::Xor: { + // Since we will replace the stride by 1 the multiplication should go away. + if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal)) + return 0; + // Certain instructions can be cheaper to vectorize if they have a constant + // second vector operand. One example of this are shifts on x86. + TargetTransformInfo::OperandValueKind Op1VK = + TargetTransformInfo::OK_AnyValue; + TargetTransformInfo::OperandValueKind Op2VK = + TargetTransformInfo::OK_AnyValue; + TargetTransformInfo::OperandValueProperties Op1VP = + TargetTransformInfo::OP_None; + TargetTransformInfo::OperandValueProperties Op2VP = + TargetTransformInfo::OP_None; + Value *Op2 = I->getOperand(1); + + // Check for a splat or for a non uniform vector of constants. + if (isa(Op2)) { + ConstantInt *CInt = cast(Op2); + if (CInt && CInt->getValue().isPowerOf2()) + Op2VP = TargetTransformInfo::OP_PowerOf2; + Op2VK = TargetTransformInfo::OK_UniformConstantValue; + } else if (isa(Op2) || isa(Op2)) { + Op2VK = TargetTransformInfo::OK_NonUniformConstantValue; + Constant *SplatValue = cast(Op2)->getSplatValue(); + if (SplatValue) { + ConstantInt *CInt = dyn_cast(SplatValue); + if (CInt && CInt->getValue().isPowerOf2()) + Op2VP = TargetTransformInfo::OP_PowerOf2; + Op2VK = TargetTransformInfo::OK_UniformConstantValue; } + } else if (Legal->isUniform(Op2)) { + Op2VK = TargetTransformInfo::OK_UniformValue; + } + SmallVector Operands(I->operand_values()); + return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK, + Op2VK, Op1VP, Op2VP, Operands); + } + case Instruction::Select: { + SelectInst *SI = cast(I); + const SCEV *CondSCEV = SE->getSCEV(SI->getCondition()); + bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop)); + Type *CondTy = SI->getCondition()->getType(); + if (!ScalarCond) + CondTy = VectorType::get(CondTy, VF); - // Scale the total scalar cost by block probability. - ScalarCost /= getReciprocalPredBlockProb(); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy); + } + case Instruction::ICmp: + case Instruction::FCmp: { + Type *ValTy = I->getOperand(0)->getType(); + Instruction *Op0AsInstruction = dyn_cast(I->getOperand(0)); + if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF)) + ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); + VectorTy = ToVectorTy(ValTy, VF); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy); + } + case Instruction::Store: + case Instruction::Load: { + VectorTy = ToVectorTy(getMemInstValueType(I), VF); + return getMemoryInstructionCost(I, VF); + } + case Instruction::ZExt: + case Instruction::SExt: + case Instruction::FPToUI: + case Instruction::FPToSI: + case Instruction::FPExt: + case Instruction::PtrToInt: + case Instruction::IntToPtr: + case Instruction::SIToFP: + case Instruction::UIToFP: + case Instruction::Trunc: + case Instruction::FPTrunc: + case Instruction::BitCast: { + // We optimize the truncation of induction variables having constant + // integer steps. The cost of these truncations is the same as the scalar + // operation. + if (isOptimizableIVTruncate(I, VF)) { + auto *Trunc = cast(I); + return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(), + Trunc->getSrcTy()); + } + + Type *SrcScalarTy = I->getOperand(0)->getType(); + Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF); + if (canTruncateToMinimalBitwidth(I, VF)) { + // This cast is going to be shrunk. This may remove the cast or it might + // turn it into slightly different cast. For example, if MinBW == 16, + // "zext i8 %1 to i32" becomes "zext i8 %1 to i16". + // + // Calculate the modified src and dest types. + Type *MinVecTy = VectorTy; + if (I->getOpcode() == Instruction::Trunc) { + SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy); + VectorTy = + largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); + } else if (I->getOpcode() == Instruction::ZExt || + I->getOpcode() == Instruction::SExt) { + SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy); + VectorTy = + smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); + } + } - // Compute the discount. A non-negative discount means the vector version - // of the instruction costs more, and scalarizing would be beneficial. - Discount += VectorCost - ScalarCost; - ScalarCosts[I] = ScalarCost; + return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy); } - - return Discount; + case Instruction::Call: { + bool NeedToScalarize; + CallInst *CI = cast(I); + unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize); + if (getVectorIntrinsicIDForCall(CI, TLI)) + return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI)); + return CallCost; + } + default: + // The cost of executing VF copies of the scalar instruction. This opcode + // is unknown. Assume that it is the same as 'mul'. + return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) + + getScalarizationOverhead(I, VF, TTI); + } // end of switch. } -LoopVectorizationCostModel::VectorizationCostTy -LoopVectorizationCostModel::expectedCost(unsigned VF) { - VectorizationCostTy Cost; - - // Collect Uniform and Scalar instructions after vectorization with VF. - collectUniformsAndScalars(VF); +char LoopVectorize::ID = 0; +static const char lv_name[] = "Loop Vectorization"; +INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false) +INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass) +INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass) +INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass) +INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker) +INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass) +INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass) +INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis) +INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass) +INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass) +INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false) - // Collect the instructions (and their associated costs) that will be more - // profitable to scalarize. - collectInstsToScalarize(VF); +namespace llvm { +Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) { + return new LoopVectorize(NoUnrolling, AlwaysVectorize); +} +} - // For each block. - for (BasicBlock *BB : TheLoop->blocks()) { - VectorizationCostTy BlockCost; +bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) { - // For each instruction in the old loop. - for (Instruction &I : *BB) { - // Skip dbg intrinsics. - if (isa(I)) - continue; + // Check if the pointer operand of a load or store instruction is + // consecutive. + if (auto *Ptr = getPointerOperand(Inst)) + return Legal->isConsecutivePtr(Ptr); + return false; +} - // Skip ignored values. - if (ValuesToIgnore.count(&I)) - continue; +void LoopVectorizationCostModel::collectValuesToIgnore() { + // Ignore ephemeral values. + CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore); - VectorizationCostTy C = getInstructionCost(&I, VF); + // Ignore type-promoting instructions we identified during reduction + // detection. + for (auto &Reduction : *Legal->getReductionVars()) { + RecurrenceDescriptor &RedDes = Reduction.second; + SmallPtrSetImpl &Casts = RedDes.getCastInsts(); + VecValuesToIgnore.insert(Casts.begin(), Casts.end()); + } +} - // Check if we should override the cost. - if (ForceTargetInstructionCost.getNumOccurrences() > 0) - C.first = ForceTargetInstructionCost; +LoopVectorizationCostModel::VectorizationFactor +LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF, + unsigned MaxVF) { + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + if (UserVF == 1) + return {UserVF, 0}; + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect Uniform and Scalar instructions after vectorization with VF. + CM->collectUniformsAndScalars(UserVF); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + return {UserVF, 0}; + } + if (MaxVF == 1) + return {1, 0}; + + assert(MaxVF > 1 && "MaxVF is zero."); + for (unsigned i = 2; i <= MaxVF; i *= 2) { + // Collect Uniform and Scalar instructions after vectorization with VF. + CM->collectUniformsAndScalars(i); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(i); + } + buildInitialVPlans(2, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + // Select the optimal vectorization factor. + return CM->selectVectorizationFactor(OptForSize, MaxVF); +} - BlockCost.first += C.first; - BlockCost.second |= C.second; - DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first << " for VF " - << VF << " For instruction: " << I << '\n'); +void LoopVectorizationPlanner::printCurrentPlans(const std::string &Title, + raw_ostream &O) { + auto printPlan = [&](VPlan *Plan, const SmallVectorImpl &VFs, + const std::string &Prefix) { + std::string Title; + raw_string_ostream RSO(Title); + RSO << Prefix << " for VF="; + if (VFs.size() == 1) + RSO << VFs[0]; + else { + RSO << "{"; + bool First = true; + for (unsigned VF : VFs) { + if (!First) + RSO << ","; + RSO << VF; + First = false; + } + RSO << "}"; } + VPlanPrinter PlanPrinter(O, *Plan); + PlanPrinter.dump(RSO.str()); + }; - // If we are vectorizing a predicated block, it will have been - // if-converted. This means that the block's instructions (aside from - // stores and instructions that may divide by zero) will now be - // unconditionally executed. For the scalar case, we may not always execute - // the predicated block. Thus, scale the block's cost by the probability of - // executing it. - if (VF == 1 && Legal->blockNeedsPredication(BB)) - BlockCost.first /= getReciprocalPredBlockProb(); + if (VPlans.empty()) + return; - Cost.first += BlockCost.first; - Cost.second |= BlockCost.second; + VPlan *Current = VPlans.begin()->second.get(); + + SmallVector VFs; + for (auto &Entry : VPlans) { + VPlan *Plan = Entry.second.get(); + if (Plan != Current) { + // Hit another VPlan. Print the current VPlan for the VFs it served thus + // far and move on to the VPlan we just encountered. + printPlan(Current, VFs, Title); + Current = Plan; + VFs.clear(); + } + // Add VF to the list of VFs served by current VPlan. + VFs.push_back(Entry.first); } - - return Cost; + // Print the current VPlan. + printPlan(Current, VFs, Title); } -/// \brief Gets Address Access SCEV after verifying that the access pattern -/// is loop invariant except the induction variable dependence. -/// -/// This SCEV can be sent to the Target in order to estimate the address -/// calculation cost. -static const SCEV *getAddressAccessSCEV( - Value *Ptr, - LoopVectorizationLegality *Legal, - ScalarEvolution *SE, - const Loop *TheLoop) { - auto *Gep = dyn_cast(Ptr); - if (!Gep) - return nullptr; - - // We are looking for a gep with all loop invariant indices except for one - // which should be an induction variable. - unsigned NumOperands = Gep->getNumOperands(); - for (unsigned i = 1; i < NumOperands; ++i) { - Value *Opd = Gep->getOperand(i); - if (!SE->isLoopInvariant(SE->getSCEV(Opd), TheLoop) && - !Legal->isInductionVariable(Opd)) - return nullptr; +std::pair +LoopVectorizationPlanner::widenIntInduction(VPlan *Plan, unsigned StartRangeVF, + unsigned &EndRangeVF, PHINode *IV, + TruncInst *Trunc) { + // The value from the original loop to which we are mapping the new + // induction variable. + Instruction *EntryVal = Trunc ? cast(Trunc) : IV; + // Determine if we want a scalar version of the induction variable. This + // is true if the induction variable itself is not widened, or if it has + // at least one user in the loop that is not widened. + auto NeedsScalarInduction = [&](unsigned VF) -> bool { + if (shouldScalarizeInstruction(IV, VF)) + return true; + auto isScalarInst = [&](User *U) -> bool { + auto *I = cast(U); + return (TheLoop->contains(I) && shouldScalarizeInstruction(I, VF)); + }; + return any_of(IV->users(), isScalarInst); + }; + bool NeedsScalarIV = + testVFRange(NeedsScalarInduction, StartRangeVF, EndRangeVF); + // Generate the widening recipe. + auto *WIIRecipe = new VPWidenIntInductionRecipe(NeedsScalarIV, IV, Trunc); + if (!NeedsScalarIV) + return std::make_pair(WIIRecipe, nullptr); + + // Create scalar steps that can be used by instructions we will later + // scalarize. Note that the addition of the scalar steps will not + // increase the number of instructions in the loop in the common case + // prior to InstCombine. We will be trading one vector extract for + // each scalar step. + auto *BSSRecipe = new VPBuildScalarStepsRecipe(WIIRecipe, EntryVal, Plan); + // Determine the number of scalars we need to generate for each unroll + // iteration. If EntryVal is uniform, we only need to generate the + // first lane. Otherwise, we generate all VF values. + auto isUniformAfterVectorization = [&](unsigned VF) -> bool { + return CM->isUniformAfterVectorization(cast(EntryVal), VF); + }; + if (testVFRange(isUniformAfterVectorization, StartRangeVF, EndRangeVF)) { + VPlanUtilsLoopVectorizer PlanUtils(Plan); + PlanUtils.designateLaneZero(BSSRecipe); } - - // Now we know we have a GEP ptr, %inv, %ind, %inv. return the Ptr SCEV. - return SE->getSCEV(Ptr); -} - -static bool isStrideMul(Instruction *I, LoopVectorizationLegality *Legal) { - return Legal->hasStride(I->getOperand(0)) || - Legal->hasStride(I->getOperand(1)); + return std::make_pair(WIIRecipe, BSSRecipe); } -unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I, - unsigned VF) { - Type *ValTy = getMemInstValueType(I); - auto SE = PSE.getSE(); - - unsigned Alignment = getMemInstAlignment(I); - unsigned AS = getMemInstAddressSpace(I); - Value *Ptr = getPointerOperand(I); - Type *PtrTy = ToVectorTy(Ptr->getType(), VF); +// Determine if a given instruction will remain scalar after vectorization, +// for VF \p StartRangeVF. Reset \p EndRangeVF to the minimal VF where this +// decision does not hold, if it's less than the given \p EndRangeVF. +bool LoopVectorizationPlanner::willBeScalarized(Instruction *I, + unsigned StartRangeVF, + unsigned &EndRangeVF) { + if (!isa(I)) { + auto isScalarAfterVectorization = [&](unsigned VF) -> bool { + return CM->isScalarAfterVectorization(I, VF); + }; + if (testVFRange(isScalarAfterVectorization, StartRangeVF, EndRangeVF)) + return true; + } - // Figure out whether the access is strided and get the stride value - // if it's known in compile time - const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop); + if (isa(I)) { - // Get the cost of the scalar memory instruction and address computation. - unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV); + auto *CI = cast(I); + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || + ID == Intrinsic::lifetime_start)) + return true; - Cost += VF * - TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment, - AS); + // The following case may be scalarized depending on the VF. + // The flag shows whether we use Intrinsic or a usual Call for vectorized + // version of the instruction. + // Is it beneficial to perform intrinsic call compared to lib call? + auto WillBeScalarized = [&](unsigned VF) -> bool { + bool NeedToScalarize; + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + return !UseVectorIntrinsic && NeedToScalarize; + }; + return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF); + } - // Get the overhead of the extractelement and insertelement instructions - // we might create due to scalarization. - Cost += getScalarizationOverhead(I, VF, TTI); + if (isa(I) || isa(I)) { - // If we have a predicated store, it may not be executed for each vector - // lane. Scale the cost by the probability of executing the predicated - // block. - if (Legal->isScalarWithPredication(I)) - Cost /= getReciprocalPredBlockProb(); + // TODO: refactor memoryInstructionMustBeScalarized() to invoke only the + // (last) part that depends on VF. + auto WillBeScalarized = [&](unsigned VF) -> bool { + LoopVectorizationCostModel::InstWidening Decision = + CM->getWideningDecision(I, VF); + assert(Decision != LoopVectorizationCostModel::CM_Unknown && + "CM decision should be taken at this point"); + return Decision == LoopVectorizationCostModel::CM_Scalarize; + }; + return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF); + } + + static DenseSet VectorizableOpcodes = { + Instruction::Br, Instruction::PHI, Instruction::UDiv, + Instruction::SDiv, Instruction::SRem, Instruction::URem, + Instruction::Add, Instruction::FAdd, Instruction::Sub, + Instruction::FSub, Instruction::Mul, Instruction::FMul, + Instruction::FDiv, Instruction::FRem, Instruction::Shl, + Instruction::LShr, Instruction::AShr, Instruction::And, + Instruction::Or, Instruction::Xor, Instruction::Select, + Instruction::ICmp, Instruction::FCmp, Instruction::Store, + Instruction::Load, Instruction::ZExt, Instruction::SExt, + Instruction::FPToUI, Instruction::FPToSI, Instruction::FPExt, + Instruction::PtrToInt, Instruction::IntToPtr, Instruction::SIToFP, + Instruction::UIToFP, Instruction::Trunc, Instruction::FPTrunc, + Instruction::BitCast, Instruction::Call}; + + if (!VectorizableOpcodes.count(I->getOpcode())) + return true; - return Cost; + // Scalarize instructions found to be more profitable if scalarized. Limit + // EndRangeVF to the last VF this is continuously true for. + auto isProfitableToScalarize = [&](unsigned VF) -> bool { + return CM->isProfitableToScalarize(I, VF); + }; + return testVFRange(isProfitableToScalarize, StartRangeVF, EndRangeVF); } -unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I, - unsigned VF) { - Type *ValTy = getMemInstValueType(I); - Type *VectorTy = ToVectorTy(ValTy, VF); - unsigned Alignment = getMemInstAlignment(I); - Value *Ptr = getPointerOperand(I); - unsigned AS = getMemInstAddressSpace(I); - int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); - - assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) && - "Stride should be 1 or -1 for consecutive memory access"); - unsigned Cost = 0; - if (Legal->isMaskRequired(I)) - Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); - else - Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); +unsigned LoopVectorizationPlanner::buildInitialVPlans(unsigned MinVF, + unsigned MaxVF) { + ILV->collectTriviallyDeadInstructions(TheLoop, Legal, DeadInstructions); - bool Reverse = ConsecutiveStride < 0; - if (Reverse) - Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); - return Cost; -} + unsigned StartRangeVF = MinVF; + unsigned EndRangeVF = MaxVF + 1; -unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I, - unsigned VF) { - LoadInst *LI = cast(I); - Type *ValTy = LI->getType(); - Type *VectorTy = ToVectorTy(ValTy, VF); - unsigned Alignment = LI->getAlignment(); - unsigned AS = LI->getPointerAddressSpace(); + unsigned i = 0; + for (; StartRangeVF < EndRangeVF; ++i) { + std::shared_ptr Plan = buildInitialVPlan(StartRangeVF, EndRangeVF); - return TTI.getAddressComputationCost(ValTy) + - TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) + - TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy); -} + for (unsigned TmpVF = StartRangeVF; TmpVF < EndRangeVF; TmpVF *= 2) + VPlans[TmpVF] = Plan; -unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I, - unsigned VF) { - Type *ValTy = getMemInstValueType(I); - Type *VectorTy = ToVectorTy(ValTy, VF); - unsigned Alignment = getMemInstAlignment(I); - Value *Ptr = getPointerOperand(I); + StartRangeVF = EndRangeVF; + EndRangeVF = MaxVF + 1; + } - return TTI.getAddressComputationCost(VectorTy) + - TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr, - Legal->isMaskRequired(I), Alignment); + return i; } -unsigned LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I, - unsigned VF) { - Type *ValTy = getMemInstValueType(I); - Type *VectorTy = ToVectorTy(ValTy, VF); - unsigned AS = getMemInstAddressSpace(I); - - auto Group = Legal->getInterleavedAccessGroup(I); - assert(Group && "Fail to get an interleaved access group."); - - unsigned InterleaveFactor = Group->getFactor(); - Type *WideVecTy = VectorType::get(ValTy, VF * InterleaveFactor); +bool LoopVectorizationPlanner::testVFRange( + const std::function &Predicate, unsigned StartRangeVF, + unsigned &EndRangeVF) { + bool StartResult = Predicate(StartRangeVF); - // Holds the indices of existing members in an interleaved load group. - // An interleaved store group doesn't need this as it doesn't allow gaps. - SmallVector Indices; - if (isa(I)) { - for (unsigned i = 0; i < InterleaveFactor; i++) - if (Group->getMember(i)) - Indices.push_back(i); + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool TmpResult = Predicate(TmpVF); + if (TmpResult != StartResult) { + EndRangeVF = TmpVF; + break; + } } - - // Calculate the cost of the whole interleaved group. - unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy, - Group->getFactor(), Indices, - Group->getAlignment(), AS); - - if (Group->isReverse()) - Cost += Group->getNumMembers() * - TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); - return Cost; + + return StartResult; } -unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I, - unsigned VF) { +std::shared_ptr +LoopVectorizationPlanner::buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF) { + + std::shared_ptr SharedPlan = std::make_shared(); + VPlan *Plan = SharedPlan.get(); + VPlanUtilsLoopVectorizer PlanUtils(Plan); + + // Create a dummy entry VPBasicBlock to start building the VPlan. + VPBlockBase *PreviousVPBlock = PlanUtils.createBasicBlock(); + VPBlockBase *PreEntry = PreviousVPBlock; + Plan->setEntry(PreEntry); // only to support printing during construction. + + // Return the interleave group a given instruction is part of in the context + // of a specific VF. + auto getInterleaveGroup = [&](Instruction *I, + unsigned VF) -> const InterleaveGroup * { + if (VF < 2) + return nullptr; // Query is illegal for VF == 1 + LoopVectorizationCostModel::InstWidening Decision = + CM->getWideningDecision(I, VF); + if (Decision != LoopVectorizationCostModel::CM_Interleave) + return nullptr; + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I); + assert(IG && "Instruction to interleave not part of any group"); + return IG; + }; - // Calculate scalar cost only. Vectorization cost should be ready at this - // moment. - if (VF == 1) { - Type *ValTy = getMemInstValueType(I); - unsigned Alignment = getMemInstAlignment(I); - unsigned AS = getMemInstAlignment(I); + // Check if given Instruction should open an interleave group. + auto isPrimaryIGMember = + [&](Instruction *I) -> std::function { + return [=](unsigned VF) -> bool { + const InterleaveGroup *IG = getInterleaveGroup(I, VF); + return IG && I == IG->getInsertPos(); + }; + }; - return TTI.getAddressComputationCost(ValTy) + - TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS); - } - return getWideningCost(I, VF); -} + // Check if given Instruction is handled as part of an interleave group. + auto isAdjunctIGMember = + [&](Instruction *I) -> std::function { + return [=](unsigned VF) -> bool { + const InterleaveGroup *IG = getInterleaveGroup(I, VF); + return IG && I != IG->getInsertPos(); + }; + }; -LoopVectorizationCostModel::VectorizationCostTy -LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) { - // If we know that this instruction will remain uniform, check the cost of - // the scalar version. - if (isUniformAfterVectorization(I, VF)) - VF = 1; + /// Determine whether \p K is a truncation based on an induction variable that + /// can be optimized. + auto isOptimizableIVTruncate = + [&](Instruction *K) -> std::function { + return + [=](unsigned VF) -> bool { return CM->isOptimizableIVTruncate(K, VF); }; + }; - if (VF > 1 && isProfitableToScalarize(I, VF)) - return VectorizationCostTy(InstsToScalarize[VF][I], false); + // Scan the body of the loop in a topological order to visit each basic block + // after having visited its predecessor basic blocks. + LoopBlocksDFS DFS(TheLoop); + DFS.perform(LI); - Type *VectorTy; - unsigned C = getInstructionCost(I, VF, VectorTy); + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { + // Relevent instructions from basic block BB will be grouped into VPRecipe + // ingredients and fill a new VPBasicBlock. + VPBasicBlock *VPBB = nullptr; + VPOneByOneRecipeBase *LastOBORecipe = nullptr; + + auto appendRecipe = [&](VPRecipeBase *Recipe) -> void { + if (VPBB) + PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB); + else { + VPBB = PlanUtils.createBasicBlock(Recipe); + PlanUtils.setSuccessor(PreviousVPBlock, VPBB); + PreviousVPBlock = VPBB; + } + LastOBORecipe = dyn_cast(Recipe); + }; - bool TypeNotScalarized = - VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF; - return VectorizationCostTy(C, TypeNotScalarized); -} + for (auto I = BB->begin(), E = BB->end(); I != E; ++I) { + Instruction *Instr = &*I; -void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) { - if (VF == 1) - return; - for (BasicBlock *BB : TheLoop->blocks()) { - // For each instruction in the old loop. - for (Instruction &I : *BB) { - Value *Ptr = getPointerOperand(&I); - if (!Ptr) + // Filter out irrelevant instructions. + if (DeadInstructions.count(Instr) || isa(Instr) || + isa(Instr)) continue; - if (isa(&I) && Legal->isUniform(Ptr)) { - // Scalar load + broadcast - unsigned Cost = getUniformMemOpCost(&I, VF); - setWideningDecision(&I, VF, CM_Scalarize, Cost); - continue; + if (isa(Instr) || isa(Instr)) { + // Ignore IG's adjunct members - will be handled by the interleave group + // recipe to be generated by the primary member of the interleave group + // which is the insertion point and bears the cost for the entire group. + if (testVFRange(isAdjunctIGMember(Instr), StartRangeVF, EndRangeVF)) + continue; + + if (testVFRange(isPrimaryIGMember(Instr), StartRangeVF, EndRangeVF)) { + // Instr points to the insert position of an interleave group: first + // load or last store. + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(Instr); + appendRecipe(new VPInterleaveRecipe(IG, Plan)); + continue; + } } - // We assume that widening is the best solution when possible. - if (Legal->memoryInstructionCanBeWidened(&I, VF)) { - unsigned Cost = getConsecutiveMemOpCost(&I, VF); - setWideningDecision(&I, VF, CM_Widen, Cost); + if (Legal->isScalarWithPredication(Instr)) { + // Instructions marked for predication are scalarized and placed under + // an if-then construct to prevent side-effects. + DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *Instr << '\n'); + + // Build the triangular if-then region. Start with VPBB holding Instr. + BasicBlock::iterator J = I; + VPRecipeBase *Recipe = new VPScalarizeOneByOneRecipe(I, ++J, Plan); + VPBB = PlanUtils.createBasicBlock(Recipe); + + // Build the entry and exit VPBB's of the triangle. + VPRegionBlock *Region = PlanUtils.createRegion(true); + VPExtractMaskBitRecipe *R = new VPExtractMaskBitRecipe(&*BB); + VPBasicBlock *Entry = PlanUtils.createBasicBlock(R); + Recipe = new VPMergeScalarizeBranchRecipe(Instr); + VPBasicBlock *Exit = PlanUtils.createBasicBlock(Recipe); + // Note: first set Entry as region entry and then connect successors + // starting from it in order, to propagate the "parent" of each + // VPBasicBlock. + PlanUtils.setRegionEntry(Region, Entry); + PlanUtils.setRegionExit(Region, Exit); + PlanUtils.setTwoSuccessors(Entry, R, VPBB, Exit); + PlanUtils.setSuccessor(VPBB, Exit); + PlanUtils.setSuccessor(PreviousVPBlock, Region); + PreviousVPBlock = Region; + + // Next instructions should start forming a VPBasicBlock of their own. + VPBB = nullptr; + LastOBORecipe = nullptr; + + // Record predicated instructions for later optimizations. + PredicatedInstructions.insert(&*I); + continue; } - // Choose between Interleaving, Gather/Scatter or Scalarization. - unsigned InterleaveCost = UINT_MAX; - unsigned NumAccesses = 1; - if (Legal->isAccessInterleaved(&I)) { - auto Group = Legal->getInterleavedAccessGroup(&I); - assert(Group && "Fail to get an interleaved access group."); + // Check if this is an integer induction. If so, build the recipes that + // produce its scalar and vector values. - // Make one decision for the whole group. - if (getWideningDecision(&I, VF) != CM_Unknown) + if (PHINode *Phi = dyn_cast(Instr)) { + InductionDescriptor II = Legal->getInductionVars()->lookup(Phi); + if (II.getKind() == InductionDescriptor::IK_IntInduction) { + auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF, Phi); + appendRecipe(Recipes.first); + if (Recipes.second) + appendRecipe(Recipes.second); continue; - - NumAccesses = Group->getNumMembers(); - InterleaveCost = getInterleaveGroupCost(&I, VF); + } } - unsigned GatherScatterCost = - Legal->isLegalGatherOrScatter(&I) - ? getGatherScatterCost(&I, VF) * NumAccesses - : UINT_MAX; - - unsigned ScalarizationCost = - getMemInstScalarizationCost(&I, VF) * NumAccesses; + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + if (isa(Instr) && testVFRange(isOptimizableIVTruncate(Instr), + StartRangeVF, EndRangeVF)) { + auto *InductionPhi = cast(Instr->getOperand(0)); + auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF, + InductionPhi, cast(Instr)); + appendRecipe(Recipes.first); + if (Recipes.second) + appendRecipe(Recipes.second); + continue; + } - // Choose better solution for the current VF, - // write down this decision and use it during vectorization. - unsigned Cost; - InstWidening Decision; - if (InterleaveCost <= GatherScatterCost && - InterleaveCost < ScalarizationCost) { - Decision = CM_Interleave; - Cost = InterleaveCost; - } else if (GatherScatterCost < ScalarizationCost) { - Decision = CM_GatherScatter; - Cost = GatherScatterCost; - } else { - Decision = CM_Scalarize; - Cost = ScalarizationCost; + // Check if instruction is to be replicated. + bool Scalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF); + DEBUG(if (Scalarized) dbgs() << "LV: Scalarizing:" << *Instr << "\n"); + + // Default: vectorize/scalarize this instruction using a one-by-one + // recipe. We optimize the common case where consecutive instructions + // can be represented by a single OBO recipe. + if (!LastOBORecipe || LastOBORecipe->isScalarizing() != Scalarized || + !PlanUtils.appendInstruction(LastOBORecipe, Instr)) { + auto J = I; + appendRecipe(PlanUtils.createOneByOneRecipe(I, ++J, Plan, Scalarized)); } - // If the instructions belongs to an interleave group, the whole group - // receives the same decision. The whole group receives the cost, but - // the cost will actually be assigned to one instruction. - if (auto Group = Legal->getInterleavedAccessGroup(&I)) - setWideningDecision(Group, VF, Decision, Cost); - else - setWideningDecision(&I, VF, Decision, Cost); } } + // PreviousVPBlock now holds the exit block of Plan. + // Set entry block of Plan to the successor of PreEntry, and discard PreEntry. + assert(PreEntry->getSuccessors().size() == 1 && "Plan has no single entry."); + VPBlockBase *Entry = PreEntry->getSuccessors().front(); + PlanUtils.disconnectBlocks(PreEntry, Entry); + Plan->setEntry(Entry); + delete PreEntry; + + // FOR STRESS TESTING, uncomment the following: + // EndRangeVF = StartRangeVF * 2; + + return SharedPlan; } -unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I, - unsigned VF, - Type *&VectorTy) { - Type *RetTy = I->getType(); - if (canTruncateToMinimalBitwidth(I, VF)) - RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]); - VectorTy = ToVectorTy(RetTy, VF); - auto SE = PSE.getSE(); +void LoopVectorizationPlanner::sinkScalarOperands(Instruction *PredInst, + VPlan *Plan) { + VPlanUtilsLoopVectorizer PlanUtils(Plan); - // TODO: We need to estimate the cost of intrinsic calls. - switch (I->getOpcode()) { - case Instruction::GetElementPtr: - // We mark this instruction as zero-cost because the cost of GEPs in - // vectorized code depends on whether the corresponding memory instruction - // is scalarized or not. Therefore, we handle GEPs with the memory - // instruction cost. - return 0; - case Instruction::Br: { - return TTI.getCFInstrCost(I->getOpcode()); - } - case Instruction::PHI: { - auto *Phi = cast(I); + // The recipe containing the predicated instruction. + VPBasicBlock *PredBB = Plan->getBasicBlock(PredInst); - // First-order recurrences are replaced by vector shuffles inside the loop. - if (VF > 1 && Legal->isFirstOrderRecurrence(Phi)) - return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector, - VectorTy, VF - 1, VectorTy); + // Initialize a worklist with the operands of the predicated instruction. + SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); - // TODO: IF-converted IFs become selects. - return 0; - } - case Instruction::UDiv: - case Instruction::SDiv: - case Instruction::URem: - case Instruction::SRem: - // If we have a predicated instruction, it may not be executed for each - // vector lane. Get the scalarization cost and scale this amount by the - // probability of executing the predicated block. If the instruction is not - // predicated, we fall through to the next case. - if (VF > 1 && Legal->isScalarWithPredication(I)) { - unsigned Cost = 0; + // Holds instructions that we need to analyze again. An instruction may be + // reanalyzed if we don't yet know if we can sink it or not. + SmallVector InstsToReanalyze; - // These instructions have a non-void type, so account for the phi nodes - // that we will create. This cost is likely to be zero. The phi node - // cost, if any, should be scaled by the block probability because it - // models a copy at the end of each predicated block. - Cost += VF * TTI.getCFInstrCost(Instruction::PHI); + // Iteratively sink the scalarized operands of the predicated instruction + // into the block we created for it. When an instruction is sunk, it's + // operands are then added to the worklist. The algorithm ends after one pass + // through the worklist doesn't sink a single instruction. + bool Changed; + do { - // The cost of the non-predicated instruction. - Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy); + // Add the instructions that need to be reanalyzed to the worklist, and + // reset the changed indicator. + Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); + InstsToReanalyze.clear(); + Changed = false; - // The cost of insertelement and extractelement instructions needed for - // scalarization. - Cost += getScalarizationOverhead(I, VF, TTI); + while (!Worklist.empty()) { + auto *I = dyn_cast(Worklist.pop_back_val()); + if (!I) + continue; - // Scale the cost by the probability of executing the predicated blocks. - // This assumes the predicated block for each vector lane is equally - // likely. - return Cost / getReciprocalPredBlockProb(); - } - case Instruction::Add: - case Instruction::FAdd: - case Instruction::Sub: - case Instruction::FSub: - case Instruction::Mul: - case Instruction::FMul: - case Instruction::FDiv: - case Instruction::FRem: - case Instruction::Shl: - case Instruction::LShr: - case Instruction::AShr: - case Instruction::And: - case Instruction::Or: - case Instruction::Xor: { - // Since we will replace the stride by 1 the multiplication should go away. - if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal)) - return 0; - // Certain instructions can be cheaper to vectorize if they have a constant - // second vector operand. One example of this are shifts on x86. - TargetTransformInfo::OperandValueKind Op1VK = - TargetTransformInfo::OK_AnyValue; - TargetTransformInfo::OperandValueKind Op2VK = - TargetTransformInfo::OK_AnyValue; - TargetTransformInfo::OperandValueProperties Op1VP = - TargetTransformInfo::OP_None; - TargetTransformInfo::OperandValueProperties Op2VP = - TargetTransformInfo::OP_None; - Value *Op2 = I->getOperand(1); + // We do not sink other predicated instructions. + if (Legal->isScalarWithPredication(I)) + continue; - // Check for a splat or for a non uniform vector of constants. - if (isa(Op2)) { - ConstantInt *CInt = cast(Op2); - if (CInt && CInt->getValue().isPowerOf2()) - Op2VP = TargetTransformInfo::OP_PowerOf2; - Op2VK = TargetTransformInfo::OK_UniformConstantValue; - } else if (isa(Op2) || isa(Op2)) { - Op2VK = TargetTransformInfo::OK_NonUniformConstantValue; - Constant *SplatValue = cast(Op2)->getSplatValue(); - if (SplatValue) { - ConstantInt *CInt = dyn_cast(SplatValue); - if (CInt && CInt->getValue().isPowerOf2()) - Op2VP = TargetTransformInfo::OP_PowerOf2; - Op2VK = TargetTransformInfo::OK_UniformConstantValue; + VPRecipeBase *Recipe = Plan->getRecipe(I); + + // We can't sink live-ins. + if (!Recipe) + continue; + VPBasicBlock *BasicBlock = Recipe->getParent(); + assert(BasicBlock && "Recipe not in any basic block"); + + // We can't sink an instruction that isn't being scalarized. + if (!isa(Recipe) && + !isa(Recipe)) + continue; + + // We can't sink an instruction if it is already in the predicated block, + // is not in the VPlan, or may have side effects. + if (BasicBlock == PredBB || I->mayHaveSideEffects()) + continue; + + // Handle phi nodes last to make sure that any user they may have has sunk + // by now. This is relevant for induction variables that feed uniform GEPs + // which may or may not sink. + if (isa(I)) { + auto IsNotAPhi = [&](Value *V) -> bool { return isa(V); }; + if (any_of(Worklist, IsNotAPhi) || + any_of(InstsToReanalyze, IsNotAPhi)) { + InstsToReanalyze.push_back(I); + continue; + } + } + + bool HasVectorizedUses = false; + bool AllScalarizedUsesInPredicatedBlock = true; + unsigned MinLaneToSink = 0; + for (auto &U : I->uses()) { + auto *UI = cast(U.getUser()); + VPRecipeBase *UserRecipe = Plan->getRecipe(UI); + // Generated scalarized instructions don't serve users outside of the + // VPlan, so we can safely ignore users that have no recipe. + if (!UserRecipe) + continue; + + // GEPs used as the uniform address of a wide memory operation must not + // sink lane zero. + if (isa(UserRecipe)) { + assert(isa(I) && + "Non-GEP used in interleave group"); + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + + // Wide memory operations do not use any of the scalarized GEPs but + // generate their own GEPs. + if (isa(UserRecipe) && + isa(I) && + (isa(UI) || isa(UI)) && + Legal->isConsecutivePtr(I)) { + continue; + } + + if (!(isa(UserRecipe) || + isa(UserRecipe))) { + // All of I's lanes are used by an instruction we can't sink. + HasVectorizedUses = true; + break; + } + + // Induction variables feeding consecutive GEPs can be indirectly used + // by vectorized load/stores which generate their own GEP rather than + // reuse the scalarized one (unlike load/store in interleave groups). + // In such a case, we can sink all lanes but lane zero. Note that we + // can do this whether or not the GEP is used within the predicated + // block (i.e. whether it will sink its own lanes 1..VF-1). + if (isa(UI) && Legal->isConsecutivePtr(UI) && + isa(Recipe)) { + auto IsVectorizedMemoryOperation = [&](User *U) -> bool { + if (!(isa(U) || isa(U))) + return false; + VPRecipeBase *Recipe = Plan->getRecipe(cast(U)); + return Recipe && isa(Recipe); + }; + + if (any_of(UI->users(), IsVectorizedMemoryOperation)) { + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + } + + if (UserRecipe->getParent() != PredBB) { + // Don't make a decision until all scalarized users have sunk. + AllScalarizedUsesInPredicatedBlock = false; + continue; + } + + // Ok to sink w.r.t this use, but no more lanes than what the user + // itself has sunk. + VPLaneRange DesignatedLanes; + if (auto *BSS = dyn_cast(UserRecipe)) + DesignatedLanes = BSS->getDesignatedLanes(); + else + DesignatedLanes = + cast(UserRecipe)->getDesignatedLanes(); + VPLaneRange SinkableLanes = + VPLaneRange::intersect(VPLaneRange(MinLaneToSink), DesignatedLanes); + MinLaneToSink = SinkableLanes.getMinLane(); + } + + if (HasVectorizedUses) + continue; // This instruction cannot be sunk. + + // It's legal to sink the instruction if all its uses occur in the + // predicated block. Otherwise, there's nothing to do yet, and we may + // need to reanalyze the instruction. + if (!AllScalarizedUsesInPredicatedBlock) { + InstsToReanalyze.push_back(I); + continue; } - } else if (Legal->isUniform(Op2)) { - Op2VK = TargetTransformInfo::OK_UniformValue; + + // Move the instruction to the beginning of the predicated block, and add + // it's operands to the worklist (except for phi nodes). + PlanUtils.sinkInstruction(I, PredBB, MinLaneToSink); + if (!isa(I)) + Worklist.insert(I->op_begin(), I->op_end()); + + // The sinking may have enabled other instructions to be sunk, so we will + // need to iterate. + Changed = true; } - SmallVector Operands(I->operand_values()); - return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK, - Op2VK, Op1VP, Op2VP, Operands); - } - case Instruction::Select: { - SelectInst *SI = cast(I); - const SCEV *CondSCEV = SE->getSCEV(SI->getCondition()); - bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop)); - Type *CondTy = SI->getCondition()->getType(); - if (!ScalarCond) - CondTy = VectorType::get(CondTy, VF); + } while (Changed); +} - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy); +void LoopVectorizationPlanner::assignScalarVectorConversions( + Instruction *PredInst, VPlan *Plan) { + + // NFC: Let Def's recipe generate the vector version of Def, but only + // if all of Def's users are vectorized. This is the equivalent to the + // previous predicateInstructions by which an insert-element got hoisted + // into the matching predicated basic block if it is the only user of + // the predicated instruction. + + if (PredInst->use_empty()) + return; + + for (User *U : PredInst->users()) { + Instruction *UserInst = dyn_cast(U); + if (!UserInst) + continue; + + VPRecipeBase *UserRecipe = Plan->getRecipe(UserInst); + if (!UserRecipe) // User is not part of the plan. + return; + + if (dyn_cast(UserRecipe)) + continue; + + // Found a user that will not be using the vector form of the predicated + // instruction. The insert-element is not going to be the only user, so + // do not hoist it. + return; } - case Instruction::ICmp: - case Instruction::FCmp: { - Type *ValTy = I->getOperand(0)->getType(); - Instruction *Op0AsInstruction = dyn_cast(I->getOperand(0)); - if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF)) - ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); - VectorTy = ToVectorTy(ValTy, VF); - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy); + + Plan->getRecipe(PredInst)->addAlsoPackOrUnpack(PredInst); +} + +bool LoopVectorizationPlanner::shouldScalarizeInstruction(Instruction *I, + unsigned VF) const { + return CM->isScalarAfterVectorization(I, VF) || + CM->isProfitableToScalarize(I, VF); +} + +void LoopVectorizationPlanner::optimizePredicatedInstructions() { + VPlan *PrevPlan = nullptr; + for (auto &It : VPlans) { + VPlan *Plan = It.second.get(); + if (Plan == PrevPlan) + continue; + for (auto *PredInst : PredicatedInstructions) { + sinkScalarOperands(PredInst, Plan); + assignScalarVectorConversions(PredInst, Plan); + } + PrevPlan = Plan; } - case Instruction::Store: - case Instruction::Load: { - VectorTy = ToVectorTy(getMemInstValueType(I), VF); - return getMemoryInstructionCost(I, VF); +} + +void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) { + DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n'); + BestVF = VF; + BestUF = UF; + + assert(VPlans.count(VF) && "Best VF does not have a VPlan."); + // Delete all other VPlans. + for (auto &Entry : VPlans) { + if (Entry.first != VF) + VPlans.erase(Entry.first); } - case Instruction::ZExt: - case Instruction::SExt: - case Instruction::FPToUI: - case Instruction::FPToSI: - case Instruction::FPExt: - case Instruction::PtrToInt: - case Instruction::IntToPtr: - case Instruction::SIToFP: - case Instruction::UIToFP: - case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { - // We optimize the truncation of induction variables having constant - // integer steps. The cost of these truncations is the same as the scalar - // operation. - if (isOptimizableIVTruncate(I, VF)) { - auto *Trunc = cast(I); - return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(), - Trunc->getSrcTy()); - } +} - Type *SrcScalarTy = I->getOperand(0)->getType(); - Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF); - if (canTruncateToMinimalBitwidth(I, VF)) { - // This cast is going to be shrunk. This may remove the cast or it might - // turn it into slightly different cast. For example, if MinBW == 16, - // "zext i8 %1 to i32" becomes "zext i8 %1 to i16". - // - // Calculate the modified src and dest types. - Type *MinVecTy = VectorTy; - if (I->getOpcode() == Instruction::Trunc) { - SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy); - VectorTy = - largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); - } else if (I->getOpcode() == Instruction::ZExt || - I->getOpcode() == Instruction::SExt) { - SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy); - VectorTy = - smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); - } - } +void LoopVectorizationPlanner::executeBestPlan(InnerLoopVectorizer &LB) { + ILV = &LB; - return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy); + // Perform the actual loop widening (vectorization). + // 1. Create a new empty loop. Unlink the old loop and connect the new one. + ILV->createEmptyLoop(); + + // 2. Widen each instruction in the old loop to a new one in the new loop. + + VPTransformState State{BestVF, BestUF, LI, ILV->DT, + ILV->Builder, ILV, Legal, CM}; + State.CFG.PrevBB = ILV->LoopVectorPreHeader; + + VPlan *Plan = getVPlanForVF(BestVF); + + Plan->vectorize(&State); + + // 3. Take care of phi's to fix: reduction, 1st-order-recurrence, loop-closed. + ILV->vectorizeLoop(); +} + +void VPVectorizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + State.ILV->vectorizeInstruction(*I); + if (willAlsoPackOrUnpack(I)) { // Unpack instruction + for (unsigned Part = 0; Part < State.UF; ++Part) + for (unsigned Lane = 0; Lane < State.VF; ++Lane) + State.ILV->getScalarValue(I, Part, Lane); } - case Instruction::Call: { - bool NeedToScalarize; - CallInst *CI = cast(I); - unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize); - if (getVectorIntrinsicIDForCall(CI, TLI)) - return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI)); - return CallCost; +} + +void VPScalarizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + // By default generate scalar instances for all VF lanes of all UF parts. + // If the instruction is uniform, generate only the first lane for each + // of the UF parts. + bool IsUniform = State.Cost->isUniformAfterVectorization(I, State.VF); + unsigned MinLane = 0; + unsigned MaxLane = IsUniform ? 0 : State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + assert(!IsUniform && + "Uniform instruction vectorized for a specific instance."); + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->scalarizeInstruction(I, MinPart, MaxPart, + EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); + if (willAlsoPackOrUnpack(I)) { + if (State.Instance) + // Insert scalar instance packing it into a vector. + State.ILV->constructVectorValue(I, MinPart, MinLane); + else + // Broadcast or group together all instances into a vector. + State.ILV->getVectorValue(I); } - default: - // The cost of executing VF copies of the scalar instruction. This opcode - // is unknown. Assume that it is the same as 'mul'. - return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) + - getScalarizationOverhead(I, VF, TTI); - } // end of switch. } -char LoopVectorize::ID = 0; -static const char lv_name[] = "Loop Vectorization"; -INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false) -INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass) -INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass) -INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass) -INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker) -INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass) -INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass) -INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis) -INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass) -INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass) -INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false) +void VPWidenIntInductionRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Int induction being replicated"); + auto BuildScalarInfo = State.ILV->widenIntInduction(NeedsScalarIV, IV, Trunc); + ScalarIV = BuildScalarInfo.first; + Step = BuildScalarInfo.second; +} -namespace llvm { -Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) { - return new LoopVectorize(NoUnrolling, AlwaysVectorize); +void VPWidenIntInductionRecipe::print(raw_ostream &O) const { + O << "Widen int induction"; + if (NeedsScalarIV) + O << " (needs scalars)"; + O << ":\n"; + O << *IV; + if (Trunc) + O << "\n" << *Trunc << ")"; } + +void VPBuildScalarStepsRecipe::vectorize(VPTransformState &State) { + // By default generate scalar instances for all VF lanes of all UF parts. + // If the instruction is uniform, generate only the first lane for each + // of the UF parts. + bool IsUniform = State.Cost->isUniformAfterVectorization(EntryVal, State.VF); + unsigned MinLane = 0; + unsigned MaxLane = IsUniform ? 0 : State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->buildScalarSteps(WII->getScalarIV(), WII->getStep(), EntryVal, + MinPart, MaxPart, EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); } -bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) { +void VPBuildScalarStepsRecipe::print(raw_ostream &O) const { + O << "Build scalar steps"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":\n" << *EntryVal; +} - // Check if the pointer operand of a load or store instruction is - // consecutive. - if (auto *Ptr = getPointerOperand(Inst)) - return Legal->isConsecutivePtr(Ptr); - return false; +void VPInterleaveRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Interleave group being replicated"); + State.ILV->vectorizeInterleaveGroup(IG->getInsertPos()); } -void LoopVectorizationCostModel::collectValuesToIgnore() { - // Ignore ephemeral values. - CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore); +void VPInterleaveRecipe::print(raw_ostream &O) const { + O << "InterleaveGroup factor:" << IG->getFactor() << '\n'; + for (unsigned i = 0; i < IG->getFactor(); ++i) + if (Instruction *I = IG->getMember(i)) { + if (I == IG->getInsertPos()) + O << i << "=]" << *I; + else + O << i << " ]" << *I; + if (willAlsoPackOrUnpack(I)) + O << " (V->S)"; + } +} - // Ignore type-promoting instructions we identified during reduction - // detection. - for (auto &Reduction : *Legal->getReductionVars()) { - RecurrenceDescriptor &RedDes = Reduction.second; - SmallPtrSetImpl &Casts = RedDes.getCastInsts(); - VecValuesToIgnore.insert(Casts.begin(), Casts.end()); +void VPExtractMaskBitRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && "Extract Mask Bit works only on single instance."); + + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + typedef SmallVector VectorParts; + + VectorParts Cond = State.ILV->createBlockInMask(MaskedBasicBlock); + + ConditionBit = State.Builder.CreateExtractElement( + Cond[Part], State.ILV->Builder.getInt32(Lane)); + ConditionBit = + State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit, + ConstantInt::get(ConditionBit->getType(), 1)); + DEBUG(dbgs() << "\nLV: vectorizing ConditionBit recipe" + << MaskedBasicBlock->getName()); +} + +void VPMergeScalarizeBranchRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && + "Merge Scalarize Branch works only on single instance."); + + Type *LiveOutType = LiveOut->getType(); + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + // Rename the predicated and merged basic blocks for backwards compatibility. + Instruction *ScalarLiveOut = + cast(State.ILV->getScalarValue(LiveOut, Part, Lane)); + BasicBlock *PredicatedBB = ScalarLiveOut->getParent(); + BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor(); + assert(PredicatingBB && "Predicated block has no single predecessor"); + PredicatedBB->setName(Twine("pred.") + LiveOut->getOpcodeName() + ".if"); + PredicatedBB->getSingleSuccessor()->setName( + Twine("pred.") + LiveOut->getOpcodeName() + ".continue"); + if (LiveOutType->isVoidTy()) + return; + + // Generate a phi node for the scalarized instruction. + PHINode *Phi = State.ILV->Builder.CreatePHI(LiveOutType, 2); + Phi->addIncoming(UndefValue::get(ScalarLiveOut->getType()), PredicatingBB); + Phi->addIncoming(ScalarLiveOut, PredicatedBB); + State.ILV->setScalarValue(LiveOut, Part, Lane, Phi); + + // If this instruction also generated the complementing form then we also need + // to create a phi for the vector value of this part & lane and update the + // vector values cache accordingly. + Value *VectorValue = State.ILV->getVectorValue(LiveOut, Part); + if (!VectorValue) + return; + + InsertElementInst *IEI = cast(VectorValue); + PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2); + VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // the unmodified vector + VPhi->addIncoming(IEI, PredicatedBB); // new vector with the inserted element + State.ILV->setVectorValue(LiveOut, Part, VPhi); +} + +/// Creates a new VPScalarizeOneByOneRecipe or VPVectorizeOneByOneRecipe based +/// on the isScalarizing parameter respectively. +VPOneByOneRecipeBase *VPlanUtilsLoopVectorizer::createOneByOneRecipe( + const BasicBlock::iterator B, const BasicBlock::iterator E, VPlan *Plan, + bool isScalarizing) { + if (isScalarizing) + return new VPScalarizeOneByOneRecipe(B, E, Plan); + return new VPVectorizeOneByOneRecipe(B, E, Plan); +} + +bool VPlanUtilsLoopVectorizer::appendInstruction(VPOneByOneRecipeBase *Recipe, + Instruction *Instr) { + if (Recipe->End != Instr->getIterator()) + return false; + + Recipe->End++; + Plan->setInst2Recipe(Instr, Recipe); + return true; +} + +/// Given a \p Split instruction assumed to reside in a VPOneByOneRecipeBase +/// -- where VPOneByOneRecipeBase is either VPScalarizeOneByOneRecipe or +/// VPVectorizeOneByOneRecipe -- update that recipe to start from \p Split +/// and move all preceeding instructions to a new VPOneByOneRecipeBase. +/// \return the newly created VPOneByOneRecipeBase, which is added to the +/// VPBasicBlock of the original recipe, right before it. +VPOneByOneRecipeBase * +VPlanUtilsLoopVectorizer::splitRecipe(Instruction *Split) { + VPOneByOneRecipeBase *Recipe = + cast(Plan->getRecipe(Split)); + auto SplitPos = Split->getIterator(); + + assert(SplitPos != Recipe->Begin && + "Nothing to split before first instruction."); + assert(SplitPos != Recipe->End && "Nothing to split after last instruction."); + + // Build a new recipe for all instructions up to the given Split. + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Recipe->Begin, SplitPos, Plan, Recipe->isScalarizing()); + + // Insert the new recipe before the split point. + BasicBlock->addRecipe(NewRecipe, Recipe); + + // Update the old recipe to start from the given split point. + Recipe->Begin = SplitPos; + + return NewRecipe; +} + +/// Insert a given instruction \p Inst into a VPBasicBlock before another +/// given instruction \p Before. Assumes \p Inst does not belong to any +/// recipe, and that \p Before belongs to a VPOneByOneRecipeBase. +void VPlanUtilsLoopVectorizer::insertBefore(Instruction *Inst, + Instruction *Before, + unsigned MinLane) { + assert(!Plan->getRecipe(Inst) && "Instruction already in recipe."); + VPRecipeBase *Recipe = Plan->getRecipe(Before); + assert(Recipe && "Insertion point not in any recipe."); + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + bool PartialInsertion = MinLane > 0; + bool IndicesMatch = true; + + if (PartialInsertion) { + VPScalarizeOneByOneRecipe *SOBO = + dyn_cast(Recipe); + if (!SOBO || SOBO->DesignatedLanes.getMinLane() != MinLane) + IndicesMatch = false; + } + + // Can we insert \p Inst by augmemting the existing recipe of \p Before? + // Only if \p Inst is immediately followed by \p Before: + Instruction *NextInst = Inst; + if (++NextInst == Before && IndicesMatch) { + // This must imply that \p Before is the first ingredient in its recipe. + assert(Before == &*OBORecipe->Begin && + "Trying to insert but Before is not first in its recipe."); + // Yes, extend the range to include the previous instruction. + OBORecipe->Begin--; + Plan->setInst2Recipe(Inst, Recipe); + return; + } + // Note that it is not possible to augment the end of Recipe by having + // Inst == &*Recipe->End, because to do that Before would need to be + // Recipe->End, which means that Before does not belong to this Recipe. + + // No, the instruction needs to have its own recipe. + + // If we're not inserting right before the Recipe's first instruction, + // split the Recipe to allow placing the new recipe right before the + // given insertion point. This new recipe is also added to BasicBlock. + if (Before != &*OBORecipe->Begin) + splitRecipe(Before); + + // TODO: VPLanUtils::addOneByOneToBasicBlock() + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = nullptr; + if (PartialInsertion) { + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + } else + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, + OBORecipe->isScalarizing()); + Plan->setInst2Recipe(Inst, NewRecipe); + BasicBlock->addRecipe(NewRecipe, OBORecipe); +} + +/// Remove a given instruction \p Inst from its recipe, if exists. We only +/// support removal from VPOneByOneRecipeBase at this time. +void VPlanUtilsLoopVectorizer::removeInstruction(Instruction *Inst, + unsigned FromLane) { + VPRecipeBase *Recipe = Plan->getRecipe(Inst); + if (!Recipe) + return; // Nothing to do, no recipe to remove the instruction from. + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + // First check if OBORecipe can be shortened to exclude Inst. + bool InstructionWasLast = false; + if (&*OBORecipe->Begin == Inst) + OBORecipe->Begin++; + else if (&*OBORecipe->End == Inst) { + OBORecipe->End--; + InstructionWasLast = true; + } + // Otherwise split OBORecipe at Inst. + else { + splitRecipe(Inst); + OBORecipe->Begin++; + } + if (FromLane > 0) { + // This is a partial removal. Leave lanes 0..FromLane-1 in the original + // basic block in a new, unregistered recipe. + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Inst->getIterator(), ++(Inst->getIterator()), Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(0, FromLane - 1); + Recipe->getParent()->addRecipe(NewRecipe, + InstructionWasLast ? nullptr : Recipe); + } + Plan->resetInst2Recipe(Inst); +} + +// Given an instruction \p Inst and a VPBasicBlock \p To, remove \p Inst from +// its current residence and add it as the first instruction of \p To. +// We currently support removal from and insertion to +// VPOneByOneRecipeBase's only. +// TODO: this is an over-simplistic implemetation that assumes we can make +// the new instruction the first instruction of the first recipe in the +// basic block. This is true for the sinkScalarOperands use-case, but for a +// general basic block a getFirstInsertionPt() logic is required. +void VPlanUtilsLoopVectorizer::sinkInstruction(Instruction *Inst, + VPBasicBlock *To, + unsigned MinLane) { + RecipeListTy *Recipes = getRecipes(To); + + VPRecipeBase *FromRecipe = Plan->getRecipe(Inst); + if (auto *FromBSSRecipe = dyn_cast(FromRecipe)) { + VPBuildScalarStepsRecipe *SunkRecipe = nullptr; + if (MinLane == 0) { + // Sink the entire recipe. + VPBasicBlock *From = FromRecipe->getParent(); + assert(From && "Recipe to sink not assigned to any basic block"); + From->removeRecipe(FromBSSRecipe); + SunkRecipe = FromBSSRecipe; + } else { + // Partially sink lanes MinLane..VF-1 + SunkRecipe = new VPBuildScalarStepsRecipe(FromBSSRecipe->WII, + FromBSSRecipe->EntryVal, Plan); + SunkRecipe->DesignatedLanes = VPLaneRange(MinLane); + FromBSSRecipe->DesignatedLanes = VPLaneRange(0, MinLane - 1); + } + To->addRecipe(SunkRecipe, &*Recipes->begin()); + return; + } + + assert(Plan->getRecipe(Inst) && + isa(Plan->getRecipe(Inst)) && + "Unsupported recipe to sink instructions from"); + + // Remove instruction from its source recipe. + removeInstruction(Inst, MinLane); + + auto *ToRecipe = dyn_cast(&*Recipes->begin()); + if (ToRecipe) { + // Try to sink the instruction into an existing recipe, default to a new + // recipe. + assert(ToRecipe->isScalarizing() && + "Cannot sink into a non-scalarizing recipe."); + + // Add it before the first ingredient of To. + insertBefore(Inst, &*ToRecipe->Begin, MinLane); + } else { + // Instruction has to go into its own one-by-one recipe. + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + auto *NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + if (MinLane > 0) // Partial sink + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + To->addRecipe(NewRecipe, &*Recipes->begin()); + } +} + +void InnerLoopUnroller::vectorizeInstruction(Instruction &I) { + switch (I.getOpcode()) { + case Instruction::Br: + // Nothing to do for branches since we already took care of the + // loop control flow instructions. + break; + + case Instruction::GetElementPtr: + scalarizeInstruction(&I, false); + break; + + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + // Scalarize with predication if this instruction may divide by zero and + // block execution is conditional, otherwise fallthrough. + if (Legal->isScalarWithPredication(&I)) { + scalarizeInstruction(&I, true); + break; + } + + case Instruction::Trunc: { + auto *CI = dyn_cast(&I); + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + if (Cost->isOptimizableIVTruncate(CI, VF)) { + setDebugLocFromInst(Builder, CI); + widenIntInduction(true, cast(CI->getOperand(0)), + cast(CI)); + break; + } + } + + default: + InnerLoopVectorizer::vectorizeInstruction(I); } } @@ -7595,9 +9249,35 @@ return false; } - // Select the optimal vectorization factor. - const LoopVectorizationCostModel::VectorizationFactor VF = - CM.selectVectorizationFactor(OptForSize); + if (!CM.canVectorize(OptForSize)) + return false; + + // Early prune excessive VF's + unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize); + + // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail. + if (OptForSize && CM.requiresTail(MaxVF)) + return false; + + // Use the planner. + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM); + + // Get user vectorization factor. + unsigned UserVF = Hints.getWidth(); + + // Select the vectorization factor. + LoopVectorizationCostModel::VectorizationFactor VF = + LVP.plan(OptForSize, UserVF, MaxVF); + bool VectorizeLoop = (VF.Width > 1); + + std::pair VecDiagMsg, IntDiagMsg; + + if (!UserVF && !VectorizeLoop) { + DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); + VecDiagMsg = std::make_pair( + "VectorizationNotBeneficial", + "the cost-model indicates that vectorization is not beneficial"); + } // Select the interleave count. unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost); @@ -7606,8 +9286,6 @@ unsigned UserIC = Hints.getInterleave(); // Identify the diagnostic messages that should be produced. - std::pair VecDiagMsg, IntDiagMsg; - bool VectorizeLoop = true, InterleaveLoop = true; if (Requirements.doesNotMeet(F, L, Hints)) { DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization " "requirements.\n"); @@ -7615,13 +9293,7 @@ return false; } - if (VF.Width == 1) { - DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); - VecDiagMsg = std::make_pair( - "VectorizationNotBeneficial", - "the cost-model indicates that vectorization is not beneficial"); - VectorizeLoop = false; - } + bool InterleaveLoop = true; if (IC == 1 && UserIC <= 1) { // Tell the user interleaving is not beneficial. @@ -7637,8 +9309,8 @@ } } else if (IC > 1 && UserIC == 1) { // Tell the user interleaving is beneficial, but it explicitly disabled. - DEBUG(dbgs() - << "LV: Interleaving is beneficial but is explicitly disabled."); + DEBUG( + dbgs() << "LV: Interleaving is beneficial but is explicitly disabled."); IntDiagMsg = std::make_pair( "InterleavingBeneficialButDisabled", "the cost-model indicates that interleaving is beneficial " @@ -7649,6 +9321,9 @@ // Override IC if user provided an interleave count. IC = UserIC > 0 ? UserIC : IC; + if (VectorizeLoop) + LVP.setBestPlan(VF.Width, IC); + // Emit diagnostic messages, if any. const char *VAPassName = Hints.vectorizeAnalysisPassName(); if (!VectorizeLoop && !InterleaveLoop) { @@ -7691,10 +9366,13 @@ << "interleaved loop (interleaved count: " << NV("InterleaveCount", IC) << ")"); } else { + // If we decided that it is *legal* to vectorize the loop, then do it. InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, &LVL, &CM); - LB.vectorize(); + + LVP.executeBestPlan(LB); + ++LoopsVectorized; // Add metadata to disable runtime unrolling a scalar loop when there are Index: lib/Transforms/Vectorize/VPlan.h =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.h @@ -0,0 +1,922 @@ +//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This file contains the declarations of the Vectorization Plan base classes: +// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual +// VPBlockBase, together implementing a Hierarchical CFG; +// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be treated +// as proper graphs for generic algorithms; +// 3. Pure virtual VPRecipeBase and its pure virtual sub-classes +// VPConditionBitRecipeBase and VPOneByOneRecipeBase that +// represent base classes for recipes contained within VPBasicBlocks; +// 4. The VPlan class holding a candidate for vectorization; +// 5. The VPlanUtils class providing methods for building plans; +// 6. The VPlanPrinter class providing a way to print a plan in dot format. +// These are documented in docs/VectorizationPlan.rst. +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H +#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H + +#include "llvm/ADT/GraphTraits.h" +#include "llvm/ADT/ilist.h" +#include "llvm/ADT/ilist_node.h" +#include "llvm/IR/IRBuilder.h" +#include "llvm/Support/raw_ostream.h" + +// The (re)use of existing LoopVectorize classes is subject to future VPlan +// refactoring. +namespace { +class InnerLoopVectorizer; +class LoopVectorizationLegality; +class LoopVectorizationCostModel; +} + +namespace llvm { + +class VPBasicBlock; + +/// VPRecipeBase is a base class describing one or more instructions that will +/// appear consecutively in the vectorized version, based on Instructions from +/// the given IR. These Instructions are referred to as the "Ingredients" of +/// the Recipe. A Recipe specifies how its ingredients are to be vectorized: +/// e.g., copy or reuse them as uniform, scalarize or vectorize them according +/// to an enclosing loop dimension, vectorize them according to internal SLP +/// dimension. +/// +/// **Design principle:** in order to reason about how to vectorize an +/// Instruction or how much it would cost, one has to consult the VPRecipe +/// holding it. +/// +/// **Design principle:** when a sequence of instructions conveys additional +/// information as a group, we use a VPRecipe to encapsulate them and attach +/// this information to the VPRecipe. For instance a VPRecipe can model an +/// interleave group of loads or stores with additional information for +/// calculating their cost and for performing IR code generation, as a group. +/// +/// **Design principle:** a VPRecipe should reuse existing containers of its +/// ingredients, i.e., iterators of basic blocks, to be lightweight. A new +/// containter should be opened on-demand, e.g., to avoid excessive recipes +/// each holding an interval of ingredients. +class VPRecipeBase : public ilist_node_with_parent { + friend class VPlanUtils; + friend class VPBasicBlock; + +private: + const unsigned char VRID; /// Subclass identifier (for isa/dyn_cast) + + /// Each VPRecipe is contained in a single VPBasicBlock. + class VPBasicBlock *Parent; + + /// Record which Instructions would require generating their complementing + /// form as well, providing a vector-to-scalar or scalar-to-vector conversion. + SmallPtrSet AlsoPackOrUnpack; + +public: + /// An enumeration for keeping track of the concrete subclass of VPRecipeBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPRecipe classes VRID field. They are used for concrete type + /// identification. + typedef enum { + VPVectorizeOneByOneSC, + VPScalarizeOneByOneSC, + VPWidenIntInductionSC, + VPBuildScalarStepsSC, + VPInterleaveSC, + VPExtractMaskBitSC, + VPMergeScalarizeBranchSC, + } VPRecipeTy; + + VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {} + + virtual ~VPRecipeBase() {} + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPRecipeID() const { return VRID; } + + /// \return the VPBasicBlock which this VPRecipe belongs to. + class VPBasicBlock *getParent() { + return Parent; + } + + /// The method which generates the new IR instructions that correspond to + /// this VPRecipe in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState &State) = 0; + + /// Each recipe prints itself. + virtual void print(raw_ostream &O) const = 0; + + /// Add an instruction to the set of instructions for which a vector-to- + /// scalar or scalar-to-vector conversion is needed, in addition to + /// vectorizing or scalarizing the instruction itself, respectively. + void addAlsoPackOrUnpack(Instruction *I) { AlsoPackOrUnpack.insert(I); } + + /// Indicates if a given instruction requires vector-to-scalar or scalar-to- + /// vector conversion. + bool willAlsoPackOrUnpack(Instruction *I) const { + return AlsoPackOrUnpack.count(I); + } +}; + +/// A VPConditionBitRecipeBase is a pure virtual VPRecipe which supports a +/// conditional branch. Concrete sub-classes of this recipe are in charge of +/// generating the instructions that compute the condition for this branch in +/// the vectorized version. +class VPConditionBitRecipeBase : public VPRecipeBase { +protected: + /// The actual condition bit that was generated. Holds null until the + /// value/instuctions are generated by the vectorize() method. + Value *ConditionBit; + +public: + /// Construct a VPConditionBitRecipeBase, simply propating its concrete type. + VPConditionBitRecipeBase(const unsigned char SC) + : VPRecipeBase(SC), ConditionBit(nullptr) {} + + /// \return the actual bit that was generated, to be plugged into the IR + /// conditional branch, or null if the code computing the actual bit has not + /// been generated yet. + Value *getConditionBit() { return ConditionBit; } + + virtual StringRef getName() const = 0; +}; + +/// VPOneByOneRecipeBase is a VPRecipeBase which handles each Instruction in its +/// ingredients independently, in order. The ingredients are either all +/// vectorized, or all scalarized. +/// A VPOneByOneRecipeBase is a virtual base recipe which can be materialized +/// by one of two sub-classes, namely VPVectorizeOneByOneRecipe or +/// VPScalarizeOneByOneRecipe for Vectorizing or Scalarizing all ingredients, +/// respectively. +/// The ingredients are held as a sub-sequence of original Instructions, which +/// reside in the same IR BasicBlock and in the same order. The Ingredients are +/// accessed by a pointer to the first and last Instruction. +class VPOneByOneRecipeBase : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +public: + /// Hold the ingredients by pointing to their original BasicBlock location. + BasicBlock::iterator Begin; + BasicBlock::iterator End; + +protected: + VPOneByOneRecipeBase() = delete; + + VPOneByOneRecipeBase(unsigned char SC, const BasicBlock::iterator B, + const BasicBlock::iterator E, class VPlan *Plan); + + /// Do the actual code generation for a single instruction. + /// This function is to be implemented and specialized by the respective + /// sub-class. + virtual void transformIRInstruction(Instruction *I, + struct VPTransformState &State) = 0; + +public: + ~VPOneByOneRecipeBase() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC || + V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; + } + + bool isScalarizing() const { + return getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPOneByOneRecipeBase in the vectorized version, thereby + /// "executing" the VPlan. + /// VPOneByOneRecipeBase may either scalarize or vectorize all Instructions. + void vectorize(struct VPTransformState &State) override { + for (auto It = Begin; It != End; ++It) + transformIRInstruction(&*It, State); + } + + const BasicBlock::iterator &begin() { return Begin; } + + const BasicBlock::iterator &end() { return End; } +}; + +/// Hold the indices of a specific scalar instruction. The VPIterationInstance +/// span the iterations of the original loop, that correspond to a single +/// iteration of the vectorized loop. +struct VPIterationInstance { + unsigned Part; + unsigned Lane; +}; + +// Forward declaration. +class BasicBlock; + +/// Hold additional information passed down when "executing" a VPlan, that is +/// needed for generating IR. Also facilitates reuse of existing LV +/// functionality. +struct VPTransformState { + + VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI, + class DominatorTree *DT, IRBuilder<> &Builder, + InnerLoopVectorizer *ILV, LoopVectorizationLegality *Legal, + LoopVectorizationCostModel *Cost) + : VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder), + ILV(ILV), Legal(Legal), Cost(Cost) {} + + /// Record the selected vectorization and unroll factors of the single loop + /// being vectorized. + unsigned VF; + unsigned UF; + + /// Hold the indices to generate a specific scalar instruction. Null indicates + /// that all instances are to be generated, using either scalar or vector + /// instructions. + VPIterationInstance *Instance; + + /// Hold state information used when constructing the CFG of the vectorized + /// Loop, traversing the VPBasicBlocks and generating corresponding IR + /// BasicBlocks. + struct CFGState { + /// The previous VPBasicBlock visited. In the beginning set to null. + VPBasicBlock *PrevVPBB; + /// The previous IR BasicBlock created or reused. In the beginning set to + /// the new header BasicBlock. + BasicBlock *PrevBB; + /// The last IR BasicBlock of the loop body. Set to the new latch + /// BasicBlock, used for placing the newly created BasicBlocks. + BasicBlock *LastBB; + /// A mapping of each VPBasicBlock to the corresponding BasicBlock. In case + /// of replication, maps the BasicBlock of the last replica created. + SmallDenseMap VPBB2IRBB; + + CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {} + } CFG; + + /// Hold pointer to LoopInfo to register new basic blocks in the loop. + class LoopInfo *LI; + + /// Hold pointer to Dominator Tree to register new basic blocks in the loop. + class DominatorTree *DT; + + /// Hold a reference to the IRBuilder used to generate IR code. + IRBuilder<> &Builder; + + /// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods. + class InnerLoopVectorizer *ILV; + + /// Hold a pointer to LoopVectorizationLegality + class LoopVectorizationLegality *Legal; + + /// Hold a pointer to LoopVectorizationCostModel to access its + /// IsUniformAfterVectorization method. + LoopVectorizationCostModel *Cost; +}; + +/// VPBlockBase is the building block of the Hierarchical CFG. A VPBlockBase +/// can be either a VPBasicBlock or a VPRegionBlock. +/// +/// The Hierarchical CFG is a control-flow graph whose nodes are basic-blocks +/// or Hierarchical CFG's. The Hierarchical CFG data structure we use is similar +/// to the Tile Tree [1], where cross-Tile edges are lifted to connect Tiles +/// instead of the original basic-blocks as in Sharir [2], promoting the Tile +/// encapsulation. We use the terms Region and Block rather than Tile [1] to +/// avoid confusion with loop tiling. +/// +/// [1] "Register Allocation via Hierarchical Graph Coloring", David Callahan +/// and Brian Koblenz, PLDI 1991 +/// +/// [2] "Structural analysis: A new approach to flow analysis in optimizing +/// compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 +/// +/// Note that in contrast to the IR BasicBlock, a VPBlockBase models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// Use the VPBlockBase. +class VPBlockBase { + friend class VPlanUtils; + +private: + const unsigned char VBID; /// Subclass identifier (for isa/dyn_cast). + + std::string Name; + + /// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if + /// it is a topmost VPBlockBase. + class VPRegionBlock *Parent; + + /// List of predecessor blocks. + SmallVector Predecessors; + + /// List of successor blocks. + SmallVector Successors; + + /// Successor selector, null for zero or single successor blocks. + VPConditionBitRecipeBase *ConditionBitRecipe; + + /// Add \p Successor as the last successor to this block. + void appendSuccessor(VPBlockBase *Successor) { + assert(Successor && "Cannot add nullptr successor!"); + Successors.push_back(Successor); + } + + /// Add \p Predecessor as the last predecessor to this block. + void appendPredecessor(VPBlockBase *Predecessor) { + assert(Predecessor && "Cannot add nullptr predecessor!"); + Predecessors.push_back(Predecessor); + } + + /// Remove \p Predecessor from the predecessors of this block. + void removePredecessor(VPBlockBase *Predecessor) { + auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor); + assert(Pos && "Predecessor does not exist"); + Predecessors.erase(Pos); + } + + /// Remove \p Successor from the successors of this block. + void removeSuccessor(VPBlockBase *Successor) { + auto Pos = std::find(Successors.begin(), Successors.end(), Successor); + assert(Pos && "Successor does not exist"); + Successors.erase(Pos); + } + +protected: + VPBlockBase(const unsigned char SC, const std::string &N) + : VBID(SC), Name(N), Parent(nullptr), ConditionBitRecipe(nullptr) {} + +public: + /// An enumeration for keeping track of the concrete subclass of VPBlockBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPBlockBase classes VBID field. They are used for concrete type + /// identification. + typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy; + + virtual ~VPBlockBase() {} + + const std::string &getName() const { return Name; } + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPBlockID() const { return VBID; } + + const class VPRegionBlock *getParent() const { return Parent; } + + /// \return the VPBasicBlock that is the entry of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getEntryBasicBlock() const; + + /// \return the VPBasicBlock that is the exit of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getExitBasicBlock() const; + class VPBasicBlock *getExitBasicBlock(); + + const SmallVectorImpl &getSuccessors() const { + return Successors; + } + + const SmallVectorImpl &getPredecessors() const { + return Predecessors; + } + + SmallVectorImpl &getSuccessors() { return Successors; } + + SmallVectorImpl &getPredecessors() { return Predecessors; } + + /// \return the successor of this VPBlockBase if it has a single successor. + /// Otherwise return a null pointer. + VPBlockBase *getSingleSuccessor() const { + return (Successors.size() == 1 ? *Successors.begin() : nullptr); + } + + /// \return the predecessor of this VPBlockBase if it has a single + /// predecessor. Otherwise return a null pointer. + VPBlockBase *getSinglePredecessor() const { + return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr); + } + + /// Returns the closest ancestor starting from "this", which has successors. + /// Returns the root ancestor if all ancestors have no successors. + VPBlockBase *getAncestorWithSuccessors(); + + /// Returns the closest ancestor starting from "this", which has predecessors. + /// Returns the root ancestor if all ancestors have no predecessors. + VPBlockBase *getAncestorWithPredecessors(); + + /// \return the successors either attached directly to this VPBlockBase or, if + /// this VPBlockBase is the exit block of a VPRegionBlock and has no + /// successors of its own, search recursively for the first enclosing + /// VPRegionBlock that has successors and return them. If no such + /// VPRegionBlock exists, return the (empty) successors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalSuccessors() { + return getAncestorWithSuccessors()->getSuccessors(); + } + + /// \return the hierarchical successor of this VPBlockBase if it has a single + /// hierarchical successor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalSuccessor() { + return getAncestorWithSuccessors()->getSingleSuccessor(); + } + + /// \return the predecessors either attached directly to this VPBlockBase or, + /// if this VPBlockBase is the entry block of a VPRegionBlock and has no + /// predecessors of its own, search recursively for the first enclosing + /// VPRegionBlock that has predecessors and return them. If no such + /// VPRegionBlock exists, return the (empty) predecessors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalPredecessors() { + return getAncestorWithPredecessors()->getPredecessors(); + } + + /// \return the hierarchical predecessor of this VPBlockBase if it has a + /// single hierarchical predecessor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalPredecessor() { + return getAncestorWithPredecessors()->getSinglePredecessor(); + } + + /// If a VPBlockBase has two successors, this is the Recipe that will generate + /// the condition bit selecting the successor, and feeding the terminating + /// conditional branch. Otherwise this is null. + VPConditionBitRecipeBase *getConditionBitRecipe() { + return ConditionBitRecipe; + } + + const VPConditionBitRecipeBase *getConditionBitRecipe() const { + return ConditionBitRecipe; + } + + void setConditionBitRecipe(VPConditionBitRecipeBase *R) { + ConditionBitRecipe = R; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBlockBase in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState *State) = 0; + + /// Delete all blocks reachable from a given VPBlockBase, inclusive. + static void deleteCFG(VPBlockBase *Entry); +}; + +/// VPBasicBlock serves as the leaf of the Hierarchical CFG. It represents a +/// sequence of instructions that will appear consecutively in a basic block +/// of the vectorized version. The VPBasicBlock takes care of the control-flow +/// relations with other VPBasicBlock's and Regions. It holds a sequence of zero +/// or more VPRecipe's that take care of representing the instructions. +/// A VPBasicBlock that holds no VPRecipe's represents no instructions; this +/// may happen, e.g., to support disjoint Regions and to ensure Regions have a +/// single exit, possibly an empty one. +/// +/// Note that in contrast to the IR BasicBlock, a VPBasicBlock models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// "use" the VPBasicBlock. +class VPBasicBlock : public VPBlockBase { + friend class VPlanUtils; + +public: + typedef iplist RecipeListTy; + +private: + /// The list of VPRecipes, held in order of instructions to generate. + RecipeListTy Recipes; + +public: + /// Instruction iterators... + typedef RecipeListTy::iterator iterator; + typedef RecipeListTy::const_iterator const_iterator; + typedef RecipeListTy::reverse_iterator reverse_iterator; + typedef RecipeListTy::const_reverse_iterator const_reverse_iterator; + + //===--------------------------------------------------------------------===// + /// Recipe iterator methods + /// + inline iterator begin() { return Recipes.begin(); } + inline const_iterator begin() const { return Recipes.begin(); } + inline iterator end() { return Recipes.end(); } + inline const_iterator end() const { return Recipes.end(); } + + inline reverse_iterator rbegin() { return Recipes.rbegin(); } + inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); } + inline reverse_iterator rend() { return Recipes.rend(); } + inline const_reverse_iterator rend() const { return Recipes.rend(); } + + inline size_t size() const { return Recipes.size(); } + inline bool empty() const { return Recipes.empty(); } + inline const VPRecipeBase &front() const { return Recipes.front(); } + inline VPRecipeBase &front() { return Recipes.front(); } + inline const VPRecipeBase &back() const { return Recipes.back(); } + inline VPRecipeBase &back() { return Recipes.back(); } + + /// Return the underlying instruction list container. + /// + /// Currently you need to access the underlying instruction list container + /// directly if you want to modify it. + const RecipeListTy &getInstList() const { return Recipes; } + RecipeListTy &getInstList() { return Recipes; } + + /// Returns a pointer to a member of the instruction list. + static RecipeListTy VPBasicBlock::*getSublistAccess(VPRecipeBase *) { + return &VPBasicBlock::Recipes; + } + + VPBasicBlock(const std::string &Name) : VPBlockBase(VPBasicBlockSC, Name) {} + + ~VPBasicBlock() { Recipes.clear(); } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC; + } + + /// Augment the existing recipes of a VPBasicBlock with an additional + /// \p Recipe at a position given by an existing recipe \p Before. If + /// \p Before is null, \p Recipe is appended as the last recipe. + void addRecipe(VPRecipeBase *Recipe, VPRecipeBase *Before = nullptr) { + Recipe->Parent = this; + if (!Before) { + Recipes.push_back(Recipe); + return; + } + assert(Before->Parent == this && + "Insertion before point not in this basic block."); + Recipes.insert(Before->getIterator(), Recipe); + } + + void removeRecipe(VPRecipeBase *Recipe) { + assert(Recipe->Parent == this && + "Recipe to remove not in this basic block."); + Recipes.remove(Recipe); + Recipe->Parent = nullptr; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBasicBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; + + /// Retrieve the list of VPRecipes that belong to this VPBasicBlock. + const RecipeListTy &getRecipes() const { return Recipes; } + +private: + /// Create an IR BasicBlock to hold the instructions vectorized from this + /// VPBasicBlock, and return it. Update the CFGState accordingly. + BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG); +}; + +/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks +/// which form a single-entry-single-exit subgraph of the CFG in the vectorized +/// code. +/// +/// A VPRegionBlock may indicate that its contents are to be replicated several +/// times. This is designed to support predicated scalarization, in which a +/// scalar if-then code structure needs to be generated VF * UF times. Having +/// this replication indicator helps to keep a single VPlan for multiple +/// candidate VF's; the actual replication takes place only once the desired VF +/// and UF have been determined. +/// +/// **Design principle:** when some additional information relates to an SESE +/// set of VPBlockBase, we use a VPRegionBlock to wrap them and attach the +/// information to it. For example, a VPRegionBlock can be used to indicate that +/// a scalarized SESE region is to be replicated, and that a vectorized SESE +/// region can retain its internal control-flow, independent of the control-flow +/// external to the region. +class VPRegionBlock : public VPBlockBase { + friend class VPlanUtils; + +private: + /// Hold the Single Entry of the SESE region represented by the VPRegionBlock. + VPBlockBase *Entry; + + /// Hold the Single Exit of the SESE region represented by the VPRegionBlock. + VPBlockBase *Exit; + + /// A VPRegionBlock can represent either a single instance of its + /// VPBlockBases, or multiple (VF * UF) replicated instances. The latter is + /// used when the internal SESE region handles a single scalarized lane. + bool IsReplicator; + +public: + VPRegionBlock(const std::string &Name) + : VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exit(nullptr), + IsReplicator(false) {} + + ~VPRegionBlock() { + if (Entry) + deleteCFG(Entry); + } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC; + } + + VPBlockBase *getEntry() { return Entry; } + + VPBlockBase *getExit() { return Exit; } + + const VPBlockBase *getEntry() const { return Entry; } + + const VPBlockBase *getExit() const { return Exit; } + + /// An indicator if the VPRegionBlock represents single or multiple instances. + bool isReplicator() const { return IsReplicator; } + + void setReplicator(bool ToReplicate) { IsReplicator = ToReplicate; } + + /// The method which generates the new IR instructions that correspond to + /// this VPRegionBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; +}; + +/// A VPlan represents a candidate for vectorization, encoding various decisions +/// taken to produce efficient vector code, including: which instructions are to +/// vectorized or scalarized, which branches are to appear in the vectorized +/// version. It models the control-flow of the candidate vectorized version +/// explicitly, and holds prescriptions for generating the code for this version +/// from a given IR code. +/// VPlan takes a "senario-based approach" to vectorization planning - different +/// scenarios, corresponding to making different decisions, can be modeled using +/// different VPlans. +/// The corresponding IR code is required to be SESE. +/// The vectorized version is represented using a Hierarchical CFG. +class VPlan { + friend class VPlanUtils; + friend class VPlanUtilsLoopVectorizer; + +private: + /// Hold the single entry to the Hierarchical CFG of the VPlan. + VPBlockBase *Entry; + + /// The IR instructions which are to be transformed to fill the vectorized + /// version are held as ingredients inside the VPRecipe's of the VPlan. Hold a + /// reverse mapping to locate the VPRecipe an IR instruction belongs to. This + /// serves optimizations that operate on the VPlan. + DenseMap Inst2Recipe; + +public: + VPlan() : Entry(nullptr) {} + + ~VPlan() { + if (Entry) + VPBlockBase::deleteCFG(Entry); + } + + /// Generate the IR code for this VPlan. + void vectorize(struct VPTransformState *State); + + VPBlockBase *getEntry() { return Entry; } + const VPBlockBase *getEntry() const { return Entry; } + + void setEntry(VPBlockBase *Block) { Entry = Block; } + + /// Retrieve the VPRecipe a given instruction \p Inst belongs to in the VPlan. + /// Returns null if it belongs to no VPRecipe. + VPRecipeBase *getRecipe(Instruction *Inst) { + auto It = Inst2Recipe.find(Inst); + if (It == Inst2Recipe.end()) + return nullptr; + return It->second; + } + + void setInst2Recipe(Instruction *I, VPRecipeBase *R) { Inst2Recipe[I] = R; } + + void resetInst2Recipe(Instruction *I) { Inst2Recipe.erase(I); } + + /// Retrieve the VPBasicBlock a given instruction \p Inst belongs to in the + /// VPlan. Returns null if it belongs to no VPRecipe. + VPBasicBlock *getBasicBlock(Instruction *Inst) { + VPRecipeBase *Recipe = getRecipe(Inst); + if (!Recipe) + return nullptr; + return Recipe->getParent(); + } + +private: + /// Add to the given dominator tree the header block and every new basic block + /// that was created between it and the latch block, inclusive. + void updateDominatorTree(class DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB); +}; + +/// The VPlanUtils class provides interfaces for the construction and +/// manipulation of a VPlan. +class VPlanUtils { +private: + /// Unique ID generator. + static unsigned NextOrdinal; + +protected: + VPlan *Plan; + + typedef iplist RecipeListTy; + RecipeListTy *getRecipes(VPBasicBlock *Block) { return &Block->Recipes; } + +public: + VPlanUtils(VPlan *Plan) : Plan(Plan) {} + + ~VPlanUtils() {} + + /// Create a unique name for a new VPlan entity such as a VPBasicBlock or + /// VPRegionBlock. + std::string createUniqueName(const char *Prefix) { + std::string S; + raw_string_ostream RSO(S); + RSO << Prefix << NextOrdinal++; + return RSO.str(); + } + + /// Add a given \p Recipe as the last recipe of a given VPBasicBlock. + void appendRecipeToBasicBlock(VPRecipeBase *Recipe, VPBasicBlock *ToVPBB) { + assert(Recipe && "No recipe to append."); + assert(!Recipe->Parent && "Recipe already in VPlan"); + ToVPBB->addRecipe(Recipe); + } + + /// Create a new empty VPBasicBlock and return it. + VPBasicBlock *createBasicBlock() { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + return BasicBlock; + } + + /// Create a new VPBasicBlock with a single \p Recipe and return it. + VPBasicBlock *createBasicBlock(VPRecipeBase *Recipe) { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + appendRecipeToBasicBlock(Recipe, BasicBlock); + return BasicBlock; + } + + /// Create a new, empty VPRegionBlock, with no blocks. + VPRegionBlock *createRegion(bool IsReplicator) { + VPRegionBlock *Region = new VPRegionBlock(createUniqueName("region")); + setReplicator(Region, IsReplicator); + return Region; + } + + /// Set the entry VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no predecessors. + void setRegionEntry(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Predecessors.empty() && + "Entry block cannot have predecessors."); + Region->Entry = Block; + Block->Parent = Region; + } + + /// Set the exit VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no successors. + void setRegionExit(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Successors.empty() && "Exit block cannot have successors."); + Region->Exit = Block; + Block->Parent = Region; + } + + void setReplicator(VPRegionBlock *Region, bool ToReplicate) { + Region->setReplicator(ToReplicate); + } + + /// Sets a given VPBlockBase \p Successor as the single successor of another + /// VPBlockBase \p Block. The parent of \p Block is copied to be the parent of + /// \p Successor. + void setSuccessor(VPBlockBase *Block, VPBlockBase *Successor) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->appendSuccessor(Successor); + Successor->appendPredecessor(Block); + Successor->Parent = Block->Parent; + } + + /// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two + /// successors of another VPBlockBase \p Block. A given + /// VPConditionBitRecipeBase provides the control selector. The parent of + /// \p Block is copied to be the parent of \p IfTrue and \p IfFalse. + void setTwoSuccessors(VPBlockBase *Block, VPConditionBitRecipeBase *R, + VPBlockBase *IfTrue, VPBlockBase *IfFalse) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->setConditionBitRecipe(R); + Block->appendSuccessor(IfTrue); + Block->appendSuccessor(IfFalse); + IfTrue->appendPredecessor(Block); + IfFalse->appendPredecessor(Block); + IfTrue->Parent = Block->Parent; + IfFalse->Parent = Block->Parent; + } + + /// Given two VPBlockBases \p From and \p To, disconnect them from each other. + void disconnectBlocks(VPBlockBase *From, VPBlockBase *To) { + From->removeSuccessor(To); + To->removePredecessor(From); + } +}; + +/// VPlanPrinter prints a given VPlan to a given output stream. The printing is +/// indented and follows the dot format. +class VPlanPrinter { +private: + raw_ostream &OS; + const VPlan &Plan; + unsigned Depth; + unsigned TabLength = 2; + std::string Indent; + + /// Handle indentation. + void buildIndent() { Indent = std::string(Depth * TabLength, ' '); } + void resetDepth() { + Depth = 1; + buildIndent(); + } + void increaseDepth() { + ++Depth; + buildIndent(); + } + void decreaseDepth() { + --Depth; + buildIndent(); + } + + /// Dump each element of VPlan. + void dumpBlock(const VPBlockBase *Block); + void dumpEdges(const VPBlockBase *Block); + void dumpBasicBlock(const VPBasicBlock *BasicBlock); + void dumpRegion(const VPRegionBlock *Region); + + const char *getNodePrefix(const VPBlockBase *Block); + const std::string &getReplicatorString(const VPRegionBlock *Region); + void drawEdge(const VPBlockBase *From, const VPBlockBase *To, bool Hidden, + const Twine &Label); + +public: + VPlanPrinter(raw_ostream &O, const VPlan &P) : OS(O), Plan(P) {} + void dump(const std::string &Title = ""); +}; + +//===--------------------------------------------------------------------===// +// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs // +//===--------------------------------------------------------------------===// + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock +// as a graph of VPBlockBases... + +template <> struct GraphTraits { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +template <> struct GraphTraits { + typedef const VPBlockBase *NodeRef; + typedef SmallVectorImpl::const_iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock as +// a graph of VPBasicBlocks... and to walk it in inverse order. Inverse order +// for a VPRegionBlock is considered to be when traversing the predecessor edges +// of a VPBlockBase instead of the successor edges. +// + +template <> struct GraphTraits> { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static Inverse getEntryNode(Inverse B) { + return B; + } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getPredecessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getPredecessors().end(); + } +}; + +} // namespace llvm + +#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H Index: lib/Transforms/Vectorize/VPlan.cpp =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.cpp @@ -0,0 +1,400 @@ +//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This is the LLVM vectorization plan. It represents a candidate for +// vectorization, allowing to plan and optimize how to vectorize a given loop +// before generating LLVM-IR. +// The vectorizer uses vectorization plans to estimate the costs of potential +// candidates and if profitable to execute the desired plan, generating vector +// LLVM-IR code. +// +//===----------------------------------------------------------------------===// + +#include "VPlan.h" +#include "llvm/ADT/PostOrderIterator.h" +#include "llvm/Analysis/LoopInfo.h" +#include "llvm/IR/BasicBlock.h" +#include "llvm/IR/Dominators.h" +#include "llvm/Support/GraphWriter.h" +#include "llvm/Transforms/Utils/BasicBlockUtils.h" + +using namespace llvm; + +#define DEBUG_TYPE "vplan" + +unsigned VPlanUtils::NextOrdinal = 1; + +VPOneByOneRecipeBase::VPOneByOneRecipeBase(unsigned char SC, + const BasicBlock::iterator B, + const BasicBlock::iterator E, + class VPlan *Plan) + : VPRecipeBase(SC), Begin(B), End(E) { + for (auto It = B; It != E; ++It) + Plan->setInst2Recipe(&*It, this); +} + +/// \return the VPBasicBlock that is the entry of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getEntry(); + return cast(Block); +} + +/// \return the VPBasicBlock that is the exit of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getExitBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +VPBasicBlock *VPBlockBase::getExitBasicBlock() { + VPBlockBase *Block = this; + while (VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +/// Returns the closest ancestor, starting from "this", which has successors. +/// Returns the root ancestor if all ancestors have no successors. +VPBlockBase *VPBlockBase::getAncestorWithSuccessors() { + if (!Successors.empty() || !Parent) + return this; + assert(Parent->getExit() == this && + "Block w/o successors not the exit of its parent."); + return Parent->getAncestorWithSuccessors(); +} + +/// Returns the closest ancestor, starting from "this", which has predecessors. +/// Returns the root ancestor if all ancestors have no predecessors. +VPBlockBase *VPBlockBase::getAncestorWithPredecessors() { + if (!Predecessors.empty() || !Parent) + return this; + assert(Parent->getEntry() == this && + "Block w/o predecessors not the entry of its parent."); + return Parent->getAncestorWithPredecessors(); +} + +void VPBlockBase::deleteCFG(VPBlockBase *Entry) { + SmallVector Blocks; + for (VPBlockBase *Block : depth_first(Entry)) + Blocks.push_back(Block); + + for (VPBlockBase *Block : Blocks) + delete Block; +} + +BasicBlock * +VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) { + // BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks. + // Pred stands for Predessor. Prev stands for Previous, last visited/created. + BasicBlock *PrevBB = CFG.PrevBB; + BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), "VPlannedBB", + PrevBB->getParent(), CFG.LastBB); + DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n'); + + // Hook up the new basic block to its predecessors. + for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) { + VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock(); + BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB]; + DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n'); + if (isa(PredBB->getTerminator())) { + PredBB->getTerminator()->eraseFromParent(); + BranchInst::Create(NewBB, PredBB); + } else { + // Replace old unconditional branch with new conditional branch. + // Note: we rely on traversing the successors in order. + BasicBlock *FirstSuccBB = PredBB->getSingleSuccessor(); + PredBB->getTerminator()->eraseFromParent(); + Value *Bit = PredVPBlock->getConditionBitRecipe()->getConditionBit(); + assert(Bit && "Cannot create conditional branch with empty bit."); + BranchInst::Create(FirstSuccBB, NewBB, Bit, PredBB); + } + } + return NewBB; +} + +void VPBasicBlock::vectorize(VPTransformState *State) { + VPIterationInstance *I = State->Instance; + bool Replica = I && !(I->Part == 0 && I->Lane == 0); + VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB; + VPBlockBase *SingleHPred = nullptr; + BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible. + + // 1. Create an IR basic block, or reuse the last one if possible. + // The last IR basic block is reused in three cases: + // A. the first VPBB reuses the header BB - when PrevVPBB is null; + // B. when the current VPBB has a single (hierarchical) predecessor which + // is PrevVPBB and the latter has a single (hierarchical) successor; and + // C. when the current VPBB is an entry of a region replica - where PrevVPBB + // is the exit of this region from a previous instance. + if (PrevVPBB && /* A */ + !((SingleHPred = getSingleHierarchicalPredecessor()) && + SingleHPred->getExitBasicBlock() == PrevVPBB && + PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */ + !(Replica && getPredecessors().empty())) { /* C */ + + NewBB = createEmptyBasicBlock(State->CFG); + State->Builder.SetInsertPoint(NewBB); + // Temporarily terminate with unreachable until CFG is rewired. + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + // Register NewBB in its loop. In innermost loops its the same for all BB's. + Loop *L = State->LI->getLoopFor(State->CFG.LastBB); + L->addBasicBlockToLoop(NewBB, *State->LI); + State->CFG.PrevBB = NewBB; + } + + // 2. Fill the IR basic block with IR instructions. + DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName() + << " in BB:" << NewBB->getName() << '\n'); + + State->CFG.VPBB2IRBB[this] = NewBB; + State->CFG.PrevVPBB = this; + + for (VPRecipeBase &Recipe : Recipes) + Recipe.vectorize(*State); + + DEBUG(dbgs() << "LV: filled BB:" << *NewBB); +} + +void VPRegionBlock::vectorize(VPTransformState *State) { + ReversePostOrderTraversal RPOT(Entry); + typedef typename std::vector::reverse_iterator rpo_iterator; + + if (!isReplicator()) { + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + return; + } + + assert(!State->Instance && + "Replicating a Region only in null context instance."); + VPIterationInstance I; + State->Instance = &I; + + for (I.Part = 0; I.Part < State->UF; ++I.Part) + for (I.Lane = 0; I.Lane < State->VF; ++I.Lane) + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + + State->Instance = nullptr; +} + +/// Generate the code inside the body of the vectorized loop. Assumes a single +/// LoopVectorBody basic block was created for this; introduces additional +/// basic blocks as needed, and fills them all. +void VPlan::vectorize(VPTransformState *State) { + BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB; + BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor(); + assert(VectorHeaderBB && "Loop preheader does not have a single successor."); + BasicBlock *VectorLatchBB = VectorHeaderBB; + auto CurrIP = State->Builder.saveIP(); + + // 1. Make room to generate basic blocks inside loop body if needed. + VectorLatchBB = VectorHeaderBB->splitBasicBlock( + VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch"); + Loop *L = State->LI->getLoopFor(VectorHeaderBB); + L->addBasicBlockToLoop(VectorLatchBB, *State->LI); + // Remove the edge between Header and Latch to allow other connections. + // Temporarily terminate with unreachable until CFG is rewired. + // Note: this asserts xform code's assumption that getFirstInsertionPt() + // can be dereferenced into an Instruction. + VectorHeaderBB->getTerminator()->eraseFromParent(); + State->Builder.SetInsertPoint(VectorHeaderBB); + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + + // 2. Generate code in loop body of vectorized version. + State->CFG.PrevVPBB = nullptr; + State->CFG.PrevBB = VectorHeaderBB; + State->CFG.LastBB = VectorLatchBB; + + for (VPBlockBase *CurrentBlock = Entry; CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) { + assert(CurrentBlock->getSuccessors().size() <= 1 && + "Multiple successors at top level."); + CurrentBlock->vectorize(State); + } + + // 3. Merge the temporary latch created with the last basic block filled. + BasicBlock *LastBB = State->CFG.PrevBB; + // Connect LastBB to VectorLatchBB to facilitate their merge. + assert(isa(LastBB->getTerminator()) && + "Expected VPlan CFG to terminate with unreachable"); + LastBB->getTerminator()->eraseFromParent(); + BranchInst::Create(VectorLatchBB, LastBB); + + // Merge LastBB with Latch. + bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI); + assert(merged && "Could not merge last basic block with latch."); + VectorLatchBB = LastBB; + + updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB); + State->Builder.restoreIP(CurrIP); +} + +void VPlan::updateDominatorTree(DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB) { + BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor(); + assert(LoopHeaderBB && "Loop preheader does not have a single successor."); + DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB); + // The vector body may be more than a single basic block by this point. + // Update the dominator tree information inside the vector body by propagating + // it from header to latch, expecting only triangular control-flow, if any. + BasicBlock *PostDomSucc = nullptr; + for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) { + // Get the list of successors of this block. + std::vector Succs(succ_begin(BB), succ_end(BB)); + assert(Succs.size() <= 2 && + "Basic block in vector loop has more than 2 successors."); + PostDomSucc = Succs[0]; + if (Succs.size() == 1) { + assert(PostDomSucc->getSinglePredecessor() && + "PostDom successor has more than one predecessor."); + DT->addNewBlock(PostDomSucc, BB); + continue; + } + BasicBlock *InterimSucc = Succs[1]; + if (PostDomSucc->getSingleSuccessor() == InterimSucc) { + PostDomSucc = Succs[1]; + InterimSucc = Succs[0]; + } + assert(InterimSucc->getSingleSuccessor() == PostDomSucc && + "One successor of a basic block does not lead to the other."); + assert(InterimSucc->getSinglePredecessor() && + "Interim successor has more than one predecessor."); + assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 && + "PostDom successor has more than two predecessors."); + DT->addNewBlock(InterimSucc, BB); + DT->addNewBlock(PostDomSucc, BB); + } +} + +const char *VPlanPrinter::getNodePrefix(const VPBlockBase *Block) { + if (isa(Block)) + return ""; + assert(isa(Block) && "Unsupported kind of VPBlock."); + return "cluster_"; +} + +const std::string & +VPlanPrinter::getReplicatorString(const VPRegionBlock *Region) { + static std::string ReplicatorString(DOT::EscapeString("")); + static std::string NonReplicatorString(DOT::EscapeString("")); + return Region->isReplicator() ? ReplicatorString : NonReplicatorString; +} + +void VPlanPrinter::dump(const std::string &Title) { + resetDepth(); + OS << "digraph VPlan {\n"; + OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan"; + if (!Title.empty()) + OS << "\\n" << DOT::EscapeString(Title); + OS << "\"]\n"; + OS << "node [shape=record]\n"; + OS << "compound=true\n"; + + for (const VPBlockBase *CurrentBlock = Plan.getEntry(); + CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) + dumpBlock(CurrentBlock); + + OS << "}\n"; +} + +void VPlanPrinter::dumpBlock(const VPBlockBase *Block) { + if (const VPBasicBlock *BasicBlock = dyn_cast(Block)) + dumpBasicBlock(BasicBlock); + else if (const VPRegionBlock *Region = dyn_cast(Block)) + dumpRegion(Region); + else + llvm_unreachable("Unsupported kind of VPBlock."); +} + +/// Print the information related to a CFG edge between two VPBlockBases. +void VPlanPrinter::drawEdge(const VPBlockBase *From, const VPBlockBase *To, + bool Hidden, const Twine &Label) { + // Due to "dot" we print an edge between two regions as an edge between the + // exit basic block and the entry basic of the respective regions. + const VPBlockBase *Tail = From->getExitBasicBlock(); + const VPBlockBase *Head = To->getEntryBasicBlock(); + OS << Indent << getNodePrefix(Tail) << DOT::EscapeString(Tail->getName()) + << " -> " << getNodePrefix(Head) << DOT::EscapeString(Head->getName()); + OS << " [ label=\"" << Label << '\"'; + if (Tail != From) + OS << " ltail=" << getNodePrefix(From) + << DOT::EscapeString(From->getName()); + if (Head != To) + OS << " lhead=" << getNodePrefix(To) << DOT::EscapeString(To->getName()); + if (Hidden) + OS << "; splines=none"; + OS << "]\n"; +} + +/// Print the information related to the CFG edges going out of a given +/// \p Block, followed by printing the successor blocks themselves. +void VPlanPrinter::dumpEdges(const VPBlockBase *Block) { + std::string Cond = ""; + if (auto *ConditionBitRecipe = Block->getConditionBitRecipe()) + Cond = ConditionBitRecipe->getName().str(); + unsigned SuccessorNumber = 1; + for (auto *Successor : Block->getSuccessors()) { + drawEdge(Block, Successor, false, + Twine() + (SuccessorNumber == 2 ? "!" : "") + Twine(Cond)); + ++SuccessorNumber; + } +} + +/// Print a VPBasicBlock, including its VPRecipes, followed by printing its +/// successor blocks. +void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) { + std::string Indent(Depth * TabLength, ' '); + OS << Indent << getNodePrefix(BasicBlock) + << DOT::EscapeString(BasicBlock->getName()) << " [label = \"{" + << DOT::EscapeString(BasicBlock->getName()); + + for (const VPRecipeBase &Recipe : BasicBlock->getRecipes()) { + OS << " | "; + std::string RecipeString; + raw_string_ostream RSO(RecipeString); + Recipe.print(RSO); + OS << DOT::EscapeString(RSO.str()); + } + + OS << "}\"]\n"; + dumpEdges(BasicBlock); +} + +/// Print a given \p Region of the VPlan. +void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) { + OS << Indent << "subgraph " << getNodePrefix(Region) + << DOT::EscapeString(Region->getName()) << " {\n"; + increaseDepth(); + OS << Indent; + OS << "label = \"" << getReplicatorString(Region) << " " + << DOT::EscapeString(Region->getName()) << "\"\n\n"; + + // Dump the blocks of the region. + assert(Region->getEntry() && "Region contains no inner blocks."); + + for (const VPBlockBase *Block : depth_first(Region->getEntry())) + dumpBlock(Block); + + decreaseDepth(); + OS << Indent << "}\n"; + dumpEdges(Region); +} Index: test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll +++ test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll @@ -15,9 +15,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i64> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i64 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i64 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i64 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i64> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i64 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -25,9 +25,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i64> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i64 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i64 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i64 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i64> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i64 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/AArch64/predication_costs.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/predication_costs.ll +++ test/Transforms/LoopVectorize/AArch64/predication_costs.ll @@ -18,8 +18,8 @@ ; Cost of udiv: ; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5 ; -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) { entry: @@ -59,8 +59,8 @@ ; Cost of store: ; (store(4) + extractelement(3)) / 2 = 3 ; -; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -98,10 +98,10 @@ ; Cost of udiv: ; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x -; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x +; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -143,10 +143,10 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -192,16 +192,16 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x ; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2 ; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x ; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) { entry: Index: test/Transforms/LoopVectorize/if-pred-non-void.ll =================================================================== --- test/Transforms/LoopVectorize/if-pred-non-void.ll +++ test/Transforms/LoopVectorize/if-pred-non-void.ll @@ -219,9 +219,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i32 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -229,9 +229,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i32 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/induction.ll =================================================================== --- test/Transforms/LoopVectorize/induction.ll +++ test/Transforms/LoopVectorize/induction.ll @@ -309,18 +309,18 @@ ; ; CHECK-LABEL: @scalarize_induction_variable_05( ; CHECK: vector.body: -; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; CHECK: %[[I0:.+]] = add i32 %index, 0 ; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]] ; CHECK: pred.udiv.if: ; CHECK: udiv i32 {{.*}}, %[[I0]] -; CHECK: pred.udiv.if1: +; CHECK: pred.udiv.if3: ; CHECK: %[[I1:.+]] = add i32 %index, 1 ; CHECK: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05( ; UNROLL-NO-IC: vector.body: -; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0 ; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2 ; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]] @@ -330,26 +330,26 @@ ; UNROLL-NO-IC: pred.udiv.if6: ; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]] -; UNROLL-NO-IC: pred.udiv.if8: +; UNROLL-NO-IC: pred.udiv.if9: ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]] -; UNROLL-NO-IC: pred.udiv.if10: +; UNROLL-NO-IC: pred.udiv.if12: ; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]] ; ; IND-LABEL: @scalarize_induction_variable_05( ; IND: vector.body: -; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; IND: %[[E0:.+]] = sext i32 %index to i64 ; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]] ; IND: pred.udiv.if: ; IND: udiv i32 {{.*}}, %index -; IND: pred.udiv.if1: +; IND: pred.udiv.if3: ; IND: %[[I1:.+]] = or i32 %index, 1 ; IND: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-LABEL: @scalarize_induction_variable_05( ; UNROLL: vector.body: -; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL: %[[I2:.+]] = or i32 %index, 2 ; UNROLL: %[[E0:.+]] = sext i32 %index to i64 ; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]] @@ -359,9 +359,9 @@ ; UNROLL: pred.udiv.if6: ; UNROLL: %[[I1:.+]] = or i32 %index, 1 ; UNROLL: udiv i32 {{.*}}, %[[I1]] -; UNROLL: pred.udiv.if8: +; UNROLL: pred.udiv.if9: ; UNROLL: udiv i32 {{.*}}, %[[I2]] -; UNROLL: pred.udiv.if10: +; UNROLL: pred.udiv.if12: ; UNROLL: %[[I3:.+]] = or i32 %index, 3 ; UNROLL: udiv i32 {{.*}}, %[[I3]]