Index: docs/VectorizationPlan.rst =================================================================== --- /dev/null +++ docs/VectorizationPlan.rst @@ -0,0 +1,574 @@ ++++++ +VPlan ++++++ + +Goal of initial VPlan patch ++++++++++++++++++++++++++++ +The design and implementation of VPlan follow our RFC [10]_ and presentation +[11]_. The initial patch is designed to: + +- be a *lightweight* NFC patch; +- show key aspects of VPlan's Hierarchical CFG concept; +- demonstrate how VPlan can + + * capture *all* current vectorization decisions: which instructions are to + + + be vectorized "on their own", or + + be part of an interleave group, or + + be scalarized, and optionally have scalar instances moved down to other + basic blocks and under a condition; and + + be packed or unpacked (at the definition rather than at its uses) to + provide both scalarized and vectorized forms; and + + * represent all control-flow *within loop body* of vectorized code version. + +- Be a step towards + + * aligning Cost step with Transformation step, + * representing entire code being transformed, + * adding optmizations: + + + optimize conditional scalarization further, + + retaining uniform control-flow, + + vectorizing outerloops, + + and more. + +Out of scope for initial patch: + +- changing how a loop is checked if it can be vectorized - "Legal"; +- changing how a loop is checked if it should be vectorized - "Cost". + + +================== +Vectorization Plan +================== + +.. contents:: + :local: + +Overview +======== +The Vectorization Plan is an explicit recipe for describing a vectorization +candidate. It serves for both estimating the cost accurately and for performing +the translation, and facilitates dealing with multiple vectorization candidates. + +The overall structure consists of: + +1. One LoopVectorizationPlanner for each attempt to vectorize a loop or a loop + nest. + +2. A LoopVectorizationPlanner can construct, optimize and discard one or more + VPlans, providing different ways to vectorize the loop or the loop nest. + +3. Once the best VPlan is determined, including the best vectorization factor + and unroll factor, this VPlan drives the vector code generation using a + VPTransformState object. + +4. Each VPlan represents the loop or the loop nest using a hierarchical CFG. + +5. At the bottom level of the hierarchical CFG are VPBasicBlocks. + +6. Each VPBasicBlock consists of one or more VPRecipes to generate Instructions + for it. + +Motivation +---------- +The vectorization transformation can be rather complicated, involving several +potential alternatives, especially for outer loops [1]_ but also possibly for +innermost loops. These alternatives may have significant performance impact, +both positive and negative. A cost model is therefore employed to identify the +best alternative, including the alternative of avoiding any transformation +altogether. + +The process of vectorization traditionally involves three major steps: Legal, +Cost, and Transform. This is the general case in LLVM's LoopVectorizer: + +1. Legal Step: check if loop can be legally vectorized; encode constraints and + artifacts if so. +2. Cost Step: compute the relative cost of vectorizing it along possible + vectorization and unroll factors (VF, UF). +3. Transform Step: vectorize the loop according to best VF and UF. + +This design, which works only directly on the original LLVM-IR, has some +implications: + +1. Cost Step tries to predict what the vectorized loop will look like and how + much it will cost, independently of what the Transform Step will eventually + do. It's hard to keep the two in sync. +2. Cost Step essentially considers a single vectorization candidate. Any + alternatives are immediately evaluately and resolved. +3. Legal Step does more than check for vectorizability; e.g., it records + auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo. +4. Transform Step first populates the single basic block of the vectorized loop + and later revisits scalarized instructions to predicate them one by one, as + needed. + +The Vectorization Plan is designed to explicitly model a vectorization +candidate to overcome the above constraints, which is especially important for +the vectorization of outer-loops. This affects the overall process by +essentially splitting the Transform Step into a Plan Step and a Code-Gen Step: + +1. Legal Step: check if loop can be legally vectorized; encode contraints and + artifacts if so. Initiate Vectorization Plan showing how the loop can be + vectorized only after passing Legal, to save redundant construction. +2. Plan Step: + + a. Build initial Vectorization Plans following the constraints and + decisions taken by Legal. + b. Explore ways to optimize the vectorization plan, complying with + all legal constraints, possibly constructing several plans following + tentative vectorization decisions. +3. Cost Step: compute the relative cost of each plan. This step can be applied + repeatedly by Plan Step 2.b. +4. Code-Gen Step: materialize the best plan. Note that only this step modifies + the IR, as in the current Loop Vectorizer. + +The Cost Step can also be split into an Early-Pruning Step(s) and a +"Cost-Gen" Step, where the former applies quick yet inaccurate estimates to +prune obviously-unpromising candidates, and the latter applies accurate +estimates based on a full Plan. + +One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds +Step 2.b. + +As the scope of vectorization grows from innermost to outer loops, so do the +uncertainty and complexity of each step. One way to mitigate the shortcomings +of the Legal and Cost steps is to rely on programmers to indicate which loops +can and/or should be vectorized. This is implicit for certain loops in +data-parallel languages such as OpenCL [4]_, [5]_ and explicit in others such as +OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports +and raises the importance of explicit vectorization beyond the current +capabilities of Clang and LLVM. Namely, from currently forcing the +vectorization of innermost loops according to prescribed width and/or +interleaving count, to supporting OpenMP's "#pragma omp simd" construct and +associated clauses, including vectorizing across function boundaries [2]_. + +References +---------- +.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit + Nuzman and Ayal Zaks, PACT 2008. + +.. [2] "Proposal for function vectorization and loop vectorization with function + calls", Xinmin Tian, [`cfe-dev + `_]., + March 2, 2016. + See also `review `_. + +.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios + Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. + +.. [4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers' Meeting 2011. + +.. [5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs", Ralf + Karrenberg, Springer 2015. See also "Improving Performance of OpenCL on + CPUs", LLVM Developers' Meeting 2012. + +.. [6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on + Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., + IPDPSW 2012. + +.. [7] "Exploiting mixed SIMD parallelism by reducing data reorganization + overhead", Hao Zhou and Jingling Xue, CGO 2016. + +.. [8] "Register Allocation via Hierarchical Graph Coloring", David Callahan and + Brian Koblenz, PLDI 1991 + +.. [9] "Structural analysis: A new approach to flow analysis in optimizing + compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 + +.. [10] "RFC: Extending LV to vectorize outerloops", [`llvm-dev + `_], + September 21, 2016. + +.. [11] "Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer + loop auto-vectorization", Hideki Saito, `LLVM Developers' Meeting 2016 + `_, November 3, 2016. + +Examples +-------- +An example with a single predicated scalarized instruction - integer division: + +.. code-block:: c + + void foo(int* a, int b, int* c) { + #pragma simd + for (int i = 0; i < 10000; ++i) + if (a[i] > 777) + a[i] = b - (c[i] + a[i] / b); + } + + +IR Dump Before Loop Vectorization: + +.. code-block:: LLVM + :emphasize-lines: 6,11 + + for.body: ; preds = %for.inc, %entry + %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ] + %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv + %0 = load i32, i32* %arrayidx, align 4, !tbaa !1 + %cmp1 = icmp sgt i32 %0, 777 + br i1 %cmp1, label %if.then, label %for.inc + + if.then: ; preds = %for.body + %arrayidx3 = getelementptr inbounds i32, i32* %c, i64 %indvars.iv + %1 = load i32, i32* %arrayidx3, align 4, !tbaa !1 + %div = sdiv i32 %0, %b + %add.neg = sub i32 %b, %1 + %sub = sub i32 %add.neg, %div + store i32 %sub, i32* %arrayidx, align 4, !tbaa !1 + br label %for.inc + + for.inc: ; preds = %for.body, %if.then + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 + %exitcond = icmp eq i64 %indvars.iv.next, 10000 + br i1 %exitcond, label %for.cond.cleanup, label %for.body + +The VPlan that is built initially: + +.. image:: VPlanPrinter.png + +Design Guidelines +================= +1. Analysis-like: building and manipulating the Vectorization Plan must not + modify the IR. In particular, if a VPlan is discarded + compilation should proceed as if the VPlan had not been built. + +2. Support all current capabilities: the Vectorization Plan must be capable of + representing the exact functionality of LLVM's existing Loop Vectorizer. + In particular, the transition can start with an NFC patch. + In particular, VPlan must support efficient selection of VF and/or UF. + +3. Align Cost & CodeGen: the Vectorization Plan must serve both the cost + model and the code generation phases, where the cost estimation must + evaluate the to-be-generated code accurately. + +4. Support vectorizing additional constructs: + + a. vectorization of Outer-loops. + In particular, VPlan must be able to represent the control-flow of a + vectorized loop which may include multiple basic-blocks and nested loops. + b. SLP vectorization. + c. Combinations of the above, including nested vectorization: vectorizing + both an inner loop and an outerloop at the same time (each with its own + VF and UF), mixed vectorization: vectorizing a loop and SLP patterns + inside [7]_, (re)vectorizing vector code. + +5. Support multiple candidates efficiently: + In particular, similar candidates related to a range of possible VF's and + UF's must be represented efficiently. + In particular support potential versionings efficiently. + +6. Compact: the Vectorization Plan must be efficient and provide as compact a + representation as possible. In particular where the transformation is + straightfoward, and where the plan is to reuse existing IR (e.g., + leftover iterations). + +VPlan Classes: Definitions +========================== + +:VPlan: + A recipe for generating a vectorized version from a given IR code. + Takes a "scenario-based approach" to vectorization planning. + Given IR code required to be SESE, mainly to simplify dominance + information. This vectorized version is represented using a Hierarchical CFG. + +:Hierarchical CFG: + A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. + The Hierarchical CFG data structure we use is similar to the Tile Tree [8]_, + where cross-Tile edges are lifted to connect Tiles instead of the original + basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the + terms Region and Block rather than Tile [8]_ to avoid confusion with loop + tiling. + +:VPBasicBlock: + Serves as the leaf of the Hierarchical CFG. Represents a sequence of + instructions that will appear consecutively in a basic block of the vectorized + version. The instructions of such a basic block originate from one or more + VPBasicBlocks. + The VPBasicBlock takes care of the control-flow + relations with other VPBasicBlock's and Regions. + Holds a sequence of zero or more + VPRecipe's that take care of representing the instructions. + A VPBasicBlock that holds no VPRecipe's represents no instructions; this + may happen, e.g., to support disjoint Regions and to ensure Regions have a + single exit, possibly an empty one. + +:VPRecipeBase: + A base class describing one or more instructions that will appear + consecutively in the vectorized version, based on Instructions from the given + IR. + These Instructions are referred to as the "Ingredients" of the Recipe. + A Recipe specifies how its ingredients are to be vectorized: e.g., + copy or reuse them as uniform, scalarize or vectorize them according to an + enclosing loop dimension, vectorize them according to internal SLP dimension. + + **Design principle:** in order to reason about how to vectorize an Instruction + or how much it would cost, one has to consult the VPRecipe holding it. + + **Design principle:** when a sequence of instructions conveys additional + information as a group, we use a VPRecipe to encapsulate them and attach + this information to the VPRecipe. For instance a VPRecipe can model an + interleave group of loads or stores with additional information for + calculating their cost and performing code-gen, as a group. + + **Design principle:** where possible a VPRecipe should reuse the existing + container of its ingredients. A new containter should be opened on-demand, + e.g., to facilitate changing the order of Instructions between original + and vectorized versions. + +:VPOneByOneRecipeBase: + Represents recipes which transform each Instruction in their Ingredients + independently, in order. + The Ingredients are a sub-sequence of original Instructions, which reside in + the same IR BasicBlock and in the same order. The Ingredients are + accessed by a pointer to the first and last Instruction in their original IR + basic block. Serves as a base class for the concrete sub-classes + VPScalarizeOneByOneRecipe and VPVectorizeOneByOneRecipe. + +:VPScalarizeOneByOneRecipe: + A concrete VPRecipe which scalarizes each ingredient, generating either + instances of lane 0 for a uniform instruction, or instances for a range of + lanes otherwise. + +:VPVectorizeOneByOneRecipe: + A concrete VPRecipe which vectorizes each ingredient. + +:VPInterleaveRecipe: + A concrete VPRecipe which transforms an interleave group of loads or stores + into one wide load/store and shuffles. + +:VPConditionBitRecipeBase: + A base class for VPRecipes which provide the condition bit feeding a + conditional branch. Such cases correspond to scalarized or uniform branches. + +:VPExtractMaskBitRecipe: + A concrete VPRecipe which represents the extraction of a bit from a mask, + needed when scalarizing a conditional branch. + Such branches are needed to guard scalarized and predicated instructions. + +:VPMergeScalarizeBranchRecipe: + A concrete VPRecipe which represents Phi's needed when control converges back + from a scalarized branch. + Such phi's are needed to merge live-out values that are set under a + scalarized branch. They can be scalar or vector, depending on the user of the + live-out value. + +:VPWidenIntInductionRecipe: + A concrete VPRecipe which widens integer reductions, producing their vector + values and computing the necessary values for producing their scalar values. + The scalar values themselves are generated, possibly elsewhere, by the + complementing VPBuildScalarStepsRecipe. + +:VPBuildScalarStepsRecipe: + A concrete VPRecipe complemeting the handling of integer induction variables, + responsible for generating the scalar values used by the IV's scalar users. + +:VPRegionBlock: + A collection of VPBasicBlocks and VPRegionBlocks which form a + single-entry-single-exit subgraph of the CFG in the vectorized code. + + **Design principle:** When some additional information relates to an SESE set + of VPBlocks, we use a VPRegionBlock to wrap them and attach the information to + it. For example, a VPRegionBlock can be used to indicate that a scalarized + SESE region is to be replicated. It is also designed to serve predicating + divergent branches while retaining uniform branches as much as possible / + desirable, and represent inner loops. + +:VPBlockBase: + The building block of the Hierarchical CFG. A VPBlockBase can be either a + VPBasicBlock or a VPRegionBlock. + A VPBlockBase may indicate that its contents are + to be replicated several times. This is designed to support scalarizing + VPBlockBases which generate VF replicas of their instructions, which in turn + remain scalar. And to do so using a single VPlan for multiple candidate VF's. + +:VPTransformState: + Stores information used for code generation, passed from the Planner to its + selected VPlan for execution, and used to pass additional information down + from VPBlocks to the VPRecipes. + +:VPlanUtils: + Contains a collection of methods for the construction and modification of + abstract VPlans. + +:VPlanUtilsLoopVectorizer: + Derived from VPlanUtils, providing additional methods for the construction and + modification of VPlans. + +:LoopVectorizationPlanner: + The object in charge of creating and manipulating VPlans for a given IR code. + + +VPlan Classes: Diagram +====================== + +The classes of VPlan with main fields and methods; sub-classes of VPRecipeBase +are shown in a separate figure: + +.. image:: VPlanUML.png + + +The class hierarchy of VPlan's VPRecipeBase class: + +.. image:: VPlanRecipesUML.png + + +Integration with LoopVectorize.cpp/processLoop() +================================================ + +Here's the integration within LoopVectorize.cpp's existing flow, in +LoopVectorizePass::processLoop(Loop \*L): + +1. Plan only after passing all early bail-outs: + + a. including those that take place after Legal, which is kept intact; + b. including those that use the Cost Model - refactor it slightly to expose + its MaxVF upper bound and canVectorize() early exit: + +.. code-block:: c++ + + // Check if the target supports potentially unsafe FP vectorization. + // FIXME: Add a check for the type of safety issue (denormal, signaling) + // for the target we're vectorizing for, to make sure none of the + // additional fp-math flags can help. + if (Hints.isPotentiallyUnsafe() && + TTI->isFPVectorizationPotentiallyUnsafe()) { + DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n"); + ORE->emit( + createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L) + << "loop not vectorized due to unsafe FP support."); + emitMissedWarning(F, L, Hints, ORE); + return false; + } + + if (!CM.canVectorize(OptForSize)) + return false; + + // Early prune excessive VF's + unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize); + + // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail. + if (OptForSize && CM.requiresTail(MaxVF)) + return false; + +2. Plan: + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before, + c. compute best interleave-count using Cost Model as before. Above two + steps are refactored into LVP.plan() (see below): + +.. code-block:: c++ + + // Use the planner. + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM); + + // Get user vectorization factor. + unsigned UserVF = Hints.getWidth(); + + // Select the vectorization factor. + LoopVectorizationCostModel::VectorizationFactor VF = + LVP.plan(OptForSize, UserVF, MaxVF); + bool VectorizeLoop = (VF.Width > 1); + + std::pair VecDiagMsg, IntDiagMsg; + + if (!UserVF && !VectorizeLoop) { + DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); + VecDiagMsg = std::make_pair( + "VectorizationNotBeneficial", + "the cost-model indicates that vectorization is not beneficial"); + } + + // Select the interleave count. + unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost); + + // Get user interleave count. + unsigned UserIC = Hints.getInterleave(); + +3. Transform: + + a. invoke an Unroller to unroll the loop (as before), or + b. invoke LVP.executeBestPlan() to vectorize the loop: + +.. code-block:: c++ + + if (!VectorizeLoop) { + assert(IC > 1 && "interleave count should not be 1 or 0"); + // If we decided that it is not legal to vectorize the loop, then + // interleave it. + InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL, + &CM); + Unroller.vectorize(); + + ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(), + L->getHeader()) + << "interleaved loop (interleaved count: " + << NV("InterleaveCount", IC) << ")"); + } else { + + // If we decided that it is \* legal \* to vectorize the loop, then do it. + InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, + &LVL, &CM); + + LVP.executeBestPlan(LB); + + ++LoopsVectorized; + + // Add metadata to disable runtime unrolling a scalar loop when there are + // no runtime checks about strides and memory. A scalar loop that is + // rarely used is not worth unrolling. + if (!LB.areSafetyChecksAdded()) + AddRuntimeUnrollDisableMetaData(L); + + // Report the vectorization decision. + ORE->emit(OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(), + L->getHeader()) + << "vectorized loop (vectorization width: " + << NV("VectorizationFactor", VF.Width) + << ", interleaved count: " << NV("InterleaveCount", IC) << ")"); + } + + // Mark the loop as already vectorized to avoid vectorizing again. + Hints.setAlreadyVectorized(); + +4. Plan, refactored into LVP.plan(): + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before: + +.. code-block:: c++ + + LoopVectorizationCostModel::VectorizationFactor + LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF, + unsigned MaxVF) { + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + if (UserVF == 1) + return {UserVF, 0}; + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions",dbgs())); + return {UserVF, 0}; + } + if (MaxVF == 1) + return {1, 0}; + + assert(MaxVF > 1 && "MaxVF is zero."); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + for (unsigned i = 2; i <= MaxVF; i = i+i) + CM->collectInstsToScalarize(i); + buildInitialVPlans(2, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + // Select the optimal vectorization factor. + return CM->selectVectorizationFactor(OptForSize, MaxVF); + } Index: docs/Vectorizers.rst =================================================================== --- docs/Vectorizers.rst +++ docs/Vectorizers.rst @@ -380,6 +380,18 @@ .. image:: linpack-pc.png +Internals +--------- + +.. toctree:: + :hidden: + + VectorizationPlan + +:doc:`VectorizationPlan` + The loop vectorizer is based on an abstract representation called Vectorization Plan. + This document describes its philosophy and design. + .. _slp-vectorizer: The SLP Vectorizer Index: lib/Transforms/Vectorize/CMakeLists.txt =================================================================== --- lib/Transforms/Vectorize/CMakeLists.txt +++ lib/Transforms/Vectorize/CMakeLists.txt @@ -4,6 +4,7 @@ LoopVectorize.cpp SLPVectorizer.cpp Vectorize.cpp + VPlan.cpp ADDITIONAL_HEADER_DIRS ${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms Index: lib/Transforms/Vectorize/LoopVectorize.cpp =================================================================== --- lib/Transforms/Vectorize/LoopVectorize.cpp +++ lib/Transforms/Vectorize/LoopVectorize.cpp @@ -47,6 +47,7 @@ //===----------------------------------------------------------------------===// #include "llvm/Transforms/Vectorize/LoopVectorize.h" +#include "VPlan.h" #include "llvm/ADT/DenseMap.h" #include "llvm/ADT/Hashing.h" #include "llvm/ADT/MapVector.h" @@ -366,6 +367,9 @@ /// LoopVectorizationLegality class to provide information about the induction /// and reduction variables that were found to a given vectorization factor. class InnerLoopVectorizer { + friend class LoopVectorizationPlanner; + friend class llvm::VPlan; + public: InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo *LI, DominatorTree *DT, @@ -412,7 +416,8 @@ // When we if-convert we need to create edge masks. We have to cache values // so that we don't end up with exponential recursion/IR. typedef DenseMap, VectorParts> - EdgeMaskCache; + EdgeMaskCacheTy; + typedef DenseMap BlockMaskCacheTy; /// Create an empty loop, based on the loop ranges of the old loop. void createEmptyLoop(); @@ -428,43 +433,44 @@ /// Copy and widen the instructions from the old loop. virtual void vectorizeLoop(); + /// Handle all cross-iteration phis in the header. + void fixCrossIterationPHIs(); + /// Fix a first-order recurrence. This is the second phase of vectorizing /// this phi node. void fixFirstOrderRecurrence(PHINode *Phi); + /// Fix a reduction cross-iteration phi. This is the second phase of + /// vectorizing this phi node. + void fixReduction(PHINode *Phi); + /// \brief The Loop exit block may have single value PHI nodes where the /// incoming value is 'Undef'. While vectorizing we only handled real values /// that were defined inside the loop. Here we fix the 'undef case'. /// See PR14725. void fixLCSSAPHIs(); - /// Iteratively sink the scalarized operands of a predicated instruction into - /// the block that was created for it. - void sinkScalarOperands(Instruction *PredInst); - - /// Predicate conditional instructions that require predication on their - /// respective conditions. - void predicateInstructions(); - /// Collect the instructions from the original loop that would be trivially /// dead in the vectorized loop if generated. - void collectTriviallyDeadInstructions(); + static void collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions); /// Shrinks vector element sizes to the smallest bitwidth they can be legally /// represented as. void truncateToMinimalBitwidths(); +public: /// A helper function that computes the predicate of the block BB, assuming /// that the header block of the loop is set to True. It returns the *entry* /// mask for the block BB. VectorParts createBlockInMask(BasicBlock *BB); + +protected: /// A helper function that computes the predicate of the edge between SRC /// and DST. VectorParts createEdgeMask(BasicBlock *Src, BasicBlock *Dst); - /// A helper function to vectorize a single BB within the innermost loop. - void vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV); - /// Vectorize a single PHINode in a block. This method handles the induction /// variable canonicalization. It supports both VF = 1 for unrolled loops and /// arbitrary length vectors. @@ -475,13 +481,69 @@ /// and update the analysis passes. void updateAnalysis(); - /// This instruction is un-vectorizable. Implement it as a sequence - /// of scalars. If \p IfPredicateInstr is true we need to 'hide' each - /// scalarized instruction behind an if block predicated on the control - /// dependence of the instruction. - virtual void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false); +public: + /// A helper function to vectorize a single Instruction in the innermost loop. + virtual void vectorizeInstruction(Instruction &I); + + /// A helper function to scalarize a single Instruction in the innermost loop. + /// Generates a sequence of scalar instances for each lane between \p MinLane + /// and \p MaxLane, times each part between \p MinPart and \p MaxPart, + /// inclusive.. + void scalarizeInstruction(Instruction *Instr, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, + unsigned MaxLane); + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. If the value has + /// been vectorized but not scalarized, the necessary extractelement + /// instruction will be generated. + Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. The scalar parts + /// for this value must already be initialized. + void setScalarValue(Value *V, unsigned Part, unsigned Lane, Value *Scalar) { + assert(VectorLoopValueMap.hasScalar(V) && + "Cannot set an uninitialized scalar value"); + VectorLoopValueMap.ScalarMapStorage[V][Part][Lane] = Scalar; + } + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. If there isn't one, return a null pointer. + /// Note that the value returned may also be a null pointer if that specific + /// part has not been generated yet. + Value *getVectorValue(Value *V, unsigned Part) { + if (!VectorLoopValueMap.hasVector(V)) + return nullptr; + return VectorLoopValueMap.VectorMapStorage[V][Part]; + } + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. The vector parts for this value must already + /// be initialized. + void setVectorValue(Value *V, unsigned Part, Value *Vector) { + assert(VectorLoopValueMap.hasVector(V) && + "Cannot set an uninitialized vector value"); + VectorLoopValueMap.VectorMapStorage[V][Part] = Vector; + } + + /// Construct the vector value of a scalarized value \p V one lane at a time. + /// This method is for predicated instructions where we'd like the + /// insert-element instructions to reside in the predicated block to have + /// them execute only if needed. + void constructVectorValue(Value *V, unsigned Part, unsigned Lane); + /// Return a constant reference to the VectorParts corresponding to \p V from + /// the original loop. If the value has already been vectorized, the + /// corresponding vector entry in VectorLoopValueMap is returned. If, + /// however, the value has a scalar entry in VectorLoopValueMap, we construct + /// new vector values on-demand by inserting the scalar values into vectors + /// with an insertelement sequence. If the value has been neither vectorized + /// nor scalarized, it must be loop invariant, so we simply broadcast the + /// value into vectors. + const VectorParts &getVectorValue(Value *V); + +protected: /// Vectorize Load and Store instructions, virtual void vectorizeMemoryInstruction(Instruction *Instr); @@ -499,13 +561,6 @@ Instruction::BinaryOps Opcode = Instruction::BinaryOpsEnd); - /// Compute scalar induction steps. \p ScalarIV is the scalar induction - /// variable on which to base the steps, \p Step is the size of the step, and - /// \p EntryVal is the value from the original loop that maps to the steps. - /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it - /// can be a truncate instruction). - void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal); - /// Create a vector induction phi node based on an existing scalar one. This /// currently only works for integer induction variables with a constant /// step. \p EntryVal is the value from the original loop that maps to the @@ -515,10 +570,6 @@ void createVectorIntInductionPHI(const InductionDescriptor &II, Instruction *EntryVal); - /// Widen an integer induction variable \p IV. If \p Trunc is provided, the - /// induction variable will first be truncated to the corresponding type. - void widenIntInduction(PHINode *IV, TruncInst *Trunc = nullptr); - /// Returns true if an instruction \p I should be scalarized instead of /// vectorized for the chosen vectorization factor. bool shouldScalarizeInstruction(Instruction *I) const; @@ -526,25 +577,25 @@ /// Returns true if we should generate a scalar version of \p IV. bool needsScalarInduction(Instruction *IV) const; - /// Return a constant reference to the VectorParts corresponding to \p V from - /// the original loop. If the value has already been vectorized, the - /// corresponding vector entry in VectorLoopValueMap is returned. If, - /// however, the value has a scalar entry in VectorLoopValueMap, we construct - /// new vector values on-demand by inserting the scalar values into vectors - /// with an insertelement sequence. If the value has been neither vectorized - /// nor scalarized, it must be loop invariant, so we simply broadcast the - /// value into vectors. - const VectorParts &getVectorValue(Value *V); - - /// Return a value in the new loop corresponding to \p V from the original - /// loop at unroll index \p Part and vector index \p Lane. If the value has - /// been vectorized but not scalarized, the necessary extractelement - /// instruction will be generated. - Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); - +public: /// Try to vectorize the interleaved access group that \p Instr belongs to. void vectorizeInterleaveGroup(Instruction *Instr); + /// Widen an integer induction variable \p IV. If \p Trunc is provided, the + /// induction variable will first be truncated to the corresponding type. + std::pair widenIntInduction(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc = nullptr); + + /// Compute scalar induction steps. \p ScalarIV is the scalar induction + /// variable on which to base the steps, \p Step is the size of the step, and + /// \p EntryVal is the value from the original loop that maps to the steps. + /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it + /// can be a truncate instruction). + void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal, + unsigned MinPart, unsigned MaxPart, unsigned MinLane, + unsigned MaxLane); + +protected: /// Generate a shuffle sequence that will reverse the vector Vec. virtual Value *reverseVector(Value *Vec); @@ -657,6 +708,16 @@ return ScalarMapStorage[Key]; } + ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) { + if (!hasScalar(Key)) { + ScalarParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part].resize(Lanes); + ScalarMapStorage[Key] = Entry; + } + return ScalarMapStorage[Key]; + } + /// \return A reference to the vector map entry corresponding to \p Key. /// The key should already be in the map. This function should only be used /// when it's necessary to update values that have already been vectorized. @@ -675,6 +736,15 @@ friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V); friend Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, unsigned Lane); + friend Value *InnerLoopVectorizer::getVectorValue(Value *V, unsigned Part); + friend void InnerLoopVectorizer::setScalarValue(Value *V, unsigned Part, + unsigned Lane, + Value *Scalar); + friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part, + Value *Vector); + friend void InnerLoopVectorizer::constructVectorValue(Value *V, + unsigned Part, + unsigned Lane); private: /// The unroll factor. Each entry in the vector map contains UF vector @@ -728,9 +798,11 @@ /// many different vector instructions. unsigned UF; +public: /// The builder that we use IRBuilder<> Builder; +protected: // --- Vectorization state --- /// The vector-loop preheader. @@ -759,10 +831,8 @@ /// vectorized and scalarized. ValueMap VectorLoopValueMap; - /// Store instructions that should be predicated, as a pair - /// - SmallVector, 4> PredicatedInstructions; - EdgeMaskCache MaskCache; + EdgeMaskCacheTy EdgeMaskCache; + BlockMaskCacheTy BlockMaskCache; /// Trip count of the original loop. Value *TripCount; /// Trip count of the widened loop (TripCount - TripCount % (VF*UF)) @@ -777,14 +847,6 @@ // Record whether runtime checks are added. bool AddedSafetyChecks; - // Holds instructions from the original loop whose counterparts in the - // vectorized loop would be trivially dead if generated. For example, - // original induction update instructions can become dead because we - // separately emit induction "steps" when generating code for the new loop. - // Similarly, we create a new latch condition when setting up the structure - // of the new loop, so the old one can become dead. - SmallPtrSet DeadInstructions; - // Holds the end values for each induction variable. We save the end values // so we can later fix-up the external users of the induction variables. DenseMap IVEndValues; @@ -803,14 +865,36 @@ UnrollFactor, LVL, CM) {} private: - void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false) override; + void vectorizeInstruction(Instruction &I) override; + void scalarizeInstruction(Instruction *Instr, bool IfPredicateInstr = false); void vectorizeMemoryInstruction(Instruction *Instr) override; Value *getBroadcastInstrs(Value *V) override; Value *getStepVector(Value *Val, int StartIdx, Value *Step, Instruction::BinaryOps Opcode = Instruction::BinaryOpsEnd) override; Value *reverseVector(Value *Vec) override; + + void vectorizeLoop() override; + + /// Iteratively sink the scalarized operands of a predicated instruction into + /// the block that was created for it. + void sinkScalarOperands(Instruction *PredInst); + + /// Predicate conditional instructions that require predication on their + /// respective conditions. + void predicateInstructions(); + + /// Store instructions that should be predicated, as a pair + /// + SmallVector, 4> PredicatedInstructions; + + // Holds instructions from the original loop whose counterparts in the + // vectorized loop would be trivially dead if generated. For example, + // original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; }; /// \brief Look for a meaningful debug location on the instruction or it's @@ -1866,11 +1950,20 @@ unsigned Width; // Vector width with best cost unsigned Cost; // Cost of the loop with that width }; + + bool canVectorize(bool OptForSize); + + bool requiresTail(unsigned MaxVectorSize); + + /// \return An upper bound for the vectorization factor. + unsigned computeMaxVectorizationFactor(bool OptForSize); + /// \return The most profitable vectorization factor and the cost of that VF. /// This method checks every power of two up to VF. If UserVF is not ZERO /// then this vectorization factor will be selected if vectorization is /// possible. - VectorizationFactor selectVectorizationFactor(bool OptForSize); + VectorizationFactor selectVectorizationFactor(bool OptForSize, + unsigned MaxVF); /// \return The size (in bits) of the smallest and widest types in the code /// that needs to be vectorized. We ignore values that remain scalar such as @@ -1909,9 +2002,16 @@ return MinBWs; } + /// Collects the instructions to scalarize for each predicated instruction in + /// the loop. + void collectInstsToScalarize(unsigned VF); + /// \returns True if it is more profitable to scalarize instruction \p I for /// vectorization factor \p VF. bool isProfitableToScalarize(Instruction *I, unsigned VF) const { + // Unroller also calls this method, but does not collectInstsToScalarize. + if (VF == 1) + return true; auto Scalars = InstsToScalarize.find(VF); assert(Scalars != InstsToScalarize.end() && "VF not yet analyzed for scalarization profitability"); @@ -1986,10 +2086,6 @@ int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts, unsigned VF); - /// Collects the instructions to scalarize for each predicated instruction in - /// the loop. - void collectInstsToScalarize(unsigned VF); - public: /// The loop that we evaluate. Loop *TheLoop; @@ -2019,132 +2115,574 @@ SmallPtrSet VecValuesToIgnore; }; -/// \brief This holds vectorization requirements that must be verified late in -/// the process. The requirements are set by legalize and costmodel. Once -/// vectorization has been determined to be possible and profitable the -/// requirements can be verified by looking for metadata or compiler options. -/// For example, some loops require FP commutativity which is only allowed if -/// vectorization is explicitly specified or if the fast-math compiler option -/// has been provided. -/// Late evaluation of these requirements allows helpful diagnostics to be -/// composed that tells the user what need to be done to vectorize the loop. For -/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late -/// evaluation should be used only when diagnostics can generated that can be -/// followed by a non-expert user. -class LoopVectorizationRequirements { +/// LoopVectorizationPlanner - builds and optimizes the Vectorization Plans +/// which record the decisions how to vectorize the given loop. +/// In particular, represent the control-flow of the vectorized version, +/// the replication of instructions that are to be scalarized, and interleave +/// access groups. +class LoopVectorizationPlanner { public: - LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE) - : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {} + LoopVectorizationPlanner(Loop *L, LoopInfo *LI, const TargetLibraryInfo *TLI, + const TargetTransformInfo *TTI, + LoopVectorizationLegality *Legal, + LoopVectorizationCostModel *CM) + : TheLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), + ILV(nullptr), BestVF(0), BestUF(0) {} - void addUnsafeAlgebraInst(Instruction *I) { - // First unsafe algebra instruction. - if (!UnsafeAlgebraInst) - UnsafeAlgebraInst = I; - } + ~LoopVectorizationPlanner() {} - void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; } + /// Plan how to best vectorize, return the best VF and its cost. + LoopVectorizationCostModel::VectorizationFactor + plan(bool OptForSize, unsigned UserVF, unsigned MaxVF); - bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) { - const char *PassName = Hints.vectorizeAnalysisPassName(); - bool Failed = false; - if (UnsafeAlgebraInst && !Hints.allowReordering()) { - ORE.emit( - OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps", - UnsafeAlgebraInst->getDebugLoc(), - UnsafeAlgebraInst->getParent()) - << "loop not vectorized: cannot prove it is safe to reorder " - "floating-point operations"); - Failed = true; - } + /// Finalize the best decision and dispose of all other VPlans. + void setBestPlan(unsigned VF, unsigned UF); - // Test if runtime memcheck thresholds are exceeded. - bool PragmaThresholdReached = - NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold; - bool ThresholdReached = - NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold; - if ((ThresholdReached && !Hints.allowReordering()) || - PragmaThresholdReached) { - ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps", - L->getStartLoc(), - L->getHeader()) - << "loop not vectorized: cannot prove it is safe to reorder " - "memory operations"); - DEBUG(dbgs() << "LV: Too many memory checks needed.\n"); - Failed = true; - } + /// Generate the IR code for the body of the vectorized loop according to the + /// best selected VPlan. + void executeBestPlan(InnerLoopVectorizer &LB); - return Failed; - } + VPlan *getVPlanForVF(unsigned VF) { return VPlans[VF].get(); } + + void printCurrentPlans(const std::string &Title, raw_ostream &O); + +protected: + /// Build initial VPlans according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. + /// Returns the number of VPlans built, zero if failed. + unsigned buildInitialVPlans(unsigned MinVF, unsigned MaxVF); + + /// On VPlan construction, each instruction marked for predication by Legal + /// gets its own basic block guarded by an if-then. This initial planning + /// is legal, but is not optimal. This function attempts to leverage the + /// necessary conditional execution of the predicated instruction in favor + /// of other related instructions. The function applies these optimizations + /// to all VPlans. + void optimizePredicatedInstructions(); private: - unsigned NumRuntimePointerChecks; - Instruction *UnsafeAlgebraInst; + /// Build an initial VPlan according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. \return a VPlan + /// that corresponds to vectorization factors starting from the given + /// \p StartRangeVF and up to \p EndRangeVF, exclusive, possibly decreasing + /// the given \p EndRangeVF. + std::shared_ptr buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF); + + /// Determine whether \p I will be scalarized in a given range of VFs. + /// The returned value reflects the result for a prefix of the range, with \p + /// EndRangeVF modified accordingly. + bool willBeScalarized(Instruction *I, unsigned StartRangeVF, + unsigned &EndRangeVF); - /// Interface to emit optimization remarks. - OptimizationRemarkEmitter &ORE; -}; + /// Iteratively sink the scalarized operands of a predicated instruction into + /// the block that was created for it. + void sinkScalarOperands(Instruction *PredInst, VPlan *Plan); -static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl &V) { - if (L.empty()) { - if (!hasCyclesInLoopBody(L)) - V.push_back(&L); - return; - } - for (Loop *InnerL : L) - addAcyclicInnerLoop(*InnerL, V); -} + /// Determine whether a newly-created recipe adds a second user to one of the + /// variants the values its ingredients use. This may cause the defining + /// recipe to generate that variant itself to serve all such users. + void assignScalarVectorConversions(Instruction *PredInst, VPlan *Plan); -/// The LoopVectorize Pass. -struct LoopVectorize : public FunctionPass { - /// Pass identification, replacement for typeid - static char ID; + /// Returns true if an instruction \p I should be scalarized instead of + /// vectorized for the chosen vectorization factor. + bool shouldScalarizeInstruction(Instruction *I, unsigned VF) const; - explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true) - : FunctionPass(ID) { - Impl.DisableUnrolling = NoUnrolling; - Impl.AlwaysVectorize = AlwaysVectorize; - initializeLoopVectorizePass(*PassRegistry::getPassRegistry()); - } + /// Returns true if we should generate a scalar version of \p IV. + bool needsScalarInduction(Instruction *IV, unsigned VF) const; - LoopVectorizePass Impl; + /// Returns true if we should generate a scalar version of \p IV for a range + /// of vectorization factors starting from the given \p StartRangeVF and up + /// to \p EndRangeVF, exclusive, possibly decreasing the given \p EndRangeVF. + bool needsScalarInduction(Instruction *IV, unsigned StartRangeVF, + unsigned &EndRangeVF) const; - bool runOnFunction(Function &F) override { - if (skipFunction(F)) - return false; +private: + /// The loop that we evaluate. + Loop *TheLoop; - auto *SE = &getAnalysis().getSE(); - auto *LI = &getAnalysis().getLoopInfo(); - auto *TTI = &getAnalysis().getTTI(F); - auto *DT = &getAnalysis().getDomTree(); - auto *BFI = &getAnalysis().getBFI(); - auto *TLIP = getAnalysisIfAvailable(); - auto *TLI = TLIP ? &TLIP->getTLI() : nullptr; - auto *AA = &getAnalysis().getAAResults(); - auto *AC = &getAnalysis().getAssumptionCache(F); - auto *LAA = &getAnalysis(); - auto *DB = &getAnalysis().getDemandedBits(); - auto *ORE = &getAnalysis().getORE(); + /// Loop Info analysis. + LoopInfo *LI; - std::function GetLAA = - [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); }; + /// Target Library Info. + const TargetLibraryInfo *TLI; - return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC, - GetLAA, *ORE); - } + /// Target Transform Info. + const TargetTransformInfo *TTI; - void getAnalysisUsage(AnalysisUsage &AU) const override { - AU.addRequired(); - AU.addRequiredID(LoopSimplifyID); - AU.addRequiredID(LCSSAID); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); - AU.addRequired(); + /// The legality analysis. + LoopVectorizationLegality *Legal; + + /// The profitablity analysis. + LoopVectorizationCostModel *CM; + + InnerLoopVectorizer *ILV; + + // Holds instructions from the original loop that we predicated. Such + // instructions reside in their own conditioned VPBasicBlock and represent + // an optimization opportunity for sinking their scalarized operands thus + // reducing their cost by the predicate's probability. + SmallPtrSet PredicatedInstructions; + + /// VPlans are shared between VFs, use smart pointers. + DenseMap> VPlans; + + unsigned BestVF; + + unsigned BestUF; + + // Holds instructions from the original loop whose counterparts in the + // vectorized loop would be trivially dead if generated. For example, + // original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; +}; + +class VPLaneRange { +private: + static const unsigned VF = INT_MAX; + unsigned MinLane = 0; + unsigned MaxLane = VF - 1; + void dumpLane(raw_ostream &O, unsigned Lane) const { + if (Lane == VF - 1) + O << "VF-1"; + else + O << Lane; + } + +public: + VPLaneRange() {} + VPLaneRange(unsigned Min) : MinLane(Min) {} + VPLaneRange(unsigned Min, unsigned Max) : MinLane(Min), MaxLane(Max) {} + unsigned getMinLane() const { return MinLane; } + unsigned getMaxLane() const { return MaxLane; } + bool isEmpty() const { return MinLane > MaxLane; } + bool isFull() const { return MinLane == 0 && MaxLane == VF - 1; } + void print(raw_ostream &O) const { + dumpLane(O, MinLane); + O << ".."; + dumpLane(O, MaxLane); + } + static VPLaneRange intersect(const VPLaneRange &One, const VPLaneRange &Two) { + return VPLaneRange(std::max(One.MinLane, Two.MinLane), + std::min(One.MaxLane, Two.MaxLane)); + } +}; + +/// VPScalarizeOneByOneRecipe is a VPOneByOneRecipeBase which scalarizes each +/// Instruction in its ingredients independently, in order. The scalarization +/// is performed in one of two methods: a) by generating a single uniform scalar +/// Instruction. b) by generating multiple Instructions, each one for a +/// respective lane. +class VPScalarizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; + + VPLaneRange DesignatedLanes; + +public: + VPScalarizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPScalarizeOneByOneSC, B, E, Plan) {} + + ~VPScalarizeOneByOneRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; + } + + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Scalarize"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":"; + for (auto It = Begin; It != End; ++It) { + O << '\n' << *It; + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; + } + } +}; + +/// VPVectorizeOneByOneRecipe is a VPOneByOneRecipeBase which transforms by +/// vectorizing each Instruction in itsingredients independently, in order. +/// This recipe covers most of the traditional vectorization cases where +/// each ingredient produces a vectorized version of itself. +class VPVectorizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; + +public: + VPVectorizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPVectorizeOneByOneSC, B, E, Plan) {} + + ~VPVectorizeOneByOneRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; + } + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Vectorize:"; + for (auto It = Begin; It != End; ++It) { + O << '\n' << *It; + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; + } + } +}; + +/// A recipe which widens integer reductions, producing their vector values +/// and computing the necessary values for producing their scalar values. +/// The scalar values themselves are generated by a complementing +/// VPBuildScalarStepsRecipe. +class VPWidenIntInductionRecipe : public VPRecipeBase { +private: + bool NeedsScalarIV; + PHINode *IV; + TruncInst *Trunc; + Value *ScalarIV = nullptr; + Value *Step = nullptr; + +public: + VPWidenIntInductionRecipe(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc = nullptr) + : VPRecipeBase(VPWidenIntInductionSC), NeedsScalarIV(NeedsScalarIV), + IV(IV), Trunc(Trunc) {} + + ~VPWidenIntInductionRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenIntInductionSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; + + Value *getScalarIV() { + assert(ScalarIV && "ScalarIV does not exist yet"); + return ScalarIV; + } + + Value *getStep() { + assert(Step && "Step does not exist yet"); + return Step; + } +}; + +/// This is a complemeting recipe for handling integer induction variables, +/// responsible for generating the scalar values used by the IV's scalar users. +class VPBuildScalarStepsRecipe : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +private: + VPWidenIntInductionRecipe *WII; + Instruction *EntryVal; + VPLaneRange DesignatedLanes; + +public: + VPBuildScalarStepsRecipe(VPWidenIntInductionRecipe *WII, + Instruction *EntryVal, VPlan *Plan) + : VPRecipeBase(VPBuildScalarStepsSC), WII(WII), EntryVal(EntryVal) { + Plan->setInst2Recipe(EntryVal, this); + } + + ~VPBuildScalarStepsRecipe() {} + + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPBuildScalarStepsSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; +}; + +/// A VPInterleaveRecipe is a VPRecipe which transforms an interleave group of +/// loads or stores into one wide load/store and shuffles. +class VPInterleaveRecipe : public VPRecipeBase { +private: + const InterleaveGroup *IG; + +public: + VPInterleaveRecipe(const InterleaveGroup *IG, VPlan *Plan) + : VPRecipeBase(VPInterleaveSC), IG(IG) { + for (unsigned I = 0, E = IG->getNumMembers(); I < E; ++I) + Plan->setInst2Recipe(IG->getMember(I), this); + } + + ~VPInterleaveRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; + + const InterleaveGroup *getInterleaveGroup() { return IG; } +}; + +/// A VPExtractMaskBitRecipe is a VPConditionBitRecipe which supports a +/// scalarized conditional branch. Such branches are needed to guard scalarized +/// instructions with possible side-effects that are predicated under a +/// condition. This recipe is in charge of generating the instruction that +/// computes the condition for this branch in the vectorized version. +class VPExtractMaskBitRecipe : public VPConditionBitRecipeBase { +private: + /// The original IR basic block in which the scalarized and predicated + /// instruction(s) reside. Needed for generating the mask of the block + /// and from it the desired condition bit. + BasicBlock *MaskedBasicBlock; + +public: + /// Construct a VPExtractMaskBitRecipe given the IR BasicBlock whose mask + /// should provide the desired bit. This recipe has no Instructions as + /// ingredients, hence does not call Plan->setInst2Recipe(). + VPExtractMaskBitRecipe(BasicBlock *BB) + : VPConditionBitRecipeBase(VPExtractMaskBitSC), MaskedBasicBlock(BB) {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPExtractMaskBitSC; + } + + /// The method which generates the comparison and related mask management + /// instructions leading to computing the desired condition bit, corresponding + /// to this VPExtractMaskBitRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Extract Mask Bit:\n" << MaskedBasicBlock->getName(); + } + + StringRef getName() const override { return MaskedBasicBlock->getName(); } +}; + +/// A VPMergeScalarizeBranchRecipe is a VPRecipe which represents the Phi's +/// needed when control converges back from a scalarized branch. Such phi's are +/// needed to merge live-out values that are set under a scalarized conditional +/// branch. They can be scalar or vector, depending on the user of the +/// live-out value. This recipe works in concert with VPExtractMaskBitRecipe. +class VPMergeScalarizeBranchRecipe : public VPRecipeBase { +private: + Instruction *LiveOut; + +public: + // Construct a VPMergeScalarizeBranchRecipe given \LiveOut whose value needs + // a Phi after merging back from a scalarized branch. + // LiveOut is mapped to the recipe vectorizing it, instead of this recipe + // which provides it with PHIs; hence no call to Plan->setInst2Recipe() here. + VPMergeScalarizeBranchRecipe(Instruction *LiveOut) + : VPRecipeBase(VPMergeScalarizeBranchSC), LiveOut(LiveOut) {} + + ~VPMergeScalarizeBranchRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPMergeScalarizeBranchSC; + } + + /// The method which generates Phi instructions for live-outs as needed to + /// retain SSA form, corresponding to this VPMergeScalarizeBranchRecipe in the + /// vectorized version, thereby "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "Merge Scalarize Branch:\n" << *LiveOut; + } +}; + +class VPlanUtilsLoopVectorizer : public VPlanUtils { +public: + VPlanUtilsLoopVectorizer(VPlan *Plan) : VPlanUtils(Plan) {} + + ~VPlanUtilsLoopVectorizer() {} + + VPOneByOneRecipeBase *createOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, + VPlan *Plan, bool isScalarizing); + + VPOneByOneRecipeBase *splitRecipe(Instruction *Split); + + void insertBefore(Instruction *Inst, Instruction *Before, + unsigned MinLane = 0); + + void removeInstruction(Instruction *Inst, unsigned FromLane = 0); + + void sinkInstruction(Instruction *Inst, VPBasicBlock *To, + unsigned MinLane = 0); + + template void designateLaneZero(T &Recipe) { + Recipe->DesignatedLanes = VPLaneRange(0, 0); + } +}; + +/// \brief This holds vectorization requirements that must be verified late in +/// the process. The requirements are set by legalize and costmodel. Once +/// vectorization has been determined to be possible and profitable the +/// requirements can be verified by looking for metadata or compiler options. +/// For example, some loops require FP commutativity which is only allowed if +/// vectorization is explicitly specified or if the fast-math compiler option +/// has been provided. +/// Late evaluation of these requirements allows helpful diagnostics to be +/// composed that tells the user what need to be done to vectorize the loop. For +/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late +/// evaluation should be used only when diagnostics can generated that can be +/// followed by a non-expert user. +class LoopVectorizationRequirements { +public: + LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE) + : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {} + + void addUnsafeAlgebraInst(Instruction *I) { + // First unsafe algebra instruction. + if (!UnsafeAlgebraInst) + UnsafeAlgebraInst = I; + } + + void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; } + + bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) { + const char *PassName = Hints.vectorizeAnalysisPassName(); + bool Failed = false; + if (UnsafeAlgebraInst && !Hints.allowReordering()) { + ORE.emit( + OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps", + UnsafeAlgebraInst->getDebugLoc(), + UnsafeAlgebraInst->getParent()) + << "loop not vectorized: cannot prove it is safe to reorder " + "floating-point operations"); + Failed = true; + } + + // Test if runtime memcheck thresholds are exceeded. + bool PragmaThresholdReached = + NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold; + bool ThresholdReached = + NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold; + if ((ThresholdReached && !Hints.allowReordering()) || + PragmaThresholdReached) { + ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps", + L->getStartLoc(), + L->getHeader()) + << "loop not vectorized: cannot prove it is safe to reorder " + "memory operations"); + DEBUG(dbgs() << "LV: Too many memory checks needed.\n"); + Failed = true; + } + + return Failed; + } + +private: + unsigned NumRuntimePointerChecks; + Instruction *UnsafeAlgebraInst; + + /// Interface to emit optimization remarks. + OptimizationRemarkEmitter &ORE; +}; + +static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl &V) { + if (L.empty()) { + if (!hasCyclesInLoopBody(L)) + V.push_back(&L); + return; + } + for (Loop *InnerL : L) + addAcyclicInnerLoop(*InnerL, V); +} + +/// The LoopVectorize Pass. +struct LoopVectorize : public FunctionPass { + /// Pass identification, replacement for typeid + static char ID; + + explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true) + : FunctionPass(ID) { + Impl.DisableUnrolling = NoUnrolling; + Impl.AlwaysVectorize = AlwaysVectorize; + initializeLoopVectorizePass(*PassRegistry::getPassRegistry()); + } + + LoopVectorizePass Impl; + + bool runOnFunction(Function &F) override { + if (skipFunction(F)) + return false; + + auto *SE = &getAnalysis().getSE(); + auto *LI = &getAnalysis().getLoopInfo(); + auto *TTI = &getAnalysis().getTTI(F); + auto *DT = &getAnalysis().getDomTree(); + auto *BFI = &getAnalysis().getBFI(); + auto *TLIP = getAnalysisIfAvailable(); + auto *TLI = TLIP ? &TLIP->getTLI() : nullptr; + auto *AA = &getAnalysis().getAAResults(); + auto *AC = &getAnalysis().getAssumptionCache(F); + auto *LAA = &getAnalysis(); + auto *DB = &getAnalysis().getDemandedBits(); + auto *ORE = &getAnalysis().getORE(); + + std::function GetLAA = + [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); }; + + return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC, + GetLAA, *ORE); + } + + void getAnalysisUsage(AnalysisUsage &AU) const override { + AU.addRequired(); + AU.addRequiredID(LoopSimplifyID); + AU.addRequiredID(LCSSAID); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); + AU.addRequired(); AU.addPreserved(); AU.addPreserved(); AU.addPreserved(); @@ -2155,8 +2693,8 @@ } // end anonymous namespace //===----------------------------------------------------------------------===// -// Implementation of LoopVectorizationLegality, InnerLoopVectorizer and -// LoopVectorizationCostModel. +// Implementation of LoopVectorizationLegality, InnerLoopVectorizer, +// LoopVectorizationCostModel and LoopVectorizationPlanner. //===----------------------------------------------------------------------===// Value *InnerLoopVectorizer::getBroadcastInstrs(Value *V) { @@ -2239,7 +2777,9 @@ return any_of(IV->users(), isScalarInst); } -void InnerLoopVectorizer::widenIntInduction(PHINode *IV, TruncInst *Trunc) { +std::pair +InnerLoopVectorizer::widenIntInduction(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc) { auto II = Legal->getInductionVars()->find(IV); assert(II != Legal->getInductionVars()->end() && "IV is not an induction"); @@ -2261,11 +2801,6 @@ // True if we have vectorized the induction variable. auto VectorizedIV = false; - // Determine if we want a scalar version of the induction variable. This is - // true if the induction variable itself is not widened, or if it has at - // least one user in the loop that is not widened. - auto NeedsScalarIV = VF > 1 && needsScalarInduction(EntryVal); - // If the induction variable has a constant integer step value, go ahead and // get it now. if (ID.getConstIntStepValue()) @@ -2321,13 +2856,9 @@ } // If an induction variable is only used for counting loop iterations or - // calculating addresses, it doesn't need to be widened. Create scalar steps - // that can be used by instructions we will later scalarize. Note that the - // addition of the scalar steps will not increase the number of instructions - // in the loop in the common case prior to InstCombine. We will be trading - // one vector extract for each scalar step. - if (NeedsScalarIV) - buildScalarSteps(ScalarIV, Step, EntryVal); + // calculating addresses, it doesn't need to be widened. + + return std::make_pair(ScalarIV, Step); } Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step, @@ -2387,7 +2918,9 @@ } void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step, - Value *EntryVal) { + Value *EntryVal, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, + unsigned MaxLane) { // We shouldn't have to build scalar steps if we aren't vectorizing. assert(VF > 1 && "VF should be greater than one"); @@ -2397,24 +2930,18 @@ assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() && "Val and Step should have the same integer type"); - // Determine the number of scalars we need to generate for each unroll - // iteration. If EntryVal is uniform, we only need to generate the first - // lane. Otherwise, we generate all VF values. - unsigned Lanes = - Legal->isUniformAfterVectorization(cast(EntryVal)) ? 1 : VF; + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(EntryVal, VF); // Compute the scalar steps and save the results in VectorLoopValueMap. - ScalarParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { Entry[Part].resize(VF); - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { auto *StartIdx = ConstantInt::get(ScalarIVTy, VF * Part + Lane); auto *Mul = Builder.CreateMul(StartIdx, Step); auto *Add = Builder.CreateAdd(ScalarIV, Mul); Entry[Part][Lane] = Add; } } - VectorLoopValueMap.initScalar(EntryVal, Entry); } int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) { @@ -2432,6 +2959,39 @@ return LAI->isUniform(V); } +void InnerLoopVectorizer::constructVectorValue(Value *V, unsigned Part, + unsigned Lane) { + assert(V != Induction && "The new induction variable should not be used."); + assert(!V->getType()->isVectorTy() && "Can't widen a vector"); + assert(!V->getType()->isVoidTy() && "Type does not produce a value"); + + if (!VectorLoopValueMap.hasVector(V)) { + VectorParts Entry(UF); + for (unsigned P = 0; P < UF; ++P) + Entry[P] = nullptr; + VectorLoopValueMap.initVector(V, Entry); + } + + VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V]; + + assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist"); + + auto *ScalarInst = cast(getScalarValue(V, Part, Lane)); + + Value *VectorValue = nullptr; + + // If we're constructing lane 0, start from undef; otherwise, start from the + // last value created. + if (Lane == 0) + VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); + else + VectorValue = Parts[Part]; + + VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst, + Builder.getInt32(Lane)); + Parts[Part] = VectorValue; +} + const InnerLoopVectorizer::VectorParts & InnerLoopVectorizer::getVectorValue(Value *V) { assert(V != Induction && "The new induction variable should not be used."); @@ -2475,8 +3035,11 @@ // Set the insert point after the last scalarized instruction. This ensures // the insertelement sequence will directly follow the scalar definitions. auto OldIP = Builder.saveIP(); - auto NewIP = std::next(BasicBlock::iterator(LastInst)); - Builder.SetInsertPoint(&*NewIP); + auto NextInsertionPoint = std::next(BasicBlock::iterator(LastInst)); + if (NextInsertionPoint != LastInst->getParent()->end()) + Builder.SetInsertPoint(&*NextInsertionPoint); + else + Builder.SetInsertPoint(LastInst->getParent()); // However, if we are vectorizing, we need to construct the vector values. // If the value is known to be uniform after vectorization, we can just @@ -2831,10 +3394,6 @@ Alignment = DL.getABITypeAlignment(ScalarDataTy); unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace(); - // Scalarize the memory instruction if necessary. - if (Legal->memoryInstructionMustBeScalarized(Instr, VF)) - return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr)); - // Determine if the pointer operand of the access is either consecutive or // reverse consecutive. int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); @@ -3004,11 +3563,11 @@ } void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr) { + unsigned MinPart, + unsigned MaxPart, + unsigned MinLane, + unsigned MaxLane) { assert(!Instr->getType()->isAggregateType() && "Can't handle vectors"); - DEBUG(dbgs() << "LV: Scalarizing" - << (IfPredicateInstr ? " and predicating:" : ":") << *Instr - << '\n'); // Holds vector parameters or scalars, in case of uniform vals. SmallVector Params; @@ -3018,30 +3577,12 @@ bool IsVoidRetTy = Instr->getType()->isVoidTy(); // Initialize a new scalar map entry. - ScalarParts Entry(UF); - - VectorParts Cond; - if (IfPredicateInstr) - Cond = createBlockInMask(Instr->getParent()); - - // Determine the number of scalars we need to generate for each unroll - // iteration. If the instruction is uniform, we only need to generate the - // first lane. Otherwise, we generate all VF values. - unsigned Lanes = Legal->isUniformAfterVectorization(Instr) ? 1 : VF; + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF); // For each vector unroll 'part': - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { // For each scalar that we create: - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { - - // Start if-block. - Value *Cmp = nullptr; - if (IfPredicateInstr) { - Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Lane)); - Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp, - ConstantInt::get(Cmp->getType(), 1)); - } + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { Instruction *Cloned = Instr->clone(); if (!IsVoidRetTy) @@ -3065,13 +3606,8 @@ if (auto *II = dyn_cast(Cloned)) if (II->getIntrinsicID() == Intrinsic::assume) AC->registerAssumption(II); - - // End if-block. - if (IfPredicateInstr) - PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp)); } } - VectorLoopValueMap.initScalar(Instr, Entry); } PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start, @@ -3854,6 +4390,7 @@ } void InnerLoopVectorizer::vectorizeLoop() { + //===------------------------------------------------===// // // Notice: any optimization or new instruction that go @@ -3861,8 +4398,37 @@ // the cost-model. // //===------------------------------------------------===// - Constant *Zero = Builder.getInt32(0); + // Insert truncates and extends for any truncated instructions as hints to + // InstCombine. + if (VF > 1) + truncateToMinimalBitwidths(); + + fixCrossIterationPHIs(); + + // Update the dominator tree. + // + // FIXME: After creating the structure of the new loop, the dominator tree is + // no longer up-to-date, and it remains that way until we update it + // here. An out-of-date dominator tree is problematic for SCEV, + // because SCEVExpander uses it to guide code generation. The + // vectorizer use SCEVExpanders in several places. Instead, we should + // keep the dominator tree up-to-date as we go. + updateAnalysis(); + + // Fix-up external users of the induction variables. + for (auto &Entry : *Legal->getInductionVars()) + fixupIVUsers(Entry.first, Entry.second, + getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), + IVEndValues[Entry.first], LoopMiddleBlock); + + fixLCSSAPHIs(); + + // Remove redundant induction instructions. + cse(LoopVectorBody); +} + +void InnerLoopVectorizer::fixCrossIterationPHIs() { // In order to support recurrences we need to be able to vectorize Phi nodes. // Phi nodes have cycles, so we need to vectorize them in two stages. First, // we create a new vector PHI node with no incoming edges. We use this value @@ -3870,268 +4436,226 @@ // all of the instructions in the block are complete we add the new incoming // edges to the PHI. At this point all of the instructions in the basic block // are vectorized, so we can use them to construct the PHI. - PhiVector PHIsToFix; - - // Collect instructions from the original loop that will become trivially - // dead in the vectorized loop. We don't need to vectorize these - // instructions. - collectTriviallyDeadInstructions(); - - // Scan the loop in a topological order to ensure that defs are vectorized - // before users. - LoopBlocksDFS DFS(OrigLoop); - DFS.perform(LI); - - // Vectorize all of the blocks in the original loop. - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) - vectorizeBlockInLoop(BB, &PHIsToFix); - - // Insert truncates and extends for any truncated instructions as hints to - // InstCombine. - if (VF > 1) - truncateToMinimalBitwidths(); // At this point every instruction in the original loop is widened to a - // vector form. Now we need to fix the recurrences in PHIsToFix. These PHI - // nodes are currently empty because we did not want to introduce cycles. + // vector form. Now we need to fix the recurrences. These PHI nodes are + // currently empty because we did not want to introduce cycles. // This is the second stage of vectorizing recurrences. - for (PHINode *Phi : PHIsToFix) { - assert(Phi && "Unable to recover vectorized PHI"); - - // Handle first-order recurrences that need to be fixed. - if (Legal->isFirstOrderRecurrence(Phi)) { + for (Instruction &I : *OrigLoop->getHeader()) { + PHINode *Phi = dyn_cast(&I); + if (!Phi) + break; + // Handle first-order recurrences and reductions that need to be fixed. + if (Legal->isFirstOrderRecurrence(Phi)) fixFirstOrderRecurrence(Phi); - continue; - } + else if (Legal->isReductionVariable(Phi)) + fixReduction(Phi); + } +} - // If the phi node is not a first-order recurrence, it must be a reduction. - // Get it's reduction variable descriptor. - assert(Legal->isReductionVariable(Phi) && - "Unable to find the reduction variable"); - RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi]; - - RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind(); - TrackingVH ReductionStartValue = RdxDesc.getRecurrenceStartValue(); - Instruction *LoopExitInst = RdxDesc.getLoopExitInstr(); - RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind = - RdxDesc.getMinMaxRecurrenceKind(); - setDebugLocFromInst(Builder, ReductionStartValue); - - // We need to generate a reduction vector from the incoming scalar. - // To do so, we need to generate the 'identity' vector and override - // one of the elements with the incoming scalar reduction. We need - // to do it in the vector-loop preheader. - Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator()); - - // This is the vector-clone of the value that leaves the loop. - const VectorParts &VectorExit = getVectorValue(LoopExitInst); - Type *VecTy = VectorExit[0]->getType(); - - // Find the reduction identity variable. Zero for addition, or, xor, - // one for multiplication, -1 for And. - Value *Identity; - Value *VectorStart; - if (RK == RecurrenceDescriptor::RK_IntegerMinMax || - RK == RecurrenceDescriptor::RK_FloatMinMax) { - // MinMax reduction have the start value as their identify. - if (VF == 1) { - VectorStart = Identity = ReductionStartValue; - } else { - VectorStart = Identity = - Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident"); - } +void InnerLoopVectorizer::fixReduction(PHINode *Phi) { + Constant *Zero = Builder.getInt32(0); + + // Get the reduction variable descriptor. + RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi]; + + RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind(); + TrackingVH ReductionStartValue = RdxDesc.getRecurrenceStartValue(); + Instruction *LoopExitInst = RdxDesc.getLoopExitInstr(); + RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind = + RdxDesc.getMinMaxRecurrenceKind(); + setDebugLocFromInst(Builder, ReductionStartValue); + + // We need to generate a reduction vector from the incoming scalar. + // To do so, we need to generate the 'identity' vector and override + // one of the elements with the incoming scalar reduction. We need + // to do it in the vector-loop preheader. + Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator()); + + // This is the vector-clone of the value that leaves the loop. + const VectorParts &VectorExit = getVectorValue(LoopExitInst); + Type *VecTy = VectorExit[0]->getType(); + + // Find the reduction identity variable. Zero for addition, or, xor, + // one for multiplication, -1 for And. + Value *Identity; + Value *VectorStart; + if (RK == RecurrenceDescriptor::RK_IntegerMinMax || + RK == RecurrenceDescriptor::RK_FloatMinMax) { + // MinMax reduction have the start value as their identify. + if (VF == 1) { + VectorStart = Identity = ReductionStartValue; } else { - // Handle other reduction kinds: - Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity( - RK, VecTy->getScalarType()); - if (VF == 1) { - Identity = Iden; - // This vector is the Identity vector where the first element is the - // incoming scalar reduction. - VectorStart = ReductionStartValue; - } else { - Identity = ConstantVector::getSplat(VF, Iden); + VectorStart = Identity = + Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident"); + } + } else { + // Handle other reduction kinds: + Constant *Iden = + RecurrenceDescriptor::getRecurrenceIdentity(RK, VecTy->getScalarType()); + if (VF == 1) { + Identity = Iden; + // This vector is the Identity vector where the first element is the + // incoming scalar reduction. + VectorStart = ReductionStartValue; + } else { + Identity = ConstantVector::getSplat(VF, Iden); - // This vector is the Identity vector where the first element is the - // incoming scalar reduction. - VectorStart = - Builder.CreateInsertElement(Identity, ReductionStartValue, Zero); - } + // This vector is the Identity vector where the first element is the + // incoming scalar reduction. + VectorStart = + Builder.CreateInsertElement(Identity, ReductionStartValue, Zero); } + } + + // Fix the vector-loop phi. + + // Reductions do not have to start at zero. They can start with + // any loop invariant values. + const VectorParts &VecRdxPhi = getVectorValue(Phi); + BasicBlock *Latch = OrigLoop->getLoopLatch(); + Value *LoopVal = Phi->getIncomingValueForBlock(Latch); + const VectorParts &Val = getVectorValue(LoopVal); + for (unsigned part = 0; part < UF; ++part) { + // Make sure to add the reduction stat value only to the + // first unroll part. + Value *StartVal = (part == 0) ? VectorStart : Identity; + cast(VecRdxPhi[part])->addIncoming(StartVal, LoopVectorPreHeader); + cast(VecRdxPhi[part]) + ->addIncoming(Val[part], + LI->getLoopFor(LoopVectorBody)->getLoopLatch()); + } + + // Before each round, move the insertion point right between + // the PHIs and the values we are going to write. + // This allows us to write both PHINodes and the extractelement + // instructions. + Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); - // Fix the vector-loop phi. + VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst); + setDebugLocFromInst(Builder, LoopExitInst); - // Reductions do not have to start at zero. They can start with - // any loop invariant values. - const VectorParts &VecRdxPhi = getVectorValue(Phi); - BasicBlock *Latch = OrigLoop->getLoopLatch(); - Value *LoopVal = Phi->getIncomingValueForBlock(Latch); - const VectorParts &Val = getVectorValue(LoopVal); + // If the vector reduction can be performed in a smaller type, we truncate + // then extend the loop exit value to enable InstCombine to evaluate the + // entire expression in the smaller type. + if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) { + Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF); + Builder.SetInsertPoint(LoopVectorBody->getTerminator()); for (unsigned part = 0; part < UF; ++part) { - // Make sure to add the reduction stat value only to the - // first unroll part. - Value *StartVal = (part == 0) ? VectorStart : Identity; - cast(VecRdxPhi[part]) - ->addIncoming(StartVal, LoopVectorPreHeader); - cast(VecRdxPhi[part]) - ->addIncoming(Val[part], LoopVectorBody); + Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy); + Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy) + : Builder.CreateZExt(Trunc, VecTy); + for (Value::user_iterator UI = RdxParts[part]->user_begin(); + UI != RdxParts[part]->user_end();) + if (*UI != Trunc) { + (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd); + RdxParts[part] = Extnd; + } else { + ++UI; + } } - - // Before each round, move the insertion point right between - // the PHIs and the values we are going to write. - // This allows us to write both PHINodes and the extractelement - // instructions. Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); + for (unsigned part = 0; part < UF; ++part) + RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy); + } + + // Reduce all of the unrolled parts into a single vector. + Value *ReducedPartRdx = RdxParts[0]; + unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK); + setDebugLocFromInst(Builder, ReducedPartRdx); + for (unsigned part = 1; part < UF; ++part) { + if (Op != Instruction::ICmp && Op != Instruction::FCmp) + // Floating point operations had to be 'fast' to enable the reduction. + ReducedPartRdx = addFastMathFlag( + Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part], + ReducedPartRdx, "bin.rdx")); + else + ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp( + Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]); + } - VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst); - setDebugLocFromInst(Builder, LoopExitInst); - - // If the vector reduction can be performed in a smaller type, we truncate - // then extend the loop exit value to enable InstCombine to evaluate the - // entire expression in the smaller type. - if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) { - Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF); - Builder.SetInsertPoint(LoopVectorBody->getTerminator()); - for (unsigned part = 0; part < UF; ++part) { - Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy); - Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy) - : Builder.CreateZExt(Trunc, VecTy); - for (Value::user_iterator UI = RdxParts[part]->user_begin(); - UI != RdxParts[part]->user_end();) - if (*UI != Trunc) { - (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd); - RdxParts[part] = Extnd; - } else { - ++UI; - } - } - Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt()); - for (unsigned part = 0; part < UF; ++part) - RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy); - } + if (VF > 1) { + // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles + // and vector ops, reducing the set of values being computed by half each + // round. + assert(isPowerOf2_32(VF) && + "Reduction emission only supported for pow2 vectors!"); + Value *TmpVec = ReducedPartRdx; + SmallVector ShuffleMask(VF, nullptr); + for (unsigned i = VF; i != 1; i >>= 1) { + // Move the upper half of the vector to the lower half. + for (unsigned j = 0; j != i / 2; ++j) + ShuffleMask[j] = Builder.getInt32(i / 2 + j); + + // Fill the rest of the mask with undef. + std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), + UndefValue::get(Builder.getInt32Ty())); + + Value *Shuf = Builder.CreateShuffleVector( + TmpVec, UndefValue::get(TmpVec->getType()), + ConstantVector::get(ShuffleMask), "rdx.shuf"); - // Reduce all of the unrolled parts into a single vector. - Value *ReducedPartRdx = RdxParts[0]; - unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK); - setDebugLocFromInst(Builder, ReducedPartRdx); - for (unsigned part = 1; part < UF; ++part) { if (Op != Instruction::ICmp && Op != Instruction::FCmp) // Floating point operations had to be 'fast' to enable the reduction. - ReducedPartRdx = addFastMathFlag( - Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part], - ReducedPartRdx, "bin.rdx")); + TmpVec = addFastMathFlag(Builder.CreateBinOp((Instruction::BinaryOps)Op, + TmpVec, Shuf, "bin.rdx")); else - ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp( - Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]); + TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind, + TmpVec, Shuf); } - if (VF > 1) { - // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles - // and vector ops, reducing the set of values being computed by half each - // round. - assert(isPowerOf2_32(VF) && - "Reduction emission only supported for pow2 vectors!"); - Value *TmpVec = ReducedPartRdx; - SmallVector ShuffleMask(VF, nullptr); - for (unsigned i = VF; i != 1; i >>= 1) { - // Move the upper half of the vector to the lower half. - for (unsigned j = 0; j != i / 2; ++j) - ShuffleMask[j] = Builder.getInt32(i / 2 + j); - - // Fill the rest of the mask with undef. - std::fill(&ShuffleMask[i / 2], ShuffleMask.end(), - UndefValue::get(Builder.getInt32Ty())); - - Value *Shuf = Builder.CreateShuffleVector( - TmpVec, UndefValue::get(TmpVec->getType()), - ConstantVector::get(ShuffleMask), "rdx.shuf"); - - if (Op != Instruction::ICmp && Op != Instruction::FCmp) - // Floating point operations had to be 'fast' to enable the reduction. - TmpVec = addFastMathFlag(Builder.CreateBinOp( - (Instruction::BinaryOps)Op, TmpVec, Shuf, "bin.rdx")); - else - TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind, - TmpVec, Shuf); - } + // The result is in the first element of the vector. + ReducedPartRdx = Builder.CreateExtractElement(TmpVec, Builder.getInt32(0)); - // The result is in the first element of the vector. + // If the reduction can be performed in a smaller type, we need to extend + // the reduction to the wider type before we branch to the original loop. + if (Phi->getType() != RdxDesc.getRecurrenceType()) ReducedPartRdx = - Builder.CreateExtractElement(TmpVec, Builder.getInt32(0)); - - // If the reduction can be performed in a smaller type, we need to extend - // the reduction to the wider type before we branch to the original loop. - if (Phi->getType() != RdxDesc.getRecurrenceType()) - ReducedPartRdx = - RdxDesc.isSigned() - ? Builder.CreateSExt(ReducedPartRdx, Phi->getType()) - : Builder.CreateZExt(ReducedPartRdx, Phi->getType()); - } - - // Create a phi node that merges control-flow from the backedge-taken check - // block and the middle block. - PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx", - LoopScalarPreHeader->getTerminator()); - for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I) - BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]); - BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); - - // Now, we need to fix the users of the reduction variable - // inside and outside of the scalar remainder loop. - // We know that the loop is in LCSSA form. We need to update the - // PHI nodes in the exit blocks. - for (BasicBlock::iterator LEI = LoopExitBlock->begin(), - LEE = LoopExitBlock->end(); - LEI != LEE; ++LEI) { - PHINode *LCSSAPhi = dyn_cast(LEI); - if (!LCSSAPhi) - break; - - // All PHINodes need to have a single entry edge, or two if - // we already fixed them. - assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI"); - - // We found our reduction value exit-PHI. Update it with the - // incoming bypass edge. - if (LCSSAPhi->getIncomingValue(0) == LoopExitInst) { - // Add an edge coming from the bypass. - LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); - break; - } - } // end of the LCSSA phi scan. - - // Fix the scalar loop reduction variable with the incoming reduction sum - // from the vector body and from the backedge value. - int IncomingEdgeBlockIdx = - Phi->getBasicBlockIndex(OrigLoop->getLoopLatch()); - assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index"); - // Pick the other block. - int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1); - Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi); - Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst); - } // end of for each Phi in PHIsToFix. - - // Update the dominator tree. - // - // FIXME: After creating the structure of the new loop, the dominator tree is - // no longer up-to-date, and it remains that way until we update it - // here. An out-of-date dominator tree is problematic for SCEV, - // because SCEVExpander uses it to guide code generation. The - // vectorizer use SCEVExpanders in several places. Instead, we should - // keep the dominator tree up-to-date as we go. - updateAnalysis(); - - // Fix-up external users of the induction variables. - for (auto &Entry : *Legal->getInductionVars()) - fixupIVUsers(Entry.first, Entry.second, - getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), - IVEndValues[Entry.first], LoopMiddleBlock); + RdxDesc.isSigned() + ? Builder.CreateSExt(ReducedPartRdx, Phi->getType()) + : Builder.CreateZExt(ReducedPartRdx, Phi->getType()); + } + + // Create a phi node that merges control-flow from the backedge-taken check + // block and the middle block. + PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx", + LoopScalarPreHeader->getTerminator()); + for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I) + BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]); + BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); + + // Now, we need to fix the users of the reduction variable + // inside and outside of the scalar remainder loop. + // We know that the loop is in LCSSA form. We need to update the + // PHI nodes in the exit blocks. + for (BasicBlock::iterator LEI = LoopExitBlock->begin(), + LEE = LoopExitBlock->end(); + LEI != LEE; ++LEI) { + PHINode *LCSSAPhi = dyn_cast(LEI); + if (!LCSSAPhi) + break; - fixLCSSAPHIs(); - predicateInstructions(); + // All PHINodes need to have a single entry edge, or two if + // we already fixed them. + assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI"); - // Remove redundant induction instructions. - cse(LoopVectorBody); + // We found our reduction value exit-PHI. Update it with the + // incoming bypass edge. + if (LCSSAPhi->getIncomingValue(0) == LoopExitInst) { + // Add an edge coming from the bypass. + LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock); + break; + } + } // end of the LCSSA phi scan. + + // Fix the scalar loop reduction variable with the incoming reduction sum + // from the vector body and from the backedge value. + int IncomingEdgeBlockIdx = Phi->getBasicBlockIndex(OrigLoop->getLoopLatch()); + assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index"); + // Pick the other block. + int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1); + Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi); + Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst); } void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) { @@ -4296,7 +4820,9 @@ } } -void InnerLoopVectorizer::collectTriviallyDeadInstructions() { +void InnerLoopVectorizer::collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions) { BasicBlock *Latch = OrigLoop->getLoopLatch(); // We create new control-flow for the vectorized loop, so the original @@ -4319,7 +4845,7 @@ } } -void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) { +void InnerLoopUnroller::sinkScalarOperands(Instruction *PredInst) { // The basic block and loop containing the predicated instruction. auto *PredBB = PredInst->getParent(); @@ -4385,7 +4911,51 @@ } while (Changed); } -void InnerLoopVectorizer::predicateInstructions() { +void InnerLoopUnroller::vectorizeLoop() { + + // Collect instructions from the original loop that will become trivially + // dead in the vectorized loop. We don't need to vectorize these + // instructions. + collectTriviallyDeadInstructions(OrigLoop, Legal, DeadInstructions); + + // Scan the loop in a topological order to ensure that defs are vectorized + // before users. + LoopBlocksDFS DFS(OrigLoop); + DFS.perform(LI); + + // Vectorize all of the blocks in the original loop. + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) + for (Instruction &I : *BB) { + if (!DeadInstructions.count(&I)) + vectorizeInstruction(I); + } + + fixCrossIterationPHIs(); + + // Update the dominator tree. + // + // FIXME: After creating the structure of the new loop, the dominator tree is + // no longer up-to-date, and it remains that way until we update it + // here. An out-of-date dominator tree is problematic for SCEV, + // because SCEVExpander uses it to guide code generation. The + // vectorizer use SCEVExpanders in several places. Instead, we should + // keep the dominator tree up-to-date as we go. + updateAnalysis(); + + // Fix-up external users of the induction variables. + for (auto &Entry : *Legal->getInductionVars()) + fixupIVUsers(Entry.first, Entry.second, + getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)), + IVEndValues[Entry.first], LoopMiddleBlock); + + fixLCSSAPHIs(); + predicateInstructions(); + + // Remove redundant induction instructions. + cse(LoopVectorBody); +} + +void InnerLoopUnroller::predicateInstructions() { // For each instruction I marked for predication on value C, split I into its // own basic block to form an if-then construct over C. Since I may be fed by @@ -4510,8 +5080,8 @@ // Look for cached value. std::pair Edge(Src, Dst); - EdgeMaskCache::iterator ECEntryIt = MaskCache.find(Edge); - if (ECEntryIt != MaskCache.end()) + EdgeMaskCacheTy::iterator ECEntryIt = EdgeMaskCache.find(Edge); + if (ECEntryIt != EdgeMaskCache.end()) return ECEntryIt->second; VectorParts SrcMask = createBlockInMask(Src); @@ -4530,11 +5100,11 @@ for (unsigned part = 0; part < UF; ++part) EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]); - MaskCache[Edge] = EdgeMask; + EdgeMaskCache[Edge] = EdgeMask; return EdgeMask; } - MaskCache[Edge] = SrcMask; + EdgeMaskCache[Edge] = SrcMask; return SrcMask; } @@ -4542,6 +5112,11 @@ InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) { assert(OrigLoop->contains(BB) && "Block is not a part of a loop"); + // Look for cached value. + BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB); + if (BCEntryIt != BlockMaskCache.end()) + return BCEntryIt->second; + // Loop incoming mask is all-one. if (OrigLoop->getHeader() == BB) { Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1); @@ -4559,6 +5134,7 @@ BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]); } + BlockMaskCache[BB] = BlockMask; return BlockMask; } @@ -4631,7 +5207,8 @@ case InductionDescriptor::IK_NoInduction: llvm_unreachable("Unknown induction"); case InductionDescriptor::IK_IntInduction: - return widenIntInduction(P); + widenIntInduction(needsScalarInduction(P), P); // Used only by Unroller + return; case InductionDescriptor::IK_PtrInduction: { // Handle the pointer induction variable case. assert(P->getType()->isPointerTy() && "Unexpected type."); @@ -4703,269 +5280,217 @@ return !CInt || CInt->isZero(); } -void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV) { - // For each instruction in the old loop. - for (Instruction &I : *BB) { - - // If the instruction will become trivially dead when vectorized, we don't - // need to generate it. - if (DeadInstructions.count(&I)) - continue; - - // Scalarize instructions that should remain scalar after vectorization. - if (VF > 1 && - !(isa(&I) || isa(&I) || - isa(&I)) && - shouldScalarizeInstruction(&I)) { - scalarizeInstruction(&I, Legal->isScalarWithPredication(&I)); - continue; - } - - switch (I.getOpcode()) { - case Instruction::Br: - // Nothing to do for PHIs and BR, since we already took care of the - // loop control flow instructions. - continue; - case Instruction::PHI: { - // Vectorize PHINodes. - widenPHIInstruction(&I, UF, VF, PV); - continue; - } // End of PHI. - - case Instruction::UDiv: - case Instruction::SDiv: - case Instruction::SRem: - case Instruction::URem: - // Scalarize with predication if this instruction may divide by zero and - // block execution is conditional, otherwise fallthrough. - if (Legal->isScalarWithPredication(&I)) { - scalarizeInstruction(&I, true); - continue; - } - case Instruction::Add: - case Instruction::FAdd: - case Instruction::Sub: - case Instruction::FSub: - case Instruction::Mul: - case Instruction::FMul: - case Instruction::FDiv: - case Instruction::FRem: - case Instruction::Shl: - case Instruction::LShr: - case Instruction::AShr: - case Instruction::And: - case Instruction::Or: - case Instruction::Xor: { - // Just widen binops. - auto *BinOp = cast(&I); - setDebugLocFromInst(Builder, BinOp); - const VectorParts &A = getVectorValue(BinOp->getOperand(0)); - const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - - // Use this vector value for all users of the original instruction. - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); +void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) { + switch (I.getOpcode()) { + case Instruction::PHI: { + // Vectorize PHINodes. + PhiVector PV; // Records Reduction and FirstOrderRecurrence header Phis. + widenPHIInstruction(&I, UF, VF, &PV); + break; + } // End of PHI. + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + case Instruction::Add: + case Instruction::FAdd: + case Instruction::Sub: + case Instruction::FSub: + case Instruction::Mul: + case Instruction::FMul: + case Instruction::FDiv: + case Instruction::FRem: + case Instruction::Shl: + case Instruction::LShr: + case Instruction::AShr: + case Instruction::And: + case Instruction::Or: + case Instruction::Xor: { + // Just widen binops. + auto *BinOp = cast(&I); + setDebugLocFromInst(Builder, BinOp); + const VectorParts &A = getVectorValue(BinOp->getOperand(0)); + const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - if (BinaryOperator *VecOp = dyn_cast(V)) - VecOp->copyIRFlags(BinOp); + // Use this vector value for all users of the original instruction. + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); - Entry[Part] = V; - } + if (BinaryOperator *VecOp = dyn_cast(V)) + VecOp->copyIRFlags(BinOp); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, BinOp); - break; + Entry[Part] = V; } - case Instruction::Select: { - // Widen selects. - // If the selector is loop invariant we can create a select - // instruction with a scalar condition. Otherwise, use vector-select. - auto *SE = PSE.getSE(); - bool InvariantCond = - SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); - setDebugLocFromInst(Builder, &I); - - // The condition can be loop invariant but still defined inside the - // loop. This means that we can't just use the original 'cond' value. - // We have to take the 'vectorized' value and pick the first lane. - // Instcombine will make this a no-op. - const VectorParts &Cond = getVectorValue(I.getOperand(0)); - const VectorParts &Op0 = getVectorValue(I.getOperand(1)); - const VectorParts &Op1 = getVectorValue(I.getOperand(2)); - - auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part] = Builder.CreateSelect( - InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); - } + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, BinOp); + break; + } + case Instruction::Select: { + // Widen selects. + // If the selector is loop invariant we can create a select + // instruction with a scalar condition. Otherwise, use vector-select. + auto *SE = PSE.getSE(); + bool InvariantCond = + SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); + setDebugLocFromInst(Builder, &I); + + // The condition can be loop invariant but still defined inside the + // loop. This means that we can't just use the original 'cond' value. + // We have to take the 'vectorized' value and pick the first lane. + // Instcombine will make this a no-op. + const VectorParts &Cond = getVectorValue(I.getOperand(0)); + const VectorParts &Op0 = getVectorValue(I.getOperand(1)); + const VectorParts &Op1 = getVectorValue(I.getOperand(2)); + + auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Entry[Part] = Builder.CreateSelect( + InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); } - case Instruction::ICmp: - case Instruction::FCmp: { - // Widen compares. Generate vector compares. - bool FCmp = (I.getOpcode() == Instruction::FCmp); - auto *Cmp = dyn_cast(&I); - setDebugLocFromInst(Builder, Cmp); - const VectorParts &A = getVectorValue(Cmp->getOperand(0)); - const VectorParts &B = getVectorValue(Cmp->getOperand(1)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *C = nullptr; - if (FCmp) { - C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); - cast(C)->copyFastMathFlags(Cmp); - } else { - C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); - } - Entry[Part] = C; - } + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + case Instruction::ICmp: + case Instruction::FCmp: { + // Widen compares. Generate vector compares. + bool FCmp = (I.getOpcode() == Instruction::FCmp); + auto *Cmp = dyn_cast(&I); + setDebugLocFromInst(Builder, Cmp); + const VectorParts &A = getVectorValue(Cmp->getOperand(0)); + const VectorParts &B = getVectorValue(Cmp->getOperand(1)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *C = nullptr; + if (FCmp) { + C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); + cast(C)->copyFastMathFlags(Cmp); + } else { + C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); + } + Entry[Part] = C; } - case Instruction::Store: - case Instruction::Load: - vectorizeMemoryInstruction(&I); - break; - case Instruction::ZExt: - case Instruction::SExt: - case Instruction::FPToUI: - case Instruction::FPToSI: - case Instruction::FPExt: - case Instruction::PtrToInt: - case Instruction::IntToPtr: - case Instruction::SIToFP: - case Instruction::UIToFP: - case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { - auto *CI = dyn_cast(&I); - setDebugLocFromInst(Builder, CI); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - // Optimize the special case where the source is a constant integer - // induction variable. Notice that we can only optimize the 'trunc' case - // because (a) FP conversions lose precision, (b) sext/zext may wrap, and - // (c) other casts depend on pointer size. - auto ID = Legal->getInductionVars()->lookup(OldInduction); - if (isa(CI) && CI->getOperand(0) == OldInduction && - ID.getConstIntStepValue()) { - widenIntInduction(OldInduction, cast(CI)); - break; - } - - /// Vectorize casts. - Type *DestTy = - (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); + case Instruction::Store: + case Instruction::Load: + vectorizeMemoryInstruction(&I); + break; + case Instruction::ZExt: + case Instruction::SExt: + case Instruction::FPToUI: + case Instruction::FPToSI: + case Instruction::FPExt: + case Instruction::PtrToInt: + case Instruction::IntToPtr: + case Instruction::SIToFP: + case Instruction::UIToFP: + case Instruction::Trunc: + case Instruction::FPTrunc: + case Instruction::BitCast: { + auto *CI = dyn_cast(&I); + setDebugLocFromInst(Builder, CI); - const VectorParts &A = getVectorValue(CI->getOperand(0)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; - } + /// Vectorize casts. + Type *DestTy = + (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); - case Instruction::Call: { - // Ignore dbg intrinsics. - if (isa(I)) - break; - setDebugLocFromInst(Builder, &I); - - Module *M = BB->getParent()->getParent(); - auto *CI = cast(&I); - - StringRef FnName = CI->getCalledFunction()->getName(); - Function *F = CI->getCalledFunction(); - Type *RetTy = ToVectorTy(CI->getType(), VF); - SmallVector Tys; - for (Value *ArgOperand : CI->arg_operands()) - Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); - - Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); - if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || - ID == Intrinsic::lifetime_start)) { - scalarizeInstruction(&I); - break; - } - // The flag shows whether we use Intrinsic or a usual Call for vectorized - // version of the instruction. - // Is it beneficial to perform intrinsic call compared to lib call? - bool NeedToScalarize; - unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); - bool UseVectorIntrinsic = - ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; - if (!UseVectorIntrinsic && NeedToScalarize) { - scalarizeInstruction(&I); - break; - } + const VectorParts &A = getVectorValue(CI->getOperand(0)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - SmallVector Args; - for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { - Value *Arg = CI->getArgOperand(i); - // Some intrinsics have a scalar argument - don't replace it with a - // vector. - if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { - const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); - Arg = VectorArg[Part]; - } - Args.push_back(Arg); + case Instruction::Call: { + // Ignore dbg intrinsics. + if (isa(I)) + break; + setDebugLocFromInst(Builder, &I); + + Module *M = I.getParent()->getParent()->getParent(); + auto *CI = cast(&I); + + StringRef FnName = CI->getCalledFunction()->getName(); + Function *F = CI->getCalledFunction(); + Type *RetTy = ToVectorTy(CI->getType(), VF); + SmallVector Tys; + for (Value *ArgOperand : CI->arg_operands()) + Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); + + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + bool NeedToScalarize; // Redundant, needed for UseVectorIntrinsic. + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + SmallVector Args; + for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { + Value *Arg = CI->getArgOperand(i); + // Some intrinsics have a scalar argument - don't replace it with a + // vector. + if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { + const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); + Arg = VectorArg[Part]; } + Args.push_back(Arg); + } - Function *VectorF; - if (UseVectorIntrinsic) { - // Use vector version of the intrinsic. - Type *TysForDecl[] = {CI->getType()}; - if (VF > 1) - TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); - VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); - } else { - // Use vector version of the library call. - StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); - assert(!VFnName.empty() && "Vector function name is empty."); - VectorF = M->getFunction(VFnName); - if (!VectorF) { - // Generate a declaration - FunctionType *FTy = FunctionType::get(RetTy, Tys, false); - VectorF = - Function::Create(FTy, Function::ExternalLinkage, VFnName, M); - VectorF->copyAttributesFrom(F); - } + Function *VectorF; + if (UseVectorIntrinsic) { + // Use vector version of the intrinsic. + Type *TysForDecl[] = {CI->getType()}; + if (VF > 1) + TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); + VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); + } else { + // Use vector version of the library call. + StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); + assert(!VFnName.empty() && "Vector function name is empty."); + VectorF = M->getFunction(VFnName); + if (!VectorF) { + // Generate a declaration + FunctionType *FTy = FunctionType::get(RetTy, Tys, false); + VectorF = + Function::Create(FTy, Function::ExternalLinkage, VFnName, M); + VectorF->copyAttributesFrom(F); } - assert(VectorF && "Can't create vector function."); - - SmallVector OpBundles; - CI->getOperandBundlesAsDefs(OpBundles); - CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); + } + assert(VectorF && "Can't create vector function."); - if (isa(V)) - V->copyFastMathFlags(CI); + SmallVector OpBundles; + CI->getOperandBundlesAsDefs(OpBundles); + CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); - Entry[Part] = V; - } + if (isa(V)) + V->copyFastMathFlags(CI); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + Entry[Part] = V; } - default: - // All other instructions are unsupported. Scalarize them. - scalarizeInstruction(&I); - break; - } // end of switch. - } // end of for_each instr. + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } + + default: + // All other instructions are scalarized. + DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I); + llvm_unreachable("Unhandled instruction!"); + } // end of switch. } void InnerLoopVectorizer::updateAnalysis() { @@ -4976,15 +5501,13 @@ assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) && "Entry does not dominate exit."); - // We don't predicate stores by this point, so the vector body should be a - // single loop. - DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader); - - DT->addNewBlock(LoopMiddleBlock, LoopVectorBody); + if (!DT->getNode(LoopVectorBody)) // For InnerLoopUnroller. + DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader); + auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch(); + DT->addNewBlock(LoopMiddleBlock, LoopVectorLatch); DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]); DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader); DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]); - DEBUG(DT->verifyDomTree()); } @@ -6075,10 +6598,7 @@ } } -LoopVectorizationCostModel::VectorizationFactor -LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) { - // Width 1 means no vectorize - VectorizationFactor Factor = {1U, 0U}; +bool LoopVectorizationCostModel::canVectorize(bool OptForSize) { if (OptForSize && Legal->getRuntimePointerChecking()->Need) { ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize") << "runtime pointer checks needed. Enable vectorization of this " @@ -6086,16 +6606,35 @@ "compiling with -Os/-Oz"); DEBUG(dbgs() << "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n"); - return Factor; + return false; } if (!EnableCondStoresVectorization && Legal->getNumPredStores()) { ORE->emit(createMissedAnalysis("ConditionalStore") << "store that is conditionally executed prevents vectorization"); DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n"); - return Factor; + return false; } + // If we optimize the program for size, avoid creating the tail loop. + if (OptForSize) { + unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); + DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); + + // If we don't know the precise trip count, don't try to vectorize. + if (TC < 2) { + ORE->emit( + createMissedAnalysis("UnknownLoopCountComplexCFG") + << "unable to calculate the loop count due to complex control flow"); + DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); + return false; + } + } + return true; +} + +unsigned +LoopVectorizationCostModel::computeMaxVectorizationFactor(bool OptForSize) { MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI); unsigned SmallestType, WidestType; std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes(); @@ -6130,6 +6669,7 @@ " into one vector!"); unsigned VF = MaxVectorSize; + if (MaximizeBandwidth && !OptForSize) { // Collect all viable vectorization factors. SmallVector VFs; @@ -6150,48 +6690,35 @@ } } } + return VF; +} - // If we optimize the program for size, avoid creating the tail loop. - if (OptForSize) { - unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); - DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); - - // If we don't know the precise trip count, don't try to vectorize. - if (TC < 2) { - ORE->emit( - createMissedAnalysis("UnknownLoopCountComplexCFG") - << "unable to calculate the loop count due to complex control flow"); - DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); - return Factor; - } +bool LoopVectorizationCostModel::requiresTail(unsigned MaxVectorSize) { + unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); + DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n'); - // Find the maximum SIMD width that can fit within the trip count. - VF = TC % MaxVectorSize; + // Find the maximum SIMD width that can fit within the trip count. + unsigned VF = TC % MaxVectorSize; - if (VF == 0) - VF = MaxVectorSize; - else { - // If the trip count that we found modulo the vectorization factor is not - // zero then we require a tail. - ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize") - << "cannot optimize for size and vectorize at the " - "same time. Enable vectorization of this loop " - "with '#pragma clang loop vectorize(enable)' " - "when compiling with -Os/-Oz"); - DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); - return Factor; - } - } + if (VF == 0) + return false; - int UserVF = Hints->getWidth(); - if (UserVF != 0) { - assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); - DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + // If the trip count that we found modulo the vectorization factor is not + // zero then we require a tail. + ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize") + << "cannot optimize for size and vectorize at the " + "same time. Enable vectorization of this loop " + "with '#pragma clang loop vectorize(enable)' " + "when compiling with -Os/-Oz"); + DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n"); + return true; +} - Factor.Width = UserVF; - collectInstsToScalarize(UserVF); - return Factor; - } +LoopVectorizationCostModel::VectorizationFactor +LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize, + unsigned VF) { + // Width 1 means no vectorize + VectorizationFactor Factor = {1U, 0U}; float Cost = expectedCost(1).first; #ifndef NDEBUG @@ -6598,11 +7125,14 @@ void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) { - // If we aren't vectorizing the loop, or if we've already collected the + // Function should not be called for the scalar case. + assert(VF >= 2 && "Function called for the scalar loop"); + + // if we've already collected the // instructions to scalarize, there's nothing to do. Collection may already // have occurred if we have a user-selected VF and are now computing the // expected cost for interleaving. - if (VF < 2 || InstsToScalarize.count(VF)) + if (InstsToScalarize.count(VF)) return; // Initialize a mapping for VF in InstsToScalalarize. If we find that it's @@ -6746,10 +7276,6 @@ LoopVectorizationCostModel::expectedCost(unsigned VF) { VectorizationCostTy Cost; - // Collect the instructions (and their associated costs) that will be more - // profitable to scalarize. - collectInstsToScalarize(VF); - // For each block. for (BasicBlock *BB : TheLoop->blocks()) { VectorizationCostTy BlockCost; @@ -7000,218 +7526,1337 @@ TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, ValTy); } - // For an interleaved access, calculate the total cost of the whole - // interleave group. - if (Legal->isAccessInterleaved(I)) { - auto Group = Legal->getInterleavedAccessGroup(I); - assert(Group && "Fail to get an interleaved access group."); + // For an interleaved access, calculate the total cost of the whole + // interleave group. + if (Legal->isAccessInterleaved(I)) { + auto Group = Legal->getInterleavedAccessGroup(I); + assert(Group && "Fail to get an interleaved access group."); + + // Only calculate the cost once at the insert position. + if (Group->getInsertPos() != I) + return 0; + + unsigned InterleaveFactor = Group->getFactor(); + Type *WideVecTy = + VectorType::get(VectorTy->getVectorElementType(), + VectorTy->getVectorNumElements() * InterleaveFactor); + + // Holds the indices of existing members in an interleaved load group. + // An interleaved store group doesn't need this as it doesn't allow gaps. + SmallVector Indices; + if (LI) { + for (unsigned i = 0; i < InterleaveFactor; i++) + if (Group->getMember(i)) + Indices.push_back(i); + } + + // Calculate the cost of the whole interleaved group. + unsigned Cost = TTI.getInterleavedMemoryOpCost( + I->getOpcode(), WideVecTy, Group->getFactor(), Indices, + Group->getAlignment(), AS); + + if (Group->isReverse()) + Cost += + Group->getNumMembers() * + TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); + + // FIXME: The interleaved load group with a huge gap could be even more + // expensive than scalar operations. Then we could ignore such group and + // use scalar operations instead. + return Cost; + } + + // Check if the memory instruction will be scalarized. + if (Legal->memoryInstructionMustBeScalarized(I, VF)) { + unsigned Cost = 0; + Type *PtrTy = ToVectorTy(Ptr->getType(), VF); + + // Figure out whether the access is strided and get the stride value + // if it's known in compile time + const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop); + + // Get the cost of the scalar memory instruction and address computation. + Cost += VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV); + Cost += VF * + TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), + Alignment, AS); + + // Get the overhead of the extractelement and insertelement instructions + // we might create due to scalarization. + Cost += getScalarizationOverhead(I, VF, TTI); + + // If we have a predicated store, it may not be executed for each vector + // lane. Scale the cost by the probability of executing the predicated + // block. + if (Legal->isScalarWithPredication(I)) + Cost /= getReciprocalPredBlockProb(); + + return Cost; + } + + // Determine if the pointer operand of the access is either consecutive or + // reverse consecutive. + int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); + bool Reverse = ConsecutiveStride < 0; + + // Determine if either a gather or scatter operation is legal. + bool UseGatherOrScatter = + !ConsecutiveStride && Legal->isLegalGatherOrScatter(I); + + unsigned Cost = TTI.getAddressComputationCost(VectorTy); + if (UseGatherOrScatter) { + assert(ConsecutiveStride == 0 && + "Gather/Scatter are not used for consecutive stride"); + return Cost + + TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr, + Legal->isMaskRequired(I), Alignment); + } + // Wide load/stores. + if (Legal->isMaskRequired(I)) + Cost += + TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); + else + Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); + + if (Reverse) + Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); + return Cost; + } + case Instruction::ZExt: + case Instruction::SExt: + case Instruction::FPToUI: + case Instruction::FPToSI: + case Instruction::FPExt: + case Instruction::PtrToInt: + case Instruction::IntToPtr: + case Instruction::SIToFP: + case Instruction::UIToFP: + case Instruction::Trunc: + case Instruction::FPTrunc: + case Instruction::BitCast: { + // We optimize the truncation of induction variable. + // The cost of these is the same as the scalar operation. + if (I->getOpcode() == Instruction::Trunc && + Legal->isInductionVariable(I->getOperand(0))) + return TTI.getCastInstrCost(I->getOpcode(), I->getType(), + I->getOperand(0)->getType()); + + Type *SrcScalarTy = I->getOperand(0)->getType(); + Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF); + if (canTruncateToMinimalBitwidth(I, VF)) { + // This cast is going to be shrunk. This may remove the cast or it might + // turn it into slightly different cast. For example, if MinBW == 16, + // "zext i8 %1 to i32" becomes "zext i8 %1 to i16". + // + // Calculate the modified src and dest types. + Type *MinVecTy = VectorTy; + if (I->getOpcode() == Instruction::Trunc) { + SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy); + VectorTy = + largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); + } else if (I->getOpcode() == Instruction::ZExt || + I->getOpcode() == Instruction::SExt) { + SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy); + VectorTy = + smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); + } + } + + return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy); + } + case Instruction::Call: { + bool NeedToScalarize; + CallInst *CI = cast(I); + unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize); + if (getVectorIntrinsicIDForCall(CI, TLI)) + return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI)); + return CallCost; + } + default: + // The cost of executing VF copies of the scalar instruction. This opcode + // is unknown. Assume that it is the same as 'mul'. + return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) + + getScalarizationOverhead(I, VF, TTI); + } // end of switch. +} + +char LoopVectorize::ID = 0; +static const char lv_name[] = "Loop Vectorization"; +INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false) +INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass) +INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass) +INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass) +INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker) +INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass) +INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass) +INITIALIZE_PASS_DEPENDENCY(LCSSAWrapperPass) +INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass) +INITIALIZE_PASS_DEPENDENCY(LoopSimplify) +INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis) +INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass) +INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass) +INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false) + +namespace llvm { +Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) { + return new LoopVectorize(NoUnrolling, AlwaysVectorize); +} +} + +bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) { + + // Check if the pointer operand of a load or store instruction is + // consecutive. + if (auto *Ptr = getPointerOperand(Inst)) + return Legal->isConsecutivePtr(Ptr); + return false; +} + +void LoopVectorizationCostModel::collectValuesToIgnore() { + // Ignore ephemeral values. + CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore); + + // Ignore type-promoting instructions we identified during reduction + // detection. + for (auto &Reduction : *Legal->getReductionVars()) { + RecurrenceDescriptor &RedDes = Reduction.second; + SmallPtrSetImpl &Casts = RedDes.getCastInsts(); + VecValuesToIgnore.insert(Casts.begin(), Casts.end()); + } + + // Insert values known to be scalar into VecValuesToIgnore. This is a + // conservative estimation of the values that will later be scalarized. + // + // FIXME: Even though an instruction is not scalar-after-vectoriztion, it may + // still be scalarized. For example, we may find an instruction to be + // more profitable for a given vectorization factor if it were to be + // scalarized. But at this point, we haven't yet computed the + // vectorization factor. + for (auto *BB : TheLoop->getBlocks()) + for (auto &I : *BB) + if (Legal->isScalarAfterVectorization(&I)) + VecValuesToIgnore.insert(&I); +} + +LoopVectorizationCostModel::VectorizationFactor +LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF, + unsigned MaxVF) { + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + if (UserVF == 1) + return {UserVF, 0}; + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + return {UserVF, 0}; + } + if (MaxVF == 1) + return {1, 0}; + + assert(MaxVF > 1 && "MaxVF is zero."); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + for (unsigned i = 2; i <= MaxVF; i *= 2) + CM->collectInstsToScalarize(i); + buildInitialVPlans(2, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + // Select the optimal vectorization factor. + return CM->selectVectorizationFactor(OptForSize, MaxVF); +} + +void LoopVectorizationPlanner::printCurrentPlans(const std::string &Title, + raw_ostream &O) { + auto printPlan = [&](VPlan *Plan, const SmallVectorImpl &VFs, + const std::string &Prefix) { + std::string Title; + raw_string_ostream RSO(Title); + RSO << Prefix << " for VF="; + if (VFs.size() == 1) + RSO << VFs[0]; + else { + RSO << "{"; + bool First = true; + for (unsigned VF : VFs) { + if (!First) + RSO << ","; + RSO << VF; + First = false; + } + RSO << "}"; + } + VPlanPrinter PlanPrinter(O, *Plan); + PlanPrinter.dump(RSO.str()); + }; + + if (VPlans.empty()) + return; + + VPlan *Current = VPlans.begin()->second.get(); + + SmallVector VFs; + for (auto &Entry : VPlans) { + VPlan *Plan = Entry.second.get(); + if (Plan != Current) { + // Hit another VPlan. Print the current VPlan for the VFs it served thus + // far and move on to the VPlan we just encountered. + printPlan(Current, VFs, Title); + Current = Plan; + VFs.clear(); + } + // Add VF to the list of VFs served by current VPlan. + VFs.push_back(Entry.first); + } + // Print the current VPlan. + printPlan(Current, VFs, Title); +} + +// Determine if a given instruction will remain scalar after vectorization, +// for VF \p StartRangeVF. Reset \p EndRangeVF to the minimal VF where this +// decision does not hold, if it's less than the given \p EndRangeVF. +bool LoopVectorizationPlanner::willBeScalarized(Instruction *I, + unsigned StartRangeVF, + unsigned &EndRangeVF) { + if (!isa(I) && Legal->isScalarAfterVectorization(I)) + return true; + + if (isa(I)) { + + auto *CI = cast(I); + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || + ID == Intrinsic::lifetime_start)) + return true; + + // The following case may be scalarized depending on the VF. + // The flag shows whether we use Intrinsic or a usual Call for vectorized + // version of the instruction. + // Is it beneficial to perform intrinsic call compared to lib call? + bool NeedToScalarize; + unsigned CallCost = + getVectorCallCost(CI, StartRangeVF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, StartRangeVF, *TTI, TLI) <= CallCost; + bool StartWillBeScalarized = !UseVectorIntrinsic && NeedToScalarize; + + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool NeedToScalarize; + unsigned CallCost = + getVectorCallCost(CI, TmpVF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, TmpVF, *TTI, TLI) <= CallCost; + bool TmpWillBeScalarized = !UseVectorIntrinsic && NeedToScalarize; + if (TmpWillBeScalarized != StartWillBeScalarized) { + EndRangeVF = TmpVF; + break; + } + } + + return StartWillBeScalarized; + } + + if (isa(I) || isa(I)) { + + // TODO: refactor memoryInstructionMustBeScalarized() to invoke only the + // (last) part that depends on VF. + bool StartWillBeScalarized = + Legal->memoryInstructionMustBeScalarized(I, StartRangeVF); + + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool TmpWillBeScalarized = + Legal->memoryInstructionMustBeScalarized(I, TmpVF); + + if (TmpWillBeScalarized != StartWillBeScalarized) { + EndRangeVF = TmpVF; + break; + } + } + + return StartWillBeScalarized; + } + + static DenseSet VectorizableOpcodes = { + Instruction::Br, Instruction::PHI, Instruction::UDiv, + Instruction::SDiv, Instruction::SRem, Instruction::URem, + Instruction::Add, Instruction::FAdd, Instruction::Sub, + Instruction::FSub, Instruction::Mul, Instruction::FMul, + Instruction::FDiv, Instruction::FRem, Instruction::Shl, + Instruction::LShr, Instruction::AShr, Instruction::And, + Instruction::Or, Instruction::Xor, Instruction::Select, + Instruction::ICmp, Instruction::FCmp, Instruction::Store, + Instruction::Load, Instruction::ZExt, Instruction::SExt, + Instruction::FPToUI, Instruction::FPToSI, Instruction::FPExt, + Instruction::PtrToInt, Instruction::IntToPtr, Instruction::SIToFP, + Instruction::UIToFP, Instruction::Trunc, Instruction::FPTrunc, + Instruction::BitCast, Instruction::Call}; + + if (!VectorizableOpcodes.count(I->getOpcode())) + return true; + + // Scalarize instructions found to be more profitable if scalarized. Limit + // EndRangeVF to the last VF this is continuously true for. + bool StartWillBeScalarized = CM->isProfitableToScalarize(I, StartRangeVF); + + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool TmpWillBeScalarized = CM->isProfitableToScalarize(I, TmpVF); + if (StartWillBeScalarized != TmpWillBeScalarized) { + EndRangeVF = TmpVF; + break; + } + } + + return StartWillBeScalarized; +} + +unsigned LoopVectorizationPlanner::buildInitialVPlans(unsigned MinVF, + unsigned MaxVF) { + ILV->collectTriviallyDeadInstructions(TheLoop, Legal, DeadInstructions); + + unsigned StartRangeVF = MinVF; + unsigned EndRangeVF = MaxVF + 1; + + unsigned i = 0; + for (; StartRangeVF < EndRangeVF; ++i) { + std::shared_ptr Plan = buildInitialVPlan(StartRangeVF, EndRangeVF); + + for (unsigned TmpVF = StartRangeVF; TmpVF < EndRangeVF; TmpVF *= 2) + VPlans[TmpVF] = Plan; + + StartRangeVF = EndRangeVF; + EndRangeVF = MaxVF + 1; + } + + return i; +} + +std::shared_ptr +LoopVectorizationPlanner::buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF) { + auto isInstructionToIgnore = [&](Instruction *I) -> bool { + if (DeadInstructions.count(I) || isa(I) || + isa(I)) + return true; + + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I); + if (IG && I != IG->getInsertPos()) + return true; + + return false; + }; + + std::shared_ptr SharedPlan = std::make_shared(); + VPlan *Plan = SharedPlan.get(); + VPlanUtilsLoopVectorizer PlanUtils(Plan); + + // Scan the body of the loop in a topological order to visit each basic block + // after having visited its predecessor basic blocks. + LoopBlocksDFS DFS(TheLoop); + DFS.perform(LI); + + // Create a dummy entry VPBasicBlock to start building the VPlan. + VPBlockBase *PreviousVPBlock = PlanUtils.createBasicBlock(); + VPBlockBase *PreEntry = PreviousVPBlock; + Plan->setEntry(PreEntry); // only to support printing during construction. + + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { + BasicBlock::iterator I = BB->begin(); + BasicBlock::iterator E = BB->end(); + + // Relevent instructions from basic block BB will be grouped into VPRecipe + // ingredients and fill a new VPBasicBlock. + VPBasicBlock *VPBB = nullptr; + while (I != E) { + // Search for first live Instruction to open VPBB. + for (; I != E && isInstructionToIgnore(&*I); ++I) + ; + + if (I == E) + break; + + Instruction *Instr = &*I; + + // Check if first Instruction should open an interleaved group VPBB. + if (const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(Instr)) { + // I points to the insert position (first load or last store) of an + // interleave group. Bump it once to look for the next recipe. + auto Recipe = new VPInterleaveRecipe(IG, Plan); + if (VPBB) + PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB); + else { + VPBB = PlanUtils.createBasicBlock(Recipe); + PlanUtils.setSuccessor(PreviousVPBlock, VPBB); + PreviousVPBlock = VPBB; + } + ++I; + continue; + } + + if (Legal->isScalarWithPredication(Instr)) { + // Instructions marked for predication are scalarized and placed under + // an if-then construct to prevent side-effects. + DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *Instr << '\n'); + + // Build the triangular if-then region. Start with VPBB holding Instr. + BasicBlock::iterator J = I; + VPRecipeBase *Recipe = new VPScalarizeOneByOneRecipe(I, ++J, Plan); + VPBB = PlanUtils.createBasicBlock(Recipe); + + // Build the entry and exit VPBB's of the triangle. + VPRegionBlock *Region = PlanUtils.createRegion(true); + VPExtractMaskBitRecipe *R = new VPExtractMaskBitRecipe(&*BB); + VPBasicBlock *Entry = PlanUtils.createBasicBlock(R); + Recipe = new VPMergeScalarizeBranchRecipe(Instr); + VPBasicBlock *Exit = PlanUtils.createBasicBlock(Recipe); + // Note: first set Entry as region entry and then connect successors + // starting from it in order, to propagate the "parent" of each + // VPBasicBlock. + PlanUtils.setRegionEntry(Region, Entry); + PlanUtils.setRegionExit(Region, Exit); + PlanUtils.setTwoSuccessors(Entry, R, VPBB, Exit); + PlanUtils.setSuccessor(VPBB, Exit); + PlanUtils.setSuccessor(PreviousVPBlock, Region); + PreviousVPBlock = Region; + + // Next instructions should start forming a VPBasicBlock of their own. + VPBB = nullptr; + + // Record predicated instructions for the later optimizations. + PredicatedInstructions.insert(&*I); + + ++I; + continue; + } + + // Check if this is an integer induction. If so, build the recipes that + // produce its scalar and vector values. + + auto widenIntInduction = [&](PHINode *IV, + TruncInst *Trunc = nullptr) -> void { + // The value from the original loop to which we are mapping the new + // induction variable. + Instruction *EntryVal = Trunc ? cast(Trunc) : IV; + bool NeedsScalarIV = needsScalarInduction(IV, StartRangeVF, EndRangeVF); + auto *WIIRecipe = + new VPWidenIntInductionRecipe(NeedsScalarIV, IV, Trunc); + if (VPBB) + PlanUtils.appendRecipeToBasicBlock(WIIRecipe, VPBB); + else { + VPBB = PlanUtils.createBasicBlock(WIIRecipe); + PlanUtils.setSuccessor(PreviousVPBlock, VPBB); + PreviousVPBlock = VPBB; + } + // Determine if we want a scalar version of the induction variable. This + // is true if the induction variable itself is not widened, or if it has + // at least one user in the loop that is not widened. + if (NeedsScalarIV) { + // Create scalar steps that can be used by instructions we will later + // scalarize. Note that the addition of the scalar steps will not + // increase the number of instructions in the loop in the common case + // prior to InstCombine. We will be trading one vector extract for + // each scalar step. + auto *BSSRecipe = + new VPBuildScalarStepsRecipe(WIIRecipe, EntryVal, Plan); + // Determine the number of scalars we need to generate for each unroll + // iteration. If EntryVal is uniform, we only need to generate the + // first lane. Otherwise, we generate all VF values. + if (Legal->isUniformAfterVectorization(cast(EntryVal))) + PlanUtils.designateLaneZero(BSSRecipe); + PlanUtils.appendRecipeToBasicBlock(BSSRecipe, VPBB); + } + }; + + // Handle the integer induction. + if (PHINode *Phi = dyn_cast(Instr)) { + InductionDescriptor II = Legal->getInductionVars()->lookup(Phi); + if (II.getKind() == InductionDescriptor::IK_IntInduction) { + widenIntInduction(Phi); + ++I; + continue; + } + } + + auto isOptimizableTrunc = [&](Instruction *I) -> bool { + if (!isa(Instr)) + return false; + PHINode *OldInduction = Legal->getInduction(); + if (I->getOperand(0) != OldInduction) + return false; + auto ID = Legal->getInductionVars()->lookup(OldInduction); + return ID.getConstIntStepValue(); + }; + + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + if (isOptimizableTrunc(Instr)) { + widenIntInduction(Legal->getInduction(), cast(Instr)); + ++I; + continue; + } + + // Check if first instruction is to be replicated, and search for last + // similar instruction in sequence. + bool Scalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF); + DEBUG(if (Scalarized) dbgs() << "LV: Scalarizing:" << *Instr << "\n"); + + BasicBlock::iterator J = I; + for (++J; J != E; ++J) { + Instruction *Instr = &*J; + if (isInstructionToIgnore(Instr)) + break; // Sequence of instructions not to ignore ended. + if (Legal->getInterleavedAccessGroup(Instr)) + break; // Instr should open an interleaved group VPBB. + if (Legal->isScalarWithPredication(Instr)) + break; // This should open a separate block (region). + if (isOptimizableTrunc(Instr)) + break; // This will be handled with integer reduction recipes. + + bool AlsoScalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF); + DEBUG(if (Scalarized && AlsoScalarized) dbgs() + << "LV: Scalarizing:" << *Instr << "\n"); + if (Scalarized != AlsoScalarized) + break; + } + VPRecipeBase *Recipe = + PlanUtils.createOneByOneRecipe(I, J, Plan, Scalarized); + if (VPBB) + PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB); + else { + VPBB = PlanUtils.createBasicBlock(Recipe); + PlanUtils.setSuccessor(PreviousVPBlock, VPBB); + PreviousVPBlock = VPBB; + } + I = J; + } + } + // PreviousVPBlock now holds the exit block of Plan. + // Set entry block of Plan to the successor of PreEntry, and discard PreEntry. + assert(PreEntry->getSuccessors().size() == 1 && "Plan has no single entry."); + VPBlockBase *Entry = PreEntry->getSuccessors().front(); + PlanUtils.disconnectBlocks(PreEntry, Entry); + Plan->setEntry(Entry); + delete PreEntry; + + // FOR STRESS TESTING, uncomment the following: + // EndRangeVF = StartRangeVF * 2; + + return SharedPlan; +} + +void LoopVectorizationPlanner::sinkScalarOperands(Instruction *PredInst, + VPlan *Plan) { + VPlanUtilsLoopVectorizer PlanUtils(Plan); + + // The recipe containing the predicated instruction. + VPBasicBlock *PredBB = Plan->getBasicBlock(PredInst); + + // Initialize a worklist with the operands of the predicated instruction. + SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); + + // Holds instructions that we need to analyze again. An instruction may be + // reanalyzed if we don't yet know if we can sink it or not. + SmallVector InstsToReanalyze; + + // Iteratively sink the scalarized operands of the predicated instruction + // into the block we created for it. When an instruction is sunk, it's + // operands are then added to the worklist. The algorithm ends after one pass + // through the worklist doesn't sink a single instruction. + bool Changed; + do { + + // Add the instructions that need to be reanalyzed to the worklist, and + // reset the changed indicator. + Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); + InstsToReanalyze.clear(); + Changed = false; + + while (!Worklist.empty()) { + auto *I = dyn_cast(Worklist.pop_back_val()); + if (!I) + continue; + + // We do not sink other predicated instructions. + if (Legal->isScalarWithPredication(I)) + continue; + + VPRecipeBase *Recipe = Plan->getRecipe(I); + + // We can't sink live-ins. + if (!Recipe) + continue; + VPBasicBlock *BasicBlock = Recipe->getParent(); + assert(BasicBlock && "Recipe not in any basic block"); + + // We can't sink an instruction that isn't being scalarized. + if (!isa(Recipe) && + !isa(Recipe)) + continue; + + // We can't sink an instruction if it is already in the predicated block, + // is not in the VPlan, or may have side effects. + if (BasicBlock == PredBB || I->mayHaveSideEffects()) + continue; + + // Handle phi nodes last to make sure that any user they may have has sunk + // by now. This is relevant for induction variables that feed uniform GEPs + // which may or may not sink. + if (isa(I)) { + auto IsNotAPhi = [&](Value *V) -> bool { return isa(V); }; + if (any_of(Worklist, IsNotAPhi) || + any_of(InstsToReanalyze, IsNotAPhi)) { + InstsToReanalyze.push_back(I); + continue; + } + } + + bool HasVectorizedUses = false; + bool AllScalarizedUsesInPredicatedBlock = true; + unsigned MinLaneToSink = 0; + for (auto &U : I->uses()) { + auto *UI = cast(U.getUser()); + VPRecipeBase *UserRecipe = Plan->getRecipe(UI); + // Generated scalarized instructions don't serve users outside of the + // VPlan, so we can safely ignore users that have no recipe. + if (!UserRecipe) + continue; + + if (isa(UserRecipe)) { + if (isa(I) && + (isa(UI) || isa(UI)) && + Legal->isConsecutivePtr(I)) { + // Wide memory operations generate their own GEPs, we can sink all + // the scalarized GEPs. + continue; + } + // All of I's lanes are used by an instruction we can't sink. + HasVectorizedUses = true; + break; + } + + if (isa(UserRecipe)) { + assert(isa(I) && + "Non-GEP used in interleave group"); + // GEP used as the uniform address of a wide memory operation, do + // not sink lane zero. + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + + assert(isa(UserRecipe) && + "Unexpected recipe while sinking scalar operands\n"); + + // Induction variables feeding consecutive GEPs can be indirectly used + // by vectorized load/stores which generate their own GEP rather than + // reuse the scalarized one (unlike load/store in interleave groups). + // In such a case, we can sink all lanes but lane zero. Note that we + // can do this whether or not the GEP is used within the predicated + // block (i.e. whether it will sink its own lanes 1..VF-1). + if (isa(UI) && Legal->isConsecutivePtr(UI) && + isa(Recipe)) { + auto IsVectorizedMemoryOperation = [&](User *U) -> bool { + if (!(isa(U) || isa(U))) + return false; + VPRecipeBase *Recipe = Plan->getRecipe(cast(U)); + return Recipe && isa(Recipe); + }; + + if (any_of(UI->users(), IsVectorizedMemoryOperation)) { + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + } + + if (UserRecipe->getParent() != PredBB) { + // Don't make a decision until all scalarized users have sunk. + AllScalarizedUsesInPredicatedBlock = false; + continue; + } + + // Ok to sink w.r.t this use, but no more lanes than what the user + // itself has sunk. + VPLaneRange DesignatedLanes; + if (auto *BSS = dyn_cast(UserRecipe)) + DesignatedLanes = BSS->getDesignatedLanes(); + else + DesignatedLanes = + cast(UserRecipe)->getDesignatedLanes(); + VPLaneRange SinkableLanes = + VPLaneRange::intersect(VPLaneRange(MinLaneToSink), DesignatedLanes); + MinLaneToSink = SinkableLanes.getMinLane(); + } + + if (HasVectorizedUses) + continue; // This instruction cannot be sunk. + + // It's legal to sink the instruction if all its uses occur in the + // predicated block. Otherwise, there's nothing to do yet, and we may + // need to reanalyze the instruction. + if (!AllScalarizedUsesInPredicatedBlock) { + InstsToReanalyze.push_back(I); + continue; + } + + // Move the instruction to the beginning of the predicated block, and add + // it's operands to the worklist (except for phi nodes). + PlanUtils.sinkInstruction(I, PredBB, MinLaneToSink); + if (!isa(I)) + Worklist.insert(I->op_begin(), I->op_end()); + + // The sinking may have enabled other instructions to be sunk, so we will + // need to iterate. + Changed = true; + } + } while (Changed); +} + +void LoopVectorizationPlanner::assignScalarVectorConversions( + Instruction *PredInst, VPlan *Plan) { + + // NFC: Let Def's recipe generate the vector version of Def, but only + // if all of Def's users are vectorized. This is the equivalent to the + // previous predicateInstructions by which an insert-element got hoisted + // into the matching predicated basic block if it is the only user of + // the predicated instruction. + + if (PredInst->use_empty()) + return; + + for (User *U : PredInst->users()) { + Instruction *UserInst = dyn_cast(U); + if (!UserInst) + continue; + + VPRecipeBase *UserRecipe = Plan->getRecipe(UserInst); + if (!UserRecipe) // User is not part of the plan. + return; + + if (dyn_cast(UserRecipe)) + continue; + + // Found a user that will not be using the vector form of the predicated + // instruction. The insert-element is not going to be the only user, so + // do not hoist it. + return; + } + + Plan->getRecipe(PredInst)->addAlsoPackOrUnpack(PredInst); +} + +bool LoopVectorizationPlanner::shouldScalarizeInstruction(Instruction *I, + unsigned VF) const { + return Legal->isScalarAfterVectorization(I) || + CM->isProfitableToScalarize(I, VF); +} + +bool LoopVectorizationPlanner::needsScalarInduction(Instruction *IV, + unsigned VF) const { + if (shouldScalarizeInstruction(IV, VF)) + return true; + + auto isScalarInst = [&](User *U) -> bool { + auto *I = cast(U); + return (TheLoop->contains(I) && shouldScalarizeInstruction(I, VF)); + }; + + return any_of(IV->users(), isScalarInst); +} + +bool LoopVectorizationPlanner::needsScalarInduction( + Instruction *IV, unsigned StartRangeVF, unsigned &EndRangeVF) const { + bool StartNeedsScalarInduction = needsScalarInduction(IV, StartRangeVF); + + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool TmpNeedsScalarInduction = needsScalarInduction(IV, TmpVF); + if (StartNeedsScalarInduction != TmpNeedsScalarInduction) { + EndRangeVF = TmpVF; + break; + } + } + + return StartNeedsScalarInduction; +} + +void LoopVectorizationPlanner::optimizePredicatedInstructions() { + VPlan *PrevPlan = nullptr; + for (auto &It : VPlans) { + VPlan *Plan = It.second.get(); + if (Plan == PrevPlan) + continue; + for (auto *PredInst : PredicatedInstructions) { + sinkScalarOperands(PredInst, Plan); + assignScalarVectorConversions(PredInst, Plan); + } + PrevPlan = Plan; + } +} + +void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) { + DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n'); + BestVF = VF; + BestUF = UF; + + assert(VPlans.count(VF) && "Best VF does not have a VPlan."); + // Delete all other VPlans. + for (auto &Entry : VPlans) { + if (Entry.first != VF) + VPlans.erase(Entry.first); + } +} + +void LoopVectorizationPlanner::executeBestPlan(InnerLoopVectorizer &LB) { + ILV = &LB; + + // Perform the actual loop widening (vectorization). + // 1. Create a new empty loop. Unlink the old loop and connect the new one. + ILV->createEmptyLoop(); + + // 2. Widen each instruction in the old loop to a new one in the new loop. + + VPTransformState State{BestVF, BestUF, LI, ILV->DT, ILV->Builder, ILV, Legal}; + State.CFG.PrevBB = ILV->LoopVectorPreHeader; + + VPlan *Plan = getVPlanForVF(BestVF); + + Plan->vectorize(&State); + + // 3. Take care of phi's to fix: reduction, 1st-order-recurrence, loop-closed. + ILV->vectorizeLoop(); +} + +void VPVectorizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + State.ILV->vectorizeInstruction(*I); + if (willAlsoPackOrUnpack(I)) { // Unpack instruction + for (unsigned Part = 0; Part < State.UF; ++Part) + for (unsigned Lane = 0; Lane < State.VF; ++Lane) + State.ILV->getScalarValue(I, Part, Lane); + } +} + +void VPScalarizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + // By default generate scalar instances for all VF lanes of all UF parts. + // If the instruction is uniform, generate only the first lane for each + // of the UF parts. + bool IsUniform = State.Legal->isUniformAfterVectorization(I); + unsigned MinLane = 0; + unsigned MaxLane = IsUniform ? 0 : State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + assert(!IsUniform && + "Uniform instruction vectorized for a specific instance."); + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->scalarizeInstruction(I, MinPart, MaxPart, + EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); + if (willAlsoPackOrUnpack(I)) { + if (State.Instance) + // Insert scalar instance packing it into a vector. + State.ILV->constructVectorValue(I, MinPart, MinLane); + else + // Broadcast or group together all instances into a vector. + State.ILV->getVectorValue(I); + } +} - // Only calculate the cost once at the insert position. - if (Group->getInsertPos() != I) - return 0; +void VPWidenIntInductionRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Int induction being replicated"); + auto BuildScalarInfo = State.ILV->widenIntInduction(NeedsScalarIV, IV, Trunc); + ScalarIV = BuildScalarInfo.first; + Step = BuildScalarInfo.second; +} - unsigned InterleaveFactor = Group->getFactor(); - Type *WideVecTy = - VectorType::get(VectorTy->getVectorElementType(), - VectorTy->getVectorNumElements() * InterleaveFactor); +void VPWidenIntInductionRecipe::print(raw_ostream &O) const { + O << "Widen int induction"; + if (NeedsScalarIV) + O << " (needs scalars)"; + O << ":\n"; + O << *IV; + if (Trunc) + O << "\n" << *Trunc << ")"; +} - // Holds the indices of existing members in an interleaved load group. - // An interleaved store group doesn't need this as it doesn't allow gaps. - SmallVector Indices; - if (LI) { - for (unsigned i = 0; i < InterleaveFactor; i++) - if (Group->getMember(i)) - Indices.push_back(i); - } +void VPBuildScalarStepsRecipe::vectorize(VPTransformState &State) { + unsigned MinLane = 0; + unsigned MaxLane = State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->buildScalarSteps(WII->getScalarIV(), WII->getStep(), EntryVal, + MinPart, MaxPart, EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); +} - // Calculate the cost of the whole interleaved group. - unsigned Cost = TTI.getInterleavedMemoryOpCost( - I->getOpcode(), WideVecTy, Group->getFactor(), Indices, - Group->getAlignment(), AS); +void VPBuildScalarStepsRecipe::print(raw_ostream &O) const { + O << "Build scalar steps"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":\n" << *EntryVal; +} - if (Group->isReverse()) - Cost += - Group->getNumMembers() * - TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); +void VPInterleaveRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Interleave group being replicated"); + State.ILV->vectorizeInterleaveGroup(IG->getInsertPos()); +} - // FIXME: The interleaved load group with a huge gap could be even more - // expensive than scalar operations. Then we could ignore such group and - // use scalar operations instead. - return Cost; +void VPInterleaveRecipe::print(raw_ostream &O) const { + O << "InterleaveGroup factor:" << IG->getFactor() << '\n'; + for (unsigned i = 0; i < IG->getFactor(); ++i) + if (Instruction *I = IG->getMember(i)) { + if (I == IG->getInsertPos()) + O << i << "=]" << *I; + else + O << i << " ]" << *I; + if (willAlsoPackOrUnpack(I)) + O << " (V->S)"; } +} - // Check if the memory instruction will be scalarized. - if (Legal->memoryInstructionMustBeScalarized(I, VF)) { - unsigned Cost = 0; - Type *PtrTy = ToVectorTy(Ptr->getType(), VF); - - // Figure out whether the access is strided and get the stride value - // if it's known in compile time - const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop); +void VPExtractMaskBitRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && "Extract Mask Bit works only on single instance."); - // Get the cost of the scalar memory instruction and address computation. - Cost += VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV); - Cost += VF * - TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), - Alignment, AS); + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; - // Get the overhead of the extractelement and insertelement instructions - // we might create due to scalarization. - Cost += getScalarizationOverhead(I, VF, TTI); + typedef SmallVector VectorParts; - // If we have a predicated store, it may not be executed for each vector - // lane. Scale the cost by the probability of executing the predicated - // block. - if (Legal->isScalarWithPredication(I)) - Cost /= getReciprocalPredBlockProb(); + VectorParts Cond = State.ILV->createBlockInMask(MaskedBasicBlock); - return Cost; - } + ConditionBit = State.Builder.CreateExtractElement( + Cond[Part], State.ILV->Builder.getInt32(Lane)); + ConditionBit = + State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit, + ConstantInt::get(ConditionBit->getType(), 1)); + DEBUG(dbgs() << "\nLV: vectorizing ConditionBit recipe" + << MaskedBasicBlock->getName()); +} - // Determine if the pointer operand of the access is either consecutive or - // reverse consecutive. - int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); - bool Reverse = ConsecutiveStride < 0; +void VPMergeScalarizeBranchRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && + "Merge Scalarize Branch works only on single instance."); + + Type *LiveOutType = LiveOut->getType(); + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + // Rename the predicated and merged basic blocks for backwards compatibility. + Instruction *ScalarLiveOut = + cast(State.ILV->getScalarValue(LiveOut, Part, Lane)); + BasicBlock *PredicatedBB = ScalarLiveOut->getParent(); + BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor(); + assert(PredicatingBB && "Predicated block has no single predecessor"); + PredicatedBB->setName(Twine("pred.") + LiveOut->getOpcodeName() + ".if"); + PredicatedBB->getSingleSuccessor()->setName( + Twine("pred.") + LiveOut->getOpcodeName() + ".continue"); + if (LiveOutType->isVoidTy()) + return; - // Determine if either a gather or scatter operation is legal. - bool UseGatherOrScatter = - !ConsecutiveStride && Legal->isLegalGatherOrScatter(I); + // Generate a phi node for the scalarized instruction. + PHINode *Phi = State.ILV->Builder.CreatePHI(LiveOutType, 2); + Phi->addIncoming(UndefValue::get(ScalarLiveOut->getType()), PredicatingBB); + Phi->addIncoming(ScalarLiveOut, PredicatedBB); + State.ILV->setScalarValue(LiveOut, Part, Lane, Phi); + + // If this instruction also generated the complementing form then we also need + // to create a phi for the vector value of this part & lane and update the + // vector values cache accordingly. + Value *VectorValue = State.ILV->getVectorValue(LiveOut, Part); + if (!VectorValue) + return; - unsigned Cost = TTI.getAddressComputationCost(VectorTy); - if (UseGatherOrScatter) { - assert(ConsecutiveStride == 0 && - "Gather/Scatter are not used for consecutive stride"); - return Cost + - TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr, - Legal->isMaskRequired(I), Alignment); - } - // Wide load/stores. - if (Legal->isMaskRequired(I)) - Cost += - TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); - else - Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS); + InsertElementInst *IEI = cast(VectorValue); + PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2); + VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // the unmodified vector + VPhi->addIncoming(IEI, PredicatedBB); // new vector with the inserted element + State.ILV->setVectorValue(LiveOut, Part, VPhi); +} - if (Reverse) - Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0); - return Cost; - } - case Instruction::ZExt: - case Instruction::SExt: - case Instruction::FPToUI: - case Instruction::FPToSI: - case Instruction::FPExt: - case Instruction::PtrToInt: - case Instruction::IntToPtr: - case Instruction::SIToFP: - case Instruction::UIToFP: - case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { - // We optimize the truncation of induction variable. - // The cost of these is the same as the scalar operation. - if (I->getOpcode() == Instruction::Trunc && - Legal->isInductionVariable(I->getOperand(0))) - return TTI.getCastInstrCost(I->getOpcode(), I->getType(), - I->getOperand(0)->getType()); +/// Creates a new VPScalarizeOneByOneRecipe or VPVectorizeOneByOneRecipe based +/// on the isScalarizing parameter respectively. +VPOneByOneRecipeBase *VPlanUtilsLoopVectorizer::createOneByOneRecipe( + const BasicBlock::iterator B, const BasicBlock::iterator E, VPlan *Plan, + bool isScalarizing) { + if (isScalarizing) + return new VPScalarizeOneByOneRecipe(B, E, Plan); + return new VPVectorizeOneByOneRecipe(B, E, Plan); +} - Type *SrcScalarTy = I->getOperand(0)->getType(); - Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF); - if (canTruncateToMinimalBitwidth(I, VF)) { - // This cast is going to be shrunk. This may remove the cast or it might - // turn it into slightly different cast. For example, if MinBW == 16, - // "zext i8 %1 to i32" becomes "zext i8 %1 to i16". - // - // Calculate the modified src and dest types. - Type *MinVecTy = VectorTy; - if (I->getOpcode() == Instruction::Trunc) { - SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy); - VectorTy = - largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); - } else if (I->getOpcode() == Instruction::ZExt || - I->getOpcode() == Instruction::SExt) { - SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy); - VectorTy = - smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy); - } - } +/// Given a \p Split instruction assumed to reside in a VPOneByOneRecipeBase +/// -- where VPOneByOneRecipeBase is either VPScalarizeOneByOneRecipe or +/// VPVectorizeOneByOneRecipe -- update that recipe to start from \p Split +/// and move all preceeding instructions to a new VPOneByOneRecipeBase. +/// \return the newly created VPOneByOneRecipeBase, which is added to the +/// VPBasicBlock of the original recipe, right before it. +VPOneByOneRecipeBase * +VPlanUtilsLoopVectorizer::splitRecipe(Instruction *Split) { + VPOneByOneRecipeBase *Recipe = + cast(Plan->getRecipe(Split)); + auto SplitPos = Split->getIterator(); + + assert(SplitPos != Recipe->Begin && + "Nothing to split before first instruction."); + assert(SplitPos != Recipe->End && "Nothing to split after last instruction."); + + // Build a new recipe for all instructions up to the given Split. + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Recipe->Begin, SplitPos, Plan, Recipe->isScalarizing()); + + // Insert the new recipe before the split point. + BasicBlock->addRecipe(NewRecipe, Recipe); + + // Update the old recipe to start from the given split point. + Recipe->Begin = SplitPos; + + return NewRecipe; +} - return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy); +/// Insert a given instruction \p Inst into a VPBasicBlock before another +/// given instruction \p Before. Assumes \p Inst does not belong to any +/// recipe, and that \p Before belongs to a VPOneByOneRecipeBase. +void VPlanUtilsLoopVectorizer::insertBefore(Instruction *Inst, + Instruction *Before, + unsigned MinLane) { + assert(!Plan->getRecipe(Inst) && "Instruction already in recipe."); + VPRecipeBase *Recipe = Plan->getRecipe(Before); + assert(Recipe && "Insertion point not in any recipe."); + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + bool PartialInsertion = MinLane > 0; + bool IndicesMatch = true; + + if (PartialInsertion) { + VPScalarizeOneByOneRecipe *SOBO = + dyn_cast(Recipe); + if (!SOBO || SOBO->DesignatedLanes.getMinLane() != MinLane) + IndicesMatch = false; + } + + // Can we insert \p Inst by augmemting the existing recipe of \p Before? + // Only if \p Inst is immediately followed by \p Before: + Instruction *NextInst = Inst; + if (++NextInst == Before && IndicesMatch) { + // This must imply that \p Before is the first ingredient in its recipe. + assert(Before == &*OBORecipe->Begin && + "Trying to insert but Before is not first in its recipe."); + // Yes, extend the range to include the previous instruction. + OBORecipe->Begin--; + Plan->setInst2Recipe(Inst, Recipe); + return; } - case Instruction::Call: { - bool NeedToScalarize; - CallInst *CI = cast(I); - unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize); - if (getVectorIntrinsicIDForCall(CI, TLI)) - return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI)); - return CallCost; + // Note that it is not possible to augment the end of Recipe by having + // Inst == &*Recipe->End, because to do that Before would need to be + // Recipe->End, which means that Before does not belong to this Recipe. + + // No, the instruction needs to have its own recipe. + + // If we're not inserting right before the Recipe's first instruction, + // split the Recipe to allow placing the new recipe right before the + // given insertion point. This new recipe is also added to BasicBlock. + if (Before != &*OBORecipe->Begin) + splitRecipe(Before); + + // TODO: VPLanUtils::addOneByOneToBasicBlock() + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = nullptr; + if (PartialInsertion) { + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + } else + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, + OBORecipe->isScalarizing()); + Plan->setInst2Recipe(Inst, NewRecipe); + BasicBlock->addRecipe(NewRecipe, OBORecipe); +} + +/// Remove a given instruction \p Inst from its recipe, if exists. We only +/// support removal from VPOneByOneRecipeBase at this time. +void VPlanUtilsLoopVectorizer::removeInstruction(Instruction *Inst, + unsigned FromLane) { + VPRecipeBase *Recipe = Plan->getRecipe(Inst); + if (!Recipe) + return; // Nothing to do, no recipe to remove the instruction from. + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + // First check if OBORecipe can be shortened to exclude Inst. + bool InstructionWasLast = false; + if (&*OBORecipe->Begin == Inst) + OBORecipe->Begin++; + else if (&*OBORecipe->End == Inst) { + OBORecipe->End--; + InstructionWasLast = true; + } + // Otherwise split OBORecipe at Inst. + else { + splitRecipe(Inst); + OBORecipe->Begin++; + } + if (FromLane > 0) { + // This is a partial removal. Leave lanes 0..FromLane-1 in the original + // basic block in a new, unregistered recipe. + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Inst->getIterator(), ++(Inst->getIterator()), Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(0, FromLane - 1); + RecipeListTy *Recipes = getRecipes(Recipe->getParent()); + if (InstructionWasLast) + Recipes->push_back(NewRecipe); + else + Recipes->insert(Recipe->getIterator(), NewRecipe); } - default: - // The cost of executing VF copies of the scalar instruction. This opcode - // is unknown. Assume that it is the same as 'mul'. - return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) + - getScalarizationOverhead(I, VF, TTI); - } // end of switch. + Plan->resetInst2Recipe(Inst); } -char LoopVectorize::ID = 0; -static const char lv_name[] = "Loop Vectorization"; -INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false) -INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass) -INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass) -INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass) -INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker) -INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass) -INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass) -INITIALIZE_PASS_DEPENDENCY(LCSSAWrapperPass) -INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass) -INITIALIZE_PASS_DEPENDENCY(LoopSimplify) -INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis) -INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass) -INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass) -INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false) +// Given an instruction \p Inst and a VPBasicBlock \p To, remove \p Inst from +// its current residence and add it as the first instruction of \p To. +// We currently support removal from and insertion to +// VPOneByOneRecipeBase's only. +// TODO: this is an over-simplistic implemetation that assumes we can make +// the new instruction the first instruction of the first recipe in the +// basic block. This is true for the sinkScalarOperands use-case, but for a +// general basic block a getFirstInsertionPt() logic is required. +void VPlanUtilsLoopVectorizer::sinkInstruction(Instruction *Inst, + VPBasicBlock *To, + unsigned MinLane) { + RecipeListTy *Recipes = getRecipes(To); + + VPRecipeBase *FromRecipe = Plan->getRecipe(Inst); + if (auto *FromBSSRecipe = dyn_cast(FromRecipe)) { + VPBuildScalarStepsRecipe *SunkRecipe = nullptr; + if (MinLane == 0) { + // Sink the entire recipe. + VPBasicBlock *From = FromRecipe->getParent(); + assert(From && "Recipe to sink not assigned to any basic block"); + RecipeListTy *FromRecipes = getRecipes(From); + FromRecipes->erase(FromRecipe); + SunkRecipe = FromBSSRecipe; + } else { + // Partially sink lanes MinLane..VF-1 + SunkRecipe = new VPBuildScalarStepsRecipe(FromBSSRecipe->WII, + FromBSSRecipe->EntryVal, Plan); + SunkRecipe->DesignatedLanes = VPLaneRange(MinLane); + FromBSSRecipe->DesignatedLanes = VPLaneRange(0, MinLane - 1); + } + To->addRecipe(SunkRecipe, &*Recipes->begin()); + // Recipes->insert(Recipes->begin(), SunkRecipe); + return; + } -namespace llvm { -Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) { - return new LoopVectorize(NoUnrolling, AlwaysVectorize); -} -} + assert(Plan->getRecipe(Inst) && + isa(Plan->getRecipe(Inst)) && + "Unsupported recipe to sink instructions from"); -bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) { + // Remove instruction from its source recipe. + removeInstruction(Inst, MinLane); - // Check if the pointer operand of a load or store instruction is - // consecutive. - if (auto *Ptr = getPointerOperand(Inst)) - return Legal->isConsecutivePtr(Ptr); - return false; + auto *ToRecipe = dyn_cast(&*Recipes->begin()); + if (ToRecipe) { + // Try to sink the instruction into an existing recipe, default to a new + // recipe. + assert(ToRecipe->isScalarizing() && + "Cannot sink into a non-scalarizing recipe."); + + // Add it before the first ingredient of To. + insertBefore(Inst, &*ToRecipe->Begin, MinLane); + } else { + // Instruction has to go into its own one-by-one recipe. + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + auto *NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + if (MinLane > 0) // Partial sink + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + To->addRecipe(NewRecipe, &*Recipes->begin()); + } } -void LoopVectorizationCostModel::collectValuesToIgnore() { - // Ignore ephemeral values. - CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore); +void InnerLoopUnroller::vectorizeInstruction(Instruction &I) { + switch (I.getOpcode()) { + case Instruction::Br: + // Nothing to do for branches since we already took care of the + // loop control flow instructions. + break; - // Ignore type-promoting instructions we identified during reduction - // detection. - for (auto &Reduction : *Legal->getReductionVars()) { - RecurrenceDescriptor &RedDes = Reduction.second; - SmallPtrSetImpl &Casts = RedDes.getCastInsts(); - VecValuesToIgnore.insert(Casts.begin(), Casts.end()); + case Instruction::GetElementPtr: + scalarizeInstruction(&I, false); + break; + + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + // Scalarize with predication if this instruction may divide by zero and + // block execution is conditional, otherwise fallthrough. + if (Legal->isScalarWithPredication(&I)) { + scalarizeInstruction(&I, true); + break; + } + + case Instruction::Trunc: { + auto *CI = dyn_cast(&I); + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + auto ID = Legal->getInductionVars()->lookup(OldInduction); + if (isa(CI) && CI->getOperand(0) == OldInduction && + ID.getConstIntStepValue()) { + setDebugLocFromInst(Builder, CI); + widenIntInduction(needsScalarInduction(OldInduction), OldInduction, + cast(CI)); + break; + } } - // Insert values known to be scalar into VecValuesToIgnore. This is a - // conservative estimation of the values that will later be scalarized. - // - // FIXME: Even though an instruction is not scalar-after-vectoriztion, it may - // still be scalarized. For example, we may find an instruction to be - // more profitable for a given vectorization factor if it were to be - // scalarized. But at this point, we haven't yet computed the - // vectorization factor. - for (auto *BB : TheLoop->getBlocks()) - for (auto &I : *BB) - if (Legal->isScalarAfterVectorization(&I)) - VecValuesToIgnore.insert(&I); + default: + InnerLoopVectorizer::vectorizeInstruction(I); + } } void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr, @@ -7457,9 +9102,35 @@ return false; } - // Select the optimal vectorization factor. - const LoopVectorizationCostModel::VectorizationFactor VF = - CM.selectVectorizationFactor(OptForSize); + if (!CM.canVectorize(OptForSize)) + return false; + + // Early prune excessive VF's + unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize); + + // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail. + if (OptForSize && CM.requiresTail(MaxVF)) + return false; + + // Use the planner. + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM); + + // Get user vectorization factor. + unsigned UserVF = Hints.getWidth(); + + // Select the vectorization factor. + LoopVectorizationCostModel::VectorizationFactor VF = + LVP.plan(OptForSize, UserVF, MaxVF); + bool VectorizeLoop = (VF.Width > 1); + + std::pair VecDiagMsg, IntDiagMsg; + + if (!UserVF && !VectorizeLoop) { + DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); + VecDiagMsg = std::make_pair( + "VectorizationNotBeneficial", + "the cost-model indicates that vectorization is not beneficial"); + } // Select the interleave count. unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost); @@ -7468,8 +9139,6 @@ unsigned UserIC = Hints.getInterleave(); // Identify the diagnostic messages that should be produced. - std::pair VecDiagMsg, IntDiagMsg; - bool VectorizeLoop = true, InterleaveLoop = true; if (Requirements.doesNotMeet(F, L, Hints)) { DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization " "requirements.\n"); @@ -7477,13 +9146,7 @@ return false; } - if (VF.Width == 1) { - DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); - VecDiagMsg = std::make_pair( - "VectorizationNotBeneficial", - "the cost-model indicates that vectorization is not beneficial"); - VectorizeLoop = false; - } + bool InterleaveLoop = true; if (IC == 1 && UserIC <= 1) { // Tell the user interleaving is not beneficial. @@ -7499,8 +9162,8 @@ } } else if (IC > 1 && UserIC == 1) { // Tell the user interleaving is beneficial, but it explicitly disabled. - DEBUG(dbgs() - << "LV: Interleaving is beneficial but is explicitly disabled."); + DEBUG( + dbgs() << "LV: Interleaving is beneficial but is explicitly disabled."); IntDiagMsg = std::make_pair( "InterleavingBeneficialButDisabled", "the cost-model indicates that interleaving is beneficial " @@ -7511,6 +9174,9 @@ // Override IC if user provided an interleave count. IC = UserIC > 0 ? UserIC : IC; + if (VectorizeLoop) + LVP.setBestPlan(VF.Width, IC); + // Emit diagnostic messages, if any. const char *VAPassName = Hints.vectorizeAnalysisPassName(); if (!VectorizeLoop && !InterleaveLoop) { @@ -7553,10 +9219,13 @@ << "interleaved loop (interleaved count: " << NV("InterleaveCount", IC) << ")"); } else { + // If we decided that it is *legal* to vectorize the loop, then do it. InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, &LVL, &CM); - LB.vectorize(); + + LVP.executeBestPlan(LB); + ++LoopsVectorized; // Add metadata to disable runtime unrolling a scalar loop when there are Index: lib/Transforms/Vectorize/VPlan.h =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.h @@ -0,0 +1,914 @@ +//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This file contains the declarations of the Vectorization Plan base classes: +// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual +// VPBlockBase, together implementing a Hierarchical CFG; +// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be treated +// as proper graphs for generic algorithms; +// 3. Pure virtual VPRecipeBase and its pure virtual sub-classes +// VPConditionBitRecipeBase and VPOneByOneRecipeBase that +// represent base classes for recipes contained within VPBasicBlocks; +// 4. The VPlan class holding a candidate for vectorization; +// 5. The VPlanUtils class providing methods for building plans; +// 6. The VPlanPrinter class providing a way to print a plan in dot format. +// These are documented in docs/VectorizationPlan.rst. +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H +#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H + +#include "llvm/ADT/GraphTraits.h" +#include "llvm/ADT/ilist.h" +#include "llvm/ADT/ilist_node.h" +#include "llvm/IR/IRBuilder.h" +#include "llvm/Support/raw_ostream.h" + +// The (re)use of existing LoopVectorize classes is subject to future VPlan +// refactoring. +namespace { +class InnerLoopVectorizer; +class LoopVectorizationLegality; +} + +namespace llvm { + +class VPBasicBlock; + +/// VPRecipeBase is a base class describing one or more instructions that will +/// appear consecutively in the vectorized version, based on Instructions from +/// the given IR. These Instructions are referred to as the "Ingredients" of +/// the Recipe. A Recipe specifies how its ingredients are to be vectorized: +/// e.g., copy or reuse them as uniform, scalarize or vectorize them according +/// to an enclosing loop dimension, vectorize them according to internal SLP +/// dimension. +/// +/// **Design principle:** in order to reason about how to vectorize an +/// Instruction or how much it would cost, one has to consult the VPRecipe +/// holding it. +/// +/// **Design principle:** when a sequence of instructions conveys additional +/// information as a group, we use a VPRecipe to encapsulate them and attach +/// this information to the VPRecipe. For instance a VPRecipe can model an +/// interleave group of loads or stores with additional information for +/// calculating their cost and for performing IR code generation, as a group. +/// +/// **Design principle:** a VPRecipe should reuse existing containers of its +/// ingredients, i.e., iterators of basic blocks, to be lightweight. A new +/// containter should be opened on-demand, e.g., to avoid excessive recipes +/// each holding an interval of ingredients. +class VPRecipeBase : public ilist_node_with_parent { + friend class VPlanUtils; + friend class VPBasicBlock; + +private: + const unsigned char VRID; // Subclass identifier (for isa/dyn_cast) + + /// Each VPRecipe is contained in a single VPBasicBlock. + class VPBasicBlock *Parent; + + /// Record which Instructions would require generating their complementing + /// form as well, providing a vector-to-scalar or scalar-to-vector conversion. + SmallPtrSet AlsoPackOrUnpack; + +public: + /// An enumeration for keeping track of the concrete subclass of VPRecipeBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPRecipe classes VRID field. They are used for concrete type + /// identification. + typedef enum { + VPVectorizeOneByOneSC, + VPScalarizeOneByOneSC, + VPWidenIntInductionSC, + VPBuildScalarStepsSC, + VPInterleaveSC, + VPExtractMaskBitSC, + VPMergeScalarizeBranchSC, + } VPRecipeTy; + + VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {} + + virtual ~VPRecipeBase() {} + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPRecipeID() const { return VRID; } + + /// \return the VPBasicBlock which this VPRecipe belongs to. + class VPBasicBlock *getParent() { + return Parent; + } + + /// The method which generates the new IR instructions that correspond to + /// this VPRecipe in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState &State) = 0; + + /// Each recipe prints itself. + virtual void print(raw_ostream &O) const = 0; + + /// Add an instruction to the set of instructions for which a vector-to- + /// scalar or scalar-to-vector conversion is needed, in addition to + /// vectorizing or scalarizing the instruction itself, respectively. + void addAlsoPackOrUnpack(Instruction *I) { AlsoPackOrUnpack.insert(I); } + + /// Indicates if a given instruction requires vector-to-scalar or scalar-to- + /// vector conversion. + bool willAlsoPackOrUnpack(Instruction *I) const { + return AlsoPackOrUnpack.count(I); + } +}; + +/// A VPConditionBitRecipeBase is a pure virtual VPRecipe which supports a +/// conditional branch. Concrete sub-classes of this recipe are in charge of +/// generating the instructions that compute the condition for this branch in +/// the vectorized version. +class VPConditionBitRecipeBase : public VPRecipeBase { +protected: + /// The actual condition bit that was generated. Holds null until the + /// value/instuctions are generated by the vectorize() method. + Value *ConditionBit; + +public: + /// Construct a VPConditionBitRecipeBase, simply propating its concrete type. + VPConditionBitRecipeBase(const unsigned char SC) + : VPRecipeBase(SC), ConditionBit(nullptr) {} + + /// \return the actual bit that was generated, to be plugged into the IR + /// conditional branch, or null if the code computing the actual bit has not + /// been generated yet. + Value *getConditionBit() { return ConditionBit; } + + virtual StringRef getName() const = 0; +}; + +/// VPOneByOneRecipeBase is a VPRecipeBase which handles each Instruction in its +/// ingredients independently, in order. The ingredients are either all +/// vectorized, or all scalarized. +/// A VPOneByOneRecipeBase is a virtual base recipe which can be materialized +/// by one of two sub-classes, namely VPVectorizeOneByOneRecipe or +/// VPScalarizeOneByOneRecipe for Vectorizing or Scalarizing all ingredients, +/// respectively. +/// The ingredients are held as a sub-sequence of original Instructions, which +/// reside in the same IR BasicBlock and in the same order. The Ingredients are +/// accessed by a pointer to the first and last Instruction. +class VPOneByOneRecipeBase : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +public: + /// Hold the ingredients by pointing to their original BasicBlock location. + BasicBlock::iterator Begin; + BasicBlock::iterator End; + +protected: + VPOneByOneRecipeBase() = delete; + + VPOneByOneRecipeBase(unsigned char SC, const BasicBlock::iterator B, + const BasicBlock::iterator E, class VPlan *Plan); + + /// Do the actual code generation for a single instruction. + /// This function is to be implemented and specialized by the respective + /// sub-class. + virtual void transformIRInstruction(Instruction *I, + struct VPTransformState &State) = 0; + +public: + ~VPOneByOneRecipeBase() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC || + V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; + } + + bool isScalarizing() { + return getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPOneByOneRecipeBase in the vectorized version, thereby + /// "executing" the VPlan. + /// VPOneByOneRecipeBase may either scalarize or vectorize all Instructions. + void vectorize(struct VPTransformState &State) override { + for (auto It = Begin; It != End; ++It) + transformIRInstruction(&*It, State); + } + + const BasicBlock::iterator &begin() { return Begin; } + + const BasicBlock::iterator &end() { return End; } +}; + +/// Hold the indices of a specific scalar instruction. The VPIterationInstance +/// span the iterations of the original loop, that correspond to a single +/// iteration of the vectorized loop. +struct VPIterationInstance { + unsigned Part; + unsigned Lane; +}; + +// Forward declaration. +class BasicBlock; + +/// Hold additional information passed down when "executing" a VPlan, that is +/// needed for generating IR. Also facilitates reuse of existing LV +/// functionality. +struct VPTransformState { + + VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI, + class DominatorTree *DT, IRBuilder<> &Builder, + InnerLoopVectorizer *ILV, LoopVectorizationLegality *Legal) + : VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder), + ILV(ILV), Legal(Legal) {} + + /// Record the selected vectorization and unroll factors of the single loop + /// being vectorized. + unsigned VF; + unsigned UF; + + /// Hold the indices to generate a specific scalar instruction. Null indicates + /// that all instances are to be generated, using either scalar or vector + /// instructions. + VPIterationInstance *Instance; + + /// Hold state information used when constructing the CFG of the vectorized + /// Loop, traversing the VPBasicBlocks and generating corresponding IR + /// BasicBlocks. + struct CFGState { + // The previous VPBasicBlock visited. In the beginning set to null. + VPBasicBlock *PrevVPBB; + // The previous IR BasicBlock created or reused. In the beginning set to + // the new header BasicBlock. + BasicBlock *PrevBB; + // The last IR BasicBlock of the loop body. Set to the new latch BasicBlock, + // used for placing the newly created BasicBlocks. + BasicBlock *LastBB; + // A mapping of each VPBasicBlock to the corresponding BasicBlock. In case + // of replication, maps the BasicBlock of the last replica created. + SmallDenseMap VPBB2IRBB; + + CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {} + } CFG; + + /// Hold pointer to LoopInfo to register new basic blocks in the loop. + class LoopInfo *LI; + + /// Hold pointer to Dominator Tree to register new basic blocks in the loop. + class DominatorTree *DT; + + /// Hold a reference to the IRBuilder used to generate IR code. + IRBuilder<> &Builder; + + /// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods. + class InnerLoopVectorizer *ILV; + + /// Hold a pointer to LoopVectorizationLegality to access its + /// IsUniformAfterVectorization method. + class LoopVectorizationLegality *Legal; +}; + +/// VPBlockBase is the building block of the Hierarchical CFG. A VPBlockBase +/// can be either a VPBasicBlock or a VPRegionBlock. +/// +/// The Hierarchical CFG is a control-flow graph whose nodes are basic-blocks +/// or Hierarchical CFG's. The Hierarchical CFG data structure we use is similar +/// to the Tile Tree [1], where cross-Tile edges are lifted to connect Tiles +/// instead of the original basic-blocks as in Sharir [2], promoting the Tile +/// encapsulation. We use the terms Region and Block rather than Tile [1] to +/// avoid confusion with loop tiling. +/// +/// [1] "Register Allocation via Hierarchical Graph Coloring", David Callahan +/// and Brian Koblenz, PLDI 1991 +/// +/// [2] "Structural analysis: A new approach to flow analysis in optimizing +/// compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 +/// +/// Note that in contrast to the IR BasicBlock, a VPBlockBase models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// Use the VPBlockBase. +class VPBlockBase { + friend class VPlanUtils; + +private: + const unsigned char VBID; // Subclass identifier (for isa/dyn_cast). + + std::string Name; + + /// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if + /// it is a topmost VPBlockBase. + class VPRegionBlock *Parent; + + /// List of predecessor blocks. + SmallVector Predecessors; + + /// List of successor blocks. + SmallVector Successors; + + /// \brief Successor selector, null for zero or single successor blocks. + VPConditionBitRecipeBase *ConditionBitRecipe; + + /// \brief Add \p Successor as the last successor to this block. + void appendSuccessor(VPBlockBase *Successor) { + assert(Successor && "Cannot add nullptr successor!"); + Successors.push_back(Successor); + } + + /// \brief Add \p Predecessor as the last predecessor to this block. + void appendPredecessor(VPBlockBase *Predecessor) { + assert(Predecessor && "Cannot add nullptr predecessor!"); + Predecessors.push_back(Predecessor); + } + + /// \brief Remove \p Predecessor from the predecessors of this block. + void removePredecessor(VPBlockBase *Predecessor) { + auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor); + assert(Pos && "Predecessor does not exist"); + Predecessors.erase(Pos); + } + + /// \brief Remove \p Successor from the successors of this block. + void removeSuccessor(VPBlockBase *Successor) { + auto Pos = std::find(Successors.begin(), Successors.end(), Successor); + assert(Pos && "Successor does not exist"); + Successors.erase(Pos); + } + +protected: + VPBlockBase(const unsigned char SC, const std::string &N) + : VBID(SC), Name(N), Parent(nullptr), ConditionBitRecipe(nullptr) {} + +public: + /// An enumeration for keeping track of the concrete subclass of VPBlockBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPBlockBase classes VBID field. They are used for concrete type + /// identification. + typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy; + + virtual ~VPBlockBase() {} + + const std::string &getName() const { return Name; } + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPBlockID() const { return VBID; } + + const class VPRegionBlock *getParent() const { return Parent; } + + /// \return the VPBasicBlock that is the entry of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getEntryBasicBlock() const; + + /// \return the VPBasicBlock that is the exit of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getExitBasicBlock() const; + class VPBasicBlock *getExitBasicBlock(); + + const SmallVectorImpl &getSuccessors() const { + return Successors; + } + + const SmallVectorImpl &getPredecessors() const { + return Predecessors; + } + + SmallVectorImpl &getSuccessors() { return Successors; } + + SmallVectorImpl &getPredecessors() { return Predecessors; } + + /// \return the successor of this VPBlockBase if it has a single successor. + /// Otherwise return a null pointer. + VPBlockBase *getSingleSuccessor() const { + return (Successors.size() == 1 ? *Successors.begin() : nullptr); + } + + /// \return the predecessor of this VPBlockBase if it has a single + /// predecessor. Otherwise return a null pointer. + VPBlockBase *getSinglePredecessor() const { + return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr); + } + + /// Returns the closest ancestor starting from "this", which has successors. + /// Returns the root ancestor if all ancestors have no successors. + VPBlockBase *getAncestorWithSuccessors(); + + /// Returns the closest ancestor starting from "this", which has predecessors. + /// Returns the root ancestor if all ancestors have no predecessors. + VPBlockBase *getAncestorWithPredecessors(); + + /// \return the successors either attached directly to this VPBlockBase or, if + /// this VPBlockBase is the exit block of a VPRegionBlock and has no + /// successors of its own, search recursively for the first enclosing + /// VPRegionBlock that has successors and return them. If no such + /// VPRegionBlock exists, return the (empty) successors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalSuccessors() { + return getAncestorWithSuccessors()->getSuccessors(); + } + + /// \return the hierarchical successor of this VPBlockBase if it has a single + /// hierarchical successor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalSuccessor() { + return getAncestorWithSuccessors()->getSingleSuccessor(); + } + + /// \return the predecessors either attached directly to this VPBlockBase or, + /// if this VPBlockBase is the entry block of a VPRegionBlock and has no + /// predecessors of its own, search recursively for the first enclosing + /// VPRegionBlock that has predecessors and return them. If no such + /// VPRegionBlock exists, return the (empty) predecessors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalPredecessors() { + return getAncestorWithPredecessors()->getPredecessors(); + } + + /// \return the hierarchical predecessor of this VPBlockBase if it has a + /// single hierarchical predecessor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalPredecessor() { + return getAncestorWithPredecessors()->getSinglePredecessor(); + } + + /// If a VPBlockBase has two successors, this is the Recipe that will generate + /// the condition bit selecting the successor, and feeding the terminating + /// conditional branch. Otherwise this is null. + VPConditionBitRecipeBase *getConditionBitRecipe() { + return ConditionBitRecipe; + } + + const VPConditionBitRecipeBase *getConditionBitRecipe() const { + return ConditionBitRecipe; + } + + void setConditionBitRecipe(VPConditionBitRecipeBase *R) { + ConditionBitRecipe = R; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBlockBase in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState *State) = 0; + + // Delete all blocks reachable from a given VPBlockBase, inclusive. + static void deleteCFG(VPBlockBase *Entry); +}; + +/// VPBasicBlock serves as the leaf of the Hierarchical CFG. It represents a +/// sequence of instructions that will appear consecutively in a basic block +/// of the vectorized version. The VPBasicBlock takes care of the control-flow +/// relations with other VPBasicBlock's and Regions. It holds a sequence of zero +/// or more VPRecipe's that take care of representing the instructions. +/// A VPBasicBlock that holds no VPRecipe's represents no instructions; this +/// may happen, e.g., to support disjoint Regions and to ensure Regions have a +/// single exit, possibly an empty one. +/// +/// Note that in contrast to the IR BasicBlock, a VPBasicBlock models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// "use" the VPBasicBlock. +class VPBasicBlock : public VPBlockBase { + friend class VPlanUtils; + +public: + typedef iplist RecipeListTy; + +private: + /// The list of VPRecipes, held in order of instructions to generate. + RecipeListTy Recipes; + +public: + /// Instruction iterators... + typedef RecipeListTy::iterator iterator; + typedef RecipeListTy::const_iterator const_iterator; + typedef RecipeListTy::reverse_iterator reverse_iterator; + typedef RecipeListTy::const_reverse_iterator const_reverse_iterator; + + //===--------------------------------------------------------------------===// + /// Recipe iterator methods + /// + inline iterator begin() { return Recipes.begin(); } + inline const_iterator begin() const { return Recipes.begin(); } + inline iterator end() { return Recipes.end(); } + inline const_iterator end() const { return Recipes.end(); } + + inline reverse_iterator rbegin() { return Recipes.rbegin(); } + inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); } + inline reverse_iterator rend() { return Recipes.rend(); } + inline const_reverse_iterator rend() const { return Recipes.rend(); } + + inline size_t size() const { return Recipes.size(); } + inline bool empty() const { return Recipes.empty(); } + inline const VPRecipeBase &front() const { return Recipes.front(); } + inline VPRecipeBase &front() { return Recipes.front(); } + inline const VPRecipeBase &back() const { return Recipes.back(); } + inline VPRecipeBase &back() { return Recipes.back(); } + + /// \brief Return the underlying instruction list container. + /// + /// Currently you need to access the underlying instruction list container + /// directly if you want to modify it. + const RecipeListTy &getInstList() const { return Recipes; } + RecipeListTy &getInstList() { return Recipes; } + + /// \brief Returns a pointer to a member of the instruction list. + static RecipeListTy VPBasicBlock::*getSublistAccess(VPRecipeBase *) { + return &VPBasicBlock::Recipes; + } + + VPBasicBlock(const std::string &Name) : VPBlockBase(VPBasicBlockSC, Name) {} + + ~VPBasicBlock() { Recipes.clear(); } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC; + } + + /// Augment the existing recipes of a VPBasicBlock with an additional + /// \p Recipe at a position given by an existing recipe \p Before. If + /// \p Before is null, \p Recipe is appended as the last recipe. + void addRecipe(VPRecipeBase *Recipe, VPRecipeBase *Before = nullptr) { + Recipe->Parent = this; + if (!Before) { + Recipes.push_back(Recipe); + return; + } + assert(Before->Parent == this && + "Insertion before point not in this basic block."); + Recipes.insert(Before->getIterator(), Recipe); + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBasicBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; + + /// Retrieve the list of VPRecipes that belong to this VPBasicBlock. + const RecipeListTy &getRecipes() const { return Recipes; } + +private: + /// Create an IR BasicBlock to hold the instructions vectorized from this + /// VPBasicBlock, and return it. Update the CFGState accordingly. + BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG); +}; + +/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks +/// which form a single-entry-single-exit subgraph of the CFG in the vectorized +/// code. +/// +/// A VPRegionBlock may indicate that its contents are to be replicated several +/// times. This is designed to support predicated scalarization, in which a +/// scalar if-then code structure needs to be generated VF * UF times. Having +/// this replication indicator helps to keep a single VPlan for multiple +/// candidate VF's; the actual replication takes place only once the desired VF +/// and UF have been determined. +/// +/// **Design principle:** when some additional information relates to an SESE +/// set of VPBlockBase, we use a VPRegionBlock to wrap them and attach the +/// information to it. For example, a VPRegionBlock can be used to indicate that +/// a scalarized SESE region is to be replicated, and that a vectorized SESE +/// region can retain its internal control-flow, independent of the control-flow +/// external to the region. +class VPRegionBlock : public VPBlockBase { + friend class VPlanUtils; + +private: + /// Hold the Single Entry of the SESE region represented by the VPRegionBlock. + VPBlockBase *Entry; + + /// Hold the Single Exit of the SESE region represented by the VPRegionBlock. + VPBlockBase *Exit; + + /// A VPRegionBlock can represent either a single instance of its + /// VPBlockBases, or multiple (VF * UF) replicated instances. The latter is + /// used when the internal SESE region handles a single scalarized lane. + bool IsReplicator; + +public: + VPRegionBlock(const std::string &Name) + : VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exit(nullptr), + IsReplicator(false) {} + + ~VPRegionBlock() { + if (Entry) + deleteCFG(Entry); + } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC; + } + + VPBlockBase *getEntry() { return Entry; } + + VPBlockBase *getExit() { return Exit; } + + const VPBlockBase *getEntry() const { return Entry; } + + const VPBlockBase *getExit() const { return Exit; } + + /// An indicator if the VPRegionBlock represents single or multiple instances. + bool isReplicator() const { return IsReplicator; } + + void setReplicator(bool ToReplicate) { IsReplicator = ToReplicate; } + + /// The method which generates the new IR instructions that correspond to + /// this VPRegionBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; +}; + +/// A VPlan represents a candidate for vectorization, encoding various decisions +/// taken to produce efficient vector code, including: which instructions are to +/// vectorized or scalarized, which branches are to appear in the vectorized +/// version. It models the control-flow of the candidate vectorized version +/// explicitly, and holds prescriptions for generating the code for this version +/// from a given IR code. +/// VPlan takes a "senario-based approach" to vectorization planning - different +/// scenarios, corresponding to making different decisions, can be modeled using +/// different VPlans. +/// The corresponding IR code is required to be SESE. +/// The vectorized version is represented using a Hierarchical CFG. +class VPlan { + friend class VPlanUtils; + friend class VPlanUtilsLoopVectorizer; + +private: + /// Hold the single entry to the Hierarchical CFG of the VPlan. + VPBlockBase *Entry; + + /// The IR instructions which are to be transformed to fill the vectorized + /// version are held as ingredients inside the VPRecipe's of the VPlan. Hold a + /// reverse mapping to locate the VPRecipe an IR instruction belongs to. This + /// serves optimizations that operate on the VPlan. + DenseMap Inst2Recipe; + +public: + VPlan() : Entry(nullptr) {} + + ~VPlan() { + if (Entry) + VPBlockBase::deleteCFG(Entry); + } + + /// Generate the IR code for this VPlan. + void vectorize(struct VPTransformState *State); + + VPBlockBase *getEntry() { return Entry; } + const VPBlockBase *getEntry() const { return Entry; } + + void setEntry(VPBlockBase *Block) { Entry = Block; } + + /// Retrieve the VPRecipe a given instruction \p Inst belongs to in the VPlan. + /// Returns null if it belongs to no VPRecipe. + VPRecipeBase *getRecipe(Instruction *Inst) { + auto It = Inst2Recipe.find(Inst); + if (It == Inst2Recipe.end()) + return nullptr; + return It->second; + } + + void setInst2Recipe(Instruction *I, VPRecipeBase *R) { Inst2Recipe[I] = R; } + + void resetInst2Recipe(Instruction *I) { Inst2Recipe.erase(I); } + + /// Retrieve the VPBasicBlock a given instruction \p Inst belongs to in the + /// VPlan. Returns null if it belongs to no VPRecipe. + VPBasicBlock *getBasicBlock(Instruction *Inst) { + VPRecipeBase *Recipe = getRecipe(Inst); + if (!Recipe) + return nullptr; + return Recipe->getParent(); + } + +private: + /// \return true if the given VPBlockBase and its successors will produce a + /// single basic block when vectorized, recursively. \returns false otherwise. + bool willProduceSingleBasicBlock(const VPBlockBase *Block) const; + + /// Add to the given dominator tree the header block and every new basic block + /// that was created between it and the latch block, inclusive. + void updateDominatorTree(class DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB); +}; + +/// The VPlanUtils class provides interfaces for the construction and +/// manipulation of a VPlan. +class VPlanUtils { +private: + /// Unique ID generator. + static unsigned NextOrdinal; + +protected: + VPlan *Plan; + + typedef iplist RecipeListTy; + RecipeListTy *getRecipes(VPBasicBlock *Block) { return &Block->Recipes; } + +public: + VPlanUtils(VPlan *Plan) : Plan(Plan) {} + + ~VPlanUtils() {} + + /// Create a unique name for a new VPlan entity such as a VPBasicBlock or + /// VPRegionBlock. + std::string createUniqueName(const char *Prefix) { + std::string S; + raw_string_ostream RSO(S); + RSO << Prefix << NextOrdinal++; + return RSO.str(); + } + + /// Add a given \p Recipe as the last recipe of a given VPBasicBlock. + void appendRecipeToBasicBlock(VPRecipeBase *Recipe, VPBasicBlock *ToVPBB) { + assert(Recipe && "No recipe to append."); + assert(!Recipe->Parent && "Recipe already in VPlan"); + ToVPBB->addRecipe(Recipe); + } + + /// Create a new empty VPBasicBlock and return it. + VPBasicBlock *createBasicBlock() { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + return BasicBlock; + } + + /// Create a new VPBasicBlock with a single \p Recipe and return it. + VPBasicBlock *createBasicBlock(VPRecipeBase *Recipe) { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + appendRecipeToBasicBlock(Recipe, BasicBlock); + return BasicBlock; + } + + /// Create a new, empty VPRegionBlock, with no blocks. + VPRegionBlock *createRegion(bool IsReplicator) { + VPRegionBlock *Region = new VPRegionBlock(createUniqueName("region")); + setReplicator(Region, IsReplicator); + return Region; + } + + /// Set the entry VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no predecessors. + void setRegionEntry(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Predecessors.empty() && + "Entry block cannot have predecessors."); + Region->Entry = Block; + Block->Parent = Region; + } + + /// Set the exit VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no successors. + void setRegionExit(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Successors.empty() && "Exit block cannot have successors."); + Region->Exit = Block; + Block->Parent = Region; + } + + void setReplicator(VPRegionBlock *Region, bool ToReplicate) { + Region->setReplicator(ToReplicate); + } + + /// Sets a given VPBlockBase \p Successor as the single successor of another + /// VPBlockBase \p Block. The parent of \p Block is copied to be the parent of + /// \p Successor. + void setSuccessor(VPBlockBase *Block, VPBlockBase *Successor) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->appendSuccessor(Successor); + Successor->appendPredecessor(Block); + Successor->Parent = Block->Parent; + } + + /// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two + /// successors of another VPBlockBase \p Block. A given + /// VPConditionBitRecipeBase provides the control selector. The parent of + /// \p Block is copied to be the parent of \p IfTrue and \p IfFalse. + void setTwoSuccessors(VPBlockBase *Block, VPConditionBitRecipeBase *R, + VPBlockBase *IfTrue, VPBlockBase *IfFalse) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->setConditionBitRecipe(R); + Block->appendSuccessor(IfTrue); + Block->appendSuccessor(IfFalse); + IfTrue->appendPredecessor(Block); + IfFalse->appendPredecessor(Block); + IfTrue->Parent = Block->Parent; + IfFalse->Parent = Block->Parent; + } + + /// Given two VPBlockBases \p From and \p To, disconnect them from each other. + void disconnectBlocks(VPBlockBase *From, VPBlockBase *To) { + From->removeSuccessor(To); + To->removePredecessor(From); + } +}; + +/// VPlanPrinter prints a given VPlan to a given output stream. The printing is +/// indented and follows the dot format. +class VPlanPrinter { +private: + raw_ostream &OS; + const VPlan &Plan; + unsigned Depth; + unsigned TabLength = 2; + std::string Indent; + + /// Handle indentation. + void buildIndent() { Indent = std::string(Depth * TabLength, ' '); } + void resetDepth() { + Depth = 1; + buildIndent(); + } + void increaseDepth() { + ++Depth; + buildIndent(); + } + void decreaseDepth() { + --Depth; + buildIndent(); + } + + /// Dump each element of VPlan. + void dumpBlock(const VPBlockBase *Block); + void dumpEdges(const VPBlockBase *Block); + void dumpBasicBlock(const VPBasicBlock *BasicBlock); + void dumpRegion(const VPRegionBlock *Region); + + const char *getNodePrefix(const VPBlockBase *Block); + const std::string &getReplicatorString(const VPRegionBlock *Region); + void drawEdge(const VPBlockBase *From, const VPBlockBase *To, bool Hidden, + const Twine &Label); + +public: + VPlanPrinter(raw_ostream &O, const VPlan &P) : OS(O), Plan(P) {} + void dump(const std::string &Title = ""); +}; + +//===--------------------------------------------------------------------===// +// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs // +//===--------------------------------------------------------------------===// + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock +// as a graph of VPBlockBases... + +template <> struct GraphTraits { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +template <> struct GraphTraits { + typedef const VPBlockBase *NodeRef; + typedef SmallVectorImpl::const_iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock as +// a graph of VPBasicBlocks... and to walk it in inverse order. Inverse order +// for a VPRegionBlock is considered to be when traversing the predecessor edges +// of a VPBlockBase instead of the successor edges. +// + +template <> struct GraphTraits> { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static Inverse getEntryNode(Inverse B) { + return B; + } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getPredecessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getPredecessors().end(); + } +}; + +} // namespace llvm + +#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H Index: lib/Transforms/Vectorize/VPlan.cpp =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.cpp @@ -0,0 +1,417 @@ +//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This is the LLVM vectorization plan. It represents a candidate for +// vectorization, allowing to plan and optimize how to vectorize a given loop +// before generating LLVM-IR. +// The vectorizer uses vectorization plans to estimate the costs of potential +// candidates and if profitable to execute the desired plan, generating vector +// LLVM-IR code. +// +//===----------------------------------------------------------------------===// + +#include "VPlan.h" +#include "llvm/ADT/PostOrderIterator.h" +#include "llvm/Analysis/LoopInfo.h" +#include "llvm/IR/BasicBlock.h" +#include "llvm/IR/Dominators.h" +#include "llvm/Support/GraphWriter.h" +#include "llvm/Transforms/Utils/BasicBlockUtils.h" + +using namespace llvm; + +#define DEBUG_TYPE "vplan" + +unsigned VPlanUtils::NextOrdinal = 1; + +VPOneByOneRecipeBase::VPOneByOneRecipeBase(unsigned char SC, + const BasicBlock::iterator B, + const BasicBlock::iterator E, + class VPlan *Plan) + : VPRecipeBase(SC), Begin(B), End(E) { + for (auto It = B; It != E; ++It) + Plan->setInst2Recipe(&*It, this); +} + +/// \return the VPBasicBlock that is the entry of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getEntry(); + return cast(Block); +} + +/// \return the VPBasicBlock that is the exit of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getExitBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +VPBasicBlock *VPBlockBase::getExitBasicBlock() { + VPBlockBase *Block = this; + while (VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +/// Returns the closest ancestor, starting from "this", which has successors. +/// Returns the root ancestor if all ancestors have no successors. +VPBlockBase *VPBlockBase::getAncestorWithSuccessors() { + if (!Successors.empty() || !Parent) + return this; + assert(Parent->getExit() == this && + "Block w/o successors not the exit of its parent."); + return Parent->getAncestorWithSuccessors(); +} + +/// Returns the closest ancestor, starting from "this", which has predecessors. +/// Returns the root ancestor if all ancestors have no predecessors. +VPBlockBase *VPBlockBase::getAncestorWithPredecessors() { + if (!Predecessors.empty() || !Parent) + return this; + assert(Parent->getEntry() == this && + "Block w/o predecessors not the entry of its parent."); + return Parent->getAncestorWithPredecessors(); +} + +void VPBlockBase::deleteCFG(VPBlockBase *Entry) { + SmallVector Blocks; + for (VPBlockBase *Block : depth_first(Entry)) + Blocks.push_back(Block); + + for (VPBlockBase *Block : Blocks) + delete Block; +} + +BasicBlock * +VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) { + // BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks. + // Pred stands for Predessor. Prev stands for Previous, last visited/created. + BasicBlock *PrevBB = CFG.PrevBB; + BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), "VPlannedBB", + PrevBB->getParent(), CFG.LastBB); + DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n'); + + // Hook up the new basic block to its predecessors. + for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) { + VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock(); + BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB]; + DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n'); + if (isa(PredBB->getTerminator())) { + PredBB->getTerminator()->eraseFromParent(); + BranchInst::Create(NewBB, PredBB); + } else { + // Replace old unconditional branch with new conditional branch. + // Note: we rely on traversing the successors in order. + BasicBlock *FirstSuccBB = PredBB->getSingleSuccessor(); + PredBB->getTerminator()->eraseFromParent(); + Value *Bit = PredVPBlock->getConditionBitRecipe()->getConditionBit(); + assert(Bit && "Cannot create conditional branch with empty bit."); + BranchInst::Create(FirstSuccBB, NewBB, Bit, PredBB); + } + } + return NewBB; +} + +void VPBasicBlock::vectorize(VPTransformState *State) { + VPIterationInstance *I = State->Instance; + bool Replica = I && !(I->Part == 0 && I->Lane == 0); + VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB; + VPBlockBase *SingleHPred = nullptr; + BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible. + + // 1. Create an IR basic block, or reuse the last one if possible. + // The last IR basic block is reused in three cases: + // A. the first VPBB reuses the header BB - when PrevVPBB is null; + // B. when the current VPBB has a single (hierarchical) predecessor which + // is PrevVPBB and the latter has a single (hierarchical) successor; and + // C. when the current VPBB is an entry of a region replica - where PrevVPBB + // is the exit of this region from a previous instance. + if (PrevVPBB && /* A */ + !((SingleHPred = getSingleHierarchicalPredecessor()) && + SingleHPred->getExitBasicBlock() == PrevVPBB && + PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */ + !(Replica && getPredecessors().empty())) { /* C */ + + NewBB = createEmptyBasicBlock(State->CFG); + State->Builder.SetInsertPoint(NewBB); + // Temporarily terminate with unreachable until CFG is rewired. + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + // Register NewBB in its loop. In innermost loops its the same for all BB's. + Loop *L = State->LI->getLoopFor(State->CFG.LastBB); + L->addBasicBlockToLoop(NewBB, *State->LI); + State->CFG.PrevBB = NewBB; + } + + // 2. Fill the IR basic block with IR instructions. + DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName() + << " in BB:" << NewBB->getName() << '\n'); + + State->CFG.VPBB2IRBB[this] = NewBB; + State->CFG.PrevVPBB = this; + + for (VPRecipeBase &Recipe : Recipes) + Recipe.vectorize(*State); + + DEBUG(dbgs() << "LV: filled BB:" << *NewBB); +} + +void VPRegionBlock::vectorize(VPTransformState *State) { + ReversePostOrderTraversal RPOT(Entry); + typedef typename std::vector::reverse_iterator rpo_iterator; + + if (!isReplicator()) { + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + return; + } + + assert(!State->Instance && + "Replicating a Region only in null context instance."); + VPIterationInstance I; + State->Instance = &I; + + for (I.Part = 0; I.Part < State->UF; ++I.Part) + for (I.Lane = 0; I.Lane < State->VF; ++I.Lane) + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + + State->Instance = nullptr; +} + +/// Generate the code inside the body of the vectorized loop. Assumes a single +/// LoopVectorBody basic block was created for this; introduces additional +/// basic blocks as needed, and fills them all. +void VPlan::vectorize(VPTransformState *State) { + BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB; + BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor(); + assert(VectorHeaderBB && "Loop preheader does not have a single successor."); + BasicBlock *VectorLatchBB = VectorHeaderBB; + auto CurrIP = State->Builder.saveIP(); + + // 1. Make room to generate basic blocks inside loop body if needed. + bool SingleBB = willProduceSingleBasicBlock(Entry); + if (!SingleBB) { + VectorLatchBB = VectorHeaderBB->splitBasicBlock( + VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch"); + Loop *L = State->LI->getLoopFor(VectorHeaderBB); + L->addBasicBlockToLoop(VectorLatchBB, *State->LI); + // Remove the edge between Header and Latch to allow other connections. + // Temporarily terminate with unreachable until CFG is rewired. + // Note: this asserts xform code's assumption that getFirstInsertionPt() + // can be dereferenced into an Instruction. + VectorHeaderBB->getTerminator()->eraseFromParent(); + State->Builder.SetInsertPoint(VectorHeaderBB); + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + } + + // 2. Generate code in loop body of vectorized version. + State->CFG.PrevVPBB = nullptr; + State->CFG.PrevBB = VectorHeaderBB; + State->CFG.LastBB = VectorLatchBB; + + for (VPBlockBase *CurrentBlock = Entry; CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) { + assert(CurrentBlock->getSuccessors().size() <= 1 && + "Multiple successors at top level."); + CurrentBlock->vectorize(State); + } + + // 3. If a temporary latch was created merge it with last basic block created. + if (!SingleBB) { + BasicBlock *LastBB = State->CFG.PrevBB; + // Connect LastBB to VectorLatchBB to facilitate their merge. + assert(isa(LastBB->getTerminator()) && + "Expected VPlan CFG to terminate with unreachable"); + LastBB->getTerminator()->eraseFromParent(); + BranchInst::Create(VectorLatchBB, LastBB); + + // Merge LastBB with Latch. + bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI); + assert(merged && "Could not merge last basic block with latch."); + VectorLatchBB = LastBB; + } + + updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB); + State->Builder.restoreIP(CurrIP); +} + +bool VPlan::willProduceSingleBasicBlock(const VPBlockBase *Block) const { + if (Block->getSuccessors().size() > 1) + return false; + if (const VPRegionBlock *Region = dyn_cast(Block)) + if (!willProduceSingleBasicBlock(Region->getEntry())) + return false; + if (const VPBlockBase *Next = Block->getSingleSuccessor()) + if (!willProduceSingleBasicBlock(Next)) + return false; + return true; +} + +void VPlan::updateDominatorTree(DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB) { + BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor(); + assert(LoopHeaderBB && "Loop preheader does not have a single successor."); + DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB); + // The vector body may be more than a single basic block by this point. + // Update the dominator tree information inside the vector body by propagating + // it from header to latch, expecting only triangular control-flow, if any. + BasicBlock *PostDomSucc = nullptr; + for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) { + // Get the list of successors of this block. + std::vector Succs(succ_begin(BB), succ_end(BB)); + assert(Succs.size() <= 2 && + "Basic block in vector loop has more than 2 successors."); + PostDomSucc = Succs[0]; + if (Succs.size() == 1) { + assert(PostDomSucc->getSinglePredecessor() && + "PostDom successor has more than one predecessor."); + DT->addNewBlock(PostDomSucc, BB); + continue; + } + BasicBlock *InterimSucc = Succs[1]; + if (PostDomSucc->getSingleSuccessor() == InterimSucc) { + PostDomSucc = Succs[1]; + InterimSucc = Succs[0]; + } + assert(InterimSucc->getSingleSuccessor() == PostDomSucc && + "One successor of a basic block does not lead to the other."); + assert(InterimSucc->getSinglePredecessor() && + "Interim successor has more than one predecessor."); + assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 && + "PostDom successor has more than two predecessors."); + DT->addNewBlock(InterimSucc, BB); + DT->addNewBlock(PostDomSucc, BB); + } +} + +const char *VPlanPrinter::getNodePrefix(const VPBlockBase *Block) { + if (isa(Block)) + return ""; + assert(isa(Block) && "Unsupported kind of VPBlock."); + return "cluster_"; +} + +const std::string & +VPlanPrinter::getReplicatorString(const VPRegionBlock *Region) { + static std::string ReplicatorString(DOT::EscapeString("")); + static std::string NonReplicatorString(DOT::EscapeString("")); + return Region->isReplicator() ? ReplicatorString : NonReplicatorString; +} + +void VPlanPrinter::dump(const std::string &Title) { + resetDepth(); + OS << "digraph VPlan {\n"; + OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan"; + if (!Title.empty()) + OS << "\\n" << DOT::EscapeString(Title); + OS << "\"]\n"; + OS << "node [shape=record]\n"; + OS << "compound=true\n"; + + for (const VPBlockBase *CurrentBlock = Plan.getEntry(); + CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) + dumpBlock(CurrentBlock); + + OS << "}\n"; +} + +void VPlanPrinter::dumpBlock(const VPBlockBase *Block) { + if (const VPBasicBlock *BasicBlock = dyn_cast(Block)) + dumpBasicBlock(BasicBlock); + else if (const VPRegionBlock *Region = dyn_cast(Block)) + dumpRegion(Region); + else + llvm_unreachable("Unsupported kind of VPBlock."); +} + +/// Print the information related to a CFG edge between two VPBlockBases. +void VPlanPrinter::drawEdge(const VPBlockBase *From, const VPBlockBase *To, + bool Hidden, const Twine &Label) { + // Due to "dot" we print an edge between two regions as an edge between the + // exit basic block and the entry basic of the respective regions. + const VPBlockBase *Tail = From->getExitBasicBlock(); + const VPBlockBase *Head = To->getEntryBasicBlock(); + OS << Indent << getNodePrefix(Tail) << DOT::EscapeString(Tail->getName()) + << " -> " << getNodePrefix(Head) << DOT::EscapeString(Head->getName()); + OS << " [ label=\"" << Label << '\"'; + if (Tail != From) + OS << " ltail=" << getNodePrefix(From) + << DOT::EscapeString(From->getName()); + if (Head != To) + OS << " lhead=" << getNodePrefix(To) << DOT::EscapeString(To->getName()); + if (Hidden) + OS << "; splines=none"; + OS << "]\n"; +} + +/// Print the information related to the CFG edges going out of a given +/// \p Block, followed by printing the successor blocks themselves. +void VPlanPrinter::dumpEdges(const VPBlockBase *Block) { + std::string Cond = ""; + if (auto *ConditionBitRecipe = Block->getConditionBitRecipe()) + Cond = ConditionBitRecipe->getName().str(); + unsigned SuccessorNumber = 1; + for (auto *Successor : Block->getSuccessors()) { + drawEdge(Block, Successor, false, + Twine() + (SuccessorNumber == 2 ? "!" : "") + Twine(Cond)); + ++SuccessorNumber; + } +} + +/// Print a VPBasicBlock, including its VPRecipes, followed by printing its +/// successor blocks. +void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) { + std::string Indent(Depth * TabLength, ' '); + OS << Indent << getNodePrefix(BasicBlock) + << DOT::EscapeString(BasicBlock->getName()) << " [label = \"{" + << DOT::EscapeString(BasicBlock->getName()); + + for (const VPRecipeBase &Recipe : BasicBlock->getRecipes()) { + OS << " | "; + std::string RecipeString; + raw_string_ostream RSO(RecipeString); + Recipe.print(RSO); + OS << DOT::EscapeString(RSO.str()); + } + + OS << "}\"]\n"; + dumpEdges(BasicBlock); +} + +/// Print a given \p Region of the VPlan. +void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) { + OS << Indent << "subgraph " << getNodePrefix(Region) + << DOT::EscapeString(Region->getName()) << " {\n"; + increaseDepth(); + OS << Indent; + OS << "label = \"" << getReplicatorString(Region) << " " + << DOT::EscapeString(Region->getName()) << "\"\n\n"; + + // Dump the blocks of the region. + assert(Region->getEntry() && "Region contains no inner blocks."); + + for (const VPBlockBase *Block : depth_first(Region->getEntry())) + dumpBlock(Block); + + decreaseDepth(); + OS << Indent << "}\n"; + dumpEdges(Region); +} Index: test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll +++ test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll @@ -15,9 +15,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i64> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i64 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i64 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i64 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i64> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i64 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -25,9 +25,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i64> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i64 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i64 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i64 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i64> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i64 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/AArch64/predication_costs.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/predication_costs.ll +++ test/Transforms/LoopVectorize/AArch64/predication_costs.ll @@ -18,8 +18,8 @@ ; Cost of udiv: ; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5 ; -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) { entry: @@ -59,8 +59,8 @@ ; Cost of store: ; (store(4) + extractelement(3)) / 2 = 3 ; -; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -98,10 +98,10 @@ ; Cost of udiv: ; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x -; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x +; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -143,10 +143,10 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -192,16 +192,16 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x ; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2 ; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x ; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) { entry: Index: test/Transforms/LoopVectorize/if-pred-non-void.ll =================================================================== --- test/Transforms/LoopVectorize/if-pred-non-void.ll +++ test/Transforms/LoopVectorize/if-pred-non-void.ll @@ -219,9 +219,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i32 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -229,9 +229,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i32 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/induction.ll =================================================================== --- test/Transforms/LoopVectorize/induction.ll +++ test/Transforms/LoopVectorize/induction.ll @@ -301,18 +301,18 @@ ; ; CHECK-LABEL: @scalarize_induction_variable_05( ; CHECK: vector.body: -; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; CHECK: %[[I0:.+]] = add i32 %index, 0 ; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]] ; CHECK: pred.udiv.if: ; CHECK: udiv i32 {{.*}}, %[[I0]] -; CHECK: pred.udiv.if1: +; CHECK: pred.udiv.if3: ; CHECK: %[[I1:.+]] = add i32 %index, 1 ; CHECK: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05( ; UNROLL-NO-IC: vector.body: -; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0 ; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2 ; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]] @@ -322,26 +322,26 @@ ; UNROLL-NO-IC: pred.udiv.if6: ; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]] -; UNROLL-NO-IC: pred.udiv.if8: +; UNROLL-NO-IC: pred.udiv.if9: ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]] -; UNROLL-NO-IC: pred.udiv.if10: +; UNROLL-NO-IC: pred.udiv.if12: ; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]] ; ; IND-LABEL: @scalarize_induction_variable_05( ; IND: vector.body: -; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; IND: %[[E0:.+]] = sext i32 %index to i64 ; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]] ; IND: pred.udiv.if: ; IND: udiv i32 {{.*}}, %index -; IND: pred.udiv.if1: +; IND: pred.udiv.if3: ; IND: %[[I1:.+]] = or i32 %index, 1 ; IND: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-LABEL: @scalarize_induction_variable_05( ; UNROLL: vector.body: -; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL: %[[I2:.+]] = or i32 %index, 2 ; UNROLL: %[[E0:.+]] = sext i32 %index to i64 ; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]] @@ -351,9 +351,9 @@ ; UNROLL: pred.udiv.if6: ; UNROLL: %[[I1:.+]] = or i32 %index, 1 ; UNROLL: udiv i32 {{.*}}, %[[I1]] -; UNROLL: pred.udiv.if8: +; UNROLL: pred.udiv.if9: ; UNROLL: udiv i32 {{.*}}, %[[I2]] -; UNROLL: pred.udiv.if10: +; UNROLL: pred.udiv.if12: ; UNROLL: %[[I3:.+]] = or i32 %index, 3 ; UNROLL: udiv i32 {{.*}}, %[[I3]]