Index: docs/VectorizationPlan.rst =================================================================== --- /dev/null +++ docs/VectorizationPlan.rst @@ -0,0 +1,574 @@ ++++++ +VPlan ++++++ + +Goal of initial VPlan patch ++++++++++++++++++++++++++++ +The design and implementation of VPlan follow our RFC [10]_ and presentation +[11]_. The initial patch is designed to: + +- be a *lightweight* NFC patch; +- show key aspects of VPlan's Hierarchical CFG concept; +- demonstrate how VPlan can + + * capture *all* current vectorization decisions: which instructions are to + + + be vectorized "on their own", or + + be part of an interleave group, or + + be scalarized, and optionally have scalar instances moved down to other + basic blocks and under a condition; and + + be packed or unpacked (at the definition rather than at its uses) to + provide both scalarized and vectorized forms; and + + * represent all control-flow *within loop body* of vectorized code version. + +- Be a step towards + + * aligning Cost step with Transformation step, + * representing entire code being transformed, + * adding optmizations: + + + optimize conditional scalarization further, + + retaining uniform control-flow, + + vectorizing outerloops, + + and more. + +Out of scope for initial patch: + +- changing how a loop is checked if it can be vectorized - "Legal"; +- changing how a loop is checked if it should be vectorized - "Cost". + + +================== +Vectorization Plan +================== + +.. contents:: + :local: + +Overview +======== +The Vectorization Plan is an explicit recipe for describing a vectorization +candidate. It serves for both estimating the cost reliably and for performing +the translation, and facilitates dealing with multiple vectorization candidates. + +The overall structure consists of: + +1. One LoopVectorizationPlanner for each attempt to vectorize a loop or a loop + nest. + +2. A LoopVectorizationPlanner can construct, optimize and discard one or more + VPlans, providing different ways to vectorize the loop or the loop nest. + +3. Once the best VPlan is determined, including the best vectorization factor + and unroll factor, this VPlan drives the vector code generation using a + VPTransformState object. + +4. Each VPlan represents the loop or the loop nest using a hierarchical CFG. + +5. At the bottom level of the hierarchical CFG are VPBasicBlocks. + +6. Each VPBasicBlock consists of one or more VPRecipes to generate Instructions + for it. + +Motivation +---------- +The vectorization transformation can be rather complicated, involving several +potential alternatives, especially for outer loops [1]_ but also possibly for +innermost loops. These alternatives may have significant performance impact, +both positive and negative. A cost model is therefore employed to identify the +best alternative, including the alternative of avoiding any transformation +altogether. + +The process of vectorization traditionally involves three major steps: Legal, +Cost, and Transform. This is the general case in LLVM's LoopVectorizer: + +1. Legal Step: check if loop can be legally vectorized; encode constraints and + artifacts if so. +2. Cost Step: compute the relative cost of vectorizing it along possible + vectorization and unroll factors (VF, UF). +3. Transform Step: vectorize the loop according to best VF and UF. + +This design, which works only directly on the original LLVM-IR, has some +implications: + +1. Cost Step tries to predict what the vectorized loop will look like and how + much it will cost, independently of what the Transform Step will eventually + do. It's hard to keep the two in sync. +2. Cost Step essentially considers a single vectorization candidate. Any + alternatives are immediately evaluately and resolved. +3. Legal Step does more than check for vectorizability; e.g., it records + auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo. +4. Transform Step first populates the single basic block of the vectorized loop + and later revisits scalarized instructions to predicate them one by one, as + needed. + +The Vectorization Plan is designed to explicitly model a vectorization +candidate to overcome the above constraints, which is especially important for +the vectorization of outer-loops. This affects the overall process by +essentially splitting the Transform Step into a Plan Step and a Code-Gen Step: + +1. Legal Step: check if loop can be legally vectorized; encode contraints and + artifacts if so. Initiate Vectorization Plan showing how the loop can be + vectorized only after passing Legal, to save redundant construction. +2. Plan Step: + + a. Build initial Vectorization Plans following the constraints and + decisions taken by Legal. + b. Explore ways to optimize the vectorization plan, complying with + all legal constraints, possibly constructing several plans following + tentative vectorization decisions. +3. Cost Step: compute the relative cost of each plan. This step can be applied + repeatedly by Plan Step 2.b. +4. Code-Gen Step: materialize the best plan. Note that only this step modifies + the IR, as in the current Loop Vectorizer. + +The Cost Step can also be split into an Early-Pruning Step(s) and a +"Cost-Gen" Step, where the former applies quick yet inaccurate estimates to +prune obviously-unpromising candidates, and the latter applies more accurate +estimates based on a full Plan. + +One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds +Step 2.b. + +As the scope of vectorization grows from innermost to outer loops, so do the +uncertainty and complexity of each step. One way to mitigate the shortcomings +of the Legal and Cost steps is to rely on programmers to indicate which loops +can and/or should be vectorized. This is implicit for certain loops in +data-parallel languages such as OpenCL [4]_, [5]_ and explicit in others such as +OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports +and raises the importance of explicit vectorization beyond the current +capabilities of Clang and LLVM. Namely, from currently forcing the +vectorization of innermost loops according to prescribed width and/or +interleaving count, to supporting OpenMP's "#pragma omp simd" construct and +associated clauses, including vectorizing across function boundaries [2]_. + +References +---------- +.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit + Nuzman and Ayal Zaks, PACT 2008. + +.. [2] "Proposal for function vectorization and loop vectorization with function + calls", Xinmin Tian, [`cfe-dev + `_]., + March 2, 2016. + See also `review `_. + +.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios + Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. + +.. [4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers' Meeting 2011. + +.. [5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs", Ralf + Karrenberg, Springer 2015. See also "Improving Performance of OpenCL on + CPUs", LLVM Developers' Meeting 2012. + +.. [6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on + Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., + IPDPSW 2012. + +.. [7] "Exploiting mixed SIMD parallelism by reducing data reorganization + overhead", Hao Zhou and Jingling Xue, CGO 2016. + +.. [8] "Register Allocation via Hierarchical Graph Coloring", David Callahan and + Brian Koblenz, PLDI 1991 + +.. [9] "Structural analysis: A new approach to flow analysis in optimizing + compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 + +.. [10] "RFC: Extending LV to vectorize outerloops", [`llvm-dev + `_], + September 21, 2016. + +.. [11] "Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer + loop auto-vectorization", Hideki Saito, `LLVM Developers' Meeting 2016 + `_, November 3, 2016. + +Examples +-------- +An example with a single predicated scalarized instruction - integer division: + +.. code-block:: c + + void foo(int* a, int b, int* c) { + #pragma simd + for (int i = 0; i < 10000; ++i) + if (a[i] > 777) + a[i] = b - (c[i] + a[i] / b); + } + + +IR Dump Before Loop Vectorization: + +.. code-block:: LLVM + :emphasize-lines: 6,11 + + for.body: ; preds = %for.inc, %entry + %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ] + %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv + %0 = load i32, i32* %arrayidx, align 4, !tbaa !1 + %cmp1 = icmp sgt i32 %0, 777 + br i1 %cmp1, label %if.then, label %for.inc + + if.then: ; preds = %for.body + %arrayidx3 = getelementptr inbounds i32, i32* %c, i64 %indvars.iv + %1 = load i32, i32* %arrayidx3, align 4, !tbaa !1 + %div = sdiv i32 %0, %b + %add.neg = sub i32 %b, %1 + %sub = sub i32 %add.neg, %div + store i32 %sub, i32* %arrayidx, align 4, !tbaa !1 + br label %for.inc + + for.inc: ; preds = %for.body, %if.then + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 + %exitcond = icmp eq i64 %indvars.iv.next, 10000 + br i1 %exitcond, label %for.cond.cleanup, label %for.body + +The VPlan that is built initially: + +.. image:: VPlanPrinter.png + +Design Guidelines +================= +1. Analysis-like: building and manipulating the Vectorization Plan must not + modify the IR. In particular, if a VPlan is discarded + compilation should proceed as if the VPlan had not been built. + +2. Support all current capabilities: the Vectorization Plan must be capable of + representing the exact functionality of LLVM's existing Loop Vectorizer. + In particular, the transition can start with an NFC patch. + In particular, VPlan must support efficient selection of VF and/or UF. + +3. Align Cost & CodeGen: the Vectorization Plan must serve both the cost + model and the code generation phases, where the cost estimation must + evaluate the to-be-generated code reliably. + +4. Support vectorizing additional constructs: + + a. vectorization of Outer-loops. + In particular, VPlan must be able to represent the control-flow of a + vectorized loop which may include multiple basic-blocks and nested loops. + b. SLP vectorization. + c. Combinations of the above, including nested vectorization: vectorizing + both an inner loop and an outerloop at the same time (each with its own + VF and UF), mixed vectorization: vectorizing a loop and SLP patterns + inside [7]_, (re)vectorizing vector code. + +5. Support multiple candidates efficiently: + In particular, similar candidates related to a range of possible VF's and + UF's must be represented efficiently. + In particular support potential versionings efficiently. + +6. Compact: the Vectorization Plan must be efficient and provide as compact a + representation as possible. In particular where the transformation is + straightfoward, and where the plan is to reuse existing IR (e.g., + leftover iterations). + +VPlan Classes: Definitions +========================== + +:VPlan: + A recipe for generating a vectorized version from a given IR code. + Takes a "scenario-based approach" to vectorization planning. + Given IR code required to be SESE, mainly to simplify dominance + information. This vectorized version is represented using a Hierarchical CFG. + +:Hierarchical CFG: + A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. + The Hierarchical CFG data structure we use is similar to the Tile Tree [8]_, + where cross-Tile edges are lifted to connect Tiles instead of the original + basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the + terms Region and Block rather than Tile [8]_ to avoid confusion with loop + tiling. + +:VPBasicBlock: + Serves as the leaf of the Hierarchical CFG. Represents a sequence of + instructions that will appear consecutively in a basic block of the vectorized + version. The instructions of such a basic block originate from one or more + VPBasicBlocks. + The VPBasicBlock takes care of the control-flow + relations with other VPBasicBlock's and Regions. + Holds a sequence of zero or more + VPRecipe's that take care of representing the instructions. + A VPBasicBlock that holds no VPRecipe's represents no instructions; this + may happen, e.g., to support disjoint Regions and to ensure Regions have a + single exit, possibly an empty one. + +:VPRecipeBase: + A base class describing one or more instructions that will appear + consecutively in the vectorized version, based on Instructions from the given + IR. + These Instructions are referred to as the "Ingredients" of the Recipe. + A Recipe specifies how its ingredients are to be vectorized: e.g., + copy or reuse them as uniform, scalarize or vectorize them according to an + enclosing loop dimension, vectorize them according to internal SLP dimension. + + **Design principle:** in order to reason about how to vectorize an Instruction + or how much it would cost, one has to consult the VPRecipe holding it. + + **Design principle:** when a sequence of instructions conveys additional + information as a group, we use a VPRecipe to encapsulate them and attach + this information to the VPRecipe. For instance a VPRecipe can model an + interleave group of loads or stores with additional information for + calculating their cost and performing code-gen, as a group. + + **Design principle:** where possible a VPRecipe should reuse the existing + container of its ingredients. A new containter should be opened on-demand, + e.g., to facilitate changing the order of Instructions between original + and vectorized versions. + +:VPOneByOneRecipeBase: + Represents recipes which transform each Instruction in their Ingredients + independently, in order. + The Ingredients are a sub-sequence of original Instructions, which reside in + the same IR BasicBlock and in the same order. The Ingredients are + accessed by a pointer to the first and last Instruction in their original IR + basic block. Serves as a base class for the concrete sub-classes + VPScalarizeOneByOneRecipe and VPVectorizeOneByOneRecipe. + +:VPScalarizeOneByOneRecipe: + A concrete VPRecipe which scalarizes each ingredient, generating either + instances of lane 0 for a uniform instruction, or instances for a range of + lanes otherwise. + +:VPVectorizeOneByOneRecipe: + A concrete VPRecipe which vectorizes each ingredient. + +:VPInterleaveRecipe: + A concrete VPRecipe which transforms an interleave group of loads or stores + into one wide load/store and shuffles. + +:VPConditionBitRecipeBase: + A base class for VPRecipes which provide the condition bit feeding a + conditional branch. Such cases correspond to scalarized or uniform branches. + +:VPExtractMaskBitRecipe: + A concrete VPRecipe which represents the extraction of a bit from a mask, + needed when scalarizing a conditional branch. + Such branches are needed to guard scalarized and predicated instructions. + +:VPMergeScalarizeBranchRecipe: + A concrete VPRecipe which represents Phi's needed when control converges back + from a scalarized branch. + Such phi's are needed to merge live-out values that are set under a + scalarized branch. They can be scalar or vector, depending on the user of the + live-out value. + +:VPWidenIntInductionRecipe: + A concrete VPRecipe which widens integer reductions, producing their vector + values and computing the necessary values for producing their scalar values. + The scalar values themselves are generated, possibly elsewhere, by the + complementing VPBuildScalarStepsRecipe. + +:VPBuildScalarStepsRecipe: + A concrete VPRecipe complemeting the handling of integer induction variables, + responsible for generating the scalar values used by the IV's scalar users. + +:VPRegionBlock: + A collection of VPBasicBlocks and VPRegionBlocks which form a + single-entry-single-exit subgraph of the CFG in the vectorized code. + + **Design principle:** When some additional information relates to an SESE set + of VPBlocks, we use a VPRegionBlock to wrap them and attach the information to + it. For example, a VPRegionBlock can be used to indicate that a scalarized + SESE region is to be replicated. It is also designed to serve predicating + divergent branches while retaining uniform branches as much as possible / + desirable, and represent inner loops. + +:VPBlockBase: + The building block of the Hierarchical CFG. A VPBlockBase can be either a + VPBasicBlock or a VPRegionBlock. + A VPBlockBase may indicate that its contents are + to be replicated several times. This is designed to support scalarizing + VPBlockBases which generate VF replicas of their instructions, which in turn + remain scalar. And to do so using a single VPlan for multiple candidate VF's. + +:VPTransformState: + Stores information used for code generation, passed from the Planner to its + selected VPlan for execution, and used to pass additional information down + from VPBlocks to the VPRecipes. + +:VPlanUtils: + Contains a collection of methods for the construction and modification of + abstract VPlans. + +:VPlanUtilsLoopVectorizer: + Derived from VPlanUtils, providing additional methods for the construction and + modification of VPlans. + +:LoopVectorizationPlanner: + The object in charge of creating and manipulating VPlans for a given IR code. + + +VPlan Classes: Diagram +====================== + +The classes of VPlan with main fields and methods; sub-classes of VPRecipeBase +are shown in a separate figure: + +.. image:: VPlanUML.png + + +The class hierarchy of VPlan's VPRecipeBase class: + +.. image:: VPlanRecipesUML.png + + +Integration with LoopVectorize.cpp/processLoop() +================================================ + +Here's the integration within LoopVectorize.cpp's existing flow, in +LoopVectorizePass::processLoop(Loop \*L): + +1. Plan only after passing all early bail-outs: + + a. including those that take place after Legal, which is kept intact; + b. including those that use the Cost Model - refactor it slightly to expose + its MaxVF upper bound and canVectorize() early exit: + +.. code-block:: c++ + + // Check if the target supports potentially unsafe FP vectorization. + // FIXME: Add a check for the type of safety issue (denormal, signaling) + // for the target we're vectorizing for, to make sure none of the + // additional fp-math flags can help. + if (Hints.isPotentiallyUnsafe() && + TTI->isFPVectorizationPotentiallyUnsafe()) { + DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n"); + ORE->emit( + createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L) + << "loop not vectorized due to unsafe FP support."); + emitMissedWarning(F, L, Hints, ORE); + return false; + } + + if (!CM.canVectorize(OptForSize)) + return false; + + // Early prune excessive VF's + unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize); + + // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail. + if (OptForSize && CM.requiresTail(MaxVF)) + return false; + +2. Plan: + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before, + c. compute best interleave-count using Cost Model as before. Above two + steps are refactored into LVP.plan() (see below): + +.. code-block:: c++ + + // Use the planner. + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM); + + // Get user vectorization factor. + unsigned UserVF = Hints.getWidth(); + + // Select the vectorization factor. + LoopVectorizationCostModel::VectorizationFactor VF = + LVP.plan(OptForSize, UserVF, MaxVF); + bool VectorizeLoop = (VF.Width > 1); + + std::pair VecDiagMsg, IntDiagMsg; + + if (!UserVF && !VectorizeLoop) { + DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); + VecDiagMsg = std::make_pair( + "VectorizationNotBeneficial", + "the cost-model indicates that vectorization is not beneficial"); + } + + // Select the interleave count. + unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost); + + // Get user interleave count. + unsigned UserIC = Hints.getInterleave(); + +3. Transform: + + a. invoke an Unroller to unroll the loop (as before), or + b. invoke LVP.executeBestPlan() to vectorize the loop: + +.. code-block:: c++ + + if (!VectorizeLoop) { + assert(IC > 1 && "interleave count should not be 1 or 0"); + // If we decided that it is not legal to vectorize the loop, then + // interleave it. + InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL, + &CM); + Unroller.vectorize(); + + ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(), + L->getHeader()) + << "interleaved loop (interleaved count: " + << NV("InterleaveCount", IC) << ")"); + } else { + + // If we decided that it is \* legal \* to vectorize the loop, then do it. + InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, + &LVL, &CM); + + LVP.executeBestPlan(LB); + + ++LoopsVectorized; + + // Add metadata to disable runtime unrolling a scalar loop when there are + // no runtime checks about strides and memory. A scalar loop that is + // rarely used is not worth unrolling. + if (!LB.areSafetyChecksAdded()) + AddRuntimeUnrollDisableMetaData(L); + + // Report the vectorization decision. + ORE->emit(OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(), + L->getHeader()) + << "vectorized loop (vectorization width: " + << NV("VectorizationFactor", VF.Width) + << ", interleaved count: " << NV("InterleaveCount", IC) << ")"); + } + + // Mark the loop as already vectorized to avoid vectorizing again. + Hints.setAlreadyVectorized(); + +4. Plan, refactored into LVP.plan(): + + a. build VPlans for relevant VF's and optimize them, + b. compute best cost using Cost Model as before: + +.. code-block:: c++ + + LoopVectorizationCostModel::VectorizationFactor + LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF, + unsigned MaxVF) { + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + if (UserVF == 1) + return {UserVF, 0}; + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM->collectInstsToScalarize(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions",dbgs())); + return {UserVF, 0}; + } + if (MaxVF == 1) + return {1, 0}; + + assert(MaxVF > 1 && "MaxVF is zero."); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + for (unsigned i = 2; i <= MaxVF; i = i+i) + CM->collectInstsToScalarize(i); + buildInitialVPlans(2, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); + // Select the optimal vectorization factor. + return CM->selectVectorizationFactor(OptForSize, MaxVF); + } Index: docs/Vectorizers.rst =================================================================== --- docs/Vectorizers.rst +++ docs/Vectorizers.rst @@ -380,6 +380,18 @@ .. image:: linpack-pc.png +Internals +--------- + +.. toctree:: + :hidden: + + VectorizationPlan + +:doc:`VectorizationPlan` + The loop vectorizer is based on an abstract representation called Vectorization Plan. + This document describes its philosophy and design. + .. _slp-vectorizer: The SLP Vectorizer Index: lib/Transforms/Vectorize/CMakeLists.txt =================================================================== --- lib/Transforms/Vectorize/CMakeLists.txt +++ lib/Transforms/Vectorize/CMakeLists.txt @@ -4,6 +4,7 @@ LoopVectorize.cpp SLPVectorizer.cpp Vectorize.cpp + VPlan.cpp ADDITIONAL_HEADER_DIRS ${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms Index: lib/Transforms/Vectorize/LoopVectorize.cpp =================================================================== --- lib/Transforms/Vectorize/LoopVectorize.cpp +++ lib/Transforms/Vectorize/LoopVectorize.cpp @@ -47,6 +47,7 @@ //===----------------------------------------------------------------------===// #include "llvm/Transforms/Vectorize/LoopVectorize.h" +#include "VPlan.h" #include "llvm/ADT/DenseMap.h" #include "llvm/ADT/Hashing.h" #include "llvm/ADT/MapVector.h" @@ -98,6 +99,7 @@ #include "llvm/Transforms/Utils/LoopVersioning.h" #include "llvm/Transforms/Vectorize.h" #include +#include #include #include @@ -376,6 +378,9 @@ /// LoopVectorizationLegality class to provide information about the induction /// and reduction variables that were found to a given vectorization factor. class InnerLoopVectorizer { + friend class LoopVectorizationPlanner; + friend class llvm::VPlan; + public: InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo *LI, DominatorTree *DT, @@ -422,7 +427,8 @@ // When we if-convert we need to create edge masks. We have to cache values // so that we don't end up with exponential recursion/IR. typedef DenseMap, VectorParts> - EdgeMaskCache; + EdgeMaskCacheTy; + typedef DenseMap BlockMaskCacheTy; /// Create an empty loop, based on the loop ranges of the old loop. void createEmptyLoop(); @@ -455,49 +461,101 @@ /// See PR14725. void fixLCSSAPHIs(); - /// Iteratively sink the scalarized operands of a predicated instruction into - /// the block that was created for it. - void sinkScalarOperands(Instruction *PredInst); - - /// Predicate conditional instructions that require predication on their - /// respective conditions. - void predicateInstructions(); - /// Collect the instructions from the original loop that would be trivially /// dead in the vectorized loop if generated. - void collectTriviallyDeadInstructions(); + static void collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions); /// Shrinks vector element sizes to the smallest bitwidth they can be legally /// represented as. void truncateToMinimalBitwidths(); +public: /// A helper function that computes the predicate of the block BB, assuming /// that the header block of the loop is set to True. It returns the *entry* /// mask for the block BB. VectorParts createBlockInMask(BasicBlock *BB); + +protected: /// A helper function that computes the predicate of the edge between SRC /// and DST. VectorParts createEdgeMask(BasicBlock *Src, BasicBlock *Dst); - /// A helper function to vectorize a single BB within the innermost loop. - void vectorizeBlockInLoop(BasicBlock *BB); - +public: /// Vectorize a single PHINode in a block. This method handles the induction /// variable canonicalization. It supports both VF = 1 for unrolled loops and /// arbitrary length vectors. void widenPHIInstruction(Instruction *PN, unsigned UF, unsigned VF); +protected: /// Insert the new loop to the loop hierarchy and pass manager /// and update the analysis passes. void updateAnalysis(); - /// This instruction is un-vectorizable. Implement it as a sequence - /// of scalars. If \p IfPredicateInstr is true we need to 'hide' each - /// scalarized instruction behind an if block predicated on the control - /// dependence of the instruction. - virtual void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false); +public: + /// A helper function to vectorize a single Instruction in the innermost loop. + virtual void vectorizeInstruction(Instruction &I); + + /// A helper function to scalarize a single Instruction in the innermost loop. + /// Generates a sequence of scalar instances for each lane between \p MinLane + /// and \p MaxLane, times each part between \p MinPart and \p MaxPart, + /// inclusive.. + void scalarizeInstruction(Instruction *Instr, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, + unsigned MaxLane); + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. If the value has + /// been vectorized but not scalarized, the necessary extractelement + /// instruction will be generated. + Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part and vector index \p Lane. The scalar parts + /// for this value must already be initialized. + void setScalarValue(Value *V, unsigned Part, unsigned Lane, Value *Scalar) { + assert(VectorLoopValueMap.hasScalar(V) && + "Cannot set an uninitialized scalar value"); + VectorLoopValueMap.ScalarMapStorage[V][Part][Lane] = Scalar; + } + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. If there isn't one, return a null pointer. + /// Note that the value returned may also be a null pointer if that specific + /// part has not been generated yet. + Value *getVectorValue(Value *V, unsigned Part) { + if (!VectorLoopValueMap.hasVector(V)) + return nullptr; + return VectorLoopValueMap.VectorMapStorage[V][Part]; + } + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. The vector parts for this value must already + /// be initialized. + void setVectorValue(Value *V, unsigned Part, Value *Vector) { + assert(VectorLoopValueMap.hasVector(V) && + "Cannot set an uninitialized vector value"); + VectorLoopValueMap.VectorMapStorage[V][Part] = Vector; + } + + /// Construct the vector value of a scalarized value \p V one lane at a time. + /// This method is for predicated instructions where we'd like the + /// insert-element instructions to reside in the predicated block to have + /// them execute only if needed. + void constructVectorValue(Value *V, unsigned Part, unsigned Lane); + + /// Return a constant reference to the VectorParts corresponding to \p V from + /// the original loop. If the value has already been vectorized, the + /// corresponding vector entry in VectorLoopValueMap is returned. If, + /// however, the value has a scalar entry in VectorLoopValueMap, we construct + /// new vector values on-demand by inserting the scalar values into vectors + /// with an insertelement sequence. If the value has been neither vectorized + /// nor scalarized, it must be loop invariant, so we simply broadcast the + /// value into vectors. + const VectorParts &getVectorValue(Value *V); +protected: /// Vectorize Load and Store instructions, virtual void vectorizeMemoryInstruction(Instruction *Instr); @@ -515,14 +573,6 @@ Instruction::BinaryOps Opcode = Instruction::BinaryOpsEnd); - /// Compute scalar induction steps. \p ScalarIV is the scalar induction - /// variable on which to base the steps, \p Step is the size of the step, and - /// \p EntryVal is the value from the original loop that maps to the steps. - /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it - /// can be a truncate instruction). - void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal, - const InductionDescriptor &ID); - /// Create a vector induction phi node based on an existing scalar one. \p /// EntryVal is the value from the original loop that maps to the vector phi /// node, and \p Step is the loop-invariant step. If \p EntryVal is a @@ -531,37 +581,31 @@ void createVectorIntOrFpInductionPHI(const InductionDescriptor &II, Value *Step, Instruction *EntryVal); - /// Widen an integer or floating-point induction variable \p IV. If \p Trunc - /// is provided, the integer induction variable will first be truncated to - /// the corresponding type. - void widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc = nullptr); - /// Returns true if an instruction \p I should be scalarized instead of /// vectorized for the chosen vectorization factor. bool shouldScalarizeInstruction(Instruction *I) const; - /// Returns true if we should generate a scalar version of \p IV. - bool needsScalarInduction(Instruction *IV) const; - - /// Return a constant reference to the VectorParts corresponding to \p V from - /// the original loop. If the value has already been vectorized, the - /// corresponding vector entry in VectorLoopValueMap is returned. If, - /// however, the value has a scalar entry in VectorLoopValueMap, we construct - /// new vector values on-demand by inserting the scalar values into vectors - /// with an insertelement sequence. If the value has been neither vectorized - /// nor scalarized, it must be loop invariant, so we simply broadcast the - /// value into vectors. - const VectorParts &getVectorValue(Value *V); - - /// Return a value in the new loop corresponding to \p V from the original - /// loop at unroll index \p Part and vector index \p Lane. If the value has - /// been vectorized but not scalarized, the necessary extractelement - /// instruction will be generated. - Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); - +public: /// Try to vectorize the interleaved access group that \p Instr belongs to. void vectorizeInterleaveGroup(Instruction *Instr); + /// Widen an integer or floating-point induction variable \p IV. If \p Trunc + /// is provided, the integer induction variable will first be truncated to + /// the corresponding type. + std::pair + widenIntOrFpInduction(const InductionDescriptor &ID, bool NeedsScalarIV, + PHINode *IV, TruncInst *Trunc = nullptr); + + /// Compute scalar induction steps. \p ScalarIV is the scalar induction + /// variable on which to base the steps, \p Step is the size of the step, and + /// \p EntryVal is the value from the original loop that maps to the steps. + /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it + /// can be a truncate instruction). + void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal, + const InductionDescriptor &ID, unsigned MinPart, + unsigned MaxPart, unsigned MinLane, unsigned MaxLane); + +protected: /// Generate a shuffle sequence that will reverse the vector Vec. virtual Value *reverseVector(Value *Vec); @@ -678,6 +722,16 @@ return ScalarMapStorage[Key]; } + ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) { + if (!hasScalar(Key)) { + ScalarParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part].resize(Lanes); + ScalarMapStorage[Key] = Entry; + } + return ScalarMapStorage[Key]; + } + /// \return A reference to the vector map entry corresponding to \p Key. /// The key should already be in the map. This function should only be used /// when it's necessary to update values that have already been vectorized. @@ -696,6 +750,15 @@ friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V); friend Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, unsigned Lane); + friend Value *InnerLoopVectorizer::getVectorValue(Value *V, unsigned Part); + friend void InnerLoopVectorizer::setScalarValue(Value *V, unsigned Part, + unsigned Lane, + Value *Scalar); + friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part, + Value *Vector); + friend void InnerLoopVectorizer::constructVectorValue(Value *V, + unsigned Part, + unsigned Lane); private: /// The unroll factor. Each entry in the vector map contains UF vector @@ -749,9 +812,11 @@ /// many different vector instructions. unsigned UF; +public: /// The builder that we use IRBuilder<> Builder; +protected: // --- Vectorization state --- /// The vector-loop preheader. @@ -780,10 +845,8 @@ /// vectorized and scalarized. ValueMap VectorLoopValueMap; - /// Store instructions that should be predicated, as a pair - /// - SmallVector, 4> PredicatedInstructions; - EdgeMaskCache MaskCache; + EdgeMaskCacheTy EdgeMaskCache; + BlockMaskCacheTy BlockMaskCache; /// Trip count of the original loop. Value *TripCount; /// Trip count of the widened loop (TripCount - TripCount % (VF*UF)) @@ -798,14 +861,6 @@ // Record whether runtime checks are added. bool AddedSafetyChecks; - // Holds instructions from the original loop whose counterparts in the - // vectorized loop would be trivially dead if generated. For example, - // original induction update instructions can become dead because we - // separately emit induction "steps" when generating code for the new loop. - // Similarly, we create a new latch condition when setting up the structure - // of the new loop, so the old one can become dead. - SmallPtrSet DeadInstructions; - // Holds the end values for each induction variable. We save the end values // so we can later fix-up the external users of the induction variables. DenseMap IVEndValues; @@ -824,7 +879,6 @@ UnrollFactor, LVL, CM) {} private: - void vectorizeMemoryInstruction(Instruction *Instr) override; Value *getBroadcastInstrs(Value *V) override; Value *getStepVector(Value *Val, int StartIdx, Value *Step, Instruction::BinaryOps Opcode = @@ -1843,6 +1897,7 @@ unsigned Width; // Vector width with best cost unsigned Cost; // Cost of the loop with that width }; + /// \return The most profitable vectorization factor and the cost of that VF. /// This method checks every power of two up to MaxVF. If UserVF is not ZERO /// then this vectorization factor will be selected if vectorization is @@ -1904,6 +1959,9 @@ /// \returns True if it is more profitable to scalarize instruction \p I for /// vectorization factor \p VF. bool isProfitableToScalarize(Instruction *I, unsigned VF) const { + // Unroller also calls this method, but does not collectInstsToScalarize. + if (VF == 1) + return true; auto Scalars = InstsToScalarize.find(VF); assert(Scalars != InstsToScalarize.end() && "VF not yet analyzed for scalarization profitability"); @@ -2110,10 +2168,12 @@ int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts, unsigned VF); +public: /// Collects the instructions to scalarize for each predicated instruction in /// the loop. void collectInstsToScalarize(unsigned VF); +private: /// Collect the instructions that are uniform after vectorization. An /// instruction is uniform if we represent it with a single scalar value in /// the vectorized loop corresponding to each vector iteration. Examples of @@ -2132,6 +2192,7 @@ /// iteration of the original scalar loop. void collectLoopScalars(unsigned VF); +public: /// Collect Uniform and Scalar values for the given \p VF. /// The sets depend on CM decision for Load/Store instructions /// that may be vectorized as interleave, gather-scatter or scalarized. @@ -2144,6 +2205,7 @@ collectLoopScalars(VF); } +private: /// Keeps cost model vectorization decision and cost for instructions. /// Right now it is used for memory instructions only. typedef DenseMap, @@ -2183,9 +2245,18 @@ /// LoopVectorizationPlanner - drives the vectorization process after having /// passed Legality checks. +/// The planner builds and optimizes the Vectorization Plans which record the +/// decisions how to vectorize the given loop. In particular, represent the +/// control-flow of the vectorized version, the replication of instructions that +/// are to be scalarized, and interleave access groups. class LoopVectorizationPlanner { public: - LoopVectorizationPlanner(LoopVectorizationCostModel &CM) : CM(CM) {} + LoopVectorizationPlanner(Loop *L, LoopInfo *LI, const TargetLibraryInfo *TLI, + const TargetTransformInfo *TTI, + LoopVectorizationLegality *Legal, + LoopVectorizationCostModel &CM) + : TheLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), + ILV(nullptr), BestVF(0), BestUF(0) {} ~LoopVectorizationPlanner() {} @@ -2193,9 +2264,111 @@ LoopVectorizationCostModel::VectorizationFactor plan(bool OptForSize, unsigned UserVF); + /// Finalize the best decision and dispose of all other VPlans. + void setBestPlan(unsigned VF, unsigned UF); + + /// Generate the IR code for the body of the vectorized loop according to the + /// best selected VPlan. + void executeBestPlan(InnerLoopVectorizer &LB); + + VPlan *getVPlanForVF(unsigned VF) { return VPlans[VF].get(); } + + void printCurrentPlans(const std::string &Title, raw_ostream &O); + + /// Test a predicate on a range of VFs. + /// The returned value reflects the result for a prefix of the range, with \p + /// EndRangeVF modified accordingly. + bool testVFRange(const std::function &Predicate, + unsigned StartRangeVF, unsigned &EndRangeVF); + +protected: + /// Build initial VPlans according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. + /// Returns the number of VPlans built, zero if failed. + unsigned buildInitialVPlans(unsigned MinVF, unsigned MaxVF); + + /// On VPlan construction, each instruction marked for predication by Legal + /// gets its own basic block guarded by an if-then. This initial planning + /// is legal, but is not optimal. This function attempts to leverage the + /// necessary conditional execution of the predicated instruction in favor + /// of other related instructions. The function applies these optimizations + /// to all VPlans. + void optimizePredicatedInstructions(); + +private: + /// Build an initial VPlan according to the information gathered by Legal + /// when it checked if it is legal to vectorize this loop. \return a VPlan + /// that corresponds to vectorization factors starting from the given + /// \p StartRangeVF and up to \p EndRangeVF, exclusive, possibly decreasing + /// the given \p EndRangeVF. + std::shared_ptr buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF); + + std::pair + widenIntOrFpInduction(VPlan *Plan, unsigned StartRangeVF, + unsigned &EndRangeVF, PHINode *IV, + TruncInst *Trunc = nullptr); + + /// Determine whether \p I will be scalarized in a given range of VFs. + /// The returned value reflects the result for a prefix of the range, with \p + /// EndRangeVF modified accordingly. + bool willBeScalarized(Instruction *I, unsigned StartRangeVF, + unsigned &EndRangeVF); + + /// Iteratively sink the scalarized operands of a predicated instruction into + /// the block that was created for it. + void sinkScalarOperands(Instruction *PredInst, VPlan *Plan); + + /// Determine whether a newly-created recipe adds a second user to one of the + /// variants the values its ingredients use. This may cause the defining + /// recipe to generate that variant itself to serve all such users. + void assignScalarVectorConversions(Instruction *PredInst, VPlan *Plan); + + /// Returns true if an instruction \p I should be scalarized instead of + /// vectorized for the chosen vectorization factor. + bool shouldScalarizeInstruction(Instruction *I, unsigned VF) const; + private: + /// The loop that we evaluate. + Loop *TheLoop; + + /// Loop Info analysis. + LoopInfo *LI; + + /// Target Library Info. + const TargetLibraryInfo *TLI; + + /// Target Transform Info. + const TargetTransformInfo *TTI; + + /// The legality analysis. + LoopVectorizationLegality *Legal; + /// The profitablity analysis. LoopVectorizationCostModel &CM; + + InnerLoopVectorizer *ILV; + + // Holds instructions from the original loop that we predicated. Such + // instructions reside in their own conditioned VPBasicBlock and represent + // an optimization opportunity for sinking their scalarized operands thus + // reducing their cost by the predicate's probability. + SmallPtrSet PredicatedInstructions; + + /// VPlans are shared between VFs, use smart pointers. + DenseMap> VPlans; + + unsigned BestVF; + + unsigned BestUF; + + // Holds instructions from the original loop whose counterparts in the + // vectorized loop would be trivially dead if generated. For example, + // original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; }; /// \brief This holds vectorization requirements that must be verified late in @@ -2430,27 +2603,14 @@ Cost->isProfitableToScalarize(I, VF); } -bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const { - if (shouldScalarizeInstruction(IV)) - return true; - auto isScalarInst = [&](User *U) -> bool { - auto *I = cast(U); - return (OrigLoop->contains(I) && shouldScalarizeInstruction(I)); - }; - return any_of(IV->users(), isScalarInst); -} - -void InnerLoopVectorizer::widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc) { +std::pair +InnerLoopVectorizer::widenIntOrFpInduction(const InductionDescriptor &ID, + bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc) { assert((IV->getType()->isIntegerTy() || IV != OldInduction) && "Primary induction variable must have an integer type"); - auto II = Legal->getInductionVars()->find(IV); - assert(II != Legal->getInductionVars()->end() && "IV is not an induction"); - - auto ID = II->second; - assert(IV->getType() == ID.getStartValue()->getType() && "Types must match"); - // The scalar value to broadcast. This will be derived from the canonical // induction variable. Value *ScalarIV = nullptr; @@ -2462,11 +2622,6 @@ // True if we have vectorized the induction variable. auto VectorizedIV = false; - // Determine if we want a scalar version of the induction variable. This is - // true if the induction variable itself is not widened, or if it has at - // least one user in the loop that is not widened. - auto NeedsScalarIV = VF > 1 && needsScalarInduction(EntryVal); - // Generate code for the induction step. Note that induction steps are // required to be loop-invariant assert(PSE.getSE()->isLoopInvariant(ID.getStep(), OrigLoop) && @@ -2527,13 +2682,9 @@ } // If an induction variable is only used for counting loop iterations or - // calculating addresses, it doesn't need to be widened. Create scalar steps - // that can be used by instructions we will later scalarize. Note that the - // addition of the scalar steps will not increase the number of instructions - // in the loop in the common case prior to InstCombine. We will be trading - // one vector extract for each scalar step. - if (NeedsScalarIV) - buildScalarSteps(ScalarIV, Step, EntryVal, ID); + // calculating addresses, it doesn't need to be widened. + + return std::make_pair(ScalarIV, Step); } Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step, @@ -2594,7 +2745,9 @@ void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal, - const InductionDescriptor &ID) { + const InductionDescriptor &ID, + unsigned MinPart, unsigned MaxPart, + unsigned MinLane, unsigned MaxLane) { // We shouldn't have to build scalar steps if we aren't vectorizing. assert(VF > 1 && "VF should be greater than one"); @@ -2616,24 +2769,18 @@ MulOp = Instruction::FMul; } - // Determine the number of scalars we need to generate for each unroll - // iteration. If EntryVal is uniform, we only need to generate the first - // lane. Otherwise, we generate all VF values. - unsigned Lanes = - Cost->isUniformAfterVectorization(cast(EntryVal), VF) ? 1 : VF; + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(EntryVal, VF); // Compute the scalar steps and save the results in VectorLoopValueMap. - ScalarParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { Entry[Part].resize(VF); - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { auto *StartIdx = getSignedIntOrFpConstant(ScalarIVTy, VF * Part + Lane); auto *Mul = addFastMathFlag(Builder.CreateBinOp(MulOp, StartIdx, Step)); auto *Add = addFastMathFlag(Builder.CreateBinOp(AddOp, ScalarIV, Mul)); Entry[Part][Lane] = Add; } } - VectorLoopValueMap.initScalar(EntryVal, Entry); } int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) { @@ -2651,6 +2798,39 @@ return LAI->isUniform(V); } +void InnerLoopVectorizer::constructVectorValue(Value *V, unsigned Part, + unsigned Lane) { + assert(V != Induction && "The new induction variable should not be used."); + assert(!V->getType()->isVectorTy() && "Can't widen a vector"); + assert(!V->getType()->isVoidTy() && "Type does not produce a value"); + + if (!VectorLoopValueMap.hasVector(V)) { + VectorParts Entry(UF); + for (unsigned P = 0; P < UF; ++P) + Entry[P] = nullptr; + VectorLoopValueMap.initVector(V, Entry); + } + + VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V]; + + assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist"); + + auto *ScalarInst = cast(getScalarValue(V, Part, Lane)); + + Value *VectorValue = nullptr; + + // If we're constructing lane 0, start from undef; otherwise, start from the + // last value created. + if (Lane == 0) + VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); + else + VectorValue = Parts[Part]; + + VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst, + Builder.getInt32(Lane)); + Parts[Part] = VectorValue; +} + const InnerLoopVectorizer::VectorParts & InnerLoopVectorizer::getVectorValue(Value *V) { assert(V != Induction && "The new induction variable should not be used."); @@ -2956,10 +3136,6 @@ Alignment = DL.getABITypeAlignment(ScalarDataTy); unsigned AddressSpace = getMemInstAddressSpace(Instr); - // Scalarize the memory instruction if necessary. - if (Decision == LoopVectorizationCostModel::CM_Scalarize) - return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr)); - // Determine if the pointer operand of the access is either consecutive or // reverse consecutive. int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); @@ -3067,11 +3243,11 @@ } void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr) { + unsigned MinPart, + unsigned MaxPart, + unsigned MinLane, + unsigned MaxLane) { assert(!Instr->getType()->isAggregateType() && "Can't handle vectors"); - DEBUG(dbgs() << "LV: Scalarizing" - << (IfPredicateInstr ? " and predicating:" : ":") << *Instr - << '\n'); // Holds vector parameters or scalars, in case of uniform vals. SmallVector Params; @@ -3081,32 +3257,12 @@ bool IsVoidRetTy = Instr->getType()->isVoidTy(); // Initialize a new scalar map entry. - ScalarParts Entry(UF); - - VectorParts Cond; - if (IfPredicateInstr) - Cond = createBlockInMask(Instr->getParent()); - - // Determine the number of scalars we need to generate for each unroll - // iteration. If the instruction is uniform, we only need to generate the - // first lane. Otherwise, we generate all VF values. - unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF; + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF); // For each vector unroll 'part': - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); + for (unsigned Part = MinPart; Part <= MaxPart; ++Part) { // For each scalar that we create: - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { - - // Start if-block. - Value *Cmp = nullptr; - if (IfPredicateInstr) { - Cmp = Cond[Part]; - if (Cmp->getType()->isVectorTy()) - Cmp = Builder.CreateExtractElement(Cmp, Builder.getInt32(Lane)); - Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp, - ConstantInt::get(Cmp->getType(), 1)); - } + for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) { Instruction *Cloned = Instr->clone(); if (!IsVoidRetTy) @@ -3130,13 +3286,8 @@ if (auto *II = dyn_cast(Cloned)) if (II->getIntrinsicID() == Intrinsic::assume) AC->registerAssumption(II); - - // End if-block. - if (IfPredicateInstr) - PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp)); } } - VectorLoopValueMap.initScalar(Instr, Entry); } PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start, @@ -3876,6 +4027,7 @@ } void InnerLoopVectorizer::vectorizeLoop() { + //===------------------------------------------------===// // // Notice: any optimization or new instruction that go @@ -3884,20 +4036,6 @@ // //===------------------------------------------------===// - // Collect instructions from the original loop that will become trivially - // dead in the vectorized loop. We don't need to vectorize these - // instructions. - collectTriviallyDeadInstructions(); - - // Scan the loop in a topological order to ensure that defs are vectorized - // before users. - LoopBlocksDFS DFS(OrigLoop); - DFS.perform(LI); - - // Vectorize all of the blocks in the original loop. - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) - vectorizeBlockInLoop(BB); - // Insert truncates and extends for any truncated instructions as hints to // InstCombine. if (VF > 1) @@ -3926,7 +4064,6 @@ IVEndValues[Entry.first], LoopMiddleBlock); fixLCSSAPHIs(); - predicateInstructions(); // Remove redundant induction instructions. cse(LoopVectorBody); @@ -4181,7 +4318,7 @@ cast(VecRdxPhi[part]) ->addIncoming(StartVal, LoopVectorPreHeader); cast(VecRdxPhi[part]) - ->addIncoming(Val[part], LoopVectorBody); + ->addIncoming(Val[part], LI->getLoopFor(LoopVectorBody)->getLoopLatch()); } // Before each round, move the insertion point right between @@ -4326,7 +4463,9 @@ } } -void InnerLoopVectorizer::collectTriviallyDeadInstructions() { +void InnerLoopVectorizer::collectTriviallyDeadInstructions( + Loop *OrigLoop, LoopVectorizationLegality *Legal, + SmallPtrSetImpl &DeadInstructions) { BasicBlock *Latch = OrigLoop->getLoopLatch(); // We create new control-flow for the vectorized loop, so the original @@ -4349,199 +4488,14 @@ } } -void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) { - - // The basic block and loop containing the predicated instruction. - auto *PredBB = PredInst->getParent(); - auto *VectorLoop = LI->getLoopFor(PredBB); - - // Initialize a worklist with the operands of the predicated instruction. - SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); - - // Holds instructions that we need to analyze again. An instruction may be - // reanalyzed if we don't yet know if we can sink it or not. - SmallVector InstsToReanalyze; - - // Returns true if a given use occurs in the predicated block. Phi nodes use - // their operands in their corresponding predecessor blocks. - auto isBlockOfUsePredicated = [&](Use &U) -> bool { - auto *I = cast(U.getUser()); - BasicBlock *BB = I->getParent(); - if (auto *Phi = dyn_cast(I)) - BB = Phi->getIncomingBlock( - PHINode::getIncomingValueNumForOperand(U.getOperandNo())); - return BB == PredBB; - }; - - // Iteratively sink the scalarized operands of the predicated instruction - // into the block we created for it. When an instruction is sunk, it's - // operands are then added to the worklist. The algorithm ends after one pass - // through the worklist doesn't sink a single instruction. - bool Changed; - do { - - // Add the instructions that need to be reanalyzed to the worklist, and - // reset the changed indicator. - Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); - InstsToReanalyze.clear(); - Changed = false; - - while (!Worklist.empty()) { - auto *I = dyn_cast(Worklist.pop_back_val()); - - // We can't sink an instruction if it is a phi node, is already in the - // predicated block, is not in the loop, or may have side effects. - if (!I || isa(I) || I->getParent() == PredBB || - !VectorLoop->contains(I) || I->mayHaveSideEffects()) - continue; - - // It's legal to sink the instruction if all its uses occur in the - // predicated block. Otherwise, there's nothing to do yet, and we may - // need to reanalyze the instruction. - if (!all_of(I->uses(), isBlockOfUsePredicated)) { - InstsToReanalyze.push_back(I); - continue; - } - - // Move the instruction to the beginning of the predicated block, and add - // it's operands to the worklist. - I->moveBefore(&*PredBB->getFirstInsertionPt()); - Worklist.insert(I->op_begin(), I->op_end()); - - // The sinking may have enabled other instructions to be sunk, so we will - // need to iterate. - Changed = true; - } - } while (Changed); -} - -void InnerLoopVectorizer::predicateInstructions() { - - // For each instruction I marked for predication on value C, split I into its - // own basic block to form an if-then construct over C. Since I may be fed by - // an extractelement instruction or other scalar operand, we try to - // iteratively sink its scalar operands into the predicated block. If I feeds - // an insertelement instruction, we try to move this instruction into the - // predicated block as well. For non-void types, a phi node will be created - // for the resulting value (either vector or scalar). - // - // So for some predicated instruction, e.g. the conditional sdiv in: - // - // for.body: - // ... - // %add = add nsw i32 %mul, %0 - // %cmp5 = icmp sgt i32 %2, 7 - // br i1 %cmp5, label %if.then, label %if.end - // - // if.then: - // %div = sdiv i32 %0, %1 - // br label %if.end - // - // if.end: - // %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ] - // - // the sdiv at this point is scalarized and if-converted using a select. - // The inactive elements in the vector are not used, but the predicated - // instruction is still executed for all vector elements, essentially: - // - // vector.body: - // ... - // %17 = add nsw <2 x i32> %16, %wide.load - // %29 = extractelement <2 x i32> %wide.load, i32 0 - // %30 = extractelement <2 x i32> %wide.load51, i32 0 - // %31 = sdiv i32 %29, %30 - // %32 = insertelement <2 x i32> undef, i32 %31, i32 0 - // %35 = extractelement <2 x i32> %wide.load, i32 1 - // %36 = extractelement <2 x i32> %wide.load51, i32 1 - // %37 = sdiv i32 %35, %36 - // %38 = insertelement <2 x i32> %32, i32 %37, i32 1 - // %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17 - // - // Predication will now re-introduce the original control flow to avoid false - // side-effects by the sdiv instructions on the inactive elements, yielding - // (after cleanup): - // - // vector.body: - // ... - // %5 = add nsw <2 x i32> %4, %wide.load - // %8 = icmp sgt <2 x i32> %wide.load52, - // %9 = extractelement <2 x i1> %8, i32 0 - // br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue - // - // pred.sdiv.if: - // %10 = extractelement <2 x i32> %wide.load, i32 0 - // %11 = extractelement <2 x i32> %wide.load51, i32 0 - // %12 = sdiv i32 %10, %11 - // %13 = insertelement <2 x i32> undef, i32 %12, i32 0 - // br label %pred.sdiv.continue - // - // pred.sdiv.continue: - // %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ] - // %15 = extractelement <2 x i1> %8, i32 1 - // br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55 - // - // pred.sdiv.if54: - // %16 = extractelement <2 x i32> %wide.load, i32 1 - // %17 = extractelement <2 x i32> %wide.load51, i32 1 - // %18 = sdiv i32 %16, %17 - // %19 = insertelement <2 x i32> %14, i32 %18, i32 1 - // br label %pred.sdiv.continue55 - // - // pred.sdiv.continue55: - // %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ] - // %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5 - - for (auto KV : PredicatedInstructions) { - BasicBlock::iterator I(KV.first); - BasicBlock *Head = I->getParent(); - auto *BB = SplitBlock(Head, &*std::next(I), DT, LI); - auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false, - /*BranchWeights=*/nullptr, DT, LI); - I->moveBefore(T); - sinkScalarOperands(&*I); - - I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if"); - BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue"); - - // If the instruction is non-void create a Phi node at reconvergence point. - if (!I->getType()->isVoidTy()) { - Value *IncomingTrue = nullptr; - Value *IncomingFalse = nullptr; - - if (I->hasOneUse() && isa(*I->user_begin())) { - // If the predicated instruction is feeding an insert-element, move it - // into the Then block; Phi node will be created for the vector. - InsertElementInst *IEI = cast(*I->user_begin()); - IEI->moveBefore(T); - IncomingTrue = IEI; // the new vector with the inserted element. - IncomingFalse = IEI->getOperand(0); // the unmodified vector - } else { - // Phi node will be created for the scalar predicated instruction. - IncomingTrue = &*I; - IncomingFalse = UndefValue::get(I->getType()); - } - - BasicBlock *PostDom = I->getParent()->getSingleSuccessor(); - assert(PostDom && "Then block has multiple successors"); - PHINode *Phi = - PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front()); - IncomingTrue->replaceAllUsesWith(Phi); - Phi->addIncoming(IncomingFalse, Head); - Phi->addIncoming(IncomingTrue, I->getParent()); - } - } - - DEBUG(DT->verifyDomTree()); -} - InnerLoopVectorizer::VectorParts InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) { assert(is_contained(predecessors(Dst), Src) && "Invalid edge"); // Look for cached value. std::pair Edge(Src, Dst); - EdgeMaskCache::iterator ECEntryIt = MaskCache.find(Edge); - if (ECEntryIt != MaskCache.end()) + EdgeMaskCacheTy::iterator ECEntryIt = EdgeMaskCache.find(Edge); + if (ECEntryIt != EdgeMaskCache.end()) return ECEntryIt->second; VectorParts SrcMask = createBlockInMask(Src); @@ -4560,11 +4514,11 @@ for (unsigned part = 0; part < UF; ++part) EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]); - MaskCache[Edge] = EdgeMask; + EdgeMaskCache[Edge] = EdgeMask; return EdgeMask; } - MaskCache[Edge] = SrcMask; + EdgeMaskCache[Edge] = SrcMask; return SrcMask; } @@ -4572,6 +4526,11 @@ InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) { assert(OrigLoop->contains(BB) && "Block is not a part of a loop"); + // Look for cached value. + BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB); + if (BCEntryIt != BlockMaskCache.end()) + return BCEntryIt->second; + // Loop incoming mask is all-one. if (OrigLoop->getHeader() == BB) { Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1); @@ -4589,6 +4548,7 @@ BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]); } + BlockMaskCache[BB] = BlockMask; return BlockMask; } @@ -4664,7 +4624,7 @@ llvm_unreachable("Unknown induction"); case InductionDescriptor::IK_IntInduction: case InductionDescriptor::IK_FpInduction: - return widenIntOrFpInduction(P); + llvm_unreachable("Integer/fp induction handled elsewhere"); case InductionDescriptor::IK_PtrInduction: { // Handle the pointer induction variable case. assert(P->getType()->isPointerTy() && "Unexpected type."); @@ -4712,333 +4672,280 @@ return !CInt || CInt->isZero(); } -void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock *BB) { - // For each instruction in the old loop. - for (Instruction &I : *BB) { - - // If the instruction will become trivially dead when vectorized, we don't - // need to generate it. - if (DeadInstructions.count(&I)) - continue; - - // Scalarize instructions that should remain scalar after vectorization. - if (VF > 1 && - !(isa(&I) || isa(&I) || - isa(&I)) && - shouldScalarizeInstruction(&I)) { - scalarizeInstruction(&I, Legal->isScalarWithPredication(&I)); - continue; - } - - switch (I.getOpcode()) { - case Instruction::Br: - // Nothing to do for PHIs and BR, since we already took care of the - // loop control flow instructions. - continue; - case Instruction::PHI: { - // Vectorize PHINodes. - widenPHIInstruction(&I, UF, VF); - continue; - } // End of PHI. - case Instruction::GetElementPtr: { - // Construct a vector GEP by widening the operands of the scalar GEP as - // necessary. We mark the vector GEP 'inbounds' if appropriate. A GEP - // results in a vector of pointers when at least one operand of the GEP - // is vector-typed. Thus, to keep the representation compact, we only use - // vector-typed operands for loop-varying values. - auto *GEP = cast(&I); - VectorParts Entry(UF); +void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) { + switch (I.getOpcode()) { + case Instruction::PHI: { + llvm_unreachable("Phi nodes have their own recipe"); + } // End of PHI. + case Instruction::GetElementPtr: { + // Construct a vector GEP by widening the operands of the scalar GEP as + // necessary. We mark the vector GEP 'inbounds' if appropriate. A GEP + // results in a vector of pointers when at least one operand of the GEP + // is vector-typed. Thus, to keep the representation compact, we only use + // vector-typed operands for loop-varying values. + auto *GEP = cast(&I); + VectorParts Entry(UF); - if (VF > 1 && OrigLoop->hasLoopInvariantOperands(GEP)) { - // If we are vectorizing, but the GEP has only loop-invariant operands, - // the GEP we build (by only using vector-typed operands for - // loop-varying values) would be a scalar pointer. Thus, to ensure we - // produce a vector of pointers, we need to either arbitrarily pick an - // operand to broadcast, or broadcast a clone of the original GEP. - // Here, we broadcast a clone of the original. - // - // TODO: If at some point we decide to scalarize instructions having - // loop-invariant operands, this special case will no longer be - // required. We would add the scalarization decision to - // collectLoopScalars() and teach getVectorValue() to broadcast - // the lane-zero scalar value. - auto *Clone = Builder.Insert(GEP->clone()); - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = Builder.CreateVectorSplat(VF, Clone); - } else { - // If the GEP has at least one loop-varying operand, we are sure to - // produce a vector of pointers. But if we are only unrolling, we want - // to produce a scalar GEP for each unroll part. Thus, the GEP we - // produce with the code below will be scalar (if VF == 1) or vector - // (otherwise). Note that for the unroll-only case, we still maintain - // values in the vector mapping with initVector, as we do for other - // instructions. - for (unsigned Part = 0; Part < UF; ++Part) { - - // The pointer operand of the new GEP. If it's loop-invariant, we - // won't broadcast it. - auto *Ptr = OrigLoop->isLoopInvariant(GEP->getPointerOperand()) - ? GEP->getPointerOperand() - : getVectorValue(GEP->getPointerOperand())[Part]; - - // Collect all the indices for the new GEP. If any index is - // loop-invariant, we won't broadcast it. - SmallVector Indices; - for (auto &U : make_range(GEP->idx_begin(), GEP->idx_end())) { - if (OrigLoop->isLoopInvariant(U.get())) - Indices.push_back(U.get()); - else - Indices.push_back(getVectorValue(U.get())[Part]); - } + if (VF > 1 && OrigLoop->hasLoopInvariantOperands(GEP)) { + // If we are vectorizing, but the GEP has only loop-invariant operands, + // the GEP we build (by only using vector-typed operands for + // loop-varying values) would be a scalar pointer. Thus, to ensure we + // produce a vector of pointers, we need to either arbitrarily pick an + // operand to broadcast, or broadcast a clone of the original GEP. + // Here, we broadcast a clone of the original. + // + // TODO: If at some point we decide to scalarize instructions having + // loop-invariant operands, this special case will no longer be + // required. We would add the scalarization decision to + // collectLoopScalars() and teach getVectorValue() to broadcast + // the lane-zero scalar value. + auto *Clone = Builder.Insert(GEP->clone()); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = Builder.CreateVectorSplat(VF, Clone); + } else { + // If the GEP has at least one loop-varying operand, we are sure to + // produce a vector of pointers. But if we are only unrolling, we want + // to produce a scalar GEP for each unroll part. Thus, the GEP we + // produce with the code below will be scalar (if VF == 1) or vector + // (otherwise). Note that for the unroll-only case, we still maintain + // values in the vector mapping with initVector, as we do for other + // instructions. + for (unsigned Part = 0; Part < UF; ++Part) { - // Create the new GEP. Note that this GEP may be a scalar if VF == 1, - // but it should be a vector, otherwise. - auto *NewGEP = GEP->isInBounds() - ? Builder.CreateInBoundsGEP(Ptr, Indices) - : Builder.CreateGEP(Ptr, Indices); - assert((VF == 1 || NewGEP->getType()->isVectorTy()) && - "NewGEP is not a pointer vector"); - Entry[Part] = NewGEP; + // The pointer operand of the new GEP. If it's loop-invariant, we + // won't broadcast it. + auto *Ptr = OrigLoop->isLoopInvariant(GEP->getPointerOperand()) + ? GEP->getPointerOperand() + : getVectorValue(GEP->getPointerOperand())[Part]; + + // Collect all the indices for the new GEP. If any index is + // loop-invariant, we won't broadcast it. + SmallVector Indices; + for (auto &U : make_range(GEP->idx_begin(), GEP->idx_end())) { + if (OrigLoop->isLoopInvariant(U.get())) + Indices.push_back(U.get()); + else + Indices.push_back(getVectorValue(U.get())[Part]); } - } - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, GEP); - break; - } - case Instruction::UDiv: - case Instruction::SDiv: - case Instruction::SRem: - case Instruction::URem: - // Scalarize with predication if this instruction may divide by zero and - // block execution is conditional, otherwise fallthrough. - if (Legal->isScalarWithPredication(&I)) { - scalarizeInstruction(&I, true); - continue; + // Create the new GEP. Note that this GEP may be a scalar if VF == 1, + // but it should be a vector, otherwise. + auto *NewGEP = GEP->isInBounds() + ? Builder.CreateInBoundsGEP(Ptr, Indices) + : Builder.CreateGEP(Ptr, Indices); + assert((VF == 1 || NewGEP->getType()->isVectorTy()) && + "NewGEP is not a pointer vector"); + Entry[Part] = NewGEP; } - case Instruction::Add: - case Instruction::FAdd: - case Instruction::Sub: - case Instruction::FSub: - case Instruction::Mul: - case Instruction::FMul: - case Instruction::FDiv: - case Instruction::FRem: - case Instruction::Shl: - case Instruction::LShr: - case Instruction::AShr: - case Instruction::And: - case Instruction::Or: - case Instruction::Xor: { - // Just widen binops. - auto *BinOp = cast(&I); - setDebugLocFromInst(Builder, BinOp); - const VectorParts &A = getVectorValue(BinOp->getOperand(0)); - const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - - // Use this vector value for all users of the original instruction. - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); - - if (BinaryOperator *VecOp = dyn_cast(V)) - VecOp->copyIRFlags(BinOp); + } - Entry[Part] = V; - } + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, GEP); + break; + } + case Instruction::UDiv: + case Instruction::SDiv: + case Instruction::SRem: + case Instruction::URem: + case Instruction::Add: + case Instruction::FAdd: + case Instruction::Sub: + case Instruction::FSub: + case Instruction::Mul: + case Instruction::FMul: + case Instruction::FDiv: + case Instruction::FRem: + case Instruction::Shl: + case Instruction::LShr: + case Instruction::AShr: + case Instruction::And: + case Instruction::Or: + case Instruction::Xor: { + // Just widen binops. + auto *BinOp = cast(&I); + setDebugLocFromInst(Builder, BinOp); + const VectorParts &A = getVectorValue(BinOp->getOperand(0)); + const VectorParts &B = getVectorValue(BinOp->getOperand(1)); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, BinOp); - break; - } - case Instruction::Select: { - // Widen selects. - // If the selector is loop invariant we can create a select - // instruction with a scalar condition. Otherwise, use vector-select. - auto *SE = PSE.getSE(); - bool InvariantCond = - SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); - setDebugLocFromInst(Builder, &I); - - // The condition can be loop invariant but still defined inside the - // loop. This means that we can't just use the original 'cond' value. - // We have to take the 'vectorized' value and pick the first lane. - // Instcombine will make this a no-op. - const VectorParts &Cond = getVectorValue(I.getOperand(0)); - const VectorParts &Op0 = getVectorValue(I.getOperand(1)); - const VectorParts &Op1 = getVectorValue(I.getOperand(2)); - - auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); + // Use this vector value for all users of the original instruction. + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part] = Builder.CreateSelect( - InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); - } + if (BinaryOperator *VecOp = dyn_cast(V)) + VecOp->copyIRFlags(BinOp); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + Entry[Part] = V; } - case Instruction::ICmp: - case Instruction::FCmp: { - // Widen compares. Generate vector compares. - bool FCmp = (I.getOpcode() == Instruction::FCmp); - auto *Cmp = dyn_cast(&I); - setDebugLocFromInst(Builder, Cmp); - const VectorParts &A = getVectorValue(Cmp->getOperand(0)); - const VectorParts &B = getVectorValue(Cmp->getOperand(1)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - Value *C = nullptr; - if (FCmp) { - C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); - cast(C)->copyFastMathFlags(Cmp); - } else { - C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); - } - Entry[Part] = C; - } + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, BinOp); + break; + } + case Instruction::Select: { + // Widen selects. + // If the selector is loop invariant we can create a select + // instruction with a scalar condition. Otherwise, use vector-select. + auto *SE = PSE.getSE(); + bool InvariantCond = + SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop); + setDebugLocFromInst(Builder, &I); + + // The condition can be loop invariant but still defined inside the + // loop. This means that we can't just use the original 'cond' value. + // We have to take the 'vectorized' value and pick the first lane. + // Instcombine will make this a no-op. + const VectorParts &Cond = getVectorValue(I.getOperand(0)); + const VectorParts &Op0 = getVectorValue(I.getOperand(1)); + const VectorParts &Op1 = getVectorValue(I.getOperand(2)); + + auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Entry[Part] = Builder.CreateSelect( + InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]); } - case Instruction::Store: - case Instruction::Load: - vectorizeMemoryInstruction(&I); - break; - case Instruction::ZExt: - case Instruction::SExt: - case Instruction::FPToUI: - case Instruction::FPToSI: - case Instruction::FPExt: - case Instruction::PtrToInt: - case Instruction::IntToPtr: - case Instruction::SIToFP: - case Instruction::UIToFP: - case Instruction::Trunc: - case Instruction::FPTrunc: - case Instruction::BitCast: { - auto *CI = dyn_cast(&I); - setDebugLocFromInst(Builder, CI); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - // Optimize the special case where the source is a constant integer - // induction variable. Notice that we can only optimize the 'trunc' case - // because (a) FP conversions lose precision, (b) sext/zext may wrap, and - // (c) other casts depend on pointer size. - if (Cost->isOptimizableIVTruncate(CI, VF)) { - widenIntOrFpInduction(cast(CI->getOperand(0)), - cast(CI)); - break; + case Instruction::ICmp: + case Instruction::FCmp: { + // Widen compares. Generate vector compares. + bool FCmp = (I.getOpcode() == Instruction::FCmp); + auto *Cmp = dyn_cast(&I); + setDebugLocFromInst(Builder, Cmp); + const VectorParts &A = getVectorValue(Cmp->getOperand(0)); + const VectorParts &B = getVectorValue(Cmp->getOperand(1)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + Value *C = nullptr; + if (FCmp) { + C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]); + cast(C)->copyFastMathFlags(Cmp); + } else { + C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]); } + Entry[Part] = C; + } - /// Vectorize casts. - Type *DestTy = - (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - const VectorParts &A = getVectorValue(CI->getOperand(0)); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; - } + case Instruction::Store: + case Instruction::Load: + vectorizeMemoryInstruction(&I); + break; + case Instruction::ZExt: + case Instruction::SExt: + case Instruction::FPToUI: + case Instruction::FPToSI: + case Instruction::FPExt: + case Instruction::PtrToInt: + case Instruction::IntToPtr: + case Instruction::SIToFP: + case Instruction::UIToFP: + case Instruction::Trunc: + case Instruction::FPTrunc: + case Instruction::BitCast: { + auto *CI = dyn_cast(&I); + setDebugLocFromInst(Builder, CI); - case Instruction::Call: { - // Ignore dbg intrinsics. - if (isa(I)) - break; - setDebugLocFromInst(Builder, &I); - - Module *M = BB->getParent()->getParent(); - auto *CI = cast(&I); - - StringRef FnName = CI->getCalledFunction()->getName(); - Function *F = CI->getCalledFunction(); - Type *RetTy = ToVectorTy(CI->getType(), VF); - SmallVector Tys; - for (Value *ArgOperand : CI->arg_operands()) - Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); - - Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); - if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || - ID == Intrinsic::lifetime_start)) { - scalarizeInstruction(&I); - break; - } - // The flag shows whether we use Intrinsic or a usual Call for vectorized - // version of the instruction. - // Is it beneficial to perform intrinsic call compared to lib call? - bool NeedToScalarize; - unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); - bool UseVectorIntrinsic = - ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; - if (!UseVectorIntrinsic && NeedToScalarize) { - scalarizeInstruction(&I); - break; - } + /// Vectorize casts. + Type *DestTy = + (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); - VectorParts Entry(UF); - for (unsigned Part = 0; Part < UF; ++Part) { - SmallVector Args; - for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { - Value *Arg = CI->getArgOperand(i); - // Some intrinsics have a scalar argument - don't replace it with a - // vector. - if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { - const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); - Arg = VectorArg[Part]; - } - Args.push_back(Arg); - } + const VectorParts &A = getVectorValue(CI->getOperand(0)); + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy); + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } - Function *VectorF; - if (UseVectorIntrinsic) { - // Use vector version of the intrinsic. - Type *TysForDecl[] = {CI->getType()}; - if (VF > 1) - TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); - VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); - } else { - // Use vector version of the library call. - StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); - assert(!VFnName.empty() && "Vector function name is empty."); - VectorF = M->getFunction(VFnName); - if (!VectorF) { - // Generate a declaration - FunctionType *FTy = FunctionType::get(RetTy, Tys, false); - VectorF = - Function::Create(FTy, Function::ExternalLinkage, VFnName, M); - VectorF->copyAttributesFrom(F); - } + case Instruction::Call: { + // Ignore dbg intrinsics. + if (isa(I)) + break; + setDebugLocFromInst(Builder, &I); + + Module *M = I.getParent()->getParent()->getParent(); + auto *CI = cast(&I); + + StringRef FnName = CI->getCalledFunction()->getName(); + Function *F = CI->getCalledFunction(); + Type *RetTy = ToVectorTy(CI->getType(), VF); + SmallVector Tys; + for (Value *ArgOperand : CI->arg_operands()) + Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); + + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + bool NeedToScalarize; // Redundant, needed for UseVectorIntrinsic. + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + VectorParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) { + SmallVector Args; + for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) { + Value *Arg = CI->getArgOperand(i); + // Some intrinsics have a scalar argument - don't replace it with a + // vector. + if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) { + const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i)); + Arg = VectorArg[Part]; } - assert(VectorF && "Can't create vector function."); + Args.push_back(Arg); + } - SmallVector OpBundles; - CI->getOperandBundlesAsDefs(OpBundles); - CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); + Function *VectorF; + if (UseVectorIntrinsic) { + // Use vector version of the intrinsic. + Type *TysForDecl[] = {CI->getType()}; + if (VF > 1) + TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF); + VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl); + } else { + // Use vector version of the library call. + StringRef VFnName = TLI->getVectorizedFunction(FnName, VF); + assert(!VFnName.empty() && "Vector function name is empty."); + VectorF = M->getFunction(VFnName); + if (!VectorF) { + // Generate a declaration + FunctionType *FTy = FunctionType::get(RetTy, Tys, false); + VectorF = + Function::Create(FTy, Function::ExternalLinkage, VFnName, M); + VectorF->copyAttributesFrom(F); + } + } + assert(VectorF && "Can't create vector function."); - if (isa(V)) - V->copyFastMathFlags(CI); + SmallVector OpBundles; + CI->getOperandBundlesAsDefs(OpBundles); + CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles); - Entry[Part] = V; - } + if (isa(V)) + V->copyFastMathFlags(CI); - VectorLoopValueMap.initVector(&I, Entry); - addMetadata(Entry, &I); - break; + Entry[Part] = V; } - default: - // All other instructions are unsupported. Scalarize them. - scalarizeInstruction(&I); - break; - } // end of switch. - } // end of for_each instr. + VectorLoopValueMap.initVector(&I, Entry); + addMetadata(Entry, &I); + break; + } + + default: + // All other instructions are scalarized. + DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I); + llvm_unreachable("Unhandled instruction!"); + } // end of switch. } void InnerLoopVectorizer::updateAnalysis() { @@ -5049,15 +4956,11 @@ assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) && "Entry does not dominate exit."); - // We don't predicate stores by this point, so the vector body should be a - // single loop. - DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader); - - DT->addNewBlock(LoopMiddleBlock, LoopVectorBody); + auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch(); + DT->addNewBlock(LoopMiddleBlock, LoopVectorLatch); DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]); DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader); DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]); - DEBUG(DT->verifyDomTree()); } @@ -6947,13 +6850,6 @@ LoopVectorizationCostModel::expectedCost(unsigned VF) { VectorizationCostTy Cost; - // Collect Uniform and Scalar instructions after vectorization with VF. - collectUniformsAndScalars(VF); - - // Collect the instructions (and their associated costs) that will be more - // profitable to scalarize. - collectInstsToScalarize(VF); - // For each block. for (BasicBlock *BB : TheLoop->blocks()) { VectorizationCostTy BlockCost; @@ -7492,6 +7388,7 @@ LoopVectorizationCostModel::VectorizationFactor LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) { + ILV->collectTriviallyDeadInstructions(TheLoop, Legal, DeadInstructions); // Width 1 means no vectorize, cost 0 means uncomputed cost. const LoopVectorizationCostModel::VectorizationFactor NoVectorization = {1U, @@ -7506,23 +7403,232 @@ // Collect the instructions (and their associated costs) that will be more // profitable to scalarize. CM.selectUserVectorizationFactor(UserVF); + buildInitialVPlans(UserVF, UserVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); return {UserVF, 0}; } unsigned MaxVF = MaybeMaxVF.getValue(); assert(MaxVF != 0 && "MaxVF is zero."); + + for (unsigned VF = 1; VF <= MaxVF; VF *= 2) { + // Collect Uniform and Scalar instructions after vectorization with VF. + CM.collectUniformsAndScalars(VF); + + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + if (VF > 1) + CM.collectInstsToScalarize(VF); + } + + buildInitialVPlans(1, MaxVF); + DEBUG(printCurrentPlans("Initial VPlans", dbgs())); + optimizePredicatedInstructions(); + DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs())); if (MaxVF == 1) return NoVectorization; - // Select the optimal vectorization factor. return CM.selectVectorizationFactor(MaxVF); } -void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) { - auto *SI = dyn_cast(Instr); - bool IfPredicateInstr = (SI && Legal->blockNeedsPredication(SI->getParent())); +void LoopVectorizationPlanner::printCurrentPlans(const std::string &Title, + raw_ostream &O) { + auto printPlan = [&](VPlan *Plan, const SmallVectorImpl &VFs, + const std::string &Prefix) { + std::string Title; + raw_string_ostream RSO(Title); + RSO << Prefix << " for VF="; + if (VFs.size() == 1) + RSO << VFs[0]; + else { + RSO << "{"; + bool First = true; + for (unsigned VF : VFs) { + if (!First) + RSO << ","; + RSO << VF; + First = false; + } + RSO << "}"; + } + VPlanPrinter PlanPrinter(O, *Plan); + PlanPrinter.dump(RSO.str()); + }; + + DenseMap> VPlansAndVFs; + for (auto &Entry : VPlans) + VPlansAndVFs[Entry.second.get()].push_back(Entry.first); + for (auto &Entry : VPlansAndVFs) { + std::sort(Entry.second.begin(), Entry.second.end()); + printPlan(Entry.first, Entry.second, Title); + } +} + +// Determine if a given instruction will remain scalar after vectorization, +// for VF \p StartRangeVF. Reset \p EndRangeVF to the minimal VF where this +// decision does not hold, if it's less than the given \p EndRangeVF. +bool LoopVectorizationPlanner::willBeScalarized(Instruction *I, + unsigned StartRangeVF, + unsigned &EndRangeVF) { + if (!isa(I)) { + auto isScalarAfterVectorization = [&](unsigned VF) -> bool { + return CM.isScalarAfterVectorization(I, VF); + }; + if (testVFRange(isScalarAfterVectorization, StartRangeVF, EndRangeVF)) + return true; + } + + if (isa(I)) { + + auto *CI = cast(I); + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || + ID == Intrinsic::lifetime_start)) + return true; + + // The following case may be scalarized depending on the VF. + // The flag shows whether we use Intrinsic or a usual Call for vectorized + // version of the instruction. + // Is it beneficial to perform intrinsic call compared to lib call? + auto WillBeScalarized = [&](unsigned VF) -> bool { + bool NeedToScalarize; + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + return !UseVectorIntrinsic && NeedToScalarize; + }; + return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF); + } + + if (isa(I) || isa(I)) { + + // TODO: refactor memoryInstructionMustBeScalarized() to invoke only the + // (last) part that depends on VF. + auto WillBeScalarized = [&](unsigned VF) -> bool { + LoopVectorizationCostModel::InstWidening Decision = + CM.getWideningDecision(I, VF); + assert(Decision != LoopVectorizationCostModel::CM_Unknown && + "CM decision should be taken at this point"); + return Decision == LoopVectorizationCostModel::CM_Scalarize; + }; + return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF); + } + + static DenseSet VectorizableOpcodes = { + Instruction::Br, Instruction::PHI, Instruction::GetElementPtr, + Instruction::UDiv, Instruction::SDiv, Instruction::SRem, + Instruction::URem, Instruction::Add, Instruction::FAdd, + Instruction::Sub, Instruction::FSub, Instruction::Mul, + Instruction::FMul, Instruction::FDiv, Instruction::FRem, + Instruction::Shl, Instruction::LShr, Instruction::AShr, + Instruction::And, Instruction::Or, Instruction::Xor, + Instruction::Select, Instruction::ICmp, Instruction::FCmp, + Instruction::Store, Instruction::Load, Instruction::ZExt, + Instruction::SExt, Instruction::FPToUI, Instruction::FPToSI, + Instruction::FPExt, Instruction::PtrToInt, Instruction::IntToPtr, + Instruction::SIToFP, Instruction::UIToFP, Instruction::Trunc, + Instruction::FPTrunc, Instruction::BitCast, Instruction::Call}; + + if (!VectorizableOpcodes.count(I->getOpcode())) + return true; + + // Scalarize instructions found to be more profitable if scalarized. Limit + // EndRangeVF to the last VF this is continuously true for. + auto isProfitableToScalarize = [&](unsigned VF) -> bool { + return CM.isProfitableToScalarize(I, VF); + }; + return testVFRange(isProfitableToScalarize, StartRangeVF, EndRangeVF); +} + +unsigned LoopVectorizationPlanner::buildInitialVPlans(unsigned MinVF, + unsigned MaxVF) { + unsigned StartRangeVF = MinVF; + unsigned EndRangeVF = MaxVF + 1; + + unsigned i = 0; + for (; StartRangeVF < EndRangeVF; ++i) { + std::shared_ptr Plan = buildInitialVPlan(StartRangeVF, EndRangeVF); + + for (unsigned TmpVF = StartRangeVF; TmpVF < EndRangeVF; TmpVF *= 2) + VPlans[TmpVF] = Plan; + + StartRangeVF = EndRangeVF; + EndRangeVF = MaxVF + 1; + } + + return i; +} + +bool LoopVectorizationPlanner::testVFRange( + const std::function &Predicate, unsigned StartRangeVF, + unsigned &EndRangeVF) { + bool StartResult = Predicate(StartRangeVF); + + for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) { + bool TmpResult = Predicate(TmpVF); + if (TmpResult != StartResult) { + EndRangeVF = TmpVF; + break; + } + } + + return StartResult; +} + +bool LoopVectorizationPlanner::shouldScalarizeInstruction(Instruction *I, + unsigned VF) const { + return CM.isScalarAfterVectorization(I, VF) || + CM.isProfitableToScalarize(I, VF); +} + +void LoopVectorizationPlanner::optimizePredicatedInstructions() { + VPlan *PrevPlan = nullptr; + for (auto &It : VPlans) { + VPlan *Plan = It.second.get(); + if (Plan == PrevPlan) + continue; + for (auto *PredInst : PredicatedInstructions) { + sinkScalarOperands(PredInst, Plan); + assignScalarVectorConversions(PredInst, Plan); + } + PrevPlan = Plan; + } +} + +void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) { + DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n'); + BestVF = VF; + BestUF = UF; + + assert(VPlans.count(VF) && "Best VF does not have a VPlan."); + // Delete all other VPlans. + for (auto &Entry : VPlans) { + if (Entry.first != VF) + VPlans.erase(Entry.first); + } +} + +void LoopVectorizationPlanner::executeBestPlan(InnerLoopVectorizer &LB) { + ILV = &LB; + + // Perform the actual loop widening (vectorization). + // 1. Create a new empty loop. Unlink the old loop and connect the new one. + ILV->createEmptyLoop(); + + // 2. Widen each instruction in the old loop to a new one in the new loop. + + VPTransformState State{BestVF, BestUF, LI, ILV->DT, + ILV->Builder, ILV, Legal, &CM}; + State.CFG.PrevBB = ILV->LoopVectorPreHeader; + + VPlan *Plan = getVPlanForVF(BestVF); + + Plan->vectorize(&State); - return scalarizeInstruction(Instr, IfPredicateInstr); + // 3. Take care of phi's to fix: reduction, 1st-order-recurrence, loop-closed. + ILV->vectorizeLoop(); } Value *InnerLoopUnroller::reverseVector(Value *Vec) { return Vec; } @@ -7580,51 +7686,1272 @@ } } -bool LoopVectorizePass::processLoop(Loop *L) { - assert(L->empty() && "Only process inner loops."); +class VPLaneRange { +private: + static const unsigned VF = INT_MAX; + unsigned MinLane = 0; + unsigned MaxLane = VF - 1; + void dumpLane(raw_ostream &O, unsigned Lane) const { + if (Lane == VF - 1) + O << "VF-1"; + else + O << Lane; + } -#ifndef NDEBUG - const std::string DebugLocStr = getDebugLocString(L); -#endif /* NDEBUG */ +public: + VPLaneRange() {} + VPLaneRange(unsigned Min) : MinLane(Min) {} + VPLaneRange(unsigned Min, unsigned Max) : MinLane(Min), MaxLane(Max) {} + unsigned getMinLane() const { return MinLane; } + unsigned getMaxLane() const { return MaxLane; } + bool isEmpty() const { return MinLane > MaxLane; } + bool isFull() const { return MinLane == 0 && MaxLane == VF - 1; } + void print(raw_ostream &O) const { + dumpLane(O, MinLane); + O << ".."; + dumpLane(O, MaxLane); + } + static VPLaneRange intersect(const VPLaneRange &One, const VPLaneRange &Two) { + return VPLaneRange(std::max(One.MinLane, Two.MinLane), + std::min(One.MaxLane, Two.MaxLane)); + } +}; - DEBUG(dbgs() << "\nLV: Checking a loop in \"" - << L->getHeader()->getParent()->getName() << "\" from " - << DebugLocStr << "\n"); +/// VPScalarizeOneByOneRecipe is a VPOneByOneRecipeBase which scalarizes each +/// Instruction in its ingredients independently, in order. The scalarization +/// is performed in one of two methods: a) by generating a single uniform scalar +/// Instruction. b) by generating multiple Instructions, each one for a +/// respective lane. +class VPScalarizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; - LoopVectorizeHints Hints(L, DisableUnrolling, *ORE); +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; - DEBUG(dbgs() << "LV: Loop hints:" - << " force=" - << (Hints.getForce() == LoopVectorizeHints::FK_Disabled - ? "disabled" - : (Hints.getForce() == LoopVectorizeHints::FK_Enabled - ? "enabled" - : "?")) - << " width=" << Hints.getWidth() - << " unroll=" << Hints.getInterleave() << "\n"); + VPLaneRange DesignatedLanes; - // Function containing loop - Function *F = L->getHeader()->getParent(); +public: + VPScalarizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPScalarizeOneByOneSC, B, E, Plan) {} - // Looking at the diagnostic output is the only way to determine if a loop - // was vectorized (other than looking at the IR or machine code), so it - // is important to generate an optimization remark for each loop. Most of - // these messages are generated as OptimizationRemarkAnalysis. Remarks - // generated as OptimizationRemark and OptimizationRemarkMissed are - // less verbose reporting vectorized loops and unvectorized loops that may - // benefit from vectorization, respectively. + ~VPScalarizeOneByOneRecipe() {} - if (!Hints.allowVectorization(F, L, AlwaysVectorize)) { - DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n"); - return false; + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; } - // Check the loop for a trip count threshold: - // do not vectorize loops with a tiny trip count. - const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L); - if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) { - DEBUG(dbgs() << "LV: Found a loop with a very small trip count. " - << "This loop is not worth vectorizing."); + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "SCALARIZE"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":"; + for (auto It = Begin; It != End; ++It) { + O << "\\l "; + VPlanPrinter::printAsIngredient(O, &*It); + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; + } + } +}; + +/// VPVectorizeOneByOneRecipe is a VPOneByOneRecipeBase which transforms by +/// vectorizing each Instruction in itsingredients independently, in order. +/// This recipe covers most of the traditional vectorization cases where +/// each ingredient produces a vectorized version of itself. +class VPVectorizeOneByOneRecipe : public VPOneByOneRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +private: + /// Do the actual code generation for a single instruction. + void transformIRInstruction(Instruction *I, VPTransformState &State) override; + +public: + VPVectorizeOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, VPlan *Plan) + : VPOneByOneRecipeBase(VPVectorizeOneByOneSC, B, E, Plan) {} + + ~VPVectorizeOneByOneRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; + } + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "VECTORIZE:"; + for (auto It = Begin; It != End; ++It) { + O << "\\l "; + VPlanPrinter::printAsIngredient(O, &*It); + if (willAlsoPackOrUnpack(&*It)) + O << " (S->V)"; + } + } +}; + +/// A recipe which handles all phi nodes except integer inductions. +class VPWidenPHIRecipe : public VPRecipeBase { + PHINode *Phi; + +public: + VPWidenPHIRecipe(PHINode *Phi, VPlan *Plan) : VPRecipeBase(VPWidenPHISC), + Phi(Phi) { + Plan->setInst2Recipe(Phi, this); + } + + ~VPWidenPHIRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenIntOrFpInductionSC; + } + + /// The method which generates the phi/select nodes, thereby "executing" the + /// VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; +}; + +/// A recipe which widens integer reductions, producing their vector values +/// and computing the necessary values for producing their scalar values. +/// The scalar values themselves are generated by a complementing +/// VPBuildScalarStepsRecipe. +class VPWidenIntOrFpInductionRecipe : public VPRecipeBase { +private: + bool NeedsScalarIV; + PHINode *IV; + TruncInst *Trunc; + Value *ScalarIV = nullptr; + Value *Step = nullptr; + +public: + VPWidenIntOrFpInductionRecipe(bool NeedsScalarIV, PHINode *IV, + TruncInst *Trunc = nullptr) + : VPRecipeBase(VPWidenIntOrFpInductionSC), NeedsScalarIV(NeedsScalarIV), + IV(IV), Trunc(Trunc) {} + + ~VPWidenIntOrFpInductionRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenIntOrFpInductionSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; + + const InductionDescriptor & + getInductionDescriptor(LoopVectorizationLegality *Legal) { + auto II = Legal->getInductionVars()->find(IV); + assert(II != Legal->getInductionVars()->end() && "IV is not an induction"); + auto& ID = II->second; + assert(IV->getType() == ID.getStartValue()->getType() && "Types must match"); + return ID; + } + + Value *getScalarIV() { + assert(ScalarIV && "ScalarIV does not exist yet"); + return ScalarIV; + } + + Value *getStep() { + assert(Step && "Step does not exist yet"); + return Step; + } +}; + +/// This is a complemeting recipe for handling integer induction variables, +/// responsible for generating the scalar values used by the IV's scalar users. +class VPBuildScalarStepsRecipe : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +private: + VPWidenIntOrFpInductionRecipe *WIFI; + Instruction *EntryVal; + VPLaneRange DesignatedLanes; + +public: + VPBuildScalarStepsRecipe(VPWidenIntOrFpInductionRecipe *WIFI, + Instruction *EntryVal, VPlan *Plan) + : VPRecipeBase(VPBuildScalarStepsSC), WIFI(WIFI), EntryVal(EntryVal) { + Plan->setInst2Recipe(EntryVal, this); + } + + ~VPBuildScalarStepsRecipe() {} + + const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPBuildScalarStepsSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; +}; + +/// A VPInterleaveRecipe is a VPRecipe which transforms an interleave group of +/// loads or stores into one wide load/store and shuffles. +class VPInterleaveRecipe : public VPRecipeBase { +private: + const InterleaveGroup *IG; + +public: + VPInterleaveRecipe(const InterleaveGroup *IG, VPlan *Plan) + : VPRecipeBase(VPInterleaveSC), IG(IG) { + for (unsigned I = 0, E = IG->getNumMembers(); I < E; ++I) + Plan->setInst2Recipe(IG->getMember(I), this); + } + + ~VPInterleaveRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC; + } + + /// The method which generates the wide load or store and shuffles that + /// correspond to this VPInterleaveRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override; + + const InterleaveGroup *getInterleaveGroup() { return IG; } +}; + +/// A VPExtractMaskBitRecipe is a VPConditionBitRecipe which supports a +/// scalarized conditional branch. Such branches are needed to guard scalarized +/// instructions with possible side-effects that are predicated under a +/// condition. This recipe is in charge of generating the instruction that +/// computes the condition for this branch in the vectorized version. +class VPExtractMaskBitRecipe : public VPConditionBitRecipeBase { +private: + /// The original IR basic block in which the scalarized and predicated + /// instruction(s) reside. Needed for generating the mask of the block + /// and from it the desired condition bit. + BasicBlock *MaskedBasicBlock; + +public: + /// Construct a VPExtractMaskBitRecipe given the IR BasicBlock whose mask + /// should provide the desired bit. This recipe has no Instructions as + /// ingredients, hence does not call Plan->setInst2Recipe(). + VPExtractMaskBitRecipe(BasicBlock *BB) + : VPConditionBitRecipeBase(VPExtractMaskBitSC), MaskedBasicBlock(BB) {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPExtractMaskBitSC; + } + + /// The method which generates the comparison and related mask management + /// instructions leading to computing the desired condition bit, corresponding + /// to this VPExtractMaskBitRecipe in the vectorized version, thereby + /// "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "EXTRACT MASK BIT:\\l " << MaskedBasicBlock->getName(); + } + + StringRef getName() const override { return MaskedBasicBlock->getName(); } +}; + +/// A VPMergeScalarizeBranchRecipe is a VPRecipe which represents the Phi's +/// needed when control converges back from a scalarized branch. Such phi's are +/// needed to merge live-out values that are set under a scalarized conditional +/// branch. They can be scalar or vector, depending on the user of the +/// live-out value. This recipe works in concert with VPExtractMaskBitRecipe. +class VPMergeScalarizeBranchRecipe : public VPRecipeBase { +private: + Instruction *LiveOut; + +public: + // Construct a VPMergeScalarizeBranchRecipe given \LiveOut whose value needs + // a Phi after merging back from a scalarized branch. + // LiveOut is mapped to the recipe vectorizing it, instead of this recipe + // which provides it with PHIs; hence no call to Plan->setInst2Recipe() here. + VPMergeScalarizeBranchRecipe(Instruction *LiveOut) + : VPRecipeBase(VPMergeScalarizeBranchSC), LiveOut(LiveOut) {} + + ~VPMergeScalarizeBranchRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPMergeScalarizeBranchSC; + } + + /// The method which generates Phi instructions for live-outs as needed to + /// retain SSA form, corresponding to this VPMergeScalarizeBranchRecipe in the + /// vectorized version, thereby "executing" the VPlan. + void vectorize(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O) const override { + O << "MERGE SCALARIZE BRANCH:\\l "; + VPlanPrinter::printAsIngredient(O, LiveOut); + } +}; + +class VPlanUtilsLoopVectorizer : public VPlanUtils { +public: + VPlanUtilsLoopVectorizer(VPlan *Plan) : VPlanUtils(Plan) {} + + ~VPlanUtilsLoopVectorizer() {} + + VPOneByOneRecipeBase *createOneByOneRecipe(const BasicBlock::iterator B, + const BasicBlock::iterator E, + VPlan *Plan, bool isScalarizing); + + bool appendInstruction(VPOneByOneRecipeBase *Recipe, Instruction *Instr); + + VPOneByOneRecipeBase *splitRecipe(Instruction *Split); + + void insertBefore(Instruction *Inst, Instruction *Before, + unsigned MinLane = 0); + + void removeInstruction(Instruction *Inst, unsigned FromLane = 0); + + void sinkInstruction(Instruction *Inst, VPBasicBlock *To, + unsigned MinLane = 0); + + template void designateLaneZero(T &Recipe) { + Recipe->DesignatedLanes = VPLaneRange(0, 0); + } +}; + +std::pair +LoopVectorizationPlanner::widenIntOrFpInduction(VPlan *Plan, + unsigned StartRangeVF, + unsigned &EndRangeVF, + PHINode *IV, TruncInst *Trunc) { + // The value from the original loop to which we are mapping the new + // induction variable. + Instruction *EntryVal = Trunc ? cast(Trunc) : IV; + // Determine if we want a scalar version of the induction variable. This + // is true if the induction variable itself is not widened, or if it has + // at least one user in the loop that is not widened. + auto NeedsScalarInduction = [&](unsigned VF) -> bool { + if (VF == 1) + return false; + if (shouldScalarizeInstruction(EntryVal, VF)) + return true; + auto isScalarInst = [&](User *U) -> bool { + auto *I = cast(U); + return (TheLoop->contains(I) && shouldScalarizeInstruction(I, VF)); + }; + return any_of(EntryVal->users(), isScalarInst); + }; + bool NeedsScalarIV = + testVFRange(NeedsScalarInduction, StartRangeVF, EndRangeVF); + // Generate the widening recipe. + auto *WIFIRecipe = new VPWidenIntOrFpInductionRecipe(NeedsScalarIV, IV, + Trunc); + if (!NeedsScalarIV) + return std::make_pair(WIFIRecipe, nullptr); + + // Create scalar steps that can be used by instructions we will later + // scalarize. Note that the addition of the scalar steps will not + // increase the number of instructions in the loop in the common case + // prior to InstCombine. We will be trading one vector extract for + // each scalar step. + auto *BSSRecipe = new VPBuildScalarStepsRecipe(WIFIRecipe, EntryVal, Plan); + // Determine the number of scalars we need to generate for each unroll + // iteration. If EntryVal is uniform, we only need to generate the + // first lane. Otherwise, we generate all VF values. + auto isUniformAfterVectorization = [&](unsigned VF) -> bool { + return CM.isUniformAfterVectorization(cast(EntryVal), VF); + }; + if (testVFRange(isUniformAfterVectorization, StartRangeVF, EndRangeVF)) { + VPlanUtilsLoopVectorizer PlanUtils(Plan); + PlanUtils.designateLaneZero(BSSRecipe); + } + return std::make_pair(WIFIRecipe, BSSRecipe); +} + +std::shared_ptr +LoopVectorizationPlanner::buildInitialVPlan(unsigned StartRangeVF, + unsigned &EndRangeVF) { + + std::shared_ptr SharedPlan = std::make_shared(); + VPlan *Plan = SharedPlan.get(); + VPlanUtilsLoopVectorizer PlanUtils(Plan); + + // Create a dummy entry VPBasicBlock to start building the VPlan. + VPBlockBase *PreviousVPBlock = PlanUtils.createBasicBlock(); + VPBlockBase *PreEntry = PreviousVPBlock; + Plan->setEntry(PreEntry); // only to support printing during construction. + + // Return the interleave group a given instruction is part of in the context + // of a specific VF. + auto getInterleaveGroup = [&](Instruction *I, + unsigned VF) -> const InterleaveGroup * { + if (VF < 2) + return nullptr; // Query is illegal for VF == 1 + LoopVectorizationCostModel::InstWidening Decision = + CM.getWideningDecision(I, VF); + if (Decision != LoopVectorizationCostModel::CM_Interleave) + return nullptr; + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I); + assert(IG && "Instruction to interleave not part of any group"); + return IG; + }; + + // Check if given Instruction should open an interleave group. + auto isPrimaryIGMember = + [&](Instruction *I) -> std::function { + return [=](unsigned VF) -> bool { + const InterleaveGroup *IG = getInterleaveGroup(I, VF); + return IG && I == IG->getInsertPos(); + }; + }; + + // Check if given Instruction is handled as part of an interleave group. + auto isAdjunctIGMember = + [&](Instruction *I) -> std::function { + return [=](unsigned VF) -> bool { + const InterleaveGroup *IG = getInterleaveGroup(I, VF); + return IG && I != IG->getInsertPos(); + }; + }; + + /// Determine whether \p K is a truncation based on an induction variable that + /// can be optimized. + auto isOptimizableIVTruncate = + [&](Instruction *K) -> std::function { + return [=](unsigned VF) -> bool { + return CM.isOptimizableIVTruncate(K, VF); + }; + }; + + // Scan the body of the loop in a topological order to visit each basic block + // after having visited its predecessor basic blocks. + LoopBlocksDFS DFS(TheLoop); + DFS.perform(LI); + + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { + // Relevent instructions from basic block BB will be grouped into VPRecipe + // ingredients and fill a new VPBasicBlock. + VPBasicBlock *VPBB = nullptr; + VPOneByOneRecipeBase *LastOBORecipe = nullptr; + + auto appendRecipe = [&](VPRecipeBase *Recipe) -> void { + if (VPBB) + PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB); + else { + VPBB = PlanUtils.createBasicBlock(Recipe); + PlanUtils.setSuccessor(PreviousVPBlock, VPBB); + PreviousVPBlock = VPBB; + } + LastOBORecipe = dyn_cast(Recipe); + }; + + for (auto I = BB->begin(), E = BB->end(); I != E; ++I) { + Instruction *Instr = &*I; + + // Filter out irrelevant instructions. + if (DeadInstructions.count(Instr) || isa(Instr) || + isa(Instr)) + continue; + + if (isa(Instr) || isa(Instr)) { + // Ignore IG's adjunct members - will be handled by the interleave group + // recipe to be generated by the primary member of the interleave group + // which is the insertion point and bears the cost for the entire group. + if (testVFRange(isAdjunctIGMember(Instr), StartRangeVF, EndRangeVF)) + continue; + + if (testVFRange(isPrimaryIGMember(Instr), StartRangeVF, EndRangeVF)) { + // Instr points to the insert position of an interleave group: first + // load or last store. + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(Instr); + appendRecipe(new VPInterleaveRecipe(IG, Plan)); + continue; + } + } + + if (Legal->isScalarWithPredication(Instr)) { + // Instructions marked for predication are scalarized and placed under + // an if-then construct to prevent side-effects. + DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *Instr << '\n'); + + // Build the triangular if-then region. Start with VPBB holding Instr. + BasicBlock::iterator J = I; + VPRecipeBase *Recipe = new VPScalarizeOneByOneRecipe(I, ++J, Plan); + VPBB = PlanUtils.createBasicBlock(Recipe); + + // Build the entry and exit VPBB's of the triangle. + VPRegionBlock *Region = PlanUtils.createRegion(true); + VPExtractMaskBitRecipe *R = new VPExtractMaskBitRecipe(&*BB); + VPBasicBlock *Entry = PlanUtils.createBasicBlock(R); + Recipe = new VPMergeScalarizeBranchRecipe(Instr); + VPBasicBlock *Exit = PlanUtils.createBasicBlock(Recipe); + // Note: first set Entry as region entry and then connect successors + // starting from it in order, to propagate the "parent" of each + // VPBasicBlock. + PlanUtils.setRegionEntry(Region, Entry); + PlanUtils.setRegionExit(Region, Exit); + PlanUtils.setTwoSuccessors(Entry, R, VPBB, Exit); + PlanUtils.setSuccessor(VPBB, Exit); + PlanUtils.setSuccessor(PreviousVPBlock, Region); + PreviousVPBlock = Region; + + // Next instructions should start forming a VPBasicBlock of their own. + VPBB = nullptr; + LastOBORecipe = nullptr; + + // Record predicated instructions for later optimizations. + PredicatedInstructions.insert(&*I); + + continue; + } + + if (PHINode *Phi = dyn_cast(Instr)) { + // Check if this is an integer induction. If so, build the recipes that + // produce its scalar and vector values. + InductionDescriptor II = Legal->getInductionVars()->lookup(Phi); + if (II.getKind() == InductionDescriptor::IK_IntInduction || + II.getKind() == InductionDescriptor::IK_FpInduction) { + auto Recipes = widenIntOrFpInduction(Plan, StartRangeVF, EndRangeVF, + Phi); + appendRecipe(Recipes.first); + if (Recipes.second) + appendRecipe(Recipes.second); + continue; + } + // Assign the default recipe for phi nodes. + appendRecipe(new VPWidenPHIRecipe(Phi, Plan)); + continue; + } + + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + if (isa(Instr) && testVFRange(isOptimizableIVTruncate(Instr), + StartRangeVF, EndRangeVF)) { + auto *InductionPhi = cast(Instr->getOperand(0)); + auto Recipes = widenIntOrFpInduction(Plan, StartRangeVF, EndRangeVF, + InductionPhi, + cast(Instr)); + appendRecipe(Recipes.first); + if (Recipes.second) + appendRecipe(Recipes.second); + continue; + } + + // Check if instruction is to be replicated. + bool Scalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF); + DEBUG(if (Scalarized) dbgs() << "LV: Scalarizing:" << *Instr << "\n"); + + // Default: vectorize/scalarize this instruction using a one-by-one + // recipe. We optimize the common case where consecutive instructions + // can be represented by a single OBO recipe. + if (!LastOBORecipe || LastOBORecipe->isScalarizing() != Scalarized || + !PlanUtils.appendInstruction(LastOBORecipe, Instr)) { + auto J = I; + appendRecipe(PlanUtils.createOneByOneRecipe(I, ++J, Plan, Scalarized)); + } + } + } + // PreviousVPBlock now holds the exit block of Plan. + // Set entry block of Plan to the successor of PreEntry, and discard PreEntry. + assert(PreEntry->getSuccessors().size() == 1 && "Plan has no single entry."); + VPBlockBase *Entry = PreEntry->getSuccessors().front(); + PlanUtils.disconnectBlocks(PreEntry, Entry); + Plan->setEntry(Entry); + delete PreEntry; + + // FOR STRESS TESTING, uncomment the following: + // EndRangeVF = StartRangeVF * 2; + + return SharedPlan; +} + +void LoopVectorizationPlanner::sinkScalarOperands(Instruction *PredInst, + VPlan *Plan) { + VPlanUtilsLoopVectorizer PlanUtils(Plan); + + // The recipe containing the predicated instruction. + VPBasicBlock *PredBB = Plan->getBasicBlock(PredInst); + + // Initialize a worklist with the operands of the predicated instruction. + SetVector Worklist(PredInst->op_begin(), PredInst->op_end()); + + // Holds instructions that we need to analyze again. An instruction may be + // reanalyzed if we don't yet know if we can sink it or not. + SmallVector InstsToReanalyze; + + // Iteratively sink the scalarized operands of the predicated instruction + // into the block we created for it. When an instruction is sunk, it's + // operands are then added to the worklist. The algorithm ends after one pass + // through the worklist doesn't sink a single instruction. + bool Changed; + do { + + // Add the instructions that need to be reanalyzed to the worklist, and + // reset the changed indicator. + Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end()); + InstsToReanalyze.clear(); + Changed = false; + + while (!Worklist.empty()) { + auto *I = dyn_cast(Worklist.pop_back_val()); + if (!I) + continue; + + // We do not sink other predicated instructions. + if (Legal->isScalarWithPredication(I)) + continue; + + VPRecipeBase *Recipe = Plan->getRecipe(I); + + // We can't sink live-ins. + if (!Recipe) + continue; + VPBasicBlock *BasicBlock = Recipe->getParent(); + assert(BasicBlock && "Recipe not in any basic block"); + + // We can't sink an instruction that isn't being scalarized. + if (!isa(Recipe) && + !isa(Recipe)) + continue; + + // We can't sink an instruction if it is already in the predicated block, + // is not in the VPlan, or may have side effects. + if (BasicBlock == PredBB || I->mayHaveSideEffects()) + continue; + + // Handle phi nodes last to make sure that any user they may have has sunk + // by now. This is relevant for induction variables that feed uniform GEPs + // which may or may not sink. + if (isa(I)) { + auto IsNotAPhi = [&](Value *V) -> bool { return !isa(V); }; + if (any_of(Worklist, IsNotAPhi) || + any_of(InstsToReanalyze, IsNotAPhi)) { + InstsToReanalyze.push_back(I); + Changed = true; // Make sure this phi gets a second chance. + continue; + } + } + + bool HasVectorizedUses = false; + bool AllScalarizedUsesInPredicatedBlock = true; + unsigned MinLaneToSink = 0; + for (auto &U : I->uses()) { + auto *UI = cast(U.getUser()); + VPRecipeBase *UserRecipe = Plan->getRecipe(UI); + // Generated scalarized instructions don't serve users outside of the + // VPlan, so we can safely ignore users that have no recipe. + if (!UserRecipe) + continue; + + // GEPs used as the uniform address of a wide memory operation must not + // sink lane zero. + if (isa(I) && + (isa(UserRecipe) || + (isa(UserRecipe) && + (isa(UI) || isa(UI)) && + Legal->isConsecutivePtr(I)))) { + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + + bool UserIsScalarized = isa(UserRecipe) || + isa(UserRecipe); + + // Scalarizing recipes can have non-scalarizing users. Note that + // build-scalar-steps cannot - vector users of induction variables use + // the instructions generated by widen-phi recipes. + if (isa(Recipe) && !UserIsScalarized) { + // All of I's lanes are used by an instruction we can't sink. + HasVectorizedUses = true; + break; + } + + // Induction variables feeding consecutive GEPs can be indirectly used + // by vectorized load/stores which generate their own GEP rather than + // reuse the scalarized one (unlike load/store in interleave groups). + // In such a case, we can sink all lanes but lane zero. Note that we + // can do this whether or not the GEP is used within the predicated + // block (i.e. whether it will sink its own lanes 1..VF-1). + if (isa(UI) && Legal->isConsecutivePtr(UI) && + isa(Recipe)) { + auto IsVectorizedMemoryOperation = [&](User *U) -> bool { + if (!(isa(U) || isa(U))) + return false; + VPRecipeBase *Recipe = Plan->getRecipe(cast(U)); + return Recipe && isa(Recipe); + }; + + if (any_of(UI->users(), IsVectorizedMemoryOperation)) { + MinLaneToSink = std::max(MinLaneToSink, 1u); + continue; + } + } + + if (UserRecipe->getParent() != PredBB && UserIsScalarized) { + // Don't make a decision until all scalarized users have sunk. Note + // that if the user is not scalarized then it has either prevented us + // from reaching this point or is irrelevant. + AllScalarizedUsesInPredicatedBlock = false; + continue; + } + + // Ok to sink w.r.t this use, but no more lanes than what the user + // itself has sunk. + VPLaneRange DesignatedLanes; + if (auto *BSS = dyn_cast(UserRecipe)) + DesignatedLanes = BSS->getDesignatedLanes(); + else if (auto *SOBO = dyn_cast(UserRecipe)) + DesignatedLanes = SOBO->getDesignatedLanes(); + else; // This is an irrelevant non-scalarized user + VPLaneRange SinkableLanes = + VPLaneRange::intersect(VPLaneRange(MinLaneToSink), DesignatedLanes); + MinLaneToSink = SinkableLanes.getMinLane(); + } + + if (HasVectorizedUses) + continue; // This instruction cannot be sunk. + + // It's legal to sink the instruction if all its uses occur in the + // predicated block. Otherwise, there's nothing to do yet, and we may + // need to reanalyze the instruction. + if (!AllScalarizedUsesInPredicatedBlock) { + InstsToReanalyze.push_back(I); + continue; + } + + // Move the instruction to the beginning of the predicated block, and add + // it's operands to the worklist (except for phi nodes). + PlanUtils.sinkInstruction(I, PredBB, MinLaneToSink); + if (!isa(I)) + Worklist.insert(I->op_begin(), I->op_end()); + + // The sinking may have enabled other instructions to be sunk, so we will + // need to iterate. + Changed = true; + } + } while (Changed); +} + +void LoopVectorizationPlanner::assignScalarVectorConversions( + Instruction *PredInst, VPlan *Plan) { + + // NFC: Let Def's recipe generate the vector version of Def, but only + // if all of Def's users are vectorized. This is the equivalent to the + // previous predicateInstructions by which an insert-element got hoisted + // into the matching predicated basic block if it is the only user of + // the predicated instruction. + + if (PredInst->use_empty()) + return; + + for (User *U : PredInst->users()) { + Instruction *UserInst = dyn_cast(U); + if (!UserInst) + continue; + + VPRecipeBase *UserRecipe = Plan->getRecipe(UserInst); + if (!UserRecipe) // User is not part of the plan. + return; + + if (isa(UserRecipe)) { + // Found a user that will not be using the vector form of the predicated + // instruction. The insert-element is not going to be the only user, so + // do not hoist it. + return; + } + } + + Plan->getRecipe(PredInst)->addAlsoPackOrUnpack(PredInst); +} + +void VPVectorizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + State.ILV->vectorizeInstruction(*I); + if (willAlsoPackOrUnpack(I)) { // Unpack instruction + for (unsigned Part = 0; Part < State.UF; ++Part) + for (unsigned Lane = 0; Lane < State.VF; ++Lane) + State.ILV->getScalarValue(I, Part, Lane); + } +} + +void VPScalarizeOneByOneRecipe::transformIRInstruction( + Instruction *I, VPTransformState &State) { + assert(I && "No instruction to vectorize."); + // By default generate scalar instances for all VF lanes of all UF parts. + // If the instruction is uniform, generate only the first lane for each + // of the UF parts. + bool IsUniform = State.Cost->isUniformAfterVectorization(I, State.VF); + unsigned MinLane = 0; + unsigned MaxLane = IsUniform ? 0 : State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + assert((State.VF == 1 || !IsUniform) && + "Uniform instruction vectorized for a specific instance."); + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->scalarizeInstruction(I, MinPart, MaxPart, + EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); + if (willAlsoPackOrUnpack(I)) { + if (State.Instance) + // Insert scalar instance packing it into a vector. + State.ILV->constructVectorValue(I, MinPart, MinLane); + else + // Broadcast or group together all instances into a vector. + State.ILV->getVectorValue(I); + } +} + +void VPWidenPHIRecipe::vectorize(VPTransformState &State) { + State.ILV->widenPHIInstruction(Phi, State.UF, State.VF); +} + +void VPWidenPHIRecipe::print(raw_ostream &O) const { + O << "WIDEN PHI"; + O << ":\\l "; + VPlanPrinter::printAsIngredient(O, Phi); +} + +void VPWidenIntOrFpInductionRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Int induction being replicated"); + auto BuildScalarInfo = + State.ILV->widenIntOrFpInduction(getInductionDescriptor(State.Legal), + NeedsScalarIV, IV, Trunc); + ScalarIV = BuildScalarInfo.first; + Step = BuildScalarInfo.second; +} + +void VPWidenIntOrFpInductionRecipe::print(raw_ostream &O) const { + O << "WIDEN INT/FP INDUCTION"; + if (NeedsScalarIV) + O << " (needs scalars)"; + O << ":"; + O << "\\l "; + VPlanPrinter::printAsIngredient(O, IV); + if (Trunc) { + O << "\\l "; + VPlanPrinter::printAsIngredient(O, Trunc); + } +} + +void VPBuildScalarStepsRecipe::vectorize(VPTransformState &State) { + // By default generate scalar instances for all VF lanes of all UF parts. + // If the instruction is uniform, generate only the first lane for each + // of the UF parts. + bool IsUniform = State.Cost->isUniformAfterVectorization(EntryVal, State.VF); + unsigned MinLane = 0; + unsigned MaxLane = IsUniform ? 0 : State.VF - 1; + unsigned MinPart = 0; + unsigned MaxPart = State.UF - 1; + + if (State.Instance) { + // Asked to create an instance for a specific lane and a specific part. + MinLane = State.Instance->Lane; + MaxLane = MinLane; + MinPart = State.Instance->Part; + MaxPart = MinPart; + } + + // Intersect requested lanes with the designated lanes for this recipe. + VPLaneRange ActiveLanes(MinLane, MaxLane); + VPLaneRange EffectiveLanes = + VPLaneRange::intersect(ActiveLanes, DesignatedLanes); + if (EffectiveLanes.isEmpty()) + return; // None of the requested lanes is designated for this recipe. + + // Generate relevant lanes. + State.ILV->buildScalarSteps(WIFI->getScalarIV(), WIFI->getStep(), EntryVal, + WIFI->getInductionDescriptor(State.Legal), + MinPart, MaxPart, EffectiveLanes.getMinLane(), + EffectiveLanes.getMaxLane()); +} + +void VPBuildScalarStepsRecipe::print(raw_ostream &O) const { + O << "BUILD SCALAR STEPS"; + if (!DesignatedLanes.isFull()) { + O << " "; + DesignatedLanes.print(O); + } + O << ":\\l "; + VPlanPrinter::printAsIngredient(O, EntryVal); +} + +void VPInterleaveRecipe::vectorize(VPTransformState &State) { + assert(State.Instance == nullptr && "Interleave group being replicated"); + State.ILV->vectorizeInterleaveGroup(IG->getInsertPos()); +} + +void VPInterleaveRecipe::print(raw_ostream &O) const { + O << "INTERLEAVE GROUP with factor " << IG->getFactor() + << " at "; IG->getInsertPos()->printAsOperand(O, false); + for (unsigned i = 0; i < IG->getFactor(); ++i) + if (Instruction *I = IG->getMember(i)) { + O << "\\l "; + VPlanPrinter::printAsIngredient(O, I); + O << " " << i; + if (willAlsoPackOrUnpack(I)) + O << " (V->S)"; + } +} + +void VPExtractMaskBitRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && "Extract Mask Bit works only on single instance."); + + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + typedef SmallVector VectorParts; + + VectorParts Cond = State.ILV->createBlockInMask(MaskedBasicBlock); + + ConditionBit = Cond[Part]; + if (ConditionBit->getType()->isVectorTy()) + ConditionBit = + State.Builder.CreateExtractElement(ConditionBit, + State.ILV->Builder.getInt32(Lane)); + ConditionBit = + State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit, + ConstantInt::get(ConditionBit->getType(), 1)); + DEBUG(dbgs() << "\nLV: vectorizing ConditionBit recipe" + << MaskedBasicBlock->getName()); +} + +void VPMergeScalarizeBranchRecipe::vectorize(VPTransformState &State) { + assert(State.Instance && + "Merge Scalarize Branch works only on single instance."); + + Type *LiveOutType = LiveOut->getType(); + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + // Rename the predicated and merged basic blocks for backwards compatibility. + Instruction *ScalarLiveOut = + cast(State.ILV->getScalarValue(LiveOut, Part, Lane)); + BasicBlock *PredicatedBB = ScalarLiveOut->getParent(); + BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor(); + assert(PredicatingBB && "Predicated block has no single predecessor"); + PredicatedBB->setName(Twine("pred.") + LiveOut->getOpcodeName() + ".if"); + PredicatedBB->getSingleSuccessor()->setName( + Twine("pred.") + LiveOut->getOpcodeName() + ".continue"); + if (LiveOutType->isVoidTy()) + return; + + // Generate a phi node for the scalarized instruction. + PHINode *Phi = State.ILV->Builder.CreatePHI(LiveOutType, 2); + Phi->addIncoming(UndefValue::get(ScalarLiveOut->getType()), PredicatingBB); + Phi->addIncoming(ScalarLiveOut, PredicatedBB); + State.ILV->setScalarValue(LiveOut, Part, Lane, Phi); + + // If this instruction also generated the complementing form then we also need + // to create a phi for the vector value of this part & lane and update the + // vector values cache accordingly. + Value *VectorValue = State.ILV->getVectorValue(LiveOut, Part); + if (!VectorValue) + return; + + InsertElementInst *IEI = cast(VectorValue); + PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2); + VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // the unmodified vector + VPhi->addIncoming(IEI, PredicatedBB); // new vector with the inserted element + State.ILV->setVectorValue(LiveOut, Part, VPhi); +} + +/// Creates a new VPScalarizeOneByOneRecipe or VPVectorizeOneByOneRecipe based +/// on the isScalarizing parameter respectively. +VPOneByOneRecipeBase *VPlanUtilsLoopVectorizer::createOneByOneRecipe( + const BasicBlock::iterator B, const BasicBlock::iterator E, VPlan *Plan, + bool isScalarizing) { + if (isScalarizing) + return new VPScalarizeOneByOneRecipe(B, E, Plan); + return new VPVectorizeOneByOneRecipe(B, E, Plan); +} + +bool VPlanUtilsLoopVectorizer::appendInstruction(VPOneByOneRecipeBase *Recipe, + Instruction *Instr) { + if (Recipe->End != Instr->getIterator()) + return false; + + Recipe->End++; + Plan->setInst2Recipe(Instr, Recipe); + return true; +} + +/// Given a \p Split instruction assumed to reside in a VPOneByOneRecipeBase +/// -- where VPOneByOneRecipeBase is either VPScalarizeOneByOneRecipe or +/// VPVectorizeOneByOneRecipe -- update that recipe to start from \p Split +/// and move all preceeding instructions to a new VPOneByOneRecipeBase. +/// \return the newly created VPOneByOneRecipeBase, which is added to the +/// VPBasicBlock of the original recipe, right before it. +VPOneByOneRecipeBase * +VPlanUtilsLoopVectorizer::splitRecipe(Instruction *Split) { + VPOneByOneRecipeBase *Recipe = + cast(Plan->getRecipe(Split)); + auto SplitPos = Split->getIterator(); + + assert(SplitPos != Recipe->Begin && + "Nothing to split before first instruction."); + assert(SplitPos != Recipe->End && "Nothing to split after last instruction."); + + // Build a new recipe for all instructions up to the given Split. + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Recipe->Begin, SplitPos, Plan, Recipe->isScalarizing()); + + // Insert the new recipe before the split point. + BasicBlock->addRecipe(NewRecipe, Recipe); + + // Update the old recipe to start from the given split point. + Recipe->Begin = SplitPos; + + return NewRecipe; +} + +/// Insert a given instruction \p Inst into a VPBasicBlock before another +/// given instruction \p Before. Assumes \p Inst does not belong to any +/// recipe, and that \p Before belongs to a VPOneByOneRecipeBase. +void VPlanUtilsLoopVectorizer::insertBefore(Instruction *Inst, + Instruction *Before, + unsigned MinLane) { + assert(!Plan->getRecipe(Inst) && "Instruction already in recipe."); + VPRecipeBase *Recipe = Plan->getRecipe(Before); + assert(Recipe && "Insertion point not in any recipe."); + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + bool PartialInsertion = MinLane > 0; + bool IndicesMatch = true; + + if (PartialInsertion) { + VPScalarizeOneByOneRecipe *SOBO = + dyn_cast(Recipe); + if (!SOBO || SOBO->DesignatedLanes.getMinLane() != MinLane) + IndicesMatch = false; + } + + // Can we insert \p Inst by augmemting the existing recipe of \p Before? + // Only if \p Inst is immediately followed by \p Before: + Instruction *NextInst = Inst; + if (++NextInst == Before && IndicesMatch) { + // This must imply that \p Before is the first ingredient in its recipe. + assert(Before == &*OBORecipe->Begin && + "Trying to insert but Before is not first in its recipe."); + // Yes, extend the range to include the previous instruction. + OBORecipe->Begin--; + Plan->setInst2Recipe(Inst, Recipe); + return; + } + // Note that it is not possible to augment the end of Recipe by having + // Inst == &*Recipe->End, because to do that Before would need to be + // Recipe->End, which means that Before does not belong to this Recipe. + + // No, the instruction needs to have its own recipe. + + // If we're not inserting right before the Recipe's first instruction, + // split the Recipe to allow placing the new recipe right before the + // given insertion point. This new recipe is also added to BasicBlock. + if (Before != &*OBORecipe->Begin) + splitRecipe(Before); + + // TODO: VPLanUtils::addOneByOneToBasicBlock() + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + VPBasicBlock *BasicBlock = Recipe->getParent(); + VPOneByOneRecipeBase *NewRecipe = nullptr; + if (PartialInsertion) { + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + } else + NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, + OBORecipe->isScalarizing()); + Plan->setInst2Recipe(Inst, NewRecipe); + BasicBlock->addRecipe(NewRecipe, OBORecipe); +} + +/// Remove a given instruction \p Inst from its recipe, if exists. We only +/// support removal from VPOneByOneRecipeBase at this time. +void VPlanUtilsLoopVectorizer::removeInstruction(Instruction *Inst, + unsigned FromLane) { + VPRecipeBase *Recipe = Plan->getRecipe(Inst); + if (!Recipe) + return; // Nothing to do, no recipe to remove the instruction from. + VPOneByOneRecipeBase *OBORecipe = cast(Recipe); + // First check if OBORecipe can be shortened to exclude Inst. + bool InstructionWasLast = false; + if (&*OBORecipe->Begin == Inst) + OBORecipe->Begin++; + else if (&*OBORecipe->End == Inst) { + OBORecipe->End--; + InstructionWasLast = true; + } + // Otherwise split OBORecipe at Inst. + else { + splitRecipe(Inst); + OBORecipe->Begin++; + } + if (FromLane > 0) { + // This is a partial removal. Leave lanes 0..FromLane-1 in the original + // basic block in a new, unregistered recipe. + VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe( + Inst->getIterator(), ++(Inst->getIterator()), Plan, true); + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(0, FromLane - 1); + Recipe->getParent()->addRecipe(NewRecipe, + InstructionWasLast ? nullptr : Recipe); + } + Plan->resetInst2Recipe(Inst); + // If source recipe is now empty, remove it. + if (OBORecipe && OBORecipe->Begin == OBORecipe->End) { + OBORecipe->getParent()->removeRecipe(OBORecipe); + delete OBORecipe; + } +} + +// Given an instruction \p Inst and a VPBasicBlock \p To, remove \p Inst from +// its current residence and add it as the first instruction of \p To. +// We currently support removal from and insertion to +// VPOneByOneRecipeBase's only. +// TODO: this is an over-simplistic implemetation that assumes we can make +// the new instruction the first instruction of the first recipe in the +// basic block. This is true for the sinkScalarOperands use-case, but for a +// general basic block a getFirstInsertionPt() logic is required. +void VPlanUtilsLoopVectorizer::sinkInstruction(Instruction *Inst, + VPBasicBlock *To, + unsigned MinLane) { + RecipeListTy *Recipes = getRecipes(To); + + VPRecipeBase *FromRecipe = Plan->getRecipe(Inst); + if (auto *FromBSSRecipe = dyn_cast(FromRecipe)) { + VPBuildScalarStepsRecipe *SunkRecipe = nullptr; + if (MinLane == 0) { + // Sink the entire recipe. + VPBasicBlock *From = FromRecipe->getParent(); + assert(From && "Recipe to sink not assigned to any basic block"); + From->removeRecipe(FromBSSRecipe); + SunkRecipe = FromBSSRecipe; + } else { + // Partially sink lanes MinLane..VF-1 + SunkRecipe = new VPBuildScalarStepsRecipe(FromBSSRecipe->WIFI, + FromBSSRecipe->EntryVal, Plan); + SunkRecipe->DesignatedLanes = VPLaneRange(MinLane); + FromBSSRecipe->DesignatedLanes = VPLaneRange(0, MinLane - 1); + } + To->addRecipe(SunkRecipe, &*Recipes->begin()); + return; + } + + assert(Plan->getRecipe(Inst) && + isa(Plan->getRecipe(Inst)) && + "Unsupported recipe to sink instructions from"); + + // Remove instruction from its source recipe. + removeInstruction(Inst, MinLane); + + auto *ToRecipe = dyn_cast(&*Recipes->begin()); + if (ToRecipe) { + // Try to sink the instruction into an existing recipe, default to a new + // recipe. + assert(ToRecipe->isScalarizing() && + "Cannot sink into a non-scalarizing recipe."); + + // Add it before the first ingredient of To. + insertBefore(Inst, &*ToRecipe->Begin, MinLane); + } else { + // Instruction has to go into its own one-by-one recipe. + auto InstBegin = Inst->getIterator(); + auto InstEnd = InstBegin; + auto *NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true); + if (MinLane > 0) // Partial sink + cast(NewRecipe)->DesignatedLanes = + VPLaneRange(MinLane); + To->addRecipe(NewRecipe, &*Recipes->begin()); + } +} + +bool LoopVectorizePass::processLoop(Loop *L) { + assert(L->empty() && "Only process inner loops."); + +#ifndef NDEBUG + const std::string DebugLocStr = getDebugLocString(L); +#endif /* NDEBUG */ + + DEBUG(dbgs() << "\nLV: Checking a loop in \"" + << L->getHeader()->getParent()->getName() << "\" from " + << DebugLocStr << "\n"); + + LoopVectorizeHints Hints(L, DisableUnrolling, *ORE); + + DEBUG(dbgs() << "LV: Loop hints:" + << " force=" + << (Hints.getForce() == LoopVectorizeHints::FK_Disabled + ? "disabled" + : (Hints.getForce() == LoopVectorizeHints::FK_Enabled + ? "enabled" + : "?")) + << " width=" << Hints.getWidth() + << " unroll=" << Hints.getInterleave() << "\n"); + + // Function containing loop + Function *F = L->getHeader()->getParent(); + + // Looking at the diagnostic output is the only way to determine if a loop + // was vectorized (other than looking at the IR or machine code), so it + // is important to generate an optimization remark for each loop. Most of + // these messages are generated as OptimizationRemarkAnalysis. Remarks + // generated as OptimizationRemark and OptimizationRemarkMissed are + // less verbose reporting vectorized loops and unvectorized loops that may + // benefit from vectorization, respectively. + + if (!Hints.allowVectorization(F, L, AlwaysVectorize)) { + DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n"); + return false; + } + + // Check the loop for a trip count threshold: + // do not vectorize loops with a tiny trip count. + const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L); + if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) { + DEBUG(dbgs() << "LV: Found a loop with a very small trip count. " + << "This loop is not worth vectorizing."); if (Hints.getForce() == LoopVectorizeHints::FK_Enabled) DEBUG(dbgs() << " But vectorizing was explicitly forced.\n"); else { @@ -7700,7 +9027,7 @@ CM.collectValuesToIgnore(); // Use the planner for vectorization. - LoopVectorizationPlanner LVP(CM); + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM); // Get user vectorization factor. unsigned UserVF = Hints.getWidth(); @@ -7787,6 +9114,8 @@ DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n'); } + LVP.setBestPlan(VF.Width, IC); + using namespace ore; if (!VectorizeLoop) { assert(IC > 1 && "interleave count should not be 1 or 0"); @@ -7794,7 +9123,7 @@ // interleave it. InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL, &CM); - Unroller.vectorize(); + LVP.executeBestPlan(Unroller); ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(), L->getHeader()) @@ -7804,7 +9133,7 @@ // If we decided that it is *legal* to vectorize the loop, then do it. InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, &LVL, &CM); - LB.vectorize(); + LVP.executeBestPlan(LB); ++LoopsVectorized; // Add metadata to disable runtime unrolling a scalar loop when there are Index: lib/Transforms/Vectorize/VPlan.h =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.h @@ -0,0 +1,941 @@ +//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This file contains the declarations of the Vectorization Plan base classes: +// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual +// VPBlockBase, together implementing a Hierarchical CFG; +// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be treated +// as proper graphs for generic algorithms; +// 3. Pure virtual VPRecipeBase and its pure virtual sub-classes +// VPConditionBitRecipeBase and VPOneByOneRecipeBase that +// represent base classes for recipes contained within VPBasicBlocks; +// 4. The VPlan class holding a candidate for vectorization; +// 5. The VPlanUtils class providing methods for building plans; +// 6. The VPlanPrinter class providing a way to print a plan in dot format. +// These are documented in docs/VectorizationPlan.rst. +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H +#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H + +#include "llvm/ADT/GraphTraits.h" +#include "llvm/ADT/ilist.h" +#include "llvm/ADT/ilist_node.h" +#include "llvm/IR/IRBuilder.h" +#include "llvm/Support/raw_ostream.h" + +// The (re)use of existing LoopVectorize classes is subject to future VPlan +// refactoring. +namespace { +class InnerLoopVectorizer; +class LoopVectorizationLegality; +class LoopVectorizationCostModel; +} + +namespace llvm { + +class VPBasicBlock; + +/// VPRecipeBase is a base class describing one or more instructions that will +/// appear consecutively in the vectorized version, based on Instructions from +/// the given IR. These Instructions are referred to as the "Ingredients" of +/// the Recipe. A Recipe specifies how its ingredients are to be vectorized: +/// e.g., copy or reuse them as uniform, scalarize or vectorize them according +/// to an enclosing loop dimension, vectorize them according to internal SLP +/// dimension. +/// +/// **Design principle:** in order to reason about how to vectorize an +/// Instruction or how much it would cost, one has to consult the VPRecipe +/// holding it. +/// +/// **Design principle:** when a sequence of instructions conveys additional +/// information as a group, we use a VPRecipe to encapsulate them and attach +/// this information to the VPRecipe. For instance a VPRecipe can model an +/// interleave group of loads or stores with additional information for +/// calculating their cost and for performing IR code generation, as a group. +/// +/// **Design principle:** a VPRecipe should reuse existing containers of its +/// ingredients, i.e., iterators of basic blocks, to be lightweight. A new +/// containter should be opened on-demand, e.g., to avoid excessive recipes +/// each holding an interval of ingredients. +class VPRecipeBase : public ilist_node_with_parent { + friend class VPlanUtils; + friend class VPBasicBlock; + +private: + const unsigned char VRID; /// Subclass identifier (for isa/dyn_cast) + + /// Each VPRecipe is contained in a single VPBasicBlock. + class VPBasicBlock *Parent; + + /// Record which Instructions would require generating their complementing + /// form as well, providing a vector-to-scalar or scalar-to-vector conversion. + SmallPtrSet AlsoPackOrUnpack; + +public: + /// An enumeration for keeping track of the concrete subclass of VPRecipeBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPRecipe classes VRID field. They are used for concrete type + /// identification. + typedef enum { + VPVectorizeOneByOneSC, + VPScalarizeOneByOneSC, + VPWidenPHISC, + VPWidenIntOrFpInductionSC, + VPBuildScalarStepsSC, + VPInterleaveSC, + VPExtractMaskBitSC, + VPMergeScalarizeBranchSC, + } VPRecipeTy; + + VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {} + + virtual ~VPRecipeBase() {} + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPRecipeID() const { return VRID; } + + /// \return the VPBasicBlock which this VPRecipe belongs to. + class VPBasicBlock *getParent() { + return Parent; + } + + /// The method which generates the new IR instructions that correspond to + /// this VPRecipe in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState &State) = 0; + + /// Each recipe prints itself. + virtual void print(raw_ostream &O) const = 0; + + /// Add an instruction to the set of instructions for which a vector-to- + /// scalar or scalar-to-vector conversion is needed, in addition to + /// vectorizing or scalarizing the instruction itself, respectively. + void addAlsoPackOrUnpack(Instruction *I) { AlsoPackOrUnpack.insert(I); } + + /// Indicates if a given instruction requires vector-to-scalar or scalar-to- + /// vector conversion. + bool willAlsoPackOrUnpack(Instruction *I) const { + return AlsoPackOrUnpack.count(I); + } +}; + +/// A VPConditionBitRecipeBase is a pure virtual VPRecipe which supports a +/// conditional branch. Concrete sub-classes of this recipe are in charge of +/// generating the instructions that compute the condition for this branch in +/// the vectorized version. +class VPConditionBitRecipeBase : public VPRecipeBase { +protected: + /// The actual condition bit that was generated. Holds null until the + /// value/instuctions are generated by the vectorize() method. + Value *ConditionBit; + +public: + /// Construct a VPConditionBitRecipeBase, simply propating its concrete type. + VPConditionBitRecipeBase(const unsigned char SC) + : VPRecipeBase(SC), ConditionBit(nullptr) {} + + /// \return the actual bit that was generated, to be plugged into the IR + /// conditional branch, or null if the code computing the actual bit has not + /// been generated yet. + Value *getConditionBit() { return ConditionBit; } + + virtual StringRef getName() const = 0; +}; + +/// VPOneByOneRecipeBase is a VPRecipeBase which handles each Instruction in its +/// ingredients independently, in order. The ingredients are either all +/// vectorized, or all scalarized. +/// A VPOneByOneRecipeBase is a virtual base recipe which can be materialized +/// by one of two sub-classes, namely VPVectorizeOneByOneRecipe or +/// VPScalarizeOneByOneRecipe for Vectorizing or Scalarizing all ingredients, +/// respectively. +/// The ingredients are held as a sub-sequence of original Instructions, which +/// reside in the same IR BasicBlock and in the same order. The Ingredients are +/// accessed by a pointer to the first and last Instruction. +class VPOneByOneRecipeBase : public VPRecipeBase { + friend class VPlanUtilsLoopVectorizer; + +public: + /// Hold the ingredients by pointing to their original BasicBlock location. + BasicBlock::iterator Begin; + BasicBlock::iterator End; + +protected: + VPOneByOneRecipeBase() = delete; + + VPOneByOneRecipeBase(unsigned char SC, const BasicBlock::iterator B, + const BasicBlock::iterator E, class VPlan *Plan); + + /// Do the actual code generation for a single instruction. + /// This function is to be implemented and specialized by the respective + /// sub-class. + virtual void transformIRInstruction(Instruction *I, + struct VPTransformState &State) = 0; + +public: + ~VPOneByOneRecipeBase() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC || + V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC; + } + + bool isScalarizing() const { + return getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPOneByOneRecipeBase in the vectorized version, thereby + /// "executing" the VPlan. + /// VPOneByOneRecipeBase may either scalarize or vectorize all Instructions. + void vectorize(struct VPTransformState &State) override { + for (auto It = Begin; It != End; ++It) + transformIRInstruction(&*It, State); + } + + const BasicBlock::iterator &begin() { return Begin; } + + const BasicBlock::iterator &end() { return End; } +}; + +/// Hold the indices of a specific scalar instruction. The VPIterationInstance +/// span the iterations of the original loop, that correspond to a single +/// iteration of the vectorized loop. +struct VPIterationInstance { + unsigned Part; + unsigned Lane; +}; + +// Forward declaration. +class BasicBlock; + +/// Hold additional information passed down when "executing" a VPlan, that is +/// needed for generating IR. Also facilitates reuse of existing LV +/// functionality. +struct VPTransformState { + + VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI, + class DominatorTree *DT, IRBuilder<> &Builder, + InnerLoopVectorizer *ILV, LoopVectorizationLegality *Legal, + LoopVectorizationCostModel *Cost) + : VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder), + ILV(ILV), Legal(Legal), Cost(Cost) {} + + /// Record the selected vectorization and unroll factors of the single loop + /// being vectorized. + unsigned VF; + unsigned UF; + + /// Hold the indices to generate a specific scalar instruction. Null indicates + /// that all instances are to be generated, using either scalar or vector + /// instructions. + VPIterationInstance *Instance; + + /// Hold state information used when constructing the CFG of the vectorized + /// Loop, traversing the VPBasicBlocks and generating corresponding IR + /// BasicBlocks. + struct CFGState { + /// The previous VPBasicBlock visited. In the beginning set to null. + VPBasicBlock *PrevVPBB; + /// The previous IR BasicBlock created or reused. In the beginning set to + /// the new header BasicBlock. + BasicBlock *PrevBB; + /// The last IR BasicBlock of the loop body. Set to the new latch + /// BasicBlock, used for placing the newly created BasicBlocks. + BasicBlock *LastBB; + /// A mapping of each VPBasicBlock to the corresponding BasicBlock. In case + /// of replication, maps the BasicBlock of the last replica created. + SmallDenseMap VPBB2IRBB; + + CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {} + } CFG; + + /// Hold pointer to LoopInfo to register new basic blocks in the loop. + class LoopInfo *LI; + + /// Hold pointer to Dominator Tree to register new basic blocks in the loop. + class DominatorTree *DT; + + /// Hold a reference to the IRBuilder used to generate IR code. + IRBuilder<> &Builder; + + /// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods. + class InnerLoopVectorizer *ILV; + + /// Hold a pointer to LoopVectorizationLegality + class LoopVectorizationLegality *Legal; + + /// Hold a pointer to LoopVectorizationCostModel to access its + /// IsUniformAfterVectorization method. + LoopVectorizationCostModel *Cost; +}; + +/// VPBlockBase is the building block of the Hierarchical CFG. A VPBlockBase +/// can be either a VPBasicBlock or a VPRegionBlock. +/// +/// The Hierarchical CFG is a control-flow graph whose nodes are basic-blocks +/// or Hierarchical CFG's. The Hierarchical CFG data structure we use is similar +/// to the Tile Tree [1], where cross-Tile edges are lifted to connect Tiles +/// instead of the original basic-blocks as in Sharir [2], promoting the Tile +/// encapsulation. We use the terms Region and Block rather than Tile [1] to +/// avoid confusion with loop tiling. +/// +/// [1] "Register Allocation via Hierarchical Graph Coloring", David Callahan +/// and Brian Koblenz, PLDI 1991 +/// +/// [2] "Structural analysis: A new approach to flow analysis in optimizing +/// compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 +/// +/// Note that in contrast to the IR BasicBlock, a VPBlockBase models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// Use the VPBlockBase. +class VPBlockBase { + friend class VPlanUtils; + +private: + const unsigned char VBID; /// Subclass identifier (for isa/dyn_cast). + + std::string Name; + + /// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if + /// it is a topmost VPBlockBase. + class VPRegionBlock *Parent; + + /// List of predecessor blocks. + SmallVector Predecessors; + + /// List of successor blocks. + SmallVector Successors; + + /// Successor selector, null for zero or single successor blocks. + VPConditionBitRecipeBase *ConditionBitRecipe; + + /// Add \p Successor as the last successor to this block. + void appendSuccessor(VPBlockBase *Successor) { + assert(Successor && "Cannot add nullptr successor!"); + Successors.push_back(Successor); + } + + /// Add \p Predecessor as the last predecessor to this block. + void appendPredecessor(VPBlockBase *Predecessor) { + assert(Predecessor && "Cannot add nullptr predecessor!"); + Predecessors.push_back(Predecessor); + } + + /// Remove \p Predecessor from the predecessors of this block. + void removePredecessor(VPBlockBase *Predecessor) { + auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor); + assert(Pos && "Predecessor does not exist"); + Predecessors.erase(Pos); + } + + /// Remove \p Successor from the successors of this block. + void removeSuccessor(VPBlockBase *Successor) { + auto Pos = std::find(Successors.begin(), Successors.end(), Successor); + assert(Pos && "Successor does not exist"); + Successors.erase(Pos); + } + +protected: + VPBlockBase(const unsigned char SC, const std::string &N) + : VBID(SC), Name(N), Parent(nullptr), ConditionBitRecipe(nullptr) {} + +public: + /// An enumeration for keeping track of the concrete subclass of VPBlockBase + /// that is actually instantiated. Values of this enumeration are kept in the + /// VPBlockBase classes VBID field. They are used for concrete type + /// identification. + typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy; + + virtual ~VPBlockBase() {} + + const std::string &getName() const { return Name; } + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPBlockID() const { return VBID; } + + const class VPRegionBlock *getParent() const { return Parent; } + + /// \return the VPBasicBlock that is the entry of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getEntryBasicBlock() const; + + /// \return the VPBasicBlock that is the exit of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getExitBasicBlock() const; + class VPBasicBlock *getExitBasicBlock(); + + const SmallVectorImpl &getSuccessors() const { + return Successors; + } + + const SmallVectorImpl &getPredecessors() const { + return Predecessors; + } + + SmallVectorImpl &getSuccessors() { return Successors; } + + SmallVectorImpl &getPredecessors() { return Predecessors; } + + /// \return the successor of this VPBlockBase if it has a single successor. + /// Otherwise return a null pointer. + VPBlockBase *getSingleSuccessor() const { + return (Successors.size() == 1 ? *Successors.begin() : nullptr); + } + + /// \return the predecessor of this VPBlockBase if it has a single + /// predecessor. Otherwise return a null pointer. + VPBlockBase *getSinglePredecessor() const { + return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr); + } + + /// Returns the closest ancestor starting from "this", which has successors. + /// Returns the root ancestor if all ancestors have no successors. + VPBlockBase *getAncestorWithSuccessors(); + + /// Returns the closest ancestor starting from "this", which has predecessors. + /// Returns the root ancestor if all ancestors have no predecessors. + VPBlockBase *getAncestorWithPredecessors(); + + /// \return the successors either attached directly to this VPBlockBase or, if + /// this VPBlockBase is the exit block of a VPRegionBlock and has no + /// successors of its own, search recursively for the first enclosing + /// VPRegionBlock that has successors and return them. If no such + /// VPRegionBlock exists, return the (empty) successors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalSuccessors() { + return getAncestorWithSuccessors()->getSuccessors(); + } + + /// \return the hierarchical successor of this VPBlockBase if it has a single + /// hierarchical successor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalSuccessor() { + return getAncestorWithSuccessors()->getSingleSuccessor(); + } + + /// \return the predecessors either attached directly to this VPBlockBase or, + /// if this VPBlockBase is the entry block of a VPRegionBlock and has no + /// predecessors of its own, search recursively for the first enclosing + /// VPRegionBlock that has predecessors and return them. If no such + /// VPRegionBlock exists, return the (empty) predecessors of the topmost + /// VPBlockBase reached. + const SmallVectorImpl &getHierarchicalPredecessors() { + return getAncestorWithPredecessors()->getPredecessors(); + } + + /// \return the hierarchical predecessor of this VPBlockBase if it has a + /// single hierarchical predecessor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalPredecessor() { + return getAncestorWithPredecessors()->getSinglePredecessor(); + } + + /// If a VPBlockBase has two successors, this is the Recipe that will generate + /// the condition bit selecting the successor, and feeding the terminating + /// conditional branch. Otherwise this is null. + VPConditionBitRecipeBase *getConditionBitRecipe() { + return ConditionBitRecipe; + } + + const VPConditionBitRecipeBase *getConditionBitRecipe() const { + return ConditionBitRecipe; + } + + void setConditionBitRecipe(VPConditionBitRecipeBase *R) { + ConditionBitRecipe = R; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBlockBase in the vectorized version, thereby "executing" the VPlan. + virtual void vectorize(struct VPTransformState *State) = 0; + + /// Delete all blocks reachable from a given VPBlockBase, inclusive. + static void deleteCFG(VPBlockBase *Entry); +}; + +/// VPBasicBlock serves as the leaf of the Hierarchical CFG. It represents a +/// sequence of instructions that will appear consecutively in a basic block +/// of the vectorized version. The VPBasicBlock takes care of the control-flow +/// relations with other VPBasicBlock's and Regions. It holds a sequence of zero +/// or more VPRecipe's that take care of representing the instructions. +/// A VPBasicBlock that holds no VPRecipe's represents no instructions; this +/// may happen, e.g., to support disjoint Regions and to ensure Regions have a +/// single exit, possibly an empty one. +/// +/// Note that in contrast to the IR BasicBlock, a VPBasicBlock models its +/// control-flow edges with successor and predecessor VPBlockBase directly, +/// rather than through a Terminator branch or through predecessor branches that +/// "use" the VPBasicBlock. +class VPBasicBlock : public VPBlockBase { + friend class VPlanUtils; + +public: + typedef iplist RecipeListTy; + +private: + /// The list of VPRecipes, held in order of instructions to generate. + RecipeListTy Recipes; + +public: + /// Instruction iterators... + typedef RecipeListTy::iterator iterator; + typedef RecipeListTy::const_iterator const_iterator; + typedef RecipeListTy::reverse_iterator reverse_iterator; + typedef RecipeListTy::const_reverse_iterator const_reverse_iterator; + + //===--------------------------------------------------------------------===// + /// Recipe iterator methods + /// + inline iterator begin() { return Recipes.begin(); } + inline const_iterator begin() const { return Recipes.begin(); } + inline iterator end() { return Recipes.end(); } + inline const_iterator end() const { return Recipes.end(); } + + inline reverse_iterator rbegin() { return Recipes.rbegin(); } + inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); } + inline reverse_iterator rend() { return Recipes.rend(); } + inline const_reverse_iterator rend() const { return Recipes.rend(); } + + inline size_t size() const { return Recipes.size(); } + inline bool empty() const { return Recipes.empty(); } + inline const VPRecipeBase &front() const { return Recipes.front(); } + inline VPRecipeBase &front() { return Recipes.front(); } + inline const VPRecipeBase &back() const { return Recipes.back(); } + inline VPRecipeBase &back() { return Recipes.back(); } + + /// Return the underlying instruction list container. + /// + /// Currently you need to access the underlying instruction list container + /// directly if you want to modify it. + const RecipeListTy &getInstList() const { return Recipes; } + RecipeListTy &getInstList() { return Recipes; } + + /// Returns a pointer to a member of the instruction list. + static RecipeListTy VPBasicBlock::*getSublistAccess(VPRecipeBase *) { + return &VPBasicBlock::Recipes; + } + + VPBasicBlock(const std::string &Name) : VPBlockBase(VPBasicBlockSC, Name) {} + + ~VPBasicBlock() { Recipes.clear(); } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC; + } + + /// Augment the existing recipes of a VPBasicBlock with an additional + /// \p Recipe at a position given by an existing recipe \p Before. If + /// \p Before is null, \p Recipe is appended as the last recipe. + void addRecipe(VPRecipeBase *Recipe, VPRecipeBase *Before = nullptr) { + Recipe->Parent = this; + if (!Before) { + Recipes.push_back(Recipe); + return; + } + assert(Before->Parent == this && + "Insertion before point not in this basic block."); + Recipes.insert(Before->getIterator(), Recipe); + } + + void removeRecipe(VPRecipeBase *Recipe) { + assert(Recipe->Parent == this && + "Recipe to remove not in this basic block."); + Recipes.remove(Recipe); + Recipe->Parent = nullptr; + } + + /// The method which generates all new IR instructions that correspond to + /// this VPBasicBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; + + /// Retrieve the list of VPRecipes that belong to this VPBasicBlock. + const RecipeListTy &getRecipes() const { return Recipes; } + +private: + /// Create an IR BasicBlock to hold the instructions vectorized from this + /// VPBasicBlock, and return it. Update the CFGState accordingly. + BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG); +}; + +/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks +/// which form a single-entry-single-exit subgraph of the CFG in the vectorized +/// code. +/// +/// A VPRegionBlock may indicate that its contents are to be replicated several +/// times. This is designed to support predicated scalarization, in which a +/// scalar if-then code structure needs to be generated VF * UF times. Having +/// this replication indicator helps to keep a single VPlan for multiple +/// candidate VF's; the actual replication takes place only once the desired VF +/// and UF have been determined. +/// +/// **Design principle:** when some additional information relates to an SESE +/// set of VPBlockBase, we use a VPRegionBlock to wrap them and attach the +/// information to it. For example, a VPRegionBlock can be used to indicate that +/// a scalarized SESE region is to be replicated, and that a vectorized SESE +/// region can retain its internal control-flow, independent of the control-flow +/// external to the region. +class VPRegionBlock : public VPBlockBase { + friend class VPlanUtils; + +private: + /// Hold the Single Entry of the SESE region represented by the VPRegionBlock. + VPBlockBase *Entry; + + /// Hold the Single Exit of the SESE region represented by the VPRegionBlock. + VPBlockBase *Exit; + + /// A VPRegionBlock can represent either a single instance of its + /// VPBlockBases, or multiple (VF * UF) replicated instances. The latter is + /// used when the internal SESE region handles a single scalarized lane. + bool IsReplicator; + +public: + VPRegionBlock(const std::string &Name) + : VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exit(nullptr), + IsReplicator(false) {} + + ~VPRegionBlock() { + if (Entry) + deleteCFG(Entry); + } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC; + } + + VPBlockBase *getEntry() { return Entry; } + + VPBlockBase *getExit() { return Exit; } + + const VPBlockBase *getEntry() const { return Entry; } + + const VPBlockBase *getExit() const { return Exit; } + + /// An indicator if the VPRegionBlock represents single or multiple instances. + bool isReplicator() const { return IsReplicator; } + + void setReplicator(bool ToReplicate) { IsReplicator = ToReplicate; } + + /// The method which generates the new IR instructions that correspond to + /// this VPRegionBlock in the vectorized version, thereby "executing" the + /// VPlan. + void vectorize(struct VPTransformState *State) override; +}; + +/// A VPlan represents a candidate for vectorization, encoding various decisions +/// taken to produce efficient vector code, including: which instructions are to +/// vectorized or scalarized, which branches are to appear in the vectorized +/// version. It models the control-flow of the candidate vectorized version +/// explicitly, and holds prescriptions for generating the code for this version +/// from a given IR code. +/// VPlan takes a "senario-based approach" to vectorization planning - different +/// scenarios, corresponding to making different decisions, can be modeled using +/// different VPlans. +/// The corresponding IR code is required to be SESE. +/// The vectorized version is represented using a Hierarchical CFG. +class VPlan { + friend class VPlanUtils; + friend class VPlanUtilsLoopVectorizer; + +private: + /// Hold the single entry to the Hierarchical CFG of the VPlan. + VPBlockBase *Entry; + + /// The IR instructions which are to be transformed to fill the vectorized + /// version are held as ingredients inside the VPRecipe's of the VPlan. Hold a + /// reverse mapping to locate the VPRecipe an IR instruction belongs to. This + /// serves optimizations that operate on the VPlan. + DenseMap Inst2Recipe; + +public: + VPlan() : Entry(nullptr) {} + + ~VPlan() { + if (Entry) + VPBlockBase::deleteCFG(Entry); + } + + /// Generate the IR code for this VPlan. + void vectorize(struct VPTransformState *State); + + VPBlockBase *getEntry() { return Entry; } + const VPBlockBase *getEntry() const { return Entry; } + + void setEntry(VPBlockBase *Block) { Entry = Block; } + + /// Retrieve the VPRecipe a given instruction \p Inst belongs to in the VPlan. + /// Returns null if it belongs to no VPRecipe. + VPRecipeBase *getRecipe(Instruction *Inst) { + auto It = Inst2Recipe.find(Inst); + if (It == Inst2Recipe.end()) + return nullptr; + return It->second; + } + + void setInst2Recipe(Instruction *I, VPRecipeBase *R) { Inst2Recipe[I] = R; } + + void resetInst2Recipe(Instruction *I) { Inst2Recipe.erase(I); } + + /// Retrieve the VPBasicBlock a given instruction \p Inst belongs to in the + /// VPlan. Returns null if it belongs to no VPRecipe. + VPBasicBlock *getBasicBlock(Instruction *Inst) { + VPRecipeBase *Recipe = getRecipe(Inst); + if (!Recipe) + return nullptr; + return Recipe->getParent(); + } + +private: + /// Add to the given dominator tree the header block and every new basic block + /// that was created between it and the latch block, inclusive. + void updateDominatorTree(class DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB); +}; + +/// The VPlanUtils class provides interfaces for the construction and +/// manipulation of a VPlan. +class VPlanUtils { +private: + /// Unique ID generator. + static unsigned NextOrdinal; + +protected: + VPlan *Plan; + + typedef iplist RecipeListTy; + RecipeListTy *getRecipes(VPBasicBlock *Block) { return &Block->Recipes; } + +public: + VPlanUtils(VPlan *Plan) : Plan(Plan) {} + + ~VPlanUtils() {} + + /// Create a unique name for a new VPlan entity such as a VPBasicBlock or + /// VPRegionBlock. + std::string createUniqueName(const char *Prefix) { + std::string S; + raw_string_ostream RSO(S); + RSO << Prefix << NextOrdinal++; + return RSO.str(); + } + + /// Add a given \p Recipe as the last recipe of a given VPBasicBlock. + void appendRecipeToBasicBlock(VPRecipeBase *Recipe, VPBasicBlock *ToVPBB) { + assert(Recipe && "No recipe to append."); + assert(!Recipe->Parent && "Recipe already in VPlan"); + ToVPBB->addRecipe(Recipe); + } + + /// Create a new empty VPBasicBlock and return it. + VPBasicBlock *createBasicBlock() { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + return BasicBlock; + } + + /// Create a new VPBasicBlock with a single \p Recipe and return it. + VPBasicBlock *createBasicBlock(VPRecipeBase *Recipe) { + VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB")); + appendRecipeToBasicBlock(Recipe, BasicBlock); + return BasicBlock; + } + + /// Create a new, empty VPRegionBlock, with no blocks. + VPRegionBlock *createRegion(bool IsReplicator) { + VPRegionBlock *Region = new VPRegionBlock(createUniqueName("region")); + setReplicator(Region, IsReplicator); + return Region; + } + + /// Set the entry VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no predecessors. + void setRegionEntry(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Predecessors.empty() && + "Entry block cannot have predecessors."); + Region->Entry = Block; + Block->Parent = Region; + } + + /// Set the exit VPBlockBase of a given VPRegionBlock to a given \p Block. + /// Block is to have no successors. + void setRegionExit(VPRegionBlock *Region, VPBlockBase *Block) { + assert(Block->Successors.empty() && "Exit block cannot have successors."); + Region->Exit = Block; + Block->Parent = Region; + } + + void setReplicator(VPRegionBlock *Region, bool ToReplicate) { + Region->setReplicator(ToReplicate); + } + + /// Sets a given VPBlockBase \p Successor as the single successor of another + /// VPBlockBase \p Block. The parent of \p Block is copied to be the parent of + /// \p Successor. + void setSuccessor(VPBlockBase *Block, VPBlockBase *Successor) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->appendSuccessor(Successor); + Successor->appendPredecessor(Block); + Successor->Parent = Block->Parent; + } + + /// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two + /// successors of another VPBlockBase \p Block. A given + /// VPConditionBitRecipeBase provides the control selector. The parent of + /// \p Block is copied to be the parent of \p IfTrue and \p IfFalse. + void setTwoSuccessors(VPBlockBase *Block, VPConditionBitRecipeBase *R, + VPBlockBase *IfTrue, VPBlockBase *IfFalse) { + assert(Block->getSuccessors().empty() && "Block successors already set."); + Block->setConditionBitRecipe(R); + Block->appendSuccessor(IfTrue); + Block->appendSuccessor(IfFalse); + IfTrue->appendPredecessor(Block); + IfFalse->appendPredecessor(Block); + IfTrue->Parent = Block->Parent; + IfFalse->Parent = Block->Parent; + } + + /// Given two VPBlockBases \p From and \p To, disconnect them from each other. + void disconnectBlocks(VPBlockBase *From, VPBlockBase *To) { + From->removeSuccessor(To); + To->removePredecessor(From); + } +}; + +/// VPlanPrinter prints a given VPlan to a given output stream. The printing is +/// indented and follows the dot format. +class VPlanPrinter { +private: + raw_ostream &OS; + const VPlan &Plan; + unsigned Depth; + unsigned TabLength = 2; + std::string Indent; + + /// Handle indentation. + void buildIndent() { Indent = std::string(Depth * TabLength, ' '); } + void resetDepth() { + Depth = 1; + buildIndent(); + } + void increaseDepth() { + ++Depth; + buildIndent(); + } + void decreaseDepth() { + --Depth; + buildIndent(); + } + + /// Dump each element of VPlan. + void dumpBlock(const VPBlockBase *Block); + void dumpEdges(const VPBlockBase *Block); + void dumpBasicBlock(const VPBasicBlock *BasicBlock); + void dumpRegion(const VPRegionBlock *Region); + + const char *getNodePrefix(const VPBlockBase *Block); + const std::string &getReplicatorString(const VPRegionBlock *Region); + void drawEdge(const VPBlockBase *From, const VPBlockBase *To, bool Hidden, + const Twine &Label); + +public: + VPlanPrinter(raw_ostream &O, const VPlan &P) : OS(O), Plan(P) {} + void dump(const std::string &Title = ""); + + static void printAsIngredient(raw_ostream &O, Value *V) { + auto *Inst = dyn_cast(V); + if (!Inst) { + V->printAsOperand(O, false); + return; + } + if (!Inst->getType()->isVoidTy()) { + Inst->printAsOperand(O, false); + O << " = "; + } + O << Inst->getOpcodeName() << " "; + Inst->getOperand(0)->printAsOperand(O, false); + for (int I = 1, E = Inst->getNumOperands(); I < E; ++I) { + O << ", "; + Inst->getOperand(I)->printAsOperand(O, false); + } + } +}; + +//===--------------------------------------------------------------------===// +// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs // +//===--------------------------------------------------------------------===// + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock +// as a graph of VPBlockBases... + +template <> struct GraphTraits { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +template <> struct GraphTraits { + typedef const VPBlockBase *NodeRef; + typedef SmallVectorImpl::const_iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +// Provide specializations of GraphTraits to be able to treat a VPRegionBlock as +// a graph of VPBasicBlocks... and to walk it in inverse order. Inverse order +// for a VPRegionBlock is considered to be when traversing the predecessor edges +// of a VPBlockBase instead of the successor edges. +// + +template <> struct GraphTraits> { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static Inverse getEntryNode(Inverse B) { + return B; + } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getPredecessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getPredecessors().end(); + } +}; + +} // namespace llvm + +#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H Index: lib/Transforms/Vectorize/VPlan.cpp =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.cpp @@ -0,0 +1,402 @@ +//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This is the LLVM vectorization plan. It represents a candidate for +// vectorization, allowing to plan and optimize how to vectorize a given loop +// before generating LLVM-IR. +// The vectorizer uses vectorization plans to estimate the costs of potential +// candidates and if profitable to execute the desired plan, generating vector +// LLVM-IR code. +// +//===----------------------------------------------------------------------===// + +#include "VPlan.h" +#include "llvm/ADT/PostOrderIterator.h" +#include "llvm/Analysis/LoopInfo.h" +#include "llvm/IR/BasicBlock.h" +#include "llvm/IR/Dominators.h" +#include "llvm/Support/GraphWriter.h" +#include "llvm/Transforms/Utils/BasicBlockUtils.h" + +using namespace llvm; + +#define DEBUG_TYPE "vplan" + +unsigned VPlanUtils::NextOrdinal = 1; + +VPOneByOneRecipeBase::VPOneByOneRecipeBase(unsigned char SC, + const BasicBlock::iterator B, + const BasicBlock::iterator E, + class VPlan *Plan) + : VPRecipeBase(SC), Begin(B), End(E) { + for (auto It = B; It != E; ++It) + Plan->setInst2Recipe(&*It, this); +} + +/// \return the VPBasicBlock that is the entry of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getEntry(); + return cast(Block); +} + +/// \return the VPBasicBlock that is the exit of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getExitBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +VPBasicBlock *VPBlockBase::getExitBasicBlock() { + VPBlockBase *Block = this; + while (VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +/// Returns the closest ancestor, starting from "this", which has successors. +/// Returns the root ancestor if all ancestors have no successors. +VPBlockBase *VPBlockBase::getAncestorWithSuccessors() { + if (!Successors.empty() || !Parent) + return this; + assert(Parent->getExit() == this && + "Block w/o successors not the exit of its parent."); + return Parent->getAncestorWithSuccessors(); +} + +/// Returns the closest ancestor, starting from "this", which has predecessors. +/// Returns the root ancestor if all ancestors have no predecessors. +VPBlockBase *VPBlockBase::getAncestorWithPredecessors() { + if (!Predecessors.empty() || !Parent) + return this; + assert(Parent->getEntry() == this && + "Block w/o predecessors not the entry of its parent."); + return Parent->getAncestorWithPredecessors(); +} + +void VPBlockBase::deleteCFG(VPBlockBase *Entry) { + SmallVector Blocks; + for (VPBlockBase *Block : depth_first(Entry)) + Blocks.push_back(Block); + + for (VPBlockBase *Block : Blocks) + delete Block; +} + +BasicBlock * +VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) { + // BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks. + // Pred stands for Predessor. Prev stands for Previous, last visited/created. + BasicBlock *PrevBB = CFG.PrevBB; + BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), "VPlannedBB", + PrevBB->getParent(), CFG.LastBB); + DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n'); + + // Hook up the new basic block to its predecessors. + for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) { + VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock(); + BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB]; + DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n'); + if (isa(PredBB->getTerminator())) { + PredBB->getTerminator()->eraseFromParent(); + BranchInst::Create(NewBB, PredBB); + } else { + // Replace old unconditional branch with new conditional branch. + // Note: we rely on traversing the successors in order. + BasicBlock *FirstSuccBB = PredBB->getSingleSuccessor(); + PredBB->getTerminator()->eraseFromParent(); + Value *Bit = PredVPBlock->getConditionBitRecipe()->getConditionBit(); + assert(Bit && "Cannot create conditional branch with empty bit."); + BranchInst::Create(FirstSuccBB, NewBB, Bit, PredBB); + } + } + return NewBB; +} + +void VPBasicBlock::vectorize(VPTransformState *State) { + VPIterationInstance *I = State->Instance; + bool Replica = I && !(I->Part == 0 && I->Lane == 0); + VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB; + VPBlockBase *SingleHPred = nullptr; + BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible. + + // 1. Create an IR basic block, or reuse the last one if possible. + // The last IR basic block is reused in three cases: + // A. the first VPBB reuses the header BB - when PrevVPBB is null; + // B. when the current VPBB has a single (hierarchical) predecessor which + // is PrevVPBB and the latter has a single (hierarchical) successor; and + // C. when the current VPBB is an entry of a region replica - where PrevVPBB + // is the exit of this region from a previous instance. + if (PrevVPBB && /* A */ + !((SingleHPred = getSingleHierarchicalPredecessor()) && + SingleHPred->getExitBasicBlock() == PrevVPBB && + PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */ + !(Replica && getPredecessors().empty())) { /* C */ + + NewBB = createEmptyBasicBlock(State->CFG); + State->Builder.SetInsertPoint(NewBB); + // Temporarily terminate with unreachable until CFG is rewired. + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + // Register NewBB in its loop. In innermost loops its the same for all BB's. + Loop *L = State->LI->getLoopFor(State->CFG.LastBB); + L->addBasicBlockToLoop(NewBB, *State->LI); + State->CFG.PrevBB = NewBB; + } + + // 2. Fill the IR basic block with IR instructions. + DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName() + << " in BB:" << NewBB->getName() << '\n'); + + State->CFG.VPBB2IRBB[this] = NewBB; + State->CFG.PrevVPBB = this; + + for (VPRecipeBase &Recipe : Recipes) + Recipe.vectorize(*State); + + DEBUG(dbgs() << "LV: filled BB:" << *NewBB); +} + +void VPRegionBlock::vectorize(VPTransformState *State) { + ReversePostOrderTraversal RPOT(Entry); + typedef typename std::vector::reverse_iterator rpo_iterator; + + if (!isReplicator()) { + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + return; + } + + assert(!State->Instance && + "Replicating a Region only in null context instance."); + VPIterationInstance I; + State->Instance = &I; + + for (I.Part = 0; I.Part < State->UF; ++I.Part) + for (I.Lane = 0; I.Lane < State->VF; ++I.Lane) + // Visit the VPBlocks connected to \p this, starting from it. + for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n'); + (*I)->vectorize(State); + } + + State->Instance = nullptr; +} + +/// Generate the code inside the body of the vectorized loop. Assumes a single +/// LoopVectorBody basic block was created for this; introduces additional +/// basic blocks as needed, and fills them all. +void VPlan::vectorize(VPTransformState *State) { + BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB; + BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor(); + assert(VectorHeaderBB && "Loop preheader does not have a single successor."); + BasicBlock *VectorLatchBB = VectorHeaderBB; + auto CurrIP = State->Builder.saveIP(); + + // 1. Make room to generate basic blocks inside loop body if needed. + VectorLatchBB = VectorHeaderBB->splitBasicBlock( + VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch"); + Loop *L = State->LI->getLoopFor(VectorHeaderBB); + L->addBasicBlockToLoop(VectorLatchBB, *State->LI); + // Remove the edge between Header and Latch to allow other connections. + // Temporarily terminate with unreachable until CFG is rewired. + // Note: this asserts xform code's assumption that getFirstInsertionPt() + // can be dereferenced into an Instruction. + VectorHeaderBB->getTerminator()->eraseFromParent(); + State->Builder.SetInsertPoint(VectorHeaderBB); + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + + // 2. Generate code in loop body of vectorized version. + State->CFG.PrevVPBB = nullptr; + State->CFG.PrevBB = VectorHeaderBB; + State->CFG.LastBB = VectorLatchBB; + + for (VPBlockBase *CurrentBlock = Entry; CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) { + assert(CurrentBlock->getSuccessors().size() <= 1 && + "Multiple successors at top level."); + CurrentBlock->vectorize(State); + } + + // 3. Merge the temporary latch created with the last basic block filled. + BasicBlock *LastBB = State->CFG.PrevBB; + // Connect LastBB to VectorLatchBB to facilitate their merge. + assert(isa(LastBB->getTerminator()) && + "Expected VPlan CFG to terminate with unreachable"); + LastBB->getTerminator()->eraseFromParent(); + BranchInst::Create(VectorLatchBB, LastBB); + + // Merge LastBB with Latch. + bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI); + assert(merged && "Could not merge last basic block with latch."); + VectorLatchBB = LastBB; + + updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB); + State->Builder.restoreIP(CurrIP); +} + +void VPlan::updateDominatorTree(DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB) { + BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor(); + assert(LoopHeaderBB && "Loop preheader does not have a single successor."); + DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB); + // The vector body may be more than a single basic block by this point. + // Update the dominator tree information inside the vector body by propagating + // it from header to latch, expecting only triangular control-flow, if any. + BasicBlock *PostDomSucc = nullptr; + for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) { + // Get the list of successors of this block. + std::vector Succs(succ_begin(BB), succ_end(BB)); + assert(Succs.size() <= 2 && + "Basic block in vector loop has more than 2 successors."); + PostDomSucc = Succs[0]; + if (Succs.size() == 1) { + assert(PostDomSucc->getSinglePredecessor() && + "PostDom successor has more than one predecessor."); + DT->addNewBlock(PostDomSucc, BB); + continue; + } + BasicBlock *InterimSucc = Succs[1]; + if (PostDomSucc->getSingleSuccessor() == InterimSucc) { + PostDomSucc = Succs[1]; + InterimSucc = Succs[0]; + } + assert(InterimSucc->getSingleSuccessor() == PostDomSucc && + "One successor of a basic block does not lead to the other."); + assert(InterimSucc->getSinglePredecessor() && + "Interim successor has more than one predecessor."); + assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 && + "PostDom successor has more than two predecessors."); + DT->addNewBlock(InterimSucc, BB); + DT->addNewBlock(PostDomSucc, BB); + } +} + +const char *VPlanPrinter::getNodePrefix(const VPBlockBase *Block) { + if (isa(Block)) + return ""; + assert(isa(Block) && "Unsupported kind of VPBlock."); + return "cluster_"; +} + +const std::string & +VPlanPrinter::getReplicatorString(const VPRegionBlock *Region) { + static std::string ReplicatorString(DOT::EscapeString("")); + static std::string NonReplicatorString(DOT::EscapeString("")); + return Region->isReplicator() ? ReplicatorString : NonReplicatorString; +} + +void VPlanPrinter::dump(const std::string &Title) { + resetDepth(); + OS << "digraph VPlan {\n"; + OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan"; + if (!Title.empty()) + OS << "\\n" << DOT::EscapeString(Title); + OS << "\"]\n"; + OS << "node [shape=rect, fontname=Courier, fontsize=30]\n"; + OS << "edge [fontname=Courier, fontsize=30]\n"; + OS << "compound=true\n"; + + for (const VPBlockBase *CurrentBlock = Plan.getEntry(); + CurrentBlock != nullptr; + CurrentBlock = CurrentBlock->getSingleSuccessor()) + dumpBlock(CurrentBlock); + + OS << "}\n"; +} + +void VPlanPrinter::dumpBlock(const VPBlockBase *Block) { + if (const VPBasicBlock *BasicBlock = dyn_cast(Block)) + dumpBasicBlock(BasicBlock); + else if (const VPRegionBlock *Region = dyn_cast(Block)) + dumpRegion(Region); + else + llvm_unreachable("Unsupported kind of VPBlock."); +} + +/// Print the information related to a CFG edge between two VPBlockBases. +void VPlanPrinter::drawEdge(const VPBlockBase *From, const VPBlockBase *To, + bool Hidden, const Twine &Label) { + // Due to "dot" we print an edge between two regions as an edge between the + // exit basic block and the entry basic of the respective regions. + const VPBlockBase *Tail = From->getExitBasicBlock(); + const VPBlockBase *Head = To->getEntryBasicBlock(); + OS << Indent << getNodePrefix(Tail) << DOT::EscapeString(Tail->getName()) + << " -> " << getNodePrefix(Head) << DOT::EscapeString(Head->getName()); + OS << " [ label=\"" << Label << '\"'; + if (Tail != From) + OS << " ltail=" << getNodePrefix(From) + << DOT::EscapeString(From->getName()); + if (Head != To) + OS << " lhead=" << getNodePrefix(To) << DOT::EscapeString(To->getName()); + if (Hidden) + OS << "; splines=none"; + OS << "]\n"; +} + +/// Print the information related to the CFG edges going out of a given +/// \p Block, followed by printing the successor blocks themselves. +void VPlanPrinter::dumpEdges(const VPBlockBase *Block) { + std::string Cond = ""; + if (auto *ConditionBitRecipe = Block->getConditionBitRecipe()) + Cond = ConditionBitRecipe->getName().str(); + unsigned SuccessorNumber = 1; + for (auto *Successor : Block->getSuccessors()) { + drawEdge(Block, Successor, false, + Twine() + (SuccessorNumber == 2 ? "!" : "") + Twine(Cond)); + ++SuccessorNumber; + } +} + +/// Print a VPBasicBlock, including its VPRecipes, followed by printing its +/// successor blocks. +void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) { + std::string Indent(Depth * TabLength, ' '); + OS << Indent << getNodePrefix(BasicBlock) + << DOT::EscapeString(BasicBlock->getName()) << " [label = \"" + << DOT::EscapeString(BasicBlock->getName()) << "\n"; + + for (const VPRecipeBase &Recipe : BasicBlock->getRecipes()) { + std::string RecipeString; + raw_string_ostream RSO(RecipeString); + Recipe.print(RSO); + OS << DOT::EscapeString(RSO.str()) << "\\l"; + } + + OS << "\"]\n"; + dumpEdges(BasicBlock); +} + +/// Print a given \p Region of the VPlan. +void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) { + OS << Indent << "subgraph " << getNodePrefix(Region) + << DOT::EscapeString(Region->getName()) << " {\n"; + increaseDepth(); + OS << Indent; + OS << "fontname=Courier\n"; + OS << Indent; + OS << "label=\"" << getReplicatorString(Region) << " " + << DOT::EscapeString(Region->getName()) << "\"\n\n"; + + // Dump the blocks of the region. + assert(Region->getEntry() && "Region contains no inner blocks."); + + for (const VPBlockBase *Block : depth_first(Region->getEntry())) + dumpBlock(Block); + + decreaseDepth(); + OS << Indent << "}\n"; + dumpEdges(Region); +} Index: test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll +++ test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll @@ -26,9 +26,9 @@ ; CHECK-NEXT: br i1 [[TMP3]], label %[[PRED_UDIV_IF:.*]], label %[[PRED_UDIV_CONTINUE:.*]] ; CHECK: [[PRED_UDIV_IF]]: ; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 -; CHECK-NEXT: [[TMP6:%.*]] = add nsw i64 [[TMP5]], %x -; CHECK-NEXT: [[TMP7:%.*]] = udiv i64 [[TMP4]], [[TMP6]] +; CHECK-NEXT: [[TMP5:%.*]] = add nsw i64 [[TMP4]], %x +; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 +; CHECK-NEXT: [[TMP7:%.*]] = udiv i64 [[TMP6]], [[TMP5]] ; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i64> undef, i64 [[TMP7]], i32 0 ; CHECK-NEXT: br label %[[PRED_UDIV_CONTINUE]] ; CHECK: [[PRED_UDIV_CONTINUE]]: @@ -37,9 +37,9 @@ ; CHECK-NEXT: br i1 [[TMP10]], label %[[PRED_UDIV_IF1:.*]], label %[[PRED_UDIV_CONTINUE2]] ; CHECK: [[PRED_UDIV_IF1]]: ; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 -; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 -; CHECK-NEXT: [[TMP13:%.*]] = add nsw i64 [[TMP12]], %x -; CHECK-NEXT: [[TMP14:%.*]] = udiv i64 [[TMP11]], [[TMP13]] +; CHECK-NEXT: [[TMP12:%.*]] = add nsw i64 [[TMP11]], %x +; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 +; CHECK-NEXT: [[TMP14:%.*]] = udiv i64 [[TMP13]], [[TMP12]] ; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x i64> [[TMP9]], i64 [[TMP14]], i32 1 ; CHECK-NEXT: br label %[[PRED_UDIV_CONTINUE2]] ; CHECK: [[PRED_UDIV_CONTINUE2]]: Index: test/Transforms/LoopVectorize/AArch64/predication_costs.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/predication_costs.ll +++ test/Transforms/LoopVectorize/AArch64/predication_costs.ll @@ -18,8 +18,8 @@ ; Cost of udiv: ; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5 ; -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) { entry: @@ -59,8 +59,8 @@ ; Cost of store: ; (store(4) + extractelement(3)) / 2 = 3 ; -; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -98,10 +98,10 @@ ; Cost of udiv: ; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x -; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x +; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -143,10 +143,10 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -192,16 +192,16 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x ; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2 ; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x ; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) { entry: Index: test/Transforms/LoopVectorize/float-induction.ll =================================================================== --- test/Transforms/LoopVectorize/float-induction.ll +++ test/Transforms/LoopVectorize/float-induction.ll @@ -304,8 +304,8 @@ ; VEC2_INTERL1_PRED_STORE-NEXT: [[TMP8:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1 ; VEC2_INTERL1_PRED_STORE-NEXT: br i1 [[TMP8]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7]] ; VEC2_INTERL1_PRED_STORE: [[PRED_STORE_IF6]]: -; VEC2_INTERL1_PRED_STORE-NEXT: [[TMP9:%.*]] = fadd fast float [[TMP1]], 1.000000e+00 ; VEC2_INTERL1_PRED_STORE-NEXT: [[TMP10:%.*]] = or i64 [[INDEX]], 1 +; VEC2_INTERL1_PRED_STORE-NEXT: [[TMP9:%.*]] = fadd fast float [[TMP1]], 1.000000e+00 ; VEC2_INTERL1_PRED_STORE-NEXT: [[TMP11:%.*]] = getelementptr inbounds float, float* %A, i64 [[TMP10]] ; VEC2_INTERL1_PRED_STORE-NEXT: store float [[TMP9]], float* [[TMP11]], align 4 ; VEC2_INTERL1_PRED_STORE-NEXT: br label %[[PRED_STORE_CONTINUE7]] Index: test/Transforms/LoopVectorize/if-pred-non-void.ll =================================================================== --- test/Transforms/LoopVectorize/if-pred-non-void.ll +++ test/Transforms/LoopVectorize/if-pred-non-void.ll @@ -219,9 +219,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i32 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -229,9 +229,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i32 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/if-pred-stores.ll =================================================================== --- test/Transforms/LoopVectorize/if-pred-stores.ll +++ test/Transforms/LoopVectorize/if-pred-stores.ll @@ -31,9 +31,9 @@ ; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]] ; ; VEC: [[cond2]]: +; VEC: %[[v1:.+]] = add i64 %index, 1 ; VEC: %[[v17:.+]] = extractelement <2 x i32> %wide.load, i32 1 ; VEC: %[[v9b:.+]] = add nsw i32 %[[v17]], 20 -; VEC: %[[v1:.+]] = add i64 %index, 1 ; VEC: %[[v4:.+]] = getelementptr inbounds i32, i32* %f, i64 %[[v1]] ; VEC: store i32 %[[v9b]], i32* %[[v4]], align 4 ; VEC: br label %[[else2:.+]] Index: test/Transforms/LoopVectorize/induction.ll =================================================================== --- test/Transforms/LoopVectorize/induction.ll +++ test/Transforms/LoopVectorize/induction.ll @@ -309,18 +309,18 @@ ; ; CHECK-LABEL: @scalarize_induction_variable_05( ; CHECK: vector.body: -; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; CHECK: %[[I0:.+]] = add i32 %index, 0 ; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]] ; CHECK: pred.udiv.if: ; CHECK: udiv i32 {{.*}}, %[[I0]] -; CHECK: pred.udiv.if1: +; CHECK: pred.udiv.if3: ; CHECK: %[[I1:.+]] = add i32 %index, 1 ; CHECK: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05( ; UNROLL-NO-IC: vector.body: -; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0 ; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2 ; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]] @@ -330,26 +330,26 @@ ; UNROLL-NO-IC: pred.udiv.if6: ; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]] -; UNROLL-NO-IC: pred.udiv.if8: +; UNROLL-NO-IC: pred.udiv.if9: ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]] -; UNROLL-NO-IC: pred.udiv.if10: +; UNROLL-NO-IC: pred.udiv.if12: ; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]] ; ; IND-LABEL: @scalarize_induction_variable_05( ; IND: vector.body: -; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ] ; IND: %[[E0:.+]] = sext i32 %index to i64 ; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]] ; IND: pred.udiv.if: ; IND: udiv i32 {{.*}}, %index -; IND: pred.udiv.if1: +; IND: pred.udiv.if3: ; IND: %[[I1:.+]] = or i32 %index, 1 ; IND: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-LABEL: @scalarize_induction_variable_05( ; UNROLL: vector.body: -; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ] ; UNROLL: %[[I2:.+]] = or i32 %index, 2 ; UNROLL: %[[E0:.+]] = sext i32 %index to i64 ; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]] @@ -359,9 +359,9 @@ ; UNROLL: pred.udiv.if6: ; UNROLL: %[[I1:.+]] = or i32 %index, 1 ; UNROLL: udiv i32 {{.*}}, %[[I1]] -; UNROLL: pred.udiv.if8: +; UNROLL: pred.udiv.if9: ; UNROLL: udiv i32 {{.*}}, %[[I2]] -; UNROLL: pred.udiv.if10: +; UNROLL: pred.udiv.if12: ; UNROLL: %[[I3:.+]] = or i32 %index, 3 ; UNROLL: udiv i32 {{.*}}, %[[I3]]