Index: docs/VectorizationPlan.rst =================================================================== --- /dev/null +++ docs/VectorizationPlan.rst @@ -0,0 +1,185 @@ +================== +Vectorization Plan +================== + +.. contents:: + :local: + +Abstract +======== +The vectorization transformation can be rather complicated, involving several +potential alternatives, especially for outer-loops [1]_ but also possibly for +innermost loops. These alternatives may have significant performance impact, +both positive and negative. A cost model is therefore employed to identify the +best alternative, including the alternative of avoiding any transformation +altogether. + +The Vectorization Plan is an explicit model for describing vectorization +candidates. It serves for both optimizing candidates including estimating their +cost reliably, and for performing their final translation into IR. This +facilitates dealing with multiple vectorization candidates. + +High-level Design +================= + +Vectorization Workflow +---------------------- +VPlan-based vectorization involves three major steps, taking a "scenario-based +approach" to vectorization planning: + +1. Legal Step: check if a loop can be legally vectorized; encode contraints and + artifacts if so. +2. Plan Step: + + a. Build initial VPlans following the constraints and decisions taken by + Legal Step 1, and compute their cost. + b. Apply optimizations to the VPlans, possibly forking additional VPlans. + Prune sub-optimal VPlans having relatively high cost. +3. Execute Step: materialize the best VPlan. Note that this is the only step + that modifies the IR. + +Design Guidelines +----------------- +In what follows, the term "input IR" refers to code that is fed into the +vectorizer whereas the term "output IR" refers to code that is generated by the +vectorizer. The output IR contains code that has been vectorized or "widened" +according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed +according to an Unroll Factor (UF). +The design of VPlan follows several high-level guidelines: + +1. Analysis-like: building and manipulating VPlans must not modify the input IR. + In particular, if the best option is not to vectorize at all, the + vectorization process terminates before reaching Step 3, and compilation + should proceed as if VPlans had not been built. + +2. Align Cost & Execute: each VPlan must support both estimating the cost and + generating the output IR code, such that the cost estimation evaluates the + to-be-generated code reliably. + +3. Support vectorizing additional constructs: + + a. Outer-loop vectorization. In particular, VPlan must be able to model the + control-flow of the output IR which may include multiple basic-blocks and + nested loops. + b. SLP vectorization. + c. Combinations of the above, including nested vectorization: vectorizing + both an inner loop and an outer-loop at the same time (each with its own + VF and UF), mixed vectorization: vectorizing a loop with SLP patterns + inside [4]_, (re)vectorizing input IR containing vector code. + d. Function vectorization [2]_. + +4. Support multiple candidates efficiently. In particular, similar candidates + related to a range of possible VF's and UF's must be represented efficiently. + Potential versioning needs to be supported efficiently. + +5. Support vectorizing idioms, such as interleaved groups of strided loads or + stores. This is achieved by modeling a sequence of output instructions using + a "Recipe", which is responsible for computing its cost and generating its + code. + +6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization + such regions may need to be, for example, predicated and linearized, or + replicated VF*UF times to handle scalarized and predicated instructions. + Innerloops are also modelled as SESE regions. + +Low-level Design +================ +The low-level design of VPlan comprises of the following classes. + +:LoopVectorizationPlanner: + A LoopVectorizationPlanner is designed to handle the vectorization of a loop + or a loop nest. It can construct, optimize and discard one or more VPlans, + each VPlan modelling a distinct way to vectorize the loop or the loop nest. + Once the best VPlan is determined, including the best VF and UF, this VPlan + drives the generation of output IR. + +:VPlan: + A model of a vectorized candidate for a given input IR loop or loop nest. This + candidate is represented using a Hierarchical CFG. VPlan supports estimating + the cost and driving the generation of the output IR code it represents. + +:Hierarchical CFG: + A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The + Hierarchical CFG data structure is similar to the Tile Tree [5]_, where + cross-Tile edges are lifted to connect Tiles instead of the original + basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms + Region and Block are used rather than Tile [5]_ to avoid confusion with loop + tiling. + +:VPBlockBase: + The building block of the Hierarchical CFG. A pure-virtual base-class of + VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical + control-flow relations with other VPBlocks. Note that in contrast to the IR + BasicBlock, a VPBlockBase models its control-flow successors and predecessors + directly, rather than through a Terminator branch or through predecessor + branches that "use" the VPBlockBase. + +:VPBasicBlock: + VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the + Hierarchical CFG. It represents a sequence of output IR instructions that will + appear consecutively in an output IR basic-block. The instructions of this + basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a + sequence of zero or more VPRecipes that model the cost and generation of the + output IR instructions. + +:VPRegionBlock: + VPRegionBlock is a subclass of VPBlockBase. It models a collection of + VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR + CFG. A VPRegionBlock may indicate that its contents are to be replicated a + constant number of times when output IR is generated, effectively representing + a loop with constant trip-count that will be completely unrolled. This is used + to support scalarized and predicated instructions with a single model for + multiple candidate VF's and UF's. + +:VPRecipeBase: + A pure-virtual base class modeling a sequence of one or more output IR + instructions, possibly based on one or more input IR instructions. These + input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe + may specify how its ingredients are to be transformed to produce the output IR + instructions; e.g., cloned once, replicated multiple times or widened + according to selected VF. + +:VPTransformState: + Stores information used for generating output IR, passed from + LoopVectorizationPlanner to its selected VPlan for execution, and used to pass + additional information down to VPBlocks and VPRecipes. + +:VPlanUtils: + A collection of methods for the construction and modification of VPlans. + +Related LLVM components +----------------------- +1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP + tree, where TSLP [3]_ adds Plan Step 2.b. + +2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by + Polly [7]_. + +References +---------- +.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit + Nuzman and Ayal Zaks, PACT 2008. + +.. [2] "Proposal for function vectorization and loop vectorization with function + calls", Xinmin Tian, [`cfe-dev + `_]., + March 2, 2016. + See also `review `_. + +.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios + Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. + +.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization + overhead", Hao Zhou and Jingling Xue, CGO 2016. + +.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and + Brian Koblenz, PLDI 1991 + +.. [6] "Structural analysis: A new approach to flow analysis in optimizing + compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 + +.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma + thesis, 2011. + +.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, + European LLVM Developers' Meeting 2017. Index: docs/Vectorizers.rst =================================================================== --- docs/Vectorizers.rst +++ docs/Vectorizers.rst @@ -380,6 +380,18 @@ .. image:: linpack-pc.png +Internals +--------- + +.. toctree:: + :hidden: + + VectorizationPlan + +:doc:`VectorizationPlan` + The loop vectorizer is based on an abstract representation called Vectorization Plan. + This document describes its philosophy and design. + .. _slp-vectorizer: The SLP Vectorizer Index: lib/Transforms/Vectorize/CMakeLists.txt =================================================================== --- lib/Transforms/Vectorize/CMakeLists.txt +++ lib/Transforms/Vectorize/CMakeLists.txt @@ -4,6 +4,7 @@ LoopVectorize.cpp SLPVectorizer.cpp Vectorize.cpp + VPlan.cpp ADDITIONAL_HEADER_DIRS ${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms Index: lib/Transforms/Vectorize/LoopVectorize.cpp =================================================================== --- lib/Transforms/Vectorize/LoopVectorize.cpp +++ lib/Transforms/Vectorize/LoopVectorize.cpp @@ -47,6 +47,7 @@ //===----------------------------------------------------------------------===// #include "llvm/Transforms/Vectorize/LoopVectorize.h" +#include "VPlan.h" #include "llvm/ADT/DenseMap.h" #include "llvm/ADT/Hashing.h" #include "llvm/ADT/MapVector.h" @@ -98,6 +99,7 @@ #include "llvm/Transforms/Utils/LoopVersioning.h" #include "llvm/Transforms/Vectorize.h" #include +#include #include #include @@ -249,6 +251,10 @@ class LoopVectorizationLegality; class LoopVectorizationCostModel; class LoopVectorizationRequirements; +class VPInterleaveRecipe; +class VPReplicateRecipe; +class VPWidenIntOrFpInductionRecipe; +class VPWidenRecipe; /// Returns true if the given loop body has a cycle, excluding the loop /// itself. @@ -391,13 +397,15 @@ TripCount(nullptr), VectorTripCount(nullptr), Legal(LVL), Cost(CM), AddedSafetyChecks(false) {} - // Perform the actual loop widening (vectorization). - void vectorize() { - // Create a new empty loop. Unlink the old loop and connect the new one. - createEmptyLoop(); - // Widen each instruction in the old loop to a new one in the new loop. - vectorizeLoop(); - } + /// Create a new empty loop. Unlink the old loop and connect the new one. + /// Return the pre-header block of the new loop. + BasicBlock *createVectorizedLoopSkeleton(); + + /// Vectorize a single instruction within the innermost loop. + void vectorizeInstruction(Instruction &I); + + /// Fix the vectorized code, taking care of header phi's, live-outs, and more. + void fixVectorizedLoop(); // Return true if any runtime check is added. bool areSafetyChecksAdded() { return AddedSafetyChecks; } @@ -425,9 +433,6 @@ EdgeMaskCacheTy; typedef DenseMap BlockMaskCacheTy; - /// Create an empty loop, based on the loop ranges of the old loop. - void createEmptyLoop(); - /// Set up the values of the IVs correctly when exiting the vector loop. void fixupIVUsers(PHINode *OrigPhi, const InductionDescriptor &II, Value *CountRoundDown, Value *EndValue, @@ -436,9 +441,6 @@ /// Create a new induction variable inside L. PHINode *createInductionVariable(Loop *L, Value *Start, Value *End, Value *Step, Instruction *DL); - /// Copy and widen the instructions from the old loop. - virtual void vectorizeLoop(); - /// Handle all cross-iteration phis in the header. void fixCrossIterationPHIs(); @@ -460,47 +462,41 @@ /// the block that was created for it. void sinkScalarOperands(Instruction *PredInst); - /// Predicate conditional instructions that require predication on their - /// respective conditions. - void predicateInstructions(); - - /// Collect the instructions from the original loop that would be trivially - /// dead in the vectorized loop if generated. - void collectTriviallyDeadInstructions( - SmallPtrSetImpl &DeadInstructions); - /// Shrinks vector element sizes to the smallest bitwidth they can be legally /// represented as. void truncateToMinimalBitwidths(); +public: /// A helper function that computes the predicate of the block BB, assuming /// that the header block of the loop is set to True. It returns the *entry* /// mask for the block BB. VectorParts createBlockInMask(BasicBlock *BB); + +protected: /// A helper function that computes the predicate of the edge between SRC /// and DST. VectorParts createEdgeMask(BasicBlock *Src, BasicBlock *Dst); - /// A helper function to vectorize a single instruction within the innermost - /// loop. - void vectorizeInstruction(Instruction &I); - +public: /// Vectorize a single PHINode in a block. This method handles the induction /// variable canonicalization. It supports both VF = 1 for unrolled loops and /// arbitrary length vectors. void widenPHIInstruction(Instruction *PN, unsigned UF, unsigned VF); +protected: /// Insert the new loop to the loop hierarchy and pass manager /// and update the analysis passes. void updateAnalysis(); - /// This instruction is un-vectorizable. Implement it as a sequence - /// of scalars. If \p IfPredicateInstr is true we need to 'hide' each - /// scalarized instruction behind an if block predicated on the control - /// dependence of the instruction. - virtual void scalarizeInstruction(Instruction *Instr, - bool IfPredicateInstr = false); +public: + /// A helper function to scalarize a single Instruction in the innermost loop. + /// Generates a sequence of scalar instances for each lane between \p MinLane + /// and \p MaxLane, times each part between \p MinPart and \p MaxPart, + /// inclusive.. + void scalarizeInstruction(Instruction *Instr, const VPIteration &Instance, + bool IfPredicateInstr); +protected: /// Vectorize Load and Store instructions, virtual void vectorizeMemoryInstruction(Instruction *Instr); @@ -534,11 +530,13 @@ void createVectorIntOrFpInductionPHI(const InductionDescriptor &II, Value *Step, Instruction *EntryVal); +public: /// Widen an integer or floating-point induction variable \p IV. If \p Trunc /// is provided, the integer induction variable will first be truncated to /// the corresponding type. void widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc = nullptr); +protected: /// Returns true if an instruction \p I should be scalarized instead of /// vectorized for the chosen vectorization factor. bool shouldScalarizeInstruction(Instruction *I) const; @@ -546,6 +544,7 @@ /// Returns true if we should generate a scalar version of \p IV. bool needsScalarInduction(Instruction *IV) const; +public: /// Return a constant reference to the VectorParts corresponding to \p V from /// the original loop. If the value has already been vectorized, the /// corresponding vector entry in VectorLoopValueMap is returned. If, @@ -557,14 +556,50 @@ const VectorParts &getVectorValue(Value *V); /// Return a value in the new loop corresponding to \p V from the original - /// loop at unroll index \p Part and vector index \p Lane. If the value has - /// been vectorized but not scalarized, the necessary extractelement + /// loop at unroll index Part and vector index Lane of \p Instance. If the + /// value has been vectorized but not scalarized, the necessary extractelement /// instruction will be generated. - Value *getScalarValue(Value *V, unsigned Part, unsigned Lane); + Value *getScalarValue(Value *V, const VPIteration &Instance); + + /// Set a value in the new loop corresponding to \p V from the original loop + /// at unroll index Part and vector index Lane of \p Instance. The scalar + /// parts for this value must already be initialized. + void setScalarValue(Value *V, const VPIteration &Instance, Value *Scalar) { + assert(VectorLoopValueMap.hasScalar(V) && + "Cannot set an uninitialized scalar value"); + VectorLoopValueMap.ScalarMapStorage[V][Instance.Part][Instance.Lane] = + Scalar; + } + + /// Return a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. If there isn't one, return a null pointer. + /// Note that the value returned may also be a null pointer if that specific + /// part has not been generated yet. + Value *getVectorValue(Value *V, unsigned Part) { + if (!VectorLoopValueMap.hasVector(V)) + return nullptr; + return VectorLoopValueMap.VectorMapStorage[V][Part]; + } + + /// Set a value in the new loop corresponding to \p V from the original + /// loop at unroll index \p Part. The vector parts for this value must already + /// be initialized. + void setVectorValue(Value *V, unsigned Part, Value *Vector) { + assert(VectorLoopValueMap.hasVector(V) && + "Cannot set an uninitialized vector value"); + VectorLoopValueMap.VectorMapStorage[V][Part] = Vector; + } + + /// Construct the vector value of a scalarized value \p V one lane at a time. + /// This method is for predicated instructions where we'd like the + /// insert-element instructions to reside in the predicated block to have + /// them execute only if needed. + void constructVectorValue(Value *V, const VPIteration &Instance); /// Try to vectorize the interleaved access group that \p Instr belongs to. void vectorizeInterleaveGroup(Instruction *Instr); +protected: /// Generate a shuffle sequence that will reverse the vector Vec. virtual Value *reverseVector(Value *Vec); @@ -681,6 +716,16 @@ return ScalarMapStorage[Key]; } + ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) { + if (!hasScalar(Key)) { + ScalarParts Entry(UF); + for (unsigned Part = 0; Part < UF; ++Part) + Entry[Part].resize(Lanes); + ScalarMapStorage[Key] = Entry; + } + return ScalarMapStorage[Key]; + } + /// \return A reference to the vector map entry corresponding to \p Key. /// The key should already be in the map. This function should only be used /// when it's necessary to update values that have already been vectorized. @@ -697,8 +742,16 @@ /// getScalarValue from InnerLoopVectorizer. Until those functions can be /// moved inside ValueMap, we have to declare them as friends. friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V); - friend Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, - unsigned Lane); + friend Value *InnerLoopVectorizer::getScalarValue(Value *V, + const VPIteration &I); + friend Value *InnerLoopVectorizer::getVectorValue(Value *V, unsigned Part); + friend void InnerLoopVectorizer::setScalarValue(Value *V, + const VPIteration &I, + Value *Scalar); + friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part, + Value *Vector); + friend void InnerLoopVectorizer::constructVectorValue(Value *V, + const VPIteration &I); private: /// The unroll factor. Each entry in the vector map contains UF vector @@ -752,9 +805,11 @@ /// many different vector instructions. unsigned UF; +public: /// The builder that we use IRBuilder<> Builder; +protected: // --- Vectorization state --- /// The vector-loop preheader. @@ -783,9 +838,8 @@ /// vectorized and scalarized. ValueMap VectorLoopValueMap; - /// Store instructions that should be predicated, as a pair - /// - SmallVector, 4> PredicatedInstructions; + /// Store instructions that were predicated. + SmallVector PredicatedInstructions; EdgeMaskCacheTy EdgeMaskCache; BlockMaskCacheTy BlockMaskCache; /// Trip count of the original loop. @@ -820,7 +874,6 @@ UnrollFactor, LVL, CM) {} private: - void vectorizeMemoryInstruction(Instruction *Instr) override; Value *getBroadcastInstrs(Value *V) override; Value *getStepVector(Value *Val, int StartIdx, Value *Step, Instruction::BinaryOps Opcode = @@ -1839,6 +1892,7 @@ unsigned Width; // Vector width with best cost unsigned Cost; // Cost of the loop with that width }; + /// \return The most profitable vectorization factor and the cost of that VF. /// This method checks every power of two up to MaxVF. If UserVF is not ZERO /// then this vectorization factor will be selected if vectorization is @@ -1900,6 +1954,9 @@ /// \returns True if it is more profitable to scalarize instruction \p I for /// vectorization factor \p VF. bool isProfitableToScalarize(Instruction *I, unsigned VF) const { + // Unroller also calls this method, but does not collectInstsToScalarize. + if (VF == 1) + return true; auto Scalars = InstsToScalarize.find(VF); assert(Scalars != InstsToScalarize.end() && "VF not yet analyzed for scalarization profitability"); @@ -2110,10 +2167,12 @@ int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts, unsigned VF); +public: /// Collects the instructions to scalarize for each predicated instruction in /// the loop. void collectInstsToScalarize(unsigned VF); +private: /// Collect the instructions that are uniform after vectorization. An /// instruction is uniform if we represent it with a single scalar value in /// the vectorized loop corresponding to each vector iteration. Examples of @@ -2132,6 +2191,7 @@ /// iteration of the original scalar loop. void collectLoopScalars(unsigned VF); +public: /// Collect Uniform and Scalar values for the given \p VF. /// The sets depend on CM decision for Load/Store instructions /// that may be vectorized as interleave, gather-scatter or scalarized. @@ -2144,6 +2204,7 @@ collectLoopScalars(VF); } +private: /// Keeps cost model vectorization decision and cost for instructions. /// Right now it is used for memory instructions only. typedef DenseMap, @@ -2183,19 +2244,139 @@ /// LoopVectorizationPlanner - drives the vectorization process after having /// passed Legality checks. +/// The planner builds and optimizes the Vectorization Plans which record the +/// decisions how to vectorize the given loop. In particular, represent the +/// control-flow of the vectorized version, the replication of instructions that +/// are to be scalarized, and interleave access groups. class LoopVectorizationPlanner { public: - LoopVectorizationPlanner(LoopVectorizationCostModel &CM) : CM(CM) {} - - ~LoopVectorizationPlanner() {} + LoopVectorizationPlanner(Loop *L, LoopInfo *LI, const TargetLibraryInfo *TLI, + const TargetTransformInfo *TTI, + LoopVectorizationLegality *Legal, + LoopVectorizationCostModel &CM) + : OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), + BestVF(0), BestUF(0) {} + + ~LoopVectorizationPlanner() { + while (!VPlans.empty()) { + VPlan *Plan = VPlans.back(); + VPlans.pop_back(); + delete Plan; + } + } /// Plan how to best vectorize, return the best VF and its cost. LoopVectorizationCostModel::VectorizationFactor plan(bool OptForSize, unsigned UserVF); + /// Finalize the best decision and dispose of all other VPlans. + void setBestPlan(unsigned VF, unsigned UF); + + /// Generate the IR code for the body of the vectorized loop according to the + /// best selected VPlan. + void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT); + + void printPlans(raw_ostream &O) { + for (VPlan *Plan : VPlans) + O << *Plan; + } + +protected: + /// Collect the instructions from the original loop that would be trivially + /// dead in the vectorized loop if generated. + void collectTriviallyDeadInstructions( + SmallPtrSetImpl &DeadInstructions); + + /// A range of powers-of-2 vectorization factors with fixed start and + /// adjustable end. The range includes start and excludes end, e.g.,: + /// [1, 9) = {1, 2, 4, 8} + struct VFRange { + unsigned Start; // A power of 2. + unsigned &End; // Need not be a power of 2. If End <= Start range is empty. + }; + + /// Test a \p Predicate on a \p Range of VF's. Return the value of applying + /// \p Predicate on Range.Start, possibly decreasing Range.End such that the + /// returned value holds for the entire \p Range. + bool testVFRange(const std::function &Predicate, + VFRange Range); + + /// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive, + /// according to the information gathered by Legal when it checked if it is + /// legal to vectorize the loop. + void buildVPlans(unsigned MinVF, unsigned MaxVF); + +private: + /// Check if \I belongs to an Interleave Group within the given VF \p Range, + /// \return true in the first returned value if so and false otherwise. + /// Build a new VPInterleaveGroup Recipe if \I is the primary member of an IG + /// for \p Range.Start, and provide it as the second returned value. + /// Note that if \I is an adjunct member of an IG for \p Range.Start, the + /// \return value is , as it is handled by another recipe. + /// \p Range.End may be decreased to ensure same decision from \p Range.Start + /// to \p Range.End. + std::pair tryToInterleaveMemory(Instruction *I, + VFRange Range); + + /// Check if an induction recipe should be constructed for \I within the given + /// VF \p Range. If so build and return it. If not, return null. \p Range.End + /// may be decreased to ensure same decision from \p Range.Start to + /// \p Range.End. + VPWidenIntOrFpInductionRecipe *tryToOptimizeInduction(Instruction *I, + VFRange Range); + + /// Check if \I can be widened within the given VF \p Range. If \I can be + /// widened for Range.Start, extend \p LastWidenRecipe to include \p I if + /// possible or else build a new VPWidenRecipe for it, and return the + /// VPWidenRecipe that includes \p I. If \p I cannot be widened for + /// Range.Start \return null. Range.End may be decreased to ensure same + /// decision from \p Range.Start to \p Range.End. + VPWidenRecipe *tryToWiden(Instruction *I, VPWidenRecipe *LastWidenRecipe, + VFRange Range); + + /// Build a VPReplicationRecipe for \p I and enclose it within a Region if it + /// is predicated. \return \p VPBB augmented with this new recipe if \p I is + /// not predicated, otherwise \return a new VPBasicBlock that succeeds the new + /// Region. Update the packing decision of predicated instructions if they + /// feed \p I. Range.End may be decreased to ensure same recipe behavior from + /// \p Range.Start to \p Range.End. + VPBasicBlock *handleReplication( + Instruction *I, VFRange Range, VPBasicBlock *VPBB, + DenseMap &PredInst2Recipe); + + /// Create a replicating region for instruction \p I that requires + /// predication. \p PredRecipe is a VPReplicateRecipe holding \p I. + VPRegionBlock *createReplicateRegion(Instruction *I, + VPRecipeBase *PredRecipe); + + /// Build a VPlan according to the information gathered by Legal. \return a + /// VPlan for vectorization factors \p Range.Start and up to \p Range.End + /// exclusive, possibly decreasing \p Range.End. + VPlan *buildVPlan(VFRange Range); + private: + /// The loop that we evaluate. + Loop *OrigLoop; + + /// Loop Info analysis. + LoopInfo *LI; + + /// Target Library Info. + const TargetLibraryInfo *TLI; + + /// Target Transform Info. + const TargetTransformInfo *TTI; + + /// The legality analysis. + LoopVectorizationLegality *Legal; + /// The profitablity analysis. LoopVectorizationCostModel &CM; + + SmallVector VPlans; + + unsigned BestVF; + unsigned BestUF; }; /// \brief This holds vectorization requirements that must be verified late in @@ -2651,6 +2832,41 @@ return LAI->isUniform(V); } +void InnerLoopVectorizer::constructVectorValue(Value *V, + const VPIteration &Instance) { + assert(V != Induction && "The new induction variable should not be used."); + assert(!V->getType()->isVectorTy() && "Can't widen a vector"); + assert(!V->getType()->isVoidTy() && "Type does not produce a value"); + + if (!VectorLoopValueMap.hasVector(V)) { + VectorParts Entry(UF); + for (unsigned P = 0; P < UF; ++P) + Entry[P] = nullptr; + VectorLoopValueMap.initVector(V, Entry); + } + + VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V]; + + assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist"); + + auto *ScalarInst = cast(getScalarValue(V, Instance)); + + Value *VectorValue = nullptr; + + // If we're constructing lane 0, start from undef; otherwise, start from the + // last value created. + unsigned Part = Instance.Part; + unsigned Lane = Instance.Lane; + if (Lane == 0) + VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); + else + VectorValue = Parts[Part]; + + VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst, + Builder.getInt32(Lane)); + Parts[Part] = VectorValue; +} + const InnerLoopVectorizer::VectorParts & InnerLoopVectorizer::getVectorValue(Value *V) { assert(V != Induction && "The new induction variable should not be used."); @@ -2680,7 +2896,7 @@ // the vector map. if (VF == 1) { for (unsigned Part = 0; Part < UF; ++Part) - Entry[Part] = getScalarValue(V, Part, 0); + Entry[Part] = getScalarValue(V, {Part, 0}); return VectorLoopValueMap.initVector(V, Entry); } @@ -2689,7 +2905,7 @@ // of the last unroll iteration. Otherwise, the last instruction is the one // we created for the last vector lane of the last unroll iteration. unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1; - auto *LastInst = cast(getScalarValue(V, UF - 1, LastLane)); + auto *LastInst = cast(getScalarValue(V, {UF - 1, LastLane})); // Set the insert point after the last scalarized instruction. This ensures // the insertelement sequence will directly follow the scalar definitions. @@ -2706,12 +2922,12 @@ for (unsigned Part = 0; Part < UF; ++Part) { Value *VectorValue = nullptr; if (Cost->isUniformAfterVectorization(I, VF)) { - VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0)); + VectorValue = getBroadcastInstrs(getScalarValue(V, {Part, 0})); } else { VectorValue = UndefValue::get(VectorType::get(V->getType(), VF)); for (unsigned Lane = 0; Lane < VF; ++Lane) VectorValue = Builder.CreateInsertElement( - VectorValue, getScalarValue(V, Part, Lane), + VectorValue, getScalarValue(V, {Part, Lane}), Builder.getInt32(Lane)); } Entry[Part] = VectorValue; @@ -2726,14 +2942,15 @@ return VectorLoopValueMap.initVector(V, VectorParts(UF, B)); } -Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part, - unsigned Lane) { - +Value *InnerLoopVectorizer::getScalarValue(Value *V, + const VPIteration &Instance) { // If the value is not an instruction contained in the loop, it should // already be scalar. if (OrigLoop->isLoopInvariant(V)) return V; + unsigned Part = Instance.Part; + unsigned Lane = Instance.Lane; assert(Lane > 0 ? !Cost->isUniformAfterVectorization(cast(V), VF) : true && "Uniform values only have lane zero"); @@ -2830,7 +3047,7 @@ Index += (VF - 1) * Group->getFactor(); for (unsigned Part = 0; Part < UF; Part++) { - Value *NewPtr = getScalarValue(Ptr, Part, 0); + Value *NewPtr = getScalarValue(Ptr, {Part, 0}); // Notice current instruction could be any index. Need to adjust the address // to the member of index 0. @@ -2956,10 +3173,6 @@ Alignment = DL.getABITypeAlignment(ScalarDataTy); unsigned AddressSpace = getMemInstAddressSpace(Instr); - // Scalarize the memory instruction if necessary. - if (Decision == LoopVectorizationCostModel::CM_Scalarize) - return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr)); - // Determine if the pointer operand of the access is either consecutive or // reverse consecutive. int ConsecutiveStride = Legal->isConsecutivePtr(Ptr); @@ -2971,7 +3184,7 @@ // Handle consecutive loads/stores. if (ConsecutiveStride) { - Ptr = getScalarValue(Ptr, 0, 0); + Ptr = getScalarValue(Ptr, {0, 0}); } else { // At this point we should vector version of GEP for Gather or Scatter assert(CreateGatherScatter && "The instruction should be scalarized"); @@ -3067,13 +3280,9 @@ } void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr, + const VPIteration &Instance, bool IfPredicateInstr) { assert(!Instr->getType()->isAggregateType() && "Can't handle vectors"); - DEBUG(dbgs() << "LV: Scalarizing" - << (IfPredicateInstr ? " and predicating:" : ":") << *Instr - << '\n'); - // Holds vector parameters or scalars, in case of uniform vals. - SmallVector Params; setDebugLocFromInst(Builder, Instr); @@ -3081,62 +3290,34 @@ bool IsVoidRetTy = Instr->getType()->isVoidTy(); // Initialize a new scalar map entry. - ScalarParts Entry(UF); + ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF); - VectorParts Cond; - if (IfPredicateInstr) - Cond = createBlockInMask(Instr->getParent()); - - // Determine the number of scalars we need to generate for each unroll - // iteration. If the instruction is uniform, we only need to generate the - // first lane. Otherwise, we generate all VF values. - unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF; - - // For each vector unroll 'part': - for (unsigned Part = 0; Part < UF; ++Part) { - Entry[Part].resize(VF); - // For each scalar that we create: - for (unsigned Lane = 0; Lane < Lanes; ++Lane) { + Instruction *Cloned = Instr->clone(); + if (!IsVoidRetTy) + Cloned->setName(Instr->getName() + ".cloned"); - // Start if-block. - Value *Cmp = nullptr; - if (IfPredicateInstr) { - Cmp = Cond[Part]; - if (Cmp->getType()->isVectorTy()) - Cmp = Builder.CreateExtractElement(Cmp, Builder.getInt32(Lane)); - Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp, - ConstantInt::get(Cmp->getType(), 1)); - } - - Instruction *Cloned = Instr->clone(); - if (!IsVoidRetTy) - Cloned->setName(Instr->getName() + ".cloned"); - - // Replace the operands of the cloned instructions with their scalar - // equivalents in the new loop. - for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) { - auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane); - Cloned->setOperand(op, NewOp); - } - addNewMetadata(Cloned, Instr); + // Replace the operands of the cloned instructions with their scalar + // equivalents in the new loop. + for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) { + auto *NewOp = getScalarValue(Instr->getOperand(op), Instance); + Cloned->setOperand(op, NewOp); + } + addNewMetadata(Cloned, Instr); - // Place the cloned scalar in the new loop. - Builder.Insert(Cloned); + // Place the cloned scalar in the new loop. + Builder.Insert(Cloned); - // Add the cloned scalar to the scalar map entry. - Entry[Part][Lane] = Cloned; + // Add the cloned scalar to the scalar map entry. + Entry[Instance.Part][Instance.Lane] = Cloned; - // If we just cloned a new assumption, add it the assumption cache. - if (auto *II = dyn_cast(Cloned)) - if (II->getIntrinsicID() == Intrinsic::assume) - AC->registerAssumption(II); + // If we just cloned a new assumption, add it the assumption cache. + if (auto *II = dyn_cast(Cloned)) + if (II->getIntrinsicID() == Intrinsic::assume) + AC->registerAssumption(II); - // End if-block. - if (IfPredicateInstr) - PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp)); - } - } - VectorLoopValueMap.initScalar(Instr, Entry); + // End if-block. + if (IfPredicateInstr) + PredicatedInstructions.push_back(Cloned); } PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start, @@ -3361,7 +3542,7 @@ LVer->prepareNoAliasMetadata(); } -void InnerLoopVectorizer::createEmptyLoop() { +BasicBlock *InnerLoopVectorizer::createVectorizedLoopSkeleton() { /* In this function we generate a new loop. The new loop will contain the vectorized instructions while the old loop will continue to run the @@ -3543,6 +3724,8 @@ LoopVectorizeHints Hints(Lp, true, *ORE); Hints.setAlreadyVectorized(); + + return LoopVectorPreHeader; } // Fix up external users of the induction variable. At this point, we are @@ -3879,7 +4062,8 @@ } } -void InnerLoopVectorizer::vectorizeLoop() { +void InnerLoopVectorizer::fixVectorizedLoop() { + //===------------------------------------------------===// // // Notice: any optimization or new instruction that go @@ -3888,27 +4072,6 @@ // //===------------------------------------------------===// - // Collect instructions from the original loop that will become trivially dead - // in the vectorized loop. We don't need to vectorize these instructions. For - // example, original induction update instructions can become dead because we - // separately emit induction "steps" when generating code for the new loop. - // Similarly, we create a new latch condition when setting up the structure - // of the new loop, so the old one can become dead. - SmallPtrSet DeadInstructions; - collectTriviallyDeadInstructions(DeadInstructions); - - // Scan the loop in a topological order to ensure that defs are vectorized - // before users. - LoopBlocksDFS DFS(OrigLoop); - DFS.perform(LI); - - // Vectorize all instructions in the original loop that will not become - // trivially dead when vectorized. - for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) - for (Instruction &I : *BB) - if (!DeadInstructions.count(&I)) - vectorizeInstruction(I); - // Insert truncates and extends for any truncated instructions as hints to // InstCombine. if (VF > 1) @@ -3937,7 +4100,8 @@ IVEndValues[Entry.first], LoopMiddleBlock); fixLCSSAPHIs(); - predicateInstructions(); + for (Instruction *PI : PredicatedInstructions) + sinkScalarOperands(&*PI); // Remove redundant induction instructions. cse(LoopVectorBody); @@ -4347,30 +4511,6 @@ } } -void InnerLoopVectorizer::collectTriviallyDeadInstructions( - SmallPtrSetImpl &DeadInstructions) { - BasicBlock *Latch = OrigLoop->getLoopLatch(); - - // We create new control-flow for the vectorized loop, so the original - // condition will be dead after vectorization if it's only used by the - // branch. - auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); - if (Cmp && Cmp->hasOneUse()) - DeadInstructions.insert(Cmp); - - // We create new "steps" for induction variable updates to which the original - // induction variables map. An original update instruction will be dead if - // all its users except the induction variable are dead. - for (auto &Induction : *Legal->getInductionVars()) { - PHINode *Ind = Induction.first; - auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); - if (all_of(IndUpdate->users(), [&](User *U) -> bool { - return U == Ind || DeadInstructions.count(cast(U)); - })) - DeadInstructions.insert(IndUpdate); - } -} - void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) { // The basic block and loop containing the predicated instruction. @@ -4437,126 +4577,6 @@ } while (Changed); } -void InnerLoopVectorizer::predicateInstructions() { - - // For each instruction I marked for predication on value C, split I into its - // own basic block to form an if-then construct over C. Since I may be fed by - // an extractelement instruction or other scalar operand, we try to - // iteratively sink its scalar operands into the predicated block. If I feeds - // an insertelement instruction, we try to move this instruction into the - // predicated block as well. For non-void types, a phi node will be created - // for the resulting value (either vector or scalar). - // - // So for some predicated instruction, e.g. the conditional sdiv in: - // - // for.body: - // ... - // %add = add nsw i32 %mul, %0 - // %cmp5 = icmp sgt i32 %2, 7 - // br i1 %cmp5, label %if.then, label %if.end - // - // if.then: - // %div = sdiv i32 %0, %1 - // br label %if.end - // - // if.end: - // %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ] - // - // the sdiv at this point is scalarized and if-converted using a select. - // The inactive elements in the vector are not used, but the predicated - // instruction is still executed for all vector elements, essentially: - // - // vector.body: - // ... - // %17 = add nsw <2 x i32> %16, %wide.load - // %29 = extractelement <2 x i32> %wide.load, i32 0 - // %30 = extractelement <2 x i32> %wide.load51, i32 0 - // %31 = sdiv i32 %29, %30 - // %32 = insertelement <2 x i32> undef, i32 %31, i32 0 - // %35 = extractelement <2 x i32> %wide.load, i32 1 - // %36 = extractelement <2 x i32> %wide.load51, i32 1 - // %37 = sdiv i32 %35, %36 - // %38 = insertelement <2 x i32> %32, i32 %37, i32 1 - // %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17 - // - // Predication will now re-introduce the original control flow to avoid false - // side-effects by the sdiv instructions on the inactive elements, yielding - // (after cleanup): - // - // vector.body: - // ... - // %5 = add nsw <2 x i32> %4, %wide.load - // %8 = icmp sgt <2 x i32> %wide.load52, - // %9 = extractelement <2 x i1> %8, i32 0 - // br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue - // - // pred.sdiv.if: - // %10 = extractelement <2 x i32> %wide.load, i32 0 - // %11 = extractelement <2 x i32> %wide.load51, i32 0 - // %12 = sdiv i32 %10, %11 - // %13 = insertelement <2 x i32> undef, i32 %12, i32 0 - // br label %pred.sdiv.continue - // - // pred.sdiv.continue: - // %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ] - // %15 = extractelement <2 x i1> %8, i32 1 - // br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55 - // - // pred.sdiv.if54: - // %16 = extractelement <2 x i32> %wide.load, i32 1 - // %17 = extractelement <2 x i32> %wide.load51, i32 1 - // %18 = sdiv i32 %16, %17 - // %19 = insertelement <2 x i32> %14, i32 %18, i32 1 - // br label %pred.sdiv.continue55 - // - // pred.sdiv.continue55: - // %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ] - // %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5 - - for (auto KV : PredicatedInstructions) { - BasicBlock::iterator I(KV.first); - BasicBlock *Head = I->getParent(); - auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false, - /*BranchWeights=*/nullptr, DT, LI); - I->moveBefore(T); - sinkScalarOperands(&*I); - - BasicBlock *PredicatedBlock = I->getParent(); - Twine BBNamePrefix = Twine("pred.") + I->getOpcodeName(); - PredicatedBlock->setName(BBNamePrefix + ".if"); - PredicatedBlock->getSingleSuccessor()->setName(BBNamePrefix + ".continue"); - - // If the instruction is non-void create a Phi node at reconvergence point. - if (!I->getType()->isVoidTy()) { - Value *IncomingTrue = nullptr; - Value *IncomingFalse = nullptr; - - if (I->hasOneUse() && isa(*I->user_begin())) { - // If the predicated instruction is feeding an insert-element, move it - // into the Then block; Phi node will be created for the vector. - InsertElementInst *IEI = cast(*I->user_begin()); - IEI->moveBefore(T); - IncomingTrue = IEI; // the new vector with the inserted element. - IncomingFalse = IEI->getOperand(0); // the unmodified vector - } else { - // Phi node will be created for the scalar predicated instruction. - IncomingTrue = &*I; - IncomingFalse = UndefValue::get(I->getType()); - } - - BasicBlock *PostDom = I->getParent()->getSingleSuccessor(); - assert(PostDom && "Then block has multiple successors"); - PHINode *Phi = - PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front()); - IncomingTrue->replaceAllUsesWith(Phi); - Phi->addIncoming(IncomingFalse, Head); - Phi->addIncoming(IncomingTrue, I->getParent()); - } - } - - DEBUG(DT->verifyDomTree()); -} - InnerLoopVectorizer::VectorParts InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) { assert(is_contained(predecessors(Dst), Src) && "Invalid edge"); @@ -4695,7 +4715,7 @@ llvm_unreachable("Unknown induction"); case InductionDescriptor::IK_IntInduction: case InductionDescriptor::IK_FpInduction: - return widenIntOrFpInduction(P); + llvm_unreachable("Integer/fp induction is handled elsewhere."); case InductionDescriptor::IK_PtrInduction: { // Handle the pointer induction variable case. assert(P->getType()->isPointerTy() && "Unexpected type."); @@ -4744,24 +4764,10 @@ } void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) { - // Scalarize instructions that should remain scalar after vectorization. - if (VF > 1 && - !(isa(&I) || isa(&I) || isa(&I)) && - shouldScalarizeInstruction(&I)) { - scalarizeInstruction(&I, Legal->isScalarWithPredication(&I)); - return; - } - switch (I.getOpcode()) { case Instruction::Br: - // Nothing to do for PHIs and BR, since we already took care of the - // loop control flow instructions. - break; - case Instruction::PHI: { - // Vectorize PHINodes. - widenPHIInstruction(&I, UF, VF); - break; - } // End of PHI. + case Instruction::PHI: + llvm_unreachable("This instruction is handled by a different recipe."); case Instruction::GetElementPtr: { // Construct a vector GEP by widening the operands of the scalar GEP as // necessary. We mark the vector GEP 'inbounds' if appropriate. A GEP @@ -4832,12 +4838,6 @@ case Instruction::SDiv: case Instruction::SRem: case Instruction::URem: - // Scalarize with predication if this instruction may divide by zero and - // block execution is conditional, otherwise fallthrough. - if (Legal->isScalarWithPredication(&I)) { - scalarizeInstruction(&I, true); - break; - } case Instruction::Add: case Instruction::FAdd: case Instruction::Sub: @@ -4890,7 +4890,7 @@ const VectorParts &Op0 = getVectorValue(I.getOperand(1)); const VectorParts &Op1 = getVectorValue(I.getOperand(2)); - auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0); + auto *ScalarCond = getScalarValue(I.getOperand(0), {0, 0}); VectorParts Entry(UF); for (unsigned Part = 0; Part < UF; ++Part) { @@ -4947,16 +4947,6 @@ auto *CI = dyn_cast(&I); setDebugLocFromInst(Builder, CI); - // Optimize the special case where the source is a constant integer - // induction variable. Notice that we can only optimize the 'trunc' case - // because (a) FP conversions lose precision, (b) sext/zext may wrap, and - // (c) other casts depend on pointer size. - if (Cost->isOptimizableIVTruncate(CI, VF)) { - widenIntOrFpInduction(cast(CI->getOperand(0)), - cast(CI)); - break; - } - /// Vectorize casts. Type *DestTy = (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF); @@ -4987,23 +4977,10 @@ Tys.push_back(ToVectorTy(ArgOperand->getType(), VF)); Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); - if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || - ID == Intrinsic::lifetime_start)) { - scalarizeInstruction(&I); - break; - } - // The flag shows whether we use Intrinsic or a usual Call for vectorized - // version of the instruction. - // Is it beneficial to perform intrinsic call compared to lib call? bool NeedToScalarize; unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); bool UseVectorIntrinsic = ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; - if (!UseVectorIntrinsic && NeedToScalarize) { - scalarizeInstruction(&I); - break; - } - VectorParts Entry(UF); for (unsigned Part = 0; Part < UF; ++Part) { SmallVector Args; @@ -5056,9 +5033,9 @@ } default: - // All other instructions are unsupported. Scalarize them. - scalarizeInstruction(&I); - break; + // All other instructions are scalarized. + DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I); + llvm_unreachable("Unhandled instruction!"); } // end of switch. } @@ -5070,14 +5047,11 @@ assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) && "Entry does not dominate exit."); - DT->addNewBlock(LI->getLoopFor(LoopVectorBody)->getHeader(), - LoopVectorPreHeader); DT->addNewBlock(LoopMiddleBlock, LI->getLoopFor(LoopVectorBody)->getLoopLatch()); DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]); DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader); DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]); - DEBUG(DT->verifyDomTree()); } @@ -6970,13 +6944,6 @@ LoopVectorizationCostModel::expectedCost(unsigned VF) { VectorizationCostTy Cost; - // Collect Uniform and Scalar instructions after vectorization with VF. - collectUniformsAndScalars(VF); - - // Collect the instructions (and their associated costs) that will be more - // profitable to scalarize. - collectInstsToScalarize(VF); - // For each block. for (BasicBlock *BB : TheLoop->blocks()) { VectorizationCostTy BlockCost; @@ -7545,41 +7512,6 @@ } } -LoopVectorizationCostModel::VectorizationFactor -LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) { - - // Width 1 means no vectorize, cost 0 means uncomputed cost. - const LoopVectorizationCostModel::VectorizationFactor NoVectorization = {1U, - 0U}; - Optional MaybeMaxVF = CM.computeMaxVF(OptForSize); - if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize. - return NoVectorization; - - if (UserVF) { - DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); - assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); - // Collect the instructions (and their associated costs) that will be more - // profitable to scalarize. - CM.selectUserVectorizationFactor(UserVF); - return {UserVF, 0}; - } - - unsigned MaxVF = MaybeMaxVF.getValue(); - assert(MaxVF != 0 && "MaxVF is zero."); - if (MaxVF == 1) - return NoVectorization; - - // Select the optimal vectorization factor. - return CM.selectVectorizationFactor(MaxVF); -} - -void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) { - auto *SI = dyn_cast(Instr); - bool IfPredicateInstr = (SI && Legal->blockNeedsPredication(SI->getParent())); - - return scalarizeInstruction(Instr, IfPredicateInstr); -} - Value *InnerLoopUnroller::reverseVector(Value *Vec) { return Vec; } Value *InnerLoopUnroller::getBroadcastInstrs(Value *V) { return V; } @@ -7635,6 +7567,772 @@ } } +namespace { +/// VPWidenRecipe is a recipe for producing a copy of vector type for each +/// Instruction in its ingredients independently, in order. This recipe covers +/// most of the traditional vectorization cases where each ingredient transforms +/// into a vectorized version of itself. +class VPWidenRecipe : public VPRecipeBase { +private: + /// Hold the ingredients by pointing to their original BasicBlock location. + BasicBlock::iterator Begin; + BasicBlock::iterator End; + +public: + VPWidenRecipe(Instruction *I) : VPRecipeBase(VPWidenSC) { + End = I->getIterator(); + Begin = End++; + } + + ~VPWidenRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenSC; + } + + /// Produce widened copies of all Ingredients. + void execute(VPTransformState &State) override { + for (auto &Instr : make_range(Begin, End)) + State.ILV->vectorizeInstruction(Instr); + } + + /// Augment the recipe to include Instr, if it lies at its End. + bool appendInstruction(Instruction *Instr) { + if (End != Instr->getIterator()) + return false; + End++; + return true; + } + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"WIDEN\\l\""; + for (auto &Instr : make_range(Begin, End)) + O << " +\n" << Indent << "\" " << VPlanIngredient(&Instr) << "\\l\""; + } +}; + +/// A recipe for handling phi nodes of integer and floating-point inductions, +/// producing their vector and scalar values. +class VPWidenIntOrFpInductionRecipe : public VPRecipeBase { +private: + PHINode *IV; + TruncInst *Trunc; + +public: + VPWidenIntOrFpInductionRecipe(PHINode *IV, TruncInst *Trunc = nullptr) + : VPRecipeBase(VPWidenIntOrFpInductionSC), IV(IV), Trunc(Trunc) {} + + ~VPWidenIntOrFpInductionRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenIntOrFpInductionSC; + } + + /// Generate the vectorized and scalarized versions of the phi node as + /// needed by their users. + void execute(VPTransformState &State) override { + assert(!State.Instance && "Int or FP induction being replicated."); + State.ILV->widenIntOrFpInduction(IV, Trunc); + } + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"WIDEN-INDUCTION"; + if (Trunc) { + O << "\\l\""; + O << " +\n" << Indent << "\" " << VPlanIngredient(IV) << "\\l\""; + O << " +\n" << Indent << "\" " << VPlanIngredient(Trunc) << "\\l\""; + } else + O << " " << VPlanIngredient(IV) << "\\l\""; + } +}; + +/// A recipe for handling all phi nodes except for integer and FP inductions. +class VPWidenPHIRecipe : public VPRecipeBase { +private: + PHINode *Phi; + +public: + VPWidenPHIRecipe(PHINode *Phi) : VPRecipeBase(VPWidenPHISC), Phi(Phi) {} + + ~VPWidenPHIRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPWidenPHISC; + } + + /// Generate the phi/select nodes. + void execute(VPTransformState &State) override { + State.ILV->widenPHIInstruction(Phi, State.UF, State.VF); + } + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"WIDEN-PHI " << VPlanIngredient(Phi) << "\\l\""; + } +}; + +/// VPInterleaveRecipe is a recipe for transforming an interleave group of load +/// or stores into one wide load/store and shuffles. +class VPInterleaveRecipe : public VPRecipeBase { +private: + const InterleaveGroup *IG; + +public: + VPInterleaveRecipe(const InterleaveGroup *IG) + : VPRecipeBase(VPInterleaveSC), IG(IG) {} + + ~VPInterleaveRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC; + } + + /// Generate the wide load or store, and shuffles. + void execute(VPTransformState &State) override { + assert(!State.Instance && "Interleave group being replicated."); + State.ILV->vectorizeInterleaveGroup(IG->getInsertPos()); + } + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override; + + const InterleaveGroup *getInterleaveGroup() { return IG; } +}; + +/// VPReplicateRecipe replicates a given instruction producing multiple scalar +/// copies of the original scalar type, one per lane, instead of producing a +/// single copy of widened type for all lanes. If the instruction is known to be +/// uniform only one copy, per lane zero, will be generated. +class VPReplicateRecipe : public VPRecipeBase { +private: + /// The instruction being replicated. + Instruction *Ingredient; + + /// Indicator if only a single replica per lane is needed. + bool IsUniform; + + /// Indicator if the replicas are also predicated. + bool IsPredicated; + + /// Indicator if the scalar values should also be packed into a vector. + bool AlsoPack; + +public: + VPReplicateRecipe(Instruction *I, bool IsUniform, bool IsPredicated = false) + : VPRecipeBase(VPReplicateSC), Ingredient(I), IsUniform(IsUniform), + IsPredicated(IsPredicated) { + // Retain the previous behavior of predicateInstructions(), where an + // insert-element of a predicated instruction got hoisted into the + // predicated basic block iff it was its only user. This is achieved by + // having predicated instructions also pack their values into a vector by + // default unless they have a replicated user which uses their scalar value. + AlsoPack = IsPredicated && !I->use_empty(); + } + + ~VPReplicateRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPReplicateSC; + } + + /// Generate replicas of the desired Ingredient. Replicas will be generated + /// for all parts and lanes unless a specific part and lane are specified in + /// the \p State. + void execute(VPTransformState &State) override; + + void setAlsoPack(bool Pack) { AlsoPack = Pack; } + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"" << (IsUniform ? "CLONE " : "REPLICATE ") + << VPlanIngredient(Ingredient); + if (AlsoPack) + O << " (S->V)"; + O << "\\l\""; + } +}; + +/// A recipe for generating conditional branches on the bits of a mask. +class VPBranchOnMaskRecipe : public VPRecipeBase { +private: + /// The input IR basic block used to obtain the mask providing the condition + /// bits for the branch. + BasicBlock *MaskedBasicBlock; + +public: + VPBranchOnMaskRecipe(BasicBlock *BB) + : VPRecipeBase(VPBranchOnMaskSC), MaskedBasicBlock(BB) {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPBranchOnMaskSC; + } + + /// Generate the extraction of the appropriate bit from the block mask and the + /// conditional branch. + void execute(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"BRANCH-ON-MASK-OF " + << MaskedBasicBlock->getName() << "\\l\""; + } +}; + +/// VPPredInstPHIRecipe is a recipe for generating the phi nodes needed when +/// control converges back from a Branch-on-Mask. The phi nodes are needed in +/// order to merge values that are set under such a branch and feed their uses. +/// The phi nodes can be scalar or vector depending on the users of the value. +/// This recipe works in concert with VPBranchOnMaskRecipe. +class VPPredInstPHIRecipe : public VPRecipeBase { +private: + Instruction *PredInst; + +public: + /// Construct a VPPredInstPHIRecipe given \p PredInst whose value needs a phi + /// nodes after merging back from a Branch-on-Mask. + VPPredInstPHIRecipe(Instruction *PredInst) + : VPRecipeBase(VPPredInstPHISC), PredInst(PredInst) {} + + ~VPPredInstPHIRecipe() {} + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPRecipeBase *V) { + return V->getVPRecipeID() == VPRecipeBase::VPPredInstPHISC; + } + + /// Generates phi nodes for live-outs as needed to retain SSA form. + void execute(VPTransformState &State) override; + + /// Print the recipe. + void print(raw_ostream &O, const Twine &Indent) const override { + O << " +\n" << Indent << "\"PHI-PREDICATED-INSTRUCTION " + << VPlanIngredient(PredInst) << "\\l\""; + } +}; +} // end anonymous namespace + +LoopVectorizationCostModel::VectorizationFactor +LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) { + + // Width 1 means no vectorize, cost 0 means uncomputed cost. + const LoopVectorizationCostModel::VectorizationFactor NoVectorization = {1U, + 0U}; + Optional MaybeMaxVF = CM.computeMaxVF(OptForSize); + if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize. + return NoVectorization; + + if (UserVF) { + DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n"); + assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two"); + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + CM.selectUserVectorizationFactor(UserVF); + buildVPlans(UserVF, UserVF); + DEBUG(printPlans(dbgs())); + return {UserVF, 0}; + } + + unsigned MaxVF = MaybeMaxVF.getValue(); + assert(MaxVF != 0 && "MaxVF is zero."); + + for (unsigned VF = 1; VF <= MaxVF; VF *= 2) { + // Collect Uniform and Scalar instructions after vectorization with VF. + CM.collectUniformsAndScalars(VF); + + // Collect the instructions (and their associated costs) that will be more + // profitable to scalarize. + if (VF > 1) + CM.collectInstsToScalarize(VF); + } + + buildVPlans(1, MaxVF); + DEBUG(printPlans(dbgs())); + if (MaxVF == 1) + return NoVectorization; + + // Select the optimal vectorization factor. + return CM.selectVectorizationFactor(MaxVF); +} + +void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) { + DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n'); + BestVF = VF; + BestUF = UF; + + for (auto *VPlanIter = VPlans.begin(); VPlanIter != VPlans.end();) { + VPlan *Plan = *VPlanIter; + if (Plan->getVFs().count(VF)) + ++VPlanIter; + else { + VPlanIter = VPlans.erase(VPlanIter); + delete Plan; + } + } + assert(VPlans.size() == 1 && "Best VF has not a single VPlan."); +} + +void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV, + DominatorTree *DT) { + // Perform the actual loop transformation. + + // 1. Create a new empty loop. Unlink the old loop and connect the new one. + VPTransformState State{BestVF, BestUF, LI, DT, ILV.Builder, &ILV}; + State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton(); + + // 2. Copy and widen instructions from the old loop into the new loop. + assert(VPlans.size() == 1 && "Not a single VPlan to execute."); + VPlan *Plan = *VPlans.begin(); + Plan->execute(&State); + + // 3. Fix the vectorized code: take care of header phi's, live-outs, + // predication, updating analyses. + ILV.fixVectorizedLoop(); +} + +void LoopVectorizationPlanner::collectTriviallyDeadInstructions( + SmallPtrSetImpl &DeadInstructions) { + BasicBlock *Latch = OrigLoop->getLoopLatch(); + + // We create new control-flow for the vectorized loop, so the original + // condition will be dead after vectorization if it's only used by the + // branch. + auto *Cmp = dyn_cast(Latch->getTerminator()->getOperand(0)); + if (Cmp && Cmp->hasOneUse()) + DeadInstructions.insert(Cmp); + + // We create new "steps" for induction variable updates to which the original + // induction variables map. An original update instruction will be dead if + // all its users except the induction variable are dead. + for (auto &Induction : *Legal->getInductionVars()) { + PHINode *Ind = Induction.first; + auto *IndUpdate = cast(Ind->getIncomingValueForBlock(Latch)); + if (all_of(IndUpdate->users(), [&](User *U) -> bool { + return U == Ind || DeadInstructions.count(cast(U)); + })) + DeadInstructions.insert(IndUpdate); + } +} + +bool LoopVectorizationPlanner::testVFRange( + const std::function &Predicate, VFRange Range) { + assert(Range.End > Range.Start && "Trying to test an empty VF range."); + bool PredicateAtRangeStart = Predicate(Range.Start); + + for (unsigned TmpVF = Range.Start * 2; TmpVF < Range.End; TmpVF *= 2) + if (Predicate(TmpVF) != PredicateAtRangeStart) { + Range.End = TmpVF; + break; + } + + return PredicateAtRangeStart; +} + +void LoopVectorizationPlanner::buildVPlans(unsigned MinVF, unsigned MaxVF) { + unsigned RangeEnd = MaxVF + 1; + VFRange Range = {MinVF, RangeEnd}; + while (Range.Start < Range.End) { + VPlan *Plan = buildVPlan(Range); + VPlans.push_back(Plan); + std::string PlanName; + raw_string_ostream RSO(PlanName); + unsigned VF = Range.Start; + Plan->addVF(VF); + RSO << "Initial VPlan for VF={" << VF; + for (VF *= 2; VF < Range.End; VF *= 2) { + Plan->addVF(VF); + RSO << "," << VF; + } + RSO << "},UF>=1"; + RSO.flush(); + Plan->setName(PlanName); + + Range.Start = Range.End; + Range.End = MaxVF + 1; + } +} + +std::pair +LoopVectorizationPlanner::tryToInterleaveMemory(Instruction *I, VFRange Range) { + const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I); + if (!IG) + return std::make_pair(false, nullptr); + + // Now check if IG is relevant for VF's in the given range. + auto isIGMember = [&](Instruction *I) -> std::function { + return [=](unsigned VF) -> bool { + return (VF >= 2 && // Query is illegal for VF == 1 + CM.getWideningDecision(I, VF) == + LoopVectorizationCostModel::CM_Interleave); + }; + }; + if (!testVFRange(isIGMember(I), Range)) + return std::make_pair(false, nullptr); + + // I is a member of an InterleaveGroup for VF's in the (possibly trimmed) + // range. If it's the primary member of the IG construct a VPInterleaveRecipe. + // Otherwise, it's an adjunct member of the IG, do not construct any Recipe. + if (I != IG->getInsertPos()) + return std::make_pair(true, nullptr); + + return std::make_pair(true, + new VPInterleaveRecipe(IG)); +} + +VPWidenIntOrFpInductionRecipe * +LoopVectorizationPlanner::tryToOptimizeInduction(Instruction *I, + VFRange Range) { + if (PHINode *Phi = dyn_cast(I)) { + // Check if this is an integer or fp induction. If so, build the recipe that + // produces its scalar and vector values. + InductionDescriptor II = Legal->getInductionVars()->lookup(Phi); + if (II.getKind() == InductionDescriptor::IK_IntInduction || + II.getKind() == InductionDescriptor::IK_FpInduction) + return new VPWidenIntOrFpInductionRecipe(Phi); + + return nullptr; + } + + // Optimize the special case where the source is a constant integer + // induction variable. Notice that we can only optimize the 'trunc' case + // because (a) FP conversions lose precision, (b) sext/zext may wrap, and + // (c) other casts depend on pointer size. + + // Determine whether \p K is a truncation based on an induction variable that + // can be optimized. + auto isOptimizableIVTruncate = + [&](Instruction *K) -> std::function { + return + [=](unsigned VF) -> bool { return CM.isOptimizableIVTruncate(K, VF); }; + }; + + if (isa(I) && testVFRange(isOptimizableIVTruncate(I), Range)) + return new VPWidenIntOrFpInductionRecipe(cast(I->getOperand(0)), + cast(I)); + return nullptr; +} + +VPWidenRecipe *LoopVectorizationPlanner::tryToWiden( + Instruction *I, VPWidenRecipe *LastWidenRecipe, VFRange Range) { + + if (Legal->isScalarWithPredication(I)) + return nullptr; + + static DenseSet VectorizableOpcodes = { + Instruction::Br, Instruction::PHI, Instruction::GetElementPtr, + Instruction::UDiv, Instruction::SDiv, Instruction::SRem, + Instruction::URem, Instruction::Add, Instruction::FAdd, + Instruction::Sub, Instruction::FSub, Instruction::Mul, + Instruction::FMul, Instruction::FDiv, Instruction::FRem, + Instruction::Shl, Instruction::LShr, Instruction::AShr, + Instruction::And, Instruction::Or, Instruction::Xor, + Instruction::Select, Instruction::ICmp, Instruction::FCmp, + Instruction::Store, Instruction::Load, Instruction::ZExt, + Instruction::SExt, Instruction::FPToUI, Instruction::FPToSI, + Instruction::FPExt, Instruction::PtrToInt, Instruction::IntToPtr, + Instruction::SIToFP, Instruction::UIToFP, Instruction::Trunc, + Instruction::FPTrunc, Instruction::BitCast, Instruction::Call}; + + if (!VectorizableOpcodes.count(I->getOpcode())) + return nullptr; + + if (CallInst *CI = dyn_cast(I)) { + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end || + ID == Intrinsic::lifetime_start)) + return nullptr; + } + + auto willWiden = [&](unsigned VF) -> bool { + if (!isa(I) && (CM.isScalarAfterVectorization(I, VF) || + CM.isProfitableToScalarize(I, VF))) + return false; + if (CallInst *CI = dyn_cast(I)) { + Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI); + // The following case may be scalarized depending on the VF. + // The flag shows whether we use Intrinsic or a usual Call for vectorized + // version of the instruction. + // Is it beneficial to perform intrinsic call compared to lib call? + bool NeedToScalarize; + unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize); + bool UseVectorIntrinsic = + ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost; + return UseVectorIntrinsic || !NeedToScalarize; + } + if (isa(I) || isa(I)) { + LoopVectorizationCostModel::InstWidening Decision = + CM.getWideningDecision(I, VF); + assert(Decision != LoopVectorizationCostModel::CM_Unknown && + "CM decision should be taken at this point."); + assert(Decision != LoopVectorizationCostModel::CM_Interleave && + "Interleave memory opportunity should be caught earlier."); + return Decision != LoopVectorizationCostModel::CM_Scalarize; + } + return true; + }; + + if (!testVFRange(willWiden, Range)) + return nullptr; + + // Success: widen this instruction. We optimize the common case where + // consecutive instructions can be represented by a single recipe. + if (LastWidenRecipe && LastWidenRecipe->appendInstruction(I)) + return LastWidenRecipe; + return new VPWidenRecipe(I); +} + +VPBasicBlock *LoopVectorizationPlanner::handleReplication( + Instruction *I, VFRange Range, VPBasicBlock *VPBB, + DenseMap &PredInst2Recipe) { + + bool IsUniform = testVFRange( + [&](unsigned VF) { return CM.isUniformAfterVectorization(I, VF); }, + Range); + + bool IsPredicated = Legal->isScalarWithPredication(I); + auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated); + + // Find if I uses a predicated instruction. If so, it will use its scalar + // value. Avoid hoisting the insert-element which packs the scalar value into + // a vector value, as that happens iff all users use the vector value. + for (auto &Op : I->operands()) + if (auto *PredInst = dyn_cast(Op)) + if (PredInst2Recipe.find(PredInst) != PredInst2Recipe.end()) + PredInst2Recipe[PredInst]->setAlsoPack(false); + + // Finalize the recipe for Instr, first if it is not predicated. + if (!IsPredicated) { + DEBUG(dbgs() << "LV: Scalarizing:" << *I << "\n"); + VPBB->appendRecipe(Recipe); + return VPBB; + } + DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *I << "\n"); + assert(VPBB->getSuccessors().empty() && + "VPBB has successors when handling predicated replication."); + // Record predicated instructions for above packing optimizations. + PredInst2Recipe[I] = Recipe; + VPBlockBase *Region = VPBB->setOneSuccessor(createReplicateRegion(I, Recipe)); + return cast(Region->setOneSuccessor(new VPBasicBlock())); +} + +VPRegionBlock * +LoopVectorizationPlanner::createReplicateRegion(Instruction *Instr, + VPRecipeBase *PredRecipe) { + // Instructions marked for predication are replicated and placed under an + // if-then construct to prevent side-effects. + + // Build the triangular if-then region. + std::string RegionName = (Twine("pred.") + Instr->getOpcodeName()).str(); + assert(Instr->getParent() && "Predicated instruction not in any basic block"); + auto *BOMRecipe = new VPBranchOnMaskRecipe(Instr->getParent()); + auto *Entry = new VPBasicBlock(Twine(RegionName) + ".entry", BOMRecipe); + auto *PHIRecipe = + Instr->getType()->isVoidTy() ? nullptr : new VPPredInstPHIRecipe(Instr); + auto *Exit = new VPBasicBlock(Twine(RegionName) + ".continue", PHIRecipe); + auto *Pred = new VPBasicBlock(Twine(RegionName) + ".if", PredRecipe); + VPRegionBlock *Region = new VPRegionBlock(Entry, Exit, RegionName, true); + + // Note: first set Entry as region entry and then connect successors starting + // from it in order, to propagate the "parent" of each VPBasicBlock. + Entry->setTwoSuccessors(Pred, Exit); + Pred->setOneSuccessor(Exit); + + return Region; +} + +VPlan *LoopVectorizationPlanner::buildVPlan(VFRange Range) { + + // Collect instructions from the original loop that will become trivially dead + // in the vectorized loop. We don't need to vectorize these instructions. For + // example, original induction update instructions can become dead because we + // separately emit induction "steps" when generating code for the new loop. + // Similarly, we create a new latch condition when setting up the structure + // of the new loop, so the old one can become dead. + SmallPtrSet DeadInstructions; + collectTriviallyDeadInstructions(DeadInstructions); + + // Hold a mapping from predicated instructions to their recipes, in order to + // fix their AlsoPack behavior if a user is determined to replicate and use a + // scalar instead of vector value. + DenseMap PredInst2Recipe; + + // Create a dummy pre-entry VPBasicBlock to start building the VPlan. + VPBasicBlock *VPBB = new VPBasicBlock("Pre-Entry"); + VPlan *Plan = new VPlan(VPBB); + + // Scan the body of the loop in a topological order to visit each basic block + // after having visited its predecessor basic blocks. + LoopBlocksDFS DFS(OrigLoop); + DFS.perform(LI); + + for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) { + // Relevent instructions from basic block BB will be grouped into VPRecipe + // ingredients and fill a new VPBasicBlock. + unsigned VPBBsForBB = 0; + auto *FirstVPBBForBB = new VPBasicBlock(BB->getName()); + VPBB->setOneSuccessor(FirstVPBBForBB); + VPBB = FirstVPBBForBB; + VPWidenRecipe *LastWidenRecipe = nullptr; + + for (Instruction &I : *BB) { + Instruction *Instr = &I; + VPRecipeBase *Recipe = nullptr; + + // Filter out irrelevant instructions. + if (isa(Instr) || isa(Instr) || + DeadInstructions.count(Instr)) + continue; + + // Check if Instr should belong to an interleave memory recipe, or already + // does. + auto InterleaveResult = tryToInterleaveMemory(Instr, Range); + if (InterleaveResult.first) { + if (InterleaveResult.second) + VPBB->appendRecipe(InterleaveResult.second); + continue; + } + + // Check if Instr should form some PHI recipe. + if ((Recipe = tryToOptimizeInduction(Instr, Range))) { + VPBB->appendRecipe(Recipe); + continue; + } + if (PHINode *Phi = dyn_cast(Instr)) { + VPBB->appendRecipe(new VPWidenPHIRecipe(Phi)); + continue; + } + + // Check if Instr is to be widened. + if ((Recipe = tryToWiden(Instr, LastWidenRecipe, Range))) { + if (Recipe != LastWidenRecipe) + VPBB->appendRecipe(Recipe); + LastWidenRecipe = cast(Recipe); + continue; + } + + // Instruction is to be replicated. This may create a successor for VPBB. + VPBasicBlock *NextVPBB = + handleReplication(Instr, Range, VPBB, PredInst2Recipe); + if (NextVPBB != VPBB) { + VPBB = NextVPBB; + VPBB->setName(BB->hasName() ? BB->getName() + "." + Twine(VPBBsForBB++) + : ""); + } + } + } + + // Discard empty dummy pre-entry VPBasicBlock. Note that other VPBasicBlocks + // may also be empty, such as the last one VPBB, reflecting original + // basic-blocks with no recipes. + VPBasicBlock *PreEntry = cast(Plan->getEntry()); + assert(PreEntry->empty() && "Expecting empty pre-entry block."); + VPBlockBase *Entry = Plan->setEntry(PreEntry->getSingleSuccessor()); + PreEntry->disconnectSuccessor(Entry); + delete PreEntry; + + return Plan; +} + +void VPInterleaveRecipe::print(raw_ostream &O, const Twine &Indent) const { + O << " +\n" << Indent << "\"INTERLEAVE-GROUP with factor " << IG->getFactor() + << " at "; + IG->getInsertPos()->printAsOperand(O, false); + O << "\\l\""; + for (unsigned i = 0; i < IG->getFactor(); ++i) + if (Instruction *I = IG->getMember(i)) + O << " +\n" << Indent << "\" " << VPlanIngredient(I) << " " << i + << "\\l\""; +} + +void VPReplicateRecipe::execute(VPTransformState &State) { + + if (State.Instance) { // Generate a single instance. + State.ILV->scalarizeInstruction(Ingredient, *State.Instance, IsPredicated); + if (AlsoPack) // Insert scalar instance packing it into a vector. + State.ILV->constructVectorValue(Ingredient, *State.Instance); + return; + } + + // Generate scalar instances for all VF lanes of all UF parts, unless the + // instruction is uniform inwhich case generate only the first lane for each + // of the UF parts. + unsigned EndLane = IsUniform ? 1 : State.VF; + for (unsigned Part = 0; Part < State.UF; ++Part) + for (unsigned Lane = 0; Lane < EndLane; ++Lane) + State.ILV->scalarizeInstruction(Ingredient, {Part, Lane}, IsPredicated); + + // Broadcast or group together all instances into a vector. + if (AlsoPack) + State.ILV->getVectorValue(Ingredient); +} + +void VPBranchOnMaskRecipe::execute(VPTransformState &State) { + assert(State.Instance && "Branch on Mask works only on single instance."); + + unsigned Part = State.Instance->Part; + unsigned Lane = State.Instance->Lane; + + auto Cond = State.ILV->createBlockInMask(MaskedBasicBlock); + + Value *ConditionBit = Cond[Part]; + if (ConditionBit->getType()->isVectorTy()) + ConditionBit = State.Builder.CreateExtractElement( + ConditionBit, State.ILV->Builder.getInt32(Lane)); + // TODO: Checking if a bit equals 1 is redundant, but LITs do look for it. + ConditionBit = + State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit, + ConstantInt::get(ConditionBit->getType(), 1)); + + // Replace the temporary unreachable terminator with a new conditional branch, + // whose two destinations will be set later when they are created. + auto *CurrentTerminator = State.CFG.PrevBB->getTerminator(); + assert(isa(CurrentTerminator) && + "Expected to replace unreachable terminator with conditional branch."); + auto *CondBr = BranchInst::Create(State.CFG.PrevBB, nullptr, ConditionBit); + CondBr->setSuccessor(0, nullptr); + ReplaceInstWithInst(CurrentTerminator, CondBr); + + DEBUG(dbgs() << "\nLV: vectorizing BranchOnMask recipe " + << MaskedBasicBlock->getName()); +} + +void VPPredInstPHIRecipe::execute(VPTransformState &State) { + assert(State.Instance && "Predicated instruction PHI works per instance."); + Instruction *ScalarPredInst = + cast(State.ILV->getScalarValue(PredInst, *State.Instance)); + BasicBlock *PredicatedBB = ScalarPredInst->getParent(); + BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor(); + assert(PredicatingBB && "Predicated block has no single predecessor."); + + // By current pack/unpack logic we need to generate only a single phi node: if + // a vector value for the predicated instruction exists at this point it means + // the instruction has vector users only, and a phi for the vector value is + // needed. In this case the recipe of the predicated instruction is marked to + // also do that packing, thereby "hoisting" the insert-element sequence. + // Otherwise, a phi node for the scalar value is needed. + unsigned Part = State.Instance->Part; + if (Value *VectorValue = State.ILV->getVectorValue(PredInst, Part)) { + InsertElementInst *IEI = cast(VectorValue); + PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2); + VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // Unmodified vector. + VPhi->addIncoming(IEI, PredicatedBB); // New vector with inserted element. + State.ILV->setVectorValue(PredInst, Part, VPhi); // Update cache. + } else { + Type *PredInstType = PredInst->getType(); + PHINode *Phi = State.ILV->Builder.CreatePHI(PredInstType, 2); + Phi->addIncoming(UndefValue::get(ScalarPredInst->getType()), PredicatingBB); + Phi->addIncoming(ScalarPredInst, PredicatedBB); + State.ILV->setScalarValue(PredInst, *State.Instance, Phi); // Update cache. + } +} + bool LoopVectorizePass::processLoop(Loop *L) { assert(L->empty() && "Only process inner loops."); @@ -7755,7 +8453,7 @@ CM.collectValuesToIgnore(); // Use the planner for vectorization. - LoopVectorizationPlanner LVP(CM); + LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM); // Get user vectorization factor. unsigned UserVF = Hints.getWidth(); @@ -7842,6 +8540,8 @@ DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n'); } + LVP.setBestPlan(VF.Width, IC); + using namespace ore; if (!VectorizeLoop) { assert(IC > 1 && "interleave count should not be 1 or 0"); @@ -7849,7 +8549,7 @@ // interleave it. InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL, &CM); - Unroller.vectorize(); + LVP.executePlan(Unroller, DT); ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(), L->getHeader()) @@ -7859,7 +8559,7 @@ // If we decided that it is *legal* to vectorize the loop, then do it. InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, &LVL, &CM); - LB.vectorize(); + LVP.executePlan(LB, DT); ++LoopsVectorized; // Add metadata to disable runtime unrolling a scalar loop when there are Index: lib/Transforms/Vectorize/VPlan.h =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.h @@ -0,0 +1,674 @@ +//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +/// +/// \file +/// This file contains the declarations of the Vectorization Plan base classes: +/// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual +/// VPBlockBase, together implementing a Hierarchical CFG; +/// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be +/// treated as proper graphs for generic algorithms; +/// 3. Pure virtual VPRecipeBase serving as the base class for recipes contained +/// within VPBasicBlocks; +/// 4. The VPlan class holding a candidate for vectorization; +/// 5. The VPlanPrinter class providing a way to print a plan in dot format. +/// These are documented in docs/VectorizationPlan.rst. +/// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H +#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H + +#include "llvm/ADT/GraphTraits.h" +#include "llvm/ADT/SmallSet.h" +#include "llvm/ADT/ilist.h" +#include "llvm/ADT/ilist_node.h" +#include "llvm/IR/IRBuilder.h" +#include "llvm/Support/raw_ostream.h" + +// The (re)use of existing LoopVectorize classes is subject to future VPlan +// refactoring. +namespace { +// Forward declarations. +class InnerLoopVectorizer; +class LoopVectorizationLegality; +class LoopVectorizationCostModel; +} // namespace + +namespace llvm { + +// Forward declarations. +class BasicBlock; +class VPBasicBlock; + +/// In what follows, the term "input IR" refers to code that is fed into the +/// vectorizer whereas the term "output IR" refers to code that is generated by +/// the vectorizer. + +/// VPIteration represents a single point in the iteration space of the output +/// (vectorized and/or unrolled) IR loop. +struct VPIteration { + unsigned Part; ///< in [0..UF) + unsigned Lane; ///< in [0..VF) +}; + +/// VPTransformState holds information passed down when "executing" a VPlan, +/// needed for generating the output IR. +struct VPTransformState { + + VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI, + class DominatorTree *DT, IRBuilder<> &Builder, + InnerLoopVectorizer *ILV) + : VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder), + ILV(ILV) {} + + /// The chosen Vectorization and Unroll Factors of the loop being vectorized. + unsigned VF; + unsigned UF; + + /// Hold the indices to generate specific scalar instructions. Null indicates + /// that all instances are to be generated, using either scalar or vector + /// instructions. + VPIteration *Instance; + + /// Hold state information used when constructing the CFG of the output IR, + /// traversing the VPBasicBlocks and generating corresponding IR BasicBlocks. + struct CFGState { + /// The previous VPBasicBlock visited. Initially set to null. + VPBasicBlock *PrevVPBB; + /// The previous IR BasicBlock created or used. Initially set to the new + /// header BasicBlock. + BasicBlock *PrevBB; + /// The last IR BasicBlock in the output IR. Set to the new latch + /// BasicBlock, used for placing the newly created BasicBlocks. + BasicBlock *LastBB; + /// A mapping of each VPBasicBlock to the corresponding BasicBlock. In case + /// of replication, maps the BasicBlock of the last replica created. + SmallDenseMap VPBB2IRBB; + + CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {} + } CFG; + + /// Hold a pointer to LoopInfo to register new basic blocks in the loop. + class LoopInfo *LI; + + /// Hold a pointer to Dominator Tree to register new basic blocks in the loop. + class DominatorTree *DT; + + /// Hold a reference to the IRBuilder used to generate output IR code. + IRBuilder<> &Builder; + + /// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods. + class InnerLoopVectorizer *ILV; +}; + +/// VPBlockBase is the building block of the Hierarchical Control-Flow Graph. +/// A VPBlockBase can be either a VPBasicBlock or a VPRegionBlock. +class VPBlockBase { +private: + const unsigned char VBID; ///< Subclass identifier (for isa/dyn_cast). + + static unsigned NextOrdinal; ///< Unique ID generator. + + unsigned int UID; + + /// An optional name for the block. + std::string Name; + + /// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if + /// it is a topmost VPBlockBase. + class VPRegionBlock *Parent; + + /// List of predecessor blocks. + SmallVector Predecessors; + + /// List of successor blocks. + SmallVector Successors; + + /// Add \p Successor as the last successor to this block. + void appendSuccessor(VPBlockBase *Successor) { + assert(Successor && "Cannot add nullptr successor!"); + Successors.push_back(Successor); + } + + /// Add \p Predecessor as the last predecessor to this block. + void appendPredecessor(VPBlockBase *Predecessor) { + assert(Predecessor && "Cannot add nullptr predecessor!"); + Predecessors.push_back(Predecessor); + } + + /// Remove \p Predecessor from the predecessors of this block. + void removePredecessor(VPBlockBase *Predecessor) { + auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor); + assert(Pos && "Predecessor does not exist"); + Predecessors.erase(Pos); + } + + /// Remove \p Successor from the successors of this block. + void removeSuccessor(VPBlockBase *Successor) { + auto Pos = std::find(Successors.begin(), Successors.end(), Successor); + assert(Pos && "Successor does not exist"); + Successors.erase(Pos); + } + +protected: + VPBlockBase(const unsigned char SC, const std::string &N) + : VBID(SC), UID(NextOrdinal++), Parent(nullptr) { + Name = N.empty() ? (Twine("VPB") + Twine(UID)).str() : N; + } + +public: + /// An enumeration for keeping track of the concrete subclass of VPBlockBase + /// that are actually instantiated. Values of this enumeration are kept in the + /// VBID field of the VPBlockBase objects. They are used for concrete type + /// identification. + typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy; + + typedef SmallVectorImpl VPBlocksTy; + typedef const SmallVectorImpl const_VPBlocksTy; + + virtual ~VPBlockBase() {} + + unsigned int getUID() const { return UID; } + + const std::string &getName() const { return Name; } + + void setName(const Twine &newName) { Name = newName.str(); } + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPBlockID() const { return VBID; } + + const class VPRegionBlock *getParent() const { return Parent; } + + void setParent(VPRegionBlock *P) { Parent = P; } + + /// \return the VPBasicBlock that is the entry of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getEntryBasicBlock() const; + class VPBasicBlock *getEntryBasicBlock(); + + /// \return the VPBasicBlock that is the exit of this VPBlockBase, + /// recursively, if the latter is a VPRegionBlock. Otherwise, if this + /// VPBlockBase is a VPBasicBlock, it is returned. + const class VPBasicBlock *getExitBasicBlock() const; + class VPBasicBlock *getExitBasicBlock(); + + const_VPBlocksTy &getSuccessors() const { return Successors; } + VPBlocksTy &getSuccessors() { return Successors; } + + const_VPBlocksTy &getPredecessors() const { return Predecessors; } + VPBlocksTy &getPredecessors() { return Predecessors; } + + /// \return the successor of this VPBlockBase if it has a single successor. + /// Otherwise return a null pointer. + VPBlockBase *getSingleSuccessor() const { + return (Successors.size() == 1 ? *Successors.begin() : nullptr); + } + + /// \return the predecessor of this VPBlockBase if it has a single + /// predecessor. Otherwise return a null pointer. + VPBlockBase *getSinglePredecessor() const { + return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr); + } + + /// \return the closest ancestor starting from "this", which has successors. + /// Returns the root ancestor if all ancestors have no successors. + VPBlockBase *getAncestorWithSuccessors(); + + /// \return the closest ancestor starting from "this", which has predecessors. + /// Returns the root ancestor if all ancestors have no predecessors. + VPBlockBase *getAncestorWithPredecessors(); + + /// \return the successors either attached directly to this VPBlockBase or, if + /// this VPBlockBase is the exit block of a VPRegionBlock and has no + /// successors of its own, search recursively for the first enclosing + /// VPRegionBlock that has successors and return them. If no such + /// VPRegionBlock exists, return the (empty) successors of the topmost + /// VPBlockBase reached. + const_VPBlocksTy &getHierarchicalSuccessors() { + return getAncestorWithSuccessors()->getSuccessors(); + } + + /// \return the hierarchical successor of this VPBlockBase if it has a single + /// hierarchical successor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalSuccessor() { + return getAncestorWithSuccessors()->getSingleSuccessor(); + } + + /// \return the predecessors either attached directly to this VPBlockBase or, + /// if this VPBlockBase is the entry block of a VPRegionBlock and has no + /// predecessors of its own, search recursively for the first enclosing + /// VPRegionBlock that has predecessors and return them. If no such + /// VPRegionBlock exists, return the (empty) predecessors of the topmost + /// VPBlockBase reached. + const_VPBlocksTy &getHierarchicalPredecessors() { + return getAncestorWithPredecessors()->getPredecessors(); + } + + /// \return the hierarchical predecessor of this VPBlockBase if it has a + /// single hierarchical predecessor. Otherwise return a null pointer. + VPBlockBase *getSingleHierarchicalPredecessor() { + return getAncestorWithPredecessors()->getSinglePredecessor(); + } + + /// Sets a given VPBlockBase \p Successor as the single successor and \return + /// \p Successor. The parent of this Block is copied to be the parent of + /// \p Successor. + VPBlockBase *setOneSuccessor(VPBlockBase *Successor) { + assert(Successors.empty() && "Setting one successor when others exist."); + appendSuccessor(Successor); + Successor->appendPredecessor(this); + Successor->Parent = Parent; + return Successor; + } + + /// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two + /// successors. The parent of this Block is copied to be the parent of both + /// \p IfTrue and \p IfFalse. + void setTwoSuccessors(VPBlockBase *IfTrue, VPBlockBase *IfFalse) { + assert(Successors.empty() && "Setting two successors when others exist."); + appendSuccessor(IfTrue); + appendSuccessor(IfFalse); + IfTrue->appendPredecessor(this); + IfFalse->appendPredecessor(this); + IfTrue->Parent = Parent; + IfFalse->Parent = Parent; + } + + void disconnectSuccessor(VPBlockBase *Successor) { + assert(Successor && "Successor to disconnect is null."); + removeSuccessor(Successor); + Successor->removePredecessor(this); + } + + /// The method which generates the output IR that correspond to this + /// VPBlockBase, thereby "executing" the VPlan. + virtual void execute(struct VPTransformState *State) = 0; + + /// Delete all blocks reachable from a given VPBlockBase, inclusive. + static void deleteCFG(VPBlockBase *Entry); +}; + +/// VPRecipeBase is a base class modeling a sequence of one or more output IR +/// instructions. +class VPRecipeBase : public ilist_node_with_parent { + friend class VPBasicBlock; + +private: + const unsigned char VRID; ///< Subclass identifier (for isa/dyn_cast). + + /// Each VPRecipe belongs to a single VPBasicBlock. + class VPBasicBlock *Parent; + +public: + /// An enumeration for keeping track of the concrete subclass of VPRecipeBase + /// that are actually instantiated. Values of this enumeration are kept in the + /// VRID field of the VPRecipeBase objects. They are used for concrete type + /// identification. + typedef enum { + VPBranchOnMaskSC, + VPInterleaveSC, + VPPredInstPHISC, + VPReplicateSC, + VPWidenIntOrFpInductionSC, + VPWidenPHISC, + VPWidenSC, + } VPRecipeTy; + + VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {} + + virtual ~VPRecipeBase() {} + + /// \return an ID for the concrete type of this object. + /// This is used to implement the classof checks. This should not be used + /// for any other purpose, as the values may change as LLVM evolves. + unsigned getVPRecipeID() const { return VRID; } + + /// \return the VPBasicBlock which this VPRecipe belongs to. + VPBasicBlock *getParent() { return Parent; } + const VPBasicBlock *getParent() const { return Parent; } + + /// The method which generates the output IR instructions that correspond to + /// this VPRecipe, thereby "executing" the VPlan. + virtual void execute(struct VPTransformState &State) = 0; + + /// Each recipe prints itself. + virtual void print(raw_ostream &O, const Twine &Indent) const = 0; +}; + +/// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It +/// holds a sequence of zero or more VPRecipe's each representing a sequence of +/// output IR instructions. +class VPBasicBlock : public VPBlockBase { +public: + typedef iplist RecipeListTy; + +private: + /// The VPRecipes held in the order of output instructions to generate. + RecipeListTy Recipes; + +public: + /// Instruction iterators... + typedef RecipeListTy::iterator iterator; + typedef RecipeListTy::const_iterator const_iterator; + typedef RecipeListTy::reverse_iterator reverse_iterator; + typedef RecipeListTy::const_reverse_iterator const_reverse_iterator; + + //===--------------------------------------------------------------------===// + /// Recipe iterator methods + /// + inline iterator begin() { return Recipes.begin(); } + inline const_iterator begin() const { return Recipes.begin(); } + inline iterator end() { return Recipes.end(); } + inline const_iterator end() const { return Recipes.end(); } + + inline reverse_iterator rbegin() { return Recipes.rbegin(); } + inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); } + inline reverse_iterator rend() { return Recipes.rend(); } + inline const_reverse_iterator rend() const { return Recipes.rend(); } + + inline size_t size() const { return Recipes.size(); } + inline bool empty() const { return Recipes.empty(); } + inline const VPRecipeBase &front() const { return Recipes.front(); } + inline VPRecipeBase &front() { return Recipes.front(); } + inline const VPRecipeBase &back() const { return Recipes.back(); } + inline VPRecipeBase &back() { return Recipes.back(); } + + /// \brief Return the underlying recipe list container. + /// + /// Currently you need to access the underlying recipe list container directly + /// if you want to modify it. + const RecipeListTy &getRecipeList() const { return Recipes; } + RecipeListTy &getRecipeList() { return Recipes; } + + /// \brief Returns a pointer to a member of the instruction list. + static RecipeListTy VPBasicBlock::*getSublistAccess(VPRecipeBase *) { + return &VPBasicBlock::Recipes; + } + + VPBasicBlock(const Twine &Name = "", VPRecipeBase *Recipe = nullptr) + : VPBlockBase(VPBasicBlockSC, Name.str()) { + if (Recipe) + appendRecipe(Recipe); + } + + ~VPBasicBlock() { Recipes.clear(); } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC; + } + + /// Augment the existing recipes of a VPBasicBlock with an additional + /// \p Recipe as the last recipe. + void appendRecipe(VPRecipeBase *Recipe) { + assert(Recipe && "No recipe to append."); + assert(!Recipe->Parent && "Recipe already in VPlan"); + Recipe->Parent = this; + return Recipes.push_back(Recipe); + } + + /// Augment the existing recipes of a VPBasicBlock with an additional + /// \p Recipe at a position given by an existing recipe \p Before. + void insertRecipe(VPRecipeBase *Recipe, VPRecipeBase *Before) { + assert(Recipe && "No recipe to insert."); + assert(!Recipe->Parent && "Recipe already in VPlan"); + Recipe->Parent = this; + assert(Before && "No recipe to insert before."); + assert(Before->Parent == this && "Inserting not in same basic block."); + Recipes.insert(Before->getIterator(), Recipe); + } + + void removeRecipe(VPRecipeBase *Recipe) { + assert(Recipe->Parent == this && "Removing not from same basic block."); + Recipes.remove(Recipe); + Recipe->Parent = nullptr; + } + + /// The method which generates the output IR instructions that correspond to + /// this VPBasicBlock, thereby "executing" the VPlan. + void execute(struct VPTransformState *State) override; + +private: + /// Create an IR BasicBlock to hold the output instructions generated by this + /// VPBasicBlock, and return it. Update the CFGState accordingly. + BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG); +}; + +/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks +/// which form a Single-Entry-Single-Exit subgraph of the output IR CFG. +/// A VPRegionBlock may indicate that its contents are to be replicated several +/// times. This is designed to support predicated scalarization, in which a +/// scalar if-then code structure needs to be generated VF * UF times. Having +/// this replication indicator helps to keep a single model for multiple +/// candidate VF's. The actual replication takes place only once the desired VF +/// and UF have been determined. +class VPRegionBlock : public VPBlockBase { +private: + /// Hold the Single Entry of the SESE region modelled by the VPRegionBlock. + VPBlockBase *Entry; + + /// Hold the Single Exit of the SESE region modelled by the VPRegionBlock. + VPBlockBase *Exit; + + /// An indicator whether this region is to generate multiple replicated + /// instances of output IR corresponding to its VPBlockBases. + bool IsReplicator; + +public: + VPRegionBlock(VPBlockBase *Entry, VPBlockBase *Exit, + const std::string &Name = "", bool IsReplicator = false) + : VPBlockBase(VPRegionBlockSC, Name), Entry(Entry), Exit(Exit), + IsReplicator(IsReplicator) { + assert(Entry->getPredecessors().empty() && "Entry block has predecessors."); + assert(Exit->getSuccessors().empty() && "Exit block has successors."); + Entry->setParent(this); + Exit->setParent(this); + } + + ~VPRegionBlock() { + if (Entry) + deleteCFG(Entry); + } + + /// Method to support type inquiry through isa, cast, and dyn_cast. + static inline bool classof(const VPBlockBase *V) { + return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC; + } + + const VPBlockBase *getEntry() const { return Entry; } + VPBlockBase *getEntry() { return Entry; } + + const VPBlockBase *getExit() const { return Exit; } + VPBlockBase *getExit() { return Exit; } + + /// An indicator whether this region is to generate multiple replicated + /// instances of output IR corresponding to its VPBlockBases. + bool isReplicator() const { return IsReplicator; } + + /// The method which generates the output IR instructions that correspond to + /// this VPRegionBlock, thereby "executing" the VPlan. + void execute(struct VPTransformState *State) override; +}; + +/// VPlan models a candidate for vectorization, encoding various decisions take +/// to produce efficient output IR, including which branches, basic-blocks and +/// output IR instructions to generate, and their cost. VPlan holds a +/// Hierarchical-CFG of VPBasicBlocks and VPRegionBlocks rooted at an Entry +/// VPBlock. +class VPlan { +private: + /// Hold the single entry to the Hierarchical CFG of the VPlan. + VPBlockBase *Entry; + + /// Holds the VFs applicable to this VPlan. + std::set VFs; + + /// Holds the name of the VPlan, for printing. + std::string Name; + +public: + VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {} + + ~VPlan() { + if (Entry) + VPBlockBase::deleteCFG(Entry); + } + + /// Generate the IR code for this VPlan. + void execute(struct VPTransformState *State); + + VPBlockBase *getEntry() { return Entry; } + const VPBlockBase *getEntry() const { return Entry; } + + VPBlockBase *setEntry(VPBlockBase *Block) { return Entry = Block; } + + void addVF(unsigned VF) { VFs.insert(VF); } + + void removeVF(unsigned VF) { VFs.erase(VF); } + + std::set &getVFs() { return VFs; } + const std::set &getVFs() const { return VFs; } + + const std::string &getName() const { return Name; } + + void setName(const Twine &newName) { Name = newName.str(); } + +private: + /// Add to the given dominator tree the header block and every new basic block + /// that was created between it and the latch block, inclusive. + void updateDominatorTree(class DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB); +}; + +/// VPlanPrinter prints a given VPlan to a given output stream. The printing is +/// indented and follows the dot format. +class VPlanPrinter { + friend inline raw_ostream &operator<<(raw_ostream &OS, VPlan &Plan); + friend inline raw_ostream &operator<<(raw_ostream &OS, + const struct VPlanIngredient &I); + +private: + raw_ostream &OS; + VPlan &Plan; + unsigned Depth; + unsigned TabWidth = 2; + std::string Indent; + + /// Handle indentation. + void bumpIndent(int b) { Indent = std::string((Depth += b) * TabWidth, ' '); } + + /// Print a given \p Block of the Plan. + void dumpBlock(const VPBlockBase *Block); + + /// Print the information related to the CFG edges going out of a given + /// \p Block, followed by printing the successor blocks themselves. + void dumpEdges(const VPBlockBase *Block); + + /// Print a given \p BasicBlock, including its VPRecipes, followed by printing + /// its successor blocks. + void dumpBasicBlock(const VPBasicBlock *BasicBlock); + + /// Print a given \p Region of the Plan. + void dumpRegion(const VPRegionBlock *Region); + + const Twine getUID(const VPBlockBase *Block); + + /// Print the information related to a CFG edge between two VPBlockBases. + void drawEdge(const VPBlockBase *From, const VPBlockBase *To, bool Hidden, + const Twine &Label); + + VPlanPrinter(raw_ostream &O, VPlan &P) : OS(O), Plan(P) {} + + void dump(); + + static void printAsIngredient(raw_ostream &O, Value *V); +}; + +struct VPlanIngredient { + Value *V; + VPlanIngredient(Value *V) : V(V) {} +}; + +inline raw_ostream &operator<<(raw_ostream &OS, const VPlanIngredient &I) { + VPlanPrinter::printAsIngredient(OS, I.V); + return OS; +} + +inline raw_ostream &operator<<(raw_ostream &OS, VPlan &Plan) { + VPlanPrinter Printer(OS, Plan); + Printer.dump(); + return OS; +} + +//===--------------------------------------------------------------------===// +// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs // +//===--------------------------------------------------------------------===// + +// Provide specializations of GraphTraits to be able to treat a VPBlockBase as a +// graph of VPBlockBase nodes... + +template <> struct GraphTraits { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +template <> struct GraphTraits { + typedef const VPBlockBase *NodeRef; + typedef SmallVectorImpl::const_iterator ChildIteratorType; + + static NodeRef getEntryNode(NodeRef N) { return N; } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getSuccessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getSuccessors().end(); + } +}; + +// Provide specializations of GraphTraits to be able to treat a VPBlockBase as a +// graph of VPBlockBase nodes... and to walk it in inverse order. Inverse order +// for a VPBlockBase is considered to be when traversing the predecessors of a +// VPBlockBase instead of its successors. +// + +template <> struct GraphTraits> { + typedef VPBlockBase *NodeRef; + typedef SmallVectorImpl::iterator ChildIteratorType; + + static Inverse getEntryNode(Inverse B) { + return B; + } + + static inline ChildIteratorType child_begin(NodeRef N) { + return N->getPredecessors().begin(); + } + + static inline ChildIteratorType child_end(NodeRef N) { + return N->getPredecessors().end(); + } +}; + +} // namespace llvm + +#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H Index: lib/Transforms/Vectorize/VPlan.cpp =================================================================== --- /dev/null +++ lib/Transforms/Vectorize/VPlan.cpp @@ -0,0 +1,397 @@ +//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +/// +/// \file +/// This is the LLVM vectorization plan. It represents a candidate for +/// vectorization, allowing to plan and optimize how to vectorize a given loop +/// before generating LLVM-IR. +/// The vectorizer uses vectorization plans to estimate the costs of potential +/// candidates and if profitable to execute the desired plan, generating vector +/// LLVM-IR code. +/// +//===----------------------------------------------------------------------===// + +#include "VPlan.h" +#include "llvm/ADT/PostOrderIterator.h" +#include "llvm/Analysis/LoopInfo.h" +#include "llvm/IR/BasicBlock.h" +#include "llvm/IR/Dominators.h" +#include "llvm/Support/GraphWriter.h" +#include "llvm/Transforms/Utils/BasicBlockUtils.h" + +using namespace llvm; + +#define DEBUG_TYPE "vplan" + +unsigned VPBlockBase::NextOrdinal = 1; + +/// \return the VPBasicBlock that is the entry of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getEntry(); + return cast(Block); +} + +VPBasicBlock *VPBlockBase::getEntryBasicBlock() { + VPBlockBase *Block = this; + while (VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getEntry(); + return cast(Block); +} + +/// \return the VPBasicBlock that is the exit of Block, possibly indirectly. +const VPBasicBlock *VPBlockBase::getExitBasicBlock() const { + const VPBlockBase *Block = this; + while (const VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +VPBasicBlock *VPBlockBase::getExitBasicBlock() { + VPBlockBase *Block = this; + while (VPRegionBlock *Region = dyn_cast(Block)) + Block = Region->getExit(); + return cast(Block); +} + +/// Returns the closest ancestor, starting from "this", which has successors. +/// Returns the root ancestor if all ancestors have no successors. +VPBlockBase *VPBlockBase::getAncestorWithSuccessors() { + if (!Successors.empty() || !Parent) + return this; + assert(Parent->getExit() == this && + "Block w/o successors not the exit of its parent."); + return Parent->getAncestorWithSuccessors(); +} + +/// Returns the closest ancestor, starting from "this", which has predecessors. +/// Returns the root ancestor if all ancestors have no predecessors. +VPBlockBase *VPBlockBase::getAncestorWithPredecessors() { + if (!Predecessors.empty() || !Parent) + return this; + assert(Parent->getEntry() == this && + "Block w/o predecessors not the entry of its parent."); + return Parent->getAncestorWithPredecessors(); +} + +void VPBlockBase::deleteCFG(VPBlockBase *Entry) { + SmallVector Blocks; + for (VPBlockBase *Block : depth_first(Entry)) + Blocks.push_back(Block); + + for (VPBlockBase *Block : Blocks) + delete Block; +} + +BasicBlock * +VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) { + // BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks. + // Pred stands for Predessor. Prev stands for Previous - last visited/created. + BasicBlock *PrevBB = CFG.PrevBB; + BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), getName(), + PrevBB->getParent(), CFG.LastBB); + DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n'); + + // Hook up the new basic block to its predecessors. + for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) { + VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock(); + auto &PredVPSuccessors = PredVPBB->getSuccessors(); + BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB]; + assert(PredBB && "Predecessor basic-block not found building successor."); + auto *PredBBTerminator = PredBB->getTerminator(); + DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n'); + if (isa(PredBBTerminator)) { + assert(PredVPSuccessors.size() == 1 && + "Predecessor ending w/o branch must have single successor."); + PredBBTerminator->eraseFromParent(); + BranchInst::Create(NewBB, PredBB); + } else { + assert(PredVPSuccessors.size() == 2 && + "Predecessor ending with branch must have two successors."); + unsigned idx = PredVPSuccessors.front() == this ? 0 : 1; + assert(!PredBBTerminator->getSuccessor(idx) && + "Trying to reset an existing successor block."); + PredBBTerminator->setSuccessor(idx, NewBB); + } + } + return NewBB; +} + +void VPBasicBlock::execute(VPTransformState *State) { + VPIteration *I = State->Instance; + bool Replica = I && !(I->Part == 0 && I->Lane == 0); + VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB; + VPBlockBase *SingleHPred = nullptr; + BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible. + + // 1. Create an IR basic block, or reuse the last one if possible. + // The last IR basic block is reused, as an optimization, in three cases: + // A. the first VPBB reuses the loop header BB - when PrevVPBB is null; + // B. when the current VPBB has a single (hierarchical) predecessor which + // is PrevVPBB and the latter has a single (hierarchical) successor; and + // C. when the current VPBB is an entry of a region replica - where PrevVPBB + // is the exit of this region from a previous instance, or the predecessor + // of this region. + if (PrevVPBB && /* A */ + !((SingleHPred = getSingleHierarchicalPredecessor()) && + SingleHPred->getExitBasicBlock() == PrevVPBB && + PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */ + !(Replica && getPredecessors().empty())) { /* C */ + + NewBB = createEmptyBasicBlock(State->CFG); + State->Builder.SetInsertPoint(NewBB); + // Temporarily terminate with unreachable until CFG is rewired. + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + // Register NewBB in its loop. In innermost loops its the same for all BB's. + Loop *L = State->LI->getLoopFor(State->CFG.LastBB); + L->addBasicBlockToLoop(NewBB, *State->LI); + State->CFG.PrevBB = NewBB; + } + + // 2. Fill the IR basic block with IR instructions. + DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName() + << " in BB:" << NewBB->getName() << '\n'); + + State->CFG.VPBB2IRBB[this] = NewBB; + State->CFG.PrevVPBB = this; + + for (VPRecipeBase &Recipe : Recipes) + Recipe.execute(*State); + + DEBUG(dbgs() << "LV: filled BB:" << *NewBB); +} + +void VPRegionBlock::execute(VPTransformState *State) { + ReversePostOrderTraversal RPOT(Entry); + + if (!isReplicator()) { + // Visit the VPBlocks connected to "this", starting from it. + for (VPBlockBase *Block : RPOT) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << Block->getName() << '\n'); + Block->execute(State); + } + return; + } + + assert(!State->Instance && "Replicating a Region with non-null instance."); + VPIteration Instance; + State->Instance = &Instance; + + for (unsigned Part = 0, UF = State->UF; Part < UF; ++Part) + for (unsigned Lane = 0, VF = State->VF; Lane < VF; ++Lane) { + Instance = {Part, Lane}; + // Visit the VPBlocks connected to \p this, starting from it. + for (VPBlockBase *Block : RPOT) { + DEBUG(dbgs() << "LV: VPBlock in RPO " << Block->getName() << '\n'); + Block->execute(State); + } + } + + State->Instance = nullptr; +} + +/// Generate the code inside the body of the vectorized loop. Assumes a single +/// LoopVectorBody basic-block was created for this. Introduce additional +/// basic-blocks as needed, and fill them all. +void VPlan::execute(VPTransformState *State) { + BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB; + BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor(); + assert(VectorHeaderBB && "Loop preheader does not have a single successor."); + BasicBlock *VectorLatchBB = VectorHeaderBB; + auto CurrIP = State->Builder.saveIP(); + + // 1. Make room to generate basic-blocks inside loop body if needed. + VectorLatchBB = VectorHeaderBB->splitBasicBlock( + VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch"); + Loop *L = State->LI->getLoopFor(VectorHeaderBB); + L->addBasicBlockToLoop(VectorLatchBB, *State->LI); + // Remove the edge between Header and Latch to allow other connections. + // Temporarily terminate with unreachable until CFG is rewired. + // Note: this asserts the generated code's assumption that + // getFirstInsertionPt() can be dereferenced into an Instruction. + VectorHeaderBB->getTerminator()->eraseFromParent(); + State->Builder.SetInsertPoint(VectorHeaderBB); + UnreachableInst *Terminator = State->Builder.CreateUnreachable(); + State->Builder.SetInsertPoint(Terminator); + + // 2. Generate code in loop body. + State->CFG.PrevVPBB = nullptr; + State->CFG.PrevBB = VectorHeaderBB; + State->CFG.LastBB = VectorLatchBB; + + for (VPBlockBase *Block : depth_first(Entry)) + Block->execute(State); + + // 3. Merge the temporary latch created with the last basic-block filled. + BasicBlock *LastBB = State->CFG.PrevBB; + // Connect LastBB to VectorLatchBB to facilitate their merge. + assert(isa(LastBB->getTerminator()) && + "Expected VPlan CFG to terminate with unreachable"); + LastBB->getTerminator()->eraseFromParent(); + BranchInst::Create(VectorLatchBB, LastBB); + + // Merge LastBB with Latch. + bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI); + assert(merged && "Could not merge last basic block with latch."); + VectorLatchBB = LastBB; + + updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB); + State->Builder.restoreIP(CurrIP); +} + +void VPlan::updateDominatorTree(DominatorTree *DT, BasicBlock *LoopPreHeaderBB, + BasicBlock *LoopLatchBB) { + BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor(); + assert(LoopHeaderBB && "Loop preheader does not have a single successor."); + DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB); + // The vector body may be more than a single basic-block by this point. + // Update the dominator tree information inside the vector body by propagating + // it from header to latch, expecting only triangular control-flow, if any. + BasicBlock *PostDomSucc = nullptr; + for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) { + // Get the list of successors of this block. + std::vector Succs(succ_begin(BB), succ_end(BB)); + assert(Succs.size() <= 2 && + "Basic block in vector loop has more than 2 successors."); + PostDomSucc = Succs[0]; + if (Succs.size() == 1) { + assert(PostDomSucc->getSinglePredecessor() && + "PostDom successor has more than one predecessor."); + DT->addNewBlock(PostDomSucc, BB); + continue; + } + BasicBlock *InterimSucc = Succs[1]; + if (PostDomSucc->getSingleSuccessor() == InterimSucc) { + PostDomSucc = Succs[1]; + InterimSucc = Succs[0]; + } + assert(InterimSucc->getSingleSuccessor() == PostDomSucc && + "One successor of a basic block does not lead to the other."); + assert(InterimSucc->getSinglePredecessor() && + "Interim successor has more than one predecessor."); + assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 && + "PostDom successor has more than two predecessors."); + DT->addNewBlock(InterimSucc, BB); + DT->addNewBlock(PostDomSucc, BB); + } +} + +const Twine VPlanPrinter::getUID(const VPBlockBase *Block) { + return (isa(Block) ? "cluster_N" : "N") + + Twine(Block->getUID()); +} + +void VPlanPrinter::dump() { + Depth = 1; + bumpIndent(0); + OS << "digraph VPlan {\n"; + OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan"; + if (!Plan.getName().empty()) + OS << "\\n" << DOT::EscapeString(Plan.getName()); + OS << "\"]\n"; + OS << "node [shape=rect, fontname=Courier, fontsize=30]\n"; + OS << "edge [fontname=Courier, fontsize=30]\n"; + OS << "compound=true\n"; + + for (VPBlockBase *Block : depth_first(Plan.getEntry())) + dumpBlock(Block); + + OS << "}\n"; +} + +void VPlanPrinter::dumpBlock(const VPBlockBase *Block) { + if (const VPBasicBlock *BasicBlock = dyn_cast(Block)) + dumpBasicBlock(BasicBlock); + else if (const VPRegionBlock *Region = dyn_cast(Block)) + dumpRegion(Region); + else + llvm_unreachable("Unsupported kind of VPBlock."); +} + +void VPlanPrinter::drawEdge(const VPBlockBase *From, const VPBlockBase *To, + bool Hidden, const Twine &Label) { + // Due to "dot" we print an edge between two regions as an edge between the + // exit basic block and the entry basic of the respective regions. + const VPBlockBase *Tail = From->getExitBasicBlock(); + const VPBlockBase *Head = To->getEntryBasicBlock(); + OS << Indent << getUID(Tail) << " -> " << getUID(Head); + OS << " [ label=\"" << Label << '\"'; + if (Tail != From) + OS << " ltail=" << getUID(From); + if (Head != To) + OS << " lhead=" << getUID(To); + if (Hidden) + OS << "; splines=none"; + OS << "]\n"; +} + +void VPlanPrinter::dumpEdges(const VPBlockBase *Block) { + auto &Successors = Block->getSuccessors(); + if (Successors.size() == 1) + drawEdge(Block, Successors.front(), false, ""); + else if (Successors.size() == 2) { + drawEdge(Block, Successors.front(), false, "T"); + drawEdge(Block, Successors.back(), false, "F"); + } else { + unsigned SuccessorNumber = 0; + for (auto *Successor : Successors) + drawEdge(Block, Successor, false, Twine(SuccessorNumber++)); + } +} + +void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) { + OS << Indent << getUID(BasicBlock) << " [label =\n"; + bumpIndent(1); + OS << Indent << "\"" << DOT::EscapeString(BasicBlock->getName()) << ":\\n\""; + bumpIndent(1); + for (const VPRecipeBase &Recipe : *BasicBlock) + Recipe.print(OS, Indent); + bumpIndent(-2); + OS << "\n" << Indent << "]\n"; + dumpEdges(BasicBlock); +} + +void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) { + OS << Indent << "subgraph " << getUID(Region) << " {\n"; + bumpIndent(1); + OS << Indent << "fontname=Courier\n" + << Indent << "label=\"" + << DOT::EscapeString(Region->isReplicator() ? " " : " ") + << DOT::EscapeString(Region->getName()) << "\"\n"; + // Dump the blocks of the region. + assert(Region->getEntry() && "Region contains no inner blocks."); + for (const VPBlockBase *Block : depth_first(Region->getEntry())) + dumpBlock(Block); + bumpIndent(-1); + OS << Indent << "}\n"; + dumpEdges(Region); +} + +void VPlanPrinter::printAsIngredient(raw_ostream &O, Value *V) { + std::string IngredientString; + raw_string_ostream RSO(IngredientString); + if (auto *Inst = dyn_cast(V)) { + if (!Inst->getType()->isVoidTy()) { + Inst->printAsOperand(RSO, false); + RSO << " = "; + } + RSO << Inst->getOpcodeName() << " "; + unsigned E = Inst->getNumOperands(); + if (E > 0) { + Inst->getOperand(0)->printAsOperand(RSO, false); + for (unsigned I = 1; I < E; ++I) + Inst->getOperand(I)->printAsOperand(RSO << ", ", false); + } + } else // !Inst + V->printAsOperand(RSO, false); + RSO.flush(); + O << DOT::EscapeString(IngredientString); +} Index: test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll +++ test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll @@ -26,9 +26,9 @@ ; CHECK-NEXT: br i1 [[TMP3]], label %[[PRED_UDIV_IF:.*]], label %[[PRED_UDIV_CONTINUE:.*]] ; CHECK: [[PRED_UDIV_IF]]: ; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 -; CHECK-NEXT: [[TMP6:%.*]] = add nsw i64 [[TMP5]], %x -; CHECK-NEXT: [[TMP7:%.*]] = udiv i64 [[TMP4]], [[TMP6]] +; CHECK-NEXT: [[TMP5:%.*]] = add nsw i64 [[TMP4]], %x +; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0 +; CHECK-NEXT: [[TMP7:%.*]] = udiv i64 [[TMP6]], [[TMP5]] ; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i64> undef, i64 [[TMP7]], i32 0 ; CHECK-NEXT: br label %[[PRED_UDIV_CONTINUE]] ; CHECK: [[PRED_UDIV_CONTINUE]]: @@ -37,9 +37,9 @@ ; CHECK-NEXT: br i1 [[TMP10]], label %[[PRED_UDIV_IF1:.*]], label %[[PRED_UDIV_CONTINUE2]] ; CHECK: [[PRED_UDIV_IF1]]: ; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 -; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 -; CHECK-NEXT: [[TMP13:%.*]] = add nsw i64 [[TMP12]], %x -; CHECK-NEXT: [[TMP14:%.*]] = udiv i64 [[TMP11]], [[TMP13]] +; CHECK-NEXT: [[TMP12:%.*]] = add nsw i64 [[TMP11]], %x +; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1 +; CHECK-NEXT: [[TMP14:%.*]] = udiv i64 [[TMP13]], [[TMP12]] ; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x i64> [[TMP9]], i64 [[TMP14]], i32 1 ; CHECK-NEXT: br label %[[PRED_UDIV_CONTINUE2]] ; CHECK: [[PRED_UDIV_CONTINUE2]]: Index: test/Transforms/LoopVectorize/AArch64/predication_costs.ll =================================================================== --- test/Transforms/LoopVectorize/AArch64/predication_costs.ll +++ test/Transforms/LoopVectorize/AArch64/predication_costs.ll @@ -18,8 +18,8 @@ ; Cost of udiv: ; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5 ; -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) { entry: @@ -59,8 +59,8 @@ ; Cost of store: ; (store(4) + extractelement(3)) / 2 = 3 ; -; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -98,10 +98,10 @@ ; Cost of udiv: ; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x -; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x +; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3 ; define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -143,10 +143,10 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4 ; define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) { entry: @@ -192,16 +192,16 @@ ; Cost of store: ; store(4) / 2 = 2 ; -; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 -; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x -; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x ; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2 ; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x ; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4 +; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2 +; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2 +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x +; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4 ; define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) { entry: Index: test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll =================================================================== --- test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll +++ test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll @@ -24,10 +24,10 @@ for.end: ret void -; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: %tmp1 = load i32, i32* %tmp0, align 4 -; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %tmp2, i32* %tmp0, align 4 - ; CHECK: LV: Scalarizing: %tmp1 = load i32, i32* %tmp0, align 4 ; CHECK: LV: Scalarizing: store i32 %tmp2, i32* %tmp0, align 4 + +; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: %tmp1 = load i32, i32* %tmp0, align 4 +; CHECK: LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %tmp2, i32* %tmp0, align 4 } Index: test/Transforms/LoopVectorize/if-pred-non-void.ll =================================================================== --- test/Transforms/LoopVectorize/if-pred-non-void.ll +++ test/Transforms/LoopVectorize/if-pred-non-void.ll @@ -219,9 +219,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]] ; CHECK: [[IF0]]: ; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0 -; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x -; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]] +; CHECK: %[[T01:.+]] = add nsw i32 %[[T00]], %x +; CHECK: %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0 +; CHECK: %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]] ; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0 ; CHECK: br label %[[CONT0]] ; CHECK: [[CONT0]]: @@ -229,9 +229,9 @@ ; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]] ; CHECK: [[IF1]]: ; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1 -; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x -; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]] +; CHECK: %[[T07:.+]] = add nsw i32 %[[T06]], %x +; CHECK: %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1 +; CHECK: %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]] ; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1 ; CHECK: br label %[[CONT1]] ; CHECK: [[CONT1]]: Index: test/Transforms/LoopVectorize/induction.ll =================================================================== --- test/Transforms/LoopVectorize/induction.ll +++ test/Transforms/LoopVectorize/induction.ll @@ -309,59 +309,59 @@ ; ; CHECK-LABEL: @scalarize_induction_variable_05( ; CHECK: vector.body: -; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue{{[0-9]+}} ] ; CHECK: %[[I0:.+]] = add i32 %index, 0 ; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]] ; CHECK: pred.udiv.if: ; CHECK: udiv i32 {{.*}}, %[[I0]] -; CHECK: pred.udiv.if1: +; CHECK: pred.udiv.if{{[0-9]+}}: ; CHECK: %[[I1:.+]] = add i32 %index, 1 ; CHECK: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05( ; UNROLL-NO-IC: vector.body: -; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue{{[0-9]+}} ] ; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0 ; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2 ; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]] ; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I2]] ; UNROLL-NO-IC: pred.udiv.if: ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I0]] -; UNROLL-NO-IC: pred.udiv.if6: +; UNROLL-NO-IC: pred.udiv.if{{[0-9]+}}: ; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]] -; UNROLL-NO-IC: pred.udiv.if8: +; UNROLL-NO-IC: pred.udiv.if{{[0-9]+}}: ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]] -; UNROLL-NO-IC: pred.udiv.if10: +; UNROLL-NO-IC: pred.udiv.if{{[0-9]+}}: ; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3 ; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]] ; ; IND-LABEL: @scalarize_induction_variable_05( ; IND: vector.body: -; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ] +; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue{{[0-9]+}} ] ; IND: %[[E0:.+]] = sext i32 %index to i64 ; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]] ; IND: pred.udiv.if: ; IND: udiv i32 {{.*}}, %index -; IND: pred.udiv.if1: +; IND: pred.udiv.if{{[0-9]+}}: ; IND: %[[I1:.+]] = or i32 %index, 1 ; IND: udiv i32 {{.*}}, %[[I1]] ; ; UNROLL-LABEL: @scalarize_induction_variable_05( ; UNROLL: vector.body: -; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ] +; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue{{[0-9]+}} ] ; UNROLL: %[[I2:.+]] = or i32 %index, 2 ; UNROLL: %[[E0:.+]] = sext i32 %index to i64 ; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]] ; UNROLL: getelementptr i32, i32* %[[G0]], i64 2 ; UNROLL: pred.udiv.if: ; UNROLL: udiv i32 {{.*}}, %index -; UNROLL: pred.udiv.if6: +; UNROLL: pred.udiv.if{{[0-9]+}}: ; UNROLL: %[[I1:.+]] = or i32 %index, 1 ; UNROLL: udiv i32 {{.*}}, %[[I1]] -; UNROLL: pred.udiv.if8: +; UNROLL: pred.udiv.if{{[0-9]+}}: ; UNROLL: udiv i32 {{.*}}, %[[I2]] -; UNROLL: pred.udiv.if10: +; UNROLL: pred.udiv.if{{[0-9]+}}: ; UNROLL: %[[I3:.+]] = or i32 %index, 3 ; UNROLL: udiv i32 {{.*}}, %[[I3]]