Index: docs/VectorizationPlan.rst
===================================================================
--- /dev/null
+++ docs/VectorizationPlan.rst
@@ -0,0 +1,574 @@
++++++
+VPlan
++++++
+
+Goal of initial VPlan patch
++++++++++++++++++++++++++++
+The design and implementation of VPlan follow our RFC [10]_ and presentation
+[11]_. The initial patch is designed to:
+
+- be a *lightweight* NFC patch;
+- show key aspects of VPlan's Hierarchical CFG concept;
+- demonstrate how VPlan can
+
+  * capture *all* current vectorization decisions: which instructions are to
+    
+    + be vectorized "on their own", or
+    + be part of an interleave group, or
+    + be scalarized, and optionally have scalar instances moved down to other
+      basic blocks and under a condition; and
+    + be packed or unpacked (at the definition rather than at its uses) to
+      provide both scalarized and vectorized forms; and
+
+  * represent all control-flow *within loop body* of vectorized code version.
+
+- Be a step towards
+
+  * aligning Cost step with Transformation step,
+  * representing entire code being transformed,
+  * adding optmizations:
+
+    + optimize conditional scalarization further,
+    + retaining uniform control-flow,
+    + vectorizing outerloops,
+    + and more.
+
+Out of scope for initial patch:
+
+- changing how a loop is checked if it can be vectorized - "Legal";
+- changing how a loop is checked if it should be vectorized - "Cost".
+
+
+==================
+Vectorization Plan
+==================
+
+.. contents::
+   :local:
+
+Overview
+========
+The Vectorization Plan is an explicit recipe for describing a vectorization
+candidate. It serves for both estimating the cost reliably and for performing
+the translation, and facilitates dealing with multiple vectorization candidates.
+
+The overall structure consists of:
+
+1. One LoopVectorizationPlanner for each attempt to vectorize a loop or a loop
+   nest.
+
+2. A LoopVectorizationPlanner can construct, optimize and discard one or more
+   VPlans, providing different ways to vectorize the loop or the loop nest.
+
+3. Once the best VPlan is determined, including the best vectorization factor
+   and unroll factor, this VPlan drives the vector code generation using a
+   VPTransformState object.
+
+4. Each VPlan represents the loop or the loop nest using a hierarchical CFG.
+
+5. At the bottom level of the hierarchical CFG are VPBasicBlocks.
+
+6. Each VPBasicBlock consists of one or more VPRecipes to generate Instructions
+   for it.
+
+Motivation
+----------
+The vectorization transformation can be rather complicated, involving several
+potential alternatives, especially for outer loops [1]_ but also possibly for
+innermost loops. These alternatives may have significant performance impact,
+both positive and negative. A cost model is therefore employed to identify the
+best alternative, including the alternative of avoiding any transformation
+altogether.
+
+The process of vectorization traditionally involves three major steps: Legal,
+Cost, and Transform. This is the general case in LLVM's LoopVectorizer:
+
+1. Legal Step: check if loop can be legally vectorized; encode constraints and
+   artifacts if so.
+2. Cost Step: compute the relative cost of vectorizing it along possible
+   vectorization and unroll factors (VF, UF).
+3. Transform Step: vectorize the loop according to best VF and UF.
+
+This design, which works only directly on the original LLVM-IR, has some
+implications:
+
+1. Cost Step tries to predict what the vectorized loop will look like and how
+   much it will cost, independently of what the Transform Step will eventually
+   do. It's hard to keep the two in sync.
+2. Cost Step essentially considers a single vectorization candidate. Any
+   alternatives are immediately evaluately and resolved.
+3. Legal Step does more than check for vectorizability; e.g., it records
+   auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo.
+4. Transform Step first populates the single basic block of the vectorized loop
+   and later revisits scalarized instructions to predicate them one by one, as
+   needed.
+
+The Vectorization Plan is designed to explicitly model a vectorization
+candidate to overcome the above constraints, which is especially important for
+the vectorization of outer-loops. This affects the overall process by
+essentially splitting the Transform Step into a Plan Step and a Code-Gen Step:
+
+1. Legal Step: check if loop can be legally vectorized; encode contraints and
+   artifacts if so. Initiate Vectorization Plan showing how the loop can be
+   vectorized only after passing Legal, to save redundant construction.
+2. Plan Step:
+
+   a. Build initial Vectorization Plans following the constraints and
+      decisions taken by Legal.
+   b. Explore ways to optimize the vectorization plan, complying with
+      all legal constraints, possibly constructing several plans following
+      tentative vectorization decisions.
+3. Cost Step: compute the relative cost of each plan. This step can be applied
+   repeatedly by Plan Step 2.b.
+4. Code-Gen Step: materialize the best plan. Note that only this step modifies
+   the IR, as in the current Loop Vectorizer.
+
+The Cost Step can also be split into an Early-Pruning Step(s) and a
+"Cost-Gen" Step, where the former applies quick yet inaccurate estimates to
+prune obviously-unpromising candidates, and the latter applies more accurate
+estimates based on a full Plan.
+
+One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds
+Step 2.b.
+
+As the scope of vectorization grows from innermost to outer loops, so do the
+uncertainty and complexity of each step. One way to mitigate the shortcomings
+of the Legal and Cost steps is to rely on programmers to indicate which loops
+can and/or should be vectorized. This is implicit for certain loops in
+data-parallel languages such as OpenCL [4]_, [5]_ and explicit in others such as
+OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports
+and raises the importance of explicit vectorization beyond the current
+capabilities of Clang and LLVM. Namely, from currently forcing the
+vectorization of innermost loops according to prescribed width and/or
+interleaving count, to supporting OpenMP's "#pragma omp simd" construct and
+associated clauses, including vectorizing across function boundaries [2]_.
+
+References
+----------
+.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
+    Nuzman and Ayal Zaks, PACT 2008.
+
+.. [2] "Proposal for function vectorization and loop vectorization with function
+    calls", Xinmin Tian, [`cfe-dev
+    <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
+    March 2, 2016.
+    See also `review <https://reviews.llvm.org/D22792>`_.
+
+.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
+    Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
+
+.. [4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers' Meeting 2011.
+
+.. [5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs", Ralf
+    Karrenberg, Springer 2015. See also "Improving Performance of OpenCL on
+    CPUs", LLVM Developers' Meeting 2012.
+
+.. [6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on
+    Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al.,
+    IPDPSW 2012.
+
+.. [7] "Exploiting mixed SIMD parallelism by reducing data reorganization
+    overhead", Hao Zhou and Jingling Xue, CGO 2016.
+
+.. [8] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
+    Brian Koblenz, PLDI 1991
+
+.. [9] "Structural analysis: A new approach to flow analysis in optimizing
+    compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
+
+.. [10] "RFC: Extending LV to vectorize outerloops", [`llvm-dev
+    <http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html>`_],
+    September 21, 2016.
+
+.. [11] "Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer
+    loop auto-vectorization", Hideki Saito, `LLVM Developers' Meeting 2016
+    <https://www.youtube.com/watch?v=XXAvdUwO7kQ>`_, November 3, 2016.
+
+Examples
+--------
+An example with a single predicated scalarized instruction - integer division:
+
+.. code-block:: c
+
+  void foo(int* a, int b, int* c) {
+    #pragma simd
+    for (int i = 0; i < 10000; ++i)
+      if (a[i] > 777)
+        a[i] = b - (c[i] + a[i] / b);
+  }
+
+
+IR Dump Before Loop Vectorization:
+
+.. code-block:: LLVM
+   :emphasize-lines: 6,11
+
+   for.body:                                         ; preds = %for.inc, %entry
+     %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
+     %arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
+     %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
+     %cmp1 = icmp sgt i32 %0, 777
+     br i1 %cmp1, label %if.then, label %for.inc
+
+   if.then:                                          ; preds = %for.body
+     %arrayidx3 = getelementptr inbounds i32, i32* %c, i64 %indvars.iv
+     %1 = load i32, i32* %arrayidx3, align 4, !tbaa !1
+     %div = sdiv i32 %0, %b
+     %add.neg = sub i32 %b, %1
+     %sub = sub i32 %add.neg, %div
+     store i32 %sub, i32* %arrayidx, align 4, !tbaa !1
+     br label %for.inc
+
+   for.inc:                                          ; preds = %for.body, %if.then
+     %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+     %exitcond = icmp eq i64 %indvars.iv.next, 10000
+     br i1 %exitcond, label %for.cond.cleanup, label %for.body
+
+The VPlan that is built initially:
+
+.. image:: VPlanPrinter.png
+
+Design Guidelines
+=================
+1. Analysis-like: building and manipulating the Vectorization Plan must not
+   modify the IR. In particular, if a VPlan is discarded
+   compilation should proceed as if the VPlan had not been built.
+
+2. Support all current capabilities: the Vectorization Plan must be capable of
+   representing the exact functionality of LLVM's existing Loop Vectorizer.
+   In particular, the transition can start with an NFC patch.
+   In particular, VPlan must support efficient selection of VF and/or UF.
+
+3. Align Cost & CodeGen: the Vectorization Plan must serve both the cost
+   model and the code generation phases, where the cost estimation must
+   evaluate the to-be-generated code reliably.
+
+4. Support vectorizing additional constructs:
+
+   a. vectorization of Outer-loops.
+      In particular, VPlan must be able to represent the control-flow of a
+      vectorized loop which may include multiple basic-blocks and nested loops.
+   b. SLP vectorization.
+   c. Combinations of the above, including nested vectorization: vectorizing
+      both an inner loop and an outerloop at the same time (each with its own
+      VF and UF), mixed vectorization: vectorizing a loop and SLP patterns
+      inside [7]_, (re)vectorizing vector code.
+
+5. Support multiple candidates efficiently:
+   In particular, similar candidates related to a range of possible VF's and
+   UF's must be represented efficiently.
+   In particular support potential versionings efficiently.
+
+6. Compact: the Vectorization Plan must be efficient and provide as compact a
+   representation as possible. In particular where the transformation is
+   straightfoward, and where the plan is to reuse existing IR (e.g.,
+   leftover iterations).
+
+VPlan Classes: Definitions
+==========================
+
+:VPlan:
+  A recipe for generating a vectorized version from a given IR code.
+  Takes a "scenario-based approach" to vectorization planning.
+  Given IR code required to be SESE, mainly to simplify dominance
+  information. This vectorized version is represented using a Hierarchical CFG.
+
+:Hierarchical CFG:
+  A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's.
+  The Hierarchical CFG data structure we use is similar to the Tile Tree [8]_,
+  where cross-Tile edges are lifted to connect Tiles instead of the original
+  basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the
+  terms Region and Block rather than Tile [8]_ to avoid confusion with loop
+  tiling.
+
+:VPBasicBlock:
+  Serves as the leaf of the Hierarchical CFG. Represents a sequence of
+  instructions that will appear consecutively in a basic block of the vectorized
+  version. The instructions of such a basic block originate from one or more
+  VPBasicBlocks.
+  The VPBasicBlock takes care of the control-flow
+  relations with other VPBasicBlock's and Regions.
+  Holds a sequence of zero or more
+  VPRecipe's that take care of representing the instructions.
+  A VPBasicBlock that holds no VPRecipe's represents no instructions; this
+  may happen, e.g., to support disjoint Regions and to ensure Regions have a
+  single exit, possibly an empty one.
+
+:VPRecipeBase:
+  A base class describing one or more instructions that will appear
+  consecutively in the vectorized version, based on Instructions from the given
+  IR.
+  These Instructions are referred to as the "Ingredients" of the Recipe.
+  A Recipe specifies how its ingredients are to be vectorized: e.g.,
+  copy or reuse them as uniform, scalarize or vectorize them according to an
+  enclosing loop dimension, vectorize them according to internal SLP dimension.
+
+  **Design principle:** in order to reason about how to vectorize an Instruction
+  or how much it would cost, one has to consult the VPRecipe holding it.
+
+  **Design principle:** when a sequence of instructions conveys additional
+  information as a group, we use a VPRecipe to encapsulate them and attach
+  this information to the VPRecipe. For instance a VPRecipe can model an
+  interleave group of loads or stores with additional information for
+  calculating their cost and performing code-gen, as a group.
+
+  **Design principle:** where possible a VPRecipe should reuse the existing
+  container of its ingredients. A new containter should be opened on-demand,
+  e.g., to facilitate changing the order of Instructions between original
+  and vectorized versions.
+
+:VPOneByOneRecipeBase:
+  Represents recipes which transform each Instruction in their Ingredients
+  independently, in order.
+  The Ingredients are a sub-sequence of original Instructions, which reside in
+  the same IR BasicBlock and in the same order. The Ingredients are
+  accessed by a pointer to the first and last Instruction in their original IR
+  basic block. Serves as a base class for the concrete sub-classes
+  VPScalarizeOneByOneRecipe and VPVectorizeOneByOneRecipe.
+
+:VPScalarizeOneByOneRecipe:
+  A concrete VPRecipe which scalarizes each ingredient, generating either
+  instances of lane 0 for a uniform instruction, or instances for a range of
+  lanes otherwise.
+
+:VPVectorizeOneByOneRecipe:
+  A concrete VPRecipe which vectorizes each ingredient.
+
+:VPInterleaveRecipe:
+  A concrete VPRecipe which transforms an interleave group of loads or stores
+  into one wide load/store and shuffles.
+
+:VPConditionBitRecipeBase:
+  A base class for VPRecipes which provide the condition bit feeding a
+  conditional branch. Such cases correspond to scalarized or uniform branches.
+
+:VPExtractMaskBitRecipe:
+  A concrete VPRecipe which represents the extraction of a bit from a mask,
+  needed when scalarizing a conditional branch.
+  Such branches are needed to guard scalarized and predicated instructions.
+
+:VPMergeScalarizeBranchRecipe:
+  A concrete VPRecipe which represents Phi's needed when control converges back
+  from a scalarized branch.
+  Such phi's are needed to merge live-out values that are set under a
+  scalarized branch. They can be scalar or vector, depending on the user of the
+  live-out value.
+
+:VPWidenIntInductionRecipe:
+  A concrete VPRecipe which widens integer reductions, producing their vector
+  values and computing the necessary values for producing their scalar values.
+  The scalar values themselves are generated, possibly elsewhere, by the
+  complementing VPBuildScalarStepsRecipe.
+
+:VPBuildScalarStepsRecipe:
+  A concrete VPRecipe complemeting the handling of integer induction variables,
+  responsible for generating the scalar values used by the IV's scalar users.
+
+:VPRegionBlock:
+  A collection of VPBasicBlocks and VPRegionBlocks which form a
+  single-entry-single-exit subgraph of the CFG in the vectorized code.
+
+  **Design principle:** When some additional information relates to an SESE set
+  of VPBlocks, we use a VPRegionBlock to wrap them and attach the information to
+  it. For example, a VPRegionBlock can be used to indicate that a scalarized
+  SESE region is to be replicated. It is also designed to serve predicating
+  divergent branches while retaining uniform branches as much as possible /
+  desirable, and represent inner loops.
+
+:VPBlockBase:
+  The building block of the Hierarchical CFG. A VPBlockBase can be either a
+  VPBasicBlock or a VPRegionBlock.
+  A VPBlockBase may indicate that its contents are
+  to be replicated several times. This is designed to support scalarizing
+  VPBlockBases which generate VF replicas of their instructions, which in turn
+  remain scalar. And to do so using a single VPlan for multiple candidate VF's.
+
+:VPTransformState:
+  Stores information used for code generation, passed from the Planner to its
+  selected VPlan for execution, and used to pass additional information down
+  from VPBlocks to the VPRecipes.
+
+:VPlanUtils:
+  Contains a collection of methods for the construction and modification of
+  abstract VPlans.
+
+:VPlanUtilsLoopVectorizer:
+  Derived from VPlanUtils, providing additional methods for the construction and
+  modification of VPlans.
+
+:LoopVectorizationPlanner:
+  The object in charge of creating and manipulating VPlans for a given IR code.
+
+
+VPlan Classes: Diagram
+======================
+
+The classes of VPlan with main fields and methods; sub-classes of VPRecipeBase
+are shown in a separate figure:
+
+.. image:: VPlanUML.png
+
+
+The class hierarchy of VPlan's VPRecipeBase class:
+
+.. image:: VPlanRecipesUML.png
+
+
+Integration with LoopVectorize.cpp/processLoop()
+================================================
+
+Here's the integration within LoopVectorize.cpp's existing flow, in
+LoopVectorizePass::processLoop(Loop \*L):
+
+1. Plan only after passing all early bail-outs:
+
+   a. including those that take place after Legal, which is kept intact;
+   b. including those that use the Cost Model - refactor it slightly to expose
+      its MaxVF upper bound and canVectorize() early exit:
+
+.. code-block:: c++
+
+  // Check if the target supports potentially unsafe FP vectorization.
+  // FIXME: Add a check for the type of safety issue (denormal, signaling)
+  // for the target we're vectorizing for, to make sure none of the
+  // additional fp-math flags can help.
+  if (Hints.isPotentiallyUnsafe() &&
+      TTI->isFPVectorizationPotentiallyUnsafe()) {
+    DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n");
+    ORE->emit(
+        createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L)
+        << "loop not vectorized due to unsafe FP support.");
+    emitMissedWarning(F, L, Hints, ORE);
+    return false;
+  }
+
+  if (!CM.canVectorize(OptForSize))
+    return false;
+
+  // Early prune excessive VF's
+  unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize);
+
+  // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail.
+  if (OptForSize && CM.requiresTail(MaxVF))
+    return false;
+
+2. Plan:
+
+   a. build VPlans for relevant VF's and optimize them,
+   b. compute best cost using Cost Model as before,
+   c. compute best interleave-count using Cost Model as before. Above two
+      steps are refactored into LVP.plan() (see below):
+
+.. code-block:: c++
+
+  // Use the planner.
+  LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM);
+
+  // Get user vectorization factor.
+  unsigned UserVF = Hints.getWidth();
+
+  // Select the vectorization factor.
+  LoopVectorizationCostModel::VectorizationFactor VF =
+      LVP.plan(OptForSize, UserVF, MaxVF);
+  bool VectorizeLoop = (VF.Width > 1);
+
+  std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
+
+  if (!UserVF && !VectorizeLoop) {
+    DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
+    VecDiagMsg = std::make_pair(
+        "VectorizationNotBeneficial",
+        "the cost-model indicates that vectorization is not beneficial");
+  }
+
+  // Select the interleave count.
+  unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);
+
+  // Get user interleave count.
+  unsigned UserIC = Hints.getInterleave();
+
+3. Transform:
+
+   a. invoke an Unroller to unroll the loop (as before), or
+   b. invoke LVP.executeBestPlan() to vectorize the loop:
+
+.. code-block:: c++
+
+  if (!VectorizeLoop) {
+    assert(IC > 1 && "interleave count should not be 1 or 0");
+    // If we decided that it is not legal to vectorize the loop, then
+    // interleave it.
+    InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
+                               &CM);
+    Unroller.vectorize();
+
+    ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
+                                 L->getHeader())
+              << "interleaved loop (interleaved count: "
+              << NV("InterleaveCount", IC) << ")");
+  } else {
+
+    // If we decided that it is \* legal \* to vectorize the loop, then do it.
+    InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
+                           &LVL, &CM);
+
+    LVP.executeBestPlan(LB);
+
+    ++LoopsVectorized;
+
+    // Add metadata to disable runtime unrolling a scalar loop when there are
+    // no runtime checks about strides and memory. A scalar loop that is
+    // rarely used is not worth unrolling.
+    if (!LB.areSafetyChecksAdded())
+      AddRuntimeUnrollDisableMetaData(L);
+
+    // Report the vectorization decision.
+    ORE->emit(OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),
+                                 L->getHeader())
+              << "vectorized loop (vectorization width: "
+              << NV("VectorizationFactor", VF.Width)
+              << ", interleaved count: " << NV("InterleaveCount", IC) << ")");
+  }
+
+  // Mark the loop as already vectorized to avoid vectorizing again.
+  Hints.setAlreadyVectorized();
+
+4. Plan, refactored into LVP.plan():
+
+   a. build VPlans for relevant VF's and optimize them,
+   b. compute best cost using Cost Model as before:
+
+.. code-block:: c++
+
+  LoopVectorizationCostModel::VectorizationFactor
+  LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF,
+                                 unsigned MaxVF) {
+    if (UserVF) {
+      DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
+      if (UserVF == 1)
+        return {UserVF, 0};
+      assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
+      // Collect the instructions (and their associated costs) that will be more
+      // profitable to scalarize.
+      CM->collectInstsToScalarize(UserVF);
+      buildInitialVPlans(UserVF, UserVF);
+      DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
+      optimizePredicatedInstructions();
+      DEBUG(printCurrentPlans("After optimize predicated instructions",dbgs()));
+      return {UserVF, 0};
+    }
+    if (MaxVF == 1)
+      return {1, 0};
+  
+    assert(MaxVF > 1 && "MaxVF is zero.");
+    // Collect the instructions (and their associated costs) that will be more
+    // profitable to scalarize.
+    for (unsigned i = 2; i <= MaxVF; i = i+i)
+      CM->collectInstsToScalarize(i);
+    buildInitialVPlans(2, MaxVF);
+    DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
+    optimizePredicatedInstructions();
+    DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
+    // Select the optimal vectorization factor.
+    return CM->selectVectorizationFactor(OptForSize, MaxVF);
+  }
Index: docs/Vectorizers.rst
===================================================================
--- docs/Vectorizers.rst
+++ docs/Vectorizers.rst
@@ -380,6 +380,18 @@
 
 .. image:: linpack-pc.png
 
+Internals
+---------
+
+.. toctree::
+   :hidden:
+
+   VectorizationPlan
+
+:doc:`VectorizationPlan`
+   The loop vectorizer is based on an abstract representation called Vectorization Plan.
+   This document describes its philosophy and design.
+
 .. _slp-vectorizer:
 
 The SLP Vectorizer
Index: lib/Transforms/Vectorize/CMakeLists.txt
===================================================================
--- lib/Transforms/Vectorize/CMakeLists.txt
+++ lib/Transforms/Vectorize/CMakeLists.txt
@@ -4,6 +4,7 @@
   LoopVectorize.cpp
   SLPVectorizer.cpp
   Vectorize.cpp
+  VPlan.cpp
 
   ADDITIONAL_HEADER_DIRS
   ${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp
+++ lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -47,6 +47,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/Transforms/Vectorize/LoopVectorize.h"
+#include "VPlan.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/MapVector.h"
@@ -97,6 +98,7 @@
 #include "llvm/Transforms/Utils/LoopVersioning.h"
 #include "llvm/Transforms/Vectorize.h"
 #include <algorithm>
+#include <functional>
 #include <map>
 #include <tuple>
 
@@ -399,6 +401,9 @@
 /// LoopVectorizationLegality class to provide information about the induction
 /// and reduction variables that were found to a given vectorization factor.
 class InnerLoopVectorizer {
+  friend class LoopVectorizationPlanner;
+  friend class llvm::VPlan;
+
 public:
   InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
                       LoopInfo *LI, DominatorTree *DT,
@@ -445,7 +450,8 @@
   // When we if-convert we need to create edge masks. We have to cache values
   // so that we don't end up with exponential recursion/IR.
   typedef DenseMap<std::pair<BasicBlock *, BasicBlock *>, VectorParts>
-      EdgeMaskCache;
+      EdgeMaskCacheTy;
+  typedef DenseMap<BasicBlock *, VectorParts> BlockMaskCacheTy;
 
   /// Create an empty loop, based on the loop ranges of the old loop.
   void createEmptyLoop();
@@ -461,43 +467,44 @@
   /// Copy and widen the instructions from the old loop.
   virtual void vectorizeLoop();
 
+  /// Handle all cross-iteration phis in the header.
+  void fixCrossIterationPHIs();
+
   /// Fix a first-order recurrence. This is the second phase of vectorizing
   /// this phi node.
   void fixFirstOrderRecurrence(PHINode *Phi);
 
+  /// Fix a reduction cross-iteration phi. This is the second phase of
+  /// vectorizing this phi node.
+  void fixReduction(PHINode *Phi);
+
   /// \brief The Loop exit block may have single value PHI nodes where the
   /// incoming value is 'Undef'. While vectorizing we only handled real values
   /// that were defined inside the loop. Here we fix the 'undef case'.
   /// See PR14725.
   void fixLCSSAPHIs();
 
-  /// Iteratively sink the scalarized operands of a predicated instruction into
-  /// the block that was created for it.
-  void sinkScalarOperands(Instruction *PredInst);
-
-  /// Predicate conditional instructions that require predication on their
-  /// respective conditions.
-  void predicateInstructions();
-
   /// Collect the instructions from the original loop that would be trivially
   /// dead in the vectorized loop if generated.
-  void collectTriviallyDeadInstructions();
+  static void collectTriviallyDeadInstructions(
+      Loop *OrigLoop, LoopVectorizationLegality *Legal,
+      SmallPtrSetImpl<Instruction *> &DeadInstructions);
 
   /// Shrinks vector element sizes to the smallest bitwidth they can be legally
   /// represented as.
   void truncateToMinimalBitwidths();
 
+public:
   /// A helper function that computes the predicate of the block BB, assuming
   /// that the header block of the loop is set to True. It returns the *entry*
   /// mask for the block BB.
   VectorParts createBlockInMask(BasicBlock *BB);
+
+protected:
   /// A helper function that computes the predicate of the edge between SRC
   /// and DST.
   VectorParts createEdgeMask(BasicBlock *Src, BasicBlock *Dst);
 
-  /// A helper function to vectorize a single BB within the innermost loop.
-  void vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV);
-
   /// Vectorize a single PHINode in a block. This method handles the induction
   /// variable canonicalization. It supports both VF = 1 for unrolled loops and
   /// arbitrary length vectors.
@@ -508,13 +515,69 @@
   /// and update the analysis passes.
   void updateAnalysis();
 
-  /// This instruction is un-vectorizable. Implement it as a sequence
-  /// of scalars. If \p IfPredicateInstr is true we need to 'hide' each
-  /// scalarized instruction behind an if block predicated on the control
-  /// dependence of the instruction.
-  virtual void scalarizeInstruction(Instruction *Instr,
-                                    bool IfPredicateInstr = false);
+public:
+  /// A helper function to vectorize a single Instruction in the innermost loop.
+  virtual void vectorizeInstruction(Instruction &I);
+
+  /// A helper function to scalarize a single Instruction in the innermost loop.
+  /// Generates a sequence of scalar instances for each lane between \p MinLane
+  /// and \p MaxLane, times each part between \p MinPart and \p MaxPart,
+  /// inclusive..
+  void scalarizeInstruction(Instruction *Instr, unsigned MinPart,
+                            unsigned MaxPart, unsigned MinLane,
+                            unsigned MaxLane);
+
+  /// Return a value in the new loop corresponding to \p V from the original
+  /// loop at unroll index \p Part and vector index \p Lane. If the value has
+  /// been vectorized but not scalarized, the necessary extractelement
+  /// instruction will be generated.
+  Value *getScalarValue(Value *V, unsigned Part, unsigned Lane);
+
+  /// Set a value in the new loop corresponding to \p V from the original
+  /// loop at unroll index \p Part and vector index \p Lane. The scalar parts
+  /// for this value must already be initialized.
+  void setScalarValue(Value *V, unsigned Part, unsigned Lane, Value *Scalar) {
+    assert(VectorLoopValueMap.hasScalar(V) &&
+           "Cannot set an uninitialized scalar value");
+    VectorLoopValueMap.ScalarMapStorage[V][Part][Lane] = Scalar;
+  }
+
+  /// Return a value in the new loop corresponding to \p V from the original
+  /// loop at unroll index \p Part. If there isn't one, return a null pointer.
+  /// Note that the value returned may also  be a null pointer if that specific
+  /// part has not been generated yet.
+  Value *getVectorValue(Value *V, unsigned Part) {
+    if (!VectorLoopValueMap.hasVector(V))
+      return nullptr;
+    return VectorLoopValueMap.VectorMapStorage[V][Part];
+  }
+
+  /// Set a value in the new loop corresponding to \p V from the original
+  /// loop at unroll index \p Part. The vector parts for this value must already
+  /// be initialized.
+  void setVectorValue(Value *V, unsigned Part, Value *Vector) {
+    assert(VectorLoopValueMap.hasVector(V) &&
+           "Cannot set an uninitialized vector value");
+    VectorLoopValueMap.VectorMapStorage[V][Part] = Vector;
+  }
+
+  /// Construct the vector value of a scalarized value \p V one lane at a time.
+  /// This method is for predicated instructions where we'd like the
+  /// insert-element instructions to reside in the predicated block to have
+  /// them execute only if needed.
+  void constructVectorValue(Value *V, unsigned Part, unsigned Lane);
 
+  /// Return a constant reference to the VectorParts corresponding to \p V from
+  /// the original loop. If the value has already been vectorized, the
+  /// corresponding vector entry in VectorLoopValueMap is returned. If,
+  /// however, the value has a scalar entry in VectorLoopValueMap, we construct
+  /// new vector values on-demand by inserting the scalar values into vectors
+  /// with an insertelement sequence. If the value has been neither vectorized
+  /// nor scalarized, it must be loop invariant, so we simply broadcast the
+  /// value into vectors.
+  const VectorParts &getVectorValue(Value *V);
+
+protected:
   /// Vectorize Load and Store instructions,
   virtual void vectorizeMemoryInstruction(Instruction *Instr);
 
@@ -532,13 +595,6 @@
                                Instruction::BinaryOps Opcode =
                                Instruction::BinaryOpsEnd);
 
-  /// Compute scalar induction steps. \p ScalarIV is the scalar induction
-  /// variable on which to base the steps, \p Step is the size of the step, and
-  /// \p EntryVal is the value from the original loop that maps to the steps.
-  /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it
-  /// can be a truncate instruction).
-  void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal);
-
   /// Create a vector induction phi node based on an existing scalar one. This
   /// currently only works for integer induction variables with a constant
   /// step. \p EntryVal is the value from the original loop that maps to the
@@ -548,10 +604,6 @@
   void createVectorIntInductionPHI(const InductionDescriptor &II,
                                    Instruction *EntryVal);
 
-  /// Widen an integer induction variable \p IV. If \p Trunc is provided, the
-  /// induction variable will first be truncated to the corresponding type.
-  void widenIntInduction(PHINode *IV, TruncInst *Trunc = nullptr);
-
   /// Returns true if an instruction \p I should be scalarized instead of
   /// vectorized for the chosen vectorization factor.
   bool shouldScalarizeInstruction(Instruction *I) const;
@@ -559,25 +611,25 @@
   /// Returns true if we should generate a scalar version of \p IV.
   bool needsScalarInduction(Instruction *IV) const;
 
-  /// Return a constant reference to the VectorParts corresponding to \p V from
-  /// the original loop. If the value has already been vectorized, the
-  /// corresponding vector entry in VectorLoopValueMap is returned. If,
-  /// however, the value has a scalar entry in VectorLoopValueMap, we construct
-  /// new vector values on-demand by inserting the scalar values into vectors
-  /// with an insertelement sequence. If the value has been neither vectorized
-  /// nor scalarized, it must be loop invariant, so we simply broadcast the
-  /// value into vectors.
-  const VectorParts &getVectorValue(Value *V);
-
-  /// Return a value in the new loop corresponding to \p V from the original
-  /// loop at unroll index \p Part and vector index \p Lane. If the value has
-  /// been vectorized but not scalarized, the necessary extractelement
-  /// instruction will be generated.
-  Value *getScalarValue(Value *V, unsigned Part, unsigned Lane);
-
+public:
   /// Try to vectorize the interleaved access group that \p Instr belongs to.
   void vectorizeInterleaveGroup(Instruction *Instr);
 
+  /// Widen an integer induction variable \p IV. If \p Trunc is provided, the
+  /// induction variable will first be truncated to the corresponding type.
+  std::pair<Value *, Value *> widenIntInduction(bool NeedsScalarIV, PHINode *IV,
+                                                TruncInst *Trunc = nullptr);
+
+  /// Compute scalar induction steps. \p ScalarIV is the scalar induction
+  /// variable on which to base the steps, \p Step is the size of the step, and
+  /// \p EntryVal is the value from the original loop that maps to the steps.
+  /// Note that \p EntryVal doesn't have to be an induction variable (e.g., it
+  /// can be a truncate instruction).
+  void buildScalarSteps(Value *ScalarIV, Value *Step, Value *EntryVal,
+                        unsigned MinPart, unsigned MaxPart, unsigned MinLane,
+                        unsigned MaxLane);
+
+protected:
   /// Generate a shuffle sequence that will reverse the vector Vec.
   virtual Value *reverseVector(Value *Vec);
 
@@ -694,6 +746,16 @@
       return ScalarMapStorage[Key];
     }
 
+    ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) {
+      if (!hasScalar(Key)) {
+        ScalarParts Entry(UF);
+        for (unsigned Part = 0; Part < UF; ++Part)
+          Entry[Part].resize(Lanes);
+        ScalarMapStorage[Key] = Entry;
+      }
+      return ScalarMapStorage[Key];
+    }
+
     /// \return A reference to the vector map entry corresponding to \p Key.
     /// The key should already be in the map. This function should only be used
     /// when it's necessary to update values that have already been vectorized.
@@ -712,6 +774,15 @@
     friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V);
     friend Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part,
                                                       unsigned Lane);
+    friend Value *InnerLoopVectorizer::getVectorValue(Value *V, unsigned Part);
+    friend void InnerLoopVectorizer::setScalarValue(Value *V, unsigned Part,
+                                                    unsigned Lane,
+                                                    Value *Scalar);
+    friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part,
+                                                    Value *Vector);
+    friend void InnerLoopVectorizer::constructVectorValue(Value *V,
+                                                          unsigned Part,
+                                                          unsigned Lane);
 
   private:
     /// The unroll factor. Each entry in the vector map contains UF vector
@@ -765,9 +836,11 @@
   /// many different vector instructions.
   unsigned UF;
 
+public:
   /// The builder that we use
   IRBuilder<> Builder;
 
+protected:
   // --- Vectorization state ---
 
   /// The vector-loop preheader.
@@ -796,10 +869,8 @@
   /// vectorized and scalarized.
   ValueMap VectorLoopValueMap;
 
-  /// Store instructions that should be predicated, as a pair
-  ///   <StoreInst, Predicate>
-  SmallVector<std::pair<Instruction *, Value *>, 4> PredicatedInstructions;
-  EdgeMaskCache MaskCache;
+  EdgeMaskCacheTy EdgeMaskCache;
+  BlockMaskCacheTy BlockMaskCache;
   /// Trip count of the original loop.
   Value *TripCount;
   /// Trip count of the widened loop (TripCount - TripCount % (VF*UF))
@@ -814,14 +885,6 @@
   // Record whether runtime checks are added.
   bool AddedSafetyChecks;
 
-  // Holds instructions from the original loop whose counterparts in the
-  // vectorized loop would be trivially dead if generated. For example,
-  // original induction update instructions can become dead because we
-  // separately emit induction "steps" when generating code for the new loop.
-  // Similarly, we create a new latch condition when setting up the structure
-  // of the new loop, so the old one can become dead.
-  SmallPtrSet<Instruction *, 4> DeadInstructions;
-
   // Holds the end values for each induction variable. We save the end values
   // so we can later fix-up the external users of the induction variables.
   DenseMap<PHINode *, Value *> IVEndValues;
@@ -840,14 +903,36 @@
                             UnrollFactor, LVL, CM) {}
 
 private:
-  void scalarizeInstruction(Instruction *Instr,
-                            bool IfPredicateInstr = false) override;
+  void vectorizeInstruction(Instruction &I) override;
+  void scalarizeInstruction(Instruction *Instr, bool IfPredicateInstr = false);
   void vectorizeMemoryInstruction(Instruction *Instr) override;
   Value *getBroadcastInstrs(Value *V) override;
   Value *getStepVector(Value *Val, int StartIdx, Value *Step,
                        Instruction::BinaryOps Opcode =
                        Instruction::BinaryOpsEnd) override;
   Value *reverseVector(Value *Vec) override;
+
+  void vectorizeLoop() override;
+
+  /// Iteratively sink the scalarized operands of a predicated instruction into
+  /// the block that was created for it.
+  void sinkScalarOperands(Instruction *PredInst);
+
+  /// Predicate conditional instructions that require predication on their
+  /// respective conditions.
+  void predicateInstructions();
+
+  /// Store instructions that should be predicated, as a pair
+  ///   <StoreInst, Predicate>
+  SmallVector<std::pair<Instruction *, Value *>, 4> PredicatedInstructions;
+
+  // Holds instructions from the original loop whose counterparts in the
+  // vectorized loop would be trivially dead if generated. For example,
+  // original induction update instructions can become dead because we
+  // separately emit induction "steps" when generating code for the new loop.
+  // Similarly, we create a new latch condition when setting up the structure
+  // of the new loop, so the old one can become dead.
+  SmallPtrSet<Instruction *, 4> DeadInstructions;
 };
 
 /// \brief Look for a meaningful debug location on the instruction or it's
@@ -1873,11 +1958,20 @@
     unsigned Width; // Vector width with best cost
     unsigned Cost;  // Cost of the loop with that width
   };
+
+  bool canVectorize(bool OptForSize);
+
+  bool requiresTail(unsigned MaxVectorSize);
+
+  /// \return An upper bound for the vectorization factor.
+  unsigned computeMaxVectorizationFactor(bool OptForSize);
+
   /// \return The most profitable vectorization factor and the cost of that VF.
   /// This method checks every power of two up to VF. If UserVF is not ZERO
   /// then this vectorization factor will be selected if vectorization is
   /// possible.
-  VectorizationFactor selectVectorizationFactor(bool OptForSize);
+  VectorizationFactor selectVectorizationFactor(bool OptForSize,
+                                                unsigned MaxVF);
 
   /// \return The size (in bits) of the smallest and widest types in the code
   /// that needs to be vectorized. We ignore values that remain scalar such as
@@ -1928,6 +2022,9 @@
   /// \returns True if it is more profitable to scalarize instruction \p I for
   /// vectorization factor \p VF.
   bool isProfitableToScalarize(Instruction *I, unsigned VF) const {
+    // Unroller also calls this method, but does not collectInstsToScalarize.
+    if (VF == 1)
+      return true;
     auto Scalars = InstsToScalarize.find(VF);
     assert(Scalars != InstsToScalarize.end() &&
            "VF not yet analyzed for scalarization profitability");
@@ -2139,10 +2236,12 @@
   int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
                               unsigned VF);
 
+public:
   /// Collects the instructions to scalarize for each predicated instruction in
   /// the loop.
   void collectInstsToScalarize(unsigned VF);
 
+private:
   /// Collect the instructions that are uniform after vectorization. An
   /// instruction is uniform if we represent it with a single scalar value in
   /// the vectorized loop corresponding to each vector iteration. Examples of
@@ -2161,6 +2260,7 @@
   /// iteration of the original scalar loop.
   void collectLoopScalars(unsigned VF);
 
+public:
   /// Collect Uniform and Scalar values for the given \p VF.
   /// The sets depend on CM decision for Load/Store instructions
   /// that may be vectorized as interleave, gather-scatter or scalarized.
@@ -2173,6 +2273,7 @@
     collectLoopScalars(VF);
   }
 
+private:
   /// Keeps cost model vectorization decision and cost for instructions.
   /// Right now it is used for memory instructions only.
   typedef DenseMap<std::pair<Instruction *, unsigned>,
@@ -2210,4538 +2311,4976 @@
   SmallPtrSet<const Value *, 16> VecValuesToIgnore;
 };
 
-/// \brief This holds vectorization requirements that must be verified late in
-/// the process. The requirements are set by legalize and costmodel. Once
-/// vectorization has been determined to be possible and profitable the
-/// requirements can be verified by looking for metadata or compiler options.
-/// For example, some loops require FP commutativity which is only allowed if
-/// vectorization is explicitly specified or if the fast-math compiler option
-/// has been provided.
-/// Late evaluation of these requirements allows helpful diagnostics to be
-/// composed that tells the user what need to be done to vectorize the loop. For
-/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late
-/// evaluation should be used only when diagnostics can generated that can be
-/// followed by a non-expert user.
-class LoopVectorizationRequirements {
+/// LoopVectorizationPlanner - builds and optimizes the Vectorization Plans
+/// which record the decisions how to vectorize the given loop.
+/// In particular, represent the control-flow of the vectorized version,
+/// the replication of instructions that are to be scalarized, and interleave
+/// access groups.
+class LoopVectorizationPlanner {
 public:
-  LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE)
-      : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {}
+  LoopVectorizationPlanner(Loop *L, LoopInfo *LI, const TargetLibraryInfo *TLI,
+                           const TargetTransformInfo *TTI,
+                           LoopVectorizationLegality *Legal,
+                           LoopVectorizationCostModel *CM)
+      : TheLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM),
+        ILV(nullptr), BestVF(0), BestUF(0) {}
 
-  void addUnsafeAlgebraInst(Instruction *I) {
-    // First unsafe algebra instruction.
-    if (!UnsafeAlgebraInst)
-      UnsafeAlgebraInst = I;
-  }
+  ~LoopVectorizationPlanner() {}
 
-  void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }
+  /// Plan how to best vectorize, return the best VF and its cost.
+  LoopVectorizationCostModel::VectorizationFactor
+  plan(bool OptForSize, unsigned UserVF, unsigned MaxVF);
 
-  bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) {
-    const char *PassName = Hints.vectorizeAnalysisPassName();
-    bool Failed = false;
-    if (UnsafeAlgebraInst && !Hints.allowReordering()) {
-      ORE.emit(
-          OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps",
-                                              UnsafeAlgebraInst->getDebugLoc(),
-                                              UnsafeAlgebraInst->getParent())
-          << "loop not vectorized: cannot prove it is safe to reorder "
-             "floating-point operations");
-      Failed = true;
-    }
+  /// Finalize the best decision and dispose of all other VPlans.
+  void setBestPlan(unsigned VF, unsigned UF);
 
-    // Test if runtime memcheck thresholds are exceeded.
-    bool PragmaThresholdReached =
-        NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;
-    bool ThresholdReached =
-        NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold;
-    if ((ThresholdReached && !Hints.allowReordering()) ||
-        PragmaThresholdReached) {
-      ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps",
-                                                  L->getStartLoc(),
-                                                  L->getHeader())
-               << "loop not vectorized: cannot prove it is safe to reorder "
-                  "memory operations");
-      DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
-      Failed = true;
-    }
+  /// Generate the IR code for the body of the vectorized loop according to the
+  /// best selected VPlan.
+  void executeBestPlan(InnerLoopVectorizer &LB);
 
-    return Failed;
-  }
+  VPlan *getVPlanForVF(unsigned VF) { return VPlans[VF].get(); }
 
-private:
-  unsigned NumRuntimePointerChecks;
-  Instruction *UnsafeAlgebraInst;
+  void printCurrentPlans(const std::string &Title, raw_ostream &O);
 
-  /// Interface to emit optimization remarks.
-  OptimizationRemarkEmitter &ORE;
-};
+  /// Test a predicate on a range of VFs.
+  /// The returned value reflects the result for a prefix of the range, with \p
+  /// EndRangeVF modified accordingly.
+  bool testVFRange(const std::function<bool(unsigned)> &Predicate,
+                   unsigned StartRangeVF, unsigned &EndRangeVF);
 
-static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl<Loop *> &V) {
-  if (L.empty()) {
-    if (!hasCyclesInLoopBody(L))
-      V.push_back(&L);
-    return;
-  }
-  for (Loop *InnerL : L)
-    addAcyclicInnerLoop(*InnerL, V);
-}
+protected:
+  /// Build initial VPlans according to the information gathered by Legal
+  /// when it checked if it is legal to vectorize this loop.
+  /// Returns the number of VPlans built, zero if failed.
+  unsigned buildInitialVPlans(unsigned MinVF, unsigned MaxVF);
+
+  /// On VPlan construction, each instruction marked for predication by Legal
+  /// gets its own basic block guarded by an if-then. This initial planning
+  /// is legal, but is not optimal. This function attempts to leverage the
+  /// necessary conditional execution of the predicated instruction in favor
+  /// of other related instructions. The function applies these optimizations
+  /// to all VPlans.
+  void optimizePredicatedInstructions();
 
-/// The LoopVectorize Pass.
-struct LoopVectorize : public FunctionPass {
-  /// Pass identification, replacement for typeid
-  static char ID;
+private:
+  /// Build an initial VPlan according to the information gathered by Legal
+  /// when it checked if it is legal to vectorize this loop. \return a VPlan
+  /// that corresponds to vectorization factors starting from the given
+  /// \p StartRangeVF and up to \p EndRangeVF, exclusive, possibly decreasing
+  /// the given \p EndRangeVF.
+  std::shared_ptr<VPlan> buildInitialVPlan(unsigned StartRangeVF,
+                                           unsigned &EndRangeVF);
+
+  std::pair<VPRecipeBase *, VPRecipeBase *>
+  widenIntInduction(VPlan *Plan, unsigned StartRangeVF, unsigned &EndRangeVF,
+                    PHINode *IV, TruncInst *Trunc = nullptr);
+
+  /// Determine whether \p I will be scalarized in a given range of VFs.
+  /// The returned value reflects the result for a prefix of the range, with \p
+  /// EndRangeVF modified accordingly.
+  bool willBeScalarized(Instruction *I, unsigned StartRangeVF,
+                        unsigned &EndRangeVF);
 
-  explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true)
-      : FunctionPass(ID) {
-    Impl.DisableUnrolling = NoUnrolling;
-    Impl.AlwaysVectorize = AlwaysVectorize;
-    initializeLoopVectorizePass(*PassRegistry::getPassRegistry());
-  }
+  /// Iteratively sink the scalarized operands of a predicated instruction into
+  /// the block that was created for it.
+  void sinkScalarOperands(Instruction *PredInst, VPlan *Plan);
 
-  LoopVectorizePass Impl;
+  /// Determine whether a newly-created recipe adds a second user to one of the
+  /// variants the values its ingredients use. This may cause the defining
+  /// recipe to generate that variant itself to serve all such users.
+  void assignScalarVectorConversions(Instruction *PredInst, VPlan *Plan);
 
-  bool runOnFunction(Function &F) override {
-    if (skipFunction(F))
-      return false;
+  /// Returns true if an instruction \p I should be scalarized instead of
+  /// vectorized for the chosen vectorization factor.
+  bool shouldScalarizeInstruction(Instruction *I, unsigned VF) const;
 
-    auto *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
-    auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
-    auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
-    auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
-    auto *BFI = &getAnalysis<BlockFrequencyInfoWrapperPass>().getBFI();
-    auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
-    auto *TLI = TLIP ? &TLIP->getTLI() : nullptr;
-    auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
-    auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
-    auto *LAA = &getAnalysis<LoopAccessLegacyAnalysis>();
-    auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();
-    auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();
+private:
+  /// The loop that we evaluate.
+  Loop *TheLoop;
 
-    std::function<const LoopAccessInfo &(Loop &)> GetLAA =
-        [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); };
+  /// Loop Info analysis.
+  LoopInfo *LI;
 
-    return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC,
-                        GetLAA, *ORE);
-  }
+  /// Target Library Info.
+  const TargetLibraryInfo *TLI;
 
-  void getAnalysisUsage(AnalysisUsage &AU) const override {
-    AU.addRequired<AssumptionCacheTracker>();
-    AU.addRequired<BlockFrequencyInfoWrapperPass>();
-    AU.addRequired<DominatorTreeWrapperPass>();
-    AU.addRequired<LoopInfoWrapperPass>();
-    AU.addRequired<ScalarEvolutionWrapperPass>();
-    AU.addRequired<TargetTransformInfoWrapperPass>();
-    AU.addRequired<AAResultsWrapperPass>();
-    AU.addRequired<LoopAccessLegacyAnalysis>();
-    AU.addRequired<DemandedBitsWrapperPass>();
-    AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
-    AU.addPreserved<LoopInfoWrapperPass>();
-    AU.addPreserved<DominatorTreeWrapperPass>();
-    AU.addPreserved<BasicAAWrapperPass>();
-    AU.addPreserved<GlobalsAAWrapperPass>();
-  }
-};
+  /// Target Transform Info.
+  const TargetTransformInfo *TTI;
 
-} // end anonymous namespace
+  /// The legality analysis.
+  LoopVectorizationLegality *Legal;
 
-//===----------------------------------------------------------------------===//
-// Implementation of LoopVectorizationLegality, InnerLoopVectorizer and
-// LoopVectorizationCostModel.
-//===----------------------------------------------------------------------===//
+  /// The profitablity analysis.
+  LoopVectorizationCostModel *CM;
 
-Value *InnerLoopVectorizer::getBroadcastInstrs(Value *V) {
-  // We need to place the broadcast of invariant variables outside the loop.
-  Instruction *Instr = dyn_cast<Instruction>(V);
-  bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody);
-  bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr;
+  InnerLoopVectorizer *ILV;
 
-  // Place the code for broadcasting invariant variables in the new preheader.
-  IRBuilder<>::InsertPointGuard Guard(Builder);
-  if (Invariant)
-    Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
+  // Holds instructions from the original loop that we predicated. Such
+  // instructions reside in their own conditioned VPBasicBlock and represent
+  // an optimization opportunity for sinking their scalarized operands thus
+  // reducing their cost by the predicate's probability.
+  SmallPtrSet<Instruction *, 4> PredicatedInstructions;
 
-  // Broadcast the scalar into all locations in the vector.
-  Value *Shuf = Builder.CreateVectorSplat(VF, V, "broadcast");
+  /// VPlans are shared between VFs, use smart pointers.
+  DenseMap<unsigned, std::shared_ptr<VPlan>> VPlans;
 
-  return Shuf;
-}
+  unsigned BestVF;
 
-void InnerLoopVectorizer::createVectorIntInductionPHI(
-    const InductionDescriptor &II, Instruction *EntryVal) {
-  Value *Start = II.getStartValue();
-  ConstantInt *Step = II.getConstIntStepValue();
-  assert(Step && "Can not widen an IV with a non-constant step");
+  unsigned BestUF;
 
-  // Construct the initial value of the vector IV in the vector loop preheader
-  auto CurrIP = Builder.saveIP();
-  Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
-  if (isa<TruncInst>(EntryVal)) {
-    auto *TruncType = cast<IntegerType>(EntryVal->getType());
-    Step = ConstantInt::getSigned(TruncType, Step->getSExtValue());
-    Start = Builder.CreateCast(Instruction::Trunc, Start, TruncType);
-  }
-  Value *SplatStart = Builder.CreateVectorSplat(VF, Start);
-  Value *SteppedStart = getStepVector(SplatStart, 0, Step);
-  Builder.restoreIP(CurrIP);
+  // Holds instructions from the original loop whose counterparts in the
+  // vectorized loop would be trivially dead if generated. For example,
+  // original induction update instructions can become dead because we
+  // separately emit induction "steps" when generating code for the new loop.
+  // Similarly, we create a new latch condition when setting up the structure
+  // of the new loop, so the old one can become dead.
+  SmallPtrSet<Instruction *, 4> DeadInstructions;
+};
 
-  Value *SplatVF =
-      ConstantVector::getSplat(VF, ConstantInt::getSigned(Start->getType(),
-                               VF * Step->getSExtValue()));
-  // We may need to add the step a number of times, depending on the unroll
-  // factor. The last of those goes into the PHI.
-  PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind",
-                                    &*LoopVectorBody->getFirstInsertionPt());
-  Instruction *LastInduction = VecInd;
-  VectorParts Entry(UF);
-  for (unsigned Part = 0; Part < UF; ++Part) {
-    Entry[Part] = LastInduction;
-    LastInduction = cast<Instruction>(
-        Builder.CreateAdd(LastInduction, SplatVF, "step.add"));
+class VPLaneRange {
+private:
+  static const unsigned VF = INT_MAX;
+  unsigned MinLane = 0;
+  unsigned MaxLane = VF - 1;
+  void dumpLane(raw_ostream &O, unsigned Lane) const {
+    if (Lane == VF - 1)
+      O << "VF-1";
+    else
+      O << Lane;
   }
-  VectorLoopValueMap.initVector(EntryVal, Entry);
-  if (isa<TruncInst>(EntryVal))
-    addMetadata(Entry, EntryVal);
 
-  // Move the last step to the end of the latch block. This ensures consistent
-  // placement of all induction updates.
-  auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch();
-  auto *Br = cast<BranchInst>(LoopVectorLatch->getTerminator());
-  auto *ICmp = cast<Instruction>(Br->getCondition());
-  LastInduction->moveBefore(ICmp);
-  LastInduction->setName("vec.ind.next");
+public:
+  VPLaneRange() {}
+  VPLaneRange(unsigned Min) : MinLane(Min) {}
+  VPLaneRange(unsigned Min, unsigned Max) : MinLane(Min), MaxLane(Max) {}
+  unsigned getMinLane() const { return MinLane; }
+  unsigned getMaxLane() const { return MaxLane; }
+  bool isEmpty() const { return MinLane > MaxLane; }
+  bool isFull() const { return MinLane == 0 && MaxLane == VF - 1; }
+  void print(raw_ostream &O) const {
+    dumpLane(O, MinLane);
+    O << "..";
+    dumpLane(O, MaxLane);
+  }
+  static VPLaneRange intersect(const VPLaneRange &One, const VPLaneRange &Two) {
+    return VPLaneRange(std::max(One.MinLane, Two.MinLane),
+                       std::min(One.MaxLane, Two.MaxLane));
+  }
+};
 
-  VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);
-  VecInd->addIncoming(LastInduction, LoopVectorLatch);
-}
+/// VPScalarizeOneByOneRecipe is a VPOneByOneRecipeBase which scalarizes each
+/// Instruction in its ingredients independently, in order. The scalarization
+/// is performed in one of two methods: a) by generating a single uniform scalar
+/// Instruction. b) by generating multiple Instructions, each one for a
+/// respective lane.
+class VPScalarizeOneByOneRecipe : public VPOneByOneRecipeBase {
+  friend class VPlanUtilsLoopVectorizer;
 
-bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const {
-  return Cost->isScalarAfterVectorization(I, VF) ||
-         Cost->isProfitableToScalarize(I, VF);
-}
+private:
+  /// Do the actual code generation for a single instruction.
+  void transformIRInstruction(Instruction *I, VPTransformState &State) override;
 
-bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {
-  if (shouldScalarizeInstruction(IV))
-    return true;
-  auto isScalarInst = [&](User *U) -> bool {
-    auto *I = cast<Instruction>(U);
-    return (OrigLoop->contains(I) && shouldScalarizeInstruction(I));
-  };
-  return any_of(IV->users(), isScalarInst);
-}
+  VPLaneRange DesignatedLanes;
 
-void InnerLoopVectorizer::widenIntInduction(PHINode *IV, TruncInst *Trunc) {
+public:
+  VPScalarizeOneByOneRecipe(const BasicBlock::iterator B,
+                            const BasicBlock::iterator E, VPlan *Plan)
+      : VPOneByOneRecipeBase(VPScalarizeOneByOneSC, B, E, Plan) {}
 
-  auto II = Legal->getInductionVars()->find(IV);
-  assert(II != Legal->getInductionVars()->end() && "IV is not an induction");
+  ~VPScalarizeOneByOneRecipe() {}
 
-  auto ID = II->second;
-  assert(IV->getType() == ID.getStartValue()->getType() && "Types must match");
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC;
+  }
 
-  // The scalar value to broadcast. This will be derived from the canonical
-  // induction variable.
-  Value *ScalarIV = nullptr;
+  const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; }
 
-  // The step of the induction.
-  Value *Step = nullptr;
+  /// Print the recipe.
+  void print(raw_ostream &O) const override {
+    O << "Scalarize";
+    if (!DesignatedLanes.isFull()) {
+      O << " ";
+      DesignatedLanes.print(O);
+    }
+    O << ":";
+    for (auto It = Begin; It != End; ++It) {
+      O << '\n' << *It;
+      if (willAlsoPackOrUnpack(&*It))
+        O << " (S->V)";
+    }
+  }
+};
 
-  // The value from the original loop to which we are mapping the new induction
-  // variable.
-  Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;
+/// VPVectorizeOneByOneRecipe is a VPOneByOneRecipeBase which transforms by
+/// vectorizing each Instruction in itsingredients independently, in order.
+/// This recipe covers most of the traditional vectorization cases where
+/// each ingredient produces  a vectorized version of itself.
+class VPVectorizeOneByOneRecipe : public VPOneByOneRecipeBase {
+  friend class VPlanUtilsLoopVectorizer;
 
-  // True if we have vectorized the induction variable.
-  auto VectorizedIV = false;
+private:
+  /// Do the actual code generation for a single instruction.
+  void transformIRInstruction(Instruction *I, VPTransformState &State) override;
 
-  // Determine if we want a scalar version of the induction variable. This is
-  // true if the induction variable itself is not widened, or if it has at
-  // least one user in the loop that is not widened.
-  auto NeedsScalarIV = VF > 1 && needsScalarInduction(EntryVal);
+public:
+  VPVectorizeOneByOneRecipe(const BasicBlock::iterator B,
+                            const BasicBlock::iterator E, VPlan *Plan)
+      : VPOneByOneRecipeBase(VPVectorizeOneByOneSC, B, E, Plan) {}
 
-  // If the induction variable has a constant integer step value, go ahead and
-  // get it now.
-  if (ID.getConstIntStepValue())
-    Step = ID.getConstIntStepValue();
+  ~VPVectorizeOneByOneRecipe() {}
 
-  // Try to create a new independent vector induction variable. If we can't
-  // create the phi node, we will splat the scalar induction variable in each
-  // loop iteration.
-  if (VF > 1 && Step && !shouldScalarizeInstruction(EntryVal)) {
-    createVectorIntInductionPHI(ID, EntryVal);
-    VectorizedIV = true;
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC;
   }
 
-  // If we haven't yet vectorized the induction variable, or if we will create
-  // a scalar one, we need to define the scalar induction variable and step
-  // values. If we were given a truncation type, truncate the canonical
-  // induction variable and constant step. Otherwise, derive these values from
-  // the induction descriptor.
-  if (!VectorizedIV || NeedsScalarIV) {
-    if (Trunc) {
-      auto *TruncType = cast<IntegerType>(Trunc->getType());
-      assert(Step && "Truncation requires constant integer step");
-      auto StepInt = cast<ConstantInt>(Step)->getSExtValue();
-      ScalarIV = Builder.CreateCast(Instruction::Trunc, Induction, TruncType);
-      Step = ConstantInt::getSigned(TruncType, StepInt);
-    } else {
-      ScalarIV = Induction;
-      auto &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
-      if (IV != OldInduction) {
-        ScalarIV = Builder.CreateSExtOrTrunc(ScalarIV, IV->getType());
-        ScalarIV = ID.transform(Builder, ScalarIV, PSE.getSE(), DL);
-        ScalarIV->setName("offset.idx");
-      }
-      if (!Step) {
-        SCEVExpander Exp(*PSE.getSE(), DL, "induction");
-        Step = Exp.expandCodeFor(ID.getStep(), ID.getStep()->getType(),
-                                 &*Builder.GetInsertPoint());
-      }
+  /// Print the recipe.
+  void print(raw_ostream &O) const override {
+    O << "Vectorize:";
+    for (auto It = Begin; It != End; ++It) {
+      O << '\n' << *It;
+      if (willAlsoPackOrUnpack(&*It))
+        O << " (S->V)";
     }
   }
+};
 
-  // If we haven't yet vectorized the induction variable, splat the scalar
-  // induction variable, and build the necessary step vectors.
-  if (!VectorizedIV) {
-    Value *Broadcasted = getBroadcastInstrs(ScalarIV);
-    VectorParts Entry(UF);
-    for (unsigned Part = 0; Part < UF; ++Part)
-      Entry[Part] = getStepVector(Broadcasted, VF * Part, Step);
-    VectorLoopValueMap.initVector(EntryVal, Entry);
-    if (Trunc)
-      addMetadata(Entry, Trunc);
-  }
+/// A recipe which widens integer reductions, producing their vector values
+/// and computing the necessary values for producing their scalar values.
+/// The scalar values themselves are generated by a complementing
+/// VPBuildScalarStepsRecipe.
+class VPWidenIntInductionRecipe : public VPRecipeBase {
+private:
+  bool NeedsScalarIV;
+  PHINode *IV;
+  TruncInst *Trunc;
+  Value *ScalarIV = nullptr;
+  Value *Step = nullptr;
 
-  // If an induction variable is only used for counting loop iterations or
-  // calculating addresses, it doesn't need to be widened. Create scalar steps
-  // that can be used by instructions we will later scalarize. Note that the
-  // addition of the scalar steps will not increase the number of instructions
-  // in the loop in the common case prior to InstCombine. We will be trading
-  // one vector extract for each scalar step.
-  if (NeedsScalarIV)
-    buildScalarSteps(ScalarIV, Step, EntryVal);
-}
+public:
+  VPWidenIntInductionRecipe(bool NeedsScalarIV, PHINode *IV,
+                            TruncInst *Trunc = nullptr)
+      : VPRecipeBase(VPWidenIntInductionSC), NeedsScalarIV(NeedsScalarIV),
+        IV(IV), Trunc(Trunc) {}
 
-Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step,
-                                          Instruction::BinaryOps BinOp) {
-  // Create and check the types.
-  assert(Val->getType()->isVectorTy() && "Must be a vector");
-  int VLen = Val->getType()->getVectorNumElements();
+  ~VPWidenIntInductionRecipe() {}
 
-  Type *STy = Val->getType()->getScalarType();
-  assert((STy->isIntegerTy() || STy->isFloatingPointTy()) &&
-         "Induction Step must be an integer or FP");
-  assert(Step->getType() == STy && "Step has wrong type");
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPWidenIntInductionSC;
+  }
 
-  SmallVector<Constant *, 8> Indices;
+  /// The method which generates the wide load or store and shuffles that
+  /// correspond to this VPInterleaveRecipe in the vectorized version, thereby
+  /// "executing" the VPlan.
+  void vectorize(VPTransformState &State) override;
 
-  if (STy->isIntegerTy()) {
-    // Create a vector of consecutive numbers from zero to VF.
-    for (int i = 0; i < VLen; ++i)
-      Indices.push_back(ConstantInt::get(STy, StartIdx + i));
+  /// Print the recipe.
+  void print(raw_ostream &O) const override;
 
-    // Add the consecutive indices to the vector value.
-    Constant *Cv = ConstantVector::get(Indices);
-    assert(Cv->getType() == Val->getType() && "Invalid consecutive vec");
-    Step = Builder.CreateVectorSplat(VLen, Step);
-    assert(Step->getType() == Val->getType() && "Invalid step vec");
-    // FIXME: The newly created binary instructions should contain nsw/nuw flags,
-    // which can be found from the original scalar operations.
-    Step = Builder.CreateMul(Cv, Step);
-    return Builder.CreateAdd(Val, Step, "induction");
+  Value *getScalarIV() {
+    assert(ScalarIV && "ScalarIV does not exist yet");
+    return ScalarIV;
   }
 
-  // Floating point induction.
-  assert((BinOp == Instruction::FAdd || BinOp == Instruction::FSub) &&
-         "Binary Opcode should be specified for FP induction");
-  // Create a vector of consecutive numbers from zero to VF.
-  for (int i = 0; i < VLen; ++i)
-    Indices.push_back(ConstantFP::get(STy, (double)(StartIdx + i)));
+  Value *getStep() {
+    assert(Step && "Step does not exist yet");
+    return Step;
+  }
+};
 
-  // Add the consecutive indices to the vector value.
-  Constant *Cv = ConstantVector::get(Indices);
+/// This is a complemeting recipe for handling integer induction variables,
+/// responsible for generating the scalar values used by the IV's scalar users.
+class VPBuildScalarStepsRecipe : public VPRecipeBase {
+  friend class VPlanUtilsLoopVectorizer;
 
-  Step = Builder.CreateVectorSplat(VLen, Step);
+private:
+  VPWidenIntInductionRecipe *WII;
+  Instruction *EntryVal;
+  VPLaneRange DesignatedLanes;
 
-  // Floating point operations had to be 'fast' to enable the induction.
-  FastMathFlags Flags;
-  Flags.setUnsafeAlgebra();
+public:
+  VPBuildScalarStepsRecipe(VPWidenIntInductionRecipe *WII,
+                           Instruction *EntryVal, VPlan *Plan)
+      : VPRecipeBase(VPBuildScalarStepsSC), WII(WII), EntryVal(EntryVal) {
+    Plan->setInst2Recipe(EntryVal, this);
+  }
 
-  Value *MulOp = Builder.CreateFMul(Cv, Step);
-  if (isa<Instruction>(MulOp))
-    // Have to check, MulOp may be a constant
-    cast<Instruction>(MulOp)->setFastMathFlags(Flags);
+  ~VPBuildScalarStepsRecipe() {}
 
-  Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction");
-  if (isa<Instruction>(BOp))
-    cast<Instruction>(BOp)->setFastMathFlags(Flags);
-  return BOp;
-}
+  const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; }
 
-void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step,
-                                           Value *EntryVal) {
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPBuildScalarStepsSC;
+  }
 
-  // We shouldn't have to build scalar steps if we aren't vectorizing.
-  assert(VF > 1 && "VF should be greater than one");
+  /// The method which generates the wide load or store and shuffles that
+  /// correspond to this VPInterleaveRecipe in the vectorized version, thereby
+  /// "executing" the VPlan.
+  void vectorize(VPTransformState &State) override;
 
-  // Get the value type and ensure it and the step have the same integer type.
-  Type *ScalarIVTy = ScalarIV->getType()->getScalarType();
-  assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&
-         "Val and Step should have the same integer type");
+  /// Print the recipe.
+  void print(raw_ostream &O) const override;
+};
 
-  // Determine the number of scalars we need to generate for each unroll
-  // iteration. If EntryVal is uniform, we only need to generate the first
-  // lane. Otherwise, we generate all VF values.
-  unsigned Lanes =
-    Cost->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF) ? 1 : VF;
+/// A VPInterleaveRecipe is a VPRecipe which transforms an interleave group of
+/// loads or stores into one wide load/store and shuffles.
+class VPInterleaveRecipe : public VPRecipeBase {
+private:
+  const InterleaveGroup *IG;
 
-  // Compute the scalar steps and save the results in VectorLoopValueMap.
-  ScalarParts Entry(UF);
-  for (unsigned Part = 0; Part < UF; ++Part) {
-    Entry[Part].resize(VF);
-    for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
-      auto *StartIdx = ConstantInt::get(ScalarIVTy, VF * Part + Lane);
-      auto *Mul = Builder.CreateMul(StartIdx, Step);
-      auto *Add = Builder.CreateAdd(ScalarIV, Mul);
-      Entry[Part][Lane] = Add;
-    }
+public:
+  VPInterleaveRecipe(const InterleaveGroup *IG, VPlan *Plan)
+      : VPRecipeBase(VPInterleaveSC), IG(IG) {
+    for (unsigned I = 0, E = IG->getNumMembers(); I < E; ++I)
+      Plan->setInst2Recipe(IG->getMember(I), this);
   }
-  VectorLoopValueMap.initScalar(EntryVal, Entry);
-}
 
-int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) {
+  ~VPInterleaveRecipe() {}
 
-  const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() :
-    ValueToValueMap();
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC;
+  }
 
-  int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false);
-  if (Stride == 1 || Stride == -1)
-    return Stride;
-  return 0;
-}
+  /// The method which generates the wide load or store and shuffles that
+  /// correspond to this VPInterleaveRecipe in the vectorized version, thereby
+  /// "executing" the VPlan.
+  void vectorize(VPTransformState &State) override;
 
-bool LoopVectorizationLegality::isUniform(Value *V) {
-  return LAI->isUniform(V);
-}
+  /// Print the recipe.
+  void print(raw_ostream &O) const override;
 
-const InnerLoopVectorizer::VectorParts &
-InnerLoopVectorizer::getVectorValue(Value *V) {
-  assert(V != Induction && "The new induction variable should not be used.");
-  assert(!V->getType()->isVectorTy() && "Can't widen a vector");
-  assert(!V->getType()->isVoidTy() && "Type does not produce a value");
+  const InterleaveGroup *getInterleaveGroup() { return IG; }
+};
 
-  // If we have a stride that is replaced by one, do it here.
-  if (Legal->hasStride(V))
-    V = ConstantInt::get(V->getType(), 1);
-
-  // If we have this scalar in the map, return it.
-  if (VectorLoopValueMap.hasVector(V))
-    return VectorLoopValueMap.VectorMapStorage[V];
-
-  // If the value has not been vectorized, check if it has been scalarized
-  // instead. If it has been scalarized, and we actually need the value in
-  // vector form, we will construct the vector values on demand.
-  if (VectorLoopValueMap.hasScalar(V)) {
-
-    // Initialize a new vector map entry.
-    VectorParts Entry(UF);
-
-    // If we've scalarized a value, that value should be an instruction.
-    auto *I = cast<Instruction>(V);
+/// A VPExtractMaskBitRecipe is a VPConditionBitRecipe which supports a
+/// scalarized conditional branch. Such branches are needed to guard scalarized
+/// instructions with possible side-effects that are predicated under a
+/// condition. This recipe is in charge of generating the instruction that
+/// computes the condition for this branch in the vectorized version.
+class VPExtractMaskBitRecipe : public VPConditionBitRecipeBase {
+private:
+  /// The original IR basic block in which the scalarized and predicated
+  /// instruction(s) reside. Needed for generating the mask of the block
+  /// and from it the desired condition bit.
+  BasicBlock *MaskedBasicBlock;
 
-    // If we aren't vectorizing, we can just copy the scalar map values over to
-    // the vector map.
-    if (VF == 1) {
-      for (unsigned Part = 0; Part < UF; ++Part)
-        Entry[Part] = getScalarValue(V, Part, 0);
-      return VectorLoopValueMap.initVector(V, Entry);
-    }
+public:
+  /// Construct a VPExtractMaskBitRecipe given the IR BasicBlock whose mask
+  /// should provide the desired bit. This recipe has no Instructions as
+  /// ingredients, hence does not call Plan->setInst2Recipe().
+  VPExtractMaskBitRecipe(BasicBlock *BB)
+      : VPConditionBitRecipeBase(VPExtractMaskBitSC), MaskedBasicBlock(BB) {}
 
-    // Get the last scalar instruction we generated for V. If the value is
-    // known to be uniform after vectorization, this corresponds to lane zero
-    // of the last unroll iteration. Otherwise, the last instruction is the one
-    // we created for the last vector lane of the last unroll iteration.
-    unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1;
-    auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPExtractMaskBitSC;
+  }
 
-    // Set the insert point after the last scalarized instruction. This ensures
-    // the insertelement sequence will directly follow the scalar definitions.
-    auto OldIP = Builder.saveIP();
-    auto NewIP = std::next(BasicBlock::iterator(LastInst));
-    Builder.SetInsertPoint(&*NewIP);
+  /// The method which generates the comparison and related mask management
+  /// instructions leading to computing the desired condition bit, corresponding
+  /// to this VPExtractMaskBitRecipe in the vectorized version, thereby
+  /// "executing" the VPlan.
+  void vectorize(VPTransformState &State) override;
 
-    // However, if we are vectorizing, we need to construct the vector values.
-    // If the value is known to be uniform after vectorization, we can just
-    // broadcast the scalar value corresponding to lane zero for each unroll
-    // iteration. Otherwise, we construct the vector values using insertelement
-    // instructions. Since the resulting vectors are stored in
-    // VectorLoopValueMap, we will only generate the insertelements once.
-    for (unsigned Part = 0; Part < UF; ++Part) {
-      Value *VectorValue = nullptr;
-      if (Cost->isUniformAfterVectorization(I, VF)) {
-        VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0));
-      } else {
-        VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));
-        for (unsigned Lane = 0; Lane < VF; ++Lane)
-          VectorValue = Builder.CreateInsertElement(
-              VectorValue, getScalarValue(V, Part, Lane),
-              Builder.getInt32(Lane));
-      }
-      Entry[Part] = VectorValue;
-    }
-    Builder.restoreIP(OldIP);
-    return VectorLoopValueMap.initVector(V, Entry);
+  /// Print the recipe.
+  void print(raw_ostream &O) const override {
+    O << "Extract Mask Bit:\n" << MaskedBasicBlock->getName();
   }
 
-  // If this scalar is unknown, assume that it is a constant or that it is
-  // loop invariant. Broadcast V and save the value for future uses.
-  Value *B = getBroadcastInstrs(V);
-  return VectorLoopValueMap.initVector(V, VectorParts(UF, B));
-}
-
-Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part,
-                                           unsigned Lane) {
+  StringRef getName() const override { return MaskedBasicBlock->getName(); }
+};
 
-  // If the value is not an instruction contained in the loop, it should
-  // already be scalar.
-  if (OrigLoop->isLoopInvariant(V))
-    return V;
+/// A VPMergeScalarizeBranchRecipe is a VPRecipe which represents the Phi's
+/// needed when control converges back from a scalarized branch. Such phi's are
+/// needed to merge live-out values that are set under a scalarized conditional
+/// branch. They can be scalar or vector, depending on the user of the
+/// live-out value. This recipe works in concert with VPExtractMaskBitRecipe.
+class VPMergeScalarizeBranchRecipe : public VPRecipeBase {
+private:
+  Instruction *LiveOut;
 
-  assert(Lane > 0 ?
-         !Cost->isUniformAfterVectorization(cast<Instruction>(V), VF)
-         : true && "Uniform values only have lane zero");
+public:
+  // Construct a VPMergeScalarizeBranchRecipe given \LiveOut whose value needs
+  // a Phi after merging back from a scalarized branch.
+  // LiveOut is mapped to the recipe vectorizing it, instead of this recipe
+  // which provides it with PHIs; hence no call to Plan->setInst2Recipe() here.
+  VPMergeScalarizeBranchRecipe(Instruction *LiveOut)
+      : VPRecipeBase(VPMergeScalarizeBranchSC), LiveOut(LiveOut) {}
 
-  // If the value from the original loop has not been vectorized, it is
-  // represented by UF x VF scalar values in the new loop. Return the requested
-  // scalar value.
-  if (VectorLoopValueMap.hasScalar(V))
-    return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane];
+  ~VPMergeScalarizeBranchRecipe() {}
 
-  // If the value has not been scalarized, get its entry in VectorLoopValueMap
-  // for the given unroll part. If this entry is not a vector type (i.e., the
-  // vectorization factor is one), there is no need to generate an
-  // extractelement instruction.
-  auto *U = getVectorValue(V)[Part];
-  if (!U->getType()->isVectorTy()) {
-    assert(VF == 1 && "Value not scalarized has non-vector type");
-    return U;
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPMergeScalarizeBranchSC;
   }
 
-  // Otherwise, the value from the original loop has been vectorized and is
-  // represented by UF vector values. Extract and return the requested scalar
-  // value from the appropriate vector lane.
-  return Builder.CreateExtractElement(U, Builder.getInt32(Lane));
-}
+  /// The method which generates Phi instructions for live-outs as needed to
+  /// retain SSA form, corresponding to this VPMergeScalarizeBranchRecipe in the
+  /// vectorized version, thereby "executing" the VPlan.
+  void vectorize(VPTransformState &State) override;
 
-Value *InnerLoopVectorizer::reverseVector(Value *Vec) {
-  assert(Vec->getType()->isVectorTy() && "Invalid type");
-  SmallVector<Constant *, 8> ShuffleMask;
-  for (unsigned i = 0; i < VF; ++i)
-    ShuffleMask.push_back(Builder.getInt32(VF - i - 1));
+  /// Print the recipe.
+  void print(raw_ostream &O) const override {
+    O << "Merge Scalarize Branch:\n" << *LiveOut;
+  }
+};
 
-  return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),
-                                     ConstantVector::get(ShuffleMask),
-                                     "reverse");
-}
+class VPlanUtilsLoopVectorizer : public VPlanUtils {
+public:
+  VPlanUtilsLoopVectorizer(VPlan *Plan) : VPlanUtils(Plan) {}
 
-// Try to vectorize the interleave group that \p Instr belongs to.
-//
-// E.g. Translate following interleaved load group (factor = 3):
-//   for (i = 0; i < N; i+=3) {
-//     R = Pic[i];             // Member of index 0
-//     G = Pic[i+1];           // Member of index 1
-//     B = Pic[i+2];           // Member of index 2
-//     ... // do something to R, G, B
-//   }
-// To:
-//   %wide.vec = load <12 x i32>                       ; Read 4 tuples of R,G,B
-//   %R.vec = shuffle %wide.vec, undef, <0, 3, 6, 9>   ; R elements
-//   %G.vec = shuffle %wide.vec, undef, <1, 4, 7, 10>  ; G elements
-//   %B.vec = shuffle %wide.vec, undef, <2, 5, 8, 11>  ; B elements
-//
-// Or translate following interleaved store group (factor = 3):
-//   for (i = 0; i < N; i+=3) {
-//     ... do something to R, G, B
-//     Pic[i]   = R;           // Member of index 0
-//     Pic[i+1] = G;           // Member of index 1
-//     Pic[i+2] = B;           // Member of index 2
-//   }
-// To:
-//   %R_G.vec = shuffle %R.vec, %G.vec, <0, 1, 2, ..., 7>
-//   %B_U.vec = shuffle %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
-//   %interleaved.vec = shuffle %R_G.vec, %B_U.vec,
-//        <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>    ; Interleave R,G,B elements
-//   store <12 x i32> %interleaved.vec              ; Write 4 tuples of R,G,B
-void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {
-  const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
-  assert(Group && "Fail to get an interleaved access group.");
+  ~VPlanUtilsLoopVectorizer() {}
 
-  // Skip if current instruction is not the insert position.
-  if (Instr != Group->getInsertPos())
-    return;
+  VPOneByOneRecipeBase *createOneByOneRecipe(const BasicBlock::iterator B,
+                                             const BasicBlock::iterator E,
+                                             VPlan *Plan, bool isScalarizing);
 
-  Value *Ptr = getPointerOperand(Instr);
+  bool appendInstruction(VPOneByOneRecipeBase *Recipe, Instruction *Instr);
 
-  // Prepare for the vector type of the interleaved load/store.
-  Type *ScalarTy = getMemInstValueType(Instr);
-  unsigned InterleaveFactor = Group->getFactor();
-  Type *VecTy = VectorType::get(ScalarTy, InterleaveFactor * VF);
-  Type *PtrTy = VecTy->getPointerTo(getMemInstAddressSpace(Instr));
+  VPOneByOneRecipeBase *splitRecipe(Instruction *Split);
 
-  // Prepare for the new pointers.
-  setDebugLocFromInst(Builder, Ptr);
-  SmallVector<Value *, 2> NewPtrs;
-  unsigned Index = Group->getIndex(Instr);
+  void insertBefore(Instruction *Inst, Instruction *Before,
+                    unsigned MinLane = 0);
 
-  // If the group is reverse, adjust the index to refer to the last vector lane
-  // instead of the first. We adjust the index from the first vector lane,
-  // rather than directly getting the pointer for lane VF - 1, because the
-  // pointer operand of the interleaved access is supposed to be uniform. For
-  // uniform instructions, we're only required to generate a value for the
-  // first vector lane in each unroll iteration.
-  if (Group->isReverse())
-    Index += (VF - 1) * Group->getFactor();
+  void removeInstruction(Instruction *Inst, unsigned FromLane = 0);
 
-  for (unsigned Part = 0; Part < UF; Part++) {
-    Value *NewPtr = getScalarValue(Ptr, Part, 0);
+  void sinkInstruction(Instruction *Inst, VPBasicBlock *To,
+                       unsigned MinLane = 0);
 
-    // Notice current instruction could be any index. Need to adjust the address
-    // to the member of index 0.
-    //
-    // E.g.  a = A[i+1];     // Member of index 1 (Current instruction)
-    //       b = A[i];       // Member of index 0
-    // Current pointer is pointed to A[i+1], adjust it to A[i].
-    //
-    // E.g.  A[i+1] = a;     // Member of index 1
-    //       A[i]   = b;     // Member of index 0
-    //       A[i+2] = c;     // Member of index 2 (Current instruction)
-    // Current pointer is pointed to A[i+2], adjust it to A[i].
-    NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index));
+  template <typename T> void designateLaneZero(T &Recipe) {
+    Recipe->DesignatedLanes = VPLaneRange(0, 0);
+  }
+};
 
-    // Cast to the vector pointer type.
-    NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy));
+/// \brief This holds vectorization requirements that must be verified late in
+/// the process. The requirements are set by legalize and costmodel. Once
+/// vectorization has been determined to be possible and profitable the
+/// requirements can be verified by looking for metadata or compiler options.
+/// For example, some loops require FP commutativity which is only allowed if
+/// vectorization is explicitly specified or if the fast-math compiler option
+/// has been provided.
+/// Late evaluation of these requirements allows helpful diagnostics to be
+/// composed that tells the user what need to be done to vectorize the loop. For
+/// example, by specifying #pragma clang loop vectorize or -ffast-math. Late
+/// evaluation should be used only when diagnostics can generated that can be
+/// followed by a non-expert user.
+class LoopVectorizationRequirements {
+public:
+  LoopVectorizationRequirements(OptimizationRemarkEmitter &ORE)
+      : NumRuntimePointerChecks(0), UnsafeAlgebraInst(nullptr), ORE(ORE) {}
+
+  void addUnsafeAlgebraInst(Instruction *I) {
+    // First unsafe algebra instruction.
+    if (!UnsafeAlgebraInst)
+      UnsafeAlgebraInst = I;
   }
 
-  setDebugLocFromInst(Builder, Instr);
-  Value *UndefVec = UndefValue::get(VecTy);
+  void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }
 
-  // Vectorize the interleaved load group.
-  if (isa<LoadInst>(Instr)) {
+  bool doesNotMeet(Function *F, Loop *L, const LoopVectorizeHints &Hints) {
+    const char *PassName = Hints.vectorizeAnalysisPassName();
+    bool Failed = false;
+    if (UnsafeAlgebraInst && !Hints.allowReordering()) {
+      ORE.emit(
+          OptimizationRemarkAnalysisFPCommute(PassName, "CantReorderFPOps",
+                                              UnsafeAlgebraInst->getDebugLoc(),
+                                              UnsafeAlgebraInst->getParent())
+          << "loop not vectorized: cannot prove it is safe to reorder "
+             "floating-point operations");
+      Failed = true;
+    }
 
-    // For each unroll part, create a wide load for the group.
-    SmallVector<Value *, 2> NewLoads;
-    for (unsigned Part = 0; Part < UF; Part++) {
-      auto *NewLoad = Builder.CreateAlignedLoad(
-          NewPtrs[Part], Group->getAlignment(), "wide.vec");
-      addMetadata(NewLoad, Instr);
-      NewLoads.push_back(NewLoad);
+    // Test if runtime memcheck thresholds are exceeded.
+    bool PragmaThresholdReached =
+        NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;
+    bool ThresholdReached =
+        NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold;
+    if ((ThresholdReached && !Hints.allowReordering()) ||
+        PragmaThresholdReached) {
+      ORE.emit(OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps",
+                                                  L->getStartLoc(),
+                                                  L->getHeader())
+               << "loop not vectorized: cannot prove it is safe to reorder "
+                  "memory operations");
+      DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
+      Failed = true;
     }
 
-    // For each member in the group, shuffle out the appropriate data from the
-    // wide loads.
-    for (unsigned I = 0; I < InterleaveFactor; ++I) {
-      Instruction *Member = Group->getMember(I);
+    return Failed;
+  }
 
-      // Skip the gaps in the group.
-      if (!Member)
-        continue;
+private:
+  unsigned NumRuntimePointerChecks;
+  Instruction *UnsafeAlgebraInst;
 
-      VectorParts Entry(UF);
-      Constant *StrideMask = createStrideMask(Builder, I, InterleaveFactor, VF);
-      for (unsigned Part = 0; Part < UF; Part++) {
-        Value *StridedVec = Builder.CreateShuffleVector(
-            NewLoads[Part], UndefVec, StrideMask, "strided.vec");
+  /// Interface to emit optimization remarks.
+  OptimizationRemarkEmitter &ORE;
+};
 
-        // If this member has different type, cast the result type.
-        if (Member->getType() != ScalarTy) {
-          VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
-          StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy);
-        }
-
-        Entry[Part] =
-            Group->isReverse() ? reverseVector(StridedVec) : StridedVec;
-      }
-      VectorLoopValueMap.initVector(Member, Entry);
-    }
+static void addAcyclicInnerLoop(Loop &L, SmallVectorImpl<Loop *> &V) {
+  if (L.empty()) {
+    if (!hasCyclesInLoopBody(L))
+      V.push_back(&L);
     return;
   }
+  for (Loop *InnerL : L)
+    addAcyclicInnerLoop(*InnerL, V);
+}
 
-  // The sub vector type for current instruction.
-  VectorType *SubVT = VectorType::get(ScalarTy, VF);
+/// The LoopVectorize Pass.
+struct LoopVectorize : public FunctionPass {
+  /// Pass identification, replacement for typeid
+  static char ID;
 
-  // Vectorize the interleaved store group.
-  for (unsigned Part = 0; Part < UF; Part++) {
-    // Collect the stored vector from each member.
-    SmallVector<Value *, 4> StoredVecs;
-    for (unsigned i = 0; i < InterleaveFactor; i++) {
-      // Interleaved store group doesn't allow a gap, so each index has a member
-      Instruction *Member = Group->getMember(i);
-      assert(Member && "Fail to get a member from an interleaved store group");
+  explicit LoopVectorize(bool NoUnrolling = false, bool AlwaysVectorize = true)
+      : FunctionPass(ID) {
+    Impl.DisableUnrolling = NoUnrolling;
+    Impl.AlwaysVectorize = AlwaysVectorize;
+    initializeLoopVectorizePass(*PassRegistry::getPassRegistry());
+  }
 
-      Value *StoredVec =
-          getVectorValue(cast<StoreInst>(Member)->getValueOperand())[Part];
-      if (Group->isReverse())
-        StoredVec = reverseVector(StoredVec);
+  LoopVectorizePass Impl;
 
-      // If this member has different type, cast it to an unified type.
-      if (StoredVec->getType() != SubVT)
-        StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT);
+  bool runOnFunction(Function &F) override {
+    if (skipFunction(F))
+      return false;
 
-      StoredVecs.push_back(StoredVec);
-    }
+    auto *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
+    auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
+    auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
+    auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
+    auto *BFI = &getAnalysis<BlockFrequencyInfoWrapperPass>().getBFI();
+    auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
+    auto *TLI = TLIP ? &TLIP->getTLI() : nullptr;
+    auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
+    auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
+    auto *LAA = &getAnalysis<LoopAccessLegacyAnalysis>();
+    auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();
+    auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();
 
-    // Concatenate all vectors into a wide vector.
-    Value *WideVec = concatenateVectors(Builder, StoredVecs);
+    std::function<const LoopAccessInfo &(Loop &)> GetLAA =
+        [&](Loop &L) -> const LoopAccessInfo & { return LAA->getInfo(&L); };
 
-    // Interleave the elements in the wide vector.
-    Constant *IMask = createInterleaveMask(Builder, VF, InterleaveFactor);
-    Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask,
-                                              "interleaved.vec");
+    return Impl.runImpl(F, *SE, *LI, *TTI, *DT, *BFI, TLI, *DB, *AA, *AC,
+                        GetLAA, *ORE);
+  }
 
-    Instruction *NewStoreInstr =
-        Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment());
-    addMetadata(NewStoreInstr, Instr);
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<AssumptionCacheTracker>();
+    AU.addRequired<BlockFrequencyInfoWrapperPass>();
+    AU.addRequired<DominatorTreeWrapperPass>();
+    AU.addRequired<LoopInfoWrapperPass>();
+    AU.addRequired<ScalarEvolutionWrapperPass>();
+    AU.addRequired<TargetTransformInfoWrapperPass>();
+    AU.addRequired<AAResultsWrapperPass>();
+    AU.addRequired<LoopAccessLegacyAnalysis>();
+    AU.addRequired<DemandedBitsWrapperPass>();
+    AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
+    AU.addPreserved<LoopInfoWrapperPass>();
+    AU.addPreserved<DominatorTreeWrapperPass>();
+    AU.addPreserved<BasicAAWrapperPass>();
+    AU.addPreserved<GlobalsAAWrapperPass>();
   }
-}
+};
 
-void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
-  // Attempt to issue a wide load.
-  LoadInst *LI = dyn_cast<LoadInst>(Instr);
-  StoreInst *SI = dyn_cast<StoreInst>(Instr);
+} // end anonymous namespace
 
-  assert((LI || SI) && "Invalid Load/Store instruction");
+//===----------------------------------------------------------------------===//
+// Implementation of LoopVectorizationLegality, InnerLoopVectorizer,
+// LoopVectorizationCostModel and LoopVectorizationPlanner.
+//===----------------------------------------------------------------------===//
 
-  LoopVectorizationCostModel::InstWidening Decision =
-      Cost->getWideningDecision(Instr, VF);
-  assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
-         "CM decision should be taken at this point");
-  if (Decision == LoopVectorizationCostModel::CM_Interleave)
-    return vectorizeInterleaveGroup(Instr);
+Value *InnerLoopVectorizer::getBroadcastInstrs(Value *V) {
+  // We need to place the broadcast of invariant variables outside the loop.
+  Instruction *Instr = dyn_cast<Instruction>(V);
+  bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody);
+  bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr;
 
-  Type *ScalarDataTy = getMemInstValueType(Instr);
-  Type *DataTy = VectorType::get(ScalarDataTy, VF);
-  Value *Ptr = getPointerOperand(Instr);
-  unsigned Alignment = getMemInstAlignment(Instr);
-  // An alignment of 0 means target abi alignment. We need to use the scalar's
-  // target abi alignment in such a case.
-  const DataLayout &DL = Instr->getModule()->getDataLayout();
-  if (!Alignment)
-    Alignment = DL.getABITypeAlignment(ScalarDataTy);
-  unsigned AddressSpace = getMemInstAddressSpace(Instr);
+  // Place the code for broadcasting invariant variables in the new preheader.
+  IRBuilder<>::InsertPointGuard Guard(Builder);
+  if (Invariant)
+    Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
 
-  // Scalarize the memory instruction if necessary.
-  if (Decision == LoopVectorizationCostModel::CM_Scalarize)
-    return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));
+  // Broadcast the scalar into all locations in the vector.
+  Value *Shuf = Builder.CreateVectorSplat(VF, V, "broadcast");
 
-  // Determine if the pointer operand of the access is either consecutive or
-  // reverse consecutive.
-  int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
-  bool Reverse = ConsecutiveStride < 0;
-  bool CreateGatherScatter =
-      (Decision == LoopVectorizationCostModel::CM_GatherScatter);
+  return Shuf;
+}
 
-  VectorParts VectorGep;
+void InnerLoopVectorizer::createVectorIntInductionPHI(
+    const InductionDescriptor &II, Instruction *EntryVal) {
+  Value *Start = II.getStartValue();
+  ConstantInt *Step = II.getConstIntStepValue();
+  assert(Step && "Can not widen an IV with a non-constant step");
 
-  // Handle consecutive loads/stores.
-  GetElementPtrInst *Gep = getGEPInstruction(Ptr);
-  if (ConsecutiveStride) {
-    if (Gep) {
-      unsigned NumOperands = Gep->getNumOperands();
-#ifndef NDEBUG
-      // The original GEP that identified as a consecutive memory access
-      // should have only one loop-variant operand.
-      unsigned NumOfLoopVariantOps = 0;
-      for (unsigned i = 0; i < NumOperands; ++i)
-        if (!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)),
-                                          OrigLoop))
-          NumOfLoopVariantOps++;
-      assert(NumOfLoopVariantOps == 1 &&
-             "Consecutive GEP should have only one loop-variant operand");
-#endif
-      GetElementPtrInst *Gep2 = cast<GetElementPtrInst>(Gep->clone());
-      Gep2->setName("gep.indvar");
+  // Construct the initial value of the vector IV in the vector loop preheader
+  auto CurrIP = Builder.saveIP();
+  Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
+  if (isa<TruncInst>(EntryVal)) {
+    auto *TruncType = cast<IntegerType>(EntryVal->getType());
+    Step = ConstantInt::getSigned(TruncType, Step->getSExtValue());
+    Start = Builder.CreateCast(Instruction::Trunc, Start, TruncType);
+  }
+  Value *SplatStart = Builder.CreateVectorSplat(VF, Start);
+  Value *SteppedStart = getStepVector(SplatStart, 0, Step);
+  Builder.restoreIP(CurrIP);
 
-      // A new GEP is created for a 0-lane value of the first unroll iteration.
-      // The GEPs for the rest of the unroll iterations are computed below as an
-      // offset from this GEP.
-      for (unsigned i = 0; i < NumOperands; ++i)
-        // We can apply getScalarValue() for all GEP indices. It returns an
-        // original value for loop-invariant operand and 0-lane for consecutive
-        // operand.
-        Gep2->setOperand(i, getScalarValue(Gep->getOperand(i),
-                                           0, /* First unroll iteration */
-                                           0  /* 0-lane of the vector */ ));
-      setDebugLocFromInst(Builder, Gep);
-      Ptr = Builder.Insert(Gep2);
+  Value *SplatVF =
+      ConstantVector::getSplat(VF, ConstantInt::getSigned(Start->getType(),
+                               VF * Step->getSExtValue()));
+  // We may need to add the step a number of times, depending on the unroll
+  // factor. The last of those goes into the PHI.
+  PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind",
+                                    &*LoopVectorBody->getFirstInsertionPt());
+  Instruction *LastInduction = VecInd;
+  VectorParts Entry(UF);
+  for (unsigned Part = 0; Part < UF; ++Part) {
+    Entry[Part] = LastInduction;
+    LastInduction = cast<Instruction>(
+        Builder.CreateAdd(LastInduction, SplatVF, "step.add"));
+  }
+  VectorLoopValueMap.initVector(EntryVal, Entry);
+  if (isa<TruncInst>(EntryVal))
+    addMetadata(Entry, EntryVal);
 
-    } else { // No GEP
-      setDebugLocFromInst(Builder, Ptr);
-      Ptr = getScalarValue(Ptr, 0, 0);
-    }
-  } else {
-    // At this point we should vector version of GEP for Gather or Scatter
-    assert(CreateGatherScatter && "The instruction should be scalarized");
-    if (Gep) {
-      // Vectorizing GEP, across UF parts. We want to get a vector value for base
-      // and each index that's defined inside the loop, even if it is
-      // loop-invariant but wasn't hoisted out. Otherwise we want to keep them
-      // scalar.
-      SmallVector<VectorParts, 4> OpsV;
-      for (Value *Op : Gep->operands()) {
-        Instruction *SrcInst = dyn_cast<Instruction>(Op);
-        if (SrcInst && OrigLoop->contains(SrcInst))
-          OpsV.push_back(getVectorValue(Op));
-        else
-          OpsV.push_back(VectorParts(UF, Op));
-      }
-      for (unsigned Part = 0; Part < UF; ++Part) {
-        SmallVector<Value *, 4> Ops;
-        Value *GEPBasePtr = OpsV[0][Part];
-        for (unsigned i = 1; i < Gep->getNumOperands(); i++)
-          Ops.push_back(OpsV[i][Part]);
-        Value *NewGep =  Builder.CreateGEP(GEPBasePtr, Ops, "VectorGep");
-        cast<GetElementPtrInst>(NewGep)->setIsInBounds(Gep->isInBounds());
-        assert(NewGep->getType()->isVectorTy() && "Expected vector GEP");
+  // Move the last step to the end of the latch block. This ensures consistent
+  // placement of all induction updates.
+  auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch();
+  auto *Br = cast<BranchInst>(LoopVectorLatch->getTerminator());
+  auto *ICmp = cast<Instruction>(Br->getCondition());
+  LastInduction->moveBefore(ICmp);
+  LastInduction->setName("vec.ind.next");
 
-        NewGep =
-            Builder.CreateBitCast(NewGep, VectorType::get(Ptr->getType(), VF));
-        VectorGep.push_back(NewGep);
-      }
-    } else
-      VectorGep = getVectorValue(Ptr);
-  }
+  VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);
+  VecInd->addIncoming(LastInduction, LoopVectorLatch);
+}
 
-  VectorParts Mask = createBlockInMask(Instr->getParent());
-  // Handle Stores:
-  if (SI) {
-    assert(!Legal->isUniform(SI->getPointerOperand()) &&
-           "We do not allow storing to uniform addresses");
-    setDebugLocFromInst(Builder, SI);
-    // We don't want to update the value in the map as it might be used in
-    // another expression. So don't use a reference type for "StoredVal".
-    VectorParts StoredVal = getVectorValue(SI->getValueOperand());
+bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const {
+  return Cost->isScalarAfterVectorization(I, VF) ||
+         Cost->isProfitableToScalarize(I, VF);
+}
 
-    for (unsigned Part = 0; Part < UF; ++Part) {
-      Instruction *NewSI = nullptr;
-      if (CreateGatherScatter) {
-        Value *MaskPart = Legal->isMaskRequired(SI) ? Mask[Part] : nullptr;
-        NewSI = Builder.CreateMaskedScatter(StoredVal[Part], VectorGep[Part],
-                                            Alignment, MaskPart);
-      } else {
-        // Calculate the pointer for the specific unroll-part.
-        Value *PartPtr =
-            Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF));
+bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {
+  if (shouldScalarizeInstruction(IV))
+    return true;
+  auto isScalarInst = [&](User *U) -> bool {
+    auto *I = cast<Instruction>(U);
+    return (OrigLoop->contains(I) && shouldScalarizeInstruction(I));
+  };
+  return any_of(IV->users(), isScalarInst);
+}
 
-        if (Reverse) {
-          // If we store to reverse consecutive memory locations, then we need
-          // to reverse the order of elements in the stored value.
-          StoredVal[Part] = reverseVector(StoredVal[Part]);
-          // If the address is consecutive but reversed, then the
-          // wide store needs to start at the last vector element.
-          PartPtr =
-              Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF));
-          PartPtr =
-              Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF));
-          Mask[Part] = reverseVector(Mask[Part]);
-        }
+std::pair<Value *, Value *>
+InnerLoopVectorizer::widenIntInduction(bool NeedsScalarIV, PHINode *IV,
+                                       TruncInst *Trunc) {
 
-        Value *VecPtr =
-            Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));
+  auto II = Legal->getInductionVars()->find(IV);
+  assert(II != Legal->getInductionVars()->end() && "IV is not an induction");
 
-        if (Legal->isMaskRequired(SI))
-          NewSI = Builder.CreateMaskedStore(StoredVal[Part], VecPtr, Alignment,
-                                            Mask[Part]);
-        else
-          NewSI =
-              Builder.CreateAlignedStore(StoredVal[Part], VecPtr, Alignment);
-      }
-      addMetadata(NewSI, SI);
-    }
-    return;
+  auto ID = II->second;
+  assert(IV->getType() == ID.getStartValue()->getType() && "Types must match");
+
+  // The scalar value to broadcast. This will be derived from the canonical
+  // induction variable.
+  Value *ScalarIV = nullptr;
+
+  // The step of the induction.
+  Value *Step = nullptr;
+
+  // The value from the original loop to which we are mapping the new induction
+  // variable.
+  Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;
+
+  // True if we have vectorized the induction variable.
+  auto VectorizedIV = false;
+
+  // If the induction variable has a constant integer step value, go ahead and
+  // get it now.
+  if (ID.getConstIntStepValue())
+    Step = ID.getConstIntStepValue();
+
+  // Try to create a new independent vector induction variable. If we can't
+  // create the phi node, we will splat the scalar induction variable in each
+  // loop iteration.
+  if (VF > 1 && Step && !shouldScalarizeInstruction(EntryVal)) {
+    createVectorIntInductionPHI(ID, EntryVal);
+    VectorizedIV = true;
   }
 
-  // Handle loads.
-  assert(LI && "Must have a load instruction");
-  setDebugLocFromInst(Builder, LI);
-  VectorParts Entry(UF);
-  for (unsigned Part = 0; Part < UF; ++Part) {
-    Instruction *NewLI;
-    if (CreateGatherScatter) {
-      Value *MaskPart = Legal->isMaskRequired(LI) ? Mask[Part] : nullptr;
-      NewLI = Builder.CreateMaskedGather(VectorGep[Part], Alignment, MaskPart,
-                                         0, "wide.masked.gather");
-      Entry[Part] = NewLI;
+  // If we haven't yet vectorized the induction variable, or if we will create
+  // a scalar one, we need to define the scalar induction variable and step
+  // values. If we were given a truncation type, truncate the canonical
+  // induction variable and constant step. Otherwise, derive these values from
+  // the induction descriptor.
+  if (!VectorizedIV || NeedsScalarIV) {
+    if (Trunc) {
+      auto *TruncType = cast<IntegerType>(Trunc->getType());
+      assert(Step && "Truncation requires constant integer step");
+      auto StepInt = cast<ConstantInt>(Step)->getSExtValue();
+      ScalarIV = Builder.CreateCast(Instruction::Trunc, Induction, TruncType);
+      Step = ConstantInt::getSigned(TruncType, StepInt);
     } else {
-      // Calculate the pointer for the specific unroll-part.
-      Value *PartPtr =
-          Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF));
-
-      if (Reverse) {
-        // If the address is consecutive but reversed, then the
-        // wide load needs to start at the last vector element.
-        PartPtr = Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF));
-        PartPtr = Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF));
-        Mask[Part] = reverseVector(Mask[Part]);
+      ScalarIV = Induction;
+      auto &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
+      if (IV != OldInduction) {
+        ScalarIV = Builder.CreateSExtOrTrunc(ScalarIV, IV->getType());
+        ScalarIV = ID.transform(Builder, ScalarIV, PSE.getSE(), DL);
+        ScalarIV->setName("offset.idx");
+      }
+      if (!Step) {
+        SCEVExpander Exp(*PSE.getSE(), DL, "induction");
+        Step = Exp.expandCodeFor(ID.getStep(), ID.getStep()->getType(),
+                                 &*Builder.GetInsertPoint());
       }
-
-      Value *VecPtr =
-          Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));
-      if (Legal->isMaskRequired(LI))
-        NewLI = Builder.CreateMaskedLoad(VecPtr, Alignment, Mask[Part],
-                                         UndefValue::get(DataTy),
-                                         "wide.masked.load");
-      else
-        NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");
-      Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;
     }
-    addMetadata(NewLI, LI);
   }
-  VectorLoopValueMap.initVector(Instr, Entry);
+
+  // If we haven't yet vectorized the induction variable, splat the scalar
+  // induction variable, and build the necessary step vectors.
+  if (!VectorizedIV) {
+    Value *Broadcasted = getBroadcastInstrs(ScalarIV);
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part)
+      Entry[Part] = getStepVector(Broadcasted, VF * Part, Step);
+    VectorLoopValueMap.initVector(EntryVal, Entry);
+    if (Trunc)
+      addMetadata(Entry, Trunc);
+  }
+
+  // If an induction variable is only used for counting loop iterations or
+  // calculating addresses, it doesn't need to be widened.
+
+  return std::make_pair(ScalarIV, Step);
 }
 
-void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
-                                               bool IfPredicateInstr) {
-  assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
-  DEBUG(dbgs() << "LV: Scalarizing"
-               << (IfPredicateInstr ? " and predicating:" : ":") << *Instr
-               << '\n');
-  // Holds vector parameters or scalars, in case of uniform vals.
-  SmallVector<VectorParts, 4> Params;
+Value *InnerLoopVectorizer::getStepVector(Value *Val, int StartIdx, Value *Step,
+                                          Instruction::BinaryOps BinOp) {
+  // Create and check the types.
+  assert(Val->getType()->isVectorTy() && "Must be a vector");
+  int VLen = Val->getType()->getVectorNumElements();
 
-  setDebugLocFromInst(Builder, Instr);
+  Type *STy = Val->getType()->getScalarType();
+  assert((STy->isIntegerTy() || STy->isFloatingPointTy()) &&
+         "Induction Step must be an integer or FP");
+  assert(Step->getType() == STy && "Step has wrong type");
 
-  // Does this instruction return a value ?
-  bool IsVoidRetTy = Instr->getType()->isVoidTy();
+  SmallVector<Constant *, 8> Indices;
 
-  // Initialize a new scalar map entry.
-  ScalarParts Entry(UF);
+  if (STy->isIntegerTy()) {
+    // Create a vector of consecutive numbers from zero to VF.
+    for (int i = 0; i < VLen; ++i)
+      Indices.push_back(ConstantInt::get(STy, StartIdx + i));
 
-  VectorParts Cond;
-  if (IfPredicateInstr)
-    Cond = createBlockInMask(Instr->getParent());
+    // Add the consecutive indices to the vector value.
+    Constant *Cv = ConstantVector::get(Indices);
+    assert(Cv->getType() == Val->getType() && "Invalid consecutive vec");
+    Step = Builder.CreateVectorSplat(VLen, Step);
+    assert(Step->getType() == Val->getType() && "Invalid step vec");
+    // FIXME: The newly created binary instructions should contain nsw/nuw flags,
+    // which can be found from the original scalar operations.
+    Step = Builder.CreateMul(Cv, Step);
+    return Builder.CreateAdd(Val, Step, "induction");
+  }
 
-  // Determine the number of scalars we need to generate for each unroll
-  // iteration. If the instruction is uniform, we only need to generate the
-  // first lane. Otherwise, we generate all VF values.
-  unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF;
+  // Floating point induction.
+  assert((BinOp == Instruction::FAdd || BinOp == Instruction::FSub) &&
+         "Binary Opcode should be specified for FP induction");
+  // Create a vector of consecutive numbers from zero to VF.
+  for (int i = 0; i < VLen; ++i)
+    Indices.push_back(ConstantFP::get(STy, (double)(StartIdx + i)));
 
-  // For each vector unroll 'part':
-  for (unsigned Part = 0; Part < UF; ++Part) {
-    Entry[Part].resize(VF);
-    // For each scalar that we create:
-    for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
-
-      // Start if-block.
-      Value *Cmp = nullptr;
-      if (IfPredicateInstr) {
-        Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Lane));
-        Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,
-                                 ConstantInt::get(Cmp->getType(), 1));
-      }
+  // Add the consecutive indices to the vector value.
+  Constant *Cv = ConstantVector::get(Indices);
 
-      Instruction *Cloned = Instr->clone();
-      if (!IsVoidRetTy)
-        Cloned->setName(Instr->getName() + ".cloned");
+  Step = Builder.CreateVectorSplat(VLen, Step);
 
-      // Replace the operands of the cloned instructions with their scalar
-      // equivalents in the new loop.
-      for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) {
-        auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane);
-        Cloned->setOperand(op, NewOp);
-      }
-      addNewMetadata(Cloned, Instr);
+  // Floating point operations had to be 'fast' to enable the induction.
+  FastMathFlags Flags;
+  Flags.setUnsafeAlgebra();
 
-      // Place the cloned scalar in the new loop.
-      Builder.Insert(Cloned);
+  Value *MulOp = Builder.CreateFMul(Cv, Step);
+  if (isa<Instruction>(MulOp))
+    // Have to check, MulOp may be a constant
+    cast<Instruction>(MulOp)->setFastMathFlags(Flags);
 
-      // Add the cloned scalar to the scalar map entry.
-      Entry[Part][Lane] = Cloned;
+  Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction");
+  if (isa<Instruction>(BOp))
+    cast<Instruction>(BOp)->setFastMathFlags(Flags);
+  return BOp;
+}
 
-      // If we just cloned a new assumption, add it the assumption cache.
-      if (auto *II = dyn_cast<IntrinsicInst>(Cloned))
-        if (II->getIntrinsicID() == Intrinsic::assume)
-          AC->registerAssumption(II);
+void InnerLoopVectorizer::buildScalarSteps(Value *ScalarIV, Value *Step,
+                                           Value *EntryVal, unsigned MinPart,
+                                           unsigned MaxPart, unsigned MinLane,
+                                           unsigned MaxLane) {
+
+  // We shouldn't have to build scalar steps if we aren't vectorizing.
+  assert(VF > 1 && "VF should be greater than one");
+
+  // Get the value type and ensure it and the step have the same integer type.
+  Type *ScalarIVTy = ScalarIV->getType()->getScalarType();
+  assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&
+         "Val and Step should have the same integer type");
+
+  ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(EntryVal, VF);
 
-      // End if-block.
-      if (IfPredicateInstr)
-        PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
+  // Compute the scalar steps and save the results in VectorLoopValueMap.
+  for (unsigned Part = MinPart; Part <= MaxPart; ++Part) {
+    Entry[Part].resize(VF);
+    for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) {
+      auto *StartIdx = ConstantInt::get(ScalarIVTy, VF * Part + Lane);
+      auto *Mul = Builder.CreateMul(StartIdx, Step);
+      auto *Add = Builder.CreateAdd(ScalarIV, Mul);
+      Entry[Part][Lane] = Add;
     }
   }
-  VectorLoopValueMap.initScalar(Instr, Entry);
 }
 
-PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start,
-                                                      Value *End, Value *Step,
-                                                      Instruction *DL) {
-  BasicBlock *Header = L->getHeader();
-  BasicBlock *Latch = L->getLoopLatch();
-  // As we're just creating this loop, it's possible no latch exists
-  // yet. If so, use the header as this will be a single block loop.
-  if (!Latch)
-    Latch = Header;
+int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) {
 
-  IRBuilder<> Builder(&*Header->getFirstInsertionPt());
-  Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction);
-  setDebugLocFromInst(Builder, OldInst);
-  auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index");
-
-  Builder.SetInsertPoint(Latch->getTerminator());
-  setDebugLocFromInst(Builder, OldInst);
-
-  // Create i+1 and fill the PHINode.
-  Value *Next = Builder.CreateAdd(Induction, Step, "index.next");
-  Induction->addIncoming(Start, L->getLoopPreheader());
-  Induction->addIncoming(Next, Latch);
-  // Create the compare.
-  Value *ICmp = Builder.CreateICmpEQ(Next, End);
-  Builder.CreateCondBr(ICmp, L->getExitBlock(), Header);
+  const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() :
+    ValueToValueMap();
 
-  // Now we have two terminators. Remove the old one from the block.
-  Latch->getTerminator()->eraseFromParent();
+  int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false);
+  if (Stride == 1 || Stride == -1)
+    return Stride;
+  return 0;
+}
 
-  return Induction;
+bool LoopVectorizationLegality::isUniform(Value *V) {
+  return LAI->isUniform(V);
 }
 
-Value *InnerLoopVectorizer::getOrCreateTripCount(Loop *L) {
-  if (TripCount)
-    return TripCount;
+void InnerLoopVectorizer::constructVectorValue(Value *V, unsigned Part,
+                                               unsigned Lane) {
+  assert(V != Induction && "The new induction variable should not be used.");
+  assert(!V->getType()->isVectorTy() && "Can't widen a vector");
+  assert(!V->getType()->isVoidTy() && "Type does not produce a value");
 
-  IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
-  // Find the loop boundaries.
-  ScalarEvolution *SE = PSE.getSE();
-  const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
-  assert(BackedgeTakenCount != SE->getCouldNotCompute() &&
-         "Invalid loop count");
+  if (!VectorLoopValueMap.hasVector(V)) {
+    VectorParts Entry(UF);
+    for (unsigned P = 0; P < UF; ++P)
+      Entry[P] = nullptr;
+    VectorLoopValueMap.initVector(V, Entry);
+  }
 
-  Type *IdxTy = Legal->getWidestInductionType();
+  VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V];
 
-  // The exit count might have the type of i64 while the phi is i32. This can
-  // happen if we have an induction variable that is sign extended before the
-  // compare. The only way that we get a backedge taken count is that the
-  // induction variable was signed and as such will not overflow. In such a case
-  // truncation is legal.
-  if (BackedgeTakenCount->getType()->getPrimitiveSizeInBits() >
-      IdxTy->getPrimitiveSizeInBits())
-    BackedgeTakenCount = SE->getTruncateOrNoop(BackedgeTakenCount, IdxTy);
-  BackedgeTakenCount = SE->getNoopOrZeroExtend(BackedgeTakenCount, IdxTy);
+  assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist");
 
-  // Get the total trip count from the count by adding 1.
-  const SCEV *ExitCount = SE->getAddExpr(
-      BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
+  auto *ScalarInst = cast<Instruction>(getScalarValue(V, Part, Lane));
 
-  const DataLayout &DL = L->getHeader()->getModule()->getDataLayout();
+  Value *VectorValue = nullptr;
 
-  // Expand the trip count and place the new instructions in the preheader.
-  // Notice that the pre-header does not change, only the loop body.
-  SCEVExpander Exp(*SE, DL, "induction");
+  // If we're constructing lane 0, start from undef; otherwise, start from the
+  // last value created.
+  if (Lane == 0)
+    VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));
+  else
+    VectorValue = Parts[Part];
 
-  // Count holds the overall loop count (N).
-  TripCount = Exp.expandCodeFor(ExitCount, ExitCount->getType(),
-                                L->getLoopPreheader()->getTerminator());
+  VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst,
+                                            Builder.getInt32(Lane));
+  Parts[Part] = VectorValue;
+}
 
-  if (TripCount->getType()->isPointerTy())
-    TripCount =
-        CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int",
-                                    L->getLoopPreheader()->getTerminator());
+const InnerLoopVectorizer::VectorParts &
+InnerLoopVectorizer::getVectorValue(Value *V) {
+  assert(V != Induction && "The new induction variable should not be used.");
+  assert(!V->getType()->isVectorTy() && "Can't widen a vector");
+  assert(!V->getType()->isVoidTy() && "Type does not produce a value");
 
-  return TripCount;
-}
+  // If we have a stride that is replaced by one, do it here.
+  if (Legal->hasStride(V))
+    V = ConstantInt::get(V->getType(), 1);
 
-Value *InnerLoopVectorizer::getOrCreateVectorTripCount(Loop *L) {
-  if (VectorTripCount)
-    return VectorTripCount;
+  // If we have this scalar in the map, return it.
+  if (VectorLoopValueMap.hasVector(V))
+    return VectorLoopValueMap.VectorMapStorage[V];
 
-  Value *TC = getOrCreateTripCount(L);
-  IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
+  // If the value has not been vectorized, check if it has been scalarized
+  // instead. If it has been scalarized, and we actually need the value in
+  // vector form, we will construct the vector values on demand.
+  if (VectorLoopValueMap.hasScalar(V)) {
 
-  // Now we need to generate the expression for the part of the loop that the
-  // vectorized body will execute. This is equal to N - (N % Step) if scalar
-  // iterations are not required for correctness, or N - Step, otherwise. Step
-  // is equal to the vectorization factor (number of SIMD elements) times the
-  // unroll factor (number of SIMD instructions).
-  Constant *Step = ConstantInt::get(TC->getType(), VF * UF);
-  Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");
+    // Initialize a new vector map entry.
+    VectorParts Entry(UF);
 
-  // If there is a non-reversed interleaved group that may speculatively access
-  // memory out-of-bounds, we need to ensure that there will be at least one
-  // iteration of the scalar epilogue loop. Thus, if the step evenly divides
-  // the trip count, we set the remainder to be equal to the step. If the step
-  // does not evenly divide the trip count, no adjustment is necessary since
-  // there will already be scalar iterations. Note that the minimum iterations
-  // check ensures that N >= Step.
-  if (VF > 1 && Legal->requiresScalarEpilogue()) {
-    auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0));
-    R = Builder.CreateSelect(IsZero, Step, R);
-  }
+    // If we've scalarized a value, that value should be an instruction.
+    auto *I = cast<Instruction>(V);
 
-  VectorTripCount = Builder.CreateSub(TC, R, "n.vec");
+    // If we aren't vectorizing, we can just copy the scalar map values over to
+    // the vector map.
+    if (VF == 1) {
+      for (unsigned Part = 0; Part < UF; ++Part)
+        Entry[Part] = getScalarValue(V, Part, 0);
+      return VectorLoopValueMap.initVector(V, Entry);
+    }
 
-  return VectorTripCount;
-}
+    // Get the last scalar instruction we generated for V. If the value is
+    // known to be uniform after vectorization, this corresponds to lane zero
+    // of the last unroll iteration. Otherwise, the last instruction is the one
+    // we created for the last vector lane of the last unroll iteration.
+    unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1;
+    auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));
 
-void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,
-                                                         BasicBlock *Bypass) {
-  Value *Count = getOrCreateTripCount(L);
-  BasicBlock *BB = L->getLoopPreheader();
-  IRBuilder<> Builder(BB->getTerminator());
+    // Set the insert point after the last scalarized instruction. This ensures
+    // the insertelement sequence will directly follow the scalar definitions.
+    auto OldIP = Builder.saveIP();
+    auto NextInsertionPoint = std::next(BasicBlock::iterator(LastInst));
+    if (NextInsertionPoint != LastInst->getParent()->end())
+      Builder.SetInsertPoint(&*NextInsertionPoint);
+    else
+      Builder.SetInsertPoint(LastInst->getParent());
 
-  // Generate code to check that the loop's trip count that we computed by
-  // adding one to the backedge-taken count will not overflow.
-  Value *CheckMinIters = Builder.CreateICmpULT(
-      Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");
+    // However, if we are vectorizing, we need to construct the vector values.
+    // If the value is known to be uniform after vectorization, we can just
+    // broadcast the scalar value corresponding to lane zero for each unroll
+    // iteration. Otherwise, we construct the vector values using insertelement
+    // instructions. Since the resulting vectors are stored in
+    // VectorLoopValueMap, we will only generate the insertelements once.
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Value *VectorValue = nullptr;
+      if (Cost->isUniformAfterVectorization(I, VF)) {
+        VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0));
+      } else {
+        VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));
+        for (unsigned Lane = 0; Lane < VF; ++Lane)
+          VectorValue = Builder.CreateInsertElement(
+              VectorValue, getScalarValue(V, Part, Lane),
+              Builder.getInt32(Lane));
+      }
+      Entry[Part] = VectorValue;
+    }
+    Builder.restoreIP(OldIP);
+    return VectorLoopValueMap.initVector(V, Entry);
+  }
 
-  BasicBlock *NewBB =
-      BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked");
-  // Update dominator tree immediately if the generated block is a
-  // LoopBypassBlock because SCEV expansions to generate loop bypass
-  // checks may query it before the current function is finished.
-  DT->addNewBlock(NewBB, BB);
-  if (L->getParentLoop())
-    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
-  ReplaceInstWithInst(BB->getTerminator(),
-                      BranchInst::Create(Bypass, NewBB, CheckMinIters));
-  LoopBypassBlocks.push_back(BB);
+  // If this scalar is unknown, assume that it is a constant or that it is
+  // loop invariant. Broadcast V and save the value for future uses.
+  Value *B = getBroadcastInstrs(V);
+  return VectorLoopValueMap.initVector(V, VectorParts(UF, B));
 }
 
-void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L,
-                                                     BasicBlock *Bypass) {
-  Value *TC = getOrCreateVectorTripCount(L);
-  BasicBlock *BB = L->getLoopPreheader();
-  IRBuilder<> Builder(BB->getTerminator());
-
-  // Now, compare the new count to zero. If it is zero skip the vector loop and
-  // jump to the scalar loop.
-  Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()),
-                                    "cmp.zero");
+Value *InnerLoopVectorizer::getScalarValue(Value *V, unsigned Part,
+                                           unsigned Lane) {
 
-  // Generate code to check that the loop's trip count that we computed by
-  // adding one to the backedge-taken count will not overflow.
-  BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
-  // Update dominator tree immediately if the generated block is a
-  // LoopBypassBlock because SCEV expansions to generate loop bypass
-  // checks may query it before the current function is finished.
-  DT->addNewBlock(NewBB, BB);
-  if (L->getParentLoop())
-    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
-  ReplaceInstWithInst(BB->getTerminator(),
-                      BranchInst::Create(Bypass, NewBB, Cmp));
-  LoopBypassBlocks.push_back(BB);
-}
+  // If the value is not an instruction contained in the loop, it should
+  // already be scalar.
+  if (OrigLoop->isLoopInvariant(V))
+    return V;
 
-void InnerLoopVectorizer::emitSCEVChecks(Loop *L, BasicBlock *Bypass) {
-  BasicBlock *BB = L->getLoopPreheader();
+  assert(Lane > 0 ?
+         !Cost->isUniformAfterVectorization(cast<Instruction>(V), VF)
+         : true && "Uniform values only have lane zero");
 
-  // Generate the code to check that the SCEV assumptions that we made.
-  // We want the new basic block to start at the first instruction in a
-  // sequence of instructions that form a check.
-  SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),
-                   "scev.check");
-  Value *SCEVCheck =
-      Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());
+  // If the value from the original loop has not been vectorized, it is
+  // represented by UF x VF scalar values in the new loop. Return the requested
+  // scalar value.
+  if (VectorLoopValueMap.hasScalar(V))
+    return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane];
 
-  if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))
-    if (C->isZero())
-      return;
+  // If the value has not been scalarized, get its entry in VectorLoopValueMap
+  // for the given unroll part. If this entry is not a vector type (i.e., the
+  // vectorization factor is one), there is no need to generate an
+  // extractelement instruction.
+  auto *U = getVectorValue(V)[Part];
+  if (!U->getType()->isVectorTy()) {
+    assert(VF == 1 && "Value not scalarized has non-vector type");
+    return U;
+  }
 
-  // Create a new block containing the stride check.
-  BB->setName("vector.scevcheck");
-  auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
-  // Update dominator tree immediately if the generated block is a
-  // LoopBypassBlock because SCEV expansions to generate loop bypass
-  // checks may query it before the current function is finished.
-  DT->addNewBlock(NewBB, BB);
-  if (L->getParentLoop())
-    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
-  ReplaceInstWithInst(BB->getTerminator(),
-                      BranchInst::Create(Bypass, NewBB, SCEVCheck));
-  LoopBypassBlocks.push_back(BB);
-  AddedSafetyChecks = true;
+  // Otherwise, the value from the original loop has been vectorized and is
+  // represented by UF vector values. Extract and return the requested scalar
+  // value from the appropriate vector lane.
+  return Builder.CreateExtractElement(U, Builder.getInt32(Lane));
 }
 
-void InnerLoopVectorizer::emitMemRuntimeChecks(Loop *L, BasicBlock *Bypass) {
-  BasicBlock *BB = L->getLoopPreheader();
+Value *InnerLoopVectorizer::reverseVector(Value *Vec) {
+  assert(Vec->getType()->isVectorTy() && "Invalid type");
+  SmallVector<Constant *, 8> ShuffleMask;
+  for (unsigned i = 0; i < VF; ++i)
+    ShuffleMask.push_back(Builder.getInt32(VF - i - 1));
 
-  // Generate the code that checks in runtime if arrays overlap. We put the
-  // checks into a separate block to make the more common case of few elements
-  // faster.
-  Instruction *FirstCheckInst;
-  Instruction *MemRuntimeCheck;
-  std::tie(FirstCheckInst, MemRuntimeCheck) =
-      Legal->getLAI()->addRuntimeChecks(BB->getTerminator());
-  if (!MemRuntimeCheck)
-    return;
+  return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),
+                                     ConstantVector::get(ShuffleMask),
+                                     "reverse");
+}
 
-  // Create a new block containing the memory check.
-  BB->setName("vector.memcheck");
-  auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
-  // Update dominator tree immediately if the generated block is a
-  // LoopBypassBlock because SCEV expansions to generate loop bypass
-  // checks may query it before the current function is finished.
-  DT->addNewBlock(NewBB, BB);
-  if (L->getParentLoop())
-    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
-  ReplaceInstWithInst(BB->getTerminator(),
-                      BranchInst::Create(Bypass, NewBB, MemRuntimeCheck));
-  LoopBypassBlocks.push_back(BB);
-  AddedSafetyChecks = true;
+// Try to vectorize the interleave group that \p Instr belongs to.
+//
+// E.g. Translate following interleaved load group (factor = 3):
+//   for (i = 0; i < N; i+=3) {
+//     R = Pic[i];             // Member of index 0
+//     G = Pic[i+1];           // Member of index 1
+//     B = Pic[i+2];           // Member of index 2
+//     ... // do something to R, G, B
+//   }
+// To:
+//   %wide.vec = load <12 x i32>                       ; Read 4 tuples of R,G,B
+//   %R.vec = shuffle %wide.vec, undef, <0, 3, 6, 9>   ; R elements
+//   %G.vec = shuffle %wide.vec, undef, <1, 4, 7, 10>  ; G elements
+//   %B.vec = shuffle %wide.vec, undef, <2, 5, 8, 11>  ; B elements
+//
+// Or translate following interleaved store group (factor = 3):
+//   for (i = 0; i < N; i+=3) {
+//     ... do something to R, G, B
+//     Pic[i]   = R;           // Member of index 0
+//     Pic[i+1] = G;           // Member of index 1
+//     Pic[i+2] = B;           // Member of index 2
+//   }
+// To:
+//   %R_G.vec = shuffle %R.vec, %G.vec, <0, 1, 2, ..., 7>
+//   %B_U.vec = shuffle %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
+//   %interleaved.vec = shuffle %R_G.vec, %B_U.vec,
+//        <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>    ; Interleave R,G,B elements
+//   store <12 x i32> %interleaved.vec              ; Write 4 tuples of R,G,B
+void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {
+  const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
+  assert(Group && "Fail to get an interleaved access group.");
 
-  // We currently don't use LoopVersioning for the actual loop cloning but we
-  // still use it to add the noalias metadata.
-  LVer = llvm::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,
-                                           PSE.getSE());
-  LVer->prepareNoAliasMetadata();
-}
+  // Skip if current instruction is not the insert position.
+  if (Instr != Group->getInsertPos())
+    return;
 
-void InnerLoopVectorizer::createEmptyLoop() {
-  /*
-   In this function we generate a new loop. The new loop will contain
-   the vectorized instructions while the old loop will continue to run the
-   scalar remainder.
+  Value *Ptr = getPointerOperand(Instr);
 
-       [ ] <-- loop iteration number check.
-    /   |
-   /    v
-  |    [ ] <-- vector loop bypass (may consist of multiple blocks).
-  |  /  |
-  | /   v
-  ||   [ ]     <-- vector pre header.
-  |/    |
-  |     v
-  |    [  ] \
-  |    [  ]_|   <-- vector loop.
-  |     |
-  |     v
-  |   -[ ]   <--- middle-block.
-  |  /  |
-  | /   v
-  -|- >[ ]     <--- new preheader.
-   |    |
-   |    v
-   |   [ ] \
-   |   [ ]_|   <-- old scalar loop to handle remainder.
-    \   |
-     \  v
-      >[ ]     <-- exit block.
-   ...
-   */
+  // Prepare for the vector type of the interleaved load/store.
+  Type *ScalarTy = getMemInstValueType(Instr);
+  unsigned InterleaveFactor = Group->getFactor();
+  Type *VecTy = VectorType::get(ScalarTy, InterleaveFactor * VF);
+  Type *PtrTy = VecTy->getPointerTo(getMemInstAddressSpace(Instr));
 
-  BasicBlock *OldBasicBlock = OrigLoop->getHeader();
-  BasicBlock *VectorPH = OrigLoop->getLoopPreheader();
-  BasicBlock *ExitBlock = OrigLoop->getExitBlock();
-  assert(VectorPH && "Invalid loop structure");
-  assert(ExitBlock && "Must have an exit block");
+  // Prepare for the new pointers.
+  setDebugLocFromInst(Builder, Ptr);
+  SmallVector<Value *, 2> NewPtrs;
+  unsigned Index = Group->getIndex(Instr);
 
-  // Some loops have a single integer induction variable, while other loops
-  // don't. One example is c++ iterators that often have multiple pointer
-  // induction variables. In the code below we also support a case where we
-  // don't have a single induction variable.
-  //
-  // We try to obtain an induction variable from the original loop as hard
-  // as possible. However if we don't find one that:
-  //   - is an integer
-  //   - counts from zero, stepping by one
-  //   - is the size of the widest induction variable type
-  // then we create a new one.
-  OldInduction = Legal->getPrimaryInduction();
-  Type *IdxTy = Legal->getWidestInductionType();
+  // If the group is reverse, adjust the index to refer to the last vector lane
+  // instead of the first. We adjust the index from the first vector lane,
+  // rather than directly getting the pointer for lane VF - 1, because the
+  // pointer operand of the interleaved access is supposed to be uniform. For
+  // uniform instructions, we're only required to generate a value for the
+  // first vector lane in each unroll iteration.
+  if (Group->isReverse())
+    Index += (VF - 1) * Group->getFactor();
 
-  // Split the single block loop into the two loop structure described above.
-  BasicBlock *VecBody =
-      VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body");
-  BasicBlock *MiddleBlock =
-      VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block");
-  BasicBlock *ScalarPH =
-      MiddleBlock->splitBasicBlock(MiddleBlock->getTerminator(), "scalar.ph");
+  for (unsigned Part = 0; Part < UF; Part++) {
+    Value *NewPtr = getScalarValue(Ptr, Part, 0);
 
-  // Create and register the new vector loop.
-  Loop *Lp = new Loop();
-  Loop *ParentLoop = OrigLoop->getParentLoop();
+    // Notice current instruction could be any index. Need to adjust the address
+    // to the member of index 0.
+    //
+    // E.g.  a = A[i+1];     // Member of index 1 (Current instruction)
+    //       b = A[i];       // Member of index 0
+    // Current pointer is pointed to A[i+1], adjust it to A[i].
+    //
+    // E.g.  A[i+1] = a;     // Member of index 1
+    //       A[i]   = b;     // Member of index 0
+    //       A[i+2] = c;     // Member of index 2 (Current instruction)
+    // Current pointer is pointed to A[i+2], adjust it to A[i].
+    NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index));
 
-  // Insert the new loop into the loop nest and register the new basic blocks
-  // before calling any utilities such as SCEV that require valid LoopInfo.
-  if (ParentLoop) {
-    ParentLoop->addChildLoop(Lp);
-    ParentLoop->addBasicBlockToLoop(ScalarPH, *LI);
-    ParentLoop->addBasicBlockToLoop(MiddleBlock, *LI);
-  } else {
-    LI->addTopLevelLoop(Lp);
+    // Cast to the vector pointer type.
+    NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy));
   }
-  Lp->addBasicBlockToLoop(VecBody, *LI);
 
-  // Find the loop boundaries.
-  Value *Count = getOrCreateTripCount(Lp);
-
-  Value *StartIdx = ConstantInt::get(IdxTy, 0);
+  setDebugLocFromInst(Builder, Instr);
+  Value *UndefVec = UndefValue::get(VecTy);
 
-  // We need to test whether the backedge-taken count is uint##_max. Adding one
-  // to it will cause overflow and an incorrect loop trip count in the vector
-  // body. In case of overflow we want to directly jump to the scalar remainder
-  // loop.
-  emitMinimumIterationCountCheck(Lp, ScalarPH);
-  // Now, compare the new count to zero. If it is zero skip the vector loop and
-  // jump to the scalar loop.
-  emitVectorLoopEnteredCheck(Lp, ScalarPH);
-  // Generate the code to check any assumptions that we've made for SCEV
-  // expressions.
-  emitSCEVChecks(Lp, ScalarPH);
+  // Vectorize the interleaved load group.
+  if (isa<LoadInst>(Instr)) {
 
-  // Generate the code that checks in runtime if arrays overlap. We put the
-  // checks into a separate block to make the more common case of few elements
-  // faster.
-  emitMemRuntimeChecks(Lp, ScalarPH);
+    // For each unroll part, create a wide load for the group.
+    SmallVector<Value *, 2> NewLoads;
+    for (unsigned Part = 0; Part < UF; Part++) {
+      auto *NewLoad = Builder.CreateAlignedLoad(
+          NewPtrs[Part], Group->getAlignment(), "wide.vec");
+      addMetadata(NewLoad, Instr);
+      NewLoads.push_back(NewLoad);
+    }
 
-  // Generate the induction variable.
-  // The loop step is equal to the vectorization factor (num of SIMD elements)
-  // times the unroll factor (num of SIMD instructions).
-  Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
-  Constant *Step = ConstantInt::get(IdxTy, VF * UF);
-  Induction =
-      createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
-                              getDebugLocFromInstOrOperands(OldInduction));
+    // For each member in the group, shuffle out the appropriate data from the
+    // wide loads.
+    for (unsigned I = 0; I < InterleaveFactor; ++I) {
+      Instruction *Member = Group->getMember(I);
 
-  // We are going to resume the execution of the scalar loop.
-  // Go over all of the induction variables that we found and fix the
-  // PHIs that are left in the scalar version of the loop.
-  // The starting values of PHI nodes depend on the counter of the last
-  // iteration in the vectorized loop.
-  // If we come from a bypass edge then we need to start from the original
-  // start value.
+      // Skip the gaps in the group.
+      if (!Member)
+        continue;
 
-  // This variable saves the new starting index for the scalar loop. It is used
-  // to test if there are any tail iterations left once the vector loop has
-  // completed.
-  LoopVectorizationLegality::InductionList *List = Legal->getInductionVars();
-  for (auto &InductionEntry : *List) {
-    PHINode *OrigPhi = InductionEntry.first;
-    InductionDescriptor II = InductionEntry.second;
+      VectorParts Entry(UF);
+      Constant *StrideMask = createStrideMask(Builder, I, InterleaveFactor, VF);
+      for (unsigned Part = 0; Part < UF; Part++) {
+        Value *StridedVec = Builder.CreateShuffleVector(
+            NewLoads[Part], UndefVec, StrideMask, "strided.vec");
 
-    // Create phi nodes to merge from the  backedge-taken check block.
-    PHINode *BCResumeVal = PHINode::Create(
-        OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator());
-    Value *&EndValue = IVEndValues[OrigPhi];
-    if (OrigPhi == OldInduction) {
-      // We know what the end value is.
-      EndValue = CountRoundDown;
-    } else {
-      IRBuilder<> B(LoopBypassBlocks.back()->getTerminator());
-      Type *StepType = II.getStep()->getType();
-      Instruction::CastOps CastOp =
-        CastInst::getCastOpcode(CountRoundDown, true, StepType, true);
-      Value *CRD = B.CreateCast(CastOp, CountRoundDown, StepType, "cast.crd");
-      const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
-      EndValue = II.transform(B, CRD, PSE.getSE(), DL);
-      EndValue->setName("ind.end");
-    }
+        // If this member has different type, cast the result type.
+        if (Member->getType() != ScalarTy) {
+          VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
+          StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy);
+        }
 
-    // The new PHI merges the original incoming value, in case of a bypass,
-    // or the value at the end of the vectorized loop.
-    BCResumeVal->addIncoming(EndValue, MiddleBlock);
-
-    // Fix the scalar body counter (PHI node).
-    unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH);
-
-    // The old induction's phi node in the scalar body needs the truncated
-    // value.
-    for (BasicBlock *BB : LoopBypassBlocks)
-      BCResumeVal->addIncoming(II.getStartValue(), BB);
-    OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);
+        Entry[Part] =
+            Group->isReverse() ? reverseVector(StridedVec) : StridedVec;
+      }
+      VectorLoopValueMap.initVector(Member, Entry);
+    }
+    return;
   }
 
-  // Add a check in the middle block to see if we have completed
-  // all of the iterations in the first vector loop.
-  // If (N - N%VF) == N, then we *don't* need to run the remainder.
-  Value *CmpN =
-      CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,
-                      CountRoundDown, "cmp.n", MiddleBlock->getTerminator());
-  ReplaceInstWithInst(MiddleBlock->getTerminator(),
-                      BranchInst::Create(ExitBlock, ScalarPH, CmpN));
-
-  // Get ready to start creating new instructions into the vectorized body.
-  Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());
-
-  // Save the state.
-  LoopVectorPreHeader = Lp->getLoopPreheader();
-  LoopScalarPreHeader = ScalarPH;
-  LoopMiddleBlock = MiddleBlock;
-  LoopExitBlock = ExitBlock;
-  LoopVectorBody = VecBody;
-  LoopScalarBody = OldBasicBlock;
-
-  // Keep all loop hints from the original loop on the vector loop (we'll
-  // replace the vectorizer-specific hints below).
-  if (MDNode *LID = OrigLoop->getLoopID())
-    Lp->setLoopID(LID);
-
-  LoopVectorizeHints Hints(Lp, true, *ORE);
-  Hints.setAlreadyVectorized();
-}
+  // The sub vector type for current instruction.
+  VectorType *SubVT = VectorType::get(ScalarTy, VF);
 
-// Fix up external users of the induction variable. At this point, we are
-// in LCSSA form, with all external PHIs that use the IV having one input value,
-// coming from the remainder loop. We need those PHIs to also have a correct
-// value for the IV when arriving directly from the middle block.
-void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
-                                       const InductionDescriptor &II,
-                                       Value *CountRoundDown, Value *EndValue,
-                                       BasicBlock *MiddleBlock) {
-  // There are two kinds of external IV usages - those that use the value
-  // computed in the last iteration (the PHI) and those that use the penultimate
-  // value (the value that feeds into the phi from the loop latch).
-  // We allow both, but they, obviously, have different values.
+  // Vectorize the interleaved store group.
+  for (unsigned Part = 0; Part < UF; Part++) {
+    // Collect the stored vector from each member.
+    SmallVector<Value *, 4> StoredVecs;
+    for (unsigned i = 0; i < InterleaveFactor; i++) {
+      // Interleaved store group doesn't allow a gap, so each index has a member
+      Instruction *Member = Group->getMember(i);
+      assert(Member && "Fail to get a member from an interleaved store group");
 
-  assert(OrigLoop->getExitBlock() && "Expected a single exit block");
+      Value *StoredVec =
+          getVectorValue(cast<StoreInst>(Member)->getValueOperand())[Part];
+      if (Group->isReverse())
+        StoredVec = reverseVector(StoredVec);
 
-  DenseMap<Value *, Value *> MissingVals;
+      // If this member has different type, cast it to an unified type.
+      if (StoredVec->getType() != SubVT)
+        StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT);
 
-  // An external user of the last iteration's value should see the value that
-  // the remainder loop uses to initialize its own IV.
-  Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoop->getLoopLatch());
-  for (User *U : PostInc->users()) {
-    Instruction *UI = cast<Instruction>(U);
-    if (!OrigLoop->contains(UI)) {
-      assert(isa<PHINode>(UI) && "Expected LCSSA form");
-      MissingVals[UI] = EndValue;
+      StoredVecs.push_back(StoredVec);
     }
-  }
 
-  // An external user of the penultimate value need to see EndValue - Step.
-  // The simplest way to get this is to recompute it from the constituent SCEVs,
-  // that is Start + (Step * (CRD - 1)).
-  for (User *U : OrigPhi->users()) {
-    auto *UI = cast<Instruction>(U);
-    if (!OrigLoop->contains(UI)) {
-      const DataLayout &DL =
-          OrigLoop->getHeader()->getModule()->getDataLayout();
-      assert(isa<PHINode>(UI) && "Expected LCSSA form");
+    // Concatenate all vectors into a wide vector.
+    Value *WideVec = concatenateVectors(Builder, StoredVecs);
 
-      IRBuilder<> B(MiddleBlock->getTerminator());
-      Value *CountMinusOne = B.CreateSub(
-          CountRoundDown, ConstantInt::get(CountRoundDown->getType(), 1));
-      Value *CMO = B.CreateSExtOrTrunc(CountMinusOne, II.getStep()->getType(),
-                                       "cast.cmo");
-      Value *Escape = II.transform(B, CMO, PSE.getSE(), DL);
-      Escape->setName("ind.escape");
-      MissingVals[UI] = Escape;
-    }
-  }
+    // Interleave the elements in the wide vector.
+    Constant *IMask = createInterleaveMask(Builder, VF, InterleaveFactor);
+    Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask,
+                                              "interleaved.vec");
 
-  for (auto &I : MissingVals) {
-    PHINode *PHI = cast<PHINode>(I.first);
-    // One corner case we have to handle is two IVs "chasing" each-other,
-    // that is %IV2 = phi [...], [ %IV1, %latch ]
-    // In this case, if IV1 has an external use, we need to avoid adding both
-    // "last value of IV1" and "penultimate value of IV2". So, verify that we
-    // don't already have an incoming value for the middle block.
-    if (PHI->getBasicBlockIndex(MiddleBlock) == -1)
-      PHI->addIncoming(I.second, MiddleBlock);
+    Instruction *NewStoreInstr =
+        Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment());
+    addMetadata(NewStoreInstr, Instr);
   }
 }
 
-namespace {
-struct CSEDenseMapInfo {
-  static bool canHandle(Instruction *I) {
-    return isa<InsertElementInst>(I) || isa<ExtractElementInst>(I) ||
-           isa<ShuffleVectorInst>(I) || isa<GetElementPtrInst>(I);
-  }
-  static inline Instruction *getEmptyKey() {
-    return DenseMapInfo<Instruction *>::getEmptyKey();
-  }
-  static inline Instruction *getTombstoneKey() {
-    return DenseMapInfo<Instruction *>::getTombstoneKey();
-  }
-  static unsigned getHashValue(Instruction *I) {
-    assert(canHandle(I) && "Unknown instruction!");
-    return hash_combine(I->getOpcode(), hash_combine_range(I->value_op_begin(),
-                                                           I->value_op_end()));
-  }
-  static bool isEqual(Instruction *LHS, Instruction *RHS) {
-    if (LHS == getEmptyKey() || RHS == getEmptyKey() ||
-        LHS == getTombstoneKey() || RHS == getTombstoneKey())
-      return LHS == RHS;
-    return LHS->isIdenticalTo(RHS);
-  }
-};
-}
+void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
+  // Attempt to issue a wide load.
+  LoadInst *LI = dyn_cast<LoadInst>(Instr);
+  StoreInst *SI = dyn_cast<StoreInst>(Instr);
 
-///\brief Perform cse of induction variable instructions.
-static void cse(BasicBlock *BB) {
-  // Perform simple cse.
-  SmallDenseMap<Instruction *, Instruction *, 4, CSEDenseMapInfo> CSEMap;
-  for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;) {
-    Instruction *In = &*I++;
+  assert((LI || SI) && "Invalid Load/Store instruction");
 
-    if (!CSEDenseMapInfo::canHandle(In))
-      continue;
+  LoopVectorizationCostModel::InstWidening Decision =
+      Cost->getWideningDecision(Instr, VF);
+  assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
+         "CM decision should be taken at this point");
+  if (Decision == LoopVectorizationCostModel::CM_Interleave)
+    return vectorizeInterleaveGroup(Instr);
 
-    // Check if we can replace this instruction with any of the
-    // visited instructions.
-    if (Instruction *V = CSEMap.lookup(In)) {
-      In->replaceAllUsesWith(V);
-      In->eraseFromParent();
-      continue;
-    }
+  Type *ScalarDataTy = getMemInstValueType(Instr);
+  Type *DataTy = VectorType::get(ScalarDataTy, VF);
+  Value *Ptr = getPointerOperand(Instr);
+  unsigned Alignment = getMemInstAlignment(Instr);
+  // An alignment of 0 means target abi alignment. We need to use the scalar's
+  // target abi alignment in such a case.
+  const DataLayout &DL = Instr->getModule()->getDataLayout();
+  if (!Alignment)
+    Alignment = DL.getABITypeAlignment(ScalarDataTy);
+  unsigned AddressSpace = getMemInstAddressSpace(Instr);
 
-    CSEMap[In] = In;
-  }
-}
+  // Determine if the pointer operand of the access is either consecutive or
+  // reverse consecutive.
+  int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
+  bool Reverse = ConsecutiveStride < 0;
+  bool CreateGatherScatter =
+      (Decision == LoopVectorizationCostModel::CM_GatherScatter);
 
-/// \brief Adds a 'fast' flag to floating point operations.
-static Value *addFastMathFlag(Value *V) {
-  if (isa<FPMathOperator>(V)) {
-    FastMathFlags Flags;
-    Flags.setUnsafeAlgebra();
-    cast<Instruction>(V)->setFastMathFlags(Flags);
-  }
-  return V;
-}
+  VectorParts VectorGep;
 
-/// \brief Estimate the overhead of scalarizing an instruction. This is a
-/// convenience wrapper for the type-based getScalarizationOverhead API.
-static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
-                                         const TargetTransformInfo &TTI) {
-  if (VF == 1)
-    return 0;
+  // Handle consecutive loads/stores.
+  GetElementPtrInst *Gep = getGEPInstruction(Ptr);
+  if (ConsecutiveStride) {
+    if (Gep) {
+      unsigned NumOperands = Gep->getNumOperands();
+#ifndef NDEBUG
+      // The original GEP that identified as a consecutive memory access
+      // should have only one loop-variant operand.
+      unsigned NumOfLoopVariantOps = 0;
+      for (unsigned i = 0; i < NumOperands; ++i)
+        if (!PSE.getSE()->isLoopInvariant(PSE.getSCEV(Gep->getOperand(i)),
+                                          OrigLoop))
+          NumOfLoopVariantOps++;
+      assert(NumOfLoopVariantOps == 1 &&
+             "Consecutive GEP should have only one loop-variant operand");
+#endif
+      GetElementPtrInst *Gep2 = cast<GetElementPtrInst>(Gep->clone());
+      Gep2->setName("gep.indvar");
 
-  unsigned Cost = 0;
-  Type *RetTy = ToVectorTy(I->getType(), VF);
-  if (!RetTy->isVoidTy())
-    Cost += TTI.getScalarizationOverhead(RetTy, true, false);
+      // A new GEP is created for a 0-lane value of the first unroll iteration.
+      // The GEPs for the rest of the unroll iterations are computed below as an
+      // offset from this GEP.
+      for (unsigned i = 0; i < NumOperands; ++i)
+        // We can apply getScalarValue() for all GEP indices. It returns an
+        // original value for loop-invariant operand and 0-lane for consecutive
+        // operand.
+        Gep2->setOperand(i, getScalarValue(Gep->getOperand(i),
+                                           0, /* First unroll iteration */
+                                           0  /* 0-lane of the vector */ ));
+      setDebugLocFromInst(Builder, Gep);
+      Ptr = Builder.Insert(Gep2);
 
-  if (CallInst *CI = dyn_cast<CallInst>(I)) {
-    SmallVector<const Value *, 4> Operands(CI->arg_operands());
-    Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
+    } else { // No GEP
+      setDebugLocFromInst(Builder, Ptr);
+      Ptr = getScalarValue(Ptr, 0, 0);
+    }
   } else {
-    SmallVector<const Value *, 4> Operands(I->operand_values());
-    Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
-  }
-
-  return Cost;
-}
+    // At this point we should vector version of GEP for Gather or Scatter
+    assert(CreateGatherScatter && "The instruction should be scalarized");
+    if (Gep) {
+      // Vectorizing GEP, across UF parts. We want to get a vector value for base
+      // and each index that's defined inside the loop, even if it is
+      // loop-invariant but wasn't hoisted out. Otherwise we want to keep them
+      // scalar.
+      SmallVector<VectorParts, 4> OpsV;
+      for (Value *Op : Gep->operands()) {
+        Instruction *SrcInst = dyn_cast<Instruction>(Op);
+        if (SrcInst && OrigLoop->contains(SrcInst))
+          OpsV.push_back(getVectorValue(Op));
+        else
+          OpsV.push_back(VectorParts(UF, Op));
+      }
+      for (unsigned Part = 0; Part < UF; ++Part) {
+        SmallVector<Value *, 4> Ops;
+        Value *GEPBasePtr = OpsV[0][Part];
+        for (unsigned i = 1; i < Gep->getNumOperands(); i++)
+          Ops.push_back(OpsV[i][Part]);
+        Value *NewGep =  Builder.CreateGEP(GEPBasePtr, Ops, "VectorGep");
+        cast<GetElementPtrInst>(NewGep)->setIsInBounds(Gep->isInBounds());
+        assert(NewGep->getType()->isVectorTy() && "Expected vector GEP");
 
-// Estimate cost of a call instruction CI if it were vectorized with factor VF.
-// Return the cost of the instruction, including scalarization overhead if it's
-// needed. The flag NeedToScalarize shows if the call needs to be scalarized -
-// i.e. either vector version isn't available, or is too expensive.
-static unsigned getVectorCallCost(CallInst *CI, unsigned VF,
-                                  const TargetTransformInfo &TTI,
-                                  const TargetLibraryInfo *TLI,
-                                  bool &NeedToScalarize) {
-  Function *F = CI->getCalledFunction();
-  StringRef FnName = CI->getCalledFunction()->getName();
-  Type *ScalarRetTy = CI->getType();
-  SmallVector<Type *, 4> Tys, ScalarTys;
-  for (auto &ArgOp : CI->arg_operands())
-    ScalarTys.push_back(ArgOp->getType());
+        NewGep =
+            Builder.CreateBitCast(NewGep, VectorType::get(Ptr->getType(), VF));
+        VectorGep.push_back(NewGep);
+      }
+    } else
+      VectorGep = getVectorValue(Ptr);
+  }
 
-  // Estimate cost of scalarized vector call. The source operands are assumed
-  // to be vectors, so we need to extract individual elements from there,
-  // execute VF scalar calls, and then gather the result into the vector return
-  // value.
-  unsigned ScalarCallCost = TTI.getCallInstrCost(F, ScalarRetTy, ScalarTys);
-  if (VF == 1)
-    return ScalarCallCost;
+  VectorParts Mask = createBlockInMask(Instr->getParent());
+  // Handle Stores:
+  if (SI) {
+    assert(!Legal->isUniform(SI->getPointerOperand()) &&
+           "We do not allow storing to uniform addresses");
+    setDebugLocFromInst(Builder, SI);
+    // We don't want to update the value in the map as it might be used in
+    // another expression. So don't use a reference type for "StoredVal".
+    VectorParts StoredVal = getVectorValue(SI->getValueOperand());
 
-  // Compute corresponding vector type for return value and arguments.
-  Type *RetTy = ToVectorTy(ScalarRetTy, VF);
-  for (Type *ScalarTy : ScalarTys)
-    Tys.push_back(ToVectorTy(ScalarTy, VF));
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Instruction *NewSI = nullptr;
+      if (CreateGatherScatter) {
+        Value *MaskPart = Legal->isMaskRequired(SI) ? Mask[Part] : nullptr;
+        NewSI = Builder.CreateMaskedScatter(StoredVal[Part], VectorGep[Part],
+                                            Alignment, MaskPart);
+      } else {
+        // Calculate the pointer for the specific unroll-part.
+        Value *PartPtr =
+            Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF));
 
-  // Compute costs of unpacking argument values for the scalar calls and
-  // packing the return values to a vector.
-  unsigned ScalarizationCost = getScalarizationOverhead(CI, VF, TTI);
+        if (Reverse) {
+          // If we store to reverse consecutive memory locations, then we need
+          // to reverse the order of elements in the stored value.
+          StoredVal[Part] = reverseVector(StoredVal[Part]);
+          // If the address is consecutive but reversed, then the
+          // wide store needs to start at the last vector element.
+          PartPtr =
+              Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF));
+          PartPtr =
+              Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF));
+          Mask[Part] = reverseVector(Mask[Part]);
+        }
 
-  unsigned Cost = ScalarCallCost * VF + ScalarizationCost;
+        Value *VecPtr =
+            Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));
 
-  // If we can't emit a vector call for this function, then the currently found
-  // cost is the cost we need to return.
-  NeedToScalarize = true;
-  if (!TLI || !TLI->isFunctionVectorizable(FnName, VF) || CI->isNoBuiltin())
-    return Cost;
+        if (Legal->isMaskRequired(SI))
+          NewSI = Builder.CreateMaskedStore(StoredVal[Part], VecPtr, Alignment,
+                                            Mask[Part]);
+        else
+          NewSI =
+              Builder.CreateAlignedStore(StoredVal[Part], VecPtr, Alignment);
+      }
+      addMetadata(NewSI, SI);
+    }
+    return;
+  }
 
-  // If the corresponding vector cost is cheaper, return its cost.
-  unsigned VectorCallCost = TTI.getCallInstrCost(nullptr, RetTy, Tys);
-  if (VectorCallCost < Cost) {
-    NeedToScalarize = false;
-    return VectorCallCost;
+  // Handle loads.
+  assert(LI && "Must have a load instruction");
+  setDebugLocFromInst(Builder, LI);
+  VectorParts Entry(UF);
+  for (unsigned Part = 0; Part < UF; ++Part) {
+    Instruction *NewLI;
+    if (CreateGatherScatter) {
+      Value *MaskPart = Legal->isMaskRequired(LI) ? Mask[Part] : nullptr;
+      NewLI = Builder.CreateMaskedGather(VectorGep[Part], Alignment, MaskPart,
+                                         0, "wide.masked.gather");
+      Entry[Part] = NewLI;
+    } else {
+      // Calculate the pointer for the specific unroll-part.
+      Value *PartPtr =
+          Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(Part * VF));
+
+      if (Reverse) {
+        // If the address is consecutive but reversed, then the
+        // wide load needs to start at the last vector element.
+        PartPtr = Builder.CreateGEP(nullptr, Ptr, Builder.getInt32(-Part * VF));
+        PartPtr = Builder.CreateGEP(nullptr, PartPtr, Builder.getInt32(1 - VF));
+        Mask[Part] = reverseVector(Mask[Part]);
+      }
+
+      Value *VecPtr =
+          Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));
+      if (Legal->isMaskRequired(LI))
+        NewLI = Builder.CreateMaskedLoad(VecPtr, Alignment, Mask[Part],
+                                         UndefValue::get(DataTy),
+                                         "wide.masked.load");
+      else
+        NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");
+      Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;
+    }
+    addMetadata(NewLI, LI);
   }
-  return Cost;
+  VectorLoopValueMap.initVector(Instr, Entry);
 }
 
-// Estimate cost of an intrinsic call instruction CI if it were vectorized with
-// factor VF.  Return the cost of the instruction, including scalarization
-// overhead if it's needed.
-static unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF,
-                                       const TargetTransformInfo &TTI,
-                                       const TargetLibraryInfo *TLI) {
-  Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
-  assert(ID && "Expected intrinsic call!");
+void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
+                                               unsigned MinPart,
+                                               unsigned MaxPart,
+                                               unsigned MinLane,
+                                               unsigned MaxLane) {
+  assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
+  // Holds vector parameters or scalars, in case of uniform vals.
+  SmallVector<VectorParts, 4> Params;
 
-  Type *RetTy = ToVectorTy(CI->getType(), VF);
-  SmallVector<Type *, 4> Tys;
-  for (Value *ArgOperand : CI->arg_operands())
-    Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));
+  setDebugLocFromInst(Builder, Instr);
 
-  FastMathFlags FMF;
-  if (auto *FPMO = dyn_cast<FPMathOperator>(CI))
-    FMF = FPMO->getFastMathFlags();
+  // Does this instruction return a value ?
+  bool IsVoidRetTy = Instr->getType()->isVoidTy();
 
-  return TTI.getIntrinsicInstrCost(ID, RetTy, Tys, FMF);
-}
+  // Initialize a new scalar map entry.
+  ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF);
 
-static Type *smallestIntegerVectorType(Type *T1, Type *T2) {
-  auto *I1 = cast<IntegerType>(T1->getVectorElementType());
-  auto *I2 = cast<IntegerType>(T2->getVectorElementType());
-  return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;
-}
-static Type *largestIntegerVectorType(Type *T1, Type *T2) {
-  auto *I1 = cast<IntegerType>(T1->getVectorElementType());
-  auto *I2 = cast<IntegerType>(T2->getVectorElementType());
-  return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2;
+  // For each vector unroll 'part':
+  for (unsigned Part = MinPart; Part <= MaxPart; ++Part) {
+    // For each scalar that we create:
+    for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) {
+
+      Instruction *Cloned = Instr->clone();
+      if (!IsVoidRetTy)
+        Cloned->setName(Instr->getName() + ".cloned");
+
+      // Replace the operands of the cloned instructions with their scalar
+      // equivalents in the new loop.
+      for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) {
+        auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane);
+        Cloned->setOperand(op, NewOp);
+      }
+      addNewMetadata(Cloned, Instr);
+
+      // Place the cloned scalar in the new loop.
+      Builder.Insert(Cloned);
+
+      // Add the cloned scalar to the scalar map entry.
+      Entry[Part][Lane] = Cloned;
+
+      // If we just cloned a new assumption, add it the assumption cache.
+      if (auto *II = dyn_cast<IntrinsicInst>(Cloned))
+        if (II->getIntrinsicID() == Intrinsic::assume)
+          AC->registerAssumption(II);
+    }
+  }
 }
 
-void InnerLoopVectorizer::truncateToMinimalBitwidths() {
-  // For every instruction `I` in MinBWs, truncate the operands, create a
-  // truncated version of `I` and reextend its result. InstCombine runs
-  // later and will remove any ext/trunc pairs.
-  //
-  SmallPtrSet<Value *, 4> Erased;
-  for (const auto &KV : Cost->getMinimalBitwidths()) {
-    // If the value wasn't vectorized, we must maintain the original scalar
-    // type. The absence of the value from VectorLoopValueMap indicates that it
-    // wasn't vectorized.
-    if (!VectorLoopValueMap.hasVector(KV.first))
-      continue;
-    VectorParts &Parts = VectorLoopValueMap.getVector(KV.first);
-    for (Value *&I : Parts) {
-      if (Erased.count(I) || I->use_empty() || !isa<Instruction>(I))
-        continue;
-      Type *OriginalTy = I->getType();
-      Type *ScalarTruncatedTy =
-          IntegerType::get(OriginalTy->getContext(), KV.second);
-      Type *TruncatedTy = VectorType::get(ScalarTruncatedTy,
-                                          OriginalTy->getVectorNumElements());
-      if (TruncatedTy == OriginalTy)
-        continue;
+PHINode *InnerLoopVectorizer::createInductionVariable(Loop *L, Value *Start,
+                                                      Value *End, Value *Step,
+                                                      Instruction *DL) {
+  BasicBlock *Header = L->getHeader();
+  BasicBlock *Latch = L->getLoopLatch();
+  // As we're just creating this loop, it's possible no latch exists
+  // yet. If so, use the header as this will be a single block loop.
+  if (!Latch)
+    Latch = Header;
 
-      IRBuilder<> B(cast<Instruction>(I));
-      auto ShrinkOperand = [&](Value *V) -> Value * {
-        if (auto *ZI = dyn_cast<ZExtInst>(V))
-          if (ZI->getSrcTy() == TruncatedTy)
-            return ZI->getOperand(0);
-        return B.CreateZExtOrTrunc(V, TruncatedTy);
-      };
+  IRBuilder<> Builder(&*Header->getFirstInsertionPt());
+  Instruction *OldInst = getDebugLocFromInstOrOperands(OldInduction);
+  setDebugLocFromInst(Builder, OldInst);
+  auto *Induction = Builder.CreatePHI(Start->getType(), 2, "index");
 
-      // The actual instruction modification depends on the instruction type,
-      // unfortunately.
-      Value *NewI = nullptr;
-      if (auto *BO = dyn_cast<BinaryOperator>(I)) {
-        NewI = B.CreateBinOp(BO->getOpcode(), ShrinkOperand(BO->getOperand(0)),
-                             ShrinkOperand(BO->getOperand(1)));
-        cast<BinaryOperator>(NewI)->copyIRFlags(I);
-      } else if (auto *CI = dyn_cast<ICmpInst>(I)) {
-        NewI =
-            B.CreateICmp(CI->getPredicate(), ShrinkOperand(CI->getOperand(0)),
-                         ShrinkOperand(CI->getOperand(1)));
-      } else if (auto *SI = dyn_cast<SelectInst>(I)) {
-        NewI = B.CreateSelect(SI->getCondition(),
-                              ShrinkOperand(SI->getTrueValue()),
-                              ShrinkOperand(SI->getFalseValue()));
-      } else if (auto *CI = dyn_cast<CastInst>(I)) {
-        switch (CI->getOpcode()) {
-        default:
-          llvm_unreachable("Unhandled cast!");
-        case Instruction::Trunc:
-          NewI = ShrinkOperand(CI->getOperand(0));
-          break;
-        case Instruction::SExt:
-          NewI = B.CreateSExtOrTrunc(
-              CI->getOperand(0),
-              smallestIntegerVectorType(OriginalTy, TruncatedTy));
-          break;
-        case Instruction::ZExt:
-          NewI = B.CreateZExtOrTrunc(
-              CI->getOperand(0),
-              smallestIntegerVectorType(OriginalTy, TruncatedTy));
-          break;
-        }
-      } else if (auto *SI = dyn_cast<ShuffleVectorInst>(I)) {
-        auto Elements0 = SI->getOperand(0)->getType()->getVectorNumElements();
-        auto *O0 = B.CreateZExtOrTrunc(
-            SI->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements0));
-        auto Elements1 = SI->getOperand(1)->getType()->getVectorNumElements();
-        auto *O1 = B.CreateZExtOrTrunc(
-            SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1));
+  Builder.SetInsertPoint(Latch->getTerminator());
+  setDebugLocFromInst(Builder, OldInst);
 
-        NewI = B.CreateShuffleVector(O0, O1, SI->getMask());
-      } else if (isa<LoadInst>(I)) {
-        // Don't do anything with the operands, just extend the result.
-        continue;
-      } else if (auto *IE = dyn_cast<InsertElementInst>(I)) {
-        auto Elements = IE->getOperand(0)->getType()->getVectorNumElements();
-        auto *O0 = B.CreateZExtOrTrunc(
-            IE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
-        auto *O1 = B.CreateZExtOrTrunc(IE->getOperand(1), ScalarTruncatedTy);
-        NewI = B.CreateInsertElement(O0, O1, IE->getOperand(2));
-      } else if (auto *EE = dyn_cast<ExtractElementInst>(I)) {
-        auto Elements = EE->getOperand(0)->getType()->getVectorNumElements();
-        auto *O0 = B.CreateZExtOrTrunc(
-            EE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
-        NewI = B.CreateExtractElement(O0, EE->getOperand(2));
-      } else {
-        llvm_unreachable("Unhandled instruction type!");
-      }
+  // Create i+1 and fill the PHINode.
+  Value *Next = Builder.CreateAdd(Induction, Step, "index.next");
+  Induction->addIncoming(Start, L->getLoopPreheader());
+  Induction->addIncoming(Next, Latch);
+  // Create the compare.
+  Value *ICmp = Builder.CreateICmpEQ(Next, End);
+  Builder.CreateCondBr(ICmp, L->getExitBlock(), Header);
 
-      // Lastly, extend the result.
-      NewI->takeName(cast<Instruction>(I));
-      Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy);
-      I->replaceAllUsesWith(Res);
-      cast<Instruction>(I)->eraseFromParent();
-      Erased.insert(I);
-      I = Res;
-    }
-  }
+  // Now we have two terminators. Remove the old one from the block.
+  Latch->getTerminator()->eraseFromParent();
 
-  // We'll have created a bunch of ZExts that are now parentless. Clean up.
-  for (const auto &KV : Cost->getMinimalBitwidths()) {
-    // If the value wasn't vectorized, we must maintain the original scalar
-    // type. The absence of the value from VectorLoopValueMap indicates that it
-    // wasn't vectorized.
-    if (!VectorLoopValueMap.hasVector(KV.first))
-      continue;
-    VectorParts &Parts = VectorLoopValueMap.getVector(KV.first);
-    for (Value *&I : Parts) {
-      ZExtInst *Inst = dyn_cast<ZExtInst>(I);
-      if (Inst && Inst->use_empty()) {
-        Value *NewI = Inst->getOperand(0);
-        Inst->eraseFromParent();
-        I = NewI;
-      }
-    }
-  }
+  return Induction;
 }
 
-void InnerLoopVectorizer::vectorizeLoop() {
-  //===------------------------------------------------===//
-  //
-  // Notice: any optimization or new instruction that go
-  // into the code below should be also be implemented in
-  // the cost-model.
-  //
-  //===------------------------------------------------===//
-  Constant *Zero = Builder.getInt32(0);
-
-  // In order to support recurrences we need to be able to vectorize Phi nodes.
-  // Phi nodes have cycles, so we need to vectorize them in two stages. First,
-  // we create a new vector PHI node with no incoming edges. We use this value
-  // when we vectorize all of the instructions that use the PHI. Next, after
-  // all of the instructions in the block are complete we add the new incoming
-  // edges to the PHI. At this point all of the instructions in the basic block
-  // are vectorized, so we can use them to construct the PHI.
-  PhiVector PHIsToFix;
-
-  // Collect instructions from the original loop that will become trivially
-  // dead in the vectorized loop. We don't need to vectorize these
-  // instructions.
-  collectTriviallyDeadInstructions();
+Value *InnerLoopVectorizer::getOrCreateTripCount(Loop *L) {
+  if (TripCount)
+    return TripCount;
 
-  // Scan the loop in a topological order to ensure that defs are vectorized
-  // before users.
-  LoopBlocksDFS DFS(OrigLoop);
-  DFS.perform(LI);
+  IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
+  // Find the loop boundaries.
+  ScalarEvolution *SE = PSE.getSE();
+  const SCEV *BackedgeTakenCount = PSE.getBackedgeTakenCount();
+  assert(BackedgeTakenCount != SE->getCouldNotCompute() &&
+         "Invalid loop count");
 
-  // Vectorize all of the blocks in the original loop.
-  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
-    vectorizeBlockInLoop(BB, &PHIsToFix);
+  Type *IdxTy = Legal->getWidestInductionType();
 
-  // Insert truncates and extends for any truncated instructions as hints to
-  // InstCombine.
-  if (VF > 1)
-    truncateToMinimalBitwidths();
+  // The exit count might have the type of i64 while the phi is i32. This can
+  // happen if we have an induction variable that is sign extended before the
+  // compare. The only way that we get a backedge taken count is that the
+  // induction variable was signed and as such will not overflow. In such a case
+  // truncation is legal.
+  if (BackedgeTakenCount->getType()->getPrimitiveSizeInBits() >
+      IdxTy->getPrimitiveSizeInBits())
+    BackedgeTakenCount = SE->getTruncateOrNoop(BackedgeTakenCount, IdxTy);
+  BackedgeTakenCount = SE->getNoopOrZeroExtend(BackedgeTakenCount, IdxTy);
 
-  // At this point every instruction in the original loop is widened to a
-  // vector form. Now we need to fix the recurrences in PHIsToFix. These PHI
-  // nodes are currently empty because we did not want to introduce cycles.
-  // This is the second stage of vectorizing recurrences.
-  for (PHINode *Phi : PHIsToFix) {
-    assert(Phi && "Unable to recover vectorized PHI");
+  // Get the total trip count from the count by adding 1.
+  const SCEV *ExitCount = SE->getAddExpr(
+      BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
 
-    // Handle first-order recurrences that need to be fixed.
-    if (Legal->isFirstOrderRecurrence(Phi)) {
-      fixFirstOrderRecurrence(Phi);
-      continue;
-    }
+  const DataLayout &DL = L->getHeader()->getModule()->getDataLayout();
 
-    // If the phi node is not a first-order recurrence, it must be a reduction.
-    // Get it's reduction variable descriptor.
-    assert(Legal->isReductionVariable(Phi) &&
-           "Unable to find the reduction variable");
-    RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi];
-
-    RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();
-    TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();
-    Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();
-    RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =
-        RdxDesc.getMinMaxRecurrenceKind();
-    setDebugLocFromInst(Builder, ReductionStartValue);
-
-    // We need to generate a reduction vector from the incoming scalar.
-    // To do so, we need to generate the 'identity' vector and override
-    // one of the elements with the incoming scalar reduction. We need
-    // to do it in the vector-loop preheader.
-    Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator());
-
-    // This is the vector-clone of the value that leaves the loop.
-    const VectorParts &VectorExit = getVectorValue(LoopExitInst);
-    Type *VecTy = VectorExit[0]->getType();
-
-    // Find the reduction identity variable. Zero for addition, or, xor,
-    // one for multiplication, -1 for And.
-    Value *Identity;
-    Value *VectorStart;
-    if (RK == RecurrenceDescriptor::RK_IntegerMinMax ||
-        RK == RecurrenceDescriptor::RK_FloatMinMax) {
-      // MinMax reduction have the start value as their identify.
-      if (VF == 1) {
-        VectorStart = Identity = ReductionStartValue;
-      } else {
-        VectorStart = Identity =
-            Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");
-      }
-    } else {
-      // Handle other reduction kinds:
-      Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity(
-          RK, VecTy->getScalarType());
-      if (VF == 1) {
-        Identity = Iden;
-        // This vector is the Identity vector where the first element is the
-        // incoming scalar reduction.
-        VectorStart = ReductionStartValue;
-      } else {
-        Identity = ConstantVector::getSplat(VF, Iden);
+  // Expand the trip count and place the new instructions in the preheader.
+  // Notice that the pre-header does not change, only the loop body.
+  SCEVExpander Exp(*SE, DL, "induction");
 
-        // This vector is the Identity vector where the first element is the
-        // incoming scalar reduction.
-        VectorStart =
-            Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);
-      }
-    }
+  // Count holds the overall loop count (N).
+  TripCount = Exp.expandCodeFor(ExitCount, ExitCount->getType(),
+                                L->getLoopPreheader()->getTerminator());
 
-    // Fix the vector-loop phi.
+  if (TripCount->getType()->isPointerTy())
+    TripCount =
+        CastInst::CreatePointerCast(TripCount, IdxTy, "exitcount.ptrcnt.to.int",
+                                    L->getLoopPreheader()->getTerminator());
 
-    // Reductions do not have to start at zero. They can start with
-    // any loop invariant values.
-    const VectorParts &VecRdxPhi = getVectorValue(Phi);
-    BasicBlock *Latch = OrigLoop->getLoopLatch();
-    Value *LoopVal = Phi->getIncomingValueForBlock(Latch);
-    const VectorParts &Val = getVectorValue(LoopVal);
-    for (unsigned part = 0; part < UF; ++part) {
-      // Make sure to add the reduction stat value only to the
-      // first unroll part.
-      Value *StartVal = (part == 0) ? VectorStart : Identity;
-      cast<PHINode>(VecRdxPhi[part])
-          ->addIncoming(StartVal, LoopVectorPreHeader);
-      cast<PHINode>(VecRdxPhi[part])
-          ->addIncoming(Val[part], LoopVectorBody);
-    }
+  return TripCount;
+}
 
-    // Before each round, move the insertion point right between
-    // the PHIs and the values we are going to write.
-    // This allows us to write both PHINodes and the extractelement
-    // instructions.
-    Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());
+Value *InnerLoopVectorizer::getOrCreateVectorTripCount(Loop *L) {
+  if (VectorTripCount)
+    return VectorTripCount;
 
-    VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst);
-    setDebugLocFromInst(Builder, LoopExitInst);
+  Value *TC = getOrCreateTripCount(L);
+  IRBuilder<> Builder(L->getLoopPreheader()->getTerminator());
 
-    // If the vector reduction can be performed in a smaller type, we truncate
-    // then extend the loop exit value to enable InstCombine to evaluate the
-    // entire expression in the smaller type.
-    if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {
-      Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);
-      Builder.SetInsertPoint(LoopVectorBody->getTerminator());
-      for (unsigned part = 0; part < UF; ++part) {
-        Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
-        Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)
-                                          : Builder.CreateZExt(Trunc, VecTy);
-        for (Value::user_iterator UI = RdxParts[part]->user_begin();
-             UI != RdxParts[part]->user_end();)
-          if (*UI != Trunc) {
-            (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd);
-            RdxParts[part] = Extnd;
-          } else {
-            ++UI;
-          }
-      }
-      Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());
-      for (unsigned part = 0; part < UF; ++part)
-        RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
-    }
+  // Now we need to generate the expression for the part of the loop that the
+  // vectorized body will execute. This is equal to N - (N % Step) if scalar
+  // iterations are not required for correctness, or N - Step, otherwise. Step
+  // is equal to the vectorization factor (number of SIMD elements) times the
+  // unroll factor (number of SIMD instructions).
+  Constant *Step = ConstantInt::get(TC->getType(), VF * UF);
+  Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");
 
-    // Reduce all of the unrolled parts into a single vector.
-    Value *ReducedPartRdx = RdxParts[0];
-    unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK);
-    setDebugLocFromInst(Builder, ReducedPartRdx);
-    for (unsigned part = 1; part < UF; ++part) {
-      if (Op != Instruction::ICmp && Op != Instruction::FCmp)
-        // Floating point operations had to be 'fast' to enable the reduction.
-        ReducedPartRdx = addFastMathFlag(
-            Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part],
-                                ReducedPartRdx, "bin.rdx"));
-      else
-        ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp(
-            Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]);
-    }
+  // If there is a non-reversed interleaved group that may speculatively access
+  // memory out-of-bounds, we need to ensure that there will be at least one
+  // iteration of the scalar epilogue loop. Thus, if the step evenly divides
+  // the trip count, we set the remainder to be equal to the step. If the step
+  // does not evenly divide the trip count, no adjustment is necessary since
+  // there will already be scalar iterations. Note that the minimum iterations
+  // check ensures that N >= Step.
+  if (VF > 1 && Legal->requiresScalarEpilogue()) {
+    auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0));
+    R = Builder.CreateSelect(IsZero, Step, R);
+  }
 
-    if (VF > 1) {
-      // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles
-      // and vector ops, reducing the set of values being computed by half each
-      // round.
-      assert(isPowerOf2_32(VF) &&
-             "Reduction emission only supported for pow2 vectors!");
-      Value *TmpVec = ReducedPartRdx;
-      SmallVector<Constant *, 32> ShuffleMask(VF, nullptr);
-      for (unsigned i = VF; i != 1; i >>= 1) {
-        // Move the upper half of the vector to the lower half.
-        for (unsigned j = 0; j != i / 2; ++j)
-          ShuffleMask[j] = Builder.getInt32(i / 2 + j);
-
-        // Fill the rest of the mask with undef.
-        std::fill(&ShuffleMask[i / 2], ShuffleMask.end(),
-                  UndefValue::get(Builder.getInt32Ty()));
-
-        Value *Shuf = Builder.CreateShuffleVector(
-            TmpVec, UndefValue::get(TmpVec->getType()),
-            ConstantVector::get(ShuffleMask), "rdx.shuf");
-
-        if (Op != Instruction::ICmp && Op != Instruction::FCmp)
-          // Floating point operations had to be 'fast' to enable the reduction.
-          TmpVec = addFastMathFlag(Builder.CreateBinOp(
-              (Instruction::BinaryOps)Op, TmpVec, Shuf, "bin.rdx"));
-        else
-          TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind,
-                                                        TmpVec, Shuf);
-      }
+  VectorTripCount = Builder.CreateSub(TC, R, "n.vec");
 
-      // The result is in the first element of the vector.
-      ReducedPartRdx =
-          Builder.CreateExtractElement(TmpVec, Builder.getInt32(0));
-
-      // If the reduction can be performed in a smaller type, we need to extend
-      // the reduction to the wider type before we branch to the original loop.
-      if (Phi->getType() != RdxDesc.getRecurrenceType())
-        ReducedPartRdx =
-            RdxDesc.isSigned()
-                ? Builder.CreateSExt(ReducedPartRdx, Phi->getType())
-                : Builder.CreateZExt(ReducedPartRdx, Phi->getType());
-    }
+  return VectorTripCount;
+}
 
-    // Create a phi node that merges control-flow from the backedge-taken check
-    // block and the middle block.
-    PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx",
-                                          LoopScalarPreHeader->getTerminator());
-    for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
-      BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]);
-    BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);
-
-    // Now, we need to fix the users of the reduction variable
-    // inside and outside of the scalar remainder loop.
-    // We know that the loop is in LCSSA form. We need to update the
-    // PHI nodes in the exit blocks.
-    for (BasicBlock::iterator LEI = LoopExitBlock->begin(),
-                              LEE = LoopExitBlock->end();
-         LEI != LEE; ++LEI) {
-      PHINode *LCSSAPhi = dyn_cast<PHINode>(LEI);
-      if (!LCSSAPhi)
-        break;
+void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,
+                                                         BasicBlock *Bypass) {
+  Value *Count = getOrCreateTripCount(L);
+  BasicBlock *BB = L->getLoopPreheader();
+  IRBuilder<> Builder(BB->getTerminator());
 
-      // All PHINodes need to have a single entry edge, or two if
-      // we already fixed them.
-      assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI");
+  // Generate code to check that the loop's trip count that we computed by
+  // adding one to the backedge-taken count will not overflow.
+  Value *CheckMinIters = Builder.CreateICmpULT(
+      Count, ConstantInt::get(Count->getType(), VF * UF), "min.iters.check");
 
-      // We found a reduction value exit-PHI. Update it with the
-      // incoming bypass edge.
-      if (LCSSAPhi->getIncomingValue(0) == LoopExitInst)
-        LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);
-    } // end of the LCSSA phi scan.
+  BasicBlock *NewBB =
+      BB->splitBasicBlock(BB->getTerminator(), "min.iters.checked");
+  // Update dominator tree immediately if the generated block is a
+  // LoopBypassBlock because SCEV expansions to generate loop bypass
+  // checks may query it before the current function is finished.
+  DT->addNewBlock(NewBB, BB);
+  if (L->getParentLoop())
+    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
+  ReplaceInstWithInst(BB->getTerminator(),
+                      BranchInst::Create(Bypass, NewBB, CheckMinIters));
+  LoopBypassBlocks.push_back(BB);
+}
 
-    // Fix the scalar loop reduction variable with the incoming reduction sum
-    // from the vector body and from the backedge value.
-    int IncomingEdgeBlockIdx =
-        Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());
-    assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");
-    // Pick the other block.
-    int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);
-    Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
-    Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
-  } // end of for each Phi in PHIsToFix.
+void InnerLoopVectorizer::emitVectorLoopEnteredCheck(Loop *L,
+                                                     BasicBlock *Bypass) {
+  Value *TC = getOrCreateVectorTripCount(L);
+  BasicBlock *BB = L->getLoopPreheader();
+  IRBuilder<> Builder(BB->getTerminator());
 
-  // Update the dominator tree.
-  //
-  // FIXME: After creating the structure of the new loop, the dominator tree is
-  //        no longer up-to-date, and it remains that way until we update it
-  //        here. An out-of-date dominator tree is problematic for SCEV,
-  //        because SCEVExpander uses it to guide code generation. The
-  //        vectorizer use SCEVExpanders in several places. Instead, we should
-  //        keep the dominator tree up-to-date as we go.
-  updateAnalysis();
+  // Now, compare the new count to zero. If it is zero skip the vector loop and
+  // jump to the scalar loop.
+  Value *Cmp = Builder.CreateICmpEQ(TC, Constant::getNullValue(TC->getType()),
+                                    "cmp.zero");
 
-  // Fix-up external users of the induction variables.
-  for (auto &Entry : *Legal->getInductionVars())
-    fixupIVUsers(Entry.first, Entry.second,
-                 getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
-                 IVEndValues[Entry.first], LoopMiddleBlock);
-
-  fixLCSSAPHIs();
-  predicateInstructions();
-
-  // Remove redundant induction instructions.
-  cse(LoopVectorBody);
+  // Generate code to check that the loop's trip count that we computed by
+  // adding one to the backedge-taken count will not overflow.
+  BasicBlock *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
+  // Update dominator tree immediately if the generated block is a
+  // LoopBypassBlock because SCEV expansions to generate loop bypass
+  // checks may query it before the current function is finished.
+  DT->addNewBlock(NewBB, BB);
+  if (L->getParentLoop())
+    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
+  ReplaceInstWithInst(BB->getTerminator(),
+                      BranchInst::Create(Bypass, NewBB, Cmp));
+  LoopBypassBlocks.push_back(BB);
 }
 
-void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {
+void InnerLoopVectorizer::emitSCEVChecks(Loop *L, BasicBlock *Bypass) {
+  BasicBlock *BB = L->getLoopPreheader();
 
-  // This is the second phase of vectorizing first-order recurrences. An
-  // overview of the transformation is described below. Suppose we have the
-  // following loop.
-  //
-  //   for (int i = 0; i < n; ++i)
-  //     b[i] = a[i] - a[i - 1];
-  //
-  // There is a first-order recurrence on "a". For this loop, the shorthand
-  // scalar IR looks like:
-  //
-  //   scalar.ph:
-  //     s_init = a[-1]
-  //     br scalar.body
-  //
-  //   scalar.body:
-  //     i = phi [0, scalar.ph], [i+1, scalar.body]
-  //     s1 = phi [s_init, scalar.ph], [s2, scalar.body]
-  //     s2 = a[i]
-  //     b[i] = s2 - s1
-  //     br cond, scalar.body, ...
-  //
-  // In this example, s1 is a recurrence because it's value depends on the
-  // previous iteration. In the first phase of vectorization, we created a
-  // temporary value for s1. We now complete the vectorization and produce the
-  // shorthand vector IR shown below (for VF = 4, UF = 1).
-  //
-  //   vector.ph:
-  //     v_init = vector(..., ..., ..., a[-1])
-  //     br vector.body
-  //
-  //   vector.body
-  //     i = phi [0, vector.ph], [i+4, vector.body]
-  //     v1 = phi [v_init, vector.ph], [v2, vector.body]
-  //     v2 = a[i, i+1, i+2, i+3];
-  //     v3 = vector(v1(3), v2(0, 1, 2))
-  //     b[i, i+1, i+2, i+3] = v2 - v3
-  //     br cond, vector.body, middle.block
-  //
-  //   middle.block:
-  //     x = v2(3)
-  //     br scalar.ph
-  //
-  //   scalar.ph:
-  //     s_init = phi [x, middle.block], [a[-1], otherwise]
-  //     br scalar.body
-  //
-  // After execution completes the vector loop, we extract the next value of
-  // the recurrence (x) to use as the initial value in the scalar loop.
+  // Generate the code to check that the SCEV assumptions that we made.
+  // We want the new basic block to start at the first instruction in a
+  // sequence of instructions that form a check.
+  SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),
+                   "scev.check");
+  Value *SCEVCheck =
+      Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());
 
-  // Get the original loop preheader and single loop latch.
-  auto *Preheader = OrigLoop->getLoopPreheader();
-  auto *Latch = OrigLoop->getLoopLatch();
+  if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))
+    if (C->isZero())
+      return;
 
-  // Get the initial and previous values of the scalar recurrence.
-  auto *ScalarInit = Phi->getIncomingValueForBlock(Preheader);
-  auto *Previous = Phi->getIncomingValueForBlock(Latch);
+  // Create a new block containing the stride check.
+  BB->setName("vector.scevcheck");
+  auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
+  // Update dominator tree immediately if the generated block is a
+  // LoopBypassBlock because SCEV expansions to generate loop bypass
+  // checks may query it before the current function is finished.
+  DT->addNewBlock(NewBB, BB);
+  if (L->getParentLoop())
+    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
+  ReplaceInstWithInst(BB->getTerminator(),
+                      BranchInst::Create(Bypass, NewBB, SCEVCheck));
+  LoopBypassBlocks.push_back(BB);
+  AddedSafetyChecks = true;
+}
 
-  // Create a vector from the initial value.
-  auto *VectorInit = ScalarInit;
-  if (VF > 1) {
-    Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
-    VectorInit = Builder.CreateInsertElement(
-        UndefValue::get(VectorType::get(VectorInit->getType(), VF)), VectorInit,
-        Builder.getInt32(VF - 1), "vector.recur.init");
-  }
+void InnerLoopVectorizer::emitMemRuntimeChecks(Loop *L, BasicBlock *Bypass) {
+  BasicBlock *BB = L->getLoopPreheader();
 
-  // We constructed a temporary phi node in the first phase of vectorization.
-  // This phi node will eventually be deleted.
-  VectorParts &PhiParts = VectorLoopValueMap.getVector(Phi);
-  Builder.SetInsertPoint(cast<Instruction>(PhiParts[0]));
+  // Generate the code that checks in runtime if arrays overlap. We put the
+  // checks into a separate block to make the more common case of few elements
+  // faster.
+  Instruction *FirstCheckInst;
+  Instruction *MemRuntimeCheck;
+  std::tie(FirstCheckInst, MemRuntimeCheck) =
+      Legal->getLAI()->addRuntimeChecks(BB->getTerminator());
+  if (!MemRuntimeCheck)
+    return;
 
-  // Create a phi node for the new recurrence. The current value will either be
-  // the initial value inserted into a vector or loop-varying vector value.
-  auto *VecPhi = Builder.CreatePHI(VectorInit->getType(), 2, "vector.recur");
-  VecPhi->addIncoming(VectorInit, LoopVectorPreHeader);
+  // Create a new block containing the memory check.
+  BB->setName("vector.memcheck");
+  auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
+  // Update dominator tree immediately if the generated block is a
+  // LoopBypassBlock because SCEV expansions to generate loop bypass
+  // checks may query it before the current function is finished.
+  DT->addNewBlock(NewBB, BB);
+  if (L->getParentLoop())
+    L->getParentLoop()->addBasicBlockToLoop(NewBB, *LI);
+  ReplaceInstWithInst(BB->getTerminator(),
+                      BranchInst::Create(Bypass, NewBB, MemRuntimeCheck));
+  LoopBypassBlocks.push_back(BB);
+  AddedSafetyChecks = true;
 
-  // Get the vectorized previous value. We ensured the previous values was an
-  // instruction when detecting the recurrence.
-  auto &PreviousParts = getVectorValue(Previous);
+  // We currently don't use LoopVersioning for the actual loop cloning but we
+  // still use it to add the noalias metadata.
+  LVer = llvm::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,
+                                           PSE.getSE());
+  LVer->prepareNoAliasMetadata();
+}
 
-  // Set the insertion point to be after this instruction. We ensured the
-  // previous value dominated all uses of the phi when detecting the
-  // recurrence.
-  Builder.SetInsertPoint(
-      &*++BasicBlock::iterator(cast<Instruction>(PreviousParts[UF - 1])));
+void InnerLoopVectorizer::createEmptyLoop() {
+  /*
+   In this function we generate a new loop. The new loop will contain
+   the vectorized instructions while the old loop will continue to run the
+   scalar remainder.
 
-  // We will construct a vector for the recurrence by combining the values for
-  // the current and previous iterations. This is the required shuffle mask.
-  SmallVector<Constant *, 8> ShuffleMask(VF);
-  ShuffleMask[0] = Builder.getInt32(VF - 1);
-  for (unsigned I = 1; I < VF; ++I)
-    ShuffleMask[I] = Builder.getInt32(I + VF - 1);
+       [ ] <-- loop iteration number check.
+    /   |
+   /    v
+  |    [ ] <-- vector loop bypass (may consist of multiple blocks).
+  |  /  |
+  | /   v
+  ||   [ ]     <-- vector pre header.
+  |/    |
+  |     v
+  |    [  ] \
+  |    [  ]_|   <-- vector loop.
+  |     |
+  |     v
+  |   -[ ]   <--- middle-block.
+  |  /  |
+  | /   v
+  -|- >[ ]     <--- new preheader.
+   |    |
+   |    v
+   |   [ ] \
+   |   [ ]_|   <-- old scalar loop to handle remainder.
+    \   |
+     \  v
+      >[ ]     <-- exit block.
+   ...
+   */
 
-  // The vector from which to take the initial value for the current iteration
-  // (actual or unrolled). Initially, this is the vector phi node.
-  Value *Incoming = VecPhi;
+  BasicBlock *OldBasicBlock = OrigLoop->getHeader();
+  BasicBlock *VectorPH = OrigLoop->getLoopPreheader();
+  BasicBlock *ExitBlock = OrigLoop->getExitBlock();
+  assert(VectorPH && "Invalid loop structure");
+  assert(ExitBlock && "Must have an exit block");
 
-  // Shuffle the current and previous vector and update the vector parts.
-  for (unsigned Part = 0; Part < UF; ++Part) {
-    auto *Shuffle =
-        VF > 1
-            ? Builder.CreateShuffleVector(Incoming, PreviousParts[Part],
-                                          ConstantVector::get(ShuffleMask))
-            : Incoming;
-    PhiParts[Part]->replaceAllUsesWith(Shuffle);
-    cast<Instruction>(PhiParts[Part])->eraseFromParent();
-    PhiParts[Part] = Shuffle;
-    Incoming = PreviousParts[Part];
-  }
+  // Some loops have a single integer induction variable, while other loops
+  // don't. One example is c++ iterators that often have multiple pointer
+  // induction variables. In the code below we also support a case where we
+  // don't have a single induction variable.
+  //
+  // We try to obtain an induction variable from the original loop as hard
+  // as possible. However if we don't find one that:
+  //   - is an integer
+  //   - counts from zero, stepping by one
+  //   - is the size of the widest induction variable type
+  // then we create a new one.
+  OldInduction = Legal->getPrimaryInduction();
+  Type *IdxTy = Legal->getWidestInductionType();
 
-  // Fix the latch value of the new recurrence in the vector loop.
-  VecPhi->addIncoming(Incoming, LI->getLoopFor(LoopVectorBody)->getLoopLatch());
+  // Split the single block loop into the two loop structure described above.
+  BasicBlock *VecBody =
+      VectorPH->splitBasicBlock(VectorPH->getTerminator(), "vector.body");
+  BasicBlock *MiddleBlock =
+      VecBody->splitBasicBlock(VecBody->getTerminator(), "middle.block");
+  BasicBlock *ScalarPH =
+      MiddleBlock->splitBasicBlock(MiddleBlock->getTerminator(), "scalar.ph");
 
-  // Extract the last vector element in the middle block. This will be the
-  // initial value for the recurrence when jumping to the scalar loop.
-  auto *Extract = Incoming;
-  if (VF > 1) {
-    Builder.SetInsertPoint(LoopMiddleBlock->getTerminator());
-    Extract = Builder.CreateExtractElement(Extract, Builder.getInt32(VF - 1),
-                                           "vector.recur.extract");
+  // Create and register the new vector loop.
+  Loop *Lp = new Loop();
+  Loop *ParentLoop = OrigLoop->getParentLoop();
+
+  // Insert the new loop into the loop nest and register the new basic blocks
+  // before calling any utilities such as SCEV that require valid LoopInfo.
+  if (ParentLoop) {
+    ParentLoop->addChildLoop(Lp);
+    ParentLoop->addBasicBlockToLoop(ScalarPH, *LI);
+    ParentLoop->addBasicBlockToLoop(MiddleBlock, *LI);
+  } else {
+    LI->addTopLevelLoop(Lp);
   }
+  Lp->addBasicBlockToLoop(VecBody, *LI);
 
-  // Fix the initial value of the original recurrence in the scalar loop.
-  Builder.SetInsertPoint(&*LoopScalarPreHeader->begin());
-  auto *Start = Builder.CreatePHI(Phi->getType(), 2, "scalar.recur.init");
-  for (auto *BB : predecessors(LoopScalarPreHeader)) {
-    auto *Incoming = BB == LoopMiddleBlock ? Extract : ScalarInit;
-    Start->addIncoming(Incoming, BB);
-  }
+  // Find the loop boundaries.
+  Value *Count = getOrCreateTripCount(Lp);
 
-  Phi->setIncomingValue(Phi->getBasicBlockIndex(LoopScalarPreHeader), Start);
-  Phi->setName("scalar.recur");
+  Value *StartIdx = ConstantInt::get(IdxTy, 0);
 
-  // Finally, fix users of the recurrence outside the loop. The users will need
-  // either the last value of the scalar recurrence or the last value of the
-  // vector recurrence we extracted in the middle block. Since the loop is in
-  // LCSSA form, we just need to find the phi node for the original scalar
-  // recurrence in the exit block, and then add an edge for the middle block.
-  for (auto &I : *LoopExitBlock) {
-    auto *LCSSAPhi = dyn_cast<PHINode>(&I);
-    if (!LCSSAPhi)
-      break;
-    if (LCSSAPhi->getIncomingValue(0) == Phi) {
-      LCSSAPhi->addIncoming(Extract, LoopMiddleBlock);
-      break;
-    }
-  }
-}
+  // We need to test whether the backedge-taken count is uint##_max. Adding one
+  // to it will cause overflow and an incorrect loop trip count in the vector
+  // body. In case of overflow we want to directly jump to the scalar remainder
+  // loop.
+  emitMinimumIterationCountCheck(Lp, ScalarPH);
+  // Now, compare the new count to zero. If it is zero skip the vector loop and
+  // jump to the scalar loop.
+  emitVectorLoopEnteredCheck(Lp, ScalarPH);
+  // Generate the code to check any assumptions that we've made for SCEV
+  // expressions.
+  emitSCEVChecks(Lp, ScalarPH);
 
-void InnerLoopVectorizer::fixLCSSAPHIs() {
-  for (Instruction &LEI : *LoopExitBlock) {
-    auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);
-    if (!LCSSAPhi)
-      break;
-    if (LCSSAPhi->getNumIncomingValues() == 1)
-      LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),
-                            LoopMiddleBlock);
-  }
-}
+  // Generate the code that checks in runtime if arrays overlap. We put the
+  // checks into a separate block to make the more common case of few elements
+  // faster.
+  emitMemRuntimeChecks(Lp, ScalarPH);
 
-void InnerLoopVectorizer::collectTriviallyDeadInstructions() {
-  BasicBlock *Latch = OrigLoop->getLoopLatch();
+  // Generate the induction variable.
+  // The loop step is equal to the vectorization factor (num of SIMD elements)
+  // times the unroll factor (num of SIMD instructions).
+  Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
+  Constant *Step = ConstantInt::get(IdxTy, VF * UF);
+  Induction =
+      createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
+                              getDebugLocFromInstOrOperands(OldInduction));
 
-  // We create new control-flow for the vectorized loop, so the original
-  // condition will be dead after vectorization if it's only used by the
-  // branch.
-  auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
-  if (Cmp && Cmp->hasOneUse())
-    DeadInstructions.insert(Cmp);
+  // We are going to resume the execution of the scalar loop.
+  // Go over all of the induction variables that we found and fix the
+  // PHIs that are left in the scalar version of the loop.
+  // The starting values of PHI nodes depend on the counter of the last
+  // iteration in the vectorized loop.
+  // If we come from a bypass edge then we need to start from the original
+  // start value.
 
-  // We create new "steps" for induction variable updates to which the original
-  // induction variables map. An original update instruction will be dead if
-  // all its users except the induction variable are dead.
-  for (auto &Induction : *Legal->getInductionVars()) {
-    PHINode *Ind = Induction.first;
-    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
-    if (all_of(IndUpdate->users(), [&](User *U) -> bool {
-          return U == Ind || DeadInstructions.count(cast<Instruction>(U));
-        }))
-      DeadInstructions.insert(IndUpdate);
-  }
-}
+  // This variable saves the new starting index for the scalar loop. It is used
+  // to test if there are any tail iterations left once the vector loop has
+  // completed.
+  LoopVectorizationLegality::InductionList *List = Legal->getInductionVars();
+  for (auto &InductionEntry : *List) {
+    PHINode *OrigPhi = InductionEntry.first;
+    InductionDescriptor II = InductionEntry.second;
 
-void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
+    // Create phi nodes to merge from the  backedge-taken check block.
+    PHINode *BCResumeVal = PHINode::Create(
+        OrigPhi->getType(), 3, "bc.resume.val", ScalarPH->getTerminator());
+    Value *&EndValue = IVEndValues[OrigPhi];
+    if (OrigPhi == OldInduction) {
+      // We know what the end value is.
+      EndValue = CountRoundDown;
+    } else {
+      IRBuilder<> B(LoopBypassBlocks.back()->getTerminator());
+      Type *StepType = II.getStep()->getType();
+      Instruction::CastOps CastOp =
+        CastInst::getCastOpcode(CountRoundDown, true, StepType, true);
+      Value *CRD = B.CreateCast(CastOp, CountRoundDown, StepType, "cast.crd");
+      const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
+      EndValue = II.transform(B, CRD, PSE.getSE(), DL);
+      EndValue->setName("ind.end");
+    }
 
-  // The basic block and loop containing the predicated instruction.
-  auto *PredBB = PredInst->getParent();
-  auto *VectorLoop = LI->getLoopFor(PredBB);
+    // The new PHI merges the original incoming value, in case of a bypass,
+    // or the value at the end of the vectorized loop.
+    BCResumeVal->addIncoming(EndValue, MiddleBlock);
 
-  // Initialize a worklist with the operands of the predicated instruction.
-  SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());
+    // Fix the scalar body counter (PHI node).
+    unsigned BlockIdx = OrigPhi->getBasicBlockIndex(ScalarPH);
 
-  // Holds instructions that we need to analyze again. An instruction may be
-  // reanalyzed if we don't yet know if we can sink it or not.
-  SmallVector<Instruction *, 8> InstsToReanalyze;
+    // The old induction's phi node in the scalar body needs the truncated
+    // value.
+    for (BasicBlock *BB : LoopBypassBlocks)
+      BCResumeVal->addIncoming(II.getStartValue(), BB);
+    OrigPhi->setIncomingValue(BlockIdx, BCResumeVal);
+  }
 
-  // Returns true if a given use occurs in the predicated block. Phi nodes use
-  // their operands in their corresponding predecessor blocks.
-  auto isBlockOfUsePredicated = [&](Use &U) -> bool {
-    auto *I = cast<Instruction>(U.getUser());
-    BasicBlock *BB = I->getParent();
-    if (auto *Phi = dyn_cast<PHINode>(I))
-      BB = Phi->getIncomingBlock(
-          PHINode::getIncomingValueNumForOperand(U.getOperandNo()));
-    return BB == PredBB;
-  };
+  // Add a check in the middle block to see if we have completed
+  // all of the iterations in the first vector loop.
+  // If (N - N%VF) == N, then we *don't* need to run the remainder.
+  Value *CmpN =
+      CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ, Count,
+                      CountRoundDown, "cmp.n", MiddleBlock->getTerminator());
+  ReplaceInstWithInst(MiddleBlock->getTerminator(),
+                      BranchInst::Create(ExitBlock, ScalarPH, CmpN));
 
-  // Iteratively sink the scalarized operands of the predicated instruction
-  // into the block we created for it. When an instruction is sunk, it's
-  // operands are then added to the worklist. The algorithm ends after one pass
-  // through the worklist doesn't sink a single instruction.
-  bool Changed;
-  do {
+  // Get ready to start creating new instructions into the vectorized body.
+  Builder.SetInsertPoint(&*VecBody->getFirstInsertionPt());
 
-    // Add the instructions that need to be reanalyzed to the worklist, and
-    // reset the changed indicator.
-    Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end());
-    InstsToReanalyze.clear();
-    Changed = false;
+  // Save the state.
+  LoopVectorPreHeader = Lp->getLoopPreheader();
+  LoopScalarPreHeader = ScalarPH;
+  LoopMiddleBlock = MiddleBlock;
+  LoopExitBlock = ExitBlock;
+  LoopVectorBody = VecBody;
+  LoopScalarBody = OldBasicBlock;
 
-    while (!Worklist.empty()) {
-      auto *I = dyn_cast<Instruction>(Worklist.pop_back_val());
+  // Keep all loop hints from the original loop on the vector loop (we'll
+  // replace the vectorizer-specific hints below).
+  if (MDNode *LID = OrigLoop->getLoopID())
+    Lp->setLoopID(LID);
 
-      // We can't sink an instruction if it is a phi node, is already in the
-      // predicated block, is not in the loop, or may have side effects.
-      if (!I || isa<PHINode>(I) || I->getParent() == PredBB ||
-          !VectorLoop->contains(I) || I->mayHaveSideEffects())
-        continue;
+  LoopVectorizeHints Hints(Lp, true, *ORE);
+  Hints.setAlreadyVectorized();
+}
 
-      // It's legal to sink the instruction if all its uses occur in the
-      // predicated block. Otherwise, there's nothing to do yet, and we may
-      // need to reanalyze the instruction.
-      if (!all_of(I->uses(), isBlockOfUsePredicated)) {
-        InstsToReanalyze.push_back(I);
-        continue;
-      }
+// Fix up external users of the induction variable. At this point, we are
+// in LCSSA form, with all external PHIs that use the IV having one input value,
+// coming from the remainder loop. We need those PHIs to also have a correct
+// value for the IV when arriving directly from the middle block.
+void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
+                                       const InductionDescriptor &II,
+                                       Value *CountRoundDown, Value *EndValue,
+                                       BasicBlock *MiddleBlock) {
+  // There are two kinds of external IV usages - those that use the value
+  // computed in the last iteration (the PHI) and those that use the penultimate
+  // value (the value that feeds into the phi from the loop latch).
+  // We allow both, but they, obviously, have different values.
 
-      // Move the instruction to the beginning of the predicated block, and add
-      // it's operands to the worklist.
-      I->moveBefore(&*PredBB->getFirstInsertionPt());
-      Worklist.insert(I->op_begin(), I->op_end());
+  assert(OrigLoop->getExitBlock() && "Expected a single exit block");
 
-      // The sinking may have enabled other instructions to be sunk, so we will
-      // need to iterate.
-      Changed = true;
+  DenseMap<Value *, Value *> MissingVals;
+
+  // An external user of the last iteration's value should see the value that
+  // the remainder loop uses to initialize its own IV.
+  Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoop->getLoopLatch());
+  for (User *U : PostInc->users()) {
+    Instruction *UI = cast<Instruction>(U);
+    if (!OrigLoop->contains(UI)) {
+      assert(isa<PHINode>(UI) && "Expected LCSSA form");
+      MissingVals[UI] = EndValue;
     }
-  } while (Changed);
-}
+  }
 
-void InnerLoopVectorizer::predicateInstructions() {
+  // An external user of the penultimate value need to see EndValue - Step.
+  // The simplest way to get this is to recompute it from the constituent SCEVs,
+  // that is Start + (Step * (CRD - 1)).
+  for (User *U : OrigPhi->users()) {
+    auto *UI = cast<Instruction>(U);
+    if (!OrigLoop->contains(UI)) {
+      const DataLayout &DL =
+          OrigLoop->getHeader()->getModule()->getDataLayout();
+      assert(isa<PHINode>(UI) && "Expected LCSSA form");
 
-  // For each instruction I marked for predication on value C, split I into its
-  // own basic block to form an if-then construct over C. Since I may be fed by
-  // an extractelement instruction or other scalar operand, we try to
-  // iteratively sink its scalar operands into the predicated block. If I feeds
-  // an insertelement instruction, we try to move this instruction into the
-  // predicated block as well. For non-void types, a phi node will be created
-  // for the resulting value (either vector or scalar).
-  //
-  // So for some predicated instruction, e.g. the conditional sdiv in:
-  //
-  // for.body:
-  //  ...
-  //  %add = add nsw i32 %mul, %0
-  //  %cmp5 = icmp sgt i32 %2, 7
-  //  br i1 %cmp5, label %if.then, label %if.end
-  //
-  // if.then:
-  //  %div = sdiv i32 %0, %1
-  //  br label %if.end
-  //
-  // if.end:
-  //  %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ]
-  //
-  // the sdiv at this point is scalarized and if-converted using a select.
-  // The inactive elements in the vector are not used, but the predicated
-  // instruction is still executed for all vector elements, essentially:
-  //
-  // vector.body:
-  //  ...
-  //  %17 = add nsw <2 x i32> %16, %wide.load
-  //  %29 = extractelement <2 x i32> %wide.load, i32 0
-  //  %30 = extractelement <2 x i32> %wide.load51, i32 0
-  //  %31 = sdiv i32 %29, %30
-  //  %32 = insertelement <2 x i32> undef, i32 %31, i32 0
-  //  %35 = extractelement <2 x i32> %wide.load, i32 1
-  //  %36 = extractelement <2 x i32> %wide.load51, i32 1
-  //  %37 = sdiv i32 %35, %36
-  //  %38 = insertelement <2 x i32> %32, i32 %37, i32 1
-  //  %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17
-  //
-  // Predication will now re-introduce the original control flow to avoid false
-  // side-effects by the sdiv instructions on the inactive elements, yielding
-  // (after cleanup):
-  //
-  // vector.body:
-  //  ...
-  //  %5 = add nsw <2 x i32> %4, %wide.load
-  //  %8 = icmp sgt <2 x i32> %wide.load52, <i32 7, i32 7>
-  //  %9 = extractelement <2 x i1> %8, i32 0
-  //  br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue
-  //
-  // pred.sdiv.if:
-  //  %10 = extractelement <2 x i32> %wide.load, i32 0
-  //  %11 = extractelement <2 x i32> %wide.load51, i32 0
-  //  %12 = sdiv i32 %10, %11
-  //  %13 = insertelement <2 x i32> undef, i32 %12, i32 0
-  //  br label %pred.sdiv.continue
-  //
-  // pred.sdiv.continue:
-  //  %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ]
-  //  %15 = extractelement <2 x i1> %8, i32 1
-  //  br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55
-  //
-  // pred.sdiv.if54:
-  //  %16 = extractelement <2 x i32> %wide.load, i32 1
-  //  %17 = extractelement <2 x i32> %wide.load51, i32 1
-  //  %18 = sdiv i32 %16, %17
-  //  %19 = insertelement <2 x i32> %14, i32 %18, i32 1
-  //  br label %pred.sdiv.continue55
-  //
-  // pred.sdiv.continue55:
-  //  %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ]
-  //  %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5
+      IRBuilder<> B(MiddleBlock->getTerminator());
+      Value *CountMinusOne = B.CreateSub(
+          CountRoundDown, ConstantInt::get(CountRoundDown->getType(), 1));
+      Value *CMO = B.CreateSExtOrTrunc(CountMinusOne, II.getStep()->getType(),
+                                       "cast.cmo");
+      Value *Escape = II.transform(B, CMO, PSE.getSE(), DL);
+      Escape->setName("ind.escape");
+      MissingVals[UI] = Escape;
+    }
+  }
 
-  for (auto KV : PredicatedInstructions) {
-    BasicBlock::iterator I(KV.first);
-    BasicBlock *Head = I->getParent();
-    auto *BB = SplitBlock(Head, &*std::next(I), DT, LI);
-    auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false,
-                                        /*BranchWeights=*/nullptr, DT, LI);
-    I->moveBefore(T);
-    sinkScalarOperands(&*I);
+  for (auto &I : MissingVals) {
+    PHINode *PHI = cast<PHINode>(I.first);
+    // One corner case we have to handle is two IVs "chasing" each-other,
+    // that is %IV2 = phi [...], [ %IV1, %latch ]
+    // In this case, if IV1 has an external use, we need to avoid adding both
+    // "last value of IV1" and "penultimate value of IV2". So, verify that we
+    // don't already have an incoming value for the middle block.
+    if (PHI->getBasicBlockIndex(MiddleBlock) == -1)
+      PHI->addIncoming(I.second, MiddleBlock);
+  }
+}
 
-    I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if");
-    BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue");
+namespace {
+struct CSEDenseMapInfo {
+  static bool canHandle(Instruction *I) {
+    return isa<InsertElementInst>(I) || isa<ExtractElementInst>(I) ||
+           isa<ShuffleVectorInst>(I) || isa<GetElementPtrInst>(I);
+  }
+  static inline Instruction *getEmptyKey() {
+    return DenseMapInfo<Instruction *>::getEmptyKey();
+  }
+  static inline Instruction *getTombstoneKey() {
+    return DenseMapInfo<Instruction *>::getTombstoneKey();
+  }
+  static unsigned getHashValue(Instruction *I) {
+    assert(canHandle(I) && "Unknown instruction!");
+    return hash_combine(I->getOpcode(), hash_combine_range(I->value_op_begin(),
+                                                           I->value_op_end()));
+  }
+  static bool isEqual(Instruction *LHS, Instruction *RHS) {
+    if (LHS == getEmptyKey() || RHS == getEmptyKey() ||
+        LHS == getTombstoneKey() || RHS == getTombstoneKey())
+      return LHS == RHS;
+    return LHS->isIdenticalTo(RHS);
+  }
+};
+}
 
-    // If the instruction is non-void create a Phi node at reconvergence point.
-    if (!I->getType()->isVoidTy()) {
-      Value *IncomingTrue = nullptr;
-      Value *IncomingFalse = nullptr;
+///\brief Perform cse of induction variable instructions.
+static void cse(BasicBlock *BB) {
+  // Perform simple cse.
+  SmallDenseMap<Instruction *, Instruction *, 4, CSEDenseMapInfo> CSEMap;
+  for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;) {
+    Instruction *In = &*I++;
 
-      if (I->hasOneUse() && isa<InsertElementInst>(*I->user_begin())) {
-        // If the predicated instruction is feeding an insert-element, move it
-        // into the Then block; Phi node will be created for the vector.
-        InsertElementInst *IEI = cast<InsertElementInst>(*I->user_begin());
-        IEI->moveBefore(T);
-        IncomingTrue = IEI; // the new vector with the inserted element.
-        IncomingFalse = IEI->getOperand(0); // the unmodified vector
-      } else {
-        // Phi node will be created for the scalar predicated instruction.
-        IncomingTrue = &*I;
-        IncomingFalse = UndefValue::get(I->getType());
-      }
+    if (!CSEDenseMapInfo::canHandle(In))
+      continue;
 
-      BasicBlock *PostDom = I->getParent()->getSingleSuccessor();
-      assert(PostDom && "Then block has multiple successors");
-      PHINode *Phi =
-          PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front());
-      IncomingTrue->replaceAllUsesWith(Phi);
-      Phi->addIncoming(IncomingFalse, Head);
-      Phi->addIncoming(IncomingTrue, I->getParent());
+    // Check if we can replace this instruction with any of the
+    // visited instructions.
+    if (Instruction *V = CSEMap.lookup(In)) {
+      In->replaceAllUsesWith(V);
+      In->eraseFromParent();
+      continue;
     }
-  }
 
-  DEBUG(DT->verifyDomTree());
+    CSEMap[In] = In;
+  }
 }
 
-InnerLoopVectorizer::VectorParts
-InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) {
-  assert(is_contained(predecessors(Dst), Src) && "Invalid edge");
-
-  // Look for cached value.
-  std::pair<BasicBlock *, BasicBlock *> Edge(Src, Dst);
-  EdgeMaskCache::iterator ECEntryIt = MaskCache.find(Edge);
-  if (ECEntryIt != MaskCache.end())
-    return ECEntryIt->second;
+/// \brief Adds a 'fast' flag to floating point operations.
+static Value *addFastMathFlag(Value *V) {
+  if (isa<FPMathOperator>(V)) {
+    FastMathFlags Flags;
+    Flags.setUnsafeAlgebra();
+    cast<Instruction>(V)->setFastMathFlags(Flags);
+  }
+  return V;
+}
 
-  VectorParts SrcMask = createBlockInMask(Src);
+/// \brief Estimate the overhead of scalarizing an instruction. This is a
+/// convenience wrapper for the type-based getScalarizationOverhead API.
+static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
+                                         const TargetTransformInfo &TTI) {
+  if (VF == 1)
+    return 0;
 
-  // The terminator has to be a branch inst!
-  BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());
-  assert(BI && "Unexpected terminator found");
+  unsigned Cost = 0;
+  Type *RetTy = ToVectorTy(I->getType(), VF);
+  if (!RetTy->isVoidTy())
+    Cost += TTI.getScalarizationOverhead(RetTy, true, false);
 
-  if (BI->isConditional()) {
-    VectorParts EdgeMask = getVectorValue(BI->getCondition());
+  if (CallInst *CI = dyn_cast<CallInst>(I)) {
+    SmallVector<const Value *, 4> Operands(CI->arg_operands());
+    Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
+  } else {
+    SmallVector<const Value *, 4> Operands(I->operand_values());
+    Cost += TTI.getOperandsScalarizationOverhead(Operands, VF);
+  }
 
-    if (BI->getSuccessor(0) != Dst)
-      for (unsigned part = 0; part < UF; ++part)
-        EdgeMask[part] = Builder.CreateNot(EdgeMask[part]);
+  return Cost;
+}
 
-    for (unsigned part = 0; part < UF; ++part)
-      EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]);
+// Estimate cost of a call instruction CI if it were vectorized with factor VF.
+// Return the cost of the instruction, including scalarization overhead if it's
+// needed. The flag NeedToScalarize shows if the call needs to be scalarized -
+// i.e. either vector version isn't available, or is too expensive.
+static unsigned getVectorCallCost(CallInst *CI, unsigned VF,
+                                  const TargetTransformInfo &TTI,
+                                  const TargetLibraryInfo *TLI,
+                                  bool &NeedToScalarize) {
+  Function *F = CI->getCalledFunction();
+  StringRef FnName = CI->getCalledFunction()->getName();
+  Type *ScalarRetTy = CI->getType();
+  SmallVector<Type *, 4> Tys, ScalarTys;
+  for (auto &ArgOp : CI->arg_operands())
+    ScalarTys.push_back(ArgOp->getType());
 
-    MaskCache[Edge] = EdgeMask;
-    return EdgeMask;
-  }
+  // Estimate cost of scalarized vector call. The source operands are assumed
+  // to be vectors, so we need to extract individual elements from there,
+  // execute VF scalar calls, and then gather the result into the vector return
+  // value.
+  unsigned ScalarCallCost = TTI.getCallInstrCost(F, ScalarRetTy, ScalarTys);
+  if (VF == 1)
+    return ScalarCallCost;
 
-  MaskCache[Edge] = SrcMask;
-  return SrcMask;
-}
+  // Compute corresponding vector type for return value and arguments.
+  Type *RetTy = ToVectorTy(ScalarRetTy, VF);
+  for (Type *ScalarTy : ScalarTys)
+    Tys.push_back(ToVectorTy(ScalarTy, VF));
 
-InnerLoopVectorizer::VectorParts
-InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) {
-  assert(OrigLoop->contains(BB) && "Block is not a part of a loop");
+  // Compute costs of unpacking argument values for the scalar calls and
+  // packing the return values to a vector.
+  unsigned ScalarizationCost = getScalarizationOverhead(CI, VF, TTI);
 
-  // Loop incoming mask is all-one.
-  if (OrigLoop->getHeader() == BB) {
-    Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1);
-    return getVectorValue(C);
-  }
+  unsigned Cost = ScalarCallCost * VF + ScalarizationCost;
 
-  // This is the block mask. We OR all incoming edges, and with zero.
-  Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0);
-  VectorParts BlockMask = getVectorValue(Zero);
+  // If we can't emit a vector call for this function, then the currently found
+  // cost is the cost we need to return.
+  NeedToScalarize = true;
+  if (!TLI || !TLI->isFunctionVectorizable(FnName, VF) || CI->isNoBuiltin())
+    return Cost;
 
-  // For each pred:
-  for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) {
-    VectorParts EM = createEdgeMask(*it, BB);
-    for (unsigned part = 0; part < UF; ++part)
-      BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]);
+  // If the corresponding vector cost is cheaper, return its cost.
+  unsigned VectorCallCost = TTI.getCallInstrCost(nullptr, RetTy, Tys);
+  if (VectorCallCost < Cost) {
+    NeedToScalarize = false;
+    return VectorCallCost;
   }
-
-  return BlockMask;
+  return Cost;
 }
 
-void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,
-                                              unsigned VF, PhiVector *PV) {
-  PHINode *P = cast<PHINode>(PN);
-  // Handle recurrences.
-  if (Legal->isReductionVariable(P) || Legal->isFirstOrderRecurrence(P)) {
-    VectorParts Entry(UF);
-    for (unsigned part = 0; part < UF; ++part) {
-      // This is phase one of vectorizing PHIs.
-      Type *VecTy =
-          (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF);
-      Entry[part] = PHINode::Create(
-          VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt());
-    }
-    VectorLoopValueMap.initVector(P, Entry);
-    PV->push_back(P);
-    return;
-  }
+// Estimate cost of an intrinsic call instruction CI if it were vectorized with
+// factor VF.  Return the cost of the instruction, including scalarization
+// overhead if it's needed.
+static unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF,
+                                       const TargetTransformInfo &TTI,
+                                       const TargetLibraryInfo *TLI) {
+  Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
+  assert(ID && "Expected intrinsic call!");
 
-  setDebugLocFromInst(Builder, P);
-  // Check for PHI nodes that are lowered to vector selects.
-  if (P->getParent() != OrigLoop->getHeader()) {
-    // We know that all PHIs in non-header blocks are converted into
-    // selects, so we don't have to worry about the insertion order and we
-    // can just use the builder.
-    // At this point we generate the predication tree. There may be
-    // duplications since this is a simple recursive scan, but future
-    // optimizations will clean it up.
+  Type *RetTy = ToVectorTy(CI->getType(), VF);
+  SmallVector<Type *, 4> Tys;
+  for (Value *ArgOperand : CI->arg_operands())
+    Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));
 
-    unsigned NumIncoming = P->getNumIncomingValues();
+  FastMathFlags FMF;
+  if (auto *FPMO = dyn_cast<FPMathOperator>(CI))
+    FMF = FPMO->getFastMathFlags();
 
-    // Generate a sequence of selects of the form:
-    // SELECT(Mask3, In3,
-    //      SELECT(Mask2, In2,
-    //                   ( ...)))
-    VectorParts Entry(UF);
-    for (unsigned In = 0; In < NumIncoming; In++) {
-      VectorParts Cond =
-          createEdgeMask(P->getIncomingBlock(In), P->getParent());
-      const VectorParts &In0 = getVectorValue(P->getIncomingValue(In));
+  return TTI.getIntrinsicInstrCost(ID, RetTy, Tys, FMF);
+}
 
-      for (unsigned part = 0; part < UF; ++part) {
-        // We might have single edge PHIs (blocks) - use an identity
-        // 'select' for the first PHI operand.
-        if (In == 0)
-          Entry[part] = Builder.CreateSelect(Cond[part], In0[part], In0[part]);
-        else
-          // Select between the current value and the previous incoming edge
-          // based on the incoming mask.
-          Entry[part] = Builder.CreateSelect(Cond[part], In0[part], Entry[part],
-                                             "predphi");
+static Type *smallestIntegerVectorType(Type *T1, Type *T2) {
+  auto *I1 = cast<IntegerType>(T1->getVectorElementType());
+  auto *I2 = cast<IntegerType>(T2->getVectorElementType());
+  return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;
+}
+static Type *largestIntegerVectorType(Type *T1, Type *T2) {
+  auto *I1 = cast<IntegerType>(T1->getVectorElementType());
+  auto *I2 = cast<IntegerType>(T2->getVectorElementType());
+  return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2;
+}
+
+void InnerLoopVectorizer::truncateToMinimalBitwidths() {
+  // For every instruction `I` in MinBWs, truncate the operands, create a
+  // truncated version of `I` and reextend its result. InstCombine runs
+  // later and will remove any ext/trunc pairs.
+  //
+  SmallPtrSet<Value *, 4> Erased;
+  for (const auto &KV : Cost->getMinimalBitwidths()) {
+    // If the value wasn't vectorized, we must maintain the original scalar
+    // type. The absence of the value from VectorLoopValueMap indicates that it
+    // wasn't vectorized.
+    if (!VectorLoopValueMap.hasVector(KV.first))
+      continue;
+    VectorParts &Parts = VectorLoopValueMap.getVector(KV.first);
+    for (Value *&I : Parts) {
+      if (Erased.count(I) || I->use_empty() || !isa<Instruction>(I))
+        continue;
+      Type *OriginalTy = I->getType();
+      Type *ScalarTruncatedTy =
+          IntegerType::get(OriginalTy->getContext(), KV.second);
+      Type *TruncatedTy = VectorType::get(ScalarTruncatedTy,
+                                          OriginalTy->getVectorNumElements());
+      if (TruncatedTy == OriginalTy)
+        continue;
+
+      IRBuilder<> B(cast<Instruction>(I));
+      auto ShrinkOperand = [&](Value *V) -> Value * {
+        if (auto *ZI = dyn_cast<ZExtInst>(V))
+          if (ZI->getSrcTy() == TruncatedTy)
+            return ZI->getOperand(0);
+        return B.CreateZExtOrTrunc(V, TruncatedTy);
+      };
+
+      // The actual instruction modification depends on the instruction type,
+      // unfortunately.
+      Value *NewI = nullptr;
+      if (auto *BO = dyn_cast<BinaryOperator>(I)) {
+        NewI = B.CreateBinOp(BO->getOpcode(), ShrinkOperand(BO->getOperand(0)),
+                             ShrinkOperand(BO->getOperand(1)));
+        cast<BinaryOperator>(NewI)->copyIRFlags(I);
+      } else if (auto *CI = dyn_cast<ICmpInst>(I)) {
+        NewI =
+            B.CreateICmp(CI->getPredicate(), ShrinkOperand(CI->getOperand(0)),
+                         ShrinkOperand(CI->getOperand(1)));
+      } else if (auto *SI = dyn_cast<SelectInst>(I)) {
+        NewI = B.CreateSelect(SI->getCondition(),
+                              ShrinkOperand(SI->getTrueValue()),
+                              ShrinkOperand(SI->getFalseValue()));
+      } else if (auto *CI = dyn_cast<CastInst>(I)) {
+        switch (CI->getOpcode()) {
+        default:
+          llvm_unreachable("Unhandled cast!");
+        case Instruction::Trunc:
+          NewI = ShrinkOperand(CI->getOperand(0));
+          break;
+        case Instruction::SExt:
+          NewI = B.CreateSExtOrTrunc(
+              CI->getOperand(0),
+              smallestIntegerVectorType(OriginalTy, TruncatedTy));
+          break;
+        case Instruction::ZExt:
+          NewI = B.CreateZExtOrTrunc(
+              CI->getOperand(0),
+              smallestIntegerVectorType(OriginalTy, TruncatedTy));
+          break;
+        }
+      } else if (auto *SI = dyn_cast<ShuffleVectorInst>(I)) {
+        auto Elements0 = SI->getOperand(0)->getType()->getVectorNumElements();
+        auto *O0 = B.CreateZExtOrTrunc(
+            SI->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements0));
+        auto Elements1 = SI->getOperand(1)->getType()->getVectorNumElements();
+        auto *O1 = B.CreateZExtOrTrunc(
+            SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1));
+
+        NewI = B.CreateShuffleVector(O0, O1, SI->getMask());
+      } else if (isa<LoadInst>(I)) {
+        // Don't do anything with the operands, just extend the result.
+        continue;
+      } else if (auto *IE = dyn_cast<InsertElementInst>(I)) {
+        auto Elements = IE->getOperand(0)->getType()->getVectorNumElements();
+        auto *O0 = B.CreateZExtOrTrunc(
+            IE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
+        auto *O1 = B.CreateZExtOrTrunc(IE->getOperand(1), ScalarTruncatedTy);
+        NewI = B.CreateInsertElement(O0, O1, IE->getOperand(2));
+      } else if (auto *EE = dyn_cast<ExtractElementInst>(I)) {
+        auto Elements = EE->getOperand(0)->getType()->getVectorNumElements();
+        auto *O0 = B.CreateZExtOrTrunc(
+            EE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
+        NewI = B.CreateExtractElement(O0, EE->getOperand(2));
+      } else {
+        llvm_unreachable("Unhandled instruction type!");
       }
+
+      // Lastly, extend the result.
+      NewI->takeName(cast<Instruction>(I));
+      Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy);
+      I->replaceAllUsesWith(Res);
+      cast<Instruction>(I)->eraseFromParent();
+      Erased.insert(I);
+      I = Res;
     }
-    VectorLoopValueMap.initVector(P, Entry);
-    return;
   }
 
-  // This PHINode must be an induction variable.
-  // Make sure that we know about it.
-  assert(Legal->getInductionVars()->count(P) && "Not an induction variable");
-
-  InductionDescriptor II = Legal->getInductionVars()->lookup(P);
-  const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
-
-  // FIXME: The newly created binary instructions should contain nsw/nuw flags,
-  // which can be found from the original scalar operations.
-  switch (II.getKind()) {
-  case InductionDescriptor::IK_NoInduction:
-    llvm_unreachable("Unknown induction");
-  case InductionDescriptor::IK_IntInduction:
-    return widenIntInduction(P);
-  case InductionDescriptor::IK_PtrInduction: {
-    // Handle the pointer induction variable case.
-    assert(P->getType()->isPointerTy() && "Unexpected type.");
-    // This is the normalized GEP that starts counting at zero.
-    Value *PtrInd = Induction;
-    PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());
-    // Determine the number of scalars we need to generate for each unroll
-    // iteration. If the instruction is uniform, we only need to generate the
-    // first lane. Otherwise, we generate all VF values.
-    unsigned Lanes = Cost->isUniformAfterVectorization(P, VF) ? 1 : VF;
-    // These are the scalar results. Notice that we don't generate vector GEPs
-    // because scalar GEPs result in better code.
-    ScalarParts Entry(UF);
-    for (unsigned Part = 0; Part < UF; ++Part) {
-      Entry[Part].resize(VF);
-      for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
-        Constant *Idx = ConstantInt::get(PtrInd->getType(), Lane + Part * VF);
-        Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);
-        Value *SclrGep = II.transform(Builder, GlobalIdx, PSE.getSE(), DL);
-        SclrGep->setName("next.gep");
-        Entry[Part][Lane] = SclrGep;
+  // We'll have created a bunch of ZExts that are now parentless. Clean up.
+  for (const auto &KV : Cost->getMinimalBitwidths()) {
+    // If the value wasn't vectorized, we must maintain the original scalar
+    // type. The absence of the value from VectorLoopValueMap indicates that it
+    // wasn't vectorized.
+    if (!VectorLoopValueMap.hasVector(KV.first))
+      continue;
+    VectorParts &Parts = VectorLoopValueMap.getVector(KV.first);
+    for (Value *&I : Parts) {
+      ZExtInst *Inst = dyn_cast<ZExtInst>(I);
+      if (Inst && Inst->use_empty()) {
+        Value *NewI = Inst->getOperand(0);
+        Inst->eraseFromParent();
+        I = NewI;
       }
     }
-    VectorLoopValueMap.initScalar(P, Entry);
-    return;
   }
-  case InductionDescriptor::IK_FpInduction: {
-    assert(P->getType() == II.getStartValue()->getType() &&
-           "Types must match");
-    // Handle other induction variables that are now based on the
-    // canonical one.
-    assert(P != OldInduction && "Primary induction can be integer only");
+}
 
-    Value *V = Builder.CreateCast(Instruction::SIToFP, Induction, P->getType());
-    V = II.transform(Builder, V, PSE.getSE(), DL);
-    V->setName("fp.offset.idx");
+void InnerLoopVectorizer::vectorizeLoop() {
 
-    // Now we have scalar op: %fp.offset.idx = StartVal +/- Induction*StepVal
+  //===------------------------------------------------===//
+  //
+  // Notice: any optimization or new instruction that go
+  // into the code below should be also be implemented in
+  // the cost-model.
+  //
+  //===------------------------------------------------===//
 
-    Value *Broadcasted = getBroadcastInstrs(V);
-    // After broadcasting the induction variable we need to make the vector
-    // consecutive by adding StepVal*0, StepVal*1, StepVal*2, etc.
-    Value *StepVal = cast<SCEVUnknown>(II.getStep())->getValue();
-    VectorParts Entry(UF);
-    for (unsigned part = 0; part < UF; ++part)
-      Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,
-                                  II.getInductionOpcode());
-    VectorLoopValueMap.initVector(P, Entry);
-    return;
-  }
-  }
-}
+  // Insert truncates and extends for any truncated instructions as hints to
+  // InstCombine.
+  if (VF > 1)
+    truncateToMinimalBitwidths();
 
-/// A helper function for checking whether an integer division-related
-/// instruction may divide by zero (in which case it must be predicated if
-/// executed conditionally in the scalar code).
-/// TODO: It may be worthwhile to generalize and check isKnownNonZero().
-/// Non-zero divisors that are non compile-time constants will not be
-/// converted into multiplication, so we will still end up scalarizing
-/// the division, but can do so w/o predication.
-static bool mayDivideByZero(Instruction &I) {
-  assert((I.getOpcode() == Instruction::UDiv ||
-          I.getOpcode() == Instruction::SDiv ||
-          I.getOpcode() == Instruction::URem ||
-          I.getOpcode() == Instruction::SRem) &&
-         "Unexpected instruction");
-  Value *Divisor = I.getOperand(1);
-  auto *CInt = dyn_cast<ConstantInt>(Divisor);
-  return !CInt || CInt->isZero();
-}
-
-void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock *BB, PhiVector *PV) {
-  // For each instruction in the old loop.
-  for (Instruction &I : *BB) {
+  fixCrossIterationPHIs();
 
-    // If the instruction will become trivially dead when vectorized, we don't
-    // need to generate it.
-    if (DeadInstructions.count(&I))
-      continue;
+  // Update the dominator tree.
+  //
+  // FIXME: After creating the structure of the new loop, the dominator tree is
+  //        no longer up-to-date, and it remains that way until we update it
+  //        here. An out-of-date dominator tree is problematic for SCEV,
+  //        because SCEVExpander uses it to guide code generation. The
+  //        vectorizer use SCEVExpanders in several places. Instead, we should
+  //        keep the dominator tree up-to-date as we go.
+  updateAnalysis();
 
-    // Scalarize instructions that should remain scalar after vectorization.
-    if (VF > 1 &&
-        !(isa<BranchInst>(&I) || isa<PHINode>(&I) ||
-          isa<DbgInfoIntrinsic>(&I)) &&
-        shouldScalarizeInstruction(&I)) {
-      scalarizeInstruction(&I, Legal->isScalarWithPredication(&I));
-      continue;
-    }
+  // Fix-up external users of the induction variables.
+  for (auto &Entry : *Legal->getInductionVars())
+    fixupIVUsers(Entry.first, Entry.second,
+                 getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
+                 IVEndValues[Entry.first], LoopMiddleBlock);
 
-    switch (I.getOpcode()) {
-    case Instruction::Br:
-      // Nothing to do for PHIs and BR, since we already took care of the
-      // loop control flow instructions.
-      continue;
-    case Instruction::PHI: {
-      // Vectorize PHINodes.
-      widenPHIInstruction(&I, UF, VF, PV);
-      continue;
-    } // End of PHI.
-
-    case Instruction::UDiv:
-    case Instruction::SDiv:
-    case Instruction::SRem:
-    case Instruction::URem:
-      // Scalarize with predication if this instruction may divide by zero and
-      // block execution is conditional, otherwise fallthrough.
-      if (Legal->isScalarWithPredication(&I)) {
-        scalarizeInstruction(&I, true);
-        continue;
-      }
-    case Instruction::Add:
-    case Instruction::FAdd:
-    case Instruction::Sub:
-    case Instruction::FSub:
-    case Instruction::Mul:
-    case Instruction::FMul:
-    case Instruction::FDiv:
-    case Instruction::FRem:
-    case Instruction::Shl:
-    case Instruction::LShr:
-    case Instruction::AShr:
-    case Instruction::And:
-    case Instruction::Or:
-    case Instruction::Xor: {
-      // Just widen binops.
-      auto *BinOp = cast<BinaryOperator>(&I);
-      setDebugLocFromInst(Builder, BinOp);
-      const VectorParts &A = getVectorValue(BinOp->getOperand(0));
-      const VectorParts &B = getVectorValue(BinOp->getOperand(1));
-
-      // Use this vector value for all users of the original instruction.
-      VectorParts Entry(UF);
-      for (unsigned Part = 0; Part < UF; ++Part) {
-        Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]);
+  fixLCSSAPHIs();
 
-        if (BinaryOperator *VecOp = dyn_cast<BinaryOperator>(V))
-          VecOp->copyIRFlags(BinOp);
+  // Remove redundant induction instructions.
+  cse(LoopVectorBody);
+}
 
-        Entry[Part] = V;
-      }
+void InnerLoopVectorizer::fixCrossIterationPHIs() {
+  // In order to support recurrences we need to be able to vectorize Phi nodes.
+  // Phi nodes have cycles, so we need to vectorize them in two stages. First,
+  // we create a new vector PHI node with no incoming edges. We use this value
+  // when we vectorize all of the instructions that use the PHI. Next, after
+  // all of the instructions in the block are complete we add the new incoming
+  // edges to the PHI. At this point all of the instructions in the basic block
+  // are vectorized, so we can use them to construct the PHI.
 
-      VectorLoopValueMap.initVector(&I, Entry);
-      addMetadata(Entry, BinOp);
+  // At this point every instruction in the original loop is widened to a
+  // vector form. Now we need to fix the recurrences. These PHI nodes are
+  // currently empty because we did not want to introduce cycles.
+  // This is the second stage of vectorizing recurrences.
+  for (Instruction &I : *OrigLoop->getHeader()) {
+    PHINode *Phi = dyn_cast<PHINode>(&I);
+    if (!Phi)
       break;
-    }
-    case Instruction::Select: {
-      // Widen selects.
-      // If the selector is loop invariant we can create a select
-      // instruction with a scalar condition. Otherwise, use vector-select.
-      auto *SE = PSE.getSE();
-      bool InvariantCond =
-          SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop);
-      setDebugLocFromInst(Builder, &I);
-
-      // The condition can be loop invariant  but still defined inside the
-      // loop. This means that we can't just use the original 'cond' value.
-      // We have to take the 'vectorized' value and pick the first lane.
-      // Instcombine will make this a no-op.
-      const VectorParts &Cond = getVectorValue(I.getOperand(0));
-      const VectorParts &Op0 = getVectorValue(I.getOperand(1));
-      const VectorParts &Op1 = getVectorValue(I.getOperand(2));
-
-      auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0);
+    // Handle first-order recurrences and reductions that need to be fixed.
+    if (Legal->isFirstOrderRecurrence(Phi))
+      fixFirstOrderRecurrence(Phi);
+    else if (Legal->isReductionVariable(Phi))
+      fixReduction(Phi);
+  }
+}
 
-      VectorParts Entry(UF);
-      for (unsigned Part = 0; Part < UF; ++Part) {
-        Entry[Part] = Builder.CreateSelect(
-            InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]);
-      }
+void InnerLoopVectorizer::fixReduction(PHINode *Phi) {
+  Constant *Zero = Builder.getInt32(0);
 
-      VectorLoopValueMap.initVector(&I, Entry);
-      addMetadata(Entry, &I);
-      break;
+  // Get the reduction variable descriptor.
+  RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi];
+
+  RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();
+  TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();
+  Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();
+  RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =
+      RdxDesc.getMinMaxRecurrenceKind();
+  setDebugLocFromInst(Builder, ReductionStartValue);
+
+  // We need to generate a reduction vector from the incoming scalar.
+  // To do so, we need to generate the 'identity' vector and override
+  // one of the elements with the incoming scalar reduction. We need
+  // to do it in the vector-loop preheader.
+  Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator());
+
+  // This is the vector-clone of the value that leaves the loop.
+  const VectorParts &VectorExit = getVectorValue(LoopExitInst);
+  Type *VecTy = VectorExit[0]->getType();
+
+  // Find the reduction identity variable. Zero for addition, or, xor,
+  // one for multiplication, -1 for And.
+  Value *Identity;
+  Value *VectorStart;
+  if (RK == RecurrenceDescriptor::RK_IntegerMinMax ||
+      RK == RecurrenceDescriptor::RK_FloatMinMax) {
+    // MinMax reduction have the start value as their identify.
+    if (VF == 1) {
+      VectorStart = Identity = ReductionStartValue;
+    } else {
+      VectorStart = Identity =
+          Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");
     }
+  } else {
+    // Handle other reduction kinds:
+    Constant *Iden =
+        RecurrenceDescriptor::getRecurrenceIdentity(RK, VecTy->getScalarType());
+    if (VF == 1) {
+      Identity = Iden;
+      // This vector is the Identity vector where the first element is the
+      // incoming scalar reduction.
+      VectorStart = ReductionStartValue;
+    } else {
+      Identity = ConstantVector::getSplat(VF, Iden);
 
-    case Instruction::ICmp:
-    case Instruction::FCmp: {
-      // Widen compares. Generate vector compares.
-      bool FCmp = (I.getOpcode() == Instruction::FCmp);
-      auto *Cmp = dyn_cast<CmpInst>(&I);
-      setDebugLocFromInst(Builder, Cmp);
-      const VectorParts &A = getVectorValue(Cmp->getOperand(0));
-      const VectorParts &B = getVectorValue(Cmp->getOperand(1));
-      VectorParts Entry(UF);
-      for (unsigned Part = 0; Part < UF; ++Part) {
-        Value *C = nullptr;
-        if (FCmp) {
-          C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);
-          cast<FCmpInst>(C)->copyFastMathFlags(Cmp);
-        } else {
-          C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);
-        }
-        Entry[Part] = C;
-      }
-
-      VectorLoopValueMap.initVector(&I, Entry);
-      addMetadata(Entry, &I);
-      break;
+      // This vector is the Identity vector where the first element is the
+      // incoming scalar reduction.
+      VectorStart =
+          Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);
     }
+  }
 
-    case Instruction::Store:
-    case Instruction::Load:
-      vectorizeMemoryInstruction(&I);
-      break;
-    case Instruction::ZExt:
-    case Instruction::SExt:
-    case Instruction::FPToUI:
-    case Instruction::FPToSI:
-    case Instruction::FPExt:
-    case Instruction::PtrToInt:
-    case Instruction::IntToPtr:
-    case Instruction::SIToFP:
-    case Instruction::UIToFP:
-    case Instruction::Trunc:
-    case Instruction::FPTrunc:
-    case Instruction::BitCast: {
-      auto *CI = dyn_cast<CastInst>(&I);
-      setDebugLocFromInst(Builder, CI);
-
-      // Optimize the special case where the source is a constant integer
-      // induction variable. Notice that we can only optimize the 'trunc' case
-      // because (a) FP conversions lose precision, (b) sext/zext may wrap, and
-      // (c) other casts depend on pointer size.
-      if (Cost->isOptimizableIVTruncate(CI, VF)) {
-        widenIntInduction(cast<PHINode>(CI->getOperand(0)),
-                          cast<TruncInst>(CI));
-        break;
-      }
-
-      /// Vectorize casts.
-      Type *DestTy =
-          (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);
-
-      const VectorParts &A = getVectorValue(CI->getOperand(0));
-      VectorParts Entry(UF);
-      for (unsigned Part = 0; Part < UF; ++Part)
-        Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy);
-      VectorLoopValueMap.initVector(&I, Entry);
-      addMetadata(Entry, &I);
-      break;
-    }
+  // Fix the vector-loop phi.
 
-    case Instruction::Call: {
-      // Ignore dbg intrinsics.
-      if (isa<DbgInfoIntrinsic>(I))
-        break;
-      setDebugLocFromInst(Builder, &I);
-
-      Module *M = BB->getParent()->getParent();
-      auto *CI = cast<CallInst>(&I);
-
-      StringRef FnName = CI->getCalledFunction()->getName();
-      Function *F = CI->getCalledFunction();
-      Type *RetTy = ToVectorTy(CI->getType(), VF);
-      SmallVector<Type *, 4> Tys;
-      for (Value *ArgOperand : CI->arg_operands())
-        Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));
-
-      Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
-      if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end ||
-                 ID == Intrinsic::lifetime_start)) {
-        scalarizeInstruction(&I);
-        break;
-      }
-      // The flag shows whether we use Intrinsic or a usual Call for vectorized
-      // version of the instruction.
-      // Is it beneficial to perform intrinsic call compared to lib call?
-      bool NeedToScalarize;
-      unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
-      bool UseVectorIntrinsic =
-          ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
-      if (!UseVectorIntrinsic && NeedToScalarize) {
-        scalarizeInstruction(&I);
-        break;
-      }
+  // Reductions do not have to start at zero. They can start with
+  // any loop invariant values.
+  const VectorParts &VecRdxPhi = getVectorValue(Phi);
+  BasicBlock *Latch = OrigLoop->getLoopLatch();
+  Value *LoopVal = Phi->getIncomingValueForBlock(Latch);
+  const VectorParts &Val = getVectorValue(LoopVal);
+  for (unsigned part = 0; part < UF; ++part) {
+    // Make sure to add the reduction stat value only to the
+    // first unroll part.
+    Value *StartVal = (part == 0) ? VectorStart : Identity;
+    cast<PHINode>(VecRdxPhi[part])->addIncoming(StartVal, LoopVectorPreHeader);
+    cast<PHINode>(VecRdxPhi[part])
+        ->addIncoming(Val[part],
+                      LI->getLoopFor(LoopVectorBody)->getLoopLatch());
+  }
+
+  // Before each round, move the insertion point right between
+  // the PHIs and the values we are going to write.
+  // This allows us to write both PHINodes and the extractelement
+  // instructions.
+  Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());
 
-      VectorParts Entry(UF);
-      for (unsigned Part = 0; Part < UF; ++Part) {
-        SmallVector<Value *, 4> Args;
-        for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
-          Value *Arg = CI->getArgOperand(i);
-          // Some intrinsics have a scalar argument - don't replace it with a
-          // vector.
-          if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) {
-            const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i));
-            Arg = VectorArg[Part];
-          }
-          Args.push_back(Arg);
-        }
+  VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst);
+  setDebugLocFromInst(Builder, LoopExitInst);
 
-        Function *VectorF;
-        if (UseVectorIntrinsic) {
-          // Use vector version of the intrinsic.
-          Type *TysForDecl[] = {CI->getType()};
-          if (VF > 1)
-            TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF);
-          VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl);
+  // If the vector reduction can be performed in a smaller type, we truncate
+  // then extend the loop exit value to enable InstCombine to evaluate the
+  // entire expression in the smaller type.
+  if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {
+    Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);
+    Builder.SetInsertPoint(LoopVectorBody->getTerminator());
+    for (unsigned part = 0; part < UF; ++part) {
+      Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
+      Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)
+                                        : Builder.CreateZExt(Trunc, VecTy);
+      for (Value::user_iterator UI = RdxParts[part]->user_begin();
+           UI != RdxParts[part]->user_end();)
+        if (*UI != Trunc) {
+          (*UI++)->replaceUsesOfWith(RdxParts[part], Extnd);
+          RdxParts[part] = Extnd;
         } else {
-          // Use vector version of the library call.
-          StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
-          assert(!VFnName.empty() && "Vector function name is empty.");
-          VectorF = M->getFunction(VFnName);
-          if (!VectorF) {
-            // Generate a declaration
-            FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
-            VectorF =
-                Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
-            VectorF->copyAttributesFrom(F);
-          }
+          ++UI;
         }
-        assert(VectorF && "Can't create vector function.");
+    }
+    Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());
+    for (unsigned part = 0; part < UF; ++part)
+      RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
+  }
+
+  // Reduce all of the unrolled parts into a single vector.
+  Value *ReducedPartRdx = RdxParts[0];
+  unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK);
+  setDebugLocFromInst(Builder, ReducedPartRdx);
+  for (unsigned part = 1; part < UF; ++part) {
+    if (Op != Instruction::ICmp && Op != Instruction::FCmp)
+      // Floating point operations had to be 'fast' to enable the reduction.
+      ReducedPartRdx = addFastMathFlag(
+          Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part],
+                              ReducedPartRdx, "bin.rdx"));
+    else
+      ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp(
+          Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]);
+  }
 
-        SmallVector<OperandBundleDef, 1> OpBundles;
-        CI->getOperandBundlesAsDefs(OpBundles);
-        CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles);
+  if (VF > 1) {
+    // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles
+    // and vector ops, reducing the set of values being computed by half each
+    // round.
+    assert(isPowerOf2_32(VF) &&
+           "Reduction emission only supported for pow2 vectors!");
+    Value *TmpVec = ReducedPartRdx;
+    SmallVector<Constant *, 32> ShuffleMask(VF, nullptr);
+    for (unsigned i = VF; i != 1; i >>= 1) {
+      // Move the upper half of the vector to the lower half.
+      for (unsigned j = 0; j != i / 2; ++j)
+        ShuffleMask[j] = Builder.getInt32(i / 2 + j);
+
+      // Fill the rest of the mask with undef.
+      std::fill(&ShuffleMask[i / 2], ShuffleMask.end(),
+                UndefValue::get(Builder.getInt32Ty()));
+
+      Value *Shuf = Builder.CreateShuffleVector(
+          TmpVec, UndefValue::get(TmpVec->getType()),
+          ConstantVector::get(ShuffleMask), "rdx.shuf");
 
-        if (isa<FPMathOperator>(V))
-          V->copyFastMathFlags(CI);
+      if (Op != Instruction::ICmp && Op != Instruction::FCmp)
+        // Floating point operations had to be 'fast' to enable the reduction.
+        TmpVec = addFastMathFlag(Builder.CreateBinOp((Instruction::BinaryOps)Op,
+                                                     TmpVec, Shuf, "bin.rdx"));
+      else
+        TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind,
+                                                      TmpVec, Shuf);
+    }
 
-        Entry[Part] = V;
-      }
+    // The result is in the first element of the vector.
+    ReducedPartRdx =
+      Builder.CreateExtractElement(TmpVec, Builder.getInt32(0));
 
-      VectorLoopValueMap.initVector(&I, Entry);
-      addMetadata(Entry, &I);
+    // If the reduction can be performed in a smaller type, we need to extend
+    // the reduction to the wider type before we branch to the original loop.
+    if (Phi->getType() != RdxDesc.getRecurrenceType())
+      ReducedPartRdx =
+        RdxDesc.isSigned()
+        ? Builder.CreateSExt(ReducedPartRdx, Phi->getType())
+        : Builder.CreateZExt(ReducedPartRdx, Phi->getType());
+  }
+
+  // Create a phi node that merges control-flow from the backedge-taken check
+  // block and the middle block.
+  PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx",
+                                        LoopScalarPreHeader->getTerminator());
+  for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
+    BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]);
+  BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);
+
+  // Now, we need to fix the users of the reduction variable
+  // inside and outside of the scalar remainder loop.
+  // We know that the loop is in LCSSA form. We need to update the
+  // PHI nodes in the exit blocks.
+  for (BasicBlock::iterator LEI = LoopExitBlock->begin(),
+                            LEE = LoopExitBlock->end();
+       LEI != LEE; ++LEI) {
+    PHINode *LCSSAPhi = dyn_cast<PHINode>(LEI);
+    if (!LCSSAPhi)
       break;
-    }
 
-    default:
-      // All other instructions are unsupported. Scalarize them.
-      scalarizeInstruction(&I);
-      break;
-    } // end of switch.
-  }   // end of for_each instr.
-}
+    // All PHINodes need to have a single entry edge, or two if
+    // we already fixed them.
+    assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI");
 
-void InnerLoopVectorizer::updateAnalysis() {
-  // Forget the original basic block.
-  PSE.getSE()->forgetLoop(OrigLoop);
+    // We found a reduction value exit-PHI. Update it with the
+    // incoming bypass edge.
+    if (LCSSAPhi->getIncomingValue(0) == LoopExitInst)
+      LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);
+  } // end of the LCSSA phi scan.
 
-  // Update the dominator tree information.
-  assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&
-         "Entry does not dominate exit.");
+    // Fix the scalar loop reduction variable with the incoming reduction sum
+    // from the vector body and from the backedge value.
+  int IncomingEdgeBlockIdx =
+      Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());
+  assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");
+  // Pick the other block.
+  int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);
+  Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
+  Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
+}
 
-  // We don't predicate stores by this point, so the vector body should be a
-  // single loop.
-  DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);
+void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {
 
-  DT->addNewBlock(LoopMiddleBlock, LoopVectorBody);
-  DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);
-  DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);
-  DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);
+  // This is the second phase of vectorizing first-order recurrences. An
+  // overview of the transformation is described below. Suppose we have the
+  // following loop.
+  //
+  //   for (int i = 0; i < n; ++i)
+  //     b[i] = a[i] - a[i - 1];
+  //
+  // There is a first-order recurrence on "a". For this loop, the shorthand
+  // scalar IR looks like:
+  //
+  //   scalar.ph:
+  //     s_init = a[-1]
+  //     br scalar.body
+  //
+  //   scalar.body:
+  //     i = phi [0, scalar.ph], [i+1, scalar.body]
+  //     s1 = phi [s_init, scalar.ph], [s2, scalar.body]
+  //     s2 = a[i]
+  //     b[i] = s2 - s1
+  //     br cond, scalar.body, ...
+  //
+  // In this example, s1 is a recurrence because it's value depends on the
+  // previous iteration. In the first phase of vectorization, we created a
+  // temporary value for s1. We now complete the vectorization and produce the
+  // shorthand vector IR shown below (for VF = 4, UF = 1).
+  //
+  //   vector.ph:
+  //     v_init = vector(..., ..., ..., a[-1])
+  //     br vector.body
+  //
+  //   vector.body
+  //     i = phi [0, vector.ph], [i+4, vector.body]
+  //     v1 = phi [v_init, vector.ph], [v2, vector.body]
+  //     v2 = a[i, i+1, i+2, i+3];
+  //     v3 = vector(v1(3), v2(0, 1, 2))
+  //     b[i, i+1, i+2, i+3] = v2 - v3
+  //     br cond, vector.body, middle.block
+  //
+  //   middle.block:
+  //     x = v2(3)
+  //     br scalar.ph
+  //
+  //   scalar.ph:
+  //     s_init = phi [x, middle.block], [a[-1], otherwise]
+  //     br scalar.body
+  //
+  // After execution completes the vector loop, we extract the next value of
+  // the recurrence (x) to use as the initial value in the scalar loop.
 
-  DEBUG(DT->verifyDomTree());
-}
+  // Get the original loop preheader and single loop latch.
+  auto *Preheader = OrigLoop->getLoopPreheader();
+  auto *Latch = OrigLoop->getLoopLatch();
 
-/// \brief Check whether it is safe to if-convert this phi node.
-///
-/// Phi nodes with constant expressions that can trap are not safe to if
-/// convert.
-static bool canIfConvertPHINodes(BasicBlock *BB) {
-  for (Instruction &I : *BB) {
-    auto *Phi = dyn_cast<PHINode>(&I);
-    if (!Phi)
-      return true;
-    for (Value *V : Phi->incoming_values())
-      if (auto *C = dyn_cast<Constant>(V))
-        if (C->canTrap())
-          return false;
-  }
-  return true;
-}
+  // Get the initial and previous values of the scalar recurrence.
+  auto *ScalarInit = Phi->getIncomingValueForBlock(Preheader);
+  auto *Previous = Phi->getIncomingValueForBlock(Latch);
 
-bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
-  if (!EnableIfConversion) {
-    ORE->emit(createMissedAnalysis("IfConversionDisabled")
-              << "if-conversion is disabled");
-    return false;
+  // Create a vector from the initial value.
+  auto *VectorInit = ScalarInit;
+  if (VF > 1) {
+    Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());
+    VectorInit = Builder.CreateInsertElement(
+        UndefValue::get(VectorType::get(VectorInit->getType(), VF)), VectorInit,
+        Builder.getInt32(VF - 1), "vector.recur.init");
   }
 
-  assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable");
+  // We constructed a temporary phi node in the first phase of vectorization.
+  // This phi node will eventually be deleted.
+  VectorParts &PhiParts = VectorLoopValueMap.getVector(Phi);
+  Builder.SetInsertPoint(cast<Instruction>(PhiParts[0]));
 
-  // A list of pointers that we can safely read and write to.
-  SmallPtrSet<Value *, 8> SafePointes;
+  // Create a phi node for the new recurrence. The current value will either be
+  // the initial value inserted into a vector or loop-varying vector value.
+  auto *VecPhi = Builder.CreatePHI(VectorInit->getType(), 2, "vector.recur");
+  VecPhi->addIncoming(VectorInit, LoopVectorPreHeader);
 
-  // Collect safe addresses.
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    if (blockNeedsPredication(BB))
-      continue;
+  // Get the vectorized previous value. We ensured the previous values was an
+  // instruction when detecting the recurrence.
+  auto &PreviousParts = getVectorValue(Previous);
 
-    for (Instruction &I : *BB)
-      if (auto *Ptr = getPointerOperand(&I))
-        SafePointes.insert(Ptr);
-  }
+  // Set the insertion point to be after this instruction. We ensured the
+  // previous value dominated all uses of the phi when detecting the
+  // recurrence.
+  Builder.SetInsertPoint(
+      &*++BasicBlock::iterator(cast<Instruction>(PreviousParts[UF - 1])));
 
-  // Collect the blocks that need predication.
-  BasicBlock *Header = TheLoop->getHeader();
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    // We don't support switch statements inside loops.
-    if (!isa<BranchInst>(BB->getTerminator())) {
-      ORE->emit(createMissedAnalysis("LoopContainsSwitch", BB->getTerminator())
-                << "loop contains a switch statement");
-      return false;
-    }
+  // We will construct a vector for the recurrence by combining the values for
+  // the current and previous iterations. This is the required shuffle mask.
+  SmallVector<Constant *, 8> ShuffleMask(VF);
+  ShuffleMask[0] = Builder.getInt32(VF - 1);
+  for (unsigned I = 1; I < VF; ++I)
+    ShuffleMask[I] = Builder.getInt32(I + VF - 1);
 
-    // We must be able to predicate all blocks that need to be predicated.
-    if (blockNeedsPredication(BB)) {
-      if (!blockCanBePredicated(BB, SafePointes)) {
-        ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
-                  << "control flow cannot be substituted for a select");
-        return false;
-      }
-    } else if (BB != Header && !canIfConvertPHINodes(BB)) {
-      ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
-                << "control flow cannot be substituted for a select");
-      return false;
-    }
+  // The vector from which to take the initial value for the current iteration
+  // (actual or unrolled). Initially, this is the vector phi node.
+  Value *Incoming = VecPhi;
+
+  // Shuffle the current and previous vector and update the vector parts.
+  for (unsigned Part = 0; Part < UF; ++Part) {
+    auto *Shuffle =
+        VF > 1
+            ? Builder.CreateShuffleVector(Incoming, PreviousParts[Part],
+                                          ConstantVector::get(ShuffleMask))
+            : Incoming;
+    PhiParts[Part]->replaceAllUsesWith(Shuffle);
+    cast<Instruction>(PhiParts[Part])->eraseFromParent();
+    PhiParts[Part] = Shuffle;
+    Incoming = PreviousParts[Part];
   }
 
-  // We can if-convert this loop.
-  return true;
-}
+  // Fix the latch value of the new recurrence in the vector loop.
+  VecPhi->addIncoming(Incoming, LI->getLoopFor(LoopVectorBody)->getLoopLatch());
 
-bool LoopVectorizationLegality::canVectorize() {
-  // We must have a loop in canonical form. Loops with indirectbr in them cannot
-  // be canonicalized.
-  if (!TheLoop->getLoopPreheader()) {
-    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
-              << "loop control flow is not understood by vectorizer");
-    return false;
+  // Extract the last vector element in the middle block. This will be the
+  // initial value for the recurrence when jumping to the scalar loop.
+  auto *Extract = Incoming;
+  if (VF > 1) {
+    Builder.SetInsertPoint(LoopMiddleBlock->getTerminator());
+    Extract = Builder.CreateExtractElement(Extract, Builder.getInt32(VF - 1),
+                                           "vector.recur.extract");
   }
 
-  // FIXME: The code is currently dead, since the loop gets sent to
-  // LoopVectorizationLegality is already an innermost loop.
-  //
-  // We can only vectorize innermost loops.
-  if (!TheLoop->empty()) {
-    ORE->emit(createMissedAnalysis("NotInnermostLoop")
-              << "loop is not the innermost loop");
-    return false;
+  // Fix the initial value of the original recurrence in the scalar loop.
+  Builder.SetInsertPoint(&*LoopScalarPreHeader->begin());
+  auto *Start = Builder.CreatePHI(Phi->getType(), 2, "scalar.recur.init");
+  for (auto *BB : predecessors(LoopScalarPreHeader)) {
+    auto *Incoming = BB == LoopMiddleBlock ? Extract : ScalarInit;
+    Start->addIncoming(Incoming, BB);
   }
 
-  // We must have a single backedge.
-  if (TheLoop->getNumBackEdges() != 1) {
-    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
-              << "loop control flow is not understood by vectorizer");
-    return false;
-  }
+  Phi->setIncomingValue(Phi->getBasicBlockIndex(LoopScalarPreHeader), Start);
+  Phi->setName("scalar.recur");
 
-  // We must have a single exiting block.
-  if (!TheLoop->getExitingBlock()) {
-    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
-              << "loop control flow is not understood by vectorizer");
-    return false;
+  // Finally, fix users of the recurrence outside the loop. The users will need
+  // either the last value of the scalar recurrence or the last value of the
+  // vector recurrence we extracted in the middle block. Since the loop is in
+  // LCSSA form, we just need to find the phi node for the original scalar
+  // recurrence in the exit block, and then add an edge for the middle block.
+  for (auto &I : *LoopExitBlock) {
+    auto *LCSSAPhi = dyn_cast<PHINode>(&I);
+    if (!LCSSAPhi)
+      break;
+    if (LCSSAPhi->getIncomingValue(0) == Phi) {
+      LCSSAPhi->addIncoming(Extract, LoopMiddleBlock);
+      break;
+    }
   }
+}
 
-  // We only handle bottom-tested loops, i.e. loop in which the condition is
-  // checked at the end of each iteration. With that we can assume that all
-  // instructions in the loop are executed the same number of times.
-  if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
-    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
-              << "loop control flow is not understood by vectorizer");
-    return false;
+void InnerLoopVectorizer::fixLCSSAPHIs() {
+  for (Instruction &LEI : *LoopExitBlock) {
+    auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);
+    if (!LCSSAPhi)
+      break;
+    if (LCSSAPhi->getNumIncomingValues() == 1)
+      LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),
+                            LoopMiddleBlock);
   }
+}
 
-  // We need to have a loop header.
-  DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName()
-               << '\n');
+void InnerLoopVectorizer::collectTriviallyDeadInstructions(
+    Loop *OrigLoop, LoopVectorizationLegality *Legal,
+    SmallPtrSetImpl<Instruction *> &DeadInstructions) {
+  BasicBlock *Latch = OrigLoop->getLoopLatch();
 
-  // Check if we can if-convert non-single-bb loops.
-  unsigned NumBlocks = TheLoop->getNumBlocks();
-  if (NumBlocks != 1 && !canVectorizeWithIfConvert()) {
-    DEBUG(dbgs() << "LV: Can't if-convert the loop.\n");
-    return false;
-  }
+  // We create new control-flow for the vectorized loop, so the original
+  // condition will be dead after vectorization if it's only used by the
+  // branch.
+  auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
+  if (Cmp && Cmp->hasOneUse())
+    DeadInstructions.insert(Cmp);
 
-  // ScalarEvolution needs to be able to find the exit count.
-  const SCEV *ExitCount = PSE.getBackedgeTakenCount();
-  if (ExitCount == PSE.getSE()->getCouldNotCompute()) {
-    ORE->emit(createMissedAnalysis("CantComputeNumberOfIterations")
-              << "could not determine number of loop iterations");
-    DEBUG(dbgs() << "LV: SCEV could not compute the loop exit count.\n");
-    return false;
+  // We create new "steps" for induction variable updates to which the original
+  // induction variables map. An original update instruction will be dead if
+  // all its users except the induction variable are dead.
+  for (auto &Induction : *Legal->getInductionVars()) {
+    PHINode *Ind = Induction.first;
+    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
+    if (all_of(IndUpdate->users(), [&](User *U) -> bool {
+          return U == Ind || DeadInstructions.count(cast<Instruction>(U));
+        }))
+      DeadInstructions.insert(IndUpdate);
   }
+}
 
-  // Check if we can vectorize the instructions and CFG in this loop.
-  if (!canVectorizeInstrs()) {
-    DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n");
-    return false;
-  }
+void InnerLoopUnroller::sinkScalarOperands(Instruction *PredInst) {
 
-  // Go over each instruction and look at memory deps.
-  if (!canVectorizeMemory()) {
-    DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");
-    return false;
-  }
+  // The basic block and loop containing the predicated instruction.
+  auto *PredBB = PredInst->getParent();
+  auto *VectorLoop = LI->getLoopFor(PredBB);
 
-  DEBUG(dbgs() << "LV: We can vectorize this loop"
-               << (LAI->getRuntimePointerChecking()->Need
-                       ? " (with a runtime bound check)"
-                       : "")
-               << "!\n");
+  // Initialize a worklist with the operands of the predicated instruction.
+  SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());
 
-  bool UseInterleaved = TTI->enableInterleavedAccessVectorization();
+  // Holds instructions that we need to analyze again. An instruction may be
+  // reanalyzed if we don't yet know if we can sink it or not.
+  SmallVector<Instruction *, 8> InstsToReanalyze;
 
-  // If an override option has been passed in for interleaved accesses, use it.
-  if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
-    UseInterleaved = EnableInterleavedMemAccesses;
+  // Returns true if a given use occurs in the predicated block. Phi nodes use
+  // their operands in their corresponding predecessor blocks.
+  auto isBlockOfUsePredicated = [&](Use &U) -> bool {
+    auto *I = cast<Instruction>(U.getUser());
+    BasicBlock *BB = I->getParent();
+    if (auto *Phi = dyn_cast<PHINode>(I))
+      BB = Phi->getIncomingBlock(
+          PHINode::getIncomingValueNumForOperand(U.getOperandNo()));
+    return BB == PredBB;
+  };
 
-  // Analyze interleaved memory accesses.
-  if (UseInterleaved)
-    InterleaveInfo.analyzeInterleaving(*getSymbolicStrides());
+  // Iteratively sink the scalarized operands of the predicated instruction
+  // into the block we created for it. When an instruction is sunk, it's
+  // operands are then added to the worklist. The algorithm ends after one pass
+  // through the worklist doesn't sink a single instruction.
+  bool Changed;
+  do {
 
-  unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;
-  if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)
-    SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;
+    // Add the instructions that need to be reanalyzed to the worklist, and
+    // reset the changed indicator.
+    Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end());
+    InstsToReanalyze.clear();
+    Changed = false;
 
-  if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {
-    ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks")
-              << "Too many SCEV assumptions need to be made and checked "
-              << "at runtime");
-    DEBUG(dbgs() << "LV: Too many SCEV checks needed.\n");
-    return false;
-  }
+    while (!Worklist.empty()) {
+      auto *I = dyn_cast<Instruction>(Worklist.pop_back_val());
 
-  // Okay! We can vectorize. At this point we don't have any other mem analysis
-  // which may limit our maximum vectorization factor, so just return true with
-  // no restrictions.
-  return true;
-}
+      // We can't sink an instruction if it is a phi node, is already in the
+      // predicated block, is not in the loop, or may have side effects.
+      if (!I || isa<PHINode>(I) || I->getParent() == PredBB ||
+          !VectorLoop->contains(I) || I->mayHaveSideEffects())
+        continue;
 
-static Type *convertPointerToIntegerType(const DataLayout &DL, Type *Ty) {
-  if (Ty->isPointerTy())
-    return DL.getIntPtrType(Ty);
+      // It's legal to sink the instruction if all its uses occur in the
+      // predicated block. Otherwise, there's nothing to do yet, and we may
+      // need to reanalyze the instruction.
+      if (!all_of(I->uses(), isBlockOfUsePredicated)) {
+        InstsToReanalyze.push_back(I);
+        continue;
+      }
 
-  // It is possible that char's or short's overflow when we ask for the loop's
-  // trip count, work around this by changing the type size.
-  if (Ty->getScalarSizeInBits() < 32)
-    return Type::getInt32Ty(Ty->getContext());
+      // Move the instruction to the beginning of the predicated block, and add
+      // it's operands to the worklist.
+      I->moveBefore(&*PredBB->getFirstInsertionPt());
+      Worklist.insert(I->op_begin(), I->op_end());
 
-  return Ty;
+      // The sinking may have enabled other instructions to be sunk, so we will
+      // need to iterate.
+      Changed = true;
+    }
+  } while (Changed);
 }
 
-static Type *getWiderType(const DataLayout &DL, Type *Ty0, Type *Ty1) {
-  Ty0 = convertPointerToIntegerType(DL, Ty0);
-  Ty1 = convertPointerToIntegerType(DL, Ty1);
-  if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits())
-    return Ty0;
-  return Ty1;
-}
+void InnerLoopUnroller::vectorizeLoop() {
 
-/// \brief Check that the instruction has outside loop users and is not an
-/// identified reduction variable.
-static bool hasOutsideLoopUser(const Loop *TheLoop, Instruction *Inst,
-                               SmallPtrSetImpl<Value *> &AllowedExit) {
-  // Reduction and Induction instructions are allowed to have exit users. All
-  // other instructions must not have external users.
-  if (!AllowedExit.count(Inst))
-    // Check that all of the users of the loop are inside the BB.
-    for (User *U : Inst->users()) {
-      Instruction *UI = cast<Instruction>(U);
-      // This user may be a reduction exit value.
-      if (!TheLoop->contains(UI)) {
-        DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n');
-        return true;
-      }
-    }
-  return false;
-}
+  // Collect instructions from the original loop that will become trivially
+  // dead in the vectorized loop. We don't need to vectorize these
+  // instructions.
+  collectTriviallyDeadInstructions(OrigLoop, Legal, DeadInstructions);
 
-void LoopVectorizationLegality::addInductionPhi(
-    PHINode *Phi, const InductionDescriptor &ID,
-    SmallPtrSetImpl<Value *> &AllowedExit) {
-  Inductions[Phi] = ID;
-  Type *PhiTy = Phi->getType();
-  const DataLayout &DL = Phi->getModule()->getDataLayout();
+  // Scan the loop in a topological order to ensure that defs are vectorized
+  // before users.
+  LoopBlocksDFS DFS(OrigLoop);
+  DFS.perform(LI);
 
-  // Get the widest type.
-  if (!PhiTy->isFloatingPointTy()) {
-    if (!WidestIndTy)
-      WidestIndTy = convertPointerToIntegerType(DL, PhiTy);
-    else
-      WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy);
-  }
+  // Vectorize all of the blocks in the original loop.
+  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
+    for (Instruction &I : *BB) {
+      if (!DeadInstructions.count(&I))
+        vectorizeInstruction(I);
+    }
 
-  // Int inductions are special because we only allow one IV.
-  if (ID.getKind() == InductionDescriptor::IK_IntInduction &&
-      ID.getConstIntStepValue() &&
-      ID.getConstIntStepValue()->isOne() &&
-      isa<Constant>(ID.getStartValue()) &&
-      cast<Constant>(ID.getStartValue())->isNullValue()) {
+  fixCrossIterationPHIs();
 
-    // Use the phi node with the widest type as induction. Use the last
-    // one if there are multiple (no good reason for doing this other
-    // than it is expedient). We've checked that it begins at zero and
-    // steps by one, so this is a canonical induction variable.
-    if (!PrimaryInduction || PhiTy == WidestIndTy)
-      PrimaryInduction = Phi;
-  }
+  // Update the dominator tree.
+  //
+  // FIXME: After creating the structure of the new loop, the dominator tree is
+  //        no longer up-to-date, and it remains that way until we update it
+  //        here. An out-of-date dominator tree is problematic for SCEV,
+  //        because SCEVExpander uses it to guide code generation. The
+  //        vectorizer use SCEVExpanders in several places. Instead, we should
+  //        keep the dominator tree up-to-date as we go.
+  updateAnalysis();
 
-  // Both the PHI node itself, and the "post-increment" value feeding
-  // back into the PHI node may have external users.
-  AllowedExit.insert(Phi);
-  AllowedExit.insert(Phi->getIncomingValueForBlock(TheLoop->getLoopLatch()));
+  // Fix-up external users of the induction variables.
+  for (auto &Entry : *Legal->getInductionVars())
+    fixupIVUsers(Entry.first, Entry.second,
+                 getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
+                 IVEndValues[Entry.first], LoopMiddleBlock);
 
-  DEBUG(dbgs() << "LV: Found an induction variable.\n");
-  return;
+  fixLCSSAPHIs();
+  predicateInstructions();
+
+  // Remove redundant induction instructions.
+  cse(LoopVectorBody);
 }
 
-bool LoopVectorizationLegality::canVectorizeInstrs() {
-  BasicBlock *Header = TheLoop->getHeader();
+void InnerLoopUnroller::predicateInstructions() {
 
-  // Look for the attribute signaling the absence of NaNs.
-  Function &F = *Header->getParent();
-  HasFunNoNaNAttr =
-      F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true";
+  // For each instruction I marked for predication on value C, split I into its
+  // own basic block to form an if-then construct over C. Since I may be fed by
+  // an extractelement instruction or other scalar operand, we try to
+  // iteratively sink its scalar operands into the predicated block. If I feeds
+  // an insertelement instruction, we try to move this instruction into the
+  // predicated block as well. For non-void types, a phi node will be created
+  // for the resulting value (either vector or scalar).
+  //
+  // So for some predicated instruction, e.g. the conditional sdiv in:
+  //
+  // for.body:
+  //  ...
+  //  %add = add nsw i32 %mul, %0
+  //  %cmp5 = icmp sgt i32 %2, 7
+  //  br i1 %cmp5, label %if.then, label %if.end
+  //
+  // if.then:
+  //  %div = sdiv i32 %0, %1
+  //  br label %if.end
+  //
+  // if.end:
+  //  %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ]
+  //
+  // the sdiv at this point is scalarized and if-converted using a select.
+  // The inactive elements in the vector are not used, but the predicated
+  // instruction is still executed for all vector elements, essentially:
+  //
+  // vector.body:
+  //  ...
+  //  %17 = add nsw <2 x i32> %16, %wide.load
+  //  %29 = extractelement <2 x i32> %wide.load, i32 0
+  //  %30 = extractelement <2 x i32> %wide.load51, i32 0
+  //  %31 = sdiv i32 %29, %30
+  //  %32 = insertelement <2 x i32> undef, i32 %31, i32 0
+  //  %35 = extractelement <2 x i32> %wide.load, i32 1
+  //  %36 = extractelement <2 x i32> %wide.load51, i32 1
+  //  %37 = sdiv i32 %35, %36
+  //  %38 = insertelement <2 x i32> %32, i32 %37, i32 1
+  //  %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17
+  //
+  // Predication will now re-introduce the original control flow to avoid false
+  // side-effects by the sdiv instructions on the inactive elements, yielding
+  // (after cleanup):
+  //
+  // vector.body:
+  //  ...
+  //  %5 = add nsw <2 x i32> %4, %wide.load
+  //  %8 = icmp sgt <2 x i32> %wide.load52, <i32 7, i32 7>
+  //  %9 = extractelement <2 x i1> %8, i32 0
+  //  br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue
+  //
+  // pred.sdiv.if:
+  //  %10 = extractelement <2 x i32> %wide.load, i32 0
+  //  %11 = extractelement <2 x i32> %wide.load51, i32 0
+  //  %12 = sdiv i32 %10, %11
+  //  %13 = insertelement <2 x i32> undef, i32 %12, i32 0
+  //  br label %pred.sdiv.continue
+  //
+  // pred.sdiv.continue:
+  //  %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ]
+  //  %15 = extractelement <2 x i1> %8, i32 1
+  //  br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55
+  //
+  // pred.sdiv.if54:
+  //  %16 = extractelement <2 x i32> %wide.load, i32 1
+  //  %17 = extractelement <2 x i32> %wide.load51, i32 1
+  //  %18 = sdiv i32 %16, %17
+  //  %19 = insertelement <2 x i32> %14, i32 %18, i32 1
+  //  br label %pred.sdiv.continue55
+  //
+  // pred.sdiv.continue55:
+  //  %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ]
+  //  %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5
 
-  // For each block in the loop.
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    // Scan the instructions in the block and look for hazards.
-    for (Instruction &I : *BB) {
-      if (auto *Phi = dyn_cast<PHINode>(&I)) {
-        Type *PhiTy = Phi->getType();
-        // Check that this PHI type is allowed.
-        if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() &&
-            !PhiTy->isPointerTy()) {
-          ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi)
-                    << "loop control flow is not understood by vectorizer");
-          DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n");
-          return false;
-        }
+  for (auto KV : PredicatedInstructions) {
+    BasicBlock::iterator I(KV.first);
+    BasicBlock *Head = I->getParent();
+    auto *BB = SplitBlock(Head, &*std::next(I), DT, LI);
+    auto *T = SplitBlockAndInsertIfThen(KV.second, &*I, /*Unreachable=*/false,
+                                        /*BranchWeights=*/nullptr, DT, LI);
+    I->moveBefore(T);
+    sinkScalarOperands(&*I);
 
-        // If this PHINode is not in the header block, then we know that we
-        // can convert it to select during if-conversion. No need to check if
-        // the PHIs in this block are induction or reduction variables.
-        if (BB != Header) {
-          // Check that this instruction has no outside users or is an
-          // identified reduction value with an outside user.
-          if (!hasOutsideLoopUser(TheLoop, Phi, AllowedExit))
-            continue;
-          ORE->emit(createMissedAnalysis("NeitherInductionNorReduction", Phi)
-                    << "value could not be identified as "
-                       "an induction or reduction variable");
-          return false;
-        }
+    I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if");
+    BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue");
 
-        // We only allow if-converted PHIs with exactly two incoming values.
-        if (Phi->getNumIncomingValues() != 2) {
-          ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi)
-                    << "control flow not understood by vectorizer");
-          DEBUG(dbgs() << "LV: Found an invalid PHI.\n");
-          return false;
-        }
+    // If the instruction is non-void create a Phi node at reconvergence point.
+    if (!I->getType()->isVoidTy()) {
+      Value *IncomingTrue = nullptr;
+      Value *IncomingFalse = nullptr;
 
-        RecurrenceDescriptor RedDes;
-        if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes)) {
-          if (RedDes.hasUnsafeAlgebra())
-            Requirements->addUnsafeAlgebraInst(RedDes.getUnsafeAlgebraInst());
-          AllowedExit.insert(RedDes.getLoopExitInstr());
-          Reductions[Phi] = RedDes;
-          continue;
-        }
+      if (I->hasOneUse() && isa<InsertElementInst>(*I->user_begin())) {
+        // If the predicated instruction is feeding an insert-element, move it
+        // into the Then block; Phi node will be created for the vector.
+        InsertElementInst *IEI = cast<InsertElementInst>(*I->user_begin());
+        IEI->moveBefore(T);
+        IncomingTrue = IEI; // the new vector with the inserted element.
+        IncomingFalse = IEI->getOperand(0); // the unmodified vector
+      } else {
+        // Phi node will be created for the scalar predicated instruction.
+        IncomingTrue = &*I;
+        IncomingFalse = UndefValue::get(I->getType());
+      }
 
-        InductionDescriptor ID;
-        if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID)) {
-          addInductionPhi(Phi, ID, AllowedExit);
-          if (ID.hasUnsafeAlgebra() && !HasFunNoNaNAttr)
-            Requirements->addUnsafeAlgebraInst(ID.getUnsafeAlgebraInst());
-          continue;
-        }
+      BasicBlock *PostDom = I->getParent()->getSingleSuccessor();
+      assert(PostDom && "Then block has multiple successors");
+      PHINode *Phi =
+          PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front());
+      IncomingTrue->replaceAllUsesWith(Phi);
+      Phi->addIncoming(IncomingFalse, Head);
+      Phi->addIncoming(IncomingTrue, I->getParent());
+    }
+  }
 
-        if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop, DT)) {
-          FirstOrderRecurrences.insert(Phi);
-          continue;
-        }
+  DEBUG(DT->verifyDomTree());
+}
 
-        // As a last resort, coerce the PHI to a AddRec expression
-        // and re-try classifying it a an induction PHI.
-        if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true)) {
-          addInductionPhi(Phi, ID, AllowedExit);
-          continue;
-        }
+InnerLoopVectorizer::VectorParts
+InnerLoopVectorizer::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) {
+  assert(is_contained(predecessors(Dst), Src) && "Invalid edge");
 
-        ORE->emit(createMissedAnalysis("NonReductionValueUsedOutsideLoop", Phi)
-                  << "value that could not be identified as "
-                     "reduction is used outside the loop");
-        DEBUG(dbgs() << "LV: Found an unidentified PHI." << *Phi << "\n");
-        return false;
-      } // end of PHI handling
-
-      // We handle calls that:
-      //   * Are debug info intrinsics.
-      //   * Have a mapping to an IR intrinsic.
-      //   * Have a vector version available.
-      auto *CI = dyn_cast<CallInst>(&I);
-      if (CI && !getVectorIntrinsicIDForCall(CI, TLI) &&
-          !isa<DbgInfoIntrinsic>(CI) &&
-          !(CI->getCalledFunction() && TLI &&
-            TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
-        ORE->emit(createMissedAnalysis("CantVectorizeCall", CI)
-                  << "call instruction cannot be vectorized");
-        DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
-        return false;
-      }
+  // Look for cached value.
+  std::pair<BasicBlock *, BasicBlock *> Edge(Src, Dst);
+  EdgeMaskCacheTy::iterator ECEntryIt = EdgeMaskCache.find(Edge);
+  if (ECEntryIt != EdgeMaskCache.end())
+    return ECEntryIt->second;
 
-      // Intrinsics such as powi,cttz and ctlz are legal to vectorize if the
-      // second argument is the same (i.e. loop invariant)
-      if (CI && hasVectorInstrinsicScalarOpd(
-                    getVectorIntrinsicIDForCall(CI, TLI), 1)) {
-        auto *SE = PSE.getSE();
-        if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) {
-          ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI)
-                    << "intrinsic instruction cannot be vectorized");
-          DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n");
-          return false;
-        }
-      }
+  VectorParts SrcMask = createBlockInMask(Src);
 
-      // Check that the instruction return type is vectorizable.
-      // Also, we can't vectorize extractelement instructions.
-      if ((!VectorType::isValidElementType(I.getType()) &&
-           !I.getType()->isVoidTy()) ||
-          isa<ExtractElementInst>(I)) {
-        ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)
-                  << "instruction return type cannot be vectorized");
-        DEBUG(dbgs() << "LV: Found unvectorizable type.\n");
-        return false;
-      }
+  // The terminator has to be a branch inst!
+  BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());
+  assert(BI && "Unexpected terminator found");
 
-      // Check that the stored type is vectorizable.
-      if (auto *ST = dyn_cast<StoreInst>(&I)) {
-        Type *T = ST->getValueOperand()->getType();
-        if (!VectorType::isValidElementType(T)) {
-          ORE->emit(createMissedAnalysis("CantVectorizeStore", ST)
-                    << "store instruction cannot be vectorized");
-          return false;
-        }
+  if (BI->isConditional()) {
+    VectorParts EdgeMask = getVectorValue(BI->getCondition());
 
-        // FP instructions can allow unsafe algebra, thus vectorizable by
-        // non-IEEE-754 compliant SIMD units.
-        // This applies to floating-point math operations and calls, not memory
-        // operations, shuffles, or casts, as they don't change precision or
-        // semantics.
-      } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) &&
-                 !I.hasUnsafeAlgebra()) {
-        DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
-        Hints->setPotentiallyUnsafe();
-      }
+    if (BI->getSuccessor(0) != Dst)
+      for (unsigned part = 0; part < UF; ++part)
+        EdgeMask[part] = Builder.CreateNot(EdgeMask[part]);
 
-      // Reduction instructions are allowed to have exit users.
-      // All other instructions must not have external users.
-      if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) {
-        ORE->emit(createMissedAnalysis("ValueUsedOutsideLoop", &I)
-                  << "value cannot be used outside the loop");
-        return false;
-      }
+    for (unsigned part = 0; part < UF; ++part)
+      EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]);
 
-    } // next instr.
+    EdgeMaskCache[Edge] = EdgeMask;
+    return EdgeMask;
   }
 
-  if (!PrimaryInduction) {
-    DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");
-    if (Inductions.empty()) {
-      ORE->emit(createMissedAnalysis("NoInductionVariable")
-                << "loop induction variable could not be identified");
-      return false;
-    }
-  }
+  EdgeMaskCache[Edge] = SrcMask;
+  return SrcMask;
+}
 
-  // Now we know the widest induction type, check if our found induction
-  // is the same size. If it's not, unset it here and InnerLoopVectorizer
-  // will create another.
-  if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType())
-    PrimaryInduction = nullptr;
+InnerLoopVectorizer::VectorParts
+InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) {
+  assert(OrigLoop->contains(BB) && "Block is not a part of a loop");
 
-  return true;
-}
+  // Look for cached value.
+  BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);
+  if (BCEntryIt != BlockMaskCache.end())
+    return BCEntryIt->second;
 
-void LoopVectorizationCostModel::collectLoopScalars(unsigned VF) {
+  // Loop incoming mask is all-one.
+  if (OrigLoop->getHeader() == BB) {
+    Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1);
+    return getVectorValue(C);
+  }
 
-  // We should not collect Scalars more than once per VF. Right now,
-  // this function is called from collectUniformsAndScalars(), which 
-  // already does this check. Collecting Scalars for VF=1 does not make any
-  // sense.
+  // This is the block mask. We OR all incoming edges, and with zero.
+  Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0);
+  VectorParts BlockMask = getVectorValue(Zero);
 
-  assert(VF >= 2 && !Scalars.count(VF) &&
-         "This function should not be visited twice for the same VF");
+  // For each pred:
+  for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) {
+    VectorParts EM = createEdgeMask(*it, BB);
+    for (unsigned part = 0; part < UF; ++part)
+      BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]);
+  }
 
-  // If an instruction is uniform after vectorization, it will remain scalar.
-  Scalars[VF].insert(Uniforms[VF].begin(), Uniforms[VF].end());
+  BlockMaskCache[BB] = BlockMask;
+  return BlockMask;
+}
 
-  // Collect the getelementptr instructions that will not be vectorized. A
-  // getelementptr instruction is only vectorized if it is used for a legal
-  // gather or scatter operation.
-  for (auto *BB : TheLoop->blocks())
-    for (auto &I : *BB) {
-      if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
-        Scalars[VF].insert(GEP);
-        continue;
-      }
-      auto *Ptr = getPointerOperand(&I);
-      if (!Ptr)
-        continue;
-      auto *GEP = getGEPInstruction(Ptr);
-      if (GEP && getWideningDecision(&I, VF) == CM_GatherScatter)
-        Scalars[VF].erase(GEP);
+void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,
+                                              unsigned VF, PhiVector *PV) {
+  PHINode *P = cast<PHINode>(PN);
+  // Handle recurrences.
+  if (Legal->isReductionVariable(P) || Legal->isFirstOrderRecurrence(P)) {
+    VectorParts Entry(UF);
+    for (unsigned part = 0; part < UF; ++part) {
+      // This is phase one of vectorizing PHIs.
+      Type *VecTy =
+          (VF == 1) ? PN->getType() : VectorType::get(PN->getType(), VF);
+      Entry[part] = PHINode::Create(
+          VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt());
     }
+    VectorLoopValueMap.initVector(P, Entry);
+    PV->push_back(P);
+    return;
+  }
 
-  // An induction variable will remain scalar if all users of the induction
-  // variable and induction variable update remain scalar.
-  auto *Latch = TheLoop->getLoopLatch();
-  for (auto &Induction : *Legal->getInductionVars()) {
-    auto *Ind = Induction.first;
-    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
-
-    // Determine if all users of the induction variable are scalar after
-    // vectorization.
-    auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool {
-      auto *I = cast<Instruction>(U);
-      return I == IndUpdate || !TheLoop->contains(I) || Scalars[VF].count(I);
-    });
-    if (!ScalarInd)
-      continue;
+  setDebugLocFromInst(Builder, P);
+  // Check for PHI nodes that are lowered to vector selects.
+  if (P->getParent() != OrigLoop->getHeader()) {
+    // We know that all PHIs in non-header blocks are converted into
+    // selects, so we don't have to worry about the insertion order and we
+    // can just use the builder.
+    // At this point we generate the predication tree. There may be
+    // duplications since this is a simple recursive scan, but future
+    // optimizations will clean it up.
 
-    // Determine if all users of the induction variable update instruction are
-    // scalar after vectorization.
-    auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
-      auto *I = cast<Instruction>(U);
-      return I == Ind || !TheLoop->contains(I) || Scalars[VF].count(I);
-    });
-    if (!ScalarIndUpdate)
-      continue;
+    unsigned NumIncoming = P->getNumIncomingValues();
 
-    // The induction variable and its update instruction will remain scalar.
-    Scalars[VF].insert(Ind);
-    Scalars[VF].insert(IndUpdate);
-  }
-}
+    // Generate a sequence of selects of the form:
+    // SELECT(Mask3, In3,
+    //      SELECT(Mask2, In2,
+    //                   ( ...)))
+    VectorParts Entry(UF);
+    for (unsigned In = 0; In < NumIncoming; In++) {
+      VectorParts Cond =
+          createEdgeMask(P->getIncomingBlock(In), P->getParent());
+      const VectorParts &In0 = getVectorValue(P->getIncomingValue(In));
 
-bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {
-  if (!blockNeedsPredication(I->getParent()))
-    return false;
-  switch(I->getOpcode()) {
-  default:
-    break;
-  case Instruction::Store:
-    return !isMaskRequired(I);
-  case Instruction::UDiv:
-  case Instruction::SDiv:
-  case Instruction::SRem:
-  case Instruction::URem:
-    return mayDivideByZero(*I);
+      for (unsigned part = 0; part < UF; ++part) {
+        // We might have single edge PHIs (blocks) - use an identity
+        // 'select' for the first PHI operand.
+        if (In == 0)
+          Entry[part] = Builder.CreateSelect(Cond[part], In0[part], In0[part]);
+        else
+          // Select between the current value and the previous incoming edge
+          // based on the incoming mask.
+          Entry[part] = Builder.CreateSelect(Cond[part], In0[part], Entry[part],
+                                             "predphi");
+      }
+    }
+    VectorLoopValueMap.initVector(P, Entry);
+    return;
   }
-  return false;
-}
-
-bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I,
-                                                              unsigned VF) {
-  // Get and ensure we have a valid memory instruction.
-  LoadInst *LI = dyn_cast<LoadInst>(I);
-  StoreInst *SI = dyn_cast<StoreInst>(I);
-  assert((LI || SI) && "Invalid memory instruction");
 
-  auto *Ptr = getPointerOperand(I);
+  // This PHINode must be an induction variable.
+  // Make sure that we know about it.
+  assert(Legal->getInductionVars()->count(P) && "Not an induction variable");
 
-  // In order to be widened, the pointer should be consecutive, first of all.
-  if (!isConsecutivePtr(Ptr))
-    return false;
+  InductionDescriptor II = Legal->getInductionVars()->lookup(P);
+  const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();
 
-  // If the instruction is a store located in a predicated block, it will be
-  // scalarized.
-  if (isScalarWithPredication(I))
-    return false;
+  // FIXME: The newly created binary instructions should contain nsw/nuw flags,
+  // which can be found from the original scalar operations.
+  switch (II.getKind()) {
+  case InductionDescriptor::IK_NoInduction:
+    llvm_unreachable("Unknown induction");
+  case InductionDescriptor::IK_IntInduction:
+    widenIntInduction(needsScalarInduction(P), P); // Used only by Unroller
+    return;
+  case InductionDescriptor::IK_PtrInduction: {
+    // Handle the pointer induction variable case.
+    assert(P->getType()->isPointerTy() && "Unexpected type.");
+    // This is the normalized GEP that starts counting at zero.
+    Value *PtrInd = Induction;
+    PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());
+    // Determine the number of scalars we need to generate for each unroll
+    // iteration. If the instruction is uniform, we only need to generate the
+    // first lane. Otherwise, we generate all VF values.
+    unsigned Lanes = Cost->isUniformAfterVectorization(P, VF) ? 1 : VF;
+    // These are the scalar results. Notice that we don't generate vector GEPs
+    // because scalar GEPs result in better code.
+    ScalarParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Entry[Part].resize(VF);
+      for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
+        Constant *Idx = ConstantInt::get(PtrInd->getType(), Lane + Part * VF);
+        Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);
+        Value *SclrGep = II.transform(Builder, GlobalIdx, PSE.getSE(), DL);
+        SclrGep->setName("next.gep");
+        Entry[Part][Lane] = SclrGep;
+      }
+    }
+    VectorLoopValueMap.initScalar(P, Entry);
+    return;
+  }
+  case InductionDescriptor::IK_FpInduction: {
+    assert(P->getType() == II.getStartValue()->getType() &&
+           "Types must match");
+    // Handle other induction variables that are now based on the
+    // canonical one.
+    assert(P != OldInduction && "Primary induction can be integer only");
 
-  // If the instruction's allocated size doesn't equal it's type size, it
-  // requires padding and will be scalarized.
-  auto &DL = I->getModule()->getDataLayout();
-  auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
-  if (hasIrregularType(ScalarTy, DL, VF))
-    return false;
+    Value *V = Builder.CreateCast(Instruction::SIToFP, Induction, P->getType());
+    V = II.transform(Builder, V, PSE.getSE(), DL);
+    V->setName("fp.offset.idx");
 
-  return true;
+    // Now we have scalar op: %fp.offset.idx = StartVal +/- Induction*StepVal
+
+    Value *Broadcasted = getBroadcastInstrs(V);
+    // After broadcasting the induction variable we need to make the vector
+    // consecutive by adding StepVal*0, StepVal*1, StepVal*2, etc.
+    Value *StepVal = cast<SCEVUnknown>(II.getStep())->getValue();
+    VectorParts Entry(UF);
+    for (unsigned part = 0; part < UF; ++part)
+      Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,
+                                  II.getInductionOpcode());
+    VectorLoopValueMap.initVector(P, Entry);
+    return;
+  }
+  }
 }
 
-void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {
+/// A helper function for checking whether an integer division-related
+/// instruction may divide by zero (in which case it must be predicated if
+/// executed conditionally in the scalar code).
+/// TODO: It may be worthwhile to generalize and check isKnownNonZero().
+/// Non-zero divisors that are non compile-time constants will not be
+/// converted into multiplication, so we will still end up scalarizing
+/// the division, but can do so w/o predication.
+static bool mayDivideByZero(Instruction &I) {
+  assert((I.getOpcode() == Instruction::UDiv ||
+          I.getOpcode() == Instruction::SDiv ||
+          I.getOpcode() == Instruction::URem ||
+          I.getOpcode() == Instruction::SRem) &&
+         "Unexpected instruction");
+  Value *Divisor = I.getOperand(1);
+  auto *CInt = dyn_cast<ConstantInt>(Divisor);
+  return !CInt || CInt->isZero();
+}
 
-  // We should not collect Uniforms more than once per VF. Right now,
-  // this function is called from collectUniformsAndScalars(), which 
-  // already does this check. Collecting Uniforms for VF=1 does not make any
-  // sense.
+void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) {
+  switch (I.getOpcode()) {
+  case Instruction::PHI: {
+    // Vectorize PHINodes.
+    PhiVector PV; // Records Reduction and FirstOrderRecurrence header Phis.
+    widenPHIInstruction(&I, UF, VF, &PV);
+    break;
+  } // End of PHI.
+  case Instruction::UDiv:
+  case Instruction::SDiv:
+  case Instruction::SRem:
+  case Instruction::URem:
+  case Instruction::Add:
+  case Instruction::FAdd:
+  case Instruction::Sub:
+  case Instruction::FSub:
+  case Instruction::Mul:
+  case Instruction::FMul:
+  case Instruction::FDiv:
+  case Instruction::FRem:
+  case Instruction::Shl:
+  case Instruction::LShr:
+  case Instruction::AShr:
+  case Instruction::And:
+  case Instruction::Or:
+  case Instruction::Xor: {
+    // Just widen binops.
+    auto *BinOp = cast<BinaryOperator>(&I);
+    setDebugLocFromInst(Builder, BinOp);
+    const VectorParts &A = getVectorValue(BinOp->getOperand(0));
+    const VectorParts &B = getVectorValue(BinOp->getOperand(1));
 
-  assert(VF >= 2 && !Uniforms.count(VF) &&
-         "This function should not be visited twice for the same VF");
+    // Use this vector value for all users of the original instruction.
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]);
 
-  // Visit the list of Uniforms. If we'll not find any uniform value, we'll 
-  // not analyze again.  Uniforms.count(VF) will return 1.
-  Uniforms[VF].clear();
+      if (BinaryOperator *VecOp = dyn_cast<BinaryOperator>(V))
+        VecOp->copyIRFlags(BinOp);
 
-  // We now know that the loop is vectorizable!
-  // Collect instructions inside the loop that will remain uniform after
-  // vectorization.
+      Entry[Part] = V;
+    }
 
-  // Global values, params and instructions outside of current loop are out of
-  // scope.
-  auto isOutOfScope = [&](Value *V) -> bool {
-    Instruction *I = dyn_cast<Instruction>(V);
-    return (!I || !TheLoop->contains(I));
-  };
+    VectorLoopValueMap.initVector(&I, Entry);
+    addMetadata(Entry, BinOp);
+    break;
+  }
+  case Instruction::Select: {
+    // Widen selects.
+    // If the selector is loop invariant we can create a select
+    // instruction with a scalar condition. Otherwise, use vector-select.
+    auto *SE = PSE.getSE();
+    bool InvariantCond =
+        SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop);
+    setDebugLocFromInst(Builder, &I);
+
+    // The condition can be loop invariant  but still defined inside the
+    // loop. This means that we can't just use the original 'cond' value.
+    // We have to take the 'vectorized' value and pick the first lane.
+    // Instcombine will make this a no-op.
+    const VectorParts &Cond = getVectorValue(I.getOperand(0));
+    const VectorParts &Op0 = getVectorValue(I.getOperand(1));
+    const VectorParts &Op1 = getVectorValue(I.getOperand(2));
+
+    auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0);
 
-  SetVector<Instruction *> Worklist;
-  BasicBlock *Latch = TheLoop->getLoopLatch();
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Entry[Part] = Builder.CreateSelect(
+          InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]);
+    }
 
-  // Start with the conditional branch. If the branch condition is an
-  // instruction contained in the loop that is only used by the branch, it is
-  // uniform.
-  auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
-  if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse()) {
-    Worklist.insert(Cmp);
-    DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");
+    VectorLoopValueMap.initVector(&I, Entry);
+    addMetadata(Entry, &I);
+    break;
   }
 
-  // Holds consecutive and consecutive-like pointers. Consecutive-like pointers
-  // are pointers that are treated like consecutive pointers during
-  // vectorization. The pointer operands of interleaved accesses are an
-  // example.
-  SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;
-
-  // Holds pointer operands of instructions that are possibly non-uniform.
-  SmallPtrSet<Instruction *, 8> PossibleNonUniformPtrs;
+  case Instruction::ICmp:
+  case Instruction::FCmp: {
+    // Widen compares. Generate vector compares.
+    bool FCmp = (I.getOpcode() == Instruction::FCmp);
+    auto *Cmp = dyn_cast<CmpInst>(&I);
+    setDebugLocFromInst(Builder, Cmp);
+    const VectorParts &A = getVectorValue(Cmp->getOperand(0));
+    const VectorParts &B = getVectorValue(Cmp->getOperand(1));
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      Value *C = nullptr;
+      if (FCmp) {
+        C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);
+        cast<FCmpInst>(C)->copyFastMathFlags(Cmp);
+      } else {
+        C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);
+      }
+      Entry[Part] = C;
+    }
 
-  auto isUniformDecision = [&](Instruction *I, unsigned VF) {
-    InstWidening WideningDecision = getWideningDecision(I, VF);
-    assert(WideningDecision != CM_Unknown &&
-           "Widening decision should be ready at this moment");
+    VectorLoopValueMap.initVector(&I, Entry);
+    addMetadata(Entry, &I);
+    break;
+  }
 
-    return (WideningDecision == CM_Widen ||
-            WideningDecision == CM_Interleave);
-  };
-  // Iterate over the instructions in the loop, and collect all
-  // consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible
-  // that a consecutive-like pointer operand will be scalarized, we collect it
-  // in PossibleNonUniformPtrs instead. We use two sets here because a single
-  // getelementptr instruction can be used by both vectorized and scalarized
-  // memory instructions. For example, if a loop loads and stores from the same
-  // location, but the store is conditional, the store will be scalarized, and
-  // the getelementptr won't remain uniform.
-  for (auto *BB : TheLoop->blocks())
-    for (auto &I : *BB) {
+  case Instruction::Store:
+  case Instruction::Load:
+    vectorizeMemoryInstruction(&I);
+    break;
+  case Instruction::ZExt:
+  case Instruction::SExt:
+  case Instruction::FPToUI:
+  case Instruction::FPToSI:
+  case Instruction::FPExt:
+  case Instruction::PtrToInt:
+  case Instruction::IntToPtr:
+  case Instruction::SIToFP:
+  case Instruction::UIToFP:
+  case Instruction::Trunc:
+  case Instruction::FPTrunc:
+  case Instruction::BitCast: {
+    auto *CI = dyn_cast<CastInst>(&I);
+    setDebugLocFromInst(Builder, CI);
 
-      // If there's no pointer operand, there's nothing to do.
-      auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));
-      if (!Ptr)
-        continue;
+    /// Vectorize casts.
+    Type *DestTy =
+        (VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);
 
-      // True if all users of Ptr are memory accesses that have Ptr as their
-      // pointer operand.
-      auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool {
-        return getPointerOperand(U) == Ptr;
-      });
+    const VectorParts &A = getVectorValue(CI->getOperand(0));
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part)
+      Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy);
+    VectorLoopValueMap.initVector(&I, Entry);
+    addMetadata(Entry, &I);
+    break;
+  }
 
-      // Ensure the memory instruction will not be scalarized or used by
-      // gather/scatter, making its pointer operand non-uniform. If the pointer
-      // operand is used by any instruction other than a memory access, we
-      // conservatively assume the pointer operand may be non-uniform.
-      if (!UsersAreMemAccesses || !isUniformDecision(&I, VF))
-        PossibleNonUniformPtrs.insert(Ptr);
+  case Instruction::Call: {
+    // Ignore dbg intrinsics.
+    if (isa<DbgInfoIntrinsic>(I))
+      break;
+    setDebugLocFromInst(Builder, &I);
+
+    Module *M = I.getParent()->getParent()->getParent();
+    auto *CI = cast<CallInst>(&I);
+
+    StringRef FnName = CI->getCalledFunction()->getName();
+    Function *F = CI->getCalledFunction();
+    Type *RetTy = ToVectorTy(CI->getType(), VF);
+    SmallVector<Type *, 4> Tys;
+    for (Value *ArgOperand : CI->arg_operands())
+      Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));
+
+    Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
+    bool NeedToScalarize; // Redundant, needed for UseVectorIntrinsic.
+    unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
+    bool UseVectorIntrinsic =
+        ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
+    VectorParts Entry(UF);
+    for (unsigned Part = 0; Part < UF; ++Part) {
+      SmallVector<Value *, 4> Args;
+      for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
+        Value *Arg = CI->getArgOperand(i);
+        // Some intrinsics have a scalar argument - don't replace it with a
+        // vector.
+        if (!UseVectorIntrinsic || !hasVectorInstrinsicScalarOpd(ID, i)) {
+          const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i));
+          Arg = VectorArg[Part];
+        }
+        Args.push_back(Arg);
+      }
 
-      // If the memory instruction will be vectorized and its pointer operand
-      // is consecutive-like, or interleaving - the pointer operand should
-      // remain uniform.
-      else
-        ConsecutiveLikePtrs.insert(Ptr);
-    }
+      Function *VectorF;
+      if (UseVectorIntrinsic) {
+        // Use vector version of the intrinsic.
+        Type *TysForDecl[] = {CI->getType()};
+        if (VF > 1)
+          TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF);
+        VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl);
+      } else {
+        // Use vector version of the library call.
+        StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
+        assert(!VFnName.empty() && "Vector function name is empty.");
+        VectorF = M->getFunction(VFnName);
+        if (!VectorF) {
+          // Generate a declaration
+          FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
+          VectorF =
+              Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
+          VectorF->copyAttributesFrom(F);
+        }
+      }
+      assert(VectorF && "Can't create vector function.");
 
-  // Add to the Worklist all consecutive and consecutive-like pointers that
-  // aren't also identified as possibly non-uniform.
-  for (auto *V : ConsecutiveLikePtrs)
-    if (!PossibleNonUniformPtrs.count(V)) {
-      DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n");
-      Worklist.insert(V);
-    }
+      SmallVector<OperandBundleDef, 1> OpBundles;
+      CI->getOperandBundlesAsDefs(OpBundles);
+      CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles);
 
-  // Expand Worklist in topological order: whenever a new instruction
-  // is added , its users should be either already inside Worklist, or
-  // out of scope. It ensures a uniform instruction will only be used
-  // by uniform instructions or out of scope instructions.
-  unsigned idx = 0;
-  while (idx != Worklist.size()) {
-    Instruction *I = Worklist[idx++];
+      if (isa<FPMathOperator>(V))
+        V->copyFastMathFlags(CI);
 
-    for (auto OV : I->operand_values()) {
-      if (isOutOfScope(OV))
-        continue;
-      auto *OI = cast<Instruction>(OV);
-      if (all_of(OI->users(), [&](User *U) -> bool {
-            return isOutOfScope(U) || Worklist.count(cast<Instruction>(U));
-          })) {
-        Worklist.insert(OI);
-        DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
-      }
+      Entry[Part] = V;
     }
+
+    VectorLoopValueMap.initVector(&I, Entry);
+    addMetadata(Entry, &I);
+    break;
   }
 
-  // Returns true if Ptr is the pointer operand of a memory access instruction
-  // I, and I is known to not require scalarization.
-  auto isVectorizedMemAccessUse = [&](Instruction *I, Value *Ptr) -> bool {
-    return getPointerOperand(I) == Ptr && isUniformDecision(I, VF);
-  };
+  default:
+    // All other instructions are scalarized.
+    DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I);
+    llvm_unreachable("Unhandled instruction!");
+  } // end of switch.
+}
 
-  // For an instruction to be added into Worklist above, all its users inside
-  // the loop should also be in Worklist. However, this condition cannot be
-  // true for phi nodes that form a cyclic dependence. We must process phi
-  // nodes separately. An induction variable will remain uniform if all users
-  // of the induction variable and induction variable update remain uniform.
-  // The code below handles both pointer and non-pointer induction variables.
-  for (auto &Induction : *Legal->getInductionVars()) {
-    auto *Ind = Induction.first;
-    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
-
-    // Determine if all users of the induction variable are uniform after
-    // vectorization.
-    auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {
-      auto *I = cast<Instruction>(U);
-      return I == IndUpdate || !TheLoop->contains(I) || Worklist.count(I) ||
-             isVectorizedMemAccessUse(I, Ind);
-    });
-    if (!UniformInd)
-      continue;
-
-    // Determine if all users of the induction variable update instruction are
-    // uniform after vectorization.
-    auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
-      auto *I = cast<Instruction>(U);
-      return I == Ind || !TheLoop->contains(I) || Worklist.count(I) ||
-             isVectorizedMemAccessUse(I, IndUpdate);
-    });
-    if (!UniformIndUpdate)
-      continue;
+void InnerLoopVectorizer::updateAnalysis() {
+  // Forget the original basic block.
+  PSE.getSE()->forgetLoop(OrigLoop);
 
-    // The induction variable and its update instruction will remain uniform.
-    Worklist.insert(Ind);
-    Worklist.insert(IndUpdate);
-    DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
-    DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n");
-  }
+  // Update the dominator tree information.
+  assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&
+         "Entry does not dominate exit.");
 
-  Uniforms[VF].insert(Worklist.begin(), Worklist.end());
+  if (!DT->getNode(LoopVectorBody)) // For InnerLoopUnroller.
+    DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);
+  auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch();
+  DT->addNewBlock(LoopMiddleBlock, LoopVectorLatch);
+  DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);
+  DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);
+  DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);
+  DEBUG(DT->verifyDomTree());
 }
 
-bool LoopVectorizationLegality::canVectorizeMemory() {
-  LAI = &(*GetLAA)(*TheLoop);
-  InterleaveInfo.setLAI(LAI);
-  const OptimizationRemarkAnalysis *LAR = LAI->getReport();
-  if (LAR) {
-    OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(),
-                                  "loop not vectorized: ", *LAR);
-    ORE->emit(VR);
-  }
-  if (!LAI->canVectorizeMemory())
-    return false;
-
-  if (LAI->hasStoreToLoopInvariantAddress()) {
-    ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")
-              << "write to a loop invariant address could not be vectorized");
-    DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");
-    return false;
+/// \brief Check whether it is safe to if-convert this phi node.
+///
+/// Phi nodes with constant expressions that can trap are not safe to if
+/// convert.
+static bool canIfConvertPHINodes(BasicBlock *BB) {
+  for (Instruction &I : *BB) {
+    auto *Phi = dyn_cast<PHINode>(&I);
+    if (!Phi)
+      return true;
+    for (Value *V : Phi->incoming_values())
+      if (auto *C = dyn_cast<Constant>(V))
+        if (C->canTrap())
+          return false;
   }
-
-  Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
-  PSE.addPredicate(LAI->getPSE().getUnionPredicate());
-
   return true;
 }
 
-bool LoopVectorizationLegality::isInductionVariable(const Value *V) {
-  Value *In0 = const_cast<Value *>(V);
-  PHINode *PN = dyn_cast_or_null<PHINode>(In0);
-  if (!PN)
+bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
+  if (!EnableIfConversion) {
+    ORE->emit(createMissedAnalysis("IfConversionDisabled")
+              << "if-conversion is disabled");
     return false;
+  }
 
-  return Inductions.count(PN);
-}
+  assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable");
 
-bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) {
-  return FirstOrderRecurrences.count(Phi);
-}
+  // A list of pointers that we can safely read and write to.
+  SmallPtrSet<Value *, 8> SafePointes;
 
-bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {
-  return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);
-}
+  // Collect safe addresses.
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    if (blockNeedsPredication(BB))
+      continue;
 
-bool LoopVectorizationLegality::blockCanBePredicated(
-    BasicBlock *BB, SmallPtrSetImpl<Value *> &SafePtrs) {
-  const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();
+    for (Instruction &I : *BB)
+      if (auto *Ptr = getPointerOperand(&I))
+        SafePointes.insert(Ptr);
+  }
 
-  for (Instruction &I : *BB) {
-    // Check that we don't have a constant expression that can trap as operand.
-    for (Value *Operand : I.operands()) {
-      if (auto *C = dyn_cast<Constant>(Operand))
-        if (C->canTrap())
-          return false;
-    }
-    // We might be able to hoist the load.
-    if (I.mayReadFromMemory()) {
-      auto *LI = dyn_cast<LoadInst>(&I);
-      if (!LI)
-        return false;
-      if (!SafePtrs.count(LI->getPointerOperand())) {
-        if (isLegalMaskedLoad(LI->getType(), LI->getPointerOperand()) ||
-            isLegalMaskedGather(LI->getType())) {
-          MaskedOp.insert(LI);
-          continue;
-        }
-        // !llvm.mem.parallel_loop_access implies if-conversion safety.
-        if (IsAnnotatedParallel)
-          continue;
-        return false;
-      }
+  // Collect the blocks that need predication.
+  BasicBlock *Header = TheLoop->getHeader();
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    // We don't support switch statements inside loops.
+    if (!isa<BranchInst>(BB->getTerminator())) {
+      ORE->emit(createMissedAnalysis("LoopContainsSwitch", BB->getTerminator())
+                << "loop contains a switch statement");
+      return false;
     }
 
-    if (I.mayWriteToMemory()) {
-      auto *SI = dyn_cast<StoreInst>(&I);
-      // We only support predication of stores in basic blocks with one
-      // predecessor.
-      if (!SI)
+    // We must be able to predicate all blocks that need to be predicated.
+    if (blockNeedsPredication(BB)) {
+      if (!blockCanBePredicated(BB, SafePointes)) {
+        ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
+                  << "control flow cannot be substituted for a select");
         return false;
-
-      // Build a masked store if it is legal for the target.
-      if (isLegalMaskedStore(SI->getValueOperand()->getType(),
-                             SI->getPointerOperand()) ||
-          isLegalMaskedScatter(SI->getValueOperand()->getType())) {
-        MaskedOp.insert(SI);
-        continue;
       }
-
-      bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0);
-      bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();
-
-      if (++NumPredStores > NumberOfStoresToPredicate || !isSafePtr ||
-          !isSinglePredecessor)
-        return false;
-    }
-    if (I.mayThrow())
+    } else if (BB != Header && !canIfConvertPHINodes(BB)) {
+      ORE->emit(createMissedAnalysis("NoCFGForSelect", BB->getTerminator())
+                << "control flow cannot be substituted for a select");
       return false;
+    }
   }
 
+  // We can if-convert this loop.
   return true;
 }
 
-void InterleavedAccessInfo::collectConstStrideAccesses(
-    MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,
-    const ValueToValueMap &Strides) {
+bool LoopVectorizationLegality::canVectorize() {
+  // We must have a loop in canonical form. Loops with indirectbr in them cannot
+  // be canonicalized.
+  if (!TheLoop->getLoopPreheader()) {
+    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
+              << "loop control flow is not understood by vectorizer");
+    return false;
+  }
 
-  auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
+  // FIXME: The code is currently dead, since the loop gets sent to
+  // LoopVectorizationLegality is already an innermost loop.
+  //
+  // We can only vectorize innermost loops.
+  if (!TheLoop->empty()) {
+    ORE->emit(createMissedAnalysis("NotInnermostLoop")
+              << "loop is not the innermost loop");
+    return false;
+  }
 
-  // Since it's desired that the load/store instructions be maintained in
-  // "program order" for the interleaved access analysis, we have to visit the
-  // blocks in the loop in reverse postorder (i.e., in a topological order).
-  // Such an ordering will ensure that any load/store that may be executed
-  // before a second load/store will precede the second load/store in
-  // AccessStrideInfo.
-  LoopBlocksDFS DFS(TheLoop);
-  DFS.perform(LI);
-  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
-    for (auto &I : *BB) {
-      auto *LI = dyn_cast<LoadInst>(&I);
-      auto *SI = dyn_cast<StoreInst>(&I);
-      if (!LI && !SI)
-        continue;
+  // We must have a single backedge.
+  if (TheLoop->getNumBackEdges() != 1) {
+    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
+              << "loop control flow is not understood by vectorizer");
+    return false;
+  }
 
-      Value *Ptr = getPointerOperand(&I);
-      // We don't check wrapping here because we don't know yet if Ptr will be 
-      // part of a full group or a group with gaps. Checking wrapping for all 
-      // pointers (even those that end up in groups with no gaps) will be overly
-      // conservative. For full groups, wrapping should be ok since if we would 
-      // wrap around the address space we would do a memory access at nullptr
-      // even without the transformation. The wrapping checks are therefore
-      // deferred until after we've formed the interleaved groups.
-      int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
-                                    /*Assume=*/true, /*ShouldCheckWrap=*/false);
+  // We must have a single exiting block.
+  if (!TheLoop->getExitingBlock()) {
+    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
+              << "loop control flow is not understood by vectorizer");
+    return false;
+  }
 
-      const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
-      PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
-      uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());
+  // We only handle bottom-tested loops, i.e. loop in which the condition is
+  // checked at the end of each iteration. With that we can assume that all
+  // instructions in the loop are executed the same number of times.
+  if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
+    ORE->emit(createMissedAnalysis("CFGNotUnderstood")
+              << "loop control flow is not understood by vectorizer");
+    return false;
+  }
 
-      // An alignment of 0 means target ABI alignment.
-      unsigned Align = getMemInstAlignment(&I);
-      if (!Align)
-        Align = DL.getABITypeAlignment(PtrTy->getElementType());
+  // We need to have a loop header.
+  DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName()
+               << '\n');
 
-      AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align);
-    }
-}
+  // Check if we can if-convert non-single-bb loops.
+  unsigned NumBlocks = TheLoop->getNumBlocks();
+  if (NumBlocks != 1 && !canVectorizeWithIfConvert()) {
+    DEBUG(dbgs() << "LV: Can't if-convert the loop.\n");
+    return false;
+  }
 
-// Analyze interleaved accesses and collect them into interleaved load and
-// store groups.
-//
-// When generating code for an interleaved load group, we effectively hoist all
-// loads in the group to the location of the first load in program order. When
-// generating code for an interleaved store group, we sink all stores to the
-// location of the last store. This code motion can change the order of load
-// and store instructions and may break dependences.
-//
-// The code generation strategy mentioned above ensures that we won't violate
-// any write-after-read (WAR) dependences.
-//
-// E.g., for the WAR dependence:  a = A[i];      // (1)
-//                                A[i] = b;      // (2)
-//
-// The store group of (2) is always inserted at or below (2), and the load
-// group of (1) is always inserted at or above (1). Thus, the instructions will
-// never be reordered. All other dependences are checked to ensure the
-// correctness of the instruction reordering.
-//
-// The algorithm visits all memory accesses in the loop in bottom-up program
-// order. Program order is established by traversing the blocks in the loop in
-// reverse postorder when collecting the accesses.
-//
-// We visit the memory accesses in bottom-up order because it can simplify the
-// construction of store groups in the presence of write-after-write (WAW)
-// dependences.
-//
-// E.g., for the WAW dependence:  A[i] = a;      // (1)
-//                                A[i] = b;      // (2)
-//                                A[i + 1] = c;  // (3)
-//
-// We will first create a store group with (3) and (2). (1) can't be added to
-// this group because it and (2) are dependent. However, (1) can be grouped
-// with other accesses that may precede it in program order. Note that a
-// bottom-up order does not imply that WAW dependences should not be checked.
-void InterleavedAccessInfo::analyzeInterleaving(
-    const ValueToValueMap &Strides) {
-  DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");
+  // ScalarEvolution needs to be able to find the exit count.
+  const SCEV *ExitCount = PSE.getBackedgeTakenCount();
+  if (ExitCount == PSE.getSE()->getCouldNotCompute()) {
+    ORE->emit(createMissedAnalysis("CantComputeNumberOfIterations")
+              << "could not determine number of loop iterations");
+    DEBUG(dbgs() << "LV: SCEV could not compute the loop exit count.\n");
+    return false;
+  }
 
-  // Holds all accesses with a constant stride.
-  MapVector<Instruction *, StrideDescriptor> AccessStrideInfo;
-  collectConstStrideAccesses(AccessStrideInfo, Strides);
+  // Check if we can vectorize the instructions and CFG in this loop.
+  if (!canVectorizeInstrs()) {
+    DEBUG(dbgs() << "LV: Can't vectorize the instructions or CFG\n");
+    return false;
+  }
 
-  if (AccessStrideInfo.empty())
-    return;
+  // Go over each instruction and look at memory deps.
+  if (!canVectorizeMemory()) {
+    DEBUG(dbgs() << "LV: Can't vectorize due to memory conflicts\n");
+    return false;
+  }
 
-  // Collect the dependences in the loop.
-  collectDependences();
+  DEBUG(dbgs() << "LV: We can vectorize this loop"
+               << (LAI->getRuntimePointerChecking()->Need
+                       ? " (with a runtime bound check)"
+                       : "")
+               << "!\n");
 
-  // Holds all interleaved store groups temporarily.
-  SmallSetVector<InterleaveGroup *, 4> StoreGroups;
-  // Holds all interleaved load groups temporarily.
-  SmallSetVector<InterleaveGroup *, 4> LoadGroups;
+  bool UseInterleaved = TTI->enableInterleavedAccessVectorization();
 
-  // Search in bottom-up program order for pairs of accesses (A and B) that can
-  // form interleaved load or store groups. In the algorithm below, access A
-  // precedes access B in program order. We initialize a group for B in the
-  // outer loop of the algorithm, and then in the inner loop, we attempt to
-  // insert each A into B's group if:
-  //
-  //  1. A and B have the same stride,
-  //  2. A and B have the same memory object size, and
-  //  3. A belongs in B's group according to its distance from B.
-  //
-  // Special care is taken to ensure group formation will not break any
-  // dependences.
-  for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend();
-       BI != E; ++BI) {
-    Instruction *B = BI->first;
-    StrideDescriptor DesB = BI->second;
+  // If an override option has been passed in for interleaved accesses, use it.
+  if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
+    UseInterleaved = EnableInterleavedMemAccesses;
 
-    // Initialize a group for B if it has an allowable stride. Even if we don't
-    // create a group for B, we continue with the bottom-up algorithm to ensure
-    // we don't break any of B's dependences.
-    InterleaveGroup *Group = nullptr;
-    if (isStrided(DesB.Stride)) {
-      Group = getInterleaveGroup(B);
-      if (!Group) {
-        DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B << '\n');
-        Group = createInterleaveGroup(B, DesB.Stride, DesB.Align);
-      }
-      if (B->mayWriteToMemory())
-        StoreGroups.insert(Group);
-      else
-        LoadGroups.insert(Group);
-    }
+  // Analyze interleaved memory accesses.
+  if (UseInterleaved)
+    InterleaveInfo.analyzeInterleaving(*getSymbolicStrides());
 
-    for (auto AI = std::next(BI); AI != E; ++AI) {
-      Instruction *A = AI->first;
-      StrideDescriptor DesA = AI->second;
+  unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;
+  if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)
+    SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;
 
-      // Our code motion strategy implies that we can't have dependences
-      // between accesses in an interleaved group and other accesses located
-      // between the first and last member of the group. Note that this also
-      // means that a group can't have more than one member at a given offset.
-      // The accesses in a group can have dependences with other accesses, but
-      // we must ensure we don't extend the boundaries of the group such that
-      // we encompass those dependent accesses.
-      //
-      // For example, assume we have the sequence of accesses shown below in a
-      // stride-2 loop:
-      //
-      //  (1, 2) is a group | A[i]   = a;  // (1)
-      //                    | A[i-1] = b;  // (2) |
-      //                      A[i-3] = c;  // (3)
-      //                      A[i]   = d;  // (4) | (2, 4) is not a group
-      //
-      // Because accesses (2) and (3) are dependent, we can group (2) with (1)
-      // but not with (4). If we did, the dependent access (3) would be within
-      // the boundaries of the (2, 4) group.
-      if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) {
+  if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {
+    ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks")
+              << "Too many SCEV assumptions need to be made and checked "
+              << "at runtime");
+    DEBUG(dbgs() << "LV: Too many SCEV checks needed.\n");
+    return false;
+  }
 
-        // If a dependence exists and A is already in a group, we know that A
-        // must be a store since A precedes B and WAR dependences are allowed.
-        // Thus, A would be sunk below B. We release A's group to prevent this
-        // illegal code motion. A will then be free to form another group with
-        // instructions that precede it.
-        if (isInterleaved(A)) {
-          InterleaveGroup *StoreGroup = getInterleaveGroup(A);
-          StoreGroups.remove(StoreGroup);
-          releaseGroup(StoreGroup);
-        }
+  // Okay! We can vectorize. At this point we don't have any other mem analysis
+  // which may limit our maximum vectorization factor, so just return true with
+  // no restrictions.
+  return true;
+}
 
-        // If a dependence exists and A is not already in a group (or it was
-        // and we just released it), B might be hoisted above A (if B is a
-        // load) or another store might be sunk below A (if B is a store). In
-        // either case, we can't add additional instructions to B's group. B
-        // will only form a group with instructions that it precedes.
-        break;
-      }
+static Type *convertPointerToIntegerType(const DataLayout &DL, Type *Ty) {
+  if (Ty->isPointerTy())
+    return DL.getIntPtrType(Ty);
 
-      // At this point, we've checked for illegal code motion. If either A or B
-      // isn't strided, there's nothing left to do.
-      if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride))
-        continue;
+  // It is possible that char's or short's overflow when we ask for the loop's
+  // trip count, work around this by changing the type size.
+  if (Ty->getScalarSizeInBits() < 32)
+    return Type::getInt32Ty(Ty->getContext());
 
-      // Ignore A if it's already in a group or isn't the same kind of memory
-      // operation as B.
-      if (isInterleaved(A) || A->mayReadFromMemory() != B->mayReadFromMemory())
-        continue;
+  return Ty;
+}
 
-      // Check rules 1 and 2. Ignore A if its stride or size is different from
-      // that of B.
-      if (DesA.Stride != DesB.Stride || DesA.Size != DesB.Size)
-        continue;
+static Type *getWiderType(const DataLayout &DL, Type *Ty0, Type *Ty1) {
+  Ty0 = convertPointerToIntegerType(DL, Ty0);
+  Ty1 = convertPointerToIntegerType(DL, Ty1);
+  if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits())
+    return Ty0;
+  return Ty1;
+}
 
-      // Calculate the distance from A to B.
-      const SCEVConstant *DistToB = dyn_cast<SCEVConstant>(
-          PSE.getSE()->getMinusSCEV(DesA.Scev, DesB.Scev));
-      if (!DistToB)
-        continue;
-      int64_t DistanceToB = DistToB->getAPInt().getSExtValue();
+/// \brief Check that the instruction has outside loop users and is not an
+/// identified reduction variable.
+static bool hasOutsideLoopUser(const Loop *TheLoop, Instruction *Inst,
+                               SmallPtrSetImpl<Value *> &AllowedExit) {
+  // Reduction and Induction instructions are allowed to have exit users. All
+  // other instructions must not have external users.
+  if (!AllowedExit.count(Inst))
+    // Check that all of the users of the loop are inside the BB.
+    for (User *U : Inst->users()) {
+      Instruction *UI = cast<Instruction>(U);
+      // This user may be a reduction exit value.
+      if (!TheLoop->contains(UI)) {
+        DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n');
+        return true;
+      }
+    }
+  return false;
+}
 
-      // Check rule 3. Ignore A if its distance to B is not a multiple of the
-      // size.
-      if (DistanceToB % static_cast<int64_t>(DesB.Size))
-        continue;
+void LoopVectorizationLegality::addInductionPhi(
+    PHINode *Phi, const InductionDescriptor &ID,
+    SmallPtrSetImpl<Value *> &AllowedExit) {
+  Inductions[Phi] = ID;
+  Type *PhiTy = Phi->getType();
+  const DataLayout &DL = Phi->getModule()->getDataLayout();
 
-      // Ignore A if either A or B is in a predicated block. Although we
-      // currently prevent group formation for predicated accesses, we may be
-      // able to relax this limitation in the future once we handle more
-      // complicated blocks.
-      if (isPredicated(A->getParent()) || isPredicated(B->getParent()))
-        continue;
+  // Get the widest type.
+  if (!PhiTy->isFloatingPointTy()) {
+    if (!WidestIndTy)
+      WidestIndTy = convertPointerToIntegerType(DL, PhiTy);
+    else
+      WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy);
+  }
 
-      // The index of A is the index of B plus A's distance to B in multiples
-      // of the size.
-      int IndexA =
-          Group->getIndex(B) + DistanceToB / static_cast<int64_t>(DesB.Size);
+  // Int inductions are special because we only allow one IV.
+  if (ID.getKind() == InductionDescriptor::IK_IntInduction &&
+      ID.getConstIntStepValue() &&
+      ID.getConstIntStepValue()->isOne() &&
+      isa<Constant>(ID.getStartValue()) &&
+      cast<Constant>(ID.getStartValue())->isNullValue()) {
 
-      // Try to insert A into B's group.
-      if (Group->insertMember(A, IndexA, DesA.Align)) {
-        DEBUG(dbgs() << "LV: Inserted:" << *A << '\n'
-                     << "    into the interleave group with" << *B << '\n');
-        InterleaveGroupMap[A] = Group;
-
-        // Set the first load in program order as the insert position.
-        if (A->mayReadFromMemory())
-          Group->setInsertPos(A);
-      }
-    } // Iteration over A accesses.
-  } // Iteration over B accesses.
-
-  // Remove interleaved store groups with gaps.
-  for (InterleaveGroup *Group : StoreGroups)
-    if (Group->getNumMembers() != Group->getFactor())
-      releaseGroup(Group);
-
-  // Remove interleaved groups with gaps (currently only loads) whose memory
-  // accesses may wrap around. We have to revisit the getPtrStride analysis, 
-  // this time with ShouldCheckWrap=true, since collectConstStrideAccesses does 
-  // not check wrapping (see documentation there).
-  // FORNOW we use Assume=false; 
-  // TODO: Change to Assume=true but making sure we don't exceed the threshold 
-  // of runtime SCEV assumptions checks (thereby potentially failing to
-  // vectorize altogether). 
-  // Additional optional optimizations:
-  // TODO: If we are peeling the loop and we know that the first pointer doesn't 
-  // wrap then we can deduce that all pointers in the group don't wrap.
-  // This means that we can forcefully peel the loop in order to only have to 
-  // check the first pointer for no-wrap. When we'll change to use Assume=true 
-  // we'll only need at most one runtime check per interleaved group.
-  //
-  for (InterleaveGroup *Group : LoadGroups) {
+    // Use the phi node with the widest type as induction. Use the last
+    // one if there are multiple (no good reason for doing this other
+    // than it is expedient). We've checked that it begins at zero and
+    // steps by one, so this is a canonical induction variable.
+    if (!PrimaryInduction || PhiTy == WidestIndTy)
+      PrimaryInduction = Phi;
+  }
 
-    // Case 1: A full group. Can Skip the checks; For full groups, if the wide
-    // load would wrap around the address space we would do a memory access at 
-    // nullptr even without the transformation. 
-    if (Group->getNumMembers() == Group->getFactor()) 
-      continue;
+  // Both the PHI node itself, and the "post-increment" value feeding
+  // back into the PHI node may have external users.
+  AllowedExit.insert(Phi);
+  AllowedExit.insert(Phi->getIncomingValueForBlock(TheLoop->getLoopLatch()));
 
-    // Case 2: If first and last members of the group don't wrap this implies 
-    // that all the pointers in the group don't wrap.
-    // So we check only group member 0 (which is always guaranteed to exist),
-    // and group member Factor - 1; If the latter doesn't exist we rely on 
-    // peeling (if it is a non-reveresed accsess -- see Case 3).
-    Value *FirstMemberPtr = getPointerOperand(Group->getMember(0));
-    if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /*Assume=*/false, 
-                      /*ShouldCheckWrap=*/true)) {
-      DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
-                      "first group member potentially pointer-wrapping.\n");
-      releaseGroup(Group);
-      continue;
-    }
-    Instruction *LastMember = Group->getMember(Group->getFactor() - 1);
-    if (LastMember) {
-      Value *LastMemberPtr = getPointerOperand(LastMember);
-      if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /*Assume=*/false, 
-                        /*ShouldCheckWrap=*/true)) {
-        DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
-                        "last group member potentially pointer-wrapping.\n");
-        releaseGroup(Group);
-      }
-    } else {
-      // Case 3: A non-reversed interleaved load group with gaps: We need
-      // to execute at least one scalar epilogue iteration. This will ensure 
-      // we don't speculatively access memory out-of-bounds. We only need
-      // to look for a member at index factor - 1, since every group must have 
-      // a member at index zero.
-      if (Group->isReverse()) {
-        releaseGroup(Group);
-        continue;
-      }
-      DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
-      RequiresScalarEpilogue = true;
-    }
-  }
+  DEBUG(dbgs() << "LV: Found an induction variable.\n");
+  return;
 }
 
-LoopVectorizationCostModel::VectorizationFactor
-LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
-  // Width 1 means no vectorize
-  VectorizationFactor Factor = {1U, 0U};
-  if (OptForSize && Legal->getRuntimePointerChecking()->Need) {
-    ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
-              << "runtime pointer checks needed. Enable vectorization of this "
-                 "loop with '#pragma clang loop vectorize(enable)' when "
-                 "compiling with -Os/-Oz");
-    DEBUG(dbgs()
-          << "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
-    return Factor;
-  }
+bool LoopVectorizationLegality::canVectorizeInstrs() {
+  BasicBlock *Header = TheLoop->getHeader();
 
-  if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
-    ORE->emit(createMissedAnalysis("ConditionalStore")
-              << "store that is conditionally executed prevents vectorization");
-    DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");
-    return Factor;
-  }
+  // Look for the attribute signaling the absence of NaNs.
+  Function &F = *Header->getParent();
+  HasFunNoNaNAttr =
+      F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true";
 
-  MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
-  unsigned SmallestType, WidestType;
-  std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
-  unsigned WidestRegister = TTI.getRegisterBitWidth(true);
-  unsigned MaxSafeDepDist = -1U;
+  // For each block in the loop.
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    // Scan the instructions in the block and look for hazards.
+    for (Instruction &I : *BB) {
+      if (auto *Phi = dyn_cast<PHINode>(&I)) {
+        Type *PhiTy = Phi->getType();
+        // Check that this PHI type is allowed.
+        if (!PhiTy->isIntegerTy() && !PhiTy->isFloatingPointTy() &&
+            !PhiTy->isPointerTy()) {
+          ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi)
+                    << "loop control flow is not understood by vectorizer");
+          DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n");
+          return false;
+        }
 
-  // Get the maximum safe dependence distance in bits computed by LAA. If the
-  // loop contains any interleaved accesses, we divide the dependence distance
-  // by the maximum interleave factor of all interleaved groups. Note that
-  // although the division ensures correctness, this is a fairly conservative
-  // computation because the maximum distance computed by LAA may not involve
-  // any of the interleaved accesses.
-  if (Legal->getMaxSafeDepDistBytes() != -1U)
-    MaxSafeDepDist =
-        Legal->getMaxSafeDepDistBytes() * 8 / Legal->getMaxInterleaveFactor();
+        // If this PHINode is not in the header block, then we know that we
+        // can convert it to select during if-conversion. No need to check if
+        // the PHIs in this block are induction or reduction variables.
+        if (BB != Header) {
+          // Check that this instruction has no outside users or is an
+          // identified reduction value with an outside user.
+          if (!hasOutsideLoopUser(TheLoop, Phi, AllowedExit))
+            continue;
+          ORE->emit(createMissedAnalysis("NeitherInductionNorReduction", Phi)
+                    << "value could not be identified as "
+                       "an induction or reduction variable");
+          return false;
+        }
 
-  WidestRegister =
-      ((WidestRegister < MaxSafeDepDist) ? WidestRegister : MaxSafeDepDist);
-  unsigned MaxVectorSize = WidestRegister / WidestType;
+        // We only allow if-converted PHIs with exactly two incoming values.
+        if (Phi->getNumIncomingValues() != 2) {
+          ORE->emit(createMissedAnalysis("CFGNotUnderstood", Phi)
+                    << "control flow not understood by vectorizer");
+          DEBUG(dbgs() << "LV: Found an invalid PHI.\n");
+          return false;
+        }
 
-  DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / "
-               << WidestType << " bits.\n");
-  DEBUG(dbgs() << "LV: The Widest register is: " << WidestRegister
-               << " bits.\n");
+        RecurrenceDescriptor RedDes;
+        if (RecurrenceDescriptor::isReductionPHI(Phi, TheLoop, RedDes)) {
+          if (RedDes.hasUnsafeAlgebra())
+            Requirements->addUnsafeAlgebraInst(RedDes.getUnsafeAlgebraInst());
+          AllowedExit.insert(RedDes.getLoopExitInstr());
+          Reductions[Phi] = RedDes;
+          continue;
+        }
 
-  if (MaxVectorSize == 0) {
-    DEBUG(dbgs() << "LV: The target has no vector registers.\n");
-    MaxVectorSize = 1;
-  }
+        InductionDescriptor ID;
+        if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID)) {
+          addInductionPhi(Phi, ID, AllowedExit);
+          if (ID.hasUnsafeAlgebra() && !HasFunNoNaNAttr)
+            Requirements->addUnsafeAlgebraInst(ID.getUnsafeAlgebraInst());
+          continue;
+        }
 
-  assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"
-                                " into one vector!");
+        if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop, DT)) {
+          FirstOrderRecurrences.insert(Phi);
+          continue;
+        }
 
-  unsigned VF = MaxVectorSize;
-  if (MaximizeBandwidth && !OptForSize) {
-    // Collect all viable vectorization factors.
-    SmallVector<unsigned, 8> VFs;
-    unsigned NewMaxVectorSize = WidestRegister / SmallestType;
-    for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
-      VFs.push_back(VS);
+        // As a last resort, coerce the PHI to a AddRec expression
+        // and re-try classifying it a an induction PHI.
+        if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true)) {
+          addInductionPhi(Phi, ID, AllowedExit);
+          continue;
+        }
 
-    // For each VF calculate its register usage.
-    auto RUs = calculateRegisterUsage(VFs);
+        ORE->emit(createMissedAnalysis("NonReductionValueUsedOutsideLoop", Phi)
+                  << "value that could not be identified as "
+                     "reduction is used outside the loop");
+        DEBUG(dbgs() << "LV: Found an unidentified PHI." << *Phi << "\n");
+        return false;
+      } // end of PHI handling
 
-    // Select the largest VF which doesn't require more registers than existing
-    // ones.
-    unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true);
-    for (int i = RUs.size() - 1; i >= 0; --i) {
-      if (RUs[i].MaxLocalUsers <= TargetNumRegisters) {
-        VF = VFs[i];
-        break;
+      // We handle calls that:
+      //   * Are debug info intrinsics.
+      //   * Have a mapping to an IR intrinsic.
+      //   * Have a vector version available.
+      auto *CI = dyn_cast<CallInst>(&I);
+      if (CI && !getVectorIntrinsicIDForCall(CI, TLI) &&
+          !isa<DbgInfoIntrinsic>(CI) &&
+          !(CI->getCalledFunction() && TLI &&
+            TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
+        ORE->emit(createMissedAnalysis("CantVectorizeCall", CI)
+                  << "call instruction cannot be vectorized");
+        DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
+        return false;
       }
-    }
-  }
-
-  // If we optimize the program for size, avoid creating the tail loop.
-  if (OptForSize) {
-    unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
-    DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
-
-    // If we don't know the precise trip count, don't try to vectorize.
-    if (TC < 2) {
-      ORE->emit(
-          createMissedAnalysis("UnknownLoopCountComplexCFG")
-          << "unable to calculate the loop count due to complex control flow");
-      DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
-      return Factor;
-    }
 
-    // Find the maximum SIMD width that can fit within the trip count.
-    VF = TC % MaxVectorSize;
+      // Intrinsics such as powi,cttz and ctlz are legal to vectorize if the
+      // second argument is the same (i.e. loop invariant)
+      if (CI && hasVectorInstrinsicScalarOpd(
+                    getVectorIntrinsicIDForCall(CI, TLI), 1)) {
+        auto *SE = PSE.getSE();
+        if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) {
+          ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI)
+                    << "intrinsic instruction cannot be vectorized");
+          DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n");
+          return false;
+        }
+      }
 
-    if (VF == 0)
-      VF = MaxVectorSize;
-    else {
-      // If the trip count that we found modulo the vectorization factor is not
-      // zero then we require a tail.
-      ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")
-                << "cannot optimize for size and vectorize at the "
-                   "same time. Enable vectorization of this loop "
-                   "with '#pragma clang loop vectorize(enable)' "
-                   "when compiling with -Os/-Oz");
-      DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
-      return Factor;
-    }
-  }
-
-  int UserVF = Hints->getWidth();
-  if (UserVF != 0) {
-    assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
-    DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
+      // Check that the instruction return type is vectorizable.
+      // Also, we can't vectorize extractelement instructions.
+      if ((!VectorType::isValidElementType(I.getType()) &&
+           !I.getType()->isVoidTy()) ||
+          isa<ExtractElementInst>(I)) {
+        ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)
+                  << "instruction return type cannot be vectorized");
+        DEBUG(dbgs() << "LV: Found unvectorizable type.\n");
+        return false;
+      }
 
-    Factor.Width = UserVF;
+      // Check that the stored type is vectorizable.
+      if (auto *ST = dyn_cast<StoreInst>(&I)) {
+        Type *T = ST->getValueOperand()->getType();
+        if (!VectorType::isValidElementType(T)) {
+          ORE->emit(createMissedAnalysis("CantVectorizeStore", ST)
+                    << "store instruction cannot be vectorized");
+          return false;
+        }
 
-    collectUniformsAndScalars(UserVF);
-    collectInstsToScalarize(UserVF);
-    return Factor;
-  }
+        // FP instructions can allow unsafe algebra, thus vectorizable by
+        // non-IEEE-754 compliant SIMD units.
+        // This applies to floating-point math operations and calls, not memory
+        // operations, shuffles, or casts, as they don't change precision or
+        // semantics.
+      } else if (I.getType()->isFloatingPointTy() && (CI || I.isBinaryOp()) &&
+                 !I.hasUnsafeAlgebra()) {
+        DEBUG(dbgs() << "LV: Found FP op with unsafe algebra.\n");
+        Hints->setPotentiallyUnsafe();
+      }
 
-  float Cost = expectedCost(1).first;
-#ifndef NDEBUG
-  const float ScalarCost = Cost;
-#endif /* NDEBUG */
-  unsigned Width = 1;
-  DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");
+      // Reduction instructions are allowed to have exit users.
+      // All other instructions must not have external users.
+      if (hasOutsideLoopUser(TheLoop, &I, AllowedExit)) {
+        ORE->emit(createMissedAnalysis("ValueUsedOutsideLoop", &I)
+                  << "value cannot be used outside the loop");
+        return false;
+      }
 
-  bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
-  // Ignore scalar width, because the user explicitly wants vectorization.
-  if (ForceVectorization && VF > 1) {
-    Width = 2;
-    Cost = expectedCost(Width).first / (float)Width;
+    } // next instr.
   }
 
-  for (unsigned i = 2; i <= VF; i *= 2) {
-    // Notice that the vector loop needs to be executed less times, so
-    // we need to divide the cost of the vector loops by the width of
-    // the vector elements.
-    VectorizationCostTy C = expectedCost(i);
-    float VectorCost = C.first / (float)i;
-    DEBUG(dbgs() << "LV: Vector loop of width " << i
-                 << " costs: " << (int)VectorCost << ".\n");
-    if (!C.second && !ForceVectorization) {
-      DEBUG(
-          dbgs() << "LV: Not considering vector loop of width " << i
-                 << " because it will not generate any vector instructions.\n");
-      continue;
-    }
-    if (VectorCost < Cost) {
-      Cost = VectorCost;
-      Width = i;
+  if (!PrimaryInduction) {
+    DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");
+    if (Inductions.empty()) {
+      ORE->emit(createMissedAnalysis("NoInductionVariable")
+                << "loop induction variable could not be identified");
+      return false;
     }
   }
 
-  DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
-        << "LV: Vectorization seems to be not beneficial, "
-        << "but was forced by a user.\n");
-  DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");
-  Factor.Width = Width;
-  Factor.Cost = Width * Cost;
-  return Factor;
+  // Now we know the widest induction type, check if our found induction
+  // is the same size. If it's not, unset it here and InnerLoopVectorizer
+  // will create another.
+  if (PrimaryInduction && WidestIndTy != PrimaryInduction->getType())
+    PrimaryInduction = nullptr;
+
+  return true;
 }
 
-std::pair<unsigned, unsigned>
-LoopVectorizationCostModel::getSmallestAndWidestTypes() {
-  unsigned MinWidth = -1U;
-  unsigned MaxWidth = 8;
-  const DataLayout &DL = TheFunction->getParent()->getDataLayout();
+void LoopVectorizationCostModel::collectLoopScalars(unsigned VF) {
 
-  // For each block.
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    // For each instruction in the loop.
-    for (Instruction &I : *BB) {
-      Type *T = I.getType();
+  // We should not collect Scalars more than once per VF. Right now,
+  // this function is called from collectUniformsAndScalars(), which 
+  // already does this check. Collecting Scalars for VF=1 does not make any
+  // sense.
 
-      // Skip ignored values.
-      if (ValuesToIgnore.count(&I))
-        continue;
+  assert(VF >= 2 && !Scalars.count(VF) &&
+         "This function should not be visited twice for the same VF");
 
-      // Only examine Loads, Stores and PHINodes.
-      if (!isa<LoadInst>(I) && !isa<StoreInst>(I) && !isa<PHINode>(I))
-        continue;
+  // If an instruction is uniform after vectorization, it will remain scalar.
+  Scalars[VF].insert(Uniforms[VF].begin(), Uniforms[VF].end());
 
-      // Examine PHI nodes that are reduction variables. Update the type to
-      // account for the recurrence type.
-      if (auto *PN = dyn_cast<PHINode>(&I)) {
-        if (!Legal->isReductionVariable(PN))
-          continue;
-        RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];
-        T = RdxDesc.getRecurrenceType();
+  // Collect the getelementptr instructions that will not be vectorized. A
+  // getelementptr instruction is only vectorized if it is used for a legal
+  // gather or scatter operation.
+  for (auto *BB : TheLoop->blocks())
+    for (auto &I : *BB) {
+      if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
+        Scalars[VF].insert(GEP);
+        continue;
       }
+      auto *Ptr = getPointerOperand(&I);
+      if (!Ptr)
+        continue;
+      auto *GEP = getGEPInstruction(Ptr);
+      if (GEP && getWideningDecision(&I, VF) == CM_GatherScatter)
+        Scalars[VF].erase(GEP);
+    }
 
-      // Examine the stored values.
-      if (auto *ST = dyn_cast<StoreInst>(&I))
-        T = ST->getValueOperand()->getType();
+  // An induction variable will remain scalar if all users of the induction
+  // variable and induction variable update remain scalar.
+  auto *Latch = TheLoop->getLoopLatch();
+  for (auto &Induction : *Legal->getInductionVars()) {
+    auto *Ind = Induction.first;
+    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
 
-      // Ignore loaded pointer types and stored pointer types that are not
-      // consecutive. However, we do want to take consecutive stores/loads of
-      // pointer vectors into account.
-      if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I))
-        continue;
+    // Determine if all users of the induction variable are scalar after
+    // vectorization.
+    auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      return I == IndUpdate || !TheLoop->contains(I) || Scalars[VF].count(I);
+    });
+    if (!ScalarInd)
+      continue;
 
-      MinWidth = std::min(MinWidth,
-                          (unsigned)DL.getTypeSizeInBits(T->getScalarType()));
-      MaxWidth = std::max(MaxWidth,
-                          (unsigned)DL.getTypeSizeInBits(T->getScalarType()));
-    }
+    // Determine if all users of the induction variable update instruction are
+    // scalar after vectorization.
+    auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      return I == Ind || !TheLoop->contains(I) || Scalars[VF].count(I);
+    });
+    if (!ScalarIndUpdate)
+      continue;
+
+    // The induction variable and its update instruction will remain scalar.
+    Scalars[VF].insert(Ind);
+    Scalars[VF].insert(IndUpdate);
   }
+}
 
-  return {MinWidth, MaxWidth};
+bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {
+  if (!blockNeedsPredication(I->getParent()))
+    return false;
+  switch(I->getOpcode()) {
+  default:
+    break;
+  case Instruction::Store:
+    return !isMaskRequired(I);
+  case Instruction::UDiv:
+  case Instruction::SDiv:
+  case Instruction::SRem:
+  case Instruction::URem:
+    return mayDivideByZero(*I);
+  }
+  return false;
 }
 
-unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
-                                                           unsigned VF,
-                                                           unsigned LoopCost) {
+bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I,
+                                                              unsigned VF) {
+  // Get and ensure we have a valid memory instruction.
+  LoadInst *LI = dyn_cast<LoadInst>(I);
+  StoreInst *SI = dyn_cast<StoreInst>(I);
+  assert((LI || SI) && "Invalid memory instruction");
 
-  // -- The interleave heuristics --
-  // We interleave the loop in order to expose ILP and reduce the loop overhead.
-  // There are many micro-architectural considerations that we can't predict
-  // at this level. For example, frontend pressure (on decode or fetch) due to
-  // code size, or the number and capabilities of the execution ports.
-  //
-  // We use the following heuristics to select the interleave count:
-  // 1. If the code has reductions, then we interleave to break the cross
-  // iteration dependency.
-  // 2. If the loop is really small, then we interleave to reduce the loop
-  // overhead.
-  // 3. We don't interleave if we think that we will spill registers to memory
-  // due to the increased register pressure.
+  auto *Ptr = getPointerOperand(I);
 
-  // When we optimize for size, we don't interleave.
-  if (OptForSize)
-    return 1;
+  // In order to be widened, the pointer should be consecutive, first of all.
+  if (!isConsecutivePtr(Ptr))
+    return false;
 
-  // We used the distance for the interleave count.
-  if (Legal->getMaxSafeDepDistBytes() != -1U)
-    return 1;
+  // If the instruction is a store located in a predicated block, it will be
+  // scalarized.
+  if (isScalarWithPredication(I))
+    return false;
 
-  // Do not interleave loops with a relatively small trip count.
-  unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
-  if (TC > 1 && TC < TinyTripCountInterleaveThreshold)
-    return 1;
+  // If the instruction's allocated size doesn't equal it's type size, it
+  // requires padding and will be scalarized.
+  auto &DL = I->getModule()->getDataLayout();
+  auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
+  if (hasIrregularType(ScalarTy, DL, VF))
+    return false;
 
-  unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);
-  DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters
-               << " registers\n");
+  return true;
+}
 
-  if (VF == 1) {
-    if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)
-      TargetNumRegisters = ForceTargetNumScalarRegs;
-  } else {
-    if (ForceTargetNumVectorRegs.getNumOccurrences() > 0)
-      TargetNumRegisters = ForceTargetNumVectorRegs;
-  }
+void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {
 
-  RegisterUsage R = calculateRegisterUsage({VF})[0];
-  // We divide by these constants so assume that we have at least one
-  // instruction that uses at least one register.
-  R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U);
-  R.NumInstructions = std::max(R.NumInstructions, 1U);
-
-  // We calculate the interleave count using the following formula.
-  // Subtract the number of loop invariants from the number of available
-  // registers. These registers are used by all of the interleaved instances.
-  // Next, divide the remaining registers by the number of registers that is
-  // required by the loop, in order to estimate how many parallel instances
-  // fit without causing spills. All of this is rounded down if necessary to be
-  // a power of two. We want power of two interleave count to simplify any
-  // addressing operations or alignment considerations.
-  unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
-                              R.MaxLocalUsers);
+  // We should not collect Uniforms more than once per VF. Right now,
+  // this function is called from collectUniformsAndScalars(), which 
+  // already does this check. Collecting Uniforms for VF=1 does not make any
+  // sense.
 
-  // Don't count the induction variable as interleaved.
-  if (EnableIndVarRegisterHeur)
-    IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
-                       std::max(1U, (R.MaxLocalUsers - 1)));
+  assert(VF >= 2 && !Uniforms.count(VF) &&
+         "This function should not be visited twice for the same VF");
 
-  // Clamp the interleave ranges to reasonable counts.
-  unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF);
+  // Visit the list of Uniforms. If we'll not find any uniform value, we'll 
+  // not analyze again.  Uniforms.count(VF) will return 1.
+  Uniforms[VF].clear();
 
-  // Check if the user has overridden the max.
-  if (VF == 1) {
-    if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)
-      MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;
-  } else {
-    if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)
-      MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
-  }
+  // We now know that the loop is vectorizable!
+  // Collect instructions inside the loop that will remain uniform after
+  // vectorization.
 
-  // If we did not calculate the cost for VF (because the user selected the VF)
-  // then we calculate the cost of VF here.
-  if (LoopCost == 0)
-    LoopCost = expectedCost(VF).first;
+  // Global values, params and instructions outside of current loop are out of
+  // scope.
+  auto isOutOfScope = [&](Value *V) -> bool {
+    Instruction *I = dyn_cast<Instruction>(V);
+    return (!I || !TheLoop->contains(I));
+  };
 
-  // Clamp the calculated IC to be between the 1 and the max interleave count
-  // that the target allows.
-  if (IC > MaxInterleaveCount)
-    IC = MaxInterleaveCount;
-  else if (IC < 1)
-    IC = 1;
+  SetVector<Instruction *> Worklist;
+  BasicBlock *Latch = TheLoop->getLoopLatch();
 
-  // Interleave if we vectorized this loop and there is a reduction that could
-  // benefit from interleaving.
-  if (VF > 1 && Legal->getReductionVars()->size()) {
-    DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");
-    return IC;
+  // Start with the conditional branch. If the branch condition is an
+  // instruction contained in the loop that is only used by the branch, it is
+  // uniform.
+  auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
+  if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse()) {
+    Worklist.insert(Cmp);
+    DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");
   }
 
-  // Note that if we've already vectorized the loop we will have done the
-  // runtime check and so interleaving won't require further checks.
-  bool InterleavingRequiresRuntimePointerCheck =
-      (VF == 1 && Legal->getRuntimePointerChecking()->Need);
-
-  // We want to interleave small loops in order to reduce the loop overhead and
-  // potentially expose ILP opportunities.
-  DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n');
-  if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {
-    // We assume that the cost overhead is 1 and we use the cost model
-    // to estimate the cost of the loop and interleave until the cost of the
-    // loop overhead is about 5% of the cost of the loop.
-    unsigned SmallIC =
-        std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));
-
-    // Interleave until store/load ports (estimated by max interleave count) are
-    // saturated.
-    unsigned NumStores = Legal->getNumStores();
-    unsigned NumLoads = Legal->getNumLoads();
-    unsigned StoresIC = IC / (NumStores ? NumStores : 1);
-    unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);
-
-    // If we have a scalar reduction (vector reductions are already dealt with
-    // by this point), we can increase the critical path length if the loop
-    // we're interleaving is inside another loop. Limit, by default to 2, so the
-    // critical path only gets increased by one reduction operation.
-    if (Legal->getReductionVars()->size() && TheLoop->getLoopDepth() > 1) {
-      unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);
-      SmallIC = std::min(SmallIC, F);
-      StoresIC = std::min(StoresIC, F);
-      LoadsIC = std::min(LoadsIC, F);
-    }
-
-    if (EnableLoadStoreRuntimeInterleave &&
-        std::max(StoresIC, LoadsIC) > SmallIC) {
-      DEBUG(dbgs() << "LV: Interleaving to saturate store or load ports.\n");
-      return std::max(StoresIC, LoadsIC);
-    }
-
-    DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n");
-    return SmallIC;
-  }
+  // Holds consecutive and consecutive-like pointers. Consecutive-like pointers
+  // are pointers that are treated like consecutive pointers during
+  // vectorization. The pointer operands of interleaved accesses are an
+  // example.
+  SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;
 
-  // Interleave if this is a large loop (small loops are already dealt with by
-  // this point) that could benefit from interleaving.
-  bool HasReductions = (Legal->getReductionVars()->size() > 0);
-  if (TTI.enableAggressiveInterleaving(HasReductions)) {
-    DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
-    return IC;
-  }
+  // Holds pointer operands of instructions that are possibly non-uniform.
+  SmallPtrSet<Instruction *, 8> PossibleNonUniformPtrs;
 
-  DEBUG(dbgs() << "LV: Not Interleaving.\n");
-  return 1;
-}
+  auto isUniformDecision = [&](Instruction *I, unsigned VF) {
+    InstWidening WideningDecision = getWideningDecision(I, VF);
+    assert(WideningDecision != CM_Unknown &&
+           "Widening decision should be ready at this moment");
 
-SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
-LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<unsigned> VFs) {
-  // This function calculates the register usage by measuring the highest number
-  // of values that are alive at a single location. Obviously, this is a very
-  // rough estimation. We scan the loop in a topological order in order and
-  // assign a number to each instruction. We use RPO to ensure that defs are
-  // met before their users. We assume that each instruction that has in-loop
-  // users starts an interval. We record every time that an in-loop value is
-  // used, so we have a list of the first and last occurrences of each
-  // instruction. Next, we transpose this data structure into a multi map that
-  // holds the list of intervals that *end* at a specific location. This multi
-  // map allows us to perform a linear search. We scan the instructions linearly
-  // and record each time that a new interval starts, by placing it in a set.
-  // If we find this value in the multi-map then we remove it from the set.
-  // The max register usage is the maximum size of the set.
-  // We also search for instructions that are defined outside the loop, but are
-  // used inside the loop. We need this number separately from the max-interval
-  // usage number because when we unroll, loop-invariant values do not take
-  // more register.
-  LoopBlocksDFS DFS(TheLoop);
-  DFS.perform(LI);
+    return (WideningDecision == CM_Widen ||
+            WideningDecision == CM_Interleave);
+  };
+  // Iterate over the instructions in the loop, and collect all
+  // consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible
+  // that a consecutive-like pointer operand will be scalarized, we collect it
+  // in PossibleNonUniformPtrs instead. We use two sets here because a single
+  // getelementptr instruction can be used by both vectorized and scalarized
+  // memory instructions. For example, if a loop loads and stores from the same
+  // location, but the store is conditional, the store will be scalarized, and
+  // the getelementptr won't remain uniform.
+  for (auto *BB : TheLoop->blocks())
+    for (auto &I : *BB) {
 
-  RegisterUsage RU;
-  RU.NumInstructions = 0;
+      // If there's no pointer operand, there's nothing to do.
+      auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));
+      if (!Ptr)
+        continue;
 
-  // Each 'key' in the map opens a new interval. The values
-  // of the map are the index of the 'last seen' usage of the
-  // instruction that is the key.
-  typedef DenseMap<Instruction *, unsigned> IntervalMap;
-  // Maps instruction to its index.
-  DenseMap<unsigned, Instruction *> IdxToInstr;
-  // Marks the end of each interval.
-  IntervalMap EndPoint;
-  // Saves the list of instruction indices that are used in the loop.
-  SmallSet<Instruction *, 8> Ends;
-  // Saves the list of values that are used in the loop but are
-  // defined outside the loop, such as arguments and constants.
-  SmallPtrSet<Value *, 8> LoopInvariants;
+      // True if all users of Ptr are memory accesses that have Ptr as their
+      // pointer operand.
+      auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool {
+        return getPointerOperand(U) == Ptr;
+      });
 
-  unsigned Index = 0;
-  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
-    RU.NumInstructions += BB->size();
-    for (Instruction &I : *BB) {
-      IdxToInstr[Index++] = &I;
+      // Ensure the memory instruction will not be scalarized or used by
+      // gather/scatter, making its pointer operand non-uniform. If the pointer
+      // operand is used by any instruction other than a memory access, we
+      // conservatively assume the pointer operand may be non-uniform.
+      if (!UsersAreMemAccesses || !isUniformDecision(&I, VF))
+        PossibleNonUniformPtrs.insert(Ptr);
 
-      // Save the end location of each USE.
-      for (Value *U : I.operands()) {
-        auto *Instr = dyn_cast<Instruction>(U);
+      // If the memory instruction will be vectorized and its pointer operand
+      // is consecutive-like, or interleaving - the pointer operand should
+      // remain uniform.
+      else
+        ConsecutiveLikePtrs.insert(Ptr);
+    }
 
-        // Ignore non-instruction values such as arguments, constants, etc.
-        if (!Instr)
-          continue;
+  // Add to the Worklist all consecutive and consecutive-like pointers that
+  // aren't also identified as possibly non-uniform.
+  for (auto *V : ConsecutiveLikePtrs)
+    if (!PossibleNonUniformPtrs.count(V)) {
+      DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n");
+      Worklist.insert(V);
+    }
 
-        // If this instruction is outside the loop then record it and continue.
-        if (!TheLoop->contains(Instr)) {
-          LoopInvariants.insert(Instr);
-          continue;
-        }
+  // Expand Worklist in topological order: whenever a new instruction
+  // is added , its users should be either already inside Worklist, or
+  // out of scope. It ensures a uniform instruction will only be used
+  // by uniform instructions or out of scope instructions.
+  unsigned idx = 0;
+  while (idx != Worklist.size()) {
+    Instruction *I = Worklist[idx++];
 
-        // Overwrite previous end points.
-        EndPoint[Instr] = Index;
-        Ends.insert(Instr);
+    for (auto OV : I->operand_values()) {
+      if (isOutOfScope(OV))
+        continue;
+      auto *OI = cast<Instruction>(OV);
+      if (all_of(OI->users(), [&](User *U) -> bool {
+            return isOutOfScope(U) || Worklist.count(cast<Instruction>(U));
+          })) {
+        Worklist.insert(OI);
+        DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
       }
     }
   }
 
-  // Saves the list of intervals that end with the index in 'key'.
-  typedef SmallVector<Instruction *, 2> InstrList;
-  DenseMap<unsigned, InstrList> TransposeEnds;
-
-  // Transpose the EndPoints to a list of values that end at each index.
-  for (auto &Interval : EndPoint)
-    TransposeEnds[Interval.second].push_back(Interval.first);
-
-  SmallSet<Instruction *, 8> OpenIntervals;
+  // Returns true if Ptr is the pointer operand of a memory access instruction
+  // I, and I is known to not require scalarization.
+  auto isVectorizedMemAccessUse = [&](Instruction *I, Value *Ptr) -> bool {
+    return getPointerOperand(I) == Ptr && isUniformDecision(I, VF);
+  };
 
-  // Get the size of the widest register.
-  unsigned MaxSafeDepDist = -1U;
-  if (Legal->getMaxSafeDepDistBytes() != -1U)
-    MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;
-  unsigned WidestRegister =
-      std::min(TTI.getRegisterBitWidth(true), MaxSafeDepDist);
-  const DataLayout &DL = TheFunction->getParent()->getDataLayout();
-
-  SmallVector<RegisterUsage, 8> RUs(VFs.size());
-  SmallVector<unsigned, 8> MaxUsages(VFs.size(), 0);
-
-  DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
-
-  // A lambda that gets the register usage for the given type and VF.
-  auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
-    if (Ty->isTokenTy())
-      return 0U;
-    unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
-    return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
-  };
-
-  for (unsigned int i = 0; i < Index; ++i) {
-    Instruction *I = IdxToInstr[i];
-
-    // Remove all of the instructions that end at this location.
-    InstrList &List = TransposeEnds[i];
-    for (Instruction *ToRemove : List)
-      OpenIntervals.erase(ToRemove);
+  // For an instruction to be added into Worklist above, all its users inside
+  // the loop should also be in Worklist. However, this condition cannot be
+  // true for phi nodes that form a cyclic dependence. We must process phi
+  // nodes separately. An induction variable will remain uniform if all users
+  // of the induction variable and induction variable update remain uniform.
+  // The code below handles both pointer and non-pointer induction variables.
+  for (auto &Induction : *Legal->getInductionVars()) {
+    auto *Ind = Induction.first;
+    auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
 
-    // Ignore instructions that are never used within the loop.
-    if (!Ends.count(I))
+    // Determine if all users of the induction variable are uniform after
+    // vectorization.
+    auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      return I == IndUpdate || !TheLoop->contains(I) || Worklist.count(I) ||
+             isVectorizedMemAccessUse(I, Ind);
+    });
+    if (!UniformInd)
       continue;
 
-    // Skip ignored values.
-    if (ValuesToIgnore.count(I))
+    // Determine if all users of the induction variable update instruction are
+    // uniform after vectorization.
+    auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      return I == Ind || !TheLoop->contains(I) || Worklist.count(I) ||
+             isVectorizedMemAccessUse(I, IndUpdate);
+    });
+    if (!UniformIndUpdate)
       continue;
 
-    // For each VF find the maximum usage of registers.
-    for (unsigned j = 0, e = VFs.size(); j < e; ++j) {
-      if (VFs[j] == 1) {
-        MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size());
-        continue;
-      }
-      collectUniformsAndScalars(VFs[j]);
-      // Count the number of live intervals.
-      unsigned RegUsage = 0;
-      for (auto Inst : OpenIntervals) {
-        // Skip ignored values for VF > 1.
-        if (VecValuesToIgnore.count(Inst) ||
-            isScalarAfterVectorization(Inst, VFs[j]))
-          continue;
-        RegUsage += GetRegUsage(Inst->getType(), VFs[j]);
-      }
-      MaxUsages[j] = std::max(MaxUsages[j], RegUsage);
-    }
-
-    DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # "
-                 << OpenIntervals.size() << '\n');
-
-    // Add the current instruction to the list of open intervals.
-    OpenIntervals.insert(I);
+    // The induction variable and its update instruction will remain uniform.
+    Worklist.insert(Ind);
+    Worklist.insert(IndUpdate);
+    DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
+    DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n");
   }
 
-  for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
-    unsigned Invariant = 0;
-    if (VFs[i] == 1)
-      Invariant = LoopInvariants.size();
-    else {
-      for (auto Inst : LoopInvariants)
-        Invariant += GetRegUsage(Inst->getType(), VFs[i]);
-    }
+  Uniforms[VF].insert(Worklist.begin(), Worklist.end());
+}
 
-    DEBUG(dbgs() << "LV(REG): VF = " << VFs[i] << '\n');
-    DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsages[i] << '\n');
-    DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n');
-    DEBUG(dbgs() << "LV(REG): LoopSize: " << RU.NumInstructions << '\n');
+bool LoopVectorizationLegality::canVectorizeMemory() {
+  LAI = &(*GetLAA)(*TheLoop);
+  InterleaveInfo.setLAI(LAI);
+  const OptimizationRemarkAnalysis *LAR = LAI->getReport();
+  if (LAR) {
+    OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(),
+                                  "loop not vectorized: ", *LAR);
+    ORE->emit(VR);
+  }
+  if (!LAI->canVectorizeMemory())
+    return false;
 
-    RU.LoopInvariantRegs = Invariant;
-    RU.MaxLocalUsers = MaxUsages[i];
-    RUs[i] = RU;
+  if (LAI->hasStoreToLoopInvariantAddress()) {
+    ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")
+              << "write to a loop invariant address could not be vectorized");
+    DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");
+    return false;
   }
 
-  return RUs;
+  Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
+  PSE.addPredicate(LAI->getPSE().getUnionPredicate());
+
+  return true;
 }
 
-void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {
+bool LoopVectorizationLegality::isInductionVariable(const Value *V) {
+  Value *In0 = const_cast<Value *>(V);
+  PHINode *PN = dyn_cast_or_null<PHINode>(In0);
+  if (!PN)
+    return false;
 
-  // If we aren't vectorizing the loop, or if we've already collected the
-  // instructions to scalarize, there's nothing to do. Collection may already
-  // have occurred if we have a user-selected VF and are now computing the
-  // expected cost for interleaving.
-  if (VF < 2 || InstsToScalarize.count(VF))
-    return;
+  return Inductions.count(PN);
+}
 
-  // Initialize a mapping for VF in InstsToScalalarize. If we find that it's
-  // not profitable to scalarize any instructions, the presence of VF in the
-  // map will indicate that we've analyzed it already.
-  ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];
+bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) {
+  return FirstOrderRecurrences.count(Phi);
+}
 
-  // Find all the instructions that are scalar with predication in the loop and
-  // determine if it would be better to not if-convert the blocks they are in.
-  // If so, we also record the instructions to scalarize.
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    if (!Legal->blockNeedsPredication(BB))
-      continue;
-    for (Instruction &I : *BB)
-      if (Legal->isScalarWithPredication(&I)) {
-        ScalarCostsTy ScalarCosts;
-        if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
-          ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
-      }
-  }
+bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {
+  return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);
 }
 
-int LoopVectorizationCostModel::computePredInstDiscount(
-    Instruction *PredInst, DenseMap<Instruction *, unsigned> &ScalarCosts,
-    unsigned VF) {
+bool LoopVectorizationLegality::blockCanBePredicated(
+    BasicBlock *BB, SmallPtrSetImpl<Value *> &SafePtrs) {
+  const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();
 
-  assert(!isUniformAfterVectorization(PredInst, VF) &&
-         "Instruction marked uniform-after-vectorization will be predicated");
+  for (Instruction &I : *BB) {
+    // Check that we don't have a constant expression that can trap as operand.
+    for (Value *Operand : I.operands()) {
+      if (auto *C = dyn_cast<Constant>(Operand))
+        if (C->canTrap())
+          return false;
+    }
+    // We might be able to hoist the load.
+    if (I.mayReadFromMemory()) {
+      auto *LI = dyn_cast<LoadInst>(&I);
+      if (!LI)
+        return false;
+      if (!SafePtrs.count(LI->getPointerOperand())) {
+        if (isLegalMaskedLoad(LI->getType(), LI->getPointerOperand()) ||
+            isLegalMaskedGather(LI->getType())) {
+          MaskedOp.insert(LI);
+          continue;
+        }
+        // !llvm.mem.parallel_loop_access implies if-conversion safety.
+        if (IsAnnotatedParallel)
+          continue;
+        return false;
+      }
+    }
 
-  // Initialize the discount to zero, meaning that the scalar version and the
-  // vector version cost the same.
-  int Discount = 0;
+    if (I.mayWriteToMemory()) {
+      auto *SI = dyn_cast<StoreInst>(&I);
+      // We only support predication of stores in basic blocks with one
+      // predecessor.
+      if (!SI)
+        return false;
 
-  // Holds instructions to analyze. The instructions we visit are mapped in
-  // ScalarCosts. Those instructions are the ones that would be scalarized if
-  // we find that the scalar version costs less.
-  SmallVector<Instruction *, 8> Worklist;
+      // Build a masked store if it is legal for the target.
+      if (isLegalMaskedStore(SI->getValueOperand()->getType(),
+                             SI->getPointerOperand()) ||
+          isLegalMaskedScatter(SI->getValueOperand()->getType())) {
+        MaskedOp.insert(SI);
+        continue;
+      }
 
-  // Returns true if the given instruction can be scalarized.
-  auto canBeScalarized = [&](Instruction *I) -> bool {
+      bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0);
+      bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();
 
-    // We only attempt to scalarize instructions forming a single-use chain
-    // from the original predicated block that would otherwise be vectorized.
-    // Although not strictly necessary, we give up on instructions we know will
-    // already be scalar to avoid traversing chains that are unlikely to be
-    // beneficial.
-    if (!I->hasOneUse() || PredInst->getParent() != I->getParent() ||
-        isScalarAfterVectorization(I, VF))
+      if (++NumPredStores > NumberOfStoresToPredicate || !isSafePtr ||
+          !isSinglePredecessor)
+        return false;
+    }
+    if (I.mayThrow())
       return false;
+  }
 
-    // If the instruction is scalar with predication, it will be analyzed
-    // separately. We ignore it within the context of PredInst.
-    if (Legal->isScalarWithPredication(I))
-      return false;
+  return true;
+}
 
-    // If any of the instruction's operands are uniform after vectorization,
-    // the instruction cannot be scalarized. This prevents, for example, a
-    // masked load from being scalarized.
-    //
-    // We assume we will only emit a value for lane zero of an instruction
-    // marked uniform after vectorization, rather than VF identical values.
-    // Thus, if we scalarize an instruction that uses a uniform, we would
-    // create uses of values corresponding to the lanes we aren't emitting code
-    // for. This behavior can be changed by allowing getScalarValue to clone
-    // the lane zero values for uniforms rather than asserting.
-    for (Use &U : I->operands())
-      if (auto *J = dyn_cast<Instruction>(U.get()))
-        if (isUniformAfterVectorization(J, VF))
-          return false;
+void InterleavedAccessInfo::collectConstStrideAccesses(
+    MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,
+    const ValueToValueMap &Strides) {
 
-    // Otherwise, we can scalarize the instruction.
-    return true;
-  };
+  auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
 
-  // Returns true if an operand that cannot be scalarized must be extracted
-  // from a vector. We will account for this scalarization overhead below. Note
+  // Since it's desired that the load/store instructions be maintained in
+  // "program order" for the interleaved access analysis, we have to visit the
+  // blocks in the loop in reverse postorder (i.e., in a topological order).
+  // Such an ordering will ensure that any load/store that may be executed
+  // before a second load/store will precede the second load/store in
+  // AccessStrideInfo.
+  LoopBlocksDFS DFS(TheLoop);
+  DFS.perform(LI);
+  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
+    for (auto &I : *BB) {
+      auto *LI = dyn_cast<LoadInst>(&I);
+      auto *SI = dyn_cast<StoreInst>(&I);
+      if (!LI && !SI)
+        continue;
+
+      Value *Ptr = getPointerOperand(&I);
+      // We don't check wrapping here because we don't know yet if Ptr will be 
+      // part of a full group or a group with gaps. Checking wrapping for all 
+      // pointers (even those that end up in groups with no gaps) will be overly
+      // conservative. For full groups, wrapping should be ok since if we would 
+      // wrap around the address space we would do a memory access at nullptr
+      // even without the transformation. The wrapping checks are therefore
+      // deferred until after we've formed the interleaved groups.
+      int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
+                                    /*Assume=*/true, /*ShouldCheckWrap=*/false);
+
+      const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
+      PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
+      uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());
+
+      // An alignment of 0 means target ABI alignment.
+      unsigned Align = getMemInstAlignment(&I);
+      if (!Align)
+        Align = DL.getABITypeAlignment(PtrTy->getElementType());
+
+      AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align);
+    }
+}
+
+// Analyze interleaved accesses and collect them into interleaved load and
+// store groups.
+//
+// When generating code for an interleaved load group, we effectively hoist all
+// loads in the group to the location of the first load in program order. When
+// generating code for an interleaved store group, we sink all stores to the
+// location of the last store. This code motion can change the order of load
+// and store instructions and may break dependences.
+//
+// The code generation strategy mentioned above ensures that we won't violate
+// any write-after-read (WAR) dependences.
+//
+// E.g., for the WAR dependence:  a = A[i];      // (1)
+//                                A[i] = b;      // (2)
+//
+// The store group of (2) is always inserted at or below (2), and the load
+// group of (1) is always inserted at or above (1). Thus, the instructions will
+// never be reordered. All other dependences are checked to ensure the
+// correctness of the instruction reordering.
+//
+// The algorithm visits all memory accesses in the loop in bottom-up program
+// order. Program order is established by traversing the blocks in the loop in
+// reverse postorder when collecting the accesses.
+//
+// We visit the memory accesses in bottom-up order because it can simplify the
+// construction of store groups in the presence of write-after-write (WAW)
+// dependences.
+//
+// E.g., for the WAW dependence:  A[i] = a;      // (1)
+//                                A[i] = b;      // (2)
+//                                A[i + 1] = c;  // (3)
+//
+// We will first create a store group with (3) and (2). (1) can't be added to
+// this group because it and (2) are dependent. However, (1) can be grouped
+// with other accesses that may precede it in program order. Note that a
+// bottom-up order does not imply that WAW dependences should not be checked.
+void InterleavedAccessInfo::analyzeInterleaving(
+    const ValueToValueMap &Strides) {
+  DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");
+
+  // Holds all accesses with a constant stride.
+  MapVector<Instruction *, StrideDescriptor> AccessStrideInfo;
+  collectConstStrideAccesses(AccessStrideInfo, Strides);
+
+  if (AccessStrideInfo.empty())
+    return;
+
+  // Collect the dependences in the loop.
+  collectDependences();
+
+  // Holds all interleaved store groups temporarily.
+  SmallSetVector<InterleaveGroup *, 4> StoreGroups;
+  // Holds all interleaved load groups temporarily.
+  SmallSetVector<InterleaveGroup *, 4> LoadGroups;
+
+  // Search in bottom-up program order for pairs of accesses (A and B) that can
+  // form interleaved load or store groups. In the algorithm below, access A
+  // precedes access B in program order. We initialize a group for B in the
+  // outer loop of the algorithm, and then in the inner loop, we attempt to
+  // insert each A into B's group if:
+  //
+  //  1. A and B have the same stride,
+  //  2. A and B have the same memory object size, and
+  //  3. A belongs in B's group according to its distance from B.
+  //
+  // Special care is taken to ensure group formation will not break any
+  // dependences.
+  for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend();
+       BI != E; ++BI) {
+    Instruction *B = BI->first;
+    StrideDescriptor DesB = BI->second;
+
+    // Initialize a group for B if it has an allowable stride. Even if we don't
+    // create a group for B, we continue with the bottom-up algorithm to ensure
+    // we don't break any of B's dependences.
+    InterleaveGroup *Group = nullptr;
+    if (isStrided(DesB.Stride)) {
+      Group = getInterleaveGroup(B);
+      if (!Group) {
+        DEBUG(dbgs() << "LV: Creating an interleave group with:" << *B << '\n');
+        Group = createInterleaveGroup(B, DesB.Stride, DesB.Align);
+      }
+      if (B->mayWriteToMemory())
+        StoreGroups.insert(Group);
+      else
+        LoadGroups.insert(Group);
+    }
+
+    for (auto AI = std::next(BI); AI != E; ++AI) {
+      Instruction *A = AI->first;
+      StrideDescriptor DesA = AI->second;
+
+      // Our code motion strategy implies that we can't have dependences
+      // between accesses in an interleaved group and other accesses located
+      // between the first and last member of the group. Note that this also
+      // means that a group can't have more than one member at a given offset.
+      // The accesses in a group can have dependences with other accesses, but
+      // we must ensure we don't extend the boundaries of the group such that
+      // we encompass those dependent accesses.
+      //
+      // For example, assume we have the sequence of accesses shown below in a
+      // stride-2 loop:
+      //
+      //  (1, 2) is a group | A[i]   = a;  // (1)
+      //                    | A[i-1] = b;  // (2) |
+      //                      A[i-3] = c;  // (3)
+      //                      A[i]   = d;  // (4) | (2, 4) is not a group
+      //
+      // Because accesses (2) and (3) are dependent, we can group (2) with (1)
+      // but not with (4). If we did, the dependent access (3) would be within
+      // the boundaries of the (2, 4) group.
+      if (!canReorderMemAccessesForInterleavedGroups(&*AI, &*BI)) {
+
+        // If a dependence exists and A is already in a group, we know that A
+        // must be a store since A precedes B and WAR dependences are allowed.
+        // Thus, A would be sunk below B. We release A's group to prevent this
+        // illegal code motion. A will then be free to form another group with
+        // instructions that precede it.
+        if (isInterleaved(A)) {
+          InterleaveGroup *StoreGroup = getInterleaveGroup(A);
+          StoreGroups.remove(StoreGroup);
+          releaseGroup(StoreGroup);
+        }
+
+        // If a dependence exists and A is not already in a group (or it was
+        // and we just released it), B might be hoisted above A (if B is a
+        // load) or another store might be sunk below A (if B is a store). In
+        // either case, we can't add additional instructions to B's group. B
+        // will only form a group with instructions that it precedes.
+        break;
+      }
+
+      // At this point, we've checked for illegal code motion. If either A or B
+      // isn't strided, there's nothing left to do.
+      if (!isStrided(DesA.Stride) || !isStrided(DesB.Stride))
+        continue;
+
+      // Ignore A if it's already in a group or isn't the same kind of memory
+      // operation as B.
+      if (isInterleaved(A) || A->mayReadFromMemory() != B->mayReadFromMemory())
+        continue;
+
+      // Check rules 1 and 2. Ignore A if its stride or size is different from
+      // that of B.
+      if (DesA.Stride != DesB.Stride || DesA.Size != DesB.Size)
+        continue;
+
+      // Calculate the distance from A to B.
+      const SCEVConstant *DistToB = dyn_cast<SCEVConstant>(
+          PSE.getSE()->getMinusSCEV(DesA.Scev, DesB.Scev));
+      if (!DistToB)
+        continue;
+      int64_t DistanceToB = DistToB->getAPInt().getSExtValue();
+
+      // Check rule 3. Ignore A if its distance to B is not a multiple of the
+      // size.
+      if (DistanceToB % static_cast<int64_t>(DesB.Size))
+        continue;
+
+      // Ignore A if either A or B is in a predicated block. Although we
+      // currently prevent group formation for predicated accesses, we may be
+      // able to relax this limitation in the future once we handle more
+      // complicated blocks.
+      if (isPredicated(A->getParent()) || isPredicated(B->getParent()))
+        continue;
+
+      // The index of A is the index of B plus A's distance to B in multiples
+      // of the size.
+      int IndexA =
+          Group->getIndex(B) + DistanceToB / static_cast<int64_t>(DesB.Size);
+
+      // Try to insert A into B's group.
+      if (Group->insertMember(A, IndexA, DesA.Align)) {
+        DEBUG(dbgs() << "LV: Inserted:" << *A << '\n'
+                     << "    into the interleave group with" << *B << '\n');
+        InterleaveGroupMap[A] = Group;
+
+        // Set the first load in program order as the insert position.
+        if (A->mayReadFromMemory())
+          Group->setInsertPos(A);
+      }
+    } // Iteration over A accesses.
+  } // Iteration over B accesses.
+
+  // Remove interleaved store groups with gaps.
+  for (InterleaveGroup *Group : StoreGroups)
+    if (Group->getNumMembers() != Group->getFactor())
+      releaseGroup(Group);
+
+  // Remove interleaved groups with gaps (currently only loads) whose memory
+  // accesses may wrap around. We have to revisit the getPtrStride analysis, 
+  // this time with ShouldCheckWrap=true, since collectConstStrideAccesses does 
+  // not check wrapping (see documentation there).
+  // FORNOW we use Assume=false; 
+  // TODO: Change to Assume=true but making sure we don't exceed the threshold 
+  // of runtime SCEV assumptions checks (thereby potentially failing to
+  // vectorize altogether). 
+  // Additional optional optimizations:
+  // TODO: If we are peeling the loop and we know that the first pointer doesn't 
+  // wrap then we can deduce that all pointers in the group don't wrap.
+  // This means that we can forcefully peel the loop in order to only have to 
+  // check the first pointer for no-wrap. When we'll change to use Assume=true 
+  // we'll only need at most one runtime check per interleaved group.
+  //
+  for (InterleaveGroup *Group : LoadGroups) {
+
+    // Case 1: A full group. Can Skip the checks; For full groups, if the wide
+    // load would wrap around the address space we would do a memory access at 
+    // nullptr even without the transformation. 
+    if (Group->getNumMembers() == Group->getFactor()) 
+      continue;
+
+    // Case 2: If first and last members of the group don't wrap this implies 
+    // that all the pointers in the group don't wrap.
+    // So we check only group member 0 (which is always guaranteed to exist),
+    // and group member Factor - 1; If the latter doesn't exist we rely on 
+    // peeling (if it is a non-reveresed accsess -- see Case 3).
+    Value *FirstMemberPtr = getPointerOperand(Group->getMember(0));
+    if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /*Assume=*/false, 
+                      /*ShouldCheckWrap=*/true)) {
+      DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
+                      "first group member potentially pointer-wrapping.\n");
+      releaseGroup(Group);
+      continue;
+    }
+    Instruction *LastMember = Group->getMember(Group->getFactor() - 1);
+    if (LastMember) {
+      Value *LastMemberPtr = getPointerOperand(LastMember);
+      if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /*Assume=*/false, 
+                        /*ShouldCheckWrap=*/true)) {
+        DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
+                        "last group member potentially pointer-wrapping.\n");
+        releaseGroup(Group);
+      }
+    } else {
+      // Case 3: A non-reversed interleaved load group with gaps: We need
+      // to execute at least one scalar epilogue iteration. This will ensure 
+      // we don't speculatively access memory out-of-bounds. We only need
+      // to look for a member at index factor - 1, since every group must have 
+      // a member at index zero.
+      if (Group->isReverse()) {
+        releaseGroup(Group);
+        continue;
+      }
+      DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
+      RequiresScalarEpilogue = true;
+    }
+  }
+}
+
+bool LoopVectorizationCostModel::canVectorize(bool OptForSize) {
+  if (OptForSize && Legal->getRuntimePointerChecking()->Need) {
+    ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
+              << "runtime pointer checks needed. Enable vectorization of this "
+                 "loop with '#pragma clang loop vectorize(enable)' when "
+                 "compiling with -Os/-Oz");
+    DEBUG(dbgs()
+          << "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
+    return false;
+  }
+
+  if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
+    ORE->emit(createMissedAnalysis("ConditionalStore")
+              << "store that is conditionally executed prevents vectorization");
+    DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");
+    return false;
+  }
+
+  // If we optimize the program for size, avoid creating the tail loop.
+  if (OptForSize) {
+    unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
+    DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
+
+    // If we don't know the precise trip count, don't try to vectorize.
+    if (TC < 2) {
+      ORE->emit(
+          createMissedAnalysis("UnknownLoopCountComplexCFG")
+          << "unable to calculate the loop count due to complex control flow");
+      DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
+      return false;
+    }
+  }
+  return true;
+}
+
+unsigned
+LoopVectorizationCostModel::computeMaxVectorizationFactor(bool OptForSize) {
+  MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
+  unsigned SmallestType, WidestType;
+  std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
+  unsigned WidestRegister = TTI.getRegisterBitWidth(true);
+  unsigned MaxSafeDepDist = -1U;
+
+  // Get the maximum safe dependence distance in bits computed by LAA. If the
+  // loop contains any interleaved accesses, we divide the dependence distance
+  // by the maximum interleave factor of all interleaved groups. Note that
+  // although the division ensures correctness, this is a fairly conservative
+  // computation because the maximum distance computed by LAA may not involve
+  // any of the interleaved accesses.
+  if (Legal->getMaxSafeDepDistBytes() != -1U)
+    MaxSafeDepDist =
+        Legal->getMaxSafeDepDistBytes() * 8 / Legal->getMaxInterleaveFactor();
+
+  WidestRegister =
+      ((WidestRegister < MaxSafeDepDist) ? WidestRegister : MaxSafeDepDist);
+  unsigned MaxVectorSize = WidestRegister / WidestType;
+
+  DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / "
+               << WidestType << " bits.\n");
+  DEBUG(dbgs() << "LV: The Widest register is: " << WidestRegister
+               << " bits.\n");
+
+  if (MaxVectorSize == 0) {
+    DEBUG(dbgs() << "LV: The target has no vector registers.\n");
+    MaxVectorSize = 1;
+  }
+
+  assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"
+                                " into one vector!");
+
+  unsigned VF = MaxVectorSize;
+
+  if (MaximizeBandwidth && !OptForSize) {
+    // Collect all viable vectorization factors.
+    SmallVector<unsigned, 8> VFs;
+    unsigned NewMaxVectorSize = WidestRegister / SmallestType;
+    for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
+      VFs.push_back(VS);
+
+    // For each VF calculate its register usage.
+    auto RUs = calculateRegisterUsage(VFs);
+
+    // Select the largest VF which doesn't require more registers than existing
+    // ones.
+    unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true);
+    for (int i = RUs.size() - 1; i >= 0; --i) {
+      if (RUs[i].MaxLocalUsers <= TargetNumRegisters) {
+        VF = VFs[i];
+        break;
+      }
+    }
+  }
+  return VF;
+}
+
+bool LoopVectorizationCostModel::requiresTail(unsigned MaxVectorSize) {
+  unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
+  DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');
+
+  // Find the maximum SIMD width that can fit within the trip count.
+  unsigned VF = TC % MaxVectorSize;
+
+  if (VF == 0)
+    return false;
+
+  // If the trip count that we found modulo the vectorization factor is not
+  // zero then we require a tail.
+  ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")
+            << "cannot optimize for size and vectorize at the "
+               "same time. Enable vectorization of this loop "
+               "with '#pragma clang loop vectorize(enable)' "
+               "when compiling with -Os/-Oz");
+  DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
+  return true;
+}
+
+LoopVectorizationCostModel::VectorizationFactor
+LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize,
+                                                      unsigned VF) {
+  // Width 1 means no vectorize
+  VectorizationFactor Factor = {1U, 0U};
+
+  float Cost = expectedCost(1).first;
+#ifndef NDEBUG
+  const float ScalarCost = Cost;
+#endif /* NDEBUG */
+  unsigned Width = 1;
+  DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");
+
+  bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
+  // Ignore scalar width, because the user explicitly wants vectorization.
+  if (ForceVectorization && VF > 1) {
+    Width = 2;
+    Cost = expectedCost(Width).first / (float)Width;
+  }
+
+  for (unsigned i = 2; i <= VF; i *= 2) {
+    // Notice that the vector loop needs to be executed less times, so
+    // we need to divide the cost of the vector loops by the width of
+    // the vector elements.
+    VectorizationCostTy C = expectedCost(i);
+    float VectorCost = C.first / (float)i;
+    DEBUG(dbgs() << "LV: Vector loop of width " << i
+                 << " costs: " << (int)VectorCost << ".\n");
+    if (!C.second && !ForceVectorization) {
+      DEBUG(
+          dbgs() << "LV: Not considering vector loop of width " << i
+                 << " because it will not generate any vector instructions.\n");
+      continue;
+    }
+    if (VectorCost < Cost) {
+      Cost = VectorCost;
+      Width = i;
+    }
+  }
+
+  DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
+        << "LV: Vectorization seems to be not beneficial, "
+        << "but was forced by a user.\n");
+  DEBUG(dbgs() << "LV: Selecting VF: " << Width << ".\n");
+  Factor.Width = Width;
+  Factor.Cost = Width * Cost;
+  return Factor;
+}
+
+std::pair<unsigned, unsigned>
+LoopVectorizationCostModel::getSmallestAndWidestTypes() {
+  unsigned MinWidth = -1U;
+  unsigned MaxWidth = 8;
+  const DataLayout &DL = TheFunction->getParent()->getDataLayout();
+
+  // For each block.
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    // For each instruction in the loop.
+    for (Instruction &I : *BB) {
+      Type *T = I.getType();
+
+      // Skip ignored values.
+      if (ValuesToIgnore.count(&I))
+        continue;
+
+      // Only examine Loads, Stores and PHINodes.
+      if (!isa<LoadInst>(I) && !isa<StoreInst>(I) && !isa<PHINode>(I))
+        continue;
+
+      // Examine PHI nodes that are reduction variables. Update the type to
+      // account for the recurrence type.
+      if (auto *PN = dyn_cast<PHINode>(&I)) {
+        if (!Legal->isReductionVariable(PN))
+          continue;
+        RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[PN];
+        T = RdxDesc.getRecurrenceType();
+      }
+
+      // Examine the stored values.
+      if (auto *ST = dyn_cast<StoreInst>(&I))
+        T = ST->getValueOperand()->getType();
+
+      // Ignore loaded pointer types and stored pointer types that are not
+      // consecutive. However, we do want to take consecutive stores/loads of
+      // pointer vectors into account.
+      if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I))
+        continue;
+
+      MinWidth = std::min(MinWidth,
+                          (unsigned)DL.getTypeSizeInBits(T->getScalarType()));
+      MaxWidth = std::max(MaxWidth,
+                          (unsigned)DL.getTypeSizeInBits(T->getScalarType()));
+    }
+  }
+
+  return {MinWidth, MaxWidth};
+}
+
+unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
+                                                           unsigned VF,
+                                                           unsigned LoopCost) {
+
+  // -- The interleave heuristics --
+  // We interleave the loop in order to expose ILP and reduce the loop overhead.
+  // There are many micro-architectural considerations that we can't predict
+  // at this level. For example, frontend pressure (on decode or fetch) due to
+  // code size, or the number and capabilities of the execution ports.
+  //
+  // We use the following heuristics to select the interleave count:
+  // 1. If the code has reductions, then we interleave to break the cross
+  // iteration dependency.
+  // 2. If the loop is really small, then we interleave to reduce the loop
+  // overhead.
+  // 3. We don't interleave if we think that we will spill registers to memory
+  // due to the increased register pressure.
+
+  // When we optimize for size, we don't interleave.
+  if (OptForSize)
+    return 1;
+
+  // We used the distance for the interleave count.
+  if (Legal->getMaxSafeDepDistBytes() != -1U)
+    return 1;
+
+  // Do not interleave loops with a relatively small trip count.
+  unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
+  if (TC > 1 && TC < TinyTripCountInterleaveThreshold)
+    return 1;
+
+  unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);
+  DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters
+               << " registers\n");
+
+  if (VF == 1) {
+    if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)
+      TargetNumRegisters = ForceTargetNumScalarRegs;
+  } else {
+    if (ForceTargetNumVectorRegs.getNumOccurrences() > 0)
+      TargetNumRegisters = ForceTargetNumVectorRegs;
+  }
+
+  RegisterUsage R = calculateRegisterUsage({VF})[0];
+  // We divide by these constants so assume that we have at least one
+  // instruction that uses at least one register.
+  R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U);
+  R.NumInstructions = std::max(R.NumInstructions, 1U);
+
+  // We calculate the interleave count using the following formula.
+  // Subtract the number of loop invariants from the number of available
+  // registers. These registers are used by all of the interleaved instances.
+  // Next, divide the remaining registers by the number of registers that is
+  // required by the loop, in order to estimate how many parallel instances
+  // fit without causing spills. All of this is rounded down if necessary to be
+  // a power of two. We want power of two interleave count to simplify any
+  // addressing operations or alignment considerations.
+  unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
+                              R.MaxLocalUsers);
+
+  // Don't count the induction variable as interleaved.
+  if (EnableIndVarRegisterHeur)
+    IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs - 1) /
+                       std::max(1U, (R.MaxLocalUsers - 1)));
+
+  // Clamp the interleave ranges to reasonable counts.
+  unsigned MaxInterleaveCount = TTI.getMaxInterleaveFactor(VF);
+
+  // Check if the user has overridden the max.
+  if (VF == 1) {
+    if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)
+      MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;
+  } else {
+    if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)
+      MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
+  }
+
+  // If we did not calculate the cost for VF (because the user selected the VF)
+  // then we calculate the cost of VF here.
+  if (LoopCost == 0)
+    LoopCost = expectedCost(VF).first;
+
+  // Clamp the calculated IC to be between the 1 and the max interleave count
+  // that the target allows.
+  if (IC > MaxInterleaveCount)
+    IC = MaxInterleaveCount;
+  else if (IC < 1)
+    IC = 1;
+
+  // Interleave if we vectorized this loop and there is a reduction that could
+  // benefit from interleaving.
+  if (VF > 1 && Legal->getReductionVars()->size()) {
+    DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");
+    return IC;
+  }
+
+  // Note that if we've already vectorized the loop we will have done the
+  // runtime check and so interleaving won't require further checks.
+  bool InterleavingRequiresRuntimePointerCheck =
+      (VF == 1 && Legal->getRuntimePointerChecking()->Need);
+
+  // We want to interleave small loops in order to reduce the loop overhead and
+  // potentially expose ILP opportunities.
+  DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n');
+  if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {
+    // We assume that the cost overhead is 1 and we use the cost model
+    // to estimate the cost of the loop and interleave until the cost of the
+    // loop overhead is about 5% of the cost of the loop.
+    unsigned SmallIC =
+        std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));
+
+    // Interleave until store/load ports (estimated by max interleave count) are
+    // saturated.
+    unsigned NumStores = Legal->getNumStores();
+    unsigned NumLoads = Legal->getNumLoads();
+    unsigned StoresIC = IC / (NumStores ? NumStores : 1);
+    unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);
+
+    // If we have a scalar reduction (vector reductions are already dealt with
+    // by this point), we can increase the critical path length if the loop
+    // we're interleaving is inside another loop. Limit, by default to 2, so the
+    // critical path only gets increased by one reduction operation.
+    if (Legal->getReductionVars()->size() && TheLoop->getLoopDepth() > 1) {
+      unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);
+      SmallIC = std::min(SmallIC, F);
+      StoresIC = std::min(StoresIC, F);
+      LoadsIC = std::min(LoadsIC, F);
+    }
+
+    if (EnableLoadStoreRuntimeInterleave &&
+        std::max(StoresIC, LoadsIC) > SmallIC) {
+      DEBUG(dbgs() << "LV: Interleaving to saturate store or load ports.\n");
+      return std::max(StoresIC, LoadsIC);
+    }
+
+    DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n");
+    return SmallIC;
+  }
+
+  // Interleave if this is a large loop (small loops are already dealt with by
+  // this point) that could benefit from interleaving.
+  bool HasReductions = (Legal->getReductionVars()->size() > 0);
+  if (TTI.enableAggressiveInterleaving(HasReductions)) {
+    DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
+    return IC;
+  }
+
+  DEBUG(dbgs() << "LV: Not Interleaving.\n");
+  return 1;
+}
+
+SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
+LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<unsigned> VFs) {
+  // This function calculates the register usage by measuring the highest number
+  // of values that are alive at a single location. Obviously, this is a very
+  // rough estimation. We scan the loop in a topological order in order and
+  // assign a number to each instruction. We use RPO to ensure that defs are
+  // met before their users. We assume that each instruction that has in-loop
+  // users starts an interval. We record every time that an in-loop value is
+  // used, so we have a list of the first and last occurrences of each
+  // instruction. Next, we transpose this data structure into a multi map that
+  // holds the list of intervals that *end* at a specific location. This multi
+  // map allows us to perform a linear search. We scan the instructions linearly
+  // and record each time that a new interval starts, by placing it in a set.
+  // If we find this value in the multi-map then we remove it from the set.
+  // The max register usage is the maximum size of the set.
+  // We also search for instructions that are defined outside the loop, but are
+  // used inside the loop. We need this number separately from the max-interval
+  // usage number because when we unroll, loop-invariant values do not take
+  // more register.
+  LoopBlocksDFS DFS(TheLoop);
+  DFS.perform(LI);
+
+  RegisterUsage RU;
+  RU.NumInstructions = 0;
+
+  // Each 'key' in the map opens a new interval. The values
+  // of the map are the index of the 'last seen' usage of the
+  // instruction that is the key.
+  typedef DenseMap<Instruction *, unsigned> IntervalMap;
+  // Maps instruction to its index.
+  DenseMap<unsigned, Instruction *> IdxToInstr;
+  // Marks the end of each interval.
+  IntervalMap EndPoint;
+  // Saves the list of instruction indices that are used in the loop.
+  SmallSet<Instruction *, 8> Ends;
+  // Saves the list of values that are used in the loop but are
+  // defined outside the loop, such as arguments and constants.
+  SmallPtrSet<Value *, 8> LoopInvariants;
+
+  unsigned Index = 0;
+  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
+    RU.NumInstructions += BB->size();
+    for (Instruction &I : *BB) {
+      IdxToInstr[Index++] = &I;
+
+      // Save the end location of each USE.
+      for (Value *U : I.operands()) {
+        auto *Instr = dyn_cast<Instruction>(U);
+
+        // Ignore non-instruction values such as arguments, constants, etc.
+        if (!Instr)
+          continue;
+
+        // If this instruction is outside the loop then record it and continue.
+        if (!TheLoop->contains(Instr)) {
+          LoopInvariants.insert(Instr);
+          continue;
+        }
+
+        // Overwrite previous end points.
+        EndPoint[Instr] = Index;
+        Ends.insert(Instr);
+      }
+    }
+  }
+
+  // Saves the list of intervals that end with the index in 'key'.
+  typedef SmallVector<Instruction *, 2> InstrList;
+  DenseMap<unsigned, InstrList> TransposeEnds;
+
+  // Transpose the EndPoints to a list of values that end at each index.
+  for (auto &Interval : EndPoint)
+    TransposeEnds[Interval.second].push_back(Interval.first);
+
+  SmallSet<Instruction *, 8> OpenIntervals;
+
+  // Get the size of the widest register.
+  unsigned MaxSafeDepDist = -1U;
+  if (Legal->getMaxSafeDepDistBytes() != -1U)
+    MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;
+  unsigned WidestRegister =
+      std::min(TTI.getRegisterBitWidth(true), MaxSafeDepDist);
+  const DataLayout &DL = TheFunction->getParent()->getDataLayout();
+
+  SmallVector<RegisterUsage, 8> RUs(VFs.size());
+  SmallVector<unsigned, 8> MaxUsages(VFs.size(), 0);
+
+  DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
+
+  // A lambda that gets the register usage for the given type and VF.
+  auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
+    if (Ty->isTokenTy())
+      return 0U;
+    unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
+    return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
+  };
+
+  for (unsigned int i = 0; i < Index; ++i) {
+    Instruction *I = IdxToInstr[i];
+
+    // Remove all of the instructions that end at this location.
+    InstrList &List = TransposeEnds[i];
+    for (Instruction *ToRemove : List)
+      OpenIntervals.erase(ToRemove);
+
+    // Ignore instructions that are never used within the loop.
+    if (!Ends.count(I))
+      continue;
+
+    // Skip ignored values.
+    if (ValuesToIgnore.count(I))
+      continue;
+
+    // For each VF find the maximum usage of registers.
+    for (unsigned j = 0, e = VFs.size(); j < e; ++j) {
+      if (VFs[j] == 1) {
+        MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size());
+        continue;
+      }
+      collectUniformsAndScalars(VFs[j]);
+      // Count the number of live intervals.
+      unsigned RegUsage = 0;
+      for (auto Inst : OpenIntervals) {
+        // Skip ignored values for VF > 1.
+        if (VecValuesToIgnore.count(Inst) ||
+            isScalarAfterVectorization(Inst, VFs[j]))
+          continue;
+        RegUsage += GetRegUsage(Inst->getType(), VFs[j]);
+      }
+      MaxUsages[j] = std::max(MaxUsages[j], RegUsage);
+    }
+
+    DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # "
+                 << OpenIntervals.size() << '\n');
+
+    // Add the current instruction to the list of open intervals.
+    OpenIntervals.insert(I);
+  }
+
+  for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
+    unsigned Invariant = 0;
+    if (VFs[i] == 1)
+      Invariant = LoopInvariants.size();
+    else {
+      for (auto Inst : LoopInvariants)
+        Invariant += GetRegUsage(Inst->getType(), VFs[i]);
+    }
+
+    DEBUG(dbgs() << "LV(REG): VF = " << VFs[i] << '\n');
+    DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsages[i] << '\n');
+    DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n');
+    DEBUG(dbgs() << "LV(REG): LoopSize: " << RU.NumInstructions << '\n');
+
+    RU.LoopInvariantRegs = Invariant;
+    RU.MaxLocalUsers = MaxUsages[i];
+    RUs[i] = RU;
+  }
+
+  return RUs;
+}
+
+void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {
+
+  // Function should not be called for the scalar case.
+  assert(VF >= 2 && "Function called for the scalar loop");
+
+  // if we've already collected the
+  // instructions to scalarize, there's nothing to do. Collection may already
+  // have occurred if we have a user-selected VF and are now computing the
+  // expected cost for interleaving.
+  if (InstsToScalarize.count(VF))
+    return;
+
+  // Initialize a mapping for VF in InstsToScalalarize. If we find that it's
+  // not profitable to scalarize any instructions, the presence of VF in the
+  // map will indicate that we've analyzed it already.
+  ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];
+
+  // Find all the instructions that are scalar with predication in the loop and
+  // determine if it would be better to not if-convert the blocks they are in.
+  // If so, we also record the instructions to scalarize.
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    if (!Legal->blockNeedsPredication(BB))
+      continue;
+    for (Instruction &I : *BB)
+      if (Legal->isScalarWithPredication(&I)) {
+        ScalarCostsTy ScalarCosts;
+        if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
+          ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
+      }
+  }
+}
+
+int LoopVectorizationCostModel::computePredInstDiscount(
+    Instruction *PredInst, DenseMap<Instruction *, unsigned> &ScalarCosts,
+    unsigned VF) {
+
+  assert(!isUniformAfterVectorization(PredInst, VF) &&
+         "Instruction marked uniform-after-vectorization will be predicated");
+
+  // Initialize the discount to zero, meaning that the scalar version and the
+  // vector version cost the same.
+  int Discount = 0;
+
+  // Holds instructions to analyze. The instructions we visit are mapped in
+  // ScalarCosts. Those instructions are the ones that would be scalarized if
+  // we find that the scalar version costs less.
+  SmallVector<Instruction *, 8> Worklist;
+
+  // Returns true if the given instruction can be scalarized.
+  auto canBeScalarized = [&](Instruction *I) -> bool {
+
+    // We only attempt to scalarize instructions forming a single-use chain
+    // from the original predicated block that would otherwise be vectorized.
+    // Although not strictly necessary, we give up on instructions we know will
+    // already be scalar to avoid traversing chains that are unlikely to be
+    // beneficial.
+    if (!I->hasOneUse() || PredInst->getParent() != I->getParent() ||
+        isScalarAfterVectorization(I, VF))
+      return false;
+
+    // If the instruction is scalar with predication, it will be analyzed
+    // separately. We ignore it within the context of PredInst.
+    if (Legal->isScalarWithPredication(I))
+      return false;
+
+    // If any of the instruction's operands are uniform after vectorization,
+    // the instruction cannot be scalarized. This prevents, for example, a
+    // masked load from being scalarized.
+    //
+    // We assume we will only emit a value for lane zero of an instruction
+    // marked uniform after vectorization, rather than VF identical values.
+    // Thus, if we scalarize an instruction that uses a uniform, we would
+    // create uses of values corresponding to the lanes we aren't emitting code
+    // for. This behavior can be changed by allowing getScalarValue to clone
+    // the lane zero values for uniforms rather than asserting.
+    for (Use &U : I->operands())
+      if (auto *J = dyn_cast<Instruction>(U.get()))
+        if (isUniformAfterVectorization(J, VF))
+          return false;
+
+    // Otherwise, we can scalarize the instruction.
+    return true;
+  };
+
+  // Returns true if an operand that cannot be scalarized must be extracted
+  // from a vector. We will account for this scalarization overhead below. Note
   // that the non-void predicated instructions are placed in their own blocks,
   // and their return values are inserted into vectors. Thus, an extract would
   // still be required.
@@ -6749,606 +7288,1721 @@
     return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
   };
 
-  // Compute the expected cost discount from scalarizing the entire expression
-  // feeding the predicated instruction. We currently only consider expressions
-  // that are single-use instruction chains.
-  Worklist.push_back(PredInst);
-  while (!Worklist.empty()) {
-    Instruction *I = Worklist.pop_back_val();
+  // Compute the expected cost discount from scalarizing the entire expression
+  // feeding the predicated instruction. We currently only consider expressions
+  // that are single-use instruction chains.
+  Worklist.push_back(PredInst);
+  while (!Worklist.empty()) {
+    Instruction *I = Worklist.pop_back_val();
+
+    // If we've already analyzed the instruction, there's nothing to do.
+    if (ScalarCosts.count(I))
+      continue;
+
+    // Compute the cost of the vector instruction. Note that this cost already
+    // includes the scalarization overhead of the predicated instruction.
+    unsigned VectorCost = getInstructionCost(I, VF).first;
+
+    // Compute the cost of the scalarized instruction. This cost is the cost of
+    // the instruction as if it wasn't if-converted and instead remained in the
+    // predicated block. We will scale this cost by block probability after
+    // computing the scalarization overhead.
+    unsigned ScalarCost = VF * getInstructionCost(I, 1).first;
+
+    // Compute the scalarization overhead of needed insertelement instructions
+    // and phi nodes.
+    if (Legal->isScalarWithPredication(I) && !I->getType()->isVoidTy()) {
+      ScalarCost += TTI.getScalarizationOverhead(ToVectorTy(I->getType(), VF),
+                                                 true, false);
+      ScalarCost += VF * TTI.getCFInstrCost(Instruction::PHI);
+    }
+
+    // Compute the scalarization overhead of needed extractelement
+    // instructions. For each of the instruction's operands, if the operand can
+    // be scalarized, add it to the worklist; otherwise, account for the
+    // overhead.
+    for (Use &U : I->operands())
+      if (auto *J = dyn_cast<Instruction>(U.get())) {
+        assert(VectorType::isValidElementType(J->getType()) &&
+               "Instruction has non-scalar type");
+        if (canBeScalarized(J))
+          Worklist.push_back(J);
+        else if (needsExtract(J))
+          ScalarCost += TTI.getScalarizationOverhead(
+                              ToVectorTy(J->getType(),VF), false, true);
+      }
+
+    // Scale the total scalar cost by block probability.
+    ScalarCost /= getReciprocalPredBlockProb();
+
+    // Compute the discount. A non-negative discount means the vector version
+    // of the instruction costs more, and scalarizing would be beneficial.
+    Discount += VectorCost - ScalarCost;
+    ScalarCosts[I] = ScalarCost;
+  }
+
+  return Discount;
+}
+
+LoopVectorizationCostModel::VectorizationCostTy
+LoopVectorizationCostModel::expectedCost(unsigned VF) {
+  VectorizationCostTy Cost;
+
+  // For each block.
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    VectorizationCostTy BlockCost;
+
+    // For each instruction in the old loop.
+    for (Instruction &I : *BB) {
+      // Skip dbg intrinsics.
+      if (isa<DbgInfoIntrinsic>(I))
+        continue;
+
+      // Skip ignored values.
+      if (ValuesToIgnore.count(&I))
+        continue;
+
+      VectorizationCostTy C = getInstructionCost(&I, VF);
+
+      // Check if we should override the cost.
+      if (ForceTargetInstructionCost.getNumOccurrences() > 0)
+        C.first = ForceTargetInstructionCost;
+
+      BlockCost.first += C.first;
+      BlockCost.second |= C.second;
+      DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first << " for VF "
+                   << VF << " For instruction: " << I << '\n');
+    }
+
+    // If we are vectorizing a predicated block, it will have been
+    // if-converted. This means that the block's instructions (aside from
+    // stores and instructions that may divide by zero) will now be
+    // unconditionally executed. For the scalar case, we may not always execute
+    // the predicated block. Thus, scale the block's cost by the probability of
+    // executing it.
+    if (VF == 1 && Legal->blockNeedsPredication(BB))
+      BlockCost.first /= getReciprocalPredBlockProb();
+
+    Cost.first += BlockCost.first;
+    Cost.second |= BlockCost.second;
+  }
+
+  return Cost;
+}
+
+/// \brief Gets Address Access SCEV after verifying that the access pattern
+/// is loop invariant except the induction variable dependence.
+///
+/// This SCEV can be sent to the Target in order to estimate the address
+/// calculation cost.
+static const SCEV *getAddressAccessSCEV(
+              Value *Ptr,
+              LoopVectorizationLegality *Legal,
+              ScalarEvolution *SE,
+              const Loop *TheLoop) {
+  auto *Gep = dyn_cast<GetElementPtrInst>(Ptr);
+  if (!Gep)
+    return nullptr;
+
+  // We are looking for a gep with all loop invariant indices except for one
+  // which should be an induction variable.
+  unsigned NumOperands = Gep->getNumOperands();
+  for (unsigned i = 1; i < NumOperands; ++i) {
+    Value *Opd = Gep->getOperand(i);
+    if (!SE->isLoopInvariant(SE->getSCEV(Opd), TheLoop) &&
+        !Legal->isInductionVariable(Opd))
+      return nullptr;
+  }
+
+  // Now we know we have a GEP ptr, %inv, %ind, %inv. return the Ptr SCEV.
+  return SE->getSCEV(Ptr);
+}
+
+static bool isStrideMul(Instruction *I, LoopVectorizationLegality *Legal) {
+  return Legal->hasStride(I->getOperand(0)) ||
+         Legal->hasStride(I->getOperand(1));
+}
+
+unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
+                                                                 unsigned VF) {
+  Type *ValTy = getMemInstValueType(I);
+  auto SE = PSE.getSE();
+
+  unsigned Alignment = getMemInstAlignment(I);
+  unsigned AS = getMemInstAddressSpace(I);
+  Value *Ptr = getPointerOperand(I);
+  Type *PtrTy = ToVectorTy(Ptr->getType(), VF);
+
+  // Figure out whether the access is strided and get the stride value
+  // if it's known in compile time
+  const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);
+
+  // Get the cost of the scalar memory instruction and address computation.
+  unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);
+
+  Cost += VF *
+          TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment,
+                              AS);
+
+  // Get the overhead of the extractelement and insertelement instructions
+  // we might create due to scalarization.
+  Cost += getScalarizationOverhead(I, VF, TTI);
+
+  // If we have a predicated store, it may not be executed for each vector
+  // lane. Scale the cost by the probability of executing the predicated
+  // block.
+  if (Legal->isScalarWithPredication(I))
+    Cost /= getReciprocalPredBlockProb();
+
+  return Cost;
+}
+
+unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
+                                                             unsigned VF) {
+  Type *ValTy = getMemInstValueType(I);
+  Type *VectorTy = ToVectorTy(ValTy, VF);
+  unsigned Alignment = getMemInstAlignment(I);
+  Value *Ptr = getPointerOperand(I);
+  unsigned AS = getMemInstAddressSpace(I);
+  int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
+
+  assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) &&
+         "Stride should be 1 or -1 for consecutive memory access");
+  unsigned Cost = 0;
+  if (Legal->isMaskRequired(I))
+    Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
+  else
+    Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
+
+  bool Reverse = ConsecutiveStride < 0;
+  if (Reverse)
+    Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
+  return Cost;
+}
+
+unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
+                                                         unsigned VF) {
+  LoadInst *LI = cast<LoadInst>(I);
+  Type *ValTy = LI->getType();
+  Type *VectorTy = ToVectorTy(ValTy, VF);
+  unsigned Alignment = LI->getAlignment();
+  unsigned AS = LI->getPointerAddressSpace();
+
+  return TTI.getAddressComputationCost(ValTy) +
+         TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +
+         TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
+}
+
+unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
+                                                          unsigned VF) {
+  Type *ValTy = getMemInstValueType(I);
+  Type *VectorTy = ToVectorTy(ValTy, VF);
+  unsigned Alignment = getMemInstAlignment(I);
+  Value *Ptr = getPointerOperand(I);
+
+  return TTI.getAddressComputationCost(VectorTy) +
+         TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,
+                                    Legal->isMaskRequired(I), Alignment);
+}
+
+unsigned LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,
+                                                            unsigned VF) {
+  Type *ValTy = getMemInstValueType(I);
+  Type *VectorTy = ToVectorTy(ValTy, VF);
+  unsigned AS = getMemInstAddressSpace(I);
+
+  auto Group = Legal->getInterleavedAccessGroup(I);
+  assert(Group && "Fail to get an interleaved access group.");
+
+  unsigned InterleaveFactor = Group->getFactor();
+  Type *WideVecTy = VectorType::get(ValTy, VF * InterleaveFactor);
+
+  // Holds the indices of existing members in an interleaved load group.
+  // An interleaved store group doesn't need this as it doesn't allow gaps.
+  SmallVector<unsigned, 4> Indices;
+  if (isa<LoadInst>(I)) {
+    for (unsigned i = 0; i < InterleaveFactor; i++)
+      if (Group->getMember(i))
+        Indices.push_back(i);
+  }
+
+  // Calculate the cost of the whole interleaved group.
+  unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy,
+                                                 Group->getFactor(), Indices,
+                                                 Group->getAlignment(), AS);
+
+  if (Group->isReverse())
+    Cost += Group->getNumMembers() *
+            TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
+  return Cost;
+}
+
+unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
+                                                              unsigned VF) {
+
+  // Calculate scalar cost only. Vectorization cost should be ready at this
+  // moment.
+  if (VF == 1) {
+    Type *ValTy = getMemInstValueType(I);
+    unsigned Alignment = getMemInstAlignment(I);
+    unsigned AS = getMemInstAlignment(I);
+
+    return TTI.getAddressComputationCost(ValTy) +
+           TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS);
+  }
+  return getWideningCost(I, VF);
+}
+
+LoopVectorizationCostModel::VectorizationCostTy
+LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
+  // If we know that this instruction will remain uniform, check the cost of
+  // the scalar version.
+  if (isUniformAfterVectorization(I, VF))
+    VF = 1;
+
+  if (VF > 1 && isProfitableToScalarize(I, VF))
+    return VectorizationCostTy(InstsToScalarize[VF][I], false);
+
+  Type *VectorTy;
+  unsigned C = getInstructionCost(I, VF, VectorTy);
+
+  bool TypeNotScalarized =
+      VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
+  return VectorizationCostTy(C, TypeNotScalarized);
+}
+
+void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
+  if (VF == 1)
+    return;
+  for (BasicBlock *BB : TheLoop->blocks()) {
+    // For each instruction in the old loop.
+    for (Instruction &I : *BB) {
+      Value *Ptr = getPointerOperand(&I);
+      if (!Ptr)
+        continue;
+
+      if (isa<LoadInst>(&I) && Legal->isUniform(Ptr)) {
+        // Scalar load + broadcast
+        unsigned Cost = getUniformMemOpCost(&I, VF);
+        setWideningDecision(&I, VF, CM_Scalarize, Cost);
+        continue;
+      }
+
+      // We assume that widening is the best solution when possible.
+      if (Legal->memoryInstructionCanBeWidened(&I, VF)) {
+        unsigned Cost = getConsecutiveMemOpCost(&I, VF);
+        setWideningDecision(&I, VF, CM_Widen, Cost);
+        continue;
+      }
+
+      // Choose between Interleaving, Gather/Scatter or Scalarization.
+      unsigned InterleaveCost = UINT_MAX;
+      unsigned NumAccesses = 1;
+      if (Legal->isAccessInterleaved(&I)) {
+        auto Group = Legal->getInterleavedAccessGroup(&I);
+        assert(Group && "Fail to get an interleaved access group.");
 
-    // If we've already analyzed the instruction, there's nothing to do.
-    if (ScalarCosts.count(I))
-      continue;
+        // Make one decision for the whole group.
+        if (getWideningDecision(&I, VF) != CM_Unknown)
+          continue;
 
-    // Compute the cost of the vector instruction. Note that this cost already
-    // includes the scalarization overhead of the predicated instruction.
-    unsigned VectorCost = getInstructionCost(I, VF).first;
+        NumAccesses = Group->getNumMembers();
+        InterleaveCost = getInterleaveGroupCost(&I, VF);
+      }
 
-    // Compute the cost of the scalarized instruction. This cost is the cost of
-    // the instruction as if it wasn't if-converted and instead remained in the
-    // predicated block. We will scale this cost by block probability after
-    // computing the scalarization overhead.
-    unsigned ScalarCost = VF * getInstructionCost(I, 1).first;
+      unsigned GatherScatterCost =
+          Legal->isLegalGatherOrScatter(&I)
+              ? getGatherScatterCost(&I, VF) * NumAccesses
+              : UINT_MAX;
 
-    // Compute the scalarization overhead of needed insertelement instructions
-    // and phi nodes.
-    if (Legal->isScalarWithPredication(I) && !I->getType()->isVoidTy()) {
-      ScalarCost += TTI.getScalarizationOverhead(ToVectorTy(I->getType(), VF),
-                                                 true, false);
-      ScalarCost += VF * TTI.getCFInstrCost(Instruction::PHI);
+      unsigned ScalarizationCost =
+          getMemInstScalarizationCost(&I, VF) * NumAccesses;
+
+      // Choose better solution for the current VF,
+      // write down this decision and use it during vectorization.
+      unsigned Cost;
+      InstWidening Decision;
+      if (InterleaveCost <= GatherScatterCost &&
+          InterleaveCost < ScalarizationCost) {
+        Decision = CM_Interleave;
+        Cost = InterleaveCost;
+      } else if (GatherScatterCost < ScalarizationCost) {
+        Decision = CM_GatherScatter;
+        Cost = GatherScatterCost;
+      } else {
+        Decision = CM_Scalarize;
+        Cost = ScalarizationCost;
+      }
+      // If the instructions belongs to an interleave group, the whole group
+      // receives the same decision. The whole group receives the cost, but
+      // the cost will actually be assigned to one instruction.
+      if (auto Group = Legal->getInterleavedAccessGroup(&I))
+        setWideningDecision(Group, VF, Decision, Cost);
+      else
+        setWideningDecision(&I, VF, Decision, Cost);
     }
+  }
+}
 
-    // Compute the scalarization overhead of needed extractelement
-    // instructions. For each of the instruction's operands, if the operand can
-    // be scalarized, add it to the worklist; otherwise, account for the
-    // overhead.
-    for (Use &U : I->operands())
-      if (auto *J = dyn_cast<Instruction>(U.get())) {
-        assert(VectorType::isValidElementType(J->getType()) &&
-               "Instruction has non-scalar type");
-        if (canBeScalarized(J))
-          Worklist.push_back(J);
-        else if (needsExtract(J))
-          ScalarCost += TTI.getScalarizationOverhead(
-                              ToVectorTy(J->getType(),VF), false, true);
+unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
+                                                        unsigned VF,
+                                                        Type *&VectorTy) {
+  Type *RetTy = I->getType();
+  if (canTruncateToMinimalBitwidth(I, VF))
+    RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
+  VectorTy = ToVectorTy(RetTy, VF);
+  auto SE = PSE.getSE();
+
+  // TODO: We need to estimate the cost of intrinsic calls.
+  switch (I->getOpcode()) {
+  case Instruction::GetElementPtr:
+    // We mark this instruction as zero-cost because the cost of GEPs in
+    // vectorized code depends on whether the corresponding memory instruction
+    // is scalarized or not. Therefore, we handle GEPs with the memory
+    // instruction cost.
+    return 0;
+  case Instruction::Br: {
+    return TTI.getCFInstrCost(I->getOpcode());
+  }
+  case Instruction::PHI: {
+    auto *Phi = cast<PHINode>(I);
+
+    // First-order recurrences are replaced by vector shuffles inside the loop.
+    if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
+      return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
+                                VectorTy, VF - 1, VectorTy);
+
+    // TODO: IF-converted IFs become selects.
+    return 0;
+  }
+  case Instruction::UDiv:
+  case Instruction::SDiv:
+  case Instruction::URem:
+  case Instruction::SRem:
+    // If we have a predicated instruction, it may not be executed for each
+    // vector lane. Get the scalarization cost and scale this amount by the
+    // probability of executing the predicated block. If the instruction is not
+    // predicated, we fall through to the next case.
+    if (VF > 1 && Legal->isScalarWithPredication(I)) {
+      unsigned Cost = 0;
+
+      // These instructions have a non-void type, so account for the phi nodes
+      // that we will create. This cost is likely to be zero. The phi node
+      // cost, if any, should be scaled by the block probability because it
+      // models a copy at the end of each predicated block.
+      Cost += VF * TTI.getCFInstrCost(Instruction::PHI);
+
+      // The cost of the non-predicated instruction.
+      Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy);
+
+      // The cost of insertelement and extractelement instructions needed for
+      // scalarization.
+      Cost += getScalarizationOverhead(I, VF, TTI);
+
+      // Scale the cost by the probability of executing the predicated blocks.
+      // This assumes the predicated block for each vector lane is equally
+      // likely.
+      return Cost / getReciprocalPredBlockProb();
+    }
+  case Instruction::Add:
+  case Instruction::FAdd:
+  case Instruction::Sub:
+  case Instruction::FSub:
+  case Instruction::Mul:
+  case Instruction::FMul:
+  case Instruction::FDiv:
+  case Instruction::FRem:
+  case Instruction::Shl:
+  case Instruction::LShr:
+  case Instruction::AShr:
+  case Instruction::And:
+  case Instruction::Or:
+  case Instruction::Xor: {
+    // Since we will replace the stride by 1 the multiplication should go away.
+    if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))
+      return 0;
+    // Certain instructions can be cheaper to vectorize if they have a constant
+    // second vector operand. One example of this are shifts on x86.
+    TargetTransformInfo::OperandValueKind Op1VK =
+        TargetTransformInfo::OK_AnyValue;
+    TargetTransformInfo::OperandValueKind Op2VK =
+        TargetTransformInfo::OK_AnyValue;
+    TargetTransformInfo::OperandValueProperties Op1VP =
+        TargetTransformInfo::OP_None;
+    TargetTransformInfo::OperandValueProperties Op2VP =
+        TargetTransformInfo::OP_None;
+    Value *Op2 = I->getOperand(1);
+
+    // Check for a splat or for a non uniform vector of constants.
+    if (isa<ConstantInt>(Op2)) {
+      ConstantInt *CInt = cast<ConstantInt>(Op2);
+      if (CInt && CInt->getValue().isPowerOf2())
+        Op2VP = TargetTransformInfo::OP_PowerOf2;
+      Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+    } else if (isa<ConstantVector>(Op2) || isa<ConstantDataVector>(Op2)) {
+      Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+      Constant *SplatValue = cast<Constant>(Op2)->getSplatValue();
+      if (SplatValue) {
+        ConstantInt *CInt = dyn_cast<ConstantInt>(SplatValue);
+        if (CInt && CInt->getValue().isPowerOf2())
+          Op2VP = TargetTransformInfo::OP_PowerOf2;
+        Op2VK = TargetTransformInfo::OK_UniformConstantValue;
       }
+    } else if (Legal->isUniform(Op2)) {
+      Op2VK = TargetTransformInfo::OK_UniformValue;
+    }
+    SmallVector<const Value *, 4> Operands(I->operand_values()); 
+    return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK,
+                                      Op2VK, Op1VP, Op2VP, Operands);
+  }
+  case Instruction::Select: {
+    SelectInst *SI = cast<SelectInst>(I);
+    const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());
+    bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));
+    Type *CondTy = SI->getCondition()->getType();
+    if (!ScalarCond)
+      CondTy = VectorType::get(CondTy, VF);
 
-    // Scale the total scalar cost by block probability.
-    ScalarCost /= getReciprocalPredBlockProb();
+    return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy);
+  }
+  case Instruction::ICmp:
+  case Instruction::FCmp: {
+    Type *ValTy = I->getOperand(0)->getType();
+    Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));
+    if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))
+      ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);
+    VectorTy = ToVectorTy(ValTy, VF);
+    return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);
+  }
+  case Instruction::Store:
+  case Instruction::Load: {
+    VectorTy = ToVectorTy(getMemInstValueType(I), VF);
+    return getMemoryInstructionCost(I, VF);
+  }
+  case Instruction::ZExt:
+  case Instruction::SExt:
+  case Instruction::FPToUI:
+  case Instruction::FPToSI:
+  case Instruction::FPExt:
+  case Instruction::PtrToInt:
+  case Instruction::IntToPtr:
+  case Instruction::SIToFP:
+  case Instruction::UIToFP:
+  case Instruction::Trunc:
+  case Instruction::FPTrunc:
+  case Instruction::BitCast: {
+    // We optimize the truncation of induction variables having constant
+    // integer steps. The cost of these truncations is the same as the scalar
+    // operation.
+    if (isOptimizableIVTruncate(I, VF)) {
+      auto *Trunc = cast<TruncInst>(I);
+      return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),
+                                  Trunc->getSrcTy());
+    }
+
+    Type *SrcScalarTy = I->getOperand(0)->getType();
+    Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF);
+    if (canTruncateToMinimalBitwidth(I, VF)) {
+      // This cast is going to be shrunk. This may remove the cast or it might
+      // turn it into slightly different cast. For example, if MinBW == 16,
+      // "zext i8 %1 to i32" becomes "zext i8 %1 to i16".
+      //
+      // Calculate the modified src and dest types.
+      Type *MinVecTy = VectorTy;
+      if (I->getOpcode() == Instruction::Trunc) {
+        SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy);
+        VectorTy =
+            largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
+      } else if (I->getOpcode() == Instruction::ZExt ||
+                 I->getOpcode() == Instruction::SExt) {
+        SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);
+        VectorTy =
+            smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
+      }
+    }
 
-    // Compute the discount. A non-negative discount means the vector version
-    // of the instruction costs more, and scalarizing would be beneficial.
-    Discount += VectorCost - ScalarCost;
-    ScalarCosts[I] = ScalarCost;
+    return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy);
   }
-
-  return Discount;
+  case Instruction::Call: {
+    bool NeedToScalarize;
+    CallInst *CI = cast<CallInst>(I);
+    unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
+    if (getVectorIntrinsicIDForCall(CI, TLI))
+      return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
+    return CallCost;
+  }
+  default:
+    // The cost of executing VF copies of the scalar instruction. This opcode
+    // is unknown. Assume that it is the same as 'mul'.
+    return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +
+           getScalarizationOverhead(I, VF, TTI);
+  } // end of switch.
 }
 
-LoopVectorizationCostModel::VectorizationCostTy
-LoopVectorizationCostModel::expectedCost(unsigned VF) {
-  VectorizationCostTy Cost;
-
-  // Collect Uniform and Scalar instructions after vectorization with VF.
-  collectUniformsAndScalars(VF);
+char LoopVectorize::ID = 0;
+static const char lv_name[] = "Loop Vectorization";
+INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
+INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis)
+INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)
+INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false)
 
-  // Collect the instructions (and their associated costs) that will be more
-  // profitable to scalarize.
-  collectInstsToScalarize(VF);
+namespace llvm {
+Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) {
+  return new LoopVectorize(NoUnrolling, AlwaysVectorize);
+}
+}
 
-  // For each block.
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    VectorizationCostTy BlockCost;
+bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) {
 
-    // For each instruction in the old loop.
-    for (Instruction &I : *BB) {
-      // Skip dbg intrinsics.
-      if (isa<DbgInfoIntrinsic>(I))
-        continue;
+  // Check if the pointer operand of a load or store instruction is
+  // consecutive.
+  if (auto *Ptr = getPointerOperand(Inst))
+    return Legal->isConsecutivePtr(Ptr);
+  return false;
+}
 
-      // Skip ignored values.
-      if (ValuesToIgnore.count(&I))
-        continue;
+void LoopVectorizationCostModel::collectValuesToIgnore() {
+  // Ignore ephemeral values.
+  CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);
 
-      VectorizationCostTy C = getInstructionCost(&I, VF);
+  // Ignore type-promoting instructions we identified during reduction
+  // detection.
+  for (auto &Reduction : *Legal->getReductionVars()) {
+    RecurrenceDescriptor &RedDes = Reduction.second;
+    SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
+    VecValuesToIgnore.insert(Casts.begin(), Casts.end());
+  }
+}
 
-      // Check if we should override the cost.
-      if (ForceTargetInstructionCost.getNumOccurrences() > 0)
-        C.first = ForceTargetInstructionCost;
+LoopVectorizationCostModel::VectorizationFactor
+LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF,
+                               unsigned MaxVF) {
+  if (UserVF) {
+    DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
+    if (UserVF == 1)
+      return {UserVF, 0};
+    assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
+    // Collect Uniform and Scalar instructions after vectorization with VF.
+    CM->collectUniformsAndScalars(UserVF);
+    // Collect the instructions (and their associated costs) that will be more
+    // profitable to scalarize.
+    CM->collectInstsToScalarize(UserVF);
+    buildInitialVPlans(UserVF, UserVF);
+    DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
+    optimizePredicatedInstructions();
+    DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
+    return {UserVF, 0};
+  }
+  if (MaxVF == 1)
+    return {1, 0};
+
+  assert(MaxVF > 1 && "MaxVF is zero.");
+  for (unsigned i = 2; i <= MaxVF; i *= 2) {
+    // Collect Uniform and Scalar instructions after vectorization with VF.
+    CM->collectUniformsAndScalars(i);
+    // Collect the instructions (and their associated costs) that will be more
+    // profitable to scalarize.
+    CM->collectInstsToScalarize(i);
+  }
+  buildInitialVPlans(2, MaxVF);
+  DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
+  optimizePredicatedInstructions();
+  DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
+  // Select the optimal vectorization factor.
+  return CM->selectVectorizationFactor(OptForSize, MaxVF);
+}
 
-      BlockCost.first += C.first;
-      BlockCost.second |= C.second;
-      DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first << " for VF "
-                   << VF << " For instruction: " << I << '\n');
+void LoopVectorizationPlanner::printCurrentPlans(const std::string &Title,
+                                                 raw_ostream &O) {
+  auto printPlan = [&](VPlan *Plan, const SmallVectorImpl<unsigned> &VFs,
+                       const std::string &Prefix) {
+    std::string Title;
+    raw_string_ostream RSO(Title);
+    RSO << Prefix << " for VF=";
+    if (VFs.size() == 1)
+      RSO << VFs[0];
+    else {
+      RSO << "{";
+      bool First = true;
+      for (unsigned VF : VFs) {
+        if (!First)
+          RSO << ",";
+        RSO << VF;
+        First = false;
+      }
+      RSO << "}";
     }
+    VPlanPrinter PlanPrinter(O, *Plan);
+    PlanPrinter.dump(RSO.str());
+  };
 
-    // If we are vectorizing a predicated block, it will have been
-    // if-converted. This means that the block's instructions (aside from
-    // stores and instructions that may divide by zero) will now be
-    // unconditionally executed. For the scalar case, we may not always execute
-    // the predicated block. Thus, scale the block's cost by the probability of
-    // executing it.
-    if (VF == 1 && Legal->blockNeedsPredication(BB))
-      BlockCost.first /= getReciprocalPredBlockProb();
+  if (VPlans.empty())
+    return;
 
-    Cost.first += BlockCost.first;
-    Cost.second |= BlockCost.second;
+  VPlan *Current = VPlans.begin()->second.get();
+
+  SmallVector<unsigned, 4> VFs;
+  for (auto &Entry : VPlans) {
+    VPlan *Plan = Entry.second.get();
+    if (Plan != Current) {
+      // Hit another VPlan. Print the current VPlan for the VFs it served thus
+      // far and move on to the VPlan we just encountered.
+      printPlan(Current, VFs, Title);
+      Current = Plan;
+      VFs.clear();
+    }
+    // Add VF to the list of VFs served by current VPlan.
+    VFs.push_back(Entry.first);
   }
-
-  return Cost;
+  // Print the current VPlan.
+  printPlan(Current, VFs, Title);
 }
 
-/// \brief Gets Address Access SCEV after verifying that the access pattern
-/// is loop invariant except the induction variable dependence.
-///
-/// This SCEV can be sent to the Target in order to estimate the address
-/// calculation cost.
-static const SCEV *getAddressAccessSCEV(
-              Value *Ptr,
-              LoopVectorizationLegality *Legal,
-              ScalarEvolution *SE,
-              const Loop *TheLoop) {
-  auto *Gep = dyn_cast<GetElementPtrInst>(Ptr);
-  if (!Gep)
-    return nullptr;
-
-  // We are looking for a gep with all loop invariant indices except for one
-  // which should be an induction variable.
-  unsigned NumOperands = Gep->getNumOperands();
-  for (unsigned i = 1; i < NumOperands; ++i) {
-    Value *Opd = Gep->getOperand(i);
-    if (!SE->isLoopInvariant(SE->getSCEV(Opd), TheLoop) &&
-        !Legal->isInductionVariable(Opd))
-      return nullptr;
+std::pair<VPRecipeBase *, VPRecipeBase *>
+LoopVectorizationPlanner::widenIntInduction(VPlan *Plan, unsigned StartRangeVF,
+                                            unsigned &EndRangeVF, PHINode *IV,
+                                            TruncInst *Trunc) {
+  // The value from the original loop to which we are mapping the new
+  // induction variable.
+  Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;
+  // Determine if we want a scalar version of the induction variable. This
+  // is true if the induction variable itself is not widened, or if it has
+  // at least one user in the loop that is not widened.
+  auto NeedsScalarInduction = [&](unsigned VF) -> bool {
+    if (shouldScalarizeInstruction(IV, VF))
+      return true;
+    auto isScalarInst = [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      return (TheLoop->contains(I) && shouldScalarizeInstruction(I, VF));
+    };
+    return any_of(IV->users(), isScalarInst);
+  };
+  bool NeedsScalarIV =
+      testVFRange(NeedsScalarInduction, StartRangeVF, EndRangeVF);
+  // Generate the widening recipe.
+  auto *WIIRecipe = new VPWidenIntInductionRecipe(NeedsScalarIV, IV, Trunc);
+  if (!NeedsScalarIV)
+    return std::make_pair<VPRecipeBase *, VPRecipeBase *>(WIIRecipe, nullptr);
+
+  // Create scalar steps that can be used by instructions we will later
+  // scalarize. Note that the addition of the scalar steps will not
+  // increase the number of instructions in the loop in the common case
+  // prior to InstCombine. We will be trading one vector extract for
+  // each scalar step.
+  auto *BSSRecipe = new VPBuildScalarStepsRecipe(WIIRecipe, EntryVal, Plan);
+  // Determine the number of scalars we need to generate for each unroll
+  // iteration. If EntryVal is uniform, we only need to generate the
+  // first lane. Otherwise, we generate all VF values.
+  auto isUniformAfterVectorization = [&](unsigned VF) -> bool {
+    return CM->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF);
+  };
+  if (testVFRange(isUniformAfterVectorization, StartRangeVF, EndRangeVF)) {
+    VPlanUtilsLoopVectorizer PlanUtils(Plan);
+    PlanUtils.designateLaneZero(BSSRecipe);
   }
-
-  // Now we know we have a GEP ptr, %inv, %ind, %inv. return the Ptr SCEV.
-  return SE->getSCEV(Ptr);
-}
-
-static bool isStrideMul(Instruction *I, LoopVectorizationLegality *Legal) {
-  return Legal->hasStride(I->getOperand(0)) ||
-         Legal->hasStride(I->getOperand(1));
+  return std::make_pair<VPRecipeBase *, VPRecipeBase *>(WIIRecipe, BSSRecipe);
 }
 
-unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
-                                                                 unsigned VF) {
-  Type *ValTy = getMemInstValueType(I);
-  auto SE = PSE.getSE();
-
-  unsigned Alignment = getMemInstAlignment(I);
-  unsigned AS = getMemInstAddressSpace(I);
-  Value *Ptr = getPointerOperand(I);
-  Type *PtrTy = ToVectorTy(Ptr->getType(), VF);
+// Determine if a given instruction will remain scalar after vectorization,
+// for VF \p StartRangeVF. Reset \p EndRangeVF to the minimal VF where this
+// decision does not hold, if it's less than the given \p EndRangeVF.
+bool LoopVectorizationPlanner::willBeScalarized(Instruction *I,
+                                                unsigned StartRangeVF,
+                                                unsigned &EndRangeVF) {
+  if (!isa<PHINode>(I)) {
+    auto isScalarAfterVectorization = [&](unsigned VF) -> bool {
+      return CM->isScalarAfterVectorization(I, VF);
+    };
+    if (testVFRange(isScalarAfterVectorization, StartRangeVF, EndRangeVF))
+      return true;
+  }
 
-  // Figure out whether the access is strided and get the stride value
-  // if it's known in compile time
-  const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);
+  if (isa<CallInst>(I)) {
 
-  // Get the cost of the scalar memory instruction and address computation.
-  unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);
+    auto *CI = cast<CallInst>(I);
+    Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
+    if (ID && (ID == Intrinsic::assume || ID == Intrinsic::lifetime_end ||
+               ID == Intrinsic::lifetime_start))
+      return true;
 
-  Cost += VF *
-          TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment,
-                              AS);
+    // The following case may be scalarized depending on the VF.
+    // The flag shows whether we use Intrinsic or a usual Call for vectorized
+    // version of the instruction.
+    // Is it beneficial to perform intrinsic call compared to lib call?
+    auto WillBeScalarized = [&](unsigned VF) -> bool {
+      bool NeedToScalarize;
+      unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
+      bool UseVectorIntrinsic =
+          ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
+      return !UseVectorIntrinsic && NeedToScalarize;
+    };
+    return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF);
+  }
 
-  // Get the overhead of the extractelement and insertelement instructions
-  // we might create due to scalarization.
-  Cost += getScalarizationOverhead(I, VF, TTI);
+  if (isa<LoadInst>(I) || isa<StoreInst>(I)) {
 
-  // If we have a predicated store, it may not be executed for each vector
-  // lane. Scale the cost by the probability of executing the predicated
-  // block.
-  if (Legal->isScalarWithPredication(I))
-    Cost /= getReciprocalPredBlockProb();
+    // TODO: refactor memoryInstructionMustBeScalarized() to invoke only the
+    // (last) part that depends on VF.
+    auto WillBeScalarized = [&](unsigned VF) -> bool {
+      LoopVectorizationCostModel::InstWidening Decision =
+          CM->getWideningDecision(I, VF);
+      assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
+             "CM decision should be taken at this point");
+      return Decision == LoopVectorizationCostModel::CM_Scalarize;
+    };
+    return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF);
+  }
+
+  static DenseSet<unsigned> VectorizableOpcodes = {
+      Instruction::Br,       Instruction::PHI,      Instruction::UDiv,
+      Instruction::SDiv,     Instruction::SRem,     Instruction::URem,
+      Instruction::Add,      Instruction::FAdd,     Instruction::Sub,
+      Instruction::FSub,     Instruction::Mul,      Instruction::FMul,
+      Instruction::FDiv,     Instruction::FRem,     Instruction::Shl,
+      Instruction::LShr,     Instruction::AShr,     Instruction::And,
+      Instruction::Or,       Instruction::Xor,      Instruction::Select,
+      Instruction::ICmp,     Instruction::FCmp,     Instruction::Store,
+      Instruction::Load,     Instruction::ZExt,     Instruction::SExt,
+      Instruction::FPToUI,   Instruction::FPToSI,   Instruction::FPExt,
+      Instruction::PtrToInt, Instruction::IntToPtr, Instruction::SIToFP,
+      Instruction::UIToFP,   Instruction::Trunc,    Instruction::FPTrunc,
+      Instruction::BitCast,  Instruction::Call};
+
+  if (!VectorizableOpcodes.count(I->getOpcode()))
+    return true;
 
-  return Cost;
+  // Scalarize instructions found to be more profitable if scalarized. Limit
+  // EndRangeVF to the last VF this is continuously true for.
+  auto isProfitableToScalarize = [&](unsigned VF) -> bool {
+    return CM->isProfitableToScalarize(I, VF);
+  };
+  return testVFRange(isProfitableToScalarize, StartRangeVF, EndRangeVF);
 }
 
-unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
-                                                             unsigned VF) {
-  Type *ValTy = getMemInstValueType(I);
-  Type *VectorTy = ToVectorTy(ValTy, VF);
-  unsigned Alignment = getMemInstAlignment(I);
-  Value *Ptr = getPointerOperand(I);
-  unsigned AS = getMemInstAddressSpace(I);
-  int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
-
-  assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) &&
-         "Stride should be 1 or -1 for consecutive memory access");
-  unsigned Cost = 0;
-  if (Legal->isMaskRequired(I))
-    Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
-  else
-    Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
+unsigned LoopVectorizationPlanner::buildInitialVPlans(unsigned MinVF,
+                                                      unsigned MaxVF) {
+  ILV->collectTriviallyDeadInstructions(TheLoop, Legal, DeadInstructions);
 
-  bool Reverse = ConsecutiveStride < 0;
-  if (Reverse)
-    Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
-  return Cost;
-}
+  unsigned StartRangeVF = MinVF;
+  unsigned EndRangeVF = MaxVF + 1;
 
-unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
-                                                         unsigned VF) {
-  LoadInst *LI = cast<LoadInst>(I);
-  Type *ValTy = LI->getType();
-  Type *VectorTy = ToVectorTy(ValTy, VF);
-  unsigned Alignment = LI->getAlignment();
-  unsigned AS = LI->getPointerAddressSpace();
+  unsigned i = 0;
+  for (; StartRangeVF < EndRangeVF; ++i) {
+    std::shared_ptr<VPlan> Plan = buildInitialVPlan(StartRangeVF, EndRangeVF);
 
-  return TTI.getAddressComputationCost(ValTy) +
-         TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +
-         TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
-}
+    for (unsigned TmpVF = StartRangeVF; TmpVF < EndRangeVF; TmpVF *= 2)
+      VPlans[TmpVF] = Plan;
 
-unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
-                                                          unsigned VF) {
-  Type *ValTy = getMemInstValueType(I);
-  Type *VectorTy = ToVectorTy(ValTy, VF);
-  unsigned Alignment = getMemInstAlignment(I);
-  Value *Ptr = getPointerOperand(I);
+    StartRangeVF = EndRangeVF;
+    EndRangeVF = MaxVF + 1;
+  }
 
-  return TTI.getAddressComputationCost(VectorTy) +
-         TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,
-                                    Legal->isMaskRequired(I), Alignment);
+  return i;
 }
 
-unsigned LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,
-                                                            unsigned VF) {
-  Type *ValTy = getMemInstValueType(I);
-  Type *VectorTy = ToVectorTy(ValTy, VF);
-  unsigned AS = getMemInstAddressSpace(I);
-
-  auto Group = Legal->getInterleavedAccessGroup(I);
-  assert(Group && "Fail to get an interleaved access group.");
-
-  unsigned InterleaveFactor = Group->getFactor();
-  Type *WideVecTy = VectorType::get(ValTy, VF * InterleaveFactor);
+bool LoopVectorizationPlanner::testVFRange(
+    const std::function<bool(unsigned)> &Predicate, unsigned StartRangeVF,
+    unsigned &EndRangeVF) {
+  bool StartResult = Predicate(StartRangeVF);
 
-  // Holds the indices of existing members in an interleaved load group.
-  // An interleaved store group doesn't need this as it doesn't allow gaps.
-  SmallVector<unsigned, 4> Indices;
-  if (isa<LoadInst>(I)) {
-    for (unsigned i = 0; i < InterleaveFactor; i++)
-      if (Group->getMember(i))
-        Indices.push_back(i);
+  for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) {
+    bool TmpResult = Predicate(TmpVF);
+    if (TmpResult != StartResult) {
+      EndRangeVF = TmpVF;
+      break;
+    }
   }
-
-  // Calculate the cost of the whole interleaved group.
-  unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy,
-                                                 Group->getFactor(), Indices,
-                                                 Group->getAlignment(), AS);
-
-  if (Group->isReverse())
-    Cost += Group->getNumMembers() *
-            TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
-  return Cost;
+
+  return StartResult;
 }
 
-unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
-                                                              unsigned VF) {
+std::shared_ptr<VPlan>
+LoopVectorizationPlanner::buildInitialVPlan(unsigned StartRangeVF,
+                                            unsigned &EndRangeVF) {
+
+  std::shared_ptr<VPlan> SharedPlan = std::make_shared<VPlan>();
+  VPlan *Plan = SharedPlan.get();
+  VPlanUtilsLoopVectorizer PlanUtils(Plan);
+
+  // Create a dummy entry VPBasicBlock to start building the VPlan.
+  VPBlockBase *PreviousVPBlock = PlanUtils.createBasicBlock();
+  VPBlockBase *PreEntry = PreviousVPBlock;
+  Plan->setEntry(PreEntry); // only to support printing during construction.
+
+  // Return the interleave group a given instruction is part of in the context
+  // of a specific VF.
+  auto getInterleaveGroup = [&](Instruction *I,
+                                unsigned VF) -> const InterleaveGroup * {
+    if (VF < 2)
+      return nullptr; // Query is illegal for VF == 1
+    LoopVectorizationCostModel::InstWidening Decision =
+        CM->getWideningDecision(I, VF);
+    if (Decision != LoopVectorizationCostModel::CM_Interleave)
+      return nullptr;
+    const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I);
+    assert(IG && "Instruction to interleave not part of any group");
+    return IG;
+  };
 
-  // Calculate scalar cost only. Vectorization cost should be ready at this
-  // moment.
-  if (VF == 1) {
-    Type *ValTy = getMemInstValueType(I);
-    unsigned Alignment = getMemInstAlignment(I);
-    unsigned AS = getMemInstAlignment(I);
+  // Check if given Instruction should open an interleave group.
+  auto isPrimaryIGMember =
+      [&](Instruction *I) -> std::function<bool(unsigned)> {
+    return [=](unsigned VF) -> bool {
+      const InterleaveGroup *IG = getInterleaveGroup(I, VF);
+      return IG && I == IG->getInsertPos();
+    };
+  };
 
-    return TTI.getAddressComputationCost(ValTy) +
-           TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS);
-  }
-  return getWideningCost(I, VF);
-}
+  // Check if given Instruction is handled as part of an interleave group.
+  auto isAdjunctIGMember =
+      [&](Instruction *I) -> std::function<bool(unsigned)> {
+    return [=](unsigned VF) -> bool {
+      const InterleaveGroup *IG = getInterleaveGroup(I, VF);
+      return IG && I != IG->getInsertPos();
+    };
+  };
 
-LoopVectorizationCostModel::VectorizationCostTy
-LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
-  // If we know that this instruction will remain uniform, check the cost of
-  // the scalar version.
-  if (isUniformAfterVectorization(I, VF))
-    VF = 1;
+  /// Determine whether \p K is a truncation based on an induction variable that
+  /// can be optimized.
+  auto isOptimizableIVTruncate =
+      [&](Instruction *K) -> std::function<bool(unsigned)> {
+    return
+        [=](unsigned VF) -> bool { return CM->isOptimizableIVTruncate(K, VF); };
+  };
 
-  if (VF > 1 && isProfitableToScalarize(I, VF))
-    return VectorizationCostTy(InstsToScalarize[VF][I], false);
+  // Scan the body of the loop in a topological order to visit each basic block
+  // after having visited its predecessor basic blocks.
+  LoopBlocksDFS DFS(TheLoop);
+  DFS.perform(LI);
 
-  Type *VectorTy;
-  unsigned C = getInstructionCost(I, VF, VectorTy);
+  for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
+    // Relevent instructions from basic block BB will be grouped into VPRecipe
+    // ingredients and fill a new VPBasicBlock.
+    VPBasicBlock *VPBB = nullptr;
+    VPOneByOneRecipeBase *LastOBORecipe = nullptr;
+
+    auto appendRecipe = [&](VPRecipeBase *Recipe) -> void {
+      if (VPBB)
+        PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB);
+      else {
+        VPBB = PlanUtils.createBasicBlock(Recipe);
+        PlanUtils.setSuccessor(PreviousVPBlock, VPBB);
+        PreviousVPBlock = VPBB;
+      }
+      LastOBORecipe = dyn_cast<VPOneByOneRecipeBase>(Recipe);
+    };
 
-  bool TypeNotScalarized =
-      VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
-  return VectorizationCostTy(C, TypeNotScalarized);
-}
+    for (auto I = BB->begin(), E = BB->end(); I != E; ++I) {
+      Instruction *Instr = &*I;
 
-void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
-  if (VF == 1)
-    return;
-  for (BasicBlock *BB : TheLoop->blocks()) {
-    // For each instruction in the old loop.
-    for (Instruction &I : *BB) {
-      Value *Ptr = getPointerOperand(&I);
-      if (!Ptr)
+      // Filter out irrelevant instructions.
+      if (DeadInstructions.count(Instr) || isa<BranchInst>(Instr) ||
+          isa<DbgInfoIntrinsic>(Instr))
         continue;
 
-      if (isa<LoadInst>(&I) && Legal->isUniform(Ptr)) {
-        // Scalar load + broadcast
-        unsigned Cost = getUniformMemOpCost(&I, VF);
-        setWideningDecision(&I, VF, CM_Scalarize, Cost);
-        continue;
+      if (isa<LoadInst>(Instr) || isa<StoreInst>(Instr)) {
+        // Ignore IG's adjunct members - will be handled by the interleave group
+        // recipe to be generated by the primary member of the interleave group
+        // which is the insertion point and bears the cost for the entire group.
+        if (testVFRange(isAdjunctIGMember(Instr), StartRangeVF, EndRangeVF))
+          continue;
+
+        if (testVFRange(isPrimaryIGMember(Instr), StartRangeVF, EndRangeVF)) {
+          // Instr points to the insert position of an interleave group: first
+          // load or last store.
+          const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(Instr);
+          appendRecipe(new VPInterleaveRecipe(IG, Plan));
+          continue;
+        }
       }
 
-      // We assume that widening is the best solution when possible.
-      if (Legal->memoryInstructionCanBeWidened(&I, VF)) {
-        unsigned Cost = getConsecutiveMemOpCost(&I, VF);
-        setWideningDecision(&I, VF, CM_Widen, Cost);
+      if (Legal->isScalarWithPredication(Instr)) {
+        // Instructions marked for predication are scalarized and placed under
+        // an if-then construct to prevent side-effects.
+        DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *Instr << '\n');
+
+        // Build the triangular if-then region. Start with VPBB holding Instr.
+        BasicBlock::iterator J = I;
+        VPRecipeBase *Recipe = new VPScalarizeOneByOneRecipe(I, ++J, Plan);
+        VPBB = PlanUtils.createBasicBlock(Recipe);
+
+        // Build the entry and exit VPBB's of the triangle.
+        VPRegionBlock *Region = PlanUtils.createRegion(true);
+        VPExtractMaskBitRecipe *R = new VPExtractMaskBitRecipe(&*BB);
+        VPBasicBlock *Entry = PlanUtils.createBasicBlock(R);
+        Recipe = new VPMergeScalarizeBranchRecipe(Instr);
+        VPBasicBlock *Exit = PlanUtils.createBasicBlock(Recipe);
+        // Note: first set Entry as region entry and then connect successors
+        // starting from it in order, to propagate the "parent" of each
+        // VPBasicBlock.
+        PlanUtils.setRegionEntry(Region, Entry);
+        PlanUtils.setRegionExit(Region, Exit);
+        PlanUtils.setTwoSuccessors(Entry, R, VPBB, Exit);
+        PlanUtils.setSuccessor(VPBB, Exit);
+        PlanUtils.setSuccessor(PreviousVPBlock, Region);
+        PreviousVPBlock = Region;
+
+        // Next instructions should start forming a VPBasicBlock of their own.
+        VPBB = nullptr;
+        LastOBORecipe = nullptr;
+
+        // Record predicated instructions for later optimizations.
+        PredicatedInstructions.insert(&*I);
+
         continue;
       }
 
-      // Choose between Interleaving, Gather/Scatter or Scalarization.
-      unsigned InterleaveCost = UINT_MAX;
-      unsigned NumAccesses = 1;
-      if (Legal->isAccessInterleaved(&I)) {
-        auto Group = Legal->getInterleavedAccessGroup(&I);
-        assert(Group && "Fail to get an interleaved access group.");
+      // Check if this is an integer induction. If so, build the recipes that
+      // produce its scalar and vector values.
 
-        // Make one decision for the whole group.
-        if (getWideningDecision(&I, VF) != CM_Unknown)
+      if (PHINode *Phi = dyn_cast<PHINode>(Instr)) {
+        InductionDescriptor II = Legal->getInductionVars()->lookup(Phi);
+        if (II.getKind() == InductionDescriptor::IK_IntInduction) {
+          auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF, Phi);
+          appendRecipe(Recipes.first);
+          if (Recipes.second)
+            appendRecipe(Recipes.second);
           continue;
-
-        NumAccesses = Group->getNumMembers();
-        InterleaveCost = getInterleaveGroupCost(&I, VF);
+        }
       }
 
-      unsigned GatherScatterCost =
-          Legal->isLegalGatherOrScatter(&I)
-              ? getGatherScatterCost(&I, VF) * NumAccesses
-              : UINT_MAX;
-
-      unsigned ScalarizationCost =
-          getMemInstScalarizationCost(&I, VF) * NumAccesses;
+      // Optimize the special case where the source is a constant integer
+      // induction variable. Notice that we can only optimize the 'trunc' case
+      // because (a) FP conversions lose precision, (b) sext/zext may wrap, and
+      // (c) other casts depend on pointer size.
+      if (isa<TruncInst>(Instr) && testVFRange(isOptimizableIVTruncate(Instr),
+                                               StartRangeVF, EndRangeVF)) {
+        auto *InductionPhi = cast<PHINode>(Instr->getOperand(0));
+        auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF,
+                                         InductionPhi, cast<TruncInst>(Instr));
+        appendRecipe(Recipes.first);
+        if (Recipes.second)
+          appendRecipe(Recipes.second);
+        continue;
+      }
 
-      // Choose better solution for the current VF,
-      // write down this decision and use it during vectorization.
-      unsigned Cost;
-      InstWidening Decision;
-      if (InterleaveCost <= GatherScatterCost &&
-          InterleaveCost < ScalarizationCost) {
-        Decision = CM_Interleave;
-        Cost = InterleaveCost;
-      } else if (GatherScatterCost < ScalarizationCost) {
-        Decision = CM_GatherScatter;
-        Cost = GatherScatterCost;
-      } else {
-        Decision = CM_Scalarize;
-        Cost = ScalarizationCost;
+      // Check if instruction is to be replicated.
+      bool Scalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF);
+      DEBUG(if (Scalarized) dbgs() << "LV: Scalarizing:" << *Instr << "\n");
+
+      // Default: vectorize/scalarize this instruction using a one-by-one
+      // recipe. We optimize the common case where consecutive instructions
+      // can be represented by a single OBO recipe.
+      if (!LastOBORecipe || LastOBORecipe->isScalarizing() != Scalarized ||
+          !PlanUtils.appendInstruction(LastOBORecipe, Instr)) {
+        auto J = I;
+        appendRecipe(PlanUtils.createOneByOneRecipe(I, ++J, Plan, Scalarized));
       }
-      // If the instructions belongs to an interleave group, the whole group
-      // receives the same decision. The whole group receives the cost, but
-      // the cost will actually be assigned to one instruction.
-      if (auto Group = Legal->getInterleavedAccessGroup(&I))
-        setWideningDecision(Group, VF, Decision, Cost);
-      else
-        setWideningDecision(&I, VF, Decision, Cost);
     }
   }
+  // PreviousVPBlock now holds the exit block of Plan.
+  // Set entry block of Plan to the successor of PreEntry, and discard PreEntry.
+  assert(PreEntry->getSuccessors().size() == 1 && "Plan has no single entry.");
+  VPBlockBase *Entry = PreEntry->getSuccessors().front();
+  PlanUtils.disconnectBlocks(PreEntry, Entry);
+  Plan->setEntry(Entry);
+  delete PreEntry;
+
+  // FOR STRESS TESTING, uncomment the following:
+  // EndRangeVF = StartRangeVF * 2;
+
+  return SharedPlan;
 }
 
-unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
-                                                        unsigned VF,
-                                                        Type *&VectorTy) {
-  Type *RetTy = I->getType();
-  if (canTruncateToMinimalBitwidth(I, VF))
-    RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
-  VectorTy = ToVectorTy(RetTy, VF);
-  auto SE = PSE.getSE();
+void LoopVectorizationPlanner::sinkScalarOperands(Instruction *PredInst,
+                                                  VPlan *Plan) {
+  VPlanUtilsLoopVectorizer PlanUtils(Plan);
 
-  // TODO: We need to estimate the cost of intrinsic calls.
-  switch (I->getOpcode()) {
-  case Instruction::GetElementPtr:
-    // We mark this instruction as zero-cost because the cost of GEPs in
-    // vectorized code depends on whether the corresponding memory instruction
-    // is scalarized or not. Therefore, we handle GEPs with the memory
-    // instruction cost.
-    return 0;
-  case Instruction::Br: {
-    return TTI.getCFInstrCost(I->getOpcode());
-  }
-  case Instruction::PHI: {
-    auto *Phi = cast<PHINode>(I);
+  // The recipe containing the predicated instruction.
+  VPBasicBlock *PredBB = Plan->getBasicBlock(PredInst);
 
-    // First-order recurrences are replaced by vector shuffles inside the loop.
-    if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
-      return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
-                                VectorTy, VF - 1, VectorTy);
+  // Initialize a worklist with the operands of the predicated instruction.
+  SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());
 
-    // TODO: IF-converted IFs become selects.
-    return 0;
-  }
-  case Instruction::UDiv:
-  case Instruction::SDiv:
-  case Instruction::URem:
-  case Instruction::SRem:
-    // If we have a predicated instruction, it may not be executed for each
-    // vector lane. Get the scalarization cost and scale this amount by the
-    // probability of executing the predicated block. If the instruction is not
-    // predicated, we fall through to the next case.
-    if (VF > 1 && Legal->isScalarWithPredication(I)) {
-      unsigned Cost = 0;
+  // Holds instructions that we need to analyze again. An instruction may be
+  // reanalyzed if we don't yet know if we can sink it or not.
+  SmallVector<Instruction *, 8> InstsToReanalyze;
 
-      // These instructions have a non-void type, so account for the phi nodes
-      // that we will create. This cost is likely to be zero. The phi node
-      // cost, if any, should be scaled by the block probability because it
-      // models a copy at the end of each predicated block.
-      Cost += VF * TTI.getCFInstrCost(Instruction::PHI);
+  // Iteratively sink the scalarized operands of the predicated instruction
+  // into the block we created for it. When an instruction is sunk, it's
+  // operands are then added to the worklist. The algorithm ends after one pass
+  // through the worklist doesn't sink a single instruction.
+  bool Changed;
+  do {
 
-      // The cost of the non-predicated instruction.
-      Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy);
+    // Add the instructions that need to be reanalyzed to the worklist, and
+    // reset the changed indicator.
+    Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end());
+    InstsToReanalyze.clear();
+    Changed = false;
 
-      // The cost of insertelement and extractelement instructions needed for
-      // scalarization.
-      Cost += getScalarizationOverhead(I, VF, TTI);
+    while (!Worklist.empty()) {
+      auto *I = dyn_cast<Instruction>(Worklist.pop_back_val());
+      if (!I)
+        continue;
 
-      // Scale the cost by the probability of executing the predicated blocks.
-      // This assumes the predicated block for each vector lane is equally
-      // likely.
-      return Cost / getReciprocalPredBlockProb();
-    }
-  case Instruction::Add:
-  case Instruction::FAdd:
-  case Instruction::Sub:
-  case Instruction::FSub:
-  case Instruction::Mul:
-  case Instruction::FMul:
-  case Instruction::FDiv:
-  case Instruction::FRem:
-  case Instruction::Shl:
-  case Instruction::LShr:
-  case Instruction::AShr:
-  case Instruction::And:
-  case Instruction::Or:
-  case Instruction::Xor: {
-    // Since we will replace the stride by 1 the multiplication should go away.
-    if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))
-      return 0;
-    // Certain instructions can be cheaper to vectorize if they have a constant
-    // second vector operand. One example of this are shifts on x86.
-    TargetTransformInfo::OperandValueKind Op1VK =
-        TargetTransformInfo::OK_AnyValue;
-    TargetTransformInfo::OperandValueKind Op2VK =
-        TargetTransformInfo::OK_AnyValue;
-    TargetTransformInfo::OperandValueProperties Op1VP =
-        TargetTransformInfo::OP_None;
-    TargetTransformInfo::OperandValueProperties Op2VP =
-        TargetTransformInfo::OP_None;
-    Value *Op2 = I->getOperand(1);
+      // We do not sink other predicated instructions.
+      if (Legal->isScalarWithPredication(I))
+        continue;
 
-    // Check for a splat or for a non uniform vector of constants.
-    if (isa<ConstantInt>(Op2)) {
-      ConstantInt *CInt = cast<ConstantInt>(Op2);
-      if (CInt && CInt->getValue().isPowerOf2())
-        Op2VP = TargetTransformInfo::OP_PowerOf2;
-      Op2VK = TargetTransformInfo::OK_UniformConstantValue;
-    } else if (isa<ConstantVector>(Op2) || isa<ConstantDataVector>(Op2)) {
-      Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
-      Constant *SplatValue = cast<Constant>(Op2)->getSplatValue();
-      if (SplatValue) {
-        ConstantInt *CInt = dyn_cast<ConstantInt>(SplatValue);
-        if (CInt && CInt->getValue().isPowerOf2())
-          Op2VP = TargetTransformInfo::OP_PowerOf2;
-        Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+      VPRecipeBase *Recipe = Plan->getRecipe(I);
+
+      // We can't sink live-ins.
+      if (!Recipe)
+        continue;
+      VPBasicBlock *BasicBlock = Recipe->getParent();
+      assert(BasicBlock && "Recipe not in any basic block");
+
+      // We can't sink an instruction that isn't being scalarized.
+      if (!isa<VPScalarizeOneByOneRecipe>(Recipe) &&
+          !isa<VPBuildScalarStepsRecipe>(Recipe))
+        continue;
+
+      // We can't sink an instruction if it is already in the predicated block,
+      // is not in the VPlan, or may have side effects.
+      if (BasicBlock == PredBB || I->mayHaveSideEffects())
+        continue;
+
+      // Handle phi nodes last to make sure that any user they may have has sunk
+      // by now. This is relevant for induction variables that feed uniform GEPs
+      // which may or may not sink.
+      if (isa<PHINode>(I)) {
+        auto IsNotAPhi = [&](Value *V) -> bool { return isa<PHINode>(V); };
+        if (any_of(Worklist, IsNotAPhi) ||
+            any_of(InstsToReanalyze, IsNotAPhi)) {
+          InstsToReanalyze.push_back(I);
+          continue;
+        }
+      }
+
+      bool HasVectorizedUses = false;
+      bool AllScalarizedUsesInPredicatedBlock = true;
+      unsigned MinLaneToSink = 0;
+      for (auto &U : I->uses()) {
+        auto *UI = cast<Instruction>(U.getUser());
+        VPRecipeBase *UserRecipe = Plan->getRecipe(UI);
+        // Generated scalarized instructions don't serve users outside of the
+        // VPlan, so we can safely ignore users that have no recipe.
+        if (!UserRecipe)
+          continue;
+
+        // GEPs used as the uniform address of a wide memory operation must not
+        // sink lane zero.
+        if (isa<VPInterleaveRecipe>(UserRecipe)) {
+          assert(isa<GetElementPtrInst>(I) &&
+                 "Non-GEP used in interleave group");
+          MinLaneToSink = std::max(MinLaneToSink, 1u);
+          continue;
+        }
+
+        // Wide memory operations do not use any of the scalarized GEPs but
+        // generate their own GEPs.
+        if (isa<VPVectorizeOneByOneRecipe>(UserRecipe) &&
+            isa<GetElementPtrInst>(I) &&
+            (isa<LoadInst>(UI) || isa<StoreInst>(UI)) &&
+            Legal->isConsecutivePtr(I)) {
+          continue;
+        }
+
+        if (!(isa<VPScalarizeOneByOneRecipe>(UserRecipe) ||
+              isa<VPBuildScalarStepsRecipe>(UserRecipe))) {
+          // All of I's lanes are used by an instruction we can't sink.
+          HasVectorizedUses = true;
+          break;
+        }
+
+        // Induction variables feeding consecutive GEPs can be indirectly used
+        // by vectorized load/stores which generate their own GEP rather than
+        // reuse the scalarized one (unlike load/store in interleave groups).
+        // In such a case, we can sink all lanes but lane zero. Note that we
+        // can do this whether or not the GEP is used within the predicated
+        // block (i.e. whether it will sink its own lanes 1..VF-1).
+        if (isa<GetElementPtrInst>(UI) && Legal->isConsecutivePtr(UI) &&
+            isa<VPBuildScalarStepsRecipe>(Recipe)) {
+          auto IsVectorizedMemoryOperation = [&](User *U) -> bool {
+            if (!(isa<LoadInst>(U) || isa<StoreInst>(U)))
+              return false;
+            VPRecipeBase *Recipe = Plan->getRecipe(cast<Instruction>(U));
+            return Recipe && isa<VPVectorizeOneByOneRecipe>(Recipe);
+          };
+
+          if (any_of(UI->users(), IsVectorizedMemoryOperation)) {
+            MinLaneToSink = std::max(MinLaneToSink, 1u);
+            continue;
+          }
+        }
+
+        if (UserRecipe->getParent() != PredBB) {
+          // Don't make a decision until all scalarized users have sunk.
+          AllScalarizedUsesInPredicatedBlock = false;
+          continue;
+        }
+
+        // Ok to sink w.r.t this use, but no more lanes than what the user
+        // itself has sunk.
+        VPLaneRange DesignatedLanes;
+        if (auto *BSS = dyn_cast<VPBuildScalarStepsRecipe>(UserRecipe))
+          DesignatedLanes = BSS->getDesignatedLanes();
+        else
+          DesignatedLanes =
+              cast<VPScalarizeOneByOneRecipe>(UserRecipe)->getDesignatedLanes();
+        VPLaneRange SinkableLanes =
+            VPLaneRange::intersect(VPLaneRange(MinLaneToSink), DesignatedLanes);
+        MinLaneToSink = SinkableLanes.getMinLane();
+      }
+
+      if (HasVectorizedUses)
+        continue; // This instruction cannot be sunk.
+
+      // It's legal to sink the instruction if all its uses occur in the
+      // predicated block. Otherwise, there's nothing to do yet, and we may
+      // need to reanalyze the instruction.
+      if (!AllScalarizedUsesInPredicatedBlock) {
+        InstsToReanalyze.push_back(I);
+        continue;
       }
-    } else if (Legal->isUniform(Op2)) {
-      Op2VK = TargetTransformInfo::OK_UniformValue;
+
+      // Move the instruction to the beginning of the predicated block, and add
+      // it's operands to the worklist (except for phi nodes).
+      PlanUtils.sinkInstruction(I, PredBB, MinLaneToSink);
+      if (!isa<PHINode>(I))
+        Worklist.insert(I->op_begin(), I->op_end());
+
+      // The sinking may have enabled other instructions to be sunk, so we will
+      // need to iterate.
+      Changed = true;
     }
-    SmallVector<const Value *, 4> Operands(I->operand_values()); 
-    return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK,
-                                      Op2VK, Op1VP, Op2VP, Operands);
-  }
-  case Instruction::Select: {
-    SelectInst *SI = cast<SelectInst>(I);
-    const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());
-    bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));
-    Type *CondTy = SI->getCondition()->getType();
-    if (!ScalarCond)
-      CondTy = VectorType::get(CondTy, VF);
+  } while (Changed);
+}
 
-    return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy);
+void LoopVectorizationPlanner::assignScalarVectorConversions(
+    Instruction *PredInst, VPlan *Plan) {
+
+  // NFC: Let Def's recipe generate the vector version of Def, but only
+  // if all of Def's users are vectorized. This is the equivalent to the
+  // previous predicateInstructions by which an insert-element got hoisted
+  // into the matching predicated basic block if it is the only user of
+  // the predicated instruction.
+
+  if (PredInst->use_empty())
+    return;
+
+  for (User *U : PredInst->users()) {
+    Instruction *UserInst = dyn_cast<Instruction>(U);
+    if (!UserInst)
+      continue;
+
+    VPRecipeBase *UserRecipe = Plan->getRecipe(UserInst);
+    if (!UserRecipe) // User is not part of the plan.
+      return;
+
+    if (dyn_cast<VPVectorizeOneByOneRecipe>(UserRecipe))
+      continue;
+
+    // Found a user that will not be using the vector form of the predicated
+    // instruction. The insert-element is not going to be the only user, so
+    // do not hoist it.
+    return;
   }
-  case Instruction::ICmp:
-  case Instruction::FCmp: {
-    Type *ValTy = I->getOperand(0)->getType();
-    Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));
-    if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))
-      ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);
-    VectorTy = ToVectorTy(ValTy, VF);
-    return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);
+
+  Plan->getRecipe(PredInst)->addAlsoPackOrUnpack(PredInst);
+}
+
+bool LoopVectorizationPlanner::shouldScalarizeInstruction(Instruction *I,
+                                                          unsigned VF) const {
+  return CM->isScalarAfterVectorization(I, VF) ||
+         CM->isProfitableToScalarize(I, VF);
+}
+
+void LoopVectorizationPlanner::optimizePredicatedInstructions() {
+  VPlan *PrevPlan = nullptr;
+  for (auto &It : VPlans) {
+    VPlan *Plan = It.second.get();
+    if (Plan == PrevPlan)
+      continue;
+    for (auto *PredInst : PredicatedInstructions) {
+      sinkScalarOperands(PredInst, Plan);
+      assignScalarVectorConversions(PredInst, Plan);
+    }
+    PrevPlan = Plan;
   }
-  case Instruction::Store:
-  case Instruction::Load: {
-    VectorTy = ToVectorTy(getMemInstValueType(I), VF);
-    return getMemoryInstructionCost(I, VF);
+}
+
+void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) {
+  DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n');
+  BestVF = VF;
+  BestUF = UF;
+
+  assert(VPlans.count(VF) && "Best VF does not have a VPlan.");
+  // Delete all other VPlans.
+  for (auto &Entry : VPlans) {
+    if (Entry.first != VF)
+      VPlans.erase(Entry.first);
   }
-  case Instruction::ZExt:
-  case Instruction::SExt:
-  case Instruction::FPToUI:
-  case Instruction::FPToSI:
-  case Instruction::FPExt:
-  case Instruction::PtrToInt:
-  case Instruction::IntToPtr:
-  case Instruction::SIToFP:
-  case Instruction::UIToFP:
-  case Instruction::Trunc:
-  case Instruction::FPTrunc:
-  case Instruction::BitCast: {
-    // We optimize the truncation of induction variables having constant
-    // integer steps. The cost of these truncations is the same as the scalar
-    // operation.
-    if (isOptimizableIVTruncate(I, VF)) {
-      auto *Trunc = cast<TruncInst>(I);
-      return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),
-                                  Trunc->getSrcTy());
-    }
+}
 
-    Type *SrcScalarTy = I->getOperand(0)->getType();
-    Type *SrcVecTy = ToVectorTy(SrcScalarTy, VF);
-    if (canTruncateToMinimalBitwidth(I, VF)) {
-      // This cast is going to be shrunk. This may remove the cast or it might
-      // turn it into slightly different cast. For example, if MinBW == 16,
-      // "zext i8 %1 to i32" becomes "zext i8 %1 to i16".
-      //
-      // Calculate the modified src and dest types.
-      Type *MinVecTy = VectorTy;
-      if (I->getOpcode() == Instruction::Trunc) {
-        SrcVecTy = smallestIntegerVectorType(SrcVecTy, MinVecTy);
-        VectorTy =
-            largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
-      } else if (I->getOpcode() == Instruction::ZExt ||
-                 I->getOpcode() == Instruction::SExt) {
-        SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);
-        VectorTy =
-            smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
-      }
-    }
+void LoopVectorizationPlanner::executeBestPlan(InnerLoopVectorizer &LB) {
+  ILV = &LB;
 
-    return TTI.getCastInstrCost(I->getOpcode(), VectorTy, SrcVecTy);
+  // Perform the actual loop widening (vectorization).
+  // 1. Create a new empty loop. Unlink the old loop and connect the new one.
+  ILV->createEmptyLoop();
+
+  // 2. Widen each instruction in the old loop to a new one in the new loop.
+
+  VPTransformState State{BestVF,       BestUF, LI,    ILV->DT,
+                         ILV->Builder, ILV,    Legal, CM};
+  State.CFG.PrevBB = ILV->LoopVectorPreHeader;
+
+  VPlan *Plan = getVPlanForVF(BestVF);
+
+  Plan->vectorize(&State);
+
+  // 3. Take care of phi's to fix: reduction, 1st-order-recurrence, loop-closed.
+  ILV->vectorizeLoop();
+}
+
+void VPVectorizeOneByOneRecipe::transformIRInstruction(
+    Instruction *I, VPTransformState &State) {
+  assert(I && "No instruction to vectorize.");
+  State.ILV->vectorizeInstruction(*I);
+  if (willAlsoPackOrUnpack(I)) { // Unpack instruction
+    for (unsigned Part = 0; Part < State.UF; ++Part)
+      for (unsigned Lane = 0; Lane < State.VF; ++Lane)
+        State.ILV->getScalarValue(I, Part, Lane);
   }
-  case Instruction::Call: {
-    bool NeedToScalarize;
-    CallInst *CI = cast<CallInst>(I);
-    unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
-    if (getVectorIntrinsicIDForCall(CI, TLI))
-      return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
-    return CallCost;
+}
+
+void VPScalarizeOneByOneRecipe::transformIRInstruction(
+    Instruction *I, VPTransformState &State) {
+  assert(I && "No instruction to vectorize.");
+  // By default generate scalar instances for all VF lanes of all UF parts.
+  // If the instruction is uniform, generate only the first lane for each
+  // of the UF parts.
+  bool IsUniform = State.Cost->isUniformAfterVectorization(I, State.VF);
+  unsigned MinLane = 0;
+  unsigned MaxLane = IsUniform ? 0 : State.VF - 1;
+  unsigned MinPart = 0;
+  unsigned MaxPart = State.UF - 1;
+
+  if (State.Instance) {
+    // Asked to create an instance for a specific lane and a specific part.
+    assert(!IsUniform &&
+           "Uniform instruction vectorized for a specific instance.");
+    MinLane = State.Instance->Lane;
+    MaxLane = MinLane;
+    MinPart = State.Instance->Part;
+    MaxPart = MinPart;
+  }
+
+  // Intersect requested lanes with the designated lanes for this recipe.
+  VPLaneRange ActiveLanes(MinLane, MaxLane);
+  VPLaneRange EffectiveLanes =
+      VPLaneRange::intersect(ActiveLanes, DesignatedLanes);
+  if (EffectiveLanes.isEmpty())
+    return; // None of the requested lanes is designated for this recipe.
+
+  // Generate relevant lanes.
+  State.ILV->scalarizeInstruction(I, MinPart, MaxPart,
+                                  EffectiveLanes.getMinLane(),
+                                  EffectiveLanes.getMaxLane());
+  if (willAlsoPackOrUnpack(I)) {
+    if (State.Instance)
+      // Insert scalar instance packing it into a vector.
+      State.ILV->constructVectorValue(I, MinPart, MinLane);
+    else
+      // Broadcast or group together all instances into a vector.
+      State.ILV->getVectorValue(I);
   }
-  default:
-    // The cost of executing VF copies of the scalar instruction. This opcode
-    // is unknown. Assume that it is the same as 'mul'.
-    return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +
-           getScalarizationOverhead(I, VF, TTI);
-  } // end of switch.
 }
 
-char LoopVectorize::ID = 0;
-static const char lv_name[] = "Loop Vectorization";
-INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
-INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
-INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(LoopAccessLegacyAnalysis)
-INITIALIZE_PASS_DEPENDENCY(DemandedBitsWrapperPass)
-INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)
-INITIALIZE_PASS_END(LoopVectorize, LV_NAME, lv_name, false, false)
+void VPWidenIntInductionRecipe::vectorize(VPTransformState &State) {
+  assert(State.Instance == nullptr && "Int induction being replicated");
+  auto BuildScalarInfo = State.ILV->widenIntInduction(NeedsScalarIV, IV, Trunc);
+  ScalarIV = BuildScalarInfo.first;
+  Step = BuildScalarInfo.second;
+}
 
-namespace llvm {
-Pass *createLoopVectorizePass(bool NoUnrolling, bool AlwaysVectorize) {
-  return new LoopVectorize(NoUnrolling, AlwaysVectorize);
+void VPWidenIntInductionRecipe::print(raw_ostream &O) const {
+  O << "Widen int induction";
+  if (NeedsScalarIV)
+    O << " (needs scalars)";
+  O << ":\n";
+  O << *IV;
+  if (Trunc)
+    O << "\n" << *Trunc << ")";
 }
+
+void VPBuildScalarStepsRecipe::vectorize(VPTransformState &State) {
+  // By default generate scalar instances for all VF lanes of all UF parts.
+  // If the instruction is uniform, generate only the first lane for each
+  // of the UF parts.
+  bool IsUniform = State.Cost->isUniformAfterVectorization(EntryVal, State.VF);
+  unsigned MinLane = 0;
+  unsigned MaxLane = IsUniform ? 0 : State.VF - 1;
+  unsigned MinPart = 0;
+  unsigned MaxPart = State.UF - 1;
+
+  if (State.Instance) {
+    // Asked to create an instance for a specific lane and a specific part.
+    MinLane = State.Instance->Lane;
+    MaxLane = MinLane;
+    MinPart = State.Instance->Part;
+    MaxPart = MinPart;
+  }
+
+  // Intersect requested lanes with the designated lanes for this recipe.
+  VPLaneRange ActiveLanes(MinLane, MaxLane);
+  VPLaneRange EffectiveLanes =
+      VPLaneRange::intersect(ActiveLanes, DesignatedLanes);
+  if (EffectiveLanes.isEmpty())
+    return; // None of the requested lanes is designated for this recipe.
+
+  // Generate relevant lanes.
+  State.ILV->buildScalarSteps(WII->getScalarIV(), WII->getStep(), EntryVal,
+                              MinPart, MaxPart, EffectiveLanes.getMinLane(),
+                              EffectiveLanes.getMaxLane());
 }
 
-bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) {
+void VPBuildScalarStepsRecipe::print(raw_ostream &O) const {
+  O << "Build scalar steps";
+  if (!DesignatedLanes.isFull()) {
+    O << " ";
+    DesignatedLanes.print(O);
+  }
+  O << ":\n" << *EntryVal;
+}
 
-  // Check if the pointer operand of a load or store instruction is
-  // consecutive.
-  if (auto *Ptr = getPointerOperand(Inst))
-    return Legal->isConsecutivePtr(Ptr);
-  return false;
+void VPInterleaveRecipe::vectorize(VPTransformState &State) {
+  assert(State.Instance == nullptr && "Interleave group being replicated");
+  State.ILV->vectorizeInterleaveGroup(IG->getInsertPos());
 }
 
-void LoopVectorizationCostModel::collectValuesToIgnore() {
-  // Ignore ephemeral values.
-  CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);
+void VPInterleaveRecipe::print(raw_ostream &O) const {
+  O << "InterleaveGroup factor:" << IG->getFactor() << '\n';
+  for (unsigned i = 0; i < IG->getFactor(); ++i)
+    if (Instruction *I = IG->getMember(i)) {
+      if (I == IG->getInsertPos())
+        O << i << "=]" << *I;
+      else
+        O << i << " ]" << *I;
+      if (willAlsoPackOrUnpack(I))
+        O << " (V->S)";
+    }
+}
 
-  // Ignore type-promoting instructions we identified during reduction
-  // detection.
-  for (auto &Reduction : *Legal->getReductionVars()) {
-    RecurrenceDescriptor &RedDes = Reduction.second;
-    SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
-    VecValuesToIgnore.insert(Casts.begin(), Casts.end());
+void VPExtractMaskBitRecipe::vectorize(VPTransformState &State) {
+  assert(State.Instance && "Extract Mask Bit works only on single instance.");
+
+  unsigned Part = State.Instance->Part;
+  unsigned Lane = State.Instance->Lane;
+
+  typedef SmallVector<Value *, 2> VectorParts;
+
+  VectorParts Cond = State.ILV->createBlockInMask(MaskedBasicBlock);
+
+  ConditionBit = State.Builder.CreateExtractElement(
+      Cond[Part], State.ILV->Builder.getInt32(Lane));
+  ConditionBit =
+      State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit,
+                               ConstantInt::get(ConditionBit->getType(), 1));
+  DEBUG(dbgs() << "\nLV: vectorizing ConditionBit recipe"
+               << MaskedBasicBlock->getName());
+}
+
+void VPMergeScalarizeBranchRecipe::vectorize(VPTransformState &State) {
+  assert(State.Instance &&
+         "Merge Scalarize Branch works only on single instance.");
+
+  Type *LiveOutType = LiveOut->getType();
+  unsigned Part = State.Instance->Part;
+  unsigned Lane = State.Instance->Lane;
+
+  // Rename the predicated and merged basic blocks for backwards compatibility.
+  Instruction *ScalarLiveOut =
+      cast<Instruction>(State.ILV->getScalarValue(LiveOut, Part, Lane));
+  BasicBlock *PredicatedBB = ScalarLiveOut->getParent();
+  BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor();
+  assert(PredicatingBB && "Predicated block has no single predecessor");
+  PredicatedBB->setName(Twine("pred.") + LiveOut->getOpcodeName() + ".if");
+  PredicatedBB->getSingleSuccessor()->setName(
+      Twine("pred.") + LiveOut->getOpcodeName() + ".continue");
+  if (LiveOutType->isVoidTy())
+    return;
+
+  // Generate a phi node for the scalarized instruction.
+  PHINode *Phi = State.ILV->Builder.CreatePHI(LiveOutType, 2);
+  Phi->addIncoming(UndefValue::get(ScalarLiveOut->getType()), PredicatingBB);
+  Phi->addIncoming(ScalarLiveOut, PredicatedBB);
+  State.ILV->setScalarValue(LiveOut, Part, Lane, Phi);
+
+  // If this instruction also generated the complementing form then we also need
+  // to create a phi for the vector value of this part & lane and update the
+  // vector values cache accordingly.
+  Value *VectorValue = State.ILV->getVectorValue(LiveOut, Part);
+  if (!VectorValue)
+    return;
+
+  InsertElementInst *IEI = cast<InsertElementInst>(VectorValue);
+  PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2);
+  VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // the unmodified vector
+  VPhi->addIncoming(IEI, PredicatedBB); // new vector with the inserted element
+  State.ILV->setVectorValue(LiveOut, Part, VPhi);
+}
+
+/// Creates a new VPScalarizeOneByOneRecipe or VPVectorizeOneByOneRecipe based
+/// on the isScalarizing parameter respectively.
+VPOneByOneRecipeBase *VPlanUtilsLoopVectorizer::createOneByOneRecipe(
+    const BasicBlock::iterator B, const BasicBlock::iterator E, VPlan *Plan,
+    bool isScalarizing) {
+  if (isScalarizing)
+    return new VPScalarizeOneByOneRecipe(B, E, Plan);
+  return new VPVectorizeOneByOneRecipe(B, E, Plan);
+}
+
+bool VPlanUtilsLoopVectorizer::appendInstruction(VPOneByOneRecipeBase *Recipe,
+                                                 Instruction *Instr) {
+  if (Recipe->End != Instr->getIterator())
+    return false;
+
+  Recipe->End++;
+  Plan->setInst2Recipe(Instr, Recipe);
+  return true;
+}
+
+/// Given a \p Split instruction assumed to reside in a VPOneByOneRecipeBase
+/// -- where VPOneByOneRecipeBase is either VPScalarizeOneByOneRecipe or
+/// VPVectorizeOneByOneRecipe -- update that recipe to start from \p Split
+/// and move all preceeding instructions to a new VPOneByOneRecipeBase.
+/// \return the newly created VPOneByOneRecipeBase, which is added to the
+/// VPBasicBlock of the original recipe, right before it.
+VPOneByOneRecipeBase *
+VPlanUtilsLoopVectorizer::splitRecipe(Instruction *Split) {
+  VPOneByOneRecipeBase *Recipe =
+      cast<VPOneByOneRecipeBase>(Plan->getRecipe(Split));
+  auto SplitPos = Split->getIterator();
+
+  assert(SplitPos != Recipe->Begin &&
+         "Nothing to split before first instruction.");
+  assert(SplitPos != Recipe->End && "Nothing to split after last instruction.");
+
+  // Build a new recipe for all instructions up to the given Split.
+  VPBasicBlock *BasicBlock = Recipe->getParent();
+  VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe(
+      Recipe->Begin, SplitPos, Plan, Recipe->isScalarizing());
+
+  // Insert the new recipe before the split point.
+  BasicBlock->addRecipe(NewRecipe, Recipe);
+
+  // Update the old recipe to start from the given split point.
+  Recipe->Begin = SplitPos;
+
+  return NewRecipe;
+}
+
+/// Insert a given instruction \p Inst into a VPBasicBlock before another
+/// given instruction \p Before. Assumes \p Inst does not belong to any
+/// recipe, and that \p Before belongs to a VPOneByOneRecipeBase.
+void VPlanUtilsLoopVectorizer::insertBefore(Instruction *Inst,
+                                            Instruction *Before,
+                                            unsigned MinLane) {
+  assert(!Plan->getRecipe(Inst) && "Instruction already in recipe.");
+  VPRecipeBase *Recipe = Plan->getRecipe(Before);
+  assert(Recipe && "Insertion point not in any recipe.");
+  VPOneByOneRecipeBase *OBORecipe = cast<VPOneByOneRecipeBase>(Recipe);
+  bool PartialInsertion = MinLane > 0;
+  bool IndicesMatch = true;
+
+  if (PartialInsertion) {
+    VPScalarizeOneByOneRecipe *SOBO =
+        dyn_cast<VPScalarizeOneByOneRecipe>(Recipe);
+    if (!SOBO || SOBO->DesignatedLanes.getMinLane() != MinLane)
+      IndicesMatch = false;
+  }
+
+  // Can we insert \p Inst by augmemting the existing recipe of \p Before?
+  // Only if \p Inst is immediately followed by \p Before:
+  Instruction *NextInst = Inst;
+  if (++NextInst == Before && IndicesMatch) {
+    // This must imply that \p Before is the first ingredient in its recipe.
+    assert(Before == &*OBORecipe->Begin &&
+           "Trying to insert but Before is not first in its recipe.");
+    // Yes, extend the range to include the previous instruction.
+    OBORecipe->Begin--;
+    Plan->setInst2Recipe(Inst, Recipe);
+    return;
+  }
+  // Note that it is not possible to augment the end of Recipe by having
+  // Inst == &*Recipe->End, because to do that Before would need to be
+  // Recipe->End, which means that Before does not belong to this Recipe.
+
+  // No, the instruction needs to have its own recipe.
+
+  // If we're not inserting right before the Recipe's first instruction,
+  // split the Recipe to allow placing the new recipe right before the
+  // given insertion point. This new recipe is also added to BasicBlock.
+  if (Before != &*OBORecipe->Begin)
+    splitRecipe(Before);
+
+  // TODO: VPLanUtils::addOneByOneToBasicBlock()
+  auto InstBegin = Inst->getIterator();
+  auto InstEnd = InstBegin;
+  VPBasicBlock *BasicBlock = Recipe->getParent();
+  VPOneByOneRecipeBase *NewRecipe = nullptr;
+  if (PartialInsertion) {
+    NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true);
+    cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
+        VPLaneRange(MinLane);
+  } else
+    NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan,
+                                     OBORecipe->isScalarizing());
+  Plan->setInst2Recipe(Inst, NewRecipe);
+  BasicBlock->addRecipe(NewRecipe, OBORecipe);
+}
+
+/// Remove a given instruction \p Inst from its recipe, if exists. We only
+/// support removal from VPOneByOneRecipeBase at this time.
+void VPlanUtilsLoopVectorizer::removeInstruction(Instruction *Inst,
+                                                 unsigned FromLane) {
+  VPRecipeBase *Recipe = Plan->getRecipe(Inst);
+  if (!Recipe)
+    return; // Nothing to do, no recipe to remove the instruction from.
+  VPOneByOneRecipeBase *OBORecipe = cast<VPOneByOneRecipeBase>(Recipe);
+  // First check if OBORecipe can be shortened to exclude Inst.
+  bool InstructionWasLast = false;
+  if (&*OBORecipe->Begin == Inst)
+    OBORecipe->Begin++;
+  else if (&*OBORecipe->End == Inst) {
+    OBORecipe->End--;
+    InstructionWasLast = true;
+  }
+  // Otherwise split OBORecipe at Inst.
+  else {
+    splitRecipe(Inst);
+    OBORecipe->Begin++;
+  }
+  if (FromLane > 0) {
+    // This is a partial removal. Leave lanes 0..FromLane-1 in the original
+    // basic block in a new, unregistered recipe.
+    VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe(
+        Inst->getIterator(), ++(Inst->getIterator()), Plan, true);
+    cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
+        VPLaneRange(0, FromLane - 1);
+    Recipe->getParent()->addRecipe(NewRecipe,
+                                   InstructionWasLast ? nullptr : Recipe);
+  }
+  Plan->resetInst2Recipe(Inst);
+}
+
+// Given an instruction \p Inst and a VPBasicBlock \p To, remove \p Inst from
+// its current residence and add it as the first instruction of \p To.
+// We currently support removal from and insertion to
+// VPOneByOneRecipeBase's only.
+// TODO: this is an over-simplistic implemetation that assumes we can make
+// the new instruction the first instruction of the first recipe in the
+// basic block. This is true for the sinkScalarOperands use-case, but for a
+// general basic block a getFirstInsertionPt() logic is required.
+void VPlanUtilsLoopVectorizer::sinkInstruction(Instruction *Inst,
+                                               VPBasicBlock *To,
+                                               unsigned MinLane) {
+  RecipeListTy *Recipes = getRecipes(To);
+
+  VPRecipeBase *FromRecipe = Plan->getRecipe(Inst);
+  if (auto *FromBSSRecipe = dyn_cast<VPBuildScalarStepsRecipe>(FromRecipe)) {
+    VPBuildScalarStepsRecipe *SunkRecipe = nullptr;
+    if (MinLane == 0) {
+      // Sink the entire recipe.
+      VPBasicBlock *From = FromRecipe->getParent();
+      assert(From && "Recipe to sink not assigned to any basic block");
+      From->removeRecipe(FromBSSRecipe);
+      SunkRecipe = FromBSSRecipe;
+    } else {
+      // Partially sink lanes MinLane..VF-1
+      SunkRecipe = new VPBuildScalarStepsRecipe(FromBSSRecipe->WII,
+                                                FromBSSRecipe->EntryVal, Plan);
+      SunkRecipe->DesignatedLanes = VPLaneRange(MinLane);
+      FromBSSRecipe->DesignatedLanes = VPLaneRange(0, MinLane - 1);
+    }
+    To->addRecipe(SunkRecipe, &*Recipes->begin());
+    return;
+  }
+
+  assert(Plan->getRecipe(Inst) &&
+         isa<VPOneByOneRecipeBase>(Plan->getRecipe(Inst)) &&
+         "Unsupported recipe to sink instructions from");
+
+  // Remove instruction from its source recipe.
+  removeInstruction(Inst, MinLane);
+
+  auto *ToRecipe = dyn_cast<VPOneByOneRecipeBase>(&*Recipes->begin());
+  if (ToRecipe) {
+    // Try to sink the instruction into an existing recipe, default to a new
+    // recipe.
+    assert(ToRecipe->isScalarizing() &&
+           "Cannot sink into a non-scalarizing recipe.");
+
+    // Add it before the first ingredient of To.
+    insertBefore(Inst, &*ToRecipe->Begin, MinLane);
+  } else {
+    // Instruction has to go into its own one-by-one recipe.
+    auto InstBegin = Inst->getIterator();
+    auto InstEnd = InstBegin;
+    auto *NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true);
+    if (MinLane > 0) // Partial sink
+      cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
+          VPLaneRange(MinLane);
+    To->addRecipe(NewRecipe, &*Recipes->begin());
+  }
+}
+
+void InnerLoopUnroller::vectorizeInstruction(Instruction &I) {
+  switch (I.getOpcode()) {
+  case Instruction::Br:
+    // Nothing to do for branches since we already took care of the
+    // loop control flow instructions.
+    break;
+
+  case Instruction::GetElementPtr:
+    scalarizeInstruction(&I, false);
+    break;
+
+  case Instruction::UDiv:
+  case Instruction::SDiv:
+  case Instruction::SRem:
+  case Instruction::URem:
+    // Scalarize with predication if this instruction may divide by zero and
+    // block execution is conditional, otherwise fallthrough.
+    if (Legal->isScalarWithPredication(&I)) {
+      scalarizeInstruction(&I, true);
+      break;
+    }
+
+  case Instruction::Trunc: {
+    auto *CI = dyn_cast<CastInst>(&I);
+    // Optimize the special case where the source is a constant integer
+    // induction variable. Notice that we can only optimize the 'trunc' case
+    // because (a) FP conversions lose precision, (b) sext/zext may wrap, and
+    // (c) other casts depend on pointer size.
+    if (Cost->isOptimizableIVTruncate(CI, VF)) {
+      setDebugLocFromInst(Builder, CI);
+      widenIntInduction(true, cast<PHINode>(CI->getOperand(0)),
+                        cast<TruncInst>(CI));
+      break;
+    }
+  }
+
+  default:
+    InnerLoopVectorizer::vectorizeInstruction(I);
   }
 }
 
@@ -7595,9 +9249,35 @@
     return false;
   }
 
-  // Select the optimal vectorization factor.
-  const LoopVectorizationCostModel::VectorizationFactor VF =
-      CM.selectVectorizationFactor(OptForSize);
+  if (!CM.canVectorize(OptForSize))
+    return false;
+
+  // Early prune excessive VF's
+  unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize);
+
+  // If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail.
+  if (OptForSize && CM.requiresTail(MaxVF))
+    return false;
+
+  // Use the planner.
+  LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM);
+
+  // Get user vectorization factor.
+  unsigned UserVF = Hints.getWidth();
+
+  // Select the vectorization factor.
+  LoopVectorizationCostModel::VectorizationFactor VF =
+      LVP.plan(OptForSize, UserVF, MaxVF);
+  bool VectorizeLoop = (VF.Width > 1);
+
+  std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
+
+  if (!UserVF && !VectorizeLoop) {
+    DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
+    VecDiagMsg = std::make_pair(
+        "VectorizationNotBeneficial",
+        "the cost-model indicates that vectorization is not beneficial");
+  }
 
   // Select the interleave count.
   unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);
@@ -7606,8 +9286,6 @@
   unsigned UserIC = Hints.getInterleave();
 
   // Identify the diagnostic messages that should be produced.
-  std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
-  bool VectorizeLoop = true, InterleaveLoop = true;
   if (Requirements.doesNotMeet(F, L, Hints)) {
     DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "
                     "requirements.\n");
@@ -7615,13 +9293,7 @@
     return false;
   }
 
-  if (VF.Width == 1) {
-    DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
-    VecDiagMsg = std::make_pair(
-        "VectorizationNotBeneficial",
-        "the cost-model indicates that vectorization is not beneficial");
-    VectorizeLoop = false;
-  }
+  bool InterleaveLoop = true;
 
   if (IC == 1 && UserIC <= 1) {
     // Tell the user interleaving is not beneficial.
@@ -7637,8 +9309,8 @@
     }
   } else if (IC > 1 && UserIC == 1) {
     // Tell the user interleaving is beneficial, but it explicitly disabled.
-    DEBUG(dbgs()
-          << "LV: Interleaving is beneficial but is explicitly disabled.");
+    DEBUG(
+       dbgs() << "LV: Interleaving is beneficial but is explicitly disabled.");
     IntDiagMsg = std::make_pair(
         "InterleavingBeneficialButDisabled",
         "the cost-model indicates that interleaving is beneficial "
@@ -7649,6 +9321,9 @@
   // Override IC if user provided an interleave count.
   IC = UserIC > 0 ? UserIC : IC;
 
+  if (VectorizeLoop)
+    LVP.setBestPlan(VF.Width, IC);
+
   // Emit diagnostic messages, if any.
   const char *VAPassName = Hints.vectorizeAnalysisPassName();
   if (!VectorizeLoop && !InterleaveLoop) {
@@ -7691,10 +9366,13 @@
               << "interleaved loop (interleaved count: "
               << NV("InterleaveCount", IC) << ")");
   } else {
+
     // If we decided that it is *legal* to vectorize the loop, then do it.
     InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
                            &LVL, &CM);
-    LB.vectorize();
+
+    LVP.executeBestPlan(LB);
+
     ++LoopsVectorized;
 
     // Add metadata to disable runtime unrolling a scalar loop when there are
Index: lib/Transforms/Vectorize/VPlan.h
===================================================================
--- /dev/null
+++ lib/Transforms/Vectorize/VPlan.h
@@ -0,0 +1,922 @@
+//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// This file contains the declarations of the Vectorization Plan base classes:
+// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual
+//    VPBlockBase, together implementing a Hierarchical CFG;
+// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be treated
+//    as proper graphs for generic algorithms;
+// 3. Pure virtual VPRecipeBase and its pure virtual sub-classes
+//    VPConditionBitRecipeBase and VPOneByOneRecipeBase that
+//    represent base classes for recipes contained within VPBasicBlocks;
+// 4. The VPlan class holding a candidate for vectorization;
+// 5. The VPlanUtils class providing methods for building plans;
+// 6. The VPlanPrinter class providing a way to print a plan in dot format.
+// These are documented in docs/VectorizationPlan.rst.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H
+#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H
+
+#include "llvm/ADT/GraphTraits.h"
+#include "llvm/ADT/ilist.h"
+#include "llvm/ADT/ilist_node.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/Support/raw_ostream.h"
+
+// The (re)use of existing LoopVectorize classes is subject to future VPlan
+// refactoring.
+namespace {
+class InnerLoopVectorizer;
+class LoopVectorizationLegality;
+class LoopVectorizationCostModel;
+}
+
+namespace llvm {
+
+class VPBasicBlock;
+
+/// VPRecipeBase is a base class describing one or more instructions that will
+/// appear consecutively in the vectorized version, based on Instructions from
+/// the given IR. These Instructions are referred to as the "Ingredients" of
+/// the Recipe. A Recipe specifies how its ingredients are to be vectorized:
+/// e.g., copy or reuse them as uniform, scalarize or vectorize them according
+/// to an enclosing loop dimension, vectorize them according to internal SLP
+/// dimension.
+///
+/// **Design principle:** in order to reason about how to vectorize an
+/// Instruction or how much it would cost, one has to consult the VPRecipe
+/// holding it.
+///
+/// **Design principle:** when a sequence of instructions conveys additional
+/// information as a group, we use a VPRecipe to encapsulate them and attach
+/// this information to the VPRecipe. For instance a VPRecipe can model an
+/// interleave group of loads or stores with additional information for
+/// calculating their cost and for performing IR code generation, as a group.
+///
+/// **Design principle:** a VPRecipe should reuse existing containers of its
+/// ingredients, i.e., iterators of basic blocks, to be lightweight. A new
+/// containter should be opened on-demand, e.g., to avoid excessive recipes
+/// each holding an interval of ingredients.
+class VPRecipeBase : public ilist_node_with_parent<VPRecipeBase, VPBasicBlock> {
+  friend class VPlanUtils;
+  friend class VPBasicBlock;
+
+private:
+  const unsigned char VRID; /// Subclass identifier (for isa/dyn_cast)
+
+  /// Each VPRecipe is contained in a single VPBasicBlock.
+  class VPBasicBlock *Parent;
+
+  /// Record which Instructions would require generating their complementing
+  /// form as well, providing a vector-to-scalar or scalar-to-vector conversion.
+  SmallPtrSet<Instruction *, 1> AlsoPackOrUnpack;
+
+public:
+  /// An enumeration for keeping track of the concrete subclass of VPRecipeBase
+  /// that is actually instantiated. Values of this enumeration are kept in the
+  /// VPRecipe classes VRID field. They are used for concrete type
+  /// identification.
+  typedef enum {
+    VPVectorizeOneByOneSC,
+    VPScalarizeOneByOneSC,
+    VPWidenIntInductionSC,
+    VPBuildScalarStepsSC,
+    VPInterleaveSC,
+    VPExtractMaskBitSC,
+    VPMergeScalarizeBranchSC,
+  } VPRecipeTy;
+
+  VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {}
+
+  virtual ~VPRecipeBase() {}
+
+  /// \return an ID for the concrete type of this object.
+  /// This is used to implement the classof checks. This should not be used
+  /// for any other purpose, as the values may change as LLVM evolves.
+  unsigned getVPRecipeID() const { return VRID; }
+
+  /// \return the VPBasicBlock which this VPRecipe belongs to.
+  class VPBasicBlock *getParent() {
+    return Parent;
+  }
+
+  /// The method which generates the new IR instructions that correspond to
+  /// this VPRecipe in the vectorized version, thereby "executing" the VPlan.
+  virtual void vectorize(struct VPTransformState &State) = 0;
+
+  /// Each recipe prints itself.
+  virtual void print(raw_ostream &O) const = 0;
+
+  /// Add an instruction to the set of instructions for which a vector-to-
+  /// scalar or scalar-to-vector conversion is needed, in addition to
+  /// vectorizing or scalarizing the instruction itself, respectively.
+  void addAlsoPackOrUnpack(Instruction *I) { AlsoPackOrUnpack.insert(I); }
+
+  /// Indicates if a given instruction requires vector-to-scalar or scalar-to-
+  /// vector conversion.
+  bool willAlsoPackOrUnpack(Instruction *I) const {
+    return AlsoPackOrUnpack.count(I);
+  }
+};
+
+/// A VPConditionBitRecipeBase is a pure virtual VPRecipe which supports a
+/// conditional branch. Concrete sub-classes of this recipe are in charge of
+/// generating the instructions that compute the condition for this branch in
+/// the vectorized version.
+class VPConditionBitRecipeBase : public VPRecipeBase {
+protected:
+  /// The actual condition bit that was generated. Holds null until the
+  /// value/instuctions are generated by the vectorize() method.
+  Value *ConditionBit;
+
+public:
+  /// Construct a VPConditionBitRecipeBase, simply propating its concrete type.
+  VPConditionBitRecipeBase(const unsigned char SC)
+      : VPRecipeBase(SC), ConditionBit(nullptr) {}
+
+  /// \return the actual bit that was generated, to be plugged into the IR
+  /// conditional branch, or null if the code computing the actual bit has not
+  /// been generated yet.
+  Value *getConditionBit() { return ConditionBit; }
+
+  virtual StringRef getName() const = 0;
+};
+
+/// VPOneByOneRecipeBase is a VPRecipeBase which handles each Instruction in its
+/// ingredients independently, in order. The ingredients are either all
+/// vectorized, or all scalarized.
+/// A VPOneByOneRecipeBase is a virtual base recipe which can be materialized
+/// by one of two sub-classes, namely VPVectorizeOneByOneRecipe or
+/// VPScalarizeOneByOneRecipe for Vectorizing or Scalarizing all ingredients,
+/// respectively.
+/// The ingredients are held as a sub-sequence of original Instructions, which
+/// reside in the same IR BasicBlock and in the same order. The Ingredients are
+/// accessed by a pointer to the first and last Instruction.
+class VPOneByOneRecipeBase : public VPRecipeBase {
+  friend class VPlanUtilsLoopVectorizer;
+
+public:
+  /// Hold the ingredients by pointing to their original BasicBlock location.
+  BasicBlock::iterator Begin;
+  BasicBlock::iterator End;
+
+protected:
+  VPOneByOneRecipeBase() = delete;
+
+  VPOneByOneRecipeBase(unsigned char SC, const BasicBlock::iterator B,
+                       const BasicBlock::iterator E, class VPlan *Plan);
+
+  /// Do the actual code generation for a single instruction.
+  /// This function is to be implemented and specialized by the respective
+  /// sub-class.
+  virtual void transformIRInstruction(Instruction *I,
+                                      struct VPTransformState &State) = 0;
+
+public:
+  ~VPOneByOneRecipeBase() {}
+
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPRecipeBase *V) {
+    return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC ||
+           V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC;
+  }
+
+  bool isScalarizing() const {
+    return getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC;
+  }
+
+  /// The method which generates all new IR instructions that correspond to
+  /// this VPOneByOneRecipeBase in the vectorized version, thereby
+  /// "executing" the VPlan.
+  /// VPOneByOneRecipeBase may either scalarize or vectorize all Instructions.
+  void vectorize(struct VPTransformState &State) override {
+    for (auto It = Begin; It != End; ++It)
+      transformIRInstruction(&*It, State);
+  }
+
+  const BasicBlock::iterator &begin() { return Begin; }
+
+  const BasicBlock::iterator &end() { return End; }
+};
+
+/// Hold the indices of a specific scalar instruction. The VPIterationInstance
+/// span the iterations of the original loop, that correspond to a single
+/// iteration of the vectorized loop.
+struct VPIterationInstance {
+  unsigned Part;
+  unsigned Lane;
+};
+
+// Forward declaration.
+class BasicBlock;
+
+/// Hold additional information passed down when "executing" a VPlan, that is
+/// needed for generating IR. Also facilitates reuse of existing LV
+/// functionality.
+struct VPTransformState {
+
+  VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI,
+                   class DominatorTree *DT, IRBuilder<> &Builder,
+                   InnerLoopVectorizer *ILV, LoopVectorizationLegality *Legal,
+                   LoopVectorizationCostModel *Cost)
+      : VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder),
+        ILV(ILV), Legal(Legal), Cost(Cost) {}
+
+  /// Record the selected vectorization and unroll factors of the single loop
+  /// being vectorized.
+  unsigned VF;
+  unsigned UF;
+
+  /// Hold the indices to generate a specific scalar instruction. Null indicates
+  /// that all instances are to be generated, using either scalar or vector
+  /// instructions.
+  VPIterationInstance *Instance;
+
+  /// Hold state information used when constructing the CFG of the vectorized
+  /// Loop, traversing the VPBasicBlocks and generating corresponding IR
+  /// BasicBlocks.
+  struct CFGState {
+    /// The previous VPBasicBlock visited. In the beginning set to null.
+    VPBasicBlock *PrevVPBB;
+    /// The previous IR BasicBlock created or reused. In the beginning set to
+    /// the new header BasicBlock.
+    BasicBlock *PrevBB;
+    /// The last IR BasicBlock of the loop body. Set to the new latch
+    /// BasicBlock, used for placing the newly created BasicBlocks.
+    BasicBlock *LastBB;
+    /// A mapping of each VPBasicBlock to the corresponding BasicBlock. In case
+    /// of replication, maps the BasicBlock of the last replica created.
+    SmallDenseMap<class VPBasicBlock *, class BasicBlock *> VPBB2IRBB;
+
+    CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {}
+  } CFG;
+
+  /// Hold pointer to LoopInfo to register new basic blocks in the loop.
+  class LoopInfo *LI;
+
+  /// Hold pointer to Dominator Tree to register new basic blocks in the loop.
+  class DominatorTree *DT;
+
+  /// Hold a reference to the IRBuilder used to generate IR code.
+  IRBuilder<> &Builder;
+
+  /// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
+  class InnerLoopVectorizer *ILV;
+
+  /// Hold a pointer to LoopVectorizationLegality
+  class LoopVectorizationLegality *Legal;
+
+  /// Hold a pointer to LoopVectorizationCostModel to access its
+  /// IsUniformAfterVectorization method.
+  LoopVectorizationCostModel *Cost;
+};
+
+/// VPBlockBase is the building block of the Hierarchical CFG. A VPBlockBase
+/// can be either a VPBasicBlock or a VPRegionBlock.
+///
+/// The Hierarchical CFG is a control-flow graph whose nodes are basic-blocks
+/// or Hierarchical CFG's. The Hierarchical CFG data structure we use is similar
+/// to the Tile Tree [1], where cross-Tile edges are lifted to connect Tiles
+/// instead of the original basic-blocks as in Sharir [2], promoting the Tile
+/// encapsulation. We use the terms Region and Block rather than Tile [1] to
+/// avoid confusion with loop tiling.
+///
+/// [1] "Register Allocation via Hierarchical Graph Coloring", David Callahan
+/// and Brian Koblenz, PLDI 1991
+///
+/// [2] "Structural analysis: A new approach to flow analysis in optimizing
+/// compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
+///
+/// Note that in contrast to the IR BasicBlock, a VPBlockBase models its
+/// control-flow edges with successor and predecessor VPBlockBase directly,
+/// rather than through a Terminator branch or through predecessor branches that
+/// Use the VPBlockBase.
+class VPBlockBase {
+  friend class VPlanUtils;
+
+private:
+  const unsigned char VBID; /// Subclass identifier (for isa/dyn_cast).
+
+  std::string Name;
+
+  /// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if
+  /// it is a topmost VPBlockBase.
+  class VPRegionBlock *Parent;
+
+  /// List of predecessor blocks.
+  SmallVector<VPBlockBase *, 1> Predecessors;
+
+  /// List of successor blocks.
+  SmallVector<VPBlockBase *, 1> Successors;
+
+  /// Successor selector, null for zero or single successor blocks.
+  VPConditionBitRecipeBase *ConditionBitRecipe;
+
+  /// Add \p Successor as the last successor to this block.
+  void appendSuccessor(VPBlockBase *Successor) {
+    assert(Successor && "Cannot add nullptr successor!");
+    Successors.push_back(Successor);
+  }
+
+  /// Add \p Predecessor as the last predecessor to this block.
+  void appendPredecessor(VPBlockBase *Predecessor) {
+    assert(Predecessor && "Cannot add nullptr predecessor!");
+    Predecessors.push_back(Predecessor);
+  }
+
+  /// Remove \p Predecessor from the predecessors of this block.
+  void removePredecessor(VPBlockBase *Predecessor) {
+    auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor);
+    assert(Pos && "Predecessor does not exist");
+    Predecessors.erase(Pos);
+  }
+
+  /// Remove \p Successor from the successors of this block.
+  void removeSuccessor(VPBlockBase *Successor) {
+    auto Pos = std::find(Successors.begin(), Successors.end(), Successor);
+    assert(Pos && "Successor does not exist");
+    Successors.erase(Pos);
+  }
+
+protected:
+  VPBlockBase(const unsigned char SC, const std::string &N)
+      : VBID(SC), Name(N), Parent(nullptr), ConditionBitRecipe(nullptr) {}
+
+public:
+  /// An enumeration for keeping track of the concrete subclass of VPBlockBase
+  /// that is actually instantiated. Values of this enumeration are kept in the
+  /// VPBlockBase classes VBID field. They are used for concrete type
+  /// identification.
+  typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy;
+
+  virtual ~VPBlockBase() {}
+
+  const std::string &getName() const { return Name; }
+
+  /// \return an ID for the concrete type of this object.
+  /// This is used to implement the classof checks. This should not be used
+  /// for any other purpose, as the values may change as LLVM evolves.
+  unsigned getVPBlockID() const { return VBID; }
+
+  const class VPRegionBlock *getParent() const { return Parent; }
+
+  /// \return the VPBasicBlock that is the entry of this VPBlockBase,
+  /// recursively, if the latter is a VPRegionBlock. Otherwise, if this
+  /// VPBlockBase is a VPBasicBlock, it is returned.
+  const class VPBasicBlock *getEntryBasicBlock() const;
+
+  /// \return the VPBasicBlock that is the exit of this VPBlockBase,
+  /// recursively, if the latter is a VPRegionBlock. Otherwise, if this
+  /// VPBlockBase is a VPBasicBlock, it is returned.
+  const class VPBasicBlock *getExitBasicBlock() const;
+  class VPBasicBlock *getExitBasicBlock();
+
+  const SmallVectorImpl<VPBlockBase *> &getSuccessors() const {
+    return Successors;
+  }
+
+  const SmallVectorImpl<VPBlockBase *> &getPredecessors() const {
+    return Predecessors;
+  }
+
+  SmallVectorImpl<VPBlockBase *> &getSuccessors() { return Successors; }
+
+  SmallVectorImpl<VPBlockBase *> &getPredecessors() { return Predecessors; }
+
+  /// \return the successor of this VPBlockBase if it has a single successor.
+  /// Otherwise return a null pointer.
+  VPBlockBase *getSingleSuccessor() const {
+    return (Successors.size() == 1 ? *Successors.begin() : nullptr);
+  }
+
+  /// \return the predecessor of this VPBlockBase if it has a single
+  /// predecessor. Otherwise return a null pointer.
+  VPBlockBase *getSinglePredecessor() const {
+    return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr);
+  }
+
+  /// Returns the closest ancestor starting from "this", which has successors.
+  /// Returns the root ancestor if all ancestors have no successors.
+  VPBlockBase *getAncestorWithSuccessors();
+
+  /// Returns the closest ancestor starting from "this", which has predecessors.
+  /// Returns the root ancestor if all ancestors have no predecessors.
+  VPBlockBase *getAncestorWithPredecessors();
+
+  /// \return the successors either attached directly to this VPBlockBase or, if
+  /// this VPBlockBase is the exit block of a VPRegionBlock and has no
+  /// successors of its own, search recursively for the first enclosing
+  /// VPRegionBlock that has successors and return them. If no such
+  /// VPRegionBlock exists, return the (empty) successors of the topmost
+  /// VPBlockBase reached.
+  const SmallVectorImpl<VPBlockBase *> &getHierarchicalSuccessors() {
+    return getAncestorWithSuccessors()->getSuccessors();
+  }
+
+  /// \return the hierarchical successor of this VPBlockBase if it has a single
+  /// hierarchical successor. Otherwise return a null pointer.
+  VPBlockBase *getSingleHierarchicalSuccessor() {
+    return getAncestorWithSuccessors()->getSingleSuccessor();
+  }
+
+  /// \return the predecessors either attached directly to this VPBlockBase or,
+  /// if this VPBlockBase is the entry block of a VPRegionBlock and has no
+  /// predecessors of its own, search recursively for the first enclosing
+  /// VPRegionBlock that has predecessors and return them. If no such
+  /// VPRegionBlock exists, return the (empty) predecessors of the topmost
+  /// VPBlockBase reached.
+  const SmallVectorImpl<VPBlockBase *> &getHierarchicalPredecessors() {
+    return getAncestorWithPredecessors()->getPredecessors();
+  }
+
+  /// \return the hierarchical predecessor of this VPBlockBase if it has a
+  /// single hierarchical predecessor. Otherwise return a null pointer.
+  VPBlockBase *getSingleHierarchicalPredecessor() {
+    return getAncestorWithPredecessors()->getSinglePredecessor();
+  }
+
+  /// If a VPBlockBase has two successors, this is the Recipe that will generate
+  /// the condition bit selecting the successor, and feeding the terminating
+  /// conditional branch. Otherwise this is null.
+  VPConditionBitRecipeBase *getConditionBitRecipe() {
+    return ConditionBitRecipe;
+  }
+
+  const VPConditionBitRecipeBase *getConditionBitRecipe() const {
+    return ConditionBitRecipe;
+  }
+
+  void setConditionBitRecipe(VPConditionBitRecipeBase *R) {
+    ConditionBitRecipe = R;
+  }
+
+  /// The method which generates all new IR instructions that correspond to
+  /// this VPBlockBase in the vectorized version, thereby "executing" the VPlan.
+  virtual void vectorize(struct VPTransformState *State) = 0;
+
+  /// Delete all blocks reachable from a given VPBlockBase, inclusive.
+  static void deleteCFG(VPBlockBase *Entry);
+};
+
+/// VPBasicBlock serves as the leaf of the Hierarchical CFG. It represents a
+/// sequence of instructions that will appear consecutively in a basic block
+/// of the vectorized version. The VPBasicBlock takes care of the control-flow
+/// relations with other VPBasicBlock's and Regions. It holds a sequence of zero
+/// or more VPRecipe's that take care of representing the instructions.
+/// A VPBasicBlock that holds no VPRecipe's represents no instructions; this
+/// may happen, e.g., to support disjoint Regions and to ensure Regions have a
+/// single exit, possibly an empty one.
+///
+/// Note that in contrast to the IR BasicBlock, a VPBasicBlock models its
+/// control-flow edges with successor and predecessor VPBlockBase directly,
+/// rather than through a Terminator branch or through predecessor branches that
+/// "use" the VPBasicBlock.
+class VPBasicBlock : public VPBlockBase {
+  friend class VPlanUtils;
+
+public:
+  typedef iplist<VPRecipeBase> RecipeListTy;
+
+private:
+  /// The list of VPRecipes, held in order of instructions to generate.
+  RecipeListTy Recipes;
+
+public:
+  /// Instruction iterators...
+  typedef RecipeListTy::iterator iterator;
+  typedef RecipeListTy::const_iterator const_iterator;
+  typedef RecipeListTy::reverse_iterator reverse_iterator;
+  typedef RecipeListTy::const_reverse_iterator const_reverse_iterator;
+
+  //===--------------------------------------------------------------------===//
+  /// Recipe iterator methods
+  ///
+  inline iterator begin() { return Recipes.begin(); }
+  inline const_iterator begin() const { return Recipes.begin(); }
+  inline iterator end() { return Recipes.end(); }
+  inline const_iterator end() const { return Recipes.end(); }
+
+  inline reverse_iterator rbegin() { return Recipes.rbegin(); }
+  inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); }
+  inline reverse_iterator rend() { return Recipes.rend(); }
+  inline const_reverse_iterator rend() const { return Recipes.rend(); }
+
+  inline size_t size() const { return Recipes.size(); }
+  inline bool empty() const { return Recipes.empty(); }
+  inline const VPRecipeBase &front() const { return Recipes.front(); }
+  inline VPRecipeBase &front() { return Recipes.front(); }
+  inline const VPRecipeBase &back() const { return Recipes.back(); }
+  inline VPRecipeBase &back() { return Recipes.back(); }
+
+  /// Return the underlying instruction list container.
+  ///
+  /// Currently you need to access the underlying instruction list container
+  /// directly if you want to modify it.
+  const RecipeListTy &getInstList() const { return Recipes; }
+  RecipeListTy &getInstList() { return Recipes; }
+
+  /// Returns a pointer to a member of the instruction list.
+  static RecipeListTy VPBasicBlock::*getSublistAccess(VPRecipeBase *) {
+    return &VPBasicBlock::Recipes;
+  }
+
+  VPBasicBlock(const std::string &Name) : VPBlockBase(VPBasicBlockSC, Name) {}
+
+  ~VPBasicBlock() { Recipes.clear(); }
+
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPBlockBase *V) {
+    return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC;
+  }
+
+  /// Augment the existing recipes of a VPBasicBlock with an additional
+  /// \p Recipe at a position given by an existing recipe \p Before. If
+  /// \p Before is null, \p Recipe is appended as the last recipe.
+  void addRecipe(VPRecipeBase *Recipe, VPRecipeBase *Before = nullptr) {
+    Recipe->Parent = this;
+    if (!Before) {
+      Recipes.push_back(Recipe);
+      return;
+    }
+    assert(Before->Parent == this &&
+           "Insertion before point not in this basic block.");
+    Recipes.insert(Before->getIterator(), Recipe);
+  }
+
+  void removeRecipe(VPRecipeBase *Recipe) {
+    assert(Recipe->Parent == this &&
+           "Recipe to remove not in this basic block.");
+    Recipes.remove(Recipe);
+    Recipe->Parent = nullptr;
+  }
+
+  /// The method which generates all new IR instructions that correspond to
+  /// this VPBasicBlock in the vectorized version, thereby "executing" the
+  /// VPlan.
+  void vectorize(struct VPTransformState *State) override;
+
+  /// Retrieve the list of VPRecipes that belong to this VPBasicBlock.
+  const RecipeListTy &getRecipes() const { return Recipes; }
+
+private:
+  /// Create an IR BasicBlock to hold the instructions vectorized from this
+  /// VPBasicBlock, and return it. Update the CFGState accordingly.
+  BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG);
+};
+
+/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks
+/// which form a single-entry-single-exit subgraph of the CFG in the vectorized
+/// code.
+///
+/// A VPRegionBlock may indicate that its contents are to be replicated several
+/// times. This is designed to support predicated scalarization, in which a
+/// scalar if-then code structure needs to be generated VF * UF times. Having
+/// this replication indicator helps to keep a single VPlan for multiple
+/// candidate VF's; the actual replication takes place only once the desired VF
+/// and UF have been determined.
+///
+/// **Design principle:** when some additional information relates to an SESE
+/// set of VPBlockBase, we use a VPRegionBlock to wrap them and attach the
+/// information to it. For example, a VPRegionBlock can be used to indicate that
+/// a scalarized SESE region is to be replicated, and that a vectorized SESE
+/// region can retain its internal control-flow, independent of the control-flow
+/// external to the region.
+class VPRegionBlock : public VPBlockBase {
+  friend class VPlanUtils;
+
+private:
+  /// Hold the Single Entry of the SESE region represented by the VPRegionBlock.
+  VPBlockBase *Entry;
+
+  /// Hold the Single Exit of the SESE region represented by the VPRegionBlock.
+  VPBlockBase *Exit;
+
+  /// A VPRegionBlock can represent either a single instance of its
+  /// VPBlockBases, or multiple (VF * UF) replicated instances. The latter is
+  /// used when the internal SESE region handles a single scalarized lane.
+  bool IsReplicator;
+
+public:
+  VPRegionBlock(const std::string &Name)
+      : VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exit(nullptr),
+        IsReplicator(false) {}
+
+  ~VPRegionBlock() {
+    if (Entry)
+      deleteCFG(Entry);
+  }
+
+  /// Method to support type inquiry through isa, cast, and dyn_cast.
+  static inline bool classof(const VPBlockBase *V) {
+    return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC;
+  }
+
+  VPBlockBase *getEntry() { return Entry; }
+
+  VPBlockBase *getExit() { return Exit; }
+
+  const VPBlockBase *getEntry() const { return Entry; }
+
+  const VPBlockBase *getExit() const { return Exit; }
+
+  /// An indicator if the VPRegionBlock represents single or multiple instances.
+  bool isReplicator() const { return IsReplicator; }
+
+  void setReplicator(bool ToReplicate) { IsReplicator = ToReplicate; }
+
+  /// The method which generates the new IR instructions that correspond to
+  /// this VPRegionBlock in the vectorized version, thereby "executing" the
+  /// VPlan.
+  void vectorize(struct VPTransformState *State) override;
+};
+
+/// A VPlan represents a candidate for vectorization, encoding various decisions
+/// taken to produce efficient vector code, including: which instructions are to
+/// vectorized or scalarized, which branches are to appear in the vectorized
+/// version. It models the control-flow of the candidate vectorized version
+/// explicitly, and holds prescriptions for generating the code for this version
+/// from a given IR code.
+/// VPlan takes a "senario-based approach" to vectorization planning - different
+/// scenarios, corresponding to making different decisions, can be modeled using
+/// different VPlans.
+/// The corresponding IR code is required to be SESE.
+/// The vectorized version is represented using a Hierarchical CFG.
+class VPlan {
+  friend class VPlanUtils;
+  friend class VPlanUtilsLoopVectorizer;
+
+private:
+  /// Hold the single entry to the Hierarchical CFG of the VPlan.
+  VPBlockBase *Entry;
+
+  /// The IR instructions which are to be transformed to fill the vectorized
+  /// version are held as ingredients inside the VPRecipe's of the VPlan. Hold a
+  /// reverse mapping to locate the VPRecipe an IR instruction belongs to. This
+  /// serves optimizations that operate on the VPlan.
+  DenseMap<Instruction *, VPRecipeBase *> Inst2Recipe;
+
+public:
+  VPlan() : Entry(nullptr) {}
+
+  ~VPlan() {
+    if (Entry)
+      VPBlockBase::deleteCFG(Entry);
+  }
+
+  /// Generate the IR code for this VPlan.
+  void vectorize(struct VPTransformState *State);
+
+  VPBlockBase *getEntry() { return Entry; }
+  const VPBlockBase *getEntry() const { return Entry; }
+
+  void setEntry(VPBlockBase *Block) { Entry = Block; }
+
+  /// Retrieve the VPRecipe a given instruction \p Inst belongs to in the VPlan.
+  /// Returns null if it belongs to no VPRecipe.
+  VPRecipeBase *getRecipe(Instruction *Inst) {
+    auto It = Inst2Recipe.find(Inst);
+    if (It == Inst2Recipe.end())
+      return nullptr;
+    return It->second;
+  }
+
+  void setInst2Recipe(Instruction *I, VPRecipeBase *R) { Inst2Recipe[I] = R; }
+
+  void resetInst2Recipe(Instruction *I) { Inst2Recipe.erase(I); }
+
+  /// Retrieve the VPBasicBlock a given instruction \p Inst belongs to in the
+  /// VPlan. Returns null if it belongs to no VPRecipe.
+  VPBasicBlock *getBasicBlock(Instruction *Inst) {
+    VPRecipeBase *Recipe = getRecipe(Inst);
+    if (!Recipe)
+      return nullptr;
+    return Recipe->getParent();
+  }
+
+private:
+  /// Add to the given dominator tree the header block and every new basic block
+  /// that was created between it and the latch block, inclusive.
+  void updateDominatorTree(class DominatorTree *DT, BasicBlock *LoopPreHeaderBB,
+                           BasicBlock *LoopLatchBB);
+};
+
+/// The VPlanUtils class provides interfaces for the construction and
+/// manipulation of a VPlan.
+class VPlanUtils {
+private:
+  /// Unique ID generator.
+  static unsigned NextOrdinal;
+
+protected:
+  VPlan *Plan;
+
+  typedef iplist<VPRecipeBase> RecipeListTy;
+  RecipeListTy *getRecipes(VPBasicBlock *Block) { return &Block->Recipes; }
+
+public:
+  VPlanUtils(VPlan *Plan) : Plan(Plan) {}
+
+  ~VPlanUtils() {}
+
+  /// Create a unique name for a new VPlan entity such as a VPBasicBlock or
+  /// VPRegionBlock.
+  std::string createUniqueName(const char *Prefix) {
+    std::string S;
+    raw_string_ostream RSO(S);
+    RSO << Prefix << NextOrdinal++;
+    return RSO.str();
+  }
+
+  /// Add a given \p Recipe as the last recipe of a given VPBasicBlock.
+  void appendRecipeToBasicBlock(VPRecipeBase *Recipe, VPBasicBlock *ToVPBB) {
+    assert(Recipe && "No recipe to append.");
+    assert(!Recipe->Parent && "Recipe already in VPlan");
+    ToVPBB->addRecipe(Recipe);
+  }
+
+  /// Create a new empty VPBasicBlock and return it.
+  VPBasicBlock *createBasicBlock() {
+    VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB"));
+    return BasicBlock;
+  }
+
+  /// Create a new VPBasicBlock with a single \p Recipe and return it.
+  VPBasicBlock *createBasicBlock(VPRecipeBase *Recipe) {
+    VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB"));
+    appendRecipeToBasicBlock(Recipe, BasicBlock);
+    return BasicBlock;
+  }
+
+  /// Create a new, empty VPRegionBlock, with no blocks.
+  VPRegionBlock *createRegion(bool IsReplicator) {
+    VPRegionBlock *Region = new VPRegionBlock(createUniqueName("region"));
+    setReplicator(Region, IsReplicator);
+    return Region;
+  }
+
+  /// Set the entry VPBlockBase of a given VPRegionBlock to a given \p Block.
+  /// Block is to have no predecessors.
+  void setRegionEntry(VPRegionBlock *Region, VPBlockBase *Block) {
+    assert(Block->Predecessors.empty() &&
+           "Entry block cannot have predecessors.");
+    Region->Entry = Block;
+    Block->Parent = Region;
+  }
+
+  /// Set the exit VPBlockBase of a given VPRegionBlock to a given \p Block.
+  /// Block is to have no successors.
+  void setRegionExit(VPRegionBlock *Region, VPBlockBase *Block) {
+    assert(Block->Successors.empty() && "Exit block cannot have successors.");
+    Region->Exit = Block;
+    Block->Parent = Region;
+  }
+
+  void setReplicator(VPRegionBlock *Region, bool ToReplicate) {
+    Region->setReplicator(ToReplicate);
+  }
+
+  /// Sets a given VPBlockBase \p Successor as the single successor of another
+  /// VPBlockBase \p Block. The parent of \p Block is copied to be the parent of
+  /// \p Successor.
+  void setSuccessor(VPBlockBase *Block, VPBlockBase *Successor) {
+    assert(Block->getSuccessors().empty() && "Block successors already set.");
+    Block->appendSuccessor(Successor);
+    Successor->appendPredecessor(Block);
+    Successor->Parent = Block->Parent;
+  }
+
+  /// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two
+  /// successors of another VPBlockBase \p Block. A given
+  /// VPConditionBitRecipeBase provides the control selector. The parent of
+  /// \p Block is copied to be the parent of \p IfTrue and \p IfFalse.
+  void setTwoSuccessors(VPBlockBase *Block, VPConditionBitRecipeBase *R,
+                        VPBlockBase *IfTrue, VPBlockBase *IfFalse) {
+    assert(Block->getSuccessors().empty() && "Block successors already set.");
+    Block->setConditionBitRecipe(R);
+    Block->appendSuccessor(IfTrue);
+    Block->appendSuccessor(IfFalse);
+    IfTrue->appendPredecessor(Block);
+    IfFalse->appendPredecessor(Block);
+    IfTrue->Parent = Block->Parent;
+    IfFalse->Parent = Block->Parent;
+  }
+
+  /// Given two VPBlockBases \p From and \p To, disconnect them from each other.
+  void disconnectBlocks(VPBlockBase *From, VPBlockBase *To) {
+    From->removeSuccessor(To);
+    To->removePredecessor(From);
+  }
+};
+
+/// VPlanPrinter prints a given VPlan to a given output stream. The printing is
+/// indented and follows the dot format.
+class VPlanPrinter {
+private:
+  raw_ostream &OS;
+  const VPlan &Plan;
+  unsigned Depth;
+  unsigned TabLength = 2;
+  std::string Indent;
+
+  /// Handle indentation.
+  void buildIndent() { Indent = std::string(Depth * TabLength, ' '); }
+  void resetDepth() {
+    Depth = 1;
+    buildIndent();
+  }
+  void increaseDepth() {
+    ++Depth;
+    buildIndent();
+  }
+  void decreaseDepth() {
+    --Depth;
+    buildIndent();
+  }
+
+  /// Dump each element of VPlan.
+  void dumpBlock(const VPBlockBase *Block);
+  void dumpEdges(const VPBlockBase *Block);
+  void dumpBasicBlock(const VPBasicBlock *BasicBlock);
+  void dumpRegion(const VPRegionBlock *Region);
+
+  const char *getNodePrefix(const VPBlockBase *Block);
+  const std::string &getReplicatorString(const VPRegionBlock *Region);
+  void drawEdge(const VPBlockBase *From, const VPBlockBase *To, bool Hidden,
+                const Twine &Label);
+
+public:
+  VPlanPrinter(raw_ostream &O, const VPlan &P) : OS(O), Plan(P) {}
+  void dump(const std::string &Title = "");
+};
+
+//===--------------------------------------------------------------------===//
+// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs  //
+//===--------------------------------------------------------------------===//
+
+// Provide specializations of GraphTraits to be able to treat a VPRegionBlock
+// as a graph of VPBlockBases...
+
+template <> struct GraphTraits<VPBlockBase *> {
+  typedef VPBlockBase *NodeRef;
+  typedef SmallVectorImpl<VPBlockBase *>::iterator ChildIteratorType;
+
+  static NodeRef getEntryNode(NodeRef N) { return N; }
+
+  static inline ChildIteratorType child_begin(NodeRef N) {
+    return N->getSuccessors().begin();
+  }
+
+  static inline ChildIteratorType child_end(NodeRef N) {
+    return N->getSuccessors().end();
+  }
+};
+
+template <> struct GraphTraits<const VPBlockBase *> {
+  typedef const VPBlockBase *NodeRef;
+  typedef SmallVectorImpl<VPBlockBase *>::const_iterator ChildIteratorType;
+
+  static NodeRef getEntryNode(NodeRef N) { return N; }
+
+  static inline ChildIteratorType child_begin(NodeRef N) {
+    return N->getSuccessors().begin();
+  }
+
+  static inline ChildIteratorType child_end(NodeRef N) {
+    return N->getSuccessors().end();
+  }
+};
+
+// Provide specializations of GraphTraits to be able to treat a VPRegionBlock as
+// a graph of VPBasicBlocks... and to walk it in inverse order. Inverse order
+// for a VPRegionBlock is considered to be when traversing the predecessor edges
+// of a VPBlockBase instead of the successor edges.
+//
+
+template <> struct GraphTraits<Inverse<VPBlockBase *>> {
+  typedef VPBlockBase *NodeRef;
+  typedef SmallVectorImpl<VPBlockBase *>::iterator ChildIteratorType;
+
+  static Inverse<VPBlockBase *> getEntryNode(Inverse<VPBlockBase *> B) {
+    return B;
+  }
+
+  static inline ChildIteratorType child_begin(NodeRef N) {
+    return N->getPredecessors().begin();
+  }
+
+  static inline ChildIteratorType child_end(NodeRef N) {
+    return N->getPredecessors().end();
+  }
+};
+
+} // namespace llvm
+
+#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H
Index: lib/Transforms/Vectorize/VPlan.cpp
===================================================================
--- /dev/null
+++ lib/Transforms/Vectorize/VPlan.cpp
@@ -0,0 +1,400 @@
+//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// This is the LLVM vectorization plan. It represents a candidate for
+// vectorization, allowing to plan and optimize how to vectorize a given loop
+// before generating LLVM-IR.
+// The vectorizer uses vectorization plans to estimate the costs of potential
+// candidates and if profitable to execute the desired plan, generating vector
+// LLVM-IR code.
+//
+//===----------------------------------------------------------------------===//
+
+#include "VPlan.h"
+#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/Support/GraphWriter.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "vplan"
+
+unsigned VPlanUtils::NextOrdinal = 1;
+
+VPOneByOneRecipeBase::VPOneByOneRecipeBase(unsigned char SC,
+                                           const BasicBlock::iterator B,
+                                           const BasicBlock::iterator E,
+                                           class VPlan *Plan)
+    : VPRecipeBase(SC), Begin(B), End(E) {
+  for (auto It = B; It != E; ++It)
+    Plan->setInst2Recipe(&*It, this);
+}
+
+/// \return the VPBasicBlock that is the entry of Block, possibly indirectly.
+const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const {
+  const VPBlockBase *Block = this;
+  while (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
+    Block = Region->getEntry();
+  return cast<VPBasicBlock>(Block);
+}
+
+/// \return the VPBasicBlock that is the exit of Block, possibly indirectly.
+const VPBasicBlock *VPBlockBase::getExitBasicBlock() const {
+  const VPBlockBase *Block = this;
+  while (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
+    Block = Region->getExit();
+  return cast<VPBasicBlock>(Block);
+}
+
+VPBasicBlock *VPBlockBase::getExitBasicBlock() {
+  VPBlockBase *Block = this;
+  while (VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
+    Block = Region->getExit();
+  return cast<VPBasicBlock>(Block);
+}
+
+/// Returns the closest ancestor, starting from "this", which has successors.
+/// Returns the root ancestor if all ancestors have no successors.
+VPBlockBase *VPBlockBase::getAncestorWithSuccessors() {
+  if (!Successors.empty() || !Parent)
+    return this;
+  assert(Parent->getExit() == this &&
+         "Block w/o successors not the exit of its parent.");
+  return Parent->getAncestorWithSuccessors();
+}
+
+/// Returns the closest ancestor, starting from "this", which has predecessors.
+/// Returns the root ancestor if all ancestors have no predecessors.
+VPBlockBase *VPBlockBase::getAncestorWithPredecessors() {
+  if (!Predecessors.empty() || !Parent)
+    return this;
+  assert(Parent->getEntry() == this &&
+         "Block w/o predecessors not the entry of its parent.");
+  return Parent->getAncestorWithPredecessors();
+}
+
+void VPBlockBase::deleteCFG(VPBlockBase *Entry) {
+  SmallVector<VPBlockBase *, 8> Blocks;
+  for (VPBlockBase *Block : depth_first(Entry))
+    Blocks.push_back(Block);
+
+  for (VPBlockBase *Block : Blocks)
+    delete Block;
+}
+
+BasicBlock *
+VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
+  // BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks.
+  // Pred stands for Predessor. Prev stands for Previous, last visited/created.
+  BasicBlock *PrevBB = CFG.PrevBB;
+  BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), "VPlannedBB",
+                                         PrevBB->getParent(), CFG.LastBB);
+  DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n');
+
+  // Hook up the new basic block to its predecessors.
+  for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
+    VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock();
+    BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB];
+    DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n');
+    if (isa<UnreachableInst>(PredBB->getTerminator())) {
+      PredBB->getTerminator()->eraseFromParent();
+      BranchInst::Create(NewBB, PredBB);
+    } else {
+      // Replace old unconditional branch with new conditional branch.
+      // Note: we rely on traversing the successors in order.
+      BasicBlock *FirstSuccBB = PredBB->getSingleSuccessor();
+      PredBB->getTerminator()->eraseFromParent();
+      Value *Bit = PredVPBlock->getConditionBitRecipe()->getConditionBit();
+      assert(Bit && "Cannot create conditional branch with empty bit.");
+      BranchInst::Create(FirstSuccBB, NewBB, Bit, PredBB);
+    }
+  }
+  return NewBB;
+}
+
+void VPBasicBlock::vectorize(VPTransformState *State) {
+  VPIterationInstance *I = State->Instance;
+  bool Replica = I && !(I->Part == 0 && I->Lane == 0);
+  VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB;
+  VPBlockBase *SingleHPred = nullptr;
+  BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible.
+
+  // 1. Create an IR basic block, or reuse the last one if possible.
+  // The last IR basic block is reused in three cases:
+  // A. the first VPBB reuses the header BB - when PrevVPBB is null;
+  // B. when the current VPBB has a single (hierarchical) predecessor which
+  //    is PrevVPBB and the latter has a single (hierarchical) successor; and
+  // C. when the current VPBB is an entry of a region replica - where PrevVPBB
+  //    is the exit of this region from a previous instance.
+  if (PrevVPBB && /* A */
+      !((SingleHPred = getSingleHierarchicalPredecessor()) &&
+        SingleHPred->getExitBasicBlock() == PrevVPBB &&
+        PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */
+      !(Replica && getPredecessors().empty())) {       /* C */
+
+    NewBB = createEmptyBasicBlock(State->CFG);
+    State->Builder.SetInsertPoint(NewBB);
+    // Temporarily terminate with unreachable until CFG is rewired.
+    UnreachableInst *Terminator = State->Builder.CreateUnreachable();
+    State->Builder.SetInsertPoint(Terminator);
+    // Register NewBB in its loop. In innermost loops its the same for all BB's.
+    Loop *L = State->LI->getLoopFor(State->CFG.LastBB);
+    L->addBasicBlockToLoop(NewBB, *State->LI);
+    State->CFG.PrevBB = NewBB;
+  }
+
+  // 2. Fill the IR basic block with IR instructions.
+  DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName()
+               << " in BB:" << NewBB->getName() << '\n');
+
+  State->CFG.VPBB2IRBB[this] = NewBB;
+  State->CFG.PrevVPBB = this;
+
+  for (VPRecipeBase &Recipe : Recipes)
+    Recipe.vectorize(*State);
+
+  DEBUG(dbgs() << "LV: filled BB:" << *NewBB);
+}
+
+void VPRegionBlock::vectorize(VPTransformState *State) {
+  ReversePostOrderTraversal<VPBlockBase *> RPOT(Entry);
+  typedef typename std::vector<VPBlockBase *>::reverse_iterator rpo_iterator;
+
+  if (!isReplicator()) {
+    // Visit the VPBlocks connected to \p this, starting from it.
+    for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) {
+      DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n');
+      (*I)->vectorize(State);
+    }
+    return;
+  }
+
+  assert(!State->Instance &&
+         "Replicating a Region only in null context instance.");
+  VPIterationInstance I;
+  State->Instance = &I;
+
+  for (I.Part = 0; I.Part < State->UF; ++I.Part)
+    for (I.Lane = 0; I.Lane < State->VF; ++I.Lane)
+      // Visit the VPBlocks connected to \p this, starting from it.
+      for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) {
+        DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n');
+        (*I)->vectorize(State);
+      }
+
+  State->Instance = nullptr;
+}
+
+/// Generate the code inside the body of the vectorized loop. Assumes a single
+/// LoopVectorBody basic block was created for this; introduces additional
+/// basic blocks as needed, and fills them all.
+void VPlan::vectorize(VPTransformState *State) {
+  BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;
+  BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();
+  assert(VectorHeaderBB && "Loop preheader does not have a single successor.");
+  BasicBlock *VectorLatchBB = VectorHeaderBB;
+  auto CurrIP = State->Builder.saveIP();
+
+  // 1. Make room to generate basic blocks inside loop body if needed.
+  VectorLatchBB = VectorHeaderBB->splitBasicBlock(
+       VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch");
+  Loop *L = State->LI->getLoopFor(VectorHeaderBB);
+  L->addBasicBlockToLoop(VectorLatchBB, *State->LI);
+  // Remove the edge between Header and Latch to allow other connections.
+  // Temporarily terminate with unreachable until CFG is rewired.
+  // Note: this asserts xform code's assumption that getFirstInsertionPt()
+  // can be dereferenced into an Instruction.
+  VectorHeaderBB->getTerminator()->eraseFromParent();
+  State->Builder.SetInsertPoint(VectorHeaderBB);
+  UnreachableInst *Terminator = State->Builder.CreateUnreachable();
+  State->Builder.SetInsertPoint(Terminator);
+
+  // 2. Generate code in loop body of vectorized version.
+  State->CFG.PrevVPBB = nullptr;
+  State->CFG.PrevBB = VectorHeaderBB;
+  State->CFG.LastBB = VectorLatchBB;
+
+  for (VPBlockBase *CurrentBlock = Entry; CurrentBlock != nullptr;
+       CurrentBlock = CurrentBlock->getSingleSuccessor()) {
+    assert(CurrentBlock->getSuccessors().size() <= 1 &&
+           "Multiple successors at top level.");
+    CurrentBlock->vectorize(State);
+  }
+
+  // 3. Merge the temporary latch created with the last basic block filled.
+  BasicBlock *LastBB = State->CFG.PrevBB;
+  // Connect LastBB to VectorLatchBB to facilitate their merge.
+  assert(isa<UnreachableInst>(LastBB->getTerminator()) &&
+         "Expected VPlan CFG to terminate with unreachable");
+  LastBB->getTerminator()->eraseFromParent();
+  BranchInst::Create(VectorLatchBB, LastBB);
+
+  // Merge LastBB with Latch.
+  bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI);
+  assert(merged && "Could not merge last basic block with latch.");
+  VectorLatchBB = LastBB;
+
+  updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB);
+  State->Builder.restoreIP(CurrIP);
+}
+
+void VPlan::updateDominatorTree(DominatorTree *DT, BasicBlock *LoopPreHeaderBB,
+                                BasicBlock *LoopLatchBB) {
+  BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor();
+  assert(LoopHeaderBB && "Loop preheader does not have a single successor.");
+  DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB);
+  // The vector body may be more than a single basic block by this point.
+  // Update the dominator tree information inside the vector body by propagating
+  // it from header to latch, expecting only triangular control-flow, if any.
+  BasicBlock *PostDomSucc = nullptr;
+  for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) {
+    // Get the list of successors of this block.
+    std::vector<BasicBlock *> Succs(succ_begin(BB), succ_end(BB));
+    assert(Succs.size() <= 2 &&
+           "Basic block in vector loop has more than 2 successors.");
+    PostDomSucc = Succs[0];
+    if (Succs.size() == 1) {
+      assert(PostDomSucc->getSinglePredecessor() &&
+             "PostDom successor has more than one predecessor.");
+      DT->addNewBlock(PostDomSucc, BB);
+      continue;
+    }
+    BasicBlock *InterimSucc = Succs[1];
+    if (PostDomSucc->getSingleSuccessor() == InterimSucc) {
+      PostDomSucc = Succs[1];
+      InterimSucc = Succs[0];
+    }
+    assert(InterimSucc->getSingleSuccessor() == PostDomSucc &&
+           "One successor of a basic block does not lead to the other.");
+    assert(InterimSucc->getSinglePredecessor() &&
+           "Interim successor has more than one predecessor.");
+    assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 &&
+           "PostDom successor has more than two predecessors.");
+    DT->addNewBlock(InterimSucc, BB);
+    DT->addNewBlock(PostDomSucc, BB);
+  }
+}
+
+const char *VPlanPrinter::getNodePrefix(const VPBlockBase *Block) {
+  if (isa<VPBasicBlock>(Block))
+    return "";
+  assert(isa<VPRegionBlock>(Block) && "Unsupported kind of VPBlock.");
+  return "cluster_";
+}
+
+const std::string &
+VPlanPrinter::getReplicatorString(const VPRegionBlock *Region) {
+  static std::string ReplicatorString(DOT::EscapeString("<xVFxUF>"));
+  static std::string NonReplicatorString(DOT::EscapeString("<x1>"));
+  return Region->isReplicator() ? ReplicatorString : NonReplicatorString;
+}
+
+void VPlanPrinter::dump(const std::string &Title) {
+  resetDepth();
+  OS << "digraph VPlan {\n";
+  OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";
+  if (!Title.empty())
+    OS << "\\n" << DOT::EscapeString(Title);
+  OS << "\"]\n";
+  OS << "node [shape=record]\n";
+  OS << "compound=true\n";
+
+  for (const VPBlockBase *CurrentBlock = Plan.getEntry();
+       CurrentBlock != nullptr;
+       CurrentBlock = CurrentBlock->getSingleSuccessor())
+    dumpBlock(CurrentBlock);
+
+  OS << "}\n";
+}
+
+void VPlanPrinter::dumpBlock(const VPBlockBase *Block) {
+  if (const VPBasicBlock *BasicBlock = dyn_cast<VPBasicBlock>(Block))
+    dumpBasicBlock(BasicBlock);
+  else if (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
+    dumpRegion(Region);
+  else
+    llvm_unreachable("Unsupported kind of VPBlock.");
+}
+
+/// Print the information related to a CFG edge between two VPBlockBases.
+void VPlanPrinter::drawEdge(const VPBlockBase *From, const VPBlockBase *To,
+                            bool Hidden, const Twine &Label) {
+  // Due to "dot" we print an edge between two regions as an edge between the
+  // exit basic block and the entry basic of the respective regions.
+  const VPBlockBase *Tail = From->getExitBasicBlock();
+  const VPBlockBase *Head = To->getEntryBasicBlock();
+  OS << Indent << getNodePrefix(Tail) << DOT::EscapeString(Tail->getName())
+     << " -> " << getNodePrefix(Head) << DOT::EscapeString(Head->getName());
+  OS << " [ label=\"" << Label << '\"';
+  if (Tail != From)
+    OS << " ltail=" << getNodePrefix(From)
+       << DOT::EscapeString(From->getName());
+  if (Head != To)
+    OS << " lhead=" << getNodePrefix(To) << DOT::EscapeString(To->getName());
+  if (Hidden)
+    OS << "; splines=none";
+  OS << "]\n";
+}
+
+/// Print the information related to the CFG edges going out of a given
+/// \p Block, followed by printing the successor blocks themselves.
+void VPlanPrinter::dumpEdges(const VPBlockBase *Block) {
+  std::string Cond = "";
+  if (auto *ConditionBitRecipe = Block->getConditionBitRecipe())
+    Cond = ConditionBitRecipe->getName().str();
+  unsigned SuccessorNumber = 1;
+  for (auto *Successor : Block->getSuccessors()) {
+    drawEdge(Block, Successor, false,
+             Twine() + (SuccessorNumber == 2 ? "!" : "") + Twine(Cond));
+    ++SuccessorNumber;
+  }
+}
+
+/// Print a VPBasicBlock, including its VPRecipes, followed by printing its
+/// successor blocks.
+void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) {
+  std::string Indent(Depth * TabLength, ' ');
+  OS << Indent << getNodePrefix(BasicBlock)
+     << DOT::EscapeString(BasicBlock->getName()) << " [label = \"{"
+     << DOT::EscapeString(BasicBlock->getName());
+
+  for (const VPRecipeBase &Recipe : BasicBlock->getRecipes()) {
+    OS << " | ";
+    std::string RecipeString;
+    raw_string_ostream RSO(RecipeString);
+    Recipe.print(RSO);
+    OS << DOT::EscapeString(RSO.str());
+  }
+
+  OS << "}\"]\n";
+  dumpEdges(BasicBlock);
+}
+
+/// Print a given \p Region of the VPlan.
+void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) {
+  OS << Indent << "subgraph " << getNodePrefix(Region)
+     << DOT::EscapeString(Region->getName()) << " {\n";
+  increaseDepth();
+  OS << Indent;
+  OS << "label = \"" << getReplicatorString(Region) << " "
+     << DOT::EscapeString(Region->getName()) << "\"\n\n";
+
+  // Dump the blocks of the region.
+  assert(Region->getEntry() && "Region contains no inner blocks.");
+
+  for (const VPBlockBase *Block : depth_first(Region->getEntry()))
+    dumpBlock(Block);
+
+  decreaseDepth();
+  OS << Indent << "}\n";
+  dumpEdges(Region);
+}
Index: test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
===================================================================
--- test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
+++ test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
@@ -15,9 +15,9 @@
 ; CHECK:   br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
 ; CHECK: [[IF0]]:
 ; CHECK:   %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0
-; CHECK:   %[[T01:.+]] = extractelement <2 x i64> %wide.load, i32 0
-; CHECK:   %[[T02:.+]] = add nsw i64 %[[T01]], %x
-; CHECK:   %[[T03:.+]] = udiv i64 %[[T00]], %[[T02]]
+; CHECK:   %[[T01:.+]] = add nsw i64 %[[T00]], %x
+; CHECK:   %[[T02:.+]] = extractelement <2 x i64> %wide.load, i32 0
+; CHECK:   %[[T03:.+]] = udiv i64 %[[T02]], %[[T01]]
 ; CHECK:   %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0
 ; CHECK:   br label %[[CONT0]]
 ; CHECK: [[CONT0]]:
@@ -25,9 +25,9 @@
 ; CHECK:   br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
 ; CHECK: [[IF1]]:
 ; CHECK:   %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1
-; CHECK:   %[[T07:.+]] = extractelement <2 x i64> %wide.load, i32 1
-; CHECK:   %[[T08:.+]] = add nsw i64 %[[T07]], %x
-; CHECK:   %[[T09:.+]] = udiv i64 %[[T06]], %[[T08]]
+; CHECK:   %[[T07:.+]] = add nsw i64 %[[T06]], %x
+; CHECK:   %[[T08:.+]] = extractelement <2 x i64> %wide.load, i32 1
+; CHECK:   %[[T09:.+]] = udiv i64 %[[T08]], %[[T07]]
 ; CHECK:   %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1
 ; CHECK:   br label %[[CONT1]]
 ; CHECK: [[CONT1]]:
Index: test/Transforms/LoopVectorize/AArch64/predication_costs.ll
===================================================================
--- test/Transforms/LoopVectorize/AArch64/predication_costs.ll
+++ test/Transforms/LoopVectorize/AArch64/predication_costs.ll
@@ -18,8 +18,8 @@
 ; Cost of udiv:
 ;   (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
 ;
-; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
+; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
 ;
 define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {
 entry:
@@ -59,8 +59,8 @@
 ; Cost of store:
 ;   (store(4) + extractelement(3)) / 2 = 3
 ;
-; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
+; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
 ;
 define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {
 entry:
@@ -98,10 +98,10 @@
 ; Cost of udiv:
 ;   (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4
 ;
-; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
-; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
 ; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
 ; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
+; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
+; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
 ;
 define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
 entry:
@@ -143,10 +143,10 @@
 ; Cost of store:
 ;   store(4) / 2 = 2
 ;
-; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
-; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
 ; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x
 ; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
+; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
+; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
 ;
 define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
 entry:
@@ -192,16 +192,16 @@
 ; Cost of store:
 ;   store(4) / 2 = 2
 ;
-; CHECK:     Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
-; CHECK:     Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
-; CHECK:     Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
-; CHECK:     Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
-; CHECK:     Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
 ; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x
 ; CHECK:     Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2
 ; CHECK:     Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
 ; CHECK:     Scalarizing: %tmp5 = sub i32 %tmp4, %x
 ; CHECK:     Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4
+; CHECK:     Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
+; CHECK:     Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
+; CHECK:     Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
+; CHECK:     Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
+; CHECK:     Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
 ;
 define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {
 entry:
Index: test/Transforms/LoopVectorize/if-pred-non-void.ll
===================================================================
--- test/Transforms/LoopVectorize/if-pred-non-void.ll
+++ test/Transforms/LoopVectorize/if-pred-non-void.ll
@@ -219,9 +219,9 @@
 ; CHECK:   br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
 ; CHECK: [[IF0]]:
 ; CHECK:   %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0
-; CHECK:   %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0
-; CHECK:   %[[T02:.+]] = add nsw i32 %[[T01]], %x
-; CHECK:   %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]]
+; CHECK:   %[[T01:.+]] = add nsw i32 %[[T00]], %x
+; CHECK:   %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0
+; CHECK:   %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]]
 ; CHECK:   %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0
 ; CHECK:   br label %[[CONT0]]
 ; CHECK: [[CONT0]]:
@@ -229,9 +229,9 @@
 ; CHECK:   br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
 ; CHECK: [[IF1]]:
 ; CHECK:   %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1
-; CHECK:   %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1
-; CHECK:   %[[T08:.+]] = add nsw i32 %[[T07]], %x
-; CHECK:   %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]]
+; CHECK:   %[[T07:.+]] = add nsw i32 %[[T06]], %x
+; CHECK:   %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1
+; CHECK:   %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]]
 ; CHECK:   %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1
 ; CHECK:   br label %[[CONT1]]
 ; CHECK: [[CONT1]]:
Index: test/Transforms/LoopVectorize/induction.ll
===================================================================
--- test/Transforms/LoopVectorize/induction.ll
+++ test/Transforms/LoopVectorize/induction.ll
@@ -309,18 +309,18 @@
 ;
 ; CHECK-LABEL: @scalarize_induction_variable_05(
 ; CHECK: vector.body:
-; CHECK:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ]
+; CHECK:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ]
 ; CHECK:   %[[I0:.+]] = add i32 %index, 0
 ; CHECK:   getelementptr inbounds i32, i32* %a, i32 %[[I0]]
 ; CHECK: pred.udiv.if:
 ; CHECK:   udiv i32 {{.*}}, %[[I0]]
-; CHECK: pred.udiv.if1:
+; CHECK: pred.udiv.if3:
 ; CHECK:   %[[I1:.+]] = add i32 %index, 1
 ; CHECK:   udiv i32 {{.*}}, %[[I1]]
 ;
 ; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05(
 ; UNROLL-NO-IC: vector.body:
-; UNROLL-NO-IC:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ]
+; UNROLL-NO-IC:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ]
 ; UNROLL-NO-IC:   %[[I0:.+]] = add i32 %index, 0
 ; UNROLL-NO-IC:   %[[I2:.+]] = add i32 %index, 2
 ; UNROLL-NO-IC:   getelementptr inbounds i32, i32* %a, i32 %[[I0]]
@@ -330,26 +330,26 @@
 ; UNROLL-NO-IC: pred.udiv.if6:
 ; UNROLL-NO-IC:   %[[I1:.+]] = add i32 %index, 1
 ; UNROLL-NO-IC:   udiv i32 {{.*}}, %[[I1]]
-; UNROLL-NO-IC: pred.udiv.if8:
+; UNROLL-NO-IC: pred.udiv.if9:
 ; UNROLL-NO-IC:   udiv i32 {{.*}}, %[[I2]]
-; UNROLL-NO-IC: pred.udiv.if10:
+; UNROLL-NO-IC: pred.udiv.if12:
 ; UNROLL-NO-IC:   %[[I3:.+]] = add i32 %index, 3
 ; UNROLL-NO-IC:   udiv i32 {{.*}}, %[[I3]]
 ;
 ; IND-LABEL: @scalarize_induction_variable_05(
 ; IND: vector.body:
-; IND:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ]
+; IND:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ]
 ; IND:   %[[E0:.+]] = sext i32 %index to i64
 ; IND:   getelementptr inbounds i32, i32* %a, i64 %[[E0]]
 ; IND: pred.udiv.if:
 ; IND:   udiv i32 {{.*}}, %index
-; IND: pred.udiv.if1:
+; IND: pred.udiv.if3:
 ; IND:   %[[I1:.+]] = or i32 %index, 1
 ; IND:   udiv i32 {{.*}}, %[[I1]]
 ;
 ; UNROLL-LABEL: @scalarize_induction_variable_05(
 ; UNROLL: vector.body:
-; UNROLL:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ]
+; UNROLL:   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ]
 ; UNROLL:   %[[I2:.+]] = or i32 %index, 2
 ; UNROLL:   %[[E0:.+]] = sext i32 %index to i64
 ; UNROLL:   %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]]
@@ -359,9 +359,9 @@
 ; UNROLL: pred.udiv.if6:
 ; UNROLL:   %[[I1:.+]] = or i32 %index, 1
 ; UNROLL:   udiv i32 {{.*}}, %[[I1]]
-; UNROLL: pred.udiv.if8:
+; UNROLL: pred.udiv.if9:
 ; UNROLL:   udiv i32 {{.*}}, %[[I2]]
-; UNROLL: pred.udiv.if10:
+; UNROLL: pred.udiv.if12:
 ; UNROLL:   %[[I3:.+]] = or i32 %index, 3
 ; UNROLL:   udiv i32 {{.*}}, %[[I3]]