|
| 1 | +================== |
| 2 | +Vectorization Plan |
| 3 | +================== |
| 4 | + |
| 5 | +.. contents:: |
| 6 | + :local: |
| 7 | + |
| 8 | +Abstract |
| 9 | +======== |
| 10 | +The vectorization transformation can be rather complicated, involving several |
| 11 | +potential alternatives, especially for outer-loops [1]_ but also possibly for |
| 12 | +innermost loops. These alternatives may have significant performance impact, |
| 13 | +both positive and negative. A cost model is therefore employed to identify the |
| 14 | +best alternative, including the alternative of avoiding any transformation |
| 15 | +altogether. |
| 16 | + |
| 17 | +The Vectorization Plan is an explicit model for describing vectorization |
| 18 | +candidates. It serves for both optimizing candidates including estimating their |
| 19 | +cost reliably, and for performing their final translation into IR. This |
| 20 | +facilitates dealing with multiple vectorization candidates. |
| 21 | + |
| 22 | +High-level Design |
| 23 | +================= |
| 24 | + |
| 25 | +Vectorization Workflow |
| 26 | +---------------------- |
| 27 | +VPlan-based vectorization involves three major steps, taking a "scenario-based |
| 28 | +approach" to vectorization planning: |
| 29 | + |
| 30 | +1. Legal Step: check if a loop can be legally vectorized; encode contraints and |
| 31 | + artifacts if so. |
| 32 | +2. Plan Step: |
| 33 | + |
| 34 | + a. Build initial VPlans following the constraints and decisions taken by |
| 35 | + Legal Step 1, and compute their cost. |
| 36 | + b. Apply optimizations to the VPlans, possibly forking additional VPlans. |
| 37 | + Prune sub-optimal VPlans having relatively high cost. |
| 38 | +3. Execute Step: materialize the best VPlan. Note that this is the only step |
| 39 | + that modifies the IR. |
| 40 | + |
| 41 | +Design Guidelines |
| 42 | +----------------- |
| 43 | +In what follows, the term "input IR" refers to code that is fed into the |
| 44 | +vectorizer whereas the term "output IR" refers to code that is generated by the |
| 45 | +vectorizer. The output IR contains code that has been vectorized or "widened" |
| 46 | +according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed |
| 47 | +according to an Unroll Factor (UF). |
| 48 | +The design of VPlan follows several high-level guidelines: |
| 49 | + |
| 50 | +1. Analysis-like: building and manipulating VPlans must not modify the input IR. |
| 51 | + In particular, if the best option is not to vectorize at all, the |
| 52 | + vectorization process terminates before reaching Step 3, and compilation |
| 53 | + should proceed as if VPlans had not been built. |
| 54 | + |
| 55 | +2. Align Cost & Execute: each VPlan must support both estimating the cost and |
| 56 | + generating the output IR code, such that the cost estimation evaluates the |
| 57 | + to-be-generated code reliably. |
| 58 | + |
| 59 | +3. Support vectorizing additional constructs: |
| 60 | + |
| 61 | + a. Outer-loop vectorization. In particular, VPlan must be able to model the |
| 62 | + control-flow of the output IR which may include multiple basic-blocks and |
| 63 | + nested loops. |
| 64 | + b. SLP vectorization. |
| 65 | + c. Combinations of the above, including nested vectorization: vectorizing |
| 66 | + both an inner loop and an outer-loop at the same time (each with its own |
| 67 | + VF and UF), mixed vectorization: vectorizing a loop with SLP patterns |
| 68 | + inside [4]_, (re)vectorizing input IR containing vector code. |
| 69 | + d. Function vectorization [2]_. |
| 70 | + |
| 71 | +4. Support multiple candidates efficiently. In particular, similar candidates |
| 72 | + related to a range of possible VF's and UF's must be represented efficiently. |
| 73 | + Potential versioning needs to be supported efficiently. |
| 74 | + |
| 75 | +5. Support vectorizing idioms, such as interleaved groups of strided loads or |
| 76 | + stores. This is achieved by modeling a sequence of output instructions using |
| 77 | + a "Recipe", which is responsible for computing its cost and generating its |
| 78 | + code. |
| 79 | + |
| 80 | +6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization |
| 81 | + such regions may need to be, for example, predicated and linearized, or |
| 82 | + replicated VF*UF times to handle scalarized and predicated instructions. |
| 83 | + Innerloops are also modelled as SESE regions. |
| 84 | + |
| 85 | +Low-level Design |
| 86 | +================ |
| 87 | +The low-level design of VPlan comprises of the following classes. |
| 88 | + |
| 89 | +:LoopVectorizationPlanner: |
| 90 | + A LoopVectorizationPlanner is designed to handle the vectorization of a loop |
| 91 | + or a loop nest. It can construct, optimize and discard one or more VPlans, |
| 92 | + each VPlan modelling a distinct way to vectorize the loop or the loop nest. |
| 93 | + Once the best VPlan is determined, including the best VF and UF, this VPlan |
| 94 | + drives the generation of output IR. |
| 95 | + |
| 96 | +:VPlan: |
| 97 | + A model of a vectorized candidate for a given input IR loop or loop nest. This |
| 98 | + candidate is represented using a Hierarchical CFG. VPlan supports estimating |
| 99 | + the cost and driving the generation of the output IR code it represents. |
| 100 | + |
| 101 | +:Hierarchical CFG: |
| 102 | + A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The |
| 103 | + Hierarchical CFG data structure is similar to the Tile Tree [5]_, where |
| 104 | + cross-Tile edges are lifted to connect Tiles instead of the original |
| 105 | + basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms |
| 106 | + Region and Block are used rather than Tile [5]_ to avoid confusion with loop |
| 107 | + tiling. |
| 108 | + |
| 109 | +:VPBlockBase: |
| 110 | + The building block of the Hierarchical CFG. A pure-virtual base-class of |
| 111 | + VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical |
| 112 | + control-flow relations with other VPBlocks. Note that in contrast to the IR |
| 113 | + BasicBlock, a VPBlockBase models its control-flow successors and predecessors |
| 114 | + directly, rather than through a Terminator branch or through predecessor |
| 115 | + branches that "use" the VPBlockBase. |
| 116 | + |
| 117 | +:VPBasicBlock: |
| 118 | + VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the |
| 119 | + Hierarchical CFG. It represents a sequence of output IR instructions that will |
| 120 | + appear consecutively in an output IR basic-block. The instructions of this |
| 121 | + basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a |
| 122 | + sequence of zero or more VPRecipes that model the cost and generation of the |
| 123 | + output IR instructions. |
| 124 | + |
| 125 | +:VPRegionBlock: |
| 126 | + VPRegionBlock is a subclass of VPBlockBase. It models a collection of |
| 127 | + VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR |
| 128 | + CFG. A VPRegionBlock may indicate that its contents are to be replicated a |
| 129 | + constant number of times when output IR is generated, effectively representing |
| 130 | + a loop with constant trip-count that will be completely unrolled. This is used |
| 131 | + to support scalarized and predicated instructions with a single model for |
| 132 | + multiple candidate VF's and UF's. |
| 133 | + |
| 134 | +:VPRecipeBase: |
| 135 | + A pure-virtual base class modeling a sequence of one or more output IR |
| 136 | + instructions, possibly based on one or more input IR instructions. These |
| 137 | + input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe |
| 138 | + may specify how its ingredients are to be transformed to produce the output IR |
| 139 | + instructions; e.g., cloned once, replicated multiple times or widened |
| 140 | + according to selected VF. |
| 141 | + |
| 142 | +:VPTransformState: |
| 143 | + Stores information used for generating output IR, passed from |
| 144 | + LoopVectorizationPlanner to its selected VPlan for execution, and used to pass |
| 145 | + additional information down to VPBlocks and VPRecipes. |
| 146 | + |
| 147 | +Related LLVM components |
| 148 | +----------------------- |
| 149 | +1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP |
| 150 | + tree, where TSLP [3]_ adds Plan Step 2.b. |
| 151 | + |
| 152 | +2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by |
| 153 | + Polly [7]_. |
| 154 | + |
| 155 | +References |
| 156 | +---------- |
| 157 | +.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit |
| 158 | + Nuzman and Ayal Zaks, PACT 2008. |
| 159 | +
|
| 160 | +.. [2] "Proposal for function vectorization and loop vectorization with function |
| 161 | + calls", Xinmin Tian, [`cfe-dev |
| 162 | + <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., |
| 163 | + March 2, 2016. |
| 164 | + See also `review <https://reviews.llvm.org/D22792>`_. |
| 165 | +
|
| 166 | +.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios |
| 167 | + Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. |
| 168 | +
|
| 169 | +.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization |
| 170 | + overhead", Hao Zhou and Jingling Xue, CGO 2016. |
| 171 | +
|
| 172 | +.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and |
| 173 | + Brian Koblenz, PLDI 1991 |
| 174 | +
|
| 175 | +.. [6] "Structural analysis: A new approach to flow analysis in optimizing |
| 176 | + compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 |
| 177 | +
|
| 178 | +.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma |
| 179 | + thesis, 2011. |
| 180 | +
|
| 181 | +.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, |
| 182 | + European LLVM Developers' Meeting 2017. |
0 commit comments