# Changeset View

Changeset View

# Standalone View

Standalone View

# mlir/docs/Passes.md

# MLIR Passes | # MLIR Passes | ||||

This document describes the available MLIR passes and their contracts. | This document describes the available MLIR passes and their contracts. | ||||

[TOC] | [TOC] | ||||

## Affine dialect lowering (`-lower-affine`) | ## General Transformation Passes | ||||

Convert operations from the affine dialect into operations from the loop and | [include "GeneralPasses.md"] | ||||

standard dialects. | |||||

`affine.for` operations are converted to `loop.for` operations that are free of | ## Conversion Passes | ||||

certain structural restrictions (on their bounds and step). `affine.if` is | |||||

similarly converted to the `loop.if` operation. `affine.apply` operations are | [include "ConversionPasses.md"] | ||||

converted into sequences of primitive arithmetic operations from the standard | |||||

dialect that have the same effect, using operands of the `index` type. | ## Quantizer Passes | ||||

Consequently, named maps and sets thare are no longer in use may be removed from | |||||

the module. | [include "QuantizerPasses.md"] | ||||

For example, `%r = affine.apply affine_map<(d0, d1)[s0] -> (d0 + 2*d1 + | ## `affine` Dialect Passes | ||||

s0)>(%d0, %d1)[%s0]` | |||||

can be converted into: | [include "AffinePasses.md"] | ||||

```mlir | ## `fxpmath` Dialect Passes | ||||

%d0 = <...> | |||||

%d1 = <...> | [include "FxpMathPasses.md"] | ||||

%s0 = <...> | |||||

%0 = constant 2 : index | ## `gpu` Dialect Passes | ||||

%1 = muli %0, %d1 | |||||

%2 = addi %d0, %1 | [include "GPUPasses.md"] | ||||

%r = addi %2, %s0 | |||||

``` | ## `linalg` Dialect Passes | ||||

### Input invariant | [include "LinalgPasses.md"] | ||||

- no `Tensor` types; | ## `llvm` Dialect Passes | ||||

These restrictions may be lifted in the future. | [include "LLVMPasses.md"] | ||||

### Output IR | ## `loop` Dialect Passes | ||||

Functions with `affine.for` and `affine.if` operations eliminated. These | [include "LoopPasses.md"] | ||||

functions may contain operations from the Standard dialect in addition to those | |||||

already present before the pass. | ## `quant` Dialect Passes | ||||

### Invariants | [include "QuantPasses.md"] | ||||

- Functions without a body are not modified. | ## `spv` Dialect Passes | ||||

- The semantics of the other functions is preserved. | |||||

- Individual operations other than those mentioned above are not modified if | [include "SPIRVPasses.md"] | ||||

they do not depend on the loop iterator value or on the result of | |||||

`affine.apply`. | |||||

## Conversion from Standard to LLVM IR dialect (`-convert-std-to-llvm`) | |||||

Convert standard operations into the LLVM IR dialect operations. | |||||

### Input invariant | |||||

- operations including: arithmetic on integers and floats, constants, direct | |||||

calls, returns and branches; | |||||

- no `tensor` types; | |||||

- all `vector` are one-dimensional; | |||||

- all blocks are reachable by following the successors of the first basic | |||||

block; | |||||

If other operations are present and their results are required by the LLVM IR | |||||

dialect operations, the pass will fail. Any LLVM IR operations or types already | |||||

present in the IR will be kept as is. | |||||

### Output IR | |||||

Functions converted to LLVM IR. Function arguments types are converted | |||||

one-to-one. Function results are converted one-to-one and, in case more than 1 | |||||

value is returned, packed into an LLVM IR struct type. Function calls and | |||||

returns are updated accordingly. Block argument types are updated to use LLVM IR | |||||

types. | |||||

## Data Copy DMA generation (`-affine-data-copy-generate`) | |||||

Replaces all loads and stores on memref's living in 'slowMemorySpace' by | |||||

introducing DMA operations (strided DMA if necessary) to transfer data to/from | |||||

`fastMemorySpace` and rewriting the original load's/store's to instead | |||||

load/store from the allocated fast memory buffers. Additional options specify | |||||

the identifier corresponding to the fast memory space and the amount of fast | |||||

memory space available. The pass traverses through the nesting structure, | |||||

recursing to inner levels if necessary to determine at what depth DMA transfers | |||||

need to be placed so that the allocated buffers fit within the memory capacity | |||||

provided. If this is not possible (for example, when the elemental type itself | |||||

is of size larger than the DMA capacity), an error with location information is | |||||

emitted. The DMA transfers are also hoisted up past all loops with respect to | |||||

which the transfers are invariant. | |||||

Input | |||||

```mlir | |||||

func @loop_nest_tiled() -> memref<256x1024xf32> { | |||||

%0 = alloc() : memref<256x1024xf32> | |||||

affine.for %i0 = 0 to 256 step 32 { | |||||

affine.for %i1 = 0 to 1024 step 32 { | |||||

affine.for %i2 = (d0) -> (d0)(%i0) to (d0) -> (d0 + 32)(%i0) { | |||||

affine.for %i3 = (d0) -> (d0)(%i1) to (d0) -> (d0 + 32)(%i1) { | |||||

%1 = affine.load %0[%i2, %i3] : memref<256x1024xf32> | |||||

} | |||||

} | |||||

} | |||||

} | |||||

return %0 : memref<256x1024xf32> | |||||

} | |||||

``` | |||||

Output (with flags: -affine-data-copy-generate -affine-data-copy-generate-fast-mem-space=2) | |||||

```mlir | |||||

module { | |||||

func @loop_nest_tiled() -> memref<256x1024xf32> { | |||||

%c262144 = constant 262144 : index | |||||

%c0 = constant 0 : index | |||||

%0 = alloc() : memref<256x1024xf32> | |||||

%1 = alloc() : memref<256x1024xf32, 2> | |||||

%2 = alloc() : memref<1xi32> | |||||

affine.dma_start %0[%c0, %c0], %1[%c0, %c0], %2[%c0], %c262144 : memref<256x1024xf32>, memref<256x1024xf32, 2>, memref<1xi32> | |||||

affine.dma_wait %2[%c0], %c262144 : memref<1xi32> | |||||

affine.for %arg0 = 0 to 256 step 32 { | |||||

affine.for %arg1 = 0 to 1024 step 32 { | |||||

affine.for %arg2 = #map1(%arg0) to #map2(%arg0) { | |||||

affine.for %arg3 = #map1(%arg1) to #map2(%arg1) { | |||||

%3 = affine.load %1[%arg2, %arg3] : memref<256x1024xf32, 2> | |||||

} | |||||

} | |||||

} | |||||

} | |||||

dealloc %2 : memref<1xi32> | |||||

dealloc %1 : memref<256x1024xf32, 2> | |||||

return %0 : memref<256x1024xf32> | |||||

} | |||||

} | |||||

``` | |||||

## Loop tiling (`-affine-loop-tile`) | |||||

Performs tiling or blocking of loop nests. It currently works on perfect loop | |||||

nests. | |||||

## Loop unroll (`-affine-loop-unroll`) | |||||

This pass implements loop unrolling. It is able to unroll loops with arbitrary | |||||

bounds, and generate a cleanup loop when necessary. | |||||

## Loop unroll and jam (`-affine-loop-unroll-jam`) | |||||

This pass implements unroll and jam for loops. It works on both perfect or | |||||

imperfect loop nests. | |||||

## Loop fusion (`-affine-loop-fusion`) | |||||

Performs fusion of loop nests using a slicing-based approach. The fused loop | |||||

nests, when possible, are rewritten to access significantly smaller local | |||||

buffers instead of the original memref's, and the latter are often | |||||

either completely optimized away or contracted. This transformation leads to | |||||

enhanced locality and lower memory footprint through the elimination or | |||||

contraction of temporaries / intermediate memref's. These benefits are sometimes | |||||

achieved at the expense of redundant computation through a cost model that | |||||

evaluates available choices such as the depth at which a source slice should be | |||||

materialized in the designation slice. | |||||

## Memref bound checking (`-memref-bound-check`) | |||||

Checks all load's and store's on memref's for out of bound accesses, and reports | |||||

any out of bound accesses (both overrun and underrun) with location information. | |||||

```mlir | |||||

test/Transforms/memref-bound-check.mlir:19:13: error: 'load' op memref out of upper bound access along dimension #2 | |||||

%x = load %A[%idx0, %idx1] : memref<9 x 9 x i32> | |||||

^ | |||||

test/Transforms/memref-bound-check.mlir:19:13: error: 'load' op memref out of lower bound access along dimension #2 | |||||

%x = load %A[%idx0, %idx1] : memref<9 x 9 x i32> | |||||

^ | |||||

``` | |||||

## Memref dataflow optimization (`-memref-dataflow-opt`) | |||||

This pass performs store to load forwarding for memref's to eliminate memory | |||||

accesses and potentially the entire memref if all its accesses are forwarded. | |||||

Input | |||||

```mlir | |||||

func @store_load_affine_apply() -> memref<10x10xf32> { | |||||

%cf7 = constant 7.0 : f32 | |||||

%m = alloc() : memref<10x10xf32> | |||||

affine.for %i0 = 0 to 10 { | |||||

affine.for %i1 = 0 to 10 { | |||||

affine.store %cf7, %m[%i0, %i1] : memref<10x10xf32> | |||||

%v0 = affine.load %m[%i0, %i1] : memref<10x10xf32> | |||||

%v1 = addf %v0, %v0 : f32 | |||||

} | |||||

} | |||||

return %m : memref<10x10xf32> | |||||

} | |||||

``` | |||||

Output | |||||

```mlir | |||||

module { | |||||

func @store_load_affine_apply() -> memref<10x10xf32> { | |||||

%cst = constant 7.000000e+00 : f32 | |||||

%0 = alloc() : memref<10x10xf32> | |||||

affine.for %arg0 = 0 to 10 { | |||||

affine.for %arg1 = 0 to 10 { | |||||

affine.store %cst, %0[%arg0, %arg1] : memref<10x10xf32> | |||||

%1 = addf %cst, %cst : f32 | |||||

} | |||||

} | |||||

return %0 : memref<10x10xf32> | |||||

} | |||||

} | |||||

``` | |||||

## Memref dependence analysis (`-memref-dependence-check`) | |||||

This pass performs dependence analysis to determine dependences between pairs of | |||||

memory operations (load's and store's) on memref's. Dependence analysis exploits | |||||

polyhedral information available (affine maps, expressions, and affine.apply | |||||

operations) to precisely represent dependences using affine constraints, while | |||||

also computing dependence vectors from them, where each component of the | |||||

dependence vector provides a lower and an upper bound on the dependence distance | |||||

along the corresponding dimension. | |||||

```mlir | |||||

test/Transforms/memref-dataflow-opt.mlir:232:7: note: dependence from 2 to 1 at depth 1 = ([1, 1], [-inf, +inf]) | |||||

store %cf9, %m[%idx] : memref<10xf32> | |||||

``` | |||||

## Pipeline data transfer (`-affine-pipeline-data-transfer`) | |||||

This pass performs a transformation to overlap non-blocking DMA operations in a | |||||

loop with computations through double buffering. This is achieved by advancing | |||||

dma_start operations with respect to other operations. | |||||

Input | |||||

```mlir | |||||

func @pipelinedatatransfer() { | |||||

%0 = alloc() : memref<256xf32> | |||||

%1 = alloc() : memref<32xf32, 1> | |||||

%2 = alloc() : memref<1xf32> | |||||

%c0 = constant 0 : index | |||||

%c128 = constant 128 : index | |||||

affine.for %i0 = 0 to 8 { | |||||

affine.dma_start %0[%i0], %1[%i0], %2[%c0], %c128 : memref<256xf32>, memref<32xf32, 1>, memref<1xf32> | |||||

affine.dma_wait %2[%c0], %c128 : memref<1xf32> | |||||

%3 = affine.load %1[%i0] : memref<32xf32, 1> | |||||

%4 = "compute"(%3) : (f32) -> f32 | |||||

affine.store %4, %1[%i0] : memref<32xf32, 1> | |||||

} | |||||

return | |||||

} | |||||

``` | |||||

Output | |||||

```mlir | |||||

module { | |||||

func @pipelinedatatransfer() { | |||||

%c8 = constant 8 : index | |||||

%c0 = constant 0 : index | |||||

%0 = alloc() : memref<256xf32> | |||||

%c0_0 = constant 0 : index | |||||

%c128 = constant 128 : index | |||||

%1 = alloc() : memref<2x32xf32, 1> | |||||

%2 = alloc() : memref<2x1xf32> | |||||

affine.dma_start %0[%c0], %1[%c0 mod 2, %c0], %2[%c0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32> | |||||

affine.for %arg0 = 1 to 8 { | |||||

affine.dma_start %0[%arg0], %1[%arg0 mod 2, %arg0], %2[%arg0 mod 2, symbol(%c0_0)], %c128 : memref<256xf32>, memref<2x32xf32, 1>, memref<2x1xf32> | |||||

%8 = affine.apply #map3(%arg0) | |||||

%9 = affine.apply #map4(%8) | |||||

%10 = affine.apply #map4(%8) | |||||

affine.dma_wait %2[%8 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32> | |||||

%11 = affine.load %1[%8 mod 2, %8] : memref<2x32xf32, 1> | |||||

%12 = "compute"(%11) : (f32) -> f32 | |||||

affine.store %12, %1[%8 mod 2, %8] : memref<2x32xf32, 1> | |||||

} | |||||

%3 = affine.apply #map3(%c8) | |||||

%4 = affine.apply #map4(%3) | |||||

%5 = affine.apply #map4(%3) | |||||

affine.dma_wait %2[%3 mod 2, symbol(%c0_0)], %c128 : memref<2x1xf32> | |||||

%6 = affine.load %1[%3 mod 2, %3] : memref<2x32xf32, 1> | |||||

%7 = "compute"(%6) : (f32) -> f32 | |||||

affine.store %7, %1[%3 mod 2, %3] : memref<2x32xf32, 1> | |||||

dealloc %2 : memref<2x1xf32> | |||||

dealloc %1 : memref<2x32xf32, 1> | |||||

return | |||||

} | |||||

} | |||||

``` |