This is an archive of the discontinued LLVM Phabricator instance.

Add a pass that specializes parallel loops for easier unrolling and vectorization
ClosedPublic

Authored by bkramer on Feb 27 2020, 4:33 AM.

Download Raw Diff

Details

Reviewers

herhut
nicolasvasilache
ftynse

Commits

rG5abf128d647d: Add a pass that specializes parallel loops for easier unrolling and…

Summary

This matches loops with a affine.min upper bound, limiting the trip
count to a constant, and rewrites them into two loops, one with constant
upper bound and one with variable upper bound. The assumption is that
the constant upper bound loop will be unrolled and vectorized, which is
preferable if this is the hot path.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bkramer created this revision.Feb 27 2020, 4:33 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptFeb 27 2020, 4:34 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, Joonsoo, liufengdb and 12 others. · View Herald Transcript

Harbormaster failed remote builds in B47402: Diff 246912!Feb 27 2020, 6:18 AM

ftynse added a subscriber: ftynse.Feb 27 2020, 6:25 AM

ftynse added inline comments.

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopPreparationForVectorization.cpp
1 ↗	(On Diff #246912)	(bikeshed) could we find a shorter name? ParallelLoopSpecialization?
36 ↗	(On Diff #246912)	This assumes the first operand of the map appears in the results of the map unmodified. While the "if" introduced below will make sure nothing bad happens if it does not, there may be cases where the generated code is dead. For example %0 = affine.min affine_map<(d0)->(-d0,-1024)>(%c10) loop.parallel (%i) = (%c0) to (%0) will never have the upper bound of 10. I'd at least add a check that the operand indeed appears unmodified. Another thing is canonicalization. If you canonicalize before this pass, constants should be folded into the affine map so you'll never have a constant index operand to affine apply. You could look into normalizing affine maps with operands and then just check if some _results_ are constant affine expressions. This is fine for a follow-up.

rriddle added inline comments.Feb 27 2020, 8:38 AM

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopPreparationForVectorization.cpp
28 ↗	(On Diff #246912)	Please only use anonymous namespace for classes. Functions should be in the global scope and marked as static.

Renamed the thing ParallelLoopSpecialization
Match the canonicalized AffineMap form, not the one produced by ParallelLoopTiling

Harbormaster completed remote builds in B47571: Diff 247224.Feb 28 2020, 5:53 AM

ftynse accepted this revision.Feb 28 2020, 8:22 AM

This revision is now accepted and ready to land.Feb 28 2020, 8:22 AM

Closed by commit rG5abf128d647d: Add a pass that specializes parallel loops for easier unrolling and… (authored by bkramer). · Explain WhyFeb 28 2020, 10:52 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

LoopOps/

Passes.h

4 lines

InitAllPasses.h

1 line

lib/

Dialect/

LoopOps/

Transforms/

CMakeLists.txt

1 line

ParallelLoopSpecialization.cpp

76 lines

test/

Dialect/

Loops/

parallel-loop-specialization.mlir

49 lines

Diff 247224

mlir/include/mlir/Dialect/LoopOps/Passes.h

	Show All 17 Lines

	namespace mlir {			namespace mlir {

	class Pass;			class Pass;

	/// Creates a loop fusion pass which fuses parallel loops.			/// Creates a loop fusion pass which fuses parallel loops.
	std::unique_ptr<Pass> createParallelLoopFusionPass();			std::unique_ptr<Pass> createParallelLoopFusionPass();

				/// Creates a pass that specializes parallel loop for unrolling and
				/// vectorization.
				std::unique_ptr<Pass> createParallelLoopSpecializationPass();

	/// Creates a pass which tiles innermost parallel loops.			/// Creates a pass which tiles innermost parallel loops.
	std::unique_ptr<Pass>			std::unique_ptr<Pass>
	createParallelLoopTilingPass(llvm::ArrayRef<int64_t> tileSize = {});			createParallelLoopTilingPass(llvm::ArrayRef<int64_t> tileSize = {});

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_LOOPOPS_PASSES_H_			#endif // MLIR_DIALECT_LOOPOPS_PASSES_H_

mlir/include/mlir/InitAllPasses.h

Show First 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	#endif
createLinalgPromotionPass(0);		createLinalgPromotionPass(0);
createConvertLinalgToLoopsPass();		createConvertLinalgToLoopsPass();
createConvertLinalgToParallelLoopsPass();		createConvertLinalgToParallelLoopsPass();
createConvertLinalgToAffineLoopsPass();		createConvertLinalgToAffineLoopsPass();
createConvertLinalgToLLVMPass();		createConvertLinalgToLLVMPass();

// LoopOps		// LoopOps
createParallelLoopFusionPass();		createParallelLoopFusionPass();
		createParallelLoopSpecializationPass();
createParallelLoopTilingPass();		createParallelLoopTilingPass();

// QuantOps		// QuantOps
quant::createConvertSimulatedQuantPass();		quant::createConvertSimulatedQuantPass();
quant::createConvertConstPass();		quant::createConvertConstPass();
quantizer::createAddDefaultStatsPass();		quantizer::createAddDefaultStatsPass();
quantizer::createRemoveInstrumentationPass();		quantizer::createRemoveInstrumentationPass();
quantizer::registerInferQuantizedTypesPass();		quantizer::registerInferQuantizedTypesPass();
Show All 16 Lines

mlir/lib/Dialect/LoopOps/Transforms/CMakeLists.txt

	add_llvm_library(MLIRLoopOpsTransforms			add_llvm_library(MLIRLoopOpsTransforms
	ParallelLoopFusion.cpp			ParallelLoopFusion.cpp
				ParallelLoopSpecialization.cpp
	ParallelLoopTiling.cpp			ParallelLoopTiling.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/LoopOps			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/LoopOps
	)			)

	target_link_libraries(MLIRLoopOpsTransforms			target_link_libraries(MLIRLoopOpsTransforms
	MLIRPass			MLIRPass
	MLIRLoopOps			MLIRLoopOps
	)			)

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopSpecialization.cpp

This file was added.

				//===- ParallelLoopSpecialization.cpp - loop.parallel specializeation -----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Specializes parallel loops for easier unrolling and vectorization.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/AffineOps/AffineOps.h"
				#include "mlir/Dialect/LoopOps/LoopOps.h"
				#include "mlir/Dialect/LoopOps/Passes.h"
				#include "mlir/Dialect/StandardOps/IR/Ops.h"
				#include "mlir/IR/AffineExpr.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;
				using loop::ParallelOp;

				/// Rewrite a loop with bounds defined by an affine.min with a constant into 2
				/// loops after checking if the bounds are equal to that constant. This is
				/// beneficial if the loop will almost always have the constant bound and that
				/// version can be fully unrolled and vectorized.
				static void specializeLoopForUnrolling(ParallelOp op) {
				SmallVector<int64_t, 2> constantIndices;
				constantIndices.reserve(op.upperBound().size());
				for (auto bound : op.upperBound()) {
				auto minOp = dyn_cast_or_null<AffineMinOp>(bound.getDefiningOp());
				if (!minOp)
				return;
				int64_t minConstant = std::numeric_limits<int64_t>::max();
				for (auto expr : minOp.map().getResults()) {
				if (auto constantIndex = expr.dyn_cast<AffineConstantExpr>())
				minConstant = std::min(minConstant, constantIndex.getValue());
				}
				if (minConstant == std::numeric_limits<int64_t>::max())
				return;
				constantIndices.push_back(minConstant);
				}

				OpBuilder b(op);
				BlockAndValueMapping map;
				Value cond;
				for (auto bound : llvm::zip(op.upperBound(), constantIndices)) {
				Value constant = b.create<ConstantIndexOp>(op.getLoc(), std::get<1>(bound));
				Value cmp = b.create<CmpIOp>(op.getLoc(), CmpIPredicate::eq,
				std::get<0>(bound), constant);
				cond = cond ? b.create<AndOp>(op.getLoc(), cond, cmp) : cmp;
				map.map(std::get<0>(bound), constant);
				}
				auto ifOp = b.create<loop::IfOp>(op.getLoc(), cond, /withElseRegion=/true);
				ifOp.getThenBodyBuilder().clone(*op.getOperation(), map);
				ifOp.getElseBodyBuilder().clone(*op.getOperation());
				op.erase();
				}

				namespace {
				struct ParallelLoopSpecialization
				: public FunctionPass<ParallelLoopSpecialization> {
				void runOnFunction() override {
				getFunction().walk([](ParallelOp op) { specializeLoopForUnrolling(op); });
				}
				};
				} // namespace

				std::unique_ptr<Pass> mlir::createParallelLoopSpecializationPass() {
				return std::make_unique<ParallelLoopSpecialization>();
				}

				static PassRegistration<ParallelLoopSpecialization>
				pass("parallel-loop-specialization",
				"Specialize parallel loops for vectorization.");

mlir/test/Dialect/Loops/parallel-loop-specialization.mlir

This file was added.

				// RUN: mlir-opt %s -parallel-loop-specialization -split-input-file \| FileCheck %s --dump-input-on-failure

				#map0 = affine_map<()[s0, s1] -> (1024, s0 - s1)>
				#map1 = affine_map<()[s0, s1] -> (64, s0 - s1)>

				func @parallel_loop(%outer_i0: index, %outer_i1: index, %A: memref<?x?xf32>, %B: memref<?x?xf32>,
				%C: memref<?x?xf32>, %result: memref<?x?xf32>) {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%d0 = dim %A, 0 : memref<?x?xf32>
				%d1 = dim %A, 1 : memref<?x?xf32>
				%b0 = affine.min #map0()[%d0, %outer_i0]
				%b1 = affine.min #map1()[%d1, %outer_i1]
				loop.parallel (%i0, %i1) = (%c0, %c0) to (%b0, %b1) step (%c1, %c1) {
				%B_elem = load %B[%i0, %i1] : memref<?x?xf32>
				%C_elem = load %C[%i0, %i1] : memref<?x?xf32>
				%sum_elem = addf %B_elem, %C_elem : f32
				store %sum_elem, %result[%i0, %i1] : memref<?x?xf32>
				}
				return
				}

				// CHECK-LABEL: func @parallel_loop(
				// CHECK-SAME: [[VAL_0:%.]]: index, [[VAL_1:%.]]: index, [[VAL_2:%.]]: memref<?x?xf32>, [[VAL_3:%.]]: memref<?x?xf32>, [[VAL_4:%.]]: memref<?x?xf32>, [[VAL_5:%.]]: memref<?x?xf32>) {
				// CHECK: [[VAL_6:%.*]] = constant 0 : index
				// CHECK: [[VAL_7:%.*]] = constant 1 : index
				// CHECK: [[VAL_8:%.*]] = dim [[VAL_2]], 0 : memref<?x?xf32>
				// CHECK: [[VAL_9:%.*]] = dim [[VAL_2]], 1 : memref<?x?xf32>
				// CHECK: [[VAL_10:%.*]] = affine.min #map0(){{\[}}[[VAL_8]], [[VAL_0]]]
				// CHECK: [[VAL_11:%.*]] = affine.min #map1(){{\[}}[[VAL_9]], [[VAL_1]]]
				// CHECK: [[VAL_12:%.*]] = constant 1024 : index
				// CHECK: [[VAL_13:%.*]] = cmpi "eq", [[VAL_10]], [[VAL_12]] : index
				// CHECK: [[VAL_14:%.*]] = constant 64 : index
				// CHECK: [[VAL_15:%.*]] = cmpi "eq", [[VAL_11]], [[VAL_14]] : index
				// CHECK: [[VAL_16:%.*]] = and [[VAL_13]], [[VAL_15]] : i1
				// CHECK: loop.if [[VAL_16]] {
				// CHECK: loop.parallel ([[VAL_17:%.]], [[VAL_18:%.]]) = ([[VAL_6]], [[VAL_6]]) to ([[VAL_12]], [[VAL_14]]) step ([[VAL_7]], [[VAL_7]]) {
				// CHECK: store
				// CHECK: }
				// CHECK: } else {
				// CHECK: loop.parallel ([[VAL_22:%.]], [[VAL_23:%.]]) = ([[VAL_6]], [[VAL_6]]) to ([[VAL_10]], [[VAL_11]]) step ([[VAL_7]], [[VAL_7]]) {
				// CHECK: store
				// CHECK: }
				// CHECK: }
				// CHECK: return
				// CHECK: }

This is an archive of the discontinued LLVM Phabricator instance.

Add a pass that specializes parallel loops for easier unrolling and vectorizationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 247224

mlir/include/mlir/Dialect/LoopOps/Passes.h

mlir/include/mlir/InitAllPasses.h

mlir/lib/Dialect/LoopOps/Transforms/CMakeLists.txt

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopSpecialization.cpp

mlir/test/Dialect/Loops/parallel-loop-specialization.mlir

Add a pass that specializes parallel loops for easier unrolling and vectorization
ClosedPublic