This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Transforms/ParallelLoopFusion.cpp
21 ↗	(On Diff #244389)	using namespace mlir;
26 ↗	(On Diff #244389)	Use /// for top-level comments.
27 ↗	(On Diff #244389)	Only classes should be in anonymous namespaces, functions should be in the global namespace and marked static. https://llvm.org/docs/CodingStandards.html#anonymous-namespaces
29 ↗	(On Diff #244389)	nit: -> WalkResult is unnecessary.
40 ↗	(On Diff #244389)	nit: Can you just use std::equal for now?
56 ↗	(On Diff #244389)	nit: Use LogicalResult instead of boolean.
75 ↗	(On Diff #244389)	Drop trivial braces.
85 ↗	(On Diff #244389)	nit: Remove this temporary value and just return the result directly. It doesn't really help readability at all.
117 ↗	(On Diff #244389)	nit: `block.getOps<ParallelOp>()`
123 ↗	(On Diff #244389)	nit: Drop all of these trivial braces.
136 ↗	(On Diff #244389)	Drop the mlir:: on each of these.

This revision now requires changes to proceed.Feb 13 2020, 11:23 AM

mehdi_amini added inline comments.Feb 13 2020, 10:46 PM

mlir/lib/Transforms/ParallelLoopFusion.cpp
137 ↗	(On Diff #244389)	Can this pass be located under the loop dialect?

Addressed the comments.

pifon2a added inline comments.Feb 13 2020, 11:19 PM

mlir/lib/Transforms/ParallelLoopFusion.cpp
40 ↗	(On Diff #244389)	fair enough :)
85 ↗	(On Diff #244389)	thanks. It was an artefact from logging that I used to have here when debugging.

pifon2a marked an inline comment as done.Feb 13 2020, 11:20 PM

pifon2a marked an inline comment as done.Feb 13 2020, 11:26 PM

pifon2a added inline comments.

mlir/lib/Transforms/ParallelLoopFusion.cpp
137 ↗	(On Diff #244389)	@mehdi_amini Do you mean moving this code to `mlir/lib/Transforms/LoopFusion.cpp` where fusion on affine loops is implemented? That file is quite big already. Or do you mean having it in `Dialect/LoopOps/Transforms` similarly to `Dialect/Linalg/Transforms`? If it is the latter, then there might be several other passes in `lib/Transforms` that have to be moved to respective directories.

Harbormaster failed remote builds in B46476: Diff 244576!Feb 13 2020, 11:29 PM

mehdi_amini added inline comments.Feb 13 2020, 11:31 PM

mlir/lib/Transforms/ParallelLoopFusion.cpp
137 ↗	(On Diff #244389)	yes there are several passes we need to move under their respective dialects. Most of the affine stuff is not under the affine dialect, because that's how MLIR started at the very beginning, before the dialect folder was even created.

Fixed CMake build.

Herald added a subscriber: mgorny. · View Herald TranscriptFeb 14 2020, 3:29 AM

Harbormaster failed remote builds in B46490: Diff 244609!Feb 14 2020, 4:00 AM

Cool, great start!

mlir/include/mlir/Dialect/LoopOps/Passes.h
2	This is lacking the LLVM header comment.
mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
40	Avoid named TODO in llvm code base. I am an offender, as well, but I was told to avoid this in the future.
49	You also need that `ploop2` does not write to buffers that `ploop1` reads. Both loops are fully independent if WRITES(ploop1) U READS(ploop2) is empty and READS(ploop1) U WRITES(ploop2) is empty. As we have sequential execution order within one iteration, we can ignore reads and writes between the two loops that are on the same index in the context of fusion.
50	Can you add a comment what the `map` contains? This is a map from `ploop1` indices to `ploop2` indices I assume?
66	This is sufficient for memrefs that are defined outside of the loops. However, memrefs that are created within the loop, e.g. by a subview, will not be handled by this. As a starter, it would be ok to bail out in these cases.
71	Does something like `llvm::all_of(llvm::zip(storeIndices, loadIndices), std::equal)` work? Just curious, no need to go there.
111	What happens if there is a read or write inbetween the ploops? As a start, you could consider immediately adjacent loops or loops where we have no sideffecting ops inbetween.

This revision now requires changes to proceed.Feb 14 2020, 4:16 AM

Addressed the comments.

pifon2a added inline comments.Feb 18 2020, 1:48 AM

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
71	should work with a custom comparator

Harbormaster failed remote builds in B46689: Diff 245106!Feb 18 2020, 2:05 AM

herhut requested changes to this revision.Feb 18 2020, 3:09 AM

herhut added inline comments.

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
56	Storing to a locally defined buffer also should bail.
91	This should be the dual to the above function with the same comments regarding aliasing. Without alias information, we have to bail on the case where we load or store to a locally defined buffer.
mlir/test/Dialect/Loops/parallel-loop-fusion.mlir
33	Can you drop the `{temp = true}` from the tests. They are just noise here.
277	This would already fail due to not-matching indices. Maybe load from `A` instead?
280	Why is this case illegal? If they store to the same buf but at the same index it should be fine.
292	There are 4 cases for this. Read/write to local allocation in loop1/loop2. Essentially any effect on a memref we do not understand.

This revision now requires changes to proceed.Feb 18 2020, 3:09 AM

Addressed the comments again.

pifon2a marked 2 inline comments as done.Feb 18 2020, 4:57 AM

AAAAAAAAAAAAA

Harbormaster failed remote builds in B46696: Diff 245129!Feb 18 2020, 5:54 AM

Harbormaster failed remote builds in B46697: Diff 245131!Feb 18 2020, 6:25 AM

Harbormaster failed remote builds in B46699: Diff 245139!Feb 18 2020, 6:52 AM

herhut added inline comments.Feb 19 2020, 1:36 AM

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
58	Maybe `haveNoReadsAfterWriteExceptSameIndex'?
101	Would it not suffice to call `haveNoReadAfterWritesExpectSameIndex` with arguments reversed?

Unify deps check.

Harbormaster failed remote builds in B46791: Diff 245354!Feb 19 2020, 2:32 AM

herhut added inline comments.Feb 19 2020, 3:34 AM

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
49	the the -> the
143	Can you extend this so it applies to any operation and traverses its regions? That way it can also be used, e.g., on a ParallelLoop to fuse loops in its body.
173	This could be a pass that runs on any Operation then, not necessarily a function.
184	This is over-committing a bit. It does not fuse loop nests.
mlir/test/Dialect/Loops/parallel-loop-fusion.mlir
262	How is this different from above?

pifon2a marked 9 inline comments as done.Feb 19 2020, 4:41 AM

pifon2a added inline comments.

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
101	No, because we are using BlockAndValueMapping that has only a map from IVs of ploop1 to IVs to ploop2. We can have the inverse map though.
mlir/test/Dialect/Loops/parallel-loop-fusion.mlir
262	renamed to `common_buf` to make it more visible.

Blabla

herhut accepted this revision.Feb 19 2020, 5:26 AM

Harbormaster failed remote builds in B46795: Diff 245367!Feb 19 2020, 5:43 AM

Harbormaster failed remote builds in B46801: Diff 245378!Feb 19 2020, 6:32 AM

rriddle added inline comments.Feb 19 2020, 12:20 PM

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp
55	nit: Please use /// for top-level comments.
161	Please remove the debugging here.

Thanks!

This revision is now accepted and ready to land.Feb 19 2020, 12:20 PM

Are we sure we want another parallel SW stack to do fusion here?
I'd love to understand the use cases that are fundamentally different from what we'd expect to do with Linalg + Affine (with or without Intel's multi-for).

Marking as blocker to be sure my question is addressed, sorry for being late to the party as I was away last week.

This revision now requires changes to proceed.Feb 23 2020, 7:34 PM

In D74544#1888566, @nicolasvasilache wrote:

Are we sure we want another parallel SW stack to do fusion here?
I'd love to understand the use cases that are fundamentally different from what we'd expect to do with Linalg + Affine (with or without Intel's multi-for).

I do not see a parallel SW stack. This is pretty much structural fusion on parallel loop nests, which are the bottom layer before we go to GPU. The reason this exists is that we do not have all code generation come through LinAlg and would like to combine loops at the lowest level. I do not expect for this to grow sophisticated dependency analysis and get complex. I agree that we will need a common analysis engine that can be reused for different loop-like structures. That is probably easier designed once we have more approaches implemented in full.

Regarding affine, I'd be happy to use loop fusion on affine.parallel loops, as well. It would need to support reduction, though.

Until then, I'd like to land this structural fusion to serve a current code-generation need we have.

I had a similar comment on the ploop tiling diff, @nicolasvasilache... Linalg fusion is also structural in that sense, but we need to decouple the transformation implementation from any of the dialect-specific concerns. I don't see how many abstraction layers and templates we'd have to add to reuse code across dialects and whether it will be worth the effort. I suppose somebody with deep experience in loop stuff (that is you or me) should look into refactoring this.

pifon2a abandoned this revision.Feb 28 2020, 8:45 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

LoopOps/

LoopOps.td

1 line

Passes.h

27 lines

InitAllPasses.h

4 lines

lib/

Dialect/

LoopOps/

CMakeLists.txt

2 lines

Transforms/

CMakeLists.txt

11 lines

ParallelLoopFusion.cpp

182 lines

test/

Dialect/

Loops/

parallel-loop-fusion.mlir

309 lines

tools/

mlir-opt/

CMakeLists.txt

1 line

Diff 245378

mlir/include/mlir/Dialect/LoopOps/LoopOps.td

Show First 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	OpBuilder<"Builder *builder, OperationState &result, "
"ValueRange steps">		"ValueRange steps">
];		];

let extraClassDeclaration = [{		let extraClassDeclaration = [{
Block *getBody() { return &region().front(); }		Block *getBody() { return &region().front(); }
iterator_range<Block::args_iterator> getInductionVars() {		iterator_range<Block::args_iterator> getInductionVars() {
return {getBody()->args_begin(), getBody()->args_end()};		return {getBody()->args_begin(), getBody()->args_end()};
}		}
		unsigned getNumLoops() { return step().size(); }
}];		}];
}		}

def ReduceOp : Loop_Op<"reduce", [HasParent<"ParallelOp">]> {		def ReduceOp : Loop_Op<"reduce", [HasParent<"ParallelOp">]> {
let summary = "reduce operation for parallel for";		let summary = "reduce operation for parallel for";
let description = [{		let description = [{
"loop.reduce" is an operation occurring inside "loop.parallel" operations.		"loop.reduce" is an operation occurring inside "loop.parallel" operations.
It consists of one block with two arguments which have the same type as the		It consists of one block with two arguments which have the same type as the
▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/LoopOps/Passes.h

This file was added.

				//===- Passes.h - Pass Entrypoints ------------------------------- C++ --===//
				//
				herhutUnsubmitted Done Reply Inline Actions This is lacking the LLVM header comment. herhut: This is lacking the LLVM header comment.
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This header file defines prototypes that expose pass constructors.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DIALECT_LOOPOPS_PASSES_H_
				#define MLIR_DIALECT_LOOPOPS_PASSES_H_

				#include <memory>

				namespace mlir {

				class Pass;

				/// Creates a loop fusion pass which fuses parallel loops.
				std::unique_ptr<Pass> createParallelLoopFusionPass();

				} // namespace mlir

				#endif // MLIR_DIALECT_LOOPOPS_PASSES_H_

mlir/include/mlir/InitAllPasses.h

Show All 21 Lines
#include "mlir/Conversion/GPUToVulkan/ConvertGPUToVulkanPass.h"		#include "mlir/Conversion/GPUToVulkan/ConvertGPUToVulkanPass.h"
#include "mlir/Conversion/LinalgToLLVM/LinalgToLLVM.h"		#include "mlir/Conversion/LinalgToLLVM/LinalgToLLVM.h"
#include "mlir/Conversion/LinalgToSPIRV/LinalgToSPIRVPass.h"		#include "mlir/Conversion/LinalgToSPIRV/LinalgToSPIRVPass.h"
#include "mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h"		#include "mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h"
#include "mlir/Conversion/StandardToSPIRV/ConvertStandardToSPIRVPass.h"		#include "mlir/Conversion/StandardToSPIRV/ConvertStandardToSPIRVPass.h"
#include "mlir/Dialect/FxpMathOps/Passes.h"		#include "mlir/Dialect/FxpMathOps/Passes.h"
#include "mlir/Dialect/GPU/Passes.h"		#include "mlir/Dialect/GPU/Passes.h"
#include "mlir/Dialect/Linalg/Passes.h"		#include "mlir/Dialect/Linalg/Passes.h"
		#include "mlir/Dialect/LoopOps/Passes.h"
#include "mlir/Dialect/QuantOps/Passes.h"		#include "mlir/Dialect/QuantOps/Passes.h"
#include "mlir/Dialect/SPIRV/Passes.h"		#include "mlir/Dialect/SPIRV/Passes.h"
#include "mlir/Quantizer/Transforms/Passes.h"		#include "mlir/Quantizer/Transforms/Passes.h"
#include "mlir/Transforms/LocationSnapshot.h"		#include "mlir/Transforms/LocationSnapshot.h"
#include "mlir/Transforms/Passes.h"		#include "mlir/Transforms/Passes.h"

#include <cstdlib>		#include <cstdlib>

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	inline void registerAllPasses() {
createLinalgTilingPass();		createLinalgTilingPass();
createLinalgTilingToParallelLoopsPass();		createLinalgTilingToParallelLoopsPass();
createLinalgPromotionPass(0);		createLinalgPromotionPass(0);
createConvertLinalgToLoopsPass();		createConvertLinalgToLoopsPass();
createConvertLinalgToParallelLoopsPass();		createConvertLinalgToParallelLoopsPass();
createConvertLinalgToAffineLoopsPass();		createConvertLinalgToAffineLoopsPass();
createConvertLinalgToLLVMPass();		createConvertLinalgToLLVMPass();

		// LoopOps
		createParallelLoopFusionPass();

// QuantOps		// QuantOps
quant::createConvertSimulatedQuantPass();		quant::createConvertSimulatedQuantPass();
quant::createConvertConstPass();		quant::createConvertConstPass();
quantizer::createAddDefaultStatsPass();		quantizer::createAddDefaultStatsPass();
quantizer::createRemoveInstrumentationPass();		quantizer::createRemoveInstrumentationPass();
quantizer::registerInferQuantizedTypesPass();		quantizer::registerInferQuantizedTypesPass();

// SPIR-V		// SPIR-V
Show All 14 Lines

mlir/lib/Dialect/LoopOps/CMakeLists.txt

Show All 15 Lines	add_dependencies(MLIRLoopOps
)		)
target_link_libraries(MLIRLoopOps		target_link_libraries(MLIRLoopOps

MLIREDSC		MLIREDSC
MLIRIR		MLIRIR
MLIRStandardOps		MLIRStandardOps
LLVMSupport		LLVMSupport
)		)

		add_subdirectory(Transforms)

mlir/lib/Dialect/LoopOps/Transforms/CMakeLists.txt

This file was added.

				add_llvm_library(MLIRLoopOpsTransforms
				ParallelLoopFusion.cpp

				ADDITIONAL_HEADER_DIRS
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/LoopOps
				)

				target_link_libraries(MLIRLoopOpsTransforms
				MLIRPass
				MLIRLoopOps
				)

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp

This file was added.

				//===- ParallelLoopFusion.cpp - Code to perform loop fusion ---------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements loop fusion on parallel loops.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/LoopOps/LoopOps.h"
				#include "mlir/Dialect/LoopOps/Passes.h"
				#include "mlir/Dialect/StandardOps/Ops.h"
				#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/OpDefinition.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Transforms/Passes.h"

				using namespace mlir;
				using loop::ParallelOp;

				/// Verify there are no nested ParallelOps.
				static bool hasNestedParallelOp(ParallelOp ploop) {
				auto walkResult =
				ploop.getBody()->walk([](ParallelOp) { return WalkResult::interrupt(); });
				return walkResult.wasInterrupted();
				}

				/// Verify equal iteration spaces.
				static bool equalIterationSpaces(ParallelOp firstPloop,
				ParallelOp secondPloop) {
				if (firstPloop.getNumLoops() != secondPloop.getNumLoops())
				return false;

				auto matchOperands = [&](const OperandRange &lhs,
				const OperandRange &rhs) -> bool {
				// TODO: Extend this to support aliases and equal constants.
				herhutUnsubmitted Done Reply Inline Actions Avoid named TODO in llvm code base. I am an offender, as well, but I was told to avoid this in the future. herhut: Avoid named TODO in llvm code base. I am an offender, as well, but I was told to avoid this in…
				return std::equal(lhs.begin(), lhs.end(), rhs.begin());
				};
				return matchOperands(firstPloop.lowerBound(), secondPloop.lowerBound()) &&
				matchOperands(firstPloop.upperBound(), secondPloop.upperBound()) &&
				matchOperands(firstPloop.step(), secondPloop.step());
				}

				/// Returns true if the defining operation for the memref is inside the body
				/// of parallel loop.
				herhutUnsubmitted Done Reply Inline Actions You also need that `ploop2` does not write to buffers that `ploop1` reads. Both loops are fully independent if WRITES(ploop1) U READS(ploop2) is empty and READS(ploop1) U WRITES(ploop2) is empty. As we have sequential execution order within one iteration, we can ignore reads and writes between the two loops that are on the same index in the context of fusion. herhut: You also need that `ploop2` does not write to buffers that `ploop1` reads. Both loops are fully…
				herhutUnsubmitted Done Reply Inline Actions the the -> the herhut: the the -> the
				bool isDefinedInPloopBody(Value memref, ParallelOp ploop) {
				herhutUnsubmitted Done Reply Inline Actions Can you add a comment what the `map` contains? This is a map from `ploop1` indices to `ploop2` indices I assume? herhut: Can you add a comment what the `map` contains? This is a map from `ploop1` indices to `ploop2`…
				auto *memrefDef = memref.getDefiningOp();
				return memrefDef && ploop.getOperation()->isAncestor(memrefDef);
				}

				// Checks if the parallel loops have mixed access to the same buffers. Returns
				rriddleUnsubmitted Not Done Reply Inline Actions nit: Please use /// for top-level comments. rriddle: nit: Please use /// for top-level comments.
				// `true` if the first parallel loop writes to the same indices that the second
				herhutUnsubmitted Done Reply Inline Actions Storing to a locally defined buffer also should bail. herhut: Storing to a locally defined buffer also should bail.
				// loop reads.
				static bool haveNoReadsAfterWriteExceptSameIndex(
				herhutUnsubmitted Done Reply Inline Actions Maybe `haveNoReadsAfterWriteExceptSameIndex'? herhut: Maybe `haveNoReadsAfterWriteExceptSameIndex'?
				ParallelOp firstPloop, ParallelOp secondPloop,
				const BlockAndValueMapping &firstToSecondPloopIndices) {
				DenseMap<Value, SmallVector<ValueRange, 1>> bufferStores;
				firstPloop.getBody()->walk([&](StoreOp store) {
				bufferStores[store.getMemRef()].push_back(store.indices());
				});
				auto walkResult = secondPloop.getBody()->walk([&](LoadOp load) {
				// Stop if the memref is defined in secondPloop body. Careful alias analysis
				herhutUnsubmitted Done Reply Inline Actions This is sufficient for memrefs that are defined outside of the loops. However, memrefs that are created within the loop, e.g. by a subview, will not be handled by this. As a starter, it would be ok to bail out in these cases. herhut: This is sufficient for memrefs that are defined outside of the loops. However, memrefs that are…
				// is needed.
				auto *memrefDef = load.getMemRef().getDefiningOp();
				if (memrefDef && memrefDef->getBlock() == load.getOperation()->getBlock())
				return WalkResult::interrupt();

				herhutUnsubmitted Done Reply Inline Actions Does something like `llvm::all_of(llvm::zip(storeIndices, loadIndices), std::equal)` work? Just curious, no need to go there. herhut: Does something like `llvm::all_of(llvm::zip(storeIndices, loadIndices), std::equal)` work? Just…
				pifon2aAuthorUnsubmitted Done Reply Inline Actions should work with a custom comparator pifon2a: should work with a custom comparator
				auto write = bufferStores.find(load.getMemRef());
				if (write == bufferStores.end())
				return WalkResult::advance();

				// Allow only single write access per buffer.
				if (write->second.size() != 1)
				return WalkResult::interrupt();

				// Check that the load indices of secondPloop coincide with store indices of
				// firstPloop for the same memrefs.
				auto storeIndices = write->second.front();
				auto loadIndices = load.indices();
				if (storeIndices.size() != loadIndices.size())
				return WalkResult::interrupt();
				for (int i = 0, e = storeIndices.size(); i < e; ++i) {
				if (firstToSecondPloopIndices.lookupOrDefault(storeIndices[i]) !=
				loadIndices[i])
				return WalkResult::interrupt();
				}
				return WalkResult::advance();
				herhutUnsubmitted Done Reply Inline Actions This should be the dual to the above function with the same comments regarding aliasing. Without alias information, we have to bail on the case where we load or store to a locally defined buffer. herhut: This should be the dual to the above function with the same comments regarding aliasing.
				});
				return !walkResult.wasInterrupted();
				}

				/// Analyzes dependencies in the most primitive way by checking simple read and
				/// write patterns.
				static LogicalResult
				verifyDependencies(ParallelOp firstPloop, ParallelOp secondPloop,
				const BlockAndValueMapping &firstToSecondPloopIndices) {
				if (!haveNoReadsAfterWriteExceptSameIndex(firstPloop, secondPloop,
				herhutUnsubmitted Done Reply Inline Actions Would it not suffice to call `haveNoReadAfterWritesExpectSameIndex` with arguments reversed? herhut: Would it not suffice to call `haveNoReadAfterWritesExpectSameIndex` with arguments reversed?
				pifon2aAuthorUnsubmitted Done Reply Inline Actions No, because we are using BlockAndValueMapping that has only a map from IVs of ploop1 to IVs to ploop2. We can have the inverse map though. pifon2a: No, because we are using BlockAndValueMapping that has only a map from IVs of ploop1 to IVs to…
				firstToSecondPloopIndices))
				return failure();

				BlockAndValueMapping secondToFirstPloopIndices;
				secondToFirstPloopIndices.map(secondPloop.getBody()->getArguments(),
				firstPloop.getBody()->getArguments());
				return success(haveNoReadsAfterWriteExceptSameIndex(
				secondPloop, firstPloop, secondToFirstPloopIndices));
				}

				herhutUnsubmitted Done Reply Inline Actions What happens if there is a read or write inbetween the ploops? As a start, you could consider immediately adjacent loops or loops where we have no sideffecting ops inbetween. herhut: What happens if there is a read or write inbetween the ploops? As a start, you could consider…
				static bool
				isFusionLegal(ParallelOp firstPloop, ParallelOp secondPloop,
				const BlockAndValueMapping &firstToSecondPloopIndices) {
				return !hasNestedParallelOp(firstPloop) &&
				!hasNestedParallelOp(secondPloop) &&
				equalIterationSpaces(firstPloop, secondPloop) &&
				succeeded(verifyDependencies(firstPloop, secondPloop,
				firstToSecondPloopIndices));
				}

				/// Prepends operations of firstPloop's body into secondPloop's body.
				static void fuseIfLegal(ParallelOp firstPloop, ParallelOp secondPloop,
				OpBuilder b) {
				BlockAndValueMapping firstToSecondPloopIndices;
				firstToSecondPloopIndices.map(firstPloop.getBody()->getArguments(),
				secondPloop.getBody()->getArguments());

				if (!isFusionLegal(firstPloop, secondPloop, firstToSecondPloopIndices))
				return;

				b.setInsertionPointToStart(secondPloop.getBody());
				for (auto &op : firstPloop.getBody()->without_terminator())
				b.clone(op, firstToSecondPloopIndices);
				firstPloop.erase();
				}

				static void naivelyFuseParallelOps(Operation *op) {
				OpBuilder b(op);
				// Consider every single block and attempt to fuse adjacent loops.
				for (auto &region : op->getRegions()) {
				for (auto &block : region.getBlocks()) {
				SmallVector<SmallVector<ParallelOp, 8>, 1> ploop_chains{{}};
				herhutUnsubmitted Done Reply Inline Actions Can you extend this so it applies to any operation and traverses its regions? That way it can also be used, e.g., on a ParallelLoop to fuse loops in its body. herhut: Can you extend this so it applies to any operation and traverses its regions? That way it can…
				// Not using `walk()` to traverse only top-level parallel loops and also
				// make sure that there are no side-effecting ops between the parallel
				// loops.
				bool noSideEffects = true;
				for (auto &op : block.getOperations()) {
				if (auto ploop = dyn_cast<ParallelOp>(op)) {
				if (noSideEffects) {
				ploop_chains.back().push_back(ploop);
				} else {
				ploop_chains.push_back({ploop});
				noSideEffects = true;
				}
				continue;
				}
				noSideEffects &= op.hasNoSideEffect();
				}
				for (ArrayRef<ParallelOp> ploops : ploop_chains) {
				llvm::errs() << "poo size = " << ploops.size() << '\n';
				rriddleUnsubmitted Not Done Reply Inline Actions Please remove the debugging here. rriddle: Please remove the debugging here.
				for (int i = 0, e = ploops.size(); i + 1 < e; ++i)
				fuseIfLegal(ploops[i], ploops[i + 1], b);
				}
				}
				}
				}

				namespace {

				struct ParallelLoopFusion : public OperationPass<ParallelLoopFusion> {
				void runOnOperation() override { naivelyFuseParallelOps(getOperation()); }
				};
				herhutUnsubmitted Done Reply Inline Actions This could be a pass that runs on any Operation then, not necessarily a function. herhut: This could be a pass that runs on any Operation then, not necessarily a function.

				} // namespace

				std::unique_ptr<Pass> mlir::createParallelLoopFusionPass() {
				return std::make_unique<ParallelLoopFusion>();
				}

				static PassRegistration<ParallelLoopFusion>
				pass("parallel-loop-fusion", "Fuse adjacent parallel loops.");
				herhutUnsubmitted Done Reply Inline Actions This is over-committing a bit. It does not fuse loop nests. herhut: This is over-committing a bit. It does not fuse loop nests.

mlir/test/Dialect/Loops/parallel-loop-fusion.mlir

This file was added.

				// RUN: mlir-opt %s -pass-pipeline='func(parallel-loop-fusion)' -split-input-file \| FileCheck %s --dump-input-on-failure

				func @fuse_empty_loops() {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @fuse_empty_loops
				// CHECK: [[C2:%.*]] = constant 2 : index
				// CHECK: [[C0:%.*]] = constant 0 : index
				// CHECK: [[C1:%.*]] = constant 1 : index
				// CHECK: loop.parallel ([[I:%.]], [[J:%.]]) = ([[C0]], [[C0]])
				// CHECK-SAME: to ([[C2]], [[C2]]) step ([[C1]], [[C1]]) {
				// CHECK: "loop.terminator"() : () -> ()
				// CHECK: }
				// CHECK-NOT: loop.parallel

				// -----

				func @fuse_two(%A: memref<2x2xf32>, %B: memref<2x2xf32>,
				%C: memref<2x2xf32>, %result: memref<2x2xf32>) {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%sum = alloc() : memref<2x2xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				herhutUnsubmitted Done Reply Inline Actions Can you drop the `{temp = true}` from the tests. They are just noise here. herhut: Can you drop the `{temp = true}` from the tests. They are just noise here.
				%B_elem = load %B[%i, %j] : memref<2x2xf32>
				%C_elem = load %C[%i, %j] : memref<2x2xf32>
				%sum_elem = addf %B_elem, %C_elem : f32
				store %sum_elem, %sum[%i, %j] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%sum_elem = load %sum[%i, %j] : memref<2x2xf32>
				%A_elem = load %A[%i, %j] : memref<2x2xf32>
				%product_elem = mulf %sum_elem, %A_elem : f32
				store %product_elem, %result[%i, %j] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				}
				dealloc %sum : memref<2x2xf32>
				return
				}
				// CHECK-LABEL: func @fuse_two
				// CHECK-SAME: ([[A:%.]]: {{.}}, [[B:%.]]: {{.}}, [[C:%.]]: {{.}},
				// CHECK-SAME: [[RESULT:%.]]: {{.}}) {
				// CHECK: [[C2:%.*]] = constant 2 : index
				// CHECK: [[C0:%.*]] = constant 0 : index
				// CHECK: [[C1:%.*]] = constant 1 : index
				// CHECK: [[SUM:%.*]] = alloc()
				// CHECK: loop.parallel ([[I:%.]], [[J:%.]]) = ([[C0]], [[C0]])
				// CHECK-SAME: to ([[C2]], [[C2]]) step ([[C1]], [[C1]]) {
				// CHECK: [[B_ELEM:%.*]] = load [[B]]{{\[}}[[I]], [[J]]]
				// CHECK: [[C_ELEM:%.*]] = load [[C]]{{\[}}[[I]], [[J]]]
				// CHECK: [[SUM_ELEM:%.*]] = addf [[B_ELEM]], [[C_ELEM]]
				// CHECK: store [[SUM_ELEM]], [[SUM]]{{\[}}[[I]], [[J]]]
				// CHECK: [[SUM_ELEM_:%.*]] = load [[SUM]]{{\[}}[[I]], [[J]]]
				// CHECK: [[A_ELEM:%.*]] = load [[A]]{{\[}}[[I]], [[J]]]
				// CHECK: [[PRODUCT_ELEM:%.*]] = mulf [[SUM_ELEM_]], [[A_ELEM]]
				// CHECK: store [[PRODUCT_ELEM]], [[RESULT]]{{\[}}[[I]], [[J]]]
				// CHECK: "loop.terminator"() : () -> ()
				// CHECK: }
				// CHECK: dealloc [[SUM]]

				// -----

				func @fuse_three(%lhs: memref<100x10xf32>, %rhs: memref<100xf32>,
				%result: memref<100x10xf32>) {
				%c100 = constant 100 : index
				%c10 = constant 10 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%broadcast_rhs = alloc() : memref<100x10xf32>
				%diff = alloc() : memref<100x10xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c100, %c10) step (%c1, %c1) {
				%rhs_elem = load %rhs[%i] : memref<100xf32>
				store %rhs_elem, %broadcast_rhs[%i, %j] : memref<100x10xf32>
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c100, %c10) step (%c1, %c1) {
				%lhs_elem = load %lhs[%i, %j] : memref<100x10xf32>
				%broadcast_rhs_elem = load %broadcast_rhs[%i, %j] : memref<100x10xf32>
				%diff_elem = subf %lhs_elem, %broadcast_rhs_elem : f32
				store %diff_elem, %diff[%i, %j] : memref<100x10xf32>
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c100, %c10) step (%c1, %c1) {
				%diff_elem = load %diff[%i, %j] : memref<100x10xf32>
				%exp_elem = exp %diff_elem : f32
				store %exp_elem, %result[%i, %j] : memref<100x10xf32>
				"loop.terminator"() : () -> ()
				}
				dealloc %broadcast_rhs : memref<100x10xf32>
				dealloc %diff : memref<100x10xf32>
				return
				}
				// CHECK-LABEL: func @fuse_three
				// CHECK-SAME: ([[LHS:%.]]: memref<100x10xf32>, [[RHS:%.]]: memref<100xf32>,
				// CHECK-SAME: [[RESULT:%.*]]: memref<100x10xf32>) {
				// CHECK: [[C100:%.*]] = constant 100 : index
				// CHECK: [[C10:%.*]] = constant 10 : index
				// CHECK: [[C0:%.*]] = constant 0 : index
				// CHECK: [[C1:%.*]] = constant 1 : index
				// CHECK: [[BROADCAST_RHS:%.*]] = alloc()
				// CHECK: [[DIFF:%.*]] = alloc()
				// CHECK: loop.parallel ([[I:%.]], [[J:%.]]) = ([[C0]], [[C0]])
				// CHECK-SAME: to ([[C100]], [[C10]]) step ([[C1]], [[C1]]) {
				// CHECK: [[RHS_ELEM:%.*]] = load [[RHS]]{{\[}}[[I]]]
				// CHECK: store [[RHS_ELEM]], [[BROADCAST_RHS]]{{\[}}[[I]], [[J]]]
				// CHECK: [[LHS_ELEM:%.*]] = load [[LHS]]{{\[}}[[I]], [[J]]]
				// CHECK: [[BROADCAST_RHS_ELEM:%.*]] = load [[BROADCAST_RHS]]
				// CHECK: [[DIFF_ELEM:%.*]] = subf [[LHS_ELEM]], [[BROADCAST_RHS_ELEM]]
				// CHECK: store [[DIFF_ELEM]], [[DIFF]]{{\[}}[[I]], [[J]]]
				// CHECK: [[DIFF_ELEM_:%.*]] = load [[DIFF]]{{\[}}[[I]], [[J]]]
				// CHECK: [[EXP_ELEM:%.*]] = exp [[DIFF_ELEM_]]
				// CHECK: store [[EXP_ELEM]], [[RESULT]]{{\[}}[[I]], [[J]]]
				// CHECK: "loop.terminator"() : () -> ()
				// CHECK: }
				// CHECK: dealloc [[BROADCAST_RHS]]
				// CHECK: dealloc [[DIFF]]

				// -----

				func @do_not_fuse_nested_ploop1() {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				loop.parallel (%k, %l) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_nested_ploop1
				// CHECK: loop.parallel
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_nested_ploop2() {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				loop.parallel (%k, %l) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_nested_ploop2
				// CHECK: loop.parallel
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_loops_unmatching_num_loops() {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i) = (%c0) to (%c2) step (%c1) {
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_loops_unmatching_num_loops
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_loops_with_side_effecting_ops_in_between() {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				%buffer = alloc() : memref<2x2xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_loops_with_side_effecting_ops_in_between
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_loops_unmatching_iteration_space() {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c4 = constant 4 : index
				loop.parallel (%i, %j) = (%c0, %c0) to (%c4, %c4) step (%c2, %c2) {
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_loops_unmatching_iteration_space
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_unmatching_write_read_patterns(
				%A: memref<2x2xf32>, %B: memref<2x2xf32>,
				%C: memref<2x2xf32>, %result: memref<2x2xf32>) {
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%common_buf = alloc() : memref<2x2xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%B_elem = load %B[%i, %j] : memref<2x2xf32>
				%C_elem = load %C[%i, %j] : memref<2x2xf32>
				%sum_elem = addf %B_elem, %C_elem : f32
				store %sum_elem, %common_buf[%i, %j] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%k = addi %i, %c1 : index
				%sum_elem = load %common_buf[%k, %j] : memref<2x2xf32>
				%A_elem = load %A[%i, %j] : memref<2x2xf32>
				%product_elem = mulf %sum_elem, %A_elem : f32
				store %product_elem, %result[%i, %j] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				}
				dealloc %common_buf : memref<2x2xf32>
				return
				}
				// CHECK-LABEL: func @do_not_fuse_unmatching_write_read_patterns
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_unmatching_read_write_patterns(
				%A: memref<2x2xf32>, %B: memref<2x2xf32>, %common_buf: memref<2x2xf32>) {
				herhutUnsubmitted Done Reply Inline Actions How is this different from above? herhut: How is this different from above?
				pifon2aAuthorUnsubmitted Done Reply Inline Actions renamed to `common_buf` to make it more visible. pifon2a: renamed to `common_buf` to make it more visible.
				%c2 = constant 2 : index
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%sum = alloc() : memref<2x2xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%B_elem = load %B[%i, %j] : memref<2x2xf32>
				%C_elem = load %common_buf[%i, %j] : memref<2x2xf32>
				%sum_elem = addf %B_elem, %C_elem : f32
				store %sum_elem, %sum[%i, %j] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%k = addi %i, %c1 : index
				%sum_elem = load %sum[%k, %j] : memref<2x2xf32>
				%A_elem = load %A[%i, %j] : memref<2x2xf32>
				herhutUnsubmitted Done Reply Inline Actions This would already fail due to not-matching indices. Maybe load from `A` instead? herhut: This would already fail due to not-matching indices. Maybe load from `A` instead?
				%product_elem = mulf %sum_elem, %A_elem : f32
				store %product_elem, %common_buf[%j, %i] : memref<2x2xf32>
				"loop.terminator"() : () -> ()
				herhutUnsubmitted Done Reply Inline Actions Why is this case illegal? If they store to the same buf but at the same index it should be fine. herhut: Why is this case illegal? If they store to the same buf but at the same index it should be fine.
				}
				dealloc %sum : memref<2x2xf32>
				return
				}
				// CHECK-LABEL: func @do_not_fuse_unmatching_read_write_patterns
				// CHECK: loop.parallel
				// CHECK: loop.parallel

				// -----

				func @do_not_fuse_loops_with_memref_defined_in_loop_bodies() {
				%c2 = constant 2 : index
				herhutUnsubmitted Done Reply Inline Actions There are 4 cases for this. Read/write to local allocation in loop1/loop2. Essentially any effect on a memref we do not understand. herhut: There are 4 cases for this. Read/write to local allocation in loop1/loop2. Essentially any…
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%buffer = alloc() : memref<2x2xf32>
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				"loop.terminator"() : () -> ()
				}
				loop.parallel (%i, %j) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
				%A = subview %buffer[%c0, %c0][%c2, %c2][%c1, %c1]
				: memref<2x2xf32> to memref<?x?xf32, offset: ?, strides:[?, ?]>
				%A_elem = load %A[%i, %j] : memref<?x?xf32, offset: ?, strides:[?, ?]>
				"loop.terminator"() : () -> ()
				}
				"xla_lhlo.terminator"() : () -> ()
				}
				// CHECK-LABEL: func @do_not_fuse_loops_with_memref_defined_in_loop_bodies
				// CHECK: loop.parallel
				// CHECK: loop.parallel

mlir/tools/mlir-opt/CMakeLists.txt

	Show All 13 Lines
	add_llvm_library(MLIRMlirOptMain			add_llvm_library(MLIRMlirOptMain
	mlir-opt.cpp			mlir-opt.cpp
	)			)
	target_link_libraries(MLIRMlirOptMain			target_link_libraries(MLIRMlirOptMain
	${LIB_LIBS}			${LIB_LIBS}
	)			)

	set(LIBS			set(LIBS
				MLIRLoopOpsTransforms
	MLIRLoopAnalysis			MLIRLoopAnalysis
	MLIRAnalysis			MLIRAnalysis
	MLIRAffineOps			MLIRAffineOps
	MLIRAffineToStandard			MLIRAffineToStandard
	MLIRDialect			MLIRDialect
	MLIRLoopsToGPU			MLIRLoopsToGPU
	MLIRLinalgToLLVM			MLIRLinalgToLLVM

	▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Add naive fusion of parallel loops.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 245378

mlir/include/mlir/Dialect/LoopOps/LoopOps.td

mlir/include/mlir/Dialect/LoopOps/Passes.h

mlir/include/mlir/InitAllPasses.h

mlir/lib/Dialect/LoopOps/CMakeLists.txt

mlir/lib/Dialect/LoopOps/Transforms/CMakeLists.txt

mlir/lib/Dialect/LoopOps/Transforms/ParallelLoopFusion.cpp

mlir/test/Dialect/Loops/parallel-loop-fusion.mlir

mlir/tools/mlir-opt/CMakeLists.txt

[MLIR] Add naive fusion of parallel loops.
AbandonedPublic