This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Conversion/SCFToOpenMP/
-
Conversion/
-
SCFToOpenMP/
-
SCFToOpenMP.cpp
-
test/Conversion/SCFToOpenMP/
-
Conversion/
-
SCFToOpenMP/
-
scf-to-openmp.mlir

Differential D108426

[MLIR][OMP] Ensure nested scf.parallel execute all iterations
ClosedPublic

Authored by wsmoses on Aug 19 2021, 4:55 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
ftynse
rriddle
nicolasvasilache
mehdi_amini

Commits

rG973cb2c326be: [MLIR][OMP] Ensure nested scf.parallel execute all iterations

Summary

Presently, the lowering of nested scf.parallel loops to OpenMP creates one omp.parallel region, with two (nested) OpenMP worksharing loops on the inside. When lowered to LLVM and executed, this results in incorrect results. The reason for this is as follows:

An OpenMP parallel region results in the code being run with whatever number of threads available to OpenMP. Within a parallel region a worksharing loop divides up the total number of requested iterations by the available number of threads, and distributes accordingly. For a single ws loop in a parallel region, this works as intended.

Now consider nested ws loops as follows:

omp.parallel {

A: omp.ws %i = 0...10 {
   B: omp.ws %j = 0...10 {
       code(%i, %j)
   }
}

}

Suppose we ran this on two threads. The first workshare loop would decide to execute iterations 0, 1, 2, 3, 4 on thread 0, and iterations 5, 6, 7, 8, 9 on thread 1. The second workshare loop would decide the same for its iteration. This means thread 0 would execute i \in [0, 5) and j \in [0, 5). Thread 1 would execute i \in [5, 10) and j \in [5, 10). This means that iterations i in [5, 10), j in [0, 5) and i in [0, 5), j in [5, 10) never get executed, which is clearly wrong.

This permits two options for a remedy:

Change the semantics of the omp.wsloop to be distinct from that of the OpenMP runtime call or equivalently #pragma omp for. This could then allow some lowering transformation to remedy the aforementioned issue. I don't think this is desirable for an abstraction standpoint.
When lowering an scf.parallel always surround the wsloop with a new parallel region (thereby causing the innermost wsloop to use the number of threads available only to it).

This PR implements the latter change.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wsmoses created this revision.Aug 19 2021, 4:55 PM

Herald added subscribers: wrengr, Chia-hungDuan, dcaballe and 16 others. · View Herald TranscriptAug 19 2021, 4:55 PM

wsmoses requested review of this revision.Aug 19 2021, 4:55 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2021, 4:55 PM

Herald added subscribers: sstefan1, stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B120459: Diff 367661.Aug 19 2021, 5:31 PM

LG, a scf.parrallel loop is in OpenMP speak an omp parallel for, and that also holds for nested loops.

This revision is now accepted and ready to land.Aug 20 2021, 8:09 AM

Closed by commit rG973cb2c326be: [MLIR][OMP] Ensure nested scf.parallel execute all iterations (authored by wsmoses). · Explain WhyAug 20 2021, 4:06 PM

This revision was automatically updated to reflect the committed changes.

wsmoses added a commit: rG973cb2c326be: [MLIR][OMP] Ensure nested scf.parallel execute all iterations.

Revision Contents

Path

Size

mlir/

lib/

Conversion/

SCFToOpenMP/

SCFToOpenMP.cpp

34 lines

test/

Conversion/

SCFToOpenMP/

scf-to-openmp.mlir

2 lines

Diff 367661

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp

Show All 38 Lines	// Replace SCF yield with OpenMP yield.
rewriter.setInsertionPointToEnd(parallelOp.getBody());		rewriter.setInsertionPointToEnd(parallelOp.getBody());
assert(llvm::hasSingleElement(parallelOp.region()) &&		assert(llvm::hasSingleElement(parallelOp.region()) &&
"expected scf.parallel to have one block");		"expected scf.parallel to have one block");
rewriter.replaceOpWithNewOp<omp::YieldOp>(		rewriter.replaceOpWithNewOp<omp::YieldOp>(
parallelOp.getBody()->getTerminator(), ValueRange());		parallelOp.getBody()->getTerminator(), ValueRange());
}		}

// Replace the loop.		// Replace the loop.
		auto omp = rewriter.create<omp::ParallelOp>(parallelOp.getLoc());
		Block *block = rewriter.createBlock(&omp.getRegion());
		rewriter.setInsertionPointToStart(block);
auto loop = rewriter.create<omp::WsLoopOp>(		auto loop = rewriter.create<omp::WsLoopOp>(
parallelOp.getLoc(), parallelOp.lowerBound(), parallelOp.upperBound(),		parallelOp.getLoc(), parallelOp.lowerBound(), parallelOp.upperBound(),
parallelOp.step());		parallelOp.step());
rewriter.inlineRegionBefore(parallelOp.region(), loop.region(),		rewriter.inlineRegionBefore(parallelOp.region(), loop.region(),
loop.region().begin());		loop.region().begin());
		rewriter.create<omp::TerminatorOp>(parallelOp.getLoc());

rewriter.eraseOp(parallelOp);		rewriter.eraseOp(parallelOp);
return success();		return success();
}		}
};		};

/// Inserts OpenMP "parallel" operations around top-level SCF "parallel"
/// operations in the given function. This is implemented as a direct IR
/// modification rather than as a conversion pattern because it does not
/// modify the top-level operation it matches, which is a requirement for
/// rewrite patterns.
//
// TODO: consider creating nested parallel operations when necessary.
static void insertOpenMPParallel(FuncOp func) {
// Collect top-level SCF "parallel" ops.
SmallVector<scf::ParallelOp, 4> topLevelParallelOps;
func.walk([&topLevelParallelOps](scf::ParallelOp parallelOp) {
// Ignore ops that are already within OpenMP parallel construct.
if (!parallelOp->getParentOfType<scf::ParallelOp>())
topLevelParallelOps.push_back(parallelOp);
});

// Wrap SCF ops into OpenMP "parallel" ops.
for (scf::ParallelOp parallelOp : topLevelParallelOps) {
OpBuilder builder(parallelOp);
auto omp = builder.create<omp::ParallelOp>(parallelOp.getLoc());
Block *block = builder.createBlock(&omp.getRegion());
builder.create<omp::TerminatorOp>(parallelOp.getLoc());
block->getOperations().splice(block->begin(),
parallelOp->getBlock()->getOperations(),
parallelOp.getOperation());
}
}

/// Applies the conversion patterns in the given function.		/// Applies the conversion patterns in the given function.
static LogicalResult applyPatterns(FuncOp func) {		static LogicalResult applyPatterns(FuncOp func) {
ConversionTarget target(*func.getContext());		ConversionTarget target(*func.getContext());
target.addIllegalOp<scf::ParallelOp>();		target.addIllegalOp<scf::ParallelOp>();
target.addDynamicallyLegalOp<scf::YieldOp>(		target.addDynamicallyLegalOp<scf::YieldOp>(
[](scf::YieldOp op) { return !isa<scf::ParallelOp>(op->getParentOp()); });		[](scf::YieldOp op) { return !isa<scf::ParallelOp>(op->getParentOp()); });
target.addLegalDialect<omp::OpenMPDialect>();		target.addLegalDialect<omp::OpenMPDialect>();

RewritePatternSet patterns(func.getContext());		RewritePatternSet patterns(func.getContext());
patterns.add<ParallelOpLowering>(func.getContext());		patterns.add<ParallelOpLowering>(func.getContext());
FrozenRewritePatternSet frozen(std::move(patterns));		FrozenRewritePatternSet frozen(std::move(patterns));
return applyPartialConversion(func, target, frozen);		return applyPartialConversion(func, target, frozen);
}		}

/// A pass converting SCF operations to OpenMP operations.		/// A pass converting SCF operations to OpenMP operations.
struct SCFToOpenMPPass : public ConvertSCFToOpenMPBase<SCFToOpenMPPass> {		struct SCFToOpenMPPass : public ConvertSCFToOpenMPBase<SCFToOpenMPPass> {
/// Pass entry point.		/// Pass entry point.
void runOnFunction() override {		void runOnFunction() override {
insertOpenMPParallel(getFunction());
if (failed(applyPatterns(getFunction())))		if (failed(applyPatterns(getFunction())))
signalPassFailure();		signalPassFailure();
}		}
};		};

} // end namespace		} // end namespace

std::unique_ptr<OperationPass<FuncOp>> mlir::createConvertSCFToOpenMPPass() {		std::unique_ptr<OperationPass<FuncOp>> mlir::createConvertSCFToOpenMPPass() {
return std::make_unique<SCFToOpenMPPass>();		return std::make_unique<SCFToOpenMPPass>();
}		}

mlir/test/Conversion/SCFToOpenMP/scf-to-openmp.mlir

Show All 15 Lines	func @parallel(%arg0: index, %arg1: index, %arg2: index,
return		return
}		}

// CHECK-LABEL: @nested_loops		// CHECK-LABEL: @nested_loops
func @nested_loops(%arg0: index, %arg1: index, %arg2: index,		func @nested_loops(%arg0: index, %arg1: index, %arg2: index,
%arg3: index, %arg4: index, %arg5: index) {		%arg3: index, %arg4: index, %arg5: index) {
// CHECK: omp.parallel {		// CHECK: omp.parallel {
// CHECK: omp.wsloop (%[[LVAR_OUT1:.*]]) : index = (%arg0) to (%arg2) step (%arg4) {		// CHECK: omp.wsloop (%[[LVAR_OUT1:.*]]) : index = (%arg0) to (%arg2) step (%arg4) {
// CHECK-NOT: omp.parallel
scf.parallel (%i) = (%arg0) to (%arg2) step (%arg4) {		scf.parallel (%i) = (%arg0) to (%arg2) step (%arg4) {
		// CHECK: omp.parallel
// CHECK: omp.wsloop (%[[LVAR_IN1:.*]]) : index = (%arg1) to (%arg3) step (%arg5) {		// CHECK: omp.wsloop (%[[LVAR_IN1:.*]]) : index = (%arg1) to (%arg3) step (%arg5) {
scf.parallel (%j) = (%arg1) to (%arg3) step (%arg5) {		scf.parallel (%j) = (%arg1) to (%arg3) step (%arg5) {
// CHECK: "test.payload"(%[[LVAR_OUT1]], %[[LVAR_IN1]]) : (index, index) -> ()		// CHECK: "test.payload"(%[[LVAR_OUT1]], %[[LVAR_IN1]]) : (index, index) -> ()
"test.payload"(%i, %j) : (index, index) -> ()		"test.payload"(%i, %j) : (index, index) -> ()
// CHECK: omp.yield		// CHECK: omp.yield
// CHECK: }		// CHECK: }
}		}
// CHECK: omp.yield		// CHECK: omp.yield
Show All 32 Lines