This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Inliner] Use llvm::parallelForEach instead of llvm::parallelTransformReduce
ClosedPublic

Authored by rriddle on Feb 4 2021, 3:46 PM.

Download Raw Diff

Details

Reviewers

jpienaar
mehdi_amini
bollu
Jing

Commits

rGabd3c6f24c82: [mlir][Inliner] Use llvm::parallelForEach instead of llvm…

Summary

llvm::parallelTransformReduce does not schedule work on the caller thread, which becomes very costly for
the inliner where a majority of SCCs are small, often ~1 element. The switch to llvm::parallelForEach solves this,
and also aligns the implementation with the PassManager (which realistically should share the same implementation).

This change dropped compile time on an internal benchmark by ~1(25%) second.

Depends On D96085

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rriddle created this revision.Feb 4 2021, 3:46 PM

Herald added a reviewer: bollu. · View Herald TranscriptFeb 4 2021, 3:46 PM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 14 others. · View Herald Transcript

rriddle requested review of this revision.Feb 4 2021, 3:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2021, 3:46 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B87997: Diff 321588.Feb 4 2021, 5:13 PM

Nice improvement!

This revision is now accepted and ready to land.Feb 23 2021, 10:47 AM

Closed by commit rGabd3c6f24c82: [mlir][Inliner] Use llvm::parallelForEach instead of llvm… (authored by rriddle). · Explain WhyFeb 23 2021, 2:37 PM

This revision was automatically updated to reflect the committed changes.

rriddle added a commit: rGabd3c6f24c82: [mlir][Inliner] Use llvm::parallelForEach instead of llvm….

Revision Contents

Path

Size

mlir/

lib/

Transforms/

Inliner.cpp

29 lines

Diff 325900

mlir/lib/Transforms/Inliner.cpp

Show First 20 Lines • Show All 682 Lines • ▼ Show 20 Lines	LogicalResult InlinerPass::optimizeSCC(CallGraph &cg, CGUseList &useList,
return success();		return success();
}		}

LogicalResult		LogicalResult
InlinerPass::optimizeSCCAsync(MutableArrayRef<CallGraphNode *> nodesToVisit,		InlinerPass::optimizeSCCAsync(MutableArrayRef<CallGraphNode *> nodesToVisit,
MLIRContext *context) {		MLIRContext *context) {
// Ensure that there are enough pipeline maps for the optimizer to run in		// Ensure that there are enough pipeline maps for the optimizer to run in
// parallel.		// parallel.
size_t numThreads = llvm::hardware_concurrency().compute_thread_count();		size_t numThreads =
if (opPipelines.size() != numThreads) {		std::min((size_t)llvm::hardware_concurrency().compute_thread_count(),
		nodesToVisit.size());
		if (opPipelines.size() < numThreads) {
// Reserve before resizing so that we can use a reference to the first		// Reserve before resizing so that we can use a reference to the first
// element.		// element.
opPipelines.reserve(numThreads);		opPipelines.reserve(numThreads);
opPipelines.resize(numThreads, opPipelines.front());		opPipelines.resize(numThreads, opPipelines.front());
}		}

// Ensure an analysis manager has been constructed for each of the nodes.		// Ensure an analysis manager has been constructed for each of the nodes.
// This prevents thread races when running the nested pipelines.		// This prevents thread races when running the nested pipelines.
for (CallGraphNode *node : nodesToVisit)		for (CallGraphNode *node : nodesToVisit)
getAnalysisManager().nest(node->getCallableRegion()->getParentOp());		getAnalysisManager().nest(node->getCallableRegion()->getParentOp());

// An index for the current node to optimize.		// An index for the current node to optimize.
std::atomic<unsigned> nodeIt(0);		std::atomic<unsigned> nodeIt(0);

// Optimize the nodes of the SCC in parallel.		// Optimize the nodes of the SCC in parallel.
ParallelDiagnosticHandler optimizerHandler(context);		ParallelDiagnosticHandler optimizerHandler(context);
return llvm::parallelTransformReduce(		std::atomic<bool> passFailed(false);
llvm::seq<size_t>(0, numThreads), success(),		llvm::parallelForEach(
[](LogicalResult lhs, LogicalResult rhs) {		opPipelines.begin(), std::next(opPipelines.begin(), numThreads),
return success(succeeded(lhs) && succeeded(rhs));		[&](llvm::StringMap<OpPassManager> &pipelines) {
},		for (auto e = nodesToVisit.size(); !passFailed && nodeIt < e;) {
[&](size_t index) {
LogicalResult result = success();
for (auto e = nodesToVisit.size(); nodeIt < e && succeeded(result);) {
// Get the next available operation index.		// Get the next available operation index.
unsigned nextID = nodeIt++;		unsigned nextID = nodeIt++;
if (nextID >= e)		if (nextID >= e)
break;		break;

// Set the order for this thread so that diagnostics will be		// Set the order for this thread so that diagnostics will be
// properly ordered, and reset after optimization has finished.		// properly ordered, and reset after optimization has finished.
optimizerHandler.setOrderIDForThread(nextID);		optimizerHandler.setOrderIDForThread(nextID);
result = optimizeCallable(nodesToVisit[nextID], opPipelines[index]);		LogicalResult pipelineResult =
		optimizeCallable(nodesToVisit[nextID], pipelines);
optimizerHandler.eraseOrderIDForThread();		optimizerHandler.eraseOrderIDForThread();

		if (failed(pipelineResult)) {
		passFailed = true;
		break;
		}
}		}
return result;
});		});
		return failure(passFailed);
}		}

LogicalResult		LogicalResult
InlinerPass::optimizeCallable(CallGraphNode *node,		InlinerPass::optimizeCallable(CallGraphNode *node,
llvm::StringMap<OpPassManager> &pipelines) {		llvm::StringMap<OpPassManager> &pipelines) {
Operation *callable = node->getCallableRegion()->getParentOp();		Operation *callable = node->getCallableRegion()->getParentOp();
StringRef opName = callable->getName().getStringRef();		StringRef opName = callable->getName().getStringRef();
auto pipelineIt = pipelines.find(opName);		auto pipelineIt = pipelines.find(opName);
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines