This is an archive of the discontinued LLVM Phabricator instance.

ThinLTO: sort inputs and schedule by decreasing size
Needs ReviewPublic

Authored by mehdi_amini on Aug 27 2016, 10:42 PM.

Download Raw Diff

Details

Reviewers

Summary

This is a compile time optimization: keeping a large file to process
at the end hurts parallelism.
The heurisitic used right now is the input buffer size, however we
may want to consider the number of functions to import or the
different number of files to load for importing as well.
(port from ThinLTOCodeGenerator.cpp)

Diff Detail

Event Timeline

mehdi_amini updated this revision to Diff 69501.Aug 27 2016, 10:42 PM

mehdi_amini retitled this revision from to ThinLTO: sort inputs and schedule by decreasing size.

mehdi_amini updated this object.

mehdi_amini added a reviewer: tejohnson.

mehdi_amini added a subscriber: llvm-commits.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptAug 27 2016, 10:42 PM

Code changes look fine. But what is the impact on memory? I wonder if this will bloat the peak memory because more of the large modules will get run in parallel.

In D23966#527396, @tejohnson wrote:

Code changes look fine. But what is the impact on memory? I wonder if this will bloat the peak memory because more of the large modules will get run in parallel.

The peak memory when linking clang with 4 threads goes up from 1.65GB to 2.22GB with this patch.

This makes me wonder if we could have a smarter scheduler that would balance the inputs to interleave the large ones with as many small one as needed.
Maybe another concern could be a smart scheduler for "locality" (if A imports from B and B imports from A, schedule them back-to-back as the files are more likely to be mapped in memory). But this seems quite secondary.

This patch was motivated by a use case (llvm-tblgen maybe?) where a single large file was taking almost as long as *all* the others to process and was scheduled late. The link time improvement was really important in this case.

Revision Contents

Path

Size

llvm/

lib/

LTO/

LTO.cpp

32 lines

Diff 69501

llvm/lib/LTO/LTO.cpp

Show All 30 Lines
#include "llvm/Support/ThreadPool.h"		#include "llvm/Support/ThreadPool.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
#include "llvm/Target/TargetOptions.h"		#include "llvm/Target/TargetOptions.h"
#include "llvm/Transforms/IPO.h"		#include "llvm/Transforms/IPO.h"
#include "llvm/Transforms/IPO/PassManagerBuilder.h"		#include "llvm/Transforms/IPO/PassManagerBuilder.h"
#include "llvm/Transforms/Utils/SplitModule.h"		#include "llvm/Transforms/Utils/SplitModule.h"

		#include <numeric>
#include <set>		#include <set>

using namespace llvm;		using namespace llvm;
using namespace lto;		using namespace lto;
using namespace object;		using namespace object;

#define DEBUG_TYPE "lto"		#define DEBUG_TYPE "lto"

▲ Show 20 Lines • Show All 691 Lines • ▼ Show 20 Lines	Error LTO::runThinLTO(AddOutputFn AddOutput) {
std::unique_ptr<ThinBackendProc> BackendProc = ThinLTO.Backend(		std::unique_ptr<ThinBackendProc> BackendProc = ThinLTO.Backend(
Conf, ThinLTO.CombinedIndex, ModuleToDefinedGVSummaries, AddOutput);		Conf, ThinLTO.CombinedIndex, ModuleToDefinedGVSummaries, AddOutput);

// Partition numbers for ThinLTO jobs start at 1 (see comments for		// Partition numbers for ThinLTO jobs start at 1 (see comments for
// GlobalResolution in LTO.h). Task numbers, however, start at		// GlobalResolution in LTO.h). Task numbers, however, start at
// ParallelCodeGenParallelismLevel if an LTO module is present, as tasks 0		// ParallelCodeGenParallelismLevel if an LTO module is present, as tasks 0
// through ParallelCodeGenParallelismLevel-1 are reserved for parallel code		// through ParallelCodeGenParallelismLevel-1 are reserved for parallel code
// generation partitions.		// generation partitions.
unsigned Task = RegularLTO.CombinedModule		unsigned FirstTask = RegularLTO.CombinedModule
? RegularLTO.ParallelCodeGenParallelismLevel		? RegularLTO.ParallelCodeGenParallelismLevel
: 0;		: 0;
unsigned Partition = 1;		unsigned Partition = 1;

for (auto &Mod : ThinLTO.ModuleMap) {		// Compute the ordering we will process the inputs: the rough heuristic here
if (Error E = BackendProc->start(Task, Mod.second, ImportLists[Mod.first],		// is to sort them per size so that the largest module get schedule as soon as
ExportLists[Mod.first],		// possible. This is purely a compile-time optimization.
ResolvedODR[Mod.first], ThinLTO.ModuleMap))		std::vector<unsigned> ModulesOrdering;
		ModulesOrdering.resize(ThinLTO.ModuleMap.size());
		std::iota(ModulesOrdering.begin(), ModulesOrdering.end(), FirstTask);
		std::sort(
		ModulesOrdering.begin(), ModulesOrdering.end(),
		[&](int LeftIndex, int RightIndex) {
		auto LSize =
		(ThinLTO.ModuleMap.begin() + LeftIndex)->second.getBufferSize();
		auto RSize =
		(ThinLTO.ModuleMap.begin() + RightIndex)->second.getBufferSize();
		return LSize > RSize;
		});

		for (auto &Task : ModulesOrdering) {
		auto Mod = ThinLTO.ModuleMap.begin() + Task;
		if (Error E = BackendProc->start(
		Task, Mod->second, ImportLists[Mod->first], ExportLists[Mod->first],
		ResolvedODR[Mod->first], ThinLTO.ModuleMap))
return E;		return E;

++Task;		++Task;
++Partition;		++Partition;
}		}

return BackendProc->wait();		return BackendProc->wait();
}		}