This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/tools/llvm-exegesis/
-
tools/
-
llvm-exegesis/
6/13
llvm-exegesis.cpp

Differential D140271

[NFCI][llvm-exegesis] Benchmark: parallelize codegen (5x ... 8x less wallclock)
Needs ReviewPublic

Authored by lebedev.ri on Dec 18 2022, 9:39 AM.

Download Raw Diff

Details

Reviewers

RKSimon
courbet
gchatelet

Summary

(i'd be fine just committing this, but just in case, for greater visibility if nothing else, decided to post)

We *might* not want to perform codegen for all Configurations X Repetitions
beforehand, since the produced Runnable Configurations may have significant
file sizes (up to 1MB?), and there are many Runnable Configurations (30k?).
But doing batches is fine memory-wise, and is a win, as expected.

We really don't want to smudge the measurements, so we do those
standalone, without running *anything* else in parallel,
but when not measuring, the codegen can be done in parallel.

Special care is taken to not produce snippets in non-deterministic order,
although snippets themselves are randomized, it's not as useful.

And so it becomes almost real-time:

time ./bin/llvm-exegesis --opcode-index=-1 --mode=latency --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress --max-configs-per-opcode=8192

old:

real    1m33.500s
user    1m29.644s
sys     0m1.762s

new:

real    0m18.191s
user    3m8.253s
sys     0m3.999s

(5.1x)

time ./bin/llvm-exegesis --opcode-index=-1 --mode=uops --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress

old:

real    1m52.256s
user    1m48.518s
sys     0m1.479s

new:

real    0m13.273s
user    4m14.228s
sys     0m4.903s

(8.5x)

time ./bin/llvm-exegesis --opcode-index=-1 --mode=inverse_throughput --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress

old:

real    1m58.765s
user    1m53.259s
sys     0m2.937s

new:

real    0m19.586s
user    4m19.133s
sys     0m6.314s

(6x)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Dec 18 2022, 9:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 18 2022, 9:39 AM

Herald added a subscriber: mstojanovic. · View Herald Transcript

lebedev.ri requested review of this revision.Dec 18 2022, 9:39 AM

lebedev.ri edited reviewers, added: gchatelet; removed: gchakrabarti.

Harbormaster completed remote builds in B203811: Diff 483824.Dec 18 2022, 10:33 AM

Note: my main question here, is if we want any knobs for this?

Does anyone have any thoughts here?
It's fine if nobody is comfortable reviewing parallelizm,
just general user-facing side review is fine.

ping

Add the flags.

Harbormaster completed remote builds in B205034: Diff 485439.Dec 27 2022, 6:03 PM

lebedev.ri mentioned this in D140702: [exegesis] "Skip codegen" dry-run mode.Dec 28 2022, 5:38 AM

ping

To answer the question i know will come up: yes, thread-batch-size is somewhat useful.
The problem is that the final snippet to be measured can greatly vary in size,
depending on the unroll factor. And when you are trying to measure many instructions,
e.g. all 20k of them, and end up with ~40k snippets, if each one takes 1MB
(worst case scenario), you already need 40GB RAM.
We don't really know beforehand how much bytes any particular snippet will take,
so having a limit on the available memory to be used seems less feasible.

ping

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 9 2023, 12:40 PM

Do we really need the speed boost ? My worry is that we are making the code more complex for a negligible advantage. I personally have never felt that waiting for a couple of minutes was really an issue, but you might have other use cases in mind ?

TBH I'm not sure it's worth making the code more complex to save a minute on a tool that is supposed to run only once in a while.
That said I may not fully understand your use case. What's your typical use? Why is it important to make it faster?

...
Are you seriously saying that 10x wallclock improvement is negligible?

The reason is always the same - making iterative development cycles faster.
I would just like to point out that the analysis used to be 50x times slower originally.
It was borderline unusable. But sure, one rarely needs it, so perhaps that was fine.
I would like to note that it is still slow[er than needed]. And the linear algebra stuff,
which i'm hoping to revisit, will make it even slower.

This is exactly the same story here. We are spending massive amounts of time,
and not doing anything useful with it, when we could be doing something useful there,
like, dunno, making more measurements. Why do we want it to be slow?
For me that long runtime has historically been the pain point.

If that was a joke, it wasn't a funny one i'm afraid.

Additionally, i would like to note a well-known fact: exegesis makes *very* liberal use of randomness.
Personally, i've found that a single measurement run (a single llvm-exegesis invocation) is not quite useful,
and practically always do several runs, and concatenate the yamls. That e.g. filters out some of
the "oh but why is this instruction's schedule suddenly not match measurements?" So there goes some of that
performance headroom. But sure, everyone has 10 hours to wait while sleep(100); a+=1; sleep(100) finishes...

ping.
Is there some concrete concern? If there isn't it can't be addressed.

All of this is pretty much entirely an idiomatic boilerplate.
I understand that parallelizm may be hard to understand a first few times,
but there even isn't any syncronization going on here.

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

In my personal experience, I did not feel that benchmarking speed was ever an issue, but I do feel that this code is more complex to understand. Therefore to me that tradeoff is negative. If other people feel otherwise I'm happy to reconsider.

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

In D140271#4056202, @courbet wrote:

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

In my personal experience, I did not feel that benchmarking speed was ever an issue, but I do feel that this code is more complex to understand. Therefore to me that tradeoff is negative. If other people feel otherwise I'm happy to reconsider.

I would not bother with this if if i didn't find the existing speed to be problematic.
I would not add progress meter either. I would not fix the analysis speed either.
I'm doing these things because i found them to be sub-par, during my usage.

In D140271#4056248, @RKSimon wrote:

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

--benchmark-phase=prepare-snippet is instantenious, takes less than a second
--benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds
--benchmark-phase=assemble-measured-code takes minutes.

In D140271#4056430, @lebedev.ri wrote:

In D140271#4056202, @courbet wrote:

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

I don't think number of touched functions is an agree-upon measure of complexity :) I would even argue that splitting in more functions might make the code less complex.

But again, my point is more about usefulness to users, and right now it looks like the change leaves both reviewers (that happen to be users) asking about complexity/usefulness ratio. But both these users also might tend to have the same usage of the tool, so maybe we're not seeing the whole picture - maybe ask for more opinions ?

In D140271#4056430, @lebedev.ri wrote:

In D140271#4056248, @RKSimon wrote:

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

--benchmark-phase=prepare-snippet is instantenious, takes less than a second

--benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds

--benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

In D140271#4056530, @RKSimon wrote:

In D140271#4056430, @lebedev.ri wrote:

In D140271#4056248, @RKSimon wrote:

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

--benchmark-phase=prepare-snippet is instantenious, takes less than a second

--benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds

--benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

The actual Codegen is known to be a compile time hog.
There is practically no way this result can be replicated otherwise.

In D140271#4056487, @courbet wrote:

In D140271#4056430, @lebedev.ri wrote:

In D140271#4056202, @courbet wrote:

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

I don't think number of touched functions is an agree-upon measure of complexity :) I would even argue that splitting in more functions might make the code less complex.

But again, my point is more about usefulness to users, and right now it looks like the change leaves both reviewers (that happen to be users) asking about complexity/usefulness ratio. But both these users also might tend to have the same usage of the tool, so maybe we're not seeing the whole picture - maybe ask for more opinions ?

Right. Well, as a user, i do think the existing performance is not great.
More seriously, i think you would be in a better position to ask users, because i don't know any (other than the ones already here)

In D140271#4056530, @RKSimon wrote:

In D140271#4056430, @lebedev.ri wrote:

In D140271#4056248, @RKSimon wrote:

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

--benchmark-phase=prepare-snippet is instantenious, takes less than a second

--benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds

--benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

See https://discourse.llvm.org/t/does-anyone-use-llvm-exegesis-feedback-wanted/67729/5?u=lebedevri

ping

lebedev.ri mentioned this in D142257: [exegesis] `ParallelSnippetGenerator`: always use `RegRandomizationStrategy`.Jan 20 2023, 6:09 PM

gchatelet added inline comments.Jan 23 2023, 2:03 AM

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
249–250	Does this bring anything to the user? Maybe "The number of threads to use for parallel operations (default = 0 (autodetect))"
256–259	"The batch size for parallel operations as it is not efficient to run one task per thread (default = 0 (autodetect))" ?
524	I think we can drop the `exegesis` namespace qualifier here.
532–537	It is not clear to me what you're trying to achieve here. I would introduce a function and assign the variable once instead of mutating it in place. The function name may also help understand what the intent is.
536	Can we introduce functions to cut on the nesting level? Also I believe function will largely improve readability. while (!Configurations.empty()) { // setup PerConfigRCs computeBatch(PerConfigRCs); runBatch(PerConfigRCs); PerConfigRCs.clear(); }

@gchatelet thank you for taking a look!
Is this better?

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
524	Right. This kind of thing is probably happening in a few other places.

Simplify loops in computeBatch().

While there, extract runOneConfiguration() out of runBatch().

Harbormaster completed remote builds in B209424: Diff 491433.Jan 23 2023, 12:31 PM

Much better, thx.
I've added a few more comments.

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
408–411	Can you try with structured binding here (and below)?
419	typo
419	typo
419	no whitespace since we're talking about the type
440	remove
444	ditto
464	Can we reverse this condition and return early to save on the nesting level?

Revision Contents

Path

Size

llvm/

tools/

llvm-exegesis/

llvm-exegesis.cpp

219 lines

Diff 491433

llvm/tools/llvm-exegesis/llvm-exegesis.cpp

Show All 35 Lines
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/FileSystem.h"		#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Format.h"		#include "llvm/Support/Format.h"
#include "llvm/Support/Host.h"		#include "llvm/Support/Host.h"
#include "llvm/Support/InitLLVM.h"		#include "llvm/Support/InitLLVM.h"
#include "llvm/Support/Path.h"		#include "llvm/Support/Path.h"
#include "llvm/Support/SourceMgr.h"		#include "llvm/Support/SourceMgr.h"
#include "llvm/Support/TargetSelect.h"		#include "llvm/Support/TargetSelect.h"
		#include "llvm/Support/ThreadPool.h"
#include <algorithm>		#include <algorithm>
		#include <iterator>
#include <string>		#include <string>

namespace llvm {		namespace llvm {
namespace exegesis {		namespace exegesis {

static cl::opt<int> OpcodeIndex(		static cl::opt<int> OpcodeIndex(
"opcode-index",		"opcode-index",
cl::desc("opcode to measure, by index, or -1 to measure all opcodes"),		cl::desc("opcode to measure, by index, or -1 to measure all opcodes"),
▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	MCPU("mcpu",
cl::value_desc("cpu-name"), cl::cat(Options), cl::init("native"));		cl::value_desc("cpu-name"), cl::cat(Options), cl::init("native"));

static cl::opt<bool> DumpObjectToDisk(		static cl::opt<bool> DumpObjectToDisk(
"dump-object-to-disk",		"dump-object-to-disk",
cl::desc("dumps the generated benchmark object to disk "		cl::desc("dumps the generated benchmark object to disk "
"and prints a message to access it (default = false)"),		"and prints a message to access it (default = false)"),
cl::cat(BenchmarkOptions), cl::init(false));		cl::cat(BenchmarkOptions), cl::init(false));

		static cl::opt<unsigned>
		ThreadCount("j",
		cl::desc("The number of threads to use for parallel operations "
		"(default = 0 (autodetect))"),
		gchateletUnsubmitted Done Reply Inline Actions Does this bring anything to the user? Maybe "The number of threads to use for parallel operations (default = 0 (autodetect))" gchatelet: Does this bring anything to the user? Maybe "The number of threads to use for parallel…
		cl::cat(Options), cl::init(0));

		static cl::opt<unsigned> PerThreadBatchSize(
		"thread-batch-size",
		cl::desc("The batch size for parallel operations as it is not efficient to "
		"run one task per thread (default = 0 (autodetect))"),
		cl::cat(Options), cl::init(0));

static ExitOnError ExitOnErr("llvm-exegesis error: ");		static ExitOnError ExitOnErr("llvm-exegesis error: ");
		gchateletUnsubmitted Done Reply Inline Actions "The batch size for parallel operations as it is not efficient to run one task per thread (default = 0 (autodetect))" ? gchatelet: "The batch size for parallel operations as it is not efficient to run one task per thread…

// Helper function that logs the error(s) and exits.		// Helper function that logs the error(s) and exits.
template <typename... ArgTs> static void ExitWithError(ArgTs &&... Args) {		template <typename... ArgTs> static void ExitWithError(ArgTs &&... Args) {
ExitOnErr(make_error<Failure>(std::forward<ArgTs>(Args)...));		ExitOnErr(make_error<Failure>(std::forward<ArgTs>(Args)...));
}		}

// Check Err. If it's in a failure state log the file error(s) and exit.		// Check Err. If it's in a failure state log the file error(s) and exit.
static void ExitOnFileError(const Twine &FileName, Error Err) {		static void ExitOnFileError(const Twine &FileName, Error Err) {
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	if (Benchmarks.size() >= MaxConfigsPerOpcode)
break;		break;
if (auto Err = Generator->generateConfigurations(Variant, Benchmarks,		if (auto Err = Generator->generateConfigurations(Variant, Benchmarks,
ForbiddenRegs))		ForbiddenRegs))
return std::move(Err);		return std::move(Err);
}		}
return Benchmarks;		return Benchmarks;
}		}

static void runBenchmarkConfigurations(		static size_t GetNumConfigurationsPerBatch(const ThreadPool &Pool,
const LLVMState &State, ArrayRef<BenchmarkCode> Configurations,		unsigned NumRepetitors) {
		// We default to the "thread-batch-size" option.
		size_t N = PerThreadBatchSize;
		if (N == 0) // autodetect - just use thread count, a good-enough default.
		N = Pool.getThreadCount();

		// "thread-batch-size" option is specified per-thread,
		// so multiply by the actual thread count.
		N = SaturatingMultiply<size_t>(N, Pool.getThreadCount());

		// Also, each configuration runs for each repetitor,
		// and we don't want the number of repetitors to affect
		// the amount of work a single batch contains,
		// so just divide by the number of repetitors.
		N = divideCeil(N, NumRepetitors);

		assert(N > 0 && "Not processing anything!");
		return N;
		}

		using ExpectedRunnableConfiguration =
		std::optional<Expected<BenchmarkRunner::RunnableConfiguration>>;
		static constexpr int MaxRepetitors = 2;

		static void computeBatch(
		std::optional<ProgressMeter<>> &Meter,
		ArrayRef<BenchmarkCode> &Configurations, size_t NumConfigurationsPerBatch,
		SmallVectorImpl<SmallVector<ExpectedRunnableConfiguration, MaxRepetitors>>
		&PerConfigRCs,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,		ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
const BenchmarkRunner &Runner) {		ThreadPool &Pool, const BenchmarkRunner &Runner) {
assert(!Configurations.empty() && "Don't have any configurations to run.");		// Onto next batch.
std::optional<raw_fd_ostream> FileOstr;		PerConfigRCs.clear();
if (BenchmarkFile != "-") {
int ResultFD = 0;
// Create output file or open existing file and truncate it, once.
ExitOnErr(errorCodeToError(openFileForWrite(BenchmarkFile, ResultFD,
sys::fs::CD_CreateAlways,
sys::fs::OF_TextWithCRLF)));
FileOstr.emplace(ResultFD, true /shouldClose/);
}
raw_ostream &Ostr = FileOstr ? *FileOstr : outs();

std::optional<ProgressMeter<>> Meter;		// In each iteration, we deal with NumConfigurationsPerBatch-sized chunks.
if (BenchmarkMeasurementsPrintProgress)
Meter.emplace(Configurations.size());
for (const BenchmarkCode &Conf : Configurations) {
ProgressMeter<>::ProgressMeterStep MeterStep(Meter ? &*Meter : nullptr);		ProgressMeter<>::ProgressMeterStep MeterStep(Meter ? &*Meter : nullptr);
SmallVector<InstructionBenchmark, 2> AllResults;		ArrayRef<BenchmarkCode> ConfigurationBatch =
		Configurations.take_front(NumConfigurationsPerBatch);
		Configurations = Configurations.drop_front(ConfigurationBatch.size());

		// For each configuration in batch:
		PerConfigRCs.resize(ConfigurationBatch.size());
		for (auto C : zip(ConfigurationBatch, PerConfigRCs)) {
		const BenchmarkCode &BC = std::get<0>(C);
		SmallVectorImpl<ExpectedRunnableConfiguration> &RCsOfConfiguration =
		std::get<1>(C);
		gchateletUnsubmitted Not Done Reply Inline Actions Can you try with structured binding here (and below)? gchatelet: Can you try with structured binding here (and below)?

		// For each configured repetitor:
		RCsOfConfiguration.resize(Repetitors.size());
		for (auto R : zip(Repetitors, RCsOfConfiguration)) {
		const SnippetRepetitor &Repetitor = *std::get<0>(R);
		ExpectedRunnableConfiguration *Storage = &std::get<1>(R);
		// Prepare an output slot for the task, without invalidating iterators.
		// Create asyncronous task to generage Runnable Configuration
		gchateletUnsubmitted Not Done Reply Inline Actions typo gchatelet: typo
		gchateletUnsubmitted Not Done Reply Inline Actions typo gchatelet: typo
		gchateletUnsubmitted Not Done Reply Inline Actions no whitespace since we're talking about the type gchatelet: no whitespace since we're talking about the type
		// for this configuration given this repetitor. This is thread-safe.
		// NOTE: this does not run any measurements. This is codegen-only!
		// NOTE: the task output into predetermined storage,
		// which is in deterministic order.
		Pool.async([BC, &Repetitor, &Runner, Storage]() {
		*Storage = Runner.getRunnableConfiguration(BC, NumRepetitions,
		LoopBodySize, Repetitor);
		});
		}
		}

		// We've scheduled all codegen tasks for all configurations X repetitions.
		// Now, let's wait until they ALL complete.
		Pool.wait();
		}

for (const std::unique_ptr<const SnippetRepetitor> &Repetitor :		static void runOneConfiguration(
Repetitors) {		const LLVMState &State, raw_ostream &Ostr,
auto RC = ExitOnErr(Runner.getRunnableConfiguration(		MutableArrayRef<ExpectedRunnableConfiguration> RCsOfConfiguration,
Conf, NumRepetitions, LoopBodySize, *Repetitor));		const BenchmarkRunner &Runner) {
		// And they've completed! Now, for each configuration in this batch:
		gchateletUnsubmitted Not Done Reply Inline Actions remove gchatelet: remove
		SmallVector<InstructionBenchmark, MaxRepetitors> AllResults;
		assert(RCsOfConfiguration.size() <= MaxRepetitors);
		AllResults.reserve(RCsOfConfiguration.size());
		// For each Runnable Configuration per repetitor:
		gchateletUnsubmitted Not Done Reply Inline Actions ditto gchatelet: ditto
		for (ExpectedRunnableConfiguration &&ERC :
		make_range(std::make_move_iterator(RCsOfConfiguration.begin()),
		std::make_move_iterator(RCsOfConfiguration.end()))) {
		assert(ERC && "The task did finish.");
		auto RC = ExitOnErr(std::move(*ERC));
		// Now, actually run the final generated snippet, and measure it!
		// NOTE: this is being done completely stand-alone and not in Pool!
AllResults.emplace_back(		AllResults.emplace_back(
ExitOnErr(Runner.runConfiguration(std::move(RC), DumpObjectToDisk)));		ExitOnErr(Runner.runConfiguration(std::move(RC), DumpObjectToDisk)));
}		}
InstructionBenchmark &Result = AllResults.front();		InstructionBenchmark &Result = AllResults.front();

// If any of our measurements failed, pretend they all have failed.		// If any of our measurements failed, pretend they all have failed.
if (AllResults.size() > 1 &&		if (AllResults.size() > 1 &&
any_of(AllResults, [](const InstructionBenchmark &R) {		any_of(AllResults, [](const InstructionBenchmark &R) {
return R.Measurements.empty();		return R.Measurements.empty();
}))		}))
Result.Measurements.clear();		Result.Measurements.clear();

if (RepetitionMode == InstructionBenchmark::RepetitionModeE::AggregateMin) {		if (RepetitionMode == InstructionBenchmark::RepetitionModeE::AggregateMin) {
		gchateletUnsubmitted Not Done Reply Inline Actions Can we reverse this condition and return early to save on the nesting level? gchatelet: Can we reverse this condition and return early to save on the nesting level?
for (const InstructionBenchmark &OtherResult :		for (const InstructionBenchmark &OtherResult :
ArrayRef<InstructionBenchmark>(AllResults).drop_front()) {		ArrayRef<InstructionBenchmark>(AllResults).drop_front()) {
llvm::append_range(Result.AssembledSnippet,		llvm::append_range(Result.AssembledSnippet, OtherResult.AssembledSnippet);
OtherResult.AssembledSnippet);
// Aggregate measurements, but only iff all measurements succeeded.		// Aggregate measurements, but only iff all measurements succeeded.
if (Result.Measurements.empty())		if (Result.Measurements.empty())
continue;		continue;
assert(OtherResult.Measurements.size() == Result.Measurements.size() &&		assert(OtherResult.Measurements.size() == Result.Measurements.size() &&
"Expected to have identical number of measurements.");		"Expected to have identical number of measurements.");
for (auto I : zip(Result.Measurements, OtherResult.Measurements)) {		for (auto I : zip(Result.Measurements, OtherResult.Measurements)) {
BenchmarkMeasure &Measurement = std::get<0>(I);		BenchmarkMeasure &Measurement = std::get<0>(I);
const BenchmarkMeasure &NewMeasurement = std::get<1>(I);		const BenchmarkMeasure &NewMeasurement = std::get<1>(I);
assert(Measurement.Key == NewMeasurement.Key &&		assert(Measurement.Key == NewMeasurement.Key &&
"Expected measurements to be symmetric");		"Expected measurements to be symmetric");

Measurement.PerInstructionValue =		Measurement.PerInstructionValue =
std::min(Measurement.PerInstructionValue,		std::min(Measurement.PerInstructionValue,
NewMeasurement.PerInstructionValue);		NewMeasurement.PerInstructionValue);
Measurement.PerSnippetValue = std::min(		Measurement.PerSnippetValue = std::min(Measurement.PerSnippetValue,
Measurement.PerSnippetValue, NewMeasurement.PerSnippetValue);		NewMeasurement.PerSnippetValue);
}		}
}		}
}		}
		// And output the results.
		// NOTE: the order is deterministic!
ExitOnFileError(BenchmarkFile, Result.writeYamlTo(State, Ostr));		ExitOnFileError(BenchmarkFile, Result.writeYamlTo(State, Ostr));
}		}

		static void runBatch(
		const LLVMState &State, raw_ostream &Ostr,
		MutableArrayRef<SmallVector<ExpectedRunnableConfiguration, MaxRepetitors>>
		PerConfigRCs,
		const BenchmarkRunner &Runner) {
		// And they've completed! Now, for each configuration in this batch:
		for (MutableArrayRef<ExpectedRunnableConfiguration> RCsOfConfiguration :
		PerConfigRCs)
		runOneConfiguration(State, Ostr, RCsOfConfiguration, Runner);
		}

		static void runBenchmarkConfigurations(
		const LLVMState &State, ArrayRef<BenchmarkCode> Configurations,
		ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
		const BenchmarkRunner &Runner) {
		assert(!Configurations.empty() && "Don't have any configurations to run.");
		assert(!Repetitors.empty() && Repetitors.size() <= MaxRepetitors &&
		"Unexpected Repetitor count.");
		std::optional<raw_fd_ostream> FileOstr;
		if (BenchmarkFile != "-") {
		int ResultFD = 0;
		// Create output file or open existing file and truncate it, once.
		ExitOnErr(errorCodeToError(openFileForWrite(BenchmarkFile, ResultFD,
		sys::fs::CD_CreateAlways,
		sys::fs::OF_TextWithCRLF)));
		FileOstr.emplace(ResultFD, true /shouldClose/);
		}
		raw_ostream &Ostr = FileOstr ? *FileOstr : outs();

		ThreadPool Pool(hardware_concurrency(ThreadCount));

		SmallVector<SmallVector<ExpectedRunnableConfiguration, MaxRepetitors>, 1>
		PerConfigRCs;
		gchateletUnsubmitted Done Reply Inline Actions I think we can drop the `exegesis` namespace qualifier here. gchatelet: I think we can drop the `exegesis` namespace qualifier here.
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Right. This kind of thing is probably happening in a few other places. lebedev.ri: Right. This kind of thing is probably happening in a few other places.

		size_t NumConfigurationsPerBatch =
		GetNumConfigurationsPerBatch(Pool, Repetitors.size());
		assert(NumConfigurationsPerBatch > 0 && "Not processing anything!");

		PerConfigRCs.reserve(
		std::min<unsigned>(NumConfigurationsPerBatch, Configurations.size()));

		std::optional<ProgressMeter<>> Meter;
		if (BenchmarkMeasurementsPrintProgress)
		Meter.emplace(divideCeil(Configurations.size(), NumConfigurationsPerBatch));
		// Outermost loop: run until we've processed all configurations.
		gchateletUnsubmitted Done Reply Inline Actions Can we introduce functions to cut on the nesting level? Also I believe function will largely improve readability. while (!Configurations.empty()) { // setup PerConfigRCs computeBatch(PerConfigRCs); runBatch(PerConfigRCs); PerConfigRCs.clear(); } gchatelet: Can we introduce functions to cut on the nesting level? Also I believe function will largely…
		while (!Configurations.empty()) {
		gchateletUnsubmitted Done Reply Inline Actions It is not clear to me what you're trying to achieve here. I would introduce a function and assign the variable once instead of mutating it in place. The function name may also help understand what the intent is. gchatelet: It is not clear to me what you're trying to achieve here. I would introduce a function and…
		computeBatch(Meter, Configurations, NumConfigurationsPerBatch, PerConfigRCs,
		Repetitors, Pool, Runner);
		runBatch(State, Ostr, PerConfigRCs, Runner);
		}
}		}

void benchmarkMain() {		void benchmarkMain() {
if (BenchmarkPhaseSelector == BenchmarkPhaseSelectorE::Measure) {		if (BenchmarkPhaseSelector == BenchmarkPhaseSelectorE::Measure) {
#ifndef HAVE_LIBPFM		#ifndef HAVE_LIBPFM
ExitWithError(		ExitWithError(
"benchmarking unavailable, LLVM was built without libpfm. You can pass "		"benchmarking unavailable, LLVM was built without libpfm. You can pass "
"--skip-measurements to skip the actual benchmarking.");		"--skip-measurements to skip the actual benchmarking.");
▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[NFCI][llvm-exegesis] Benchmark: parallelize codegen (5x ... 8x less wallclock)Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 491433

llvm/tools/llvm-exegesis/llvm-exegesis.cpp

[NFCI][llvm-exegesis] Benchmark: parallelize codegen (5x ... 8x less wallclock)
Needs ReviewPublic