This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
test/tools/llvm-reduce/
-
tools/
-
llvm-reduce/
-
operands-skip.ll
-
tools/llvm-reduce/
-
llvm-reduce/
-
CMakeLists.txt
-
deltas/
30/35
Delta.cpp

Differential D113857

[llvm-reduce] Add parallel chunk processing.
ClosedPublic

Authored by fhahn on Nov 14 2021, 11:17 AM.

Download Raw Diff

Details

Reviewers

aeubanks
dblaikie
lebedev.ri
Meinersbur

Commits

rG8ef460fc5137: [llvm-reduce] Add parallel chunk processing.

Summary

This patch adds parallel processing of chunks. When reducing very large
inputs, e.g. functions with 500k basic blocks, processing chunks in
parallel can significantly speed up the reduction.

To allow modifying clones of the original module in parallel, each clone
needs their own LLVMContext object. To achieve this, each job parses the
input module with their own LLVMContext. In case a job successfully
reduced the input, it serializes the result module as bitcode into a
result array.

To ensure parallel reduction produces the same results as serial
reduction, only the first successfully reduced result is used, and
results of other successful jobs are dropped. Processing resumes after
the chunk that was successfully reduced.

The number of threads to use can be configured using the -max-chunk-threads
option. It defaults to 1, which means serial processing.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn requested review of this revision.Nov 14 2021, 11:17 AM

fhahn created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2021, 11:17 AM

fhahn added a parent revision: D113856: [llvm-reduce] Move code to check chunk to function, to enable reuse (NFC)..Nov 14 2021, 11:17 AM

Harbormaster completed remote builds in B134160: Diff 387118.Nov 14 2021, 11:17 AM

One issue is that the text output of processing each Chunk in a job batch will get mixed & jumbled together.

Meinersbur retitled this revision from [[llvm-reduce] Add parallel chunk processing. to [llvm-reduce] Add parallel chunk processing..Nov 15 2021, 10:05 AM

Meinersbur edited the summary of this revision. (Show Details)

Fixes spelling in title and summary.

In D113857#3130113, @fhahn wrote:

One issue is that the text output of processing each Chunk in a job batch will get mixed & jumbled together.

[suggestion] Task printouts could be redirected to a buffer and printed in-order.

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40	Why stop me from using 128 jobs on a 128 core machine? Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful. [suggestion] Use `-j` (for "jobs") shortcut for consistency with build tools such as ninja and make.
279	Why require at least 5 parallel jobs? Synchronization overhead?
280	`NumTasks` is originally assigned `MaxChunkThreads * 2` but here it is overwritten again with a different value?
288	Why not leave the number of tasks up to `ThreadPoolStrategy::apply_thread_strategy`? Even if `--max-chunk-threads` is larger, the current approach is limited to what `ThreadPoolStrategy::compute_thread_count()` chooses. [suggestion] Use `ThreadPool::async` `MaxChunkThreads` times and wait for the first one in the queue. If that one finishes, add another job using `ThreadPool::async` until either one successfully reduces or we are out of chunks still considered interesting.
301	We have `Result` and `Res` in the same scope. Better names?
330	There are two places where `I` is incremented and far apart. Suggestion: for (auto I = ChunksStillConsideredInteresting.rbegin(), E = ChunksStillConsideredInteresting.rend(); I != E; ) { bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4; int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1; ... I += WorkItemsInThisJob;

Adjust task handling: instead of queuing 2 * NumJobs task up front and waiting for all of them to complete, first queue a number closer to the available jobs (NumJobs + 2 at the moment, so there's a bit of extra work in case a few jobs finish early).

The patch now uses a queue to keep track of all futures for the scheduled tasks. After scheduling the initial set of tasks, we wait for the first task in the queue to complete. After it completes, we schedule anohter tasks (if we can and there's no other tasks that successfully reduced a chunk). We continue waiting for tasks in the queue, until we reach a job that reduced a chunk or we are out of jobs.

Herald added a subscriber: mgorny. · View Herald TranscriptNov 17 2021, 4:16 AM

fhahn marked an inline comment as done.Nov 17 2021, 4:26 AM

fhahn added inline comments.

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40	Why stop me from using 128 jobs on a 128 core machine? My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk. I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped. [suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make. Done!
279	Yes, but a smaller limit may be good as well. In the current version it is just 2 tasks.
280	this is now gone
288	Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy? Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in. [suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting. That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether any task already reduced something, to avoid adding new jobs in that case. This approach is slightly faster than the original patch for reducing `llvm/test/tools/llvm-reduce/operands-skip.ll` in parallel.
301	Thanks, I updated it to `ChunkResult`.
330	Thanks, I added a new `NumChunksProcessed`, which is set to 1 initially and then in the loop where we process the parallel results.

Harbormaster completed remote builds in B134710: Diff 387897.Nov 17 2021, 5:04 AM

Meinersbur added inline comments.Nov 17 2021, 10:55 AM

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40	I still think it's the user's right to shoot themselves into the foot if they want to (it would be different if the number of jobs is determined by a heuristic). A maximum could be suggested in the description/documentation also explaining why. A hard limit blocks legitimate use cases.
42–46	Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any parallelism otherwise.
270	[serious] `Results` is never resized to cover the last `NumChunksProcessed`, i.e. we get a buffer overflow. Instead, we could use a container that never invalidates its elements such as `std:deque` or use a circular buffer. Alternatively, you can change `TaskQueue` to `std::dequeue<std::pair<std::shared_future<void>, ResultTy>>` to the task and its result in the same queue.
288	That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. Nice! Thanks! Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy? Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in. I thought then we could override `apply_thread_strategy` to set either `NumThreads` or, if not number of threads is specified, call the inherited methods. But it's not-virtual and your approach `hardware_concurrency(NumJobs)` does about the same thing. To get the number of effective jobs, we could call `compute_thread_count` of `ThreadPoolStrategy`. This still applies if explicit with `hardware_concurrency(NumJobs)` since `compute_thread_count` caps it by the number of hardware threads. Currently `-j<n>` requires the user to specify a number of jobs, but if we want have a heuristic, we could set `UseHyperThreads = false` without setting `ThreadsRequested`. This heuristically-determined number of jobs is what we might cap at 32. Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs. I'd remove the `+2` for `NumInitialTasks` as by `hardware_concurrency(NumJobs)`, there will not be more threads created for those two additional jobs.
310	This waits until the task queue is empty. Could we leave earlier as soon as a reduced version is available, and just forget the jobs not yet finished while scheduling the next ones?
321	For just a bool, maybe use `std::atomic<bool>` instead?
330	The updated patch still increments `I` in the for-statement parens as will as in the body. As reader of the code, I find this very unexpected. Since with the task queue, the `while` loop will always process all work items (or break on finding a reduction), there is not advantage in having them in the same loop anymore. Updated suggestion: if (NumJobs > 1 && ChunksStillConsideredInteresting.size() > 1) { auto I = ChunksStillConsideredInteresting.begin(); bool AnyReduced = false; do { if (!TaskQueue.empty()) { auto TaskAndResult = TaskQueue.pop_front_val(); TaskAndResult.first.wait(); if (TaskAndResult.second) return TaskAndResult.second; // Found a result } while (!AnyReduced && TaskQueue.size() < NumJobs && I != ChunksStillConsideredInteresting.end()) { TaskQueue.push_back(ChunkThreadPool.async(...)); ++I; } } while(!TaskQueue.empty()); } else { for (Chunk &ChunkToCheckForUninterestingness : reverse(ChunksStillConsideredInteresting)) std::unique_ptr<ReducerWorkItem> Result = CheckChunk(...); } This also uses the same code for adding the initial jobs and re-filling the queue.

fhahn marked an inline comment as done.Nov 17 2021, 12:39 PM

fhahn added inline comments.

llvm/tools/llvm-reduce/deltas/Delta.cpp
270	I think the code should only add new jobs if `NumChunkProcessed < Results.size()` in one iteration of the main loop, but there might be a an error. Adding it the to dequeue sounds like a great idea though to make things a bit simpler. Alternatively we could also try to communicate the result through the `Future` directly, although that would require some changes to `ThreadPool`, because currently it just supports returning `shared_future<void>`.
310	The current approach waits for each task, until it finds the first one that leads to a reduction. It intentionally does not try to stop once any job reduces a chunk, because if we pick any reduced chunk we might get different results than during serial reduction.

I am writing lots of suggestion for something you may have intended to be a simple addition that could be improved later just as well. If this applies, please tell me.

llvm/tools/llvm-reduce/deltas/Delta.cpp
270	Sorry, I did not see the `NumChunkProcessed < Results.size()` condition. IIUC, this will require the task queue to be cleared every `NumJobs * 3` jobs. I did not see that and could be avoided using the solutions that I mentioned.
310	Assume the current content of TaskQueue 0: job finished, no valid reduction 1: job still running 2: job finished, valid reduction result 3: job still running I understood that the intention of AnyReduced is to not queue additional jobs (assuming `NumJobs > 4`) since we know that when hitting 2, we will have a reduced result. We still need to wait for job 1 which may also compute a valid reduction. I think this is clever. I had `std::future<ResultTy>` in mind as you proposed yourself, until I discovered that `llvm::ThreadPool` only supports a void return type :-(. However, to avoid submitting more jobs after we know that there will a result available, I think a `AnyReduced` would still be useful flag. My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

In D113857#3138944, @Meinersbur wrote:

I am writing lots of suggestion for something you may have intended to be a simple addition that could be improved later just as well. If this applies, please tell me.

Thanks for taking a close look, it is very much appreciated!

The latest change updates the patch to communicate results via shared_future, requires ThreadPool refactoring from D114183.

Drop +2 increment for initial number of tasks.

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40	Sounds good, I dropped the limit.
42–46	Updated! If `LLVM_ENABLE_THREADS` is not set `NumJobs` is defined as `unsigned` with a value of 1.
288	Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs. I think for now requiring a user specified value is the easiest. I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs. The main intention was to have. a few more jobs to fill threads when jobs exit earlier, but I think it could do more bad than good. I removed it
310	I updated the code to communicate the result via the `shared_future`. This depends on D114183 now. My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible. Oh I see, you are thinking about submitting jobs already with the updated state? At the moment I am not really sure what kind of issues that would bring and if it is feasible. I think it would be good to try to keep the initial version as simple as possible.
321	Done, thanks!
330	This also uses the same code for adding the initial jobs and re-filling the queue. Ah, thanks for elaborating! I have not updated the code so far, because I am not yet sure how to best include the processing of the found result. When splitting it into 2 separate loops, I am not sure how to best continue processing the next chunk after the reduced one. I could merge the 2 loops in the current implementation though if you think it's easier to read (the one that schedules the initial jobs and then processes the queue)

Harbormaster completed remote builds in B134984: Diff 388315.Nov 18 2021, 2:22 PM

Meinersbur added inline comments.Nov 18 2021, 3:45 PM

llvm/tools/llvm-reduce/deltas/Delta.cpp
183	[Not a change request] To avoid global variables, did you consider making `AnyReduced` inside `runDeltaPassInt` and `ProcessChunkFromSerializedBitcode` a lambda?
186	[Not a change request] Why prefer `SmallString<0>` over `std::string`?
288	Agreed (to both)
310	Agreed. A TODO in the code could mention this

fhahn added a parent revision: D114183: [ThreadPool] Support returning futures with results..Nov 19 2021, 2:26 AM

Make AnyReduced a local variable, pass to function & lambda.

Also updated to use ThreadPoolWithResult after changes to D114183.

Harbormaster completed remote builds in B135082: Diff 388442.Nov 19 2021, 2:49 AM

fhahn mentioned this in D114183: [ThreadPool] Support returning futures with results..Nov 19 2021, 5:14 AM

rebased on top of latest changes to D114183

Harbormaster completed remote builds in B135271: Diff 388710.Nov 20 2021, 10:41 AM

remove unnecessary std::move from ... = std::move(Future.get());

fhahn mentioned this in D114363: [ThreadPool] Do not return shared futures..Nov 22 2021, 5:15 AM

Harbormaster completed remote builds in B135399: Diff 388877.Nov 22 2021, 7:22 AM

LGTM

llvm/tools/llvm-reduce/deltas/Delta.cpp
283	Trailing underscore usually not used in the LLVM code base.

This revision is now accepted and ready to land.Nov 23 2021, 12:59 AM

fhahn mentioned this in rGa5fff58781f3: [ThreadPool] Do not return shared futures..Nov 23 2021, 2:06 AM

Rebased on top of a5fff58781f3.

Harbormaster completed remote builds in B135586: Diff 389141.Nov 23 2021, 6:26 AM

Closed by commit rG8ef460fc5137: [llvm-reduce] Add parallel chunk processing. (authored by fhahn). · Explain WhyNov 24 2021, 1:24 AM

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG8ef460fc5137: [llvm-reduce] Add parallel chunk processing..

fhahn marked an inline comment as done.Nov 24 2021, 1:25 AM

fhahn added inline comments.

llvm/tools/llvm-reduce/deltas/Delta.cpp
283	Thanks, I adjusted the name in the committed version (ChunkThreadPool and ChunkThreadPoolPtr)

Revision Contents

Path

Size

llvm/

test/

tools/

llvm-reduce/

operands-skip.ll

7 lines

tools/

llvm-reduce/

CMakeLists.txt

1 line

deltas/

Delta.cpp

151 lines

Diff 389424

llvm/test/tools/llvm-reduce/operands-skip.ll

	; RUN: llvm-reduce %s -o %t --delta-passes=operands-skip --test FileCheck --test-arg %s --test-arg --match-full-lines --test-arg --check-prefix=INTERESTING --test-arg --input-file			; RUN: llvm-reduce %s -o %t --delta-passes=operands-skip --test FileCheck --test-arg %s --test-arg --match-full-lines --test-arg --check-prefix=INTERESTING --test-arg --input-file
	; RUN: FileCheck %s --input-file %t --check-prefixes=REDUCED			; RUN: FileCheck %s --input-file %t --check-prefixes=REDUCED

				; RUN: llvm-reduce -j 2 %s -o %t.1 --delta-passes=operands-skip --test FileCheck --test-arg %s --test-arg --match-full-lines --test-arg --check-prefix=INTERESTING --test-arg --input-file
				; RUN: FileCheck %s --input-file %t.1 --check-prefixes=REDUCED

				; RUN: llvm-reduce -j 4 %s -o %t.2 --delta-passes=operands-skip --test FileCheck --test-arg %s --test-arg --match-full-lines --test-arg --check-prefix=INTERESTING --test-arg --input-file
				; RUN: FileCheck %s --input-file %t.2 --check-prefixes=REDUCED


	; INTERESTING: store i32 43, i32* {{(%imm\|%indirect)}}, align 4			; INTERESTING: store i32 43, i32* {{(%imm\|%indirect)}}, align 4
	; REDUCED: store i32 43, i32* %imm, align 4			; REDUCED: store i32 43, i32* %imm, align 4

	; INTERESTING: store i32 44, i32* {{(%imm\|%indirect\|%phi)}}, align 4			; INTERESTING: store i32 44, i32* {{(%imm\|%indirect\|%phi)}}, align 4
	; REDUCED: store i32 44, i32* %phi, align 4			; REDUCED: store i32 44, i32* %phi, align 4

	; INTERESTING: store i32 45, i32* {{(%imm\|%indirect\|%phi\|%val)}}, align 4			; INTERESTING: store i32 45, i32* {{(%imm\|%indirect\|%phi\|%val)}}, align 4
	; REDUCED: store i32 45, i32* %val, align 4			; REDUCED: store i32 45, i32* %val, align 4
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/tools/llvm-reduce/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	AllTargetsAsmParsers			AllTargetsAsmParsers
	AllTargetsCodeGens			AllTargetsCodeGens
	AllTargetsDescs			AllTargetsDescs
	AllTargetsInfos			AllTargetsInfos
				BitReader
	BitWriter			BitWriter
	CodeGen			CodeGen
	Core			Core
	IRReader			IRReader
	MC			MC
	MIRParser			MIRParser
	Support			Support
	Target			Target
	Show All 32 Lines

llvm/tools/llvm-reduce/deltas/Delta.cpp

Show All 9 Lines

// it splits a given set of Targets (i.e. Functions, Instructions, BBs, etc.) // it splits a given set of Targets (i.e. Functions, Instructions, BBs, etc.)

// into chunks and tries to reduce the number chunks that are interesting. // into chunks and tries to reduce the number chunks that are interesting.

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "Delta.h" #include "Delta.h"

#include "ReducerWorkItem.h" #include "ReducerWorkItem.h"

#include "llvm/ADT/STLExtras.h" #include "llvm/ADT/STLExtras.h"

#include "llvm/Bitcode/BitcodeReader.h"

#include "llvm/Bitcode/BitcodeWriter.h" #include "llvm/Bitcode/BitcodeWriter.h"

#include "llvm/IR/Verifier.h" #include "llvm/IR/Verifier.h"

#include "llvm/Support/CommandLine.h" #include "llvm/Support/CommandLine.h"

#include "llvm/Support/ThreadPool.h"

#include "llvm/Support/ToolOutputFile.h" #include "llvm/Support/ToolOutputFile.h"

#include <fstream> #include <fstream>

#include <set> #include <set>

using namespace llvm; using namespace llvm;

static cl::opt<bool> AbortOnInvalidReduction( static cl::opt<bool> AbortOnInvalidReduction(

"abort-on-invalid-reduction", "abort-on-invalid-reduction",

cl::desc("Abort if any reduction results in invalid IR")); cl::desc("Abort if any reduction results in invalid IR"));

static cl::opt<unsigned int> StartingGranularityLevel( static cl::opt<unsigned int> StartingGranularityLevel(

"starting-granularity-level", "starting-granularity-level",

cl::desc("Number of times to divide chunks prior to first test")); cl::desc("Number of times to divide chunks prior to first test"));

static cl::opt<bool> TmpFilesAsBitcode( static cl::opt<bool> TmpFilesAsBitcode(

"write-tmp-files-as-bitcode", "write-tmp-files-as-bitcode",

cl::desc("Write temporary files as bitcode, instead of textual IR"), cl::desc("Write temporary files as bitcode, instead of textual IR"),

cl::init(false)); cl::init(false));

MeinersburUnsubmitted

Done

Why stop me from using 128 jobs on a 128 core machine?

Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.

[suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.

Meinersbur: Why stop me from using 128 jobs on a 128 core machine? Could warn about diminishing returns…

fhahnAuthorUnsubmitted

Done

Why stop me from using 128 jobs on a 128 core machine?

My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk.

I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped.

[suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.

Done!

fhahn: > Why stop me from using 128 jobs on a 128 core machine? My original intention was to avoid…

MeinersburUnsubmitted

Done

I still think it's the user's right to shoot themselves into the foot if they want to (it would be different if the number of jobs is determined by a heuristic). A maximum could be suggested in the description/documentation also explaining why. A hard limit blocks legitimate use cases.

Meinersbur: I still think it's the user's right to shoot themselves into the foot if they want to (it would…

fhahnAuthorUnsubmitted

Done

Sounds good, I dropped the limit.

fhahn: Sounds good, I dropped the limit.

#ifdef LLVM_ENABLE_THREADS

static cl::opt<unsigned> NumJobs(

"j",

cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "

"disables parallelism."),

MeinersburUnsubmitted

Done

cl::init(false));

+ #if LLVM_ENABLE_THREADS

static cl::opt<unsigned int> NumJobs(

"j",

cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "

"disables parallelism. Maximum capped to 32."),

cl::init(1));

+ #endif

void writeOutput(ReducerWorkItem &M, llvm::StringRef Message);

Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any parallelism otherwise.

Meinersbur: Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any…

fhahnAuthorUnsubmitted

Done

Updated! If LLVM_ENABLE_THREADS is not set NumJobs is defined as unsigned with a value of 1.

fhahn: Updated! If `LLVM_ENABLE_THREADS` is not set `NumJobs` is defined as `unsigned` with a value of…

cl::init(1));

#else

unsigned NumJobs = 1;

#endif

void writeOutput(ReducerWorkItem &M, llvm::StringRef Message); void writeOutput(ReducerWorkItem &M, llvm::StringRef Message);

bool isReduced(ReducerWorkItem &M, TestRunner &Test, bool isReduced(ReducerWorkItem &M, TestRunner &Test,

SmallString<128> &CurrentFilepath) { SmallString<128> &CurrentFilepath) {

// Write ReducerWorkItem to tmp file // Write ReducerWorkItem to tmp file

int FD; int FD;

std::error_code EC = sys::fs::createTemporaryFile( std::error_code EC = sys::fs::createTemporaryFile(

"llvm-reduce", M.isMIR() ? "mir" : (TmpFilesAsBitcode ? "bc" : "ll"), FD, "llvm-reduce", M.isMIR() ? "mir" : (TmpFilesAsBitcode ? "bc" : "ll"), FD,

▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines if (SplitOne) {

} }

return SplitOne; return SplitOne;

} }

// Check if \p ChunkToCheckForUninterestingness is interesting. Returns the // Check if \p ChunkToCheckForUninterestingness is interesting. Returns the

// modified module if the chunk resulted in a reduction. // modified module if the chunk resulted in a reduction.

template <typename T> template <typename T>

static std::unique_ptr<ReducerWorkItem> static std::unique_ptr<ReducerWorkItem>

CheckChunk(Chunk &ChunkToCheckForUninterestingness, TestRunner &Test, CheckChunk(Chunk &ChunkToCheckForUninterestingness,

std::unique_ptr<ReducerWorkItem> Clone, TestRunner &Test,

function_ref<void(Oracle &, T &)> ExtractChunksFromModule, function_ref<void(Oracle &, T &)> ExtractChunksFromModule,

std::set<Chunk> &UninterestingChunks, std::set<Chunk> &UninterestingChunks,

std::vector<Chunk> &ChunksStillConsideredInteresting) { std::vector<Chunk> &ChunksStillConsideredInteresting) {

// Take all of ChunksStillConsideredInteresting chunks, except those we've // Take all of ChunksStillConsideredInteresting chunks, except those we've

// already deemed uninteresting (UninterestingChunks) but didn't remove // already deemed uninteresting (UninterestingChunks) but didn't remove

// from ChunksStillConsideredInteresting yet, and additionally ignore // from ChunksStillConsideredInteresting yet, and additionally ignore

// ChunkToCheckForUninterestingness chunk. // ChunkToCheckForUninterestingness chunk.

std::vector<Chunk> CurrentChunks; std::vector<Chunk> CurrentChunks;

CurrentChunks.reserve(ChunksStillConsideredInteresting.size() - CurrentChunks.reserve(ChunksStillConsideredInteresting.size() -

UninterestingChunks.size() - 1); UninterestingChunks.size() - 1);

copy_if(ChunksStillConsideredInteresting, std::back_inserter(CurrentChunks), copy_if(ChunksStillConsideredInteresting, std::back_inserter(CurrentChunks),

[&](const Chunk &C) { [&](const Chunk &C) {

return !UninterestingChunks.count(C) && return !UninterestingChunks.count(C) &&

C != ChunkToCheckForUninterestingness; C != ChunkToCheckForUninterestingness;

}); });

// Clone module before hacking it up..

std::unique_ptr<ReducerWorkItem> Clone =

cloneReducerWorkItem(Test.getProgram());

// Generate Module with only Targets inside Current Chunks // Generate Module with only Targets inside Current Chunks

Oracle O(CurrentChunks); Oracle O(CurrentChunks);

ExtractChunksFromModule(O, *Clone); ExtractChunksFromModule(O, *Clone);

// Some reductions may result in invalid IR. Skip such reductions. // Some reductions may result in invalid IR. Skip such reductions.

if (verifyReducerWorkItem(*Clone, &errs())) { if (verifyReducerWorkItem(*Clone, &errs())) {

if (AbortOnInvalidReduction) { if (AbortOnInvalidReduction) {

errs() << "Invalid reduction\n"; errs() << "Invalid reduction\n";

Show All 13 Lines CheckChunk(Chunk &ChunkToCheckForUninterestingness,

if (!isReduced(*Clone, Test, CurrentFilepath)) { if (!isReduced(*Clone, Test, CurrentFilepath)) {

// Program became non-reduced, so this chunk appears to be interesting. // Program became non-reduced, so this chunk appears to be interesting.

errs() << "\n"; errs() << "\n";

return nullptr; return nullptr;

} }

return Clone; return Clone;

} }

template <typename T>

SmallString<0> ProcessChunkFromSerializedBitcode(

MeinersburUnsubmitted

Not Done

[Not a change request] To avoid global variables, did you consider making AnyReduced inside runDeltaPassInt and ProcessChunkFromSerializedBitcode a lambda?

Meinersbur: [Not a change request] To avoid global variables, did you consider making `AnyReduced` inside…

Chunk &ChunkToCheckForUninterestingness, TestRunner &Test,

function_ref<void(Oracle &, T &)> ExtractChunksFromModule,

std::set<Chunk> &UninterestingChunks,

MeinersburUnsubmitted

Not Done

[Not a change request] Why prefer SmallString<0> over std::string?

Meinersbur: [Not a change request] Why prefer `SmallString<0>` over `std::string`?

std::vector<Chunk> &ChunksStillConsideredInteresting,

SmallString<0> &OriginalBC, std::atomic<bool> &AnyReduced) {

LLVMContext Ctx;

Expected<std::unique_ptr<Module>> MOrErr = parseBitcodeFile(

MemoryBufferRef(StringRef(OriginalBC.data(), OriginalBC.size()),

"<llvm-reduce tmp module>"),

Ctx);

if (!MOrErr)

report_fatal_error("Failed to read bitcode");

auto CloneMMM = std::make_unique<ReducerWorkItem>();

CloneMMM->M = std::move(MOrErr.get());

SmallString<0> Result;

if (std::unique_ptr<ReducerWorkItem> ChunkResult =

CheckChunk(ChunkToCheckForUninterestingness, std::move(CloneMMM),

Test, ExtractChunksFromModule, UninterestingChunks,

ChunksStillConsideredInteresting)) {

raw_svector_ostream BCOS(Result);

WriteBitcodeToFile(*ChunkResult->M, BCOS);

// Communicate that the task reduced a chunk.

AnyReduced = true;

}

return Result;

}

/// Runs the Delta Debugging algorithm, splits the code into chunks and /// Runs the Delta Debugging algorithm, splits the code into chunks and

/// reduces the amount of chunks that are considered interesting by the /// reduces the amount of chunks that are considered interesting by the

/// given test. /// given test.

template <typename T> template <typename T>

void runDeltaPassInt( void runDeltaPassInt(

TestRunner &Test, TestRunner &Test,

function_ref<void(Oracle &, T &)> ExtractChunksFromModule) { function_ref<void(Oracle &, T &)> ExtractChunksFromModule) {

int Targets; int Targets;

Show All 22 Lines void runDeltaPassInt(

std::vector<Chunk> ChunksStillConsideredInteresting = {{0, Targets - 1}}; std::vector<Chunk> ChunksStillConsideredInteresting = {{0, Targets - 1}};

std::unique_ptr<ReducerWorkItem> ReducedProgram; std::unique_ptr<ReducerWorkItem> ReducedProgram;

for (unsigned int Level = 0; Level < StartingGranularityLevel; Level++) { for (unsigned int Level = 0; Level < StartingGranularityLevel; Level++) {

increaseGranularity(ChunksStillConsideredInteresting); increaseGranularity(ChunksStillConsideredInteresting);

} }

std::atomic<bool> AnyReduced;

std::unique_ptr<ThreadPool> ChunkThreadPoolPtr;

if (NumJobs > 1)

ChunkThreadPoolPtr =

std::make_unique<ThreadPool>(hardware_concurrency(NumJobs));

bool FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity; bool FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity;

do { do {

FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity = false; FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity = false;

std::set<Chunk> UninterestingChunks; std::set<Chunk> UninterestingChunks;

for (Chunk &ChunkToCheckForUninterestingness :

reverse(ChunksStillConsideredInteresting)) { // When running with more than one thread, serialize the original bitcode

std::unique_ptr<ReducerWorkItem> Result = CheckChunk( // to OriginalBC.

ChunkToCheckForUninterestingness, Test, ExtractChunksFromModule, SmallString<0> OriginalBC;

UninterestingChunks, ChunksStillConsideredInteresting); if (NumJobs > 1) {

raw_svector_ostream BCOS(OriginalBC);

WriteBitcodeToFile(*Test.getProgram().M, BCOS);

}

std::deque<std::future<SmallString<0>>> TaskQueue;

MeinersburUnsubmitted

Done

[serious] Results is never resized to cover the last NumChunksProcessed, i.e. we get a buffer overflow. Instead, we could use a container that never invalidates its elements such as std:deque or use a circular buffer.

Alternatively, you can change TaskQueue to std::dequeue<std::pair<std::shared_future<void>, ResultTy>> to the task and its result in the same queue.

Meinersbur: [serious] `Results` is never resized to cover the last `NumChunksProcessed`, i.e. we get a…

fhahnAuthorUnsubmitted

Done

I think the code should only add new jobs if NumChunkProcessed < Results.size() in one iteration of the main loop, but there might be a an error. Adding it the to dequeue sounds like a great idea though to make things a bit simpler.

Alternatively we could also try to communicate the result through the Future directly, although that would require some changes to ThreadPool, because currently it just supports returning shared_future<void>.

fhahn: I think the code should only add new jobs if `NumChunkProcessed < Results.size()` in one…

MeinersburUnsubmitted

Done

Sorry, I did not see the NumChunkProcessed < Results.size() condition. IIUC, this will require the task queue to be cleared every NumJobs * 3 jobs. I did not see that and could be avoided using the solutions that I mentioned.

Meinersbur: Sorry, I did not see the `NumChunkProcessed < Results.size()` condition. IIUC, this will…

for (auto I = ChunksStillConsideredInteresting.rbegin(),

E = ChunksStillConsideredInteresting.rend();

I != E; ++I) {

std::unique_ptr<ReducerWorkItem> Result = nullptr;

unsigned WorkLeft = std::distance(I, E);

// Run in parallel mode, if the user requested more than one thread and

// there are at least a few chunks to process.

if (NumJobs > 1 && WorkLeft > 1) {

MeinersburUnsubmitted

Done

Why require at least 5 parallel jobs? Synchronization overhead?

Meinersbur: Why require at least 5 parallel jobs? Synchronization overhead?

fhahnAuthorUnsubmitted

Done

Yes, but a smaller limit may be good as well. In the current version it is just 2 tasks.

fhahn: Yes, but a smaller limit may be good as well. In the current version it is just 2 tasks.

unsigned NumInitialTasks = std::min(WorkLeft, unsigned(NumJobs));

MeinersburUnsubmitted

Done

NumTasks is originally assigned MaxChunkThreads * 2 but here it is overwritten again with a different value?

Meinersbur: `NumTasks` is originally assigned `MaxChunkThreads * 2` but here it is overwritten again with a…

fhahnAuthorUnsubmitted

Done

this is now gone

fhahn: this is now gone

unsigned NumChunksProcessed = 0;

ThreadPool &ChunkThreadPool = *ChunkThreadPoolPtr;

MeinersburUnsubmitted

Not Done

Trailing underscore usually not used in the LLVM code base.

Meinersbur: Trailing underscore usually not used in the LLVM code base.

fhahnAuthorUnsubmitted

Done

Thanks, I adjusted the name in the committed version (ChunkThreadPool and ChunkThreadPoolPtr)

fhahn: Thanks, I adjusted the name in the committed version (ChunkThreadPool and ChunkThreadPoolPtr)

TaskQueue.clear();

AnyReduced = false;

// Queue jobs to process NumInitialTasks chunks in parallel using

// ChunkThreadPool. When the tasks are added to the pool, parse the

MeinersburUnsubmitted

Done

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Even if --max-chunk-threads is larger, the current approach is limited to what ThreadPoolStrategy::compute_thread_count() chooses.

[suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.

Meinersbur: Why not leave the number of tasks up to `ThreadPoolStrategy::apply_thread_strategy`? Even if `…

fhahnAuthorUnsubmitted

Done

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

[suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.

That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether *any* task already reduced something, to avoid adding new jobs in that case.

This approach is slightly faster than the original patch for reducing llvm/test/tools/llvm-reduce/operands-skip.ll in parallel.

fhahn: > Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy? Do you…

MeinersburUnsubmitted

Done

That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something.

Nice! Thanks!

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

I thought then we could override apply_thread_strategy to set either NumThreads or, if not number of threads is specified, call the inherited methods. But it's not-virtual and your approach hardware_concurrency(NumJobs) does about the same thing.

To get the number of effective jobs, we could call compute_thread_count of ThreadPoolStrategy. This still applies if explicit with hardware_concurrency(NumJobs) since compute_thread_count caps it by the number of hardware threads.

Currently -j<n> requires the user to specify a number of jobs, but if we want have a heuristic, we could set UseHyperThreads = false without setting ThreadsRequested. This heuristically-determined number of jobs is what we might cap at 32. Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs.

Meinersbur: > That sounds like a good idea, thanks! I updated the code to work along those lines, unless I…

fhahnAuthorUnsubmitted

Done

Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I think for now requiring a user specified value is the easiest.

I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs.

The main intention was to have. a few more jobs to fill threads when jobs exit earlier, but I think it could do more bad than good. I removed it

fhahn: > Since I am not sure this is a good heuristic, we may keep requiring the user to specify the…

MeinersburUnsubmitted

Not Done

Agreed (to both)

Meinersbur: Agreed (to both)

// original module from OriginalBC with a fresh LLVMContext object. This

// ensures that the cloned module of each task uses an independent

// LLVMContext object. If a task reduces the input, serialize the result

// back in the corresponding Result element.

for (unsigned J = 0; J < NumInitialTasks; ++J) {

TaskQueue.emplace_back(ChunkThreadPool.async(

[J, I, &Test, &ExtractChunksFromModule, &UninterestingChunks,

&ChunksStillConsideredInteresting, &OriginalBC, &AnyReduced]() {

return ProcessChunkFromSerializedBitcode(

*(I + J), Test, ExtractChunksFromModule,

UninterestingChunks, ChunksStillConsideredInteresting,

OriginalBC, AnyReduced);

}));

MeinersburUnsubmitted

Done

CloneMMM->M = std::move(MOrErr.get());

- auto Res = CheckChunk(*(I + J), std::move(CloneMMM));

+ std::unique_ptr<ReducerWorkItem> Res = CheckChunk(*(I + J), std::move(CloneMMM));

if (Res) {

We have Result and Res in the same scope. Better names?

Meinersbur: We have `Result` and `Res` in the same scope. Better names?

fhahnAuthorUnsubmitted

Done

Thanks, I updated it to ChunkResult.

fhahn: Thanks, I updated it to `ChunkResult`.

}

// Start processing results of the queued tasks. We wait for the first

// task in the queue to finish. If it reduced a chunk, we parse the

// result and exit the loop.

// Otherwise we will try to schedule a new task, if

// * no other pending job reduced a chunk and

// * we have not reached the end of the chunk.

while (!TaskQueue.empty()) {

MeinersburUnsubmitted

Done

This waits until the task queue is empty. Could we leave earlier as soon as a reduced version is available, and just forget the jobs not yet finished while scheduling the next ones?

Meinersbur: This waits until the task queue is empty. Could we leave earlier as soon as a reduced version…

fhahnAuthorUnsubmitted

Done

The current approach waits for each task, until it finds the first one that leads to a reduction.

It intentionally does not try to stop once any job reduces a chunk, because if we pick any reduced chunk we might get different results than during serial reduction.

fhahn: The current approach waits for each task, until it finds the first one that leads to a…

MeinersburUnsubmitted

Done

Assume the current content of TaskQueue

0: job finished, no valid reduction
1: job still running
2: job finished, valid reduction result
3: job still running

I understood that the intention of AnyReduced is to not queue additional jobs (assuming NumJobs > 4) since we know that when hitting 2, we will have a reduced result. We still need to wait for job 1 which may also compute a valid reduction.
I think this is clever. I had std::future<ResultTy> in mind as you proposed yourself, until I discovered that llvm::ThreadPool only supports a void return type :-(. However, to avoid submitting more jobs after we know that there will a result available, I think a AnyReduced would still be useful flag.

My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

Meinersbur: Assume the current content of TaskQueue 0: job finished, no valid reduction 1: job still…

fhahnAuthorUnsubmitted

Done

I updated the code to communicate the result via the shared_future. This depends on D114183 now.

My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

Oh I see, you are thinking about submitting jobs already with the updated state? At the moment I am not really sure what kind of issues that would bring and if it is feasible. I think it would be good to try to keep the initial version as simple as possible.

fhahn: I updated the code to communicate the result via the `shared_future`. This depends on D114183…

MeinersburUnsubmitted

Not Done

Agreed. A TODO in the code could mention this

Meinersbur: Agreed. A TODO in the code could mention this

auto &Future = TaskQueue.front();

Future.wait();

NumChunksProcessed++;

SmallString<0> Res = Future.get();

TaskQueue.pop_front();

if (Res.empty()) {

unsigned NumScheduledTasks = NumChunksProcessed + TaskQueue.size();

if (!AnyReduced && I + NumScheduledTasks != E) {

Chunk &ChunkToCheck = *(I + NumScheduledTasks);

TaskQueue.emplace_back(ChunkThreadPool.async(

MeinersburUnsubmitted

Done

For just a bool, maybe use std::atomic<bool> instead?

Meinersbur: For just a bool, maybe use `std::atomic<bool>` instead?

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

[&Test, &ExtractChunksFromModule, &UninterestingChunks,

&ChunksStillConsideredInteresting, &OriginalBC,

&ChunkToCheck, &AnyReduced]() {

return ProcessChunkFromSerializedBitcode(

ChunkToCheck, Test, ExtractChunksFromModule,

UninterestingChunks, ChunksStillConsideredInteresting,

OriginalBC, AnyReduced);

}));

}

MeinersburUnsubmitted

Done

There are two places where I is incremented and far apart. Suggestion:

for (auto I = ChunksStillConsideredInteresting.rbegin(),
              E = ChunksStillConsideredInteresting.rend();
         I != E; ) {
  bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4;
  int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1;
  ...
  I += WorkItemsInThisJob;

Meinersbur: There are two places where `I` is incremented and far apart. Suggestion: ``` for (auto I =…

fhahnAuthorUnsubmitted

Done

Thanks, I added a new NumChunksProcessed, which is set to 1 initially and then in the loop where we process the parallel results.

fhahn: Thanks, I added a new `NumChunksProcessed`, which is set to 1 initially and then in the loop…

MeinersburUnsubmitted

Not Done

The updated patch still increments I in the for-statement parens as will as in the body. As reader of the code, I find this very unexpected. Since with the task queue, the while loop will always process all work items (or break on finding a reduction), there is not advantage in having them in the same loop anymore. Updated suggestion:

if (NumJobs > 1 && ChunksStillConsideredInteresting.size() > 1) {
  auto I = ChunksStillConsideredInteresting.begin();
  bool AnyReduced = false;
  do {
    if (!TaskQueue.empty()) {
      auto TaskAndResult = TaskQueue.pop_front_val();
      TaskAndResult.first.wait();
      if (TaskAndResult.second)
        return TaskAndResult.second; // Found a result
    }
    while (!AnyReduced && TaskQueue.size() < NumJobs && I != ChunksStillConsideredInteresting.end()) {
      TaskQueue.push_back(ChunkThreadPool.async(...));
       ++I;
    }
  } while(!TaskQueue.empty());
} else {
  for (Chunk &ChunkToCheckForUninterestingness : reverse(ChunksStillConsideredInteresting))
      std::unique_ptr<ReducerWorkItem> Result = CheckChunk(...);
}

This also uses the same code for adding the initial jobs and re-filling the queue.

Meinersbur: The updated patch still increments `I` in the for-statement parens as will as in the body. As…

fhahnAuthorUnsubmitted

Done

This also uses the same code for adding the initial jobs and re-filling the queue.

Ah, thanks for elaborating! I have not updated the code so far, because I am not yet sure how to best include the processing of the found result. When splitting it into 2 separate loops, I am not sure how to best continue processing the next chunk after the reduced one.

I could merge the 2 loops in the current implementation though if you think it's easier to read (the one that schedules the initial jobs and then processes the queue)

fhahn: > This also uses the same code for adding the initial jobs and re-filling the queue. Ah…

continue;

}

Expected<std::unique_ptr<Module>> MOrErr = parseBitcodeFile(

MemoryBufferRef(StringRef(Res.data(), Res.size()),

"<llvm-reduce tmp module>"),

Test.getProgram().M->getContext());

if (!MOrErr)

report_fatal_error("Failed to read bitcode");

Result = std::make_unique<ReducerWorkItem>();

Result->M = std::move(MOrErr.get());

break;

}

// Forward I to the last chunk processed in parallel.

I += NumChunksProcessed - 1;

} else {

Result = CheckChunk(*I, cloneReducerWorkItem(Test.getProgram()), Test,

ExtractChunksFromModule, UninterestingChunks,

ChunksStillConsideredInteresting);

}

if (!Result) if (!Result)

continue; continue;

Chunk &ChunkToCheckForUninterestingness = *I;

FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity = true; FoundAtLeastOneNewUninterestingChunkWithCurrentGranularity = true;

UninterestingChunks.insert(ChunkToCheckForUninterestingness); UninterestingChunks.insert(ChunkToCheckForUninterestingness);

ReducedProgram = std::move(Result); ReducedProgram = std::move(Result);

errs() << " **** SUCCESS | lines: " << getLines(CurrentFilepath) << "\n"; errs() << " **** SUCCESS | lines: " << getLines(CurrentFilepath) << "\n";

writeOutput(*ReducedProgram, "Saved new best reduction to "); writeOutput(*ReducedProgram, "Saved new best reduction to ");

} }

// Delete uninteresting chunks // Delete uninteresting chunks

erase_if(ChunksStillConsideredInteresting, erase_if(ChunksStillConsideredInteresting,

Show All 24 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[llvm-reduce] Add parallel chunk processing.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389424

llvm/test/tools/llvm-reduce/operands-skip.ll

llvm/tools/llvm-reduce/CMakeLists.txt

llvm/tools/llvm-reduce/deltas/Delta.cpp

[llvm-reduce] Add parallel chunk processing.
ClosedPublic