Page MenuHomePhabricator

[llvm-reduce] Add parallel chunk processing.
ClosedPublic

Authored by fhahn on Sun, Nov 14, 11:17 AM.

Details

Summary

This patch adds parallel processing of chunks. When reducing very large
inputs, e.g. functions with 500k basic blocks, processing chunks in
parallel can significantly speed up the reduction.

To allow modifying clones of the original module in parallel, each clone
needs their own LLVMContext object. To achieve this, each job parses the
input module with their own LLVMContext. In case a job successfully
reduced the input, it serializes the result module as bitcode into a
result array.

To ensure parallel reduction produces the same results as serial
reduction, only the first successfully reduced result is used, and
results of other successful jobs are dropped. Processing resumes after
the chunk that was successfully reduced.

The number of threads to use can be configured using the -max-chunk-threads
option. It defaults to 1, which means serial processing.

Diff Detail

Event Timeline

fhahn requested review of this revision.Sun, Nov 14, 11:17 AM
fhahn created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptSun, Nov 14, 11:17 AM

One issue is that the text output of processing each Chunk in a job batch will get mixed & jumbled together.

Meinersbur retitled this revision from [[llvm-reduce] Add parallel chunk processing. to [llvm-reduce] Add parallel chunk processing..Mon, Nov 15, 10:05 AM
Meinersbur edited the summary of this revision. (Show Details)

Fixes spelling in title and summary.

One issue is that the text output of processing each Chunk in a job batch will get mixed & jumbled together.

[suggestion] Task printouts could be redirected to a buffer and printed in-order.

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40

Why stop me from using 128 jobs on a 128 core machine?

Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.

[suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.

279

Why require at least 5 parallel jobs? Synchronization overhead?

280

NumTasks is originally assigned MaxChunkThreads * 2 but here it is overwritten again with a different value?

288

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Even if --max-chunk-threads is larger, the current approach is limited to what ThreadPoolStrategy::compute_thread_count() chooses.

[suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.

301

We have Result and Res in the same scope. Better names?

330

There are two places where I is incremented and far apart. Suggestion:

for (auto I = ChunksStillConsideredInteresting.rbegin(),
              E = ChunksStillConsideredInteresting.rend();
         I != E; ) {
  bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4;
  int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1;
  ...
  I += WorkItemsInThisJob;
fhahn updated this revision to Diff 387897.Wed, Nov 17, 4:16 AM
fhahn marked 2 inline comments as done.

Adjust task handling: instead of queuing 2 * NumJobs task up front and waiting for all of them to complete, first queue a number closer to the available jobs (NumJobs + 2 at the moment, so there's a bit of extra work in case a few jobs finish early).

The patch now uses a queue to keep track of all futures for the scheduled tasks. After scheduling the initial set of tasks, we wait for the first task in the queue to complete. After it completes, we schedule anohter tasks (if we can and there's no other tasks that successfully reduced a chunk). We continue waiting for tasks in the queue, until we reach a job that reduced a chunk or we are out of jobs.

fhahn marked an inline comment as done.Wed, Nov 17, 4:26 AM
fhahn added inline comments.
llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40

Why stop me from using 128 jobs on a 128 core machine?

My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk.

I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped.

[suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.

Done!

279

Yes, but a smaller limit may be good as well. In the current version it is just 2 tasks.

280

this is now gone

288

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

[suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.

That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether *any* task already reduced something, to avoid adding new jobs in that case.

This approach is slightly faster than the original patch for reducing llvm/test/tools/llvm-reduce/operands-skip.ll in parallel.

301

Thanks, I updated it to ChunkResult.

330

Thanks, I added a new NumChunksProcessed, which is set to 1 initially and then in the loop where we process the parallel results.

Meinersbur added inline comments.Wed, Nov 17, 10:55 AM
llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40

I still think it's the user's right to shoot themselves into the foot if they want to (it would be different if the number of jobs is determined by a heuristic). A maximum could be suggested in the description/documentation also explaining why. A hard limit blocks legitimate use cases.

42–46

Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any parallelism otherwise.

270

[serious] Results is never resized to cover the last NumChunksProcessed, i.e. we get a buffer overflow. Instead, we could use a container that never invalidates its elements such as std:deque or use a circular buffer.

Alternatively, you can change TaskQueue to std::dequeue<std::pair<std::shared_future<void>, ResultTy>> to the task and its result in the same queue.

288

That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something.

Nice! Thanks!

Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

I thought then we could override apply_thread_strategy to set either NumThreads or, if not number of threads is specified, call the inherited methods. But it's not-virtual and your approach hardware_concurrency(NumJobs) does about the same thing.

To get the number of effective jobs, we could call compute_thread_count of ThreadPoolStrategy. This still applies if explicit with hardware_concurrency(NumJobs) since compute_thread_count caps it by the number of hardware threads.

Currently -j<n> requires the user to specify a number of jobs, but if we want have a heuristic, we could set UseHyperThreads = false without setting ThreadsRequested. This heuristically-determined number of jobs is what we might cap at 32. Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs.

310

This waits until the task queue is empty. Could we leave earlier as soon as a reduced version is available, and just forget the jobs not yet finished while scheduling the next ones?

321

For just a bool, maybe use std::atomic<bool> instead?

330

The updated patch still increments I in the for-statement parens as will as in the body. As reader of the code, I find this very unexpected. Since with the task queue, the while loop will always process all work items (or break on finding a reduction), there is not advantage in having them in the same loop anymore. Updated suggestion:

if (NumJobs > 1 && ChunksStillConsideredInteresting.size() > 1) {
  auto I = ChunksStillConsideredInteresting.begin();
  bool AnyReduced = false;
  do {
    if (!TaskQueue.empty()) {
      auto TaskAndResult = TaskQueue.pop_front_val();
      TaskAndResult.first.wait();
      if (TaskAndResult.second)
        return TaskAndResult.second; // Found a result
    }
    while (!AnyReduced && TaskQueue.size() < NumJobs && I != ChunksStillConsideredInteresting.end()) {
      TaskQueue.push_back(ChunkThreadPool.async(...));
       ++I;
    }
  } while(!TaskQueue.empty());
} else {
  for (Chunk &ChunkToCheckForUninterestingness : reverse(ChunksStillConsideredInteresting))
      std::unique_ptr<ReducerWorkItem> Result = CheckChunk(...);
}

This also uses the same code for adding the initial jobs and re-filling the queue.

fhahn marked an inline comment as done.Wed, Nov 17, 12:39 PM
fhahn added inline comments.
llvm/tools/llvm-reduce/deltas/Delta.cpp
270

I think the code should only add new jobs if NumChunkProcessed < Results.size() in one iteration of the main loop, but there might be a an error. Adding it the to dequeue sounds like a great idea though to make things a bit simpler.

Alternatively we could also try to communicate the result through the Future directly, although that would require some changes to ThreadPool, because currently it just supports returning shared_future<void>.

310

The current approach waits for each task, until it finds the first one that leads to a reduction.

It intentionally does not try to stop once any job reduces a chunk, because if we pick any reduced chunk we might get different results than during serial reduction.

I am writing lots of suggestion for something you may have intended to be a simple addition that could be improved later just as well. If this applies, please tell me.

llvm/tools/llvm-reduce/deltas/Delta.cpp
270

Sorry, I did not see the NumChunkProcessed < Results.size() condition. IIUC, this will require the task queue to be cleared every NumJobs * 3 jobs. I did not see that and could be avoided using the solutions that I mentioned.

310

Assume the current content of TaskQueue

0: job finished, no valid reduction
1: job still running
2: job finished, valid reduction result
3: job still running

I understood that the intention of AnyReduced is to not queue additional jobs (assuming NumJobs > 4) since we know that when hitting 2, we will have a reduced result. We still need to wait for job 1 which may also compute a valid reduction.
I think this is clever. I had std::future<ResultTy> in mind as you proposed yourself, until I discovered that llvm::ThreadPool only supports a void return type :-(. However, to avoid submitting more jobs after we know that there will a result available, I think a AnyReduced would still be useful flag.

My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

fhahn updated this revision to Diff 388306.Thu, Nov 18, 1:26 PM
fhahn marked 5 inline comments as done.

I am writing lots of suggestion for something you may have intended to be a simple addition that could be improved later just as well. If this applies, please tell me.

Thanks for taking a close look, it is very much appreciated!

The latest change updates the patch to communicate results via shared_future, requires ThreadPool refactoring from D114183.

fhahn updated this revision to Diff 388315.Thu, Nov 18, 1:49 PM
fhahn marked 6 inline comments as done.

Drop +2 increment for initial number of tasks.

llvm/tools/llvm-reduce/deltas/Delta.cpp
39–40

Sounds good, I dropped the limit.

42–46

Updated! If LLVM_ENABLE_THREADS is not set NumJobs is defined as unsigned with a value of 1.

288

Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I think for now requiring a user specified value is the easiest.

I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs.

The main intention was to have. a few more jobs to fill threads when jobs exit earlier, but I think it could do more bad than good. I removed it

310

I updated the code to communicate the result via the shared_future. This depends on D114183 now.

My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

Oh I see, you are thinking about submitting jobs already with the updated state? At the moment I am not really sure what kind of issues that would bring and if it is feasible. I think it would be good to try to keep the initial version as simple as possible.

321

Done, thanks!

330
This also uses the same code for adding the initial jobs and re-filling the queue.

Ah, thanks for elaborating! I have not updated the code so far, because I am not yet sure how to best include the processing of the found result. When splitting it into 2 separate loops, I am not sure how to best continue processing the next chunk after the reduced one.

I could merge the 2 loops in the current implementation though if you think it's easier to read (the one that schedules the initial jobs and then processes the queue)

Meinersbur added inline comments.Thu, Nov 18, 3:45 PM
llvm/tools/llvm-reduce/deltas/Delta.cpp
183

[Not a change request] To avoid global variables, did you consider making AnyReduced inside runDeltaPassInt and ProcessChunkFromSerializedBitcode a lambda?

186

[Not a change request] Why prefer SmallString<0> over std::string?

288

Agreed (to both)

310

Agreed. A TODO in the code could mention this

fhahn updated this revision to Diff 388442.Fri, Nov 19, 2:34 AM

Make AnyReduced a local variable, pass to function & lambda.

Also updated to use ThreadPoolWithResult after changes to D114183.

fhahn updated this revision to Diff 388710.Sat, Nov 20, 10:07 AM

rebased on top of latest changes to D114183

fhahn updated this revision to Diff 388877.Mon, Nov 22, 4:23 AM

remove unnecessary std::move from ... = std::move(Future.get());

Meinersbur accepted this revision.Tue, Nov 23, 12:59 AM

LGTM

llvm/tools/llvm-reduce/deltas/Delta.cpp
283

Trailing underscore usually not used in the LLVM code base.

This revision is now accepted and ready to land.Tue, Nov 23, 12:59 AM
fhahn updated this revision to Diff 389141.Tue, Nov 23, 2:53 AM

Rebased on top of a5fff58781f3.

This revision was automatically updated to reflect the committed changes.
fhahn marked an inline comment as done.Wed, Nov 24, 1:25 AM
fhahn added inline comments.
llvm/tools/llvm-reduce/deltas/Delta.cpp
283

Thanks, I adjusted the name in the committed version (ChunkThreadPool and ChunkThreadPoolPtr)