This patch adds parallel processing of chunks. When reducing very large
inputs, e.g. functions with 500k basic blocks, processing chunks in
parallel can significantly speed up the reduction.
To allow modifying clones of the original module in parallel, each clone
needs their own LLVMContext object. To achieve this, each job parses the
input module with their own LLVMContext. In case a job successfully
reduced the input, it serializes the result module as bitcode into a
result array.
To ensure parallel reduction produces the same results as serial
reduction, only the first successfully reduced result is used, and
results of other successful jobs are dropped. Processing resumes after
the chunk that was successfully reduced.
The number of threads to use can be configured using the -max-chunk-threads
option. It defaults to 1, which means serial processing.
Why stop me from using 128 jobs on a 128 core machine?
Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.
[suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.