This is an archive of the discontinued LLVM Phabricator instance.

DWP multithreading
Needs ReviewPublic

Authored by zhuna8616 on Jun 5 2023, 7:07 AM.

Details

Reviewers
ayermolo
dblaikie
Summary

This patch is intended to add multithreading into dwp packaging. Most of the multithreading happens when input files are parsed and str & str_offset sections are merged. Command line options are introduced to specify the number of thread in those 2 stages.

Diff Detail

Event Timeline

zhuna8616 created this revision.Jun 5 2023, 7:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 7:07 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
zhuna8616 requested review of this revision.Jun 5 2023, 7:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 7:07 AM
zhuna8616 retitled this revision from This patch is intended to add multithreading into dwp packaging. Most of the multithreading happens when input files are parsed and str & str_offset sections are merged. Command line options are introduced to specify the number of thread in those... to DWP multithreading.Jun 5 2023, 7:10 AM
zhuna8616 edited the summary of this revision. (Show Details)
zhuna8616 added reviewers: ayermolo, dblaikie.

Do you have the detail data for this patch to speed up the dwp ? And I think, we need to reuse LLVM code as much as possible instead of add new one.(i.e. ConsumerQueue should be easily implement with llvm ADT).

@dblaikie We are trying to speed up the dwp with multiple threading as it takes too much time. Do you think this is the right direction we want to go ?

Enna1 added a subscriber: Enna1.Jun 5 2023, 11:39 PM

Do you have a profile for this? I think there's probably more low-hanging fruit that'd be good to address before parallelism's the tool I'd reach for. Like currently using MC to write the output adds a lot of overhead of copying all the to-be-written bytes into buffers before writing them out, and the inputs are probably decompressed and cached...

Comparing a profile between llvm-dwp and gold's dwp might give some hints at places to improve/where time's being wasted.

Though I'm not averse to adding multithreading.

If there were ways to share some of lld's multithreading & more efficient reading/writing that'd be great too, though I think last time that was mentioned @MaskRay suggested it'd be better to reimplement things - maybe at least inspired by lld's implementation.

+1 for achieving low-hanging fruits and switching to MCStreamer before adding parallelism. (@dblaikie seems to have a plan)
After these optimizations, the time proportion of different phases may be different enough that warrants a different design of parallelism.
I think this direction is going to have smaller engineering efforts after both MCStreamer migration and parallelism work is done.

I have some reservation that adding parallelism now may increase the long-term engineering efforts.

We investigated the performance of both LLVM & gold 's dwp, and found gold's performance is not as good, specifically 0.6 of llvm-dwp when packing clang's dwp. They share a bottleneck in merging the debug_str.dwo section, which for llvm-dwp is the function writeStringsAndOffsets. This is because maintaining a hash set of strings is slow, and each string requires one lookup at least. Gold uses std::unordered_map, while LLVM uses DenseMap, which is faster than std::unordered_map, thus causes Gold to be slower than LLVM.

We changed LLVM's DenseMap to StringMap, and achieved 5% at the very least, and 17%~20% when packing for clang. This is a low-hanging fruit as @dblaikie put it. I believe we can at least make this improvement.

With all of above being said, adding multithreading for the merging of the debug_str.dwo section at least would give the improvement of 17%~22% on average for our projects, with 16 threads. This is not as messy as the patch presently since this patch also added multithreading in other places. If only the code concerning debug_str.dwo is modified, I believe it would be friendly to the long-term engineering efforts.

Furthermore, if we do not deduplicate the string table produced by each worker thread, after they are done with their assigned files, the improvement in performance can be 59%~190% as observed on our projects. The size of the produced DWP file increases less than 9% with respect to the file produced by the original implementation of LLVM.

So we propose 3 options of changes:

  • Change DenseMap to StringMap.
  • Add multithreading for the merging of debug_str.dwo.
  • Add multithreading for the merging of debug_str.dwo and a command line option controlling whether the threads deduplicate the string table.

We investigated the performance of both LLVM & gold 's dwp, and found gold's performance is not as good, specifically 0.6 of llvm-dwp when packing clang's dwp. They share a bottleneck in merging the debug_str.dwo section, which for llvm-dwp is the function writeStringsAndOffsets. This is because maintaining a hash set of strings is slow, and each string requires one lookup at least. Gold uses std::unordered_map, while LLVM uses DenseMap, which is faster than std::unordered_map, thus causes Gold to be slower than LLVM.

Huh, fascinating. How's the memory usage? (I'd expect the memory usage would be way higher, and the extra copying into/out of buffers would add up to cost more than the savings in string map lookups - but I haven't done detailed profiles, admittedly)

We changed LLVM's DenseMap to StringMap, and achieved 5% at the very least, and 17%~20% when packing for clang. This is a low-hanging fruit as @dblaikie put it. I believe we can at least make this improvement.

Yep, that sounds like a freebie - please send that as a separate review?

With all of above being said, adding multithreading for the merging of the debug_str.dwo section at least would give the improvement of 17%~22% on average for our projects, with 16 threads. This is not as messy as the patch presently since this patch also added multithreading in other places. If only the code concerning debug_str.dwo is modified, I believe it would be friendly to the long-term engineering efforts.

Furthermore, if we do not deduplicate the string table produced by each worker thread, after they are done with their assigned files, the improvement in performance can be 59%~190% as observed on our projects. The size of the produced DWP file increases less than 9% with respect to the file produced by the original implementation of LLVM.

So we propose 3 options of changes:

  • Change DenseMap to StringMap.
  • Add multithreading for the merging of debug_str.dwo.
  • Add multithreading for the merging of debug_str.dwo and a command line option controlling whether the threads deduplicate the string table.

My concern with multithreaded string merging is determinism, I think? It's important that we produce the same output bits given the same inputs - are the multithreaded string merging approaches you have in mind still deterministic?

It'd be great if we could reuse some of lld's string merging support since they've already thought about these sort of issues & I believe figured out ways to do it deterministically and fast/multithreaded.

I posted a new review for using StringMap. https://reviews.llvm.org/D154341

We use a priority queue to maintain the same order in multithreading as in original implementation. I will look into lld presently.

MTC added a subscriber: MTC.Jul 31 2023, 10:55 PM