Many parallel tasks just want to iterate over all the possible numbers from 0 to N-1. Rather than enqueue N work items, instead just "map" the function across the requested integer space.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
IMO, this is a simpler interface than TaskRunner. Also, after this, there are no users of TaskRunner left. Should I just delete them?
I did this change to help parallel symbol demangling (to come in a separate patch). Rather than have the symbol demangler use batching, etc, I thought it should be a higher level function.
I think you might not have understood TaskRunner's real benefits. It is smart in the way it runs things: it lets you run N things and get each item as soon as it completes. The TaskMap will serialize everything. So if you have 100 items to do and item 0 takes 200 seconds to complete, and items 1 - 99 take 1ms to complete, you will need to wait for task 0 to complete before you can start parsing the data. This will slow down the DWARF parser if you switch over to this. TaskRunner should not be deleted as it has a very specific purpose. Your TaskMap works fine for one case, but not in the other.
I did this change to help parallel symbol demangling (to come in a separate patch). Rather than have the symbol demangler use batching, etc, I thought it should be a higher level function.
Make sure this is a win before switching demangling over to using threads. I tried to improve performance on demangling before and I got worse performance on MacOS when we tried it because all of the malloc contention and threading overhead.
source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp | ||
---|---|---|
1988 | This replacement is ok, since no expensive work is being done in the while loop that did the task_runner_extract.WaitForNextCompletedTask();. | |
1994 | Note that we lost performance here. The old code would run: while (true) { std::future<uint32_t> f = task_runner.WaitForNextCompletedTask(); // Do expensive work as soon as possible with any random task that completes as soon as it completes. Your current code says "wait to do all expensive work until I complete everything and then do all of the expensive work". | |
source/Utility/TaskPool.cpp | ||
77 | Is this named correctly? Maybe SerializedTaskMap? Or BatchedTaskMap? | |
100 | TaskRunner is smart in the way it runs things: it lets you run N things and get each item as it completes. This implementation, if read it correctly, will serialize everything. So if you have 100 items to do and item at index 0 takes 200 seconds to complete, and items 1 - 99 take 1ms to complete, you will need to wait for task 0 to complete before you can start parsing the data. This will slow down the DWARF parser if you switch over to this. |
And note in the DWARF parser you can't add the expensive code that aggregates all of the data into the SymbolFileDWARF into "parser_fn" because only on thread can be putting stuff into SymbolFileDWARF at a time.
That may provide a benefit on machines with a few cores, but doesn't really help when you have 40+ cores. Granted, the average laptop has 4 cores/8 hyperthreads.
I did this change to help parallel symbol demangling (to come in a separate patch). Rather than have the symbol demangler use batching, etc, I thought it should be a higher level function.
Make sure this is a win before switching demangling over to using threads. I tried to improve performance on demangling before and I got worse performance on MacOS when we tried it because all of the malloc contention and threading overhead.
It is on my machine, but maybe not on other machines. That would be unfortunate. I'm guessing I shouldn't add go ahead and add jemalloc/tcmalloc to work around poor a MacOS malloc? haha.
I would suggest putting this in LLVM, as something liek this:
namespace llvm { template<typename Range, typename Func> void parallel_apply(Range &&R, Func &&F) { // enqueue items here. // wait for all tasks to complete. } }
It is no different on any machine with any number of cores. TaskRunner will be faster in all cases for the second DWARF loop. It is also nice to be able to consume the information as it comes in, so TaskRunner is needed. I do like the TaskMap idea to make things batch-able so I think they both add value.
I did this change to help parallel symbol demangling (to come in a separate patch). Rather than have the symbol demangler use batching, etc, I thought it should be a higher level function.
Make sure this is a win before switching demangling over to using threads. I tried to improve performance on demangling before and I got worse performance on MacOS when we tried it because all of the malloc contention and threading overhead.
It is on my machine, but maybe not on other machines. That would be unfortunate. I'm guessing I shouldn't add go ahead and add jemalloc/tcmalloc to work around poor a MacOS malloc? haha.
lol. If it does improve things and if it integrates nicely with all of the malloc tools on MacOS, I wouldn't be opposed. I don't know much about jemalloc/tcmalloc, but if there are perf wins to be had that don't hose over the malloc zone/block iterations I am all for it!
I can make more measurements on this.
source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp | ||
---|---|---|
1994 | The following loop is not expensive, it's simple vector concatenation of fairly simple types. The actual slow work is Finalize(), which calls Sort(). m_function_basename_index is of type NameDIE, which has a UniqueCStringMap member, which declared collection as std::vector. Though arguably the Append should be pushed down into the RunTasks below. | |
source/Utility/TaskPool.cpp | ||
100 | If items 1-99 complete quickly, there isn't much advantage to handling their output before handling the output of item 0, since item 0 is likely to produce more output than 1-99 combined. |
IMO the vector append issue doesn't matter, because the very next thing we do is sort. Sorting is more expensive than appending, so the append is noise.
source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp | ||
---|---|---|
1994 | This takes <0.25s (out of a total of 15 seconds of runtime) when timing lldb starting lldb (RelWithDebInfo build). That's for an aggregate of 14M+ names being added to the vectors. IMO that should not block this change. I also moved the append into RunTasks, just because we already have those subtasks. | |
source/Utility/TaskPool.cpp | ||
77 | TaskMapOverInt? |
Not to sound like a broken record, but please try to put this in LLVM instead of LLVM. I suggested a convenient function signature earlier.
I hear you, but IMO it's not ready for that yet.
- It would depend on ThreadPool, but
- LLDB hasn't switched to ThreadPool yet, because
- I want to figure out how to incorporate tasks enqueuing tasks first.
I don't want to commit a monolithic patch with all my changes (and I haven't developed them all yet), so instead I submit incremental improvements.
So I don't see where you moved all of the .Append functions. And if you did your timings with them not being in there then your timings are off.
finalize_fn calls Append on each source first, then calls Finalize. It might be hard to see because it's now a lambda that takes the src vector and dst NameToDIE as parameters.
I know no one is using TaskRunner, but I would rather not remove it just yet. It has the possibility of being useful. Not in this case, but in general. I'd be interested in hearing what Pavel and Tamas think? None of this affects LLDB on Mac because all Darwin (macOS, iOS, tvOS, watchOS) have Apple accelerator tables emitted by default. If you really want to speed things up, at least when compiling with a new clang, we can change the compiler to emit the apple accelerator tables on all systems...
Personally I never really liked TaskRunner (even though I was the one implemented it) because I think it adds a lot of extra complexity for fairly little gain and thinking about it a bit more I think in most use case a more targeted solution at the call site will probably give better results. Also if we need it in the future it can always be restored based on git/svn history. Based on that I am happy to delete it if we have no use case for it at the moment.
Regarding Apple accelerator tables, giving it a try can be interesting (pass '-glldb' to a sufficiently new clang) but as far as I know they are not compatible with split dwarf what can be show stopper for very large applications (e.g. linking chromium on linux with "no-limit-debug-info" and without split dwarf requires unreasonably large amount of memory).
-glldb doesn't emit the Apple accelerator tables IIRC. We don't need to change the DWARF, but just add the Apple accelerator tables by modifying clang to emit them. They will be just fine with DWO as each DWO file would have a set of them. The other way to do this to just check out if this works is to modify "llvm-dsymutil --update" to be able to work on ELF files. "llvm-dsymutil --update" regenerates the accelerator tables and puts them into the DWARF using an existing file with debug info. It was made in case we missing something in the accelerator tables that we added later so we could update old DWARF files gotten from build servers.
That being said, if no one is going to miss TaskRunner then we can let it go.
@zturner ok to commit? TaskMapOverInt(x, y, fn) maps directly to parallel_for(0, x, fn). Rather than rebundle the change you have for lldb, only for it to be deleted once you get it into llvm, can we just commit this as a stopgap?
It is a step in the right direction as it removes TaskRunner and puts us on an API more likely to end up in LLVM.
- Change the API to more closely match parallel_for (begin, end, fn) instead of (end, batch, fn).
- Fix unit test. I didn't realize I had to run check-lldb-unit separately from dotest (oops).
- Fix formatting via git-clang-format.
Last changes are cosmetic (though look big because I captured a different amount of context) so hopefully doesn't need a re-review. Can someone push them for me? Thank you!
I can do the pushing. :)
Thanks for the patch.
include/lldb/Utility/TaskPool.h | ||
---|---|---|
89 | Making this a template would enable you to get rid of the std::function wrapper overhead. I have no idea whether it would make a difference in practice. |
Making this a template would enable you to get rid of the std::function wrapper overhead. I have no idea whether it would make a difference in practice.