Remove TODO and implement parallel_for_each.
Details
Diff Detail
Event Timeline
Please send code reviews to ruiu instead of rui314 (I wrote about that before).
Did you run any benchmark with this patch?
I tried to self host lld and self host clang and the time taken to link one in 10 runs, saves the link time by 1/2 a second. I changed the minParallelSize to be 512 for the function parallel_for_each to make this happen though.
I tried this patch as published here and didn't notice any benefits. It, though, enabled further effective optimisation, so I'm in favour of seeing this getting in. I didn't review this carefully yet, but I plan to. See llvm-devel LLD performance thread for details.
I agree with David's comment, it doesnot show up with any considerable improvement.
I happen to get an improved number (0.5 second improvement) only once :-
19.55user 0.78system 0:11.83elapsed 171%CPU (0avgtext+0avgdata 7037040maxresident)k
I usually get time taken as below without this patch :-
20.06user 0.74system 0:12.55elapsed 165%CPU (0avgtext+0avgdata 7042880maxresident)k
This patch gives us a benefit of having a version of parallel_for_each on unix platforms without degrading performance.
Updating the code sent for review, with my latest changes. Have a seperate constant for parallel_for_each functionality.
When we are optimizing something, we generally don't care about one particular result in many results. Picking up the best number hardly describe a change you made and also not scientific thing to do because observation always include noise (you could even observe an "improvement" without making any change if you pick up the best number among multiple runs). Please keep that in mind when you do optimization.
As to the patch, you split an array into small pieces whose size is 512 each. If we have, say, 1 million items, we create 2000 tasks. We of course cannot execute 2000 tasks in parallel -- the maximum parallelism is equal to the number of cores, which is std::thread::hardware_concurrency(). What if you divide tasks by hardware_concurrency? Each task becomes larger (size of an array divided by number of cores), so it needs less synchronization, which is good. But if task load is not evenly distributed, we need to wait for slow workers, which might have negative impact. This may worth testing.
I agree dividing the tasks by the hardware concurrency could get optimal performance. If you see parallel_sort(few lines above) the code tries to divide tasks by 1024. This file needs to be cleaned up based on your suggestion and I feel we could implement that later, just for allowing others to try various loops to be optimized based on this patch. I am fine anyways but I think creating an optimal solution would make others wait.
Update code with comments from Rui,
Below are my notes from this patch :-
Your suggestion worked a lot better for simple applications (previously where lld was not parallelizing would become parallel now).
[with lld] clang -B`pwd` main.c
warning: ignoring unknown argument for -z: relro
real 0m0.068s
user 0m0.036s
sys 0m0.036s
[with binutils]$ time clang main.c
real 0m0.425s
user 0m0.036s
sys 0m0.020s
FYI, the above is in DEBUG mode of lld, just cant imagine release mode :)
But for self hosting lld/clang, the performance got degraded, takes 14 seconds to self host instead of 12(happens constantly). I will post the patch anyways for review for further discussion.
include/lld/Core/Parallel.h | ||
---|---|---|
303 | This doesn't make sense. This condition is always true for any len > 0 if concurrency > 1. As a result, you created one task for each array element. |
The results are very similar to the previous patch that was posted. Regresses on performance with bigger links on my box.
No need to use recursion. Using for-loop like this is much simpler.
void parallel_for_each(Iterator begin, iterator end, Func func) { TaskGroup tg; int taskSize = distance(begin, end) / concurrency; while (taskSize <= distance(begin, end)) { tg.spawn([=, &func, &tg] { std::for_each(begin, begin + taskSize, func); }); begin += taskSize; } std::for_each(begin, end, func); }
I was thinking of doing it non recursive too, but I thought its easier to experiment or change the way tasks get spawned(by changing few numbers) with a recursive version, no ??
The other problem is this approach is not helping and makes the link times longer for building clang/self hosting. Should we move to using the previous implementation ?
I was thinking of doing it non recursive too, but I thought its easier to experiment or change the way tasks get spawned(by changing few numbers) with a recursive version, no ??
The other problem is this approach is not helping and makes the link times longer for building clang/self hosting. Should we move to using the previous implementation ?
include/lld/Core/Parallel.h | ||
---|---|---|
322 | Does this compile without std::? | |
325 | This expression doesn't make sense again. You divide length of the array with minSize, which is equal to how many minSize chunks in the array. It's not a task size. Let's make this simple as a starter. Just set 1024 to taskSize unconditionally. That's probably large enough. You may still have to run a test with that number, though. | |
326 | This doesn't seem to compile with std::. |
include/lld/Core/Parallel.h | ||
---|---|---|
325 | Yes we can. The last std::for_each outside the while loop is for that. |
include/lld/Core/Parallel.h | ||
---|---|---|
324 | Just set 1024 unconditionally. |
include/lld/Core/Parallel.h | ||
---|---|---|
324 | Thanks for being so patient. Sorry it took so much time. |
LGTM with this fix. Thank you for your patience. Let's give it a shot to see how it works.
include/lld/Core/Parallel.h | ||
---|---|---|
321 | I think int is enough, but if you choose other type, it should be size_t and not int64_t. | |
321 | I think int is enough, but if you choose other type, it should be ptrdiff_t instead of int64_t. (But, well, int should suffice.) |
include/lld/Core/Parallel.h | ||
---|---|---|
321 | Sure, changed and will commit soon. |
This doesn't make sense. This condition is always true for any len > 0 if concurrency > 1. As a result, you created one task for each array element.