⚙ D8348 [lld][Core] Implement parallel_for

shankarke updated this revision to Diff 21997.Mar 15 2015, 11:00 AM

shankarke retitled this revision from to [lld][Core] Implement parallel_for_each.

shankarke updated this object.

shankarke edited the test plan for this revision. (Show Details)

shankarke added reviewers: davide, Bigcheese, rui314.

shankarke added a project: lld.

shankarke added a subscriber: Unknown Object (MLST).

Please send code reviews to ruiu instead of rui314 (I wrote about that before).

Did you run any benchmark with this patch?

I havent run benchmarks as yet with this patch as yet.

I tried to self host lld and self host clang and the time taken to link one in 10 runs, saves the link time by 1/2 a second. I changed the minParallelSize to be 512 for the function parallel_for_each to make this happen though.

I tried this patch as published here and didn't notice any benefits. It, though, enabled further effective optimisation, so I'm in favour of seeing this getting in. I didn't review this carefully yet, but I plan to. See llvm-devel LLD performance thread for details.

I agree with David's comment, it doesnot show up with any considerable improvement.

I happen to get an improved number (0.5 second improvement) only once :-

19.55user 0.78system 0:11.83elapsed 171%CPU (0avgtext+0avgdata 7037040maxresident)k

I usually get time taken as below without this patch :-

20.06user 0.74system 0:12.55elapsed 165%CPU (0avgtext+0avgdata 7042880maxresident)k

This patch gives us a benefit of having a version of parallel_for_each on unix platforms without degrading performance.

Updating the code sent for review, with my latest changes. Have a seperate constant for parallel_for_each functionality.

When we are optimizing something, we generally don't care about one particular result in many results. Picking up the best number hardly describe a change you made and also not scientific thing to do because observation always include noise (you could even observe an "improvement" without making any change if you pick up the best number among multiple runs). Please keep that in mind when you do optimization.

As to the patch, you split an array into small pieces whose size is 512 each. If we have, say, 1 million items, we create 2000 tasks. We of course cannot execute 2000 tasks in parallel -- the maximum parallelism is equal to the number of cores, which is std::thread::hardware_concurrency(). What if you divide tasks by hardware_concurrency? Each task becomes larger (size of an array divided by number of cores), so it needs less synchronization, which is good. But if task load is not evenly distributed, we need to wait for slow workers, which might have negative impact. This may worth testing.

I agree dividing the tasks by the hardware concurrency could get optimal performance. If you see parallel_sort(few lines above) the code tries to divide tasks by 1024. This file needs to be cleaned up based on your suggestion and I feel we could implement that later, just for allowing others to try various loops to be optimized based on this patch. I am fine anyways but I think creating an optimal solution would make others wait.

Update code with comments from Rui,

Below are my notes from this patch :-

Your suggestion worked a lot better for simple applications (previously where lld was not parallelizing would become parallel now).

[with lld] clang -B`pwd` main.c
warning: ignoring unknown argument for -z: relro

real 0m0.068s
user 0m0.036s
sys 0m0.036s

[with binutils]$ time clang main.c

real 0m0.425s
user 0m0.036s
sys 0m0.020s

FYI, the above is in DEBUG mode of lld, just cant imagine release mode :)

But for self hosting lld/clang, the performance got degraded, takes 14 seconds to self host instead of 12(happens constantly). I will post the patch anyways for review for further discussion.

ruiu added inline comments.Mar 15 2015, 8:58 PM

include/lld/Core/Parallel.h
303 ↗	(On Diff #22009)	This doesn't make sense. This condition is always true for any len > 0 if concurrency > 1. As a result, you created one task for each array element.

.

Ah that was bad. Thanks for catching that.

The results are very similar to the previous patch that was posted. Regresses on performance with bigger links on my box.

No need to use recursion. Using for-loop like this is much simpler.

void parallel_for_each(Iterator begin, iterator end, Func func) {
  TaskGroup tg;
  int taskSize = distance(begin, end) / concurrency;
  while (taskSize <= distance(begin, end)) {
    tg.spawn([=, &func, &tg] { std::for_each(begin, begin + taskSize, func); });
    begin += taskSize;
  }
  std::for_each(begin, end, func);
}

I was thinking of doing it non recursive too, but I thought its easier to experiment or change the way tasks get spawned(by changing few numbers) with a recursive version, no ??

The other problem is this approach is not helping and makes the link times longer for building clang/self hosting. Should we move to using the previous implementation ?

I was thinking of doing it non recursive too, but I thought its easier to experiment or change the way tasks get spawned(by changing few numbers) with a recursive version, no ??

The other problem is this approach is not helping and makes the link times longer for building clang/self hosting. Should we move to using the previous implementation ?

Implement a non-recursive version as per davide/ruiu.

ruiu added inline comments.Mar 16 2015, 12:36 PM

include/lld/Core/Parallel.h
300 ↗	(On Diff #22039)	Does this compile without std::?
303 ↗	(On Diff #22039)	This expression doesn't make sense again. You divide length of the array with minSize, which is equal to how many minSize chunks in the array. It's not a task size. Let's make this simple as a starter. Just set 1024 to taskSize unconditionally. That's probably large enough. You may still have to run a test with that number, though.
304 ↗	(On Diff #22039)	This doesn't seem to compile with std::.

shankarke added inline comments.Mar 16 2015, 12:47 PM

include/lld/Core/Parallel.h
300 ↗	(On Diff #22039)	compiles/runs fine for me.
303 ↗	(On Diff #22039)	will change. we cannot set taskSize to 1024 unconditionally if the length < 1024.

shankarke updated this revision to Diff 22043.Mar 16 2015, 12:50 PM

ruiu added inline comments.Mar 16 2015, 12:50 PM

include/lld/Core/Parallel.h
303 ↗	(On Diff #22039)	Yes we can. The last std::for_each outside the while loop is for that.

ruiu added inline comments.Mar 16 2015, 1:02 PM

include/lld/Core/Parallel.h
302 ↗	(On Diff #22043)	Just set 1024 unconditionally.

shankarke added inline comments.Mar 16 2015, 1:04 PM

include/lld/Core/Parallel.h
302 ↗	(On Diff #22043)	Thanks for being so patient. Sorry it took so much time.

ruiu added inline comments.Mar 16 2015, 1:05 PM

include/lld/Core/Parallel.h
300 ↗	(On Diff #22043)	Remove length.
304 ↗	(On Diff #22043)	Remove &tg.

shankarke updated this revision to Diff 22047.Mar 16 2015, 1:05 PM

hopefully this is the final version :)

LGTM with this fix. Thank you for your patience. Let's give it a shot to see how it works.

include/lld/Core/Parallel.h
299 ↗	(On Diff #22050)	I think int is enough, but if you choose other type, it should be ptrdiff_t instead of int64_t. (But, well, int should suffice.)
299 ↗	(On Diff #22047)	I think int is enough, but if you choose other type, it should be size_t and not int64_t.

shankarke added inline comments.Mar 16 2015, 2:12 PM

include/lld/Core/Parallel.h
299 ↗	(On Diff #22050)	Sure, changed and will commit soon.

shankarke updated this revision to Diff 22051.Mar 16 2015, 2:13 PM

Closed by commit rL232419: [Core] Add parallel_for_each (authored by shankare). · Explain WhyMar 16 2015, 2:30 PM

This revision was automatically updated to reflect the committed changes.

This is an archive of the discontinued LLVM Phabricator instance.

[lld][Core] Implement parallel_for_each
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 22054

lld/trunk/include/lld/Core/Parallel.h