The current reduction logic tries to reproduce what a serial reduction
would produce, and just takes the first one that is still
interesting. We still have to wait for all others to complete though,
which at that point is just a waste.
This helps speed things up with long running reducers, which I
frequently have. e.g. for the added sleep test on my system, it took
about 8 seconds before this change and about 4 after.
is this include need here? Seems like nothing from <atomic> is used here.