This makes the profiler easier to use; we no longer need to
remember to initialize it every time we fork. It also means that we
initialize the profiler at most once per thread instead of once per
task.
TODO: we should probably implement this for Parallel.h as well, but
perhaps only once we have a use case to test it with.
It seems that we'll always use the instance initialized in the thread that calls "grow". Also, this instance has to be setup before the call to grow, and the thread can't reinitialize it for the lifetime duration of the ThreadPool if I understand correctly.
I'm not sure this makes sense in the full generality of the ThreadPool API?