This makes the profiler easier to use; we no longer need to
remember to initialize it every time we fork. It also means that we
initialize the profiler at most once per thread instead of once per
task.
TODO: we should probably implement this for Parallel.h as well, but
perhaps only once we have a use case to test it with.