This is a reimplementation of D60495 but with Teresa's suggestion applied: https://reviews.llvm.org/D60495#1562871
I've tested a 3-stage compilation, the graph below shows linking of clang.exe with -flto=thin, /opt:lldltojobs=all, no LTO cache, and -DLLVM_INTEGRATED_CRT_ALLOC=d:\git\rpmalloc on stage 1 & 2 to work around Windows Heap's scaling issues on many-core machines. Test running on 36-core Xeon 6140.
Before (total run is 100 sec):
After patch (total run is 85 sec):
The remaining issue after the falloff in the graph is PassBuilder.cpp which takes a long time to opt+codegen. If that file was split into several .CPPs, I suppose the linking could complete in 70 sec.
Nit, it isn't actually producing a reordered container. How about something like "Produces a container ordering for optimal multi-threaded processing. Returns ordered indices to elements in the input array."