I've tested a 3-stage compilation, the graph below shows linking of clang.exe with -flto=thin, /opt:lldltojobs=all, no LTO cache, and -DLLVM_INTEGRATED_CRT_ALLOC=d:\git\rpmalloc on stage 1 & 2 to work around Windows Heap's scaling issues on many-core machines. Test running on 36-core Xeon 6140.
Before (total run is 100 sec):
After patch (total run is 85 sec):
The remaining issue after the falloff in the graph is PassBuilder.cpp which takes a long time to opt+codegen. If that file was split into several .CPPs, I suppose the linking could complete in 70 sec.