This patch optionally replaces the CRT allocator (i.e., malloc and free) with rpmalloc (public domain licence) or snmalloc (MIT licence) or mimalloc.
This changes the memory stack from:
new/delete -> MS VC++ CRT malloc/free -> HeapAlloc -> VirtualAlloc
to
new/delete -> {rpmalloc|snmalloc|mimalloc} -> VirtualAlloc
Problem
The Windows heap is thread-safe by default, and the ThinLTO codegen does a lot of allocations in each thread:
In addition, on many-core systems, this effectively blocks other threads to a point where only a very small fraction of the CPU time is used:
Before patch (here on Windows 10, build 1709):
We can see that a whooping 80% of the CPU time is spend waiting (blocked) on other threads (780 sec / 72 cores = 10 sec total) (graph with D71775 applied):
Threads are blocked waiting on the heap lock:
The thread above is awaken by the heap lock being released in another thread:
Solution
We simply use rpmalloc instead of VC++'s CRT allocator, which uses the Windows heap under-the-hood.
"rpmalloc - Rampant Pixels Memory Allocator
This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C"
The feature can be enabled with the cmake flag -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT. It is currently available only for Windows, but rpmalloc already supports Darwin, FreeBSD, Linux so it would be easy to enable it for Unix as well. It currently uses /MT because it is easier that way, and I'm not sure /MD can be overloaded without code patching at runtime (I could investigate that later, but the DLL thunks slow things down).
In addition rpmalloc supports large memory pages, which can be enabled at run-time through the environment variable LLVM_RPMALLOC_PAGESIZE=[4k|2M|1G].
After patch:
After patch, with D71775 applied, all CPU sockets are used:
(the dark blue part of the graph represents the time spent in the kernel, see below why)
In addition to the heap lock, there's a kernel bug in some versions of of Windows 10, where accessing newly allocated virtual pages triggers the page zero-out mechanism, which itself is protected by a global lock, which further blocks memory allocations.
If we dig deeper, we effectively see ExpWaitForSpinLockExclusiveAndAcquire taking way too much time (the blue vertical lines show where it is called on the timeline):
(patch is applied)
When using the latest Windows build 1909, with this patch applied and D71775, LLD now reaches over 90% CPU usage:
ThinLTO timings
Globally, this patch along with D71775 gives more interesting link times with ThinLTO, on Windows at least. The link times below are for Ubisoft's Rainbow 6: Siege PC Final ThinLTO build. Timings are only for the full link (no ThinLTO cache)
LLD 10 was crafted out of a two-stage build.
In case [1] the second stage uses -DLLVM_USE_CRT_RELEASE=MT.
In case [2] the second stage uses -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT.
In case [3] the second stage uses -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT -DLLVM_ENABLE_LTO=Thin, and both -DCMAKE_C_FLAGS -DCMAKE_CXX_FLAGS are set to "/GS- -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512".
The LLVM tests link with ThinLTO, while the MSVC tests evidently run full LTO.
The malloc interception will 100% interfere with sanitizers. I think it's worth preemptively defending against that by erroring out if LLVM_USE_SANITIZER is non-empty and we are using rpmalloc.