- For information only. No review needed **
Following our work detailed in D86694 to port the scudo sanitizer lib…rary to Windows, we have investigated also producing a Windows port of the stand-alone version of scudo.
The main impetus for this work was to see if stand-alone scudo (scudo-sa) would scale better on machines with a high number of cores vs the standard Microsoft heap library (MSHeap). We have seen that Thin LTO builds on machines with a high number of cores do not fully utilize the available compute resources.
This makes use of the CRT allocation hooks from D71786.
The new files 'WindowsMMap.h/c' are based on the versions from compiler-rt/lib/profile.
To build with scudo on Windows use: -DLLVM_INTEGRATED_CRT_ALLOC=path to hostlib build -DLLVM_USE_CRT_RELEASE=MT.
Limitations:
• A bug in the MSVC STL implementation means that Debug builds will not work correctly with scudo-sa (STL hard codes calls to MSVC debug heap routines causing attempts to ‘free’ from heap memory allocated by scudo-sa).
• The llvm-rtdly code will not reliably work in 32-bit mode (built for IA64) with scudo-sa. It assumes that any heap blocks will be within a 32-bit address range (as it uses 32-bit relocations between the blocks). Scudo-sa deliberately randomizes the range of addresses returned by heap allocations and frequently returns blocks greater than 4gb apart. (https://bugs.llvm.org/show_bug.cgi?id=24978)
• Some scudo-sa tests assume the presence of ‘pthread’. These have not been ported to Windows.
• Scudo-sa CMAKE build not integrated into parent LLVM build yet.
To evaluate the comparative performance between MSHeap and scudo-sa we performed a ThinLTO build of clang.exe on a 16 core, 32 Hyper Thread based AMD ryzen machine. Running the build with 32 threads showed that MSHeap stopped using all CPU resources at around 16 threads (CPU usage peaked oscillating between 15 and 17 threads). Scudo-sa performed the same build using all 32 threads at 100% usage.
Results:
hyperfine.exe -w 1 -m 2 ".\stage1s\lld-link.exe @response.txt /threads:32" ".\stage1\lld-link.exe @response.txt /threads:32"
Benchmark #1: .\stage1s\lld-link.exe @response.txt /threads:32
Time (mean ± σ): 211.857 s ± 3.079 s [User: 7.4 ms, System: 13.2 ms]
Range (min … max): 209.680 s … 214.034 s 2 runs
Benchmark #2: .\stage1\lld-link.exe @response.txt /threads:32
Time (mean ± σ): 233.617 s ± 8.573 s [User: 0.0 ms, System: 13.2 ms]
Range (min … max): 227.555 s … 239.679 s 2 runs
Summary
'.\stage1s\lld-link.exe @response.txt /threads:32' ran
1.10 ± 0.04 times faster than '.\stage1\lld-link.exe @response.txt /threads:32'
(stage1s is scudo-sa build, stage1 MSHeap)
This showed a roughly 10% build speed improvement when using scudo-sa. This is disappointing but understandable as the other 16 threads fully used by scudo-sa are virtual Hyper Thread cores, not true CPU cores.
For builds on smaller machines with less than 16 cores, both MSHeap and scudo-sa utilize 100% of available threads.
hyperfine.exe -w 1 -m 2 ".\stage1s\lld-link.exe @response.txt /threads:12" ".\stage1\lld-link.exe @response.txt /threads:12"
Benchmark #1: .\stage1s\lld-link.exe @response.txt /threads:12
Time (mean ± σ): 259.826 s ± 0.055 s [User: 7.5 ms, System: 7.0 ms]
Range (min … max): 259.787 s … 259.865 s 2 runs
Benchmark #2: .\stage1\lld-link.exe @response.txt /threads:12
Time (mean ± σ): 245.082 s ± 0.386 s [User: 7.5 ms, System: 0.0 ms]
Range (min … max): 244.809 s … 245.354 s 2 runs
Summary
'.\stage1\lld-link.exe @response.txt /threads:12' ran
1.06 ± 0.00 times faster than '.\stage1s\lld-link.exe @response.txt /threads:12'
In this case MSHeap is slightly quicker than scudo-sa.
Given the results we have seen, we will not be spending any further time investigating scudo-sa as a replacement for the default Microsoft heap.