This is an archive of the discontinued LLVM Phabricator instance.

Investigation of porting scudo standalone to Windows for use in LLVM tools.
Needs ReviewPublic

Authored by FlameTop on Oct 29 2021, 4:48 AM.

Details

Summary
  • For information only. No review needed **

Following our work detailed in D86694 to port the scudo sanitizer lib…rary to Windows, we have investigated also producing a Windows port of the stand-alone version of scudo.

The main impetus for this work was to see if stand-alone scudo (scudo-sa) would scale better on machines with a high number of cores vs the standard Microsoft heap library (MSHeap). We have seen that Thin LTO builds on machines with a high number of cores do not fully utilize the available compute resources.
This makes use of the CRT allocation hooks from D71786.
The new files 'WindowsMMap.h/c' are based on the versions from compiler-rt/lib/profile.
To build with scudo on Windows use: -DLLVM_INTEGRATED_CRT_ALLOC=path to hostlib build -DLLVM_USE_CRT_RELEASE=MT.
Limitations:
• A bug in the MSVC STL implementation means that Debug builds will not work correctly with scudo-sa (STL hard codes calls to MSVC debug heap routines causing attempts to ‘free’ from heap memory allocated by scudo-sa).
• The llvm-rtdly code will not reliably work in 32-bit mode (built for IA64) with scudo-sa. It assumes that any heap blocks will be within a 32-bit address range (as it uses 32-bit relocations between the blocks). Scudo-sa deliberately randomizes the range of addresses returned by heap allocations and frequently returns blocks greater than 4gb apart. (https://bugs.llvm.org/show_bug.cgi?id=24978)
• Some scudo-sa tests assume the presence of ‘pthread’. These have not been ported to Windows.
• Scudo-sa CMAKE build not integrated into parent LLVM build yet.
To evaluate the comparative performance between MSHeap and scudo-sa we performed a ThinLTO build of clang.exe on a 16 core, 32 Hyper Thread based AMD ryzen machine. Running the build with 32 threads showed that MSHeap stopped using all CPU resources at around 16 threads (CPU usage peaked oscillating between 15 and 17 threads). Scudo-sa performed the same build using all 32 threads at 100% usage.
Results:
hyperfine.exe -w 1 -m 2 ".\stage1s\lld-link.exe @response.txt /threads:32" ".\stage1\lld-link.exe @response.txt /threads:32"
Benchmark #1: .\stage1s\lld-link.exe @response.txt /threads:32
Time (mean ± σ): 211.857 s ± 3.079 s [User: 7.4 ms, System: 13.2 ms]
Range (min … max): 209.680 s … 214.034 s 2 runs

Benchmark #2: .\stage1\lld-link.exe @response.txt /threads:32
Time (mean ± σ): 233.617 s ± 8.573 s [User: 0.0 ms, System: 13.2 ms]
Range (min … max): 227.555 s … 239.679 s 2 runs

Summary
'.\stage1s\lld-link.exe @response.txt /threads:32' ran
1.10 ± 0.04 times faster than '.\stage1\lld-link.exe @response.txt /threads:32'

(stage1s is scudo-sa build, stage1 MSHeap)

This showed a roughly 10% build speed improvement when using scudo-sa. This is disappointing but understandable as the other 16 threads fully used by scudo-sa are virtual Hyper Thread cores, not true CPU cores.
For builds on smaller machines with less than 16 cores, both MSHeap and scudo-sa utilize 100% of available threads.

hyperfine.exe -w 1 -m 2 ".\stage1s\lld-link.exe @response.txt /threads:12" ".\stage1\lld-link.exe @response.txt /threads:12"
Benchmark #1: .\stage1s\lld-link.exe @response.txt /threads:12
Time (mean ± σ): 259.826 s ± 0.055 s [User: 7.5 ms, System: 7.0 ms]
Range (min … max): 259.787 s … 259.865 s 2 runs

Benchmark #2: .\stage1\lld-link.exe @response.txt /threads:12
Time (mean ± σ): 245.082 s ± 0.386 s [User: 7.5 ms, System: 0.0 ms]
Range (min … max): 244.809 s … 245.354 s 2 runs

Summary
'.\stage1\lld-link.exe @response.txt /threads:12' ran
1.06 ± 0.00 times faster than '.\stage1s\lld-link.exe @response.txt /threads:12'

In this case MSHeap is slightly quicker than scudo-sa.
Given the results we have seen, we will not be spending any further time investigating scudo-sa as a replacement for the default Microsoft heap.

Diff Detail

Event Timeline

FlameTop requested review of this revision.Oct 29 2021, 4:48 AM
FlameTop created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptOct 29 2021, 4:48 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
FlameTop updated this revision to Diff 383312.Oct 29 2021, 4:57 AM

Fixed missing commit in diff file.

FlameTop edited the summary of this revision. (Show Details)Oct 29 2021, 6:55 AM

Thanks a lot for working on this!

I'm a bit surprised by the results. I expected a much better performance with Scudo, when compared with the Windows Heap. Even on a 6-core machine, the perf. is at least 2x better with rpmalloc, than with the Windows Heap. And this improves a lot with the number of cores. @cryptoad was saying that tweeking could be involved to make Scudo optimal.

Generally I still think it would be valuable to go ahead with a Windows port for scudo-sa. @cryptoad seemed to be confident in the fact that the performance of Scudo could match that of other lockfree allocators.

@FlameTop @russell.gallop What would be otherwise the alternative? Should we go ahead with a patch that adds a copy of rpmalloc in the trunk? This was suggested in offline discussions.

One of the main problems we found in this investigation was getting reliable performance metrics. Scudo was at least very consistent. Windows Heap appears to be wildly influenced by the allocation/de-allocation profile. In some circumstances WinHeap was seen to perform and scale very well over multiple threads. But in others it was seen to perform amazingly badly. One suspected culprit for this effect is the interaction between the 'low-fragmentation' heap scheme used for smaller allocations vs the 'big heap' for larger.