Page MenuHomePhabricator

[scudo] Allow -fsanitize=scudo on Linux and Windows (WIP, don't land as is)
Needs ReviewPublic

Authored by russell.gallop on Aug 27 2020, 3:05 AM.

Details

Summary

This is basically https://reviews.llvm.org/D42519 updated for monorepo with a few changes to allow -fsanitize=scudo in clang-cl and permit LLVM_USE_SANITIZER=Scudo.

This is to allow evaluation of Scudo as a replacement memory allocator on Windows for comparison with https://reviews.llvm.org/D71786

Diff Detail

Event Timeline

russell.gallop created this revision.Aug 27 2020, 3:05 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptAug 27 2020, 3:05 AM
Herald added subscribers: llvm-commits, Restricted Project, cfe-commits, mgorny. · View Herald Transcript
russell.gallop requested review of this revision.Aug 27 2020, 3:05 AM

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

I'd like it to be committed, but can't claim I know the code from https://reviews.llvm.org/D42519 well enough. The good news is that I can build LLVM on Windows with this. Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

You marked D42519 as WIP, can you remember what was still TBD?

Also, it might make sense to separate out the "use" of sanitize=Scudo in LLVM, from providing Windows support. I put them together here for evaluation purposes.

You marked D42519 as WIP, can you remember what was still TBD?

Maybe more tests could be enabled...

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

I'd like it to be committed, but can't claim I know the code from https://reviews.llvm.org/D42519 well enough. The good news is that I can build LLVM on Windows with this. Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

Nothing except the tests. Compiling a sizeable application with Scudo as well.

You marked D42519 as WIP, can you remember what was still TBD?

Mostly because I didn't have the time and resources to make sure this would work over the long term.
I used my home windows box to do this as a proof of concept that it could be done.

Also, it might make sense to separate out the "use" of sanitize=Scudo in LLVM, from providing Windows support. I put them together here for evaluation purposes.

Yeah this makes sense.

The other point that is worth mentioning is that we moved all dev efforts to the "standalone" version of Scudo (eg: the one not depending on sanitizer_common in the standalone/ subdirectory).
There is enough differences that there could be some significant performance/mem footprint changes between the 2.

Also depending on what you want to compare, disabling the Quarantine and other optional security features will make things faster and use less memory.

russell.gallop edited the summary of this revision. (Show Details)Aug 28 2020, 10:03 AM
russell.gallop updated this revision to Diff 291235.EditedSep 11 2020, 9:15 AM
russell.gallop edited the summary of this revision. (Show Details)

Fixup scudo (sanitizer based) to work on Windows.

This makes use of the CRT alloc hooks from D71786.

To build with scudo on Windows use: -DLLVM_INTEGRATED_CRT_ALLOC=<llvm-project>/stage1/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib -DLLVM_USE_CRT_RELEASE=MT
-DLLVM_USE_SANITIZER=Scudo is supported in this patch, but isn't required. @cryptoad on Linux does this do anything other than add libraries when using clang to drive the linker?

Limitations:
-Note that this is not using hardware CRC32.
-This just hooks in the C scudo library, not the cxx library

I evaluated this on a 3 stage LLVM build on Windows 10 2004 (in vs2019 16.7.3 environment) on a 6-core i7-8700k.

  • stage1 builds the scudo sanitizer on Windows: requires -DLLVM_ENABLE_PROJECTS=clang;lld;compiler-rt
  • stage2: built with -DCMAKE_C_COMPILER="<stage1>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage1>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage1>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release
  • stage2_scudo: as stage2 plus -DLLVM_INTEGRATED_CRT_ALLOC=<stage1>/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib

Then evaluated linking clang with ThinLTO:

  • stage3: -DCMAKE_C_COMPILER="<stage2>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage2>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage2>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_LTO=Thin
  • stage3_scudo: -DCMAKE_C_COMPILER="<stage2_scudo>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage2_scudo>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage2_scudo>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_LTO=Thin

I set /threads:12 and removed /lldltocache.

Without SCUDO_OPTIONS scudo seems to be about 25% slower.

>"hyperfine.exe" -m 3 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     268.209 s ±  4.966 s    [User: 18.1 ms, System: 6.6 ms]
  Range (min … max):   263.223 s … 273.155 s    3 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     334.312 s ±  4.002 s    [User: 2.4 ms, System: 14.6 ms]
  Range (min … max):   329.889 s … 337.683 s    3 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.25 ± 0.03 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

I set scudo options to disable quarantine and mismatch checking and it seems to be about 8% slower.

>set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0
>"hyperfine.exe" -m 3 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     273.772 s ±  3.624 s    [User: 1.3 ms, System: 8.6 ms]
  Range (min … max):   269.806 s … 276.909 s    3 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     296.593 s ±  2.362 s    [User: 1.3 ms, System: 13.9 ms]
  Range (min … max):   293.917 s … 298.391 s    3 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.08 ± 0.02 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

It's worth noting that the run without scudo was not using all CPU so still hitting some locking issues but was still over 90%. With both scudo runs it was pegged at 100% CPU until near the end of the link. From this it appears that the locking behaviour is better than the default allocator and would win out on a wide enough processor. The loss in straight line performance is potentially due to CRC calculations not using hardware support.

russell.gallop added a comment.EditedSep 11 2020, 9:20 AM

Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

It turns out is wasn't actually using scudo, it is now!

Nothing except the tests. Compiling a sizeable application with Scudo as well.

Compiling clang and lld with Scudo and ThinLTO linking clang seems to work.

The other point that is worth mentioning is that we moved all dev efforts to the "standalone" version of Scudo (eg: the one not depending on sanitizer_common in the standalone/ subdirectory).
There is enough differences that there could be some significant performance/mem footprint changes between the 2.

Thanks. I'm looking at porting the standalone variant, drawing on D42519 and Windows support from sanitizer_common. Does that sound like a reasonable approach?

Also depending on what you want to compare, disabling the Quarantine and other optional security features will make things faster and use less memory.

Would these be good flags to use:

set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0

Thanks. I'm looking at porting the standalone variant, drawing on D42519 and Windows support from sanitizer_common. Does that sound like a reasonable approach?

Yup, the structure is a tad different, and other platforms (Fuchsia, Linux cpp) have some implementation of the platform specific functions.

Would these be good flags to use:

set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0

Yup, no release, no quarantine is a great. Size class map compile time options could also make a difference.

aganea set the repository for this revision to rG LLVM Github Monorepo.Sep 11 2020, 3:37 PM
aganea added a comment.EditedSep 11 2020, 3:39 PM

@russell.gallop Do you think you can upload the patch again by setting the Repository to "Monorepo"?

NVM I managed to do it. It's such a nightmare to apply patches when the above is not set upon uploading :-(

aganea added a comment.EditedSep 13 2020, 6:20 PM

Thanks for working on this @russell.gallop!

I've reproduced your tests, please see below. The only difference is that I've used a ThinLTO build for stage2:
-DCMAKE_CXX_FLAGS="/GS- -Xclang -O3 -fstrict-aliasing -march=skylake-avx512 -flto=thin -fwhole-program-vtables.
Running with /opt:lldltojobs=all no /lldltocache.
Results on a 36-core (dual mount) Xeon Gold 6140, Windows 10 version 2004.

(WinHeap vs. Scudo+options)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     664.086 s ± 18.740 s    [User: 0.0 ms, System: 0.0 ms]                                         
  Range (min … max):   647.070 s … 684.172 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     145.619 s ±  0.140 s    [User: 0.0 ms, System: 8.1 ms]                                         
  Range (min … max):   145.522 s … 145.779 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    4.56 ± 0.13 times faster than 'cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp'
(Scudo+options vs. Rpmalloc)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     95.423 s ±  0.830 s    [User: 0.0 ms, System: 9.0 ms]                                          
  Range (min … max):   94.886 s … 96.380 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     145.266 s ±  0.387 s    [User: 4.9 ms, System: 7.6 ms]                                         
  Range (min … max):   144.894 s … 145.666 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    1.52 ± 0.01 times faster than 'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp'
(Scudo vs. Rpmalloc)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     95.435 s ±  0.059 s    [User: 0.0 ms, System: 8.0 ms]                                          
  Range (min … max):   95.385 s … 95.499 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     270.967 s ±  1.366 s    [User: 4.8 ms, System: 0.0 ms]                                        
  Range (min … max):   269.397 s … 271.887 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    2.84 ± 0.01 times faster than 'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp'

Summary:

TimeFactor
WinHeap664.086 s ± 18.740 s1.0
Scudo270.967 s ± 1.366 s2.45
Scudo+options145.619 s ± 0.140 s4.56
Rpmalloc95.423 s ± 0.830 s6.95

CPU usage:
Rpmaloc - 3,944 cumulated seconds (all threads)


Scudo+options - 6,337 cumulated seconds (all threads)

Time spent in the allocator itself (note the different vertical scale in the graph)
(a hardware CRC or AES implemention will certainly help for Scudo)
Rpmalloc - 191 cumulated seconds


Scudo+options - 1,171 cumulated seconds

Memory usage:
Rpmalloc - Peaks at 11 GB commit (19 GB mapped)


Scudo+options - Peaks at 5 GB commit (although 4.4 TB of mapped pages!!!)

Re-upload patch with G LLVM Github Monorepo set.

Thanks for working on this @russell.gallop!

I've reproduced your tests, please see below. The only difference is that I've used a ThinLTO build for stage2:

Thanks. It's good to know they're reproducible, and that it scales better than the default allocator.

(a hardware CRC or AES implemention will certainly help for Scudo)

Actually this wasn't too hard to try out. I added "-msse4.2" to CMAKE_C_FLAGS and CMAKE_CXX_FLAGS (as suggested in scudo_allocator.cpp). This helps, but scudo is still a bit behind the default allocator for me on 6 cores:

> set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0
> hyperfine.exe -m 5 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     269.922 s ±  2.706 s    [User: 0.0 ms, System: 18.6 ms]
  Range (min … max):   265.969 s … 272.188 s    5 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     285.522 s ±  0.440 s    [User: 12.3 ms, System: 9.9 ms]
  Range (min … max):   284.907 s … 286.056 s    5 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.06 ± 0.01 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

@russell.gallop I see a lots of failing tests when running ninja check-all on a Scudo-enabled build (stage 2). Do you see the same thing on your end?

==89136==ERROR: Scudo failed to allocate 0x10000 (65536) bytes of memory at address 0x4c0090e00 (error code: 1455)
==89136==Dumping process modules:
ERROR: Failed to mmap

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

Also, the infamous DynamicLibrary.Shutdown unittest fails with Scudo, like with SnMalloc. That test covers a tricky situation at shutdown.

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

The size of the Primary is defined in https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/scudo/scudo_platform.h#L75
Scudo reserves that size but doesn't commit it, then it incrementally commits when memory is needed within the reserved region.

@russell.gallop I see a lots of failing tests when running ninja check-all on a Scudo-enabled build (stage 2). Do you see the same thing on your end?

Not a lot but a few, including DynamicLibrary.Shutdown:

Failed Tests (4):
  LLVM-Unit :: Support/DynamicLibrary/./DynamicLibraryTests.exe/DynamicLibrary.Shutdown
  LLVM :: ExecutionEngine/MCJIT/multi-module-sm-pic-a.ll
  LLVM :: ExecutionEngine/OrcMCJIT/multi-module-sm-pic-a.ll
  LLVM :: ExecutionEngine/OrcMCJIT/stubs-sm-pic.ll

And the MCJIT ones seem intermittent (different ones fail each time).

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exhaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

I guess that's just a problem with a larger number of cores then.

I'm going to concentrate on the standalone port as I think that's the way forward. If I get that working then can work through the other issues.

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

The size of the Primary is defined in https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/scudo/scudo_platform.h#L75
Scudo reserves that size but doesn't commit it, then it incrementally commits when memory is needed within the reserved region.

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

Thanks

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

You can reduce the size of the Primary for Windows with an additional define in the platform file. You probably want to make sure there is at least a few gigs per region (eg: the total size could be 256G).

Once again the memory is reserved but not committed, and this is on a per-process basis. There shouldn't be more than one Primary, and as such we shouldn't run out of VA space. We could possibly run out of memory if we allocate past the amount of RAM (+swap), but this is the committed amount.

aganea added a subscriber: maniccoder.EditedSep 16 2020, 11:28 AM

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

You can reduce the size of the Primary for Windows with an additional define in the platform file. You probably want to make sure there is at least a few gigs per region (eg: the total size could be 256G).

Once again the memory is reserved but not committed, and this is on a per-process basis. There shouldn't be more than one Primary, and as such we shouldn't run out of VA space. We could possibly run out of memory if we allocate past the amount of RAM (+swap), but this is the committed amount.

I think reserving 4 TB hits a pathological case in the Windows NT Kernel, where for some reason, the application's VAD (Virtual Address Descriptor) tree is not cleared right away, but deferred to the system zero thread. Since that thread is low-priority, the VAD trees are accumulating in the "Active List", until it hits the physical memory limit, then it goes to swap, then at some point the VirtualAlloc calls in the application fail. This looks like an edge case that wouldn't happen normally (lots of applications that quickly start, reserve several TB of vRAM, then shutdown)
+ @maniccoder

As soon as I pause the ninja check-llvm tests (by blocking the console output), the zero thread now has more time and the free memory goes down again.

@cryptoad What happens if the primary was much smaller? Or if pages were reserved in much smaller ranges?

(a hardware CRC or AES implemention will certainly help for Scudo)

Actually this wasn't too hard to try out. I added "-msse4.2" to CMAKE_C_FLAGS and CMAKE_CXX_FLAGS (as suggested in scudo_allocator.cpp). This helps, but scudo is still a bit behind the default allocator for me on 6 cores

This improves things a bit, wall clock 145 sec before -> 136 sec after.

Scudo+options+hardware-crc32 - 5,997 cumulated seconds (all threads) - before it was 6,337 seconds

Time spent in the allocator itself:

Scudo+options+hardware-crc32 - 761 cumulated seconds - before it was 1,171 seconds

There's one more thing. I think the difference in performance, as when compared with competing allocators, is that SCUDO still does some level of locking. Like I mentionned previously, this doesn't scale on many-core machine for allocation-intensive applications like LLVM. Every waiting lock gives up its time slice to the OS scheduler, and that's bad for performance. Also, the more core there are, the more we wait for the lock to become free.


Here you can see a Scudo-enabled LLD doing 10x more context-switches that the Rpmalloc-enabled LLD:

One thing that is in favor of Scudo though, is that it commits memory in much smaller blocks than Rpmalloc (peak LLD commit is 5 GB for Scudo, 11 GB for Rpmalloc). Mimalloc employs the same kind of strategy, with similar benefits.

@cryptoad Does SCUDO standalone differs in any of these aspects from this version?

@cryptoad What happens if the primary was much smaller? Or if pages were reserved in much smaller ranges?

The Primary can be made smaller, but this works better with the Standalone version as some code was added to fallback to larger class sizes if a region is full (Android uses 256mb per region).

@cryptoad Does SCUDO standalone differs in any of these aspects from this version?

So this requires a bit of background.
There are two models for the Thread Specific Data that holds the cached pointers: Shared (a pool of N caches is shared between all threads) and Exclusive (1 exclusive cache per thread).
For my initial port to Windows, I used the Shared model, with a pool of 32 caches max (it's a define in the platform header). If there is more than 32 cores, this can be increased.
I didn't try to make the Exclusive version work, mostly because I was using the Windows TLS API and the Shared fit right in with those, but it would get rid of a lot of the contention.

Overall with regard to the Standalone, it should be better on all accounts: faster (as we got rid of some of the quirks of sanitizer_common), lesser memory footprint, better reclaiming, more configurable.

I didn't try to make the Exclusive version work, mostly because I was using the Windows TLS API and the Shared fit right in with those, but it would get rid of a lot of the contention.

That sounds good. Right now, the delta between rpmalloc/mimalloc/snmalloc is less than 5 sec (for a run of 2 min 15 sec), I would assume with the Exclusive mode, SCUDO would fall in the same ballpark. Any other allocator that I've tried that does some level of locking (Hoard, ptmalloc, tcmalloc2) does not scale (is much slower) on many-cores machines.

I'm going to concentrate on the standalone port as I think that's the way forward. If I get that working then can work through the other issues.

I'm considering a slight change of plan. @cryptoad, in the name of incremental development, would you be happy for me to put the scudo sanitizer version up (with working tests), or is this deprecated now? It may not be the best for performance long term but it's a smaller step than the standalone port and seems to work.

I'm considering a slight change of plan. @cryptoad, in the name of incremental development, would you be happy for me to put the scudo sanitizer version up (with working tests), or is this deprecated now? It may not be the best for performance long term but it's a smaller step than the standalone port and seems to work.

Sounds good to me.

aganea added a comment.EditedSep 22 2020, 1:27 PM

I'm also in favor, I think this is good direction ahead. It'd be nice if following issues were fixed -- in subsequent patches if you wish:

  • Stage1 ninja check-scudo fails many tests for me, see .
  • Stage2 ninja check-llvm fails after a while on high-core machines (4TB issue mentionned in comments above). Lowering AllocatorSize to 256GB would fix the issue on the short term.
  • Fix & test the "exclusive" mode (or just skip to scudo-standalone if it's too complicated).
compiler-rt/lib/scudo/scudo_platform.h
67

&& SANITIZER_WINDOWS ?

compiler-rt/test/scudo/lit.cfg.py
19

Remove this line (duplicate).

llvm/lib/Support/CMakeLists.txt
76

Please also mention scudo in corresponding LLVM_INTEGRATED_CRT_ALLOC section in llvm/docs/CMake.rst.

I'm also in favor, I think this is good direction ahead. It'd be nice if following issues were fixed -- in subsequent patches if you wish:

  • Stage1 ninja check-scudo fails many tests for me, see .

I intend to fix these (or mark an unsupported) for this patch.

  • Stage2 ninja check-llvm fails after a while on high-core machines (4TB issue mentionned in comments above). Lowering AllocatorSize to 256GB would fix the issue on the short term.

I'll add that.

  • Fix & test the "exclusive" mode (or just skip to scudo-standalone if it's too complicated).

I don't intend to do that for this patch, as shared mode works. I will re-evaluate after the above.

I also intend to remove the -fsanitize related changes. I believe that they're not required on Windows.

Thanks
Russ

compiler-rt/lib/scudo/scudo_platform.h
67

I think that should be:

# if (defined(__aarch64__) && SANITIZER_ANDROID) || SANITIZER_WINDOWS

I'll either do that or add a new elif.