This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Allow -fsanitize=scudo on Linux and Windows (WIP, don't land as is)
Needs ReviewPublic

Authored by russell.gallop on Aug 27 2020, 3:05 AM.

Details

Summary

This is basically https://reviews.llvm.org/D42519 updated for monorepo and support for scudo on -DLLVM_INTEGRATED_CRT_ALLOC.

This is to allow evaluation of Scudo as a replacement memory allocator on Windows for comparison with https://reviews.llvm.org/D71786

Diff Detail

Event Timeline

russell.gallop created this revision.Aug 27 2020, 3:05 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptAug 27 2020, 3:05 AM
Herald added subscribers: llvm-commits, Restricted Project, cfe-commits, mgorny. · View Herald Transcript
russell.gallop requested review of this revision.Aug 27 2020, 3:05 AM

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

I'd like it to be committed, but can't claim I know the code from https://reviews.llvm.org/D42519 well enough. The good news is that I can build LLVM on Windows with this. Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

You marked D42519 as WIP, can you remember what was still TBD?

Also, it might make sense to separate out the "use" of sanitize=Scudo in LLVM, from providing Windows support. I put them together here for evaluation purposes.

You marked D42519 as WIP, can you remember what was still TBD?

Maybe more tests could be enabled...

That's awesome! Is it meant to eventually be committed or only be used for comparison purposes?

I'd like it to be committed, but can't claim I know the code from https://reviews.llvm.org/D42519 well enough. The good news is that I can build LLVM on Windows with this. Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

Nothing except the tests. Compiling a sizeable application with Scudo as well.

You marked D42519 as WIP, can you remember what was still TBD?

Mostly because I didn't have the time and resources to make sure this would work over the long term.
I used my home windows box to do this as a proof of concept that it could be done.

Also, it might make sense to separate out the "use" of sanitize=Scudo in LLVM, from providing Windows support. I put them together here for evaluation purposes.

Yeah this makes sense.

The other point that is worth mentioning is that we moved all dev efforts to the "standalone" version of Scudo (eg: the one not depending on sanitizer_common in the standalone/ subdirectory).
There is enough differences that there could be some significant performance/mem footprint changes between the 2.

Also depending on what you want to compare, disabling the Quarantine and other optional security features will make things faster and use less memory.

russell.gallop edited the summary of this revision. (Show Details)Aug 28 2020, 10:03 AM
russell.gallop updated this revision to Diff 291235.EditedSep 11 2020, 9:15 AM
russell.gallop edited the summary of this revision. (Show Details)

Fixup scudo (sanitizer based) to work on Windows.

This makes use of the CRT alloc hooks from D71786.

To build with scudo on Windows use: -DLLVM_INTEGRATED_CRT_ALLOC=<llvm-project>/stage1/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib -DLLVM_USE_CRT_RELEASE=MT
-DLLVM_USE_SANITIZER=Scudo is supported in this patch, but isn't required. @cryptoad on Linux does this do anything other than add libraries when using clang to drive the linker?

Limitations:
-Note that this is not using hardware CRC32.
-This just hooks in the C scudo library, not the cxx library

I evaluated this on a 3 stage LLVM build on Windows 10 2004 (in vs2019 16.7.3 environment) on a 6-core i7-8700k.

  • stage1 builds the scudo sanitizer on Windows: requires -DLLVM_ENABLE_PROJECTS=clang;lld;compiler-rt
  • stage2: built with -DCMAKE_C_COMPILER="<stage1>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage1>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage1>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release
  • stage2_scudo: as stage2 plus -DLLVM_INTEGRATED_CRT_ALLOC=<stage1>/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib

Then evaluated linking clang with ThinLTO:

  • stage3: -DCMAKE_C_COMPILER="<stage2>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage2>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage2>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_LTO=Thin
  • stage3_scudo: -DCMAKE_C_COMPILER="<stage2_scudo>/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="<stage2_scudo>/bin/clang-cl.exe" -DCMAKE_LINKER="<stage2_scudo>/bin/lld-link.exe" -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_LTO=Thin

I set /threads:12 and removed /lldltocache.

Without SCUDO_OPTIONS scudo seems to be about 25% slower.

>"hyperfine.exe" -m 3 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     268.209 s ±  4.966 s    [User: 18.1 ms, System: 6.6 ms]
  Range (min … max):   263.223 s … 273.155 s    3 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     334.312 s ±  4.002 s    [User: 2.4 ms, System: 14.6 ms]
  Range (min … max):   329.889 s … 337.683 s    3 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.25 ± 0.03 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

I set scudo options to disable quarantine and mismatch checking and it seems to be about 8% slower.

>set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0
>"hyperfine.exe" -m 3 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     273.772 s ±  3.624 s    [User: 1.3 ms, System: 8.6 ms]
  Range (min … max):   269.806 s … 276.909 s    3 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     296.593 s ±  2.362 s    [User: 1.3 ms, System: 13.9 ms]
  Range (min … max):   293.917 s … 298.391 s    3 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.08 ± 0.02 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

It's worth noting that the run without scudo was not using all CPU so still hitting some locking issues but was still over 90%. With both scudo runs it was pegged at 100% CPU until near the end of the link. From this it appears that the locking behaviour is better than the default allocator and would win out on a wide enough processor. The loss in straight line performance is potentially due to CRC calculations not using hardware support.

russell.gallop added a comment.EditedSep 11 2020, 9:20 AM

Is there a good sanity check that it is actually using Scudo rather than silently using the standard alloc?

It turns out is wasn't actually using scudo, it is now!

Nothing except the tests. Compiling a sizeable application with Scudo as well.

Compiling clang and lld with Scudo and ThinLTO linking clang seems to work.

The other point that is worth mentioning is that we moved all dev efforts to the "standalone" version of Scudo (eg: the one not depending on sanitizer_common in the standalone/ subdirectory).
There is enough differences that there could be some significant performance/mem footprint changes between the 2.

Thanks. I'm looking at porting the standalone variant, drawing on D42519 and Windows support from sanitizer_common. Does that sound like a reasonable approach?

Also depending on what you want to compare, disabling the Quarantine and other optional security features will make things faster and use less memory.

Would these be good flags to use:

set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0

Thanks. I'm looking at porting the standalone variant, drawing on D42519 and Windows support from sanitizer_common. Does that sound like a reasonable approach?

Yup, the structure is a tad different, and other platforms (Fuchsia, Linux cpp) have some implementation of the platform specific functions.

Would these be good flags to use:

set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0

Yup, no release, no quarantine is a great. Size class map compile time options could also make a difference.

aganea set the repository for this revision to rG LLVM Github Monorepo.Sep 11 2020, 3:37 PM
aganea added a comment.EditedSep 11 2020, 3:39 PM

@russell.gallop Do you think you can upload the patch again by setting the Repository to "Monorepo"?

NVM I managed to do it. It's such a nightmare to apply patches when the above is not set upon uploading :-(

aganea added a comment.EditedSep 13 2020, 6:20 PM

Thanks for working on this @russell.gallop!

I've reproduced your tests, please see below. The only difference is that I've used a ThinLTO build for stage2:
-DCMAKE_CXX_FLAGS="/GS- -Xclang -O3 -fstrict-aliasing -march=skylake-avx512 -flto=thin -fwhole-program-vtables.
Running with /opt:lldltojobs=all no /lldltocache.
Results on a 36-core (dual mount) Xeon Gold 6140, Windows 10 version 2004.

(WinHeap vs. Scudo+options)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     664.086 s ± 18.740 s    [User: 0.0 ms, System: 0.0 ms]                                         
  Range (min … max):   647.070 s … 684.172 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     145.619 s ±  0.140 s    [User: 0.0 ms, System: 8.1 ms]                                         
  Range (min … max):   145.522 s … 145.779 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    4.56 ± 0.13 times faster than 'cd d:\llvm-project\buildninjaRelWinHeap3 && d:\llvm-project\buildninjaRelWinHeap2\bin\lld-link.exe @CMakeFiles\clang.rsp'
(Scudo+options vs. Rpmalloc)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     95.423 s ±  0.830 s    [User: 0.0 ms, System: 9.0 ms]                                          
  Range (min … max):   94.886 s … 96.380 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     145.266 s ±  0.387 s    [User: 4.9 ms, System: 7.6 ms]                                         
  Range (min … max):   144.894 s … 145.666 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    1.52 ± 0.01 times faster than 'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp'
(Scudo vs. Rpmalloc)

D:\llvm-project>hyperfine -m 3 -w 1 "cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp" "cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp"
Benchmark #1: cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     95.435 s ±  0.059 s    [User: 0.0 ms, System: 8.0 ms]                                          
  Range (min … max):   95.385 s … 95.499 s    3 runs

Benchmark #2: cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp
  Time (mean ± σ):     270.967 s ±  1.366 s    [User: 4.8 ms, System: 0.0 ms]                                        
  Range (min … max):   269.397 s … 271.887 s    3 runs

Summary
  'cd d:\llvm-project\buildninjaRelRpmalloc3 && d:\llvm-project\buildninjaRelRpMalloc2\bin\lld-link.exe @CMakeFiles\clang.rsp' ran
    2.84 ± 0.01 times faster than 'cd d:\llvm-project\buildninjaRelScudo3 && d:\llvm-project\buildninjaRelScudo2\bin\lld-link.exe @CMakeFiles\clang.rsp'

Summary:

TimeFactor
WinHeap664.086 s ± 18.740 s1.0
Scudo270.967 s ± 1.366 s2.45
Scudo+options145.619 s ± 0.140 s4.56
Rpmalloc95.423 s ± 0.830 s6.95

CPU usage:
Rpmaloc - 3,944 cumulated seconds (all threads)


Scudo+options - 6,337 cumulated seconds (all threads)

Time spent in the allocator itself (note the different vertical scale in the graph)
(a hardware CRC or AES implemention will certainly help for Scudo)
Rpmalloc - 191 cumulated seconds


Scudo+options - 1,171 cumulated seconds

Memory usage:
Rpmalloc - Peaks at 11 GB commit (19 GB mapped)


Scudo+options - Peaks at 5 GB commit (although 4.4 TB of mapped pages!!!)

Re-upload patch with G LLVM Github Monorepo set.

Thanks for working on this @russell.gallop!

I've reproduced your tests, please see below. The only difference is that I've used a ThinLTO build for stage2:

Thanks. It's good to know they're reproducible, and that it scales better than the default allocator.

(a hardware CRC or AES implemention will certainly help for Scudo)

Actually this wasn't too hard to try out. I added "-msse4.2" to CMAKE_C_FLAGS and CMAKE_CXX_FLAGS (as suggested in scudo_allocator.cpp). This helps, but scudo is still a bit behind the default allocator for me on 6 cores:

> set SCUDO_OPTIONS=allocator_release_to_os_interval_ms=-1:QuarantineSizeKb=0:ThreadLocalQuarantineSizeKb=0:DeleteSizeMismatch=0:DeallocationTypeMismatch=0
> hyperfine.exe -m 5 -w 1 "cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt" "cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt"
Benchmark #1: cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt
  Time (mean ± σ):     269.922 s ±  2.706 s    [User: 0.0 ms, System: 18.6 ms]
  Range (min … max):   265.969 s … 272.188 s    5 runs

Benchmark #2: cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt
  Time (mean ± σ):     285.522 s ±  0.440 s    [User: 12.3 ms, System: 9.9 ms]
  Range (min … max):   284.907 s … 286.056 s    5 runs

Summary
  'cd stage3\repro && f:\git\llvm-project\stage2\bin\lld-link @response.txt' ran
    1.06 ± 0.01 times faster than 'cd stage3_scudo\repro && f:\git\llvm-project\stage2_scudo\bin\lld-link @response.txt'

@russell.gallop I see a lots of failing tests when running ninja check-all on a Scudo-enabled build (stage 2). Do you see the same thing on your end?

==89136==ERROR: Scudo failed to allocate 0x10000 (65536) bytes of memory at address 0x4c0090e00 (error code: 1455)
==89136==Dumping process modules:
ERROR: Failed to mmap

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

Also, the infamous DynamicLibrary.Shutdown unittest fails with Scudo, like with SnMalloc. That test covers a tricky situation at shutdown.

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

The size of the Primary is defined in https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/scudo/scudo_platform.h#L75
Scudo reserves that size but doesn't commit it, then it incrementally commits when memory is needed within the reserved region.

@russell.gallop I see a lots of failing tests when running ninja check-all on a Scudo-enabled build (stage 2). Do you see the same thing on your end?

Not a lot but a few, including DynamicLibrary.Shutdown:

Failed Tests (4):
  LLVM-Unit :: Support/DynamicLibrary/./DynamicLibraryTests.exe/DynamicLibrary.Shutdown
  LLVM :: ExecutionEngine/MCJIT/multi-module-sm-pic-a.ll
  LLVM :: ExecutionEngine/OrcMCJIT/multi-module-sm-pic-a.ll
  LLVM :: ExecutionEngine/OrcMCJIT/stubs-sm-pic.ll

And the MCJIT ones seem intermittent (different ones fail each time).

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exhaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

I guess that's just a problem with a larger number of cores then.

I'm going to concentrate on the standalone port as I think that's the way forward. If I get that working then can work through the other issues.

If 4.4 TB of virtual pages are mapped in each process (this happens on startup), then we quickly exaust the 48-bit (256 TB) addressable space with 72+ programs running (on a 36-core). Any idea where this 4.4 TB mapping comes from?

The size of the Primary is defined in https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/scudo/scudo_platform.h#L75
Scudo reserves that size but doesn't commit it, then it incrementally commits when memory is needed within the reserved region.

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

Thanks

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

You can reduce the size of the Primary for Windows with an additional define in the platform file. You probably want to make sure there is at least a few gigs per region (eg: the total size could be 256G).

Once again the memory is reserved but not committed, and this is on a per-process basis. There shouldn't be more than one Primary, and as such we shouldn't run out of VA space. We could possibly run out of memory if we allocate past the amount of RAM (+swap), but this is the committed amount.

aganea added a subscriber: maniccoder.EditedSep 16 2020, 11:28 AM

I guess using scudo as a general purpose allocator that could set a limit on the number of cores that can be used at once as @aganea found. Would there be any problem with making this very small (e.g. a couple of GB)?

You can reduce the size of the Primary for Windows with an additional define in the platform file. You probably want to make sure there is at least a few gigs per region (eg: the total size could be 256G).

Once again the memory is reserved but not committed, and this is on a per-process basis. There shouldn't be more than one Primary, and as such we shouldn't run out of VA space. We could possibly run out of memory if we allocate past the amount of RAM (+swap), but this is the committed amount.

I think reserving 4 TB hits a pathological case in the Windows NT Kernel, where for some reason, the application's VAD (Virtual Address Descriptor) tree is not cleared right away, but deferred to the system zero thread. Since that thread is low-priority, the VAD trees are accumulating in the "Active List", until it hits the physical memory limit, then it goes to swap, then at some point the VirtualAlloc calls in the application fail. This looks like an edge case that wouldn't happen normally (lots of applications that quickly start, reserve several TB of vRAM, then shutdown)
+ @maniccoder

As soon as I pause the ninja check-llvm tests (by blocking the console output), the zero thread now has more time and the free memory goes down again.

@cryptoad What happens if the primary was much smaller? Or if pages were reserved in much smaller ranges?

(a hardware CRC or AES implemention will certainly help for Scudo)

Actually this wasn't too hard to try out. I added "-msse4.2" to CMAKE_C_FLAGS and CMAKE_CXX_FLAGS (as suggested in scudo_allocator.cpp). This helps, but scudo is still a bit behind the default allocator for me on 6 cores

This improves things a bit, wall clock 145 sec before -> 136 sec after.

Scudo+options+hardware-crc32 - 5,997 cumulated seconds (all threads) - before it was 6,337 seconds

Time spent in the allocator itself:

Scudo+options+hardware-crc32 - 761 cumulated seconds - before it was 1,171 seconds

There's one more thing. I think the difference in performance, as when compared with competing allocators, is that SCUDO still does some level of locking. Like I mentionned previously, this doesn't scale on many-core machine for allocation-intensive applications like LLVM. Every waiting lock gives up its time slice to the OS scheduler, and that's bad for performance. Also, the more core there are, the more we wait for the lock to become free.


Here you can see a Scudo-enabled LLD doing 10x more context-switches that the Rpmalloc-enabled LLD:

One thing that is in favor of Scudo though, is that it commits memory in much smaller blocks than Rpmalloc (peak LLD commit is 5 GB for Scudo, 11 GB for Rpmalloc). Mimalloc employs the same kind of strategy, with similar benefits.

@cryptoad Does SCUDO standalone differs in any of these aspects from this version?

@cryptoad What happens if the primary was much smaller? Or if pages were reserved in much smaller ranges?

The Primary can be made smaller, but this works better with the Standalone version as some code was added to fallback to larger class sizes if a region is full (Android uses 256mb per region).

@cryptoad Does SCUDO standalone differs in any of these aspects from this version?

So this requires a bit of background.
There are two models for the Thread Specific Data that holds the cached pointers: Shared (a pool of N caches is shared between all threads) and Exclusive (1 exclusive cache per thread).
For my initial port to Windows, I used the Shared model, with a pool of 32 caches max (it's a define in the platform header). If there is more than 32 cores, this can be increased.
I didn't try to make the Exclusive version work, mostly because I was using the Windows TLS API and the Shared fit right in with those, but it would get rid of a lot of the contention.

Overall with regard to the Standalone, it should be better on all accounts: faster (as we got rid of some of the quirks of sanitizer_common), lesser memory footprint, better reclaiming, more configurable.

I didn't try to make the Exclusive version work, mostly because I was using the Windows TLS API and the Shared fit right in with those, but it would get rid of a lot of the contention.

That sounds good. Right now, the delta between rpmalloc/mimalloc/snmalloc is less than 5 sec (for a run of 2 min 15 sec), I would assume with the Exclusive mode, SCUDO would fall in the same ballpark. Any other allocator that I've tried that does some level of locking (Hoard, ptmalloc, tcmalloc2) does not scale (is much slower) on many-cores machines.

I'm going to concentrate on the standalone port as I think that's the way forward. If I get that working then can work through the other issues.

I'm considering a slight change of plan. @cryptoad, in the name of incremental development, would you be happy for me to put the scudo sanitizer version up (with working tests), or is this deprecated now? It may not be the best for performance long term but it's a smaller step than the standalone port and seems to work.

I'm considering a slight change of plan. @cryptoad, in the name of incremental development, would you be happy for me to put the scudo sanitizer version up (with working tests), or is this deprecated now? It may not be the best for performance long term but it's a smaller step than the standalone port and seems to work.

Sounds good to me.

aganea added a comment.EditedSep 22 2020, 1:27 PM

I'm also in favor, I think this is good direction ahead. It'd be nice if following issues were fixed -- in subsequent patches if you wish:

  • Stage1 ninja check-scudo fails many tests for me, see .
  • Stage2 ninja check-llvm fails after a while on high-core machines (4TB issue mentionned in comments above). Lowering AllocatorSize to 256GB would fix the issue on the short term.
  • Fix & test the "exclusive" mode (or just skip to scudo-standalone if it's too complicated).
compiler-rt/lib/scudo/scudo_platform.h
67

&& SANITIZER_WINDOWS ?

compiler-rt/test/scudo/lit.cfg.py
19–20

Remove this line (duplicate).

llvm/lib/Support/CMakeLists.txt
61

Please also mention scudo in corresponding LLVM_INTEGRATED_CRT_ALLOC section in llvm/docs/CMake.rst.

I'm also in favor, I think this is good direction ahead. It'd be nice if following issues were fixed -- in subsequent patches if you wish:

  • Stage1 ninja check-scudo fails many tests for me, see .

I intend to fix these (or mark an unsupported) for this patch.

  • Stage2 ninja check-llvm fails after a while on high-core machines (4TB issue mentionned in comments above). Lowering AllocatorSize to 256GB would fix the issue on the short term.

I'll add that.

  • Fix & test the "exclusive" mode (or just skip to scudo-standalone if it's too complicated).

I don't intend to do that for this patch, as shared mode works. I will re-evaluate after the above.

I also intend to remove the -fsanitize related changes. I believe that they're not required on Windows.

Thanks
Russ

compiler-rt/lib/scudo/scudo_platform.h
67

I think that should be:

# if (defined(__aarch64__) && SANITIZER_ANDROID) || SANITIZER_WINDOWS

I'll either do that or add a new elif.

Apologies for the delay, I've had a few other things on.

I've worked through the tests. Several I've marked as unsupported on windows as they use things unixy things like nm, LD_PRELOAD or fork(). I added a C++ threads version of threads.c to test that on Windows (cxx_threads.cpp).

There are 5 remaining failures:

Failed Tests (5):
  Scudo-x86_64 :: aligned-new.cpp
  Scudo-x86_64 :: mismatch.cpp
  Scudo-x86_64 :: options.cpp
  Scudo-x86_64 :: sized-delete.cpp
  Scudo-x86_64 :: sizes.cpp

Testing Time: 11.61s
  Unsupported:  8
  Passed     : 13
  Failed     :  5

These are all down to new/delete not being replaced by the scudo versions, so scudoAllocate and scudoDeallocate don't get the appropriate size and Type parameters. The cxx library should be linked by SanitizerArgs::addArgs so I'm not sure why this is. @cryptoad, it looks like you did some work in scudo_new_delete.cpp, did you have this working?

Does new/delete need to work for this, or should I mark these tests as unsupported on Windows for the purposes of a step towards getting the standalone scudo working?

Stage 2 builds with -DLLVM_INTEGRATED_CRT_ALLOC=<llvm-project>/stage1/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib -DLLVM_USE_CRT_RELEASE=MT.

Thank you for all your work!
IIRC everything was working except as you point out some of the very Unixy tests.
Is your build creating a clang_rt.scudo_cxx library that also needs to be link as well? The C++ specific interceptors usually end up in a separate library that is not linked into the straight C programs, which is the most likely culprit for not having new/delete be intercepted.

These are all down to new/delete not being replaced by the scudo versions, so scudoAllocate and scudoDeallocate don't get the appropriate size and Type parameters. The cxx library should be linked by SanitizerArgs::addArgs so I'm not sure why this is. @cryptoad, it looks like you did some work in scudo_new_delete.cpp, did you have this working?

Does new/delete need to work for this, or should I mark these tests as unsupported on Windows for the purposes of a step towards getting the standalone scudo working?

Stage 2 builds with -DLLVM_INTEGRATED_CRT_ALLOC=<llvm-project>/stage1/lib/clang/12.0.0/lib/windows/clang_rt.scudo-x86_64.lib -DLLVM_USE_CRT_RELEASE=MT.

russell.gallop edited the summary of this revision. (Show Details)

Apologies for the delay, I've had other things taking my time.

Latest version uploaded. This fixes stage1 lit tests (on Windows and Linux) and adds scudo_cxx to linking.

Usage has changed slightly, now -DLLVM_INTEGRATED_CRT_ALLOC should point to the directory with scudo built libraries (e.g. c:/git/llvm-project/stage1/lib/clang/12.0.0/lib/windows) rather than the library itself. -fsanitize=scudo works for the lit tests (as clang adds the libraries when driving the linker) but doesn't work for building LLVM as cmake links with the linker directly so this uses LLVM_INTEGRATED_CRT_ALLOC.

There are two issues I'm aware of now:

  • Linking c-index-test.exe with scudo fails due to new being redefined
lld-link: error: void * __cdecl operator new(unsigned __int64) was replaced

Or from link.exe

clang_rt.scudo_cxx-x86_64.lib(scudo_new_delete.cpp.obj) : error LNK2005: "void * __cdecl operator new(unsigned __int64)" (??2@YAPEAX_K@Z) already defined in libclang.lib(libclang.dll)
...
  • With scudo all lit tests (generally) pass, but there are some intermittent failures (Access Violation) on the following tests:

ExecutionEngine/MCJIT/test-global-init-nonzero-sm-pic.ll
ExecutionEngine/MCJIT/test-ptr-reloc-sm-pic.ll
ExecutionEngine/MCJIT/multi-module-sm-pic-a.ll

This happens with VS2019 or clang-cl as the toolchain. I've run multiple times without scudo and don't see these. Unfortunately I haven't been able to get this to fail in a debugger.

aganea added a comment.Jan 8 2021, 8:29 AM
  • With scudo all lit tests (generally) pass, but there are some intermittent failures (Access Violation) on the following tests:

ExecutionEngine/MCJIT/test-global-init-nonzero-sm-pic.ll
ExecutionEngine/MCJIT/test-ptr-reloc-sm-pic.ll
ExecutionEngine/MCJIT/multi-module-sm-pic-a.ll

This happens with VS2019 or clang-cl as the toolchain. I've run multiple times without scudo and don't see these. Unfortunately I haven't been able to get this to fail in a debugger.

Could you possibly comment out sys::PrintStackTraceOnErrorSignal in InitLLVM.cpp, run specifically those tests several times through LIT until they crash, and attach when the popup shows up?

Remove -fsanitize=scudo support for Windows in LLVM cmake, using LLVM_INTEGRATED_CRT_ALLOC instead.
Remove scudo_cxx from LLVM_INTEGRATED_CRT_ALLOC.

I managed to get this to fail in the debugger (for the cross-module-sm-pic-a.ll test):

 	000001655e9d001e()	Unknown
 	000001655e9d0019()	Unknown
 	0000017e5eb6b410()	Unknown
 	0000017c5eb63810()	Unknown
 	0000017e5eb6b410()	Unknown
 	0000017e5eb6b410()	Unknown
 	0000016f5eb65b58()	Unknown
>	lli.exe!llvm::MCJIT::runFunction(llvm::Function * F, llvm::ArrayRef<llvm::GenericValue> ArgValues) Line 578	C++
 	lli.exe!llvm::ExecutionEngine::runFunctionAsMain(llvm::Function * Fn, const std::vector<std::string,std::allocator<std::string>> & argv, const char * const * envp) Line 467	C++
 	lli.exe!main(int argc, char * * argv, char * const * envp) Line 646	C++
 	[Inline Frame] lli.exe!invoke_main() Line 78	C++
 	lli.exe!__scrt_common_main_seh() Line 288	C++
 	kernel32.dll!00007ffd145a7034()	Unknown
 	ntdll.dll!00007ffd1657d0d1()	Unknown

It appears to be failing within the interpreted code. Maybe this is a problem between the memory allocator and the Memory Manager.

I tried running these tests repeatedly with scudo (sanitizer) on Linux but didn't see it fail. I also tried repeating these tests with rpmalloc on Windows and didn't see the failures so it does appear to be specific to scudo on Windows. I'll keep investigating.

I've focussed on the test test-global-init-nonzero-sm-pic.ll which fails writing to an address which (I believe) should be in the .data section but isn't.

With some breakpoints in SectionMemoryManager.cpp it appears that this fails when the top 32bits of the .text allocated address and the .data allocated address are different:

Allocating Data section ".data"
Returning aligned address 0x000001ed 30f60000 of size 0x0000000000000004
Allocating code section ".text"
Returning aligned address 0x0000022d 31050000 of size 0x000000000000003a

And work when they happen to be the same. When this fails, the address causing the access violation is the top 32bits of the .text address and the bottom 32bits of the .data address, e.g. 0x0000022d 30f60000. This doesn't fail without scudo as the top 32bits of the addresses seem to always be the same.

With assertions enabled this is causing an assert:

f:\git\llvm-project\stage1_scudo>"f:\git\llvm-project\stage1_scudo\bin\lli.exe" "-mtriple=x86_64-pc-windows-msvc-elf" "-relocation-model=pic" "-code-model=small" "f:\git\llvm-project\llvm\test\ExecutionEngine\MCJIT\test-global-init-nonzero-sm-pic.ll"
Assertion failed: isInt<32>(RealOffset), file F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyldELF.cpp, line 300
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: f:\\git\\llvm-project\\stage1_scudo\\bin\\lli.exe -mtriple=x86_64-pc-windows-msvc-elf -relocation-model=pic -code-model=small f:\\git\\llvm-project\\llvm\\test\\ExecutionEngine\\MCJIT\\test-global-init-nonzero-sm-pic.ll
 #0 0x00007ff63eb423c5 HandleAbort F:\git\llvm-project\llvm\lib\Support\Windows\Signals.inc:408:0
 #1 0x00007ff63f89d951 raise minkernel\crts\ucrt\src\appcrt\misc\signal.cpp:547:0
 #2 0x00007ff63f8617cc abort minkernel\crts\ucrt\src\appcrt\startup\abort.cpp:71:0
 #3 0x00007ff63f8825b8 common_assert_to_stderr<wchar_t> minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:175:0
 #4 0x00007ff63f882d42 _wassert minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:443:0
 #5 0x00007ff63e749e96 llvm::RuntimeDyldELF::resolveX86_64Relocation(class llvm::SectionEntry const &, unsigned __int64, unsigned __int64, unsigned int, __int64, unsigned __int64) F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyldELF.cpp:302:0
 #6 0x00007ff63e7496be llvm::RuntimeDyldELF::resolveRelocation(class llvm::RelocationEntry const &, unsigned __int64) F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyldELF.cpp:932:0
 #7 0x00007ff63e72e656 llvm::RuntimeDyldImpl::resolveRelocationList(class llvm::SmallVector<class llvm::RelocationEntry, 64> const &, unsigned __int64) F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyld.cpp:1077:0
 #8 0x00007ff63e72e52c ::operator++ C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\list:166:0
 #9 0x00007ff63e72e52c ::operator++ C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29333\include\list:247:0
#10 0x00007ff63e72e52c llvm::RuntimeDyldImpl::resolveLocalRelocations(void) F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyld.cpp:145:0
#11 0x00007ff63e72e872 llvm::RuntimeDyldImpl::resolveRelocations(void) F:\git\llvm-project\llvm\lib\ExecutionEngine\RuntimeDyld\RuntimeDyld.cpp:139:0
#12 0x00007ff63e532601 llvm::MCJIT::finalizeLoadedModules(void) F:\git\llvm-project\llvm\lib\ExecutionEngine\MCJIT\MCJIT.cpp:245:0
#13 0x00007ff63e532bfd ::{dtor} F:\git\llvm-project\llvm\include\llvm\ADT\SmallVector.h:1045:0
#14 0x00007ff63e532bfd llvm::MCJIT::finalizeObject(void) F:\git\llvm-project\llvm\lib\ExecutionEngine\MCJIT\MCJIT.cpp:271:0
#15 0x00007ff63dcfa06c main F:\git\llvm-project\llvm\tools\lli\lli.cpp:633:0
#16 0x00007ff63f81527c invoke_main d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78:0
#17 0x00007ff63f81527c __scrt_common_main_seh d:\agent\_work\63\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288:0
#18 0x00007ffd145a7034 (C:\WINDOWS\System32\KERNEL32.DLL+0x17034)
#19 0x00007ffd1657d0d1 (C:\WINDOWS\SYSTEM32\ntdll.dll+0x4d0d1)

I believe that the MCJIT failures are due to bug https:/bugs.llvm.org/show_bug.cgi?id=24978 rather than a problem in the Scudo port so I have added details of how I hit it to that bugzilla and opened two reviews to get this landed:
https://reviews.llvm.org/D96120 - [scudo] Port scudo sanitizer to Windows
https://reviews.llvm.org/D96133 - Allow building with scudo memory allocator on Windows

Please can you take a look?

I have not added the LLVM_USE_SANITIZER support for scudo from this review. -fsanitize adds scudo libraries when the linker is driven from clang, and the LLVM cmake build on Windows uses link.exe directly so this doesn't help. And scudo sanitizer is planned to go away so I didn't think it made sense to add that support on Linux either.