Page MenuHomePhabricator

RFC: [Support] On Windows, add optional support for rpmalloc
Needs ReviewPublic

Authored by aganea on Dec 20 2019, 2:12 PM.

Details

Summary

This patch optionally replaces the CRT allocator (i.e., malloc and free) with rpmalloc (public domain licence).

This changes the memory stack from:

new/delete -> MS VC++ CRT malloc/free -> HeapAlloc -> VirtualAlloc

to

new/delete -> rpmalloc -> VirtualAlloc

Problem

The Windows heap is thread-safe by default, and the ThinLTO codegen does a lot of allocations in each thread:

In addition, on many-core systems, this effectively blocks other threads to a point where only a very small fraction of the CPU time is used:

Before patch (here on Windows 10, build 1709):

We can see that a whooping 80% of the CPU time is spend waiting (blocked) on other threads (780 sec / 72 cores = 10 sec total) (graph with D71775 applied):

Threads are blocked waiting on the heap lock:

The thread above is awaken by the heap lock being released in another thread:

Solution

We simply use rpmalloc instead of VC++'s CRT allocator, which uses the Windows heap under-the-hood.

"rpmalloc - Rampant Pixels Memory Allocator
This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C"

The feature can be enabled with the cmake flag -DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT. It is currently available only for Windows, but rpmalloc already supports Darwin, FreeBSD, Linux so it would be easy to enable it for Unix as well. It currently uses /MT because it is easier that way, and I'm not sure /MD can be overloaded without code patching at runtime (I could investigate that later, but the DLL thunks slow things down).

In addition rpmalloc supports large memory pages, which can be enabled at run-time through the environment variable LLVM_RPMALLOC_PAGESIZE=[4k|2M|1G].

After patch:

After patch, with D71775 applied, all CPU sockets are used:
(the dark blue part of the graph represents the time spent in the kernel, see below why)

In addition to the heap lock, there's a kernel bug in some versions of of Windows 10, where accessing newly allocated virtual pages triggers the page zero-out mechanism, which itself is protected by a global lock, which further blocks memory allocations.

If we dig deeper, we effectively see ExpWaitForSpinLockExclusiveAndAcquire taking way too much time (the blue vertical lines show where it is called on the timeline):
(patch is applied)

When using the latest Windows build 1909, with this patch applied and D71775, LLD now reaches over 90% CPU usage:

ThinLTO timings

Globally, this patch along with D71775 gives more interesting link times with ThinLTO, on Windows at least. The link times below are for Ubisoft's Rainbow 6: Siege PC Final ThinLTO build. Timings are only for the full link (no ThinLTO cache)

LLD 10 was crafted out of a two-stage build.
In case [1] the second stage uses -DLLVM_USE_CRT_RELEASE=MT.
In case [2] the second stage uses -DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT.
In case [3] the second stage uses -DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT -DLLVM_ENABLE_LTO=Thin, and both -DCMAKE_C_FLAGS -DCMAKE_CXX_FLAGS are set to "/GS- -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512".
The LLVM tests link with ThinLTO, while the MSVC tests evidently run full LTO.

Diff Detail

Event Timeline

aganea created this revision.Dec 20 2019, 2:12 PM
Herald added a project: Restricted Project. · View Herald Transcript
ychen added a subscriber: ychen.Dec 20 2019, 2:25 PM

Just a nit on the patch description: This change replaces the CRT allocator (i.e., malloc and free). The "default Windows heap allocator" could be misunderstood to mean replacing the OS-level heap manager (i.e., https://docs.microsoft.com/en-us/windows/win32/memory/heap-functions).

The code comments are clear. It's just the beginning of the patch description that could throw someone off.

This changes the memory stack from:

new/delete -> MS VC++ CRT malloc/free -> HeapAlloc -> VirtualAlloc

to

new/delete -> rpmalloc -> VirtualAlloc

aganea edited the summary of this revision. (Show Details)Dec 21 2019, 7:51 AM
aganea added a subscriber: ruiu.

Thank @amccarth, fixed!

Note in the bar charts how the 28-core system is faster, when compared to the 36-core. The 28-core does not suffer from the "processor group" issue fixed in D71775, the 28-cores/56-threads all fit in a single "processor group", so they are all available to the application; while the 36-cores/72-threads are projected to the application as two "processor groups", out of which only one is used by default. After D71775, the 36-core is evidently faster.

It currently uses /MT because it is easier that way,

Based on my experience with Windows ASan, yes, let's limit this to /MT[d]. Malloc interception gets very hard otherwise.

+ thinlto users: @gbiv @inglorion

llvm/CMakeLists.txt
337

The malloc interception will 100% interfere with sanitizers. I think it's worth preemptively defending against that by erroring out if LLVM_USE_SANITIZER is non-empty and we are using rpmalloc.

llvm/include/llvm/Config/config.h.cmake
350–351

I think this should be LLVM_ENABLE_RPMALLOC, even though it is an internal option (not in llvm-config.h.cmake).

llvm/lib/Support/Windows/Memory.inc
220

I guess it should just be ::getenv instead of o::getenv.

llvm/lib/Support/Windows/rpmalloc/malloc.c
1 ↗(On Diff #234949)

Since this code is portable, I wonder if it would be better to move it to lib/Support/rpmalloc. I would expect it essentially work out of the box on Linux. I have heard that teams at Google found that tcmalloc greatly speeds up clang, even in a non-threaded context. We can let folks experiment with that use case.

llvm/unittests/Support/DynamicLibrary/CMakeLists.txt
42–45

What an exciting failure mode. :)

Thanks for the patch. I applied it (to 8188c998ffa4d20253444b257402907d2aa74dc2), specified LLVM_ENABLE_RPMALLOC=ON and -DLLVM_USE_CRT_RELEASE=MT and tried to build with MS cl.exe/link.exe. It fails creating a library:

   Creating library lib\LLVM-C.lib and object lib\LLVM-C.exp
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: calloc already defined in libucrt.lib(calloc.obj)
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: free already defined in libucrt.lib(free.obj)
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: malloc already defined in libucrt.lib(malloc.obj)
bin\LLVM-C.dll : fatal error LNK1169: one or more multiply defined symbols found

Is there any reason this couldn't be built with these host tools? If not, the please could you get this working? Thanks.

aganea added a comment.EditedJan 2 2020, 8:52 PM

It fails creating a library

This use-case is supported (using both Visual Studio 2017 and 2019). The error you're seeing most likely occurs because of the OBJs link order. I would assume a symbol gets pulled, and somehow link.exe considers libucrt.lib first, before LLVMSupport.lib (where our malloc override is).

Could you please run ninja -v, and ensure LLVMSupport.lib appears first on the linker cmd-line? (before other libs).
Now add /INCLUDE:malloc to the link cmd-line and re-run it, see if that fixes the issue. That will force the malloc symbol to be processed first, and consequently to consider LLVMSupport.lib before any #pragma comment(linker, "/defaultlib:libucrt.lib") directive is encountered.

If nothing works, could you please post the exact cmake cmd-line you're running, along with the version for Visual Studio and the Windows Kit? It'd be interesting also to have an output of that linker cmd-line by running again with /VERBOSE, so I can see how malloc is pulled from libucrt.lib.

It fails creating a library

Could you please run ninja -v, and ensure LLVMSupport.lib appears first on the linker cmd-line? (before other libs).

The link.exe command line is:

>C:\PROGRA~2\MIB055~1\2019\PROFES~1\VC\Tools\MSVC\1424~1.283\bin\Hostx64\x64\link.exe /nologo tools\remarks-shlib\CMakeFiles\Remarks.dir\libremarks.cpp.obj tools\remarks-shlib\CMakeFiles\Remarks.dir\__\__\resources\windows_version_resource.rc.res /out:bin\Remarks.dll /implib:lib\Remarks.lib /pdb:bin\Remarks.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /DEF:<path>/llvm-project/build_vs/tools/remarks-shlib/Remarks.def lib\LLVMRemarks.lib lib\LLVMBitstreamReader.lib lib\LLVMSupport.lib psapi.lib shell32.lib ole32.lib uuid.lib advapi32.lib delayimp.lib -delayload:shell32.dll -delayload:ole32.dll lib\LLVMDemangle.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:bin\Remarks.dll.manifest

Now add /INCLUDE:malloc to the link cmd-line and re-run it, see if that fixes the issue. That will force the malloc symbol to be processed first, and consequently to consider LLVMSupport.lib before any #pragma comment(linker, "/defaultlib:libucrt.lib") directive is encountered.

With /INCLUDE:malloc the above link line completes successfully.

russell.gallop added a comment.EditedJan 3 2020, 7:03 AM

I have done some performance evaluations here and LLVM_ENABLE_RPMALLOC does seem to improve ThinLTO link times and CPU utilisation during ThinLTO.

There are two other things to note:
1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.
2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Edit: To be clear, I would be happy with the additional memory usage for the gain in time, but I think that it should be noted as it may not be a good trade off for everyone.

1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Yes, generally rpmalloc improves single-threaded allocation performance over the default runtime, especially for small block allocations (below 1KiB), enough to warrant using it only for this use case.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Like most thread-caching allocators rpmalloc will trade memory usage for performance. You are certainly right that increased parallellism also contributes, but as you I also think that most of the increase comes from the rpmalloc internals. However, I'm looking into some LLVM statistics provided by @aganea to see if the overhead can be reduced without sacrificing too much of the performance gains.

(disclaimer: I'm the author of rpmalloc and will naturally be biased in any discussions regarding its performance)

1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Yes, generally rpmalloc improves single-threaded allocation performance over the default runtime, especially for small block allocations (below 1KiB), enough to warrant using it only for this use case.

Thanks for the info.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

I'm looking into some LLVM statistics provided by @aganea to see if the overhead can be reduced without sacrificing too much of the performance gains.

Thanks. In your experience, is rpmalloc performance sensitive to different host machines (i.e. memory hierarchy), or is it mainly dependent on the application (things like allocation sizes)?

(disclaimer: I'm the author of rpmalloc and will naturally be biased in any discussions regarding its performance)

Okay. Thanks for writing it!

I would say it's mostly down to application usage patterns. The worst case is probably a gc-like usage where one thread does all allocation and another all deallocation, as this will cause all blocks to cross the thread cache via an atomic pointer CAS and eventually cause larger spans of blocks to traverse the thread caches via the global cache. However, even this scenario will probably be significantly faster than the standard runtime allocator.

I have not had the opportunity to test this on other hardware than single-socket consumer hardware, game consoles and handheld devices, mostly due to me working in that spectrum and being the sole developer of it. It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

I would say it's mostly down to application usage patterns. The worst case is probably a gc-like usage where one thread does all allocation and another all deallocation, as this will cause all blocks to cross the thread cache via an atomic pointer CAS and eventually cause larger spans of blocks to traverse the thread caches via the global cache. However, even this scenario will probably be significantly faster than the standard runtime allocator.

I have not had the opportunity to test this on other hardware than single-socket consumer hardware, game consoles and handheld devices, mostly due to me working in that spectrum and being the sole developer of it. It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

Okay, thanks.

aganea updated this revision to Diff 236639.Jan 7 2020, 11:21 AM
aganea marked 5 inline comments as done.
  • Changes as suggested by @rnk
  • Fixed linking remarks-shlib as reported by @russell.gallop
  • Reverted back to default rpmalloc settings (no more unlimited mode), ie. ENABLE_UNLIMITED_CACHE is 0 now.
  • Support rpmalloc's ENABLE_STATISTICS - when enabled, allocation statistics will be printed on process exit.

It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

All I can say is that we're using it in production for the past year (on editor/production builds) and we're thrilled, the gains are absolutely massive. All multi-threaded workloads are largely improved. Some figures from early last year:

But there's an extra memory cost. When building AC Odyssey's Greece world data, it requires 220 GB of RAM instead of 180 with dlmalloc.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory.

At first, I configured rpmalloc in the "unlimited" mode because I wanted maximum performance. I now reverted to default rpmalloc flags, and the memory usage has decreased (albeit still higher than the Windows Heap), with no impact on link time (meaning that timings are still the same as in the Summary). Below are "Max" values when linking clang.exe with ThinLTO:

riccibruno added a comment.EditedJan 8 2020, 4:30 AM

I have done some performance evaluations here and LLVM_ENABLE_RPMALLOC does seem to improve ThinLTO link times and CPU utilisation during ThinLTO.

There are two other things to note:
1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Is rpcmalloc using large pages by default if possible ? I have been toying with using large pages for some of the bump-pointer allocators in clang, namely the allocator of ASTContext and the allocator of Sema, and I am getting a ~5% improvement in parsing speed.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Edit: To be clear, I would be happy with the additional memory usage for the gain in time, but I think that it should be noted as it may not be a good trade off for everyone.

russell.gallop added inline comments.Jan 8 2020, 8:41 AM
llvm/tools/remarks-shlib/CMakeLists.txt
15

I think this should be:

set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -INCLUDE:malloc")

as it needs adding onto the shared library link flags.

I also had to add similar to llvm/tools/llvm-shlib/CMakeLists.txt:

--- a/llvm/tools/llvm-shlib/CMakeLists.txt
+++ b/llvm/tools/llvm-shlib/CMakeLists.txt
@@ -172,4 +172,9 @@ if(MSVC)
   # Finally link the target.
   add_llvm_library(LLVM-C SHARED INSTALL_WITH_TOOLCHAIN ${SOURCES} DEPENDS intrinsics_gen)

+  if (LLVM_ENABLE_RPMALLOC AND MSVC)
+    # Make sure we search LLVMSupport first, before the CRT libs
+    set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -INCLUDE:malloc")
+  endif()
+
 endif()

With those, it builds for me with VS2017.

Is rpcmalloc using large pages by default if possible ? I have been toying with using large pages for some of the bump-pointer allocators in clang, namely the allocator of ASTContext and the allocator of Sema, and I am getting a ~5% improvement in parsing speed.

Not by default, but you can use set LLVM_RPMALLOC_PAGESIZE=2M or set LLVM_RPMALLOC_PAGESIZE=1G to enable large pages. However I realize that is broken (crashes), I'll investigate and get back.

aganea updated this revision to Diff 237082.EditedJan 9 2020, 7:48 AM
  • Integrated changes as suggested by @russell.gallop - Could you please re-apply the patch on a clean git checkout see if this works?
  • Fixed usage of LLVM_RPMALLOC_PAGESIZE -- it seems attribute((used)) doesn't always prevent the symbol from being evicted by LLD -- maybe using TLS callbacks is an edge case. I extended the usage of /INCLUDE to Clang, like for MSVC.
  • Fixed enabling large pages in rpmalloc, when they are provided in the 'config' struct. Previously, AdjustTokenPrivileges wasn't called in this case, and that is required for VirtualAlloc(MEM_LARGEPAGES, ...) to work.

Along the way, I discovered what looks like a new NT kernel issue in Windows 10 build 1909. Still need to confirm this with Microsoft, but from looking at the profile traces, code in the NT kernel (specific to large pages) seems to take almost 100% CPU time when allocating large pages in many threads.


The problem is not there when running on 1903 (same exact hardware, dual Xeon 6140, but different machine -- using only one CPU socket in these tests):

As for the link timings with LLD, I see not difference when using large pages (I've tried 2MB and 1GB pages). I would imagine the cache misses are dominant over DTLB misses, but I would need to dig a bit deeper with VTune. Please let me know if you see a difference on your end.

At first, I configured rpmalloc in the "unlimited" mode because I wanted maximum performance. I now reverted to default rpmalloc flags, and the memory usage has decreased (albeit still higher than the Windows Heap), with no impact on link time (meaning that timings are still the same as in the Summary). Below are "Max" values when linking clang.exe with ThinLTO:

I have tested the patch (https://reviews.llvm.org/differential/diff/236639/ from Tuesday) and get broadly similar speed performance to before, with less memory usage so this seems like a better trade off.

I haven't looked through the code in detail but I'd support this patch. In terms of getting it in, I think that there are a few questions:
1). Is the licensing of this okay for the LLVM project? I don't feel qualified to say, maybe someone from the LLVM foundation can answer.
2). How will LLVM_ENABLE_RPMALLOC be tested (it could be added to an existing bot such as the PS4 windows bot)?

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM? It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

aganea added a comment.EditedJan 9 2020, 9:26 AM

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM? It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

Well I think that should give consistent performance across systems. We could (maybe) do like LD_PRELOAD with Detours, but that's less plug'n'play, not sure what @maniccoder thinks? But I'm open to other alternatives.
The other thing to keep in mind is that we (professional game developers) are using LLVM toolchains provided by 1st-partys like Sony or Nintendo, and we have to ship our game(s) with their toolchain. The rationale for pushing all these optimizations upstream is for them to eventually get them, and possibly use them easily at some point.

Regarding license, if you believe in public domain then rpmalloc is in public domain. If not, it's released under MIT which should be compatible with most project. If you still have issues with this, let me know and I can probably accommodate any needs you might have.

I would say using things like Detours is overkill and opening a whole can of worms for something as simple as linking in a new allocator. I consider the memory allocator to be like any other part of the standard runtime - if it doesn't suit you needs or perform badly, use or write something that does. It not like the runtime can be the optimal solution to every use case, it is simply a solution provided out of the box. Memory allocation is not some magic thing that only the runtime can do. But then again, I might be biased :)

russell.gallop added a comment.EditedJan 10 2020, 3:56 AM

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM?

No, you're not. I don't think it's ideal but I'm not sure what the alternative is to remove this bottleneck for ThinLTO on Windows.

It might be possible to use Win32 API (https://docs.microsoft.com/en-gb/windows/win32/api/heapapi/) HeapCreate() to create a heap for each thread and HeapAlloc() with HEAP_NO_SERIALIZE but HeapCreate() heaps are finite so you'd have to handle exhausting that and by then you're practically writing a new memory allocator...

It does feel strange that we have to replace the default allocator to get efficient memory allocation on multiple threads, which doesn't sound like an uncommon requirement to me.

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

I guess we can hope that the standard implementation is improved as multi-threaded code becomes more prevalent. glibc for instance added a thread cache to malloc in 2017: https://sourceware.org/ml/libc-alpha/2017-08/msg00010.html

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM?

No, you're not. I don't think it's ideal but I'm not sure what the alternative is to remove this bottleneck for ThinLTO on Windows.

It might be possible to use Win32 API (https://docs.microsoft.com/en-gb/windows/win32/api/heapapi/) HeapCreate() to create a heap for each thread and HeapAlloc() with HEAP_NO_SERIALIZE but HeapCreate() heaps are finite so you'd have to handle exhausting that and by then you're practically writing a new memory allocator...

Are they? I only looked quickly at the reference for HeapCreate, but I see: If dwMaximumSize is 0, the heap can grow in size. The heap's size is limited only by the available memory. [...] However this does not help if the allocations are all over the place since you have to explicitly use HeapAlloc.

It does feel strange that we have to replace the default allocator to get efficient memory allocation on multiple threads, which doesn't sound like an uncommon requirement to me.

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

I guess we can hope that the standard implementation improves with increased use of multi-threaded code. glibc for instance added a thread cache to malloc in 2017: https://sourceware.org/ml/libc-alpha/2017-08/msg00010.html

Yes, really it does not seem like an uncommon usage pattern. In the meanwhile I don't have a better idea here and I don't want to block this if this is what is needed.

I think you will find that the Heap API has subpar performance since that too uses locks to ensure thread safety (unless you can guarantee that memory blocks never traverses thread boundaries).

Also, even though glibc now has thread caches I suspect that you will find the rpmalloc/mimalloc/tcmalloc provides a decent performance increase, maybe not an order of magnitude but still noticeable.

I think the main benefit of rpmalloc is that it is small in code size, integrating and maintaining tcmalloc for example would most likely be more work.

aganea added a comment.EditedFeb 3 2020, 8:52 AM

In D71786#1813866, @russell.gallop wrote:
With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

I've tried mimalloc -- it's a tad faster than rpmalloc. maybe this has to do with mimalloc mapping less memory pages than rpmalloc.
I used mimalloc as a .lib, not compiled as part of LLVM, because of mismatches between Clang headers and Windows headers (see here).

Figures below are for linking clang.exe with the cmake flags [3] mentioned in the Summary. LTO cache is disabled; the Windows cache is warm, figures taken after several runs.
Using a two-stage lld-link.exe with the same cmake flags and mimalloc/rpmalloc used in both stages.
D71775 is also applied.
36-threads test uses ThinLTO with one thread per core, but using all CPU sockets. 72-threads test uses all hardware threads, on all cores, on all CPU sockets.

Memory usage:

Working Set (B)Private Working Set (B)Commit (B)Virtual Size (B)
rpmalloc - 36-threads25.1 GB16.5 GB19.9 GB37.4 GB
mimalloc - 36-threads25.6 GB16.3 GB18.3 GB33.3 GB
rpmalloc - 72-threads33.6 GB25.1 GB28.5 GB46 GB
mimalloc - 72-threads30.5 GB21.2 GB23.4 GB38.4 GB

It's quite neat to see lld-link remaining at 99% CPU almost all the time during link :-D

I've also tried tcmalloc, but it performs significantly worse than both mimalloc and rpmalloc, mainly because of the SpinLock implementation, which downgrades to a kernel Sleep after a few spins. This is an absolute no-go for many-core systems when a lot of allocations are involved, like ThinLTO.
Also quickly tested ptmalloc3 but out-of-the-box it doesn't pass the LLVM tests. However I don't expect any miracle there, quickly browsing through the code shows the same spinlock behavior as tcmalloc. And so does Hoard. Possibly glibc 2.26+ would perform better if they have a per-thread cache, but the mimalloc bench uses glibc 2.7, and shows worse performance in general.

Please let me know if you'd like more testing on any of the above, or any other allocator.

I think both mimalloc & rpmalloc are already on the asymptote in terms of performance. They both use about 3.5% of CPU time on the 36-core, which includes calls to malloc & free and the kernel calls to VirtualAlloc.
At this point, any further improvements will come from reducing allocations in parts of LLVM that allocate a lot:

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

I think integrating it by default into LLVM would ease things and reduce the points of failure for enabling it (from a toolchain provider POV).

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

Agreed. What about compiling it by default, but enabling it at runtime with a environment variable or a cmd-line flag? That would be the best, and it wouldn't require you (Sony) or any other LLVM toolchain provider (Nintendo) to ship a different set of binaries. Customers could selectively enable the alternate allocator, at the expense of higher memory usage. We (Ubisoft) would like very much if first-party LLVM toolchains providers, could include this/an alternate allocator/ as part of their toolchains for Windows. Our games need to ship built with those toolchains, we can't use our own downstream forks (at least not at the moment).

As for code size,

D:\Utils>cloc-1.80.exe d:\llvm-project\llvm\lib\Support\rpmalloc\
       4 text files.
       4 unique files.
       2 files ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (6.0 files/s, 6472.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                2            280            341           2352
C/C++ Header                     1             30            110            123
-------------------------------------------------------------------------------
SUM:                             3            310            451           2475
-------------------------------------------------------------------------------

2475 LoC in 2 files seem pretty reasonable addition to me.

And:

D:\Utils>cloc-1.80.exe d:\git\mimalloc\src
      15 text files.
      15 unique files.
       1 file ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (30.0 files/s, 13760.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                               15            889           1064           4927
-------------------------------------------------------------------------------
SUM:                            15            889           1064           4927
-------------------------------------------------------------------------------

D:\Utils>cloc-1.80.exe d:\git\mimalloc\include
       6 text files.
       6 unique files.
       1 file ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (12.0 files/s, 3194.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C/C++ Header                     6            261            308           1028
-------------------------------------------------------------------------------
SUM:                             6            261            308           1028
-------------------------------------------------------------------------------

mimalloc comes with about 2x larger codesize, and more .C files to potentially integrate into LLVM.


At this point, if rpmalloc could be improved to the same levels of memory usage as mimalloc, I'd be willing to wait a few weeks/months before going through. @maniccoder WDYT? I can provide a full step-by-step repro.
Otherwise, I'd take mimalloc as an external .lib, but this is not my preferred solution.

Please advise:

  1. rpmalloc or mimalloc?
  2. .lib or in-tree?
  3. Compiled by default in LLVM, but disabled at runtime / optionally enabled by the user?

I'm more than happy to take a look at this use case and see what can be done to reduce the peak commit when using rpmalloc, perhaps if you file an issue on rpmalloc github project we can take it from there

rnk added a comment.Feb 11 2020, 5:45 PM

I haven't found time to review this yet, but this is still a very interesting change, and I hope to get to it soon.

Some other things I wanted to do in this space are:

  • Histogram most commonly allocated AST node types
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator.

These changes would likely only affect parsing speed, not middle/backend speed which dominates thinlto time.

In D71786#1871211, @rnk wrote:

I haven't found time to review this yet, but this is still a very interesting change, and I hope to get to it soon.

Some other things I wanted to do in this space are:

  • Histogram most commonly allocated AST node types
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator. These changes would likely only affect parsing speed, not middle/backend speed which dominates thinlto time.
  • This already exist right? This is -print-stats.
  • I agree that 4K is too small. Virtual memory is cheap so we should just allocate large slabs. That said in my experiments increasing the slab size had no measurable impact in itself (at least for clang with -fsyntax-only). What did have a meaningful impact is using large pages (5%-6% in -fsyntax-only on all of Boost, obviously hardware dependent). Additionally I suspect, but have no data, that the improvement will be comparatively larger in a normal compilation because typically the DTLB and STLB are shared between the logical threads on each core.
aganea added a comment.EditedFeb 15 2020, 5:59 AM
In D71786#1871211, @rnk wrote:
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator.
  • I agree that 4K is too small. Virtual memory is cheap so we should just allocate large slabs. That said in my experiments increasing the slab size had no measurable impact in itself (at least for clang with -fsyntax-only). What did have a meaningful impact is using large pages (5%-6% in -fsyntax-only on all of Boost, obviously hardware dependent). Additionally I suspect, but have no data, that the improvement will be comparatively larger in a normal compilation because typically the DTLB and STLB are shared between the logical threads on each core.

We could add a PageAllocator instead of the current MallocAllocator, and make the default 64kb. The beauty about this is that we could either: 1. use large pages with just flipping a flag; or 2. allocate large contiguous slabs of virtual memory and commit memory pages on-the-go. This is a technique we're using in games to avoid reallocating/moving around allocated blocks. This only works with 4kb pages, not with large pages (which are committed by default)
But further down the line, it'd be nice to collect statistics about how the BumpPtrAlloctor grows/is used. And same for stack-based structures like the "Small" containers, so that we can set default container sizes that actually have the most impact. Nicolas Fleury has a talk here about some of the techniques we use at Ubisoft (check at 21:26).

Adding more to the discussion: LLVM compilation times on Linux shown in the article below, seem to indicate the same compilation times variability as between Windows versions.
While I haven't got time to test on these systems, part of the variability could come from the default heap allocator.
See https://www.phoronix.com/scan.php?page=article&item=3990x-freebsd-bsd&num=3