This is an archive of the discontinued LLVM Phabricator instance.

[Support] On Windows, add optional support for {rpmalloc|snmalloc|mimalloc}
ClosedPublic

Authored by aganea on Dec 20 2019, 2:12 PM.
Tokens
"Like" token, awarded by evgeny777.

Details

Summary

This patch optionally replaces the CRT allocator (i.e., malloc and free) with rpmalloc (public domain licence) or snmalloc (MIT licence) or mimalloc.

This changes the memory stack from:

new/delete -> MS VC++ CRT malloc/free -> HeapAlloc -> VirtualAlloc

to

new/delete -> {rpmalloc|snmalloc|mimalloc} -> VirtualAlloc

Problem

The Windows heap is thread-safe by default, and the ThinLTO codegen does a lot of allocations in each thread:

In addition, on many-core systems, this effectively blocks other threads to a point where only a very small fraction of the CPU time is used:

Before patch (here on Windows 10, build 1709):

We can see that a whooping 80% of the CPU time is spend waiting (blocked) on other threads (780 sec / 72 cores = 10 sec total) (graph with D71775 applied):

Threads are blocked waiting on the heap lock:

The thread above is awaken by the heap lock being released in another thread:

Solution

We simply use rpmalloc instead of VC++'s CRT allocator, which uses the Windows heap under-the-hood.

"rpmalloc - Rampant Pixels Memory Allocator
This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C"

The feature can be enabled with the cmake flag -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT. It is currently available only for Windows, but rpmalloc already supports Darwin, FreeBSD, Linux so it would be easy to enable it for Unix as well. It currently uses /MT because it is easier that way, and I'm not sure /MD can be overloaded without code patching at runtime (I could investigate that later, but the DLL thunks slow things down).

In addition rpmalloc supports large memory pages, which can be enabled at run-time through the environment variable LLVM_RPMALLOC_PAGESIZE=[4k|2M|1G].

After patch:

After patch, with D71775 applied, all CPU sockets are used:
(the dark blue part of the graph represents the time spent in the kernel, see below why)

In addition to the heap lock, there's a kernel bug in some versions of of Windows 10, where accessing newly allocated virtual pages triggers the page zero-out mechanism, which itself is protected by a global lock, which further blocks memory allocations.

If we dig deeper, we effectively see ExpWaitForSpinLockExclusiveAndAcquire taking way too much time (the blue vertical lines show where it is called on the timeline):
(patch is applied)

When using the latest Windows build 1909, with this patch applied and D71775, LLD now reaches over 90% CPU usage:

ThinLTO timings

Globally, this patch along with D71775 gives more interesting link times with ThinLTO, on Windows at least. The link times below are for Ubisoft's Rainbow 6: Siege PC Final ThinLTO build. Timings are only for the full link (no ThinLTO cache)

LLD 10 was crafted out of a two-stage build.
In case [1] the second stage uses -DLLVM_USE_CRT_RELEASE=MT.
In case [2] the second stage uses -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT.
In case [3] the second stage uses -DLLVM_INTEGRATED_CRT_ALLOC==D:/git/rpmalloc -DLLVM_USE_CRT_RELEASE=MT -DLLVM_ENABLE_LTO=Thin, and both -DCMAKE_C_FLAGS -DCMAKE_CXX_FLAGS are set to "/GS- -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512".
The LLVM tests link with ThinLTO, while the MSVC tests evidently run full LTO.

Diff Detail

Event Timeline

aganea created this revision.Dec 20 2019, 2:12 PM
Herald added a project: Restricted Project. · View Herald Transcript
ychen added a subscriber: ychen.Dec 20 2019, 2:25 PM

Just a nit on the patch description: This change replaces the CRT allocator (i.e., malloc and free). The "default Windows heap allocator" could be misunderstood to mean replacing the OS-level heap manager (i.e., https://docs.microsoft.com/en-us/windows/win32/memory/heap-functions).

The code comments are clear. It's just the beginning of the patch description that could throw someone off.

This changes the memory stack from:

new/delete -> MS VC++ CRT malloc/free -> HeapAlloc -> VirtualAlloc

to

new/delete -> rpmalloc -> VirtualAlloc

aganea edited the summary of this revision. (Show Details)Dec 21 2019, 7:51 AM
aganea added a subscriber: ruiu.

Thank @amccarth, fixed!

Note in the bar charts how the 28-core system is faster, when compared to the 36-core. The 28-core does not suffer from the "processor group" issue fixed in D71775, the 28-cores/56-threads all fit in a single "processor group", so they are all available to the application; while the 36-cores/72-threads are projected to the application as two "processor groups", out of which only one is used by default. After D71775, the 36-core is evidently faster.

It currently uses /MT because it is easier that way,

Based on my experience with Windows ASan, yes, let's limit this to /MT[d]. Malloc interception gets very hard otherwise.

+ thinlto users: @gbiv @inglorion

llvm/CMakeLists.txt
346

The malloc interception will 100% interfere with sanitizers. I think it's worth preemptively defending against that by erroring out if LLVM_USE_SANITIZER is non-empty and we are using rpmalloc.

llvm/include/llvm/Config/config.h.cmake
356–357 ↗(On Diff #234949)

I think this should be LLVM_ENABLE_RPMALLOC, even though it is an internal option (not in llvm-config.h.cmake).

llvm/lib/Support/Windows/Memory.inc
219 ↗(On Diff #234949)

I guess it should just be ::getenv instead of o::getenv.

llvm/lib/Support/Windows/rpmalloc/malloc.c
1 ↗(On Diff #234949)

Since this code is portable, I wonder if it would be better to move it to lib/Support/rpmalloc. I would expect it essentially work out of the box on Linux. I have heard that teams at Google found that tcmalloc greatly speeds up clang, even in a non-threaded context. We can let folks experiment with that use case.

llvm/unittests/Support/DynamicLibrary/CMakeLists.txt
42–45

What an exciting failure mode. :)

Thanks for the patch. I applied it (to 8188c998ffa4d20253444b257402907d2aa74dc2), specified LLVM_ENABLE_RPMALLOC=ON and -DLLVM_USE_CRT_RELEASE=MT and tried to build with MS cl.exe/link.exe. It fails creating a library:

   Creating library lib\LLVM-C.lib and object lib\LLVM-C.exp
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: calloc already defined in libucrt.lib(calloc.obj)
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: free already defined in libucrt.lib(free.obj)
LLVMSupport.lib(rpmalloc.c.obj) : error LNK2005: malloc already defined in libucrt.lib(malloc.obj)
bin\LLVM-C.dll : fatal error LNK1169: one or more multiply defined symbols found

Is there any reason this couldn't be built with these host tools? If not, the please could you get this working? Thanks.

aganea added a comment.EditedJan 2 2020, 8:52 PM

It fails creating a library

This use-case is supported (using both Visual Studio 2017 and 2019). The error you're seeing most likely occurs because of the OBJs link order. I would assume a symbol gets pulled, and somehow link.exe considers libucrt.lib first, before LLVMSupport.lib (where our malloc override is).

Could you please run ninja -v, and ensure LLVMSupport.lib appears first on the linker cmd-line? (before other libs).
Now add /INCLUDE:malloc to the link cmd-line and re-run it, see if that fixes the issue. That will force the malloc symbol to be processed first, and consequently to consider LLVMSupport.lib before any #pragma comment(linker, "/defaultlib:libucrt.lib") directive is encountered.

If nothing works, could you please post the exact cmake cmd-line you're running, along with the version for Visual Studio and the Windows Kit? It'd be interesting also to have an output of that linker cmd-line by running again with /VERBOSE, so I can see how malloc is pulled from libucrt.lib.

It fails creating a library

Could you please run ninja -v, and ensure LLVMSupport.lib appears first on the linker cmd-line? (before other libs).

The link.exe command line is:

>C:\PROGRA~2\MIB055~1\2019\PROFES~1\VC\Tools\MSVC\1424~1.283\bin\Hostx64\x64\link.exe /nologo tools\remarks-shlib\CMakeFiles\Remarks.dir\libremarks.cpp.obj tools\remarks-shlib\CMakeFiles\Remarks.dir\__\__\resources\windows_version_resource.rc.res /out:bin\Remarks.dll /implib:lib\Remarks.lib /pdb:bin\Remarks.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /DEF:<path>/llvm-project/build_vs/tools/remarks-shlib/Remarks.def lib\LLVMRemarks.lib lib\LLVMBitstreamReader.lib lib\LLVMSupport.lib psapi.lib shell32.lib ole32.lib uuid.lib advapi32.lib delayimp.lib -delayload:shell32.dll -delayload:ole32.dll lib\LLVMDemangle.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:bin\Remarks.dll.manifest

Now add /INCLUDE:malloc to the link cmd-line and re-run it, see if that fixes the issue. That will force the malloc symbol to be processed first, and consequently to consider LLVMSupport.lib before any #pragma comment(linker, "/defaultlib:libucrt.lib") directive is encountered.

With /INCLUDE:malloc the above link line completes successfully.

russell.gallop added a comment.EditedJan 3 2020, 7:03 AM

I have done some performance evaluations here and LLVM_ENABLE_RPMALLOC does seem to improve ThinLTO link times and CPU utilisation during ThinLTO.

There are two other things to note:
1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.
2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Edit: To be clear, I would be happy with the additional memory usage for the gain in time, but I think that it should be noted as it may not be a good trade off for everyone.

1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Yes, generally rpmalloc improves single-threaded allocation performance over the default runtime, especially for small block allocations (below 1KiB), enough to warrant using it only for this use case.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Like most thread-caching allocators rpmalloc will trade memory usage for performance. You are certainly right that increased parallellism also contributes, but as you I also think that most of the increase comes from the rpmalloc internals. However, I'm looking into some LLVM statistics provided by @aganea to see if the overhead can be reduced without sacrificing too much of the performance gains.

(disclaimer: I'm the author of rpmalloc and will naturally be biased in any discussions regarding its performance)

1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Yes, generally rpmalloc improves single-threaded allocation performance over the default runtime, especially for small block allocations (below 1KiB), enough to warrant using it only for this use case.

Thanks for the info.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

I'm looking into some LLVM statistics provided by @aganea to see if the overhead can be reduced without sacrificing too much of the performance gains.

Thanks. In your experience, is rpmalloc performance sensitive to different host machines (i.e. memory hierarchy), or is it mainly dependent on the application (things like allocation sizes)?

(disclaimer: I'm the author of rpmalloc and will naturally be biased in any discussions regarding its performance)

Okay. Thanks for writing it!

I would say it's mostly down to application usage patterns. The worst case is probably a gc-like usage where one thread does all allocation and another all deallocation, as this will cause all blocks to cross the thread cache via an atomic pointer CAS and eventually cause larger spans of blocks to traverse the thread caches via the global cache. However, even this scenario will probably be significantly faster than the standard runtime allocator.

I have not had the opportunity to test this on other hardware than single-socket consumer hardware, game consoles and handheld devices, mostly due to me working in that spectrum and being the sole developer of it. It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

I would say it's mostly down to application usage patterns. The worst case is probably a gc-like usage where one thread does all allocation and another all deallocation, as this will cause all blocks to cross the thread cache via an atomic pointer CAS and eventually cause larger spans of blocks to traverse the thread caches via the global cache. However, even this scenario will probably be significantly faster than the standard runtime allocator.

I have not had the opportunity to test this on other hardware than single-socket consumer hardware, game consoles and handheld devices, mostly due to me working in that spectrum and being the sole developer of it. It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

Okay, thanks.

aganea updated this revision to Diff 236639.Jan 7 2020, 11:21 AM
aganea marked 5 inline comments as done.
  • Changes as suggested by @rnk
  • Fixed linking remarks-shlib as reported by @russell.gallop
  • Reverted back to default rpmalloc settings (no more unlimited mode), ie. ENABLE_UNLIMITED_CACHE is 0 now.
  • Support rpmalloc's ENABLE_STATISTICS - when enabled, allocation statistics will be printed on process exit.

It does seem that @aganea has tried it on more service facing hardware, perhaps he can give additional input.

All I can say is that we're using it in production for the past year (on editor/production builds) and we're thrilled, the gains are absolutely massive. All multi-threaded workloads are largely improved. Some figures from early last year:

But there's an extra memory cost. When building AC Odyssey's Greece world data, it requires 220 GB of RAM instead of 180 with dlmalloc.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory.

At first, I configured rpmalloc in the "unlimited" mode because I wanted maximum performance. I now reverted to default rpmalloc flags, and the memory usage has decreased (albeit still higher than the Windows Heap), with no impact on link time (meaning that timings are still the same as in the Summary). Below are "Max" values when linking clang.exe with ThinLTO:

riccibruno added a comment.EditedJan 8 2020, 4:30 AM

I have done some performance evaluations here and LLVM_ENABLE_RPMALLOC does seem to improve ThinLTO link times and CPU utilisation during ThinLTO.

There are two other things to note:
1). This also seems to improve compile time by around 5% for me. Presumably this is due to some other benefit of rpmalloc as (AFAIK) clang.exe does not use multiple threads.

Is rpcmalloc using large pages by default if possible ? I have been toying with using large pages for some of the bump-pointer allocators in clang, namely the allocator of ASTContext and the allocator of Sema, and I am getting a ~5% improvement in parsing speed.

2). This can significantly increase memory usage. Part of this can be attributed to the increased parallelism but I don't think that entirely explains it. I think that rpmalloc is using more memory. Have you experimented with settings such as here: https://github.com/mjansson/rpmalloc/blob/develop/CACHE.md?

Edit: To be clear, I would be happy with the additional memory usage for the gain in time, but I think that it should be noted as it may not be a good trade off for everyone.

russell.gallop added inline comments.Jan 8 2020, 8:41 AM
llvm/tools/remarks-shlib/CMakeLists.txt
15

I think this should be:

set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -INCLUDE:malloc")

as it needs adding onto the shared library link flags.

I also had to add similar to llvm/tools/llvm-shlib/CMakeLists.txt:

--- a/llvm/tools/llvm-shlib/CMakeLists.txt
+++ b/llvm/tools/llvm-shlib/CMakeLists.txt
@@ -172,4 +172,9 @@ if(MSVC)
   # Finally link the target.
   add_llvm_library(LLVM-C SHARED INSTALL_WITH_TOOLCHAIN ${SOURCES} DEPENDS intrinsics_gen)

+  if (LLVM_ENABLE_RPMALLOC AND MSVC)
+    # Make sure we search LLVMSupport first, before the CRT libs
+    set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -INCLUDE:malloc")
+  endif()
+
 endif()

With those, it builds for me with VS2017.

Is rpcmalloc using large pages by default if possible ? I have been toying with using large pages for some of the bump-pointer allocators in clang, namely the allocator of ASTContext and the allocator of Sema, and I am getting a ~5% improvement in parsing speed.

Not by default, but you can use set LLVM_RPMALLOC_PAGESIZE=2M or set LLVM_RPMALLOC_PAGESIZE=1G to enable large pages. However I realize that is broken (crashes), I'll investigate and get back.

aganea updated this revision to Diff 237082.EditedJan 9 2020, 7:48 AM
  • Integrated changes as suggested by @russell.gallop - Could you please re-apply the patch on a clean git checkout see if this works?
  • Fixed usage of LLVM_RPMALLOC_PAGESIZE -- it seems attribute((used)) doesn't always prevent the symbol from being evicted by LLD -- maybe using TLS callbacks is an edge case. I extended the usage of /INCLUDE to Clang, like for MSVC.
  • Fixed enabling large pages in rpmalloc, when they are provided in the 'config' struct. Previously, AdjustTokenPrivileges wasn't called in this case, and that is required for VirtualAlloc(MEM_LARGEPAGES, ...) to work.

Along the way, I discovered what looks like a new NT kernel issue in Windows 10 build 1909. Still need to confirm this with Microsoft, but from looking at the profile traces, code in the NT kernel (specific to large pages) seems to take almost 100% CPU time when allocating large pages in many threads.


The problem is not there when running on 1903 (same exact hardware, dual Xeon 6140, but different machine -- using only one CPU socket in these tests):

As for the link timings with LLD, I see not difference when using large pages (I've tried 2MB and 1GB pages). I would imagine the cache misses are dominant over DTLB misses, but I would need to dig a bit deeper with VTune. Please let me know if you see a difference on your end.

At first, I configured rpmalloc in the "unlimited" mode because I wanted maximum performance. I now reverted to default rpmalloc flags, and the memory usage has decreased (albeit still higher than the Windows Heap), with no impact on link time (meaning that timings are still the same as in the Summary). Below are "Max" values when linking clang.exe with ThinLTO:

I have tested the patch (https://reviews.llvm.org/differential/diff/236639/ from Tuesday) and get broadly similar speed performance to before, with less memory usage so this seems like a better trade off.

I haven't looked through the code in detail but I'd support this patch. In terms of getting it in, I think that there are a few questions:
1). Is the licensing of this okay for the LLVM project? I don't feel qualified to say, maybe someone from the LLVM foundation can answer.
2). How will LLVM_ENABLE_RPMALLOC be tested (it could be added to an existing bot such as the PS4 windows bot)?

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM? It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

aganea added a comment.EditedJan 9 2020, 9:26 AM

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM? It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

Well I think that should give consistent performance across systems. We could (maybe) do like LD_PRELOAD with Detours, but that's less plug'n'play, not sure what @maniccoder thinks? But I'm open to other alternatives.
The other thing to keep in mind is that we (professional game developers) are using LLVM toolchains provided by 1st-partys like Sony or Nintendo, and we have to ship our game(s) with their toolchain. The rationale for pushing all these optimizations upstream is for them to eventually get them, and possibly use them easily at some point.

Regarding license, if you believe in public domain then rpmalloc is in public domain. If not, it's released under MIT which should be compatible with most project. If you still have issues with this, let me know and I can probably accommodate any needs you might have.

I would say using things like Detours is overkill and opening a whole can of worms for something as simple as linking in a new allocator. I consider the memory allocator to be like any other part of the standard runtime - if it doesn't suit you needs or perform badly, use or write something that does. It not like the runtime can be the optimal solution to every use case, it is simply a solution provided out of the box. Memory allocation is not some magic thing that only the runtime can do. But then again, I might be biased :)

russell.gallop added a comment.EditedJan 10 2020, 3:56 AM

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM?

No, you're not. I don't think it's ideal but I'm not sure what the alternative is to remove this bottleneck for ThinLTO on Windows.

It might be possible to use Win32 API (https://docs.microsoft.com/en-gb/windows/win32/api/heapapi/) HeapCreate() to create a heap for each thread and HeapAlloc() with HEAP_NO_SERIALIZE but HeapCreate() heaps are finite so you'd have to handle exhausting that and by then you're practically writing a new memory allocator...

It does feel strange that we have to replace the default allocator to get efficient memory allocation on multiple threads, which doesn't sound like an uncommon requirement to me.

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

I guess we can hope that the standard implementation is improved as multi-threaded code becomes more prevalent. glibc for instance added a thread cache to malloc in 2017: https://sourceware.org/ml/libc-alpha/2017-08/msg00010.html

Am I the only one who is a little bit uncomfortable with including a replacement to the system allocator in LLVM?

No, you're not. I don't think it's ideal but I'm not sure what the alternative is to remove this bottleneck for ThinLTO on Windows.

It might be possible to use Win32 API (https://docs.microsoft.com/en-gb/windows/win32/api/heapapi/) HeapCreate() to create a heap for each thread and HeapAlloc() with HEAP_NO_SERIALIZE but HeapCreate() heaps are finite so you'd have to handle exhausting that and by then you're practically writing a new memory allocator...

Are they? I only looked quickly at the reference for HeapCreate, but I see: If dwMaximumSize is 0, the heap can grow in size. The heap's size is limited only by the available memory. [...] However this does not help if the allocations are all over the place since you have to explicitly use HeapAlloc.

It does feel strange that we have to replace the default allocator to get efficient memory allocation on multiple threads, which doesn't sound like an uncommon requirement to me.

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

It seems to me that this is something that is clearly under the responsibility of the C library. Now this might still be worth doing given the large performance improvement, but yuck...

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

I guess we can hope that the standard implementation improves with increased use of multi-threaded code. glibc for instance added a thread cache to malloc in 2017: https://sourceware.org/ml/libc-alpha/2017-08/msg00010.html

Yes, really it does not seem like an uncommon usage pattern. In the meanwhile I don't have a better idea here and I don't want to block this if this is what is needed.

I think you will find that the Heap API has subpar performance since that too uses locks to ensure thread safety (unless you can guarantee that memory blocks never traverses thread boundaries).

Also, even though glibc now has thread caches I suspect that you will find the rpmalloc/mimalloc/tcmalloc provides a decent performance increase, maybe not an order of magnitude but still noticeable.

I think the main benefit of rpmalloc is that it is small in code size, integrating and maintaining tcmalloc for example would most likely be more work.

aganea added a comment.EditedFeb 3 2020, 8:52 AM

In D71786#1813866, @russell.gallop wrote:
With all respect to @maniccoder, there are other replacement memory allocators that could be considered as alternatives (e.g.):
tcmalloc from https://github.com/gperftools/gperftools (as @rnk mentioned above)
https://github.com/microsoft/mimalloc

I've tried mimalloc -- it's a tad faster than rpmalloc. maybe this has to do with mimalloc mapping less memory pages than rpmalloc.
I used mimalloc as a .lib, not compiled as part of LLVM, because of mismatches between Clang headers and Windows headers (see here).

Figures below are for linking clang.exe with the cmake flags [3] mentioned in the Summary. LTO cache is disabled; the Windows cache is warm, figures taken after several runs.
Using a two-stage lld-link.exe with the same cmake flags and mimalloc/rpmalloc used in both stages.
D71775 is also applied.
36-threads test uses ThinLTO with one thread per core, but using all CPU sockets. 72-threads test uses all hardware threads, on all cores, on all CPU sockets.

Memory usage:

Working Set (B)Private Working Set (B)Commit (B)Virtual Size (B)
rpmalloc - 36-threads25.1 GB16.5 GB19.9 GB37.4 GB
mimalloc - 36-threads25.6 GB16.3 GB18.3 GB33.3 GB
rpmalloc - 72-threads33.6 GB25.1 GB28.5 GB46 GB
mimalloc - 72-threads30.5 GB21.2 GB23.4 GB38.4 GB

It's quite neat to see lld-link remaining at 99% CPU almost all the time during link :-D

I've also tried tcmalloc, but it performs significantly worse than both mimalloc and rpmalloc, mainly because of the SpinLock implementation, which downgrades to a kernel Sleep after a few spins. This is an absolute no-go for many-core systems when a lot of allocations are involved, like ThinLTO.
Also quickly tested ptmalloc3 but out-of-the-box it doesn't pass the LLVM tests. However I don't expect any miracle there, quickly browsing through the code shows the same spinlock behavior as tcmalloc. And so does Hoard. Possibly glibc 2.26+ would perform better if they have a per-thread cache, but the mimalloc bench uses glibc 2.7, and shows worse performance in general.

Please let me know if you'd like more testing on any of the above, or any other allocator.

I think both mimalloc & rpmalloc are already on the asymptote in terms of performance. They both use about 3.5% of CPU time on the 36-core, which includes calls to malloc & free and the kernel calls to VirtualAlloc.
At this point, any further improvements will come from reducing allocations in parts of LLVM that allocate a lot:

Including the entire rpmalloc source does add quite a bit of source code in one go. Would it be better to keep the rpmalloc source outside LLVM and just include hooks?

I think integrating it by default into LLVM would ease things and reduce the points of failure for enabling it (from a toolchain provider POV).

I wouldn't be happy with unconditionally replacing the C library malloc (without a cmake option). LLVM should certainly continue to be tested, and work, with the default allocator as we wouldn't want to unintentionally develop a dependence on an alternative.

Agreed. What about compiling it by default, but enabling it at runtime with a environment variable or a cmd-line flag? That would be the best, and it wouldn't require you (Sony) or any other LLVM toolchain provider (Nintendo) to ship a different set of binaries. Customers could selectively enable the alternate allocator, at the expense of higher memory usage. We (Ubisoft) would like very much if first-party LLVM toolchains providers, could include this/an alternate allocator/ as part of their toolchains for Windows. Our games need to ship built with those toolchains, we can't use our own downstream forks (at least not at the moment).

As for code size,

D:\Utils>cloc-1.80.exe d:\llvm-project\llvm\lib\Support\rpmalloc\
       4 text files.
       4 unique files.
       2 files ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (6.0 files/s, 6472.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                2            280            341           2352
C/C++ Header                     1             30            110            123
-------------------------------------------------------------------------------
SUM:                             3            310            451           2475
-------------------------------------------------------------------------------

2475 LoC in 2 files seem pretty reasonable addition to me.

And:

D:\Utils>cloc-1.80.exe d:\git\mimalloc\src
      15 text files.
      15 unique files.
       1 file ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (30.0 files/s, 13760.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                               15            889           1064           4927
-------------------------------------------------------------------------------
SUM:                            15            889           1064           4927
-------------------------------------------------------------------------------

D:\Utils>cloc-1.80.exe d:\git\mimalloc\include
       6 text files.
       6 unique files.
       1 file ignored.

github.com/AlDanial/cloc v 1.80  T=0.50 s (12.0 files/s, 3194.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C/C++ Header                     6            261            308           1028
-------------------------------------------------------------------------------
SUM:                             6            261            308           1028
-------------------------------------------------------------------------------

mimalloc comes with about 2x larger codesize, and more .C files to potentially integrate into LLVM.


At this point, if rpmalloc could be improved to the same levels of memory usage as mimalloc, I'd be willing to wait a few weeks/months before going through. @maniccoder WDYT? I can provide a full step-by-step repro.
Otherwise, I'd take mimalloc as an external .lib, but this is not my preferred solution.

Please advise:

  1. rpmalloc or mimalloc?
  2. .lib or in-tree?
  3. Compiled by default in LLVM, but disabled at runtime / optionally enabled by the user?

I'm more than happy to take a look at this use case and see what can be done to reduce the peak commit when using rpmalloc, perhaps if you file an issue on rpmalloc github project we can take it from there

rnk added a comment.Feb 11 2020, 5:45 PM

I haven't found time to review this yet, but this is still a very interesting change, and I hope to get to it soon.

Some other things I wanted to do in this space are:

  • Histogram most commonly allocated AST node types
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator.

These changes would likely only affect parsing speed, not middle/backend speed which dominates thinlto time.

In D71786#1871211, @rnk wrote:

I haven't found time to review this yet, but this is still a very interesting change, and I hope to get to it soon.

Some other things I wanted to do in this space are:

  • Histogram most commonly allocated AST node types
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator.

These changes would likely only affect parsing speed, not middle/backend speed which dominates thinlto time.

  • This already exist right? This is -print-stats.
  • I agree that 4K is too small. Virtual memory is cheap so we should just allocate large slabs. That said in my experiments increasing the slab size had no measurable impact in itself (at least for clang with -fsyntax-only). What did have a meaningful impact is using large pages (5%-6% in -fsyntax-only on all of Boost, obviously hardware dependent). Additionally I suspect, but have no data, that the improvement will be comparatively larger in a normal compilation because typically the DTLB and STLB are shared between the logical threads on each core.
aganea added a comment.EditedFeb 15 2020, 5:59 AM
In D71786#1871211, @rnk wrote:
  • Try raising BumpPtrAllocator "slab" size above 4096, that seems too small. Most AST nodes are allocated with BumpPtrAllocator.
  • I agree that 4K is too small. Virtual memory is cheap so we should just allocate large slabs. That said in my experiments increasing the slab size had no measurable impact in itself (at least for clang with -fsyntax-only). What did have a meaningful impact is using large pages (5%-6% in -fsyntax-only on all of Boost, obviously hardware dependent). Additionally I suspect, but have no data, that the improvement will be comparatively larger in a normal compilation because typically the DTLB and STLB are shared between the logical threads on each core.

We could add a PageAllocator instead of the current MallocAllocator, and make the default 64kb. The beauty about this is that we could either: 1. use large pages with just flipping a flag; or 2. allocate large contiguous slabs of virtual memory and commit memory pages on-the-go. This is a technique we're using in games to avoid reallocating/moving around allocated blocks. This only works with 4kb pages, not with large pages (which are committed by default)
But further down the line, it'd be nice to collect statistics about how the BumpPtrAlloctor grows/is used. And same for stack-based structures like the "Small" containers, so that we can set default container sizes that actually have the most impact. Nicolas Fleury has a talk here about some of the techniques we use at Ubisoft (check at 21:26).

Adding more to the discussion: LLVM compilation times on Linux shown in the article below, seem to indicate the same compilation times variability as between Windows versions.
While I haven't got time to test on these systems, part of the variability could come from the default heap allocator.
See https://www.phoronix.com/scan.php?page=article&item=3990x-freebsd-bsd&num=3

aganea added a comment.EditedApr 14 2020, 8:30 AM

Ping!

Would reviewers be ok if I changed this patch to always compile rpmalloc.c as part of LLVM (no cmake flag), but only make it active on demand? (with something like -fintegrated-crt-alloc in both Clang and LLD)
Additionally, I could provide a cmake hook for linking in an external CRT alloc .lib, such as mimalloc or tcmalloc3. In which case, -fintegrated-crt-alloc would use that lib instead.

If we go that route, does Linux support some kind of early entry point (before (m)allocations can occur), like TLS callbacks on Windows? It is true Linux users could use LD_PRELOAD to achieve the goal of this patch, I am simply wondering if we should have an uniform (integrated-crt-alloc) behavior across platforms.

dalias added a subscriber: dalias.Apr 15 2020, 11:18 AM

Coming from the unixy side (musl), there is no reasonable way short of self-exec with LD_PRELOAD (which has reasonableness problems itself since it'd be inherited by child processes) to make a -fintegrated-crt-alloc where replacement of malloc is runtime-conditional. And the only way to (non-statically) initialize a replacement is init-on-first-use (call_once style; this can be done without barriers if needed using a not-widely-known algorithm you can dig up). The better way is just making the statically initialized state a valid initial state.

The right way to do this is just to link the replacement into the binaries clang, lld, etc. - NOT the LLVM shared libraries - if the user selects a malloc replacement at build time, and optionally default to doing so if that's what you decide is the right default.

(If we want to statically link some runtime libraries into executables, we can add an option along the lines of -static-libgcc -static-openmp -static-libstdc++. (I just recalled I wanted to add -static= https://reviews.llvm.org/D53238))

rnk added a comment.Apr 15 2020, 5:21 PM

I think this is better as an off-by-default build configuration option. It also makes it easy for users with license concerns to avoid this dependency.

I basically thinks this looks good as is, but I want to get more guidance on what our policies on third party code are.

llvm/lib/Support/rpmalloc/LICENSE
24 ↗(On Diff #237082)

My understanding is that lawyers do not like one-off licenses. However, I think there is a pretty clear case that rpmalloc is not critical functionality to LLVM, it is an implementation of a well-defined interface: malloc, and if we discover license complications down the road, we can always remove it without removing critical functionality. And if anyone finds this worrying today, they do not need to enable the rpmalloc build mode.

So, that is my non-authoritative, "not legal advice" rationalization that this should be OK. And, you did put all the code in the rpmalloc directory. So, I think this is fine.

evgeny777 added a subscriber: evgeny777.

This speeds up thin LTO link more than 10 times on Ryzen-9 3950X / Windows 10: from 30 mins to less than 3 mins

mjp41 added a subscriber: mjp41.May 6 2020, 3:27 PM
dmajor added a subscriber: dmajor.Jun 11 2020, 12:23 PM

Very nice! This reduces the link time of Firefox's xul.dll from 13 minutes to 7 minutes.

aganea added a comment.EditedJun 15 2020, 9:18 AM

Very nice! This reduces the link time of Firefox's xul.dll from 13 minutes to 7 minutes.

Nice! What is the config? Have you tried with lld-link ... /opt:lldltojobs=all? The default is to only use half of the hardware threads.

That was on our c5.4xlarge AWS builders, which have 16 vCPUs according to the docs. I'm not sure offhand whether there is HT involved, but anyway I was testing with this patch applied to LLVM 10 which doesn't have the lldltojobs=all patch.

I notice that the improvements to rpmalloc for LLVM are still under discussion (https://github.com/mjansson/rpmalloc/issues/150) and targetted at rpmalloc v1.4.1.

Is the version in this review based on rpmalloc v1.4.0?

Is the plan to land this version as it fixes the concurrency issue then update to v1.4.1 to improve memory usage?

I wonder if it's possible to get this in for LLVM 11. Is it just wanting an okay on the licensing?

llvm/lib/Support/rpmalloc/LICENSE
24 ↗(On Diff #237082)

I believe that rpmalloc can be licensed under the MIT license if the lawyers would be happier with this: https://github.com/mjansson/rpmalloc#license

I notice that the improvements to rpmalloc for LLVM are still under discussion (https://github.com/mjansson/rpmalloc/issues/150) and targetted at rpmalloc v1.4.1.

Is the version in this review based on rpmalloc v1.4.0?

I believe it was based on this commit: https://github.com/mjansson/rpmalloc/commit/5aa754c14645303f12cd147c78da07af95886af6 (29 Nov 2019)

Is the plan to land this version as it fixes the concurrency issue then update to v1.4.1 to improve memory usage?

I wonder if it's possible to get this in for LLVM 11. Is it just wanting an okay on the licensing?

I modified the patch to support rpmalloc, snmalloc & mimalloc from a external folder (outside of the LLVM's folder).
In essence, one would need to do:

F:\git> git clone https://github.com/mjansson/rpmalloc/tree/mjansson/llvm-opt
F:\git> cd \llvm-project && mkdir build && cd build
F:\llvm-project\build> cmake -G"Ninja" ../llvm (...) -DLLVM_INTEGRATED_CRT_ALLOC=f:/git/rpmalloc
-or-
F:\llvm-project\build> cmake -G"Ninja" ../llvm (...) -DLLVM_INTEGRATED_CRT_ALLOC=f:/git/snmalloc

And that would solve the licencing issue and leave room for better allocator updates along the way. What do you think?

MaskRay added a comment.EditedJun 30 2020, 1:54 PM

I notice that the improvements to rpmalloc for LLVM are still under discussion (https://github.com/mjansson/rpmalloc/issues/150) and targetted at rpmalloc v1.4.1.

Is the version in this review based on rpmalloc v1.4.0?

Is the plan to land this version as it fixes the concurrency issue then update to v1.4.1 to improve memory usage?

I wonder if it's possible to get this in for LLVM 11. Is it just wanting an okay on the licensing?

First, there are of course license concerns. You'll need to find someone comfortable and authoritative approving this.

Second, adding a malloc implementation does not seem suitable to me. I think this particular comment should be addressed https://reviews.llvm.org/D71786#1984285
Is replacing the malloc implementation on difficult on Windows that the compiler has to ship a copy?

(BTW, have folks tried https://github.com/richfelker/mallocng-draft ? I hope it is not too difficult porting it to Windows...)

Second, adding a malloc implementation does not seem suitable to me. I think this particular comment should be addressed https://reviews.llvm.org/D71786#1984285

Sorry, which part of that comment should be addressed Fangrui?

Is replacing the malloc implementation on difficult on Windows that the compiler has to ship a copy?

Windows doesn't have a native LD_PRELOAD mechanism. The only other way would be to use Detours or do runtime patching of the EXE, however @maniccoder seems to advise against it: https://reviews.llvm.org/D71786#1812576 - I've never used Detours personally, maybe others can relate. That also means using /MD (dynamic CRT) instead of /MT (static CRT), which makes the executable slower de facto, because of the DLL thunking.

However that leaves the official LLVM releases & git checkouts with using the default Windows heap, which doesn't scale. The execution time of any heavily-allocating threaded workload like ThinLTO, is inversely proportional to the number of cores. The more people will have 32- or 64-core CPUs, the more this will become apparent. Microsoft's own compiler uses tbbmalloc since VS2019.

We are only suggesting this improvement for Windows. Other LLVM platforms could remain as they are. Perhaps this needs a RFC on the mailing list, but the gains are pretty clear, we're talking of at least an order of magnitude in ThinLTO execution time on many-core machines.

Second, adding a malloc implementation does not seem suitable to me. I think this particular comment should be addressed https://reviews.llvm.org/D71786#1984285

Sorry, which part of that comment should be addressed Fangrui?

Hi, I think you answered in the next paragraph.

Is replacing the malloc implementation on difficult on Windows that the compiler has to ship a copy?

Windows doesn't have a native LD_PRELOAD mechanism. The only other way would be to use Detours or do runtime patching of the EXE, however @maniccoder seems to advise against it: https://reviews.llvm.org/D71786#1812576 - I've never used Detours personally, maybe others can relate. That also means using /MD (dynamic CRT) instead of /MT (static CRT), which makes the executable slower de facto, because of the DLL thunking.

However that leaves the official LLVM releases & git checkouts with using the default Windows heap, which doesn't scale. The execution time of any heavily-allocating threaded workload like ThinLTO, is inversely proportional to the number of cores. The more people will have 32- or 64-core CPUs, the more this will become apparent. Microsoft's own compiler uses tbbmalloc since VS2019.

Thanks for the explanation. Lacking of an out-of-box LD_PRELOAD sound sounds readlly unfortunate ;-)

We are only suggesting this improvement for Windows. Other LLVM platforms could remain as they are. Perhaps this needs a RFC on the mailing list, but the gains are pretty clear, we're talking of at least an order of magnitude in ThinLTO execution time on many-core machines.

I can see if makes a lot of improvement for Windows. This patch pulls in a third party project with more than 3000 lines of code. I guess a RFC is probably deserved.

Finally, let me end with a hopefully constructive argument:) If the malloc interface is well defined on Windows, can -fintegrated-crt-alloc be generalized to take a value - path to the precompiled malloc implementation, so that we don't need to check in a specific implementation & users can easily use whatever implementations they want?

hans added a comment.Jul 1 2020, 2:24 AM

I also don't know what exactly the policy around third_party code is, so discussing this on llvm-dev is probably a good idea.

But perhaps one way of side-stepping the whole issue would be to not check in rpmalloc into LLVM, but make the user point at it if they want to use it through a cmake variable, e.g. -DLLVM_USE_RPMALLOC=c:\path\to\rpmalloc (if such a thing is technically practical, I don't know)

I also don't know what exactly the policy around third_party code is, so discussing this on llvm-dev is probably a good idea.

But perhaps one way of side-stepping the whole issue would be to not check in rpmalloc into LLVM, but make the user point at it if they want to use it through a cmake variable, e.g. -DLLVM_USE_RPMALLOC=c:\path\to\rpmalloc (if such a thing is technically practical, I don't know)

That sounds like @aganea's suggestion from https://reviews.llvm.org/D71786#2123378.

Maybe it be worth putting that approach up in a separate review (as it's quite different).

I also don't know what exactly the policy around third_party code is, so discussing this on llvm-dev is probably a good idea.

But perhaps one way of side-stepping the whole issue would be to not check in rpmalloc into LLVM, but make the user point at it if they want to use it through a cmake variable, e.g. -DLLVM_USE_RPMALLOC=c:\path\to\rpmalloc (if such a thing is technically practical, I don't know)

That diverges from what we have done in other cases (ConvertUTF, gtest, gmock, google benchmark, isl, ...), adds to the effort to build LLVM and adds a dimension to the cmake configuration space.

llvm/lib/Support/rpmalloc/LICENSE
24 ↗(On Diff #237082)

Taking the occasion to point to another licence issue we already have: D66390

aganea updated this revision to Diff 274989.Jul 1 2020, 8:40 PM
aganea retitled this revision from RFC: [Support] On Windows, add optional support for rpmalloc to RFC: [Support] On Windows, add optional support for {rpmalloc|snmalloc|mimalloc}.
aganea edited the summary of this revision. (Show Details)

Obviously this needs a bit more polishing, but just quick WIP before going further into the discussions.

  • Removed the rpmalloc source code.
  • Added support for mimalloc and snmalloc.
  • The path to the replacement allocator can be set with -DLLVM_INTEGRATED_CRT_ALLOC.

This patch should work on Linux as well, probably with minor tweaks.

After previous steps, compile LLVM such as cmake ... -DLLVM_USE_CRT_RELEASE=MT -DLLVM_INTEGRATED_CRT_ALLOC=D:/git/mimalloc

Latest timings - LLD linking clang.exe with ThinLTO on 36-core CPU:

Wall clockPage ranges commited/decommitedTotal touched pagesPeak Mem
Windows 10 version 2004 default heap38 min 47 sec14.9 GB
mimalloc2 min 22 sec1,449,501174,3 GB19,8 GB
rpmalloc2 min 15 sec270,79645,9 GB31,9 GB
snmalloc2 min 19 sec102,83947,0 GB42,0 GB

! In D71786#2125487, @Meinersbur wrote:
That diverges from what we have done in other cases (ConvertUTF, gtest, gmock, google benchmark, isl, ...), adds to the effort to build LLVM and adds a dimension to the cmake configuration space.

I would imagine a downstream fork of LLVM could simply embed the allocator's checkout into the LLVM tree. Not ideal, but that leaves latitude to choose the allocator and/or one with a suitable licence.

! In D71786#2124330, @MaskRay wrote:
Finally, let me end with a hopefully constructive argument:) If the malloc interface is well defined on Windows, can -fintegrated-crt-alloc be generalized to take a value - path to the precompiled malloc implementation, so that we don't need to check in a specific implementation & users can easily use whatever implementations they want?

Replacing the allocator dynamically at startup could be possible, but that's a lot of trouble. If that's we want, I could check if we can load a DLL from a TLS callback, because we need the allocator very early, before the CRT entry point is called.

But since the allocator is more a LLVM-developer-facing feature, and not really a Clang/LLD-user feature, I would tend to prefer cmake -DLLVM_INTEGRATED_CRT_ALLOC=D:/git/rpmalloc and perhaps -fno-integrated-crt-alloc at runtime to disable. But if you disagree and/or have arguments for a dynamic loader, let me know.

Replacing the allocator dynamically at startup could be possible, but that's a lot of trouble. If that's we want, I could check if we can load a DLL from a TLS callback, because we need the allocator very early, before the CRT entry point is called.

I think a TLS would fall in the "unspecified order" category of global constructors and that you on Windows need to use the CRT data segments ".CRT$XIx" to ensure it is called before anything might be allocating memory as I wrote about earlier: https://gist.github.com/mjansson/8a573974473e1d69a2299444cd41bea9

Anyway, not sure you can load a DLL without it calling in to the allocator though :)

Thanks for your work on this. I've just tried out this version with rpmalloc (1.4.0) checked out out at the top of llvm-project.

> cmake -G Ninja -DLLVM_ENABLE_PROJECTS=clang;lld -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_INTEGRATED_CRT_ALLOC=<path>/llvm-project/rpmalloc -DLLVM_USE_CRT_RELEASE=MT ..\llvm
-- The C compiler identification is MSVC 19.26.28806.0
-- The CXX compiler identification is MSVC 19.26.28806.0
...
> ninja all

And hit a compile issue in rpmalloc:

C:\PROGRA~2\MIB055~1\2019\PROFES~1\VC\Tools\MSVC\1426~1.288\bin\Hostx64\x64\cl.exe  /nologo -DGTEST_HAS_RTTI=0 -DUNICODE -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE -D_CRT_SECURE_NO_WARNINGS -D_HAS_EXCEPTIONS=0 -D_SCL_SECURE_NO_DEPRECATE -D_SCL_SECURE_NO_WARNINGS -D_UNICODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib\Support -I<path>\llvm-project\llvm\lib\Support -Iinclude -I<path>\llvm-project\llvm\include /DWIN32 /D_WINDOWS /Zc:inline /Zc:strictStrings /Oi /W4 -wd4141 -wd4146 -wd4244 -wd4267 -wd4291 -wd4351 -wd4456 -wd4457 -wd4458 -wd4459 -wd4503 -wd4624 -wd4722 -wd4100 -wd4127 -wd4512 -wd4505 -wd4610 -wd4510 -wd4702 -wd4245 -wd4706 -wd4310 -wd4701 -wd4703 -wd4389 -wd4611 -wd4805 -wd4204 -wd4577 -wd4091 -wd4592 -wd4319 -wd4709 -wd4324 -w14062 -we4238 /MDd /Zi /Ob0 /Od /RTC1 /showIncludes /Folib\Support\CMakeFiles\LLVMSupport.dir\F_\git\llvm-project\rpmalloc\rpmalloc\malloc.c.obj /Fdlib\Support\CMakeFiles\LLVMSupport.dir\LLVMSupport.pdb /FS -c <path>\llvm-project\rpmalloc\rpmalloc\malloc.c
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(20): error C2143: syntax error: missing ')' before 'sizeof'
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(20): error C2143: syntax error: missing '{' before 'sizeof'
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(20): error C2059: syntax error: 'sizeof'
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(21): error C2143: syntax error: missing ')' before 'sizeof'
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(21): error C2143: syntax error: missing '{' before 'sizeof'
<path>\llvm-project\rpmalloc\rpmalloc\malloc.c(21): error C2059: syntax error: 'sizeof'

On the whole I preferred the version of this patch with rpmalloc committed as:

  • It lets us test it on a buildbot without adding other repositories. That makes it easier to keep it working across all the support LLVM configurations (as mentioned by @Meinersbur above)
  • It could be used in the official LLVM binary releases (https://releases.llvm.org/download.html) (if @hans is happy with that), helping users of that
  • All downstream users will use the same alternative memory allocator, so there is less chance of strange downstream issues.
  • I can't see enough of an argument for each downstream user having a choice. Any of {rp,sn,mi}malloc will fix the multi-threaded locking issue and boost single threaded perf. Any difference between them is relatively small (particularly if the memory usage of rpmalloc is improved as https://github.com/mjansson/rpmalloc/issues/150)
  • If rpmalloc is accepted in upstream LLVM then it actually simplifies licensing for downstream toolchain providers
  • The current version (supporting {rpmalloc|snmalloc|mimalloc}) is coupled to those external projects (in llvm/lib/Support/CMakeLists.txt) and I'm not sure how we'd test this integration.

IANAL but I believe that rpmalloc probably *would* be acceptable to the LLVM project under the MIT license. I think that it is up to the LLVM board to decide on licensing issues. If you're happy to go with the rpmalloc route then you could contact them directly (contact details here: http://llvm.org/foundation/), or I would be happy to in support of this if that would help.

Thanks
Russ

hans added a comment.Aug 4 2020, 5:28 AM

From my point of view, I think it would be fine to only support rpmalloc in this patch. One aspect that Russell touched on is that if we support a bunch of options, it's not clear how well that support will be tested. Supporting only rpmalloc would also simplify the patch a little, and small incremental change is always good I think.

As for the "why rpmalloc and not my favorite allocator" question - well you're doing the work, so you get to choose the allocator :-) And, if there really is strong demand for supporting a different allocator, that can be handled later.

It would be nice if rpmalloc could live as third_party code in the llvm tree eventually, so that this could just work out of the box, but until then I think the current patch is a good solution.

llvm/lib/Support/CMakeLists.txt
66

I thought rpmalloc/malloc.c isn't supposed to be compiled on its own, but gets included into rpmalloc.c when ENABLE_OVERRIDE is defined. I'm guessing the could explain Russell's compiler error above.

That is correct, the malloc.c is included from the rpmalloc.c file when ENABLE_OVERRIDE is set to 1, in order to correctly alias to internal functions. It should not be compiled on its own.

maniccoder added inline comments.Aug 4 2020, 10:31 PM
llvm/lib/Support/CMakeLists.txt
66

Also seems like the ENABLE_OVERRIDE define is missing, must be defined to 1 to enable the malloc/free and new/delete replacement in rpmalloc

aganea updated this revision to Diff 288117.Aug 26 2020, 2:34 PM
aganea marked 3 inline comments as done.

Address build issues.
Document the new flag.
Add more validations in cmake.

aganea retitled this revision from RFC: [Support] On Windows, add optional support for {rpmalloc|snmalloc|mimalloc} to [Support] On Windows, add optional support for {rpmalloc|snmalloc|mimalloc}.Aug 26 2020, 2:35 PM
aganea added a comment.EditedAug 26 2020, 2:37 PM

From my point of view, I think it would be fine to only support rpmalloc in this patch. One aspect that Russell touched on is that if we support a bunch of options, it's not clear how well that support will be tested. Supporting only rpmalloc would also simplify the patch a little, and small incremental change is always good I think.

Some users reported that snmalloc perfoms better on a cloud VM. The performance heavily depends on the configuration, whether it's an older Windows 10 build (which has the virtual pages kernel bug), or if it has large pages active. I'd also like to see how they (currently supported allocators) compare vs. SCUDO and vs. tcmalloc3, if someone is willing to port those on Windows.

The changes to support the allocators are quite small. @hans I see this as temporary until we gather more feedback from users, would you be willing to keep all three allocators for now?

IANAL but I believe that rpmalloc probably *would* be acceptable to the LLVM project under the MIT license. I think that it is up to the LLVM board to decide on licensing issues. If you're happy to go with the rpmalloc route then you could contact them directly (contact details here: http://llvm.org/foundation/), or I would be happy to in support of this if that would help.

Just for the record, this has been discussed offline with the LLVM board. Currently, rpmalloc cannot be included as it is into the monorepo because it has a different licence. However, @maniccoder has proposed to donate the code, so we could include it if the community feels it should be part of (and maintained within) LLVM. This should be first discussed on llvm-dev. In the meanwhile we could just go ahead with this patch as a first step.

hans accepted this revision.Aug 27 2020, 2:50 AM

Thanks for the update! I'm excited to get this in.

From my point of view, I think it would be fine to only support rpmalloc in this patch. One aspect that Russell touched on is that if we support a bunch of options, it's not clear how well that support will be tested. Supporting only rpmalloc would also simplify the patch a little, and small incremental change is always good I think.

Some users reported that snmalloc perfoms better on a cloud VM. The performance heavily depends on the configuration, whether it's an older Windows 10 build (which has the virtual pages kernel bug), or if it has large pages active. I'd also like to see how they (currently supported allocators) compare vs. SCUDO and vs. tcmalloc3, if someone is willing to port those on Windows.

The changes to support the allocators are quite small. @hans I see this as temporary until we gather more feedback from users, would you be willing to keep all three allocators for now?

Sure, I'm not against it.

llvm/docs/CMake.rst
476

nit: Shouldn't that be git clone?

llvm/lib/Support/CMakeLists.txt
69

Should this also handle trailing backslash?

This revision is now accepted and ready to land.Aug 27 2020, 2:50 AM

I'd also like to see how they (currently supported allocators) compare vs. SCUDO and vs. tcmalloc3, if someone is willing to port those on Windows.

I've been looking at Scudo here: https://reviews.llvm.org/D86694. It builds LLVM on Windows using -DLLVM_USE_SANITIZER=Scudo.

This revision was automatically updated to reflect the committed changes.
aganea marked 2 inline comments as done.
aganea added inline comments.Aug 27 2020, 8:10 AM
llvm/lib/Support/CMakeLists.txt
69

Done & tested. Altough with cmake I've learned to always use forward slashes, even on Windows, otherwise it always fails in some way or another :-)

hans added a comment.Aug 28 2020, 5:26 AM

This seems to have broken 32-bit builds for me. In an x86 VS 2019 prompt:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang" ../llvm
ninja unittests\Support\DynamicLibrary\SecondLib.dll
LINK : error LNK2001: unresolved external symbol malloc
unittests\Support\DynamicLibrary\SecondLib.dll : fatal error LNK1120: 1 unresolved externals

This diff appears to fix it:

diff --git a/llvm/unittests/Support/DynamicLibrary/CMakeLists.txt b/llvm/unittests/Support/DynamicLibrary/CMakeLists.txtindex 00d20510ffd..b9bcc2d9680 100644
--- a/llvm/unittests/Support/DynamicLibrary/CMakeLists.txt
+++ b/llvm/unittests/Support/DynamicLibrary/CMakeLists.txt
@@ -45,7 +45,7 @@ function(dynlib_add_module NAME)
   # allocator (see llvm/lib/Support/Windows/Memory.inc).
   # /INCLUDE:malloc is there to force searching into LLVMSupport before libucrt
   llvm_map_components_to_libnames(llvm_libs Support)
-  target_link_libraries(${NAME} ${llvm_libs} "-INCLUDE:malloc")
+  target_link_libraries(${NAME} ${llvm_libs} "-INCLUDE:_malloc")

 endfunction(dynlib_add_module)

@hans Fixed in rGb9b954b8bbf0feed1aecde78cb6976134e460e91. Since the x86 versions of the LLVM tools are limited in terms of memory, I think it's not worth supporting LLVM_INTEGRATED_CRT_ALLOC on x86.

xgupta added a subscriber: xgupta.Aug 30 2020, 8:39 AM
xgupta added inline comments.
llvm/docs/CMake.rst
477

Sphinx documentation build produce a (warning) error with >ninja docs/docs-llvm-html

FAILED: docs/CMakeFiles/docs-llvm-html
cd /home/user/llvm-project-master/build/docs && /usr/bin/sphinx-build -b html -d /home/user/llvm-project-master/build/docs/_doctrees-llvm-html -q -t builder-html -W /home/user/llvm-project-master/llvm/docs /home/user/llvm-project-master/build/docs/html
/home/user/.local/lib/python3.8/site-packages/recommonmark/parser.py:75: UserWarning: Container node skipped: type=document

warn("Container node skipped: type={0}".format(mdnode.t))

Warning, treated as error:
/home/user/llvm-project-master/llvm/docs/CMake.rst:476:Unexpected indentation.
ninja: build stopped: subcommand failed.

I think you should write these two lines as:-

.. code-block:: console

  $ D:\git> git clone https://github.com/mjansson/rpmalloc
  $ D:\llvm-project> cmake ... -DLLVM_INTEGRATED_CRT_ALLOC=D:\git\rpmalloc
hans added a comment.Sep 10 2020, 8:10 AM

I've been trying to use this, but not with much luck so far.

Using rpmalloc (I used version 1.4.1), llvm-lld hangs during clang bootstrap in different stages of the build - sometimes already during the cmake invocation. I'm not sure what's going on there - it's not using a lot of memory or anything - but would be curious to find out.

Trying to use snmalloc, it doesn't build because the /std:c++17 flag doesn't actually seem to get applied to the snmalloc files:

C:\src\chromium\src\third_party\depot_tools\win_toolchain\vs_files\a687d8e2e4114d9015eb550e1b156af21381faac\VC\Tools\MSVC\14.26.28801\bin\Hostx64\x64\cl.exe  /nologo /TP -DGTEST_HAS_RTTI=0 -DUNICODE -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE -D_CRT_SECURE_NO_WARNINGS -D_HAS_EXCEPTIONS=0 -D_SCL_SECURE_NO_DEPRECATE -D_SCL_SECURE_NO_WARNINGS -D_UNICODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib\Support -IC:\src\chromium\src\third_party\llvm\llvm\lib\Support -Iinclude -IC:\src\chromium\src\third_party\llvm\llvm\include -IC:\src\chromium\src\third_party\llvm-build-tools\zlib-1.2.11 -IC:\src\chromium\src\third_party\llvm-build-tools\zlib-1.2.11 /Zc:inline /Zc:__cplusplus /Zc:strictStrings /Oi /Zc:rvalueCast /bigobj /W4 -wd4141 -wd4146 -wd4244 -wd4267 -wd4291 -wd4351 -wd4456 -wd4457 -wd4458 -wd4459 -wd4503 -wd4624 -wd4722 -wd4100 -wd4127 -wd4512 -wd4505 -wd4610 -wd4510 -wd4702 -wd4245 -wd4706 -wd4310 -wd4701 -wd4703 -wd4389 -wd4611 -wd4805 -wd4204 -wd4577 -wd4091 -wd4592 -wd4319 -wd4709 -wd4324 -w14062 -we4238 /Gw /MT /O2 /Ob2   -UNDEBUG -std:c++14 /showIncludes /Folib\Support\CMakeFiles\LLVMSupport.dir\C_\src\chromium\src\third_party\llvm-build-tools\snmalloc\src\override\malloc.cc.obj /Fdlib\Support\CMakeFiles\LLVMSupport.dir\LLVMSupport.pdb /FS -c C:\src\chromium\src\third_party\llvm-build-tools\snmalloc\src\override\malloc.cc
C:\src\chromium\src\third_party\llvm-build-tools\snmalloc\src\ds\../aal/aal.h(178): error C2429: language feature 'terse static assert' requires compiler flag '/std:c++17'

but the flag does get applied to other files (because of PARENT_SCOPE?), although the compiler warns that it gets overridden by a later -std:c++14 flag:

[127/3104] Building CXX object lib\TableGen\CMakeFiles\LLVMTableGen.dir\StringMatcher.cpp.obj
 cl : Command line warning D9025 : overriding '/std:c++17' with '/std:c++14'

Maybe it would be better to try to override CXX_STANDARD instead of setting the flag?

Using mimalloc doesn't look appealing right now since the cmake file expects to be pointed at a .lib file built with a patch applied. But hopefully they'll fix https://github.com/microsoft/mimalloc/issues/268 at some point :-)

I'll keep tinkering, just wanted to report my findings so far.

https://reviews.llvm.org/D87471 makes building with snmalloc work for me, both with msvc and clang-cl.

For rpmalloc make sure to use the latest develop branch, the use of hard _exit in lld coupled with the use of fiber local storage made the allocator sometimes end up in a deadlock. It should be fixed on the develop branch.

hans added a comment.Sep 10 2020, 10:44 AM

For rpmalloc make sure to use the latest develop branch, the use of hard _exit in lld coupled with the use of fiber local storage made the allocator sometimes end up in a deadlock. It should be fixed on the develop branch.

Ah, here I was trying to play it safe by using one of the release tag :) I'll try the develop branch tomorrow. Thanks!

russell.gallop added inline comments.Sep 14 2020, 2:14 AM
llvm/CMakeLists.txt
570

"option" in cmake appears to create this as a BOOL option (https://cmake.org/cmake/help/latest/command/option.html), which contradicts it being a PATH (as described in CMake.rst). I haven't seen this cause problems but it might.