User Details
- User Since
- Mar 2 2017, 6:37 AM (317 w, 2 d)
Yesterday
Mar 2 2023
I assumed that any patch would also need to pass all the testing but perhaps my Windows libc++ configuration is not quite "right".
Any patch does indeed need to pass all the testing, but that's primarily defined by the CI configurations; those must keep passing. Ideally, the tests should pass in all possible build configurations, but reality isn't quite there.
Feb 27 2023
I tested my patch locally in all 6 combinations, i.e. including release and debug runtimes. However, it's already clear that the configuration that I've been using does not match CI. Now that there are CI failures, I should hopefully be able to find out the differences in configuration.
Hi @mstorsjo, thanks for your comments.
Feb 24 2023
Feb 9 2023
Feb 8 2023
Updated to add comment regarding the strange looking use of 0xE0000000.
Feb 3 2023
Feb 2 2023
Updated for review suggestions.
Jan 31 2023
Jan 30 2023
Ping.
Jan 27 2023
While the change itself looks good to me, this test is for linux (REQUIRES: x86_64-linux). Is there some motivation for this change?
Jan 26 2023
@avl, could you please get this fix cherry picked to the 16.x release branch. The instructions for doing so are here: Backporting Fixes to the Release Branches and the GitHub milestone is here: LLVM 16.0.0 Release. Thanks.
Jan 25 2023
I'm not sure of the latest procedure regarding release branches but this patch should be cherry picked to the latest 16.x branch.
Jan 24 2023
I think you need to be a bit careful with terminology because IIUC it's not actually "thread-safe" but a "per thread" allocator. I think this could be useful in some situations but perhaps a better overall approach would be to use a low-level thread aware memory allocator. I think some toolchains/runtimes already have such an allocator and there are other options such as rpmalloc and mimalloc. This way, the benefit is more widespread and doesn't require any extra effort (although some allocator libraries have some limitations on certain platforms).
Other than limitations on certain platforms, AFAICT, these libraries do not have a notion of memory pools - allocating data in "one big chunk"(To avoid memory fragmentation and to have a possibility to free resources fast). Thus if someone need a thread aware memory pool using rpmalloc and mimalloc would be not enough.
I also have a connected question not related directly. What do you think about allocator from D142318 and following design for such allocator:
class ThreadPoolAllocator { void *Allocate(size_t size, size_t alignment); }; class ThreadPool { std::unique_ptr<ThreadPoolAllocator> getAllocator (); }; class ThreadPoolExecutor { std::unique_ptr<ThreadPoolAllocator> getAllocator (); };The idea is to have thread-safe allocator connected with its execution context.
Jan 23 2023
I did not find direct specification of that behavior(thread_local variables are zero initialized) thus created a patch which initializes it.
Perhaps I'm missing something here, but I thought that thread_local variables are zero initialized. Therefore, the threadIndex for the main thread should already be 0. However, what is definitely of concern is that parallelFor could create contention between the main thread and thread pool thread "0" because they share the index 0. I definitely missed this in the original review that introduced this change. I spotted the potential for the issue within LLD itself but completely forgot that the parallel support code also makes use of the main thread!
Jan 20 2023
Jan 10 2023
Dec 6 2022
However, the libc++ static test configuration does appear to be a mix of static and dynamic but that's another issue.
It's not necessarily a case when _every_ library is linked statically - it's a case where libc++ itself is linked statically. Whether each of libc++, msvcp, vcruntime or UCRT are linked statically or dynamically are orthogonal distribution choices.
Dec 5 2022
These changes are not required. Ensuring that libc++.lib appears before msvcprt.lib solves the problem. Therefore, relying on the #pragma comment(lib, "libc++.lib") in the libc++ config doesn't really work for this Windows configuration. The libc++ testing also explicitly lists the link libraries with libc++.lib before msvcprt.lib which is why no problems are seen. However, the libc++ static test configuration does appear to be a mix of static and dynamic but that's another issue.
Nov 22 2022
Also, one reason why this slept for so long is that the libc++ review group is not on this review. Did you follow the steps in https://libcxx.llvm.org/Contributing.html#contributing-to-libc to upload the patch (in particular arc diff)?
Nov 8 2022
LGTM.
Nov 7 2022
Ping.
Nov 4 2022
Ping or any suggestions for other reviewers?
Oct 31 2022
Ping.
Oct 24 2022
Oct 18 2022
Update for review comments.
Oct 17 2022
Update for review comment.
Sep 21 2022
LGTM and performance benefit is good on Windows too.
Sep 13 2022
LGTM, although might want to wait to see what @MaskRay thinks.
A couple of suggestions in response to the comment on D133003 regarding no TLS support.
I think this looks fine to unblock the MinGW dylib build but I wonder if we could reduce the number of calls to getThreadIndex() by storing the index in the RelocationScanner? Although this would require quite a few changes to the existing interfaces.
Sep 12 2022
This LGTM now, but I think it would be good to get another opinion too.
Sep 11 2022
Sep 10 2022
Sep 9 2022
Performance on Windows looks good! Every test case I've tried has shown an improvement.
Far better approach now in D133003 that removes the lock contention.
If the NEEDS_* change looks good, I'll pre-commit it (without using std::atomic) to reduce diff for future updates.
Sep 7 2022
I've created D133431 which is the result of my experimentation thus far. In my testing, it does slightly improve performance in the test cases that regressed in performance. In other test cases, it's around the same or slightly lower performance increase. However, I can't help but feel there should be a "better" solution. Although, I guess you've always got to balance that with complexity/maintainability. The hard coded concurrency limit of 8 tasks in D133431 also doesn't feel great.
Sep 6 2022
Don't yet know the reason for the slow down but I suspect it will be related to the "size" of the tasks being spawned in parallel.
Sep 5 2022
Sorry, I've been busy, so have only just had some time to look at this patch. Looks promising but unfortunately there are performance regressions on Windows for both chrome (~3%) and mozilla (~5%) from lld-speed-test.tar.xz. Don't yet know the reason for the slow down but I suspect it will be related to the "size" of the tasks being spawned in parallel.
Aug 24 2022
Approach adopted in D131247.
Thanks. Do you think a smaller task size like 1MiB / 2MiB work fine?
@MaskRay, I did briefly experiment with smaller task sizes and both 1MiB & 2MiB showed similar results on my 2 test PCs with Windows. The performance related issue on Windows occurred when the "task size" was too large and 4MiB appeared to be the "sweet spot" in my testing. If 1MiB or 2MiB can be shown to be beneficial for other platforms, then it should be fine for Windows too (at least in my testing).
Aug 23 2022
LGTM but I think it would be good to get some other reviewers to take a look (and perhaps test) before approval.
Aug 22 2022
I've tested this latest patch applied to our downstream port on Windows and it improves linking performance in all our test cases that I've managed to test so far.
Resign as reviewer as I also work with Carlos and helped to author this patch.
Aug 17 2022
Instead of approximating the taskSize based on a 4MB size limit, I've prototyped a patch in https://reviews.llvm.org/D132025 that does it a bit better. This prototype patch has only been briefly tested in the same configurations as I've already tested this patch. However, I have also been able to reproduce the performance degradation on a different Windows PC.
Aug 15 2022
I've looked a bit more at the Windows performance degradation and have come up with the following code for taskSize and asyncParallelFor:
size_t tasks = size / (4 * 1024 * 1024); size_t taskSize = tasks ? sections.size() / tasks : sections.size(); asyncParallelFor(tg, std::max<size_t>(1, taskSize), 0, sections.size(), fn);
The 4MB is somewhat arbitrary but this appears to work OK on my Windows PC and mostly eliminates the performance degradation that I've seen so far. In fact there's a ~3% improvement for mozilla from lld-speed-test.tar.xz.
Aug 12 2022
Some parallel* in SyntheticSections.cpp (e.g. MergeNoTailSection::writeTo) is now serial due to limitation of llvm/Support/Parallel.h, e.g. GdbIndexSection::writeTo, MergeNoTailSection::writeTo. Sometimes (in all workloads I have tested) overlapping their write with other output sections seems better than spending all threads in parallelizing them and writing output sections serially.
Aug 11 2022
I've managed to get some time to test this change on Windows and the results do not look good. Testing chrome from lld-speed-test.tar.xz, I get a ~1% improvement, so that's the good news. However, testing mozilla, I get a ~23% increase in link time and testing scylla, I get a ~140% increase in link time. Testing an Unreal Engine 4 based link, gives ~21% increase in link time. This is running on a Windows 10 PC with AMD Ryzen 3900X 12C/24T 64GB RAM with Samsung 970 EVO NVMe SSDs. Haven't had a chance to dig deeper into why this is having such a negative impact.
Aug 10 2022
Aug 9 2022
I assume that the performance figures are based on running on Linux. Do you have any idea what the performance impact is on Windows?
Jul 8 2022
Jul 7 2022
Address review comment and removed the unit test.
Jul 1 2022
Jun 24 2022
Jun 8 2022
Jun 1 2022
I've added support for x86_32/i686.