This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
ELF/
-
OutputSections.h
8/8
OutputSections.cpp
-
Writer.cpp
-
test/ELF/
-
ELF/
-
arm-thumb-interwork-notfunc.s
-
hexagon-jump-error.s
-
linkerscript/
-
overlapping-sections.s
-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
1/1
Parallel.h
-
lib/Support/
-
Support/
2/2
Parallel.cpp

Differential D131247

[ELF] Parallelize writes of different OutputSections
ClosedPublic

Authored by MaskRay on Aug 5 2022, 1:24 AM.

Download Raw Diff

Details

Reviewers

andrewng
ikudrin
peter.smith

Commits

rG3b4d800911b5: [ELF] Parallelize writes of different OutputSections

Summary

We currently process one OutputSection at a time and for each OutputSection
write contained input sections in parallel. This strategy does not leverage
multi-threading well. Instead, parallelize writes of different OutputSections.

The default TaskSize for parallelFor often leads to inferior sharding. We
prepare the task in the caller instead.

Move llvm::parallel::detail::TaskGroup to llvm::parallel::TaskGroup
Add llvm::parallel::TaskGroup::execute.
Change writeSections to declare TaskGroup and pass it to writeTo.

Speed-up with --threads=8:

clang -DCMAKE_BUILD_TYPE=Release: 1.11x as fast
clang -DCMAKE_BUILD_TYPE=Debug: 1.10x as fast
chrome -DCMAKE_BUILD_TYPE=Release: 1.04x as fast
scylladb build/release: 1.09x as fast

On M1, many benchmarks are a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 1.03x as fast.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MaskRay created this revision.Aug 5 2022, 1:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2022, 1:24 AM

Herald added subscribers: StephenFan, hiraditya, kristof.beyls and 2 others. · View Herald Transcript

MaskRay requested review of this revision.Aug 5 2022, 1:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2022, 1:24 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B179485: Diff 450253.Aug 5 2022, 1:45 AM

dblaikie added a subscriber: dblaikie.Aug 5 2022, 9:52 AM

MaskRay mentioned this in rG28d05d672300: [ELF][PPC64] Fix potentially corrupted section content with empty .got.Aug 5 2022, 3:23 PM

MaskRay mentioned this in rGd7cbfcf36ace: [ELF][AArch64] Fix potentially corrupted section content for PAC.Aug 5 2022, 6:25 PM

The identified failures were actually all hidden bugs exposed by this patch.
All fixed now.

Harbormaster completed remote builds in B179650: Diff 450458.Aug 5 2022, 7:23 PM

I am not really happy that the patch exposes some implementation details (i.e TaskGroup and requires a value of TaskSize) that are best to be hidden, but I, honestly, do not have a better idea ready. Maybe there can be a wrapper class that encloses a TaskGroup and where asyncParallelFor can be moved as a method? Maybe it even could automatically adjust TaskSize based on parallel::detail::MaxTasksPerGroup and the count of already added tasks?

lld/ELF/OutputSections.cpp
459–467	I cannot find a requirement for `BYTE()` commands to override the content of input sections. Isn't the script simply malformed in that case? Can't we add these writings to the pool too?
476	Where does the value of `128` come from? It looks like, with the hardcoded value, large projects will benefit from the parallelization more than small ones. And it also does not take into account the `llvm::parallel::detail::MaxTasksPerGroup`; can it somehow be derived from the common setting?

In D131247#3706866, @ikudrin wrote:

I am not really happy that the patch exposes some implementation details (i.e TaskGroup and requires a value of TaskSize) that are best to be hidden, but I, honestly, do not have a better idea ready. Maybe there can be a wrapper class that encloses a TaskGroup and where asyncParallelFor can be moved as a method? Maybe it even could automatically adjust TaskSize based on parallel::detail::MaxTasksPerGroup and the count of already added tasks?

TaskGroup is moved outside detail::. The idea is that its destructor is a suitable place for the gather place.

lld/ELF/OutputSections.cpp
459–467	We want to ensure that `BYTE` may overwrite a filler (`fill(start, end - start, filler);`). I can add a comment.
476	I don't think it can be derived from the common setting. It may not be bad to have a call site specific value like `SmallVector`. If we derived this automatically, tuning the library parameter may be more difficult.

I assume that the performance figures are based on running on Linux. Do you have any idea what the performance impact is on Windows?

lld/ELF/OutputSections.cpp
466	Given there is already a "live" `TaskGroup`, I don't think this will actually run in parallel, IIUC. However, this is a limitation of the current parallel implementation.
476	I would actually suggest that the task size could be even larger than `128` for output sections that have many input sections. I think as a "minimum" setting it feels about right. It's usually hard to derive a good task size value without incurring some cost in gathering appropriate metrics in order to make a good estimate. Perhaps the task size could be calculated based on the number of sections and the number of threads in the task pool?

Not a lot to add over what has already been said. IIUC we do have code to detect when an Output Section overlaps and the user has to explicitly choose --noinhibit-exec to force the write? I think that non-deterministic output is reasonable in that case. Perhaps --noinhibit-exec could imply --threads=1 if we were concerned about people relying on order of output.

derive a better taskSize

Add comment about limitation of llvm/Support/Parallel.h.

In D131247#3715321, @peter.smith wrote:

Not a lot to add over what has already been said. IIUC we do have code to detect when an Output Section overlaps and the user has to explicitly choose --noinhibit-exec to force the write? I think that non-deterministic output is reasonable in that case. Perhaps --noinhibit-exec could imply --threads=1 if we were concerned about people relying on order of output.

The --check-sections error almost assuredly suggests a fatal style error. The user can get an output with --noinhibit-exec. This is fair corner case and I think letting it non-deterministic is fine.
If we want to avoid that, we can change checkSections to always run and return a value whether we should disable async parallel write.

lld/ELF/OutputSections.cpp
476	Thanks. Changed to a simple heuristic.

Harbormaster completed remote builds in B180617: Diff 451767.Aug 11 2022, 2:42 AM

I've managed to get some time to test this change on Windows and the results do not look good. Testing chrome from lld-speed-test.tar.xz, I get a ~1% improvement, so that's the good news. However, testing mozilla, I get a ~23% increase in link time and testing scylla, I get a ~140% increase in link time. Testing an Unreal Engine 4 based link, gives ~21% increase in link time. This is running on a Windows 10 PC with AMD Ryzen 3900X 12C/24T 64GB RAM with Samsung 970 EVO NVMe SSDs. Haven't had a chance to dig deeper into why this is having such a negative impact.

andrewng added inline comments.Aug 11 2022, 11:16 AM

lld/ELF/OutputSections.cpp
479	Didn't get much time to investigate the Windows performance degradation. However, lowering `256` to `16` appears to "fix" the issue for the test cases that I've tried so far. In all the "bad" cases, the performance is about the same or slightly better (1-3%). For a link of `clang` the improvement is ~9% for both values. Still don't really know the reasoning behind this behaviour.

Appreciate the testing.

Some parallel* in SyntheticSections.cpp (e.g. MergeNoTailSection::writeTo) is now serial due to limitation of llvm/Support/Parallel.h, e.g. GdbIndexSection::writeTo, MergeNoTailSection::writeTo. Sometimes (in all workloads I have tested) overlapping their write with other output sections seems better than spending all threads in parallelizing them and writing output sections serially.

In case it is useful, I have tried parallel outputsection write last year (https://reviews.llvm.org/D116282). I don't find strict improvement so that patch stays the preview state.
The shouldParallel function there may be useful.

Also, --time-trace may be useful to analyze synchronous operations. For asyncParallelFor, te "Write ..." time may be significantly smaller.

ld.lld --time-trace --threads=8 @response.txt -o 1
jq -r '.traceEvents[] | select(.name|contains("Write")) | "\(.dur/1000000) \(.name) \(.args)"' < 1.time-trace

Some parallel* in SyntheticSections.cpp (e.g. MergeNoTailSection::writeTo) is now serial due to limitation of llvm/Support/Parallel.h, e.g. GdbIndexSection::writeTo, MergeNoTailSection::writeTo. Sometimes (in all workloads I have tested) overlapping their write with other output sections seems better than spending all threads in parallelizing them and writing output sections serially.

That is a shame. I think keeping the "parallel for" in the BYTE handling code is fine, I was just commenting that it wouldn't be parallel given the current circumstances. Hopefully, at some point the limitation in the parallel support can be improved.

Also, --time-trace may be useful to analyze synchronous operations. For asyncParallelFor, te "Write ..." time may be significantly smaller.

Unfortunately, the --time-trace didn't really tell me anything I didn't already know, i.e. the writing of the output sections is taking significantly longer, everything else is about the same.

However, I have isolated it to the .debug sections. My "fast" test cases didn't have any debug info, where as the "slow" test cases all had some debug info. Did you test with debug info? I think the key thing about debug info is that it's large and has relatively few input sections which is why the minimum task size figure makes such a significant difference. This would lead me to suggest that having multiple threads concurrently writing out different large output sections does not scale well on Windows.

I've looked a bit more at the Windows performance degradation and have come up with the following code for taskSize and asyncParallelFor:

size_t tasks = size / (4 * 1024 * 1024);
size_t taskSize = tasks ? sections.size() / tasks : sections.size();
asyncParallelFor(tg, std::max<size_t>(1, taskSize), 0, sections.size(), fn);

The 4MB is somewhat arbitrary but this appears to work OK on my Windows PC and mostly eliminates the performance degradation that I've seen so far. In fact there's a ~3% improvement for mozilla from lld-speed-test.tar.xz.

I've also tried to test on Linux, although only with an Ubuntu 22.04.1 VM on my Windows PC. I seem to see a similar performance degradation for scylla and mozilla (and the UE4 based link too). @MaskRay, could you please try testing scylla and mozilla to see if you can reproduce the performance degradation? The above patch also improves the situation for my setup and actually results in performance improvements for the problematic test cases.

Not really too sure what the next steps should be for this review. Parallel optimisations of this nature are always going to be somewhat tricky across platforms.

andrewng mentioned this in D132025: [ELF][WIP] Parallelize writes of different OutputSections.Aug 17 2022, 5:26 AM

Instead of approximating the taskSize based on a 4MB size limit, I've prototyped a patch in https://reviews.llvm.org/D132025 that does it a bit better. This prototype patch has only been briefly tested in the same configurations as I've already tested this patch. However, I have also been able to reproduce the performance degradation on a different Windows PC.

Use TaskGroup::execute

MaskRay marked an inline comment as done.Aug 20 2022, 12:43 PM

In D131247#3723722, @andrewng wrote:
I've looked a bit more at the Windows performance degradation and have come up with the following code for taskSize and asyncParallelFor:
size_t tasks = size / (4 * 1024 * 1024);
size_t taskSize = tasks ? sections.size() / tasks : sections.size();
asyncParallelFor(tg, std::max<size_t>(1, taskSize), 0, sections.size(), fn);
The 4MB is somewhat arbitrary but this appears to work OK on my Windows PC and mostly eliminates the performance degradation that I've seen so far. In fact there's a ~3% improvement for mozilla from lld-speed-test.tar.xz.

I've also tried to test on Linux, although only with an Ubuntu 22.04.1 VM on my Windows PC. I seem to see a similar performance degradation for scylla and mozilla (and the UE4 based link too). @MaskRay, could you please try testing scylla and mozilla to see if you can reproduce the performance degradation? The above patch also improves the situation for my setup and actually results in performance improvements for the problematic test cases.

Not really too sure what the next steps should be for this review. Parallel optimisations of this nature are always going to be somewhat tricky across platforms.

I confirm the scylladb regression on a Debian testing machine, using the previous sharding strategy. Thanks for suggesting the 4MiB task size strategy. I have switched to the strategy and it fixes the regression.

For the new Parallel API, I use std::function<void()> and place it in .cpp to avoid template code size bloat. I have tested that there is no observable slowdown compared with FuncTy Fn.

I do not know how to test a Firefox build.

curl https://hg.mozilla.org/mozilla-central/raw-file/default/python/mozboot/bin/bootstrap.py -O
python3 bootstrap.py
cd mozilla-unified
./mach build

gives me a 202MiB obj-x86_64-pc-linux-gnu/dist/bin/libxul.so. It is unclear how to rebuild it with -Wl,--reproduce=.

MaskRay edited the summary of this revision. (Show Details)Aug 20 2022, 1:34 PM

Harbormaster completed remote builds in B182402: Diff 454244.Aug 20 2022, 1:48 PM

I've tested this latest patch applied to our downstream port on Windows and it improves linking performance in all our test cases that I've managed to test so far.

llvm/include/llvm/Support/Parallel.h
82	`maybeSpawn` -> `execute`.
llvm/lib/Support/Parallel.cpp
21–22	The patch doesn't compile with `LLVM_ENABLE_THREADS=OFF`. I think this guard needs to move down.
183–184	This guard needs to move up and I think `TaskGroup::spawn` above will need guards added too.

fix !LLVM_ENABLE_THREADS build

Harbormaster completed remote builds in B182750: Diff 454713.Aug 23 2022, 1:38 AM

LGTM but I think it would be good to get some other reviewers to take a look (and perhaps test) before approval.

In D131247#3742991, @andrewng wrote:

LGTM but I think it would be good to get some other reviewers to take a look (and perhaps test) before approval.

Thanks. Do you think a smaller task size like 1MiB / 2MiB work fine?

Just to say, no objections so far. I don't have a lot of easy access to targets that haven't been benchmarked on already. One that I do have is an M1 Macbook Air. I should be able to cross-compile Clang for AArch64 ELF on that. Although not an ideal machine for benchmarking it may give some info on how this performs on Apple Silicon.

On a M1 MacBook Air (8 GiB) using the lld speed test https://s3-us-west-2.amazonaws.com/linker-tests/lld-speed-test.tar.xz I didn't see any significant differences with the patch. Both Chromium and Clang GDB index were within a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 3% faster. I think most of the results are that this doesn't make a lot of difference on an M1. Which isn't an ideal target to do benchmarking on, nor does it have many threads + some of the CPUs are efficiency cores. However it does at least show that this patch isn't going to harm results either.

LGTM for the benefits on other targets.

Thanks. Do you think a smaller task size like 1MiB / 2MiB work fine?

@MaskRay, I did briefly experiment with smaller task sizes and both 1MiB & 2MiB showed similar results on my 2 test PCs with Windows. The performance related issue on Windows occurred when the "task size" was too large and 4MiB appeared to be the "sweet spot" in my testing. If 1MiB or 2MiB can be shown to be beneficial for other platforms, then it should be fine for Windows too (at least in my testing).

LGTM for the benefits on other targets.

@peter.smith, thanks for testing on Apple M1.

Thanks for all the reviews and testing. Will push this shortly.

MaskRay edited the summary of this revision. (Show Details)Aug 24 2022, 9:30 AM

This revision was not accepted when it landed; it landed in state Needs Review.Aug 24 2022, 9:40 AM

Closed by commit rG3b4d800911b5: [ELF] Parallelize writes of different OutputSections (authored by MaskRay). · Explain Why

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rG3b4d800911b5: [ELF] Parallelize writes of different OutputSections.

Revision Contents

Path

Size

lld/

ELF/

OutputSections.h

6 lines

OutputSections.cpp

95 lines

Writer.cpp

28 lines

test/

ELF/

arm-thumb-interwork-notfunc.s

3 lines

hexagon-jump-error.s

3 lines

linkerscript/

overlapping-sections.s

4 lines

llvm/

include/

llvm/

Support/

Parallel.h

27 lines

lib/

Support/

Parallel.cpp

23 lines

Diff 455254

lld/ELF/OutputSections.h

//===- OutputSections.h ------------------------------------------ C++ --===//		//===- OutputSections.h ------------------------------------------ C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLD_ELF_OUTPUT_SECTIONS_H		#ifndef LLD_ELF_OUTPUT_SECTIONS_H
#define LLD_ELF_OUTPUT_SECTIONS_H		#define LLD_ELF_OUTPUT_SECTIONS_H

#include "InputSection.h"		#include "InputSection.h"
#include "LinkerScript.h"		#include "LinkerScript.h"
#include "lld/Common/LLVM.h"		#include "lld/Common/LLVM.h"
		#include "llvm/Support/Parallel.h"

#include <array>		#include <array>

namespace lld::elf {		namespace lld::elf {

struct PhdrEntry;		struct PhdrEntry;

struct CompressedData {		struct CompressedData {
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	public:
// that wasn't needed). This is needed for orphan placement.		// that wasn't needed). This is needed for orphan placement.
bool hasInputSections = false;		bool hasInputSections = false;

// The output section description is specified between DATA_SEGMENT_ALIGN and		// The output section description is specified between DATA_SEGMENT_ALIGN and
// DATA_RELRO_END.		// DATA_RELRO_END.
bool relro = false;		bool relro = false;

void finalize();		void finalize();
template <class ELFT> void writeTo(uint8_t *buf);		template <class ELFT>
		void writeTo(uint8_t *buf, llvm::parallel::TaskGroup &tg);
// Check that the addends for dynamic relocations were written correctly.		// Check that the addends for dynamic relocations were written correctly.
void checkDynRelAddends(const uint8_t *bufStart);		void checkDynRelAddends(const uint8_t *bufStart);
template <class ELFT> void maybeCompress();		template <class ELFT> void maybeCompress();

void sort(llvm::function_ref<int(InputSectionBase *s)> order);		void sort(llvm::function_ref<int(InputSectionBase *s)> order);
void sortInitFini();		void sortInitFini();
void sortCtorsDtors();		void sortCtorsDtors();

private:		private:
		SmallVector<InputSection *, 0> storage;

// Used for implementation of --compress-debug-sections option.		// Used for implementation of --compress-debug-sections option.
CompressedData compressed;		CompressedData compressed;

std::array<uint8_t, 4> getFiller();		std::array<uint8_t, 4> getFiller();
};		};

struct OutputDesc final : SectionCommand {		struct OutputDesc final : SectionCommand {
OutputSection osec;		OutputSection osec;
Show All 34 Lines

lld/ELF/OutputSections.cpp

Show First 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	#if LLVM_ENABLE_ZLIB
if (!config->compressDebugSections \|\| (flags & SHF_ALLOC) \|\|		if (!config->compressDebugSections \|\| (flags & SHF_ALLOC) \|\|
!name.startswith(".debug_") \|\| size == 0)		!name.startswith(".debug_") \|\| size == 0)
return;		return;

llvm::TimeTraceScope timeScope("Compress debug sections");		llvm::TimeTraceScope timeScope("Compress debug sections");

// Write uncompressed data to a temporary zero-initialized buffer.		// Write uncompressed data to a temporary zero-initialized buffer.
auto buf = std::make_unique<uint8_t[]>(size);		auto buf = std::make_unique<uint8_t[]>(size);
writeTo<ELFT>(buf.get());		{
		parallel::TaskGroup tg;
		writeTo<ELFT>(buf.get(), tg);
		}
// We chose 1 (Z_BEST_SPEED) as the default compression level because it is		// We chose 1 (Z_BEST_SPEED) as the default compression level because it is
// the fastest. If -O2 is given, we use level 6 to compress debug info more by		// the fastest. If -O2 is given, we use level 6 to compress debug info more by
// ~15%. We found that level 7 to 9 doesn't make much difference (~1% more		// ~15%. We found that level 7 to 9 doesn't make much difference (~1% more
// compression) while they take significant amount of time (~2x), so level 6		// compression) while they take significant amount of time (~2x), so level 6
// seems enough.		// seems enough.
const int level = config->optimize >= 2 ? 6 : Z_BEST_SPEED;		const int level = config->optimize >= 2 ? 6 : Z_BEST_SPEED;

// Split input into 1-MiB shards.		// Split input into 1-MiB shards.
Show All 37 Lines	static void writeInt(uint8_t *buf, uint64_t data, uint64_t size) {
else if (size == 4)		else if (size == 4)
write32(buf, data);		write32(buf, data);
else if (size == 8)		else if (size == 8)
write64(buf, data);		write64(buf, data);
else		else
llvm_unreachable("unsupported Size argument");		llvm_unreachable("unsupported Size argument");
}		}

template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {		template <class ELFT>
		void OutputSection::writeTo(uint8_t *buf, parallel::TaskGroup &tg) {
llvm::TimeTraceScope timeScope("Write sections", name);		llvm::TimeTraceScope timeScope("Write sections", name);
if (type == SHT_NOBITS)		if (type == SHT_NOBITS)
return;		return;

// If --compress-debug-section is specified and if this is a debug section,		// If --compress-debug-section is specified and if this is a debug section,
// we've already compressed section contents. If that's the case,		// we've already compressed section contents. If that's the case,
// just write it down.		// just write it down.
if (compressed.shards) {		if (compressed.shards) {
Show All 16 Lines	parallelFor(0, compressed.numShards, [&](size_t i) {
compressed.shards[i].size());		compressed.shards[i].size());
});		});

write32be(buf + (size - sizeof(*chdr) - 4), compressed.checksum);		write32be(buf + (size - sizeof(*chdr) - 4), compressed.checksum);
return;		return;
}		}

// Write leading padding.		// Write leading padding.
SmallVector<InputSection *, 0> storage;
ArrayRef<InputSection > sections = getInputSections(this, storage);		ArrayRef<InputSection > sections = getInputSections(this, storage);
std::array<uint8_t, 4> filler = getFiller();		std::array<uint8_t, 4> filler = getFiller();
bool nonZeroFiller = read32(filler.data()) != 0;		bool nonZeroFiller = read32(filler.data()) != 0;
if (nonZeroFiller)		if (nonZeroFiller)
fill(buf, sections.empty() ? size : sections[0]->outSecOff, filler);		fill(buf, sections.empty() ? size : sections[0]->outSecOff, filler);

parallelFor(0, sections.size(), [&](size_t i) {		auto fn = [=](size_t begin, size_t end) {
		size_t numSections = sections.size();
		for (size_t i = begin; i != end; ++i) {
InputSection *isec = sections[i];		InputSection *isec = sections[i];
if (auto *s = dyn_cast<SyntheticSection>(isec))		if (auto *s = dyn_cast<SyntheticSection>(isec))
s->writeTo(buf + isec->outSecOff);		s->writeTo(buf + isec->outSecOff);
else		else
isec->writeTo<ELFT>(buf + isec->outSecOff);		isec->writeTo<ELFT>(buf + isec->outSecOff);

// Fill gaps between sections.		// Fill gaps between sections.
if (nonZeroFiller) {		if (nonZeroFiller) {
uint8_t *start = buf + isec->outSecOff + isec->getSize();		uint8_t *start = buf + isec->outSecOff + isec->getSize();
uint8_t *end;		uint8_t *end;
if (i + 1 == sections.size())		if (i + 1 == numSections)
end = buf + size;		end = buf + size;
else		else
end = buf + sections[i + 1]->outSecOff;		end = buf + sections[i + 1]->outSecOff;
if (isec->nopFiller) {		if (isec->nopFiller) {
assert(target->nopInstrs);		assert(target->nopInstrs);
nopInstrFill(start, end - start);		nopInstrFill(start, end - start);
} else		} else
fill(start, end - start, filler);		fill(start, end - start, filler);
}		}
});		}
		};

// Linker scripts may have BYTE()-family commands with which you		// If there is any BYTE()-family command (rare), write the section content
// can write arbitrary bytes to the output. Process them if any.		// first then process BYTE to overwrite the filler content. The write is
		// serial due to the limitation of llvm/Support/Parallel.h.
		bool written = false;
		size_t numSections = sections.size();
for (SectionCommand *cmd : commands)		for (SectionCommand *cmd : commands)
if (auto *data = dyn_cast<ByteCommand>(cmd))		if (auto *data = dyn_cast<ByteCommand>(cmd)) {
		if (!std::exchange(written, true))
		fn(0, numSections);
		andrewngUnsubmitted Done Reply Inline Actions Given there is already a "live" `TaskGroup`, I don't think this will actually run in parallel, IIUC. However, this is a limitation of the current parallel implementation. andrewng: Given there is already a "live" `TaskGroup`, I don't think this will actually run in parallel…
writeInt(buf + data->offset, data->expression().getValue(), data->size);		writeInt(buf + data->offset, data->expression().getValue(), data->size);
		ikudrinUnsubmitted Done Reply Inline Actions I cannot find a requirement for `BYTE()` commands to override the content of input sections. Isn't the script simply malformed in that case? Can't we add these writings to the pool too? ikudrin: I cannot find a requirement for `BYTE()` commands to override the content of input sections.
		MaskRayAuthorUnsubmitted Done Reply Inline Actions We want to ensure that `BYTE` may overwrite a filler (`fill(start, end - start, filler);`). I can add a comment. MaskRay: We want to ensure that `BYTE` may overwrite a filler (`fill(start, end - start, filler);`). I…
}		}
		if (written \|\| !numSections)
		return;

		// There is no data command. Write content asynchronously to overlap the write
		// time with other output sections. Note, if a linker script specifies
		// overlapping output sections (needs --noinhibit-exec or --no-check-sections
		// to supress the error), the output may be non-deterministic.
		const size_t taskSizeLimit = 4 << 20;
		ikudrinUnsubmitted Done Reply Inline Actions Where does the value of `128` come from? It looks like, with the hardcoded value, large projects will benefit from the parallelization more than small ones. And it also does not take into account the `llvm::parallel::detail::MaxTasksPerGroup`; can it somehow be derived from the common setting? ikudrin: Where does the value of `128` come from? It looks like, with the hardcoded value, large…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions I don't think it can be derived from the common setting. It may not be bad to have a call site specific value like `SmallVector`. If we derived this automatically, tuning the library parameter may be more difficult. MaskRay: I don't think it can be derived from the common setting. It may not be bad to have a call site…
		andrewngUnsubmitted Done Reply Inline Actions I would actually suggest that the task size could be even larger than `128` for output sections that have many input sections. I think as a "minimum" setting it feels about right. It's usually hard to derive a good task size value without incurring some cost in gathering appropriate metrics in order to make a good estimate. Perhaps the task size could be calculated based on the number of sections and the number of threads in the task pool? andrewng: I would actually suggest that the task size could be even larger than `128` for output sections…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Thanks. Changed to a simple heuristic. MaskRay: Thanks. Changed to a simple heuristic.
		for (size_t begin = 0, i = 0, taskSize = 0;;) {
		taskSize += sections[i]->getSize();
		bool done = ++i == numSections;
		andrewngUnsubmitted Done Reply Inline Actions Didn't get much time to investigate the Windows performance degradation. However, lowering `256` to `16` appears to "fix" the issue for the test cases that I've tried so far. In all the "bad" cases, the performance is about the same or slightly better (1-3%). For a link of `clang` the improvement is ~9% for both values. Still don't really know the reasoning behind this behaviour. andrewng: Didn't get much time to investigate the Windows performance degradation. However, lowering…
		if (done \|\| taskSize >= taskSizeLimit) {
		tg.execute([=] { fn(begin, i); });
		if (done)
		break;
		begin = i;
		taskSize = 0;
		}
		}
		}

static void finalizeShtGroup(OutputSection os, InputSection section) {		static void finalizeShtGroup(OutputSection os, InputSection section) {
// sh_link field for SHT_GROUP sections should contain the section index of		// sh_link field for SHT_GROUP sections should contain the section index of
// the symbol table.		// the symbol table.
os->link = in.symTab->getParent()->sectionIndex;		os->link = in.symTab->getParent()->sectionIndex;

if (!section)		if (!section)
return;		return;
▲ Show 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	void OutputSection::checkDynRelAddends(const uint8_t *bufStart) {
});		});
}		}

template void OutputSection::writeHeaderTo<ELF32LE>(ELF32LE::Shdr *Shdr);		template void OutputSection::writeHeaderTo<ELF32LE>(ELF32LE::Shdr *Shdr);
template void OutputSection::writeHeaderTo<ELF32BE>(ELF32BE::Shdr *Shdr);		template void OutputSection::writeHeaderTo<ELF32BE>(ELF32BE::Shdr *Shdr);
template void OutputSection::writeHeaderTo<ELF64LE>(ELF64LE::Shdr *Shdr);		template void OutputSection::writeHeaderTo<ELF64LE>(ELF64LE::Shdr *Shdr);
template void OutputSection::writeHeaderTo<ELF64BE>(ELF64BE::Shdr *Shdr);		template void OutputSection::writeHeaderTo<ELF64BE>(ELF64BE::Shdr *Shdr);

template void OutputSection::writeTo<ELF32LE>(uint8_t *Buf);		template void OutputSection::writeTo<ELF32LE>(uint8_t *,
template void OutputSection::writeTo<ELF32BE>(uint8_t *Buf);		llvm::parallel::TaskGroup &);
template void OutputSection::writeTo<ELF64LE>(uint8_t *Buf);		template void OutputSection::writeTo<ELF32BE>(uint8_t *,
template void OutputSection::writeTo<ELF64BE>(uint8_t *Buf);		llvm::parallel::TaskGroup &);
		template void OutputSection::writeTo<ELF64LE>(uint8_t *,
		llvm::parallel::TaskGroup &);
		template void OutputSection::writeTo<ELF64BE>(uint8_t *,
		llvm::parallel::TaskGroup &);

template void OutputSection::maybeCompress<ELF32LE>();		template void OutputSection::maybeCompress<ELF32LE>();
template void OutputSection::maybeCompress<ELF32BE>();		template void OutputSection::maybeCompress<ELF32BE>();
template void OutputSection::maybeCompress<ELF64LE>();		template void OutputSection::maybeCompress<ELF64LE>();
template void OutputSection::maybeCompress<ELF64BE>();		template void OutputSection::maybeCompress<ELF64BE>();

lld/ELF/Writer.cpp

Show First 20 Lines • Show All 2,833 Lines • ▼ Show 20 Lines	error("failed to open " + config->outputFile + ": " +
llvm::toString(bufferOrErr.takeError()));		llvm::toString(bufferOrErr.takeError()));
return;		return;
}		}
buffer = std::move(*bufferOrErr);		buffer = std::move(*bufferOrErr);
Out::bufferStart = buffer->getBufferStart();		Out::bufferStart = buffer->getBufferStart();
}		}

template <class ELFT> void Writer<ELFT>::writeSectionsBinary() {		template <class ELFT> void Writer<ELFT>::writeSectionsBinary() {
		parallel::TaskGroup tg;
for (OutputSection *sec : outputSections)		for (OutputSection *sec : outputSections)
if (sec->flags & SHF_ALLOC)		if (sec->flags & SHF_ALLOC)
sec->writeTo<ELFT>(Out::bufferStart + sec->offset);		sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
}		}

static void fillTrap(uint8_t i, uint8_t end) {		static void fillTrap(uint8_t i, uint8_t end) {
for (; i + 4 <= end; i += 4)		for (; i + 4 <= end; i += 4)
memcpy(i, &target->trapInstr, 4);		memcpy(i, &target->trapInstr, 4);
}		}

// Fill the last page of executable segments with trap instructions		// Fill the last page of executable segments with trap instructions
Show All 26 Lines	if (last && (last->p_flags & PF_X))
alignToPowerOf2(last->p_filesz, config->maxPageSize);		alignToPowerOf2(last->p_filesz, config->maxPageSize);
}		}
}		}

// Write section contents to a mmap'ed file.		// Write section contents to a mmap'ed file.
template <class ELFT> void Writer<ELFT>::writeSections() {		template <class ELFT> void Writer<ELFT>::writeSections() {
llvm::TimeTraceScope timeScope("Write sections");		llvm::TimeTraceScope timeScope("Write sections");

		{
// In -r or --emit-relocs mode, write the relocation sections first as in		// In -r or --emit-relocs mode, write the relocation sections first as in
// ELf_Rel targets we might find out that we need to modify the relocated		// ELf_Rel targets we might find out that we need to modify the relocated
// section while doing it.		// section while doing it.
		parallel::TaskGroup tg;
for (OutputSection *sec : outputSections)		for (OutputSection *sec : outputSections)
if (sec->type == SHT_REL \|\| sec->type == SHT_RELA)		if (sec->type == SHT_REL \|\| sec->type == SHT_RELA)
sec->writeTo<ELFT>(Out::bufferStart + sec->offset);		sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
		}
		{
		parallel::TaskGroup tg;
for (OutputSection *sec : outputSections)		for (OutputSection *sec : outputSections)
if (sec->type != SHT_REL && sec->type != SHT_RELA)		if (sec->type != SHT_REL && sec->type != SHT_RELA)
sec->writeTo<ELFT>(Out::bufferStart + sec->offset);		sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
		}

// Finally, check that all dynamic relocation addends were written correctly.		// Finally, check that all dynamic relocation addends were written correctly.
if (config->checkDynamicRelocs && config->writeAddends) {		if (config->checkDynamicRelocs && config->writeAddends) {
for (OutputSection *sec : outputSections)		for (OutputSection *sec : outputSections)
if (sec->type == SHT_REL \|\| sec->type == SHT_RELA)		if (sec->type == SHT_REL \|\| sec->type == SHT_RELA)
sec->checkDynRelAddends(Out::bufferStart);		sec->checkDynRelAddends(Out::bufferStart);
}		}
}		}
▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

lld/test/ELF/arm-thumb-interwork-notfunc.s

	// REQUIRES: arm			// REQUIRES: arm
	// RUN: llvm-mc -g --triple=armv7a-linux-gnueabihf -arm-add-build-attributes -filetype=obj -o %t.o %s			// RUN: llvm-mc -g --triple=armv7a-linux-gnueabihf -arm-add-build-attributes -filetype=obj -o %t.o %s
	// RUN: ld.lld %t.o -o %t 2>&1 \| FileCheck %s --check-prefix=WARN			/// Use --threads=1 to keep emitted warnings across sections sequential.
				// RUN: ld.lld %t.o -o %t --threads=1 2>&1 \| FileCheck %s --check-prefix=WARN
	// RUN: llvm-objdump --no-show-raw-insn -d %t \| FileCheck %s			// RUN: llvm-objdump --no-show-raw-insn -d %t \| FileCheck %s

	.syntax unified			.syntax unified
	.section .arm_target, "ax", %progbits			.section .arm_target, "ax", %progbits
	.balign 0x1000			.balign 0x1000
	.arm			.arm
	arm_func_with_notype:			arm_func_with_notype:
	.type arm_func_with_explicit_notype, %notype			.type arm_func_with_explicit_notype, %notype
	▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

lld/test/ELF/hexagon-jump-error.s

	# REQUIRES: hexagon			# REQUIRES: hexagon
	# RUN: llvm-mc -filetype=obj -triple=hexagon-unknown-elf %s -o %t.o			# RUN: llvm-mc -filetype=obj -triple=hexagon-unknown-elf %s -o %t.o
	# RUN: not ld.lld %t.o -o /dev/null 2>&1 \| FileCheck --implicit-check-not "out of range" %s			## Use --threads=1 to keep emitted warnings across sections sequential.
				# RUN: not ld.lld %t.o -o /dev/null --threads=1 2>&1 \| FileCheck --implicit-check-not "out of range" %s

	.globl _start			.globl _start
	.type _start, @function			.type _start, @function
	_start:			_start:

	# CHECK: relocation R_HEX_B9_PCREL out of range: 1028 is not in [-1024, 1023]			# CHECK: relocation R_HEX_B9_PCREL out of range: 1028 is not in [-1024, 1023]
	{r0 = #0; jump #1f}			{r0 = #0; jump #1f}
	.space (1<<10)			.space (1<<10)
	Show All 20 Lines

lld/test/ELF/linkerscript/overlapping-sections.s

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
	# fatal error by passing --noinhibit-exec, so this behaviour is fine.			# fatal error by passing --noinhibit-exec, so this behaviour is fine.

	# RUN: llvm-objdump -s %t.so \| FileCheck %s --check-prefix BROKEN-OUTPUT-FILE			# RUN: llvm-objdump -s %t.so \| FileCheck %s --check-prefix BROKEN-OUTPUT-FILE
	# BROKEN-OUTPUT-FILE-LABEL: Contents of section .sec1:			# BROKEN-OUTPUT-FILE-LABEL: Contents of section .sec1:
	# BROKEN-OUTPUT-FILE-NEXT: 8000 01010101 01010101 01010101 01010101			# BROKEN-OUTPUT-FILE-NEXT: 8000 01010101 01010101 01010101 01010101
	# BROKEN-OUTPUT-FILE-NEXT: 8010 01010101 01010101 01010101 01010101			# BROKEN-OUTPUT-FILE-NEXT: 8010 01010101 01010101 01010101 01010101
	# BROKEN-OUTPUT-FILE-NEXT: 8020 01010101 01010101 01010101 01010101			# BROKEN-OUTPUT-FILE-NEXT: 8020 01010101 01010101 01010101 01010101
	# BROKEN-OUTPUT-FILE-NEXT: 8030 01010101 01010101 01010101 01010101			# BROKEN-OUTPUT-FILE-NEXT: 8030 01010101 01010101 01010101 01010101
	# Starting here the contents of .sec2 overwrites .sec1:			## Starting here the content may be from either .sec1 or .sec2, depending on the write order.
	# BROKEN-OUTPUT-FILE-NEXT: 8040 02020202 02020202 02020202 02020202			# BROKEN-OUTPUT-FILE-NEXT: 8040

	# RUN: llvm-readelf --sections -l %t.so \| FileCheck %s -check-prefix BAD-BOTH			# RUN: llvm-readelf --sections -l %t.so \| FileCheck %s -check-prefix BAD-BOTH
	# BAD-BOTH-LABEL: Section Headers:			# BAD-BOTH-LABEL: Section Headers:
	# BAD-BOTH: .sec1 PROGBITS 0000000000008000 002000 000100 00 WA 0 0 1			# BAD-BOTH: .sec1 PROGBITS 0000000000008000 002000 000100 00 WA 0 0 1
	# BAD-BOTH: .sec2 PROGBITS 0000000000008040 002040 000100 00 WA 0 0 1			# BAD-BOTH: .sec2 PROGBITS 0000000000008040 002040 000100 00 WA 0 0 1
	# BAD-BOTH-LABEL: Program Headers:			# BAD-BOTH-LABEL: Program Headers:
	# BAD-BOTH-NEXT: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align			# BAD-BOTH-NEXT: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
	# BAD-BOTH-NEXT: LOAD 0x001000 0x0000000000000000 0x0000000000000000 0x000100 0x000100 R E 0x1000			# BAD-BOTH-NEXT: LOAD 0x001000 0x0000000000000000 0x0000000000000000 0x000100 0x000100 R E 0x1000
	Show All 13 Lines

llvm/include/llvm/Support/Parallel.h

Show All 24 Lines
namespace parallel {		namespace parallel {

// Strategy for the default executor used by the parallel routines provided by		// Strategy for the default executor used by the parallel routines provided by
// this file. It defaults to using all hardware threads and should be		// this file. It defaults to using all hardware threads and should be
// initialized before the first use of parallel routines.		// initialized before the first use of parallel routines.
extern ThreadPoolStrategy strategy;		extern ThreadPoolStrategy strategy;

namespace detail {		namespace detail {

#if LLVM_ENABLE_THREADS

class Latch {		class Latch {
uint32_t Count;		uint32_t Count;
mutable std::mutex Mutex;		mutable std::mutex Mutex;
mutable std::condition_variable Cond;		mutable std::condition_variable Cond;

public:		public:
explicit Latch(uint32_t Count = 0) : Count(Count) {}		explicit Latch(uint32_t Count = 0) : Count(Count) {}
~Latch() {		~Latch() {
Show All 12 Lines	if (--Count == 0)
Cond.notify_all();		Cond.notify_all();
}		}

void sync() const {		void sync() const {
std::unique_lock<std::mutex> lock(Mutex);		std::unique_lock<std::mutex> lock(Mutex);
Cond.wait(lock, [&] { return Count == 0; });		Cond.wait(lock, [&] { return Count == 0; });
}		}
};		};
		} // namespace detail

class TaskGroup {		class TaskGroup {
Latch L;		detail::Latch L;
bool Parallel;		bool Parallel;

public:		public:
TaskGroup();		TaskGroup();
~TaskGroup();		~TaskGroup();

		// Spawn a task, but does not wait for it to finish.
void spawn(std::function<void()> f);		void spawn(std::function<void()> f);

		// Similar to spawn, but execute the task immediately when ThreadsRequested ==
		// 1. The difference is to give the following pattern a more intuitive order
		// when single threading is requested.
		//
		// for (size_t begin = 0, i = 0, taskSize = 0;;) {
		// taskSize += ...
		// bool done = ++i == end;
		// if (done \|\| taskSize >= taskSizeLimit) {
		// tg.execute([=] { fn(begin, i); });
		andrewngUnsubmitted Done Reply Inline Actions `maybeSpawn` -> `execute`. andrewng: `maybeSpawn` -> `execute`.
		// if (done)
		// break;
		// begin = i;
		// taskSize = 0;
		// }
		// }
		void execute(std::function<void()> f);

void sync() const { L.sync(); }		void sync() const { L.sync(); }
};		};

		namespace detail {

		#if LLVM_ENABLE_THREADS
const ptrdiff_t MinParallelSize = 1024;		const ptrdiff_t MinParallelSize = 1024;

/// Inclusive median.		/// Inclusive median.
template <class RandomAccessIterator, class Comparator>		template <class RandomAccessIterator, class Comparator>
RandomAccessIterator medianOf3(RandomAccessIterator Start,		RandomAccessIterator medianOf3(RandomAccessIterator Start,
RandomAccessIterator End,		RandomAccessIterator End,
const Comparator &Comp) {		const Comparator &Comp) {
RandomAccessIterator Mid = Start + (std::distance(Start, End) / 2);		RandomAccessIterator Mid = Start + (std::distance(Start, End) / 2);
▲ Show 20 Lines • Show All 173 Lines • Show Last 20 Lines

llvm/lib/Support/Parallel.cpp

Show All 12 Lines

#include <atomic>		#include <atomic>
#include <future>		#include <future>
#include <stack>		#include <stack>
#include <thread>		#include <thread>
#include <vector>		#include <vector>

llvm::ThreadPoolStrategy llvm::parallel::strategy;		llvm::ThreadPoolStrategy llvm::parallel::strategy;

#if LLVM_ENABLE_THREADS

namespace llvm {		namespace llvm {
		andrewngUnsubmitted Done Reply Inline Actions The patch doesn't compile with `LLVM_ENABLE_THREADS=OFF`. I think this guard needs to move down. andrewng: The patch doesn't compile with `LLVM_ENABLE_THREADS=OFF`. I think this guard needs to move down.
namespace parallel {		namespace parallel {
		#if LLVM_ENABLE_THREADS
namespace detail {		namespace detail {

namespace {		namespace {

/// An abstract class that takes closures and runs them asynchronously.		/// An abstract class that takes closures and runs them asynchronously.
class Executor {		class Executor {
public:		public:
virtual ~Executor() = default;		virtual ~Executor() = default;
▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	Executor *Executor::getDefaultExecutor() {

static ManagedStatic<ThreadPoolExecutor, ThreadPoolExecutor::Creator,		static ManagedStatic<ThreadPoolExecutor, ThreadPoolExecutor::Creator,
ThreadPoolExecutor::Deleter>		ThreadPoolExecutor::Deleter>
ManagedExec;		ManagedExec;
static std::unique_ptr<ThreadPoolExecutor> Exec(&(*ManagedExec));		static std::unique_ptr<ThreadPoolExecutor> Exec(&(*ManagedExec));
return Exec.get();		return Exec.get();
}		}
} // namespace		} // namespace
		} // namespace detail
		#endif

static std::atomic<int> TaskGroupInstances;		static std::atomic<int> TaskGroupInstances;

// Latch::sync() called by the dtor may cause one thread to block. If is a dead		// Latch::sync() called by the dtor may cause one thread to block. If is a dead
// lock if all threads in the default executor are blocked. To prevent the dead		// lock if all threads in the default executor are blocked. To prevent the dead
// lock, only allow the first TaskGroup to run tasks parallelly. In the scenario		// lock, only allow the first TaskGroup to run tasks parallelly. In the scenario
// of nested parallel_for_each(), only the outermost one runs parallelly.		// of nested parallel_for_each(), only the outermost one runs parallelly.
TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}		TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}
TaskGroup::~TaskGroup() {		TaskGroup::~TaskGroup() {
// We must ensure that all the workloads have finished before decrementing the		// We must ensure that all the workloads have finished before decrementing the
// instances count.		// instances count.
L.sync();		L.sync();
--TaskGroupInstances;		--TaskGroupInstances;
}		}

void TaskGroup::spawn(std::function<void()> F) {		void TaskGroup::spawn(std::function<void()> F) {
		#if LLVM_ENABLE_THREADS
if (Parallel) {		if (Parallel) {
L.inc();		L.inc();
Executor::getDefaultExecutor()->add([&, F = std::move(F)] {		detail::Executor::getDefaultExecutor()->add([&, F = std::move(F)] {
F();		F();
L.dec();		L.dec();
});		});
} else {		return;
F();
}		}
		#endif
		F();
}		}

} // namespace detail		void TaskGroup::execute(std::function<void()> F) {
		if (parallel::strategy.ThreadsRequested == 1)
		F();
		else
		spawn(F);
		}
} // namespace parallel		} // namespace parallel
} // namespace llvm		} // namespace llvm
#endif // LLVM_ENABLE_THREADS

		andrewngUnsubmitted Done Reply Inline Actions This guard needs to move up and I think `TaskGroup::spawn` above will need guards added too. andrewng: This guard needs to move up and I think `TaskGroup::spawn` above will need guards added too.
void llvm::parallelFor(size_t Begin, size_t End,		void llvm::parallelFor(size_t Begin, size_t End,
llvm::function_ref<void(size_t)> Fn) {		llvm::function_ref<void(size_t)> Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up		// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not		// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel		// support nested parallelism, a single entry task group can block parallel
// execution underneath them.		// execution underneath them.
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
auto NumItems = End - Begin;		auto NumItems = End - Begin;
if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {		if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {
// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.		// overhead on large inputs.
auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;		auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
if (TaskSize == 0)		if (TaskSize == 0)
TaskSize = 1;		TaskSize = 1;

parallel::detail::TaskGroup TG;		parallel::TaskGroup TG;
for (; Begin + TaskSize < End; Begin += TaskSize) {		for (; Begin + TaskSize < End; Begin += TaskSize) {
TG.spawn([=, &Fn] {		TG.spawn([=, &Fn] {
for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)		for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
Fn(I);		Fn(I);
});		});
}		}
for (; Begin != End; ++Begin)		for (; Begin != End; ++Begin)
Fn(Begin);		Fn(Begin);
return;		return;
}		}
#endif		#endif

for (; Begin != End; ++Begin)		for (; Begin != End; ++Begin)
Fn(Begin);		Fn(Begin);
}		}