This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/
-
ELF/
2
Writer.cpp
-
test/ELF/
-
ELF/
-
bss-sort.s
-
common.s

Differential D56325

Sort symbols in .bss by size.
Needs RevisionPublic

Authored by ruiu on Jan 4 2019, 11:10 AM.

Download Raw Diff

Details

Reviewers

grimar
MaskRay
• espindola

Summary

Android Go team (a project to make Android work better on memory and
CPU-contrained devices) found that this heuristics achieves better
memory write locality, and thus saves memory because it reduces the
number of dirty pages at runtime.

IIRC, we used to sort common symbols by size, but the code is now gone.
This patch in some sense resurrects it and extends to all symbols in .bss.

Diff Detail

Build Status

Buildable 26404
Build 26403: arc lint + arc unit

Event Timeline

ruiu created this revision.Jan 4 2019, 11:10 AM

Herald added a reviewer: • espindola. · View Herald TranscriptJan 4 2019, 11:10 AM

Herald added subscribers: arichardson, emaste, srhines. · View Herald Transcript

Harbormaster completed remote builds in B26403: Diff 180286.Jan 4 2019, 11:10 AM

fix comment

Harbormaster completed remote builds in B26404: Diff 180287.Jan 4 2019, 11:12 AM

pcc added a subscriber: pcc.Jan 4 2019, 11:20 AM

pcc added inline comments.

lld/ELF/Writer.cpp
1249	I wonder whether we should do this for all `SHF_WRITE` sections. I can see similar logic applying to `.data` for example.

ruiu edited the summary of this revision. (Show Details)Jan 4 2019, 11:22 AM

victoryang added a subscriber: victoryang.Jan 4 2019, 12:56 PM

victoryang added inline comments.

lld/ELF/Writer.cpp
1249	For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

Do you have a number on how much memory you could save on Android by sorting .data section? I'm pretty interested in that number
.

In D56325#1346810, @ruiu wrote:

For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

Do you have a number on how much memory you could save on Android by sorting .data section? I'm pretty interested in that number
.

On the build I was testing, sorting .bss alone saved ~1.7MB across the system. Sorting .bss+.data saved ~2.6MB. The numbers of course vary depending on the build, but this shows the saving from sorting .data is non-trivial.

What is the entire size of your program?

I'm tempted to sort all SHF_WRITE sections by size by default. Unlike some other sections such as .init_array, there should not be really any program that have an assumption on how .data sections are laid out, but I'm not sure if that wouldn't surprise users. But maybe, we should do that?

In D56325#1346834, @ruiu wrote:

What is the entire size of your program?

The numbers are measured across all processes running, so there are multiple programs. If you are looking for numbers from a single binary, I saw private dirty from .bss in libc went down from 36KB to 16KB. Unfortunately I didn't record the saving from sorting .bss+.data in libc. I can do the test again if that's necessary.

I'm tempted to sort all SHF_WRITE sections by size by default. Unlike some other sections such as .init_array, there should not be really any program that have an assumption on how .data sections are laid out, but I'm not sure if that wouldn't surprise users. But maybe, we should do that?

Unless there's a certain logic in how things in SHF_WRITE sections are ordered right now, I don't think anyone can realistically depend on the ordering, so this should be safe.

I think this should factor in any larger-than-normal alignment, otherwise it could easily result in a significant increase in size?

In D56325#1346864, @joerg wrote:

I think this should factor in any larger-than-normal alignment, otherwise it could easily result in a significant increase in size?

I'm not worried too much about it, as I don't think sorting by size could make the situation worse.

It can certainly make it worse, e.g. if you have one input section with half page size and page alignment followed by another input section with half page size. Depending on the precise size and order, it will fit into one page or two. It's a bit constructured, but gives the general idea.

But on average I don't think that could make the situation worse. Sorting the content could change the total size but that shouldn't make it always worse.

@victoryyang Have you measured the savings by sorting in descending order? What about other metrics such as alignment?

I didn't test the descending order, but assuming there's no special alignment requirement, you'd likely end up with similar result because you are essentially shifting all page boundaries and flipping individual symbols at the same time.

Re alignment: do we currently have any logic to order symbols based on alignment? And in terms of testing, what metrics do you want to see?

I am interested in benchmarks for things that are not low-end android specific. Something more general, other targets, platforms. I do not think that tweaking LLD for android only is OK.

"Android Go team (a project to make Android work better on memory and
CPU-contrained devices) found that this heuristics achieves better
memory write locality,"

Please attach the numbers to the patch description.

This revision now requires changes to proceed.Jan 10 2019, 1:16 AM

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

In D56325#1353138, @ruiu wrote:

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

In D56325#1353178, @grimar wrote:

In D56325#1353138, @ruiu wrote:

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

For the record, the reason why I'm in favor of this patch not because I'm working for Google; it is not correct to say that this patch is to fix a local issue. I've been trying to handle all contributors equally. As to this patch, I think it doesn't only improve Android but improve every target. Android Go team is working hard to reduce memory usage, and that's why they found this heuristics, and I added that to the comment because a concrete example is more reader-friendly than saying the same thing abstractly. But nothing special about Android Go.

That said, I can ask Vic to build other regular Linux applications with/without this patch to see if the same improvement (or at least no regression) can be observed. Vic, do you mind if I ask you to do that?

In D56325#1355019, @ruiu wrote:

In D56325#1353178, @grimar wrote:

In D56325#1353138, @ruiu wrote:

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

For the record, the reason why I'm in favor of this patch not because I'm working for Google; it is not correct to say that this patch is to fix a local issue. I've been trying to handle all contributors equally. As to this patch, I think it doesn't only improve Android but improve every target. Android Go team is working hard to reduce memory usage, and that's why they found this heuristics, and I added that to the comment because a concrete example is more reader-friendly than saying the same thing abstractly. But nothing special about Android Go.

That said, I can ask Vic to build other regular Linux applications with/without this patch to see if the same improvement (or at least no regression) can be observed. Vic, do you mind if I ask you to do that?

I don't mind doing the test. I spent some time trying to find a open source project for this test, but I'm having a hard time finding one that 1) uses clang, 2) has a large enough bss section for this test, and 3) doesn't take me crazy amount of time to setup the build. Before I spend more time on this, I thought I'd ask if you have any suggestion on what programs to test this on.

I don't think your benchmark program doesn't have to use clang. Or, maybe you don't really have to build anything new because Android *is* a Linux. Could you gather statistics of some large applications such as Chrome on Android? We probably need just a few applications that are not very Android-ish (so Java applications may not be appropriate).

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

4KB is just one page. Honestly I think it is hard to draw a conclusion from only that data as the difference is too small. Could you give us more data points so that we are convinced that the patch actually makes a difference?

In D56325#1360738, @victoryang wrote:

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

How exactly did you get those figures? For measuring anything memory related in Chrome I'd recommend using the system_health.memory_mobile benchmarks (I'd be happy to help you run them).

In D56325#1360753, @ruiu wrote:

4KB is just one page. Honestly I think it is hard to draw a conclusion from only that data as the difference is too small. Could you give us more data points so that we are convinced that the patch actually makes a difference?

I think in a lot of case, you actually just want to see that this doesn't cause a regression. If most symbols in a bss section are used, there is really no gain to be had here, and this is probably most of the cases. Where this change helps is when you have sparsely dirtied bss section. This is more common for versatile libraries, like libc, where you usually only use a part of it. So far, I've tested:

libc: 36KB -> 16KB (x86_64), 32KB -> 12KB (arm32)
Chrome on Android: 364KB -> 360KB (arm32)
tcpdump: No change (x86_64)
strace: No change (x86_64)

I don't think that we should expect this to work wonder for any programs. What we are more likely to see is that for some programs, you see pretty good gain, while for others, you see no change or close to no change. Based on the numbers I have so far, it certainly looks that this change is beneficial on average. This was also shown when I applied this on all (non-prebuilt) binaries on a x86_64 Android build.

In D56325#1360770, @pcc wrote:

In D56325#1360738, @victoryang wrote:

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

How exactly did you get those figures? For measuring anything memory related in Chrome I'd recommend using the system_health.memory_mobile benchmarks (I'd be happy to help you run them).

I have a modified dynamic linker that tag each bss VM area with the library it corresponds to, and I only count those from libraries included in the APK (that is, libraries that I applied this change on). I didn't want to take memory usage as a whole because I worry the variation may be more than the change this is causing.

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

In D56325#1362094, @victoryang wrote:

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

In D56325#1362861, @grimar wrote:

In D56325#1362094, @victoryang wrote:

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

0.5% may not sound a lot, but that's already a helpful amount for some systems, where you have to constantly make tradeoffs that sacrifices some functionalities to free up memory.

In D56325#1363572, @victoryang wrote:

In D56325#1362861, @grimar wrote:

In D56325#1362094, @victoryang wrote:

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

0.5% may not sound a lot, but that's already a helpful amount for some systems, where you have to constantly make tradeoffs that sacrifices some functionalities to free up memory.

Given all the results you provided, now this patch looks like a micro-optimization for some particular set of code.
In other words, based on results, it seems useless for non-micro apps to me. At the same time, it brings additional logic to the linker that is assumed to be used for all kind of applications,
and that code adds not big but still an additional complication that we will have to support.

Please understand my position right. I am an android user and I am happy with it, it's a good OS IMO, but this patch is going to change apps all over the world and the best result for them shown
in this thread seems to be "there is no regression". Having no regression is sure great, but doing micro-optimizations for particular apps and adding more code to linker for no benefits for 99% users
(I guess) is probably not.

I would be happy to hear other opinions, am I missing something? I do not want to be a stopper person for a good change to the linker, but is it a really good one?

George,

0.5% is not a micro-optimization if it works for many applications. If you find a heuristic that can generally reduce memory usage by 0.5%, that's not negligible. Imagine for example 0.5% memory usage reduction in a huge cloud application. You might be able to save multi million dollars.

Vic,

That said, the numbers you've shown so far don't honestly seem very convincing. The number of data points is too few that I can't see if it generally works. I believe I can test your patch myself and see how it works for Linux applications by parsing /proc/self/smaps (which contains information about the number of dirty pages). Let me try to do that for a few applications like clang.

In D56325#1367138, @ruiu wrote:

0.5% is not a micro-optimization if it works for many applications.
...
The number of data points is too few that I can't see if it generally works

That is what I tried to say. For now, it is not proven that this works for something larger than a few little apps.
If you'll be able to show that it really works/can work for other/larger apps then it will be great.

In D56325#1367138, @ruiu wrote:

George,

0.5% is not a micro-optimization if it works for many applications. If you find a heuristic that can generally reduce memory usage by 0.5%, that's not negligible. Imagine for example 0.5% memory usage reduction in a huge cloud application. You might be able to save multi million dollars.

Vic,

That said, the numbers you've shown so far don't honestly seem very convincing. The number of data points is too few that I can't see if it generally works. I believe I can test your patch myself and see how it works for Linux applications by parsing /proc/self/smaps (which contains information about the number of dirty pages). Let me try to do that for a few applications like clang.

Rui,

Did you get a chance to try this out?

Revision Contents

Path

Size

lld/

ELF/

Writer.cpp

8 lines

test/

ELF/

bss-sort.s

19 lines

common.s

8 lines

Diff 180287

lld/ELF/Writer.cpp

Show First 20 Lines • Show All 1,235 Lines • ▼ Show 20 Lines	if (!Script->HasSectionsCommand)
Sec->sortCtorsDtors();		Sec->sortCtorsDtors();
return;		return;
}		}

// Never sort these.		// Never sort these.
if (Name == ".init" \|\| Name == ".fini")		if (Name == ".init" \|\| Name == ".fini")
return;		return;

		// .bss contents are by default sorted by symbol size for better packing.
		// According to Android Go project (a project to make Android work better
		// on memory and CPU-constrained devices), it is found that this
		// heuristics also achieves better memory write locality, which reduces
		// number of dirty pages at runtime.
		if (Name == ".bss")
		pccUnsubmitted Not Done Reply Inline Actions I wonder whether we should do this for all `SHF_WRITE` sections. I can see similar logic applying to `.data` for example. pcc: I wonder whether we should do this for all `SHF_WRITE` sections. I can see similar logic…
		victoryangUnsubmitted Not Done Reply Inline Actions For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem. victoryang: For .data section specifically, the logic does apply, and I saw more memory saving when I…
		Sec->sort([](InputSectionBase *S) { return S->getSize(); });

// Sort input sections by priority using the list provided		// Sort input sections by priority using the list provided
// by --symbol-ordering-file.		// by --symbol-ordering-file.
if (!Order.empty())		if (!Order.empty())
for (BaseCommand *B : Sec->SectionCommands)		for (BaseCommand *B : Sec->SectionCommands)
if (auto *ISD = dyn_cast<InputSectionDescription>(B))		if (auto *ISD = dyn_cast<InputSectionDescription>(B))
sortISDBySectionOrder(ISD, Order);		sortISDBySectionOrder(ISD, Order);
}		}

▲ Show 20 Lines • Show All 1,254 Lines • Show Last 20 Lines

lld/test/ELF/bss-sort.s

This file was added.

				// REQUIRES: x86
				// RUN: echo ".globl s4; .size arr, 4; .bss; s4: .zero 4" \| \
				// RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux - -o %t1.o
				// RUN: echo ".globl s8; .size arr, 8; .bss; s8: .zero 8" \| \
				// RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux - -o %t2.o
				// RUN: echo ".globl s1; .size arr, 1; .bss; s1: .zero 1" \| \
				// RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux - -o %t3.o

				// RUN: ld.lld -o %t.exe %t1.o %t2.o %t3.o
				// RUN: llvm-readelf -s %t.exe \| FileCheck %s

				// Test that symbols in .bss are sorted by size.
				// CHECK: 0000000000201001 0 NOTYPE GLOBAL DEFAULT 2 s4
				// CHECK: 0000000000201005 0 NOTYPE GLOBAL DEFAULT 2 s8
				// CHECK: 0000000000201000 0 NOTYPE GLOBAL DEFAULT 2 s1

				.global _start
				_start:
				nop

lld/test/ELF/common.s

	Show All 12 Lines
	// CHECK-NEXT: Address: 0x201000			// CHECK-NEXT: Address: 0x201000
	// CHECK-NEXT: Offset:			// CHECK-NEXT: Offset:
	// CHECK-NEXT: Size: 36			// CHECK-NEXT: Size: 36
	// CHECK-NEXT: Link: 0			// CHECK-NEXT: Link: 0
	// CHECK-NEXT: Info: 0			// CHECK-NEXT: Info: 0
	// CHECK-NEXT: AddressAlignment: 16			// CHECK-NEXT: AddressAlignment: 16

	// CHECK: Name: sym1			// CHECK: Name: sym1
	// CHECK-NEXT: Value: 0x201000			// CHECK-NEXT: Value: 0x201014
	// CHECK-NEXT: Size: 8			// CHECK-NEXT: Size: 8
	// CHECK-NEXT: Binding: Global			// CHECK-NEXT: Binding: Global
	// CHECK-NEXT: Type: Object			// CHECK-NEXT: Type: Object
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .bss			// CHECK-NEXT: Section: .bss

	// CHECK: Name: sym2			// CHECK: Name: sym2
	// CHECK-NEXT: Value: 0x201008			// CHECK-NEXT: Value: 0x20101C
	// CHECK-NEXT: Size: 8			// CHECK-NEXT: Size: 8
	// CHECK-NEXT: Binding: Global			// CHECK-NEXT: Binding: Global
	// CHECK-NEXT: Type: Object			// CHECK-NEXT: Type: Object
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .bss			// CHECK-NEXT: Section: .bss

	// CHECK: Name: sym3			// CHECK: Name: sym3
	// CHECK-NEXT: Value: 0x201010			// CHECK-NEXT: Value: 0x201000
	// CHECK-NEXT: Size: 2			// CHECK-NEXT: Size: 2
	// CHECK-NEXT: Binding: Global			// CHECK-NEXT: Binding: Global
	// CHECK-NEXT: Type: Object			// CHECK-NEXT: Type: Object
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .bss			// CHECK-NEXT: Section: .bss

	// CHECK: Name: sym4			// CHECK: Name: sym4
	// CHECK-NEXT: Value: 0x201020			// CHECK-NEXT: Value: 0x201010
	// CHECK-NEXT: Size: 4			// CHECK-NEXT: Size: 4
	// CHECK-NEXT: Binding: Global			// CHECK-NEXT: Binding: Global
	// CHECK-NEXT: Type: Object			// CHECK-NEXT: Type: Object
	// CHECK-NEXT: Other: 0			// CHECK-NEXT: Other: 0
	// CHECK-NEXT: Section: .bss			// CHECK-NEXT: Section: .bss


	.globl _start			.globl _start
	_start:			_start:

	.comm sym1,4,4			.comm sym1,4,4
	.comm sym2,8,4			.comm sym2,8,4
	.comm sym3,2,2			.comm sym3,2,2
	.comm sym4,4,2			.comm sym4,4,2