This is an archive of the discontinued LLVM Phabricator instance.

Sort symbols in .bss by size.
Needs RevisionPublic

Authored by ruiu on Jan 4 2019, 11:10 AM.

Details

Summary

Android Go team (a project to make Android work better on memory and
CPU-contrained devices) found that this heuristics achieves better
memory write locality, and thus saves memory because it reduces the
number of dirty pages at runtime.

IIRC, we used to sort common symbols by size, but the code is now gone.
This patch in some sense resurrects it and extends to all symbols in .bss.

Event Timeline

ruiu created this revision.Jan 4 2019, 11:10 AM
ruiu updated this revision to Diff 180287.Jan 4 2019, 11:12 AM
  • fix comment
pcc added a subscriber: pcc.Jan 4 2019, 11:20 AM
pcc added inline comments.
lld/ELF/Writer.cpp
1249

I wonder whether we should do this for all SHF_WRITE sections. I can see similar logic applying to .data for example.

ruiu edited the summary of this revision. (Show Details)Jan 4 2019, 11:22 AM
victoryang added inline comments.
lld/ELF/Writer.cpp
1249

For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

ruiu added a comment.Jan 4 2019, 12:59 PM

For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

Do you have a number on how much memory you could save on Android by sorting .data section? I'm pretty interested in that number
.

For .data section specifically, the logic does apply, and I saw more memory saving when I sorted .bss+.data vs only .bss on Android. Applying this to all SHF_WRITE sections should be fine. It is possible that sorting results in no memory save at all in the worst case, but that shouldn't be a problem.

Do you have a number on how much memory you could save on Android by sorting .data section? I'm pretty interested in that number
.

On the build I was testing, sorting .bss alone saved ~1.7MB across the system. Sorting .bss+.data saved ~2.6MB. The numbers of course vary depending on the build, but this shows the saving from sorting .data is non-trivial.

ruiu added a comment.Jan 4 2019, 1:09 PM

What is the entire size of your program?

I'm tempted to sort all SHF_WRITE sections by size by default. Unlike some other sections such as .init_array, there should not be really any program that have an assumption on how .data sections are laid out, but I'm not sure if that wouldn't surprise users. But maybe, we should do that?

What is the entire size of your program?

The numbers are measured across all processes running, so there are multiple programs. If you are looking for numbers from a single binary, I saw private dirty from .bss in libc went down from 36KB to 16KB. Unfortunately I didn't record the saving from sorting .bss+.data in libc. I can do the test again if that's necessary.

I'm tempted to sort all SHF_WRITE sections by size by default. Unlike some other sections such as .init_array, there should not be really any program that have an assumption on how .data sections are laid out, but I'm not sure if that wouldn't surprise users. But maybe, we should do that?

Unless there's a certain logic in how things in SHF_WRITE sections are ordered right now, I don't think anyone can realistically depend on the ordering, so this should be safe.

joerg added a subscriber: joerg.Jan 4 2019, 1:27 PM

I think this should factor in any larger-than-normal alignment, otherwise it could easily result in a significant increase in size?

ruiu added a comment.Jan 4 2019, 1:32 PM

I think this should factor in any larger-than-normal alignment, otherwise it could easily result in a significant increase in size?

I'm not worried too much about it, as I don't think sorting by size could make the situation worse.

joerg added a comment.Jan 4 2019, 2:28 PM

It can certainly make it worse, e.g. if you have one input section with half page size and page alignment followed by another input section with half page size. Depending on the precise size and order, it will fit into one page or two. It's a bit constructured, but gives the general idea.

ruiu added a comment.Jan 4 2019, 2:50 PM

But on average I don't think that could make the situation worse. Sorting the content could change the total size but that shouldn't make it always worse.

MaskRay added a comment.EditedJan 7 2019, 6:05 PM

@victoryyang Have you measured the savings by sorting in descending order? What about other metrics such as alignment?

I didn't test the descending order, but assuming there's no special alignment requirement, you'd likely end up with similar result because you are essentially shifting all page boundaries and flipping individual symbols at the same time.

Re alignment: do we currently have any logic to order symbols based on alignment? And in terms of testing, what metrics do you want to see?

I am interested in benchmarks for things that are not low-end android specific. Something more general, other targets, platforms. I do not think that tweaking LLD for android only is OK.

grimar requested changes to this revision.Jan 10 2019, 1:16 AM

"Android Go team (a project to make Android work better on memory and
CPU-contrained devices) found that this heuristics achieves better
memory write locality,"

Please attach the numbers to the patch description.

This revision now requires changes to proceed.Jan 10 2019, 1:16 AM
ruiu added a comment.Jan 10 2019, 11:06 AM

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

ruiu added a comment.Jan 11 2019, 4:50 PM

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

For the record, the reason why I'm in favor of this patch not because I'm working for Google; it is not correct to say that this patch is to fix a local issue. I've been trying to handle all contributors equally. As to this patch, I think it doesn't only improve Android but improve every target. Android Go team is working hard to reduce memory usage, and that's why they found this heuristics, and I added that to the comment because a concrete example is more reader-friendly than saying the same thing abstractly. But nothing special about Android Go.

That said, I can ask Vic to build other regular Linux applications with/without this patch to see if the same improvement (or at least no regression) can be observed. Vic, do you mind if I ask you to do that?

I don't think we need a detailed benchmark for other targets, as how programs use .bss (and in general other parts of data sections) doesn't depend too much on targets.

Hard to say. I remember we had a patch that fixed inconsistency of handling .bss symbols (we had an issue in a case when application wanted to have the same order of symbols it would have if we would create them in a command line order, it had a relocation overflow because we create the .bss symbols first now). The patch was declined (I can find it if you want). So if this patch solves a local issue of low-end Android handsets only I really see no reason of why we want to change the behavior of LLD? It is the almost the same case that patch had. The difference is that now it is needed for Android and not for an unknown application. Until we have benchmarks showing it has a benefit for other platforms I am not convinced we should change the linker. It is up to you anyways, I am just saying my opinion about that change.

For the record, the reason why I'm in favor of this patch not because I'm working for Google; it is not correct to say that this patch is to fix a local issue. I've been trying to handle all contributors equally. As to this patch, I think it doesn't only improve Android but improve every target. Android Go team is working hard to reduce memory usage, and that's why they found this heuristics, and I added that to the comment because a concrete example is more reader-friendly than saying the same thing abstractly. But nothing special about Android Go.

That said, I can ask Vic to build other regular Linux applications with/without this patch to see if the same improvement (or at least no regression) can be observed. Vic, do you mind if I ask you to do that?

I don't mind doing the test. I spent some time trying to find a open source project for this test, but I'm having a hard time finding one that 1) uses clang, 2) has a large enough bss section for this test, and 3) doesn't take me crazy amount of time to setup the build. Before I spend more time on this, I thought I'd ask if you have any suggestion on what programs to test this on.

ruiu added a comment.Jan 14 2019, 10:28 AM

I don't think your benchmark program doesn't have to use clang. Or, maybe you don't really have to build anything new because Android *is* a Linux. Could you gather statistics of some large applications such as Chrome on Android? We probably need just a few applications that are not very Android-ish (so Java applications may not be appropriate).

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

ruiu added a comment.Jan 16 2019, 4:01 PM

4KB is just one page. Honestly I think it is hard to draw a conclusion from only that data as the difference is too small. Could you give us more data points so that we are convinced that the patch actually makes a difference?

pcc added a comment.Jan 16 2019, 4:18 PM

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

How exactly did you get those figures? For measuring anything memory related in Chrome I'd recommend using the system_health.memory_mobile benchmarks (I'd be happy to help you run them).

4KB is just one page. Honestly I think it is hard to draw a conclusion from only that data as the difference is too small. Could you give us more data points so that we are convinced that the patch actually makes a difference?

I think in a lot of case, you actually just want to see that this doesn't cause a regression. If most symbols in a bss section are used, there is really no gain to be had here, and this is probably most of the cases. Where this change helps is when you have sparsely dirtied bss section. This is more common for versatile libraries, like libc, where you usually only use a part of it. So far, I've tested:

  • libc: 36KB -> 16KB (x86_64), 32KB -> 12KB (arm32)
  • Chrome on Android: 364KB -> 360KB (arm32)
  • tcpdump: No change (x86_64)
  • strace: No change (x86_64)

I don't think that we should expect this to work wonder for any programs. What we are more likely to see is that for some programs, you see pretty good gain, while for others, you see no change or close to no change. Based on the numbers I have so far, it certainly looks that this change is beneficial on average. This was also shown when I applied this on all (non-prebuilt) binaries on a x86_64 Android build.

In D56325#1360770, @pcc wrote:

I was able to build and test Chrome on Android. Without this, the dirty pages from all libraries included in Chrome APK sum up to 364KB. With the symbols in those libraries sorted, this comes down to 360KB.

How exactly did you get those figures? For measuring anything memory related in Chrome I'd recommend using the system_health.memory_mobile benchmarks (I'd be happy to help you run them).

I have a modified dynamic linker that tag each bss VM area with the library it corresponds to, and I only count those from libraries included in the APK (that is, libraries that I applied this change on). I didn't want to take memory usage as a whole because I worry the variation may be more than the change this is causing.

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

0.5% may not sound a lot, but that's already a helpful amount for some systems, where you have to constantly make tradeoffs that sacrifices some functionalities to free up memory.

Another data point: Sorting the data section for Chrome on Android, data section dirty pages went down from 6308KB to 6280KB (arm32).

Or 0.5%..

0.5% may not sound a lot, but that's already a helpful amount for some systems, where you have to constantly make tradeoffs that sacrifices some functionalities to free up memory.

Given all the results you provided, now this patch looks like a micro-optimization for some particular set of code.
In other words, based on results, it seems useless for non-micro apps to me. At the same time, it brings additional logic to the linker that is assumed to be used for all kind of applications,
and that code adds not big but still an additional complication that we will have to support.

Please understand my position right. I am an android user and I am happy with it, it's a good OS IMO, but this patch is going to change apps all over the world and the best result for them shown
in this thread seems to be "there is no regression". Having no regression is sure great, but doing micro-optimizations for particular apps and adding more code to linker for no benefits for 99% users
(I guess) is probably not.

I would be happy to hear other opinions, am I missing something? I do not want to be a stopper person for a good change to the linker, but is it a really good one?

ruiu added a comment.Jan 22 2019, 5:47 PM

George,

0.5% is not a micro-optimization if it works for many applications. If you find a heuristic that can generally reduce memory usage by 0.5%, that's not negligible. Imagine for example 0.5% memory usage reduction in a huge cloud application. You might be able to save multi million dollars.

Vic,

That said, the numbers you've shown so far don't honestly seem very convincing. The number of data points is too few that I can't see if it generally works. I believe I can test your patch myself and see how it works for Linux applications by parsing /proc/self/smaps (which contains information about the number of dirty pages). Let me try to do that for a few applications like clang.

0.5% is not a micro-optimization if it works for many applications.
...
The number of data points is too few that I can't see if it generally works

That is what I tried to say. For now, it is not proven that this works for something larger than a few little apps.
If you'll be able to show that it really works/can work for other/larger apps then it will be great.

George,

0.5% is not a micro-optimization if it works for many applications. If you find a heuristic that can generally reduce memory usage by 0.5%, that's not negligible. Imagine for example 0.5% memory usage reduction in a huge cloud application. You might be able to save multi million dollars.

Vic,

That said, the numbers you've shown so far don't honestly seem very convincing. The number of data points is too few that I can't see if it generally works. I believe I can test your patch myself and see how it works for Linux applications by parsing /proc/self/smaps (which contains information about the number of dirty pages). Let me try to do that for a few applications like clang.

Rui,

Did you get a chance to try this out?