Page MenuHomePhabricator

Add IR support, ELF section and user documentation for partitioning feature.
ClosedPublic

Authored by pcc on Apr 3 2019, 5:34 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc created this revision.Apr 3 2019, 5:34 PM
Herald added a project: Restricted Project. · View Herald Transcript
ruiu added inline comments.Apr 4 2019, 12:47 AM
lld/docs/Partitions.rst
15 ↗(On Diff #193636)

share a virtual address space at link time?

27–31 ↗(On Diff #193636)

Each partition's entry point address is fixed at link-time, so I wonder why you chose to get the address of the entry point by name. If you use the entry point by address, you can possibly remove the symbol table, no?

I've put some suggestions for the documentation. Feel free to ignore them if you prefer the original. I can't see anything wrong with the LLVM changes at a cursory look.

lld/docs/Partitions.rst
10 ↗(On Diff #193636)

Program is an interesting choice of word here. From the initial RFC I think this is largely aimed at executables. I don't think that it would be impossible to make this work with shared-libraries, however I'd expect the constraints would make it difficult to use effectively. I'm thinking of an executable calling a function in the shared library that happens to be in a partition not loaded.

Maybe replace program with executable if that is the only option.

16 ↗(On Diff #193636)

I tend to think of an address as an absolute address, I think "fixed load address relative to the main partition" is meant to be an offset". A suggestion:
"and each loadable partition is assigned a fixed offset from the main partition."

19 ↗(On Diff #193636)

I think that you could remove the "by address" if offset were used earlier.

25 ↗(On Diff #193636)

Just a suggestion, to be used as the "entry points" for each partition.

65 ↗(On Diff #193636)

If I've understood it correctly you have to know ahead of time which source files are in which partition, the linker won't automatically create these for you. May be worth putting a shell comment before the main.c and before the feature.c emphasising that these are likely to be two separate build steps. I'm thinking that a common tool flow would be to build each partition into a separate static archive.

67 ↗(On Diff #193636)

Just a thought, is --gc-sections compulsory? I don't think that it is but if you are hanging this off the --gc-sections it might be easier to automatically enable it if LLD encounters a partition. Or maybe have a --combined-object option that forces --gc-sections, and if not present drops all the partition information and produces a regular output?

70 ↗(On Diff #193636)

This is an llvm-objcopy thought, perhaps a --extract-partition=all option would be useful. In this case the name of the partition would be the name of the filename as in your example above.

83 ↗(On Diff #193636)

Could offset be used instead of relptr?

pcc updated this revision to Diff 193793.Apr 4 2019, 3:37 PM
pcc marked 11 inline comments as done.
  • Address review comments
lld/docs/Partitions.rst
10 ↗(On Diff #193636)

It can be used with both executables and shared libraries. Indeed, the native parts of Android apps are shared libraries, so the shared library case is important to me. I'll clarify this in the doc.

In your executable calling a shared library scenario, the executable is linked separately from the shared library, and depends on it in the usual way (either DT_NEEDED or dlopen), correct? In this case, I don't think it would be possible for such a call to happen. To support the intended usage model of looking up partition entry points by name, the dynamic symbol table of each partition (including the main partition) defines just the symbols for that partition.

One thing that could cause problems is if the executable had a DT_NEEDED dependency on a loadable partition, as this may result in the main partition being loaded at the wrong relative address. In this case, I think it should be possible for a dynamic loader to make this work by using something like the dynamic tag that I mentioned in my initial proposal.

15 ↗(On Diff #193636)

Yes, that might be the best way to phrase it. The partitions share a virtual address space in the p_vaddr sense, but that could probably be confused with the runtime virtual address space which is shared by everything.

27–31 ↗(On Diff #193636)

A couple of reasons:

  • This forces users to explicitly load the partition in order to access the entry point. If the entry point addresses were available statically, it would be easier to accidentally refer to a partition's entry point before it is loaded.
  • In some cases, it's desirable for the entry points to be findable by name in the dynamic symbol table. For example, this is how JNI finds implementations of native methods; see: https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/design.html#wp615 which could be very useful for Android apps.
65 ↗(On Diff #193636)

Done. Note that it might not work for them to be archives because that gives no guarantee that the archive members will be added to the link, especially not if there won't be direct references. (The same thing applies to anything else that might need to be exported from a shared object.) In Chromium we're going to use a source_set [1] for the partition entry points.

[1] https://gn.googlesource.com/gn/+/refs/heads/master/docs/reference.md#func_source_set

67 ↗(On Diff #193636)

Just a thought, is --gc-sections compulsory?

It technically isn't, but the behaviour without it isn't all that useful. With my current implementation, what I think will happen is that the linker will put almost everything in the main partition, and the loadable partitions will just contain symbol tables (whose symbols will point into the main partition). So it might indeed be a good idea to enable --gc-sections if partitions are encountered.

Or maybe have a --combined-object option that forces --gc-sections, and if not present drops all the partition information and produces a regular output?

We might want an option for controlling whether to create a combined output file but I don't think we should have that to start with and I don't think we should be dropping the partition specifications by default. Similar to my point in http://lists.llvm.org/pipermail/llvm-dev/2019-March/131015.html , leaving them in would allow downstream tools to detect the case where partition specifications were present but not acted upon.

70 ↗(On Diff #193636)

It might occasionally be useful (though I'd probably name it --extract-all-partitions to avoid the (admittedly corner case) ambiguity between extracting all partitions and the partition named "all"), but since in most cases, the call to llvm-objcopy will be integrated into a build system which will know exactly which partitions are being created (and should be able to schedule the llvm-objcopy commands in parallel), combined with the fact that this feature can be relatively easily implemented in a shell script, it doesn't seem that important to me.

83 ↗(On Diff #193636)

Yes, that works too.

Thanks for the updates and clarifications, I don't have any more comments.

Hi Peter,

I really like the concept. I had some nebulous concerns when I read the RFC; but, I didn't comment as I couldn't think of any alternatives to what you were proposing. What I am writing here is still not fully formed.. so feel free to ignore if it is unhelpful.

My main concern is that this is supporting a rare usecase and I don't really like the idea of adding complexity to the core tools for niche cases.

I feel like it would be better if this could be implemented via post-processing in some way; outside of the core tools.

Some aspect of this, like the linker being able to report all of the functions reachable from a given set of entry points, are of general utility though.

I wonder if there is any scope for implementing something different along the lines of:

  1. Do an initial link.
  2. Analyse the resulting elf and report all of the functions in each partition.
  3. Do another link but this time provide an ordering file so that all of the functions for each partition are placed contiguously.
  4. Extract the functions for each partition, apart form the first, into a partition file which also contains metadata specifying the address range.
  5. Write zeros over all of the functions apart from those in the main partition and then compress the executable (saving the file space).
  6. A loader extension patches in the missing functions for each partition when needed.
pcc added a comment.Apr 5 2019, 11:19 AM

Hi Peter,

I really like the concept. I had some nebulous concerns when I read the RFC; but, I didn't comment as I couldn't think of any alternatives to what you were proposing. What I am writing here is still not fully formed.. so feel free to ignore if it is unhelpful.

My main concern is that this is supporting a rare usecase and I don't really like the idea of adding complexity to the core tools for niche cases.

I feel like it would be better if this could be implemented via post-processing in some way; outside of the core tools.

Some aspect of this, like the linker being able to report all of the functions reachable from a given set of entry points, are of general utility though.

I wonder if there is any scope for implementing something different along the lines of:

  1. Do an initial link.
  2. Analyse the resulting elf and report all of the functions in each partition.
  3. Do another link but this time provide an ordering file so that all of the functions for each partition are placed contiguously.
  4. Extract the functions for each partition, apart form the first, into a partition file which also contains metadata specifying the address range.
  5. Write zeros over all of the functions apart from those in the main partition and then compress the executable (saving the file space).
  6. A loader extension patches in the missing functions for each partition when needed.

Hi Ben, thanks for the feedback. I shared a similar concern around complexity, but I don't see any reasonable alternatives to implementing this directly in the linker.

Your idea of using orderfiles is interesting, but I see at least two showstopper problems with it. The first is that it is incompatible with range extension thunks. Imagine that you have one main partition and two loadable partitions. An orderfile is used to sort the main partition's sections before partition 1's sections before partition 2's sections. We're going to split partition 1 and partition 2 into separate files, but the linker doesn't know that, so it may place a range extension thunk in the middle of partition 1's sections and have partition 2 call it. Now at runtime we will crash if partition 2 is loaded without partition 1 and it tries to call one of the range extension thunks in partition 1. We might be able to work around this issue somehow by telling the linker where the partition boundaries are, but at that point since the linker needs to know how you're going to split the file up anyway it might as well actually split it up for you.

The second is that replacing parts of the file with zeros and compressing it isn't always useful from a size perspective; in many cases it's the uncompressed size that matters because that's what takes up storage space on the device. One might consider uncompressing the code at program startup but that leads to other problems: it slows down startup and prevents the kernel from paging out the pages containing the uncompressed code.

I also see other problems that aren't showstoppers but are still significant:

  • We're leaving some of the binary size gains on the table because we can't move exception-handling sections, dynamically relocatable sections (e.g. vtables) or symbol tables.
  • Since we aren't splitting the symbol table the solution is more error prone since there's nothing stopping someone from dlsym'ing a symbol in an unloaded partition.
  • This requires linking multiple times, which will slow down linking and hurt developer productivity, but the problem is made worse when (full) LTO is being used since we'll need to do LTO multiple times.
pcc updated this revision to Diff 193995.Apr 5 2019, 6:17 PM
  • Add missing pieces to MC; add test
In D60242#1456509, @pcc wrote:

Hi Peter,

I really like the concept. I had some nebulous concerns when I read the RFC; but, I didn't comment as I couldn't think of any alternatives to what you were proposing. What I am writing here is still not fully formed.. so feel free to ignore if it is unhelpful.

My main concern is that this is supporting a rare usecase and I don't really like the idea of adding complexity to the core tools for niche cases.

I feel like it would be better if this could be implemented via post-processing in some way; outside of the core tools.

Some aspect of this, like the linker being able to report all of the functions reachable from a given set of entry points, are of general utility though.

I wonder if there is any scope for implementing something different along the lines of:

  1. Do an initial link.
  2. Analyse the resulting elf and report all of the functions in each partition.
  3. Do another link but this time provide an ordering file so that all of the functions for each partition are placed contiguously.
  4. Extract the functions for each partition, apart form the first, into a partition file which also contains metadata specifying the address range.
  5. Write zeros over all of the functions apart from those in the main partition and then compress the executable (saving the file space).
  6. A loader extension patches in the missing functions for each partition when needed.

Hi Ben, thanks for the feedback. I shared a similar concern around complexity, but I don't see any reasonable alternatives to implementing this directly in the linker.

Your idea of using orderfiles is interesting, but I see at least two showstopper problems with it. The first is that it is incompatible with range extension thunks. Imagine that you have one main partition and two loadable partitions. An orderfile is used to sort the main partition's sections before partition 1's sections before partition 2's sections. We're going to split partition 1 and partition 2 into separate files, but the linker doesn't know that, so it may place a range extension thunk in the middle of partition 1's sections and have partition 2 call it. Now at runtime we will crash if partition 2 is loaded without partition 1 and it tries to call one of the range extension thunks in partition 1. We might be able to work around this issue somehow by telling the linker where the partition boundaries are, but at that point since the linker needs to know how you're going to split the file up anyway it might as well actually split it up for you.

The second is that replacing parts of the file with zeros and compressing it isn't always useful from a size perspective; in many cases it's the uncompressed size that matters because that's what takes up storage space on the device. One might consider uncompressing the code at program startup but that leads to other problems: it slows down startup and prevents the kernel from paging out the pages containing the uncompressed code.

I also see other problems that aren't showstoppers but are still significant:

  • We're leaving some of the binary size gains on the table because we can't move exception-handling sections, dynamically relocatable sections (e.g. vtables) or symbol tables.
  • Since we aren't splitting the symbol table the solution is more error prone since there's nothing stopping someone from dlsym'ing a symbol in an unloaded partition.
  • This requires linking multiple times, which will slow down linking and hurt developer productivity, but the problem is made worse when (full) LTO is being used since we'll need to do LTO multiple times.

Thanks for considering this carefully. IMO most of your concerns could be accepted as reasonable limitations to keep the feature out of the linker. We could probably deal with the points that you raised with the compression scheme somehow (e.g. some horrible scheme that adjusts the existing program headers and adds an additional bss one, similar to your original proposal but done by post processing). However, I think your point about thunks is a hard showstopper. We would need a much more expressive means of controlling the linkers placement of things (than the current ordering files, linker scripts, --section-start etc..) in order to create a layout that would be amenable to post processing; and, as you hinted at, this would be just as much work (or more) than just getting the linker to do the splitting up.

Given the above, I am happy with the implementation.

Adding some more people who may be able to review the LLVM changes here.

pcc updated this revision to Diff 198335.May 6 2019, 1:49 PM
  • Add a note about the limit on the number of partitions
pcc updated this revision to Diff 200851.May 22 2019, 5:47 PM
  • Rebase
echristo accepted this revision.Tue, May 28, 6:41 PM

Some inline comment requests and if you wouldn't mind changing the type allocator separately it'd be great.

Otherwise LGTM.

Thanks!

llvm/include/llvm/IR/GlobalValue.h
111 ↗(On Diff #200851)

Comment please.

llvm/lib/AsmParser/LLParser.cpp
963 ↗(On Diff #200851)

I know nothing else does, but commenting this would be great with invariants, conditions, etc.

llvm/lib/Bitcode/Reader/BitcodeReader.cpp
2986 ↗(On Diff #200851)

Comment what the conditional is checking please?

3078 ↗(On Diff #200851)

Comment what the conditional is checking please?

3158 ↗(On Diff #200851)

Comment here too please.

This revision is now accepted and ready to land.Tue, May 28, 6:41 PM
pcc marked 7 inline comments as done.Tue, May 28, 8:26 PM

Thanks for the review. Committed rL361922 (rename type allocator) and rL361923 (this change).

llvm/lib/Bitcode/Reader/BitcodeReader.cpp
2986 ↗(On Diff #200851)

I added some comments to this file, but I'm not sure how useful they are. Maybe I'm biased from working on this code before, but (aside from the additional comment I added in parseFunctionRecord) I think the code speaks for itself here and I'm not sure if my comment is really adding any useful information. If we added similar comments elsewhere in these functions for consistency, I think that would result in less information per line of code which can hurt readability. Can you see a better comment to add here?

This revision was automatically updated to reflect the committed changes.
pcc marked an inline comment as done.