This is an archive of the discontinued LLVM Phabricator instance.

[llvm][llvm-objcopy] Added support for outputting to binary in llvm-objcopy
ClosedPublic

Authored by jakehehrlich on Jun 21 2017, 3:05 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

jakehehrlich created this revision.Jun 21 2017, 3:05 PM
tpimh added a subscriber: tpimh.Jun 23 2017, 4:33 AM
jakehehrlich edited the summary of this revision. (Show Details)
jakehehrlich removed a subscriber: phosek.

Updated this to solve Mac build issues and MSan issues in D33964

jhenderson edited edge metadata.Jul 24 2017, 1:44 AM

It looks like you've accidentally lost program-headers.test again and regained hello-world.test.

I'm not familiar with the binary output format, and the objcopy documentation isn't exactly clear on the matter. Would you mind giving a quick overview, please?

tools/llvm-objcopy/Object.cpp
362 ↗(On Diff #107707)

I can't help but notice that this method and totalSize above are basically the same. Is there any way we can sensibly factor out the duplication?

tools/llvm-objcopy/Object.h
151 ↗(On Diff #107707)

"= default" maybe?

154 ↗(On Diff #107707)

I feel like "ELFObject" (and "BinaryObject") are more natural names. Could we use those instead, please?

159–161 ↗(On Diff #107707)

Are these needed?

It looks like you've accidentally lost program-headers.test again and regained hello-world.test.

I'm not familiar with the binary output format, and the objcopy documentation isn't exactly clear on the matter. Would you mind giving a quick overview, please?

Sure, I had to play with objtool to figure out what was going on because it isn't at all clear otherwise. Loosely the "binary" format should stack the memory image of each segment into a single file. The loader is responsible for knowing where these segments need to go and where they lie in the file. There is an exception however for NOBITS sections that occur at the end of a segment. Those are not included in the file. Additionally alignment needs to be respected so you have to bump things to alignment boundaries. The goal is that if you know the offset, filesize, and virtual address of each segment in the output then you could use nothing more than that information to load each segment into memory. All extraneous information is excluded.

tools/llvm-objcopy/Object.h
159–161 ↗(On Diff #107707)

All 3 are used in ELFObject for file layout

Thinking a little bit more about the design. Do we anticipate in the future supporting non-ELF inputs? Seeing inheritance used to control the output, made me consider an alternative design based more on templates: rather than have one base Object class which allows sharing the input code, and then derive from it to specify the output code, how about a single class that is templated on two arguments - an input reader class, and an output reader class. This would in theory allow for easy mixing and matching the different input and output types, without having to provide a different class definition for every single combination. A sketch outline is below:

Input Types:
Elf64LEInput --+   Object <Input, Output>   +-- Elf64LEOutput
Elf32LEInput --+-------------^       ^------+-- Elf32LEOutput
NonElfInput  --+                            +-- BinaryOutput

and the high-level program would look something like:

Object<Elf64LEInput, BinaryOutput> Obj;
Obj.Input.read();
Obj.dostuff();
Obj.Output.write();

I think that this design might also allow for converting between endianness and ELF size, although I don't know if there's any value in that. Overall, it might not be worth it, but it's something to consider.

Sure, I had to play with objtool to figure out what was going on because it isn't at all clear otherwise. Loosely the "binary" format should stack the memory image of each segment into a single file. The loader is responsible for knowing where these segments need to go and where they lie in the file. There is an exception however for NOBITS sections that occur at the end of a segment. Those are not included in the file. Additionally alignment needs to be respected so you have to bump things to alignment boundaries. The goal is that if you know the offset, filesize, and virtual address of each segment in the output then you could use nothing more than that information to load each segment into memory. All extraneous information is excluded.

Thanks, that makes sense. I take it that we can ignore nested segments here, because you shouldn't nest PT_LOAD segments?

tools/llvm-objcopy/Object.cpp
108 ↗(On Diff #107968)

I'm slightly nervous that this might turn into a dangling reference at some point in the future - at the moment, it works fine, because ElfFile remains open until after writing happens, but in theory, I don't see any reason why ElfFile needs to stay open once reading has happened (it might help reduce the memory footprint, for example). However, closing it would make this and any other references directly to the data invalid.

363 ↗(On Diff #107968)

What about PT_GNU_RELRO segments? Does the GNU tool treat these the same as PT_LOAD, since they are still loadable segments?

362 ↗(On Diff #107707)

Yeah, thinking about it a bit more, this could be turned into a common function called "forEachLoadableSegment" or similar, with a lambda or similar abstracting the uncommon bit away.

tools/llvm-objcopy/Object.h
185–187 ↗(On Diff #107968)

"virtual" is unnecessary here.

159–161 ↗(On Diff #107707)

Oops, I put this comment on the wrong instance - they're repeated in BinaryObject below, where I think they are unnecessary.

tools/llvm-objcopy/llvm-objcopy.cpp
56 ↗(On Diff #107968)

Maybe worth documenting the valid options in the help text?

63 ↗(On Diff #107968)

Is it worth emitting an error here if the OutputFormat isn't an understood type? At the moment, I could do "-O Elf32LE" or "-O coff" or similar expecting to get 32-bit ELF/COFF out, but actually I get 64-bit ELF.

69 ↗(On Diff #107968)

Unused variable. Should the error message report it?

Thinking a little bit more about the design. Do we anticipate in the future supporting non-ELF inputs? Seeing inheritance used to control the output, made me consider an alternative design based more on templates: rather than have one base Object class which allows sharing the input code, and then derive from it to specify the output code, how about a single class that is templated on two arguments - an input reader class, and an output reader class. This would in theory allow for easy mixing and matching the different input and output types, without having to provide a different class definition for every single combination. A sketch outline is below:

I do anticipate multiple output formats long term. There are a few different low level output formats like "binary" and intel HEX that might be needed by somebody. Additionally there are a number of operations that have been discussed by various people that change how output works enough that I would consider subclassing ELFObject rather than try and incorporate the behavior into ELFObject. For instance one idea is to "complete strip" a binary so that it has no section headers or other extraneous information. It would be just like "-O binary" except that the ELF header and program headers would be retained. Another format (that was discussed but I don't thin will be included) is a tool for taking shared objects and stripping them down so much that a linker can link against them but there is no actual content to the output. So I expect multiple things to possibly fit into an "output format" like thing even if the actual format output is still ELF.

The output format depends heavily on the input format. For instance converting between ELF and MachO is, as I understand it, not a great idea. Converting between 32-bit and 64-bit isn't possible in general so I don't think I want to support that. I've also been advised by many people that ELF, COFF, and MachO support should be totally separate and that it isn't easy to create a uniform interface to them. However somethings like converting between endianness' makes sense I think. I think as long as this stays ELF specific and everything maintains the same 32-bit/64-bit then I like the idea. This would allow for converting between endinesses and perhaps provide a better separation between how things are read in and how things are written out. I'll play around with design today and see if I can't make something better than this.

Thanks, that makes sense. I take it that we can ignore nested segments here, because you shouldn't nest PT_LOAD segments?

Every time I make an assumption like "this will never happen" I generally find a case where I'm wrong. So I won't say "shouldn't" but I certainly don't see why it makes sense to have overlapping PT_LOAD segments. It would just be redundant from my perspective. PT_GNU_RELRO might come up at some point as a "loadable" segment that overlaps but I don't think that should have any baring on the use cases for binary output since most of those use cases are for kernels and embedded systems.

tools/llvm-objcopy/Object.cpp
362 ↗(On Diff #107707)

One thought I had was to put this loop in finalize, set the Offset of each segment as I went, and then use the final Offset to set a "TotalSize" field of BinaryObject. This would deduplicate the code but I didn't like using Segment::Offset in that way. My complaint is that Segment::Offset is supposed to represent "p_offset" and "p_offset" is just not relevant in the binary output format. Additionally it requires adding a field. I'm less concerned with that however. I'd be willing to dedup the code this way if you provided an argument for why Segment::Offset should be allowed to be used this way. E.g. the argument should claim that it is ok to divorce Segment::Offset from p_offset. Also currently there is a bit of a blemish in the code in that most "write" interfaces take a FileOutputBuffer but Segment::writeSegment takes a uint8_t pointer. Using Offset in this way would solve that issue. Still I'm concerned that that just isn't the right use of Segment::Offset. I could be convinced that it doesn't matter though.

Perhaps there is a general pattern here that goes something like "loop over all loadable segments and calculate their offsets" but that seems very specific. Certainly I could write a method that took a function and did this but it seems too specific. It would be nice if this pattern kept coming up but it only seems to come up twice and I'd bet that it will only come up twice.

  1. Changed how error handling works
  2. Deduplicated looping over loadable segments while calculating their offset

I do anticipate multiple output formats long term. There are a few different low level output formats like "binary" and intel HEX that might be needed by somebody. Additionally there are a number of operations that have been discussed by various people that change how output works enough that I would consider subclassing ELFObject rather than try and incorporate the behavior into ELFObject. For instance one idea is to "complete strip" a binary so that it has no section headers or other extraneous information. It would be just like "-O binary" except that the ELF header and program headers would be retained. Another format (that was discussed but I don't thin will be included) is a tool for taking shared objects and stripping them down so much that a linker can link against them but there is no actual content to the output. So I expect multiple things to possibly fit into an "output format" like thing even if the actual format output is still ELF.

The output format depends heavily on the input format. For instance converting between ELF and MachO is, as I understand it, not a great idea. Converting between 32-bit and 64-bit isn't possible in general so I don't think I want to support that. I've also been advised by many people that ELF, COFF, and MachO support should be totally separate and that it isn't easy to create a uniform interface to them. However somethings like converting between endianness' makes sense I think. I think as long as this stays ELF specific and everything maintains the same 32-bit/64-bit then I like the idea. This would allow for converting between endinesses and perhaps provide a better separation between how things are read in and how things are written out. I'll play around with design today and see if I can't make something better than this.

Ok, sounds good to me. I can certainly understand that switching between formats might not be the easiest (I have managed it in the past between COFF and ELF to some extent, but it's far from trivial and doesn't seem like a real use case). I'm not sure I understand why going between 32 and 64 bits isn't possible in general personally, but regardless, if there's some value in switching endianness at least, it's certainly worth investigating as a design option.

Thanks, that makes sense. I take it that we can ignore nested segments here, because you shouldn't nest PT_LOAD segments?

Every time I make an assumption like "this will never happen" I generally find a case where I'm wrong. So I won't say "shouldn't" but I certainly don't see why it makes sense to have overlapping PT_LOAD segments. It would just be redundant from my perspective. PT_GNU_RELRO might come up at some point as a "loadable" segment that overlaps but I don't think that should have any baring on the use cases for binary output since most of those use cases are for kernels and embedded systems.

Very true! I think ultimately we can ignore nested segments entirely, since the parent segment will result in them being written. The only possible case we need to be wary of in the future is if segments can overlap - I'm not aware of any use cases for this, but if there were, and some of it involved PT_LOAD (or other loadable segment type), we'd need to do something about it. Not sure that's worth worrying about at this point though.

tools/llvm-objcopy/Object.cpp
382 ↗(On Diff #108130)

CompareSegments, not Sections.

363 ↗(On Diff #107968)

Oops, for some reason I was under the impression that PT_GNU_RELRO was a separate non-nested segment. Never mind!

362 ↗(On Diff #107707)

Although p_offset itself doesn't make a huge amount of sense in a BinaryObject, I don't think it's unreasonable to treat Offset as the location of the data in the file. After all, what else does it represent? The p_offset field in a regular ELF represents exactly that - the offset of the segment in the file.

Also, now looking at your loop changes, I'm beginning to think that it doesn't look particularly nice after all, so if you can remove the looping twice by moving it into finalize, all the better, but I'm happy with whichever you prefer.

If you're concerned about the extra field, perhaps you could simply look at the last Segment in the vector (assuming it's sorted by offset, and that there are no nested segments), and add it's size to the offset to get the total size? Alternatively, maybe return the total size from finalize in all cases?

  1. Removed function to loop though loadable segments while computing offset. I did this using the change that James and I discussed
  2. Fixed a naming issue

Ok, sounds good to me. I can certainly understand that switching between formats might not be the easiest (I have managed it in the past between COFF and ELF to some extent, but it's far from trivial and doesn't seem like a real use case). I'm not sure I understand why going between 32 and 64 bits isn't possible in general personally, but regardless, if there's some value in switching endianness at least, it's certainly worth investigating as a design option.

The only reason I say switching between 32 and 64 bit isn't possible in general is because going from 64 to 32 might demand copying an address that doesn't fit in 32-bits. Also 64-bit ELF files will tend to have 64-bit code which might not be usable on a 32-bit system. Also does the endianness switch make sense? I'm not aware of an architecture that is sometimes little endiann and sometimes big endiann...then again it most likely exists given the oddity of these things. I'm not sure I want to make a big architectural shift for that reason alone however. Still I think there might be other benefits to your separation of input and output so I'm still looking at it.

tools/llvm-objcopy/Object.cpp
108 ↗(On Diff #107968)

I was slightly worried about this as well because I believe that all of the StringRefs for section names and symbol names have this same issue. Additionally the Section type has the same issue as well. So I'd have to make sure that all of those strings were copied as well. I think LLVM has a string saver for this reason. I'm fine demanding that the ELFFile stay open but I understand the concern. Perhaps I should pass a unique pointer containing the ELFFile to the Object and have it take ownership of the ELFFile? I think that would avoid using defensive copies and using a string saver as well as keep the memory footprint lower while still ensuring that the needed data stays open.

363 ↗(On Diff #107968)

It ignores them from my testing. Also I don't think PT_GNU_RELRO will be used in the applications of "-O binary" since PT_GNU_RELRO requires virtual memory support that most applications won't have when they're loaded (e.g. Kernels and embedded systems)

The only reason I say switching between 32 and 64 bit isn't possible in general is because going from 64 to 32 might demand copying an address that doesn't fit in 32-bits. Also 64-bit ELF files will tend to have 64-bit code which might not be usable on a 32-bit system. Also does the endianness switch make sense? I'm not aware of an architecture that is sometimes little endiann and sometimes big endiann...then again it most likely exists given the oddity of these things. I'm not sure I want to make a big architectural shift for that reason alone however. Still I think there might be other benefits to your separation of input and output so I'm still looking at it.

Ok, sounds reasonable to me.

Did you mean to rename program-headers.test?

tools/llvm-objcopy/Object.cpp
108 ↗(On Diff #107968)

Perhaps I should pass a unique pointer containing the ELFFile to the Object and have it take ownership of the ELFFile? I think that would avoid using defensive copies and using a string saver as well as keep the memory footprint lower while still ensuring that the needed data stays open.

I like this idea, but it should probably be done as part of a different change. It reduces any risk of future (and somewhat subtle) mistakes, without being overly defensive in coding.

jakehehrlich marked 20 inline comments as done.Jul 27 2017, 2:22 PM
  1. fixed file renaming issue (I did not mean to do that)
jhenderson accepted this revision.Jul 28 2017, 1:17 AM

LGTM, with some fixes for nits in the comments.

tools/llvm-objcopy/Object.cpp
47 ↗(On Diff #108565)

Either "... maintain segments' interstitial data..." (with apostrophe) or "... maintain interstitial data and contents of segments..."

48 ↗(On Diff #108565)

"this" -> "This"

366 ↗(On Diff #108565)

Full stop.

This revision is now accepted and ready to land.Jul 28 2017, 1:17 AM
  1. Fixed nits
  2. Brought this into line with 33964 changes that were made primarily for windows compatibility.
This revision was automatically updated to reflect the committed changes.

On release builds Object<...>::Object(ELFFile<...>) was giving an undefined error. This appears to be because in the explicit instantiation of ELFObject and BianryObject in Object.cpp there was no need to instantiate that method from Object. I added the explicit instantiations for Object to insure that that constructor would be defined in Object.cpp.

On 32-bit builds size_t changes and causes a casting warning when converting a 64-bit p_filesz field to a size_t.

basic-align-copy.test was failing on big endian machines because of the byte ordering needed in the check for the .data contents. I fixed this by giving the .data section 2 of the same bytes so no matter how they're ordered the message will be the same.