This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] Add support for -I binary -B <arch>.
ClosedPublic

Authored by rupprecht on Aug 6 2018, 11:34 AM.

Details

Summary

The -I (--input-target) and -B (--binary-architecture) flags exist but are currently silently ignored. This adds support for -I binary for architectures i386, x86-64 (and alias i386:x86-64), arm, and aarch64. This is largely based on D41687.

This is done by implementing an additional subclass of Reader, BinaryReader, which works by interpreting the input file as contents for .data field, sets up a synthetic header, and adds additional sections/symbols (e.g. _binary__tmp_data_txt_start). Additionally, change the symbol table to own symbols, by changing StringRef to std::string, as the synthetic symbols added are otherwise unowned.

Diff Detail

Event Timeline

rupprecht created this revision.Aug 6 2018, 11:34 AM

Ah so I think I'm against BinaryReader being templated against ELFT. That is required to implement getElfType() however. getElfType drives which writer we create. That shouldn't be how things work however. I'll go and add my related comments to this to the previous patch. Sorry if I didn't see what was going on in the other patch and lead you astray. This patch makes things a bit more clear.

tools/llvm-objcopy/Object.cpp
617

Correct me if I'm wrong but I don't think GNU objcopy produces any program headers when it does this. The resulting file should be relocatable and thus program headers are pointless. Even if GNU objcopy does this, we probably shouldn't.

650

The MemBuf should outlive the section in which case you can just use Section

654

nit: Can you use a Twine? You'll have to use an std::string for the buffer identifier part unfortunately but other than that you should be able to use a twine.

674

I think you're ok to initialize the symbol table in whatever order you want here since there are no relocation. That's a current thorn that you even have to initialize things in a different order. I need to carve out some time to refactor llvm-objcopy to work differently to get there however.

rupprecht updated this revision to Diff 159407.Aug 6 2018, 3:18 PM
rupprecht marked 3 inline comments as done.
  • Refactor Object based on code review comments

Ah so I think I'm against BinaryReader being templated against ELFT. That is required to implement getElfType() however. getElfType drives which writer we create. That shouldn't be how things work however. I'll go and add my related comments to this to the previous patch. Sorry if I didn't see what was going on in the other patch and lead you astray. This patch makes things a bit more clear.

Yeah, it's not my first choice either, but it did make the implementation easier and seemed somewhat consistent with the pattern elsewhere in this file.

I wasn't sure what plans you had for refactoring things here (still haven't seen whatever thread was referenced earlier). My hunch is that "the right way" would be to not have any extra ELF type here and just always directly use what's in Object/ELFTypes.h, and to do things with composition instead of templated-inheritance, and I'm happy to make that change (or whatever alternative change makes sense), but that seems like an orthogonal discussion that isn't related to the features implemented here.

tools/llvm-objcopy/Object.cpp
617

I just verified this; commenting out this section causes an invalid object file (readelf/llvm-readobj can't read it at all).
This is ElfHdrSegment. I think you are confusing this with ProgramHdrSegment?

However, speaking of program headers, line 1290 (i.e. OrderedSegments.push_back(&Obj.ProgramHdrSegment)) seems to do nothing -- all the tests pass -- and also gets rid of the error readelf: Warning: possibly corrupt ELF header - it has a non-zero program header offset, but no program headers when running readelf -a on object files that llvm-objcopy produces. I'd like to take a look at that after this patch; it seems to be a preexisting issue.

rupprecht updated this revision to Diff 159772.Aug 8 2018, 1:06 PM
  • Pass MachineInfo into BinaryReader to push ELFType template down to only BinaryELFBuilder<>
  • Move getElfType out of Reader (and also ELFReader), it can be implemented purely in the driver class. This make it clear that "Reader" is not tied to ELF types, even if that's all that's supported currently.

Ah so I think I'm against BinaryReader being templated against ELFT. That is required to implement getElfType() however. getElfType drives which writer we create. That shouldn't be how things work however. I'll go and add my related comments to this to the previous patch. Sorry if I didn't see what was going on in the other patch and lead you astray. This patch makes things a bit more clear.

Yeah, it's not my first choice either, but it did make the implementation easier and seemed somewhat consistent with the pattern elsewhere in this file.

OK, I removed ELF templates from BinaryReader; having them remain on BinaryELFBuilder still seems appropriate though.

jakehehrlich added inline comments.Aug 8 2018, 1:16 PM
tools/llvm-objcopy/Object.cpp
590

So I this function is the real problem child. We don't actually use the Ident anywhere.This is also 100% comprised of information that the ELFWriter knows but nothing else knows. We should make the ELFWriter construct the Ident and just not store it in Object.

617

You're right. I confused it with ProgramHdrSegment. I now remember that someone had the clever idea to make the ELF header an implicit segment so that the same layout algorithm that had all the bugs worked out could already be used. As for the warning I guess I knew about that issue. The OrderedSegments.push_back(&Obj.ProgramHdrSegment) should accomplish adding that program headers to layout when you have a PT_PHDR segment in a consistent way. The warning is a separate issue on line 1078 (in this patch) not checking if the program headers are empty before blindly setting phoff to the offset of that segment. This is in full compliance with the ELF standard because the number of program headers is still zero. It's just that no tool normally has a reason to produce an ELF that has phoff != 0 and phnum = 0 so it's normally a sign of corruption.

rupprecht marked an inline comment as done.Aug 8 2018, 2:38 PM
rupprecht added inline comments.
tools/llvm-objcopy/Object.cpp
590

Nice, I was able to completely remove Ident from Object.

617

I tried adding that check, and several tests failed -- I might not have been checking the right thing, or maybe the tests are bad. I added a TODO on that line to investigate further.

rupprecht updated this revision to Diff 159801.Aug 8 2018, 2:39 PM
rupprecht marked an inline comment as done.
  • Remove Ident from object
jakehehrlich added inline comments.Aug 8 2018, 2:58 PM
tools/llvm-objcopy/Object.cpp
617

How did the tests fail? Some of the tests do a literal check against the header. e.g. they may be checking to see that 0x40 is used even when (to comply with the warning) 0x0 should be used instead.

rupprecht updated this revision to Diff 159809.Aug 8 2018, 3:01 PM
  • Add sparc/ppc to arch map
rupprecht added inline comments.Aug 8 2018, 3:10 PM
tools/llvm-objcopy/Object.cpp
617

With this change:

// Obj.ProgramHdrSegment.firstSection() == nullptr implies
// Obj.ProgramHdrSegment.Sections is empty
Ehdr.e_phoff = Obj.ProgramHdrSegment.firstSection() == nullptr
                   ? 0
                   : Obj.ProgramHdrSegment.Offset;

The failures usually looked like:

Command Output (stderr):
--
<src>/llvm/test/tools/llvm-objcopy/triple-overlap.test:72:14: error: CHECK-NEXT: expected string not found in input
#CHECK-NEXT: Type: PT_LOAD (0x1)
             ^
<stdin>:9:2: note: scanning from here
 Type: (0x464C457F)
 ^
<stdin>:21:2: note: possible intended match here
 Type: (0x400004)
 ^

Actually, it looks like the object is corrupted with that change, e.g. the first program header for that test has an alignment of 15762873573703680, vs 4096 for all of them on the base side.

jakehehrlich added inline comments.Aug 8 2018, 3:12 PM
tools/llvm-objcopy/Object.cpp
617

Yeah that's not the right check. Use Obj.ProgramHeaders.size() == 0

rupprecht added inline comments.Aug 8 2018, 3:35 PM
tools/llvm-objcopy/Object.cpp
617

I can't seem to get that to work:

  • Obj doesn't have a field called ProgramHeaders
  • You probably mean Obj.ProgramHdrSegment, but Segment doesn't have a method size()
  • If I implement size() as returning either Contents.size() or Sections.size() and then try:
Ehdr.e_phoff =
    Obj.ProgramHdrSegment.size() == 0 ? 0 : Obj.ProgramHdrSegment.Offset;

Then I get the same failure as before

jakehehrlich added inline comments.Aug 8 2018, 10:39 PM
tools/llvm-objcopy/Object.cpp
605

I think ELFWriter should assign these. I think I need to do some refactoring to make Object not so ELF specific (with respect to not containing fields that are not known until write time). This change is exposing a lot of points where I failed to properly separate those concerns. In the mean time if you could move this code into ELFWriter and call it form assignOffset that would be ideal.

617

hmm...that's bothersome; I must not be understanding something. I'll see about looking into that. Thanks for letting me know about this!

rupprecht updated this revision to Diff 159939.Aug 9 2018, 9:20 AM
  • Extract out ehdr initialization into initEhdr()
rupprecht marked an inline comment as done.Aug 9 2018, 9:22 AM
rupprecht added inline comments.
tools/llvm-objcopy/Object.cpp
605

I didn't find a way to refactor out setting ElfHdr.Index, since ELFBuilder<ELFT>::readProgramHeaders() is setting it to ElfHdr.Index = Index++ in the middle of some other sections, but the rest was simple to refactor into ELFWriter.

rupprecht updated this revision to Diff 159941.Aug 9 2018, 9:30 AM
rupprecht marked an inline comment as done.
  • Consistently use ELFT typename aliases in headers

Awesome! I'd like to have someone else look over this but this pretty well LGTM.

tools/llvm-objcopy/Object.cpp
614

And this is the last thing where we can't easily factor out dependence on ElfType. This is going to require that Size be handled differently from how it is now. I'm not going to block this change on that since the fix is just not simple. Under the current design of llvm-objcopy that would require a new visitor to calculate the size. I'm not sure I want to further things along that path.

Can you add a TODO here for me?

rupprecht updated this revision to Diff 159961.Aug 9 2018, 11:03 AM
  • Add Elf_Sym TODO
rupprecht marked an inline comment as done.Aug 9 2018, 11:05 AM

Thanks for the review!

Alex/James/anyone else who's lurking, mind taking a look for a second set of eyes?

Thanks for the review!

Alex/James/anyone else who's lurking, mind taking a look for a second set of eyes?

Yup, just going to look now. Sorry, it's been a busy couple of days, otherwise I'd have done so earlier.

jhenderson added inline comments.Aug 10 2018, 3:16 AM
test/tools/llvm-objcopy/binary-input-aarch64.test
1 ↗(On Diff #159961)

Use echo, not printf. I'm not sure if printf is implemented in the lit test system etc, but echo is the more common approach anyway.

3 ↗(On Diff #159961)

You dump the symbols here, but don't check any of them... you should probably do so!

8 ↗(On Diff #159961)

Does this line add anything? If not, can we remove it?

test/tools/llvm-objcopy/binary-input-arm.test
1 ↗(On Diff #159961)

Could you fold all of the regular tests into a single test file? There isn't much difference between them. Just run llvm-objcopy N times in a single file with different -B options, and use check prefixes to match the different sections (e.g. COMMON, ARM, AARCH64... etc).

test/tools/llvm-objcopy/binary-input-error.test
2

printf -> echo

test/tools/llvm-objcopy/binary-input-reconstitute.test
1 ↗(On Diff #159961)

I'm not sure I understand what "reconstitute" means here. Perhaps rename the test? I assume the aim is for binary input and binary output? So maybe a better name is "binary-input-and-output.test".

10 ↗(On Diff #159961)

Technically, this only shows that the contents include "abcd", not that they are explicitly and solely "abcd". So this would also match a file containing other junk.

A better way might be to just compare the output files against the original input files using diff. Also, a common technique in our tests is to copy the input file to a separate file before running llvm-objcopy and then diff the input against the copy to show that the input hasn't been modified.

test/tools/llvm-objcopy/binary-input.test
3

I'm not sure how this test is different from the others (apart from the symbols). Is i386:x86-64 just a synonym of x86-64? If so, I think this test can be folded into the others, like mentioned above.

tools/llvm-objcopy/Object.cpp
65–66

Could you do these little unrelated tidy-ups in a separate NFC commit, please. No need for a review. Essentially, I'd like this review to just be of the things required for your change.

621

AR -> Data. "AR" is meaningless.

It saddens me that different parts of LLVM can't agree on whether to use chars or uint8_t.

624

Rather than abbreviate this to something that is unclear, just call it "DataSection"

633

I think it's considered bad form to have explicit local variable Twines that aren't just function parameters, if I remember, based on comments in a previous review (I might be mistaken though, so am happy to be proven wrong).

Please also name it something more descriptive like "Prefix".

642

Should the end symbol have a size? That seems weird that it does. I might expect the start symbol to, but not the end symbol. What does GNU objcopy do?

1255

Could you rename this initEhdrSegment, since it doesn't actually initialise the ELF header itself.

1267

I'm not convinced that the ELF header segment should be initialised inside something called "assignOffsets", since it's doing a lot more than assigning it an offset (which actually is always 0, so could be initialised at the same time as the rest of the header). It should probably be called outside this function.

tools/llvm-objcopy/Object.h
403

Side note: I think a recent change for --prefix-symbols makes this part of the diff identical. I recommend rebasing when you next update the diff.

458

Don't abbreviate variable names: Sz -> Size.

tools/llvm-objcopy/llvm-objcopy.cpp
135–136

It looks to me like we are grouping things in CopyConfig by type, so this should probably be moved to before (or after) the StringRef block.

313

I feel like this might read better if everything is constructed inline:

static const StringMap<MachineInfo> ArchMap{
    // Name,     EM value,   64bit, LittleEndian
    {"aarch64", {EM_AARCH64, true,  true}},
    {"arm",     {EM_ARM,     false, true}},
    /* More entries here */
};

I'd order the entries either alphabetically by name or numerically by their EM value. You could also put "headers" as shown to avoid needing to duplicate the 64/Endianness comment in each part.

rupprecht marked 30 inline comments as done.
  • Fold arch-specific file header tests into a single templated test
  • Replace printf w/ echo -n
  • Other code review comments
rupprecht added inline comments.Aug 10 2018, 11:36 AM
test/tools/llvm-objcopy/binary-input-aarch64.test
1 ↗(On Diff #159961)

Done -- I think printf was chosen to avoid the \n, but that can be done with "echo -n".

Looks like lit does support it, as it's used in test/tools/llvm-objcopy/add-gnu-debuglink.test, as well as some non-objcopy tests. I'll resist the temptation to fix those in this patch :)

3 ↗(On Diff #159961)

Actually, the point of these arch-specific tests is just to check the file header. I removed -sections and -symbols from this test; those are checked by binary-input.test, which shouldn't depend much on the specific arch.

test/tools/llvm-objcopy/binary-input-reconstitute.test
1 ↗(On Diff #159961)

Yes, essentially it's an integration test that some payload doesn't change when it's converted back and forth (i.e. payload -> object file w/ payload shoved into .data -> payload). I'm not sure why I thought "reconstitute" would be a good word for that, so I'll take your name suggestion.

test/tools/llvm-objcopy/binary-input.test
3

This test is focused on the generic stuff, namely section/symbols. I dropped the arch-specific stuff since that's covered elsewhere.

tools/llvm-objcopy/Object.cpp
633

I originally had string, but changed to Twine based on Jake's suggestion here: https://reviews.llvm.org/D50343?id=159343#inline-442972

I'm trying to understand why it would be a bad idea to write this. Looking at http://llvm.org/docs/ProgrammersManual.html#dss-twine, the discouraged pattern is:

void foo(const Twine &T);
...
StringRef X = ...
unsigned i = ...
const Twine &Tmp = X + "." + Twine(i);
foo(Tmp);

Which is bad because Tmp is a ref, whereas the Twine here is a regular stack variable that persists past all the calls to addSymbol (which calls .str() on the input twine to save/own the symbol).

I renamed to "Prefix", but kept this as a Twine here, since I think this is safe. Otherwise, I think I'd have to do:

SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_start", ...);
SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_end", ...);
SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_size", ...);
642

Nope, this is Value, not Size, although it's confusing, especially because addSymbol has so many parameters. I commented the param names here to make this clear.
The sizes/values here match gnu objcopy.

tools/llvm-objcopy/Object.h
458

Done, although this is existing code -- it's only showing up as a diff because clang-format put it on a new line after I changed StringRef->Twine...

tools/llvm-objcopy/llvm-objcopy.cpp
135–136

Done -- I added some comments to try to logically explain how this config is laid out, but I'm not very familiar with all of them... I'm open to suggestions here.

jhenderson added inline comments.Aug 13 2018, 3:58 AM
test/tools/llvm-objcopy/binary-input-and-output.test
2

Is the no new line important to this test? I think it isn't any more.

tools/llvm-objcopy/Object.cpp
642

Right, I get it now. Thanks for the comments.

tools/llvm-objcopy/Object.h
458

Ah, yes of course. But the renaming is good anyway.

701

If this is a clang-format added difference, please do it in another patch.

tools/llvm-objcopy/llvm-objcopy.cpp
135–136

I'd like @jakehehrlich to comment on these comments, if that's okay, as he may have a specific desire as to how this class is laid out.

rupprecht marked 8 inline comments as done.
  • echo -n -> echo where it doesn't matter
jakehehrlich added inline comments.Aug 13 2018, 11:41 AM
tools/llvm-objcopy/llvm-objcopy.cpp
134–136

Do we use these anywhere? I think I added them thinking about how I was going to use them to do exactly what this change does but embedding MachineInfo seems much nicer and convey's the same information. Maybe we should just do that?

135–136

To date I have had no reason or rhyme to how I've added these...in fact I'm not sure if I have even added the majority at this point. A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

Long term ideals on how this should be laid out:

  1. The names should have a consistent method for naming them derived from the option name where possible. I don't think I've done a very good job of this and it isn't clear how to best solve this issue.
  2. Grouping first by type and then by alphabetization is probably a good idea. I'm pretty trash at finding things in alphabetized lists but I always know the basic type of thing I'm looking for. AddSection is the only thing I know of which currently violates this. (we should make it work the way Paul made SectionsToRename work).
rupprecht added inline comments.Aug 13 2018, 3:05 PM
tools/llvm-objcopy/llvm-objcopy.cpp
134–136

We do. Essentially, we use this to decide whether we're going to need a BinaryReader vs ELFReader for InputFormat, and, separately, whether we're going to need a BinaryWriter vs ELFWriter for OutputFormat. BinaryArch is extra information that only applies when using BinaryReader.

rupprecht updated this revision to Diff 160461.Aug 13 2018, 3:05 PM
  • Rebase/fix other method names to be lower cased.
jhenderson requested changes to this revision.Aug 14 2018, 2:55 AM

No null symbol is a blocker, so I've marked this as requesting changes. Sorry!

test/tools/llvm-objcopy/binary-input-and-output.test
6

I'm not sure if this is important or not in this case, but the usual pattern in llvm-objcopy tests is to copy the input file to a backup to verify that it hasn't been modified before doing the diff.

test/tools/llvm-objcopy/binary-input-arch.test
2

I don't think -n is important here? Indeed, you could do away with this file entirely and just feed in %s into llvm-objcopy, if you wanted, although this is probably less confusing.

test/tools/llvm-objcopy/binary-input-error.test
2

The -n isn't important here either.

test/tools/llvm-objcopy/binary-input.test
74

This is an illegal symbol table: it has no null symbol.

tools/llvm-objcopy/Object.cpp
613

Doesn't this assignment rely on knowing that the symbol table is added immediately after the string table? That seems like poor design to me. Better would be to pass in the index or string table section.

I might well be forgetting how llvm-objcopy is designed in this area, but don't we need to explicitly add a null symbol as the first symbol in the symbol table?

639–640

Is there any point in adding this local section symbol? It can't be referenced, so I think it's superfluous.

This revision now requires changes to proceed.Aug 14 2018, 2:55 AM
rupprecht marked 6 inline comments as done.
  • Fix tests, add null symbol to symtab, make binary objection creation less fragile, and automatically assign section indices
tools/llvm-objcopy/Object.cpp
613

Yes, this is very fragile, I think I understand the issue I was having before a little better.

The Link needs to be set to the StrTab Index, as you mention. The Object::addSection() helper was not assigning any index, so this would implicitly be zero (which is SHN_UNDEF), and that was throwing errors when initializing it. Using size(Obj->sections()) - 1 was more of a reverse-engineered way of getting things working.

Changing Object::addSection() to automatically assign the index lets us use the index from StrTab directly. I think this might allow us to stop assigning manually indices elsewhere in this file, but I'll save that for another change.

Also -- fixed the symbol table to include a null symbol.

639–640

GNU objcopy adds it, but it doesn't seem to be necessary. I'll add it back if it turns out to be needed.

jakehehrlich added inline comments.Aug 14 2018, 12:10 PM
tools/llvm-objcopy/Object.cpp
613

Yeah that probably should have functioned that way the whole time. Thanks for fixing that!

jhenderson accepted this revision.Aug 15 2018, 2:22 AM

LGTM, but there is one thing still outstanding, I think, which @jakehehrlich mentioned in a comment regarding the CopyConfig options:

A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

I don't have a strong preference either way, but you should get @jakehehrlich to confirm he is happy before committing this change.

This revision is now accepted and ready to land.Aug 15 2018, 2:22 AM

LGTM, but there is one thing still outstanding, I think, which @jakehehrlich mentioned in a comment regarding the CopyConfig options:

A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

I don't have a strong preference either way, but you should get @jakehehrlich to confirm he is happy before committing this change.

Ok, I split out each section by type + alphabetized (as suggested after that comment). I also don't have any strong preference though; I'm happy to apply whatever organization/comments/etc. others want here.

There was also a suggestion in that comment to rename config names to more closely match flag names which is also a good idea, but I'd rather submit that as an NFC after this to avoid putting myself in merge conflict hell :)

  • Reorganize CopyConfig sections

rebase/fix stale filename in a test case

This revision was automatically updated to reflect the committed changes.