[llvm-objcopy] Add support for -I binary -B <arch>.
ClosedPublic

Authored by rupprecht on Aug 6 2018, 11:34 AM.

Details

Summary

The -I (--input-target) and -B (--binary-architecture) flags exist but are currently silently ignored. This adds support for -I binary for architectures i386, x86-64 (and alias i386:x86-64), arm, and aarch64. This is largely based on D41687.

This is done by implementing an additional subclass of Reader, BinaryReader, which works by interpreting the input file as contents for .data field, sets up a synthetic header, and adds additional sections/symbols (e.g. _binary__tmp_data_txt_start). Additionally, change the symbol table to own symbols, by changing StringRef to std::string, as the synthetic symbols added are otherwise unowned.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
rupprecht added inline comments.Aug 8 2018, 3:10 PM
tools/llvm-objcopy/Object.cpp
619 ↗(On Diff #159343)

With this change:

// Obj.ProgramHdrSegment.firstSection() == nullptr implies
// Obj.ProgramHdrSegment.Sections is empty
Ehdr.e_phoff = Obj.ProgramHdrSegment.firstSection() == nullptr
                   ? 0
                   : Obj.ProgramHdrSegment.Offset;

The failures usually looked like:

Command Output (stderr):
--
<src>/llvm/test/tools/llvm-objcopy/triple-overlap.test:72:14: error: CHECK-NEXT: expected string not found in input
#CHECK-NEXT: Type: PT_LOAD (0x1)
             ^
<stdin>:9:2: note: scanning from here
 Type: (0x464C457F)
 ^
<stdin>:21:2: note: possible intended match here
 Type: (0x400004)
 ^

Actually, it looks like the object is corrupted with that change, e.g. the first program header for that test has an alignment of 15762873573703680, vs 4096 for all of them on the base side.

jakehehrlich added inline comments.Aug 8 2018, 3:12 PM
tools/llvm-objcopy/Object.cpp
619 ↗(On Diff #159343)

Yeah that's not the right check. Use Obj.ProgramHeaders.size() == 0

rupprecht added inline comments.Aug 8 2018, 3:35 PM
tools/llvm-objcopy/Object.cpp
619 ↗(On Diff #159343)

I can't seem to get that to work:

  • Obj doesn't have a field called ProgramHeaders
  • You probably mean Obj.ProgramHdrSegment, but Segment doesn't have a method size()
  • If I implement size() as returning either Contents.size() or Sections.size() and then try:
Ehdr.e_phoff =
    Obj.ProgramHdrSegment.size() == 0 ? 0 : Obj.ProgramHdrSegment.Offset;

Then I get the same failure as before

jakehehrlich added inline comments.Aug 8 2018, 10:39 PM
tools/llvm-objcopy/Object.cpp
619 ↗(On Diff #159343)

hmm...that's bothersome; I must not be understanding something. I'll see about looking into that. Thanks for letting me know about this!

607 ↗(On Diff #159809)

I think ELFWriter should assign these. I think I need to do some refactoring to make Object not so ELF specific (with respect to not containing fields that are not known until write time). This change is exposing a lot of points where I failed to properly separate those concerns. In the mean time if you could move this code into ELFWriter and call it form assignOffset that would be ideal.

rupprecht updated this revision to Diff 159939.Aug 9 2018, 9:20 AM
  • Extract out ehdr initialization into initEhdr()
rupprecht marked an inline comment as done.Aug 9 2018, 9:22 AM
rupprecht added inline comments.
tools/llvm-objcopy/Object.cpp
607 ↗(On Diff #159809)

I didn't find a way to refactor out setting ElfHdr.Index, since ELFBuilder<ELFT>::readProgramHeaders() is setting it to ElfHdr.Index = Index++ in the middle of some other sections, but the rest was simple to refactor into ELFWriter.

rupprecht updated this revision to Diff 159941.Aug 9 2018, 9:30 AM
rupprecht marked an inline comment as done.
  • Consistently use ELFT typename aliases in headers

Awesome! I'd like to have someone else look over this but this pretty well LGTM.

tools/llvm-objcopy/Object.cpp
614 ↗(On Diff #159941)

And this is the last thing where we can't easily factor out dependence on ElfType. This is going to require that Size be handled differently from how it is now. I'm not going to block this change on that since the fix is just not simple. Under the current design of llvm-objcopy that would require a new visitor to calculate the size. I'm not sure I want to further things along that path.

Can you add a TODO here for me?

rupprecht updated this revision to Diff 159961.Aug 9 2018, 11:03 AM
  • Add Elf_Sym TODO
rupprecht marked an inline comment as done.Aug 9 2018, 11:05 AM

Thanks for the review!

Alex/James/anyone else who's lurking, mind taking a look for a second set of eyes?

Thanks for the review!

Alex/James/anyone else who's lurking, mind taking a look for a second set of eyes?

Yup, just going to look now. Sorry, it's been a busy couple of days, otherwise I'd have done so earlier.

jhenderson added inline comments.Aug 10 2018, 3:16 AM
test/tools/llvm-objcopy/binary-input-aarch64.test
1 ↗(On Diff #159961)

Use echo, not printf. I'm not sure if printf is implemented in the lit test system etc, but echo is the more common approach anyway.

3 ↗(On Diff #159961)

You dump the symbols here, but don't check any of them... you should probably do so!

8 ↗(On Diff #159961)

Does this line add anything? If not, can we remove it?

test/tools/llvm-objcopy/binary-input-arm.test
1 ↗(On Diff #159961)

Could you fold all of the regular tests into a single test file? There isn't much difference between them. Just run llvm-objcopy N times in a single file with different -B options, and use check prefixes to match the different sections (e.g. COMMON, ARM, AARCH64... etc).

test/tools/llvm-objcopy/binary-input-error.test
1 ↗(On Diff #159961)

printf -> echo

test/tools/llvm-objcopy/binary-input-reconstitute.test
1 ↗(On Diff #159961)

I'm not sure I understand what "reconstitute" means here. Perhaps rename the test? I assume the aim is for binary input and binary output? So maybe a better name is "binary-input-and-output.test".

10 ↗(On Diff #159961)

Technically, this only shows that the contents include "abcd", not that they are explicitly and solely "abcd". So this would also match a file containing other junk.

A better way might be to just compare the output files against the original input files using diff. Also, a common technique in our tests is to copy the input file to a separate file before running llvm-objcopy and then diff the input against the copy to show that the input hasn't been modified.

test/tools/llvm-objcopy/binary-input.test
2 ↗(On Diff #159961)

I'm not sure how this test is different from the others (apart from the symbols). Is i386:x86-64 just a synonym of x86-64? If so, I think this test can be folded into the others, like mentioned above.

tools/llvm-objcopy/Object.cpp
66 ↗(On Diff #159961)

Could you do these little unrelated tidy-ups in a separate NFC commit, please. No need for a review. Essentially, I'd like this review to just be of the things required for your change.

621 ↗(On Diff #159961)

AR -> Data. "AR" is meaningless.

It saddens me that different parts of LLVM can't agree on whether to use chars or uint8_t.

624 ↗(On Diff #159961)

Rather than abbreviate this to something that is unclear, just call it "DataSection"

633 ↗(On Diff #159961)

I think it's considered bad form to have explicit local variable Twines that aren't just function parameters, if I remember, based on comments in a previous review (I might be mistaken though, so am happy to be proven wrong).

Please also name it something more descriptive like "Prefix".

642 ↗(On Diff #159961)

Should the end symbol have a size? That seems weird that it does. I might expect the start symbol to, but not the end symbol. What does GNU objcopy do?

1249 ↗(On Diff #159961)

Could you rename this initEhdrSegment, since it doesn't actually initialise the ELF header itself.

1261 ↗(On Diff #159961)

I'm not convinced that the ELF header segment should be initialised inside something called "assignOffsets", since it's doing a lot more than assigning it an offset (which actually is always 0, so could be initialised at the same time as the rest of the header). It should probably be called outside this function.

tools/llvm-objcopy/Object.h
402 ↗(On Diff #159961)

Side note: I think a recent change for --prefix-symbols makes this part of the diff identical. I recommend rebasing when you next update the diff.

457 ↗(On Diff #159961)

Don't abbreviate variable names: Sz -> Size.

tools/llvm-objcopy/llvm-objcopy.cpp
133 ↗(On Diff #159961)

It looks to me like we are grouping things in CopyConfig by type, so this should probably be moved to before (or after) the StringRef block.

302 ↗(On Diff #159961)

I feel like this might read better if everything is constructed inline:

static const StringMap<MachineInfo> ArchMap{
    // Name,     EM value,   64bit, LittleEndian
    {"aarch64", {EM_AARCH64, true,  true}},
    {"arm",     {EM_ARM,     false, true}},
    /* More entries here */
};

I'd order the entries either alphabetically by name or numerically by their EM value. You could also put "headers" as shown to avoid needing to duplicate the 64/Endianness comment in each part.

rupprecht marked 30 inline comments as done.
  • Fold arch-specific file header tests into a single templated test
  • Replace printf w/ echo -n
  • Other code review comments
rupprecht added inline comments.Aug 10 2018, 11:36 AM
test/tools/llvm-objcopy/binary-input-aarch64.test
1 ↗(On Diff #159961)

Done -- I think printf was chosen to avoid the \n, but that can be done with "echo -n".

Looks like lit does support it, as it's used in test/tools/llvm-objcopy/add-gnu-debuglink.test, as well as some non-objcopy tests. I'll resist the temptation to fix those in this patch :)

3 ↗(On Diff #159961)

Actually, the point of these arch-specific tests is just to check the file header. I removed -sections and -symbols from this test; those are checked by binary-input.test, which shouldn't depend much on the specific arch.

test/tools/llvm-objcopy/binary-input-reconstitute.test
1 ↗(On Diff #159961)

Yes, essentially it's an integration test that some payload doesn't change when it's converted back and forth (i.e. payload -> object file w/ payload shoved into .data -> payload). I'm not sure why I thought "reconstitute" would be a good word for that, so I'll take your name suggestion.

test/tools/llvm-objcopy/binary-input.test
2 ↗(On Diff #159961)

This test is focused on the generic stuff, namely section/symbols. I dropped the arch-specific stuff since that's covered elsewhere.

tools/llvm-objcopy/Object.cpp
633 ↗(On Diff #159961)

I originally had string, but changed to Twine based on Jake's suggestion here: https://reviews.llvm.org/D50343?id=159343#inline-442972

I'm trying to understand why it would be a bad idea to write this. Looking at http://llvm.org/docs/ProgrammersManual.html#dss-twine, the discouraged pattern is:

void foo(const Twine &T);
...
StringRef X = ...
unsigned i = ...
const Twine &Tmp = X + "." + Twine(i);
foo(Tmp);

Which is bad because Tmp is a ref, whereas the Twine here is a regular stack variable that persists past all the calls to addSymbol (which calls .str() on the input twine to save/own the symbol).

I renamed to "Prefix", but kept this as a Twine here, since I think this is safe. Otherwise, I think I'd have to do:

SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_start", ...);
SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_end", ...);
SymTab->addSymbol(Twine("_binary_") + SanitizedFilename + "_size", ...);
642 ↗(On Diff #159961)

Nope, this is Value, not Size, although it's confusing, especially because addSymbol has so many parameters. I commented the param names here to make this clear.
The sizes/values here match gnu objcopy.

tools/llvm-objcopy/Object.h
457 ↗(On Diff #159961)

Done, although this is existing code -- it's only showing up as a diff because clang-format put it on a new line after I changed StringRef->Twine...

tools/llvm-objcopy/llvm-objcopy.cpp
133 ↗(On Diff #159961)

Done -- I added some comments to try to logically explain how this config is laid out, but I'm not very familiar with all of them... I'm open to suggestions here.

jhenderson added inline comments.Aug 13 2018, 3:58 AM
test/tools/llvm-objcopy/binary-input-and-output.test
1 ↗(On Diff #160163)

Is the no new line important to this test? I think it isn't any more.

tools/llvm-objcopy/Object.cpp
642 ↗(On Diff #159961)

Right, I get it now. Thanks for the comments.

tools/llvm-objcopy/Object.h
457 ↗(On Diff #159961)

Ah, yes of course. But the renaming is good anyway.

703 ↗(On Diff #160163)

If this is a clang-format added difference, please do it in another patch.

tools/llvm-objcopy/llvm-objcopy.cpp
133 ↗(On Diff #159961)

I'd like @jakehehrlich to comment on these comments, if that's okay, as he may have a specific desire as to how this class is laid out.

rupprecht marked 8 inline comments as done.
  • echo -n -> echo where it doesn't matter
jakehehrlich added inline comments.Aug 13 2018, 11:41 AM
tools/llvm-objcopy/llvm-objcopy.cpp
134–135 ↗(On Diff #160409)

Do we use these anywhere? I think I added them thinking about how I was going to use them to do exactly what this change does but embedding MachineInfo seems much nicer and convey's the same information. Maybe we should just do that?

133 ↗(On Diff #159961)

To date I have had no reason or rhyme to how I've added these...in fact I'm not sure if I have even added the majority at this point. A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

Long term ideals on how this should be laid out:

  1. The names should have a consistent method for naming them derived from the option name where possible. I don't think I've done a very good job of this and it isn't clear how to best solve this issue.
  2. Grouping first by type and then by alphabetization is probably a good idea. I'm pretty trash at finding things in alphabetized lists but I always know the basic type of thing I'm looking for. AddSection is the only thing I know of which currently violates this. (we should make it work the way Paul made SectionsToRename work).
rupprecht added inline comments.Aug 13 2018, 3:05 PM
tools/llvm-objcopy/llvm-objcopy.cpp
134–135 ↗(On Diff #160409)

We do. Essentially, we use this to decide whether we're going to need a BinaryReader vs ELFReader for InputFormat, and, separately, whether we're going to need a BinaryWriter vs ELFWriter for OutputFormat. BinaryArch is extra information that only applies when using BinaryReader.

rupprecht updated this revision to Diff 160461.Aug 13 2018, 3:05 PM
  • Rebase/fix other method names to be lower cased.
jhenderson requested changes to this revision.Aug 14 2018, 2:55 AM

No null symbol is a blocker, so I've marked this as requesting changes. Sorry!

test/tools/llvm-objcopy/binary-input-and-output.test
5 ↗(On Diff #160461)

I'm not sure if this is important or not in this case, but the usual pattern in llvm-objcopy tests is to copy the input file to a backup to verify that it hasn't been modified before doing the diff.

test/tools/llvm-objcopy/binary-input-arch.test
1 ↗(On Diff #160461)

I don't think -n is important here? Indeed, you could do away with this file entirely and just feed in %s into llvm-objcopy, if you wanted, although this is probably less confusing.

test/tools/llvm-objcopy/binary-input-error.test
1 ↗(On Diff #160461)

The -n isn't important here either.

test/tools/llvm-objcopy/binary-input.test
73 ↗(On Diff #160461)

This is an illegal symbol table: it has no null symbol.

tools/llvm-objcopy/Object.cpp
613 ↗(On Diff #160461)

Doesn't this assignment rely on knowing that the symbol table is added immediately after the string table? That seems like poor design to me. Better would be to pass in the index or string table section.

I might well be forgetting how llvm-objcopy is designed in this area, but don't we need to explicitly add a null symbol as the first symbol in the symbol table?

639–640 ↗(On Diff #160461)

Is there any point in adding this local section symbol? It can't be referenced, so I think it's superfluous.

This revision now requires changes to proceed.Aug 14 2018, 2:55 AM
rupprecht marked 6 inline comments as done.
  • Fix tests, add null symbol to symtab, make binary objection creation less fragile, and automatically assign section indices
tools/llvm-objcopy/Object.cpp
613 ↗(On Diff #160461)

Yes, this is very fragile, I think I understand the issue I was having before a little better.

The Link needs to be set to the StrTab Index, as you mention. The Object::addSection() helper was not assigning any index, so this would implicitly be zero (which is SHN_UNDEF), and that was throwing errors when initializing it. Using size(Obj->sections()) - 1 was more of a reverse-engineered way of getting things working.

Changing Object::addSection() to automatically assign the index lets us use the index from StrTab directly. I think this might allow us to stop assigning manually indices elsewhere in this file, but I'll save that for another change.

Also -- fixed the symbol table to include a null symbol.

639–640 ↗(On Diff #160461)

GNU objcopy adds it, but it doesn't seem to be necessary. I'll add it back if it turns out to be needed.

jakehehrlich added inline comments.Aug 14 2018, 12:10 PM
tools/llvm-objcopy/Object.cpp
613 ↗(On Diff #160461)

Yeah that probably should have functioned that way the whole time. Thanks for fixing that!

jhenderson accepted this revision.Aug 15 2018, 2:22 AM

LGTM, but there is one thing still outstanding, I think, which @jakehehrlich mentioned in a comment regarding the CopyConfig options:

A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

I don't have a strong preference either way, but you should get @jakehehrlich to confirm he is happy before committing this change.

This revision is now accepted and ready to land.Aug 15 2018, 2:22 AM

LGTM, but there is one thing still outstanding, I think, which @jakehehrlich mentioned in a comment regarding the CopyConfig options:

A pattern of grouping by type does seem to have formed organically however. Let's stick with it.

I don't have a strong preference either way, but you should get @jakehehrlich to confirm he is happy before committing this change.

Ok, I split out each section by type + alphabetized (as suggested after that comment). I also don't have any strong preference though; I'm happy to apply whatever organization/comments/etc. others want here.

There was also a suggestion in that comment to rename config names to more closely match flag names which is also a good idea, but I'd rather submit that as an NFC after this to avoid putting myself in merge conflict hell :)

  • Reorganize CopyConfig sections

rebase/fix stale filename in a test case

This revision was automatically updated to reflect the committed changes.