This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] Initial COFF support
ClosedPublic

Authored by mstorsjo on Nov 27 2018, 2:08 AM.

Details

Summary

This is an initial implementation of no-op passthrough copying of COFF with objcopy.

I've tried to base the design on the ELF backend, although some of the extra classes (such as Reader vs COFFReader) might not make much sense here.

Diff Detail

Event Timeline

mstorsjo created this revision.Nov 27 2018, 2:08 AM

Please do split this into a separate pass-through change and option-hook up. It will keep the scale of the review down by quite a bit. Hooking up the options one or two at a time after that would then make reviewing each of those easier too.

I assume you've seen D54384, where @alexshap is doing the same for MachO? It would make a lot of sense for the two approaches to mirror each other.

I noticed you're planning on adding pre-canned binaries for the tests. Is it possible to use yaml2obj or similar instead?

Please do split this into a separate pass-through change and option-hook up. It will keep the scale of the review down by quite a bit. Hooking up the options one or two at a time after that would then make reviewing each of those easier too.

Ok, will split that out. As for adding other options one at a time, I probably can't do it one option at a time, but I'll try to split it into smaller sets of options with the same mechanics at least.

I assume you've seen D54384, where @alexshap is doing the same for MachO? It would make a lot of sense for the two approaches to mirror each other.

I've seen it and looked at it a little for inspiration with this one as well. The general structure of the ELF/COFF/MachO subdirs are the same, but the actual objects and functions used for rebuilding objects obviously differ depending on the formats' structure etc.

I noticed you're planning on adding pre-canned binaries for the tests. Is it possible to use yaml2obj or similar instead?

For most practical functional tests, yes. I have one test for the plain passthrough copying which does a byte-by-byte comparison between the input and output, and tools that generate object files/executables have some freedom in exactly how they are laid out. The input files I use right now have been generated with llvm-mc/lld/msvc, but I can try to see how many of them end up identical if passed via yaml.

mstorsjo updated this revision to Diff 175462.Nov 27 2018, 5:35 AM
mstorsjo edited the summary of this revision. (Show Details)

Removed the changes that do actual transformations on the binary. Still using hardcoded input files, as it will take a while to sort out what exact differences are produced if going via yaml (I ran into one yaml2obj bug so far), will look more into that later.

Removed the changes that do actual transformations on the binary. Still using hardcoded input files, as it will take a while to sort out what exact differences are produced if going via yaml (I ran into one yaml2obj bug so far), will look more into that later.

FWIW, I can easily change the simple object file inputs into yaml. Executables can also be made into yaml, when the serialization format of VirtualAddress for sections is clarified. (Right now it's broken if you convert to yaml and back.) For executables, I currently pad code segments with 0xcc, to match what LLD does, but yaml2obj doesn't do that, so I'd have to stop doing that if I want a test that matches byte for byte the input data.

For bigobj inputs, yaml2obj doesn't produce that unless the bigobj format strictly is needed. So I can't easily test that with yaml unless I create an object file with over 64k sections. So at least for that, a binary input files more or less is wanted.

i will get to this diff early next week, sorry about the delay. A small side note - probably the simple changes to include/llvm/Object/COFF.h can be factored out from this diff and reviewed separately (and probably much faster).

In D54939#1314029, @alexshap wrote:

i will get to this diff early next week, sorry about the delay. A small side note - probably the simple changes to include/llvm/Object/COFF.h can be factored out from this diff and reviewed separately (and probably much faster).

Ok - splitting them out isn't hard, but I dunno what to make as test for them, if they are going in as such.

In D54939#1314029, @alexshap wrote:

i will get to this diff early next week, sorry about the delay. A small side note - probably the simple changes to include/llvm/Object/COFF.h can be factored out from this diff and reviewed separately (and probably much faster).

Ok - splitting them out isn't hard, but I dunno what to make as test for them, if they are going in as such.

It looks to me like they should be unit-tested. There is already a set of unit tests for LLVMObject, so it shouldn't be too hard to add them, I'd think.

I'm not really a COFF expert, so I haven't yet looked at the details of this change. It broadly looks like the right direction though.

If I get time, I'll try to start reviewing the details, but it would be better if somebody with more familiarity looked at them.

include/llvm/Object/COFF.h
974

We seem to be mixing our styles of defining things in headers and in .cpp files for no particular reason. Would you mind defining these in COFF.cpp, please since there are two that already are? (I'd support moving getDosHeader into the COFF.cpp too in that case).

I could be persuaded instead that the two PE header functions should be moved here.

test/tools/llvm-objcopy/COFF/basic-copy.test
25

nit: remove the extra "of"

tools/llvm-objcopy/COFF/Object.cpp
47

This is quite a big function. Would it be feasible to split it up into a few helpers?

144

Again, this is quite a big function. Would it be feasible to split it up into a few helpers?

240

Again, this is quite a big function. Would it be feasible to split it up into a few helpers?

tools/llvm-objcopy/COFF/Object.h
30

To avoid confusion, maybe we should rename the ELF Object to ELFObject, and then this Object to COFFObject, etc?

mstorsjo marked 4 inline comments as done.Dec 3 2018, 4:46 AM
mstorsjo added inline comments.
include/llvm/Object/COFF.h
974

Sure, I don't mind moving them there. It's indeed quite a bit inconsistent right now, although I don't know if there's some rationale for which ones have been made inline so far.

test/tools/llvm-objcopy/COFF/basic-copy.test
25

Thanks, amended the patch locally with that fixed.

tools/llvm-objcopy/COFF/Object.cpp
47

I guess it should be feasible, I'll give it a try.

tools/llvm-objcopy/COFF/Object.h
30

Sure, I don't mind doing that.

The same goes for D54674 and MachO as well.

Although that also moves the names closer to the class names used by llvm/Object, COFFObjectFile and ELFObjectFile, but I don't think that's an issue.

mstorsjo added inline comments.Dec 3 2018, 5:11 AM
tools/llvm-objcopy/COFF/Object.cpp
275

Open questions to reviewers: Here I pad executable sections with 0xcc, in the same way as LLD does.

It was requested to have tests that use yaml input instead of committing binaries, and the tests so far check that objcopy produces bytewise identical files to its input. As yaml2obj doesn't do the same 0xcc padding, we'd either need to make yaml2obj do the same padding as well, or stop doing it here. Which one do you prefer?

(Also, currently, LLD/COFF doesn't currently use different paddings for ARM/AArch64, which it probably should, and the same would go here as well.)

jhenderson added inline comments.Dec 3 2018, 5:31 AM
tools/llvm-objcopy/COFF/Object.cpp
275

Not too sure, but it might be nice to find a way to specify the padding for any given file. For example, padding with 0xcc doesn't make much sense for x86 output in the data segment, but makes a lot of sense in the text segment.

mstorsjo marked 4 inline comments as not done.Dec 3 2018, 5:39 AM

(I have no idea why my inline comments were marked done from the start...)

mstorsjo added inline comments.Dec 3 2018, 5:41 AM
tools/llvm-objcopy/COFF/Object.cpp
275

Well currently, the padding is only done for executable sections (if ((S.Header.Characteristics & IMAGE_SCN_CNT_CODE) above), other sections are padded with zeros.

tools/llvm-objcopy/COFF/Object.h
30

khm, i don't have a strong opinion, but they are already kinda named this way if we take into account the namespace.

mstorsjo updated this revision to Diff 176482.Dec 3 2018, 2:40 PM

Split the large methods in Object.cpp into a bunch of smaller ones, and added a bit more comments. Moved the COFF header getter implementations into the cpp file.

Suggested changes not (yet) done:

  • Didn't split the COFF header getter methods to a separate patch as I don't know a sensible standalone test for it, other than using it here.
  • Kept the class named Object instead of COFFObject, as there weren't clear unanimosity about changing them yet
  • Didn't change to use yaml2obj for the tests yet.

Currently, the llvm-objcopy output matches bytewise what lld outputs for executables (except for string tables), and what the LLVM codegen outputs for normal object files. If testing with files synthesized from yaml, yaml2obj needs 3 changes; padding executable sections, skipping 4 byte section content alignment as yaml2obj does but MC doesn't, and using the MC StringTableBuilder. The former two are trivial to change, but the latter requires adding an MC dependency to yaml2obj, which I'm not sure is wanted.

mstorsjo marked 8 inline comments as done.Dec 3 2018, 2:41 PM

On the whole ELFObject vs Object thing. I'm in favor of just using Object but I don't really care too much. Having both the namespace and the name prefix is kind of verbose. A while back we had two kinds of Objects instead of just one but this was determined overtime to be a mistake for the ELF backend. If you want to do conversions between formats, even if the format is "binary" you'll probably want to avoid having two kinds of Object or if you do make sure you provide a sufficiently rich base interface to them to allow most code to be written in terms of that.

tools/llvm-objcopy/COFF/Object.h
83

It might be useful to declare this as the OriginalHeader and have a second Header. We discovered overtime that this would have been a good structure in the ELF case but currently we have a horrible hodgepodge of fields prefixed with "Original" and some without. Also, the name "Contents" doesn't necessarily need to change but it should be clear that such a thing is the original contents and nothing more.

Also don't focus on byte for byte accuracy, test semantically so that we can make layout changes independent of content changes. It's clear to me now that we shouldn't even bother attempting exact binary matches even in the first patch.

Also don't focus on byte for byte accuracy, test semantically so that we can make layout changes independent of content changes. It's clear to me now that we shouldn't even bother attempting exact binary matches even in the first patch.

Sure, in general it's not necessary to have byte for byte accuracy, but as we write a new file, I've tried to mimic the layout of MC and LLD as those should be sane and actually used in the wild.

The test itself doesn't need to check for such accuracy of course though. I guess I could check e.g. the input/output with e.g. llvm-readobj -file-headers or something like that, to check that the output generally looks sane.

tools/llvm-objcopy/COFF/Object.h
83

So far, I don't keep a copy of the original header anywhere but I just patch things in this one copy (marginally when just doing a plain copy, patching more when actually changing things) before finally writing it to the output with one memcpy, so there's no distinction between original and current.

If we need to keep the original header separately later, couldn't we add the separate OriginalHeader field at that point?

jakehehrlich added inline comments.Dec 3 2018, 3:11 PM
tools/llvm-objcopy/COFF/Object.h
83

I'd say if you ever have to keep a single "Original" field then you should keep the whole original header.

mstorsjo added inline comments.Dec 4 2018, 12:22 AM
tools/llvm-objcopy/COFF/Object.h
83

Right - well I don't keep any "original" fields in the sense that it is the original unmodified value from the input file, but it's the actual header as it will be written to the destination file in the same struct form, with values updated and filled in along the objcopy process.

So I don't break out all the individual fields but keep them as they are on disk. The only exception is the Name field that I break out into a separate StringRef, as the Name field of the coff_section header only makes sense in the context of a full file.

As for the field Contents, why would I need to make a distinction that it is the original contents? In the follow-up parts where I synthesize a .gnu_debuglink section, Contents won't be any original contents but the newly created. (In that case I add a separate field to actually own the allocation as the ArrayRef either points to the original input file contents or newly synthesized contents.)

@rnk Can you have a look at this one wrt the COFF specifics, and how I'm storing/reassembling it?

mstorsjo updated this revision to Diff 177481.Dec 10 2018, 4:05 AM
mstorsjo retitled this revision from [RFC] [llvm-objcopy] Initial COFF support to [llvm-objcopy] Initial COFF support.

Removed the RFC tag as it has been through a few rounds of discussion already, and I'd like to see actual progress towards getting it merged.

I changed the tests to use yaml files as input as requested. In order to have sensible tests that actually check that the file as a whole is copied correctly, without tediously writing lots of individual CHECK lines for llvm-readobj+FileCheck, I'm using obj2yaml on both inputs and outputs, and a plain cmp on the output from obj2yaml. Later patches that actually do some transformation can of course use more specific and targeted checks with llvm-readobj+FileCheck.

Ping @rnk, can you give this a review from a generic COFF perspective?

Ping @alexshap, do you have time to have a look?

Ping @jakehehrlich, can you follow-up on the discussion where you suggested renaming fields? Can you, if possible, have a look on the patch as a whole, wrt to how I read structs from the input file, keeping them mostly packed in the same kinds of structs as in the binary file.

sorry about the delay, I've also pinged some other people who know coff better than I do (at the moment).
What do think about keeping the implementation of Reader and Writer in their own files, and (in the future) put into Object.cpp only the logic for
modifying the "intermediate representation" (mutations of COFF) ?

In D54939#1330427, @alexshap wrote:

What do think about keeping the implementation of Reader and Writer in their own files, and (in the future) put into Object.cpp only the logic for
modifying the "intermediate representation" (mutations of COFF) ?

I guess that could be doable.

For the actual stripping operations that I've already implemented (but holding off posting until this is merged), most of the code that does modifications actually is outside of the Object class (so far), so the Object class itself is mostly a dumb container. This, because modifications don't (so far) easily map down to simple individual operations that would fit into individual methods. E.g. removing a section requires updating symbols as well - so so far I've kept it as one series of modifications in COFFObjcopy.cpp. But that's

Do you have any opinion on the extra Reader/Writer abstract base classes (vs COFFReader/COFFWriter)? I copied that design from ELF, but I'm not sure if it makes sense here as we have much fewer variants of everything.

that's fine, my point was to "unload" some code from Object.cpp, in particular, the serialization/deserialization into separate files. (probably it won't change that much in the future, but the code for modifying Object will evolve as more features will be added).

rnk accepted this revision.Dec 13 2018, 3:32 PM

I don't have time to go over this with a fine tooth comb, but I took a look, and all the COFF parts make sense. The string table and the use of "/%d" for long section names is all correct.

This revision is now accepted and ready to land.Dec 13 2018, 3:32 PM
mstorsjo updated this revision to Diff 178193.Dec 14 2018, 12:20 AM

Split Object.cpp/h into separate files for Reader and Writer as suggested by @alexshap.

mstorsjo updated this revision to Diff 178198.Dec 14 2018, 12:35 AM

A few minor cleanups and tweaks to the includes and file headers for the split Reader/Writer.

I've looked at the code, there are few minor nits, but to me this code looks like a pretty good start, many thanks for working on this ))

tools/llvm-objcopy/COFF/Object.h
62

nit: i'd probably replace the names A and B with smth that reflects they are the symbol types

tools/llvm-objcopy/COFF/Reader.cpp
63 ↗(On Diff #178198)

I would add a comment saying that it's on purpose (that the indices start with 1)

tools/llvm-objcopy/COFF/Reader.h
32 ↗(On Diff #178198)

nit: don't need private here

40 ↗(On Diff #178198)

nit: for consistency with the code above I would swap these two lines
(first - ctor, second - create)

@jakehehrlich , @jhenderson - are there any concerns with moving this forward ?

mstorsjo marked 6 inline comments as done.Dec 14 2018, 11:47 AM

Thanks for the review!

tools/llvm-objcopy/COFF/Object.h
62

Sure, will replace with Symbol1Ty/Symbol2Ty.

tools/llvm-objcopy/COFF/Reader.cpp
63 ↗(On Diff #178198)

Good point, will do.

tools/llvm-objcopy/COFF/Reader.h
32 ↗(On Diff #178198)

Thanks, leaving it out - doing the same for COFFWriter as well.

mstorsjo updated this revision to Diff 178262.Dec 14 2018, 11:48 AM
mstorsjo marked 2 inline comments as done.

Applied the changes suggested by @alexshap.

tools/llvm-objcopy/COFF/Object.h
49

btw - I don't have a strong opinion on this yet,
but wanted to explain the motivation why we were trying not to keep this type of information (Is64 etc) inside the Object. Basically, the idea was to have all encoding-related logic consolidated inside the Readers/Writers and avoid accidental usage / leaking of this information into the model itself. I don't know if it makes sense for COFF (although yes, to some extent it means that this information needs to be passed around) - what do you think ?

mstorsjo added inline comments.Dec 14 2018, 1:26 PM
tools/llvm-objcopy/COFF/Object.h
49

Oh, ok, I see. We could probably quite easily get rid of IsBigObj and create a big object whenever it's necessary. (The current design just copies the input struct as such, bringing in the exact values of fields such as Sig1/Sig2/Version/UUID, but if we'd decouple them, we'd probably synthesize a new bigobj header instead.)

Wrt pe32_header vs pe32plus_header, the latter is almost a superset of the former - it has got a few fields lengthened to 64 bit, but it instead lacks the BaseOfData field. With a pe32plus_header struct plus a BaseOfData field, we could store whichever input data we read, but that on the other hand requires hardcoding which architectures are 32 bit and which are 64 bit, which we right now just take from the input file. (Hardcoding the machine type vs bitness isn't much of a practical issue as there only are 4 architectures in common use these days - but nothing else in the COFF objcopy needs to know about the architecture at all.)

mstorsjo updated this revision to Diff 178376.Dec 15 2018, 2:28 PM

Removed the IsBigObj field, and only storing one single coff file header and one pe32 header struct, and converting to/from the intermediate form in the reader/writer, as suggested/hinted by @alexshap.

In D54939#1331487, @alexshap wrote:

@jakehehrlich , @jhenderson - are there any concerns with moving this forward ?

My last day in work until the New Year is on Wednesday, and unfortunately I don't think I'll get a chance to look further at it until then. I'm happy to defer to other people's judgement on whether this is good, and can review it post-commit if needed.

Ping @alexshap - Any reply to your last discussion point and my update of the patch relating to it?

Ping @jakehehrlich - Any objections to this, or can I commit with @alexshap's and @rnk's approval?

on my side i think this is fine, I don't want to block this diff.
I would wait for Jake since he is the code owner for llvm-objcopy, but to me this looks good enough (to start the ball rolling).

jakehehrlich accepted this revision.Dec 18 2018, 3:53 PM

High level structure/interface looks good to me. I don't see in obvious code level issues. I'll trust @rnk's assessment of the COFF specific parts of this. I haven't reviewed the testing inputs but it looks like you're able to copy some substantial object files which looks like a solid start. I'll trust that James and Alex have looked at enough of this to fill in the gaps that I might have missed. We can start iterating now if there are any issues. Land it.

tools/llvm-objcopy/COFF/Writer.cpp
18 ↗(On Diff #178376)

Why do you need this?

tools/llvm-objcopy/COFF/Writer.h
15 ↗(On Diff #178376)

Why do you need this?

mstorsjo marked 3 inline comments as done.Dec 18 2018, 11:20 PM
mstorsjo added inline comments.
tools/llvm-objcopy/COFF/Writer.cpp
18 ↗(On Diff #178376)

On a second look, I don't - I had somehow conflated it with the header for the llvm-objcopy specific Buffer class. Will remove before committing.

This revision was automatically updated to reflect the committed changes.
mstorsjo marked an inline comment as done.