This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] [COFF] Fix handling of aux symbols for big objects
ClosedPublic

Authored by mstorsjo on Jan 21 2019, 2:18 AM.

Details

Summary

@rnk - I want your input on this one.

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

Currently, the aux symbols are stored in an opaque std::vector<uint8_t>, with contents interpreted according to the rest of the symbol. This allows passing through all the aux symbols we don't need to touch or care about.

If the input was a bigobj but the output isn't, or vice versa, this makes the aux data desync the whole symbol table.

All aux symbol types that use a struct fit in 18 bytes (sizeof(coff_symbol16)), and if written to a bigobj, two extra padding bytes are written after each (as sizeof(coff_symbol32) is 20).

This patch implements the following fix: In the llvm-objcopy storage agnostic intermediate representation, store the aux symbols as a series of coff_symbol16 sized opaque blobs within the same std::vector<uint8_t>. (In practice, all such struct based aux symbols only consist of one aux symbol, so this is more flexible than what reality needs.)

The special case is the file aux symbols, which are written in potentially more than one aux symbol slot, without any padding, as one single long string. This can't be stored in the same opaque vector of fixed sized aux symbol entries. The file aux symbols will
occupy a different number of aux symbol slots depending on the type of output object file. As nothing in the intermediate process needs to have accurate raw symbol indices, updating that is moved into the writer class.

Instead of updating the symbol raw indices at the end when the final format is known, one could alternatively choose to waste a bit more space and always allocate indices based on a normal object file. For a bigobj, we could potentially end up with a whole aux entry slot of padding for the filename. As this is rather uncommon (in practice max one per file), the total wasted space would be 20 bytes per file, unless really long file names are stored.

An alternative to the opaque AuxData vector would be to add a set of Optional<coff_aux_section_definition>, Optional <coff_aux_weak_external>. The upside is that this makes the intermediate format much clearer and neater, but the downside is that we need to explicitly know and care about all sorts of aux symbols (5 types, plus the file names) that we'd otherwise just pass through without touching and even knowing the specifics about.

Diff Detail

Event Timeline

mstorsjo created this revision.Jan 21 2019, 2:18 AM

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

I'm not a fan of adding hidden options purely for testing when there are alternatives. In the ELF tests, we use a pre-built zip file containing an object with many sections (see ELF/many-sections.test). I think you could probably do the same thing here.

Regarding the code, I honestly don't really understand it, so I don't feel like I'm qualified to review this COFF-ism.

tools/llvm-objcopy/COFF/Reader.cpp
113

You can get rid of the braces here.

tools/llvm-objcopy/COFF/Writer.cpp
321–332

Strictly speaking, you don't need the braces in this if/else.

327–329

Are coff_symbol16 and SymbolTy supposed to be the same size? Or are you deliberately writing less (or more) into the buffer than you iterate over?

mstorsjo marked an inline comment as done.Jan 22 2019, 3:32 AM

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

I'm not a fan of adding hidden options purely for testing when there are alternatives. In the ELF tests, we use a pre-built zip file containing an object with many sections (see ELF/many-sections.test). I think you could probably do the same thing here.

Hmm, ok. Now with --add-gnu-debuglink it's possible to add a section, so with that I guess it should be possible to achieve a test which removes a section to make the big object small, and then add another one to make it big again. Less elegant than a small and neat yaml test input IMO, but probably tolerable.

tools/llvm-objcopy/COFF/Writer.cpp
327–329

I'm deliberately writing more or less, yes.

The COFF-ism is that the symbol table can consist of entries of either coff_symbol16 or coff_symbol32, of 18 or 20 bytes each. A symbol can be followed by a number of aux symbols, which can be one of a number of different structs, all 18 bytes each. If the table consists of coff_symbol32 entries, each one of the aux symbols (opaque aux structs) will have 2 bytes of padding at the end.

So here I'm writing chunks of 18 bytes at a time out of the stored AuxData (where they are packed tightly), spaced 18 or 20 bytes apart in the output symbol table (depending on the entry size of that symbol table).

rnk added inline comments.Jan 22 2019, 9:56 AM
tools/llvm-objcopy/COFF/Object.h
85–86

Well, they aren't always coff_symbol16 sized are they? For an input bigobj, it'll be coff_symbol32, or we should make this a vector of coff_symbol16 directly.

I don't know much about objcopy, but I think it might be more in the spirit of it to widen into coff_symbol32 as is done for the main symbol field above, instead of keeping this as an opaque binary blob.

tools/llvm-objcopy/COFF/Reader.cpp
116

I think this can just be .rtrim('\0'), there is a single character overload of rtrim.

121

Should this second sizeof(coff_symbol16) be SymSize?

Maybe an easier way to express it would be:

ArrayRef<uint8_t> Chunk = AuxData.take_front(SymSize);
Sym.AuxData.insert(Sym.AuxData.end(), Chunk.begin(), Chunk.end());
AuxData = AuxData.drop_front(SymSize);

It mutates a local variable, but takes less math.

mstorsjo marked 2 inline comments as done.Jan 22 2019, 10:43 AM
mstorsjo added inline comments.
tools/llvm-objcopy/COFF/Object.h
85–86

Even for bigobj inputs, the aux symbols (except for .file) only have coff_symbol16 worth of payload. There's no wide version of any of the coff_aux_.. structs, so they can't be widened into the intermediate storage.

Making it a vector of coff_symbol16 would make things clearer, but as the data actually isn't that struct, maybe struct { uint8_t Opaque[sizeof(coff_symbol16)]; } would be more correct?

Or alternatively Optional<coff_aux_*> for each of the known types - but I prefer being able to passthrough unknown data untouched.

tools/llvm-objcopy/COFF/Reader.cpp
121

No, this is intentionally (within the current patch design) copying 18 bytes from a source which has got either 18 or 20 bytes stride.

mstorsjo updated this revision to Diff 182981.Jan 22 2019, 2:11 PM

Using a vector of AuxSymbol, which are an opaque struct of coff_symbol16 size, making the code slightly clearer. Didn't change the test to use a large object file to actually trigger generating a big object yet.

rnk added inline comments.Jan 22 2019, 4:36 PM
tools/llvm-objcopy/COFF/Object.h
85–86

I see.

tools/llvm-objcopy/COFF/Reader.cpp
121

Of course now I read you already clarified this. I think there should be a comment about how this is normalizing from coff_symbol32-sized entries to AuxSymbol sized entries, and discarding the padding bytes that are present in a bigobj.

rnk accepted this revision.Jan 22 2019, 4:47 PM

lgtm I like the new code, so feel free to commit after adding a comment about the thing both reviewers were confused by. :)

This revision is now accepted and ready to land.Jan 22 2019, 4:47 PM

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

I'm not a fan of adding hidden options purely for testing when there are alternatives. In the ELF tests, we use a pre-built zip file containing an object with many sections (see ELF/many-sections.test). I think you could probably do the same thing here.

I'm experimenting with crafting such an input file now. The uncompressed object file weighs in at around 5 MB, and after gzip (as is used for that ELF test) it currently ends up at around 725 KB. Do you think that's acceptable or too large?

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

I'm not a fan of adding hidden options purely for testing when there are alternatives. In the ELF tests, we use a pre-built zip file containing an object with many sections (see ELF/many-sections.test). I think you could probably do the same thing here.

I'm experimenting with crafting such an input file now. The uncompressed object file weighs in at around 5 MB, and after gzip (as is used for that ELF test) it currently ends up at around 725 KB. Do you think that's acceptable or too large?

It's not ideal, if I'm honest, but it might be a quirk of the file format, and therefore unavoidable. The equivalent file for ELF is only 152 KB. I don't know if it was somehow more aggressively compressed though. If you can't get it any smaller, I think it's probably acceptable.

By the way, there's a gunzip.py script in the ELF/Inputs directory, which you should probably move to a shared area and use for decompressing.

I'm experimenting with crafting such an input file now. The uncompressed object file weighs in at around 5 MB, and after gzip (as is used for that ELF test) it currently ends up at around 725 KB. Do you think that's acceptable or too large?

It's not ideal, if I'm honest, but it might be a quirk of the file format, and therefore unavoidable. The equivalent file for ELF is only 152 KB. I don't know if it was somehow more aggressively compressed though. If you can't get it any smaller, I think it's probably acceptable.

Ok then. At least it makes for a better testcase.

By the way, there's a gunzip.py script in the ELF/Inputs directory, which you should probably move to a shared area and use for decompressing.

Hmm, what place would that be, where it's findable by python within the lit tests? I could just move it up into test/tools/llvm-objcopy/Inputs and refer to it with %p/../Inputs/ungzip.py in the ELF/COFF subdirs - not ideal or elegant or anything, but at least shared between these two directories.

By the way, there's a gunzip.py script in the ELF/Inputs directory, which you should probably move to a shared area and use for decompressing.

Hmm, what place would that be, where it's findable by python within the lit tests? I could just move it up into test/tools/llvm-objcopy/Inputs and refer to it with %p/../Inputs/ungzip.py in the ELF/COFF subdirs - not ideal or elegant or anything, but at least shared between these two directories.

That's where I'd put it. No point in duplicating it after all.

mstorsjo updated this revision to Diff 183065.Jan 23 2019, 2:16 AM

Removed the option for forcing emission of a big object, made a test that operates on a bundled large object file instead.

jhenderson added inline comments.Jan 23 2019, 3:03 AM
test/tools/llvm-objcopy/COFF/bigobj.test
6–9

I think it probably is easier to associate comments with the corresponding test case without the blank link between them, but I'm not too bothered, so if you prefer it this way, that's fine.

tools/llvm-objcopy/COFF/Reader.cpp
113

You can still get rid of these braces ;)

I think a few code comments around here explaining what you are doing and why would make it much more understandable. Perhaps a brief explanation of the difference in the BigObj format?

tools/llvm-objcopy/COFF/Writer.cpp
147–148

This is another place requiring a code comment, I think, just explaining the "why".

321–333

Again, a few comments around here would be good.

mstorsjo marked 4 inline comments as done.Jan 23 2019, 3:34 AM

Other llvm-objcopy reviewers: I'd like to add a custom hidden option for testing, for triggering using the big object format. Without that, a test would have to create over 32k sections to trigger that.

I'm not a fan of adding hidden options purely for testing when there are alternatives. In the ELF tests, we use a pre-built zip file containing an object with many sections (see ELF/many-sections.test). I think you could probably do the same thing here.

I'm experimenting with crafting such an input file now. The uncompressed object file weighs in at around 5 MB, and after gzip (as is used for that ELF test) it currently ends up at around 725 KB. Do you think that's acceptable or too large?

It's not ideal, if I'm honest, but it might be a quirk of the file format, and therefore unavoidable. The equivalent file for ELF is only 152 KB. I don't know if it was somehow more aggressively compressed though. If you can't get it any smaller, I think it's probably acceptable.

I realized I could make the testcase less interesting and remove some aspects that aren't strictly needed for this test. That reduced the uncompressed object from 5 MB to 2.5, and the compressed one from 725 KB to 7 KB. That's probably small enough :-)

tools/llvm-objcopy/COFF/Reader.cpp
113

Oh, right, I forgot about the other comments when focusing on the test data.

mstorsjo updated this revision to Diff 183077.Jan 23 2019, 3:39 AM
mstorsjo marked an inline comment as done.

Removed the extra braces and added more comments, adjusted the testcase for the smaller test data.

This revision was automatically updated to reflect the committed changes.