Page MenuHomePhabricator

[llvm-objcopy] [COFF] Add support for removing symbols
ClosedPublic

Authored by mstorsjo on Dec 19 2018, 1:26 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I realized a few shortcomings in my approach so far (that had gone unnoticed while I've kept my patchset in use for a couple weeks but that just now cropped up) - the fact that I can't use symbol names as unique keys

I'm updating the patch to work around this now, by adding a different unique key to each symbol, to avoid having to use the non-unique name for that. Incidentally I noticed that neither obj2yaml, llvm-readobj or llvm-objdump were capable of differentiating between two symbols with the same name when dumping relocations; I'm adding support for that in D56140 for llvm-readobj, and will amend the tests here a bit once that lands, to properly verify that aspect as well.

tools/llvm-objcopy/COFF/Object.cpp
35 ↗(On Diff #178846)

Actually, associative comdats are a case where one section (indirectly, via a section definition symbol) references another section. Since this patch only handles symbol removal so far, I'm not touching that quite yet. Once I do, I'll probably first just let the removal do what it does and then error out at the end if there are dangling references - which should save quite a bit of complexity and be just fine for the basic stripping functionality (and shouldn't exclude implementing it differently later if desired).

mstorsjo updated this revision to Diff 179739.Dec 30 2018, 12:19 PM

Did a few suggested changes, renaming setReferencedSymbols to markSymbols, and using a unique key instead of the non-unique symbol name. Will amend the tests for that case once D56140 lands.

jhenderson added inline comments.Jan 2 2019, 3:38 AM
test/tools/llvm-objcopy/COFF/X86/remove-local-symbols.s
20–21 ↗(On Diff #179739)

This will produce a false positive, if otherfunc is present, because you've got nothing checking the total number of symbols. I don't know if COFF has a convenient way of testing this, but I would recommend doing so. Otherwise, you should put a -NEXT suffix on this line (and similarly in the above block) - you should also show that there are no symbols before this list if doing things this way, since llvm-objcopy might have reordered the symbols to put otherfunc first instead of removing it.

Could we also test that two different local symbols are removed? That shows that we remove all locals not just the first.

1 ↗(On Diff #179052)

Yes, to my knowledge, we haven't agreed anything formal re. YAML versus assembly, but I personally would prefer YAML wherever possible. From experience, at least for ELF, assembly often doesn't provide enough low-level control which we need to produce objects that have the properties we want in the input (e.g. what kinds of relocations get emitted etc). The comment about calling otherfunc not resulting in a relocation is a perfect example of this. It also removes the need for the "PRE" checks.

Also, I might be wrong, but wouldn't making this use YAML allow us to drop the X86-requirement here?

test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s
22 ↗(On Diff #179739)

My comments apply from the remove-local-symbols test apply equally well here. We need to more tightly show that there are no symbols in other places. Similarly, it might be nice to make this YAML based if possible.

test/tools/llvm-objcopy/COFF/X86/remove-symbol2.s
6–10 ↗(On Diff #179739)

I don't think this check really gives us anything in this case.

12 ↗(On Diff #179739)

It would be nice if this test checked that the file name appears in the error. The symbol name should probably also be quoted in some fashion. At least on ELF, it is possible to have symbol names with spaces in, so the error can be a bit ambiguous!

tools/llvm-objcopy/COFF/COFFObjcopy.cpp
38 ↗(On Diff #179739)

There seems to be an awful lot of code in this function that I don't think is tested. Perhaps it would make more sense to implement it in more patches (e.g. basic explicit symbol removal, then keep, then strip all or whatever etc), and test each part individually.

52 ↗(On Diff #179739)

This isn't a parsing failure. Use a more appropriate error value, like invalid_argument.

tools/llvm-objcopy/COFF/Object.cpp
38–39 ↗(On Diff #179739)

Referenced is initialised to false. Do you need to reset it here as well? I don't envisage a situation where we should need to do this more than once on a single object.

tools/llvm-objcopy/COFF/Object.h
27–28 ↗(On Diff #179739)

Are these constructors really necessary? Won't they be generated by default?

46–47 ↗(On Diff #179739)

Don't abbreviate unnecessarily (Idx -> Index). Do we really need both separately? If so, perhaps better names would be OriginalIndex and Index or similar, and then use the input symbol index as the key.

70 ↗(On Diff #179739)

This comment seems to indicate an encapsulation issue to me. If this function has to be called any time the vector is updated, perhaps the vector should be private and accessed only through member functions?

mstorsjo marked 9 inline comments as done.Jan 2 2019, 1:09 PM
mstorsjo added inline comments.
test/tools/llvm-objcopy/COFF/X86/remove-local-symbols.s
20–21 ↗(On Diff #179739)

Indeed, that's true. I somehow thought I tested this stricter than this, but apparently I misremembered.

Yes, some sort of stricter check would indeed be good. I think the current order actually is produced by llvm-nm and not dependent on the source file itself, but avoiding such assumptions is of course better.

Using -NEXT would indeed help, but is there anything that would force a match on the first line? llvm-nm doesn't print anything at all before the symbols. llvm-readobj -symbols is way too verbose for a neat test here (it prints 5+ lines per symbol), but llvm-objdump -t turns out to be very compact and detailed, so that's probably even better than this. That output also has got a nice start header that makes the use of -NEXT exact.

1 ↗(On Diff #179052)

Yes, using YAML would remove the dependency on X86 being enabled, and I agree that it makes some things less ambiguous. As for writing (and reading) the tests I prefer assembly, but if there's strong opposition I can make them YAML instead and just use assembly as source locally when crafting the tests.

There is, however, at least one case in a test (that depends on D56140) which is impossible to make using YAML as input at the moment; the COFF YAML format doesn't distinguish between multiple symbols with the same name for relocations. To fix that, the COFF YAML format would need to be changed to use symbol indices instead of names for relocations (mirroring what actually is stored in the files).

test/tools/llvm-objcopy/COFF/X86/remove-symbol2.s
6–10 ↗(On Diff #179739)

Indeed, this part of the test is pretty pointless.

12 ↗(On Diff #179739)

Ok, will make the code quote it and have the test look for a filename as well.

tools/llvm-objcopy/COFF/COFFObjcopy.cpp
38 ↗(On Diff #179739)

Fair enough, I can do that.

52 ↗(On Diff #179739)

Ok, will use a different error code.

tools/llvm-objcopy/COFF/Object.cpp
38–39 ↗(On Diff #179739)

I'd rather have the method be self-contained enough to clear it here, in case the caller would call this method multiple times. If it doesn't, the method leaks the limitation that it can only be called once, which in itself also is rather encapsulation.

To make things safer perhaps the field should be initialized to true, to indicate that all symbols are referenced unless we've called this and actually know for sure?

tools/llvm-objcopy/COFF/Object.h
46–47 ↗(On Diff #179739)

Ok, I'll expand the unnecessarily terse name.

As for Key vs OriginalIndex - yes, input symbol index would work just as well for key (and that's pretty much what I use right now anyway). But as @jakehehrlich wants to have the Object API usable even for constructing object files from scratch, naming it OriginalIndex feels a bit limiting.

70 ↗(On Diff #179739)

Yes, that'd be a nicer abstraction. That leaves things a bit open wrt how to use it from the Reader though; I could have an add method which adds one symbol at a time and updates the map after each one. That ends up with O(n^2) for the map updates. Alternatively the Reader produces a vector locally and there'd be an addSymbols(vector<Symbol>) method here, which adds all at once and does one update of the map at the end.

What do you think?

mstorsjo updated this revision to Diff 179944.Jan 2 2019, 2:44 PM

Updating with most of the clear issues fixed, removed code for all other stripping modes than removing explicitly named symbols. Using llvm-objdump -t instead of llvm-nm for printing symbol lists, using -NEXT to avoid the risk of any missed symbols. Using llvm::errc::invalid_argument in one place, renamed RawIdx to RawIndex, added quoting around the symbol name for the erorr message, extended the error test to check for this and the file name.

Other issues that still are under discussion are not yet changed.

mstorsjo updated this revision to Diff 180003.Jan 3 2019, 12:42 AM

Updated and extended the testcase after llvm-readobj was improved. The testcase in remove-symbols1.s no longer works with yaml input, with current yaml2obj/obj2yaml tools/formats.

llvm-readobj -symbols is way too verbose for a neat test here (it prints 5+ lines per symbol)

Not that it's really relevant here, I think, but llvm-readelf (aka GNU output style for llvm-readobj) produces single line output for each symbol. I've not looked, but I'm guessing that llvm-readobj doesn't have an equivalent for COFF. Maybe that's worth taking a look at at some point? We actually tend to use the verbose output style for the llvm-objcopy ELF tests though.

Yes, using YAML would remove the dependency on X86 being enabled, and I agree that it makes some things less ambiguous. As for writing (and reading) the tests I prefer assembly, but if there's strong opposition I can make them YAML instead and just use assembly as source locally when crafting the tests.

I don't feel strongly about it, but I would prefer YAML, because of the extra control granted.

There is, however, at least one case in a test (that depends on D56140) which is impossible to make using YAML as input at the moment; the COFF YAML format doesn't distinguish between multiple symbols with the same name for relocations. To fix that, the COFF YAML format would need to be changed to use symbol indices instead of names for relocations (mirroring what actually is stored in the files).

Hmm... that's an interesting case. I think it accepting arbitrary indexes would be nice. IIRC, the ELF YAML format allows for some fields specifying either a name or index, and if an index, it doesn't do any checking, which allows for crafting invalid inputs to test error handling. Having a symmetry here would make a lot of sense.

test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s
11–12 ↗(On Diff #180003)

Unless there's a strong need, I'd just drop the (16)/(15) bit at the end, as it allows folding these two lines together, and doesn't really give us anything, since there's only one "ptr" symbol.

47 ↗(On Diff #180003)

Could you largely fold this and the PRE block together, to reduce risk of copy/paste issues, and make clear the difference?

You could do this by sharing the majority of the checks, but having a single difference for the removed symbol:

...
# SYMBOLS-NEXT: foo
# SYMBOLS-PRE-NEXT: otherfunc
# SYMBOLS-NEXT: otherfunc2
...
tools/llvm-objcopy/COFF/COFFObjcopy.cpp
32 ↗(On Diff #180003)

You might want to simplify this if in this first version to only check the SymbolsToRemove.

40–42 ↗(On Diff #180003)

This comment doesn't really make sense in the patch's current form.

tools/llvm-objcopy/COFF/Object.cpp
38–39 ↗(On Diff #179739)

Okay, I see your point, so it's fine what you have done. I have issues both with it being set to true and to false, because each would result in technically incorrect state for a period of time until the function is called. Perhaps it would be best to simply not initialise Referenced at all, until markSymbols is called, since the value can't be relied on to be safe without that call.

tools/llvm-objcopy/COFF/Object.h
46–47 ↗(On Diff #179739)

Okay, no problem. I'd marginally prefer UniqueID to Key, I think, but I'm okay if you prefer it this way around.

70 ↗(On Diff #179739)

I think an addSymbols function as you describe would make a lot of sense. A future change might also want a single addSymbol function too, but that's just a minor thing that can be added when required.

mstorsjo marked 14 inline comments as done.Jan 3 2019, 2:14 PM

There is, however, at least one case in a test (that depends on D56140) which is impossible to make using YAML as input at the moment; the COFF YAML format doesn't distinguish between multiple symbols with the same name for relocations. To fix that, the COFF YAML format would need to be changed to use symbol indices instead of names for relocations (mirroring what actually is stored in the files).

Hmm... that's an interesting case. I think it accepting arbitrary indexes would be nice. IIRC, the ELF YAML format allows for some fields specifying either a name or index, and if an index, it doesn't do any checking, which allows for crafting invalid inputs to test error handling. Having a symmetry here would make a lot of sense.

Ok, I'll try to have a look at that and whether it's possible to extend the COFF YAML format similarly.

test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s
11–12 ↗(On Diff #180003)

The point I tried to make in keeping it explicit, is that llvm-objcopy updated the symbol index in this relocation. Granted, if it wouldn't have, the symbol name would have changed as well (or it would have ended up as a broken relocation), but I tried to show it clearly that this reloc actually does refer to a symbol that changed index.

47 ↗(On Diff #180003)

Oh, right, yes - will do.

tools/llvm-objcopy/COFF/COFFObjcopy.cpp
32 ↗(On Diff #180003)

Right, yes, will fix.

40–42 ↗(On Diff #180003)

Oops, will fix.

tools/llvm-objcopy/COFF/Object.cpp
38–39 ↗(On Diff #179739)

Ok, I can leave it uninitialized, and then sanitizers and checkers can point out if it's ever used uninitialized when it shouldn't.

tools/llvm-objcopy/COFF/Object.h
27–28 ↗(On Diff #179739)

Sorry, I forgot to reply on this one yesterday. They're indeed unnecessary if using the right syntax for initializing, I didn't think of the new C++11 syntax for doing that without explicit constructors.

46–47 ↗(On Diff #179739)

I can change it to UniqueId, that's what I originally called it, but later changed it into Key because I thought it was better - I'll change it back.

mstorsjo updated this revision to Diff 180140.Jan 3 2019, 2:16 PM
mstorsjo marked 6 inline comments as done.

Updated the details discussed with @jhenderson, except for the still-discussed overly verbose relocation symbol index in the test, and still using assembly instead of yasm for the test, pending extending obj2yaml/yaml2obj.

mstorsjo marked an inline comment as done.Jan 3 2019, 2:19 PM
mstorsjo added inline comments.
tools/llvm-objcopy/COFF/Object.h
72 ↗(On Diff #180140)

Oops, I forgot to clang-format this one - done locally but won't repost a new patch just for that.

jhenderson added inline comments.Jan 4 2019, 2:16 AM
test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s
11–12 ↗(On Diff #180003)

Okay, that seems reasonable. Might be worth a small comment to show that's what you're doing (i.e. showing that the index has been changed).

tools/llvm-objcopy/COFF/Object.cpp
20–23 ↗(On Diff #180140)

Not sure it's actually any simpler, but you could use std::transform, to do this. Approximate untested code:

std::transform(NewSymbols.begin(),
  NewSymbols.end(),
  std::back_inserter(Symbols),
  [&NextSymbolUniqueID](Symbol&S){ S.UniqueId = NextSymbolUniqueId++;});

Okay, now that I type it out, this is probably less readable too, so feel free to completely ignore it!

tools/llvm-objcopy/COFF/Object.h
64 ↗(On Diff #180140)

Could this return an ArrayRef?

67 ↗(On Diff #180140)

This can take an ArrayRef.

72 ↗(On Diff #180140)

I'd move this method to be alongside getSymbols.

27–28 ↗(On Diff #179739)

I'm happy that you changed this, but I now realise that I misread the original, and probably wouldn't have bothered commenting if I'd realised it was a conversion constructor, not a generic copy constructor!

mstorsjo marked 8 inline comments as done.Jan 4 2019, 1:42 PM
mstorsjo added inline comments.
test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s
11–12 ↗(On Diff #180003)

Ok, adding a comment here.

tools/llvm-objcopy/COFF/Object.h
64 ↗(On Diff #180140)

Yup, will change.

67 ↗(On Diff #180140)

Done.

72 ↗(On Diff #180140)

Ok, will do.

mstorsjo updated this revision to Diff 180315.Jan 4 2019, 1:44 PM
mstorsjo marked 4 inline comments as done.

Changed a few cases of std::vector to ArrayRef as suggested, moved the getMutableSymbols method next to getSymbols as suggested, added a comment about the verbose tests, and changed the test input to yaml (depending on D56294).

jhenderson added inline comments.Jan 7 2019, 3:03 AM
test/tools/llvm-objcopy/COFF/Inputs/remove-symbols.yaml
38 ↗(On Diff #180315)

Similar to my comments in D56294, how much of this YAML is actually necessary to test symbol removal? We really only want the very minimum needed for this particular case (and we can add more for other tests or too this test as we add more cases).

test/tools/llvm-objcopy/COFF/remove-symbol1.test
1 ↗(On Diff #180315)

Please rename this to strip-symbol.test, to match the long form of the switch name. Also, please add a case using the long form. This could be a second run of llvm-objcopy on the same input, and show that you get the same output. Finally, the general pattern in most ELF tests is to make the edits out-of-line (i.e. specify a different output file to the input file), and then to show that the input file was not itself modified by the operation. This last point is probably only important because of the second point re. the long form switch name etc.

test/tools/llvm-objcopy/COFF/remove-symbol2.test
1 ↗(On Diff #180315)

Rename this test to something meaningful, e.g. strip-referenced-symbol.test

In fact, I wonder if we should mirror the equivalent ELF test (strip-reloc-symbol.test) and error message as much as possible? What do you think?

mstorsjo updated this revision to Diff 180466.Jan 7 2019, 5:03 AM

Stripped down the yaml input, renamed tests as suggested (matching the ELF tests), adjusted the error message to match the one used in the ELF case.

jhenderson accepted this revision.Jan 7 2019, 5:28 AM

LGTM, but please wait for the others to confirm they're happy.

This revision is now accepted and ready to land.Jan 7 2019, 5:28 AM

@jakehehrlich - Any further comments, or can this go in?

alexshap added inline comments.Jan 7 2019, 2:29 PM
tools/llvm-objcopy/COFF/Writer.h
25 ↗(On Diff #180466)

btw, i forgot to ask it earlier - why do we need this base class + protected fields ?
maybe just roll it into COFFWriter ? (the hierarchy of Readers (in ELF and MachO) basically facilitates handling different kinds of input (object file vs binary input (not a regular object file)), but why is this base class necessary ?

alexshap added inline comments.Jan 7 2019, 2:33 PM
tools/llvm-objcopy/COFF/Object.h
84 ↗(On Diff #180466)

i was thinking about these 3 guys for some time,
i don't have a strong opinion - however, for what it's worth, they all should be in some consistent state, maybe there should be an extra abstraction which would encapsulate them ? (and Object would own that abstraction (for example)). There is some interference with relocations, but we can expose some methods to enable that "communication".

mstorsjo marked 2 inline comments as done.Jan 7 2019, 2:44 PM
mstorsjo added inline comments.
tools/llvm-objcopy/COFF/Object.h
84 ↗(On Diff #180466)

Well they're private at least, to me that's at least some sort of encapsulation. I'd rather not overengineer that part at this point, in case later patches turn out to need things wrt these I haven't thought of yet.

tools/llvm-objcopy/COFF/Writer.h
25 ↗(On Diff #180466)

They're not really necessary - I brought them along while copying the general structure from ELF, and I tried to ask for opinions during review whether I should keep or remove them, but I don't think anybody commented on it. I can do an NFC patch to remove them.

alexshap accepted this revision.Jan 7 2019, 3:14 PM
alexshap added inline comments.
tools/llvm-objcopy/COFF/Writer.h
25 ↗(On Diff #180466)

yeah, sounds good, thanks

rupprecht accepted this revision.Jan 10 2019, 12:57 PM

Just some nits, but I think this is good to commit.

tools/llvm-objcopy/COFF/Object.cpp
20 ↗(On Diff #180466)

Symbol& S?

31 ↗(On Diff #180466)

Given the number of symbols may be large and you know the exact size you want to add, you might want to avoid reallocation by changing SymbolMap.clear() with SymbolMap = DenseMap<xxx>(Symbols.size()).

tools/llvm-objcopy/COFF/Reader.cpp
87 ↗(On Diff #180466)

Similarly you can set an initial size of COFFObj.getRawNumberOfSymbols() here

mstorsjo marked 3 inline comments as done.Jan 10 2019, 1:09 PM

Just some nits, but I think this is good to commit.

Ok - should I still wait for @jakehehrlich, or do you happen to know about his whereabouts?

tools/llvm-objcopy/COFF/Object.cpp
20 ↗(On Diff #180466)

No, intentionally not a reference here. The point is that we have an ArrayRef<Symbol> from the caller, which essentially is read-only, and we mutate the Symbol objects as we insert them into the Object class' internal vector.

31 ↗(On Diff #180466)

Ok, I'll try that.

tools/llvm-objcopy/COFF/Reader.cpp
87 ↗(On Diff #180466)

Ok, will do. That'd be Symbols.reserve(COFFObj.getRawNumberOfSymbols()) right, as it can't be done via the constructor directly?

Just some nits, but I think this is good to commit.

Ok - should I still wait for @jakehehrlich, or do you happen to know about his whereabouts?

IMO you can go ahead and submit now -- Jake (or anyone else) can always do a post-commit review if there are issues with this patch.

tools/llvm-objcopy/COFF/Reader.cpp
87 ↗(On Diff #180466)

Sounds right to me

This revision was automatically updated to reflect the committed changes.