This is an archive of the discontinued LLVM Phabricator instance.

[llvm] Add symbol table support to llvm-objcopy
ClosedPublic

Authored by jakehehrlich on Jun 13 2017, 1:37 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

jakehehrlich created this revision.Jun 13 2017, 1:37 PM

Forgot one thing

tpimh added a subscriber: tpimh.Jun 14 2017, 1:47 PM

I updated this to match the latest code in D33964

jakehehrlich edited the summary of this revision. (Show Details)
jakehehrlich removed reviewers: phosek, Bigcheese.

I swapped the order of these diffs around so that binary output comes first now.

jakehehrlich edited the summary of this revision. (Show Details)
jakehehrlich added reviewers: phosek, Bigcheese.

Updated this yet again to match revisions made to other diffs. This change only depends on a diff that already has an LGTM so it should be ready for review.

jhenderson edited edge metadata.Aug 21 2017, 4:04 AM

Nit: full stops at the end of all comments, please.

In addition to the inline comments, I'm concerned about the behaviour for section symbols, which do not have a name directly associated with them (they could be considered to have the section name, but this is not usually recorded in the string table). As things stand, it looks to me like only one section symbol will be added, because the others will all have the same (empty) Name, and so will clash in the StringMap.

I'm also not a fan of the use of Symbols and FinalSymbols as it means that there are two copies of the symbol table, which could become quite a large additional memory overhead (also such similar names don't sit well with me). Not thought about this too much, but I could think of a couple of other options: 1) Turn the FinalSymbols vector into a vector of pointers into the Symbols map, ordered in the output order; 2) Perhaps combined with my suggested alternative algorithm for finalizing the symbol table order above, simply record the new index in the symbols in the map, and then iterate over the map when writing, jumping to the appropriate location as needed.

test/tools/llvm-objcopy/symbol-copy.test
91

For completeness, please check that there are no more symbols by checking the end of the symbol table has been reached.

tools/llvm-objcopy/Object.cpp
121

I don't think weak and global symbols have to be ordered in any specific order with respect to each other, but this will sort all STB_WEAK symbols to be after the STB_GLOBAL symbols.

134

Could you comment why you do the double sorting, please?

I assume the aim is to maintain the symbol order as much as possible, but moving symbols that are now local/global to the correct place in the symbol table, but it would be good to explicitly say something like that.

Alternatively, isn't it possible to combine the two comparators and then run a single sort?

Perhaps an even better alternative is to use a different algorithm entirely, namely to insert the symbols in the vector in the correct order already - simply record where the last STB_LOCAL was and insert it there if local, or at the end if not. This recorded location can then also be use to set the Info field appropriately, without needing to do a search.

143

Could this become addSymbolNames, please, as that's more descriptive.

221

This could cause a crash if the the sh_link field is corrupt and is higher than the number of sections, or is unset.

224

Could this error message be improved to include the symbol and supposed string table index, please?

232

What is Sym.st_shndx is not a valid section?

475

You appear to have lost a comment here.

tools/llvm-objcopy/Object.h
125

Is this comment correct? The problem isn't that we're changing from ELFT to ELFT, it's because we are changing from ELFT to some internal representation and then back again.

139

Do we need this function for now?

  1. I fixed a lot of little things that James mentioned
  2. I removed a the "removeSymbol" method because, as James noted, it isn't needed right now. It will be needed for strip. Consequently we don't have to reassign indexes to symbols which addresses a whole slew of problems that were raised. This combined with James' idea on how to write the symbol table simplified the code and removed many of the concerns around FinalSymbols (like sorting, extra memory usage, what to name that thing, etc...). This however side-steps what we will need to do in the future so I'd like to recommend two possible plans for adding symbols at a later date in place of adding code to handle reassigning indexes.

Plan 1) Keep the same writeSection algorithm in this latest diff (the algorithm proposed by James) and then locally create an std::vector<Symbol*> in finalize that can be sorted and then iterated though to assign new indexes. The sorting algorithm would use a comparision that would make any STB_LOCAL symbol come before but would ignore STB_WEAK vs STB_GLOBAL in order to better maintain the original order (addressing a particular concern raised by James). The comparison would compare by original index when binding was not an issue.

Pros) Only adds 8-bytes to memory usage for each symbol and that memory is only allocated for the duration that symbols are being finalized.
Cons) Writes happen out of order and there will cause some less than ideal behavior write caching behavior. Sorting increases finalization time for symbol tables also.

Plan 2) Make Symbol an intrusive list node type. When adding a symbol it must then be added to the Symbols map but also to the intrusively constructed list. A pointer to the head of this list would be maintained. When adding and removing symbols the last local symbol is kept track of. This lets us quickly add local symbols to the right place so that we never have to worry about sorting them in any way. As local symbols are removed we can update the pointer to the last local symbol.

Pros) The symbol table is written in order, no sorting is needed so finalization becomes linear. This also means we only need to do 2 passes over the symbols for the whole thing. The last local symbol is kept track of by the SymbolTableSection so finalizing the Info field is easy (It's the index of the local symbol whose's next symbol is not local).
Cons) It's a bit more complicated. It uses like 16-bytes (or more?) of space per symbol and that space is is used for the whole duration of llvm-objcopy's life. Each constant time operation will likely be more expensive.

My proposal is that some veriation of one of these 2 methods be used to add removeSymbol to SymbolTableSection just before symbol stripping is needed.

In addition to the inline comments, I'm concerned about the behaviour for section symbols, which do not have a name directly associated with them (they could be considered to have the section name, but this is not usually recorded in the string table). As things stand, it looks to me like only one section symbol will be added, because the others will all have the same (empty) Name, and so will clash in the StringMap.

What are section symbols? It doesn't sound like I have supported them and I am not aware of what they are. I wasn't able to find them online. These are some kind of symbol for which distinct symbols can have the same name? That seems to spell problems for my overall design here.

Plan 1) Keep the same writeSection algorithm in this latest diff (the algorithm proposed by James) and then locally create an std::vector<Symbol*> in finalize that can be sorted and then iterated though to assign new indexes. The sorting algorithm would use a comparision that would make any STB_LOCAL symbol come before but would ignore STB_WEAK vs STB_GLOBAL in order to better maintain the original order (addressing a particular concern raised by James). The comparison would compare by original index when binding was not an issue.

Pros) Only adds 8-bytes to memory usage for each symbol and that memory is only allocated for the duration that symbols are being finalized.
Cons) Writes happen out of order and there will cause some less than ideal behavior write caching behavior. Sorting increases finalization time for symbol tables also.

Which writes are we talking about here? If we've sorted things, then we'll be writing contiguously into the buffer.

Plan 2) Make Symbol an intrusive list node type. When adding a symbol it must then be added to the Symbols map but also to the intrusively constructed list. A pointer to the head of this list would be maintained. When adding and removing symbols the last local symbol is kept track of. This lets us quickly add local symbols to the right place so that we never have to worry about sorting them in any way. As local symbols are removed we can update the pointer to the last local symbol.

Pros) The symbol table is written in order, no sorting is needed so finalization becomes linear. This also means we only need to do 2 passes over the symbols for the whole thing. The last local symbol is kept track of by the SymbolTableSection so finalizing the Info field is easy (It's the index of the local symbol whose's next symbol is not local).
Cons) It's a bit more complicated. It uses like 16-bytes (or more?) of space per symbol and that space is is used for the whole duration of llvm-objcopy's life. Each constant time operation will likely be more expensive.

My biggest concern with this is that we have to be careful not to change the symbol binding of a symbol in place, or if we did, we would need to move it within the list. If that gets forgotten, then the order will be wrong.

My proposal is that some veriation of one of these 2 methods be used to add removeSymbol to SymbolTableSection just before symbol stripping is needed.

My initial instinct is that we should take the simpler approach (option 1), as it is less likely to go wrong as more functionality is added to llvm-objcopy. That being said, I suspect that in the general case, people aren't going to make many changes to symbol bindings, so I'd say that the order of existing symbols is unlikely to change much, making a general sort probably rather inefficient. This then suggests option 2 to be better!

Do we need a map at all? Is removing individual symbols by name something that needs to be supported, and is it up to objcopy to ensure that there are no two symbols with the same name? Could we actually maintain a single vector, where the index in the vector is the symbol's index (with something recording the last local). Then, when adding symbols, insert in the corresponding place. Changing a symbol's binding from weak/global to local or vice versa becomes a move within the vector. I think this means that there would only be a memory overhead of a std::vector, plus one size_t value.

One thing that occurred to me with regards to modifying the symbol table ordering that might be somewhat related to this - we have to be wary that there might be relocations (and possibly other section contents), that refer to symbol indexes, so we need to track those if we anticipate the symbol table contents changing.

What are section symbols? It doesn't sound like I have supported them and I am not aware of what they are. I wasn't able to find them online. These are some kind of symbol for which distinct symbols can have the same name? That seems to spell problems for my overall design here.

Section symbols are local symbols with type STT_SECTION. They do not have a name as such (their st_name field is 0, so points to the empty string), but most tools that I've encountered use the name of their corresponding section (identified by st_shndx) when displaying them. They are used by relocations primarily, to allow section relative relocations, i.e. references to some offset into a section that doesn't necessarily have its own symbol. The only thing that you have to be aware of for them in this area is that they don't have unique names. This suggests to me that we shouldn't be using the StringMap (see my above comment).

tools/llvm-objcopy/Object.cpp
239

invalid section index of -> invalid section with index

I think that reads better.

tools/llvm-objcopy/Object.h
125

The first line now doesn't read well at all - I'm not sure what you're trying to say any more with it.

Which writes are we talking about here? If we've sorted things, then we'll be writing contiguously into the buffer.

I was considering having "FinalSymbols" be a local variable. If it isn't local then we can write contiguously. If it is local then all it helps us do is assign the indexes and we would have to do things noncontiguously. But I think removing the StringMap all together as you recommended is likely a better solution.

My biggest concern with this is that we have to be careful not to change the symbol binding of a symbol in place, or if we did, we would need to move it within the list. If that gets forgotten, then the order will be wrong.

Eh, yeah. That is a fair concern. I'd rather allow for as many modifications to occur as possible. I think I prefer sorting now, especially since this latest diff doesn't have a StringMap

My initial instinct is that we should take the simpler approach (option 1), as it is less likely to go wrong as more functionality is added to llvm-objcopy. That being said, I suspect that in the general case, people aren't going to make many changes to symbol bindings, so I'd say that the order of existing symbols is unlikely to change much, making a general sort probably rather inefficient. This then suggests option 2 to be better!

I think we can get the best of both worlds actually. You pointed that people won't make changes to symbol bindings or symbol order. When these things do change they likely won't change very much. Well lots of sorting algorithms are linear under these conditions. So we just need to use a sorting algorithm that takes advantage of this. Unfortunately there don't seem to be any ready implementations of such algorithms in the standard library or in llvm.

This diff:

  1. Should have fixed a couple nits by James
  2. I reworked the algorithm taking the "do we need the StringMap" idea seriously. All operations should happen in ideal running time, the implementation is simple, and caching behavior should be better than previous considered options. The only real downside is that now symbols have to be allocated individually.

The primary reason for the StringMap before was so that when handling relocations we could look up symbol indexes (see here: https://reviews.llvm.org/D36554). Strings seemed like a better option because I thought symbol names were unique. Symbol names, unlike symbol indexes, will not change on strip operations so strings seemed like better identifiers than indexes. James pointed out however that symbol names are not necessarily unique so we can't do that. So we need another way to look up indexes of the finalized symbols. If we agree that before modification of the symbol table that symbol indexes agree with original symbol indexes then we can look up pointers to symbols very quickly. For relocation sections rather than storing a symbol name we can store a pointer to the symbol.

When removing symbols we will not remove symbols one by one but instead remove a whole bunch of symbols all at once. When we do this we can reassign new Index values to each symbol so that the internal symbol index (not the original index) stays consistent with the vector the symbols are being stored in. This means that if Sym is returned by the getSymbolByIndex method that Sym->Index will agree with the index passed to getSymbolByIndex. This reassignment of indexes won't affect relocations since relocations will grab pointers to the symbols during construction before any modification has occured.

To clarify, because I may have missed it, but are we deferring sorting for now, to a later change when it becomes important? I don't see where it's happening currently!

tools/llvm-objcopy/Object.cpp
105

Do we need an explicit this here now that this class isn't templated?

To clarify, because I may have missed it, but are we deferring sorting for now, to a later change when it becomes important? I don't see where it's happening currently!

It's not currently happening. I figured once it's clear how symbols are being altered we can add the sorting algorithm. I'll add it now if you'd like, it just seemed like it's unnecessary right now. Your call.

jakehehrlich marked 13 inline comments as done.Aug 24 2017, 2:31 PM
jakehehrlich added inline comments.
tools/llvm-objcopy/Object.cpp
105

Nope! I thought it was odd that I needed this-> in other places. I didn't realize what the rule was for when you need this-> and when you don't. Thanks!

jhenderson accepted this revision.Aug 25 2017, 1:38 AM

To clarify, because I may have missed it, but are we deferring sorting for now, to a later change when it becomes important? I don't see where it's happening currently!

It's not currently happening. I figured once it's clear how symbols are being altered we can add the sorting algorithm. I'll add it now if you'd like, it just seemed like it's unnecessary right now. Your call.

Nope, I'm happy with it being added later. No point in adding unnecessary code at this point for the sake of it.

I realised that there's another case that you need to add support for either now or later, namely symbols with section indexes of SHN_COMMON or SHN_ABS, as these don't refer to actual sections. I'm happy for this to be a later change though. Assuming you do this in a subsequent change, this gets a LGTM from me.

tools/llvm-objcopy/Object.cpp
105
This revision is now accepted and ready to land.Aug 25 2017, 1:38 AM
This revision was automatically updated to reflect the committed changes.
jakehehrlich marked an inline comment as done.

Added explicit move when returning SymTab for gnu builds