This change adds support for the -I and -B flags from GNU objcopy. The only currently supported input type is "binary" and only 4 architectures (by a total of 5 names) are currently supported. This change adds a constructor for Object and its subtypes that includes the basic sections and contents that almost any relocatable ELF will have. This needs to know the ELFT and EMachine code so some architecture information stuff was needed (oddly I wasn't able to figure out any good way to piggy back off of llvm) After that another method adds the parts that are specific to the binary input type.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I'm not sure I'm a big fan of the big stack of functions that gradually pick off an argument, and then choose a specific function to call next. It's effectively a giant nested if statement, whereas what we had before was much more linear. I haven't got a good alternative solution as of yet, but will have a think and try to come up with an alternative.
Otherwise, most of this looks fine. There should be tests for each of the different possible -B values though.
tools/llvm-objcopy/Object.cpp | ||
---|---|---|
604 | The contents of this function feel a bit out-of-order. It starts with setting stuff to do with sections, then adds a symbol, then sets some properties, based on the sections, then goes off and initialises the elf header, before coming back and updating some more properties to do with the sections. Could you maybe reorder things as follows:
This should make the function a little easier to follow. | |
tools/llvm-objcopy/llvm-objcopy.cpp | ||
354–356 | I wonder if this is an indication that Symbols should own their own names. It would mean a bit of copying in the ELF input case, but could prevent easy-to-make errors if we want to create or rename symbols. If you prefer keeping it as is, I'd make a separate function called "MakeBinarySymbolName(StringRef BaseName, StringRef Suffix)", so that the warts of the name ownership can be kept separate from the adding of symbols, and it can be reused in other places too. Did you consider making InputBinaryFormat a subclass of Object? That would allow you to have a slightly nicer name ownership resolution, apart from anything else. | |
421 | This function should be static (along with the rest of the new functions). | |
438 | I'm not sure this function is well named, given that this function is effectively the driver of the rest of objcopy - it's not really handling the input. It's doing all the work. I'd even go so far as to say that it should be inlined into main. If you do want to keep the function, I'd rename Arch to InArch, and change the argument order to Input, InFmt, InArch, OutFmt (i.e. keeping all input-related arguments together). |
I just did some experimenting with GNU objcopy with this, and I'm getting a binary copy of the input, when using -I binary, even with -B i386:x86-64. I can workaround this using -O elf64-x86-64, i.e. "objcopy text.txt text.o -B i386:x86-64 -I binary -O elf64-x86-64". If I drop the -B, I get an EM_NONE machine type, and if I drop the -O, I get a text file. Not sure how you want to handle this!
Well for command line compatibility we're going to have to accept "-O elf64-x86-64" which we don't do right now. In fact for the specific use case that I'm trying to solve here I'll need that to work (whoops). I'm fine copying the behavior of GNU objcopy here but I might add it in two more changes. One change will add extra output formats (like elf64-x86-64) and the other will make the default output type the corresponding input type. Those two changes together with this change should produce command line compatibility. Sound good?
I haven't figured out a solution to the "giant nested if-statment" problem yet. There's a sense in which it's largely intractable because we have to use dynamic information to dispatch to 1 of 4 statically known functions of different types. I'm still thinking about it.
- Added test for each -B option (as future architectures/names are added then that test can be extended)
- Made any non-template function static (I think it was just the one)
- Reordered Object(uint16_t) as mentioned
- Made symbols own their names (amazingly this was as simple just replacing StringRef with std::string in one place)
- Made addSymbol take a Twine instead of a StringRef for the symbol name (strictly more general and suits the symbol generation case used in InputBianryFormat)
- Eh...I may be forgetting something but I only changed things that were requested. Hopefully I changed everything that was requested.
Context missing.
Yes, sounds good.
test/tools/llvm-objcopy/binary-input-archs.test | ||
---|---|---|
2 | You don't need to change this test in this commit, but it will need updating once you make the -O format changes. | |
test/tools/llvm-objcopy/binary-input.test | ||
3 | Could you modify this test slightly, please, to exercise the non-alnum replacement code. The easiest thing to do would be to add a file extension to the input file (e.g. make it %t.bin rather than %t). | |
tools/llvm-objcopy/llvm-objcopy.cpp | ||
354–356 | The comment needs updating now that symbols do own their own names, and I think the StringSaver header probably can be deleted too. |
- Removed comment and include for StringSaver stuff
- Tweaked test to show that stemming and removing of alpha numeric characters exists
Adding ".bin" isn't really ideal. It means that ".tmp" winds up in the stemmed file name but that extension probably isn't something we should relay on. It turns out (via a means we definitively shouldn't rely on) that the replacement of alpha numeric characters was already being tested because binary-input.test was the stem. I made this explicit in the test so that even if the test file's name changes the test will still test the appropriate things. I add ".bin" as the extension instead of ".tmp". This also shows more explicitly that stemming works correctly.
On the topic of removing the implicit "giant if" I'm not sure there's a great way to solve this. I've considered solutions involving macros (which are simple but should work) to some rather overly complicated dynamic solutions that check a condition for each ELF type for you and then dispatch to the correct template correspondingly but a) I wasn't able to get them to work properly and b) they were super complicated. In general you're going to have at least 8 branches because a) There are 4 templates and you need to dispatch to the correct one and b) There are two fundamentally different sources of information that need to be considered to decide which of those 4 should be used. Removing the templates we can from Object dosn't help either because how we read in the object and how we decide to write out the object still require that we have the same number of branches. Using macros to ensure uniformity of branches is one idea that a) works and b) is simple but I think I'd rather know explicitly what's going on.
The macro I have in mind looks like this
FOREACH_ELFT(auto *o = dyn_cast<ELFObjectFile<ELFT>>(&Binary), {
HandleELF(*o, OutFmt); return;
})
ELFT is then typedefed in each scope that's copied 4 times for each type. ELFT is then available in both the block and the condition. The code block is only executed in the case that the condition is true. For the case where MInfo is used we can do something that looks like this:
FOREACH_ELFT(elftMatchesMachine<ELFT>(MInfo), {
HandleBinary<ELFT>(Input, OutFmt, MInfo);
})
Using those two changes we can ensure uniformity of these dispatches. Coupling that with a function that returns a unique pointer to an Object and uses the output format to decide the which of ELFObject or BianryObject will be used beings our apparent branches down to a more manageable amount. I'm slightly against both of these solutions but I could be convinced otherwise with minimal effort. The output format decider in particular seems fine to me.
tools/llvm-objcopy/llvm-objcopy.cpp | ||
---|---|---|
354–356 | I think switching to Symbols owning their names is a good idea. Symbols and relocations are likely to be the sticking point for optimization at some point in the future but I'd rather use the conceptually simplest option now and optimize later when we have an issue. I think the biggest optimization for symbol tables will come from lazy loading and not from optimizing copying of small strings like that. As for making InputBinaryFormat a subclass of Object I'm not sure. I didn't consider but considering it now I was intending for those sub classes to be the output formats. This change does raise the question of how input formats should be handled however. For instance why is the binary input format handled here but the elf input format is handled inside of Object, that seems kind of off to me. I remember you mentioned an idea a while back about having read and writer types that map in and out of a common representation. Maybe we should refactor the Object code to expose enough of an interface that code outside of Object can reconstruct the ELF Object the way this code does so for the binary input case. |
For instance why is the binary input format handled here but the elf input format is handled inside of Object, that seems kind of off to me. I remember you mentioned an idea a while back about having read and writer types that map in and out of a common representation. Maybe we should refactor the Object code to expose enough of an interface that code outside of Object can reconstruct the ELF Object the way this code does so for the binary input case.
I think this is exactly the issue with the big "if" statement, and also the naming of the functions. At the moment, we have a path which goes something like: HandleBinaryELFT -> HandleBinary -> HandleArgs (binary object), so we're converting from a binary input into an ELF object, and then emitting it as a binary object. The whole conversion to ELF seems wrong here. The advantage of having separate classes for readers and writers, with a common intermediate representation, is that the process looks a lot clearer.
Your code could look something like:
InputReader Reader(InputFile, InputFormat); Object *Obj = Reader.createObject(); // Creates an intermediate object based on the input type. // There would be 5 different paths (one for each ELF format, and one for binary). Obj->handleArgs(Args); // Removes/add sections/symbols etc. Obj->write(OutputFormat); // Performs section and segment layout and writes out the file, according to the output format.
Note how the "handleArgs" call is hoisted up out of the depths of the nested call, and how the writing is handled separately. It would allow easy adding of new input and output formats, by simply extending the corresponding createObject/write implementation.
Does this sound plausible to you?
test/tools/llvm-objcopy/binary-input-test.bin | ||
---|---|---|
1 | I don't think you meant to add this file? | |
test/tools/llvm-objcopy/binary-input.test | ||
1 | I think you probably want %T, not %p, since that's the test output directory, not the test source directory. | |
112 | I just did some experimenting, and these symbols should include the file extension (so in the current case, this one should be "_binary_binary_input_test_bin_start"), according to GNU objcopy. I think you want to be using filename() instead of stem(). |
I plan on adding a refactoring to use an InputReader and possibly a Writer object and detemplating Object to make the refactoring work. I think that's a) quite doable and b) a very good idea. I'm going to post the refactoring in another change and put this one on hold for a bit and then rebase the refactor into this change so there isn't one giant change all at once.
test/tools/llvm-objcopy/binary-input-test.bin | ||
---|---|---|
1 | uh crap my %p solution was wrong. Now I see that you commend on that. Whoops. |
You don't need to change this test in this commit, but it will need updating once you make the -O format changes.