This is an archive of the discontinued LLVM Phabricator instance.

llvm-objdump Allow disassembly of ARM and thumb code mix in ELF object file.
Needs RevisionPublic

Authored by khemant on Sep 15 2016, 3:08 PM.

Details

Summary

This is a patch that tries to enable something that is not usually possible in present design. If the patch looks a little weird bear with me. I want ideas on this issue and any workable solutions if any.

llvm-objdump with no triple or mcpu tries to infer the architecture based on

the details present in file. This works on most architectures and file formats
except ET_ARM type object type. The presence of thumb if any is not indicated
anywhere except in .ARM.attributes and using mapping symbols $t.* Also not all
architectures have all instructions available. This makes the disassembly of ARM
ELF full linked image that has thumb and ARM mix  virtually useless.

The objdump in its present state  has only one disassembler and hence cannot
make use of the above information. This commit will crack open the ELF binary if
it is ET_ARM type and try to see CPU name and if it is thumb only processor.

The tool will create a second disassembler for ARM ELF and try using that if the
first one fails. The change also makes use of mapping symbols $a, $t. The
mapping symbols will determine which disassembler should be used. This makes
usage more user friendly as the CPU no longer needs to be explicit if the
attributes are present in the file. If the attributes are not present, the usage
remains the old way (using -mcpu and -triple options).

Diff Detail

Repository
rL LLVM

Event Timeline

khemant updated this revision to Diff 71565.Sep 15 2016, 3:08 PM
khemant retitled this revision from to llvm-objdump Allow disassembly of ARM and thumb code mix in ELF object file..
khemant updated this object.
khemant added reviewers: echristo, compnerd, rafael.
khemant set the repository for this revision to rL LLVM.
khemant added inline comments.Sep 15 2016, 3:10 PM
test/tools/llvm-objdump/ARM/disassemble-arm-thumb-elf-mix.test
1

Will be adding source and the way source was compiled and linked.

rafael removed a subscriber: rafael.
rengolin requested changes to this revision.Sep 16 2016, 5:41 AM
rengolin added reviewers: rengolin, zatrazz, peter.smith.

This is an interesting functionality, but I'm not well versed in objdump to know if the code is what's expected.

The first thing I would say is that it really needs a clang-format pass as well as a lot more tests, since the change in functionality is not small.

From a quick look, it seems that you went for the "whatever works" approach, which makes the whole thing extremely complicated, and harder to maintain. I'm not sure all other architectures will be happy with the added context, complexity and conditional blocks for the sake of one arch (ARM).

I honestly think this needs a better design and it would be better to get that discussion to the llvm-dev list first, as an RFC.

cheers,
--renato

tools/llvm-objdump/llvm-objdump.cpp
1092

You're converting a pointer to an array ref to another pointer, this is really confusing. Please, pick one style and stick with it.

even better, isn't there a way to decode the attributes into a nice structure?

@compnerd, you spent more time in the attributes code than I did, any ideas?

1102

StringRef.startsWith("aeabi")?

1119

there has to be a better way to deal with attributes...

1362

AArch64 code doesn't mix with AArch32, so this is conceptually wrong.

If this is just to reduce code duplication in this file, I understand, but there could be an error. It'd be clearer, though, if they were in separate blocks, one for ARM and one for AArch64.

This revision now requires changes to proceed.Sep 16 2016, 5:41 AM

The first thing I would say is that it really needs a clang-format pass as well as a lot more tests, since the change in functionality is not small.

This was clang-formatted, may be my clang format python script is old.

From a quick look, it seems that you went for the "whatever works" approach, which makes the whole thing extremely complicated, and harder to maintain. I'm not sure all other architectures will be happy with the added context, complexity and conditional blocks for the sake of one arch (ARM).

I honestly think this needs a better design and it would be better to get that discussion to the llvm-dev list first, as an RFC.

This is a problem with existing design and is not easily compatible with ARM architecture. Take a look at MachO dumper, it employs similar method to try using both disassembler. But as suggested, I will send an RFC for this functionality on dev list and find out.

tools/llvm-objdump/llvm-objdump.cpp
1092

Got it.

The only way to decode the attributes is to at the size a parse them (they are sorted in ULSB format).

1102

Got it.

1119

I am open for ideas.

AFAIK there is no way to go past an attribute till you read the size, and since each can vary based on attribute, you have to go over them till you find the one you need or finish reading all of them.

1362

This is not used for mix of ISA. The $d mapping can exist within a region of code that is treated as a single STT_FUNC symbol. The mapping from $x to $d and then back to $x is used to make sure you disassembler is not thrown off balance by trying to dissemble data. This change existed before this change. I made sure the text is separated into arm and thumb and 64 bit arm instructions. Line 1550 will give you an idea of what I am saying.

peter.smith edited edge metadata.Sep 19 2016, 2:23 AM

I've not got a lot to add over what Renato and other comments on llvm-dev have made. I understand that this is probably close to the best incremental change that could be made to llvm-objdump, but I think that this is probably too intrusive a change just for ARM as it is and I think we need to be clearer about the requirements in some areas such as using attributes.

Some comments that might be useful in the future:
It would be good to write down what the scope of disassembly you are aiming for. For example: Is it architectural disassembly where instructions not supported on the target architecture come out as undefined (i.e. disassembly of an ARMv7a object for an ARMv5 target) or a universal if it is a legal ARM or Thumb instruction in any architecture disassemble it regardless of architecture.

Are you intending to try and support stripped binaries with no mapping symbols or static symbol table? There are overlaps between ARM and Thumb bit patterns, and of course literal data so I fear that even trying to do this may cause more problems than it solves.

I think that putting attribute reading code directly into llvm-objdump, potentially duplicating code in llvm-readobj isn't the right thing to do. We should have an attribute reading/writing library that tools can use. This would be really useful for lld for example.

Disassembly is an architectural property and not a property of the CPU name, it is true that a CPU name infers a default set of attributes, but there are ways to alter these defaults and have different properties in the object file. If we are going to read the attributes I think we should be reading the architecture and the various supporting attributes to work out what the target and subtarget features are. In an ideal world we shouldn't be making any disassembly decisions based on the CPU name alone.

When mapping symbols aren't available but the static symbol table still persists, it is possible to use the state of the last STT_FUNC symbol definition (bit 0 == 1 for Thumb) and (bit 0 == 0) for ARM to determine ARM or Thumb. This won't work if there is a state change or literal without another STT_FUNC or STT_OBJ symbol, but it is a reasonable heuristic.

I think that any llvm-objdump is going to end up with an ARM disassembler and a Thumb disassembler, however I think that there may be neater ways to switch between them in a refactored llvm-objdump. For example the mapping symbols identify a non-overlapping range of addresses that are either ARM, Thumb or literal data and there is one disassembler for each of those ranges, a design that in effect does DisassembleRange(Start, End, Disassembler) would work reasonably well.