This is an archive of the discontinued LLVM Phabricator instance.

llvm-objdump: ELF: Handle code and data mix in all scenarios
ClosedPublic

Authored by khemant on Aug 17 2016, 12:31 PM.

Details

Summary

The patch handls behavior of the disassembler when code and data are mixed in a text section.
When code an data are mixed in a section, the GNU disassembler acts in an interesting manner.
If told to disassemble text only (-d), it will interpret mixed data as data
$cat 1.c
int myInt = 1;
char myChar = 'b';
float myFloat = 1.2;
double myComm;

int main () {

return myInt;

}

$cat 1.t
SECTIONS {
.text : { *(.text)
*(.data.my* )
*(.bss*) }
}

$clang -c 1.c -fdata-sections
$ld 1.o -T 1.t
$objdump -d a.out

a.out: file format elf64-x86-64

Disassembly of section .text:

0000000000000040 <main>:

40:   55                      push   %rbp
41:   48 89 e5                mov    %rsp,%rbp
44:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
4b:   8b 04 25 54 00 00 00    mov    0x54,%eax
52:   5d                      pop    %rbp
53:   c3                      retq

0000000000000054 <myInt>:

54:   01 00 00 00                                         ....

0000000000000058 <myChar>:

58:   62 0f 1f 00                                         b...

000000000000005c <myFloat>:

5c:   9a 99 99 3f                                         ...?

ARM variant has one more twist to it. When told to disassemble text only, it will interpret data if data is a standalone symbol, if it is part of a function, it marks it as a word/short etc.
$arm-none-eabi-gcc -c 1.c -fdata-sections
$arm-none-eabi-ld 1.o -t 1.t

$arm-none-eabi-objdump -d a.out

a.out: file format elf32-littlearm

Disassembly of section .text:

00008000 <main>:

8000:       e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
8004:       e28db000        add     fp, sp, #0
8008:       e59f3010        ldr     r3, [pc, #16]   ; 8020 <main+0x20>
800c:       e5933000        ldr     r3, [r3]
8010:       e1a00003        mov     r0, r3
8014:       e24bd000        sub     sp, fp, #0
8018:       e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
801c:       e12fff1e        bx      lr
8020:       00008024        .word   0x00008024  <=====marked as data

00008024 <myInt>:

8024:       00000001                                ....    <====interpreted as data

00008028 <myChar>:

8028:       00000062                                b...

0000802c <myFloat>:

802c:       3f99999a                                ...?

If told to disassemble all - all objdump variants really disassemble the data as instructions:
$arm-none-eabi-objdump -D a.out

a.out: file format elf32-littlearm

Disassembly of section .text:

00008000 <main>:

8000:       e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
8004:       e28db000        add     fp, sp, #0
8008:       e59f3010        ldr     r3, [pc, #16]   ; 8020 <main+0x20>
800c:       e5933000        ldr     r3, [r3]
8010:       e1a00003        mov     r0, r3
8014:       e24bd000        sub     sp, fp, #0
8018:       e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
801c:       e12fff1e        bx      lr
8020:       00008024        andeq   r8, r0, r4, lsr #32

00008024 <myInt>:

8024:       00000001        andeq   r0, r0, r1

00008028 <myChar>:

8028:       00000062        andeq   r0, r0, r2, rrx

0000802c <myFloat>:

802c:       3f99999a        svccc   0x0099999a

Disassembly of section .bss:

00010030 <__bss_start>:

...

Diff Detail

Repository
rL LLVM

Event Timeline

khemant updated this revision to Diff 68398.Aug 17 2016, 12:31 PM
khemant retitled this revision from to llvm-objdump: ELF: Handle code and data mix in all scenarios.
khemant updated this object.
khemant added reviewers: compnerd, echristo, davide.
khemant set the repository for this revision to rL LLVM.
khemant added subscribers: davide, echristo, compnerd and 2 others.
compnerd added inline comments.Aug 17 2016, 8:36 PM
test/tools/llvm-objdump/X86/disassemble-code-data-mix.test
15 ↗(On Diff #68398)

Please uniformly choose a style (space or no space between the comma and type specifier).

18 ↗(On Diff #68398)

Are the sizes really needed?

19 ↗(On Diff #68398)

Whitespace is ignored, so it makes it easier to read if you align the address labels.

tools/llvm-objdump/llvm-objdump.cpp
1127

I wonder how much more expensive it is to keep a std::vector<SymbolRef> instead and query the Address/Name/Type from the symbol.

1128

Join this and the previous line?

khemant added inline comments.Aug 18 2016, 9:23 AM
test/tools/llvm-objdump/X86/disassemble-code-data-mix.test
15 ↗(On Diff #68398)

Good catch, will fix this.

18 ↗(On Diff #68398)

In case of X86 test not really. In case of ARM test (I will upload it with some changes you recommend) it is needed so objdump can figure out that the data is "inside" a function or just a standalone data symbol in a text section.

tools/llvm-objdump/llvm-objdump.cpp
1127

Good question for which I do not have any answer. The patch only made the pair into a tuple.
I see however that types, address and name all return Expected, that would mean every time we query we have to check for all of this and then decide what should be done in case of error(s).

Perhaps someone else may have an answer for this.

1128

Yes.

khemant updated this revision to Diff 68559.Aug 18 2016, 10:15 AM

Incorporated suggestions from previous diff.
Added ARM test.

khemant marked 3 inline comments as done.Aug 18 2016, 10:15 AM
compnerd added inline comments.Aug 22 2016, 7:48 AM
tools/llvm-objdump/llvm-objdump.cpp
1127

IIRC the Expected wrappers are not as intrusive as the Error wrappers. So, it may end up being reasonable still. I would agree that this would go beyond what is needed here, and is possibly a good follow up cleanup.

However, would you mind using a typedef (or the C++11 equivalent with using) for the tuple type? Even better would be to have a private enumeration so we can semi-symbolically expand the tuple.

1215

And that just leaves jezelle ($z). (No, Im not recommending that you add that, LLVM can't even generate that I believe, just excited for the additional coverage here.)

1251

Really? clang-format did this?

1340

I wonder if we can use llvm::ulittle32_t to translate the value. It would require a specific cast going through format, but would be easier to understand.

1353

Similar via llvm::ulittle16_t.

1390

Cant this if be hoisted outside of the for loop? Index should be Start at that point, and we can do this unconditionally as we know that NumBytes is zero prior to the for loop.

1396

Id probably collapse this into a ternary operation.

1411

reinterpret_cast<char *> please.

khemant added inline comments.Aug 23 2016, 1:48 PM
tools/llvm-objdump/llvm-objdump.cpp
1127

Looks like this is a moot point now. Please check this commit: https://reviews.llvm.org/rL278921

1340

Good point. Will do this.

khemant added inline comments.Aug 23 2016, 2:47 PM
tools/llvm-objdump/llvm-objdump.cpp
1390

We cannot hoist this outside, because the byte stream is broken into 8 character byte sequences and then printer with printable characters. Then the carriage must return and then show the next offset and then start. Example:

0:        00 7e 00 00 01 40 00 00         .~...@..
 8:        89 42 00 00 cd 45 00 00         .B...E..
10:        89 42 00 00 89 42 00 00         .B...B..
18:        89 42 00 00 00 00 00 00         .B......

Hoisting it outside will only print first address, it will make finding a byte say 0x12 difficult.

khemant updated this revision to Diff 69043.Aug 23 2016, 3:14 PM

Updated diff with latest tip, made few changes as recommended.

compnerd accepted this revision.Aug 23 2016, 4:07 PM
compnerd edited edge metadata.

LGTM with at least the cast on the use of format to avoid cross-endian issues.

tools/llvm-objdump/llvm-objdump.cpp
1315

Is there a real reason to limit this to ARM ELF? Shouldn't all ELF have this behavior?

1332

auto would be nicer since the cast spells out the type.

1342

Add an explicit cast to unsigned before passing this through format, otherwise you will be breaking cross-endianness.

1349

auto would be nicer since the cast spells out the large type.

This revision is now accepted and ready to land.Aug 23 2016, 4:07 PM
khemant added inline comments.Aug 24 2016, 10:21 AM
tools/llvm-objdump/llvm-objdump.cpp
1315

ARM is the only one that has these mapping symbols that mix inside the function (from beginning of a@function type and within the size of that symbol). All other architectures there needs to be an explicit data symbol (local or global) in which case the disassembler uses the second kind - the general way of interpreting that bytes are data.

1332

Yep.

1342

Data is a unsigned int (uint16_t). I think that is enough for format.

1349

Got it.

This revision was automatically updated to reflect the committed changes.