This is an archive of the discontinued LLVM Phabricator instance.

[MC,llvm-objdump,ARM] Target-dependent disassembly resync policy.
ClosedPublic

Authored by simon_tatham on Jul 22 2022, 7:17 AM.

Details

Summary

Currently, when llvm-objdump is disassembling a code section and
encounters a point where no instruction can be decoded, it uses the
same policy on all targets: consume one byte of the section, emit it
as "<unknown>", and try disassembling from the next byte position.

On an architecture where instructions are always 4 bytes long and
4-byte aligned, this makes no sense at all. If a 4-byte word cannot be
decoded as an instruction, then the next place that a valid
instruction could possibly be found is 4 bytes further on.
Disassembling from a misaligned address can't possibly produce
anything that the code generator intended, or that the CPU would even
attempt to execute.

This patch introduces a new MCDisassembler virtual method called
suggestBytesToSkip, which allows each target to choose its own
resynchronization policy. For Arm (as opposed to Thumb) and AArch64,
I've filled in the new method to return a fixed width of 4.

Thumb is a more interesting case, because the criterion for
identifying 2-byte and 4-byte instruction encodings is very simple,
and doesn't require the particular instruction to be recognized. So
suggestBytesToSkip is also passed an ArrayRef of the bytes in
question, so that it can take that into account. The new test case
shows Thumb disassembly skipping over two unrecognized instructions,
and identifying one as 2-byte and one as 4-byte.

For targets other than Arm and AArch64, this is NFC: the base class
implementation of suggestBytesToSkip still returns 1, so that the
existing behavior is unchanged. Other targets can fill in their own
implementations as they see fit; I haven't attempted to choose a new
behavior for each one myself.

I've updated all the call sites of MCDisassembler::getInstruction in
llvm-objdump, and also one in sancov, which was the only other place I
spotted the same idiom of if (Size == 0) Size = 1 after a call to
getInstruction.

Diff Detail

Event Timeline

simon_tatham created this revision.Jul 22 2022, 7:17 AM
simon_tatham requested review of this revision.Jul 22 2022, 7:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 7:17 AM
DavidSpickett added inline comments.Jul 25 2022, 3:27 AM
llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h
181

"memory space of the region" perhaps?

182

What is "the symbol" here? Maybe from context it will be a symbol but it seems to me like you'd be in the middle of some function maybe. So symbol is "main" but I'm N bytes away from where it starts by this point.

Do I have the wrong end of the stick?

183

I'd explain why the parameter exists. You'll probably end up repeating the thumb justification but hey why not.

Also is the size of this known to be in some range or does the target also decide that. E.g. if thumb needed 4 bytes of context to decide, the Arm backend would pass this 4 bytes. Some other target could choose less or more or none.

186

const ArrayRef<uint8_t>?

llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp
353

We're assuming that Address is already aligned, I assume that's safe given that we're either:

  • disassembling from the start of a fn, which would be aligned
  • have just disassembled an instruction, which would have been aligned, and 4 bytes in size.

Correct?

I guess someone could give MC a misaligned function start but garbage in garbage out in that case? Or would you want to align up to the nearest 4 bytes.

llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp
758

Is the in memory layout of thumb instructions the same between endians?

771

I might have written this early return style, but that's my lldb bias talking.

if (!is_thumb)
   return 4;
if (bytes.size < 2)
   return 2;
Insn16 ....
return Insn16 < 0xE800 ? 2 : 4;

The logic is clear either way but if you want to have one less indent.

llvm/test/tools/llvm-objdump/ELF/ARM/unknown-instr-resync.test
2

I think this one is the thumb side of the testing, can you add that in a comment in the file if so. Judging by the skip 4 then skip 2 later.

(if this is the thumb test do you need another one for Arm only?)

13

Is it worth adding a single byte on the end to cover the thumb path if bytes < 2? Proves you don't out of bounds the bytes you're given, at least in an asserts build (I hope).

simon_tatham marked 2 inline comments as done.

Addressed review comments (I hope).

simon_tatham added inline comments.Jul 25 2022, 6:59 AM
llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h
181

Oh yes, I copy-pasted the parameter comments from one of the existing API functions just above it without noticing that they didn't all say quite the same things or make sense in all cases.

182

Another copy-paste goof.

183

Where I've called it, I just passed all the data available. If that's not enough to make the best choice, then the caller can't magic more data out of the air anyway, so the callee will just have to do the best it can (which in Thumb I decided was advancing to the next multiple of 2 bytes – still better than the previous 1!)

186

No, because firstly, that would only make the tiny ArrayRef structure itself const and not the pointed-to data, and secondly, ArrayRef is implicitly a pointer-to-const anyway.

(This too is copied from the previous API functions, so if it had been a bug, they'd have this bug as well.)

llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp
353

Mmmm. I did wonder about specifying suggestBytesToSkip so that you could give it a totally arbitrary alignment and it would do something sensible. But I didn't like the idea that every single implementation (other than the default 'skip 1 byte because anything's a valid start position') would end up having to contain very similar boilerplate. That struck me as a sign of having put the API boundary in the wrong place.

In this patch there are only two implementations of suggestBytesToSkip (or two and a half if you count the Arm/Thumb branches of the AArch32 one). But I expect that most non-x86 targets will end up wanting to do something sensible in here – surely a lot of targets are fixed-alignment RISC style, and even m68k has an alignment constraint if I remember my Amiga-owning days correctly. So there'd end up being a lot of copies of that code!

llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp
758

In the modern (BE8-style) Arm architecture, yes: instructions are stored little-endian regardless of the data endianness. In A32 that means little-endian 32-bit words, and in T32 it means little-endian 16-bit halfwords.

In older versions of the architecture that was different, if I remember rightly. But the rest of LLVM doesn't support those versions either, because this code is taken directly from the code in the same file that extracts the halfword for actual disassembly.

llvm/test/tools/llvm-objdump/ELF/ARM/unknown-instr-resync.test
2

It's both – the first three instructions are Arm, including an unrecognised Arm instruction, and the next six are Thumb, including a 16- and a 32-bit unrecognised Thumb instruction.

simon_tatham added inline comments.Jul 25 2022, 7:29 AM
llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp
353

Thinking about this a bit more ... another problem with trying to use suggestBytesToSkip to handle existing alignment problems is that it's called in the wrong place in the disassembly loop. Currently, the expected usage is that you have some starting address, and you call getInstruction to see if you can disassemble an instruction starting there. If you can, you advance by its width; otherwise, you call this new suggestBytesToSkip function and advance by that many instead.

But if the user tries to start disassembly at an address that's invalid due to misalignment, then suggestBytesToSkip can't rescue them anyway, because by the time it's first called, it's too late to prevent an initial nonsense instruction from having been decoded at the misaligned starting location. So then you'd get a resynchronization between that first instruction and the next one, which seems even more nonsensical to me!

So I think it is right that suggestBytesToSkip restricts itself to not creating new alignment problems, and doesn't check whether there's an existing one. The latter (if we think it needs doing at all) would be the job of some other API function.

DavidSpickett accepted this revision.Jul 25 2022, 7:29 AM

LGTM

llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h
186

Right, I was reading the header as if the non const methods modified the array it's referencing, not the extent of the reference.

This revision is now accepted and ready to land.Jul 25 2022, 7:29 AM
DavidSpickett added inline comments.Jul 25 2022, 7:41 AM
llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp
353

Makes sense.

If your starting point was misaligned to begin with it's either a mistake of some other tool, or you're doing something unusual where you expect to have to handle things yourself (some kind of "is this random data possibly code" tool for example).

simon_tatham added inline comments.Jul 25 2022, 8:44 AM
llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp
758

Hmmm, actually, now I go back and check more carefully ...

I'd forgotten the wrinkle that in the AArch32 ABI, ELF objects are supposed to have their instructions stored in endianness matching the ELF header. But ELF images have them stored in BE8, and the linker is supposed to byte-swap the right parts of the code sections based on the mapping symbols. So actually, you're right: the places in this patch and in D130358 where I've added always-little-endian accesses into the code section won't work everywhere.

On the other hand, (a) LLD doesn't support that byte-reversal, and (b) the MC disassembler also reads instructions as little-endian unconditionally, so even before this patch, llvm-objdump would mis-disassemble an object file of that kind.

So I'm not introducing any more big-endian incompatibility than was already there. And if we make major changes to fix that at a later date, then I think these extra little-endian accesses won't be forgotten about, because the llvm-objdump tests touched by these patches will fail and remind whoever is doing it!

This revision was landed with ongoing or failed builds.Jul 26 2022, 1:35 AM
This revision was automatically updated to reflect the committed changes.
scott.linder added inline comments.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1073–1074

@simon_tatham was this change intended? There is a fix at https://reviews.llvm.org/D135430 and I wanted to ping you in case we missed something.

simon_tatham added inline comments.Oct 12 2022, 1:21 AM
llvm/tools/llvm-objdump/llvm-objdump.cpp
1073–1074

I see what you mean – it surely can't be sensible to call Bytes.slice(Index - SectionAddr) when Index is iterating from 0 up to Bytes.size().

This code looks suspiciously like the code in the next hunk up, around line 1027 in collectLocalBranchTargets. It's correct there, because the bounds on Index are set differently. So I suspect I pasted the same code in both places without noticing the difference. Sorry!