This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] Introduce 'ihex-flat' output format.
Needs ReviewPublic

Authored by simon_tatham on Aug 24 2022, 2:45 AM.

Details

Summary

Currently, if you use llvm-objcopy to translate an ELF image into
'ihex' format, the input ELF file's entry point address will be
converted into x86-16 segment:offset style if it's less than 1MB, and
the same happens to addresses of data records. The separate record
types for a flat 32-bit address space are not used unless an address
is too big for the segment:offset style.

This is awkward for users consuming the file, who may find they need
to understand both formats of start address and both formats of data
address record. And it doesn't have any relevance to any of LLVM's
target architectures, since we support x86-32 but not x86-16. So users
wanting to write the simplest possible ihex consumer would prefer the
producer to emit the 32-bit record types unconditionally.

I haven't changed the existing format in this commit, on the
assumption that exactly matching the behavior of GNU objdump is a
useful property. Instead, I've added a new output-only format name
"ihex-flat" alongside the existing "ihex", and made the format changes
conditional on that.

Diff Detail

Event Timeline

simon_tatham created this revision.Aug 24 2022, 2:45 AM
simon_tatham requested review of this revision.Aug 24 2022, 2:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 2:45 AM

Supplementary discussion:

This is a conservative patch which changes no existing behavior. But I'd be happy to go further if people want to, by making ihex-flat the default, and relegating the old behavior to a different name, or perhaps even removing it entirely.

Rationale: this segment:offset representation of addresses was only ever important to 16-bit x86 as far as I know, and in all other situations, its sole effect is to complicate the file format unnecessarily for everyone else. And since LLVM doesn't even support x86-16 as a target architecture, 'everyone else' is quite likely to be all users!

Also, even if someone is targeting x86-16, I can't see how a hex file written in this way would be useful. Surely you would need the ability to control the precise CS:IP representation of the entry point address, so you could set it to match the expectations of the code that will be executing there? And that will not in all cases match the fixed policy here of making the segment address a multiple of 0x1000 and putting all 16 low bits of the linear address into the offset.

But the current code was put in on purpose, and as far as I can tell from the comments in D60270, the purpose was to match GNU objcopy, so for the moment I'm presuming that's useful in its own right.

I'll look at this in the coming days, but I'd appreciate @evgeny777's comments, since they were the ones who originally implemented this, so might have some specific rationale beyond matching GNU. One thought I did have though is that although LLVM as a whole may not target x86-16, there's no particularly reason why its binary manipulation tools like llvm-objcopy shouldn't work with them, if there's a use-case.

Absolutely, I agree. Off the top of my head, the most obvious continuing use case for x86-16 is first-stage bootloaders that the PC BIOS runs in real mode. I've no idea what tools people typically use for those these days, but I wouldn't have a hard time at all believing that it might turn out to be a hodgepodge of bits and pieces from all over the place.

And I suppose that in that use case, the problem I mentioned with the entry point representation is moot anyway, because you don't get a choice about the entry point of an MBR boot sector – it's fixed at 0000:7C00 (or maybe 07C0:0000, I shamefully forget which). So you'd never need to retrieve it from the ihex file to pass on to something else.

llvm-objcopy CommandGuide docs (llvm/docs/CommandGuide) will need updating.

@MaskRay, any thoughts on this? I think it's a reasonable change, but am unsure whether it should be the "default" ihex output, or under the new format name. If the former, I'd be tempted to keep the old format around for the clients who need it, under a different name. The change should then definitely be mentioned in the release notes too.

llvm/lib/ObjCopy/ELF/ELFObject.h
275

I suggest MayUseSegmentOffset since, if I understand it correctly, you may need to use the other format even if this is true.

llvm/tools/llvm-objcopy/ObjcopyOptions.cpp
644

To keep the code simpler, I think we should omit this ihex-flat as an input format: as far as I'm aware, there's no need to support it. Users can just use ihex for the input option. In my opinion, the symmetry isn't important: note that for binary format the input and output formats are essentially unrelated, so the "symmetry" there is, if anything, confusing, but a necessary evil of being compatible with GNU.

simon_tatham edited the summary of this revision. (Show Details)

Sorry to have been so long getting back to this! It was relegated to my back burner for a while, but I've now found time to address the review comments so far.

simon_tatham marked 2 inline comments as done.Nov 17 2022, 5:33 AM

Sorry for not coming to this yet: this is low priority for me, and I have a pile of other stuff I need to finish in the next couple of weeks before taking a 6 week vacation, so I doubt I'll get around to this patch until some time late January or February. Hopefully somebody else might be able to take over reviewing.