This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] Add support for Intel HEX output format

Authored by evgeny777 on Apr 4 2019, 8:37 AM.

Diff Detail


Event Timeline

evgeny777 created this revision.Apr 4 2019, 8:37 AM

Thanks for sending this! I assigned PR39841 to you since this should fix that bug.

I ran this on some internal ihex files we have, and it seems to be dropping one of the sections, so I'll have to poke around to see what the bug is. But some ihex support is better than none, so it's not necessarily blocking :)

2–4 ↗(On Diff #193725)

The only ihex reading test (besides the error cases) is just consuming ihex that llvm-objcopy produces with -O ihex. Could you add a .hex test file for a more stable test? And then it would be easier to verify, e.g., llvm-objcopy -I ihex -O binary <test> is identical to objcopy -I ihex -O binary <test>

638 ↗(On Diff #193725)

A more succinct way (and same in executeObjcopyOnRawBinary):

const ElfType OutputElfType = getOutputElfType(
159–160 ↗(On Diff #193725)

Fail is unused in release builds, so you need to add a (void)Fail; to silence the error/warning in release builds.

1050 ↗(On Diff #193725)

RecAddr should be defined in the loop, where it is used

1450–1460 ↗(On Diff #193725)

WDYT about just using llvm::Regex here instead of this method? It may be easier to read code if it just attempts to match ":[0-9A-F]+". It would produce less precise error messages, though.

1451 ↗(On Diff #193725)

I think this will crash (or be UB) on an empty line?

1469–1470 ↗(On Diff #193725)

I think there should be validation (somewhere) that there are no more records after this

1510 ↗(On Diff #193725)

as a tiny optimization, call Records.reserve(Lines.size()) once you know how many lines there are.

1525–1532 ↗(On Diff #193725)

Once we've validated it, can we convert the whole hex string to separate ArrayRef<uint8/16_t> fields for each record, so we don't have to worry about it being valid everywhere (i.e. using checkedGetHex)?

1534–1537 ↗(On Diff #193725)

How about creating a static method to convert a line into an Expected<IHexRecord>, so we can return an error if it's invalid instead of making the user call getChecksum/checkRecord?

195–198 ↗(On Diff #193725)

Why not just uint16_t fields?

evgeny777 marked 4 inline comments as done.Apr 5 2019, 10:36 AM
evgeny777 added inline comments.
1450–1460 ↗(On Diff #193725)

I think that precise error message is more important. It might be hard in some cases to identify wrong character, e.g: I instead of 1, O instead of 0, russian A instead of english A and so on.

1451 ↗(On Diff #193725)

Line is checked for minimal valid length earlier in the code. Though, it makes sense to assert here on !Line.empty()

1469–1470 ↗(On Diff #193725)

I think that EndOfFile record should unconditionally cancel further processing. . This allows moving EOF record within a file to temporarily prevent part of records from loading. This can be useful for testing. Also it seems GNU objcopy behaves this way.

1525–1532 ↗(On Diff #193725)

It's possible, but I don't see straight way to do this w/o dynamic memory allocation. As we're checking string with checkChars we shouldn't really step on conversion error, unless something really weird happens.

evgeny777 updated this revision to Diff 193910.Apr 5 2019, 10:39 AM

Line parser moved to IHexRecord::parse
Better test case for reader
Addressed some of review comments


FYI, most of the other llvm-objcopy developers are still away following Euro LLVM, so it may not be until next week that you get an comments from @rupprecht. I don't currently have time to read up on ihex, I'm afraid, but if you don't get any feedback next week, I'll try to look into it if I have time.

FYI, most of the other llvm-objcopy developers are still away following Euro LLVM

Ah, I see. Ok, there is no rush.

FYI, most of the other llvm-objcopy developers are still away following Euro LLVM

Ah, I see. Ok, there is no rush.

Sorry, I should have mentioned earlier that I was going to be busy last week. (In advance: I'm here this week, but I'll be out next week).
Hopefully I'll get to this one today, or if not, then tomorrow.

btw, some people at euro llvm also requested srec supprt, which seems extremely similar to ihex -- so it might be good to think about how generic this handling can be, e.g. maybe most of it should just be a "record" parser which is shared with ihex and srec. I don't think premature specialization should be done to make it more general than it should be, but just don't do anything that would be hostile towards refactoring it :)

btw, some people at euro llvm also requested srec supprt, which seems extremely similar to ihex

For me it doesn't look extremely similar to IHEX, except both formats use hexadecimal byte representation.
There are no such things as segment and extended addresses in SREC and even checksum calculation is different.

I think if we implement SREC then part of section builder functionality from IHexELFBuilder::addDataSections can be moved to a common base class,
also it seems SREC would have similar record structure (Type, Address, Data).

Still I expect writer and parser to be completely separate.

There's a lot of code to review here. I'll keep reviewing it everyday but this is going to take a while to review. Any help on splitting this up and making into smaller chunks would be helpful. Splitting reading and writing up into two separate patches would be helpful and removing features that we can add later would be helpful.

1180 ↗(On Diff #193910)

size_t here and below is kind of confusing, can we use uint32_t?

157 ↗(On Diff #193910)

Unless an check that generates an error always proceeds this I think its best to return an error in this case, not assert fail. It would be better to roll this into an Expected function in that case I think anyway.

197 ↗(On Diff #193910)

Maybe a raw_ostream would be useful here. We've generally avoided them but this format seems to lend itself to streams where as my opinion was the opposite before. You wouldn't need utohexstr since those formatting options are already supplied by the library I believe.

209 ↗(On Diff #193910)

This is a very generic name with no comment. In general your comments have been awesome. I'd like to have an idea what this function does without reading the contents.

298 ↗(On Diff #193910)

Does this ever make sense if there is no segment?

310 ↗(On Diff #193910)

Masking that like this seems redundant, in general the number of places we're converting from 64 to 32 in an unchecked way is really shocking. I'd feel a lot more comfortable if we encapsulated these checks more and made them more clear.

313 ↗(On Diff #193910)

Maybe we could split support for extended records out into a sperate patch and error out here for now?

266 ↗(On Diff #193910)

What's the point in splitting this into two classes?

Also does inheriting from BinarySectionWriter make sense? The same visitors will need to be implemented I would suppose the offsets and everything would be very different.

296 ↗(On Diff #193910)

In general I think it might be worth considering weather there is a need to use sections at all. Originally with the binary writer we only used program headers. It turns out that people did a lot of stuff in a really odd way with GNU objcopy when using -O binary that required that we use sections as the primary basis for output. I would imagine that ihex users would not be doing the same sorts of odd tricks and that you could write the output strait from program headers. This would simplify the implementation greatly I think and harden the implementation against all sorts of odd corner cases.

498–507 ↗(On Diff #193910)

Why do we need this to output a new format?

932 ↗(On Diff #193910)

We can probably split this into two changes to make things smaller, one for reading, and one for writing yeah?

evgeny777 marked 9 inline comments as done.Apr 24 2019, 10:40 AM

There's a lot of code to review here.

I've responded to some of the comments, meanwhile I'm splitting patch into writer (will go first) and reader (will go next). Will update the review soon

157 ↗(On Diff #193910)

This function can't actually return error, because string has been previously validated (see checkChars for example). IMO, it's bad practice to implement runtime checks for one's own logical errors.

197 ↗(On Diff #193910)

This function was optimized to not using any dynamic allocation (IHexLineData is actually SmallVector), because each line contains only 16 bytes of section data, so it's possible to have really huge number of lines. What are the benefits of using raw_ostream?

298 ↗(On Diff #193910)

It's a helper function which returns section VA if there is no segment. Any suggestion for better name?

310 ↗(On Diff #193910)

There is a checkSections method which check all sections to detect if any of them has 64-bit address. Bear in mind that implementation also supports sign extended 32-bit addresses, i.e 0xFFFFFFFF80000000 is a valid address, but 0x100000000 is not

313 ↗(On Diff #193910)

I suggest splitting reader and writer. To me it looks like a more logical split compared to removal of certain record types.

266 ↗(On Diff #193910)

I inherited IHexSectionWriter from BinarySectionWriter in order to reuse visitors for RelocationSection, GnuDebugLinkSection, e.t.c which will never go to IHEX nor to binary output. Are you suggesting duplication?

296 ↗(On Diff #193910)

AFAIU one can't do this with IHEX, because unlike binary IHEX is not a contiguous blob, e.g you can have a gap between sections which won't go to output file.

498–507 ↗(On Diff #193910)

This function takes a portion of hexadecimal data from IHexRecord and appends it in binary form to internal vector of OwnedDataSection

932 ↗(On Diff #193910)

I think so.

Splitted IHEX patch into reader and writer. Diff now contains the "writer" part.

Any comments on this?

Looked mostly at the test for now, going to take a pass over the code today.

20 ↗(On Diff #197265)

I think this should be .data1?

Command Output (stderr):
error: Unknown section referenced: '.data1' by program header.

(I think this is a recent validation added by rL359663)

1 ↗(On Diff #197265)

All these hex outputs have diffs when compared to what GNU objcopy produces... is that expected? I haven't yet debugged exactly why.

3 ↗(On Diff #197265)

cat X | FileCheck should be replaced with FileCheck --input-file=X everywhere

20 ↗(On Diff #197265)

When I run GNU objcopy on this test case, I get an error: address 0xffffffff80001000 out of range for Intel Hex file. Maybe we shouldn't be supporting it? Are we able to handle it correctly somehow even though GNU objcopy can't?

155 ↗(On Diff #197265)

Isn't relying on Addr + 0x80000000 to loop around UB? Could this just directly check Addr & 0xffffffff80000000 == 0xffffffff80000000 instead?

Some insight into the differences:

10 ↗(On Diff #197265)

It looks like the addresses in this file don't match up, but I don't have a specific suggestion yet

60 ↗(On Diff #197265)
205 ↗(On Diff #197265)

It looks like ihex uses \r\n line endings 😦;a=blob;f=bfd/ihex.c;h=101e0a76155fc48f95312c08307739cf9c1ee5eb;hb=HEAD#l752

It seems weird for me to request this, but I think we should write \r\n, as this seems like a strange detail that people might need when consuming these files. I don't actually have any examples of this, however.

MaskRay added inline comments.May 22 2019, 6:37 PM
9 ↗(On Diff #197265)

The two RUN lines can be written as:
llvm-objcopy -O ihex %t-segs - | FileCheck --check-prefix=SEGMENTS %s if the output %t2-segs.hex isn't used elsewhere.

evgeny777 updated this revision to Diff 201192.May 24 2019, 4:18 AM
evgeny777 marked 7 inline comments as done.
  • Fixed error in section name in one of the tests
  • Removed cat | FileCheck from test case
  • Zero start address is not longer written to IHEX
  • Switched to windows line endings.
1 ↗(On Diff #197265)

After I stopped emitting '03' record for zero start address output from ihex-elf-sections.yaml is identical to one of GNU objcopy.
However if input ELF file contains segments situation is different - GNU objcopy seems to ignore segments completely and always uses section virtual address. This doesn't seem logical to me and also doesn't look consistent with the way we're currently generating binary output in llvm-objcopy.

20 ↗(On Diff #197265)

Probably the problem is in the version of objcopy you're using. On my machine 2.30 fails, but works fine

155 ↗(On Diff #197265)

As far as I know unsigned overflows (unlike signed) are not UB. Addr is uint64_t.

205 ↗(On Diff #197265)

Yes, I've seen this also. Nothing is said in IHEX spec about the line endings. Wikipedia tells that:

Programs that create HEX records typically use line termination characters that conform to the conventions of their operating systems

Probably the easiest thing to do is to stick to GNU behavior. I'll update the patch

seiya added a subscriber: seiya.May 27 2019, 1:42 AM
rupprecht accepted this revision.May 28 2019, 2:19 PM
rupprecht marked an inline comment as done.

I ran a few internal tests and this produces identical ihex output for every file I checked! \ o /

20 ↗(On Diff #197265)

Yep, that was it, I'm no longer seeing it with GNU objcopy from trunk.

This revision is now accepted and ready to land.May 28 2019, 2:19 PM
evgeny777 retitled this revision from [llvm-objcopy] Add support for Intel HEX input/output format to [llvm-objcopy] Add support for Intel HEX output format.May 29 2019, 3:11 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2019, 4:37 AM
Herald added a subscriber: kristina. · View Herald Transcript