This is an archive of the discontinued LLVM Phabricator instance.

[obj2yaml] Save offset for segments and size for PHDR
Needs ReviewPublic

Authored by treapster on Feb 14 2023, 6:33 AM.

Details

Summary

With current implementation obj2yaml produces files which on conversion back to obj have invalid address-offset value for the first segment and have half the sections not belonging to any segment(especially after some sections are stripped from YAML). This quick fix improves situation a lot.

Diff Detail

Unit TestsFailed

Event Timeline

treapster created this revision.Feb 14 2023, 6:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 6:33 AM
treapster requested review of this revision.Feb 14 2023, 6:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 6:33 AM
treapster edited the summary of this revision. (Show Details)Feb 14 2023, 6:40 AM
treapster updated this revision to Diff 497341.Feb 14 2023, 8:30 AM

Only save offset for the first segment for now to fix tests. But changing tests and saving all offsets is probably better?

Only save offset for the first segment for now to fix tests. But changing tests and saving all offsets is probably better?

Just show what the complete fix is, then update all the tests (usually for obj2yaml there are not many to update).

If you feel that the change may need several iterations and don't want to modify tests back and forth, you can say that.

treapster updated this revision to Diff 497607.EditedFeb 15 2023, 2:47 AM

Change tests. However, i have no idea why asan-related tests failed on timeout, will investigate if it happens again.

Could you clarify why you only emit the sizes for the PHDR type? It's perfectly legitimate for other segment types to have a size that covers other areas too that aren't parts of sections (and therefore the size can't be derived purely from the section size), but IIRC, yaml2obj will assume the sizes match unless told otherwise. I suppose it might be necessary to emit Fill "sections" in the YAML to cover the gap in some cases to ensure section layout works correctly.

llvm/test/tools/obj2yaml/ELF/program-headers.yaml
52

Nit: here and below, all the other field values within a segment are aligned vertically with each other, but in most cases, the new Offset field isn't. Please fix.

llvm/tools/obj2yaml/elf2yaml.cpp
488–493

I think we should be emitting these values in the output only if they can't be derived automatically from the existing layout, otherwise the YAML output will end up unnecessarily verbose. It's been a while since I've looked at the obj2yaml code though, so it's possible this happens elsewhere.

Could you clarify why you only emit the sizes for the PHDR type? It's perfectly legitimate for other segment types to have a size that covers other areas too that aren't parts of sections (and therefore the size can't be derived purely from the section size), but IIRC, yaml2obj will assume the sizes match unless told otherwise. I suppose it might be necessary to emit Fill "sections" in the YAML to cover the gap in some cases to ensure section layout works correctly.

I don't know of any practical examples of segments aside from PHDR and first load which cover out-of-section space, and with these two cases, the size will be properly adjusted by yaml2obj to cover everything from the provided offset to the last included section, or we'll provide it for PHDR in absence of sections.

treapster added inline comments.Feb 16 2023, 12:44 AM
llvm/tools/obj2yaml/elf2yaml.cpp
488–493

I don't think one line per segment is so verbose that we should build separate logic to determine whether to include it. What makes yaml verbose are symbols, relocations, dynamic, note sections and etc. TBH i think we should also include section offsets unconditionally because people strip sections all the time which breaks all this smart offset derivation and rebuilt binaries have absolutely nonsensical offset/address values, which makes it difficult to make some tools(e.g BOLT) more strict to the input binaries, since all the generated tests will fail.
But saving section offsets in all cases requires changing about 40 tests(that much failed when i just started always including section offsets), so unfortunately i won't do it anytime soon.

Align offset fields in tests

treapster marked an inline comment as done.Feb 16 2023, 12:56 AM

Seems like some tests randomly don't finish in one hour and thus fail. Will restart pipeline.

Could you clarify why you only emit the sizes for the PHDR type? It's perfectly legitimate for other segment types to have a size that covers other areas too that aren't parts of sections (and therefore the size can't be derived purely from the section size), but IIRC, yaml2obj will assume the sizes match unless told otherwise. I suppose it might be necessary to emit Fill "sections" in the YAML to cover the gap in some cases to ensure section layout works correctly.

I don't know of any practical examples of segments aside from PHDR and first load which cover out-of-section space, and with these two cases, the size will be properly adjusted by yaml2obj to cover everything from the provided offset to the last included section, or we'll provide it for PHDR in absence of sections.

FWIW, I know of situations within an internal toolchain where this can happen, although we don't use the LLVM tools in that situation, I believe. Also, if explicitly instructed to remove a section that is within a segment, llvm-objcopy will preserve the gap in the segment, to avoid invalidating addresses, so it could arise from that situation. Finally, since yaml2obj supports the concept of "Fill" sections (sections that aren't actually sections in the section header table, i.e. they just cover blocks of empty space), it's possible to create an object with gaps in that way, without explicitly specifying the offset/size of program headers. obj2yaml doesn't currently support emitting Fill sections, but it could.

When we did some work a few years ago to yaml2obj to allow it to create "Fill" sections, we discussed creating Chunks for the ELF Header and Program Header Tables (like we do already with the Sections and SectionHeaderTable), with the idea that you could list them in the FirstSec/LastSec listing of a segment. I feel like this would be a better way of ensuring the size of a segment (especially a PT_PHDR segment) is correct, rather than hard-coding the size. Hard-coding the size means it will fall down if somebody decides to manually add an additional program header to the YAML produced by obj2yaml, for example.

llvm/tools/obj2yaml/elf2yaml.cpp
488–493

I don't think one line per segment is so verbose that we should build separate logic to determine whether to include it.

My reasoning here is that a significant use-case I'm aware of is for people creating test cases that use yaml2obj to generate the test input at runtime. Essentially, they get an object they want to test using the compiler/assembler/etc and then use obj2yaml to generate the YAML for use in the test. If the fields are emitted unnecessarily this causes two problems: 1) the YAML becomes more verbose than necessary, obfuscating what is actually important in the test and 2) it makes it harder to manipulate the YAML to add/remove sections from it (because the offsets etc all become invalid and need updating by hand). Thinking about it, I'm more concerned by the second of these.

people strip sections all the time which breaks all this smart offset derivation and rebuilt binaries have absolutely nonsensical offset/address values

Sorry, I'm not sure what you're saying here? Are you saying people use tools like llvm-strip/llvm-objcopy to manipulate sections? If so, the offsets and addresess of any remaining sections and segments should remain valid afterwards (and if they don't, that's a bug in llvm-strip/llvm-objcopy etc). Alternatively, are you saying people hand-edit the YAML after obj2yaml emitted it? If it's the latter, emitting more offsets can cause other issues.

treapster added inline comments.Feb 17 2023, 3:27 AM
llvm/tools/obj2yaml/elf2yaml.cpp
488–493

Sorry, I'm not sure what you're saying here?

I'm referring to the second case where we remove sections from the generated yaml, and thus other sections move forward and have invalidated addresses. So i think we should preserve as much info in yaml as possible, because it will likely be edited.

If the fields are emitted unnecessarily this causes two problems: 1) the YAML becomes more verbose than necessary, obfuscating what is actually important in the test.

As i said before, the bulk of yaml comes from other things, one line per segment or even per section will not make the file significantly bigger or harder to understand.

  1. it makes it harder to manipulate the YAML to add/remove sections from it (because the offsets etc all become invalid and need updating by hand).

That is exactly what happens without emitting offsets: you remove a section, and the next slides into it's place, and now has invalid base address. Or, for example, you remove a first section in the segment, and the segment offset changes if not emitted, but since the VAddr is the same, we have invalid base address again.

If we emit all segments and sections offsets, we make it easier to remove some sections without invalidating others. As for adding anything, you'll have to think more about a place in a file and in address space, which is a good thing because valid mappings are a good thing. But from what i see we usually remove sections from yaml, not add them, so you won't think about it very often.

But anyway, this diff is about segment offsets, and emitting them alone will help because obj->yaml->obj becomes more accurate.

! In D144009#4131236, @jhenderson wrote:
Hard-coding the size means it will fall down if somebody decides to manually add an additional program header to the YAML produced by obj2yaml, for example.

I think you should not just manually add a segment bacause it will move all the other parts of the file and, again, invalidate all the mappings, possibly causing some sections to come out of address space. But if you remove a segment and have PHDR size hardcoded, you didn't break anything, since the new space will just become NULL segment but all the mappings are still valid.

It seems to me like we have a fundamental difference in how obj2yaml is used (or rather, there are multiple valid use-cases, but they pull in different directions). Whether it's segment offsets or section offsets, the fundamental issue is the same, I believe (just one has a bigger impact than the other). Let me summarize my understanding of the situation and the pros/cons of the approaches:

  1. Use-case 1: As a way of quickly generating YAML blocks for testing purposes. In this case, you don't want more than the minimum needed to represent the object, as anything extra hides what's actually important. Since yaml2obj can happily derive section/segment offsets automatically, it isn't necessary to include the offsets in this case for most cases. Furthermore, it is often not necessary to have rigid offsets/addresses (many tests don't care about these). Typically these test cases will have minimal symbols, relocations, or sections, so the signal to noise impact of unnecessary offsets (for segments or sections) is relatively high.
  2. Use-case 2 (I'm uncertain if this is exactly correct, but hopefully I've got the general point correct): As a way of creating something that is easier to manipulate than a raw binary. Often, this might be a linked executable. Removing sections etc is desirable, but any addresses need to remain correct (due to things like already-applied relocations referencing them). Removing sections will cause segments/sections to potentially move, unless offsets are included explicitly in the YAML, or something else is used to replace the section.

I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML (this is actually fairly easy to do, but users need to be aware of these Fill sections).

@MaskRay, do you have any thoughts?

It seems to me like we have a fundamental difference in how obj2yaml is used (or rather, there are multiple valid use-cases, but they pull in different directions). Whether it's segment offsets or section offsets, the fundamental issue is the same, I believe (just one has a bigger impact than the other). Let me summarize my understanding of the situation and the pros/cons of the approaches:

  1. Use-case 1: As a way of quickly generating YAML blocks for testing purposes. In this case, you don't want more than the minimum needed to represent the object, as anything extra hides what's actually important. Since yaml2obj can happily derive section/segment offsets automatically, it isn't necessary to include the offsets in this case for most cases. Furthermore, it is often not necessary to have rigid offsets/addresses (many tests don't care about these). Typically these test cases will have minimal symbols, relocations, or sections, so the signal to noise impact of unnecessary offsets (for segments or sections) is relatively high.
  2. Use-case 2 (I'm uncertain if this is exactly correct, but hopefully I've got the general point correct): As a way of creating something that is easier to manipulate than a raw binary. Often, this might be a linked executable. Removing sections etc is desirable, but any addresses need to remain correct (due to things like already-applied relocations referencing them). Removing sections will cause segments/sections to potentially move, unless offsets are included explicitly in the YAML, or something else is used to replace the section.

I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML (this is actually fairly easy to do, but users need to be aware of these Fill sections).

You're right, and the second case of generating a somewhat valid executable is currently not handled very well, so adding an option to emit section offsets may be useful. But anyway, the context of this diff is purely segment offsets, and i don't see fundamental contradictions with the first case here.

treapster added a comment.EditedFeb 20 2023, 1:52 AM

I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML

If you have hardcoded sections' offsets and addresses as well as segments', there's no need to use fill sections and you can simply remove what you don't need, so i don't get why use fill in that case.

It seems to me like we have a fundamental difference in how obj2yaml is used (or rather, there are multiple valid use-cases, but they pull in different directions). Whether it's segment offsets or section offsets, the fundamental issue is the same, I believe (just one has a bigger impact than the other). Let me summarize my understanding of the situation and the pros/cons of the approaches:

  1. Use-case 1: As a way of quickly generating YAML blocks for testing purposes. In this case, you don't want more than the minimum needed to represent the object, as anything extra hides what's actually important. Since yaml2obj can happily derive section/segment offsets automatically, it isn't necessary to include the offsets in this case for most cases. Furthermore, it is often not necessary to have rigid offsets/addresses (many tests don't care about these). Typically these test cases will have minimal symbols, relocations, or sections, so the signal to noise impact of unnecessary offsets (for segments or sections) is relatively high.
  2. Use-case 2 (I'm uncertain if this is exactly correct, but hopefully I've got the general point correct): As a way of creating something that is easier to manipulate than a raw binary. Often, this might be a linked executable. Removing sections etc is desirable, but any addresses need to remain correct (due to things like already-applied relocations referencing them). Removing sections will cause segments/sections to potentially move, unless offsets are included explicitly in the YAML, or something else is used to replace the section.

I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML (this is actually fairly easy to do, but users need to be aware of these Fill sections).

You're right, and the second case of generating a somewhat valid executable is currently not handled very well, so adding an option to emit section offsets may be useful. But anyway, the context of this diff is purely segment offsets, and i don't see fundamental contradictions with the first case here.

I think for the Program Headers, one fundamental issue is that there's more than one way to define the positioning of a segment. In my opinion, the "correct" way to place segments in YAML is to use FirstSec/LastSec, since that causes offsets and sizes to be correct automatically. In the event that there's space at the start or end of a segment, it is better to use a Fill section that hard-code the size/offset, (again in my opinion), because then there is only one source of information for the segment layout (i.e. the section list). Adding Size/Offset fields to the segment as well as specifying the FirstSec/LastSec fields means that users have to be aware of more than one thing when they adjust the YAML (i.e. if they want to move a segment, they have to update the segment offset as well as the sections). There may be valid reasons to do this move. This is of course on the top of the fact that for many test purposes, the size and offset are completely redundant, and it would be nice not to have to remove them if starting from a real object file (indeed, many people may forget to remove them, leading to noisy, or potentially even buggy, tests).

I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML

If you have hardcoded sections' offsets and addresses as well as segments', there's no need to use fill sections and you can simply remove what you don't need, so i don't get why use fill in that case.

To avoid obj2yaml from having to emit section offsets and sizes for the reasons outlined elsewhere.