With current implementation obj2yaml produces files which on conversion back to obj have invalid address-offset value for the first segment and have half the sections not belonging to any segment(especially after some sections are stripped from YAML). This quick fix improves situation a lot.
Details
Diff Detail
Event Timeline
Only save offset for the first segment for now to fix tests. But changing tests and saving all offsets is probably better?
Just show what the complete fix is, then update all the tests (usually for obj2yaml there are not many to update).
If you feel that the change may need several iterations and don't want to modify tests back and forth, you can say that.
Change tests. However, i have no idea why asan-related tests failed on timeout, will investigate if it happens again.
Could you clarify why you only emit the sizes for the PHDR type? It's perfectly legitimate for other segment types to have a size that covers other areas too that aren't parts of sections (and therefore the size can't be derived purely from the section size), but IIRC, yaml2obj will assume the sizes match unless told otherwise. I suppose it might be necessary to emit Fill "sections" in the YAML to cover the gap in some cases to ensure section layout works correctly.
llvm/test/tools/obj2yaml/ELF/program-headers.yaml | ||
---|---|---|
52 | Nit: here and below, all the other field values within a segment are aligned vertically with each other, but in most cases, the new Offset field isn't. Please fix. | |
llvm/tools/obj2yaml/elf2yaml.cpp | ||
488–493 | I think we should be emitting these values in the output only if they can't be derived automatically from the existing layout, otherwise the YAML output will end up unnecessarily verbose. It's been a while since I've looked at the obj2yaml code though, so it's possible this happens elsewhere. |
I don't know of any practical examples of segments aside from PHDR and first load which cover out-of-section space, and with these two cases, the size will be properly adjusted by yaml2obj to cover everything from the provided offset to the last included section, or we'll provide it for PHDR in absence of sections.
llvm/tools/obj2yaml/elf2yaml.cpp | ||
---|---|---|
488–493 | I don't think one line per segment is so verbose that we should build separate logic to determine whether to include it. What makes yaml verbose are symbols, relocations, dynamic, note sections and etc. TBH i think we should also include section offsets unconditionally because people strip sections all the time which breaks all this smart offset derivation and rebuilt binaries have absolutely nonsensical offset/address values, which makes it difficult to make some tools(e.g BOLT) more strict to the input binaries, since all the generated tests will fail. |
Seems like some tests randomly don't finish in one hour and thus fail. Will restart pipeline.
FWIW, I know of situations within an internal toolchain where this can happen, although we don't use the LLVM tools in that situation, I believe. Also, if explicitly instructed to remove a section that is within a segment, llvm-objcopy will preserve the gap in the segment, to avoid invalidating addresses, so it could arise from that situation. Finally, since yaml2obj supports the concept of "Fill" sections (sections that aren't actually sections in the section header table, i.e. they just cover blocks of empty space), it's possible to create an object with gaps in that way, without explicitly specifying the offset/size of program headers. obj2yaml doesn't currently support emitting Fill sections, but it could.
When we did some work a few years ago to yaml2obj to allow it to create "Fill" sections, we discussed creating Chunks for the ELF Header and Program Header Tables (like we do already with the Sections and SectionHeaderTable), with the idea that you could list them in the FirstSec/LastSec listing of a segment. I feel like this would be a better way of ensuring the size of a segment (especially a PT_PHDR segment) is correct, rather than hard-coding the size. Hard-coding the size means it will fall down if somebody decides to manually add an additional program header to the YAML produced by obj2yaml, for example.
llvm/tools/obj2yaml/elf2yaml.cpp | ||
---|---|---|
488–493 |
My reasoning here is that a significant use-case I'm aware of is for people creating test cases that use yaml2obj to generate the test input at runtime. Essentially, they get an object they want to test using the compiler/assembler/etc and then use obj2yaml to generate the YAML for use in the test. If the fields are emitted unnecessarily this causes two problems: 1) the YAML becomes more verbose than necessary, obfuscating what is actually important in the test and 2) it makes it harder to manipulate the YAML to add/remove sections from it (because the offsets etc all become invalid and need updating by hand). Thinking about it, I'm more concerned by the second of these.
Sorry, I'm not sure what you're saying here? Are you saying people use tools like llvm-strip/llvm-objcopy to manipulate sections? If so, the offsets and addresess of any remaining sections and segments should remain valid afterwards (and if they don't, that's a bug in llvm-strip/llvm-objcopy etc). Alternatively, are you saying people hand-edit the YAML after obj2yaml emitted it? If it's the latter, emitting more offsets can cause other issues. |
llvm/tools/obj2yaml/elf2yaml.cpp | ||
---|---|---|
488–493 |
I'm referring to the second case where we remove sections from the generated yaml, and thus other sections move forward and have invalidated addresses. So i think we should preserve as much info in yaml as possible, because it will likely be edited.
As i said before, the bulk of yaml comes from other things, one line per segment or even per section will not make the file significantly bigger or harder to understand.
That is exactly what happens without emitting offsets: you remove a section, and the next slides into it's place, and now has invalid base address. Or, for example, you remove a first section in the segment, and the segment offset changes if not emitted, but since the VAddr is the same, we have invalid base address again. If we emit all segments and sections offsets, we make it easier to remove some sections without invalidating others. As for adding anything, you'll have to think more about a place in a file and in address space, which is a good thing because valid mappings are a good thing. But from what i see we usually remove sections from yaml, not add them, so you won't think about it very often. But anyway, this diff is about segment offsets, and emitting them alone will help because obj->yaml->obj becomes more accurate. |
! In D144009#4131236, @jhenderson wrote:
Hard-coding the size means it will fall down if somebody decides to manually add an additional program header to the YAML produced by obj2yaml, for example.
I think you should not just manually add a segment bacause it will move all the other parts of the file and, again, invalidate all the mappings, possibly causing some sections to come out of address space. But if you remove a segment and have PHDR size hardcoded, you didn't break anything, since the new space will just become NULL segment but all the mappings are still valid.
It seems to me like we have a fundamental difference in how obj2yaml is used (or rather, there are multiple valid use-cases, but they pull in different directions). Whether it's segment offsets or section offsets, the fundamental issue is the same, I believe (just one has a bigger impact than the other). Let me summarize my understanding of the situation and the pros/cons of the approaches:
- Use-case 1: As a way of quickly generating YAML blocks for testing purposes. In this case, you don't want more than the minimum needed to represent the object, as anything extra hides what's actually important. Since yaml2obj can happily derive section/segment offsets automatically, it isn't necessary to include the offsets in this case for most cases. Furthermore, it is often not necessary to have rigid offsets/addresses (many tests don't care about these). Typically these test cases will have minimal symbols, relocations, or sections, so the signal to noise impact of unnecessary offsets (for segments or sections) is relatively high.
- Use-case 2 (I'm uncertain if this is exactly correct, but hopefully I've got the general point correct): As a way of creating something that is easier to manipulate than a raw binary. Often, this might be a linked executable. Removing sections etc is desirable, but any addresses need to remain correct (due to things like already-applied relocations referencing them). Removing sections will cause segments/sections to potentially move, unless offsets are included explicitly in the YAML, or something else is used to replace the section.
I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML (this is actually fairly easy to do, but users need to be aware of these Fill sections).
@MaskRay, do you have any thoughts?
You're right, and the second case of generating a somewhat valid executable is currently not handled very well, so adding an option to emit section offsets may be useful. But anyway, the context of this diff is purely segment offsets, and i don't see fundamental contradictions with the first case here.
I don't see a clear way forward that meets both sets of requirements. About the best idea I have is to change obj2yaml to emit hard-coded offsets only when needed to enforce the layout (i.e. correct implicit values should be preferred), but then users should not simply remove whole sections, and instead use Fill sections to replace them in the YAML
If you have hardcoded sections' offsets and addresses as well as segments', there's no need to use fill sections and you can simply remove what you don't need, so i don't get why use fill in that case.
I think for the Program Headers, one fundamental issue is that there's more than one way to define the positioning of a segment. In my opinion, the "correct" way to place segments in YAML is to use FirstSec/LastSec, since that causes offsets and sizes to be correct automatically. In the event that there's space at the start or end of a segment, it is better to use a Fill section that hard-code the size/offset, (again in my opinion), because then there is only one source of information for the segment layout (i.e. the section list). Adding Size/Offset fields to the segment as well as specifying the FirstSec/LastSec fields means that users have to be aware of more than one thing when they adjust the YAML (i.e. if they want to move a segment, they have to update the segment offset as well as the sections). There may be valid reasons to do this move. This is of course on the top of the fact that for many test purposes, the size and offset are completely redundant, and it would be nice not to have to remove them if starting from a real object file (indeed, many people may forget to remove them, leading to noisy, or potentially even buggy, tests).
To avoid obj2yaml from having to emit section offsets and sizes for the reasons outlined elsewhere.
Nit: here and below, all the other field values within a segment are aligned vertically with each other, but in most cases, the new Offset field isn't. Please fix.