This is an archive of the discontinued LLVM Phabricator instance.

[yaml2obj][ELF] Add support for program headers
ClosedPublic

Authored by jakehehrlich on Jul 11 2017, 2:03 PM.

Details

Summary

This change adds basic support for program headers.

I need to do some testing which requires generating program headers but I can't use ld.lld or clang to produce programs that have headers. I'd also like to test some strange things that those programs may never produce.

Diff Detail

Repository
rL LLVM

Event Timeline

jakehehrlich created this revision.Jul 11 2017, 2:03 PM
silvas requested changes to this revision.Jul 11 2017, 6:55 PM

It is too fragile to manually specify file offsets. The actual layout that yaml2obj does right now should be considered opaque, such that any file offsets you could manually choose are "wrong" (even if they are incidentally correct with the current layout of sections that yaml2obj chooses).

So while this is probably an interesting incremental step, we don't want this in-tree yet as-is (if I understand correctly; it might be useful to add in a basic test to make things concrete). We definitely want to have the layout portion so that we don't choose brittle offsets. I don't think it should be too difficult to add a field on the program header yaml object that lists which section names should be covered by that program header.

This revision now requires changes to proceed.Jul 11 2017, 6:55 PM
jakehehrlich edited edge metadata.
jakehehrlich edited the summary of this revision. (Show Details)

This now sets offsets to be the minimum offset of a collection of sections but does not set file size or memory size. If no sections are specified the Offset is set to zero.

atanasyan requested changes to this revision.Jul 12 2017, 3:55 PM

And you definitely need to add some tests cases.

This revision now requires changes to proceed.Jul 12 2017, 3:55 PM
jakehehrlich edited edge metadata.

Added a test for program headers

This now sets offsets to be the minimum offset of a collection of sections but does not set file size or memory size. If no sections are specified the Offset is set to zero.

Do you need the ability to manually set file size to a specific numeric value? Since that allows creating weird/broken files, I would prefer not to have it in an initial commit. Or in other words, I would consider it a separate feature specifically for creating broken ELF files (and this means it would require a separate test too).

The overall design philosophy for ELF yaml2obj is to by default create fairly sane object files (specifically, having it be impossible by default to have wild offsets / sizes). We are willing to sacrifice a bit of generality to keep this (otherwise juggling raw sizes/offsets is way too error-prone). For a couple fields we've added the ability to specify raw numeric values either to create files broken in specific ways or to create a slightly more general subset of ELF files, but those are generally added after the fact specifically for that purpose.

So according to that philosophy, I think it is worth fleshing out this initial commit to actually set the program header file size to cover the listed sections (and set mem size in a default way based on that).

test/tools/yaml2obj/program-header.yaml
29 ↗(On Diff #106353)

The spec requires VAddr and PAddr to be equal to the offset in the file modulo the page size / alignment. Even though we are treating offsets as opaque, they are all aligned to the respective boundaries respect alignment. So that means that the VAddr and PAddr must be addresses aligned at least to the alignment of the phdr.

You might want to do something like Sec->Align = max(Sec->Align, PHdr->Align) for all sections Sec covered by phdr Phdr to ensure that the section layout respects the PHdr's alignment.

This comment was removed by jakehehrlich.

Yeah, I think I'll probably need to add a special case that lets me take over how layout occurs here. After we work this diff out I'll put together a list of use cases that I'll need and I'll post something on llvm-dev so that we can decide what would fill those use cases best.

This now sets offsets to be the minimum offset of a collection of sections but does not set file size or memory size. If no sections are specified the Offset is set to zero.

Do you need the ability to manually set file size to a specific numeric value? Since that allows creating weird/broken files, I would prefer not to have it in an initial commit. Or in other words, I would consider it a separate feature specifically for creating broken ELF files (and this means it would require a separate test too).

The overall design philosophy for ELF yaml2obj is to by default create fairly sane object files (specifically, having it be impossible by default to have wild offsets / sizes). We are willing to sacrifice a bit of generality to keep this (otherwise juggling raw sizes/offsets is way too error-prone). For a couple fields we've added the ability to specify raw numeric values either to create files broken in specific ways or to create a slightly more general subset of ELF files, but those are generally added after the fact specifically for that purpose.

So according to that philosophy, I think it is worth fleshing out this initial commit to actually set the program header file size to cover the listed sections (and set mem size in a default way based on that).

There's some tricky cases that need to be accounted for then.

  1. If there is a PT_LOAD segment covered by 3 sections such that the middle section is a NOBITS section then that middle section should take up space in the file image. This means that section layout changes depends on the presence of a PT_LOAD segment. This is a pretty non-trivial change to make to layout. Do you want me to go ahead and add this change to this diff?
  2. I think accounting PT_PHDR could be a special case that we can handle in a later diff.
  3. There's some strange behavior in dealing with TLS segments and sections. Namely there's a case which contradicts case 1) above when a NOBITS section is at the end of a TLS segment but not at the end of a larger PT_LOAD segment. In this case, sometimes the NOBITS section is not given space in the file and dynamic loaders have handle it differently. I'd like to be able to test for this case. I think it's fair for something like this to be an extension.

Cases 2 and 3 are cases where the sensible thing to do isn't what will be done by just specifying which sections should be covered. I think there might be a few more cases like these lurking in the corners. Should these cases be handled by extensions as they come up? Or should we be trying to account for these in this change?
Case 1 is a little complicated to add but the correct behavior is specifiable by just listing sections that should cover the segment.

In the mean time I've got a diff and a couple tests prepared that I think handle the common normal cases sensibly. But this change fails on issues 1-3 above.

  1. Added a test for no-bits sections
  2. Modified test from previous diff
  3. Memory size and file size are now computed from the sections

This does not properly handle the case where a NOBITS section is in the middle of two PROGBITS sections. There is also no way to create certain valid program headers but these are corner cases.

I think you should be able to deal with most of those issues by sorting the list of sections with an appropriate sorting predicate that coalesces sections that need to be adjacent, which should be fairly unintrusive.

I don't know if you really need to do that though. Again, we're mostly just trying to avoid accidentally creating really blatantly wrong files like wild offsets. Things that are semantically wrong at a higher level are not critical; for example, the contents of the .text section don't need to be real instructions. One way to look at this perhaps is to think that if you are modifying a yaml file, you aren't going to need to have a feedback loop where you manually inspect the output object file to then e.g. write in a magical offset back into the yaml file -- that's brittle, error prone, and annoying.

You want the yaml file to contains stuff that matters for testing what you care about and avoid anything that doesn't matter as much as possible.

Sometimes having yaml2obj do some work automatically for you to prevent higher-level semantic invalid stuff (like we do for relocations, which would otherwise need to be manually encoded in the section content) is a decent tradeoff, sometimes just detecting something is wrong and giving an error is a good tradeoff, sometimes letting something semantically wrong at a higher level silently go through is all that is practical. Combined with the fact that sometimes we want to create invalid stuff because yaml2obj is a testing tool, means that there's no a priori right answer.

Overall, there is a tradeoff between generality of yaml2obj (e.g. how broken the files it produces can be) and how error-prone it is to use. Conversely, how convenient it is to use is a tradeoff vs how many different testing scenarios it can be used for. So my main advice is to think about the use cases you want to use it for and strike a balance; it's meant to make life easier for writing tests. If it isn't doing that then you don't need to use it. Some use cases might be outliers for which you are really just better off checking in a binary file for the test.

Also, for the record you should be able to do a lot of this by passing a linker script to ld.lld e.g. using the PHDRS command and appropriate literal data commands. See e.g. lld/test/ELF/linkerscript/phdrs.s and lld/test/ELF/linkerscript/data-commands.s.
I'm not sure if that is enough (especially w.r.t. creating broken files) for what you want to do, but the basics of creating an output file with certain phdrs that cover certain sections with some dummy content can be done like that.

test/tools/yaml2obj/program-header.yaml
34 ↗(On Diff #106564)

You might want to consider defaulting the alignment to be the max of the alignment of all included sections for now.

I imagine that in most tests you don't actually care about the alignment, so being able to omit it and get some sane default would be good.

Also, I think it will require a bit of extra code to ensure that manual alignment is respected (e.g. if the PT_LOAD has an alignment exceeding that of the sections it covers, it will need to be rounded down to the alignment) . So removing Align from the yaml input format altogether might be the right choice in the first patch.

tools/yaml2obj/yaml2elf.cpp
354

Capitalize.

  1. Fixed capitalization typo
  2. Removed the align field
  3. The alignment of a segment is now the maximum alignment of all sections that have the same offset. If there are no sections, the alignment is 1.

If the alignment was set to be the maximum of all sections in the segment then an invalid case could occur where a section in the middle has a larger alignment than the first section and the segment would start at the offset of the first section while having an alignment that didn't agree with that offset.

I think you should be able to deal with most of those issues by sorting the list of sections with an appropriate sorting predicate that coalesces sections that need to be adjacent, which should be fairly unintrusive.

I don't know if you really need to do that though. Again, we're mostly just trying to avoid accidentally creating really blatantly wrong files like wild offsets. Things that are semantically wrong at a higher level are not critical; for example, the contents of the .text section don't need to be real instructions. One way to look at this perhaps is to think that if you are modifying a yaml file, you aren't going to need to have a feedback loop where you manually inspect the output object file to then e.g. write in a magical offset back into the yaml file -- that's brittle, error prone, and annoying.

I'll try and keep that in mind as I propose extensions to this for my testing. This two step process was what I initially had in mind but I couldn't really think of another way of going about it. Being able to upload binaries as test cases alleviates a lot of my concerns.

Overall, there is a tradeoff between generality of yaml2obj (e.g. how broken the files it produces can be) and how error-prone it is to use. Conversely, how convenient it is to use is a tradeoff vs how many different testing scenarios it can be used for. So my main advice is to think about the use cases you want to use it for and strike a balance; it's meant to make life easier for writing tests. If it isn't doing that then you don't need to use it. Some use cases might be outliers for which you are really just better off checking in a binary file for the test.

I didn't realize it was acceptable to upload binaries! I'll probably be using that trick in the near future to test these trickier cases. I think this change will suffice to handle the less tricky cases. This all makes more sense now. I thought yaml2obj was the only way I could test some of this stuff so I really wanted full control over the output. Now that I know there's another way I agree strongly that yaml2obj should make producing sensible binaries easy.

Also, for the record you should be able to do a lot of this by passing a linker script to ld.lld e.g. using the PHDRS command and appropriate literal data commands. See e.g. lld/test/ELF/linkerscript/phdrs.s and lld/test/ELF/linkerscript/data-commands.s.
I'm not sure if that is enough (especially w.r.t. creating broken files) for what you want to do, but the basics of creating an output file with certain phdrs that cover certain sections with some dummy content can be done like that.

I think that should be good actually. And if it isn't then for the few cases where that doesn't work I can resort to some hex editing or some such other trickery.

If the alignment was set to be the maximum of all sections in the segment then an invalid case could occur where a section in the middle has a larger alignment than the first section and the segment would start at the offset of the first section while having an alignment that didn't agree with that offset.

Technically the way you've handled this here is incorrect I think. For example, suppose you have two sections, one is strings (or whatever) at an alignment of 1 and another is some other data with alignment of 4. A PT_LOAD covering both must have an alignment of 4 so that the data is loaded in a way consistent with its alignment. IIUC, depending on the order of the sections, I believe that the current patch will give the PT_LOAD an alignment of 4.

You should be able to handle this by setting the alignment of the PT_LOAD to the max of all sections is must contain. Then rounding down p_offset to that alignment (and adjusting the size too).

I didn't realize it was acceptable to upload binaries! I'll probably be using that trick in the near future to test these trickier cases. I think this change will suffice to handle the less tricky cases. This all makes more sense now. I thought yaml2obj was the only way I could test some of this stuff so I really wanted full control over the output. Now that I know there's another way I agree strongly that yaml2obj should make producing sensible binaries easy.

We try to avoid checking in binaries, but sometimes it is the most sensible way. The binaries should usually be fairly small (e.g. produced by a reduced test case, perhaps with some hex editing of the binary to corrupt some field).

Also, for the record you should be able to do a lot of this by passing a linker script to ld.lld e.g. using the PHDRS command and appropriate literal data commands. See e.g. lld/test/ELF/linkerscript/phdrs.s and lld/test/ELF/linkerscript/data-commands.s.
I'm not sure if that is enough (especially w.r.t. creating broken files) for what you want to do, but the basics of creating an output file with certain phdrs that cover certain sections with some dummy content can be done like that.

I think that should be good actually. And if it isn't then for the few cases where that doesn't work I can resort to some hex editing or some such other trickery.

One logistical detail is that using ld.lld in the tests would require the tool living in LLD's repo. This isn't impossible, but somewhat annoying if the only reason to do it is for testing.
Overall for LLD/ELF we mostly rely on inputs generated by ordinary tools unless we're testing something specifically strange or that can't be produced by ordinary tools.
So using ld.lld / llvm-mc to create your objcopy inputs is probably the thing most consistent with that. Ultimately, almost all user inputs to objcopy are going to be produced by ordinary tools so it makes sense that most test cases (except error/corrupt cases) should be generatable with such tools.

So you can look at lld/test/ELF/ and compare:

  • binaries: .o files
  • YAML tests: .test files (or files containing !ELF to be precise)
  • tool-produced inputs: .s / .ll

The vast majority of test inputs are tool-produced.

Technically the way you've handled this here is incorrect I think. For example, suppose you have two sections, one is strings (or whatever) at an alignment of 1 and another is some other data with alignment of 4. A PT_LOAD covering both must have an alignment of 4 so that the data is loaded in a way consistent with its alignment. IIUC, depending on the order of the sections, I believe that the current patch will give the PT_LOAD an alignment of 4.

I'm not sure I follow. The current patch, as I intended it and as I understand it, does the following:

  1. Sections are laid out appropriately according to their alignments and other constraints (this is not changed in this patch, this is just existing working code doing it's job)
  2. Segments are expanded to fit the needed sections. Most notably p_offset becomes the sh_offset of the first section.
  3. Segment alignment is set to the alignment of the first section (which is also the same section that p_offset got it's value from). Because the first segment's alignment and offset are valid so too should the alignment and offset of the segment be valid.

Are you concerned about alignment in memory after loading? Since the virtual address is chosen by the user I don't think that matters. If there is ever a case where the loader chooses the virtual address based on alignment then I'd agree, we should tweak p_offset to account for maximum alignment. Otherwise I think having the first section's offset agree with the segment offset is more sensible because the user will be picking the virtual address and there's nothing we can do about that.

tpimh added a subscriber: tpimh.Jul 18 2017, 12:37 PM
silvas accepted this revision.Jul 18 2017, 10:09 PM

Are you concerned about alignment in memory after loading? Since the virtual address is chosen by the user I don't think that matters. If there is ever a case where the loader chooses the virtual address based on alignment then I'd agree, we should tweak p_offset to account for maximum alignment. Otherwise I think having the first section's offset agree with the segment offset is more sensible because the user will be picking the virtual address and there's nothing we can do about that.

Yes that is my concern. If you have a section that requires 2M alignment, then with your current patch a PT_LOAD covering it can end up with 4k alignment and thus the section might be loaded at an address that isn't 2M aligned. That's a bit of a technicality though (probably only matters for position independent code where the loader is choosing where to load it). Overall this LGTM.

In practice, section alignment is much smaller (~ size of a data type) compared to segment alignment (~ size of a page). A case where a section has a massive alignment is theoretically possible though.

This revision was automatically updated to reflect the committed changes.