This is an archive of the discontinued LLVM Phabricator instance.

[obj2yaml,yaml2obj] - Read and dump the "Content" key of the RawContentSection section as array.
AbandonedPublic

Authored by grimar on Feb 25 2020, 8:33 AM.

Details

Summary

For all of custom sections we have we use Optional<yaml::BinaryRef> for
describing Content. A BinaryRef is a specialized YAMLIO scalar type for
representing a binary blob. Because of this we dump a content as a single string.

This string is dumped in a one line and there is no way to split it and
add proper comments for each part.
I was asked today about possibility of writing a custom
number of bytes and commenting the each group of them. It is useful for
crafting new types of sections.

I was surprised that we unable to do something like:

## Comment 0.
Content: [ 0xfe, 0xfe, 0xfe, 0xfe,
## Comment 1.
0xfe, 0xfe,
## Comment 2.
0xfe, 0xfe ]

In this patch I did a change for RawContentSection for start,
now it accepts and prints the vector of bytes for a Content which
is now Optional<std::vector<uint8_t>>.
How does the idea of such change sound?

Diff Detail

Event Timeline

grimar created this revision.Feb 25 2020, 8:33 AM
Herald added a project: Restricted Project. · View Herald Transcript
grimar edited the summary of this revision. (Show Details)Feb 25 2020, 8:34 AM
grimar retitled this revision from [obj2yaml,yaml2obj] - Dump Content of the RawContentSection section as array. to [obj2yaml,yaml2obj] - Read and dump the "Content" key of the RawContentSection section as array..Feb 25 2020, 8:40 AM

Just wanted to flag up a concern: "feff" -> [ 0xFE, 0xFF ] can make an opaque section content longer.

## Comment 1.
0xfe, 0xfe,
## Comment 2.
0xfe, 0xfe ]

Sometimes there may be a need to describe a little-endian 32-bit value, for example. Is the proposal to use 0x01, 0x00, 0x00, 0x00 for a little-endian 1?

~180 tests in test/tools/llvm-{objcopy,xray,objdump,elfabi,ar,dwarfdump} test/Object test/DebugInfo lld/test/ELF need an update, too. If we support both "0001" and [ 0x00, 0x01 ], we don't have to update those tests.

Just wanted to flag up a concern: "feff" -> [ 0xFE, 0xFF ] can make an opaque section content longer.

## Comment 1.
0xfe, 0xfe,
## Comment 2.
0xfe, 0xfe ]

Sometimes there may be a need to describe a little-endian 32-bit value, for example. Is the proposal to use 0x01, 0x00, 0x00, 0x00 for a little-endian 1?

Currently you can do:

---- !ELF
FileHeader:
  Class:   ELFCLASS32
  Data:    ELFDATA2MSB
  Type:    ET_REL
  Machine: EM_386
Sections:
  - Name: .foo
    Type: SHT_PROGBITS
    Content: "01000000"

It does not matter if you use ELFDATA2MSB or ELFDATA2LSB. The output will be the same:

SectionData (
  0000: 01000000 
)

And the same happens when this patch is used with Content: [ 0x01, 0x00, 0x00, 0x00 ]:
we have the same result with LSB/MSB, so nothing really changed here.

~180 tests in test/tools/llvm-{objcopy,xray,objdump,elfabi,ar,dwarfdump} test/Object test/DebugInfo lld/test/ELF need an update, too.

Hmm that's right, thanks for spotting. Seems I was too focused in yaml2obj/obj2yaml tests and forgot to run others. I am OK to update them though.

If we support both "0001" and [ 0x00, 0x01 ], we don't have to update those tests.

I see next problems here:

  1. I am not sure how to make yaml2obj and obj2yaml to support both ways. Perhaps it is possible, I need to investigate how much hard to do this. But anyways it should be more complicated than the current approach, which is trivial. I.e. I'd like to be sure we want to try to go this way before I actually try.
  2. What should be the default output for obj2yaml? Should it print [ 0x00, 0x01 ]? Then people will probably often use the output from obj2yaml as a sample and will ignore "0011" form probably. Also "0011" form does not allow to use large peice of a content nicely as it prints it in a single line and makes almost improssible to comment those bytes properly.
  3. Using large chunk of raw bytes is a bad practice. Should we care much about keeping a shorter way to describe it?
grimar planned changes to this revision.Feb 26 2020, 2:00 AM

This need an update for the rest of test cases + I'd like to wait for the result of the discussion in this thread before doing anything.

  1. Using large chunk of raw bytes is a bad practice. Should we care much about keeping a shorter way to describe it?

I think this is the key question to me. We shouldn't be encouraging large binary blobs, and if we get a situation where one is needed, perhaps the test needs simplifying/converting to assembly, or obj2yaml/yaml2obj needs extending. In other words, I don't think this change is particularly useful, as blobs should be kept to very short amounts (circa 32 bytes I reckon at most).

How about the following plan?

  1. This patch: supporting both [0,1] and "3031" (hex pairs) in yaml2obj. Support [0x00, 0x01] in obj2yaml. Only obj2yaml tests need to be updated. There are only a few.
  2. A future change: change "3031" (hex pairs) to "01" (strings) in yaml2obj. ~180 tests need to be migrated. This can probably be done mechanically with a dedicated tool.

How about the following plan?

  1. This patch: supporting both [0,1] and "3031" (hex pairs) in yaml2obj. Support [0x00, 0x01] in obj2yaml. Only obj2yaml tests need to be updated. There are only a few.
  2. A future change: change "3031" (hex pairs) to "01" (strings) in yaml2obj. ~180 tests need to be migrated. This can probably be done mechanically with a dedicated tool.

I did not understand it I think.
Currectly we support hex pairs, like "00112233". This patch suggests "[0x00, 0x01]" instead. Not sure I understand what do you mean by "strings" in (2)? I.e. what is the difference with "pairs" we have?

How about the following plan?

  1. This patch: supporting both [0,1] and "3031" (hex pairs) in yaml2obj. Support [0x00, 0x01] in obj2yaml. Only obj2yaml tests need to be updated. There are only a few.
  2. A future change: change "3031" (hex pairs) to "01" (strings) in yaml2obj. ~180 tests need to be migrated. This can probably be done mechanically with a dedicated tool.

I did not understand it I think.
Currectly we support hex pairs, like "00112233". This patch suggests "[0x00, 0x01]" instead. Not sure I understand what do you mean by "strings" in (2)? I.e. what is the difference with "pairs" we have?

I might be completely misreading this, but i think @MaskRay is suggesting to get something like the string "foo" in a string table, for example, writing literally that instead of 666f6f.

grimar added a comment.EditedFeb 28 2020, 2:31 AM

How about the following plan?

  1. This patch: supporting both [0,1] and "3031" (hex pairs) in yaml2obj. Support [0x00, 0x01] in obj2yaml. Only obj2yaml tests need to be updated. There are only a few.
  2. A future change: change "3031" (hex pairs) to "01" (strings) in yaml2obj. ~180 tests need to be migrated. This can probably be done mechanically with a dedicated tool.

I did not understand it I think.
Currectly we support hex pairs, like "00112233". This patch suggests "[0x00, 0x01]" instead. Not sure I understand what do you mean by "strings" in (2)? I.e. what is the difference with "pairs" we have?

I might be completely misreading this, but i think @MaskRay is suggesting to get something like the string "foo" in a string table, for example, writing literally that instead of 666f6f.

Looks true. I did not get that 0x30 is 1 and 0x31 is 2 from ASCII.
(And it is very far from what this patch initially suggested.)

I am not sure why we might want strings. Usually we describe a binary context of
sections isn't? e,g. custom symbol tables, custom text sections (with broken instructions) for broken test cases etc.

grimar added a comment.EditedFeb 28 2020, 5:00 AM
  1. Using large chunk of raw bytes is a bad practice. Should we care much about keeping a shorter way to describe it?

I think this is the key question to me. We shouldn't be encouraging large binary blobs, and if we get a situation where one is needed, perhaps the test needs simplifying/converting to assembly, or obj2yaml/yaml2obj needs extending. In other words, I don't think this change is particularly useful, as blobs should be kept to very short amounts (circa 32 bytes I reckon at most).

There are cases where using yaml2obj is preferable over llvm-mc, but you may have blobs that are not little. Imagine you prototype a new section and want to write a test
for a dumper tool for example. It is not always reasonable to teach yaml2obj about a new section type. So you still might want to use Content, but want to comment bytes well.
It is currently impossible to split Content data to leave multi-line comments.

An idea:
What if we introduce a one more YAML key: "ContentBytes"? I.e. we keep Optional<yaml::BinaryRef> Content as is,
but also add Optional<std::vector<uint8_t>> ContentBytes for those who would like to comment group of bytes
and use [ 0x1, 0x2 ] style for whatever reason is.

There are cases where using yaml2obj is preferable over llvm-mc, but you may have blobs that are not little. Imagine you prototype a new section and want to write a test
for a dumper tool for example. It is not always reasonable to teach yaml2obj about a new section type. So you still might want to use Content, but want to comment bytes well.
It is currently impossible to split Content data to leave multi-line comments.

FWIW, I actually disagree with the statement that it is not always reasonable to teach yaml2obj about a new section type. I think if a binary blob is large enough that it needs commenting, it should be added to yaml2obj instead.

That being said, I could imagine a limited set of cases involving malformed inputs where even if the section is implemented in yaml2obj, you might need to use raw Content.

An idea:
What if we introduce a one more YAML key: "ContentBytes"? I.e. we keep Optional<yaml::BinaryRef> Content as is,
but also add Optional<std::vector<uint8_t>> ContentBytes for those who would like to comment group of bytes
and use [ 0x1, 0x2 ] style for whatever reason is.

This could possibly work, but I suspect people might be confused as to when to use one instead of the other. We probably need to update the yaml2obj/obj2yaml documentation. I might also suggest ContentArray as a name instead perhaps?

MaskRay added a comment.EditedMar 2 2020, 9:34 AM

I am not sure why we might want strings. Usually we describe a binary context of
sections isn't? e,g. custom symbol tables, custom text sections (with broken instructions) for broken test cases etc.

I just feel that the current "3031" (hex pairs) representation is not conventional. C string literals may meet users' intuition.

There are cases where using yaml2obj is preferable over llvm-mc, but you may have blobs that are not little. Imagine you prototype a new section and want to write a test
for a dumper tool for example. It is not always reasonable to teach yaml2obj about a new section type. So you still might want to use Content, but want to comment bytes well.
It is currently impossible to split Content data to leave multi-line comments.

FWIW, I actually disagree with the statement that it is not always reasonable to teach yaml2obj about a new section type. I think if a binary blob is large enough that it needs commenting, it should be added to yaml2obj instead.

That being said, I could imagine a limited set of cases involving malformed inputs where even if the section is implemented in yaml2obj, you might need to use raw Content.

An idea:
What if we introduce a one more YAML key: "ContentBytes"? I.e. we keep Optional<yaml::BinaryRef> Content as is,
but also add Optional<std::vector<uint8_t>> ContentBytes for those who would like to comment group of bytes
and use [ 0x1, 0x2 ] style for whatever reason is.

This could possibly work, but I suspect people might be confused as to when to use one instead of the other. We probably need to update the yaml2obj/obj2yaml documentation. I might also suggest ContentArray as a name instead perhaps?

I don't have a strong opinion. ContentArray looks fine. obj2yaml does not know the nature of the content (strings or blob), should it continue dumping to Content by default? (I'll vote for yes; no strong opinion) If yes, ContentArray will be a key only known by yaml2obj.

This could possibly work, but I suspect people might be confused as to when to use one instead of the other. We probably need to update the yaml2obj/obj2yaml documentation. I might also suggest ContentArray as a name instead perhaps?

I don't have a strong opinion. ContentArray looks fine. obj2yaml does not know the nature of the content (strings or blob), should it continue dumping to Content by default? (I'll vote for yes; no strong opinion) If yes, ContentArray will be a key only known by yaml2obj.

Yeah, I'd leave it only known by yaml2obj and continue dumping to Content.