This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] Detect note section for ELF objects
Needs ReviewPublic

Authored by rochauha on Jun 30 2020, 3:50 AM.

Details

Summary

This is the first step toward handling notes in llvm-objdump.

Diff Detail

Event Timeline

rochauha created this revision.Jun 30 2020, 3:50 AM
Herald added a reviewer: MaskRay. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript

Could you more clearly clarify what the motivation here is? Is this working towards some missing GNU compatibility or similar? If so, could you paste an example of what GNU objdump produces, please.

Note that .note sections can be dumped using llvm-readelf in GNU style, just not as part of the disassembly.

llvm/tools/llvm-objdump/llvm-objdump.cpp
1312

I'm not sure it's advisable to skip printing the bytes. The current behaviour will cause the section data to be printed with --disassemble-all; this patch prevents that, meaning that until you implement the support, the section contents won't be dumped, which is a regression.

Could you more clearly clarify what the motivation here is? Is this working towards some missing GNU compatibility or similar? If so, could you paste an example of what GNU objdump produces, please.

Note that .note sections can be dumped using llvm-readelf in GNU style, just not as part of the disassembly.

GNU compatibility is not a goal here. The idea is to support disassembling everything that appears in a binary. Right now almost every entity is disassembled as instructions (even notes). This is not very helpful to the user.

Let us consider the example of notes. A user has to use llvm-readobj to look at notes and then use llvm-objdump to look at instructions. There is some overlap between the functionality llvm-readobj and llvm-objdump as well. For example, both also support only looking at symbols. I guess this divide between both tools is not necessary.

llvm-objdump is more 'feature rich' in the sense that it allows to look at target specific parts of the code as well. Right now the plan is to add some of the missing functionality to llvm-objdump. The final goal would be to support re-assemblable disassembly after we are able to disassemble all contents of a binary.

At the time of writing, no mainstream tool supports re-assemblable disassembly. The ones which do are mostly proprietary.

rochauha marked an inline comment as done.Jun 30 2020, 4:44 AM
rochauha added inline comments.
llvm/tools/llvm-objdump/llvm-objdump.cpp
1312

Would it be a good idea to just print .note section : Pending support right now and falling back to the normal flow?

Could you more clearly clarify what the motivation here is? Is this working towards some missing GNU compatibility or similar? If so, could you paste an example of what GNU objdump produces, please.

Note that .note sections can be dumped using llvm-readelf in GNU style, just not as part of the disassembly.

GNU compatibility is not a goal here. The idea is to support disassembling everything that appears in a binary. Right now almost every entity is disassembled as instructions (even notes). This is not very helpful to the user.

Let us consider the example of notes. A user has to use llvm-readobj to look at notes and then use llvm-objdump to look at instructions. There is some overlap between the functionality llvm-readobj and llvm-objdump as well. For example, both also support only looking at symbols. I guess this divide between both tools is not necessary.

llvm-objdump is more 'feature rich' in the sense that it allows to look at target specific parts of the code as well. Right now the plan is to add some of the missing functionality to llvm-objdump. The final goal would be to support re-assemblable disassembly after we are able to disassemble all contents of a binary.

At the time of writing, no mainstream tool supports re-assemblable disassembly. The ones which do are mostly proprietary.

If that's the motivation, I don't think I can support it for various reasons:

  1. Having multiple tools to do the same job is not a good idea - each requires its own maintenance, the behaviour can diverge, bugs might require fixing in two places/support for new things etc etc etc. In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.
  2. Decoding the .note section in a special manner takes us further away from GNU compatibility. It's not clear to me that GNU would want to add this functionality themselves.
  3. I'm not convinced people actually find dumping all sections in an interpreted form at once useful. Do you actually have any users for that? I think most people are interested in the disassembly of their code, but are unlikely to want this information in the same output as note information.
  4. I don't quite follow whether you're saying that one motivation is to make things re-assembleable, but if it is, the .note section is not the place to start - there are other sections where this would be more useful (e.g. data sections).
  5. I'm not sure I follow your "feature rich" comment. llvm-readelf for example has just as much access to the object as a whole and so is able to use target-specific information where appropriate (for example, it knows about various target-specific flags, can parse the target-specific attributes section etc).

Could you more clearly clarify what the motivation here is? Is this working towards some missing GNU compatibility or similar? If so, could you paste an example of what GNU objdump produces, please.

Note that .note sections can be dumped using llvm-readelf in GNU style, just not as part of the disassembly.

GNU compatibility is not a goal here. The idea is to support disassembling everything that appears in a binary. Right now almost every entity is disassembled as instructions (even notes). This is not very helpful to the user.

Let us consider the example of notes. A user has to use llvm-readobj to look at notes and then use llvm-objdump to look at instructions. There is some overlap between the functionality llvm-readobj and llvm-objdump as well. For example, both also support only looking at symbols. I guess this divide between both tools is not necessary.

llvm-objdump is more 'feature rich' in the sense that it allows to look at target specific parts of the code as well. Right now the plan is to add some of the missing functionality to llvm-objdump. The final goal would be to support re-assemblable disassembly after we are able to disassemble all contents of a binary.

At the time of writing, no mainstream tool supports re-assemblable disassembly. The ones which do are mostly proprietary.

If that's the motivation, I don't think I can support it for various reasons:

  1. Having multiple tools to do the same job is not a good idea - each requires its own maintenance, the behaviour can diverge, bugs might require fixing in two places/support for new things etc etc etc. In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.

+1, I think the long term idea is that we'll have a single llvm tool that tells us useful object info in a variety of ways, and objdump/readelf will just be GNU compatibility frontends to the tool. Something closer to that goal would be to teach llvm-readobj how to disassemble if you want both things in one tool short term.

  1. Decoding the .note section in a special manner takes us further away from GNU compatibility. It's not clear to me that GNU would want to add this functionality themselves.

FWIW I think it's worth at least asking GNU binutils maintainers about this feature. I do imagine the answer would probably be no, but it doesn't hurt to see what they think. (I'm partly scared that there's someone out there relying on how objdump prints note sections).

  1. I'm not convinced people actually find dumping all sections in an interpreted form at once useful. Do you actually have any users for that? I think most people are interested in the disassembly of their code, but are unlikely to want this information in the same output as note information.
  2. I don't quite follow whether you're saying that one motivation is to make things re-assembleable, but if it is, the .note section is not the place to start - there are other sections where this would be more useful (e.g. data sections).
  3. I'm not sure I follow your "feature rich" comment. llvm-readelf for example has just as much access to the object as a whole and so is able to use target-specific information where appropriate (for example, it knows about various target-specific flags, can parse the target-specific attributes section etc).
rochauha added a comment.EditedJun 30 2020, 11:21 PM
  1. Having multiple tools to do the same job is not a good idea - each requires its own maintenance, the behaviour can diverge, bugs might require fixing in two places/support for new things etc etc etc. In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.

I agree that having a single tool is the direction we must aim for. But to do so, one tool needs to be improved to the point that it is 'feature complete'. llvm-objdump already disassembles all contents of the binary. It's just that everything is disassembled as instructions. Even notes are disassembled as instructions today. I am not 'adding' anything new; just trying to 'correct' the existing output. Targets will still need to do implement things from their side(if needed) to take advantages of the infrastructure changes.
The initial plan would be to have note record handling in the MC layer. llvm-objdump will just iterate over the notes section. For each note record it will query the registered targets. The owning target will appropriately disassemble the bytes. A note record must be disassembled using the .byte directive if no target owns the note / printing for a particular kind of note is not implemented.

  1. I'm not convinced people actually find dumping all sections in an interpreted form at once useful. Do you actually have any users for that? I think most people are interested in the disassembly of their code, but are unlikely to want this information in the same output as note information.

Yes, it is super helpful even if the user can just take a look at an entire binary and make sense of it. He/she shouldn't need two tools to do that. Many times note records contain useful information. Looking at the entire binary in an understandable text form can even help when working on bugs.

  1. I don't quite follow whether you're saying that one motivation is to make things re-assembleable, but if it is, the .note section is not the place to start - there are other sections where this would be more useful (e.g. data sections).

Before even thinking of re-assembly, we need to first make sure that all entities are disassembled in a proper way. For example AMDGPU kernel descriptors as assembler directives rather than instructions. Similarly for notes. It may be considered as a low hanging fruit. But it definitely needs to be done. All entities need to be disassembled appropriately to make the final text relevant to the assembler.

  1. I'm not sure I follow your "feature rich" comment. llvm-readelf for example has just as much access to the object as a whole and so is able to use target-specific information where appropriate (for example, it knows about various target-specific flags, can parse the target-specific attributes section etc).

I meant that llvm-objdump can also disassemble and thus allows handling bytes in target specific way.

I'd also like to point out another fact. There is already some feature overlap between llvm-readobj and llvm-objdump - like printing symbol information using --symbols and --syms respectively. Another such example is using the flag --section-headers in both tools to print section headers.

Could we bring this discussion up on llvm-dev please? It seems like it needs wider attention to me. Essentially, there are two related questions. "What should the disassembler do?" is the first - should it be for interepreting bytes as instructions (this is largely the current approach), or should it be for intepreting all sections (what you're essentially proposing). The second is essentially "Should llvm-objdump/llvm-read[obj/elf]/<possibly also other tools like llvm-nm> be combined into a single tool? If so, what is the best way to do so?" The latter will of course still need to provide GNU-compatible front-ends.

I'd also like to point out another fact. There is already some feature overlap between llvm-readobj and llvm-objdump - like printing symbol information using --symbols and --syms respectively. Another such example is using the flag --section-headers in both tools to print section headers.

This is an unfortunate consequence of people wanting to have GNU compatibility. In fact, in our downstream version of llvm-objdump we don't officially support the features that are also available in llvm-readobj so that we don't have to maintain two tools doing the same thing. I am happy to extend things for GNU compatibility, but anything beyond that should not be duplicated in two places.

  1. Having multiple tools to do the same job is not a good idea - each requires its own maintenance, the behaviour can diverge, bugs might require fixing in two places/support for new things etc etc etc. In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.

I agree that having a single tool is the direction we must aim for. But to do so, one tool needs to be improved to the point that it is 'feature complete'. llvm-objdump already disassembles all contents of the binary. It's just that everything is disassembled as instructions. Even notes are disassembled as instructions today. I am not 'adding' anything new; just trying to 'correct' the existing output. Targets will still need to do implement things from their side(if needed) to take advantages of the infrastructure changes.
The initial plan would be to have note record handling in the MC layer. llvm-objdump will just iterate over the notes section. For each note record it will query the registered targets. The owning target will appropriately disassemble the bytes. A note record must be disassembled using the .byte directive if no target owns the note / printing for a particular kind of note is not implemented.

Your definition of "correct" does not match mine. I'd interpret --disassemble-all to mean "disassemble all sections as instructions". I acknowledge that in most cases this probably isn't useful, but honestly I don't know what the purpose of the feature was in the first place. Relatedly, not even text sections necessarily consist entirely of executable instructions - if I'm not mistaken jump tables and other embedded data can exist in them too.

Note parsing already exists in llvm-readobj, and possibly in the Object library (without looking I don't remember where the bulk of the work is done). We don't want to add note parsing to the MC library as well. Duplicated functionality is bad as already outlined. If this feature is to be implemented in disassembly, it should be reusing the same functionality as llvm-readobj. At most, the only difference should be how to print the output. Also I don't think SHT_NOTE sections are intended to be target specific: they are supposed to be vendor specific (where vendor is defined by the note's content, as opposed the EM_* field of the ELF header). For example, there are GNU notes, and I think LLVM notes too. I know we have downstream notes with another vendor name too.

  1. I'm not convinced people actually find dumping all sections in an interpreted form at once useful. Do you actually have any users for that? I think most people are interested in the disassembly of their code, but are unlikely to want this information in the same output as note information.

Yes, it is super helpful even if the user can just take a look at an entire binary and make sense of it. He/she shouldn't need two tools to do that. Many times note records contain useful information. Looking at the entire binary in an understandable text form can even help when working on bugs.

So you actually use note contents in your day-to-day development to immediately tell you something about the executable instructions? If not, then I don't see how it can be helpful. In all the years I've spent helping develop and maintain our downstream binutils, I can honestly say that I've never found a "dump everything" approach useful, even though it is supported by our downstream dumping tool, except when having to diff a before and after output, but even in that case, I'm not using pieces of information from one part to inform my understanding of another part.

  1. I don't quite follow whether you're saying that one motivation is to make things re-assembleable, but if it is, the .note section is not the place to start - there are other sections where this would be more useful (e.g. data sections).

Before even thinking of re-assembly, we need to first make sure that all entities are disassembled in a proper way. For example AMDGPU kernel descriptors as assembler directives rather than instructions. Similarly for notes. It may be considered as a low hanging fruit. But it definitely needs to be done. All entities need to be disassembled appropriately to make the final text relevant to the assembler.

But is re-assembly a long-term goal? If it isn't then there's no point in doing things to work towards it. There are cleaner ways of dumping information rather than trying to squeeze it all into the disassembly output, and in most cases those ways already exist. Even if it were, the chances are that your note output would have to be not human readable (i.e. in the form of .byte etc directives) since there is no corresponding assembler instructions/directives that correspond to them. For example, how do you create a "build-id" assembler input?

In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.

I'm confused by this statement in particular. If the goal is to just have one tool, why did LLVM start re-implementing these tools to begin with? Wasn't the first commit of "llvm-objdump"/"llvm-readobj" a massive step away from the ideal?

In an ideal world, we'd merge all the binary tools (GNU and LLVM) into a single tool, or redistribute functionality somehow, so that we don't have duplicate functionality like we already do. This takes us further away from that ideal.

I'm confused by this statement in particular. If the goal is to just have one tool, why did LLVM start re-implementing these tools to begin with? Wasn't the first commit of "llvm-objdump"/"llvm-readobj" a massive step away from the ideal?

I'm afraid I can't answer that question. I joined LLVM development quite some time after both llvm-readobj and llvm-objdump were initially created. My suspicion is that llvm-readobj was created to provide a generic testing facility, llvm-readelf (i.e. GNU output style for llvm-readobj) was later added for GNU compatiblity, and llvm-objdump was created for disassembly, with GNU compatibility features such as section header printing added later on. However, I haven't attempted to research any of this in depth, so I could easily see this being wrong.