This is an archive of the discontinued LLVM Phabricator instance.

Add JSON output skeleton to llvm-readelf
Needs ReviewPublic

Authored by Jaysonyan on Oct 12 2021, 9:33 AM.

Details

Summary

This change adds JSON output to llvm-readelf. It is enabled through the --elf-output-style
option which now supports JSON in addition to the existing LLVM and GNU. This patch
just lays out the needed infrastructure and skeleton code but doesn't include any
implementation.

The major change was the need for a json printer that can be instantiated at the top level
of llvm-readobj. Since llvm-readobj uses ScopedPrinter this change adds a JSONScopedPrinter
as well as a ScopedPrinterBase which both ScopedPrinter and JSONScopedPrinter both
inherit from.

Corresponding RFC: https://lists.llvm.org/pipermail/llvm-dev/2021-September/152994.html

Diff Detail

Event Timeline

Jaysonyan created this revision.Oct 12 2021, 9:33 AM
Jaysonyan requested review of this revision.Oct 12 2021, 9:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 9:33 AM
Jaysonyan retitled this revision from [WIP] Add JSON output option to llvm-readelf to Add JSON output skeleton to llvm-readelf.Oct 12 2021, 1:43 PM

Some high-level comments:

  1. Please include a link to the RFC email thread where this was discussed, in your commit/patch description, for ease of reference.
  2. Please make sure to create your diff with full context, using -U999999 when you make the diff.
  3. Please make sure to clang-format all your changes.
  4. I think it makes sense to split this down even further than you have done. I'd start with literally just adding the option and skeleton dumping structure, without any implementation of the dump at all. Your later patches can then flesh out each individual dump option. This also means that you can have lots of child patches that base off this one, which are entirely independent, so it doesn't matter what order they get implemented or committed in.
  5. You should update the llvm-readobj and llvm-readelf .rst files in llvm/docs/CommandGuide to discuss the new output mode, perhaps adding something like "Not all options are currently supported by this mode" or something to that effect.

Also, if you wish to add other commits in a series, that are built on top of this patch, you can use the "Edit Related Revisions" option. These will then appear in a new tab in the revision summary panel called "Stack", so that it's possible to get a quick overview of where things are.

Regarding the overall output appearance, it seems to me like you should have the JSON for a file formatted as a single JSON object, with each output variety in its own child object, labelled according to the option name or similar. Since llvm-readobj and llvm-readelf also both support multiple input files, you need a strategy for how these might be dumped. My instinct says that you actually have the dump output for a file added as child members to a top-level object, labelled by their file names. Finally, you should give some thought to how archive members are printed. I think you could either have them printed as individual objects within a JSON object representing the whole archive, or just as regular files, but with the names disambiguated. I've gone with the latter approach below:

{
    "object.o" : {
        "file-header" : { <output for file header> },
        "section-headers" : { <section header table> }
    },
    "archive.a(member1.o)": {
        "file-header" : { <output for file header> },
        "section-headers" : { <section header table> }
    },
    "archive.a(member2.o)": {
        "file-header" : { <output for file header> },
        "section-headers" : { <section header table> }
    }
}

I hope the reasoning for this is clear, but if you've got other ideas, I'm happy to discuss them.

Thanks for all the feedback! I was working on updating the patch according to your comments and I've been thinking about what you've mentioned about handling multiple object files and archives.

Finally, you should give some thought to how archive members are printed. I think you could either have them printed as individual objects within a JSON object representing the whole archive, or just as regular files, but with the names disambiguated. I've gone with the latter approach below:

Regarding the structure of outputting archives, I am leaning towards your first suggestion which is outputting each archive member as a child object within an overall JSON object. I think this would allow the output to be consumed easier since then there's no need to parse the names of archive member to determine what archive it falls under. Given your example I would like to produce this:

{
    "object.o" : {
        "file-header" : { <output for file header> },
        "section-headers" : { <section header table> }
    },
    "archive.a" : {
        "member1.o": {
            "file-header" : { <output for file header> },
            "section-headers" : { <section header table> }
        },
        "member2.o": {
            "file-header" : { <output for file header> },
            "section-headers" : { <section header table> }
        },
    }
}

In terms of outputting as a single JSON object I agree that this seems like the right approach and it also pokes some holes into my current implementation. Since we would need to print things like the outer {} or object file names we need to output JSON at a higher level than just the JSONELFDumper. Since the json::OStream being used does validation that you are outputting valid json we would need to use the same json::OStream for any time we print since it will need to have the correct context (ex: ensuring you are printing attributes inside a parent object). To avoid needing to pass a json::OStream to every method that needs printing I'm considering creating a subclass of ScopedPrinter called JSONScopedPrinter which holds on to a json::OStream rather than a raw_ostream that way we can just create a JSONScopedPrinter and pass it as a ScopedPrinter to all necessary functions since they already accept a ScopedPrinter. This would allow me to still use the same json::OStream for all printing while also not heavily impacting the existing functions.

I'm hoping this sounds reasonable and I'm interested if you have any alternative ideas.

Thinking about it further, I think the JSON output stream is conceptually a higher-level output style than just for ELF files. In particular, you could imagine the future support being added for e.g. COFF and Mach-O objects (they'd have a different internal layout, but they'd still be objects to be printed using the JSON printer). Given that, it makes sense for this printer to be owned higher up the stack, i.e. as you're suggesting, where the ScopedPrinter currently lives. You'd still want a JSON object dumper implementation. Thus, you'd have two levels: your "ScopedPrinter" replacement would be responsible for printing the outer braces and object names (and handle archives probably too, although from a pure design aesthetic, an ArchiveDumper class might be nicer to handle this), with then individual ObjDumper instances handling printing the contenst of an object (including its corresponding outer braces). This is then nice, because the ObjDumper could be used to print a JSON object of the relevant type in other situations, not just within the outer ScopedPrinter instance. Ultimately, this sounds more-or-less like what you're proposing, which is great.

One other thought I had: I think we may need to be careful about duplicate objects in the input list: at the moment, they are dumped once per instance, i.e. llvm-readobj input.o input.o prints input.o twice. We probably should replicate this behaviour, which means we can't use object names as keys directly. I think this means we actually want a top-level array, with a name field in each object, which slightly breaks my design idea above. Alternatively, you could have something like:

[
  {
    "object.o" : {
      ... # Contents of object.o
    },
  }
...
]

There's essentially the same issue with archives: it's potentially possible for an archive to have two members with the same name.

Jaysonyan updated this revision to Diff 384231.Nov 2 2021, 2:55 PM
Jaysonyan edited the summary of this revision. (Show Details)

Add a base class ScopedPrinterBase and concrete class JSONScopedPrinter.
Update llvm-readobj to utilize the base class and concrete classes JSONScopedPrinter
and ScopedPrinter.

jhenderson added inline comments.Nov 3 2021, 1:59 AM
llvm/docs/CommandGuide/llvm-readelf.rst
74

I'd be inclined to remove the "not all options are..." bit, as that's already the case for GNU versus LLVM style, and it will avoid the comment rotting as functionality is added.

llvm/docs/CommandGuide/llvm-readobj.rst
189–190

Same comments.

llvm/include/llvm/Support/JSONScopedPrinter.h
24

Please add a blank line between the constructor and this function.

74

I'm not sure you need this particular function. const std::string & should implicitly convert to StringRef. (I see it was there before, but I'd see whether you can delete this overload entirely).

78

Ditto. const char * implicitly converts to StringRef.

113

Nit: delete blank line at end of class.

115

Nit: delete this blank line too.

llvm/include/llvm/Support/ScopedPrinter.h
82

Do you need this and the classof functions? As far as I can tell, they aren't used anywhere.

99–100

Up to you, but I don't think there's a particular need to make this constructor protected, since the class is an abstract class, so can't be directly instantiated.

103

Slight naming quibble: at the moment, it's not clear why ScopedPrinter is not the base class, and why JSONScopedPrinter is a sibling class (shares same base class) and not a derived class of ScopedPrinter. I'd make ScopedPrinter the base class, and call this derived class something else. Perhaps UnstructuredScopedPrinter or PlainScopedPrinter (other ideas welcome)?

llvm/tools/llvm-readobj/COFFDumper.cpp
250

Do you need this case purely because you're avoiding changing COFFDumper to own a ScopedPrinterBase rather than a ScopedPrinter, or does the COFFDumper make use of functionality not in the base class? If the former, just change to using the base class all the way. If the latter, that suggests there's something not quite right with our design, as ideally the dumpers shouldn't need to know which variety of dumper they're using - what functionality is in the subclass?

624

It looks to me like you've clang-formatted the whole file. Don't do this as part of this patch - only format the bits of the file you've changed (you may be able to get away with a separate patch to reformat the whole file though).

llvm/tools/llvm-readobj/ELFDumper.cpp
805

In a similar manner to the COFF dumper earlier, it would be ideal if you could avoid the cast, and just pass a ScopedPrinterBase instance into the ELFDumper constructors. This would allow you to avoid many of the changes you've made where you pass in a ScopedPrinter now, and decouple the printer style from the actual scoped printer (in some cases, this may be all you need to get some functions to work for JSON style).

7424

Don't bother with these sorts of comments: they don't add any long-term benefit (and will rot very quickly).

llvm/tools/llvm-readobj/ObjDumper.cpp
84

Perhaps do this slightly differently, so that we can leave a clear TODO here for JSON output:

if (auto *W = dyn_cast<JSONScopedPrinter>(WriterBase))
  continue; // TODO: implement for JSON printer.
ScopedPrinter *W = cast<ScopedPrinter>(WriterBase);

Ideally of course, we'd find a way to avoid needing the casting at all. This might be achievable by splitting the iteration away from the printing: keep the iteration in this function, but have a printStringListEntry function that is called per iteration, and is a virtual function in the ScopedPrinterBase class, implemented in the subclasses.

136

For this and the similar printSectionsAsHex, I wonder if it would make sense to make this a virtual function that the JSONELFDumper implements differently, again to avoid the casting. Thoughts?

llvm/tools/llvm-readobj/llvm-readobj.cpp
336–346

How about the following:

  1. Add a printFileHeader() function to the ScopedPrinterBase interface. For the JSONScopedPrinter, it does the objectBegin calls etc. For the ScopedPrinter, it does the LLVM/filename count checks (it might even do the later Format/Arch pritning too in this case).
  2. Add a endFile() function which is a noop for ScopedPrinter but for the JSONScopedPrinter does the objectEnd calls etc.

This will avoid needing additional checks for opts::JSON, and helps hide some of the details a little further - ideally once you've created a ScopedPrinterBase instance, you should be able to avoid ever referencing the subclasses.

348

I don't think you need this cast here?

632

I think it's probably fine to just leave this called Writer - the fact that it's a base class shouldn't be important to the later code.

634

In a similar manner to my earlier comment about files, how about startOutput and endOutput functions?

640

You should be able to avoid some of the changes with pointer versus reference, by doing *WriterBase.get();

Jaysonyan added inline comments.Nov 3 2021, 8:17 PM
llvm/include/llvm/Support/ScopedPrinter.h
82

I believe this is needed to perform dyn_cast. I was following the guide here: https://llvm.org/docs/HowToSetUpLLVMStyleRTTI.html

103

I agree, the naming feels a bit weird right now. I held off on renaming this class and making any changes to the interface since the ScopedPrinter is used outside of llvm-readobj and I didn't want this change to affect any other tool. Although if this is something that you think is fine to do, I'd be happy to change it. I think PlainScopedPrinter seems like an appropriate name.

llvm/tools/llvm-readobj/ObjDumper.cpp
136

I was hesitant to change too much of the ScopedPrinter interface to cater too closely to the needs of llvm-readobj since its used elsewhere as well.

llvm/tools/llvm-readobj/llvm-readobj.cpp
336–346

Similar to my feeling about adding a printSectionAsString and printSectionAsHex to the base class, I'm hesitant about adding methods to meet specific needs of llvm-readobj.

I assumed it would be nice to keep ScopedPrinter somewhat primitive but I'm interested if you have any thoughts on this?

To collate the common topics into one place, regarding additional virtual functions etc. I didn't register that ScopedPrinter is used outside the llvm-readobj code. Given that it is, I agree that modifying the base class to provide additional virtual functions would be undesirable. However, I did have another idea. Given that we ideally want a ScopedPrinter that has an interface suitable for llvm-readobj, how about we do the following:

  1. Leave the original ScopedPrinter largely untouched, except making the common methods like printNumber virtual, adding a virtual destructor etc.
  2. Within the llvm-readobj tree, create a new ReadobjScopedPrinter which is just an extension of the former, but with the additional methods added needed by llvm-readobj. This could be pure virtual, or provide default implementations of the extension functions.
  3. Change all usages of ScopedPrinter to ReadobjScopedPrinter within the llvm-readobj code.
  4. Change JSONScopedPrinter to derive from ReadobjScopedPrinter and implement the additional functions (and move it into the llvm-readobj code).

I'm not 100% happy with this last point. I feel like there's something in the Design Patterns space that would allow us to keep the JSON scoped printer generic, and provide a specialisation of it within llvm-readobj, but I haven't figured out how you'd get the llvm-readobj specific bit of the interface to work in this case yet.

llvm/include/llvm/Support/ScopedPrinter.h
82

Yes, you're right: I meant to come back and delete this comment after seeing the casting later on.

Jaysonyan added a comment.EditedNov 4 2021, 9:55 PM
  1. Change JSONScopedPrinter to derive from ReadobjScopedPrinter and implement the additional functions (and move it into the llvm-readobj code).

One thing that makes this implementation plan difficult is that JSONScopedPrinter needs to output valid json. There are some methods offered by ScopedPrinter that would make it difficult to ensure a JSONScopedPrinter prints valid json while still offering these methods. The most notable is startLine() which provides direct access to the underlying raw_ostream. If the JSONScopedPrinter inherits from ScopedPrinter I don't believe there is a way around not offering this method.

All implementations of ObjDumper utilize methods such as startLine() or perform other operations that may be undesirable/invalid to in json such as not enclosing each method in its own child object or having duplicate object keys (ex COFFDumper::printCOFFExports). So it may be preferable to keep the current structure and have ObjDumper implementations rely on a specific printer-type. Since passing a JSONScopedPrinter to any dumper that isn't the JSONELFDumper would likely produce invalid json.

Jaysonyan updated this revision to Diff 385129.Nov 5 2021, 10:54 AM

Update readelf and readobj docs wording
Misc formatting and renaming changes

Jaysonyan marked 13 inline comments as done.Nov 5 2021, 10:56 AM

@jhenderson I was wondering if you had any more thoughts on the design on ScopedPrinter/ ScopedPrinterBase/JSONScopedPrinter? As I see it right now, it's not very feasible to treat ScopedPrinter and JSONScopedPrinter uniformly due to the usage of methods like startLine(). So a parent-child structure between ScopedPrinter and JSONScopedPrinter might not be suitable.

@jhenderson I was wondering if you had any more thoughts on the design on ScopedPrinter/ ScopedPrinterBase/JSONScopedPrinter? As I see it right now, it's not very feasible to treat ScopedPrinter and JSONScopedPrinter uniformly due to the usage of methods like startLine(). So a parent-child structure between ScopedPrinter and JSONScopedPrinter might not be suitable.

Sorry, I'm a little bit busy, and have a number of other reviews that are calling for my attention too. I haven't had a chance to give this too much further thought yet. One initial thought I had was that things like printLine might just be a no-op for JSON output, although I think it might be nicer if we could wrap these sorts of calls behind another interface suitable for generic llvm-readobj usage. Unfortunately, I don't have the time right now to take a look at all usages, to see if there's a way we could represent such functions in a cleaner manner. I'll try to come back to this later this week, or next week, if I can.

Sorry, I'm a little bit busy, and have a number of other reviews that are calling for my attention too. I haven't had a chance to give this too much further thought yet. One initial thought I had was that things like printLine might just be a no-op for JSON output, although I think it might be nicer if we could wrap these sorts of calls behind another interface suitable for generic llvm-readobj usage. Unfortunately, I don't have the time right now to take a look at all usages, to see if there's a way we could represent such functions in a cleaner manner. I'll try to come back to this later this week, or next week, if I can.

That's totally understandable and I appreciate the time you're taking to review this change. For the sake of saving time, I'll provide all of my thoughts on the design as well as my reasoning behind them in the hopes that it could save some back and forth.

I think that having JSONScopedPrinter as a subclass of ScopedPrinter would have more downsides than upsides without major refactoring to how we interact with ScopedPrinter in llvm-readobj. I believe the biggest driving force to having JSONScopedPrinter as a subclass to ScopedPrinter is with the hope that each implementation of ObjDumper can be agnostic to the type of ScopedPrinter it is using (json or regular). Although I don't believe that this is possible for a few reasons:

  1. ScopedPrinter has a few template methods. I know template and virtual methods don't mix well. I'm not sure if there is a way around this that LLVM has used but I thought it'd be valuable to bring up if this is an issue.
  2. The startLine() method as mentioned previously, provides access to the underlying raw_ostream. While it is possible to provide this functionality in the JSONScopedPrinter (through calling json::OStream::rawValueBegin), it would be non-trivial to ensure that the usage of this results in valid json (we would also need to find a way to call json::OStream::rawValueEnd afterwards). Alternatively, making startline() in JSONScopedPrinter an no-op may result in necessary information being omitted.
  3. The structure of json can trip up certain usage of ScopedPrinter. One example is attributes inside of lists, here is an example of code inside LLVMElfDumper:
template <class ELFT> void LLVMELFDumper<ELFT>::printSectionHeaders() {
  ...
  if (opts::SectionSymbols) {
    ListScope D(W, "Symbols");
    ...
    for (const Elf_Sym &Sym : Symbols) {
      ...
      printSymbol(...);
    }
  ...
}

void LLVMELFDumper<ELFT>::printSymbol(...) {
  ...
  DictScope Group(W, "Symbol");
  W.printNumber("Name", FullSymbolName, Symbol.st_name);
  W.printHex("Value", Symbol.st_value);
  ...
}

So with a ScopedPrinter this could produce something like:

Symbols [
  Symbol {
    Name: foo (1)
    Value: 0x0
  }
  Symbol {
    Name: bar (2)
    Value: 0x0
  }
]

Although the equivalent json would not be valid due to attributes inside of a list:

"Symbols": [
  "Symbol": {
    "Name": "foo",
    "Value": "0x0"
  },
  "Symbol": {
    "Name": "bar",
    "Value": "0x0"
  }
]

So for these reasons I don't think we'd be able to have ScopedPrinter-agnostic ObjDumper implementation without significant refactors to how we interact with ScopedPrinter to ensure we're printing valid json while maintaining the current LLVM format remains unchanged. And if we don't receive this benefit I believe the downsides of having JSONScopedPrinter inherit from ScopedPrinter (JSONScopedPrinter needing offer all methods ScopedPrinter offers, many of which would be unused) outweigh the benefits.

Ultimately I think it would be ideal to reach a point where an ObjDumper can be printer-agnostic but it would require non-trivial refactors in most ObjDumper implementations to reach that point. And since currently our only need is for llvm-readelf to output json, I believe the current implementation is sufficient for addressing our needs.

I think that having JSONScopedPrinter as a subclass of ScopedPrinter would have more downsides than upsides without major refactoring to how we interact with ScopedPrinter in llvm-readobj. I believe the biggest driving force to having JSONScopedPrinter as a subclass to ScopedPrinter is with the hope that each implementation of ObjDumper can be agnostic to the type of ScopedPrinter it is using (json or regular). Although I don't believe that this is possible for a few reasons:

Since making this comment I've been thinking of possible solutions to these problems. I've uploaded D114052: Add JSONScopedPrinter as a subclass to ScopedPrinter with history stack and D114053: Add JSONScopedPrinter as a subclass to ScopedPrinter with array output which are two approaches that attempt to solve the 3 issues laid out in this comment. Hopefully either of those 2 PRs (or this one) will be a sufficient solution to meet our needs.

I think that having JSONScopedPrinter as a subclass of ScopedPrinter would have more downsides than upsides without major refactoring to how we interact with ScopedPrinter in llvm-readobj. I believe the biggest driving force to having JSONScopedPrinter as a subclass to ScopedPrinter is with the hope that each implementation of ObjDumper can be agnostic to the type of ScopedPrinter it is using (json or regular). Although I don't believe that this is possible for a few reasons:

Since making this comment I've been thinking of possible solutions to these problems. I've uploaded D114052: Add JSONScopedPrinter as a subclass to ScopedPrinter with history stack and D114053: Add JSONScopedPrinter as a subclass to ScopedPrinter with array output which are two approaches that attempt to solve the 3 issues laid out in this comment. Hopefully either of those 2 PRs (or this one) will be a sufficient solution to meet our needs.

Thanks for thinking about this, and sorry for not coming back sooner (I've been dealing with deadlines, but am now free for a bit, so hopefully will have a bit more time). I quite like the "history stack" idea of the two new approaches, and I think the solutions to the other cases are viable. Note however that your two new diffs appear to be identical? Either that or I'm missing where the problem 3 approach differs in the implementation.

Do you have a preference for an approach of the three?

Thanks for thinking about this, and sorry for not coming back sooner (I've been dealing with deadlines, but am now free for a bit, so hopefully will have a bit more time). I quite like the "history stack" idea of the two new approaches, and I think the solutions to the other cases are viable. Note however that your two new diffs appear to be identical? Either that or I'm missing where the problem 3 approach differs in the implementation.

Do you have a preference for an approach of the three?

The interface for ScopedPrinter can broadly be split into 2 categories. The first is methods that creates scopes, the second is printing values inside those scopes. The first categories consists of objectBegin(), arrayBegin(), objectBegin(attr), arrayBegin(attr) (and their respective end methods). The second category are methods such as printString(Label, Value) which print actual values, usually as a key-value pair. The goal for both the history
stack and array output approach is that when printing key-value pairs its inside the context of an object and when printing single values (ex: arrays, object, number, string), its inside the context of arrays.

The history stack approach utilizes saved context to conditionally output surrounding {} inside objectBegin(attr)/arrayBegin(attr) when inside the context of an array.
The array output approach doesn't hold on to any context but will rather always output arrays and enclosing {} for both objectBegin()/arrayBegin() and for all key-value pairs it will always output enclosing {}. This ensures all calls to ScopedPrinter output single values and will always be in the context of an array.

Personally I prefer the history stack approach as well, but recognize the duplicated work of keeping a history stack inside both JSONScopedPrinter and json::OStream. The benefit of the array output is that it doesn't need to do any of this work but it comes at the expense of a somewhat unintuitive output.

Thanks for thinking about this, and sorry for not coming back sooner (I've been dealing with deadlines, but am now free for a bit, so hopefully will have a bit more time). I quite like the "history stack" idea of the two new approaches, and I think the solutions to the other cases are viable. Note however that your two new diffs appear to be identical? Either that or I'm missing where the problem 3 approach differs in the implementation.

Do you have a preference for an approach of the three?

The interface for ScopedPrinter can broadly be split into 2 categories. The first is methods that creates scopes, the second is printing values inside those scopes. The first categories consists of objectBegin(), arrayBegin(), objectBegin(attr), arrayBegin(attr) (and their respective end methods). The second category are methods such as printString(Label, Value) which print actual values, usually as a key-value pair. The goal for both the history
stack and array output approach is that when printing key-value pairs its inside the context of an object and when printing single values (ex: arrays, object, number, string), its inside the context of arrays.

The history stack approach utilizes saved context to conditionally output surrounding {} inside objectBegin(attr)/arrayBegin(attr) when inside the context of an array.
The array output approach doesn't hold on to any context but will rather always output arrays and enclosing {} for both objectBegin()/arrayBegin() and for all key-value pairs it will always output enclosing {}. This ensures all calls to ScopedPrinter output single values and will always be in the context of an array.

Personally I prefer the history stack approach as well, but recognize the duplicated work of keeping a history stack inside both JSONScopedPrinter and json::OStream. The benefit of the array output is that it doesn't need to do any of this work but it comes at the expense of a somewhat unintuitive output.

Thanks. I'm not too fussed by this duplication, if there's no practical way of avoiding it, and I agree that the output is more logical in this form. I'm going to start reviewing the "history stack" version now.