This is an archive of the discontinued LLVM Phabricator instance.

Add support for YAML output style to llvm-symbolizer
Needs ReviewPublic

Authored by aorlov on Feb 8 2021, 1:20 PM.

Details

Summary

This patch adds YAML output style to llvm-symbolizer to better support CLI automation by providing a machine readable output.

Diff Detail

Event Timeline

aorlov created this revision.Feb 8 2021, 1:20 PM
aorlov requested review of this revision.Feb 8 2021, 1:20 PM

Generally doesn't seem too bad to me, but I hope to hear others weigh in as well.

Is this all of it? or are there more patches you're expecting to need for this to meet your needs?

Please remember to upload your diffs with full context, as per this article: https://llvm.org/docs/Phabricator.html.

You'll need to document the new output style in the llvm-symbolizer and llvm-addr2line documentation located in llvm\docs\CommandGuide, preferably with one or more examples.

How does this new output style interact with existing options (I'm thinking -p, -a, -i and any other option that affects the output)? What about DATA etc output? You need testing which shows this interaction.

You also need testing for addresses not found.

If we're going to add support for this, I don't think in YAML output mode that it's a good idea to print lines of output that aren't part of the YAML (in this case the "some text" lines which correspond to invalid input addresses).

llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
42

What's this for? clang-tidy is complaining it is unused.

166

i -> I to match LLVM style.

llvm/test/tools/llvm-symbolizer/output-style-yaml.test
2–26

It's somewhat traditional for tests I've seen to have comment markers for all lit and FileCheck lines, plus genuine comments, even if they aren't really needed. This makes it easier to maintain them going forwards too.

In this case, I'd add ## for the real comment line, and # before each RUN and CHECK line (with the space between # and the start of the rest of the line in all cases).

4

Add a matching line for llvm-addr2line --output-style=YAML too, to show the different default can still be overridden.

Thanks for the comments, James! I’ll post the updated patch shortly.

David, yes, the change seems relatively small and looks clean with all the YAML support we already have in LLVM.

I expect a couple more patches. One will add a few additional data fields requested by our users, and then, I think, something will come up upon real world usage.

aorlov updated this revision to Diff 322788.Feb 10 2021, 12:35 PM

I have updated the patch. It

  • addresses the review comments,
  • adds more tests for YAML output style similar to the existing tests we have for other output styles. In particular output-style-yaml-data.test is based on untag-addresses.test, output-style-yaml-frame.test is based on frame-types.s.
  • updates llvm-symbolizer documentation.

Unless I’m missing something, addr2line documentation is not affected, as it just provides a reference to the llvm-symbolizer doc, which is changed by this patch.

By the way, while looking at the addr2line documentation, I have noticed that the documentation does not match the code. In particular, -f flag is ignored for llvm-adr2line. This issue is out of the scope of this patch, though.

I've still got to review the testing further, but here are some comments for now.

As a tip, once you have addressed an inline comment, mark it as done, by clicking the checkbox on the comment, before uploading your latest patch version, as it will make it easier to see what you've attempted to address already.

llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
43

Please reformat using clang-format, as per the linter comment.

Why add the llvm:: namespace qualifier? You're already in the llvm namespace and use DILineInfo in the function signature without it.

Same goes elsewhere below.

116

Re. the comment - I wonder if that's a bug in the << implementation? Does it need a const adding to its signature, or does that break other things?

llvm/test/tools/llvm-symbolizer/output-style-yaml-data.test
2
8

This could be simplified slightly, I believe, as shown in the inline edit.

It would also be a good idea to add --strict-whitespace, --match-full-lines and --implicit-check-not={{.}} to ensure that the output tested exactly matches what you expect.

--strict-whitespace ensures strings of multiple whitespace characters are not converted to a single space (in both input and check patterns), --match-full-lines ensures each check line matches the entirety of an input line, and --implicit-check-not in this case ensures there's no output apart from what is being explicitly checked for.

10–13

A slight snag with the above suggestion is that the space between the ':' and the start of the check needs to go. You can make it look a little nicer by indenting the first CHECK line to line up the colons as shown in the inline edit.

llvm/test/tools/llvm-symbolizer/output-style-yaml-frame.test
2
8

Same comments as the DATA test apply here.

The line based output can be straightforwardly parsed, e.g. https://github.com/google/pprof/blob/master/internal/binutils/addr2liner_llvm.go#L113

There is a question on https://github.com/google/pprof/issues/606 : why is JSON picked over YAML? "JSON is a more common choice for machine-readable output. YAML is used more as a configuration language - i.e. human-read and -written."

aorlov marked 8 inline comments as done.Feb 14 2021, 9:32 PM
aorlov added inline comments.
llvm/lib/DebugInfo/Symbolize/DIPrinter.cpp
43

I added redundant llvm:: to see if that would make a pre-commit Debian bot happy. It didn’t, so I removed.

116

You are right, this lookd strange, however changing this is out of the scope of this patch.
My solution is based on the following code
https://github.com/llvm/llvm-project/blob/main/clang-tools-extra/clangd/index/YAMLSerialization.cpp#L493

aorlov updated this revision to Diff 323663.Feb 14 2021, 9:38 PM
aorlov marked an inline comment as done.

The line based output can be straightforwardly parsed

And this is fine. I do not see how this patch could prevent your users from parsing the line based output. It is quite the opposite, they are even better protected, as any change in YAML output style would not change what they depend on.

The line based output can be straightforwardly parsed

And this is fine. I do not see how this patch could prevent your users from parsing the line based output. It is quite the opposite, they are even better protected, as any change in YAML output style would not change what they depend on.

It is not that this patch could prevent my users. It is my question about whether the YAML output adds new value.

The two cover the spectrum I can think of. If really we need an interchange format (I have some doubts, after asking others, so it is not my own opinion), JSON seems a better choice.

The line based output can be straightforwardly parsed

I don't really have any particular stake in this either way. I know our users in the past have wanted GNU addr2line-compatible output, which implies that they parse the output directly, and already have scripts to do so. Indeed the symbolizer is possibly the most commonly used tool from the smaller tools used by our end users. On the one hand, a JSON or YAML output would make new users' jobs simpler because it's easier to work with such output than having to manually write a parser. On the other hand, as you've pointed out, writing the parser for it isn't that complicated. Is it really worth us carrying the maintenance burden for this additional output style? I'm not opposed if there have been end user requests for it.

  • Users want a lot of flexibility - they should use the DebugInfo/Symbolize APIs.

This only works if a user is writing their parsing in C++, right? That being said, I don't know if there are any real users for this situation who wouldn't be building using the LLVM libraries.

The two cover the spectrum I can think of. If really we need an interchange format (I have some doubts, after asking others, so it is not my own opinion), JSON seems a better choice.

Having given it more thought, +1 to using JSON over YAML. JSON has the advantage that it can be consumed directly by native python, without needing additional modules, and python is a regular choice for people writing scripts to parse output like this.

The two cover the spectrum I can think of. If really we need an interchange format (I have some doubts, after asking others, so it is not my own opinion), JSON seems a better choice.

Having given it more thought, +1 to using JSON over YAML. JSON has the advantage that it can be consumed directly by native python, without needing additional modules, and python is a regular choice for people writing scripts to parse output like this.

Given we already have YAML APIs in LLVM, there's some convenience/reduced cost (for perhaps an already marginal use case, I'd rather keep the code complexity lower - and this seems pretty low at the moment) to sticking with that, I think? Or is there some equivalently tidy way to emit JSON?

The two cover the spectrum I can think of. If really we need an interchange format (I have some doubts, after asking others, so it is not my own opinion), JSON seems a better choice.

Having given it more thought, +1 to using JSON over YAML. JSON has the advantage that it can be consumed directly by native python, without needing additional modules, and python is a regular choice for people writing scripts to parse output like this.

Given we already have YAML APIs in LLVM, there's some convenience/reduced cost (for perhaps an already marginal use case, I'd rather keep the code complexity lower - and this seems pretty low at the moment) to sticking with that, I think? Or is there some equivalently tidy way to emit JSON?

I've not looked too much in detail, but there's JSON.h in the Support library. A casual glance suggests this will have most of what is needed, so using JSON shouldn't be significantly different in terms of additional complexity, compared to using YAML.

Here is the patch for supporting JSON - https://reviews.llvm.org/D96883
Now we can look at the changes side by side and decide which one is the way to go.

Thanks for reviewing!

The two cover the spectrum I can think of. If really we need an interchange format (I have some doubts, after asking others, so it is not my own opinion), JSON seems a better choice.

Having given it more thought, +1 to using JSON over YAML. JSON has the advantage that it can be consumed directly by native python, without needing additional modules, and python is a regular choice for people writing scripts to parse output like this.

Given we already have YAML APIs in LLVM, there's some convenience/reduced cost (for perhaps an already marginal use case, I'd rather keep the code complexity lower - and this seems pretty low at the moment) to sticking with that, I think? Or is there some equivalently tidy way to emit JSON?

I've not looked too much in detail, but there's JSON.h in the Support library. A casual glance suggests this will have most of what is needed, so using JSON shouldn't be significantly different in terms of additional complexity, compared to using YAML.

Oh, fair enough - I don't mind either way, then.

ormris removed a subscriber: ormris.Jun 3 2021, 10:55 AM