This is an archive of the discontinued LLVM Phabricator instance.

[Symbolizer] Add flag to dump process context JSON from markup
Needs ReviewPublic

Authored by mysterymath on Mar 24 2023, 4:42 PM.

Details

Summary

This creates a simple JSON format for representing the process contexts
encountered by the markup filter and a flag, --dump-process-context, for
exporting it. Using this flag suppresses the usual symbolized log output
and instead emits this JSON. This provides a machine-readable
representation of the layouts of the processes that emitted the
marked-up logs.

Diff Detail

Event Timeline

mysterymath created this revision.Mar 24 2023, 4:42 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: hiraditya. · View Herald Transcript
mysterymath requested review of this revision.Mar 24 2023, 4:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 4:42 PM

Minimize diff.

Add check for flag validity.

A few minor points from me that I spotted whilst skimming over this due to my herald rule pinging me. I am not familiar enough with the markup stuff to be able to review that.

llvm/test/DebugInfo/symbolize-filter-markup-dump-context.test
5–6 ↗(On Diff #508252)

You should check that the error message is the expected error message here. Otherwise, these could be failing for an unrelated reason and you wouldn't notice.

9 ↗(On Diff #508252)

CHECK-NEXT (or CHECK-SAME) here and below?

llvm/tools/llvm-symbolizer/Opts.td
34

Nit: for consistency with other options.

mysterymath marked 2 inline comments as done.

Address review comments.

Thanks. No more comments from me.

"context" as the name for this confuses me a bit - perhaps you can describe it more (maybe I'm jus tthinking about it wrong, or maybe it could benefit from a rename)

When I think of "dumping context" - I think of dumping the text surrounding the symbolizer markup, which it isn't doing. It looks like it's something like "dump markup as json", is that accurate? (& it drops/doesn't print out any of the non-markup content present in the input?)

phosek added inline comments.Mar 28 2023, 9:02 AM
llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp
451

For attribute naming, in JSON generated from LLVM tools we typically use either PascalCase or snake_case, camelCase is a bit unusual.

"context" as the name for this confuses me a bit - perhaps you can describe it more (maybe I'm jus tthinking about it wrong, or maybe it could benefit from a rename)

When I think of "dumping context" - I think of dumping the text surrounding the symbolizer markup, which it isn't doing. It looks like it's something like "dump markup as json", is that accurate? (& it drops/doesn't print out any of the non-markup content present in the input?)

The term "context" comes from the symbolizer markup format: https://llvm.org/docs/SymbolizerMarkupFormat.html#contextual-elements
In this case it's not actually dumping the presentation requests from the markup, it's dumping a completed representation of the process layout contained in the "contextual elements" of the markup. These contextualize virtual addresses to allow symbolizing them later; and it's that contextual information that it'd be useful to export and import into other tools.

mysterymath added inline comments.Mar 28 2023, 2:54 PM
llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp
451

I did a quick search around the codebase, and it doesn't look like there's a consistent convention; there's kebab case, pascal case, and camelCase all represented. Didn't really take the time to get accurate counts, but it does seem that various style guides suggest lower camel case:

https://google.github.io/styleguide/jsoncstyleguide.xml
https://jsonapi.org/recommendations/

Dump name field as well.

"context" as the name for this confuses me a bit - perhaps you can describe it more (maybe I'm jus tthinking about it wrong, or maybe it could benefit from a rename)

When I think of "dumping context" - I think of dumping the text surrounding the symbolizer markup, which it isn't doing. It looks like it's something like "dump markup as json", is that accurate? (& it drops/doesn't print out any of the non-markup content present in the input?)

The term "context" comes from the symbolizer markup format: https://llvm.org/docs/SymbolizerMarkupFormat.html#contextual-elements
In this case it's not actually dumping the presentation requests from the markup, it's dumping a completed representation of the process layout contained in the "contextual elements" of the markup. These contextualize virtual addresses to allow symbolizing them later; and it's that contextual information that it'd be useful to export and import into other tools.

I have the same confusion as well. If there are most significant contextual elements, can their names be mentioned in the help message of --dump-context? Just mentioning the element name should still make the message brief.

llvm/docs/CommandGuide/llvm-symbolizer.rst
247

Defined markup contexts briefly in the flag help message and more thoroughly in
the rst documentation.

mysterymath marked an inline comment as done.

Make --filter-markup a doc reference.

Rename "[markup] context" to "process context".

mysterymath retitled this revision from [Symbolizer] Add flag to dump symbolizer markup context JSON. to [Symbolizer] Add flag to dump process context JSON from markup.Apr 24 2023, 3:44 PM
mysterymath edited the summary of this revision. (Show Details)
mcgrathr added inline comments.Apr 28 2023, 10:24 PM
llvm/docs/CommandGuide/llvm-symbolizer.rst
248

I think these should be independent options: that is, you can produce filter output, or JSON output, or both.
To me, it seems reasonable to have the JSON switch take a separate explicit output file and have a separate flag that says whether or not to filtered markup output, e.g. --parse-markup that is implied by --filter-markup but can be specified separately to consume that input (and which is superfluous with --filter-markup and useless without either --filter-markup or --dump-context).

As I mentioned pedantically in another change, I think "process" is actually overly and unnecessarily specific here. Perhaps in the context (pun intention unspecified) of llvm-symbolizer, "context" alone is sufficient, but "address context" also seems pretty clear, specific, and precisely applicable to what matters about it to symbolization.

In the long run, I think it makes sense for this not to even imply --parse-markup because we can and should eventually have other sources for the same kind of information, such as switches like elfutils tools such as eu-unstrip have, for /proc/PID/maps (Linux format) text files, live process IDs (that on systems like Linux means read /proc/PID/maps), ELF core files that have NT_GNU_BUILD_ID notes themselves and/or PT_LOAD memory images with enough ELF data to reconstruct runtime layouts and build IDs post mortem, Linux kernel module layouts, and more not covered in elfutils, like minidump format or whatever else. Of course, there could be separate tools that produce this same JSON schema for each of those things, but for at least some of them building direct command-line switch support that can be used in llvm-symbolizer and other tools will make sense.

I think --parse-markup also has a nice parity with the coming switch to parse this JSON format as the input source.

Furthermore, --parse-markup alone is useful for traditional llvm-symbolizer uses without --filter-markup: I can use --parse-markup=logfile when feeding the symbolizer addresses manually on the command line (or via the stdin protocol if markup input and symbolization-request input formats can come from different explicit inputs since only one or the other can be stdin).

252

When we implement support for the trigger elements (dumpfile), that will be a separate reason to distinguish different context snapshot states. (In between reset elements, trigger elements need to be associated with the context snapshot formed from only the module+mmap elements that were emitted before the trigger element, not later additions/changes.)

So I think it makes sense to preemptively describe it here, and specify the JSON schema, as being an array of context snapshots. In future each one might contain descriptors for additional non-context elements that are to be interpreted in the context of the snapshot they appear in. In fact, it could be an option in the JSON dumper feature even now to include the bt et al that were mentioned as if they were trigger elements. That's really tantamount to implementing trigger elements and then adding another switch that says "make all the presentation elements act as trigger elements" too, so it might as well come after trigger elements. But in how we prepare the schema for future optional expansion, we should think about those cases now even if we take a while to implement them all.

llvm/include/llvm/DebugInfo/Symbolize/MarkupFilter.h
35

It seems like it might be a nicer structure to keep this just responsible for the parsing and not the dumping per se.
That is, instead of just a built-in flag here, this could be an optional callback to make with ProcessContext whenever a new one is ready, paired with a flag about whether to produce the transformed version of the input text (that could just be a nullable OS pointer here).

Then a separate ProcessContext->JSON layer can be trivially plumbed to build a callback to pass here.

When you add the JSON input option, then plumbing that to do JSON->ProcessContext->JSON and markup->ProcessContext->JSON and various such combinations as unit tests becomes very attractive.