This is an archive of the discontinued LLVM Phabricator instance.

[Symbolize] Parser for log symbolizer markup.
ClosedPublic

Authored by mysterymath on Apr 29 2022, 10:28 AM.

Details

Summary

This adds a parser for the log symbolizer markup format discussed in
https://discourse.llvm.org/t/rfc-log-symbolizer/61282. The parser
operates in a line-by-line fashion with minimal memory requirements.

This doesn't yet include support for multi-line tags or specific parsing
for ANSI X3.64 SGR control sequences, but it can be extended to do so.
The latter can also be relatively easily handled by examining the
resulting text elements.

Diff Detail

Event Timeline

mysterymath created this revision.Apr 29 2022, 10:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 10:28 AM

Fixup before code review.

Add tests for field parsing.

mysterymath published this revision for review.Apr 29 2022, 12:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 12:36 PM

Fix typos in comments; expand a bit.

Use SmallVector for Buffer; allow main loop to run malloc-free.

CurEntry => NextIdx

Fix parsing for empty field.

My apologies for the delay in responding. I've only taken a look at this patch alone. Not looked forward to D124798 and D126980. In general looks good to me. I've a few small suggestions, mostly in naming and comments.

llvm/include/llvm/DebugInfo/Symbolize/Markup.h
12

Although not for this patch, I note that in D126980 the docs for the SymbolizerFormat are added, it would be good to put a link as that specification is needed to understand the code. Perhaps worth adding a TODO for this patch.

34

I got a bit confused with the name MarkupElement. Looking at the spec in https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup and the An element of symbolizer markup I was expecting MarkupElement to be just a MarkupElement of the form {{{tag:fields}}} but it looks like this is a general case of (IIUC):

  • TextElement: a string containing no Markup Elements or SGR codes.
  • SGRElement: a special case of Text Element that contains a single SGR control code.
  • MarkupElement: Also has Tag and Fields.

Maybe worth finding a different word to element to distinguish? Something like MarkupPiece or MarkupComponent? If we do change it will be worth updating the comment. I see that you've used just element in other comments. Perhaps just Element.

61

An alternative behaviour could be that calling parseLine() without iterating through nextLine() would be to reset NextIdx and clear the Buffer. You'd then not leave the Buffer and NextIdx in an inconsistent state.

Not a strong opinion as I expect the number of users of the parser to be low.

llvm/lib/DebugInfo/Symbolize/Markup.cpp
38

Similar to the comment I made in the header. If the Buffer isn't empty I think you could do the equivalent of:

NextIdx.reset();
Buffer.clear();
56

IIUC this is looking specifically for a MarkupElement of the form {{{tag}}} or {{{tag:fields}}} perhaps worth renaming to parseMarkupElement(). Could be worth putting in the comment about the expected syntax of the Markup.

76

I expect a common user error could be an upper case tag. Is it worth a warning. This may not be the right place for it though.

96

Sentence in comment looks unfinished. Perhaps so these are added as individual elements.

mysterymath marked 6 inline comments as done.

Address review comments.

llvm/include/llvm/DebugInfo/Symbolize/Markup.h
34

I was also split about this naming convention; markup languages like to describe themselves as composed of markup elements, but they're not, they're also composed of all the stuff between the elements.
I'd considered just "Element", but that would give this the name "llvm::symbolize::Element", which is a bit too broadly named for what this does.

I did a very quick survey of what other SAX-ish HTML parsers do, it looks like they've settled on calling the abstract type a Node, while an actual element is an Element or Tag, which subclasses Node.
The usage of Node here would be akin to linked list nodes: an ordered collection of units forming a stream.

61

Ah, that's cleaner; then you can stop handling nodes as soon as you decide you don't care about the rest. I can actually make use of that in the filter later.

llvm/lib/DebugInfo/Symbolize/Markup.cpp
56

Renaming MarkupElement to MarkupNode should make this clearer; with that, Element exclusively means a markup element.

76

I thought a bit more about this, and it seems like this is a distinction that doesn't belong in the determination of whether or not the given text is an element, but whether the element is well-formed.
This is more akin to the handling of things like integer parsing errors, which occur later, where we can give good error messages.
Accordingly, I've pulled this out of this change, to be added into the third of the series.

peter.smith accepted this revision.Jun 17 2022, 12:58 AM

LGTM thanks for the updates. I spotted what looked like one comment update for the change to Node, but that can easily be fixed up if needed.

llvm/include/llvm/DebugInfo/Symbolize/Markup.h
35

I think and following elements. could be and following nodes.

This revision is now accepted and ready to land.Jun 17 2022, 12:58 AM

LGTM thanks for the updates. I spotted what looked like one comment update for the change to Node, but that can easily be fixed up if needed.

Nice catch, thanks!

Fix missed element->node comment update.

This revision was landed with ongoing or failed builds.Jun 17 2022, 10:26 AM
This revision was automatically updated to reflect the committed changes.

Is it already planned to add this markup to LLVM's own crash dumps when they aren't already symbolized? Might be a handy use-case for the feature, providing some built-in-to-llvm exercisizing/advertising/experience with the feature? (if the markup is meant to be compatible with human readers who might not be able to symbolize the data later - which seems like a nice feature too)

Is it already planned to add this markup to LLVM's own crash dumps when they aren't already symbolized? Might be a handy use-case for the feature, providing some built-in-to-llvm exercisizing/advertising/experience with the feature? (if the markup is meant to be compatible with human readers who might not be able to symbolize the data later - which seems like a nice feature too)

We've had a few discussions about this; I think broadly yes, although we haven't hashed out the full details yet. One of the options would be to always emit symbolizer markup in LLVM, then pass it to a forked llvm-symbolizer instance if one can be found. That way, you'd get online symbolization if possible, but graceful degradation to markup if not.

Is it already planned to add this markup to LLVM's own crash dumps when they aren't already symbolized? Might be a handy use-case for the feature, providing some built-in-to-llvm exercisizing/advertising/experience with the feature? (if the markup is meant to be compatible with human readers who might not be able to symbolize the data later - which seems like a nice feature too)

We've had a few discussions about this; I think broadly yes, although we haven't hashed out the full details yet. One of the options would be to always emit symbolizer markup in LLVM, then pass it to a forked llvm-symbolizer instance if one can be found. That way, you'd get online symbolization if possible, but graceful degradation to markup if not.

Ah, that makes sense - yeah, I was wondering/figuring the markup probably wasn't human readable, so always passing it through the processor, which can either add the symbolizing info, or at least remove the markup if there's no debug info/other symbolizing to do, etc.