This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] ExplodedGraph: printDiff(): ProgramState implementation
AbandonedPublic

Authored by Charusso on May 8 2019, 6:25 AM.

Details

Summary

The ExplodedGraph has too much repetitive information. This patch
introduces a new type of print() called printDiff() which should be
implemented in most of the graph-printing methods.

printDiff() takes the current and a previous state of the given object
and only prints out the additional differences. It is useful to print out
each node of the graph and only show the new information compared to the
previous node.

This new behaviour is optional, the flag is diff-egraph and could be
used as trim-egraph.

Diff Detail

Event Timeline

Charusso created this revision.May 8 2019, 6:25 AM
Charusso marked an inline comment as done.May 8 2019, 6:46 AM
Charusso added inline comments.
clang/lib/StaticAnalyzer/Core/ExprEngine.cpp
3040

It has been created a separator before the Store, but if we decide to do not write out empty maps / no differences the separation should be driven by each printDiff().

NoQ added a comment.May 8 2019, 12:15 PM

Aha, ook, so you're going for diffing within clang. That'd require duplicating the diff work in every checker (i.e., in every implementation of Checker::printState()), which is probably fine. I suspect that the solution of dumping the graph to, say, JSON or YAML and then converting it to DOT with a python script (where we can easily implement diffing in a generic way, regardless of what sort of data we're diffing) would have taken less code (and would have additionally helped us with serializing the program state for the (crazy) purposes of summary-based cross-translation-unit analysis), so i would have slightly preferred this direction, but it doesn't sound too heavy anyway, so i don't mind^^

We actually already have some tests for our exploded graph dumps, eg. test/Analysis/dump_egraph.c, i think you can test your patches in a similar manner. The trick is to use -dump-egraph instead of -analyzer-viz-egraph-graphviz because it accepts a target path instead of coming up with a random path.

clang/lib/StaticAnalyzer/Core/ExprEngine.cpp
3058

The choice of the first predecessor here is fairly arbitrary. Ideally this would have been a merge-diff (like a diff for merge commits in git), but we might as well print the full state when there are multiple predecessors.

I'm generally in favor of printing the full state even in diff mode for:

  • Merges,
  • Leafs,
  • Non-fatal error nodes,
  • The root (it's probably empty anyway).

Like, generally, in order to keep the graph readable, it should be possible to be able to recover the full state fairly easily. Having a full state in merge nodes would most likely help a lot. There are also not that many of them, so maybe we should also print the full state before splits. Another potential solution would be to use GraphViz labels/tooltips to show the full state of every node, which means that the graph dump would still be huge, but hopefully it'll be much easier to render.

Charusso planned changes to this revision.EditedMay 8 2019, 2:03 PM
Charusso marked an inline comment as done.

Thanks for the ideas! I am looking forward for that JSON formatting as it sounds good.

In D61677#1495563, @NoQ wrote:

Aha, ook, so you're going for diffing within clang. That'd require duplicating the diff work in every checker (i.e., in every implementation of Checker::printState()), which is probably fine. I suspect that the solution of dumping the graph to, say, JSON or YAML and then converting it to DOT with a python script

At first my idea was that to compare strings, but I think it is not flexible enough (even it sounds the most flexible way) because you lost tons of information from Clang. My idea here is to use the raw exploded graph to print the given full size of every node as an empty skeleton, and only render the diff-mode notes. Put a checkbox to every node to switch between the raw and the diff mode easily. Also the checker part will not be differentiated because there are too many of them and the heavier is the non-checker part for rendering and navigating, so I think it is not a problem.

My main goal is to convert DOT to HTML, and I do not know too much about CTU-stuff here. My job is to create a more understandable graph, and the DOT to SVG generates the necessary dimensions and offsets of nodes. I am not really into this string-parsing-world, so I decided to create the smallest change and use what we already have.

The problem with JSON to DOT is DOT to SVG generates tons of duplicated code but that is the graph-draw utility, so we have to involve it. If I remember right the HTML really supports JSON and could be used to inject data into my empty full-sized raw graph idea. So may we could generate the current DOT what is my plan with the SVG graph-skeleton and also the JSON what is your request and the Python script mixes the SVG and the JSON into a flexible HTML with tons of JavaScript and craziness involved in.


My research so far with SVG to HTML with Python:

HTML output (check out the source): https://mathieu.fenniak.net/assets/idempotent/idempotent_prototype_4.html
Script used for generating that: https://gist.github.com/mfenniak/5532463

PS: I do not like XML and nor YAML. I think the JSON format is the most reliable representation for scripting with my HTML knowledge. I am not really sure which tool will be used to generate the HTML output but I have asked a Python guy and that ElementTree of XML representation looks really promising as the SVG is an XML:
https://docs.python.org/2/library/xml.etree.elementtree.html
(PPS: I have not written any industrial-level Python yet, and I am lost into that string-parsing-world.)


We actually already have some tests for our exploded graph dumps

I have seen that as I have broken it once. I will introduce a test case with that patch.

clang/lib/StaticAnalyzer/Core/ExprEngine.cpp
3058

That is why I do not recommend comparing just two strings. We could modify the states easily inside the core. I have not seen yet a merge-diff, but because the representation is not just two string it should be easy to implement. I will implement that after printDiff() patches.

Charusso marked an inline comment as done.May 8 2019, 2:38 PM
Charusso added inline comments.
clang/lib/StaticAnalyzer/Core/ExprEngine.cpp
3058

Oh I missed the labeling idea as it is too crazy but I really like that. May checkboxes and huge empty nodes would looks like too silly and difficult to load.

NoQ added a comment.May 9 2019, 12:42 PM

Ok, here's a plan:

  • Completely screw current program point and program state dumps and transform them into JSON dumps entirely while keeping it relatively human-readable when it comes to whitespace. This isn't much less readable than the existing dumps but it's also fully machine-readable, and we don't have to maintain two separate dumping facilities. Eg.:
{
  "program_points" : [
    { "kind" : "BinaryOperator", "file" : "test.c", "line" : 13, "col" : 7, "pretty" : "a + b" },
  ],
  "program_state" : [
    { "trait": "Environment", "items": [
      { "location_context" : 3, "calling" : "foo() on line 25", "items" : [
        { "pretty" : "a", "value" : "reg_$1<b>" },
        { "pretty" : "b", "value" : "reg_$0<a>" },
        { "pretty" : "a + b", "value" : "reg_$0<a> + reg_$1<b>" },
      ]},
      { "location_context" : 1, "calling" : "bar()", "items": [
        { "pretty" : "foo", "value" : "&code<foo>" },
      ]}
    ]},
    { "trait" : "Store", "items" : [
      { "cluster" : "SymRegion{reg_$2<x>}", "items" : [
        { "kind" : "Direct", "offset" : 0, "value" : "reg_$0<a> + reg_$1<b>" },
        { "kind" : "Direct", "offset" : 32, "value" : "reg_$1<b> - reg_$0<a>" },
      ]},
    ]},
  ]
}
  • Separately, make a DOT-to-DOT converter in python that'll prettify the dumps. It wouldn't involve much raw string parsing, just basic DOT and JSON parsing. I can help with that because i already have some code.
  • Eventually teach this converter to produce fancy dynamic HTML interfaces.
In D61677#1497130, @NoQ wrote:
  • Completely screw current program point and program state dumps and transform them into JSON dumps entirely while keeping it relatively human-readable when it comes to whitespace. This isn't much less readable than the existing dumps but it's also fully machine-readable, and we don't have to maintain two separate dumping facilities.

Would you introduce printJSON() instead of printDiff() in every graph-printer method then just dumping it out into a file? I do not see any generic way.

  • Separately, make a DOT-to-DOT converter in python that'll prettify the dumps. It wouldn't involve much raw string parsing, just basic DOT and JSON parsing. I can help with that because i already have some code.
  • Eventually teach this converter to produce fancy dynamic HTML interfaces.

Could you explain a little bit more about DOT-to-DOT? So when we have a JSON-dump for both CTU purposes, I only could imagine that:

We could have a Python script which takes the raw current SVG and the new JSON representation, copies the necessary dimensions from SVG and all the text from JSON, then mixes it into a dynamic HTML with all these magic differentiating and coloring. I hate that DOT format so that is why printDiff() introduced without the knowledge about possible JSON dumping.

In addition: If I am right you would use the CTU-level more complex JSON format with tons of additional stuff compared to the current raw ugly DOT dump. So forget about the current DOT as it useless compared to the JSON and convert the JSON into DOT so to SVG and HTML?

NoQ added a comment.May 9 2019, 2:22 PM

I propose neither implementing ProgramState::printJSON() nor ProgramState::printDiff() but instead replacing the current implementation of ProgramState::print() with a brand-new implementation that prints JSON which is both human-readable and machine-readable.

The rest remains the same: these prints are still used both in manual debug dumps and in exploded graph dumps.

Don't worry about CTU for now; that's way too far from being real. I just imagine that we'll want some sort of serialization for it.

Then i propose making a tool that replaces JSON curly braces with GraphViz's HTML tags for prettier looks.

Charusso abandoned this revision.May 9 2019, 2:43 PM
In D61677#1497292, @NoQ wrote:

I propose neither implementing ProgramState::printJSON() nor ProgramState::printDiff() but instead replacing the current implementation of ProgramState::print()
Then i propose making a tool that replaces JSON curly braces with GraphViz's HTML tags for prettier looks.

I see, now I get it. Thanks for the idea! Sadly Graphviz has a very limited HTML usage so this tool has to do a lot more. I like that idea to have machine code inside the graph so I can parse it and modify it whatever I like without any extra overhead of very complex LLVM file-handling/command execution. I will mention the source of these ideas in the actual patch.

In D61677#1497130, @NoQ wrote:
{ "trait" : "Store", "items" : [
  { "cluster" : "SymRegion{reg_$2<x>}", "items" : [
    { "kind" : "Direct", "offset" : 0, "value" : "reg_$0<a> + reg_$1<b>" },

What is a cluster here? Do we need the ID of the store? What about more new-lines? E.g.

"region": "HeapSymRegion{conj_$2{void *, LC21, S1655, #1}}",
"kind": "Default",
"value": "Undefined"

PS: I mean we do not write it out and I am not sure whether it is the first/last of the ClusterBindings or what.

NoQ added a comment.May 13 2019, 2:44 PM

What is a cluster here?

It's a group of bindings within a single base region. Cf. RegionStore.cpp:

class BindingKey {
  const MemRegion *getBaseRegion() const { ... }
  bool isDirect() const { ... }
  uint64_t getOffset() const { ... }
  ...
};
typedef llvm::ImmutableMap<BindingKey,        SVal>            ClusterBindings;
typedef llvm::ImmutableMap<const MemRegion *, ClusterBindings> RegionBindings;

Do we need the ID of the store?

We could keep it, but i don't think i ever used it.

What about more new-lines?

I think we should keep one item (eg., one binding) per line in order to keep it human-readable.

Generally, when in doubt, let's produce something as similar to the old output as possible.