Page MenuHomePhabricator

[WIP] Dumping the AST to JSON
ClosedPublic

Authored by aaron.ballman on Apr 19 2019, 11:52 AM.

Details

Summary

This is a work in progress patch that adds the ability to specify an AST dump format on the command line. By default, we continue to dump the AST with its usual tree view, but users can now optionally pass -ast-dump=json to dump to a machine-readable JSON format that makes it easier for third parties to consume the Clang AST in some fashion.

The patch can currently handle dumping a fair amount of declaration information, some statements, and very few expressions. I got it to the point where it was showing useful output in roughly the correct format, but I wanted to get community feedback for continuing the implementation. Once the current approach gains consensus, my plan is to commit the WIP and then do subsequent commits with post-commit review to finish the implementation (unless the changes are somehow interesting enough to warrant pre-commit review, of course).

The hybrid approach of using some LLVM JSON functionality and some streaming functionality is purposeful for performance reasons (collecting the entire AST into memory in a second form means ~2x the memory usage for the AST, which can be prohibitive for large compilation units). Testing this functionality with FileCheck is quite verbose, so if someone has suggestions for a better way to test the JSON output, I'd be happy to consider it.

Diff Detail

Event Timeline

aaron.ballman created this revision.Apr 19 2019, 11:52 AM

A few comments/questions:

  1. How stable is the format going to be, and how much state is going to be exposed ?
  2. I am wondering if it would be possible to separate the logic about how to visit a given AST node, from the logic about what to do with the node's state. For example it seems to me that JSONNodeDumper::VisitVarDecl and TextNodeDumper::VisitVarDecl are somewhat similar. You could instead imagine having a generic way to visit each node, which would then inform JSONNodeDumper or TextNodeDumper about the various bits of state in the node. With the proposed implementation essentially all of the logic for what is in each node has to be duplicated. This is not an objection about the current patch, just something I am wondering about.
include/clang/AST/JSONNodeDumper.h
56

Perhaps you should perfect-forward DoAddChild ?

165

@lebedev.ri might want to comment on the use of llvm::json. I think that there has been issues in the recent past where it was unexpectedly slow (although it might not matter for this use case).

lib/AST/ASTDumper.cpp
231

(Ah, Yoda conditionals ! (I am not saying to change it))

Do you plan to do the same modification to the others dump methods ?

aaron.ballman marked 2 inline comments as done.Apr 20 2019, 9:06 AM

A few comments/questions:

  1. How stable is the format going to be, and how much state is going to be exposed ?

I don't expect it to be particularly stable (I'm not suggesting we conform to some particular schema) and for my own needs, I'm happy enough if new Clang releases break us due to new, deleted, or renamed information. I need to expose as much state as required for our research team to import the data and reason about the AST; I'd guess it's going to be roughly the same as the textual dumper, but different in spots.

  1. I am wondering if it would be possible to separate the logic about how to visit a given AST node, from the logic about what to do with the node's state. For example it seems to me that JSONNodeDumper::VisitVarDecl and TextNodeDumper::VisitVarDecl are somewhat similar. You could instead imagine having a generic way to visit each node, which would then inform JSONNodeDumper or TextNodeDumper about the various bits of state in the node. With the proposed implementation essentially all of the logic for what is in each node has to be duplicated. This is not an objection about the current patch, just something I am wondering about.

This was something I had been pondering when we were doing the great AST dump refactoring earlier this year, but I didn't think we could get it to be sufficiently general to suit various purposes. On the one hand, there's definitely a ton of overlap between any two "dump the AST in this format" approaches, but on the other hand, different approaches have different requirements that may change what data is displayed. For instance, the textual dump will likely write out more information than the JSON dump because the textual dump is purely for human readability (so having extra information near the node may be critical) whereas the JSON dump is for machine readability (so having extra information may or may not be prohibitive, but hooking up disparate data within the file may be trivial).

include/clang/AST/JSONNodeDumper.h
165

I'm currently only using the JSON functionality for wholly-contained "leaf" nodes, so I don't think there will be a strong performance concern here from using it. However, there's no strong requirement that I continue to use it (I just thought it was a bit cleaner), so I could also replace it with the streaming functionality I use elsewhere.

lib/AST/ASTDumper.cpp
231

Yes, I do (though probably not all of them, like dumpColor())

A few comments/questions:

  1. How stable is the format going to be, and how much state is going to be exposed ?

I don't expect it to be particularly stable (I'm not suggesting we conform to some particular schema) and for my own needs, I'm happy enough if new Clang releases break us due to new, deleted, or renamed information. I need to expose as much state as required for our research team to import the data and reason about the AST; I'd guess it's going to be roughly the same as the textual dumper, but different in spots.

I am fine with that.

  1. I am wondering if it would be possible to separate the logic about how to visit a given AST node, from the logic about what to do with the node's state. For example it seems to me that JSONNodeDumper::VisitVarDecl and TextNodeDumper::VisitVarDecl are somewhat similar. You could instead imagine having a generic way to visit each node, which would then inform JSONNodeDumper or TextNodeDumper about the various bits of state in the node. With the proposed implementation essentially all of the logic for what is in each node has to be duplicated. This is not an objection about the current patch, just something I am wondering about.

This was something I had been pondering when we were doing the great AST dump refactoring earlier this year, but I didn't think we could get it to be sufficiently general to suit various purposes. On the one hand, there's definitely a ton of overlap between any two "dump the AST in this format" approaches, but on the other hand, different approaches have different requirements that may change what data is displayed. For instance, the textual dump will likely write out more information than the JSON dump because the textual dump is purely for human readability (so having extra information near the node may be critical) whereas the JSON dump is for machine readability (so having extra information may or may not be prohibitive, but hooking up disparate data within the file may be trivial).

I see, thanks for the explanation.

Someone who is actually going to use the JSON output should probably comment on this patch, but as far as I am concerned I have no objection or additional comment. This is a self-contained feature which is not going to impact anyone else.

Rebased to master + small amount of additional functionality.

Since it seems this is non-controversial, I will probably commit after next week unless reviewers bring up concerns in the meantime.

Looks good, but just a couple of questions.

include/clang/AST/JSONNodeDumper.h
74

Just curious, since you clear Indentation on the previous line, do you need it here?

include/clang/Frontend/FrontendOptions.h
311

Why can't you use the enum defined in clang/AST/ASTDumperUtils.h? Seems to make the dump() call below unnecessarily ugly.

aaron.ballman marked 3 inline comments as done.Apr 25 2019, 10:02 AM
aaron.ballman added inline comments.
include/clang/AST/JSONNodeDumper.h
74

Definitely not required -- I had it there for consistency with other streaming output, but it's spurious. I'll remove.

include/clang/Frontend/FrontendOptions.h
311

I thought there would be a layering violation if AST and Frontend colluded on that, but I see now that Frontend already links against AST, so perhaps there isn't a layering violation after all?

I'll investigate to see if this can be cleaned up.

hintonda added inline comments.Apr 25 2019, 10:05 AM
include/clang/Frontend/FrontendOptions.h
311

I think that Frontend uses AST, e.g., ASTContext, but not the other way around.

Thanks for doing this! I'm glad you were able to do it without needing to change the traverser class. That's a good indicator.

I think it would be great if there were somewhere to document the non-stability of the format and content here.

include/clang/AST/DeclBase.h
1139

I think we've talked about this before, but I don't think growing interfaces like this is the best way forward. An enum is a less-good replacement for an object (ie making the user of the API responsible for creating the dumper they want to use).

I think that could be made more convenient in the future. What do you think?

include/clang/AST/JSONNodeDumper.h
54

You have

!Label.empty() ? Label : "inner";

below. This looks needlessly non-DRY?

131

I'm no expert in JSON, so I'm not certain if this has implications.

Do entries in a JSON object (ie not an Array-type) have order? Are you 'giving' them order here which is not part of how JSON is usually parsed (eg by third party tools)?

Can you create ordered JSON (arrays) where it is needed? I see that inner is already an array, so I'm probably missing the part that is 'modelled as an unordered container'. Can you point that out to me?

162

Is it possible this 'textual' JSON generation could produce invalid JSON? One of the things a library solution ensures is balanced braces and valid syntax etc. I'm not familiar enough with JSON to know if it has dark corners.

aaron.ballman marked 14 inline comments as done.

Updating based on review feedback.

include/clang/AST/DeclBase.h
1139

I think that may be a good future improvement, yes.

One thing to take on balance when considering this for the future is that dump() can be called from within a debugger and that's a bit harder to accomplish with an object interface. I'm not certain that will be a big issue for my use case, but it may matter to some folks.

include/clang/AST/JSONNodeDumper.h
54

There are calls to AddChild() from the AST traverser that pass in an empty string, which is how I got to the state I'm in (see ASTNodeTraverser::Visit(const Stmt *, StringRef)). I can pull the explicit label from this call and just pass an empty string.

56

Hmm, if I was doing that, I'd probably prefer to use std::invoke() to call the function, but we can't use that because it's C++17. Do you have concerns if I don't make that change just yet (we don't do it from TextNodeDumper either).

131

Oh, good catch! I need to update those comments because they got stale.

Can you create ordered JSON (arrays) where it is needed? I see that inner is already an array, so I'm probably missing the part that is 'modelled as an unordered container'. Can you point that out to me?

It's the key/value pair bits that are unordered, and that turns out to not matter here. I wrote that comment during a very early pass when I was passing around JSON objects more liberally and I had run into a situation where I was using kv pairs but order mattered (I don't even recall what the case was off the top of my head now). With the new design, these comments are stale -- I've corrected that.

162

I don't believe this will produce invalid JSON, short of aborting mid-stream.

riccibruno marked an inline comment as done.May 8 2019, 5:09 AM
riccibruno added inline comments.
include/clang/AST/JSONNodeDumper.h
56

Sounds good!

rsmith accepted this revision.May 8 2019, 1:50 PM

If you're happy with these two conditions, then I have no concerns with this moving forward:

  • There is no implied stability for the content or format of the dump between major releases, other than that it be valid JSON; this should be stated explicitly in the documentation. (Compatibility between patch releases seems like something we can work out with the release manager, but I'm inclined to say we should make a best-effort attempt to preserve it.) If people want to build tools on this rather than on one of our stable APIs, they should expect to be broken in some way on every major release.
  • There is no requirement for people maintaining the AST (changing or adding AST nodes) to update the dump output for modified AST nodes to show any new information -- unlike the existing -ast-dump, this is not just for debugging, but we should be able to treat it as if it were. Perhaps a better way to put it: there is no requirement that the information in this dump is complete, but the information that is dumped should be correct.

If you want stronger guarantees than that, then we should have a broader discussion to establish some community buy-in.

This revision is now accepted and ready to land.May 8 2019, 1:50 PM
steveire added inline comments.May 12 2019, 1:46 PM
include/clang/AST/DeclBase.h
1139

I don't think "it may matter to some folks." is a good guideline to use when designing an API. I have a patch which moves ASTDumper to its own header which I have just uploaded here: https://reviews.llvm.org/D61835

If additional args to dump() are mainly for debugging, then adding new args which are not for debugging doesn't seem right.

aaron.ballman marked 2 inline comments as done.

Switched to using the JSON streaming interface rather than a custom solution
Updated the test cases for the new formatting

If you're happy with these two conditions, then I have no concerns with this moving forward:

  • There is no implied stability for the content or format of the dump between major releases, other than that it be valid JSON; this should be stated explicitly in the documentation. (Compatibility between patch releases seems like something we can work out with the release manager, but I'm inclined to say we should make a best-effort attempt to preserve it.) If people want to build tools on this rather than on one of our stable APIs, they should expect to be broken in some way on every major release.
  • There is no requirement for people maintaining the AST (changing or adding AST nodes) to update the dump output for modified AST nodes to show any new information -- unlike the existing -ast-dump, this is not just for debugging, but we should be able to treat it as if it were. Perhaps a better way to put it: there is no requirement that the information in this dump is complete, but the information that is dumped should be correct.

    If you want stronger guarantees than that, then we should have a broader discussion to establish some community buy-in.

Both of these conditions make a lot of sense and we're happy with them. I'll update the documentation accordingly as I land the initial patch. Thanks!

include/clang/AST/DeclBase.h
1139

The status quo is that we want dump() to be a function; changing that to use a different interface is certainly worth discussing. However, I'm opposed to trying to do that refactoring at the same time as I'm trying to land this functionality. I would recommend we land this initial patch first, then if we decide to refactor the dump() interface, do that in subsequent reviews.

aaron.ballman closed this revision.May 13 2019, 2:43 PM

Commit in r360622.