Page MenuHomePhabricator

[TableGen] Add a general-purpose JSON backend.

Authored by simon_tatham on Apr 25 2018, 5:28 AM.



The aim of this backend is to output everything TableGen knows about
the record set, similarly to the default -print-records backend. But
where -print-records produces output in TableGen's input syntax
(convenient for humans to read), this backend produces it as
structured JSON data, which is convenient for loading into standard
scripting languages such as Python, in order to extract information
from the data set in an automated way.

The output data contains a JSON representation of the variable
definitions in output 'def' records, and a few pieces of metadata such
as which of those definitions are tagged with the 'field' prefix and
which defs are derived from which classes. It doesn't dump out
absolutely every piece of knowledge it _could_ produce, such as type
information and complicated arithmetic operator nodes in abstract
superclasses; the main aim is to allow consumers of this JSON dump to
essentially act as new backends, and backends don't generally need to
depend on that kind of data.

The new backend is implemented as an EmitJSON() function similar to
all of llvm-tblgen's other EmitFoo functions, except that it lives in
lib/TableGen instead of utils/TableGen on the basis that I'm expecting
to add it to clang-tblgen too in a future patch.

To test it, I've written a Python script that loads the JSON output
and tests properties of it based on comments in the .td source - more
or less like FileCheck, except that the CHECK: lines have Python
expressions after them instead of textual pattern matches.

Diff Detail

Event Timeline

simon_tatham created this revision.Apr 25 2018, 5:28 AM
labath added a subscriber: labath.Apr 25 2018, 8:28 AM
labath added inline comments.

You should take a look at <D45753>, which is about to add a JSON library to llvm. It would be a shame to add two of them in the same week. :)

simon_tatham added inline comments.Apr 25 2018, 8:35 AM

Ha! You're right, I hadn't noticed that. I'll replace my ad-hockery with calls to that code with great pleasure as soon as it lands – even without looking I'm sure it will be better than the 'only just enough' code I have here. Thanks for pointing it out!

@nhaehnle : while this change is blocked on that one, do you have any opinions on my design questions? Particularly the one about whether I ought to move most of the code into some sort of getAsJSONObject method in the various classes in Record.h / Record.cpp, because if I'm going to do that then I should do it before making any other detailed changes :-)

Consider what to do about integer values that don't fit exactly into a 'double'. This code will simply emit them as decimal integer literals, which JSON parsers are within their rights to round to the nearest double precision float, losing data. Some JSON readers (e.g. Python json.load) will deliver accurate integer values anyway, but it might be better not to rely on that, and instead output very large integers in some other form, such as a JSON object containing an identifying type field and two doubles whose sum is the desired integer, or a string representation of the integer, or both.

Well. The one thing I do have a strong opinion on is that the representation of very large integers should not be different from the representation of small integers, because consumers would almost certainly get that wrong. So that leaves two options: string representations, or sending JavaScript to the hell it deserves and going with integers.

Consider adding the new -dump-json option to clang-tblgen as well as llvm-tblgen. (As I understand it they wouldn't do anything differently, but it seems asymmetric not to have both of them support it. They both have -print-records, after all.)

Sure, that seems like a good idea.

Consider providing a cut-down version, enabled by another option such as '-dump-simple-json', in which all the complicated parametric expression nodes like !add and !foreach and !foldl are simply not emitted, and replaced by some kind of small object indicating that a complex expression was elided. The motivation is that I expect a lot of uses for this system would only be interested in the output fields that consist of final well-defined values of primitive type, so constructing the complicated parts is a waste of both TableGen's time and the consumer's. But I'm not sure where the line should be drawn - DAG arguments might well still need to be output in full, for example, and type information might be omittable. There may be no one good answer.

I believe there is a good answer :)

Take a look at TGParser.cpp, checkConcrete. All valid final and needed records should pass that check. There are unfortunately still some exceptions, although it'd be nice if we could get rid of those actually.

So I would argue that you should only bother emitting what fits this definition of "concrete", i.e. don't have a "complex" JSON option at all. If you do encounter something that doesn't fit the definition of concrete, just output a JSON object { "kind": "complex", "code": <getAsString()> } in its place.

On a related note, I'm not convinced you really want to print out class definitions. The benefit of printing class definitions with -print-records is that it can help you understand what's going on while you're writing class definitions, but the existing TableGen backends really don't care about class definitions. Since the idea here is to basically allow writing TableGen backends in a scripting language like Python, providing the class definitions is unlikely to be useful. It doesn't hurt, but I don't think it's a good motivation for building all this infrastructure for printing ! operations, for example.

Decide where all this code should live. It might be better to move a lot of it into Record.cpp in the form of getAsJSONObject() methods or something like that. That would remove the risk of forgetting to update the JSON back end if a new node type is introduced - anyone forgetting to implement that method in any new subclass of Init or RecTy would be reminded by a compile error.

I'd rather keep it separate for orthogonality. Adding a new fundamental concrete data type is a rare enough occurence, and the worst that would happen with my suggestion above is that you get an unexpected "complex" object, which is not too difficult to track down.

Well. The one thing I do have a strong opinion on is that the representation of very large integers should not be different from the representation of small integers, because consumers would almost certainly get that wrong.

A good point – now you put it that way, I suddenly agree strongly!

So that leaves two options: string representations, or sending JavaScript to the hell it deserves and going with integers.

Well, since my personal use cases all involve Python, and Python copes fine with arbitrary integers, I'm happy with the latter if you are :-) And I'd definitely prefer not to have to put a decode-from-string operation in every single consumer of this JSON output.

I suppose a cl::opt<bool> to switch to a more cumbersome representation of integers could always be added later, if anyone turns out to really need one.

Take a look at TGParser.cpp, checkConcrete. All valid final and needed records should pass that check. There are unfortunately still some exceptions, although it'd be nice if we could get rid of those actually.

Ah! Yes, that seems nice. And if I'm not dumping the class definitions, then perhaps it's not worth dumping the details of all the types either, for the same reason (a 'back end' consuming this format will already know what type to expect from any field it cares about), in which case I can simplify the output representation a great deal by removing the extra level of dereference where you have to suffix ['value'] all the time.

I agree that none of my own use cases will care about any of the things that this redesign throws away – and it makes the output JSON a great deal smaller and simpler. Thanks for the suggestions!

simon_tatham updated this revision to Diff 144564.EditedApr 30 2018, 8:20 AM

OK, here's my second draft. Changes since last time:

  • thrown out the ad-hoc JSON emitter in favour of the new JSON library in D45753 (also requires the integer-handling followup patch D46209)
  • moved the new source file into lib/TableGen where clang-tblgen will be able to get at it more easily (but I haven't actually added it to clang-tblgen yet)
  • removed all the type and abstract class information, leaving only the concrete records and a couple of pieces of metadata that I know backends do actually want (list of field keywords, list of superclasses, list of instances of each class). Exotic subclasses of Init are now rendered as kind="complex" with only a printable representation.
  • flattened the JSON structure by several layers to make it more convenient to consume
  • added documentation of the format.

I think from my perspective this is no longer an unfinished draft; I'd be happy to commit it in this state, subject to code review approval and its dependencies landing.

simon_tatham edited the summary of this revision. (Show Details)Apr 30 2018, 8:21 AM

Thanks, this already looks very good. I do have some suggestions though.

516–518 ↗(On Diff #144564)

This is a minor point, but I suspect it would be slightly more convenient for consumers if this were instead represented as [[arg, name], [arg, name], ...]. So the example below would have args be [[22, null], ["hello", "foo"]]. What do you think?

46–49 ↗(On Diff #144564)

C-style comments are rather uncommon in LLVM; best to be consistent and use C++-style comments here (and below).

93 ↗(On Diff #144564)

I think this should be "var" instead of "variable", for consistency.

172–179 ↗(On Diff #144564)

I think it would be slightly nicer to merge this into the loop above, to avoid iterating over the same data twice.


Cool, I didn't know this was possible.

simon_tatham added inline comments.May 3 2018, 8:33 AM
516–518 ↗(On Diff #144564)

That's actually more like how I had it in the first version of the patch, and I had second thoughts and changed it to this :-) so I'm already on the fence and could easily be persuaded to change it back again.

My thought was that some use cases wouldn't care about the name at all (e.g. an entire backend might use dag-typed data for some purpose that never gives a name to any argument), and those users would find it more convenient if they could get the actual arguments by just saying node['args'][n] instead of having to say node['args'][n][0] (in your version) or node['args'][n]['value'] (as I originally had it). So I moved the names out into a separate array that you'd only have to look at if you cared about names at all.

On the other hand, I can certainly see the counterargument – if you do care about names, it's nicer to have a single array to iterate over. I'm happy to change it to your style.

93 ↗(On Diff #144564)

Another thing I was trying to do (but forgot to actually mention anywhere) was to arrange for all the field names that depend on "kind" not to be the same as each other, as a means of error detection – it would stop a user accidentally mistaking a var for a varbit, by absentmindedly retrieving node['var'] and forgetting there was another field to look at too. But that's quite a marginal consideration in the first place, and also I admit this particular choice of two different names was terrible :-) so yes, I'm happy to change over to being consistent.


You mean passing a string to sys.exit? That was new to me as well quite recently. It's one of those functions where as a C programmer I automatically assumed I already knew how its API would work, so it took me years to find a reason to read its docs!

Updated with all those review comments.

simon_tatham marked 4 inline comments as done.May 3 2018, 9:04 AM
nhaehnle accepted this revision.May 3 2018, 9:27 AM

Great, LGTM!

516–518 ↗(On Diff #144564)

Yeah, I admit it was really a minor thing. Thanks for changing it though!

This revision is now accepted and ready to land.May 3 2018, 9:27 AM

Thanks for the review!

Of course, having got to this point, I can't actually commit it until D45753 is ready. And I'm about to be on holiday for several weeks, so if that happens soon then I probably won't notice for a while. But I won't actually forget about this patch, I promise :-)

@nhaehnle, following up discussion about NAME in D47430 (and hopefully posted in the right place this time):

I was going to change this patch so that it uses !name instead of NAME for the key inside each JSON def object that gives the def's own name. (Mostly the reason I think it's useful to have such a key at all is so that client code that's consuming the JSON can pass those dictionaries to its own subfunctions without having to pass the name alongside it, so it makes sense to me to use a key that indicates that it's a JSON-specific convenience.)

But it's just struck me that there might also be a use case for knowing whether the record is anonymous or not, in the sense of whether its name is something that was deliberately specified in the TableGen input or whether it was some anonymous_123 value made up by tblgen itself.

I was thinking of adding a boolean field !anonymous, or alternatively perhaps having !key and !name (where !key is always the key under which this record is stored in the JSON root object, and !name is either the same as !key or null). Any particular preference, before I make the changes? Or is the entire idea not worth bothering with?

(Also, I noticed in passing that the IsAnonymous field isn't set reliably: the records defined by an anonymous defm have it false rather than true. That looks easy to fix, and I could fold that into this change or make it a separate one.)

There are arguments both for !key + !name and for !name + !anonymous, although thinking about it for a minute or two I weakly prefer !name + !anonymous because it matches the representation in C++. It makes it easier for people to move between JSON and C++.

P.S.: No worries about the confusion with the other review :)

And by the way, I do agree with your rationale for why !name is very useful to have in JSON. The C++ backends can (and do) use Record::getName() for the same functionality.

simon_tatham edited the summary of this revision. (Show Details)

Renamed the NAME attribute to !name in line with D47430.

Added the !anonymous attribute, a set of tests for it, and a fix in TGParser.cpp to set it correctly for anonymous defms.

OK, !name + !anonymous it is. (That was how I'd drafted it too, so I think we had the same mild preference.)

I'm afraid this patch now has a very tiny conflict with D47430, in that we've both added braces to the same if statement in TGParser::ParseDefm.

Rebase to current trunk, and revert change to 'anonymous' handling.

In line with the discussion in D47431, I've removed my previous tweak
in TGParser.cpp that sets the !anonymous flag if any part of a def's
final name was derived from an anonymous def or defm. Now the
!anonymous field in the JSON output is consistent with the existing
behavior of the isAnonymous() query function.

simon_tatham edited the summary of this revision. (Show Details)Jul 9 2018, 7:05 AM

@nhaehnle , are you still happy for me to commit this, now that its dependencies have landed and I've tweaked its handing of !anonymous?

Nearly forgot to mention the new option in the tblgen man page!

Actually, on second thoughts, I'm going to assume it was overcautious to ask for a re-approval, since this version of the patch introduces no new controversy and in fact removes the only previous tweak in the Tablegen core (in that I'm not trying to change the semantics of anonymous any more). So I'll commit this as is, based on the previous review approval.

This revision was automatically updated to reflect the committed changes.

The test fails on Windows when there's a space in the path to python. python should be specified in the test file as '%python' or "%python" rather than just '%python. I can submit a fix for review if you're not able. Thanks!

Oops, sorry about that :-(

(I admit I didn't consider the possibility of spacey filenames at all, but now I have, it's a mild surprise to me that %python doesn't expand to an already correctly-quoted string.)

The fix sounds easy, but if you have a setup where you can actually check it works, there might be less chance of a typo if you do it rather than me...