This is an archive of the discontinued LLVM Phabricator instance.

Lift JSON library from clang-tools-extra/clangd to llvm/Support.
ClosedPublic

Authored by sammccall on Apr 17 2018, 9:56 PM.

Details

Summary

This consists of four main parts:

  • an type json::Expr representing JSON values of dynamic kind, which can be composed, inspected, and modified
  • a JSON parser from string -> json::Expr
  • a JSON printer from json::Expr -> string, with optional pretty-printing
  • a convention for mapping json::Expr <=> native types (fromJSON/toJSON) Mapping functions are provided for primitives (e.g. int, vector) and the ObjectMapper helper helps implement fromJSON for struct/object types.

Based on clangd's usage, a couple of places I'd appreciate review attention:

  • fromJSON returns only bool. A richer error-signaling mechanism may be useful to provide useful messages, or let recursive fromJSONs (containers/structs) do careful error recovery.
  • should json::obj be always explicitly written (like json::ary)
  • there's no streaming parse API. I suspect there are some simple wins like a callback API where the document is a long array, and each element is small. But this can probably be bolted on easily when we see the need.

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Apr 17 2018, 9:56 PM

I'm not sure who should be the main person reviewing this, but I think the implementation looks pretty good, and would be a great replacement for the one we have in lldb. The main thing I noticed is that you seem to be rolling your own utf8 parser -- I would hope we can reuse the existing unicode utilities here.

fromJSON returns only bool. A richer error-signaling mechanism may be useful to provide useful messages, or let recursive fromJSONs (containers/structs) do careful error recovery.

For our (lldb) use case, we don't need a fancy error mechanism. It seems it should be possible to make these return llvm::Error if it turns out to be necessary later.

should json::obj be always explicitly written (like json::ary)

I don't have an opinion on that, though I do think that json::ary should be renamed.

there's no streaming parse API. I suspect there are some simple wins like a callback API where the document is a long array, and each element is small. But this can probably be bolted on easily when we see the need.

All the messages we are parsing are reasonably small (<= 1k), so we don't have a need for a streaming parser right now. The parser is essentially streaming already. If there was something like llvm::raw_istream, it could be trivially converted, but it looks like input streams aren't a thing for llvm in general, so I wouldn't be worried about it here.

include/llvm/Support/JSON.h
58 ↗(On Diff #142880)

It would be good to emphasize that the returned object's lifetime is independent of the string being parsed (i.e. the contained strings are not references to the strings in the original text).

449–450 ↗(On Diff #142880)

I find it a bit inconsistent when I see a json::Expr (uppercase) for the base type and then json::obj (lowercase) for the derived ones. The lowercase names don't really follow the naming convention and ary is also fairly un-obvious. Could we just call these Object and Array ?

lib/Support/JSON.cpp
283–364 ↗(On Diff #142880)

Is there any reason Support/ConvertUTF.h cannot be used here? Is sounds like you just need the "lenient" conversion mode here.

394 ↗(On Diff #142880)

llvm coding standards say we use static for functions instead of anonymous namespaces. Also, the llvm and json namespaces are opened and closed twice, so this may be a good opportunity to merge them.

sammccall marked 4 inline comments as done.Apr 20 2018, 10:05 AM

I'm not sure who should be the main person reviewing this,

I'm also not sure, but I really appreciate your feedback, and wanted to give you as a likely user a chance to shape the API.
Thanks for the great advice, don't feel any pressure to accept or examine everything if you don't feel like the right person.

but I think the implementation looks pretty good, and would be a great replacement for the one we have in lldb. The main thing I noticed is that you seem to be rolling your own utf8 parser -- I would hope we can reuse the existing unicode utilities here.

Thanks for pointing that out, I hadn't seen them, and you're right. (At the same time, ick!)

fromJSON returns only bool. A richer error-signaling mechanism may be useful to provide useful messages, or let recursive fromJSONs (containers/structs) do careful error recovery.

For our (lldb) use case, we don't need a fancy error mechanism. It seems it should be possible to make these return llvm::Error if it turns out to be necessary later.

should json::obj be always explicitly written (like json::ary)

I don't have an opinion on that, though I do think that json::ary should be renamed.

Agree, these were a wart. ary -> Array, obj -> Object, Expr -> Value.

there's no streaming parse API. I suspect there are some simple wins like a callback API where the document is a long array, and each element is small. But this can probably be bolted on easily when we see the need.

All the messages we are parsing are reasonably small (<= 1k), so we don't have a need for a streaming parser right now. The parser is essentially streaming already. If there was something like llvm::raw_istream, it could be trivially converted, but it looks like input streams aren't a thing for llvm in general, so I wouldn't be worried about it here.

Yeah. Rather than the input data, I was more thinking of passing the output objects to a callback without waiting to pass the whole containing array/object into memory. But agree it's a thing for later.

include/llvm/Support/JSON.h
58 ↗(On Diff #142880)

Done (on the function doc for `parse()'

449–450 ↗(On Diff #142880)

Done. These are now defined as json::Object and json::Array. Since Value already had enumerator with these names, I un-nested the classes which required reordering some things and defining functions out-of-line.
I put them in the cpp file for readability reasons, they could move to the bottom of the header if we care about inlining.

While renaming, also changed json::Expr -> json::Value which seems more accurate.

lib/Support/JSON.cpp
283–364 ↗(On Diff #142880)

Done. That's really slow code with a really inconvenient API, but it shouldn't matter.

394 ↗(On Diff #142880)

Switched to static and removed the namespace (enum doesn't need to be in it).

The format_provider specialization does need to be outside the namespace and there is some order dependency around there, but I'vo avoided opening/closing a lot by qualifying definition names.

sammccall marked 4 inline comments as done.

Address review comments.

This looks good to me, but I do feel someone else should comment on the appropriateness of including this library in llvm/Support. Chandler is listed as the code owner of Support, and he used to have opinions on json parsers in the past, so maybe he would be a good candidate (?)

include/llvm/Support/JSON.h
93–96 ↗(On Diff #143338)

Leftover references to obj and ary

457 ↗(On Diff #143338)

Could we use ObjectKey as the property type here?

sammccall marked 2 inline comments as done.Apr 23 2018, 9:59 AM

This looks good to me, but I do feel someone else should comment on the appropriateness of including this library in llvm/Support. Chandler is listed as the code owner of Support, and he used to have opinions on json parsers in the past, so maybe he would be a good candidate (?)

Good suggestion, I'll try to get this on his radar. Thanks!

include/llvm/Support/JSON.h
457 ↗(On Diff #143338)

Better, StringRef. (ObjectKey is just a maybe-owning stringref, and non-owning is fine here)

lib/Support/JSON.cpp
283–364 ↗(On Diff #142880)

Hmm, I guess I only *thought* I ran the tests :-(

It turns out DecodeUTF16ToUTF8 doesn't do the right thing - at least it doesn't do anything particularly compatible with JSON. The problematic cases are when UTF-16 surrogate code units appear without being properly paired.

  • a JSON parser has to accept these, as they conform to the grammar
  • the best behavior per Unicode is to replace them with U+FFFD
  • ConvertUTF in lenient mode simply drops them in most cases (permitted, but not recommended). When encountering a lone leading surrogate at the end of text, it returns an error even in lenient mode.

References:
http://seriot.ch/parsing_json.php
https://www.rfc-editor.org/errata_search.php?rfc=7159&eid=3984
http://unicode.org/review/pr-121.html

I've restored the previous code and added more comments, including reasons not to use ConvertUTF. This implements unicode's preferred handling of invalid UTF-16 ("Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD").

sammccall marked an inline comment as done.

Address review comments, add a few more docs.
Go back to custom UTF transcoding because of failing test cases :-(

@chandlerc: any interest in reviewing this JSON library (either at a high level for suitability in llvm/Support, or for detailed review)?

It has been used in clangd for a while, and seems suitable for lldb's purposes too. llvm/Support seems like their common ancestor, but I'm also happy if there's somewhere better.

@chandlerc: any interest in reviewing this JSON library (either at a high level for suitability in llvm/Support, or for detailed review)?

I'm happy to tackle both of these. I'll want to do a bit of homework to make sure that the high level concerns that came up previously are adequately addressed (or there is some plan to address them).

It has been used in clangd for a while, and seems suitable for lldb's purposes too. llvm/Support seems like their common ancestor, but I'm also happy if there's somewhere better.

This seems at least sufficient for us to invest some time sorting this all out one way or another.

chandlerc added inline comments.Apr 24 2018, 3:52 AM
include/llvm/Support/JSON.h
8 ↗(On Diff #143588)

One thing that would help me even as I start to dig into this would be an overview comment at the top of the file...

  • What is the intended / expected usage pattern?
  • Mention alternatives given that it seems unlikely we'll eliminate YAMLIO immediately. Why use one vs. the other?

I'd also like to see your initial thoughts on why this library is better as a separate API/library rather than a separate interface that is part and parcel of the YAML library we already have. I don't have any real opinion (yet) about what does or doesn't make sense, and having your perspective on this would help me form an opinion I suspect.

26–27 ↗(On Diff #143588)

Feel free to defer this until higher level stuff is addressed, but here and throughout this entire file, you have excellent comments but don't use a doxygen comment prefix. I think all of these should be converted to be actual doxygen comments at whatever point this is moving forward.

Meinersbur added inline comments.Apr 24 2018, 8:50 AM
include/llvm/Support/JSON.h
313–316 ↗(On Diff #143588)

Is this mutable here also required for "cheating"?

355 ↗(On Diff #143588)

Can you elaborate on why this is needed? AFAIK std::initializer_lists are not meant to be moved from.

lib/Support/JSON.cpp
327 ↗(On Diff #143588)

The error message we get seem to be:
If the token starts with an 'e' or 'E', the error we get is "Invalid number".
For the letters 'n', 't', or 'f' we get "Invalid bareword".
For any other first letter we get "Expected JSON value".

I'd hope for more consistent error messages.

614 ↗(On Diff #143588)

Did you consider using llvm_unreachable?

Thanks for the work, I would like to replace Polly's jsoncpp with this one once it is done.

sammccall marked 3 inline comments as done.Apr 24 2018, 7:56 PM

@chandlerc: any interest in reviewing this JSON library (either at a high level for suitability in llvm/Support, or for detailed review)?

I'm happy to tackle both of these. I'll want to do a bit of homework to make sure that the high level concerns that came up previously are adequately addressed (or there is some plan to address them).

Thank you! Other than "how does this relate to YAML", I don't think I'm familiar with the concerns so any pointers you have would be appreciated.

include/llvm/Support/JSON.h
8 ↗(On Diff #143588)

Oops, the big comment on Value was meant to serve this purpose, but I can't find a way to get it to be first in the file.
Added a real file overview - right level of detail?

There's a few reasons to have this separate from the YAML library, and I'm not sure which are important or even good.

  • a YAML based dom/parser will have a poor API for people who want to deal with JSON - it's too complicated and the names are incorrect. The thing I like most about this API is that the usages are simple.
  • neither the YAMLParser nor YAML I/O design is useful for the use cases I've run into so far (e.g. parsing LSP correctly really requires a DOM, YAMLParser is too low-level and YAML I/O is too restrictive). There'd be a lot of work in filling out the format x API matrix to make it coherent. Maybe it's worth it, so far it could be YAGNI.
  • I'm trying to avoid becoming a YAML expert myself, it's complicated and IMO a dead-end.
  • I simply don't have good ideas about how to combine the pieces I want to add into a single coherent API that I'd want to use. How should YAML anchor data be represented in a DOM? What happens if we try to write JSON-incompatible data to a JSON stream? How can we make YAML I/O more flexible without making it yet more complicated? I'm sure these are solvable, but they might need to be solved by someone who's more in tune with the goals of the YAML library.
26–27 ↗(On Diff #143588)

Ack, I'll do this conversion shortly.

313–316 ↗(On Diff #143588)

Yes. This is documented at moveFrom, but added another comment.

355 ↗(On Diff #143588)

I was able to mitigate this, eliminating the const-rvalue-reference constructors. (This optimization isn't important for ObjectKey, and for Object I friended the relevant classes instead.)

The reasoning here is roughly:

  • we need some syntax to support many KV pairs for an object, or elements in an array
  • function syntax fails because it doesn't format well in long-list cases. Clang-format does a better job if lists are braced lists, in particular it offers you the ability to force one-per-line with a trailing comma in the list.
  • a variadic constructor fails because this must be templated on the arg type, which means args can't be braced list expressions themselves, as those do not have a deducible type. This would hurt map-like object-literal syntax...
  • so we're left with the std::initializer_list constructor as the way to pass variable numbers of arguments, in a way that formats nicely, and allows them to be coerced to a chosen type. However initializer_list acts like a container of const<T>, which would mean naively json::Value{{{{{1}}}}} would result in a deep copy at every level of nesting. Fortunately the standard spells out enough of how the contents of init-lists are constructed that moving the data out of them seems well-defined.

I'm not sure how much of this stuff belongs in the comments here - it's more design doc than user guide.

sammccall updated this revision to Diff 143858.Apr 24 2018, 7:57 PM
sammccall marked 2 inline comments as done.

Address some review comments (but no doxygen yet)

Just my 5 cents, I feel that including this library would be useful for LLVM and Polly would be a happy user.

sammccall updated this revision to Diff 144305.Apr 27 2018, 2:49 AM
sammccall marked 3 inline comments as done.

Doxygen, comment and error message tweaks.

Looks like I forgot to clang-format before sending this :-/

To avoid adding spurious diffs during review, I'll do that before landing (if applicable!).

include/llvm/Support/JSON.h
26–27 ↗(On Diff #143588)

Used doxygen comment prefix and wrapped a long example in \code...\endcode.

Many comments apply to blocks of related functions so are not doxygenated.
For the most part no actual doxygen annotations seemed appropriate, and I think rewriting comments to make more use of doxygen often hurts the inline readability of the comments.

So I think this is done, but LMK if you disagree.

lib/Support/JSON.cpp
327 ↗(On Diff #143588)

There's a tension here between precise errors, consistent/useful errors, and parser complexity.
The "invalid number" is a false positive for elephant but a true positive for 123,00.

I've renamed these messages to be consistent but with a hint: "Invalid JSON value (number?)", "Invalid JSON value (null?)" etc, and "Invalid JSON value" respectively - WDYT?

simon_tatham added inline comments.Apr 27 2018, 5:23 AM
include/llvm/Support/JSON.h
315 ↗(On Diff #144305)

When I built this locally, I had a strange build failure involving this function with g++ 5.4.0 (i.e. the default compiler on Ubuntu 16.04). It reported, at the definition of this function in JSON.cpp, this error:

error: ‘llvm::raw_ostream& llvm::json::operator<<(llvm::raw_ostream&, const llvm::json::Value&)’ should have been declared inside ‘llvm::json’

for which, apparently, the fix was to add a repeated declaration of this function without the 'friend' keyword and outside the definition of class Value.

clang was happy with it, on the other hand. I've no idea :-)

lib/Support/JSON.cpp
87 ↗(On Diff #144305)

When I tried to build with this change locally, g++ pointed out a spurious semicolon here.

560 ↗(On Diff #144305)

Could this number formatting be changed? The default %g loses precision – you don't even get enough information to exactly reconstruct the same double you started with.

Also, over in D46054 I'm working on a JSON back end for TableGen, for which I'd find it useful to be able to pass an arbitrary 64-bit integer through this system and still have the full 64 bits of integer value visible in the JSON output file, for the benefit of JSON consumers (e.g. Python json.load) that go above the call of duty in returning it as an integer without rounding it to the nearest representable double.

So, would it be possible to have some method of constructing a json::Value that formats as a 64-bit integer literal?

sammccall updated this revision to Diff 144327.Apr 27 2018, 6:30 AM
sammccall marked 2 inline comments as done.

Fix GCC warnings/errors.
Add note about integer representation.

include/llvm/Support/JSON.h
315 ↗(On Diff #144305)

Indeed, thanks for catching this.
I think GCC is technically correct here, and most other compilers prefer to be helpful instead :-)
Added the namespace-scope declaration.

lib/Support/JSON.cpp
560 ↗(On Diff #144305)

Could this number formatting be changed? The default %g loses precision – you don't even get enough information to exactly reconstruct the same double you started with.

Yes, though is there an existing aware of round-trip safe double formatter in llvm?
I suspect this only actually matters when the values are integers, so we should consider your second suggestion first :-)

I'd find it useful to be able to pass an arbitrary 64-bit integer through this system and still have the full 64 bits of integer value visible in the JSON output file, for the benefit of JSON consumers (e.g. Python json.load) that go above the call of duty in returning it as an integer without rounding it to the nearest representable double.

What about this design:

  • Internally, a numeric value can be an integer or a double. i.e we split the internal ValueType T_Number into two, T_Integer and T_Double. public Kind remains unchanged.
  • when constructing, you get one or the other depending on the static type
  • when parsing, you get integer unless it has a nonzero decimal part or is out-of-range.
  • asDouble() always succeeds
  • asInteger() succeeds if the underlying value is integer or if it's a double that can be exactly represented as int64_t (same as now)
  • when serializing, you get %g for double and the usual representation for integers

Open questions:

  • is 1.2e3 a double or an integer? I kind of want the former, which complicates our heuristic.
  • int64_t leaves anyone who wants uint64_t out in the cold. But adding more options for types is going to lead to madness. Can we live with this limitation?

If this sounds good I can start on the changes, but I'd like to defer adding new features to another patch if that's OK. This one is largely moving mostly-battleworn code from clangd, and new features need closer review of the implementation.

simon_tatham added inline comments.Apr 27 2018, 7:20 AM
lib/Support/JSON.cpp
560 ↗(On Diff #144305)

That design certainly sounds as if it would do what I need, and a great deal more besides.

Another option that would be fine for me personally would be to have a means of constructing a ValueType called, say, T_Custom, which internally holds a string value, and serializes as exactly that string, unquoted. I could imagine that being used for other unusual purposes as well, such as controlling which of \uXXXX and UTF-8 was used to represent a non-ASCII character in a string literal.

(And that possibility is simple enough that I could add it myself as part of my patch.)

Oh yes, nearly forgot – you might be interested to know that when I ported my TableGen JSON back end to use this library in place of my previous serialization code, it reduced the running time to 60% of what it previously was and the client code ended up more legible. So in spite of posting nitpicks, I like this patch :-)

Meinersbur added inline comments.Apr 27 2018, 8:31 AM
include/llvm/Support/JSON.h
355 ↗(On Diff #143588)

std::initializer_list acting like a container of const elements is probably for a reason. I'd prefer no such hacks, but also see that endless copying of elements might justify cheating.

cppreference.com mentions that its elements must be copy-initialized, but only since C++17.

does anyone else have an opinion on this?

sammccall added inline comments.Apr 27 2018, 2:13 PM
include/llvm/Support/JSON.h
355 ↗(On Diff #143588)

I would also prefer no such hacks, but I can't see a way to get good syntax without the recursive copy, and without the hacks :(
Any ideas? Also interested in hearing more opinions on whether this is too hairy to rely on.

That said, I believe this is valid, per all relevant versions of the standard.

cppreference.com mentions that its elements must be copy-initialized, but only since C++17.

I think I'm missing your point here - can you explain why this is good or bad, and what the implication is? It also seems to be a cppreference mistake, the copy-initialized requirement is older.

C++11 says

as if the implementation allocated an array of N elements of type E [...] Each element of that array is copy-initialized

i.e. it creates a T[N]. We could probably even const_cast!

C++14 says

as if the implementation allocated a temporary array of N elements of type const E [...]. Each element of that array is copy-initialized [...] The implementation is free to allocate the array in read-only memory if an explicit array with the same initializer could be so allocated

(emphasis mine). So now the type changes to const T[N] (so const_cast would be invalid) but mutating mutable members of const objects is allowed, so read-only memory can't be used.

The C++17 language is more obscure

as if the implementation generated and materialized (7.4) a prvalue of type “array of N const E” [...] Each element of that array is copy-initialized

but seems to be the same for these purposes.

lib/Support/JSON.cpp
560 ↗(On Diff #144305)

D46209 adds int64 support, and fixes use of %g to retain full precision.

T_Custom is a cool idea and I definitely don't want to rule it out, but large integers in particular seems like something common that should "just work" if possible.
(It's also unclear how a T_Custom could solve the problem on the *parse* side, which is nice to have)

@chandlerc This is ready for you again whenever you have cycles. To summarize:

  • I think everything above was resolved except for the question of whether the mutable/initializer_list tricks are too gross to use.
  • a few more potential users seem pleased about this idea, and @simon_tatham has verified that it works for tablegen in D46054.
  • I've had a couple of suggestions which have been moved into D46209 (numeric precision) and D46274 (unicode validation), I don't think you personally need to review those.
Meinersbur added inline comments.May 2 2018, 9:46 AM
include/llvm/Support/JSON.h
355 ↗(On Diff #143588)

Thanks for looking into the standard (note that cppreference mentions C++14, not C++17 as I claimed).

I understood the standard the following way: If the element is copy-initialized, it should call the copy-constructor. In your implementation, it calls copyFrom, i.e. it still makes a recursive copy on each level, meaning the problem is not solved (But you save one by not recursively copying again when constructing from initializer_list).

However, I looked into the implementation of initializer_list in libc++ and msvc. It is a list of pointers-to-elements, rather than a flat array of objects. That is, no copy-initialization is not happening, it just points to the already existing elements. Not sure whether this is mandated by the standard.

sammccall added inline comments.May 2 2018, 11:52 PM
include/llvm/Support/JSON.h
355 ↗(On Diff #143588)

I understood the standard the following way: If the element is copy-initialized, it should call the copy-constructor

Ah, this is just the standard being confusing. copy-initialization is basically unrelated to copy-constructors. Certain syntaxes trigger copy-initialization, and others trigger direct-initialization, which behaves slightly differently (copy-initialization won't call explicit constructors).

http://en.cppreference.com/w/cpp/language/copy_initialization gives this example, which is relevant here:

std::string s2 = std::move(s); // this copy-initialization performs a move

@chandlerc Ping - I think you're the best person to decide whether a JSON-specific library should be added.
It'd be great to make some progress on that, even if actual review would take longer or be left for others.
(I think simon_tatham has work blocked on this, and there are some clangd changes I'm holding off to avoid diverging)
Are the arguments above compelling? Any more info I can provide?

(I'm going to be sporadically available over the next two weeks, but will keep an eye on this)

Sorry for delays circling back to this.

I think the primary concerns about putting this into the Support library is that we already have one API that is quite similar there: YAMLIO. However, that API is clearly not serving all of the needs of users given that there are mutliple JSON-parsing code paths in the wider project that are already using other libraries (this one, but also Polly). So I think adding this makes lots of sense at this point.

However, I'd like to take this opportunity to try to iterate a bit on the API since it is going into a new, more widely visible home and will almost certainly grow new users as soon as it lands. I'm fairly uncomfortable with the inheritance approach, and I think some of the additional APIs would benefit from (hopefully minor) adjustment. None of this is intended to change the fundamental design in any way though.

include/llvm/Support/JSON.h
62 ↗(On Diff #144327)

I understand the pragmatic reason for this, but I am pretty uncomfortable deriving from standard types like this. I'm worried it will hurt portability due to subtle variations in standard libraries, or subtle changes in standard libraries as we switch to C++N+1 or whatever.

I would be much more comfortable using something internal and owning the API you export.

Unrelatedly (and not blocking anything), I struggle to believe that std::map is the correct tool for the job... Is it just that DenseMap is a pain to use currently? Is it that these are typically small and not worth a DenseMap? I'm sorry to ask this as I suspect you've already explained this in the original review, but I must admit I'm curious.

69–71 ↗(On Diff #144327)

This does more than what it says. It hides std::map's operator[] overloads, no matter what they are. Is that what you intended?

This is perhaps a good example of why I find inheritance challenging here...

100 ↗(On Diff #144327)

Do you expect users to primarily use these typed accessors? Or the underlying vector?

Should they take iterators instead of indices?

These seem to make the API somewhat hostile to range based for loops. I'm surprised this isn't more a facility provided by the iterator...

I find the name somewhat confusing too -- see my comment below for some clarity of why.

But what's the advantage here? Why not just arr[i].asNull()?

266 ↗(On Diff #144327)

A more conventional name would be getAsFoo matching getAs<T>.

Also, is it really worth having all of these? getAs<bool> and getAs<double> seem just as nice as the non-templated versions to me... But I guess getAsString and getAsInteger do interesting validation and such.

Thanks for the thoughts, I'll take another pass and try to incorporate
them. But I'm out until Monday, so some answers/questions while this is in
your cache...

(Looks like Phabricator swallowed most of my email reply, please excuse some repetition)

Sorry for delays circling back to this.

I think the primary concerns about putting this into the Support library is that we already have one API that is quite similar there: YAMLIO. However, that API is clearly not serving all of the needs of users given that there are mutliple JSON-parsing code paths in the wider project that are already using other libraries (this one, but also Polly). So I think adding this makes lots of sense at this point.

However, I'd like to take this opportunity to try to iterate a bit on the API since it is going into a new, more widely visible home and will almost certainly grow new users as soon as it lands. I'm fairly uncomfortable with the inheritance approach, and I think some of the additional APIs would benefit from (hopefully minor) adjustment. None of this is intended to change the fundamental design in any way though.

That sounds good. Any pushback below is to clarify the reasons for decisions, not to resist changes - happy to change whatever you don't find convincing.

include/llvm/Support/JSON.h
62 ↗(On Diff #144327)

Are we worried about inheriting the interface of *llvm* types, or just std types?

I've switched to inheriting from DenseMap here, but I can wrap it if you prefer.
(std::map's ordered-ness made operator== and print() simple, so there's a few more lines now).

Array is more complicated: even SmallVector<Value, 0> needs Value, which sets up a cycle that's hard to break. So I've resorted to wrapping std::vector and exposing a hopefully-sane subset of the API.
(Aside: I suspect we *do* want to track e.g. the C++17 changes to vector, but that's a burden that shouldn't fall on whoever does the upgrade)

69–71 ↗(On Diff #144327)

There are already no such overloads available, as Value is not default-constructible. I reworded the comment a bit.

Agree that inheritance makes this a little subtle to reason about (though obviously if wrapping, this particular operator would look just the same).

100 ↗(On Diff #144327)

I expect the most common access patterns to be:

  1. mapping to a vector<Something> using fromJSON
  2. iterating over Values using ranged-for (followed by asFoo() on each element)
  3. these getFoo(I) accessors

They're not nearly as useful/important as the typed accessors on Object. The benefit is:

  • symmetry with Object makes the API easier to follow (thus size_t rather than iterator)
  • convenient when particular indices have semantics (like executeCommand parameter list in LSP), rather than being a homogeneous collection
  • more readable than operator[] when calling on a pointer, which is common due to the if (Array* A = V.asArray()) idiom - (*A)[i].asNull() vs A->getNull(i).

These seem to make the API somewhat hostile to range based for loops. I'm surprised this isn't more a facility provided by the iterator...

Can you elaborate on what kind of loop you'd like to write? I tend to find clever iterator APIs fairly undiscoverable/obscure, but I'm not sure what I'm missing.

266 ↗(On Diff #144327)

A more conventional name would be getAsFoo matching getAs<T>.

Sure, in Java ;-)
Adding "get" because a function must have a verb seems like cargo culting to me - we're trying to signal the effects, but there aren't any! Maybe it's my exposure, but I don't see any ambiguity as to what asBoolean() might do.

I'd like to propose a style guide change to be more like https://swift.org/documentation/api-design-guidelines/#strive-for-fluent-usage here. Meanwhile, I can change this to match the style guide as it stands if you think this is important. I do think it hurts the signal/noise.

Also, is it really worth having all of these? getAs<bool> and getAs<double> seem just as nice as the non-templated versions to me...

I do prefer the non-templated version:

  • getAs<bool> etc isn't fewer things to understand, and it's not shorter, it's just fewer distinct tokens. I'm not sure what we're trying to conserve here.
  • the non-templated functions are more discoverable: easier to read in the header, and work better with code completion, and easier to search for
  • getAsNumber is a better name than getAs<double>, as the relevant concept here is JSON's "Number" not the types used to represent them in C++. (this is "interesting validation and such" I think)
  • I don't want people to think they can write getAs<int>(), and if they do I want the error message to be easy to understand. Templates make both of these harder.
sammccall updated this revision to Diff 146855.May 15 2018, 9:27 AM

Array wraps vector instead of inheriting it.
Object inherits DenseMap instead of std::map
Change some Object accessors to use StringRef instead of ObjectKey.
clang-format, because it was getting out of control.

sammccall added inline comments.May 15 2018, 9:32 AM
include/llvm/Support/JSON.h
62 ↗(On Diff #144327)

(And I forgot to mention a new dirty trick here: the explicit use of DenseMapInfo<StringRef> works since ObjectKey can already be implicitly converted back and forth to StringRef. This saves a bunch of unreadable boilerplate which has to go at the *top* of the file, but is admittedly a bit fragile - WDYT?)

@chandlerc: Back-from-vacation ping.
(A short response like "yes, do change X" is fine, keeps me busy :-)

Thanks for the ping!

include/llvm/Support/JSON.h
62 ↗(On Diff #144327)

I'm somewhat concerned with inheritance generally. I feel like wrapping is just a better pattern and more easily understood and debugged.

The additional code doesn't worry me much.

100 ↗(On Diff #144327)

Definitely not advocating for more clever iterator API. Mostly pointing out that index-oriented APIs are particularly hard to use with range based for loops. Whereas, using foo[i].asNull() is fine because the indexing is orthogonal to the query and can be replaced with something more range-based or iterator based if useful.

If the issue is just that operator[] is annoying with pointers, how about just a method for indexing so that you can use -> to call it? A->at(i).asNull() or A->get(i).asNull() seem fine?

266 ↗(On Diff #144327)

I find the arguments against templates compelling FWIW.

That said, while I actually support not using get superfluously in many cases, I don't actually like it here. There is an interesting and potentially complex conversion happening, and I personally much prefer getAsFoo() for that. The only time I'm really quite happy omitting the get is when it is truly just an accessor and not providing any "interesting" logic (a higher bar than the Swift convention you cite in the llvm-dev thread).

Anyways, for now, I suggest the getAsNumber pattern.

sammccall marked an inline comment as done.May 25 2018, 10:32 AM
sammccall added inline comments.
include/llvm/Support/JSON.h
62 ↗(On Diff #144327)

Object now wraps DenseMap instead of inheriting it.

I've removed some less-frequently used methods (resize, reserve, etc).
I'm reluctant to simplify commonly used interfaces in ways that might be surprising (like fewer insert overloads where it may add more copies).

100 ↗(On Diff #144327)

The Object case is different and more important, and worth resolving first.

Object::getNumber(StringRef) returns a number if the property exists and is a number. Dropping this accessor and writing if (auto N = O->get("foo").getAsNumber()) ... isn't viable as it crashes if the property is missing. Instead you'd have to write if (auto MN = O->get("foo")) if (auto N = MN->getAsNumber()) .... This is the overwhelmingly common pattern for parsing objects, and I do think we should support this in one expression and preferably one call.

If you disagree with that, let's talk about that first :-)

If the issue is just that operator[] is annoying with pointers, how about just a method for indexing so that you can use -> to call it? A->at(i).asNull() or A->get(i).asNull() seem fine?

So assuming we're going to have these methods for object, I think the biggest issue is consistency with object. I do also think A->getNumber(I) is a lot nicer than A->get(I).getAsNumber(). But neither of these are hard objections and I'm happy to just drop all of the Array::get*.

266 ↗(On Diff #144327)

OK. I'd really like to avoid three-word names here so I'll have a think about this over the weekend.

I do think get is always superfluous - if you want to signal that something tricky is happening, get doesn't do that.

This is logically just accessing a member of a discriminated union, which I don't think is anything tricky at all, so I'm not sure what the verb should be. But I'll find something.

sammccall marked an inline comment as done.

Wrap DenseMap instead of inheriting, drop some container member functions.

sammccall updated this revision to Diff 148916.May 29 2018, 8:17 AM

Rename Value accessors asNumber() -> getAsNumber().
Remove typed element accessors on Array.

OK, i'm done being recalcitrant, I think everything is addressed now :-)

include/llvm/Support/JSON.h
100 ↗(On Diff #144327)

I've dropped the typed Array::get*, but kept Object::get* as explained above.

We now have this glorious code in the test: (*(*A)[4].getAsArray())[1].getAsInteger() but that's not terribly representative of likely real-world use.

266 ↗(On Diff #144327)

I tried a few out with a small survey (N=4). From most to least preferred

  • asNumber() - this was most people's favorite
  • getAsNumber() - everyone could live with this; one person liked it as much as asNumber
  • getNumber() - an awkward compromise
  • toNumber() - probably sounds too much like a conversion
  • number() - I really like this, but nobody else does

I've changed them to getAsNumber, but I'm still a bit conflicted here.

@chandlerc Ping. Hope you had fun in committee :-)

It feels a bit strange to be pinging a code review that's not even my own :-), but is this still being reviewed, or is it now entirely stalled?

If the latter, I can rework my JSON tablegen patch in D46054 to reintroduce the (much more trivial) JSON output code from its first draft. But I'd prefer to use this library if it still looks like landing.

@chandlerc Ping again :-)
I'm going on a month's leave fairly soon (a week-ish exact date TBD), it'd be great to get this wrapped up before then. Or a realistic estimate - a few things depending on this now.

@simon_tatham The plan is still for this to land. Chandler's schedule has been busy and also disrupted, so I'm hoping to catch him at a weak moment :-)

chandlerc accepted this revision.Jul 6 2018, 2:21 PM

Ok, I generally think this is looking good to go in... As mentioned previously, I think there is a good rationale around supporting this despite the existing other/similar libraries.

The code quality is crazy high. Awesome job there.

I have a bunch of pretty minor questions or suggestions below. But I think they're mostly optional stuff. Even if you decide to make the changes, I'm even fine with those being follow-up patches and such as they seem unlikely to wildly change the API surface here, and none of them seem to represent serious bugs or anything.

So don't block on any of this to land the patch and move to a more iterative model. The patch as a whole is LGTM, and I'd just suggest folding the suggestions below that make sense to you to fold into the initial one you land, and follow-up on anything else.

include/llvm/Support/JSON.h
292 ↗(On Diff #148916)

Does some compiler reject this as just std::enable_if<!std::is_same<T, bool>::value>::type? I'd guess MSVC might complain about the use of ! but that would make me sad.

353 ↗(On Diff #148916)

I would write this as std::modf(D, &D) == 0.0. The return is a double, and this way it looks less like you're checking for the absence of an error code and more like actually checking the fractional component is zero.

354–355 ↗(On Diff #148916)

I would explicitly convert the RHS of these two to a double to make it clear that this comparison takes place is double precision floating point, not in an integral type. I think that would help the reader out, because otherwise it could look a bit like a tautology.

396 ↗(On Diff #148916)

Sadly, I think this is invalid by the pedantic wording of the spec.

Because as<T>() binds a reference to the pointer, I think that the pointer has to point to a valid object. And it doesn't yet...

I would just inline the reinterpret_cast here despite that repetition because I think all other users of as<T>() are correct.

However, since as<T>() is a private implementation detail, you could alternatively just return the pointer. Really up to you.

458–460 ↗(On Diff #148916)

Sink this to be a non-friend operator like operator == and !=?

Or hoist those to be friend operators?

I don't have a strong opinion about one pattern or the other, mostly just advocating for consistency whichever way you want to go.

463–464 ↗(On Diff #148916)

Are there likely to be a lot of these? Might make sense to leave a note for the future that this could be optimized a lot by having a custom StringRef like implementation that encodes whether the data is owned w/o spending an extra pointer on it. Clearly not needed in this patch, just a thought for the future if it comes up.

474–477 ↗(On Diff #148916)

Consider inlining this above? I actually think it makes the std::initializer_list<KV> used in a public API much easier to read if this trivial wrapper is immediately visible.

Oh, I guess this is kept out-of-line in order to define ObjectKey after Object, Array, and Value?

Not sure this ordering is buying that much in terms of readability. Given that you can define ObjectKey first, I might just do that and avoid the need to make this out-of-line... Anyways, this is just an optional suggestion, I'm fine with whichever way you end up.

490–491 ↗(On Diff #148916)

Given that these have JSON in the name, does it make sense to move them out of the json namespace?

554–565 ↗(On Diff #148916)

This is really nice btw. =D

596–599 ↗(On Diff #148916)

Would it be a readability improvement to have this be llvm::parseJSON rather than llvm::json::parse?

This revision is now accepted and ready to land.Jul 6 2018, 2:21 PM
sammccall marked 8 inline comments as done.Jul 9 2018, 3:09 AM

Thanks a lot for the review!
Made most of the changes. The one I'm least certain about: I didn't move json::parse --> parseJSON.
As you suggest, going to land this patch and happy to make any followup changes, I don't expect much API churn (even parse doesn't tend to have a lot of callsites).

include/llvm/Support/JSON.h
292 ↗(On Diff #148916)

That works fine. I just didn't look critically enough at this once it compiled :-)

474–477 ↗(On Diff #148916)

I wish this was just readability :-(

The problem is a circular dependency: defining KV needs Value to be complete, Value requires Object to be complete (because of Union). So the Object -> Value -> KV ordering is necessary, I think.

ObjectKey could indeed be hoisted higher, but that alone doesn't buy anything, I think.

490–491 ↗(On Diff #148916)

I'm not sure. The intent is these are part of a family of functions found via ADL on the second argument.
When used in that way, they are always available (because of ADL on the first argument), and they don't particularly "belong "in namespace llvm (they overload for builtin and std types).

Most of these particular functions are for generic code (like ObjectMapper, or fromJSON(Value, std::vector) itself) . e.g. calling Value.getAsNumber() is generally nicer than fromJSON(Value, double). So moving these particular overloads into namespace llvm seems to give them more prominence compared to the rest of the API than seems warranted.

c.f. llvm::json::parse vs llvm::parseJSON, which *is* a main entrypoint.

554–565 ↗(On Diff #148916)

:-) We had plenty of this code in clangd to test the API on.

The one thing that's pretty awful is that when parsing fails there's no detailed representation of the error. But worse seems to be better for us so far.

596–599 ↗(On Diff #148916)

Hmm, this seems like a wash to me. parseJSON is slightly shorter than json::parse, while the latter hints (accurately) at a library boundary.

Given the *other* entrypoints (Value, ObjectMapper etc) are in namespace json, I lean towards keeping this there too - among other things, it makes the spelling/capitalization of json::parse, json::Value etc easier to remember I think.

Certainly happy to change this if you have a strong opinion here though.

This revision was automatically updated to reflect the committed changes.
sammccall marked 3 inline comments as done.