This is an archive of the discontinued LLVM Phabricator instance.

Generate Config {Fragment structure, json schema, docs, YAML parser} from schema spec
Needs ReviewPublic

Authored by sammccall on Dec 28 2022, 3:53 PM.

Details

Reviewers
kadircet
nridge
Summary

The generated files are checked in, and updated manually (tested).
This means we can consume them downstream from clangd-www & schemastore
without needing to run the whole LLVM build system.

Diff Detail

Event Timeline

sammccall created this revision.Dec 28 2022, 3:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 28 2022, 3:53 PM
nridge added a subscriber: nridge.Dec 28 2022, 5:11 PM
sammccall updated this revision to Diff 485562.Dec 28 2022, 6:27 PM

Restructure schema so $ref can use JSON pointers, fixing compat with yaml-language-server.

Add some VSCode extensions supported by yaml-language-server.

sammccall updated this revision to Diff 485564.Dec 28 2022, 6:51 PM

fix external: none

sammccall updated this revision to Diff 485568.Dec 28 2022, 7:10 PM

fix schema version

sammccall published this revision for review.Jan 2 2023, 5:26 AM

This is definitely not ready to ship, but I'd like some high-level feedback before adding tests/polish.

Most important thing to look at is schema.yaml which would be the source of truth, followed by maybe Generate.cpp and the new structure of ConfigYAML (but no need to go through the code details at this point).

Particular questions I'd like opinions on:

worthwhile overall?

This isn't trivial: does it solve a big enough problem to be worth the complexity? (Benefit would be improved maintenance of existing config duplication, and being able to accept JSON-schema which IMO we shouldn't without automation).

better off parsing schema.json rather than defining a new format?

I don't think so (json-schema is too powerful a language, and it's IMO pretty verbose and painful to edit).
But the bias towards creating new things needs to be double-checked!

does the YAML structure seem OK?

I'm reasonably happy with the compromises I landed on e.g.

  • distinction between lowercase for meta-properties and UpperCase for schema children
  • ways to handle special casing around If.UnknownKey and Index.External = None

but maybe they only make sense in my own head

build-time generation vs checking in generated files

Generating at cmake-time is the obvious choice for *.inc, but the consumers of schema.json and doc.md aren't part of our build.

  1. we could generate *.inc at build time, and say that it's up to some other repo(s) to run the llvm build system and produce+publish schema.json and doc.md
  2. we could generate *.inc at build time, and check in schema.json and doc.md (need to update by rerunning the tool, we could have a test)
  3. we could generate all files by hand and check them in

My feeling is that 1 is too burdensome to get these extra files.
2 is a bit more complicated than 3 (two ways to do things, and building the tool in host config adds some cmake complexity).
And I'm not sure it has any advantages: regenerating all the files isn't more work than regenerating half.

So I lean towards 3 but I don't have much confidence in what will be least annoying overall.

meta-schema.json

Once I got the tooling set up, having a schema+diagnostics for schema.yaml is pretty nice.
It provides a place to document the structure of the file, and gives a crisp answer to "is this change broken" (albeit one we won't actually have a test for).
OTOH it's not strictly necessary, and yet more stuff in the tree.

Enums in ConfigFragment.h

Previously these were strings in ConfigFragment.h, and the enums were defined in Config.h, which I think is where they belong in terms of dep structure.

However once schema.yaml knows about enums, it seems natural to generate the enum definitions and the code to parse them.

So options all have downsides:

  1. put them in ConfigFragment.h and parse them in ConfigYAML, with the other generated code. Now Config needs to depend on ConfigFragment
  2. generate them into Config.h and generate parsing code in ConfigCompile as today. This means we now need to inject generated code into 4 places instead of 2.
  3. don't generate the enums, define & parse them by hand as today (and need to keep that code in sync)

I went with 1 in the code, but I really don't know what's best here.

Dir structure

With so many extra files, I created config/ and I think moving existing Config* sources to config/* seems like a clear improvement, but want to get an ack on this first as renaming is annoying.

Herald added a project: Restricted Project. · View Herald TranscriptJan 2 2023, 5:26 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

worthwhile overall?

This isn't trivial: does it solve a big enough problem to be worth the complexity? (Benefit would be improved maintenance of existing config duplication, and being able to accept JSON-schema which IMO we shouldn't without automation).

I find the solution here valuable, especially considering all the times that we've failed to wire/test things properly when we introduced new config options and the lack of documentation in clangd-www. Also this will make introducing new options straightforward (no need to write fragments or parse logic).
Last time we discussed the concerns I had (at least the ones I can remember) were losing the readability we have in a C++ header file (when performing the fragment->config step) and difficulty about changing "parse" behavior every now and then, but considering all the bugs/outages we had ever since I think these concerns are OK to give up at this point (I also like the solution around only writing custom parse logic when needed).

better off parsing schema.json rather than defining a new format?

I don't think so (json-schema is too powerful a language, and it's IMO pretty verbose and painful to edit).
But the bias towards creating new things needs to be double-checked!

I also feel like editing the YAML format is easier than the JSON one, this might be an LLVM bias but considering the project is part of LLVM, I don't think that'll be a bad thing.

does the YAML structure seem OK?

I'm reasonably happy with the compromises I landed on e.g.

  • distinction between lowercase for meta-properties and UpperCase for schema children
  • ways to handle special casing around If.UnknownKey and Index.External = None

but maybe they only make sense in my own head

I am still a little unhappy with losing the readability through the new YAML format. but I also don't have any better alternatives and we still get to keep the Config.h so at least when writing features we still have a single C++ header to read.

build-time generation vs checking in generated files

Generating at cmake-time is the obvious choice for *.inc, but the consumers of schema.json and doc.md aren't part of our build.

  1. we could generate *.inc at build time, and say that it's up to some other repo(s) to run the llvm build system and produce+publish schema.json and doc.md
  2. we could generate *.inc at build time, and check in schema.json and doc.md (need to update by rerunning the tool, we could have a test)
  3. we could generate all files by hand and check them in

My feeling is that 1 is too burdensome to get these extra files.
2 is a bit more complicated than 3 (two ways to do things, and building the tool in host config adds some cmake complexity).
And I'm not sure it has any advantages: regenerating all the files isn't more work than regenerating half.

So I lean towards 3 but I don't have much confidence in what will be least annoying overall.

I am actually more inclined towards option 1). Having *.inc generation as part of the build step and not caring about them ever sounds like the better trade-off. I don't know how troublesome it's in the CMake sense (but we already do that with proto-generated files in clangd for example).
If you think that's too much work I am also fine with option 3 as you mentioned (this also has the nice side effect of checking the delta to real code after a change to the schema).
As for doc.md, it's unfortunate that we'll need to publish it in a different repository anyways, but I don't think that's a huge burden and we can probably get away with doing that only at branch cuts. Having doc.md and schema.json always available in the repo as source of truth (to aid debugging when things go wrong) sounds like a good plus to me. So I am in favor of having them in.

meta-schema.json

Once I got the tooling set up, having a schema+diagnostics for schema.yaml is pretty nice.
It provides a place to document the structure of the file, and gives a crisp answer to "is this change broken" (albeit one we won't actually have a test for).
OTOH it's not strictly necessary, and yet more stuff in the tree.

I am in favor of having it. this is unlikely to change so not a huge burden for maintenance. I'd still like to have some sort of schema description in the YAML file though (as a comment, both for a place to read without context switches and to serve as a template when introducing a new option).

Enums in ConfigFragment.h

Previously these were strings in ConfigFragment.h, and the enums were defined in Config.h, which I think is where they belong in terms of dep structure.

However once schema.yaml knows about enums, it seems natural to generate the enum definitions and the code to parse them.

So options all have downsides:

  1. put them in ConfigFragment.h and parse them in ConfigYAML, with the other generated code. Now Config needs to depend on ConfigFragment
  2. generate them into Config.h and generate parsing code in ConfigCompile as today. This means we now need to inject generated code into 4 places instead of 2.
  3. don't generate the enums, define & parse them by hand as today (and need to keep that code in sync)

I went with 1 in the code, but I really don't know what's best here.

For me, the biggest downside of the current approach is the need to spell enum names in generated code inside Config.h. but I don't think managing the parser code by hand can be justified just to refer to familiar names in Config.h.
So I think all good here.

Dir structure

With so many extra files, I created config/ and I think moving existing Config* sources to config/* seems like a clear improvement, but want to get an ack on this first as renaming is annoying.

agreed, as discussed in the past, I think we should slowly try to define more fine grained targets for clangd (so that we can better check for layering violations, enable code reuse, enable more caching etc.)

sammccall updated this revision to Diff 495817.Feb 8 2023, 5:58 AM
sammccall retitled this revision from WIP: generate Config {Fragment structure, json schema, docs, YAML parser} from schema spec to Generate Config {Fragment structure, json schema, docs, YAML parser} from schema spec.
sammccall edited the summary of this revision. (Show Details)

Add comments to schema.yaml.
Add tests that generated files are up-to-date, and script to regenerate
Rename doc.md -> documentation.md

Based on offline discussion, doubling down on checking in generated files to make these easy/possible to consume where they're needed.

rebased and addressed high level comments (I think!)

ping on this one for when you have time
I haven't rebased against latest config yet, it's a bit of a moving target :-)

ping on this one for when you have time

(Just wanted to double-check, is this ping directed to me or Kadir (or both)? I haven't looked at this because Kadir has done a round of review and it seemed like duplication of work for me to look at it as well; however, if you'd like me to do a round of review, I'm happy to!)

ping on this one for when you have time

(Just wanted to double-check, is this ping directed to me or Kadir (or both)? I haven't looked at this because Kadir has done a round of review and it seemed like duplication of work for me to look at it as well; however, if you'd like me to do a round of review, I'm happy to!)

Oops, indeed it was directed at Kadir, we'd also discussed it offline at length a few months ago and I think this is mostly a question of cleaning up the details now.

Always happy to have more feedback or hear concerns, but no need unless you're interested (and this patch isn't that interesting :-D)