This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add schema for `.clangd` config
AbandonedPublic

Authored by hyperupcall on Dec 20 2022, 11:49 PM.

Details

Reviewers
nridge
Summary

This adds a JSON Schema for the .clangd configuration file.

Fixes https://github.com/clangd/clangd/issues/1376

Diff Detail

Event Timeline

hyperupcall created this revision.Dec 20 2022, 11:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2022, 11:49 PM
hyperupcall requested review of this revision.Dec 20 2022, 11:49 PM
hyperupcall edited the summary of this revision. (Show Details)Dec 20 2022, 11:52 PM
hyperupcall added a reviewer: nridge.

Thanks, this is pretty neat!

cc @sammccall as a heads up, just to make sure you're cool with having this in-tree

clang-tools-extra/clangd/schema/config.json
49

Technically string OR enum [ "Ancestors", "None" ] would be more accurate

Not sure if it's worth specifying that... from a validation point of view, string is a superset of the other one, but maybe some people will look at the schema for documentation, in which it could be useful to have those "keywords" listed?

73

I don't think this is quite right:

  • If File is specified, Server and MountPoint should not be
  • If Server is specified, MountPoint can optionally be specified but File should not be

Thanks, this is pretty neat!

cc @sammccall as a heads up, just to make sure you're cool with having this in-tree

This has been discussed before, unfortunately while I *thought* it was on a bug, it was actually in a PR: https://github.com/clangd/vscode-clangd/pull/140

I don't think it's a good idea to add a third (arguably, fourth/fifth) place that the config structure and documentation must be maintained by hand, especially one that constrains the structure in non-obvious ways. This needs to be automatically generated.

This implies moving the source of truth to something we can generate {fragment struct, website, parsing code, YAML schema} from.)

I made an attempt at this in https://reviews.llvm.org/D115425 using TableGen, which is the closest LLVM has to a standard way to do this, but the language is pretty bad (for this purpose).
YAML itself is the best alternative we came up with to express the schema in. Using the JSON-schema itself as the source of truth is another interesting option, though it might be simpler to have a format we control.
My recollection was that I thought this was valuable to solve in one way or other, and @kadircet (whose call it is now, really) didn't see it as a high priority.

This has been discussed before, unfortunately while I *thought* it was on a bug, it was actually in a PR: https://github.com/clangd/vscode-clangd/pull/140

I don't think it's a good idea to add a third (arguably, fourth/fifth) place that the config structure and documentation must be maintained by hand, especially one that constrains the structure in non-obvious ways. This needs to be automatically generated.

This implies moving the source of truth to something we can generate {fragment struct, website, parsing code, YAML schema} from.)

I made an attempt at this in https://reviews.llvm.org/D115425 [...]

Thanks for reminding me about the previous work on this! I definitely appreciate the value in generating these various files / pieces of code from a single source of truth.

That said, getting there requires what seems like a non-trivial refactoring that doesn't seem to be in anyone's immediate plans. Meanwhile, it's clear that there is demand in the user community for a schema, and this schema is already out there, today, at https://json.schemastore.org/clangd.json, automatically in effect for anyone who uses VSCode's YAML extension (or comparable functionality in another editor).

The practical implication of allowing the schema to be stored in the clangd repo seems to be:

(1) there is yet another place where the config structure is recorded, but at least that place is in the clangd repo so we're likely to remember to update it at the same time we add new config features

vs.

(2) there is yet another place where the config structure is recorded, but it's on a third-party site and it's probably going to get out of date and only updated when someone periodically remembers to update it, and that person then has to identify and document a batch of changes to the config file format that others have made in the intervening time

Given this choice, I think there's a clear benefit to having the file in the repo.

If/when the described refactoring is implemented, the schema file can start benefiting from also being generated from a single source of truth.

I think there's a clear benefit, but my feeling is it's outweighed by the costs, which are fairly high given lack of any automation/tests.
If clangd contributors should maintain this, how should contributors/reviewers know whether changes are correct?
If they needn't maintain it, then having it on a third-party site seems appropriate. Maybe a /contrib/ directory is a useful compromise?

This is definitely informed by my biases: I have to make/review changes that touch config, don't have a good understanding of JSON-schema, and I don't use VSCode.


One practical question: AIUI the main point of having this is so it can be provided to the VSCode YAML extension so it understands .clangd files. How does this work mechanically? Does it need to be part of the vscode-clangd repo instead?

given lack of any automation/tests

That's a fair point, it would definitely help validate the correctness of the schema if we had tests in the form of example config files that pass or fail validation.

I would be supportive of adding such tests, but I'm also hesitant to require them as a condition of accepting the addition of the schema file since it does add a fair amount of effort and complexity.

If clangd contributors should maintain this, how should contributors/reviewers know whether changes are correct?

Is "look at existing sections of our schema file as a guide, or barring that, the JSON-schema specification" too unsatisfactory an answer?

My feeling is that, at the end of the day, the schema format is relatively straightforward compared to other things that a clangd reviewer needs to know (like C++, or the Clang AST API), and the stakes of letting a mistake slip in are quite low (i.e. maybe a user will fail to get code completion for a config file key, as opposed to say clangd crashing on your code).

If they needn't maintain it, then having it on a third-party site seems appropriate. Maybe a /contrib/ directory is a useful compromise?

My original thinking was that reviewers should ask for patches that change config to make correspoding changes to the schema, which I guess would fall under "clangd contributors should maintain this".

But I'm definitely happy to compromise on this -- perhaps we could keep it in a /contrib/ directory like you say, and encourage but not require that config changes keep it up to date? And perhaps, if someone contributes automated tests for the schema in the future, we could consider upgrading its status to "maintained" at that time?

I don't use VSCode

(Note, there's nothing tying this to VSCode in particular. For example, lsp-mode seems to have support for consuming schemas from SchemaStore based on https://emacs-lsp.github.io/lsp-mode/page/lsp-yaml/#lsp-yaml-schema-store-uri.)

One practical question: AIUI the main point of having this is so it can be provided to the VSCode YAML extension so it understands .clangd files. How does this work mechanically?

My understanding, based on the discussion in the issue and my experimentation is:

  • If the YAML extension's "yaml.schemaStore.enable" setting is enabled (which is the default), the extension downloads a catalog of schemas from schemastore.org
  • Clangd's entry in that catalog, which you can see if you load https://www.schemastore.org/api/json/catalog.json and filter for "clangd" [aside: I recommend Firefox's built-in JSON viewer for such tasks, the builtin JSON visualization and filtering is really neat; if you have any pull with the Chrome team, maybe put in a good word for adding a similar JSON viewer], declares that the schema applies to files named .clangd
  • The extension automatically applies the schema to the .clangd file in your workspace

Does it need to be part of the vscode-clangd repo instead?

I think being in the vscode-clangd repo would have the small advantage that the vscode-clangd extension could use the YAML extension's contribution point to install the schema locally even if "yaml.schemaStore.enable" is set to false. Not sure whether that's worth putting in place.

Improve schema correctness

Apply clang-format

Thank you for the suggestions - I applied all of the fixes!

As someone who helps maintain a lot of the schemas on schemastore, it's nice when schemas are in-tree with their respective project, but I totally understand that this can increase complexity, especially for the reasons stated.

given lack of any automation/tests.

And perhaps, if someone contributes automated tests for the schema in the future, we could consider upgrading its status to "maintained" at that time?

If you would like me to add tests to verify the schema (for now or later?), is there a utility within LLVM to help do so? I saw TableGen was mentioned, but it sounds like the schema validation use case is a different issue.

One practical question: AIUI the main point of having this is so it can be provided to the VSCode YAML extension so it understands .clangd files. How does this work mechanically? Does it need to be part of the vscode-clangd repo instead?

I think it's worth mentioning that the JSON schema can be hosted anywhere. As already mentioned, the YAML extension automatically downloads and uses the schema if it finds a .clangd file, and the schema can point to an external URL. One example of that in practice is the xunit-2.2.json schema; it simply contains a $ref to the actual schema on xunit.net. Whether this schema goes in clangd/schema/config.json, clangd/contrib/schema/config.json, or in vscode-clangd - on the SchemaStore end, all I have to do is update $ref.

If you would like me to add tests to verify the schema (for now or later?), is there a utility within LLVM to help do so? I saw TableGen was mentioned, but it sounds like the schema validation use case is a different issue.

Is there a command-line tool that can perform JSON schema validation which is included in a commonly used package that's available in the default package repositories of major Linux distros? If so, we may be able to get away with just requiring that the buildbots have that package installed.

If not, then we'd need to check in such a tool into the LLVM repo (and if it needs building, integrate its build into the build system).

In either case, the tests themselves can take the form of simple lit tests that invoke that tool on some test config files.

Is there a command-line tool that can perform JSON schema validation

(well, more specifically that can validate YAML files against a JSON schema)

The difficulty in testing is that:

  • JSON-schema is totally unknown in llvm-project, so there are no tools but also no expectation that contributors understand it
  • llvm-project doesn't like dependencies, particularly those that aren't in C++ or python
  • having done some digging, interop seems really poor...

I tried 3 consumers (ajv-cli, yaml-language-server, and yajsv) and hit different blocking bugs in all of them when using obvious, spec-compliant approaches to writing schemas.

My feeling is that, at the end of the day, the schema format is relatively straightforward compared to other things that a clangd reviewer needs to know

But this is additional, it's obscure (unrelated to all the other knowledge), and as it's untested, the reviewer needs to reliably catch problems, including subtle interop problems.
It has the "python problem", where it's readable and you can see at a glance why the code is correct - even if it isn't.
(Having spent a couple of days with json-schema now, I don't think it's straightforward. That file seems simple so far, but there are simple C++ programs too! And it carefully dodges the interop issues, which is *subtle*).

Meanwhile, changing config already is a sea of boilerplate that's hard to review and bugs/divergences have crept in. (Config settings that are compiled but not parsed, things missing from the web docs, web docs being reworded but internal docs staying the same, etc).


Fixing this is an unreasonable burden to place on a new contributor, I've taken a shot at this and have something almost working: https://reviews.llvm.org/D140745.

It's pretty draft-ish:

  • That patch just checks in the generated files {Fragment.inc, YAML.inc, doc.md, schema.json}, (maybe) that belongs in the build system
  • the schema is correct but yaml-language-server chokes on it due to https://github.com/redhat-developer/yaml-language-server/issues/823, so need to restructure
  • files would need to be moved around etc
  • tests need to be added/updated
  • i'm not sure whether having a json-schema for schema.yaml is actually a good idea, it was mostly a learning exercise
clang-tools-extra/clangd/schema/config.json
21

disabling additionalProperties probably yields useful diagnostics

26

Unfortunately it's not valid to specify description together with $ref (implementations are required to ignore it)

https://datatracker.ietf.org/doc/html/draft-handrews-json-schema-01#section-8.3

82

external can also be the case-insensitive string "None"

116

this is rather a list of strings

173

hmm, there are two more categories here - were they missing somewhere?

I tried 3 consumers (ajv-cli, yaml-language-server, and yajsv) and hit different blocking bugs in all of them when using obvious, spec-compliant approaches to writing schemas.

Thanks for experimenting. Happy to defer to your judgment on the best course of action given this state of affairs (and also the mentioned difficulty in accessing these tools from an LLVM test suite).

I've taken a shot at this and have something almost working: https://reviews.llvm.org/D140745.

Neat!

clang-tools-extra/clangd/schema/config.json
21

One question here is, do we want diagnostics from the schema validation to duplicate or replace clangd's own diagnostics for the config file?

26

Hmmm... the spec draft at https://json-schema.org/draft/2020-12/json-schema-core.html#section-8.2.3.1, which is newer (2022 rather than 2018) does not seem to have this wording, and the example at https://json-schema.org/learn/getting-started-step-by-step.html#references uses $ref together with description.

173

Designators is missing from https://clangd.llvm.org/config.html#inlayhints

What's the other one?

sammccall added inline comments.Dec 28 2022, 6:03 PM
clang-tools-extra/clangd/schema/config.json
21

I'm not totally sure.

It's tempting to call these a failed experiment. The UX for clangd-emitted diags is pretty sad (since they only generate/update when we happen to process the config file again).

However there are definitely going to be things you can get wrong that the schema won't catch, and we should make them visible *somehow*.

26

Yeah, one of the complexities here is the version skew :-(
Different versions are (partially) supported by different tools. Some of them pay attention to what the file is self-declared as, others don't.

draft-07 looks like the lowest-common-denominator (and is what this file claims to be), and in particular yaml-language-server always validates against draft-07 regardless of what the file claims.

173

There isn't one, I was confused.
(I think I was thinking of type-param hints, but we never added it :-()

hyperupcall marked 2 inline comments as done.Jan 9 2023, 11:54 PM

The difficulty in testing is that:

  • JSON-schema is totally unknown in llvm-project, so there are no tools but also no expectation that contributors understand it
  • llvm-project doesn't like dependencies, particularly those that aren't in C++ or python
  • having done some digging, interop seems really poor...

I tried 3 consumers (ajv-cli, yaml-language-server, and yajsv) and hit different blocking bugs in all of them when using obvious, spec-compliant approaches to writing schemas. I only have experience with ajv, so I suppose it means

It has the "python problem", where it's readable and you can see at a glance why the code is correct - even if it isn't.
(Having spent a couple of days with json-schema now, I don't think it's straightforward. That file seems simple so far, but there are simple C++ programs too! And it carefully dodges the interop issues, which is *subtle*).

I appreciate that you took the time to experiment with those three implementations of JSON-schema - perhaps I was naive in thinking that the different implementations were more mature. On my side, this makes me want to validate the various schemas in SchemaStore with more implementations than just ajv to improve the experience for everyone.


Fixing this is an unreasonable burden to place on a new contributor, I've taken a shot at this and have something almost working: https://reviews.llvm.org/D140745.

It's pretty draft-ish:

  • That patch just checks in the generated files {Fragment.inc, YAML.inc, doc.md, schema.json}, (maybe) that belongs in the build system
  • the schema is correct but yaml-language-server chokes on it due to https://github.com/redhat-developer/yaml-language-server/issues/823, so need to restructure
  • files would need to be moved around etc
  • tests need to be added/updated
  • i'm not sure whether having a json-schema for schema.yaml is actually a good idea, it was mostly a learning exercise

This looks pretty neat! I think something like that revision would be a better way forward. I will go ahead and upload a new Diff, I suppose just for completeness, but I don't expect anything to change much since it doesn't address your points about syncing with documentation or the obscurity of the specification.

Improve schema correctness and disable additionalProperties

hyperupcall abandoned this revision.EditedJan 10 2023, 12:22 AM

I think I shall now abandon this revision for the reasons stated above - there is an improved path forward at https://reviews.llvm.org/D140745 with autogenerated files.