This is an archive of the discontinued LLVM Phabricator instance.

tree-sitter-mlir: contribute a more complete grammar.js
ClosedPublic

Authored by artagnon on Feb 20 2023, 8:48 AM.

Details

Summary

Contribute a grammar.js from the upstream project maintained at
https://github.com/artagnon/tree-sitter-mlir. The new grammar.js
includes several fixes, and successfully parses 60-80% of MLIR tests in
the Arith, Tensor, and Linalg dialects.

Diff Detail

Event Timeline

artagnon created this revision.Feb 20 2023, 8:48 AM
artagnon requested review of this revision.Feb 20 2023, 8:48 AM
artagnon updated this revision to Diff 498932.Feb 20 2023, 12:29 PM

Update; format with git-clang-format

Something that doesn't seem great about this approach is the need for every operation of every dialect to have an explicit entry. For MLIR that severely limits the situations that the grammar can be useful (effectively no downstream user can benefit), and is also hard to maintain, is there not a more generic fuzzy approach that gets us the correct highlighting most of the time (similarly to the textmate grammar)? @jpienaar

mlir/utils/tree-sitter-mlir/README.md
9–10

we shouldn't be referencing an external project for this, I'd rather have everything necessary in-tree.

Something that doesn't seem great about this approach is the need for every operation of every dialect to have an explicit entry. For MLIR that severely limits the situations that the grammar can be useful (effectively no downstream user can benefit), and is also hard to maintain, is there not a more generic fuzzy approach that gets us the correct highlighting most of the time (similarly to the textmate grammar)? @jpienaar

Every operation of every dialect has a custom syntax, and as far as I can tell, there is very little that is common between the syntaxes. A fuzzy approach would essentially entail a very permissive parser, but even if we match custom_operation with /[a-z.]/, I find myself at a loss when having to define its fields. Downstream users won't benefit from the custom_operation section, as those will be ERROR nodes; however, they would still benefit from tree-sitter error-recovery and highlighting the types, literals, attributes etc. correctly, which would be on-par with a hypothetical fuzzy grammar. Textmate grammars are very slow, since they rely on regexp matching, and only VSCode currently uses them: both NVim and Emacs have recently merged their respective tree-sitter features.

While I agree that a complete grammar.js using this approach is a significant time investment, I don't think this grammar is particularly hard to maintain. When a new custom_operation is added, users will automatically see that it is not being highlighted; then, it's just a matter of adding an entry to the custom_operation table. Moreover, I have a small script to bench grammar.js against the .mlir files in test/Dialect/, and we could even add this to the CI to avoid regressions.

artagnon added inline comments.Feb 22 2023, 10:04 AM
mlir/utils/tree-sitter-mlir/README.md
9–10

I agree with this. I'll prepare a patch checking in everything from the external project in-tree.

artagnon updated this revision to Diff 499578.Feb 22 2023, 10:30 AM

Don't reference external project; check in complete project.

Another approach would be to generate this from ODS declarative assembly syntax, any obstacle to this?

artagnon marked an inline comment as done.Feb 22 2023, 11:03 AM

Another approach would be to generate this from ODS declarative assembly syntax, any obstacle to this?

Indeed, this is a very good idea; it will work for 70-80% of custom operations: we'd have to include the mlir-tblgen headers, and write something that processes all the .td files in one go, and spit out the custom_operation section of grammar.js, by looking at the assemblyFormat of defs. However, we'd have to rely on hand-written parsers for those defs that have customAssemblyFormat. I can try to attack this problem over the coming days, as a follow-up patch.

When I did the first version I was using this as I wanted to run some analysis tool that uses tree-sitter (that didn't work well) and also to check the language guide a bit using the error checks in tree-sitter generation process. Generation would be really nice, and from maintenance point of view (and we know these may go out of sync, so it would be best effort). Could you perhaps show the before and after of how it affects highlighting? Esp with an unsupported dialect too, to speak to the fuzzing matching part.

mlir/utils/tree-sitter-mlir/grammar.js
463

Could we utilize modules here to have these be in separate files?

artagnon marked an inline comment as done.Feb 23 2023, 4:43 AM

When I did the first version I was using this as I wanted to run some analysis tool that uses tree-sitter (that didn't work well) and also to check the language guide a bit using the error checks in tree-sitter generation process. Generation would be really nice, and from maintenance point of view (and we know these may go out of sync, so it would be best effort). Could you perhaps show the before and after of how it affects highlighting? Esp with an unsupported dialect too, to speak to the fuzzing matching part.

Actually, I found that the tree-sitter error recovery algorithm is quite rudimentary at the moment, and there is nothing we can do as a grammar-author to get better error-recovery. Here's a comparison of how a supported dialect highlights versus how a downstream dialect highlights.

After some digging, I'm not feeling terribly optimistic about auto-generation, since customAssemblyFormat seems to be the norm for any operation with a non-trivial assembly format. I'm going to attempt auto-generation of grammar for dialects like Arith and Math as a follow-up, but for Linalg, ControlFlow, SCF, and the like, we have to rely on hand-written parsers.

mlir/utils/tree-sitter-mlir/grammar.js
463

It turns out that tree-sitter doesn't support ESM modules, but I found a way to split the grammar up into multiple files nevertheless. Will post an updated diff shortly.

artagnon updated this revision to Diff 499822.Feb 23 2023, 6:00 AM
artagnon marked an inline comment as done.
  • Split up dialect grammars into separate files
  • Include highlighting test for downstream dialect

I see on the screenshot for unknown dialects it treats everything as unknown (e.g., memref not identified). That's more conservative (only show what you know) but might be worse for highlighting. Do you know if this is intrinsic to tree-sitter or the way the grammar is written? (It was written directly matching our markdown/uses to fix markdown, with explicit structure in mind).

The poor error recovery is intrinsic to tree-sitter, and there is nothing we can do as a grammar author to improve it. I found a couple of issues related to this here and here.

This is very thorough, which is both good and bad. Maintaining these won't be on many folks minds or knowledge set. The other highlighting support we have here is much simpler (they are approximate, they only attempt to work well enough). So this has me a bit in two minds. As long as we don't make it required or increase maintenance demands, that seems fine (and we can revise if not).

mlir/utils/tree-sitter-mlir/bench.mjs
9 ↗(On Diff #499822)

Why is this needed?

mlir/utils/tree-sitter-mlir/dialect/linalg.js
7

Could we just do something like linalg.[a-z_]* ? (this list on named ops keep on growing but I think the assembly format is all similar)

artagnon added inline comments.May 16 2023, 8:41 AM
mlir/utils/tree-sitter-mlir/bench.mjs
9 ↗(On Diff #499822)

So we can benchmark against test-suite in supported dialects, and notify contributors when coverage drops. We can't possibly write tests for everything: the tests are extracted from the test-suite anyway.

mlir/utils/tree-sitter-mlir/dialect/linalg.js
7

The issue is overlap with linalg.index, linalg.map, and linalg.yield, which have a different parse-rule. I agree that it's not ideal to maintain this list, but I don't see a way out, unless we make the parser overly permissive to all linalg operations.

jpienaar added inline comments.May 18 2023, 11:01 PM
mlir/utils/tree-sitter-mlir/bench.mjs
9 ↗(On Diff #499822)

As long as optional and expectation is clear.

I don't particularly like serializing the results of a benchmark run here vs in some dashboard. E.g., some repo with GitHub action that checks out llvm and runs/tracks these over time would cause less churn here.

mlir/utils/tree-sitter-mlir/test/corpus/type.txt
1243

Why is this needed vs just inf part?

artagnon updated this revision to Diff 524058.May 20 2023, 2:21 PM
artagnon marked an inline comment as done.

Address review comments:

  • Remove bench.js.
  • Update arith inf test to not include irrelevant lines.
jpienaar accepted this revision.Jun 5 2023, 7:38 AM

Looks good - I would like something where this could be generated more from the asm specs (it will never be complete, but we could get to something that reduces effort here). Let me know if you need help in landing.

This also makes me a bit sad for if github switches to tree-sitter, I think it'll make it more difficult to give "good enough (even if not great)" across known and unknown dialects.

This revision is now accepted and ready to land.Jun 5 2023, 7:38 AM
This revision was automatically updated to reflect the committed changes.
mlir/utils/tree-sitter-mlir/test/corpus/func.txt