This is an archive of the discontinued LLVM Phabricator instance.

[RFC][clangd] Use interpolation for CDB pushed via LSP protocol
AbandonedPublic

Authored by DmitryPolukhin on Apr 18 2023, 2:35 PM.

Details

Summary

Now clangd only interpolates CDBs loaded from disk and doesn't make any
interpolation for CDBs pushed via LSP protocol. This diff add the same
extrapolation logic as for loaded from disk.

Test Plan: check-clangd

Diff Detail

Event Timeline

DmitryPolukhin created this revision.Apr 18 2023, 2:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 2:35 PM
Herald added a subscriber: arphaman. · View Herald Transcript
DmitryPolukhin requested review of this revision.Apr 18 2023, 2:35 PM
tdupes added a subscriber: tdupes.Apr 24 2023, 10:51 AM

@kadircet @nridge friendly ping, could you please take a look?

I wanted to chime in and provide a bit of context.
This was a long time ago, so I might misremember, so take this with a grain of salt.

Idea behind pushing the CDB over LSP was to allow the capable client to fully control the commands produced for the files.
Decisions like interpolation were pushed towards the clients intentionally, not accidentally.
IIRC, the motivation back in the day was either sourcekit-lsp or Theia.

I will let other do the actual review, just wanted to bring up the history for a complete picutre.

I wanted to chime in and provide a bit of context.
This was a long time ago, so I might misremember, so take this with a grain of salt.

Idea behind pushing the CDB over LSP was to allow the capable client to fully control the commands produced for the files.
Decisions like interpolation were pushed towards the clients intentionally, not accidentally.
IIRC, the motivation back in the day was either sourcekit-lsp or Theia.

I will let other do the actual review, just wanted to bring up the history for a complete picutre.

Thank you for sharing the context. I completely agree with the idea that nothing should prevent clangd clients from fully controlling CDB if they want it.
And, if they do so, this diff will just does nothing because there is an exact match. It starts playing only if client provided partial CDB and inference is required.
So I think it shouldn't break any existing scenarios. But IMHO having inference is a good feature for clangd because it allows pushing exactly the same CDB as in
compile_commands.json file and re-using clangd logic how to transfer compile commands from source to header. Pushing command via LSP might be preferable
in comparison with file based approach to avoid race condition with updating the file and clangd reading it + it works better with build systems that can generate
compiles commands on the fly for files and generating all of them in advance may not be possible.

And, if they do so, this diff will just does nothing because there is an exact match. It starts playing only if client provided partial CDB and inference is required.

(hypothesising below)
I think this depends on a client at question. If I were writing one and had an idea what I want to do for headers, e.g. I might have a project system that knows which build targets headers belong to,
I would rather see no compile errors for a header (if I have bugs in my integration) than for Clangd to kick in with interpolation and silently hide the bug.

From what I can recollect, your case is different and you just want to "pipe" compile_commands.json to Clangd provided by some build system, but without clangd actually reading it.
I don't actually write an LSP client, though, would let @kadircet decide which path is preferable.

Pushing command via LSP might be preferable in comparison with file based approach to avoid race condition with updating the file and clangd reading it

Ideally this should be handled with atomic writes (write to temp file and move to destination), at least on Linux and Mac. I'm not sure if build systems do that, though.
Does Clangd ever re-read the compilation database now? It used to load it only once, so rereading would require a restart of clangd anyway. Race on just one read is very unlikely (although not impossible).
However, the compile_commands.json file is parsed lazily, when the command for a particular file is actually requested, if you see crashes or inconsistencies in your scenarios, it highly likely due to this.

+ it works better with build systems that can generate compiles commands on the fly for files and generating all of them in advance may not be possible.

FYI, there is separate support for pushing command per-file in Clangd via an LSP extension.
I think this should cover build systems that generate commands on the fly better, but this also does not use interpolation.

A meta-comment: it would be really useful to understand how you use Clangd a bit better. Would it be hard to write down which kind of client you have and what are requirements for Clangd? Which build systems are at play, how does the client use them, etc?
To validate whether patches like this one are a good fit to solve the problem or there are better ways, it's really useful to have a higher level overview of how Clangd is being used.

And, if they do so, this diff will just does nothing because there is an exact match. It starts playing only if client provided partial CDB and inference is required.

(hypothesising below)
I think this depends on a client at question. If I were writing one and had an idea what I want to do for headers, e.g. I might have a project system that knows which build targets headers belong to,
I would rather see no compile errors for a header (if I have bugs in my integration) than for Clangd to kick in with interpolation and silently hide the bug.

From what I can recollect, your case is different and you just want to "pipe" compile_commands.json to Clangd provided by some build system, but without clangd actually reading it.
I don't actually write an LSP client, though, would let @kadircet decide which path is preferable.

We have LSP client that works like a proxy and exposes LSP protocol to higher level. Now we do very deep processing of CDB and partially replicates logic clangd about command inference.
Clangd usually does good job in command inference and we would like to use this feature instead of keep our logic up-to-date. I can put this inference logic for LSP behind some command line flag
if you think that it might really break some good use case. Please let me know.

Pushing command via LSP might be preferable in comparison with file based approach to avoid race condition with updating the file and clangd reading it

Ideally this should be handled with atomic writes (write to temp file and move to destination), at least on Linux and Mac. I'm not sure if build systems do that, though.
Does Clangd ever re-read the compilation database now? It used to load it only once, so rereading would require a restart of clangd anyway. Race on just one read is very unlikely (although not impossible).
However, the compile_commands.json file is parsed lazily, when the command for a particular file is actually requested, if you see crashes or inconsistencies in your scenarios, it highly likely due to this.

Yes, clangd re-read CDB in some cases but it won't try to read CDB again from a directory if there was none before for performance reasons I think. I think synchronisation here is hard we cannot control when
clangd will actually read CDB so it might use wrong flags. Also we run multiple clangd behind multiplexing proxy and they might need different CDBs. Also CDB tends to become large and hard to manage so
per-file LSP protocol has lots of advantages for us.

+ it works better with build systems that can generate compiles commands on the fly for files and generating all of them in advance may not be possible.

FYI, there is separate support for pushing command per-file in Clangd via an LSP extension.
I think this should cover build systems that generate commands on the fly better, but this also does not use interpolation.

It is exactly the LSP protocol we are using and I added inference.

A meta-comment: it would be really useful to understand how you use Clangd a bit better. Would it be hard to write down which kind of client you have and what are requirements for Clangd? Which build systems are at play, how does the client use them, etc?
To validate whether patches like this one are a good fit to solve the problem or there are better ways, it's really useful to have a higher level overview of how Clangd is being used.

We use Buck and Buck2 as our main build system + some projects uses other build systems like CMake, and in general we don't limit build systems that subproject might use.
Therefore we cannot limit ourself to any particular clangd version and had to support multiple of them simultaneously because individual projects might might use incompatible features. So we have a LSP multiplexing proxy that
encapsulates build system specifics and combines results from several clangds. Changes in clangd that we would like to put in upstream in our understanding should be usable not only in our setup but should also improve clangd for all users.

I agree with Ilya's concerns here.

We deliberately don't mess with compile flags pushed over LSP. These are "overrides" to whatever information we have from other sources, turning on interpolation at this override layer implies we'll never fallback to other sources of information (as inference will always pick a target, it doesn't have a "that's too bad" threshold).
The contract on the LSP based compile flag setting requires clients to be "capable" of managing files somehow, having some mixed support is unlikely to benefit other users and more likely to break things as we're changing behaviour now (instead of fallback to underlying compilation database, we'll have interpolation).

I agree with Ilya's concerns here.

We deliberately don't mess with compile flags pushed over LSP. These are "overrides" to whatever information we have from other sources, turning on interpolation at this override layer implies we'll never fallback to other sources of information (as inference will always pick a target, it doesn't have a "that's too bad" threshold).
The contract on the LSP based compile flag setting requires clients to be "capable" of managing files somehow, having some mixed support is unlikely to benefit other users and more likely to break things as we're changing behaviour now (instead of fallback to underlying compilation database, we'll have interpolation).

Thank you for the feedback, I see your point. Would you mind if I make this behaviour conditionally behind a command line flag?
I think most of the users never mix flags from different sources and, if they have CDB pushed from LSP, all flags except for fallback comes from this source and underlying compilation database is empty.

DmitryPolukhin abandoned this revision.May 27 2023, 1:34 PM