Added a function for diffing highlightings and removing duplicate lines. It also returns empty lines if the IDE should remove previous highlighting tokens on the line. Integrated with state into the highlighting generation flow. Only works correctly as long as there are no multiline tokens.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 35145 Build 35144: arc lint + arc unit
Event Timeline
Thanks for bringing this up and implementing it.
I haven't looked into the details of the patch, just some high-level comments:
- the scope of the patch seems a bit unclear to me, what's the problem we are trying to solve?
- the patch looks like calculating the diff/delta highlightings (pre highlightings vs. new highlightings) from a server-only perspective, which seems incorrect. At least we should make sure that LSP client and server share the same understanding of source diff change, otherwise client may render the delta information in a wrong way. To do that, I guess we may start from the DidChangeTextDocument notification (where the client sends source changes to server);
- I'm not sure that we should start implement this at the moment -- I don't see that we will get much performance gain, we save some traffic cost between LSP client and server, but at the same time we are spending more CPU resource to compute highlighting diff in server side. Maybe there are some other clever ways (like traversing the decls in source diff change);
We probably need more investigations, like trying the current implementation on some real-project files in a real client (e.g. theia) to figure out and understand whether the latency is a real issue.
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
230 | I think this is just Lhs.R.start < Lhs.R.end (if it's not, we should add the operator<) is it enforced somewhere that you don't have two highlights at the same spot? | |
252 | holding the lock while doing all the diffing seems dubious You could reasonably put the cache as a map<file, highlights> in ClangdServer, then you don't have to deal with not having the cache for the current file. You'd need to lock the map itself, but note that no races on updating individual entries are possible because onMainAST is called synchronously from the (per-file) worker thread. | |
269 | Whatever you do about storage, please pull out the diff(old, new) function so you can unit test it. | |
269 | (llvm uses unsigned for indices. It's a terrible idea, but it's used fairly consistently...) | |
271 | I can believe this is correct, but it's hard to verify as it's a bit monolithic and... index-arithmetic-y. could you factor out something like: using TokRange = ArrayRef<HighlightingToken>; // The highlights for the current line. TokRange OldLine = {PrevHighlightings.data(), 0}, NewLine = {Highlightings.data(), 0}; // iterate over lines until we run out of data for (unsigned Line = 0; OldLine.end() < PrevHighlightings.end() || NewLine.end() < PrevHighlightings.end(); ++Line) { // Get the current line highlights OldLine = getLineRange(PrevHighlightings, Line, OldLine); NewLine = getLineRange(Highlightings, Line, NewLine); if (OldLine != NewLine) emitLine(NewLine, Line); } // get the highlights for Line, which follows PrevLine TokRange getLineRange(TokRange AllTokens, unsigned Line, TokRange PrevLine) { return makeArrayRef(PrevLine.end(), AllTokens.end()).take_while( Tok => Tok.R.start.line == Line); } |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
230 | It's not enforced anywhere, it depends on how the HighlightingTokenCollector is implemented, it should not currently take any multiples of tokens at the same spot. Even if there are tokens at the same location it should still satisfy Compare though? | |
252 | I did not know onMainAST was called synchronously per file. But maybe that isn't a big concern/a concern at all? |
based on the offline discussion, now I understand the patch better (thanks).
some comments on the interface.
clang-tools-extra/clangd/ClangdLSPServer.h | ||
---|---|---|
62 | how about creating a new structure LineHighlightingTokens to represent the line-based tokens? | |
clang-tools-extra/clangd/ClangdServer.cpp | ||
82 ↗ | (On Diff #209228) | If we make it as a pointer, the bool SemanticHighlighting is not needed (just follow the what the other field FIndex does). |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
293 | we could get the canonical path from onMainAST callback (the first parameter PathRef Path). | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
72 | nit: llvm::StringMap. I think this map is storing all file highlightings, maybe call it FileToHighlightings? | |
86 | this method does two things,
but the name just reflects 1). I think we just narrow it to 2) only: // calculate the new highlightings, and return the old one. std::vector<HighlightingToken> updateFile(PathRef Path, std::vector<HighlightingToken> NewHighlightings); // to get diff, in `onMainAST` diffHighlightings(differ.updateFile(Path, NewHighlightings), NewHighlightings); | |
94 | this utility method doesn't use any internal member of the class, we could pull it out or make it static. |
As discussed offline, it's a little scary that the highlight state for diffing lives in a separate map from TUScheduler's workers, but it's hard to fix without introducing lots of abstractions or layering violations.
It does seem like a good idea to move the diffing up to ClangdLSPServer, as it's fairly bound to protocol details. It needs to be reset in didClose (and maybe didOpen too, for safety - I think there's a race when a document is closed, then tokens are delivered...).
I'd tend to put the map/mutex directly in ClangdLSPServer rather than encapsulating it in the Differ object (again, because it's a protocol/lifetime detail rather than diffing logic itself) but up to you.
Moved highlighting state to LSP layer. Removed class holding state. Addressed comments.
the code looks clearer now!
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
609 | nit: I would create a new {} block merely for the hightlightings. | |
1112 | this seems unsafe, we get a reference of the map value, we might access it without the mutex being guarded. std::vector<HighlightingToken> Old; { std::lock_guard<std::mutex> Lock(HighlightingsMutex); Old = std::move(FileToHighlightings[File]); } | |
1125 | FileToPrevHighlightings[File] = std::move(Highlightings); | |
clang-tools-extra/clangd/ClangdLSPServer.h | ||
140 | nit: I'd drop Prev. | |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
229 | nit: I'm not sure I understand the comment. | |
233 | nit: this implies that OldLine and AllTokens must ref to the same container, could you refine the OldLine as a start Iterator? | |
255 | IIUC, we are initializing an empty ArrayRef, if so, how about using NewLine(Highlightings.data(), /*length*/0)? NewLine(Highlightings.data(), Highlightings.data()) is a bit weird, it took me a while to understand the purpose. | |
257 | nit: split it into a new statement. | |
259 | we don't change Current and Prev below, how about re-using Highlightings and PrevHighlightings? | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
71 | I believe Assumes the highlightings are sorted by the tokens' ranges. is a requirement for this function? If so, mark it more explicitly in the comment, like /// /// REQUIRED: OldHighlightings and NewHighlightings are sorted. | |
74 | nit: I'd use name the parameters symmetrically, NewHighlightings and OldHighlightings. | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
32–33 | we don't need ParsedAST now I think. | |
263 | this is a complicated structure, could you add some comments describing the test strategy here? |
Address comments.
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1112 | Aren't the parsing callbacks thread safe in the same file? I think @sammccall said so above. Although actually, the map might reallocate and move around the memory if things are inserted/deleted invalidating the reference maybe (I have no idea of how StringMap memory allocation works). So I guess it doesn't hurt to be safe and copy it. | |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
255 | I couldn't do that because the ArrayRef(const T *data, size_t length) and ArrayRef(const T *begin, const T *end) were ambiguous. Added a cast to cast 0 to a size_t which solved it. |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
255 | you could also do it like NewLine(Highlightings.data(), /*length*/0UL);, which saves you a static_cast<>. | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
281 | so the empty lines are stored separately, which is not easy to figure out from the testing code snippet. I think we should make the empty lines more obvious in the code snippet (maybe use an Empty annotations?) , and explicitly verify the empty lines. What do you think refining the test like below, just annotate the diff tokens? I find it is easy to spot the diff tokens. struct Testcase { Code Before; Code After; }; std::vector<TestCase> cases = { { R"cpp( int a; int b; int c; )cpp", R"cpp( int a; $Empty[[]] // int b int $Variable[[C]]; )cpp" } } oldHighlightings = getSemanticHighlightings(OldAST); newHighlightings = getSemanticHighlightings(NewAST); diffHighlightings = diff...; // verify the diffHighlightings has the expected empty lines ("Empty" annotations). // verify the diffHighlightings has the expected highlightings (the regular annotations); |
Rewrote test to be more readable.
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
281 | Moved everything into the checkDiffedHighlights as well. |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
275 | nit: remove the redundant ;. | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
35 | could we split it into three functions?
| |
63 | nit: OrgCode => OldCode, use llvm::StringRef | |
91 | I think the tokens are also sorted, as we pass two sorted lists of highlightings to diffHighlightings. | |
265 | nit: the code indentation is strange | |
266 | The annotations in the OldCode are not used -- you only use the actual highlightings in the checkDiffedHighlights. |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
119 | This means if we won't emit most of symbols if the AST is ill-formed? I'm not sure whether we should include it in this patch, it seems out of scope. could we leave it out in this patch? | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
68 | The comment seems a bit out of date (as the code is updated during review), and it should mention the diff strategy of this function. // Return a line-by-line diff between two highlightings. // - if the tokens on a line are the same in both hightlightings, we omit this line // - if a line exists in NewHighligtings but not in OldHighligtings, we emit the tokens on this line // - if a line not exists in NewHighligtings but in OldHighligtings, we emit an empty line (to tell client not highlighting this line) | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
51 | nit: let's sort the tokens here to make the result deterministic. | |
55 | nit: you could use for-range loop to simplify the code. for (const auto& EmptyRange : EmptyRanges) EmptyLines.push_back(EmptyRange.start.line); | |
67 | nit: since OldActualTokens and NewActualTokens are used only once in diffHighlightings, I'd inline them there (instead of creating two new variables). | |
80 | if we do sort in getExpectedTokens, we don't need this here. | |
264 | nit: Use an explicit struct, then you wouldn't need a comment. struct { llvm::StringRef OldCode; llvm::StringRef NewCode; } TestCases[] = { ... } | |
265 | could we add one more case? // Before int a; int b; int c; // After int a; int $Variable[[new_b]]; int c; | |
317 | nit: const auto& |
Thanks! Looks good from my side.
@ilya-biryukov will be nice if you could take a second look on the patch. We plan to land it before the release cut today.
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1115 | use std::move() to save a copy here. we'd drop the old highlighting anyway (replace it with new highlightings). |
A few suggestions from me, I haven't looked into the diffing algorithm and the tests yet.
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
458 | Should can't we handle this on didClose instead? | |
1115 | NIT: maybe std::move() into Old? If we had exceptions, that would be complicated, but we don't have them | |
1124 | To follow the same pattern as diagnostics, could we store before calling publish...? | |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
243 | Do you have any ideas on how we can fix this? I would simply split those tokens into multiple tokens of the same kind at newlines boundaries, but maybe there are other ways to handle this. In any case, could you put a suggested way to fix this (in case someone will want to pick it up, they'll have a starting point) | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
78 | Are we locked into the line-based diff implementation? Does the protocol have a way to communicate this cleanly? |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
458 | We are removing in didClose but the problem is that there is a race condition (I think). If someone does some edits and closes the document right after, the highlightings for the final edit might finish being generated after the FileToHighlightings have earsed the highlightings for the file. So next time when opening the file we will have those final highlightings that were generated for the last edit in the map. However I've head that ASTWorked is synchronous for a single file, is that the case for the didClose call as well? Because in that case there is no race condition. | |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
243 | Yeah, I would split the tokens into multiple lines as well. Or enforce that a token is single line and handle it in addToken in the collector. (i.e. change the HighlightingToken struct) It's a bit annoying to solve because we'd have to get the line lengths of every line that the multiline length token covers when doing that. | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
78 | We are looked into some kind of line based diff implementation as the protocol sends token line by line. |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
241 | NIT: maybe rename to New and Old, the suffix of the name could be easily inferred from the variable type (clangd has hover/go-to-definition to find the type quickly, the code is rather small so its should always be visible on the screen too). | |
243 | NIT: add assertions that highlightings are sorted. | |
253 | NIT: maybe mention diff in the name somehow. I was confused at first, as I thought it's all highlightings grouped by line, not the diff. | |
259 | NIT: maybe just inline NewEnd and OldEnd to save a few lines? | |
261 | Could we skip directly to the next interesting line? auto NextLine = [&]() { int NextNew = NewLine.end() != NewHighlightings.end() ? NewLine.end()->start.line : numeric_limits<int>::max(); int NextOld = ...; return std::min(NextNew, NextOld); } for (int Line = 0; ...; Line = NextLine()) { } |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
262 | Can we test this in a more direct manner by specifying:
The resulting tests don't have to be real C++ then, allowing to express what we're testing in a more direct manner. {/*Old*/ "$Variable[[a]]", /*New*/ "$Class[[a]]", /*Diff*/ "{{/*line */0, "$Class[[a]]"}} It also seems that the contents of the lines could even be checked automatically (they should be equal to the corresponding line from /*New*/, right?), that leaves us with even simpler inputs: {/*Old*/ "$Variable[[a]]", /*New*/ "$Class[[a]]", /*DiffLines*/ "{0}} | |
346 | NIT: remove braces |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
458 | You are correct, there is actually a race condition. We worked hard to eliminate it for diagnostics, but highlightings are going through a different code path in ASTWorker, not guarded by DiagsMu and ReportDiagnostics. And, unfortunately, I don't think this guard here prevents it in all cases. In particular, there is still a possibility (albeit very low, I guess) that the old highlightings you are trying to remove here are still being computed. If they are reported after this erase() runs, we will end up with inconsistent highlightings. Ideally, we would guard the diagnostics callback with the same mutex, but given our current layering it seems like a hard problem, will need to think what's the simplest way to fix this. | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
78 | It seems there no compact way to send interesting diffs then, although the clients can do a reasonably good job of moving the older version of highlightings on changes until the server sends them a new version. The diff-based approach only seems to be helping with changes on a single line. Which is ok, thanks for the explanation. |
Fixed test case that was actually (kinda) broken even though it didn't fail.
(The last two edits in semantic-highlighting test. They removed the = 2 part and did not insert an extra ; which produced invalid code. It was still fine because the highlightings that were supposed to be generated were still generated. This just makes the code not be invalid)
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
259 | Wouldn't that violate code style: https://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop ? | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
262 | That's a great idea on how to write these tests. |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
458 | The race condition of highlighting sounds bad (especially when a user opens a large file and closes it immediately, then the highlighting is finished and we emit it to the client). No need to fix it in this patch, just add a FIXME. Can we use the same mechanism for Diagnostic to guard the highlighting here? or use the DiagsMu and ReportDiagnostics to guard the callback.onMainAST() as well (which is probably not a good idea)... | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
262 | hmm, I have a different opinion here (I'd prefer the previous one), let's chat. |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
266 | @hokein rightfully pointed out that mentioning all changed lines makes the tests unreadable. {/*Before*/ R"( $Var[[a]] $Func[[b]] "), /*After*/ R"( $Var[[a]] ^$Func[[b]] )"} // and no list of lines is needed! Could we do that here? One interesting case that we can't test this way to removing lines from the end of the file. But for that particular case, could we just write a separate test case? |
The fix for a race condition on remove has landed in rL366577, this revision would need a small update after it.
Fixed to work with that patch.
Should the diffing be moved somewhere else that is not inside the publish function though? That would require moving the state outside the LSP server though and handle the onClose/onOpen events somewhere else though.
Maybe it's ok to keep the diffing in the publish lock?
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
266 | We don't want to send diffs that are beyond the end of the file. There is a testcase for that as well (the count of newlines in the new code is sent to the differ as the number of lines in the file). |
Yeah, let's keep it in publish for now. Diffing in there does not change complexity of the operations.
If it ever turns out that the constant factor is too high, we can figure out ways to fight this.
Mostly final NITs from me, but there is an important one about removing the erase call on didOpen.
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
459 | Now that the patch landed, this is obsolete. Could you remove? | |
clang-tools-extra/clangd/ClangdServer.cpp | ||
80 ↗ | (On Diff #211890) | NIT: use unsigned instead of unsigned int |
clang-tools-extra/clangd/ClangdServer.h | ||
56 ↗ | (On Diff #211890) | NIT: could you document NLines |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
260 | Could we try to find a better name for this? Having Line and NextLine() which represent line numbers and NewLine which represents highlightings, creates some confusion. | |
276 | Line != intmax is redundant (NLines is <= intmax by definition). | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
79 | Could you document what NLines is? Specifically, whether it's the number of lines for New or Old. | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
67 | NIT: add documentation on how this should be used | |
73 | NIT: use llvm::DenseMap |
a few more comments from my side.
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
275 | nit: I believe theia is crashing because of this? could we file a bug to theia? I think it would be nice for client to be robust on this case. | |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
69 | nit: add , before this | |
72 | could you refine this sentence, I can't parse it. | |
73 | and here as well. | |
79 | could we use a more describe name for NLines? maybe MaxLines |
clang-tools-extra/clangd/SemanticHighlighting.cpp | ||
---|---|---|
260 | I renamed the Line and NextLine() instead. Could also rename NewLine to something like NewLineHighlightings but I feel like it almost becomes to verbose at that point. | |
275 | I seem to be misremembering, when I first started testing in theia I think I was crashing the client (due to a broken base64 encoding function, will take a look and see what was actually happening, can't quite remember) So this could be removed, it will just sometimes return highlightings that are outside of the file and hopefully any future clients would handle that as well. |
clang-tools-extra/clangd/ClangdServer.cpp | ||
---|---|---|
81 ↗ | (On Diff #212301) | from the comment of the getLineNumber, this is not cheap to call this method. We may use SM.getBufferData(MainFileID).count('\n') to count the line numbers. // This requires building and caching a table of line offsets for the /// MemoryBuffer, so this is not cheap: use only when about to emit a /// diagnostic. |
clang-tools-extra/clangd/ClangdServer.h | ||
55 ↗ | (On Diff #212301) | nit: drop the needed ..., this is an interface, don't mention the details of subclass (diffs are subclass implementation details). Maybe name it "NumLines"? |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1116 | NIT: avoid copying (from Highlightings into the map) under a lock, make the copy outside the lock instead | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
363 | Could you also add a separate test that checks diffs when the number of lines is getting smaller (i.e. taking NLines into account) |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
363 | I would expect this to be better handled outside checkDiffedHighlights to avoid over-generalizing this function. But up to you. |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
---|---|---|
363 | That's what this test does: { R"( $Class[[A]] $Variable[[A]] $Class[[A]] $Variable[[A]] )", R"( $Class[[A]] $Variable[[A]] )"}, But do you mean I should do a testcase like: { R"( $Class[[A]] $Variable[[A]] $Class[[A]] $Variable[[A]] )", R"( $Class[[A]] $Variable[[A]] $Class[[A]] $Variable[[A]] )"}, And just set NLines = 2 instead when doing the diffing? |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1116 | I think we can even avoid copy, since we only use the Highlightings for calculating the diff. Maybe just { std::lock_guard<std::mutex> Lock(HighlightingsMutex); Old = std::move(FileToHighlightings[File]); } // calculate the diff. { std::lock_guard<std::mutex> Lock(HighlightingsMutex); FileToHighlightings[File] = std::move(Highlightings); } |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1116 | I think I changed this to only lock once (and copy instead) at the same time me and @ilya-biryukov were talking about the race condition (?) |
LGTM from my side
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1116 | It means there's a window where nobody can access the highlightings for this file. Which is probably fine, we really do use this only for computing the diff and nothing else. But doing the copy shouldn't matter here either, so leaving as is also looks ok from my side. If you decide to avoid an extra copy, please add comments to FileToHighlightings (e.g. only used to compute diffs, must not be used outside onHighlightingsReady and didRemove). Don't remember exact details about race conditions, but we should be good now that it's called inside Publish() | |
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp | ||
363 | Ah, the test actually needs the set of changed lines to be exactly the same as marked lines. |
clang-tools-extra/clangd/ClangdLSPServer.cpp | ||
---|---|---|
1116 | I'm fine with the current state, avoiding the copy cost would make the code here a bit mess... |
how about creating a new structure LineHighlightingTokens to represent the line-based tokens?