This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add Random Forest runtime for code completion.
ClosedPublic

Authored by usaxena95 on Jul 14 2020, 2:32 PM.

Details

Summary

We intend to replace heuristics based code completion ranking with a Decision Forest Model.

This patch introduces a format for representing the model and an inference runtime that is code-generated at build time.

  • Forest.json contains all the trees as an array of trees.
  • Features.json describes the features to be used.
  • Codegen file takes the above two files and generates CompletionModel containing Feature struct and corresponding Evaluate function. The Evaluate function maps a feature to a real number describing the relevance of this candidate.
  • The codegen is part of build system and these files are generated at build time.
  • Proposes a way to test the generated runtime using a test model.
    • Replicates the model structure in unittests.
    • unittest tests both the test model (for correct tree traversal) and the real model (for sanity).

Diff Detail

Event Timeline

usaxena95 created this revision.Jul 14 2020, 2:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2020, 2:32 PM
usaxena95 edited the summary of this revision. (Show Details)Jul 15 2020, 1:58 AM
usaxena95 edited the summary of this revision. (Show Details)Jul 15 2020, 9:09 AM
usaxena95 edited the summary of this revision. (Show Details)
nridge added a subscriber: nridge.Jul 21 2020, 4:24 PM

Is there some additional context for what this patch is aiming to do? The description sounds interesting but I don't really understand it.

What is a "feature" in this context?

The features refers to the code completion signals in https://github.com/llvm/llvm-project/blob/master/clang-tools-extra/clangd/Quality.h
These signals are currently used to map the code completion candidates to a relevance score using hand-coded heuristics.
We intend to replace the heuristics with a Decision forest model. This patch introduces a dummy model and corresponding runtime that will be used to inference this model.

This is still WIP and I will provide more details in the description once this is finalized.

usaxena95 updated this revision to Diff 280406.Jul 24 2020, 4:56 AM

Addressed offline comments.

usaxena95 edited the summary of this revision. (Show Details)Jul 24 2020, 7:11 AM
usaxena95 added a reviewer: sammccall.
usaxena95 updated this revision to Diff 280455.Jul 24 2020, 7:12 AM
usaxena95 edited the summary of this revision. (Show Details)

Better formatting in generated files.

usaxena95 edited the summary of this revision. (Show Details)Jul 24 2020, 7:23 AM

This is still WIP and I will provide more details in the description once this is finalized.

This really should have high level documentation - I don't think Nathan will be the only one to ask these questions.
We need a description of what problem this solves, what the concepts are (trees, features, scores), inference, training, data sources, generated API.

Given that the implementation is necessarily split across CMake, generated code, generator, and data (but not a lot of hand-written C++) I think it's probably best documented in a README.md or so in the model/ dir.

Seems fine to do that in this patch, or ahead of time (before the implementation)... I wouldn't wait to do it after, it'll help with the review.


One question we haven't discussed is how workspace/symbol should work. (This uses the hand-tuned score function with some different logic). This is relevant to naming: it's not the "completion model" if it also has other features.

clang-tools-extra/clangd/CMakeLists.txt
32

if you want the compiler script to be a parameter, make it an argument to the function rather than a magic variable.

But really, I think the CompletionModel.cmake is tightly coupled to the python script, I think it can be hardcoded there.

clang-tools-extra/clangd/CompletionModel.cmake
1 ↗(On Diff #280455)

I think there's some confusion in the naming.

You've got the code split into two locations: the generic generator and the specific code completion model.

However the directory named just "model" contains the specific stuff, and the generic parts are named "completionmodel!".

I'd suggest either:

  • don't generalize, and put everything in clangd/quality or so
  • split into clangd/quality/ and clangd/forest/ for the specific/generic parts
5 ↗(On Diff #280455)

what does the class do?

7 ↗(On Diff #280455)

df is cryptic.
decision_forest or gen_decision_forest?

17 ↗(On Diff #280455)

I'd suggest passing the component filenames explicitly here since you're computing them anyway

sammccall added inline comments.Jul 24 2020, 2:15 PM
clang-tools-extra/clangd/CompletionModel.cmake
19 ↗(On Diff #280455)

fname -> filename

29 ↗(On Diff #280455)

this needs to be guarded based on the compiler - other compilers use different flags

I'd suggest just -Wno-usuned

31 ↗(On Diff #280455)

It'd be nice to avoid passing data out by setting magic variables with generic names.

The caller is already passing in the filename they want, so they know what it is.

clang-tools-extra/clangd/CompletionModelCodegen.py
1 ↗(On Diff #280455)

this needs documentation throughout

9 ↗(On Diff #280455)

Hmm, do these classes really pay for themselves compared to just using the JSON-decoded data structures directly and writing functions?

e.g.

def setter(feature):
  if (feature['kind'] == KIND_CATEGORICAL)
    return "void Set{feature}(float V) {{ {feature} = OrderEncode(V); }}".format(feature['name'])
  ...
10 ↗(On Diff #280455)

These labels seem a bit abstract, what do you think about "number"/"enum"?

21 ↗(On Diff #280455)

Assert failures print the expression, so no need to say "header not found" etc.
And in fact the next line will throw a reasonable exception in this case... could drop these

28 ↗(On Diff #280455)

nit: set{Feature}, following LLVM style

33 ↗(On Diff #280455)

raise instead?

I'm not sure we want this error handling to be turned off... assert is for programming errors
(applies to the rest of the usage of assert too)

45 ↗(On Diff #280455)

i think this is set(f.header for f in features.values() if f.type == Feature.Type.CATEGORICAL) with no need for lambda
but my python is rusty

52 ↗(On Diff #280455)

why not assert this on self.ns so that "::Foo" will work fine?

60 ↗(On Diff #280455)

join here rather than returning an array?

67 ↗(On Diff #280455)

gaurd -> guard

70 ↗(On Diff #280455)

''.join('_' if x in '-' else x.upper() for x in self.fname) ?

74 ↗(On Diff #280455)

(again, not clear the classes pay for themselves in terms of complexity. We do a lot of parsing and validation, but it's not clear it's important)

103 ↗(On Diff #280455)

would be nice to structure to reduce the indentation inside here.
Also 'codegen' is a pretty generic name for this particular piece.

I'd consider

def node(n):
  return {
    'boost': boost_node,
    'if_greater': if_greater_node,
    'if_member': if_member_node,
  }[n['operation']](n)

and then define each case separately. (That snippet does actually check errors!)

129 ↗(On Diff #280455)

rather than passing features around to get access to the enum type, what about adding using {feature}_type = ... to the top of the generated file, and using that here?

282 ↗(On Diff #280455)

putting all the trees in a single huge json file seems like it makes them hard to inspect, did you consider one-per-file?

clang-tools-extra/clangd/for-review-only/CompletionModel.cpp
28 ↗(On Diff #280455)

nit: consistently either:

  • tree0, tree0_node1
  • tree_0, tree_0_node_1
  • t0, t0_n1

etc

clang-tools-extra/clangd/for-review-only/CompletionModel.h
10 ↗(On Diff #280455)

this shouldn't be part of the public interface. Can we make it private static in the model class?

Hi @usaxena95 and @sammccall,

I am wondering about couple high-level things.

Do you guys intend to open-source also the training part of the model pipeline or publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

Do you still intend to support the heuristic that is currently powering clangd in the future?

Thanks!

usaxena95 marked 23 inline comments as done.

Addressed comments.

usaxena95 added inline comments.Sep 10 2020, 10:37 AM
clang-tools-extra/clangd/CMakeLists.txt
32

Hardcoded it in .cmake file.

clang-tools-extra/clangd/CompletionModel.cmake
1 ↗(On Diff #280455)

Moved .cmake, codegen and model in quality dir.

5 ↗(On Diff #280455)

The class specifies the name and scope of the Feature class. clang::clangd::Example in this case.

17 ↗(On Diff #280455)

This allows the generated cc file to include the header using "filename.h".
If we give the filepath as input, we would have to strip out the filename from it.

Although I like the current notion of being explicit that the output_dir contains the two files. We need to add output_dir to include path to use this library.

29 ↗(On Diff #280455)

only MSVC needed a different flag. -Wno-unused works with Clang and GCC.
https://godbolt.org/z/Gvdne7

31 ↗(On Diff #280455)

We can avoid GENERATED_CC. But I still wanted to keep the output directory as a detail in this function itself and not as an input parameter.
Changed the name to more specific name DECISION_FOREST_OUTPUT_DIR.

clang-tools-extra/clangd/CompletionModelCodegen.py
9 ↗(On Diff #280455)

Removed the Feature class and Tree.
CppClass calculates and holds the namespaces which I felt convenient.

52 ↗(On Diff #280455)

Allowed fully qualified names of classes.

70 ↗(On Diff #280455)

Sorry for making this complicated. filename was assumed to be in PascalCase (and not contain '-' at all).
I wanted to convert it to UPPER_SNAKE_CASE. To avoid unnecessary complexity, lets simply convert it to upper case.

Hi @jkorous

Do you guys intend to open-source also the training part of the model pipeline ?

Open sourcing the training part (both dataset generation and using an open sourced DecisionForest based framework for training) has been on our radar. Although gathering capacity for this task has been difficult lately.

Publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

Although the current model has not been trained on a generic codebase, but since the features involved doesn't capture code style/conventions/variable names, it is likely that it performs well on generic code bases as well. This remains to be tested.

Do you still intend to support the heuristic that is currently powering clangd in the future?

Currently we are planning to use this model behind a flag. Initially we would be focusing on comparing the two. Since maintaining and developing signals is easier for an ML model, we might end up deprecating the heuristics.

Thanks,
Utkarsh.

usaxena95 updated this revision to Diff 291205.Sep 11 2020, 7:01 AM

Added README.md for the code completion model.

adamcz added inline comments.Sep 14 2020, 8:05 AM
clang-tools-extra/clangd/unittests/DecisionForestTests.cpp
6

This is supposed to be "namespace clang", right?

Fixed namespace.

usaxena95 marked an inline comment as done.Sep 15 2020, 12:16 AM
usaxena95 updated this revision to Diff 291909.Sep 15 2020, 7:21 AM

Fixed namespace ending.

Looks good to me overall, some minor style comments included ;-)

Do we expect this to be a generic model code generator that gets reused for other things? If not, maybe we can hardcode more (like the namespace, class name, etc), but if you think there's other use cases for this then this LGTM.

clang-tools-extra/clangd/quality/CompletionModelCodegen.py
40

Why GENERATED_DECISON_FOREST_MODEL instead of output_dir, to be consistent with header guards for other files? Doesn't matter much for generated code, but if someone opens this in vim they'll see warnings.

58

nit: add space after if for readability (also below)

94

Please extend the comment to mention the second return value (size of the tree)

106

This is a good place to use an Python's f-string.

Also in few places below.

127

style nit: be consistent with spaces around +

178

Is there a reason to make this a friend free-function instead of static method on the Example class? The problem is that you now end up with clang::clangd::Evaluate, so if we every re-use this code gen for another model we'll have a name collision.

272

nit: be consistent about putting a "." at the end of the help text or not.

clang-tools-extra/clangd/unittests/model/CategoricalFeature.h
1 ↗(On Diff #291813)

Can we rename this directory? quality/model makes some sense (although it would be better to include something about code completion there), but unittests/model is not very descriptive - what model?

How about unittests/decision_forest_model/ or something like that? Or go with the Input/TEST_NAME pattern.

usaxena95 updated this revision to Diff 292188.Sep 16 2020, 4:57 AM
usaxena95 marked 8 inline comments as done.

Addressed comments.

clang-tools-extra/clangd/quality/CompletionModelCodegen.py
40

The output_dir is the absolute path and not a project relative path.
I tried to stick with a special prefix for header guard as done in other Generated headers (e.g. protos)

If someone opens this in vim, there would many other warnings that they would see like "unused_label" ;)
I don't think that would be a concern since it would be opened for inspection and not for editing.

178

The class name ("Example" currently) would be different for a different model and therefore there would be another overload for Evaluate(const AnotherClass&) even if the namespaces are same (clang::clangd).

clang-tools-extra/clangd/unittests/model/CategoricalFeature.h
1 ↗(On Diff #291813)

You are right "quality" wasn't indicative of code completion here but we decided to be consistent with the current naming. The current heuristics for the ranking are in Quality.h and Quality.cpp ;-)

changed the dir name in unittests.

Just some naming and doc nits. This looks really solid now, nice job!

Hi @usaxena95 and @sammccall,

I am wondering about couple high-level things.

Do you guys intend to open-source also the training part of the model pipeline or publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

@adamcz and I were talking about this too... I think it's important we do as much of this as possible. I was the one not finding time to do it though, and I think Adam may do better :-)

  • the existing training stuff is using internal tech, but AFAIK it's nothing LightGBM can't do (it trains on a single machine). So we should be able to open-source the training setup and actually use that.
  • training data generation is harder to open, because it involves sampling a large diverse body of code and parsing lots of it. The core code that embeds clangd and extracts completion candidates should be very doable, so one could run over LLVM on one machine. The framework to run at a larger scale is coupled to internal infra though, and we're currently training on both public and non-public code.
clang-tools-extra/clangd/quality/CompletionModel.cmake
11

these vars are used only once, I'd suggest inlining them for readability

14

/generated/decision_forest seems redundant considering ${CMAKE_BINARY_DIR} is already the generated-files tree for the directory of the calling CMakeLists.

Can't we just use ${CMAKE_BINARY_DIR} directly and avoid the DECISION_FOREST_OUTPUT_DIR variable?

clang-tools-extra/clangd/quality/CompletionModelCodegen.py
151

you can use triple-quoted f-strings, which I think would be more readable than blocks of "code +="

code += f"""class {cpp_class.name} {{
public:
  {"\n   ".join(setters)}

private:
  {"\n  ".join(class_members)}
"""

etc

may even do the whole thing in one go.

218

again, making this one big triple-quoted f-string may be nicer, up to you

clang-tools-extra/clangd/quality/README.md
5

The second half of this sentence simply restates the first.

Maybe we can combine this with the second paragraph: "A decision tree is a full binary tree that provides a quality prediction for an input (code completion item). Internal nodes represent a binary decision based on the input data, and leaf nodes represent a prediction."

9

Nit: I think it's worth separating out defining features vs conditions.

e.g.
"An input (code completion candidate) is characterized as a set of features, such as the type of symbol or the number of existing references

At every non-leaf node, we evaluate the condition to decide whether to go left or right. The condition compares one feature of the input against a constant. It is either:
...".

17

nit: rather than alternating between describing traversing all trees and one tree, I'd just say "To compute an overall quality score, we traverse each tree in this way and add up the scores".

27

This is a little prone to confusion with C++ type.

Consider "kind" instead?

35

this might be "type"?

39

The max numeric value may not exceed 32 (is that right?)

77

nit: order should match order of the actual paragraphs.

This is short enough though that you might not want the mini-TOC here.

99

This seems like it might be a pain to maintain.
Maybe just include the json files and the public interface from DecicionForestRuntime.h?

99

may want to add the cmake invocation to generate the files.

usaxena95 marked 12 inline comments as done.

Addressed comments.

usaxena95 marked an inline comment as done.Sep 18 2020, 12:52 AM
usaxena95 added inline comments.
clang-tools-extra/clangd/quality/CompletionModel.cmake
14

Changed CMAKE_BINARY_DIR to CMAKE_CURRENT_BINARY_DIR and removed /generated/decision_forest to avoid the DECISION_FOREST_OUTPUT_DIR variable.

sammccall accepted this revision.Sep 18 2020, 1:06 AM

LG from my side!

This revision is now accepted and ready to land.Sep 18 2020, 1:06 AM
adamcz accepted this revision.Sep 18 2020, 6:27 AM
usaxena95 updated this revision to Diff 292818.Sep 18 2020, 9:04 AM

Removed generated (for review) files.

usaxena95 updated this revision to Diff 292828.Sep 18 2020, 9:36 AM

Fixed output_dir cmake variable. Clean build succeeds now.
Ready to land.

This revision was landed with ongoing or failed builds.Sep 18 2020, 10:27 AM
This revision was automatically updated to reflect the committed changes.
usaxena95 edited the summary of this revision. (Show Details)Sep 19 2020, 12:37 AM