Page MenuHomePhabricator

[clangd] Add Random Forest runtime for code completion.

Authored by usaxena95 on Jul 14 2020, 2:32 PM.



We intend to replace heuristics based code completion ranking with a Decision Forest Model.

This patch introduces a format for representing the model and an inference runtime that is code-generated at build time.

  • Forest.json contains all the trees as an array of trees.
  • Features.json describes the features to be used.
  • Codegen file takes the above two files and generates CompletionModel containing Feature struct and corresponding Evaluate function. The Evaluate function maps a feature to a real number describing the relevance of this candidate.
  • The codegen is part of build system and these files are generated at build time.
  • Proposes a way to test the generated runtime using a test model.
    • Replicates the model structure in unittests.
    • unittest tests both the test model (for correct tree traversal) and the real model (for sanity).

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sammccall added inline comments.Jul 24 2020, 2:15 PM
19 ↗(On Diff #280455)

fname -> filename

29 ↗(On Diff #280455)

this needs to be guarded based on the compiler - other compilers use different flags

I'd suggest just -Wno-usuned

31 ↗(On Diff #280455)

It'd be nice to avoid passing data out by setting magic variables with generic names.

The caller is already passing in the filename they want, so they know what it is.

1 ↗(On Diff #280455)

this needs documentation throughout

9 ↗(On Diff #280455)

Hmm, do these classes really pay for themselves compared to just using the JSON-decoded data structures directly and writing functions?


def setter(feature):
  if (feature['kind'] == KIND_CATEGORICAL)
    return "void Set{feature}(float V) {{ {feature} = OrderEncode(V); }}".format(feature['name'])
10 ↗(On Diff #280455)

These labels seem a bit abstract, what do you think about "number"/"enum"?

21 ↗(On Diff #280455)

Assert failures print the expression, so no need to say "header not found" etc.
And in fact the next line will throw a reasonable exception in this case... could drop these

28 ↗(On Diff #280455)

nit: set{Feature}, following LLVM style

33 ↗(On Diff #280455)

raise instead?

I'm not sure we want this error handling to be turned off... assert is for programming errors
(applies to the rest of the usage of assert too)

45 ↗(On Diff #280455)

i think this is set(f.header for f in features.values() if f.type == Feature.Type.CATEGORICAL) with no need for lambda
but my python is rusty

52 ↗(On Diff #280455)

why not assert this on self.ns so that "::Foo" will work fine?

60 ↗(On Diff #280455)

join here rather than returning an array?

67 ↗(On Diff #280455)

gaurd -> guard

70 ↗(On Diff #280455)

''.join('_' if x in '-' else x.upper() for x in self.fname) ?

74 ↗(On Diff #280455)

(again, not clear the classes pay for themselves in terms of complexity. We do a lot of parsing and validation, but it's not clear it's important)

103 ↗(On Diff #280455)

would be nice to structure to reduce the indentation inside here.
Also 'codegen' is a pretty generic name for this particular piece.

I'd consider

def node(n):
  return {
    'boost': boost_node,
    'if_greater': if_greater_node,
    'if_member': if_member_node,

and then define each case separately. (That snippet does actually check errors!)

129 ↗(On Diff #280455)

rather than passing features around to get access to the enum type, what about adding using {feature}_type = ... to the top of the generated file, and using that here?

282 ↗(On Diff #280455)

putting all the trees in a single huge json file seems like it makes them hard to inspect, did you consider one-per-file?

28 ↗(On Diff #280455)

nit: consistently either:

  • tree0, tree0_node1
  • tree_0, tree_0_node_1
  • t0, t0_n1


10 ↗(On Diff #280455)

this shouldn't be part of the public interface. Can we make it private static in the model class?

Hi @usaxena95 and @sammccall,

I am wondering about couple high-level things.

Do you guys intend to open-source also the training part of the model pipeline or publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

Do you still intend to support the heuristic that is currently powering clangd in the future?


usaxena95 marked 23 inline comments as done.

Addressed comments.

usaxena95 added inline comments.Sep 10 2020, 10:37 AM

Hardcoded it in .cmake file.

1 ↗(On Diff #280455)

Moved .cmake, codegen and model in quality dir.

5 ↗(On Diff #280455)

The class specifies the name and scope of the Feature class. clang::clangd::Example in this case.

17 ↗(On Diff #280455)

This allows the generated cc file to include the header using "filename.h".
If we give the filepath as input, we would have to strip out the filename from it.

Although I like the current notion of being explicit that the output_dir contains the two files. We need to add output_dir to include path to use this library.

29 ↗(On Diff #280455)

only MSVC needed a different flag. -Wno-unused works with Clang and GCC.

31 ↗(On Diff #280455)

We can avoid GENERATED_CC. But I still wanted to keep the output directory as a detail in this function itself and not as an input parameter.
Changed the name to more specific name DECISION_FOREST_OUTPUT_DIR.

9 ↗(On Diff #280455)

Removed the Feature class and Tree.
CppClass calculates and holds the namespaces which I felt convenient.

52 ↗(On Diff #280455)

Allowed fully qualified names of classes.

70 ↗(On Diff #280455)

Sorry for making this complicated. filename was assumed to be in PascalCase (and not contain '-' at all).
I wanted to convert it to UPPER_SNAKE_CASE. To avoid unnecessary complexity, lets simply convert it to upper case.

Hi @jkorous

Do you guys intend to open-source also the training part of the model pipeline ?

Open sourcing the training part (both dataset generation and using an open sourced DecisionForest based framework for training) has been on our radar. Although gathering capacity for this task has been difficult lately.

Publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

Although the current model has not been trained on a generic codebase, but since the features involved doesn't capture code style/conventions/variable names, it is likely that it performs well on generic code bases as well. This remains to be tested.

Do you still intend to support the heuristic that is currently powering clangd in the future?

Currently we are planning to use this model behind a flag. Initially we would be focusing on comparing the two. Since maintaining and developing signals is easier for an ML model, we might end up deprecating the heuristics.


usaxena95 updated this revision to Diff 291205.Sep 11 2020, 7:01 AM

Added for the code completion model.

adamcz added inline comments.Sep 14 2020, 8:05 AM

This is supposed to be "namespace clang", right?

Fixed namespace.

usaxena95 marked an inline comment as done.Sep 15 2020, 12:16 AM
usaxena95 updated this revision to Diff 291909.Sep 15 2020, 7:21 AM

Fixed namespace ending.

Looks good to me overall, some minor style comments included ;-)

Do we expect this to be a generic model code generator that gets reused for other things? If not, maybe we can hardcode more (like the namespace, class name, etc), but if you think there's other use cases for this then this LGTM.


Why GENERATED_DECISON_FOREST_MODEL instead of output_dir, to be consistent with header guards for other files? Doesn't matter much for generated code, but if someone opens this in vim they'll see warnings.


nit: add space after if for readability (also below)


Please extend the comment to mention the second return value (size of the tree)


This is a good place to use an Python's f-string.

Also in few places below.


style nit: be consistent with spaces around +


Is there a reason to make this a friend free-function instead of static method on the Example class? The problem is that you now end up with clang::clangd::Evaluate, so if we every re-use this code gen for another model we'll have a name collision.


nit: be consistent about putting a "." at the end of the help text or not.

1 ↗(On Diff #291813)

Can we rename this directory? quality/model makes some sense (although it would be better to include something about code completion there), but unittests/model is not very descriptive - what model?

How about unittests/decision_forest_model/ or something like that? Or go with the Input/TEST_NAME pattern.

usaxena95 updated this revision to Diff 292188.Sep 16 2020, 4:57 AM
usaxena95 marked 8 inline comments as done.

Addressed comments.


The output_dir is the absolute path and not a project relative path.
I tried to stick with a special prefix for header guard as done in other Generated headers (e.g. protos)

If someone opens this in vim, there would many other warnings that they would see like "unused_label" ;)
I don't think that would be a concern since it would be opened for inspection and not for editing.


The class name ("Example" currently) would be different for a different model and therefore there would be another overload for Evaluate(const AnotherClass&) even if the namespaces are same (clang::clangd).

1 ↗(On Diff #291813)

You are right "quality" wasn't indicative of code completion here but we decided to be consistent with the current naming. The current heuristics for the ranking are in Quality.h and Quality.cpp ;-)

changed the dir name in unittests.

Just some naming and doc nits. This looks really solid now, nice job!

Hi @usaxena95 and @sammccall,

I am wondering about couple high-level things.

Do you guys intend to open-source also the training part of the model pipeline or publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

@adamcz and I were talking about this too... I think it's important we do as much of this as possible. I was the one not finding time to do it though, and I think Adam may do better :-)

  • the existing training stuff is using internal tech, but AFAIK it's nothing LightGBM can't do (it trains on a single machine). So we should be able to open-source the training setup and actually use that.
  • training data generation is harder to open, because it involves sampling a large diverse body of code and parsing lots of it. The core code that embeds clangd and extracts completion candidates should be very doable, so one could run over LLVM on one machine. The framework to run at a larger scale is coupled to internal infra though, and we're currently training on both public and non-public code.

these vars are used only once, I'd suggest inlining them for readability


/generated/decision_forest seems redundant considering ${CMAKE_BINARY_DIR} is already the generated-files tree for the directory of the calling CMakeLists.

Can't we just use ${CMAKE_BINARY_DIR} directly and avoid the DECISION_FOREST_OUTPUT_DIR variable?


you can use triple-quoted f-strings, which I think would be more readable than blocks of "code +="

code += f"""class {} {{
  {"\n   ".join(setters)}

  {"\n  ".join(class_members)}


may even do the whole thing in one go.


again, making this one big triple-quoted f-string may be nicer, up to you


The second half of this sentence simply restates the first.

Maybe we can combine this with the second paragraph: "A decision tree is a full binary tree that provides a quality prediction for an input (code completion item). Internal nodes represent a binary decision based on the input data, and leaf nodes represent a prediction."


Nit: I think it's worth separating out defining features vs conditions.

"An input (code completion candidate) is characterized as a set of features, such as the type of symbol or the number of existing references

At every non-leaf node, we evaluate the condition to decide whether to go left or right. The condition compares one feature of the input against a constant. It is either:


nit: rather than alternating between describing traversing all trees and one tree, I'd just say "To compute an overall quality score, we traverse each tree in this way and add up the scores".


This is a little prone to confusion with C++ type.

Consider "kind" instead?


this might be "type"?


The max numeric value may not exceed 32 (is that right?)


nit: order should match order of the actual paragraphs.

This is short enough though that you might not want the mini-TOC here.


This seems like it might be a pain to maintain.
Maybe just include the json files and the public interface from DecicionForestRuntime.h?


may want to add the cmake invocation to generate the files.

usaxena95 marked 12 inline comments as done.

Addressed comments.

usaxena95 marked an inline comment as done.Sep 18 2020, 12:52 AM
usaxena95 added inline comments.

Changed CMAKE_BINARY_DIR to CMAKE_CURRENT_BINARY_DIR and removed /generated/decision_forest to avoid the DECISION_FOREST_OUTPUT_DIR variable.

sammccall accepted this revision.Sep 18 2020, 1:06 AM

LG from my side!

This revision is now accepted and ready to land.Sep 18 2020, 1:06 AM
adamcz accepted this revision.Sep 18 2020, 6:27 AM
usaxena95 updated this revision to Diff 292818.Sep 18 2020, 9:04 AM

Removed generated (for review) files.

usaxena95 updated this revision to Diff 292828.Sep 18 2020, 9:36 AM

Fixed output_dir cmake variable. Clean build succeeds now.
Ready to land.

This revision was landed with ongoing or failed builds.Sep 18 2020, 10:27 AM
This revision was automatically updated to reflect the committed changes.
usaxena95 edited the summary of this revision. (Show Details)Sep 19 2020, 12:37 AM