This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Use pre-populated mappings for standard symbols
ClosedPublic

Authored by ilya-biryukov on Sep 4 2019, 7:20 AM.

Details

Summary

This takes ~5% of time when running clangd unit tests.

To achieve this, move mapping of system includes out of CanonicalIncludes
and into a separate class

Diff Detail

Repository
rL LLVM

Event Timeline

ilya-biryukov created this revision.Sep 4 2019, 7:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 4 2019, 7:20 AM

Also considered moving the std symbol mapping into a separate (private-to-implementation) class, e.g. we don't use suffix mappings and symbol mappings outside system includes.
But decided to wait until suffix mapping are going away completely.

hokein added a comment.Sep 6 2019, 2:01 AM

Also considered moving the std symbol mapping into a separate (private-to-implementation) class, e.g. we don't use suffix mappings and symbol mappings outside system includes.
But decided to wait until suffix mapping are going away completely.

It took me a while to understand this patch. I'd suggest doing it in this patch as well (it should not require large effort to do it, I think?)
The CanonicalInclude is becoming complex now, and leading to some wire design (we introduce a new field pointing to another instance).

And I'm not sure whether it is possible to remove the suffix mapping completely -- for STL symbols it is possible, but for other symbols like GNU C headers, I didn't find out an automatic way to build a lookup table.

Ok, agreed. Adding this indirection layer is probably making things a bit too complex. I'll update the patch to clearly separate the std symbol mapping and canonical includes.

And I'm not sure whether it is possible to remove the suffix mapping completely -- for STL symbols it is possible, but for other symbols like GNU C headers, I didn't find out an automatic way to build a lookup table.

It's because we don't have anything similar to cppreference.com for them? Maybe we could try scrapping man pages ?

  • Move system symbol mapping to a different class.
ilya-biryukov edited the summary of this revision. (Show Details)Sep 6 2019, 4:33 AM

Not sure whether the new code is a win overall, it's still a bit messy
Let me know what you think.

hokein added inline comments.Sep 6 2019, 4:55 AM
clang-tools-extra/clangd/index/CanonicalIncludes.cpp
780 ↗(On Diff #219061)

what's the interesting case where SystemSymbols is null?
I think it should be fine to always use the standard symbol mapping here? just use getStandardSymbolMapping()->mapHeader(), this would help to simplify the CanonicalInclude interface (no addSystemHeadersMapping and SystemSymbols members are needed).

hokein added a comment.Sep 6 2019, 4:58 AM

Ok, agreed. Adding this indirection layer is probably making things a bit too complex. I'll update the patch to clearly separate the std symbol mapping and canonical includes.

And I'm not sure whether it is possible to remove the suffix mapping completely -- for STL symbols it is possible, but for other symbols like GNU C headers, I didn't find out an automatic way to build a lookup table.

It's because we don't have anything similar to cppreference.com for them? Maybe we could try scrapping man pages ?

yes, cppreferences only provides symbols from C++/C standard libraries. Parsing man page is possible, but it requires some work (we have to write another tool to parse it).

ilya-biryukov added inline comments.Sep 9 2019, 4:22 AM
clang-tools-extra/clangd/index/CanonicalIncludes.cpp
780 ↗(On Diff #219061)

Fair question. I guess the only interesting case is unit tests - if we always have system include mapping, testing other things independently is impossible, therefore it's a bit more fragile.
But I don't think that's a real problem in practice.

Should we try to go even further, pass LangOptions to CanonicalIncludes on construction and always have a single defined mapping? I'll try to do that and update the patch accordingly. But let me know if you think that's a bad idea for some reason.

I had the same reaction as Haojian to the original patch, thanks for clearing that up.

I do wonder whether we're microoptimizing for the tests too much, I don't think 5% on the tests is worth very much in itself, unless it's speeding up real workloads or improving the code (it may well be).

I'm nervous about requiring langopts, it's convenient in tests/prototypes/etc to be able to default-construct CanonicalIncludes (e.g. a test that doesn't actually care about standard library mappings).

I do wonder whether we're microoptimizing for the tests too much, I don't think 5% on the tests is worth very much in itself, unless it's speeding up real workloads or improving the code (it may well be).

Even though tests don't parse real C++ programs, I wouldn't call it a micro-optimization. Spending 5% on such a low-level operation is not ok, small things like that can add up and have a subtle effect on performance.
If there's an issue with making the code more complex, I'm happy to explore other ways to make it simpler. What would convince you that's a cleanup that improves things? What parts of the current version do you think are problematic?

I'm nervous about requiring langopts, it's convenient in tests/prototypes/etc to be able to default-construct CanonicalIncludes (e.g. a test that doesn't actually care about standard library mappings).

We can have a factory method that constructs with default lang opts for tests, I don't think this would be an issue in the actual application code. In fact, I think it'll make the application code better.

I do wonder whether we're microoptimizing for the tests too much, I don't think 5% on the tests is worth very much in itself, unless it's speeding up real workloads or improving the code (it may well be).

Even though tests don't parse real C++ programs, I wouldn't call it a micro-optimization. Spending 5% on such a low-level operation is not ok, small things like that can add up and have a subtle effect on performance.

Only if it's 5% of something meaningful. If the denominator isn't something we care about, then it's really "spending XXX usec is not ok" - which depends on XXX I think.

If there's an issue with making the code more complex, I'm happy to explore other ways to make it simpler. What would convince you that's a cleanup that improves things? What parts of the current version do you think are problematic?

You'll note that I didn't say that *anything* about the current version was problematic :-)

One way this could be simpler is if suffix mappings were gone, then the StdIncludes class would go away and we'd just be left with a stringmap<string>. Then there'd be a performance win with almost no extra complexity.
The description says you're waiting for it to go away, but AFAIK nobody's working on this nor really knows how to eliminate it.

Some of the new complexity seems unneccesary even with this approach.
There's a lot of indirection around the initialization, an enum that gets passed around a bunch, constructors that switch over it.
I think this could be:

StdIncludes *getStdIncludes(LangOpts) {
    if (is c++) {
       static pair<...> entries[] = ...;
       static StdIncludes *CPP = new StdIncludes(entries);
       // the things that apply everywhere are in the StdIncludes constructor
       return CPP;
    } else if (is c) {
       ...
    } else {
      static StdIncludes *Others = new StdIncludes({});
      return Others;
    }
}

I'm nervous about requiring langopts, it's convenient in tests/prototypes/etc to be able to default-construct CanonicalIncludes (e.g. a test that doesn't actually care about standard library mappings).

We can have a factory method that constructs with default lang opts for tests, I don't think this would be an issue in the actual application code. In fact, I think it'll make the application code better.

I'm not saying that there's no way to work around it, or that it's fatal, but my opinion is that this would be worse than the current state.

clang-tools-extra/clangd/index/CanonicalIncludes.cpp
22 ↗(On Diff #219061)

Why C11 and not C?

Only if it's 5% of something meaningful. If the denominator isn't something we care about, then it's really "spending XXX usec is not ok" - which depends on XXX I think.

Agree, but unit-test time is meaningful clangd developers. Not saying this is more important than real use-cases, but keeping the unit tests enables us to iterate faster.

One way this could be simpler is if suffix mappings were gone, then the StdIncludes class would go away and we'd just be left with a stringmap<string>. Then there'd be a performance win with almost no extra complexity.
The description says you're waiting for it to go away, but AFAIK nobody's working on this nor really knows how to eliminate it.

We seem to know how to eliminate this in principle. See @hokein's comment, we could try parsing man pages or even the headers themselves and building a corresponding symbol map.
But this patch does not add suffix mapping, so I don't think that's relevant - this patch tries to avoid redundant initialization; not remove suffix maps.

Some of the new complexity seems unneccesary even with this approach.
There's a lot of indirection around the initialization, an enum that gets passed around a bunch, constructors that switch over it.
I think this could be:

Will update the patch. Although I find the current approach more direct, I will happily change the code according to your suggestion.

I'm nervous about requiring langopts, it's convenient in tests/prototypes/etc to be able to default-construct CanonicalIncludes (e.g. a test that doesn't actually care about standard library mappings).

We can have a factory method that constructs with default lang opts for tests, I don't think this would be an issue in the actual application code. In fact, I think it'll make the application code better.

I'm not saying that there's no way to work around it, or that it's fatal, but my opinion is that this would be worse than the current state.

If we care about having a default constructor, we can default to using StdIncludes with a default mapping. I'll avoid requiring LangOptions in the constructor in that case. Thanks for the early feedback!

  • Alternative approach - do not add extra classes, aim to minimize changes

The new version looks very much like the older one, with a few additions to not re-initialize data twice.
PTAL and let me know what you think.

ilya-biryukov marked an inline comment as done.Sep 9 2019, 6:04 AM
ilya-biryukov added inline comments.
clang-tools-extra/clangd/index/CanonicalIncludes.cpp
748 ↗(On Diff #219335)

Note that addSuffixMapping/findSuffixMapping are not paired anymore, which I personally find to be odd.
But let me know whether that's a concern someone else shares.

sammccall accepted this revision.Sep 9 2019, 7:25 AM

LG, though if we can drop the struct and make MaxSuffixComponents a constant it'd be simpler still.

Only if it's 5% of something meaningful. If the denominator isn't something we care about, then it's really "spending XXX usec is not ok" - which depends on XXX I think.

Agree, but unit-test time is meaningful clangd developers. Not saying this is more important than real use-cases, but keeping the unit tests enables us to iterate faster.

Sure. This is going to win a couple of percent I guess: for these cases we care about walltime including rebuild and lit startup overhead. It's worth having (and thanks for doing this!) but I don't think we should pay much complexity.

One way this could be simpler is if suffix mappings were gone, then the StdIncludes class would go away and we'd just be left with a stringmap<string>. Then there'd be a performance win with almost no extra complexity.
The description says you're waiting for it to go away, but AFAIK nobody's working on this nor really knows how to eliminate it.

We seem to know how to eliminate this in principle. See @hokein's comment, we could try parsing man pages or even the headers themselves and building a corresponding symbol map.

"we could try" sounds like we *don't* know how to eliminate it. Parsing manpages aside, I thought the main problem was these symbols are nonstandard and an infinitely portable qname->header mapping simply didn't exist.

But this patch does not add suffix mapping, so I don't think that's relevant - this patch tries to avoid redundant initialization; not remove suffix maps.

This patch adds code shoveling around suffix maps, and so increases the cost of this feature. (Less so after latest changes). It also increases the scope of a future patch that would remove suffix maps. Sequencing this the other way would be better, if removing suffix mapping was straightforward. (I don't think it is)

Some of the new complexity seems unneccesary even with this approach.
There's a lot of indirection around the initialization, an enum that gets passed around a bunch, constructors that switch over it.
I think this could be:

Will update the patch. Although I find the current approach more direct, I will happily change the code according to your suggestion.

Thanks, this looks a lot better!
I think it can be simplified a little further, as the suffix maps are totally hardcoded now.

clang-tools-extra/clangd/index/CanonicalIncludes.cpp
41 ↗(On Diff #219335)

nit: can we write if (SuffixHeaderMapping) { ... } for consistency with above?

84 ↗(On Diff #219335)

so this is a constant, and it's 3.

Can we avoid the calculation/propagation and defining a struct, and just add an assert(llvm::all_of(...)) after the initialization?

91 ↗(On Diff #219335)

I think this could just be

static const auto *SystemHeaderMap = new llvm::StringMap<string>{
...
}

skipping the intermediate array, and doing the initialize-once here
(if we can drop the struct)

758 ↗(On Diff #219335)

if you want to save CPU, move to the scope where it's used? Lit tests and many interesting subsets will never use C.

clang-tools-extra/clangd/index/CanonicalIncludes.h
60 ↗(On Diff #219335)

nit: maybe StdSuffixHeaderMapping to clarify the purpose, now these are std-only

This revision is now accepted and ready to land.Sep 9 2019, 7:25 AM

LG, though if we can drop the struct and make MaxSuffixComponents a constant it'd be simpler still.

Done.

Sure. This is going to win a couple of percent I guess: for these cases we care about walltime including rebuild and lit startup overhead. It's worth having (and thanks for doing this!) but I don't think we should pay much complexity.

Agreed.

"we could try" sounds like we *don't* know how to eliminate it. Parsing manpages aside, I thought the main problem was these symbols are nonstandard and an infinitely portable qname->header mapping simply didn't exist.

I would expect qualified names to be more portable than paths, but might be mistaken. I recall that we might run into problems if folks define some names as macros instead of functions, but I would be interested to see how common is this.
We should probably write down the examples that are broken somewhere... It's hard to see why it doesn't work without having the concrete cases.
@hokein, do you have any examples that cannot be covered by qual-name-to-path mapping?

Thanks, this looks a lot better!
I think it can be simplified a little further, as the suffix maps are totally hardcoded now.

I think the simplest option is to always make the path suffix mapping available in the class.
That means we will always map the system paths to their shorter versions. It looks ok to me, but might make unit-testing a little more awkward.
WDYT?
I'll land this and send this out as a separate patch if everyone is happy with the approach.

"we could try" sounds like we *don't* know how to eliminate it. Parsing manpages aside, I thought the main problem was these symbols are nonstandard and an infinitely portable qname->header mapping simply didn't exist.

I would expect qualified names to be more portable than paths, but might be mistaken. I recall that we might run into problems if folks define some names as macros instead of functions, but I would be interested to see how common is this.

In practice, I expect almost all of these names to be C and thus unqualified.

Names are indeed more portable but unlike path mappings we can't just include both when different platforms want different things, or nothing.
e.g. are we willing to claim that any kind of symbol named ::open should be from <sys/stat.h>, even on windows?
Maybe it's OK, and I can't remember if we had specific bad examples in mind, but seems like there are dragons here.

Thanks, this looks a lot better!
I think it can be simplified a little further, as the suffix maps are totally hardcoded now.

I think the simplest option is to always make the path suffix mapping available in the class.
That means we will always map the system paths to their shorter versions. It looks ok to me, but might make unit-testing a little more awkward.
WDYT?
I'll land this and send this out as a separate patch if everyone is happy with the approach.

From a public API standpoint, it seems cleaner to only have these available once addSystemHeadersMapping has been called (with any argument), and delay assigning the pointer until then.
But if adds more than a few lines, it seems fine without.

ilya-biryukov marked 8 inline comments as done.Sep 9 2019, 8:27 AM
ilya-biryukov added inline comments.
clang-tools-extra/clangd/index/CanonicalIncludes.cpp
41 ↗(On Diff #219335)

I'd prefer less nesting to consistency, but happy let me know if you feel that's very important.

84 ↗(On Diff #219335)

The assertion is inside the loop that's adding the mappings instead of llvm::all_of.

758 ↗(On Diff #219335)

Done. Also initialized everything directly as StringMap

ilya-biryukov marked 3 inline comments as done.
  • Make MaxSuffixComponents a constant
  • Put the suffix mapping into a single constant
  • Initialize all StringMaps directly
  • Rename to Std...Mapping

Thanks for all suggestions. The final result is rather small and minimal.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2019, 8:31 AM