Page MenuHomePhabricator

[llvm] Make YAML serialization up to 2.5 times faster

Authored by kbobyrev on Aug 16 2018, 6:00 AM.



This patch significantly improves performance of the YAML serializer by optimizing YAML::isNumeric function. This function is called on the most strings and is highly inefficient for two reasons:

  • It uses Regex, which is parsed and compiled each time this function is called
  • It uses multiple passes which are not necessary

This patch introduces stateful ad hoc YAML number parser which does not rely on Regex. It also fixes YAML number format inconsistency: current implementation supports C-stile octal number format (01234567) which was present in YAML 1.0 specialization (, [Section 2.4. Tags, Example 2.19] but was deprecated and is no longer present in latest YAML 1.2 specification (, see [Section 10.3.2. Tag Resolution]. Since the rest of the rest of the implementation does not support other deprecated YAML 1.0 numeric features such as sexagecimal numbers, commas as delimiters it is treated as inconsistency and not longer supported. This patch also adds unit tests to ensure the validity of proposed implementation.

This performance bottleneck was identified while profiling Clangd's global-symbol-builder tool with my colleague @ilya-biryukov. The substantial part of the runtime was spent during a single-thread Reduce phase, which concludes with YAML serialization of collected symbol collection. Regex matching was accountable for approximately 45% of the whole runtime (which involves sharded Map phase), now it is reduced to 18% (which is spent in clang::clangd::CanonicalIncludes and can be also optimized because all used regexes are in fact either suffix matches or exact matches).

llvm-yaml-numeric-parser-fuzzer was used to ensure the validity of the proposed regex replacement. Fuzzing for ~60 hours using 10 threads did not expose any bugs.

Benchmarking global-symbol-builder (using hyperfine --warmup 2 --min-runs 5 'command 1' 'command 2') tool by processing a reasonable amount of code (26 source files matched by clang-tools-extra/clangd/*.cpp with all transitive includes) confirmed our understanding of the performance bottleneck nature as it speeds up the command by the factor of 1.6x:

CommandMean [s]Min…Max [s]
this patch (D50839)84.7 ± 0.683.3…84.7
master (rL339849)133.1 ± 0.8132.4…134.6

Using smaller samples (e.g. by collecting symbols from clang-tools-extra/clangd/AST.cpp only) yields even better performance improvement, which is expected because Map phase takes less time compared to Reduce and is 2.05x faster and therefore would significantly improve the performance of standalone YAML serializations.

CommandMean [ms]Min…Max [ms]
this patch (D50839)3702.2 ± 48.73635.1…3752.3
master (rL339849)7607.6 ± 109.57533.3…7796.4

Diff Detail


Event Timeline

kbobyrev created this revision.Aug 16 2018, 6:00 AM
kbobyrev edited the summary of this revision. (Show Details)Aug 16 2018, 6:01 AM
kbobyrev edited the summary of this revision. (Show Details)
kbobyrev edited the summary of this revision. (Show Details)
kbobyrev edited the summary of this revision. (Show Details)
kbobyrev edited the summary of this revision. (Show Details)Aug 16 2018, 6:03 AM
lebedev.ri added inline comments.
453 ↗(On Diff #161008)

Passing-by thought, feel free to ignore.

Changes like these are a great targets for fuzzers.
Don't just rewrite the implementation, but instead write a new [optimized] function,
and add a fuzzer that would feed both of these functions the same input,
and assert the equality of their outputs. (and that neither of them crashes).

Would preserve the infinitely more readable code version, too.

Just drive-by comments, the maintainers of the code are in a much better position to give feedback, of course.
Nevertheless, my few cents:

  • Getting rid of a regex in favor of the explicit loop is definitely a good thing. It's incredibly how much time is spent there when serializing big chunks of YAML (our case in clangd).
  • Other changes are definitely less of a win performance-wise and I personally find the new code a bit harder to read than before, so don't feel as confident about those.
  • Actual correctness fixes are a good thing, but could go in as a separate patch.

I would suggest splitting the patch into two: (1) rewriting the regex, (2) other changes. (1) is such a clear win, that it would be a pitty to have it delayed by reviewing other parts of the patch :-)

466 ↗(On Diff #161008)

I would argue the previous version of parsing hex and octal chars was more readable.
And I'm not sure the new heavily optimized version is more performant too.

494 ↗(On Diff #161008)

A more structured way to parse a floating numbers would be to:

  1. skip digits until we find . or exponent (e)
  2. if we find ., skip digits until we find an exponent.
  3. if we find an exponent, skip the next symbol if it's '+' or '-', then skip digits until the end of the string.

having a code structure that mirrors that would make the code more readable.

joerg added a subscriber: joerg.Aug 16 2018, 6:52 AM
joerg added inline comments.
477 ↗(On Diff #161008)

Can you use strchr here? I would expect the compiler to fold the string constants into bit tests, creating both more compact and faster code.

kbobyrev updated this revision to Diff 161027.Aug 16 2018, 7:46 AM
kbobyrev marked 2 inline comments as done.
kbobyrev added a subscriber: lebedev.ri.

Very good point by @lebedev.ri! I have added a very simple fuzzer for the parser. So far, there were no issues with the current implementation. I have not exposed the regexp matcher to the header, though, because it won't be used anywhere.

kbobyrev updated this revision to Diff 161033.Aug 16 2018, 8:03 AM

Use consistent Regex matchers naming: don't append "Matcher" at the end.

kbobyrev edited reviewers, added: majnemer; removed: ioeric, javed.absar.Aug 16 2018, 8:09 AM
kbobyrev added a subscriber: ioeric.
joerg added inline comments.Aug 16 2018, 9:37 AM
16 ↗(On Diff #161033)


lebedev.ri added inline comments.Aug 16 2018, 9:43 AM
2626 ↗(On Diff #161033)


2627 ↗(On Diff #161033)


zturner added inline comments.Aug 16 2018, 10:24 AM
454 ↗(On Diff #161033)

What would happen if we re-wrote this entire function as:

inline bool isNumeric(StringRef S) {
  uint64_t N;
  int64_t I;
  APFloat F;
  return S.getAsInteger(N) || S.getAsInteger(I) || (F.convertFromString(S) == opOK);

Would this a) Be correct, and b) have similar performance characteristics to what you've got here?

kbobyrev updated this revision to Diff 161191.Aug 17 2018, 2:44 AM
kbobyrev marked 4 inline comments as done.

Upload version which is IMO readable.

454 ↗(On Diff #161033)

Thank you for the suggestion!

I have tried the proposed approach, but there are several caveats:

First, APInt (which I believe should be used in this case since YAML numbers are of arbitrary length) parsing does not look simpler than the current approach (and it's also unnecessary overhead and potentially some cases which are invalid in YAML but are perfectly fine in APInt parser). An example would be the prefix of octal numbers: APInt accepts 0 while it should be 0o in YAML, so the Radix should be manually inferred anyway.

The main problem, however, is with the APFloat parser, which accepts a huge number of items which are not valid in YAML numeric format. Examples are:

  • .
  • .e+1
  • .e+
  • .e

Even worse, the parser appears to have bugs. I was able to find several classes of inputs which cause global-buffer-overflow caught by AddressSanitizer (e.g. .+). This should be investigated independently. However, the above cases lead me to believe that:

  1. The LLVM parser is likely to have a huge number of cases which are invalid in YAML numeric format but are valid APFloats. Finding all of these cases is non-trivial and is probably not rewarding.
  2. The parser is unreliable.

What do you think?

kbobyrev updated this revision to Diff 161192.Aug 17 2018, 2:47 AM

I tried to rewrite the loop, but IMO it looks even worse now.

kbobyrev updated this revision to Diff 161204.Aug 17 2018, 3:55 AM

Add couple tests, fix formatting issues, use __builtin_trap() instead of assert in fuzzer so that it's more transparent.

Also, fuzzing this unreadable version for a couple of hours suggests that it is valid.

kbobyrev edited subscribers, added: llvm-commits; removed: cfe-commits.Aug 17 2018, 6:02 AM

I suspected something would be wrong with that approach, it would be too
simple otherwise :) lgtm

Mostly LG, just a few more NITs

458 ↗(On Diff #161204)
  • Maybe simplify to return S.dropWhile(...)?
  • Maybe make it a lambda and put inside isNumeric?
464 ↗(On Diff #161204)

Maybe use S == "+" instead of S.equals("+")?
Just a suggestion, feel free to ignore

497 ↗(On Diff #161204)

maybe use std::isdigit(S[1]) instead?

530 ↗(On Diff #161204)

NIT: remove braces, remove else. LLVM Style guide has a section on it :-)

zturner added inline comments.Aug 17 2018, 7:47 AM
458 ↗(On Diff #161204)

dropWhile will probably be slower, but S.drop_front(S.find_first_not_of("0123456789")) would be good

482–483 ↗(On Diff #161204)

Doesn't find_first_not_of have a starting pos argument? If so we could use that instead of the drop_front

487–488 ↗(On Diff #161204)

Same here.

497 ↗(On Diff #161204)

We should use llvm::isDigit instead.

kbobyrev updated this revision to Diff 161266.Aug 17 2018, 8:43 AM
kbobyrev marked 11 inline comments as done.

Thank you for the feedback! I will fuzz over the weekend just in case and update the benchmark before submitting.

kbobyrev updated this revision to Diff 161420.Aug 19 2018, 11:48 PM
kbobyrev edited the summary of this revision. (Show Details)

Run clang-format.

kbobyrev retitled this revision from [llvm] Optimize YAML::isNumeric to [llvm] Make YAML serialization up to 2.5 times faster.Aug 19 2018, 11:49 PM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 20 2018, 12:01 AM
This revision was automatically updated to reflect the committed changes.

Hi Kirill,


You can probably use StringRef::compare_lower() rather than enumerating all the possible strings in input.




This assert is wrong.

It should be:

assert(State == FoundExponent && "Should have found exponent at this point.");

This is causing some spurious warnings on gcc.

YAMLTraits.h:536: warning: enum constant in boolean context [-Wint-in-bool-context].

Hi Kirill,

Hi Andrea! Thank you very much for spotting this, I will fix those as soon as I get to my workstation.

kbobyrev marked an inline comment as done.Aug 21 2018, 12:25 AM

Fixed the assertion in rL340252. My comments about compare_lower() are inline.


.nAN, .Nan will be allowed then, same with infinity.

I was mainly concerned about the assert. Thanks for fixing it! :-)

kbobyrev marked 3 inline comments as done.Sep 17 2018, 1:56 AM