This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
14/14
YAMLTraits.h
-
tools/llvm-yaml-numeric-parser-fuzzer/
-
llvm-yaml-numeric-parser-fuzzer/
-
CMakeLists.txt
-
DummyYAMLNumericParserFuzzer.cpp
1/1
yaml-numeric-parser-fuzzer.cpp
-
unittests/Support/
-
Support/
2/2
YAMLIOTest.cpp

Differential D50839

[llvm] Make YAML serialization up to 2.5 times faster
ClosedPublic

Authored by kbobyrev on Aug 16 2018, 6:00 AM.

Download Raw Diff

Details

Reviewers

ilya-biryukov
zturner
majnemer
javed.absar

Commits

rG5f26a642e61e: [llvm] Make YAML serialization up to 2.5 times faster
rL340154: [llvm] Make YAML serialization up to 2.5 times faster

Summary

This patch significantly improves performance of the YAML serializer by optimizing YAML::isNumeric function. This function is called on the most strings and is highly inefficient for two reasons:

It uses Regex, which is parsed and compiled each time this function is called
It uses multiple passes which are not necessary

This patch introduces stateful ad hoc YAML number parser which does not rely on Regex. It also fixes YAML number format inconsistency: current implementation supports C-stile octal number format (01234567) which was present in YAML 1.0 specialization (http://yaml.org/spec/1.0/), [Section 2.4. Tags, Example 2.19] but was deprecated and is no longer present in latest YAML 1.2 specification (http://yaml.org/spec/1.2/spec.html), see [Section 10.3.2. Tag Resolution]. Since the rest of the rest of the implementation does not support other deprecated YAML 1.0 numeric features such as sexagecimal numbers, commas as delimiters it is treated as inconsistency and not longer supported. This patch also adds unit tests to ensure the validity of proposed implementation.

This performance bottleneck was identified while profiling Clangd's global-symbol-builder tool with my colleague @ilya-biryukov. The substantial part of the runtime was spent during a single-thread Reduce phase, which concludes with YAML serialization of collected symbol collection. Regex matching was accountable for approximately 45% of the whole runtime (which involves sharded Map phase), now it is reduced to 18% (which is spent in clang::clangd::CanonicalIncludes and can be also optimized because all used regexes are in fact either suffix matches or exact matches).

llvm-yaml-numeric-parser-fuzzer was used to ensure the validity of the proposed regex replacement. Fuzzing for ~60 hours using 10 threads did not expose any bugs.

Benchmarking global-symbol-builder (using hyperfine --warmup 2 --min-runs 5 'command 1' 'command 2') tool by processing a reasonable amount of code (26 source files matched by clang-tools-extra/clangd/*.cpp with all transitive includes) confirmed our understanding of the performance bottleneck nature as it speeds up the command by the factor of 1.6x:

Command	Mean [s]	Min…Max [s]
this patch (D50839)	84.7 ± 0.6	83.3…84.7
master (rL339849)	133.1 ± 0.8	132.4…134.6

Using smaller samples (e.g. by collecting symbols from clang-tools-extra/clangd/AST.cpp only) yields even better performance improvement, which is expected because Map phase takes less time compared to Reduce and is 2.05x faster and therefore would significantly improve the performance of standalone YAML serializations.

Command	Mean [ms]	Min…Max [ms]
this patch (D50839)	3702.2 ± 48.7	3635.1…3752.3
master (rL339849)	7607.6 ± 109.5	7533.3…7796.4

Diff Detail

Event Timeline

kbobyrev created this revision.Aug 16 2018, 6:00 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptAug 16 2018, 6:00 AM

Herald added a subscriber: kristof.beyls. · View Herald Transcript

kbobyrev edited the summary of this revision. (Show Details)Aug 16 2018, 6:01 AM

kbobyrev edited the summary of this revision. (Show Details)

kbobyrev edited the summary of this revision. (Show Details)Aug 16 2018, 6:03 AM

lebedev.ri added a subscriber: lebedev.ri.Aug 16 2018, 6:10 AM

lebedev.ri added inline comments.

llvm/include/llvm/Support/YAMLTraits.h
460–463	Passing-by thought, feel free to ignore. Changes like these are a great targets for fuzzers. Don't just rewrite the implementation, but instead write a new [optimized] function, and add a fuzzer that would feed both of these functions the same input, and assert the equality of their outputs. (and that neither of them crashes). Would preserve the infinitely more readable code version, too.

Just drive-by comments, the maintainers of the code are in a much better position to give feedback, of course.
Nevertheless, my few cents:

Getting rid of a regex in favor of the explicit loop is definitely a good thing. It's incredibly how much time is spent there when serializing big chunks of YAML (our case in clangd).
Other changes are definitely less of a win performance-wise and I personally find the new code a bit harder to read than before, so don't feel as confident about those.
Actual correctness fixes are a good thing, but could go in as a separate patch.

I would suggest splitting the patch into two: (1) rewriting the regex, (2) other changes. (1) is such a clear win, that it would be a pitty to have it delayed by reviewing other parts of the patch :-)

llvm/include/llvm/Support/YAMLTraits.h
476	I would argue the previous version of parsing hex and octal chars was more readable. And I'm not sure the new heavily optimized version is more performant too.
524	A more structured way to parse a floating numbers would be to: skip digits until we find `.` or exponent (`e`) if we find `.`, skip digits until we find an exponent. if we find an exponent, skip the next symbol if it's `'+'` or `'-'`, then skip digits until the end of the string. having a code structure that mirrors that would make the code more readable.

joerg added a subscriber: joerg.Aug 16 2018, 6:52 AM

joerg added inline comments.

llvm/include/llvm/Support/YAMLTraits.h
487	Can you use strchr here? I would expect the compiler to fold the string constants into bit tests, creating both more compact and faster code.

lebedev.ri removed a subscriber: lebedev.ri.Aug 16 2018, 6:55 AM

Very good point by @lebedev.ri! I have added a very simple fuzzer for the parser. So far, there were no issues with the current implementation. I have not exposed the regexp matcher to the header, though, because it won't be used anywhere.

Herald added a subscriber: mgorny. · View Herald TranscriptAug 16 2018, 7:46 AM

kbobyrev added a reviewer: zturner.Aug 16 2018, 7:47 AM

Use consistent Regex matchers naming: don't append "Matcher" at the end.

kbobyrev edited reviewers, added: majnemer; removed: ioeric, javed.absar.Aug 16 2018, 8:09 AM

kbobyrev added a subscriber: ioeric.

Herald added a reviewer: javed.absar. · View Herald TranscriptAug 16 2018, 8:09 AM

joerg added inline comments.Aug 16 2018, 9:37 AM

llvm/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp
17	Spelling?

lebedev.ri added inline comments.Aug 16 2018, 9:43 AM

llvm/unittests/Support/YAMLIOTest.cpp
2626	Spelling
2627	spelling

zturner added inline comments.Aug 16 2018, 10:24 AM

llvm/include/llvm/Support/YAMLTraits.h
461	What would happen if we re-wrote this entire function as: inline bool isNumeric(StringRef S) { uint64_t N; int64_t I; APFloat F; return S.getAsInteger(N) \|\| S.getAsInteger(I) \|\| (F.convertFromString(S) == opOK); } Would this a) Be correct, and b) have similar performance characteristics to what you've got here?

Upload version which is IMO readable.

llvm/include/llvm/Support/YAMLTraits.h
461	Thank you for the suggestion! I have tried the proposed approach, but there are several caveats: First, `APInt` (which I believe should be used in this case since YAML numbers are of arbitrary length) parsing does not look simpler than the current approach (and it's also unnecessary overhead and potentially some cases which are invalid in YAML but are perfectly fine in `APInt` parser). An example would be the prefix of octal numbers: `APInt` accepts `0` while it should be `0o` in YAML, so the `Radix` should be manually inferred anyway. The main problem, however, is with the `APFloat` parser, which accepts a huge number of items which are not valid in YAML numeric format. Examples are: `.` `.e+1` `.e+` `.e` Even worse, the parser appears to have bugs. I was able to find several classes of inputs which cause global-buffer-overflow caught by AddressSanitizer (e.g. `.+`). This should be investigated independently. However, the above cases lead me to believe that: The LLVM parser is likely to have a huge number of cases which are invalid in YAML numeric format but are valid `APFloat`s. Finding all of these cases is non-trivial and is probably not rewarding. The parser is unreliable. What do you think?

lebedev.ri removed a subscriber: lebedev.ri.Aug 17 2018, 2:45 AM

I tried to rewrite the loop, but IMO it looks even worse now.

Add couple tests, fix formatting issues, use __builtin_trap() instead of assert in fuzzer so that it's more transparent.

Also, fuzzing this unreadable version for a couple of hours suggests that it is valid.

kbobyrev edited subscribers, added: llvm-commits; removed: cfe-commits.Aug 17 2018, 6:02 AM

I suspected something would be wrong with that approach, it would be too
simple otherwise :) lgtm

Mostly LG, just a few more NITs

llvm/include/llvm/Support/YAMLTraits.h
458	Maybe simplify to `return S.dropWhile(...)`? Maybe make it a lambda and put inside `isNumeric`?
463	Maybe use `S == "+"` instead of `S.equals("+")`? Just a suggestion, feel free to ignore
495	maybe use `std::isdigit(S[1])` instead?
519	NIT: remove braces, remove `else`. LLVM Style guide has a section on it :-)

zturner added inline comments.Aug 17 2018, 7:47 AM

llvm/include/llvm/Support/YAMLTraits.h
458	`dropWhile` will probably be slower, but `S.drop_front(S.find_first_not_of("0123456789"))` would be good
480–481	Doesn't `find_first_not_of` have a starting pos argument? If so we could use that instead of the `drop_front`
485–486	Same here.
495	We should use `llvm::isDigit` instead.

Thank you for the feedback! I will fuzz over the weekend just in case and update the benchmark before submitting.

Run clang-format.

Herald added a subscriber: kadircet. · View Herald TranscriptAug 19 2018, 11:48 PM

kbobyrev retitled this revision from [llvm] Optimize YAML::isNumeric to [llvm] Make YAML serialization up to 2.5 times faster.Aug 19 2018, 11:49 PM

This revision was not accepted when it landed; it landed in state Needs Review.Aug 20 2018, 12:01 AM

Closed by commit rL340154: [llvm] Make YAML serialization up to 2.5 times faster (authored by omtcyfz). · Explain Why

This revision was automatically updated to reflect the committed changes.

omtcyfz mentioned this in rL340154: [llvm] Make YAML serialization up to 2.5 times faster.

Hi Kirill,

llvm/trunk/include/llvm/Support/YAMLTraits.h
464 ↗	(On Diff #161421)	You can probably use `StringRef::compare_lower()` rather than enumerating all the possible strings in input.
472 ↗	(On Diff #161421)	Same.
536 ↗	(On Diff #161421)	This assert is wrong. It should be: assert(State == FoundExponent && "Should have found exponent at this point."); This is causing some spurious warnings on gcc. YAMLTraits.h:536: warning: enum constant in boolean context [-Wint-in-bool-context].

In D50839#1205689, @andreadb wrote:

Hi Kirill,

Hi Andrea! Thank you very much for spotting this, I will fix those as soon as I get to my workstation.

Fixed the assertion in rL340252. My comments about compare_lower() are inline.

llvm/trunk/include/llvm/Support/YAMLTraits.h
464 ↗	(On Diff #161421)	`.nAN`, `.Nan` will be allowed then, same with infinity.

Right.
I was mainly concerned about the assert. Thanks for fixing it! :-)

kbobyrev marked 3 inline comments as done.Sep 17 2018, 1:56 AM

Herald added a subscriber: kristina. · View Herald TranscriptSep 17 2018, 1:56 AM

scott.linder mentioned this in D91573: [YAMLIO] Add a generic YAML fuzzer harness.Nov 16 2020, 3:24 PM

scott.linder mentioned this in rG544cb649d778: [YAMLIO] Add a generic YAML fuzzer harness.Nov 18 2020, 3:06 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

YAMLTraits.h

111 lines

tools/

llvm-yaml-numeric-parser-fuzzer/

CMakeLists.txt

9 lines

DummyYAMLNumericParserFuzzer.cpp

19 lines

yaml-numeric-parser-fuzzer.cpp

47 lines

unittests/

Support/

YAMLIOTest.cpp

83 lines

Diff 161266

llvm/include/llvm/Support/YAMLTraits.h

Show All 21 Lines
#include "llvm/Support/Regex.h"		#include "llvm/Support/Regex.h"
#include "llvm/Support/SourceMgr.h"		#include "llvm/Support/SourceMgr.h"
#include "llvm/Support/YAMLParser.h"		#include "llvm/Support/YAMLParser.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include <cassert>		#include <cassert>
#include <cctype>		#include <cctype>
#include <cstddef>		#include <cstddef>
#include <cstdint>		#include <cstdint>
		#include <iterator>
#include <map>		#include <map>
#include <memory>		#include <memory>
#include <new>		#include <new>
#include <string>		#include <string>
#include <system_error>		#include <system_error>
#include <type_traits>		#include <type_traits>
#include <vector>		#include <vector>
		#include <cassert>

namespace llvm {		namespace llvm {
namespace yaml {		namespace yaml {

struct EmptyContext {};		struct EmptyContext {};

/// This class should be specialized by any type that needs to be converted		/// This class should be specialized by any type that needs to be converted
/// to/from a YAML mapping. For example:		/// to/from a YAML mapping. For example:
▲ Show 20 Lines • Show All 399 Lines • ▼ Show 20 Lines	struct has_DocumentListTraits

template <typename U>		template <typename U>
static double test(...);		static double test(...);

public:		public:
static bool const value = (sizeof(test<DocumentListTraits<T>>(nullptr))==1);		static bool const value = (sizeof(test<DocumentListTraits<T>>(nullptr))==1);
};		};

inline bool isNumber(StringRef S) {		inline bool isNumeric(StringRef S) {
static const char OctalChars[] = "01234567";		const static auto skipDigits = [](StringRef Input) {
if (S.startswith("0") &&		return Input.drop_front(std::min(Input.find_first_not_of("0123456789"),
S.drop_front().find_first_not_of(OctalChars) == StringRef::npos)		Input.size()));
return true;		};
		ilya-biryukovUnsubmitted Done Reply Inline Actions Maybe simplify to `return S.dropWhile(...)`? Maybe make it a lambda and put inside `isNumeric`? ilya-biryukov: - Maybe simplify to `return S.dropWhile(...)`? - Maybe make it a lambda and put inside…
		zturnerUnsubmitted Done Reply Inline Actions `dropWhile` will probably be slower, but `S.drop_front(S.find_first_not_of("0123456789"))` would be good zturner: `dropWhile` will probably be slower, but `S.drop_front(S.find_first_not_of("0123456789"))`…

if (S.startswith("0o") &&		// Make S.front() and S.drop_front().front() (if S.front() is [+-]) calls
S.drop_front(2).find_first_not_of(OctalChars) == StringRef::npos)		// safe.
		zturnerUnsubmitted Done Reply Inline Actions What would happen if we re-wrote this entire function as: inline bool isNumeric(StringRef S) { uint64_t N; int64_t I; APFloat F; return S.getAsInteger(N) \|\| S.getAsInteger(I) \|\| (F.convertFromString(S) == opOK); } Would this a) Be correct, and b) have similar performance characteristics to what you've got here? zturner: What would happen if we re-wrote this entire function as: ``` inline bool isNumeric(StringRef…
		kbobyrevAuthorUnsubmitted Done Reply Inline Actions Thank you for the suggestion! I have tried the proposed approach, but there are several caveats: First, `APInt` (which I believe should be used in this case since YAML numbers are of arbitrary length) parsing does not look simpler than the current approach (and it's also unnecessary overhead and potentially some cases which are invalid in YAML but are perfectly fine in `APInt` parser). An example would be the prefix of octal numbers: `APInt` accepts `0` while it should be `0o` in YAML, so the `Radix` should be manually inferred anyway. The main problem, however, is with the `APFloat` parser, which accepts a huge number of items which are not valid in YAML numeric format. Examples are: `.` `.e+1` `.e+` `.e` Even worse, the parser appears to have bugs. I was able to find several classes of inputs which cause global-buffer-overflow caught by AddressSanitizer (e.g. `.+`). This should be investigated independently. However, the above cases lead me to believe that: The LLVM parser is likely to have a huge number of cases which are invalid in YAML numeric format but are valid `APFloat`s. Finding all of these cases is non-trivial and is probably not rewarding. The parser is unreliable. What do you think? kbobyrev: Thank you for the suggestion! I have tried the proposed approach, but there are several…
return true;		if (S.empty() \|\| S.equals("+") \|\| S.equals("-"))
		return false;
		lebedev.riUnsubmitted Done Reply Inline Actions Passing-by thought, feel free to ignore. Changes like these are a great targets for fuzzers. Don't just rewrite the implementation, but instead write a new [optimized] function, and add a fuzzer that would feed both of these functions the same input, and assert the equality of their outputs. (and that neither of them crashes). Would preserve the infinitely more readable code version, too. lebedev.ri: Passing-by thought, feel free to ignore. Changes like these are a great targets for…
		ilya-biryukovUnsubmitted Done Reply Inline Actions Maybe use `S == "+"` instead of `S.equals("+")`? Just a suggestion, feel free to ignore ilya-biryukov: Maybe use `S == "+"` instead of `S.equals("+")`? Just a suggestion, feel free to ignore

static const char HexChars[] = "0123456789abcdefABCDEF";		if (S.equals(".nan") \|\| S.equals(".NaN") \|\| S.equals(".NAN"))
if (S.startswith("0x") &&
S.drop_front(2).find_first_not_of(HexChars) == StringRef::npos)
return true;		return true;

static const char DecChars[] = "0123456789";		// Infinity and decimal numbers can be prefixed with sign.
if (S.find_first_not_of(DecChars) == StringRef::npos)		StringRef Tail = (S.front() == '-' \|\| S.front() == '+') ? S.drop_front() : S;
return true;

if (S.equals(".inf") \|\| S.equals(".Inf") \|\| S.equals(".INF"))		// Check for infinity first, because checking for hex and oct numbers is more
		// expensive.
		if (Tail.equals(".inf") \|\| Tail.equals(".Inf") \|\| Tail.equals(".INF"))
return true;		return true;

Regex FloatMatcher("^(\\.[0-9]+\|[0-9]+(\\.[0-9]*)?)([eE][-+]?[0-9]+)?$");		// Section 10.3.2 Tag Resolution
		ilya-biryukovUnsubmitted Done Reply Inline Actions I would argue the previous version of parsing hex and octal chars was more readable. And I'm not sure the new heavily optimized version is more performant too. ilya-biryukov: I would argue the previous version of parsing hex and octal chars was more readable. And I'm…
if (FloatMatcher.match(S))		// YAML 1.2 Specification prohibits Base 8 and Base 16 numbers prefixed with
		// [-+], so S should be used instead of Tail.
		if (S.startswith("0o"))
		return S.size() > 2 &&
		S.drop_front(2).find_first_not_of("01234567") == StringRef::npos;
		zturnerUnsubmitted Done Reply Inline Actions Doesn't `find_first_not_of` have a starting pos argument? If so we could use that instead of the `drop_front` zturner: Doesn't `find_first_not_of` have a starting pos argument? If so we could use that instead of…

		if (S.startswith("0x"))
		return S.size() > 2 &&
		S.drop_front(2).find_first_not_of("0123456789abcdefABCDEF") ==
		StringRef::npos;
		zturnerUnsubmitted Done Reply Inline Actions Same here. zturner: Same here.

		joergUnsubmitted Done Reply Inline Actions Can you use strchr here? I would expect the compiler to fold the string constants into bit tests, creating both more compact and faster code. joerg: Can you use strchr here? I would expect the compiler to fold the string constants into bit…
		// Parse float: [-+]? (\. [0-9]+ \| [0-9]+ (\. [0-9]* )?) ([eE] [-+]? [0-9]+)?
		S = Tail;

		// Handle cases when the number starts with '.' and hence needs at least one
		// digit after dot (as opposed by number which has digits before the dot), but
		// doesn't have one.
		if (S.startswith(".") &&
		(S.equals(".") \|\| (S.size() > 1 && std::strchr("0123456789",
		ilya-biryukovUnsubmitted Done Reply Inline Actions maybe use `std::isdigit(S[1])` instead? ilya-biryukov: maybe use `std::isdigit(S[1])` instead?
		zturnerUnsubmitted Done Reply Inline Actions We should use `llvm::isDigit` instead. zturner: We should use `llvm::isDigit` instead.
		S[1]) == nullptr)))
		return false;

		if (S.startswith("E") \|\| S.startswith("e"))
		return false;

		enum ParseState {
		Default,
		FoundDot,
		FoundExponent,
		};
		ParseState State = Default;

		S = skipDigits(S);

		// Accept decimal integer.
		if (S.empty())
return true;		return true;

		if (S.front() == '.') {
		State = FoundDot;
		S = S.drop_front();
		} else if (S.front() == 'e' \|\| S.front() == 'E') {
		State = FoundExponent;
		ilya-biryukovUnsubmitted Done Reply Inline Actions NIT: remove braces, remove `else`. LLVM Style guide has a section on it :-) ilya-biryukov: NIT: remove braces, remove `else`. LLVM Style guide has a section on it :-)
		S = S.drop_front();
		} else {
return false;		return false;
}		}

		ilya-biryukovUnsubmitted Done Reply Inline Actions A more structured way to parse a floating numbers would be to: skip digits until we find `.` or exponent (`e`) if we find `.`, skip digits until we find an exponent. if we find an exponent, skip the next symbol if it's `'+'` or `'-'`, then skip digits until the end of the string. having a code structure that mirrors that would make the code more readable. ilya-biryukov: A more structured way to parse a floating numbers would be to: 1. skip digits until we find `.
inline bool isNumeric(StringRef S) {		if (State == FoundDot) {
if ((S.front() == '-' \|\| S.front() == '+') && isNumber(S.drop_front()))		S = skipDigits(S);
		if (S.empty())
return true;		return true;

if (isNumber(S))		if (S.front() == 'e' \|\| S.front() == 'E') {
return true;		State = FoundExponent;
		S = S.drop_front();
		} else {
		return false;
		}
		}

if (S.equals(".nan") \|\| S.equals(".NaN") \|\| S.equals(".NAN"))		assert(FoundExponent && "Should have found exponent at this point.");
return true;		if (S.empty())
		return false;

		if (S.front() == '+' \|\| S.front() == '-') {
		S = S.drop_front();
		if (S.empty())
return false;		return false;
}		}

		return skipDigits(S).empty();
		}

inline bool isNull(StringRef S) {		inline bool isNull(StringRef S) {
return S.equals("null") \|\| S.equals("Null") \|\| S.equals("NULL") \|\|		return S.equals("null") \|\| S.equals("Null") \|\| S.equals("NULL") \|\|
S.equals("~");		S.equals("~");
}		}

inline bool isBool(StringRef S) {		inline bool isBool(StringRef S) {
return S.equals("true") \|\| S.equals("True") \|\| S.equals("TRUE") \|\|		return S.equals("true") \|\| S.equals("True") \|\| S.equals("TRUE") \|\|
S.equals("false") \|\| S.equals("False") \|\| S.equals("FALSE");		S.equals("false") \|\| S.equals("False") \|\| S.equals("FALSE");
▲ Show 20 Lines • Show All 1,265 Lines • Show Last 20 Lines

llvm/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt

This file was added.

				set(LLVM_LINK_COMPONENTS
				Support
				FuzzMutate
				)

				add_llvm_fuzzer(llvm-yaml-numeric-parser-fuzzer
				yaml-numeric-parser-fuzzer.cpp
				DUMMY_MAIN DummyYAMLNumericParserFuzzer.cpp
				)

llvm/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp

This file was added.

				//===--- DummyYAMLNumericParserFuzzer.cpp ---------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// Implementation of main so we can build and test without linking libFuzzer.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/FuzzMutate/FuzzerCLI.h"

				extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size);
				int main(int argc, char *argv[]) {
				return llvm::runFuzzerOnInputs(argc, argv, LLVMFuzzerTestOneInput);
				}

llvm/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp

This file was added.

				//===--- special-case-list-fuzzer.cpp - Fuzzer for special case lists -----===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/Regex.h"
				#include "llvm/Support/YAMLTraits.h"
				#include <cassert>
				#include <string>

				llvm::Regex Infinity("^[-+]?(\\.inf\|\\.Inf\|\\.INF)$");
				llvm::Regex Base8("^0o[0-7]+$");
				joergUnsubmitted Done Reply Inline Actions Spelling? joerg: Spelling?
				llvm::Regex Base16("^0x[0-9a-fA-F]+$");
				llvm::Regex Float("^[-+]?(\\.[0-9]+\|[0-9]+(\\.[0-9]*)?)([eE][-+]?[0-9]+)?$");

				inline bool isNumericRegex(llvm::StringRef S) {

				if (S.equals(".nan") \|\| S.equals(".NaN") \|\| S.equals(".NAN"))
				return true;

				if (Infinity.match(S))
				return true;

				if (Base8.match(S))
				return true;

				if (Base16.match(S))
				return true;

				if (Float.match(S))
				return true;

				return false;
				}

				extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
				std::string Input(reinterpret_cast<const char *>(Data), Size);
				Input.erase(std::remove(Input.begin(), Input.end(), 0), Input.end());
				if (!Input.empty() && llvm::yaml::isNumeric(Input) != isNumericRegex(Input))
				__builtin_trap();
				return 0;
				}

llvm/unittests/Support/YAMLIOTest.cpp

Show All 10 Lines
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/Endian.h"		#include "llvm/Support/Endian.h"
#include "llvm/Support/Format.h"		#include "llvm/Support/Format.h"
#include "llvm/Support/YAMLTraits.h"		#include "llvm/Support/YAMLTraits.h"
#include "gmock/gmock.h"		#include "gmock/gmock.h"
#include "gtest/gtest.h"		#include "gtest/gtest.h"

		using llvm::yaml::Hex16;
		using llvm::yaml::Hex32;
		using llvm::yaml::Hex64;
		using llvm::yaml::Hex8;
using llvm::yaml::Input;		using llvm::yaml::Input;
using llvm::yaml::Output;
using llvm::yaml::IO;		using llvm::yaml::IO;
using llvm::yaml::MappingTraits;		using llvm::yaml::isNumeric;
using llvm::yaml::MappingNormalization;		using llvm::yaml::MappingNormalization;
		using llvm::yaml::MappingTraits;
		using llvm::yaml::Output;
using llvm::yaml::ScalarTraits;		using llvm::yaml::ScalarTraits;
using llvm::yaml::Hex8;
using llvm::yaml::Hex16;
using llvm::yaml::Hex32;
using llvm::yaml::Hex64;
using ::testing::StartsWith;		using ::testing::StartsWith;




static void suppressErrorMessages(const llvm::SMDiagnostic &, void *) {		static void suppressErrorMessages(const llvm::SMDiagnostic &, void *) {
}		}

▲ Show 20 Lines • Show All 2,527 Lines • ▼ Show 20 Lines	TEST(YAMLIO, TestEscaped) {
{		{
const unsigned char foobar[10] = {'f', 'o', 'o',		const unsigned char foobar[10] = {'f', 'o', 'o',
0xE2, 0x80, 0x8B, // UTF-8 of U+200B		0xE2, 0x80, 0x8B, // UTF-8 of U+200B
'b', 'a', 'r',		'b', 'a', 'r',
0x0};		0x0};
TestEscaped((char const *)foobar, "\"foo\\u200Bbar\"");		TestEscaped((char const *)foobar, "\"foo\\u200Bbar\"");
}		}
}		}

		TEST(YAMLIO, Numeric) {
		EXPECT_TRUE(isNumeric(".inf"));
		EXPECT_TRUE(isNumeric(".INF"));
		EXPECT_TRUE(isNumeric(".Inf"));
		EXPECT_TRUE(isNumeric("-.inf"));
		EXPECT_TRUE(isNumeric("+.inf"));

		EXPECT_TRUE(isNumeric(".nan"));
		EXPECT_TRUE(isNumeric(".NaN"));
		EXPECT_TRUE(isNumeric(".NAN"));

		EXPECT_TRUE(isNumeric("0"));
		EXPECT_TRUE(isNumeric("0."));
		EXPECT_TRUE(isNumeric("0.0"));
		EXPECT_TRUE(isNumeric("-0.0"));
		EXPECT_TRUE(isNumeric("+0.0"));

		EXPECT_TRUE(isNumeric("12345"));
		EXPECT_TRUE(isNumeric("012345"));
		EXPECT_TRUE(isNumeric("+12.0"));
		EXPECT_TRUE(isNumeric(".5"));
		EXPECT_TRUE(isNumeric("+.5"));
		EXPECT_TRUE(isNumeric("-1.0"));

		EXPECT_TRUE(isNumeric("2.3e4"));
		EXPECT_TRUE(isNumeric("-2E+05"));
		EXPECT_TRUE(isNumeric("+12e03"));
		EXPECT_TRUE(isNumeric("6.8523015e+5"));

		EXPECT_TRUE(isNumeric("1.e+1"));
		EXPECT_TRUE(isNumeric(".0e+1"));

		EXPECT_TRUE(isNumeric("0x2aF3"));
		EXPECT_TRUE(isNumeric("0o01234567"));

		EXPECT_FALSE(isNumeric("not a number"));
		EXPECT_FALSE(isNumeric("."));
		EXPECT_FALSE(isNumeric(".e+1"));
		EXPECT_FALSE(isNumeric(".1e"));
		EXPECT_FALSE(isNumeric(".1e+"));
		EXPECT_FALSE(isNumeric(".1e++1"));

		EXPECT_FALSE(isNumeric("ABCD"));
		EXPECT_FALSE(isNumeric("+0x2AF3"));
		EXPECT_FALSE(isNumeric("-0x2AF3"));
		EXPECT_FALSE(isNumeric("0x2AF3Z"));
		EXPECT_FALSE(isNumeric("0o012345678"));
		EXPECT_FALSE(isNumeric("0xZ"));
		EXPECT_FALSE(isNumeric("-0o012345678"));
		EXPECT_FALSE(isNumeric("000003A8229434B839616A25C16B0291F77A438B"));

		EXPECT_FALSE(isNumeric(""));
		EXPECT_FALSE(isNumeric("."));
		lebedev.riUnsubmitted Done Reply Inline Actions Spelling lebedev.ri: Spelling
		EXPECT_FALSE(isNumeric(".e+1"));
		lebedev.riUnsubmitted Done Reply Inline Actions spelling lebedev.ri: spelling
		EXPECT_FALSE(isNumeric(".e+"));
		EXPECT_FALSE(isNumeric(".e"));
		EXPECT_FALSE(isNumeric("e1"));

		// Deprecated formats: as for YAML 1.2 specification, the following are not
		// valid numbers anymore:
		//
		// * Sexagecimal numbers
		// * Decimal numbers with comma s the delimiter
		// * "inf", "nan" without '.' prefix
		EXPECT_FALSE(isNumeric("3:25:45"));
		EXPECT_FALSE(isNumeric("+12,345"));
		EXPECT_FALSE(isNumeric("-inf"));
		EXPECT_FALSE(isNumeric("1,230.15"));
		}