Page MenuHomePhabricator

[YAML] speed up isNumber by doing regex matching less often
Needs ReviewPublic

Authored by pelikan on Feb 28 2018, 12:39 PM.

Details

Summary

Processing 2 GB XRay traces with "llvm-xray convert -output-format=yaml"
currently takes 1.5+ hours on my machine. When YAML is finally printed,
profiling shows huge amounts of time in YAML's needQuotes and isNumber,
doing regexp matching. That shouldn't be necessary for arbitrary input
and leaving it only when there's a chance of a float appearing makes it
consume only 40 minutes instead. More CPU savings to come.

Diff Detail

Event Timeline

pelikan created this revision.Feb 28 2018, 12:39 PM

Adding Reid and Zach who might be able to give more insight here.

include/llvm/Support/YAMLTraits.h
479–480

Does it make sense to make the Regex object static and const so that we only compile/initialise it once?

pelikan updated this revision to Diff 136453.Feb 28 2018, 6:08 PM

sync

include/llvm/Support/YAMLTraits.h
479–480

I was too lazy to look into Regex implementation to check whether it actually makes sense. I bet the current version is faster though.

dberris added inline comments.Feb 28 2018, 6:23 PM
include/llvm/Support/YAMLTraits.h
479–480

Why would this be faster than having it compiled once, and only the first time it's needed?

I suspect if you do that, instead of having to do a linear search first, would be much faster if it's possible to re-use a Regex object, compile it once, and re-use for all the times it's needed.

zturner added inline comments.Mar 1 2018, 10:20 AM
include/llvm/Support/YAMLTraits.h
479–480

Agree, this will make the case where it's not a number faster, but that's the uncommon case. The case where it *is* a number will become slower. I agree with Dean's suggestion of compiling the regex only once, but we can still do an early out if it's obviously not a number. However, instead of checking the entire string, how about just checking the first character?

labath added a subscriber: labath.Mar 1 2018, 10:27 AM

That regex is relatively simple. If we're worried about performance, maybe we should just roll it out into a series of find_first_(not_)of/consume_front/... statements