This is an archive of the discontinued LLVM Phabricator instance.

[Support] Optimize (.*) regex matches
ClosedPublic

Authored by nikic on Apr 14 2022, 3:14 AM.

Details

Summary

If capturing groups are used, the regex matcher handles something like (.*)suffix by first doing a maximal match of .*, trying to match suffix afterward, and then reducing the maximal stop position one by one until this finally succeeds. This makes the match quadratic in the length of the line (with large constant factors).

This is particularly problematic because regexes of this form are ubiquitous in FileCheck (something like [[VAR:%.*]] = ... falls in this category), making FileCheck executions much slower than they have any right to be.

This implements a very crude optimization that checks if suffix starts with a fixed character, and steps back to the last occurrence of that character, instead of stepping back by one character at the time. This drops FileCheck time on clang/test/CodeGen/RISCV/rvv-intrinsics/vloxseg_mask.c from 7.3 seconds to 2.7 seconds.

An obvious further improvement would be to check more than one character (once again, this is particularly relevant for FileCheck, because the next character is usually a space, which happens to have many occurrences).

This should help with https://github.com/llvm/llvm-project/issues/54821.

Diff Detail

Event Timeline

nikic created this revision.Apr 14 2022, 3:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 3:14 AM
nikic requested review of this revision.Apr 14 2022, 3:14 AM

I don't think this code has any maintainers, as it is originally taken from OpenBSD, so choice of reviewers is somewhat arbitrary...

Is there some reason we can't use the c++11 regex library for this?

nikic added a comment.Apr 14 2022, 1:09 PM

Is there some reason we can't use the c++11 regex library for this?

std::regex implementations are known to perform very badly for ABI compatibility reasons -- but it's possible that they're still better than this one, so I guess it would be worth trying at least.

Is there some reason we can't use the c++11 regex library for this?

I heard about some rumors that std::regex is really bad, both slow to build and run, is it true? Can we use some well-maintained 3rd party library like CTRE or re2?

xbolva00 added a subscriber: xbolva00.EditedApr 15 2022, 6:09 AM

Yeah, std::regex is very poor.. anything is better than std::regex nowdays.

Can we use some well-maintained 3rd party library like CTRE or re2?

I am not sure about CTRE as there is mainly a one sole developer so.. risks are higher.

re2, pcre or hyperscan sound good.

Is there some reason we can't use the c++11 regex library for this?

std::regex implementations are known to perform very badly for ABI compatibility reasons -- but it's possible that they're still better than this one, so I guess it would be worth trying at least.

Is it worth to spend time to port this matching code to std::regex when you could use much better libs?

aeubanks accepted this revision.Apr 15 2022, 10:07 AM

this looks fine and is better than the status quo so let's land this first then consider alternatives

This revision is now accepted and ready to land.Apr 15 2022, 10:07 AM
nikic closed this revision.Apr 19 2022, 7:08 AM

Landed as https://github.com/llvm/llvm-project/commit/653de14f17215c540a7dc058acb3c29a28ef0f1c, forgot to add differential revision tag.

I also gave std::regex a try, with the following patch: https://gist.github.com/nikic/9608b350a821740402115790fd2a906f This actually had good performance on clang/test/CodeGen/RISCV/rvv-intrinsics/vloxseg_mask.c -- but the multiline support that is needed by FileCheck is apparently only part of C++17 (if compilers implement it at all?) so this doesn't work for our purpose. I tried emulating it, but that would require support for lookbehind assertions, which is also missing.