This is an archive of the discontinued LLVM Phabricator instance.

[Support] Rewrite GlobPattern
ClosedPublic

Authored by MaskRay on Jul 23 2023, 12:21 AM.

Details

Summary

The current implementation has two primary issues:

  • Matching a*a*a*b against aaaaaa has exponential complexity.
  • BitVector harms data cache and is inefficient for literal matching.

and a minor issue that \ at the end may cause an out of bounds access
in StringRef::operator[].

Switch to an O(|S|*|P|) greedy algorithm instead: factor the pattern
into segments split by '*'. The segment is matched sequentianlly by
finding the first occurrence past the end of the previous match. This
algorithm is used by lots of fnmatch implementations, including musl and
NetBSD's.

In addition, optional<StringRef> Exact, Suffix, Prefix wastes space.
Instead, match the non-metacharacter prefix against the haystack, then
match the pattern with the rest.

In practice *suffix style patterns are less common and our new
algorithm is fast enough, so don't bother storing the non-metacharacter
suffix.

Note: brace expansions (D153587) can leverage the matchOne function.

Diff Detail

Event Timeline

MaskRay created this revision.Jul 23 2023, 12:21 AM
MaskRay requested review of this revision.Jul 23 2023, 12:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2023, 12:21 AM
ellis added inline comments.Jul 24 2023, 12:38 PM
llvm/lib/Support/GlobPattern.cpp
64–65

Will this fail for the pattern abc\\ (ending with an escaped backslash)?

70

From line 77:

// ']' is allowed as the first character of a character class. '[]' is
// invalid. So, just skip the first character.
size_t End = S.find(']', 2);

I would expect this to break the BracketFrontOfCharacterClass test

72

Can we add something like expected ']' to this error message?

110–111

Since you are storing P, PEnd and S, End anyway, why not store these as StringRefs so there are fewer variables to keep track of?

145

To me it's confusing that Star would hold the index after the *. Could we make this P = Star + 1 and make line 114 Star = P++ (or separate it into two lines)?

MaskRay updated this revision to Diff 543690.Jul 24 2023, 2:01 PM
MaskRay marked 5 inline comments as done.

fix []] bug
address comments
add more tests (when tests are in a good shape, they should be committed separately)

MaskRay added inline comments.Jul 24 2023, 2:06 PM
llvm/lib/Support/GlobPattern.cpp
64–65

Thanks for catching this. We need to handle this when scanning the pattern.

72

I'd like to do that, but 6 lld/test/ELF tests rely on the particular error message. The change should probably be made in the future...

I think we should drop + S from GlobPattern.cpp diagnostics. The clients can add + S as needed.

110–111

Below we are going to increment various variables, mainly P, S, and B.
If we save P, PEnd as a StringRef, the increment and backtracking will be more cumbersome...

145

Thank's for the question. I renamed Star to SegmentBegin, hopefully it is clearer.

Using P = SegmentBegin; will be slightly more efficient than P = LastStar + 1; as backtracking has one less variable to increment...

ellis accepted this revision.Jul 25 2023, 12:02 PM

LGTM thanks for cleaning this up!

This revision is now accepted and ready to land.Jul 25 2023, 12:02 PM
MaskRay updated this revision to Diff 544170.Jul 25 2023, 6:36 PM
MaskRay edited the summary of this revision. (Show Details)

rebase
fix [*] matching ]

MaskRay updated this revision to Diff 544171.Jul 25 2023, 6:40 PM

fix IWYU issue

This revision was landed with ongoing or failed builds.Jul 25 2023, 6:47 PM
This revision was automatically updated to reflect the committed changes.