This is an archive of the discontinued LLVM Phabricator instance.

[GlobPattern] Support brace expansions
ClosedPublic

Authored by ellis on Jun 22 2023, 1:22 PM.

Details

Summary

Extend GlobPattern to support brace expansions, e.g., foo.{c,cpp} as discussed in https://reviews.llvm.org/D152762#4425203.

The high level change was to turn Tokens into a list that gets larger when we see a new brace expansion term. Then in GlobPattern::match() we must check against each token group.

This is a breaking change since { will no longer match a literal without escaping. However, \{ will match the literal { before and after this change. Also, from a brief survey of LLVM, it seems that GlobPattern is mostly used for symbol and path matching, which likely won't need { in their patterns.

See https://github.com/devongovett/glob-match#syntax for a nice glob reference.

Diff Detail

Event Timeline

ellis created this revision.Jun 22 2023, 1:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2023, 1:22 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
ellis published this revision for review.Jun 28 2023, 4:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2023, 4:37 PM
MaskRay added a comment.EditedJun 28 2023, 11:30 PM

There are some interesting test cases like {a,b}{1,2}. If we don't intend to support

% echo {ab\,d}
, a b d

we can probably report an error, but I suspect supporting it is easier...

LGTM, although this logic is already pretty obscure, so it's hard to follow. I wonder if we have fuzzers hooked up to this so we don't have to worry too much if this misses an edge case.

llvm/unittests/Support/GlobPatternTest.cpp
173–220

nit: const auto& should work here?

ellis added a comment.Jul 20 2023, 3:45 PM

There are some interesting test cases like {a,b}{1,2}. If we don't intend to support

% echo {ab\,d}
, a b d

we can probably report an error, but I suspect supporting it is easier...

I've added the test case {a,b}{1,2} and it works fine. I don't understand the test case {ab\,d}. That looks like a brace expansion with one term (ab,d). If you meant to use a character class, then I also added that test ([*?^{},]).

ellis added a comment.Jul 20 2023, 3:47 PM

LGTM, although this logic is already pretty obscure, so it's hard to follow. I wonder if we have fuzzers hooked up to this so we don't have to worry too much if this misses an edge case.

I'm not sure how we would add fuzzers to this unit test. I could generate random strings and use that to create patterns and matches, but that won't catch correctness issues, just segfaults.

ellis updated this revision to Diff 542712.Jul 20 2023, 3:53 PM

Add test cases

ellis updated this revision to Diff 542727.Jul 20 2023, 5:23 PM

Add more test cases

{a,b}{a,b}{a,b}{a,b}... expands to an exponential number of patterns. This can be a vulnerability. Many glob(3) implementations use a similar approach under an opt-in flag GLOB_BRACE. I think we need a similar feature flag and perhaps special case SanitizerSpecialCaseList (in a separate patch) to set the feature flag.

ellis added a comment.Aug 8 2023, 5:03 PM

{a,b}{a,b}{a,b}{a,b}... expands to an exponential number of patterns. This can be a vulnerability. Many glob(3) implementations use a similar approach under an opt-in flag GLOB_BRACE. I think we need a similar feature flag and perhaps special case SanitizerSpecialCaseList (in a separate patch) to set the feature flag.

Would it be reasonable to restrict the number of expanded patterns to say 2^10? This would allow for 10 braces with 2 terms each or 2 braces with 10 terms each, which should be plenty in practice.

{a,b}{a,b}{a,b}{a,b}... expands to an exponential number of patterns. This can be a vulnerability. Many glob(3) implementations use a similar approach under an opt-in flag GLOB_BRACE. I think we need a similar feature flag and perhaps special case SanitizerSpecialCaseList (in a separate patch) to set the feature flag.

Would it be reasonable to restrict the number of expanded patterns to say 2^10? This would allow for 10 braces with 2 terms each or 2 braces with 10 terms each, which should be plenty in practice.

I think existing GlobPattern users are pretty traditional and brace expansion might be a surprise for some.

Some features like SanitizerSpecialCaseList adopting brace expansion is totally fine. They will benefit from a limit (which many glob(3) implementations have as well).
10 seems reasonable. A lower number likely works, too.

ellis updated this revision to Diff 548692.Aug 9 2023, 11:20 AM

Add a limit for the number of subpatterns created by brace expansions

ellis added a comment.Aug 28 2023, 2:09 PM

Friendly ping for this and the stacked diff. I recently encountered another bug where I forgot to escape the $ character in a special case list which wouldn't be a problem if we were using globs.

I think we likely need to make brace expansion an opt-in feature (similar to GLOB_BRACE for glob). *BSD glob.h defines some arbitrary limits to prevent exponential time causing problems.
Using brace expansion in sanitizer ignore lists is totally fine and I fully support it, but other places may not want the feature with problematic time complexity (assuming that we don't implement NFA or use regex).

ellis added a comment.Aug 28 2023, 2:44 PM

I think we likely need to make brace expansion an opt-in feature (similar to GLOB_BRACE for glob). *BSD glob.h defines some arbitrary limits to prevent exponential time causing problems.
Using brace expansion in sanitizer ignore lists is totally fine and I fully support it, but other places may not want the feature with problematic time complexity (assuming that we don't implement NFA or use regex).

That makes sense. I think I can implement this by making MaxSubPatterns=1 to mean that globs are disabled and have that be the default. Or did you mean we should add a compile-time variable to enable/disable globs similar to LLVM_ENABLE_THREADS?

I think we likely need to make brace expansion an opt-in feature (similar to GLOB_BRACE for glob). *BSD glob.h defines some arbitrary limits to prevent exponential time causing problems.
Using brace expansion in sanitizer ignore lists is totally fine and I fully support it, but other places may not want the feature with problematic time complexity (assuming that we don't implement NFA or use regex).

That makes sense. I think I can implement this by making MaxSubPatterns=1 to mean that globs are disabled and have that be the default. Or did you mean we should add a compile-time variable to enable/disable globs similar to LLVM_ENABLE_THREADS?

MaxSubPatterns=1 makes { a metacharacter so that it cannot be used in regular pattern. The parameter can probably be an optional<size_t>.

I think we don't want to introduced configure-time variables for this brace expansion...

ellis updated this revision to Diff 554085.Aug 28 2023, 3:55 PM

Use MaxSubPatterns={} to mean brace expansion is disabled

MaskRay accepted this revision.Aug 29 2023, 3:38 PM

LG! with some issues fixed

llvm/include/llvm/Support/GlobPattern.h
85

Perhaps SmallVector<, 0> to save memory. For the cases we care about performance, the pattern is typically longer than the max inlined element size in libc++/libstdc++ (15 or 23).

llvm/lib/Support/GlobPattern.cpp
75

unmatched [ with a present MaxSubPatterns will cause a crash.

111

a > std::numeric_limits<size_t>::max() / b; to catch multiplication overflow? Compilers can optimize out this division.

https://godbolt.org/z/4PxGT7ad6

llvm/unittests/Support/GlobPatternTest.cpp
56

For new tests, use ASSERT_TRUE((bool)Pat3);
If Pat3 is false, then Pat3->... below would crash.

This revision is now accepted and ready to land.Aug 29 2023, 3:38 PM
ellis updated this revision to Diff 554524.Aug 29 2023, 4:53 PM

Resolve comments

llvm/include/llvm/Support/GlobPattern.h
85

I'm curious, why not leave out the 0 and just have SmallVector<char> Pat?

MaskRay accepted this revision.Aug 29 2023, 6:33 PM
MaskRay added inline comments.
llvm/include/llvm/Support/GlobPattern.h
85

I consider it unfortunate... Omitting the argument has been reserved to select an inline size to have about 64 bytes IIUC.

This revision was landed with ongoing or failed builds.Aug 30 2023, 8:30 AM
This revision was automatically updated to reflect the committed changes.
srj added a subscriber: srj.EditedSep 12 2023, 3:48 PM

This has injected a build break in an older-but-still-supported version of gcc (at least, I think it's supported): arm-linux-gnueabihf-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0, in which we get failures on lines 62 and 132 of the form

`GlobPattern.cpp: In function ‘llvm::Expected<llvm::SmallVector<std::__cxx11::basic_string<char>, 1> > parseBraceExpansions(llvm::StringRef, std::optional<unsigned int>)’:
GlobPattern.cpp:62:12: error: could not convert ‘SubPatterns’ from ‘llvm::SmallVector<std::__cxx11::basic_string<char> >’ to ‘llvm::Expected<llvm::SmallVector<std::__cxx11::basic_string<char>, 1> >’`

In both cases, replacing the return SubPatterns; with return std::move(SubPatterns); heals the misbehavior. Could we please get a patch to update this to unbreak our 32-bit builds?

(Yes, gcc7.5 is probably getting a bit long in the tooth here, and we're looking to upgrade; unfortunately, we are temporarily hamstrung by some ancient build infrastructure that is essentially incapable of being easily upgraded to newer builds, and it will take us a little time to figure out good replacements, unfortunately.)

UPDATE: I just realized that pull requests from GitHub are now being accepted, so I went and opened one myself to fix this: https://github.com/llvm/llvm-project/pull/66155

ellis added a comment.Sep 12 2023, 4:57 PM

This has injected a build break in an older-but-still-supported version of gcc (at least, I think it's supported): arm-linux-gnueabihf-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0, in which we get failures on lines 62 and 132 of the form

`GlobPattern.cpp: In function ‘llvm::Expected<llvm::SmallVector<std::__cxx11::basic_string<char>, 1> > parseBraceExpansions(llvm::StringRef, std::optional<unsigned int>)’:
GlobPattern.cpp:62:12: error: could not convert ‘SubPatterns’ from ‘llvm::SmallVector<std::__cxx11::basic_string<char> >’ to ‘llvm::Expected<llvm::SmallVector<std::__cxx11::basic_string<char>, 1> >’`

In both cases, replacing the return SubPatterns; with return std::move(SubPatterns); heals the misbehavior. Could we please get a patch to update this to unbreak our 32-bit builds?

(Yes, gcc7.5 is probably getting a bit long in the tooth here, and we're looking to upgrade; unfortunately, we are temporarily hamstrung by some ancient build infrastructure that is essentially incapable of being easily upgraded to newer builds, and it will take us a little time to figure out good replacements, unfortunately.)

UPDATE: I just realized that pull requests from GitHub are now being accepted, so I went and opened one myself to fix this: https://github.com/llvm/llvm-project/pull/66155

Thanks for letting me know. I've just pushed a fix at https://github.com/llvm/llvm-project/commit/dfd0cd1cc182789d351953d2c40b8ccd9fdec34f so let me know if it starts building or not

srj added a comment.Sep 12 2023, 5:02 PM

Thanks for letting me know. I've just pushed a fix at https://github.com/llvm/llvm-project/commit/dfd0cd1cc182789d351953d2c40b8ccd9fdec34f so let me know if it starts building or not

Thanks! Testing now

srj added a comment.Sep 12 2023, 5:16 PM

Thanks for letting me know. I've just pushed a fix at https://github.com/llvm/llvm-project/commit/dfd0cd1cc182789d351953d2c40b8ccd9fdec34f so let me know if it starts building or not

Thanks! Testing now

Seems to have fixed it -- thanks!