This is an archive of the discontinued LLVM Phabricator instance.

[SpecialCaseList] Add option to use Globs instead of Regex to match patterns
ClosedPublic

Authored by ellis on Jun 28 2023, 3:43 PM.

Details

Summary

Add an option in SpecialCaseList to use Globs instead of Regex to match patterns. GlobPattern was extended in https://reviews.llvm.org/D153587 to support brace expansions which allows us to use patterns like */src/foo.{c,cpp}. It turns out that most patterns only take advantage of * so using Regex was overkill and required lots of escaping in practice. This often led to bugs due to forgetting to escape special characters.

Since this would be a breaking change, we temporarily support Regex by default and use Globs when #!special-case-list-v2 is the first line in the file. Users should switch to the glob format described in https://llvm.org/doxygen/classllvm_1_1GlobPattern.html. For example, (abc|def) should become {abc,def}.

See discussion in https://reviews.llvm.org/D152762 and https://discourse.llvm.org/t/use-glob-instead-of-regex-for-specialcaselists/71666.

Diff Detail

Event Timeline

ellis created this revision.Jun 28 2023, 3:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2023, 3:43 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
ellis edited the summary of this revision. (Show Details)Jun 28 2023, 4:29 PM
ellis updated this revision to Diff 535560.Jun 28 2023, 4:33 PM

Fix ProfileList

ellis published this revision for review.Jun 28 2023, 4:37 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 28 2023, 4:37 PM
ellis edited the summary of this revision. (Show Details)Jun 28 2023, 5:02 PM

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

My user has some ignore lists, but there is no ^[a-z]+:.*\( or ^[a-z]+:\. occurrence, so this change is likely safe for us.

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

My user has some ignore lists, but there is no ^[a-z]+:.*\( or ^[a-z]+:\. occurrence, so this change is likely safe for us.

I think I'm looking at the same lists, and I see plenty of .* in the sanitizer exclusion lists. Also a few cases of (abc|def|...). IIUC, those would both be broken by this -- instead of .* meaning "any character any number of times" (regex) it would mean "dot followed by any number of characters" (glob), right? And the (abc|...) would just be that literal text, not matching the individual parts?

If my understanding of that is correct, I don't think this is a good change -- there's possibly plenty of configs out there that assume this is regex, and there doesn't seem to be sufficient motivation to just break those. But I can see that globs are useful for the examples posted in the patch description. Is it possible to have some middle ground, e.g. default to regex but allow a config at the top of sanitizer lists to interpret future patterns as globs instead?

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

My user has some ignore lists, but there is no ^[a-z]+:.*\( or ^[a-z]+:\. occurrence, so this change is likely safe for us.

I think I'm looking at the same lists, and I see plenty of .* in the sanitizer exclusion lists. Also a few cases of (abc|def|...). IIUC, those would both be broken by this -- instead of .* meaning "any character any number of times" (regex) it would mean "dot followed by any number of characters" (glob), right? And the (abc|...) would just be that literal text, not matching the individual parts?

If my understanding of that is correct, I don't think this is a good change -- there's possibly plenty of configs out there that assume this is regex, and there doesn't seem to be sufficient motivation to just break those. But I can see that globs are useful for the examples posted in the patch description. Is it possible to have some middle ground, e.g. default to regex but allow a config at the top of sanitizer lists to interpret future patterns as globs instead?

If a sanitizer list currently has .* it will be replaced with ..* here meaning that it matches one or more characters, so those lists may already be broken. This is exactly the confusion we would like to avoid. And you are correct that (abc|def) will be treated as a literal with this change, the but fix to {abc,def} is trivial.

That's a nice idea to add a config at the top to specify glob or regex to give users more time to migrate their lists.

ellis updated this revision to Diff 536003.Jun 29 2023, 2:28 PM

Fix memory bug and add TODO about a regex/glob config. Lets wait for others to comment before I put work into this.

MaskRay added a comment.EditedJun 30 2023, 12:13 AM

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

My user has some ignore lists, but there is no ^[a-z]+:.*\( or ^[a-z]+:\. occurrence, so this change is likely safe for us.

I think I'm looking at the same lists, and I see plenty of .* in the sanitizer exclusion lists. Also a few cases of (abc|def|...). IIUC, those would both be broken by this -- instead of .* meaning "any character any number of times" (regex) it would mean "dot followed by any number of characters" (glob), right? And the (abc|...) would just be that literal text, not matching the individual parts?

Sorry that my regex use was incorrect. We do have some src:x/a.pb.* style rules, which are technically misused. src:x/a.pb.* can match possibly unintended files like x/axpbx.
To match just glob x/a.pb.* files, the rule should be written as src:x/a\.pb\.* (* instead of .*), but I doubt anyone writes rules using \..

If my understanding of that is correct, I don't think this is a good change -- there's possibly plenty of configs out there that assume this is regex, and there doesn't seem to be sufficient motivation to just break those. But I can see that globs are useful for the examples posted in the patch description. Is it possible to have some middle ground, e.g. default to regex but allow a config at the top of sanitizer lists to interpret future patterns as globs instead?

I think the main breakages are:

  • patterns using ( and ). Our users don't have such patterns.
  • patterns using \. to mean a literal ., e.g. src:a\.c. Our users don't use \. Such a pattern needs to changed to src:a.c (which used to match other files like axc)
  • patterns like src:a.pb.* that match other files like src:axpbx.c. This pattern can be changed to glob src:a?pb?*.

I think all of these are likely uncommon. While \. is technically the correct way to match a literal . before this patch, most people likely just use ..

In summary I think this patch is moving toward the right direction and making the patterns less error-prone.

If we want to support both regular expressions and glob patterns permanently, then a solution like #!regex or #!glob is likely the way to go.

If we want to allow soft-transition, then we could do something like #!special-case-list-v2, where v1 would support regular expressions and v2 glob patterns. We can then give a deprecation warning that v1 is going to be removed in the next Clang release and later remove it.

ellis added a comment.Jun 30 2023, 8:16 AM

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

My user has some ignore lists, but there is no ^[a-z]+:.*\( or ^[a-z]+:\. occurrence, so this change is likely safe for us.

I think I'm looking at the same lists, and I see plenty of .* in the sanitizer exclusion lists. Also a few cases of (abc|def|...). IIUC, those would both be broken by this -- instead of .* meaning "any character any number of times" (regex) it would mean "dot followed by any number of characters" (glob), right? And the (abc|...) would just be that literal text, not matching the individual parts?

Sorry that my regex use was incorrect. We do have some src:x/a.pb.* style rules, which are technically misused. src:x/a.pb.* can match possibly unintended files like x/axpbx.
To match just glob x/a.pb.* files, the rule should be written as src:x/a\.pb\.* (* instead of .*), but I doubt anyone writes rules using \..

If my understanding of that is correct, I don't think this is a good change -- there's possibly plenty of configs out there that assume this is regex, and there doesn't seem to be sufficient motivation to just break those. But I can see that globs are useful for the examples posted in the patch description. Is it possible to have some middle ground, e.g. default to regex but allow a config at the top of sanitizer lists to interpret future patterns as globs instead?

I think the main breakages are:

  • patterns using ( and ). Our users don't have such patterns.
  • patterns using \. to mean a literal ., e.g. src:a\.c. Our users don't use \. Such a pattern needs to changed to src:a.c (which used to match other files like axc)
  • patterns like src:a.pb.* that match other files like src:axpbx.c. This pattern can be changed to glob src:a?pb?*.

I think all of these are likely uncommon. While \. is technically the correct way to match a literal . before this patch, most people likely just use ..

In summary I think this patch is moving toward the right direction and making the patterns less error-prone.

Actually, the code will escape any character after \, not just special characters, so that a\\ will match a\, a\. will match a., and a\a will match aa (I'll have to make sure there are test cases for this). You are right that . is no longer a special character, which I believe is a major improvement.

ellis updated this revision to Diff 542732.Jul 20 2023, 6:08 PM

If #!special-case-list-v2 is the first line in the special case list, then we will use globs to match patterns. Otherwise, we fall back to the original behavior of using regexes to match patterns. Once this feature is stable, and after a version cut, we can remove the regex case.

MaskRay accepted this revision.Aug 30 2023, 11:04 PM

[SpecialCaseList] Use Globs instead of Regex

Since Glob is enabled under a non-default header #!special-case-list-v2, this subject should be changed.

This revision is now accepted and ready to land.Aug 30 2023, 11:04 PM

This is a breaking change since some SCLs might use .* or (abc|def) which are supported regexes but not valid globs. Since we have just cut clang 16.x this is a good time to make this change.

This should be updated, too. 16.x => 17.x

ellis retitled this revision from [SpecialCaseList] Use Globs instead of Regex to [SpecialCaseList] Add option to use Globs instead of Regex to match patterns.Sep 1 2023, 8:47 AM
ellis edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Sep 1 2023, 9:06 AM
This revision was automatically updated to reflect the committed changes.