This patch adds new directives CHECK-WORD, CHECK-WORD-NEXT, CHECK-WORD-NOT, CHECK-WORD-SAME, CHECK-WORD-DAG, which are like standart directives, but they match only when string is a separate word.
Details
Diff Detail
Event Timeline
I don't really like adding a whole new set of CHECK-things for word-matching.
I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word, or else make word matching the default and add \w* to the CHECKs that actually *WANT* to match subword. I'd expect there to be few which really ought to be doing that, but probably more that do accidentally.
"I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word" - Problem is that current regexp library doesn't support \b assertion. And question about changing library appears again.
I think, if change the meaning of CHECK there will be a lot of work of changing LLVM tests and private tests of FileCheck users.
+1, I'd rather not have to go fix up tests which rely on the sub-word matching behavior of CHECK.
@eklepilkina: Thanks for working on this! There are a few spots in this patch that need clang-formatting. Also, please upload diffs with more context (e.g git diff -U1000 ...). I just have a few minor comments --
| test/FileCheck/check-word.txt | ||
|---|---|---|
| 16 | Could you add: // RUN: not FileCheck -input-file %s %s -check-prefix=FAILTEST first and // FAILTEST-WORD: first // FAILTEST-WORD-SAME: and The failtest should probably be extended to cover more cases. | |
| utils/FileCheck/FileCheck.cpp | ||
| 1149 | nit, capitalize 'if' and add a space before it. | |
| 1157 | Why do Match{Pos,Len} need to be passed in by reference? | |
| 1165 | Does this guarantee that there cannot be a newline between CHECK-SAME-WORD matches? | |
Well, the regex implementation does live in the Support library, even if it did come from somewhere else originally; we can modify it. A word-break feature seems useful in more contexts than just CHECK-WORD.
I think, if change the meaning of CHECK there will be a lot of work of changing LLVM tests and private tests of FileCheck users.
We won't know until somebody tries. It should not be too hard to do the experiment, I've done that kind of thing twice already.
| utils/FileCheck/FileCheck.cpp | ||
|---|---|---|
| 1165 | Newline between CHECK-SAME-WORD matches  checks separately as usual CHECK-SAME.  | |
Ok, I can do such experiment, but this experiment will be only with LLVM and clang tests. May be there are a few tests. But how can I guarantee that users have no a lot of such tests in private repositories?
There is a suggestion to add new option, which will change standart behaviour to word-matchinf instead of addding new set of CHECKs directives. What about this idea?
I am in support of such a switch. As already mentioned that once I got bitten that
; CHECK: br
to check for a branch instruction also matched unrelated lines eg.
%2 = add i32 %subregion, i32 5
Such an option would lower the probability of such mismatches. I'd prefer this mode for all my tests but introducing an opt-in switch for compatibility is a good idea.
I made new mode and made experiment turned it on for all LLVM tests.
Result:
Expected Passes : 15810 Expected Failures : 125 Unsupported Tests : 195 Unexpected Passes : 4 Unexpected Failures: 1128
In my opinion, there is quite big number of failed tests with new behavior. So doing this word-match behavior standart won't be simple task.
I don't think making this the default behavior is a good idea. This may introduce hard-to-find bugs to every testcase (because they were written written with the old behavior in mind).
For example, how would we even find problems with CHECK-NOT; the new behavior will make the expressions stricter, but we wouldn't even realize it because we're expecting no match. The only way this could be made the default behavior would be to change all existing CHECKs to explicitly use the old behavior. This would still break every out-of-tree testcase, but that is admittedly a lesser concern.
Do I understand correctly, that this patch makes writing single-word check lines easier? Why not add more words to check lines to make them more strict? BTW, does FileCheck support {{\b}}?
Previous comments answer the last question. So is it feasible to add support for \b to LLVM regexps?
This patch adds
// CHECK-WORD: // CHECK-WORD-NEXT: // CHECK-WORD-SAME: // CHECK-WORD-DAG: // CHECK-WORD-NOT:
What words do you suggest to add?
So is it feasible to add support for \b to LLVM regexps?
The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.
I think it would be worthwhile to bring this up again. What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)? How much work would it be to implement support for \b ourselves?
What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)?
As I understood used regexp library is version of OpenBSD library. I couldn't find some new versions.
I don't know a lot about modern C++ regexp library, but there are:
- PCRE(pcre.h)
- There is regex library in boost. But boost is too big
- Also there is regex library in poco.
May be there are people who has some experience with C++ regexs libraries.
How much work would it be to implement support for \b ourselves?
I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.
FWIW, there's also re2 (https://github.com/google/re2).
May be there are people who has some experience with C++ regexs libraries.
How much work would it be to implement support for \b ourselves?
I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.
I'm suggesting to make patterns used in CHECK-lines stricter by adding more text to them instead of allowing to match on word boundaries using the new CHECK-directives. E.g. the line you want to check is
op1;
So instead of using CHECK-WORD: op1, you could make the pattern stricter by adding more context: CHECK: {{^}}op1;{{$}}. Do you have specific examples of output and patterns, where this solution is undesired for some reason?
Overall, I think the FileCheck utility is already confusing enough and many developers attribute some magic properties to it (that it can somehow implicitly connect position of a CHECK line to the position of the relevant line in the input, etc.). Adding more features like CHECK-WORD only makes this worse.
So is it feasible to add support for \b to LLVM regexps?
The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.
Most of the time it is not desirable to add the remainder of the line to be checked as well. It could be very long, or just contain irrelevant information (such as the branch sources following basic block names)
As I understood most of users prefer to change library to have support of \b assert. But, I think, this decision is quite important and can be made only by he contributor of FileCheck. I don't know how this suggestion can be done. Or I can try to change library and create a patch with this changes?
I think the two most compelling options are:
- Implement support for \b in llvm's regex library.
- Go with Duncan's suggestion and investigate whether the C++11 regex library can replace LLVM's posix regex library entirely so we can drop it from LLVM.
There is problem with STL, their realization of regular expressions doesn't match start and end of line in basic mode and there is no multiline mode. There is the issue 2343 in http://cplusplus.github.io/LWG/lwg-toc.html.
There is opportunity to split text by lines and match each line separately, but I think this is a hack, and I think it'll slow FileCheck.
FWIW, there's also re2 (https://github.com/google/re2).
May be there are people who has some experience with C++ regexs libraries.
How much work would it be to implement support for \b ourselves?
I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.
I'm suggesting to make patterns used in CHECK-lines stricter by adding more text to them instead of allowing to match on word boundaries using the new CHECK-directives. E.g. the line you want to check is
op1;
So instead of using CHECK-WORD: op1, you could make the pattern stricter by adding more context: CHECK: {{^}}op1;{{$}}. Do you have specific examples of output and patterns, where this solution is undesired for some reason?
Overall, I think the FileCheck utility is already confusing enough and many developers attribute some magic properties to it (that it can somehow implicitly connect position of a CHECK line to the position of the relevant line in the input, etc.). Adding more features like CHECK-WORD only makes this worse.
So is it feasible to add support for \b to LLVM regexps?
The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.
Not sure, whether adding \b support to the regex library used in LLVM is more effort, but there's another option here: take the lib++s regex header with the multiline option added (as per http://cplusplus.github.io/LWG/lwg-active.html#2503) and use a private copy of it (with proper namespace modifications and whatever else is needed to make it actually private. WDYT?
Could you add:
The failtest should probably be extended to cover more cases.