This is an archive of the discontinued LLVM Phabricator instance.

FileCheck Enhancement - CHECK-WORD
Needs RevisionPublic

Authored by eklepilkina on Jul 14 2016, 6:52 AM.

Details

Summary

This patch adds new directives CHECK-WORD, CHECK-WORD-NEXT, CHECK-WORD-NOT, CHECK-WORD-SAME, CHECK-WORD-DAG, which are like standart directives, but they match only when string is a separate word.

Diff Detail

Event Timeline

eklepilkina retitled this revision from to FileCheck Enhancement - CHECK-WORD.
eklepilkina updated this object.
eklepilkina changed the edit policy from "All Users" to "Administrators".
eklepilkina added a subscriber: vsk.
jyknight edited edge metadata.Jul 14 2016, 3:25 PM

I don't really like adding a whole new set of CHECK-things for word-matching.

I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word, or else make word matching the default and add \w* to the CHECKs that actually *WANT* to match subword. I'd expect there to be few which really ought to be doing that, but probably more that do accidentally.

"I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word" - Problem is that current regexp library doesn't support \b assertion. And question about changing library appears again.
I think, if change the meaning of CHECK there will be a lot of work of changing LLVM tests and private tests of FileCheck users.

vsk added a comment.Jul 15 2016, 8:47 AM

"I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word" - Problem is that current regexp library doesn't support \b assertion. And question about changing library appears again.
I think, if change the meaning of CHECK there will be a lot of work of changing LLVM tests and private tests of FileCheck users.

+1, I'd rather not have to go fix up tests which rely on the sub-word matching behavior of CHECK.

@eklepilkina: Thanks for working on this! There are a few spots in this patch that need clang-formatting. Also, please upload diffs with more context (e.g git diff -U1000 ...). I just have a few minor comments --

test/FileCheck/check-word.txt
16

Could you add:

// RUN: not FileCheck -input-file %s %s -check-prefix=FAILTEST
first
and
// FAILTEST-WORD: first
// FAILTEST-WORD-SAME: and

The failtest should probably be extended to cover more cases.

utils/FileCheck/FileCheck.cpp
1149

nit, capitalize 'if' and add a space before it.

1157

Why do Match{Pos,Len} need to be passed in by reference?

1165

Does this guarantee that there cannot be a newline between CHECK-SAME-WORD matches?

probinson edited edge metadata.Jul 15 2016, 9:06 AM

"I think either users should use "\b" to check that the edge of the match is at the beginning/end of a word" - Problem is that current regexp library doesn't support \b assertion. And question about changing library appears again.

Well, the regex implementation does live in the Support library, even if it did come from somewhere else originally; we can modify it. A word-break feature seems useful in more contexts than just CHECK-WORD.

I think, if change the meaning of CHECK there will be a lot of work of changing LLVM tests and private tests of FileCheck users.

We won't know until somebody tries. It should not be too hard to do the experiment, I've done that kind of thing twice already.

eklepilkina marked 4 inline comments as done.Jul 18 2016, 11:42 PM
eklepilkina added inline comments.
utils/FileCheck/FileCheck.cpp
1165

Newline between CHECK-SAME-WORD matches checks separately as usual CHECK-SAME.
I added suggested test and it works. As I understood it checks such situation.

eklepilkina marked an inline comment as done.Jul 18 2016, 11:47 PM

We won't know until somebody tries. It should not be too hard to do the experiment, I've done that kind of thing twice already.

Ok, I can do such experiment, but this experiment will be only with LLVM and clang tests. May be there are a few tests. But how can I guarantee that users have no a lot of such tests in private repositories?

eklepilkina edited edge metadata.Jul 18 2016, 11:48 PM
eklepilkina added a subscriber: llvm-commits.

There is a suggestion to add new option, which will change standart behaviour to word-matchinf instead of addding new set of CHECKs directives. What about this idea?

I am in support of such a switch. As already mentioned that once I got bitten that

; CHECK: br

to check for a branch instruction also matched unrelated lines eg.

%2 = add i32 %subregion, i32 5

Such an option would lower the probability of such mismatches. I'd prefer this mode for all my tests but introducing an opt-in switch for compatibility is a good idea.

Replace set of new CHECK directives to new -check-word option

eklepilkina added a comment.EditedJul 21 2016, 1:42 AM

I made new mode and made experiment turned it on for all LLVM tests.
Result:

Expected Passes    : 15810
Expected Failures  : 125
Unsupported Tests  : 195
Unexpected Passes  : 4
Unexpected Failures: 1128

In my opinion, there is quite big number of failed tests with new behavior. So doing this word-match behavior standart won't be simple task.

I don't think making this the default behavior is a good idea. This may introduce hard-to-find bugs to every testcase (because they were written written with the old behavior in mind).
For example, how would we even find problems with CHECK-NOT; the new behavior will make the expressions stricter, but we wouldn't even realize it because we're expecting no match. The only way this could be made the default behavior would be to change all existing CHECKs to explicitly use the old behavior. This would still break every out-of-tree testcase, but that is admittedly a lesser concern.

eklepilkina edited reviewers, added: vsk, alexfh; removed: dblaikie.
eklepilkina removed a subscriber: vsk.
eklepilkina updated this revision to Diff 69957.Sep 1 2016, 1:28 AM
alexfh edited edge metadata.Sep 7 2016, 4:33 AM

Do I understand correctly, that this patch makes writing single-word check lines easier? Why not add more words to check lines to make them more strict? BTW, does FileCheck support {{\b}}?

alexfh added a comment.Sep 7 2016, 4:35 AM

Do I understand correctly, that this patch makes writing single-word check lines easier? Why not add more words to check lines to make them more strict? BTW, does FileCheck support {{\b}}?

Previous comments answer the last question. So is it feasible to add support for \b to LLVM regexps?

Why not add more words to check lines to make them more strict? BTW, does FileCheck support {{\b}}?

This patch adds

// CHECK-WORD:
// CHECK-WORD-NEXT:
// CHECK-WORD-SAME:
// CHECK-WORD-DAG:
// CHECK-WORD-NOT:

What words do you suggest to add?

So is it feasible to add support for \b to LLVM regexps?

The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.

Why not add more words to check lines to make them more strict? BTW, does FileCheck support {{\b}}?
So is it feasible to add support for \b to LLVM regexps?

The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.

I think it would be worthwhile to bring this up again. What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)? How much work would it be to implement support for \b ourselves?

What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)?

As I understood used regexp library is version of OpenBSD library. I couldn't find some new versions.
I don't know a lot about modern C++ regexp library, but there are:

  1. PCRE(pcre.h)
  2. There is regex library in boost. But boost is too big
  3. Also there is regex library in poco.

May be there are people who has some experience with C++ regexs libraries.

How much work would it be to implement support for \b ourselves?

I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.

What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)?

As I understood used regexp library is version of OpenBSD library. I couldn't find some new versions.
I don't know a lot about modern C++ regexp library, but there are:

  1. PCRE(pcre.h)
  2. There is regex library in boost. But boost is too big
  3. Also there is regex library in poco.

FWIW, there's also re2 (https://github.com/google/re2).

May be there are people who has some experience with C++ regexs libraries.

How much work would it be to implement support for \b ourselves?

I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.

Why not add more words to check lines to make them more strict?

This patch adds

// CHECK-WORD:
// CHECK-WORD-NEXT:
// CHECK-WORD-SAME:
// CHECK-WORD-DAG:
// CHECK-WORD-NOT:

What words do you suggest to add?

I'm suggesting to make patterns used in CHECK-lines stricter by adding more text to them instead of allowing to match on word boundaries using the new CHECK-directives. E.g. the line you want to check is

op1;

So instead of using CHECK-WORD: op1, you could make the pattern stricter by adding more context: CHECK: {{^}}op1;{{$}}. Do you have specific examples of output and patterns, where this solution is undesired for some reason?

Overall, I think the FileCheck utility is already confusing enough and many developers attribute some magic properties to it (that it can somehow implicitly connect position of a CHECK line to the position of the relevant line in the input, etc.). Adding more features like CHECK-WORD only makes this worse.

So is it feasible to add support for \b to LLVM regexps?

The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.

So instead of using CHECK-WORD: op1, you could make the pattern stricter by adding more context: CHECK: {{^}}op1;{{$}}. Do you have specific examples of output and patterns, where this solution is undesired for some reason?

Most of the time it is not desirable to add the remainder of the line to be checked as well. It could be very long, or just contain irrelevant information (such as the branch sources following basic block names)

As I understood most of users prefer to change library to have support of \b assert. But, I think, this decision is quite important and can be made only by he contributor of FileCheck. I don't know how this suggestion can be done. Or I can try to change library and create a patch with this changes?

I think the two most compelling options are:

  1. Implement support for \b in llvm's regex library.
  2. Go with Duncan's suggestion and investigate whether the C++11 regex library can replace LLVM's posix regex library entirely so we can drop it from LLVM.

Ok, I'll try to use STL regex.

eklepilkina added a comment.EditedSep 19 2016, 12:28 AM

There is problem with STL, their realization of regular expressions doesn't match start and end of line in basic mode and there is no multiline mode. There is the issue 2343 in http://cplusplus.github.io/LWG/lwg-toc.html.
There is opportunity to split text by lines and match each line separately, but I think this is a hack, and I think it'll slow FileCheck.

alexfh requested changes to this revision.Sep 27 2016, 7:46 PM
alexfh edited edge metadata.

What are our options with the regex library? Is there a newer version of the one we are currently using that we could upgrade to (and that supports \b)?

As I understood used regexp library is version of OpenBSD library. I couldn't find some new versions.
I don't know a lot about modern C++ regexp library, but there are:

  1. PCRE(pcre.h)
  2. There is regex library in boost. But boost is too big
  3. Also there is regex library in poco.

FWIW, there's also re2 (https://github.com/google/re2).

May be there are people who has some experience with C++ regexs libraries.

How much work would it be to implement support for \b ourselves?

I can't give exact estimate. I inly had a look on its source code. But I think that it's wrong to add features support. There are a lot of other useful feautures, for example complex assertions, conditional subexpressions and etc., that can be used. Then if somebody want them he should implement them by himself.

Why not add more words to check lines to make them more strict?

This patch adds

// CHECK-WORD:
// CHECK-WORD-NEXT:
// CHECK-WORD-SAME:
// CHECK-WORD-DAG:
// CHECK-WORD-NOT:

What words do you suggest to add?

I'm suggesting to make patterns used in CHECK-lines stricter by adding more text to them instead of allowing to match on word boundaries using the new CHECK-directives. E.g. the line you want to check is

op1;

So instead of using CHECK-WORD: op1, you could make the pattern stricter by adding more context: CHECK: {{^}}op1;{{$}}. Do you have specific examples of output and patterns, where this solution is undesired for some reason?

Overall, I think the FileCheck utility is already confusing enough and many developers attribute some magic properties to it (that it can somehow implicitly connect position of a CHECK line to the position of the relevant line in the input, etc.). Adding more features like CHECK-WORD only makes this worse.

So is it feasible to add support for \b to LLVM regexps?

The regexp library is very old. I asked several times about changing it. But I haven't got any answer. There are a lot of other unsupported features of regexs in this library and which can be useful.

There is problem with STL, their realization of regular expressions doesn't match start and end of line in basic mode and there is no multiline mode. There is the issue 2343 in http://cplusplus.github.io/LWG/lwg-toc.html.
There is opportunity to split text by lines and match each line separately, but I think this is a hack, and I think it'll slow FileCheck.

Not sure, whether adding \b support to the regex library used in LLVM is more effort, but there's another option here: take the lib++s regex header with the multiline option added (as per http://cplusplus.github.io/LWG/lwg-active.html#2503) and use a private copy of it (with proper namespace modifications and whatever else is needed to make it actually private. WDYT?

This revision now requires changes to proceed.Sep 27 2016, 7:46 PM

Looks like a few older comments were posted again. Please ignore duplicates.