This is an archive of the discontinued LLVM Phabricator instance.

[FileCheck] New option -warn
AbandonedPublic

Authored by SjoerdMeijer on Nov 20 2018, 12:36 PM.

Details

Reviewers
jdenny
probinson
Summary

This adds new option -warn to FileCheck. We would like FileCheck to
produce diagnostics indicating possible test case issues when this
options is enabled. The motiving example taken from D53710 is this:

    
RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
; FOO-LABEL: FOO:
; ZOO:       mov  pc, lr
  ret void
}

The problem here is that we're not really checking anything, because
non-LABEL prefix "DOO" isn't checked with -check-prefix: FOO is, but
ZOO isn't, which can easily happen when a typo is made.

This is the initial bit of plumbing to add a new option, under which
we can later emit diagnostics.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Nov 20 2018, 12:36 PM

Thanks for working on this. I do not believe DAG overlaps should be a warning. I know from experience that they occur very frequently.

SjoerdMeijer edited the summary of this revision. (Show Details)

Thanks for taking a look!

Agreed about the overlapping DAG warnings, so have removed that. Instead I now issue one diagnostic "Running FileCheck with warnings enabled" when -warn is enabled and check that in tests.

jdenny added inline comments.Nov 21 2018, 7:31 AM
test/FileCheck/check-warning.txt
4

Why -vv?

Why -ERROR? Wouldn't -QUIET make more sense?

utils/FileCheck/FileCheck.cpp
211

Is the purpose of this warning purely for testing your -warn implementation? That is, do you plan to remove it upon committing D53710 (or some other warning)?

I'm imagining grepping through test suite output looking for FileCheck warnings to investigate, and this warning would be a pervasive distraction.

Is the purpose of this warning purely for testing your -warn implementation?

Yes, exactly. But I've restricted it even further, so that it will be printed only when -v is also enabled. But I most likely/definitely refactor this away as soon as we have some real warnings in.

This is small enough that I'd prefer to see it incorporated into the patch with the "real" warning, rather than have this temporary placeholder diagnostic.

FTR I am having some sort of email issues where I do not receive all the Phabricator updates. I'll try to do better at polling the FileCheck patches via the website next week, after the US holiday break.

This is small enough that I'd prefer to see it incorporated into the patch with the "real" warning, rather than have this temporary placeholder diagnostic.

Agreed, especially given that much of this patch will disappear at that point. That might also help us to decide whether it's useful for PrintWarning to be exposed via the FileCheck class as it is now.

This is small enough that I'd prefer to see it incorporated into the patch with the "real" warning, rather than have this temporary placeholder diagnostic.

Okidoki, thanks. I've removed the test and placeholder warning. Will leave this diff as it is now, and return to D53710 and make that dependent on this change (I could also abandon this change, but keeping it for now won't hurt).

I decided to continue here, and implemented the diagnostic that Paul suggested in the other ticket, because it's small and a nice demonstrator for this new option:

Another useful diagnostic (probably best done separately) would be when a given prefix triggers no directives at all. That's obviously an incorrect test; either it has typos or has a dangling -check-prefix whose directives have all been removed. Right now, this diagnostic occurs only when there are no matching directives at all. But if you specify two prefixes, and only one of them matches a directive, FileCheck is silent. I think that's bad.

I ran this on the regression tests and found more than enough things! They all looked genuine dangling prefixes to me, and I haven't spotted any false positives yet.

Once this change is in (if we're happy with it), I will start using it, also to get experience, and start with cleaning up the Arm tests....

fyi: as a first exercise, I found some (real) issues and fixed/committed them in https://reviews.llvm.org/rL347487

I decided to continue here, and implemented the diagnostic that Paul suggested in the other ticket, because it's small and a nice demonstrator for this new option:

Another useful diagnostic (probably best done separately) would be when a given prefix triggers no directives at all. That's obviously an incorrect test; either it has typos or has a dangling -check-prefix whose directives have all been removed. Right now, this diagnostic occurs only when there are no matching directives at all. But if you specify two prefixes, and only one of them matches a directive, FileCheck is silent. I think that's bad.

Sounds like that could be an error, rather than a warning, perhaps?

(in general I'm a bit hesitant to add warnings - either these checks are things we can all agree on are erroneous & should be a failure, or warning cruft will accumulate and make the noise too high for the warning signals to be seen?)

I decided to continue here, and implemented the diagnostic that Paul suggested in the other ticket, because it's small and a nice demonstrator for this new option:

Another useful diagnostic (probably best done separately) would be when a given prefix triggers no directives at all. That's obviously an incorrect test; either it has typos or has a dangling -check-prefix whose directives have all been removed. Right now, this diagnostic occurs only when there are no matching directives at all. But if you specify two prefixes, and only one of them matches a directive, FileCheck is silent. I think that's bad.

Sounds like that could be an error, rather than a warning, perhaps?

We can always do that later if we decide the warning is reasonable enough of the time.

(in general I'm a bit hesitant to add warnings - either these checks are things we can all agree on are erroneous & should be a failure, or warning cruft will accumulate and make the noise too high for the warning signals to be seen?)

The usage we discussed is running with FILECHECK_OPTS="-warn" while developing/debugging tests, but normally you'd leave it off. Does that address your concern?

Sounds like that could be an error, rather than a warning, perhaps?

While I agree with the observation, practically it would a lot easier to start small with a warning, clean things up, and then either promote warnings into errors, or make this particular warning an error. I've spent quite some time just looking at only 2 subdirectories, and fixed only one. Making this an error now would break tests, so we would need one big-bang commit to fix all tests first. I am okay spending time on this, but this is going to take some time, and also for some broken codegen tests I would probably need advise.

We can always do that later if we decide the warning is reasonable enough of the time.

So for these reasons, this would definitely be my preferred approach.

I decided to continue here, and implemented the diagnostic that Paul suggested in the other ticket, because it's small and a nice demonstrator for this new option:

Another useful diagnostic (probably best done separately) would be when a given prefix triggers no directives at all. That's obviously an incorrect test; either it has typos or has a dangling -check-prefix whose directives have all been removed. Right now, this diagnostic occurs only when there are no matching directives at all. But if you specify two prefixes, and only one of them matches a directive, FileCheck is silent. I think that's bad.

Sounds like that could be an error, rather than a warning, perhaps?

We can always do that later if we decide the warning is reasonable enough of the time.

Mostly I'm pushing back on the general idea of adding a warning feature because of the level of noise that may develop if it's not held as a hard requirement.

(in general I'm a bit hesitant to add warnings - either these checks are things we can all agree on are erroneous & should be a failure, or warning cruft will accumulate and make the noise too high for the warning signals to be seen?)

The usage we discussed is running with FILECHECK_OPTS="-warn" while developing/debugging tests, but normally you'd leave it off. Does that address your concern?

Not really - either everyone's using it and the tests are warning clean, or not everyone's using it (especially likely early on if it's opt-in) & the codebase has lots of warnings in it that make the feature hard to use practically by those who want to (because every time they sync and build, they'll get spurious warnings they have to clean up to be able to see their own warnings)

Sounds like that could be an error, rather than a warning, perhaps?

While I agree with the observation, practically it would a lot easier to start small with a warning, clean things up, and then either promote warnings into errors, or make this particular warning an error. I've spent quite some time just looking at only 2 subdirectories, and fixed only one. Making this an error now would break tests, so we would need one big-bang commit to fix all tests first. I am okay spending time on this, but this is going to take some time, and also for some broken codegen tests I would probably need advise.

Are the warnings you have in mind regressing at a sufficiently high rate to need this, though? (& if other people aren't opting in to warnings-as-errors, how will they be cleaned up?) You/others could checkin fixes as they find them rather than accumulating one large patch - then when it's close to done, you send out the patch to add the error to avoid regressions?

The usage we discussed is running with FILECHECK_OPTS="-warn" while developing/debugging tests, but normally you'd leave it off. Does that address your concern?

Not really - either everyone's using it and the tests are warning clean, or not everyone's using it (especially likely early on if it's opt-in) & the codebase has lots of warnings in it that make the feature hard to use practically by those who want to (because every time they sync and build, they'll get spurious warnings they have to clean up to be able to see their own warnings)

I imagined people would specify FILECHECK_OPTS="-warn" while running only the tests they're developing/debugging. For example, they might combine with LIT_FILTER. That is, they would only see warnings where they care about them.

Also, specific test suites that are expected to be warning-clean might want -warn in their config. I was also thinking we'd eventually have a way to enable specific warnings or make them errors (like -Wfoo -Werror). Others feel that's overkill for now because we don't have very many kinds of warnings.

You/others could checkin fixes as they find them rather than accumulating one large patch - then when it's close to done, you send out the patch to add the error to avoid regressions?

Adding -warn now makes it easier for anyone to contribute to this cleanup and doesn't hurt anyone who doesn't care about it.

You/others could checkin fixes as they find them rather than accumulating one large patch - then when it's close to done, you send out the patch to add the error to avoid regressions?

Adding -warn now makes it easier for anyone to contribute to this cleanup and doesn't hurt anyone who doesn't care about it.

I wonder if -allow-deprecated-dag-overlap is a better model: an error that can be explicitly disabled. The idea there was that over time we could rework the buggy tests, and when that's done, remove the option. In the meantime all new tests will get complaints, which is what we want.

I think @dblaikie has a point; if we want to move to a new world with less-fallible tests, it's better for new FileCheck diags to be errors that are on-by-default, and explicitly disabled in the tests that we haven't managed to fix yet.

You/others could checkin fixes as they find them rather than accumulating one large patch - then when it's close to done, you send out the patch to add the error to avoid regressions?

Adding -warn now makes it easier for anyone to contribute to this cleanup and doesn't hurt anyone who doesn't care about it.

I wonder if -allow-deprecated-dag-overlap is a better model: an error that can be explicitly disabled. The idea there was that over time we could rework the buggy tests, and when that's done, remove the option. In the meantime all new tests will get complaints, which is what we want.

I think @dblaikie has a point; if we want to move to a new world with less-fallible tests, it's better for new FileCheck diags to be errors that are on-by-default, and explicitly disabled in the tests that we haven't managed to fix yet.

Sounds plausible to me - if the cleanup work is big enough to merit doing it that way. (so I'd suggest evaluating on a case-by-case basis, if the cleanup work needs to be distributed amongst people (are there enough people signing up to do this work, etc))

I'd like to avoid (though I probably don't feel so strongly that it's either "this must have a timeline to cleanup/removing the deprecating flag - or we shouldn't do it at all" just "if these deprecated flags stick around in a handful of tests for years I'll be a bit sad") these things sticking around indefinitely, of course.

Looks like we are unhappy with FileCheck for many and different reasons:

  1. we want to error for overlapping dag checks,
  2. we want to error for the example in this diff: 1 but not all prefixes are checked,
  3. we want a diagnostic for my motivating example in D53710: only labels are checked (check-prefixes are renamed/refactored/misspelled, but not the tags in tests accordingly)
  4. CHECK-LABEL is confusing and probably mostly misunderstood

For me, 2) and 3) are really important, because very easily this results in tests not checking anything; as a result, we have serious test quality issues at the moment imo.

As I said, I agree we don't need a diagnostic mechanism for 2) as they can be errors. Tomorrow I will get the full list to get an impression how much work it will be to fix this.

I am unsure about 3), i.e. I can't remember/am unsure if we can avoid false positives. I will look into this again, or if someone can remind me that's fine too :-). But I thought this was the motivating case for emitting a diagnostic: if a false positive cannot be easily be avoided, an opt-in warning is the best we can do at the moment. The use case would indeed be to enable this while developing new tests.

You/others could checkin fixes as they find them rather than accumulating one large patch - then when it's close to done, you send out the patch to add the error to avoid regressions?

Adding -warn now makes it easier for anyone to contribute to this cleanup and doesn't hurt anyone who doesn't care about it.

I wonder if -allow-deprecated-dag-overlap is a better model: an error that can be explicitly disabled. The idea there was that over time we could rework the buggy tests, and when that's done, remove the option. In the meantime all new tests will get complaints, which is what we want.

In that case, before we proceed, we need a much more comprehensive survey of the effect of these diagnostics on all existing test suites, as I did for -allow-deprecated-dag-overlap.

Making this diagnostic a warning for now gives us a gentler way to ease into this: we add the warning, we explore the effect, and then we decide whether to (1) remove the warning, (2) leave it as a warning for some test suites and for manual usage, or (3) make it an error.

Moreover, -allow-deprecated-dag-overlap was a case where we were fixing subtle behavior and not simply enabling/disabling a diagnostic. We had strong evidence that the old behavior was confusing users, and supporting two different behaviors forever would surely be considerably more confusing. In contrast, it's not clear at all to me that the diagnostics we're discussing now are universally correct and thus should become errors. We already discussed that point at length for the diagnostic introduced by D53710.

For the diagnostic introduced by this patch (individually unused prefixes), I've written tests (locally, not sure about upstream) where I purposely didn't always use all prefixes specified to FileCheck. Specifically, I have tests in which I'm trying to exercise many different combinations of some executable's features: foo, bar, etc. Such a test has many FileCheck calls, each checking output when some particular combination of features is enabled, and I have a systematic FileCheck prefix scheme that corresponds to those combinations: FOO-MAYBE-BAR, BAR-MAYBE-FOO, FOO-NOT-BAR, FOO-AND-BAR, etc. As the test code evolves, the prefixes specified in those FileCheck calls rarely need to change, but which prefixes are used in actual directives do change. If I had to go back and update the FileCheck calls every time the usage changes, I'd probably forget to add back prefixes, and my test would pass when it shouldn't. It's better to write the FileCheck calls with their full sequences of prefixes once and not have to update them again.

Now, I can't argue we need to cater to that use case until it's upstream, and someone might offer a better way to accomplish what I want to accomplish in those tests. But my point is that there logically exists a use case where people could be depending on this diagnostic not being an error, and maybe it even makes sense that, for those cases, this diagnostic should never become an error. We'll only know if we explore more test suites. A warning (especially if we also provide something like -Werror) is the easiest way to do that, in my opinion.

I think @dblaikie has a point; if we want to move to a new world with less-fallible tests, it's better for new FileCheck diags to be errors that are on-by-default, and explicitly disabled in the tests that we haven't managed to fix yet.

I'd agree if we already knew that all tests producing the diagnostics likely deserve to be fixed. But we haven't looked hard enough yet, in my opinion.

I am unsure about 3), i.e. I can't remember/am unsure if we can avoid false positives. I will look into this again, or if someone can remind me that's fine too :-). But I thought this was the motivating case for emitting a diagnostic: if a false positive cannot be easily be avoided, an opt-in warning is the best we can do at the moment. The use case would indeed be to enable this while developing new tests.

Someone had a use-case for label-only check prefixes, which was when the labels were the same across several runs but the checks between the labels would vary. I think it had to do with instructions emitted for different subtargets. Thus, if a given prefix had only labels, and we diagnosed that, it would be a false positive.
Now, if the only directives ever seen were labels, that's different, and suggests a misunderstanding of LABEL. But it also will not lead to tests that do the wrong thing. So, I'm not convinced that we can have a diagnostic for your case 3 that would really be meaningful and useful.

So, I'm not convinced that we can have a diagnostic for your case 3 that would really be meaningful and useful.

That is really unfortunate and problematic. Apologies, but just for completeness, this was that example:

RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
  ; FOO-LABEL: FOO:
  ; DOO:             mov  pc, lr
  ret void
}

It's really not difficult to make this mistake, and I really would like to solve this one too. My previous approach might be wrong, so I am going to look again if I can get it to reliably error in this case.

So, I'm not convinced that we can have a diagnostic for your case 3 that would really be meaningful and useful.

That is really unfortunate and problematic. Apologies, but just for completeness, this was that example:

RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
  ; FOO-LABEL: FOO:
  ; DOO:             mov  pc, lr
  ret void
}

It's really not difficult to make this mistake, and I really would like to solve this one too. My previous approach might be wrong, so I am going to look again if I can get it to reliably error in this case.

And if the input was this:

RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
  ; FOO-LABEL: FOO:
  ; DOO:             mov  pc, lr
  ; FOO:             ret void
}

there is no reasonable way to flag the DOO line as a mistake. In fact, typos in the prefix are a tactic I've often used to "comment out" directives, for one reason or another, and I would not want that tactic to start triggering warnings.

And I do remember a possibly valid use case for emitting a diagnostic. I've seen quite a few tests with names like test-pr12345.ll, that were reproducers of compiler crashes, but only contained CHECK-LABELS. I called them dubious tests before (but we don't want to use that term), because these tests are happy when a functions compiles, but don't tests anything whereas they probably should. There are probably a few exceptions where only CHECK-LABELs are okay, so I think a warning along the lines of "you're only checking labels, is that what you really want?" should be fair?

And I do remember a possibly valid use case for emitting a diagnostic. I've seen quite a few tests with names like test-pr12345.ll, that were reproducers of compiler crashes, but only contained CHECK-LABELS. I called them dubious tests before (but we don't want to use that term), because these tests are happy when a functions compiles, but don't tests anything whereas they probably should. There are probably a few exceptions where only CHECK-LABELs are okay, so I think a warning along the lines of "you're only checking labels, is that what you really want?" should be fair?

I'd probably err towards saying "this (whatever it is - but the conservative behavior in any given situation) is the default behavior, you can pass a flag if you want the other" - if we're sure having a flag & that schism in readability, etc, is worthwhile rather than constraining all tests to the conservative behavior. Having a warning seems like it'd be hard to maintain (for the same reason compiler warnings that aren't -Werror are hard to maintain - the difference here is we control both the use and the tool, so there's no need to make warnings non-errors like we have to for the public interface to the compiler) - when I go and edit a test and it flags this warning, how do I know if the original author intended it or not? If they have to put a specific flag there to say "hey, this is the filecheck behavior I want" that seems better to me.

And I do remember a possibly valid use case for emitting a diagnostic. I've seen quite a few tests with names like test-pr12345.ll, that were reproducers of compiler crashes, but only contained CHECK-LABELS. I called them dubious tests before (but we don't want to use that term), because these tests are happy when a functions compiles, but don't tests anything whereas they probably should. There are probably a few exceptions where only CHECK-LABELs are okay, so I think a warning along the lines of "you're only checking labels, is that what you really want?" should be fair?

In practices, a LABEL-only test works exactly like the same test with all the -LABEL suffixes removed. That is, it still does check that the specified strings are present. We could decree that a LABEL-only test is invalid, and make that an error instead of a warning, but only after fixing the offending tests that are in the code base now.

Given that FileCheck is primarily a non-interactive tool, warnings aren't particularly beneficial. Even when I'm writing a test, typically I don't run FileCheck directly, I'm running the test under lit. I won't even see a diagnostic unless I run lit -v, which I probably won't if the test passes first try.
So, errors (that can be silenced, if we can't fix all the problems up front) are the way to go.

I agree the diagnostic should be an error not a warning when it's enabled. I meant to be arguing that it should initially be disabled by default so that it's easier for multiple people to help investigate various test suites, but I conflated the two concepts.

We could have something like -diag=[no-]DIAG where DIAG is something like unused-prefix (this patch) or label-only (the other patch) or whatever. At first, for each DIAG value, the default would be off, -diag=DIAG could be specified in FILECHECK_OPTS, and -diag=no-DIAG could be added to specific tests that will never be adjusted (keep in mind that the command line has higher precedence than FILECHECK_OPTS). Later, when we're more confident, the default would be on. If we eventually decide the diagnostic is always right and that -diag=no-DIAG thus should never be used, then we disable that option.

Ok, nice, looks like we actually agree: we want this to be errors, not warnings. Perhaps I conflated things too, and this was likely caused for practical reasons: how do we get all tests error free so that we the we can enable the error checking?

Just to reiterate that 1) there are serious test problems, and 2) this is quite a bit of work, below the list of things I've cleaned up and quite importantly, I have only looked in test/CodeGen/ARM/:

I just had a quick look and in the entire test suite (with backends X86, AArch64 and ARM enabled) there are ~200 tests where this unused-prefix triggers.

Therefore, I hope Joel's proposed approach is acceptable; I agree with that and I see that as a convenient way to gradually clean up and fix tests.

But I still disagree with this one:

And if the input was this:

RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
; FOO-LABEL: FOO:
; DOO:             mov  pc, lr
; FOO:             ret void
}

there is no reasonable way to flag the DOO line as a mistake. In fact, typos in the prefix are a tactic I've often used to "comment out" directives, for one reason or another, and I would not want that tactic to start triggering warnings.

I think this is the prime example where I would like to see and get a diagnostic. In this example, I think there are 2 bugs: 1 in the test, and 1 in FileCheck. There are probably many ways to temporarily comment out things and this is a "dangerous" one. Because allowing this relaxed FileCheck behaviour it's too easy to shout ourselves in the foot, and that's why we have so many test issues. The fact that we have so many test issues, shows that code review isn't always catching this. But I don't think code review should not really have to deal with this. We have a tool for this: FileCheck should do its checking, and do it properly. Besides 'commenting out' checks in this way, the other cause for this if you first had --check-prefixes=FOO,DOO, then some test refactoring happens, DOO is dropped from the --check-prefixes line, and the test becomes half useless, half correct.

jdenny added a comment.EditedNov 27 2018, 5:44 AM

Both diagnostics we have been discussing can frustrate attempts to comment directives. There are ways to work around both: temporarily change the preceding CHECK-LABEL to CHECK, or remove the associated prefix from the FileCheck command.

Alternatively, if diagnostics tell you their names (just as compiler warnings do), you could temporarily use the -diags=no-DIAG option I proposed. But that means leaving that option available indefinitely... but maybe that'll be necessary anyway once we have investigated all the test suites.

Both diagnostics we have been discussing can frustrate attempts to comment directives. There are ways to work around both: temporarily change the preceding CHECK-LABEL to CHECK, or remove the associated prefix from the FileCheck command.

Alternatively, if diagnostics tell you their names (just as compiler warnings do), you could temporarily use the -diags=no-DIAG option I proposed. But that means leaving that option available indefinitely... but maybe that'll be necessary anyway once we have investigated all the test suites.

I accidentally submitted that comment before I was done editing. I fixed it, but it looks like no email notification was sent for the edit.

Anyway, another thought: maybe it's time for a true comment directive, which would be used something like this:

// CHECK-COMMENT: CHECK: pattern

First, the intention is far clearer than

// XHECK: pattern

and so should probably always be used for commenting out directives going forward. Moreover, it could also suppress diagnostics. That is, for the sake of these diagnostics, it could either (1) be considered to be, or (2) be checked to contain a non-LABEL directive of the relevant prefix.

This directive could be a separate patch, and it could be pushed later provided it's pushed before these diagnostics are enabled by default.

But I still disagree with this one:

And if the input was this:

RUN: llc -mtriple=arm-linux-gnueabi %s -o - | FileCheck %s -check-prefix=FOO
define void @FOO() {
entry:
; FOO-LABEL: FOO:
; DOO:             mov  pc, lr
; FOO:             ret void
}

there is no reasonable way to flag the DOO line as a mistake. In fact, typos in the prefix are a tactic I've often used to "comment out" directives, for one reason or another, and I would not want that tactic to start triggering warnings.

I think this is the prime example where I would like to see and get a diagnostic. In this example, I think there are 2 bugs: 1 in the test, and 1 in FileCheck. There are probably many ways to temporarily comment out things and this is a "dangerous" one. Because allowing this relaxed FileCheck behaviour it's too easy to shout ourselves in the foot, and that's why we have so many test issues. The fact that we have so many test issues, shows that code review isn't always catching this. But I don't think code review should not really have to deal with this. We have a tool for this: FileCheck should do its checking, and do it properly. Besides 'commenting out' checks in this way, the other cause for this if you first had --check-prefixes=FOO,DOO, then some test refactoring happens, DOO is dropped from the --check-prefixes line, and the test becomes half useless, half correct.

And if the input was this:

; RUN: llc -O2 %s -o - | FileCheck %s -check-prefix=FOO
; RUN: llc -O0 %s -o - | FileCheck %s -check-prefixex=FOO,DOO
define void @FOO() {
entry:
; FOO-LABEL: FOO:
; DOO:             noopt-only-case
; FOO:             opt-or-noopt-case
}

Then we absolutely don't want a warning for DOO on the first run. This test is written correctly.

I've been thinking about these diagnostics over the past week. I'm afraid I'm becoming convinced they are not the best approach. As evidenced by the false positives we've already discussed, I believe unused prefixes and the absence of non-label directives are not inherently bugs. Instead, we are using them as heuristics to help us find the real bugs: typos in FileCheck directives that result in unused directives.

Problem: I believe we are resorting to such heuristics because of a flaw in FileCheck's general directive syntax. Looking at the function ValidateCheckPrefixes in lib/Support/FileCheck.cpp, that syntax is just a word ([a-zA-Z0-9_-]+) followed by a colon. Without knowing the list of user-defined prefixes in order to restrict the syntax further, there is no reliable way to distinguish FileCheck directives from other text in a FileCheck check file. Thus, there is no way to automatically detect directives that are never used because their prefixes are never defined by the user.

Solution: I think we should design a general directive syntax that is sufficiently distinct. As I have been reminded many times before, tools like FileCheck are intended for LLVM's internal use not for the general public. In other words, FileCheck should do what LLVM developers need not the other way around. With this in mind, we should carefully design a general syntax that is very unlikely to collide with the kinds of non-directive text in LLVM's own test suites and not worry too much about all conceivable test suites in the wild. If we are clever in our choice, my hunch is that such collisions will rarely occur, and it will be possible to work around cases where they do. After all, even now, how often do we have to choose a prefix other than CHECK purely because of such collisions? We normally change prefixes just to activate a different set of directives, right? We just need to add something that distinctive into the fixed portion of the general syntax. (To mock something up at first, we could just pick a small set of tests and change all their user-defined prefixes to start with the same distinctive string. For example, we might have FC-CHECK, FC-FOO, FC-BAR, etc., but maybe there's something better.)

Conclusion: A distinct general syntax for FileCheck directives would permit us to implement precise detection of all unused directives and thus more cleanly combat directive typos. There's plenty to be discussed, and the migration to a new syntax would be a long-term investment (but it can be done incrementally). If the problem with typos is as bad as evidence seems to suggest so far, perhaps all that effort will prove worthwhile: it tackles the root of our problem rather than placing awkward band-aids over it as we're trying to do now. What do people think?

Conclusion: A distinct general syntax for FileCheck directives would permit us to implement precise detection of all unused directives and thus more cleanly combat directive typos. There's plenty to be discussed, and the migration to a new syntax would be a long-term investment (but it can be done incrementally). If the problem with typos is as bad as evidence seems to suggest so far,

What's the evidence that suggests typos are a cause of a significant cost to the project? I see them come up from time to time, but something on the sub 1% of test cases/false negatives, I'd have thought, if I had to guess.

perhaps all that effort will prove worthwhile: it tackles the root of our problem rather than placing awkward band-aids over it as we're trying to do now. What do people think?

How would having a fixed prefix (like the FC example given) help diagnose misspelled directives in one FileCheck run, when another FileCheck run on the same file may use different check prefixes? (usually named prefixes are used for this situation, where a single file contains checks for multiple different things (different tests/using different flags, or multiple tests using different tools (compile this file, then run dwarfdump & FileCheck that output, then run objdump and FileCheck that output with a different prefix))? The file contains, say, a collection of checks with two different prefixes (DWARF: and OBJ:, for instance) - and each FileCheck run is run with only one of those prefixes, so all the others would look like typos, right? (even if they were prefixed with "FC-")

Conclusion: A distinct general syntax for FileCheck directives would permit us to implement precise detection of all unused directives and thus more cleanly combat directive typos. There's plenty to be discussed, and the migration to a new syntax would be a long-term investment (but it can be done incrementally). If the problem with typos is as bad as evidence seems to suggest so far,

What's the evidence that suggests typos are a cause of a significant cost to the project? I see them come up from time to time, but something on the sub 1% of test cases/false negatives, I'd have thought, if I had to guess.

I probably worded that too strongly. I'm basing that assessment on the discussions for these patches, including the results Sjoerd has presented so far. If people's opinions aren't where I thought they were, and if the diagnostic counts Sjoerd reported are mostly false positives, then perhaps you're right that any diagnostics for this problem are not worthwhile and we should move on.

How would having a fixed prefix (like the FC example given) help diagnose misspelled directives in one FileCheck run, when another FileCheck run on the same file may use different check prefixes? (usually named prefixes are used for this situation, where a single file contains checks for multiple different things (different tests/using different flags, or multiple tests using different tools (compile this file, then run dwarfdump & FileCheck that output, then run objdump and FileCheck that output with a different prefix))? The file contains, say, a collection of checks with two different prefixes (DWARF: and OBJ:, for instance) - and each FileCheck run is run with only one of those prefixes, so all the others would look like typos, right? (even if they were prefixed with "FC-")

You'd certainly have to check all FileCheck calls. For many simple test files, I think a script could parse the test file and report prefixes that are used in directives but never defined in FileCheck commands in RUN lines (interestingly, the diagnostic in the current patch reverses this, and I've already explained how that could produce false positives). Dynamic approaches could also be explored for more complex tests. Again, any approach necessarily involves checking all FileCheck calls.

Conclusion: A distinct general syntax for FileCheck directives would permit us to implement precise detection of all unused directives and thus more cleanly combat directive typos. There's plenty to be discussed, and the migration to a new syntax would be a long-term investment (but it can be done incrementally). If the problem with typos is as bad as evidence seems to suggest so far,

What's the evidence that suggests typos are a cause of a significant cost to the project? I see them come up from time to time, but something on the sub 1% of test cases/false negatives, I'd have thought, if I had to guess.

I probably worded that too strongly. I'm basing that assessment on the discussions for these patches, including the results Sjoerd has presented so far. If people's opinions aren't where I thought they were, and if the diagnostic counts Sjoerd reported are mostly false positives, then perhaps you're right that any diagnostics for this problem are not worthwhile and we should move on.

Ah, OK - Went back to review those. Yeah, unused prefixes are perhaps mostly harmless due to refactoring. Unused directives in the file (things that look like FileCheck directives but aren't used by any FileCheck invocation for that file) would be interesting to know about, but hard to get good numbers I'd imagine, without a bunch more work - having a warning in FileCheck itself would be insufficient (because each FileCheck only sees potentially a subset of the prefixes used in a given file), so probably a bit of a hacky scripting solution could be done to catch most of the cases & get some numbers to see if a more robust/long-term solution could be used (& what the actual bug rate is - how often are they really mistyped V being used only in a subset of checks)

How would having a fixed prefix (like the FC example given) help diagnose misspelled directives in one FileCheck run, when another FileCheck run on the same file may use different check prefixes? (usually named prefixes are used for this situation, where a single file contains checks for multiple different things (different tests/using different flags, or multiple tests using different tools (compile this file, then run dwarfdump & FileCheck that output, then run objdump and FileCheck that output with a different prefix))? The file contains, say, a collection of checks with two different prefixes (DWARF: and OBJ:, for instance) - and each FileCheck run is run with only one of those prefixes, so all the others would look like typos, right? (even if they were prefixed with "FC-")

You'd certainly have to check all FileCheck calls. For many simple test files, I think a script could parse the test file and report prefixes that are used in directives but never defined in FileCheck commands in RUN lines (interestingly, the diagnostic in the current patch reverses this, and I've already explained how that could produce false positives). Dynamic approaches could also be explored for more complex tests. Again, any approach necessarily involves checking all FileCheck calls.

How're you thinking you'd check all the FileCheck calls? Each FileCheck run would itself look for other FileCheck commands in RUN lines to compute the total set of prefixes used & validate them all? (that'd mean redundant work (O(N^2) in the number of FileCheck commands/prefixes - but probably scale OK) & redundant error messages (since each FileCheck command would report all the unused check names itself))

Ah, OK - Went back to review those. Yeah, unused prefixes are perhaps mostly harmless due to refactoring. Unused directives in the file (things that look like FileCheck directives but aren't used by any FileCheck invocation for that file) would be interesting to know about, but hard to get good numbers I'd imagine, without a bunch more work - having a warning in FileCheck itself would be insufficient (because each FileCheck only sees potentially a subset of the prefixes used in a given file), so probably a bit of a hacky scripting solution could be done to catch most of the cases & get some numbers to see if a more robust/long-term solution could be used (& what the actual bug rate is - how often are they really mistyped V being used only in a subset of checks)

Agreed.

How would having a fixed prefix (like the FC example given) help diagnose misspelled directives in one FileCheck run, when another FileCheck run on the same file may use different check prefixes? (usually named prefixes are used for this situation, where a single file contains checks for multiple different things (different tests/using different flags, or multiple tests using different tools (compile this file, then run dwarfdump & FileCheck that output, then run objdump and FileCheck that output with a different prefix))? The file contains, say, a collection of checks with two different prefixes (DWARF: and OBJ:, for instance) - and each FileCheck run is run with only one of those prefixes, so all the others would look like typos, right? (even if they were prefixed with "FC-")

You'd certainly have to check all FileCheck calls. For many simple test files, I think a script could parse the test file and report prefixes that are used in directives but never defined in FileCheck commands in RUN lines (interestingly, the diagnostic in the current patch reverses this, and I've already explained how that could produce false positives). Dynamic approaches could also be explored for more complex tests. Again, any approach necessarily involves checking all FileCheck calls.

How're you thinking you'd check all the FileCheck calls?

For simple tests, an external script could parse (static) test files as I described. That script could run as a separate test case.

I'm not sure which tests would be "simple", but any test that generates the FileCheck check file dynamically would certainly not work with such a script.

For a (hopefully) complete solution, here's a possible approach. Each FileCheck call builds (1) a list of all prefixes defined (by --check-prefix[es]) for that one call and (2) a list of all prefixes used in directives but not defined in the first list, and then it merges those lists into a database. At the end of a test suite run, lit calls a special FileCheck incantation that compiles that database into a list of unused directives.

Each FileCheck run would itself look for other FileCheck commands in RUN lines to compute the total set of prefixes used & validate them all? (that'd mean redundant work (O(N^2) in the number of FileCheck commands/prefixes - but probably scale OK) & redundant error messages (since each FileCheck command would report all the unused check names itself))

That seems possible too. We could experiment with different approaches.

For simple tests, an external script could parse (static) test files as I described. That script could run as a separate test case.

Not exactly possible, unfortunately - tests can/should only read files in the "Inputs" directory relative to the test file location - they can't walk the entire test tree/read everything (this keeps tests a bit isolated, so you can move them around without worrying about some other random test depending on your test - and also it's used internally at Google to avoid shipping all the tests/inputs to every machine that runs any of the tests)

I'm not sure which tests would be "simple", but any test that generates the FileCheck check file dynamically would certainly not work with such a script.

I don't think there are any of those - at least I rather hope not.

For a (hopefully) complete solution, here's a possible approach. Each FileCheck call builds (1) a list of all prefixes defined (by --check-prefix[es]) for that one call and (2) a list of all prefixes used in directives but not defined in the first list, and then it merges those lists into a database. At the end of a test suite run, lit calls a special FileCheck incantation that compiles that database into a list of unused directives.

"merges those lists into a database" is a bit vague/concerning in terms of how that'd run in the build system, etc.

Each FileCheck run would itself look for other FileCheck commands in RUN lines to compute the total set of prefixes used & validate them all? (that'd mean redundant work (O(N^2) in the number of FileCheck commands/prefixes - but probably scale OK) & redundant error messages (since each FileCheck command would report all the unused check names itself))

That seems possible too. We could experiment with different approaches.

Yeah, I'd suggest scratching something simple up with grep/sed/python/whatever to get a sense of the magnitude of the problem before sinking time into trying to design/build a long-term solution. My feeling is still that this isn't a problem worth solving given the complexity of the solutions thus far (the self-checking FileCheck version (despite its duplicate error messages, etc) feels like the cheapest I can think of & still isn't terribly nice, I think) - but data will help.

I'm not sure which tests would be "simple", but any test that generates the FileCheck check file dynamically would certainly not work with such a script.

I don't think there are any of those - at least I rather hope not.

I think some of FileCheck's own tests do this, but FileCheck tests are quite a unique case already. Outside of that, I agree, dynamically generated check-files ought to be rare to non-existent.
There are plenty of tests that were generated by one script or another, but the tests themselves don't dynamically generate check files.

For simple tests, an external script could parse (static) test files as I described. That script could run as a separate test case.

Not exactly possible, unfortunately - tests can/should only read files in the "Inputs" directory relative to the test file location - they can't walk the entire test tree/read everything (this keeps tests a bit isolated, so you can move them around without worrying about some other random test depending on your test - and also it's used internally at Google to avoid shipping all the tests/inputs to every machine that runs any of the tests)

For something that checks correctness of test suites (and assuming we decide it's worth it), couldn't we ultimately change the rules? It might even be a lit feature rather than a distinct test. It could also be optional in case it's problematic somewhere.

I'm not sure which tests would be "simple", but any test that generates the FileCheck check file dynamically would certainly not work with such a script.

I don't think there are any of those - at least I rather hope not.

I didn't find any with a quick grep, but I feel like I didn't look hard enough. I know I have some in phabricator for FileCheck's own test suite, but reviewers might tell me to remove that usage.

Another problem would be if lit substitutions make FileCheck commands or their prefixes unrecognizable. I haven't looked for that usage yet.

For a (hopefully) complete solution, here's a possible approach. Each FileCheck call builds (1) a list of all prefixes defined (by --check-prefix[es]) for that one call and (2) a list of all prefixes used in directives but not defined in the first list, and then it merges those lists into a database. At the end of a test suite run, lit calls a special FileCheck incantation that compiles that database into a list of unused directives.

"merges those lists into a database" is a bit vague/concerning in terms of how that'd run in the build system, etc.

Sure, I was just trying to communicate the high-level idea. I'll fill in some details, but these are just possibilities.

For 2, we'd probably want a list of all directives whose prefixes are not defined in 1.

Each entry in 1 and 2 would specify what FileCheck check file it was associated with (and that might present a challenge if that file is ever a stream).

The "merging" that each FileCheck calls performs might be just appending its lists to a results file that is unique to the lit test file that contains that FileCheck call. Hopefully this would permit merging to be performed without synchronization across parallel tests.

For the final compilation of the database, the procedure would be to iterate all these result files. Per FileCheck check file, it would first build a hash of all defined prefixes, and then it would build a list of all unused directives but removing any whose prefix appears in the hash.

The final compilation of the database could be a feature of lit that is always run at the end of any test suite run.

Each FileCheck run would itself look for other FileCheck commands in RUN lines to compute the total set of prefixes used & validate them all? (that'd mean redundant work (O(N^2) in the number of FileCheck commands/prefixes - but probably scale OK) & redundant error messages (since each FileCheck command would report all the unused check names itself))

That seems possible too. We could experiment with different approaches.

Yeah, I'd suggest scratching something simple up with grep/sed/python/whatever to get a sense of the magnitude of the problem before sinking time into trying to design/build a long-term solution. My feeling is still that this isn't a problem worth solving given the complexity of the solutions thus far (the self-checking FileCheck version (despite its duplicate error messages, etc) feels like the cheapest I can think of & still isn't terribly nice, I think) - but data will help.

Agreed. No promises I'll do this anytime soon. I just want to see if people feel that, if this mistyped/unused directive problem is actually worth addressing, this approach is the right one.

Another problem would be if lit substitutions make FileCheck commands or their prefixes unrecognizable. I haven't looked for that usage yet.

I can't imagine that being a real thing. FileCheck itself is a token that lit substitutes just to paste the tools directory onto it while building the actual command lines. Lit substitutions in prefix names... makes no sense.

Yeah, I'd suggest scratching something simple up with grep/sed/python/whatever to get a sense of the magnitude of the problem before sinking time into trying to design/build a long-term solution. My feeling is still that this isn't a problem worth solving given the complexity of the solutions thus far (the self-checking FileCheck version (despite its duplicate error messages, etc) feels like the cheapest I can think of & still isn't terribly nice, I think) - but data will help.

Agreed. No promises I'll do this anytime soon. I just want to see if people feel that, if this mistyped/unused directive problem is actually worth addressing, this approach is the right one.

I don't doubt there are some, and developing a script/utility to find them could be helpful. But automating that process... my guess is, unlikely to be worthwhile unless new cases are being added frequently, which I would not expect.

thopre added a subscriber: thopre.Nov 21 2019, 7:38 AM
SjoerdMeijer abandoned this revision.Mar 17 2023, 1:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:42 AM
Herald added a subscriber: kosarev. · View Herald Transcript