This is an archive of the discontinued LLVM Phabricator instance.

[FileCheck] Add precision to format specifier
ClosedPublic

Authored by thopre on Jun 11 2020, 8:42 AM.

Details

Summary

Add printf-style precision specifier to pad numbers to a given number of
digits when matching them if the value is smaller than the given
precision. This works on both empty numeric expression (e.g. variable
definition from input) and when matching a numeric expression. The
syntax is as follows:

[[#%.<precision><format specifier>, ...]

where <format specifier> is optional and ... can be a variable
definition or not with an empty expression or not. In the absence of a
precision specifier, a variable definition will accept leading zeros.

Diff Detail

Event Timeline

thopre created this revision.Jun 11 2020, 8:42 AM

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

The code change in general seems simple enough to support the proposal too, though I haven't reviewed it in detail. I'll wait until you've added documentation/tests etc, so that I can review it all at once.

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

When only using a numeric expression, numeric substitution blocks are behaving as printf: print a value as text to be matched against the input. When defining a variable with an empty expression (the majority of definition cases), it behaves more like a scanf. Only printf support a precision in its syntax. Scanf doesn't support it. This is the main reason why I ask the question. It is also the case that we currently allow leading zeros when matching an unknown numeric value for a numeric variable definition with empty expression.

I think the case of a variable defined from an expression is a special case since you are matching something specific so in itself doesn't mandate extending the precision to variable definition with empty expression. However I think allowing a precision when matching an unknown variable is both useful and makes for syntax consistency.

To answer your earlier question, yes it'll be possible to match a 16-bit hex with #%.16x,VAR1: or an 8-bit hex with #%.8x, VAR2:. However #%.8, VAR1 will print all 16-bit of VAR1, same as printf. Does that seem reasonable or should we deviate from printf and give an error in such a case?

The code change in general seems simple enough to support the proposal too, though I haven't reviewed it in detail. I'll wait until you've added documentation/tests etc, so that I can review it all at once.

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

When only using a numeric expression, numeric substitution blocks are behaving as printf: print a value as text to be matched against the input. When defining a variable with an empty expression (the majority of definition cases), it behaves more like a scanf. Only printf support a precision in its syntax. Scanf doesn't support it. This is the main reason why I ask the question. It is also the case that we currently allow leading zeros when matching an unknown numeric value for a numeric variable definition with empty expression.

I think the case of a variable defined from an expression is a special case since you are matching something specific so in itself doesn't mandate extending the precision to variable definition with empty expression. However I think allowing a precision when matching an unknown variable is both useful and makes for syntax consistency.

To answer your earlier question, yes it'll be possible to match a 16-bit hex with #%.16x,VAR1: or an 8-bit hex with #%.8x, VAR2:. However #%.8, VAR1 will print all 16-bit of VAR1, same as printf. Does that seem reasonable or should we deviate from printf and give an error in such a case?

I think that seems reasonable to me overall. Thanks for explaining.

MaskRay added inline comments.
llvm/lib/Support/FileCheckImpl.h
56

Prefer default member initializer (unsigned Precision = 0;)

Should the regex wildcard for a numeric variable definition with empty expression also respect the precision, i.e. #%.5u, VAR2: would be matched by (([1-9][0-9]+)? [0-9]{1,5})

I believe I followed the comments about matching behavior for an empty expression (scanf-like) vs. an expression (printf-like). So the above question is about whether, in the empty-expression case, it's worthwhile to support a precision specified by . even though scanf does not support that. Right?

I don't understand the above regex due to the space character after the ?. Was that intended?

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

thopre edited the summary of this revision. (Show Details)Jun 17 2020, 9:53 AM

Should the regex wildcard for a numeric variable definition with empty expression also respect the precision, i.e. #%.5u, VAR2: would be matched by (([1-9][0-9]+)? [0-9]{1,5})

I believe I followed the comments about matching behavior for an empty expression (scanf-like) vs. an expression (printf-like). So the above question is about whether, in the empty-expression case, it's worthwhile to support a precision specified by . even though scanf does not support that. Right?

Correct.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail. If the directive was #%x, ADDR: the first input would have led to the same outcome but the second input would have led the directive matching and the value in ADDR to be 0xFFFFFFFF12345678.

Besides whether this is a useful feature, it makes for easier parsing and consistency in the syntax (no difference between variables defined from an expression where the precision would be allowed and variables defined from an empty expression where precision would not be allowed).

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Besides whether this is a useful feature, it makes for easier parsing and consistency in the syntax (no difference between variables defined from an expression where the precision would be allowed and variables defined from an empty expression where precision would not be allowed).

I forgot to mention that scanf doesn't need this because it's separate from printf (weaker need for consistency) and I guess aims at parsing some value more than checking format.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

It now says #%.5u, VAR2: matches (([1-9][0-9]+)?[0-9]{1,5}), but that matches 123456789. I think that's unintended.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail.

You mean fail to match and continue searching? Or fail immediately?

So, %.8x is a maximum? For printf, it's a minimum. scanf's %8x (no .) feels more like what you're going for except that it discards additional digits instead of failing to match.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

It now says #%.5u, VAR2: matches (([1-9][0-9]+)?[0-9]{1,5}), but that matches 123456789. I think that's unintended.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail.

You mean fail to match and continue searching? Or fail immediately?

So, %.8x is a maximum? For printf, it's a minimum. scanf's %8x (no .) feels more like what you're going for except that it discards additional digits instead of failing to match.

My bad, my example was completely wrong. My personal motivation is consistency in the syntax. New example:

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits. I guess it could be useful to check alignment in a tool but as I said my main motivation is keeping a common format specifier syntax for all numeric substitution blocks. Note that my regex was indeed wrong anyway, it should be (([1-9][0-9]+)?[0-9]{5}).

thopre edited the summary of this revision. (Show Details)Jun 17 2020, 10:57 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

I guess it could be useful to check alignment in a tool but as I said my main motivation is keeping a common format specifier syntax for all numeric substitution blocks. Note that my regex was indeed wrong anyway, it should be (([1-9][0-9]+)?[0-9]{5}).

I think you want + to be * to permit 123456.

What would happen on 012345? Would it match 01234 and leave 5 for a later directive, or would FileCheck fail immediately?

jhenderson added a comment.EditedJun 18 2020, 12:27 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

thopre edited the summary of this revision. (Show Details)Jun 18 2020, 12:53 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

We currently accept numbers with leading zeroes but printf would not produce those without a precision. Should we start by fixing this then?

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

We currently accept numbers with leading zeroes but printf would not produce those without a precision. Should we start by fixing this then?

I think we need leading zeros to be accepted until we have an alternative in place. Otherwise, there may be existing tests that rely on the current behaviour which we can't migrate. I think that means a rough order of: 1) Add precision support; 2) Migrate existing tests to use it where needed; 3) Stop accepting leading zeros except via precision. 2) and 3) can probably be done at the same time. We should only do them as part of 1) if it's harder to keep them separate, in my opinion.

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

If no variable is captured, is the syntax [[#%.4u:]]?

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

That's exactly the point of the second question in the description. Capturing a variable feels more like scanf but I think a unified syntax makes more sense. This is where we need to diverge from the printf/scanf analogy. Since the accepted format is defined explicitely in the documentation I don't think it's a big problem.

If no variable is captured, is the syntax [[#%.4u:]]?

It would be #%.4u or simply #%.4 since u is the default format specifier.

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

That's exactly the point of the second question in the description. Capturing a variable feels more like scanf but I think a unified syntax makes more sense. This is where we need to diverge from the printf/scanf analogy. Since the accepted format is defined explicitely in the documentation I don't think it's a big problem.

If no variable is captured, is the syntax [[#%.4u:]]?

It would be #%.4u or simply #%.4 since u is the default format specifier.

Nice. [[#%.4u]] (non-capturing) and [[#%.4u,ADDR:]] (capturing) looks good to me. Might be worth noting that it is not a scanf-supported specifier.

thopre updated this revision to Diff 284150.Aug 8 2020, 3:39 PM
thopre edited the summary of this revision. (Show Details)
thopre removed subscribers: MaskRay, ikudrin.

Finish implementation based on consensus reached on questions raised by the proof of concept version.

thopre updated this revision to Diff 284233.Aug 9 2020, 1:44 PM

Add example of precision in documentation

Functionality looks reasonable, although I haven't checked the testing yet.

llvm/docs/CommandGuide/FileCheck.rst
738

If we expand this out, the full syntax is apparently [[#%.<precision><precision><conversion specifier>,<NUMVAR:]], which I don't think is what you mean :-)

742–743

Should we say something about leading zeros beyond those required by the precision value?

746

Nit: There's a double space after "to".

765
llvm/lib/Support/FileCheck.cpp
49

StringRef?

796

Can you fix the case of fmtloc whilst you're modifying this line, please?

llvm/lib/Support/FileCheckImpl.h
86
thopre updated this revision to Diff 284632.Aug 11 2020, 3:23 AM
thopre marked 6 inline comments as done.

Address most comments

llvm/docs/CommandGuide/FileCheck.rst
742–743

Is that what you expected?

llvm/lib/Support/FileCheck.cpp
49

ostringstream below does not understand StringRef so I would need to do .str() which can be expensive. Any reason not to keep const char*?

grimar added inline comments.Aug 11 2020, 3:23 AM
llvm/lib/Support/FileCheck.cpp
70

Seems you should be able to do the following instead?

return (RegexPrefix + Twine(Precision) + "}").str();
86

Perhaps, it might be simpler just to merge switches and write the logic here as:

Expected<std::string> ExpressionFormat::getWildcardRegex() const {
  if (Value == Kind::NoFormat)
    return createStringError(std::errc::invalid_argument,
                             "trying to match value with invalid format");
  switch (Value) {
  case Kind::Unsigned:
    if (Precision)
      return ("-?([1-9][0-9]*)?[0-9]{" + Twine(Precision) + "}").str();
    return std::string("[0-9]+");
  case Kind::Signed:
     ...
  default:
    llvm_unreachable("....");
  }
}
780

Use trim?

FormatExpr.trim(SpaceChars)
llvm/unittests/Support/FileCheckTest.cpp
165

This will fail if NumStr is empty. Is it OK (I guess so), though perhaps a bit cleaner would be to use StringRef::startswith.

171
PaddedStr = "-";
thopre updated this revision to Diff 284635.Aug 11 2020, 3:39 AM
thopre marked 6 inline comments as done.

Address more review comments

thopre added inline comments.Aug 11 2020, 3:40 AM
llvm/lib/Support/FileCheck.cpp
86

I'm not a big fan of repeating the formatting logic for the Precision case so I've kept that bit as is. What do you think of the result?

grimar added inline comments.Aug 11 2020, 3:56 AM
llvm/lib/Support/FileCheck.cpp
86

I see 2 possible improvements:

  1. When you have a dedicated RegexPrefix variable, you postpone the return and have to add breaks everywhere. If you just do not want to repeat the formatting logic, I'd suggest to add a little helper. E.g:
auto CreatePrecisionRegex = [](StringRef S) -> std::string {
  return (S + Twine(Precision) + "}").str();
};

switch (Value) {
case Kind::Unsigned:
  if (Precision)
    return CreatePrecisionRegex("-?([1-9][0-9]*)?[0-9]{");
  return std::string("[0-9]+");
default:
  llvm_unreachable("ddd");
}

The main benefit is that you can return early and avoid having a one more variable.

  1. Perhaps it doesn't make much sence to use createStringError for the default case? It is unreachable now and can't be tested either (I believe).

So I'd either remove the if (Value == Kind::NoFormat) block and handle the error in the default, like you initially did,
or keep it and switch to using llvm_unreachable in default.

grimar added inline comments.Aug 11 2020, 3:58 AM
llvm/lib/Support/FileCheck.cpp
86

Oh, and for 1) there is no need to use -> std::string:

auto CreatePrecisionRegex = [](StringRef S) {
  return (S + Twine(Precision) + "}").str();
};
thopre updated this revision to Diff 284647.Aug 11 2020, 4:22 AM
thopre marked 2 inline comments as done.

Add review comments

thopre added inline comments.Aug 11 2020, 4:23 AM
llvm/lib/Support/FileCheck.cpp
86

Ah yes, I started doing it your way and changed in the middle. I'll remove the top if block

jdenny added inline comments.Aug 11 2020, 9:26 AM
llvm/docs/CommandGuide/FileCheck.rst
732–733

"%<fmtspec> is an optional" -> "%<fmtspec>, is an optional"? That is, you must either have %<fmtspec> and , or neither, right?

"the what" -> "what"

733

"how many leading zeros" -> "how many digits" given that you can directly specify the latter (as a minimum) but not the former?

758

IMM->ADDR

The documentation above says 8 is the minimum, but F0F0 has 4 digits.

761

Isn't : supposed to be ,? That's how the tests seem to work, and FileCheck complains when I try this syntax with :.

769

"variable" -> "variables,"

776

When <expr> is empty (here or in the variable definition syntax), then the precision specifier specifies the minimum number of digits to be matched, right?

When <expr> is non-empty, then the precision specifier combined with the actual value of the expression specifies an exact number of digits to be matched, right? I understand that the precision is a minimum here too, but I think it's a printing/substitution minimum not a matching/capturing minimum.

My point is that this case is a bit hard to follow. It seems to me that the numeric substitution syntax with no <expr> is actually more like a variable definition syntax with no variable (and thus no :): there's no existing value to match against, so there's nothing to "substitute". Instead you're capturing a new value and either saving it as a variable or discarding it. Can we document it that way?

If so, instead of calling the first syntax "The syntax to define a numeric variable", you might call it "The syntax to capture a numeric value". It can optionally define a numeric variable.

jhenderson added inline comments.Aug 13 2020, 2:10 AM
llvm/docs/CommandGuide/FileCheck.rst
742–743

I think that is much simpler.

llvm/lib/Support/FileCheck.cpp
48

This doesn't compile. I don't think you can use -> in a capture list. You just need to specify this and then use appropriately below.

llvm/test/FileCheck/numeric-expression.txt
147

Same goes elsewhere.

llvm/unittests/Support/FileCheckTest.cpp
142

I think you could simplify this code by starting with std::string ExtendedInput = Input; and then just using ExtendedInput in the checks below.

153–177

It sounds to me like this is really just two completely different functions. I'd recommend splitting.

thopre updated this revision to Diff 285356.Aug 13 2020, 6:41 AM
thopre marked 12 inline comments as done.

Address all remaining review comments

llvm/docs/CommandGuide/FileCheck.rst
776

I like the idea of distinguishing between capturing a value and substituting a value. Good call.

llvm/lib/Support/FileCheck.cpp
48

It's what I found out before I submit this diff, I must have forgotten to undo the change. Sorry about that.

thopre retitled this revision from [RFC, FileCheck] Add precision to format specifier to [FileCheck] Add precision to format specifier.Aug 19 2020, 8:49 AM

I think this is basically ready now, barring my example comment.

llvm/docs/CommandGuide/FileCheck.rst
755–759

If this example is meant to demonstrate the precision as well as conversion, it probably makes sense to say something like "but would not match mov r5, 0x00F0F0FEFE" and/or change the example to mov r5, 0x0000F0F0, so that it shows the precision behaviour.

thopre updated this revision to Diff 286738.Aug 20 2020, 1:35 AM
thopre marked an inline comment as done.

Better demonstrate precision in documentation

llvm/docs/CommandGuide/FileCheck.rst
755–759

Good point.

jhenderson accepted this revision.Aug 20 2020, 1:43 AM

LGTM, but best wait for someone else to confirm too.

This revision is now accepted and ready to land.Aug 20 2020, 1:43 AM

LGTM, but best wait for someone else to confirm too.

Ping anyone else?

grimar accepted this revision.EditedAug 30 2020, 2:39 AM

I've debugged this and it LGTM.
Have a few minor suggestions about the code (up to you).

llvm/lib/Support/FileCheck.cpp
49

Perhaps, a bit cleaner would be to add the "{" right here.

100

You can just use the value you have already.

119
You can combine these cases I think:
case Kind::HexUpper:
case Kind::HexLower:
  AbsoluteValueStr = utohexstr(AbsoluteValue, Value == Kind::HexLower);
  break;
This revision was automatically updated to reflect the committed changes.
thopre marked 3 inline comments as done.