Page MenuHomePhabricator

[RFC, FileCheck] Add precision to format specifier
Needs ReviewPublic

Authored by thopre on Jun 11 2020, 8:42 AM.

Details

Summary

This is a proof of concept patch to add support for precision in numeric
expression format specifier. I'm looking for feedback on 3 aspects:

  • Should I allow to specify a precision without a conversion specifier, i.e. #%.8, VAR1? The conversion would thus be implicit.
  • Should the regex wildcard for a numeric variable definition with empty expression also respect the precision, i.e. #%.5u, VAR2: would be matched by (([1-9][0-9]*)?[0-9]{5})
  • Should the regex wildcard for a numeric variable definition with empty expression and no precision forbid leading zeros, i.e. #VAR3: would be matched by ([1-9]?[0-9]+) and existing testcase changed to not fail

I personally think the answer to the first two should be yes and the
last one should be no. Thoughts?

Diff Detail

Event Timeline

thopre created this revision.Jun 11 2020, 8:42 AM

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

The code change in general seems simple enough to support the proposal too, though I haven't reviewed it in detail. I'll wait until you've added documentation/tests etc, so that I can review it all at once.

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

When only using a numeric expression, numeric substitution blocks are behaving as printf: print a value as text to be matched against the input. When defining a variable with an empty expression (the majority of definition cases), it behaves more like a scanf. Only printf support a precision in its syntax. Scanf doesn't support it. This is the main reason why I ask the question. It is also the case that we currently allow leading zeros when matching an unknown numeric value for a numeric variable definition with empty expression.

I think the case of a variable defined from an expression is a special case since you are matching something specific so in itself doesn't mandate extending the precision to variable definition with empty expression. However I think allowing a precision when matching an unknown variable is both useful and makes for syntax consistency.

To answer your earlier question, yes it'll be possible to match a 16-bit hex with #%.16x,VAR1: or an 8-bit hex with #%.8x, VAR2:. However #%.8, VAR1 will print all 16-bit of VAR1, same as printf. Does that seem reasonable or should we deviate from printf and give an error in such a case?

The code change in general seems simple enough to support the proposal too, though I haven't reviewed it in detail. I'll wait until you've added documentation/tests etc, so that I can review it all at once.

I think I agree with your conclusions. More generally, I think we should be permissive, where permissiveness is not going to be surprising (i.e. no explicit format specifier seems reasonable in the general context), and should follow scanf style format specifiers where reasonable. If I follow it right, it would therefore be possible to specify a 16-digit hex field with %.16x, right? Could you clarify what the motivation of the "with empty expression" bit is for? Is that just because when there is an empty expression, your regex is incorrect, or something else?

When only using a numeric expression, numeric substitution blocks are behaving as printf: print a value as text to be matched against the input. When defining a variable with an empty expression (the majority of definition cases), it behaves more like a scanf. Only printf support a precision in its syntax. Scanf doesn't support it. This is the main reason why I ask the question. It is also the case that we currently allow leading zeros when matching an unknown numeric value for a numeric variable definition with empty expression.

I think the case of a variable defined from an expression is a special case since you are matching something specific so in itself doesn't mandate extending the precision to variable definition with empty expression. However I think allowing a precision when matching an unknown variable is both useful and makes for syntax consistency.

To answer your earlier question, yes it'll be possible to match a 16-bit hex with #%.16x,VAR1: or an 8-bit hex with #%.8x, VAR2:. However #%.8, VAR1 will print all 16-bit of VAR1, same as printf. Does that seem reasonable or should we deviate from printf and give an error in such a case?

I think that seems reasonable to me overall. Thanks for explaining.

MaskRay added inline comments.
llvm/lib/Support/FileCheckImpl.h
56

Prefer default member initializer (unsigned Precision = 0;)

Should the regex wildcard for a numeric variable definition with empty expression also respect the precision, i.e. #%.5u, VAR2: would be matched by (([1-9][0-9]+)? [0-9]{1,5})

I believe I followed the comments about matching behavior for an empty expression (scanf-like) vs. an expression (printf-like). So the above question is about whether, in the empty-expression case, it's worthwhile to support a precision specified by . even though scanf does not support that. Right?

I don't understand the above regex due to the space character after the ?. Was that intended?

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

thopre edited the summary of this revision. (Show Details)Jun 17 2020, 9:53 AM

Should the regex wildcard for a numeric variable definition with empty expression also respect the precision, i.e. #%.5u, VAR2: would be matched by (([1-9][0-9]+)? [0-9]{1,5})

I believe I followed the comments about matching behavior for an empty expression (scanf-like) vs. an expression (printf-like). So the above question is about whether, in the empty-expression case, it's worthwhile to support a precision specified by . even though scanf does not support that. Right?

Correct.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail. If the directive was #%x, ADDR: the first input would have led to the same outcome but the second input would have led the directive matching and the value in ADDR to be 0xFFFFFFFF12345678.

Besides whether this is a useful feature, it makes for easier parsing and consistency in the syntax (no difference between variables defined from an expression where the precision would be allowed and variables defined from an empty expression where precision would not be allowed).

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Besides whether this is a useful feature, it makes for easier parsing and consistency in the syntax (no difference between variables defined from an expression where the precision would be allowed and variables defined from an empty expression where precision would not be allowed).

I forgot to mention that scanf doesn't need this because it's separate from printf (weaker need for consistency) and I guess aims at parsing some value more than checking format.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

It now says #%.5u, VAR2: matches (([1-9][0-9]+)?[0-9]{1,5}), but that matches 123456789. I think that's unintended.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail.

You mean fail to match and continue searching? Or fail immediately?

So, %.8x is a maximum? For printf, it's a minimum. scanf's %8x (no .) feels more like what you're going for except that it discards additional digits instead of failing to match.

I don't understand the above regex due to the space character after the ?. Was that intended?

No, fixed now.

It now says #%.5u, VAR2: matches (([1-9][0-9]+)?[0-9]{1,5}), but that matches 123456789. I think that's unintended.

Can you give some example inputs and explain the intended matching behavior for #%.5u, VAR2:? Why is this behavior needed in FileCheck but not in scanf?

Say the directive is:

CHECK: Address #%.8x,ADDR: is aligned

and the input text is:

Address 12345678 is aligned

I'd expect the directive to match and the value in ADDR to be 0x12345678. Now if the input text was:

Address FFFFFFFF12345678

I'd expect the directive to fail.

You mean fail to match and continue searching? Or fail immediately?

So, %.8x is a maximum? For printf, it's a minimum. scanf's %8x (no .) feels more like what you're going for except that it discards additional digits instead of failing to match.

My bad, my example was completely wrong. My personal motivation is consistency in the syntax. New example:

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits. I guess it could be useful to check alignment in a tool but as I said my main motivation is keeping a common format specifier syntax for all numeric substitution blocks. Note that my regex was indeed wrong anyway, it should be (([1-9][0-9]+)?[0-9]{5}).

thopre edited the summary of this revision. (Show Details)Jun 17 2020, 10:57 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

I guess it could be useful to check alignment in a tool but as I said my main motivation is keeping a common format specifier syntax for all numeric substitution blocks. Note that my regex was indeed wrong anyway, it should be (([1-9][0-9]+)?[0-9]{5}).

I think you want + to be * to permit 123456.

What would happen on 012345? Would it match 01234 and leave 5 for a later directive, or would FileCheck fail immediately?

jhenderson added a comment.EditedJun 18 2020, 12:27 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

thopre edited the summary of this revision. (Show Details)Jun 18 2020, 12:53 AM

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

We currently accept numbers with leading zeroes but printf would not produce those without a precision. Should we start by fixing this then?

I'd expect 0x[[#%.8x, ADDR:]] to match 00001234 or FFFFFFFF12345678 but not 1234 due to there not being enough digits.

OK, it would expect a value that could have been printed by printf with %.8x.

FWIW, this is what I'm imagining the overall behaviour to be. If printf could have produced the output for a given format specifier, we should accept it, and conversely if it can't produce the output for a given format specifier, we shouldn't accept it.

I'm not sure whether we should consume all digits before applying the precision check or not though. I can see benefits for either side.

We currently accept numbers with leading zeroes but printf would not produce those without a precision. Should we start by fixing this then?

I think we need leading zeros to be accepted until we have an alternative in place. Otherwise, there may be existing tests that rely on the current behaviour which we can't migrate. I think that means a rough order of: 1) Add precision support; 2) Migrate existing tests to use it where needed; 3) Stop accepting leading zeros except via precision. 2) and 3) can probably be done at the same time. We should only do them as part of 1) if it's harder to keep them separate, in my opinion.

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

If no variable is captured, is the syntax [[#%.4u:]]?

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

That's exactly the point of the second question in the description. Capturing a variable feels more like scanf but I think a unified syntax makes more sense. This is where we need to diverge from the printf/scanf analogy. Since the accepted format is defined explicitely in the documentation I don't think it's a big problem.

If no variable is captured, is the syntax [[#%.4u:]]?

It would be #%.4u or simply #%.4 since u is the default format specifier.

I want to raise one point. Some people may expect format specifier to be similar to scanf, instead of printf. scanf uses similar but less powerful format specifiers than printf. For instance, . is not valid in scanf. %.4u should fail (though glibc appears to be weird things; musl is good). In scanf, %4u reads at most 4 digits, not exactly 4 digits. The only way is %4c plus a conversion -> this is certainly not suitable in FileCheck. Anyway %.4u stills looks good to me.

That's exactly the point of the second question in the description. Capturing a variable feels more like scanf but I think a unified syntax makes more sense. This is where we need to diverge from the printf/scanf analogy. Since the accepted format is defined explicitely in the documentation I don't think it's a big problem.

If no variable is captured, is the syntax [[#%.4u:]]?

It would be #%.4u or simply #%.4 since u is the default format specifier.

Nice. [[#%.4u]] (non-capturing) and [[#%.4u,ADDR:]] (capturing) looks good to me. Might be worth noting that it is not a scanf-supported specifier.