This patch is part of a patch series to add support for FileCheck
numeric expressions. This specific patch adds support for selecting a
matching format to match a numeric value against (ie. decimal, hex lower
case letters or hex upper case letters).
This commit allows to select what format a numeric value should be
matched against. The following formats are supported: decimal value,
lower case hex value and upper case hex value. Matching formats impact
both the format of numeric value to be matched as well as the format of
accepted numbers in a definition with empty numeric expression
constraint.
Default for absence of format is decimal value unless the numeric
expression constraint is non null and use a variable in which case the
format is the one used to define that variable. Conclict of format in
case of several variable being used is diagnosed and forces the user to
select a matching format explicitely.
This commit also enables immediates in numeric expressions to be in any
radix known to StringRef's GetAsInteger method, except for legacy
numeric expressions (ie [[@LINE+<offset>]] which only support decimal
immediates.
Copyright:
Linaro (changes up to diff 183612 of revision D55940)
GraphCore (changes in later versions of revision D55940 and in new revision created off D55940)
I'm getting confused by the syntax here not matching the syntax above for defining a numeric variable. Aren't the two essentially the same syntax, just with different parameters missing? If I'm not mistaken they can be unified to [[#%<fmtspc>,<NUMVAR>:<expr>]] where <expr> says what this matches against (if specified), <NUMVAR> says what numeric variable to store the result in (if specified) and fmtspec defines the format of the expression (if any) and that stored in the variable (if any).
Aside: does #%X match a hex number, but not store it?
I would call this UpperCase or Capitalize. If it is only ever used for hex values (I'm not sure if there are plans for e.g. uppercasing string values later on) storing an enumeration with HexUpper and HexLower, Unsigned values could also make sense.
I agree that an enum seems like the right way forward here, possibly with Conflict and Unspecified as other values in that enum.
Readability is in the eye of the beholder. I personally find it more readable to compare against nullptr, because it is then clear that Expression is a pointer, not a boolean.
I feel like you're losing coverage here: where are command-line defined numeric variables tested with an implicit format specifier? I'd have a separate test for format specifiers, that explicitly focus on testing those and nothing else.
Do you really need every one of these test cases? It feels like the last one would be enough (and one with no spaces, which you have earlier on). Same goes for other instances like this elsewhere.
The grammar of this statement needs significant improvement. How about "Explicitly specified format can override conflicting implicit formats."
This test case should be after the one that shows that explicit format specifiers override non-conflicting implicit ones, and probably also after a test case showing what happens when they conflict.
In one of the other changes, you went to some effort to improve the readability of these using --strict-whitespace, so that the '^' line up with the correct thing. Could you replicate these improvements in this patch too, please?
Why make_shared and not make_unique? In fact, why is this a smart pointer at all? Why not just create this on the stack, i.e. FileCheckNumericVariable FooVar (1, "FOO", &NumVarExpr);
Same comment goes in all sorts of places elsewhere.
The patch initially only changed ERRCLIFMT into ERRCLINAME. Later when rebasing it after the updated previous patch with the -- syntax I missed the - to -- change when resolving conflicts in this file. Same for all other issues you mention in an earlier patch and end up in a later patch: either incorrect merging (happens a lot when the conflict spans many lines because it's easy to miss something) or it's new code and I forgot to apply some fix on there as well (for example the eg -> e.g. fixes).
In other word, my apologies for making you repeat your point, it's due to being absent-minded at times.
I don't know about anybody else, but LeftFormat != NoFormat ? LeftFormat : RightFormat is more readable to me (and removes the need to hard-code the enum value).
Some minor comments, but generally looking good to me.
We have a few downstream tests that will really benefit from matching Hex upper/lower so looking forward to this landing.
I agree, checking != NoFormat seems clearer to me rather than relying on NoFormat being zero.
Alternatively, we could use Optional<ExpressionFormat> but that seems like unnecessary overhead.
Not sure about the FormatValue name for the enum. Maybe something like Kind or Type?
Then this would be ExpressionFormat::Kind::Unsigned which I think reads slightly better.
Instead of checking for a comma (which be allowed to appear after the : in the future, I would check if the next non-whitespace character is a %. Or to simplify this we could require the % to immediately follow the # character?
The hex comments don't match the order of the enum.
Since you are documenting all other members, maybe add Value should be printed an unsigned decimal number. to Unsigned?
I find this slightly surprising. If I define a numeric variable on the command line with hex format, I would expect the value to be parsed as a hex number, i.e. 0x12 and not 0xc.
I'm not sure I follow the intent of your message, are you against the syntax (i.e. there should be no comma at all) or the way it is parsed? Anyway, since the format comes first, I'm not sure how allowing a comma in the later part of the syntax would be a problem. If I remember well its use in the syntax was suggested to match the API of printf/scanf. Otherwise we'd have: #%d FOOBAR:N+1 which I find less easy to read.
Sorry about the ambiguity. The current syntax is perfectly fine.
I was suggesting a change to the parsing code to check that the first non-whitespace char is a % first instead of searching for a comma. But this would only change the which error messages for certain invalid input so it shouldn't really matter.
I'm not entirely sure what the best solution there is.
I think NUMVAL2 should always evaluate to 19.
However, I don't have a strong preference whether it should capture decimal 19 or hex 0x13 (i.e. parse the number in the expresssion as decimal, but inherit format from the other variable).
Alternatively we could require an explicit format for those (probably rare?) cases.
This comment isn't clear to me. I assume what it means by "if it can be matched" is "if it is a format that can be used in a match" ot something similar?
Yes, in one case one could diagnose a missing comma while in the other a missing percentage. This approach allows to reorder the code to parse each element of a numeric substitution (format specifier, variable definition, expression) when internal API change, as was done in earlier version of this patch. If you don't mind I'll keep it this way.
Is the way you expect 12 to be interpreted in '-D#%X,NUMVAL=12' and in '-D#%X,NUMVAL1=7 -D#NUMVAL2=12+NUMVAL1' different because of the explicit Vs implicit format or because you see the former as similar to a value in an input text (and thus 12 as 0x12) and the latter as a literal in a numeric expression (and thus 12 as decimal 12)?
I find it even more confusing to distinguish the behavior of literals between implicit and explicit format (e.g. '-D#%X,NUMVAL1=7 -D#%X,NUMVAL2=12+NUMVAL1' would interpret 12 as 0x12 then).
I also see several reasons to avoid the distinction between the 2 examples based on whether a variable is present after the equal sign:
complexity: (i) more text needs to be added in the documentation to describe this distinction thus making the feature harder to comprehend IMHO and (ii) the parsing code needs to distinguish between these 2 cases
consistency: the same syntax is used in both cases but the behaviour is different due to the use of a numeric variable after the equal sign
unnecessary: the format specifier is needed because (i) the input text can contain a mixture of text and numeric values and (ii) hex numeric values can come with or without prefix, in lowercase or uppercase. It is thus necessary to indicate when something like "dead" is to be interpreted as a numeric value and when it shouldn't. There is not problem in a numeric substitution (whether one defined on the command line or in a check file) since there is no text in it and it is under control of the test writer which can add a prefix.
Therefore I strongly think anything after the equal sign is to be interpreted as a numeric expression and not an input. The distinction might appear clearer in the provision made by one of the later patch to support numeric subsitutions such as #%X,NUMVAL:<12 where the numeric expression value would be 12 and the input value could be 11 (input being B) and thus match.
I've added an example in the documentation to document that a literal is always intepreted as decimal in the absence of prefix (0x12 is allowed and interpreted as hex). Hopefully that'll alleviate some of your concern.
valueFromStringRepr does not give an error in that case as it expects the value to match the format. This is because getAsInteger does not allow to check for the casing and I don't want to make the function more complex by checking it.
FWIW, I would find treating any number without 0x as hex to be ambiguous. I don't think the format specifier really should make a difference to that. After all, if you think of it in printf terms: printf("0x%x", 12) prints "0xc", not "0x12".
@jhenderson I presume you were answering to Paul? Because FileCheck will indeed check the input file for 0xC when encountering CHECK: 0x[[#%X, 12]] so it's all good. As to the prefix, I supposed you meant for the literals inside the # syntax since prefix is not needed for printf/scanf: scanf("%X", x) will happily scan CAFE without expecting a prefix.
@probinson Since we are talking about parallels between variable definition and scanf on one side and numeric substitution and printf on the other side, we can look at the following examples:
0x[[#%X, VAR2:]] is equivalent to: scanf("%X", VAR2).
0x[[#%X, VAR1+12]] when VAR1 was defined with -D#VAR1=3 will be equivalent to checking the output of the following against the input file:
VAR1=3;
printf("0x%X", VAR1+12);
Note that C interprets 12 in decimal despite the %X format specifier of the printf which only influences the conversion of the resulting numeric value (3+12=15 is converted to F and thus the input file is checked for "0xF").
#SOMEVAR:<EXPR> is a shorthand for both, and thus a combination of printf+scanf with some input file check inbetween, e.g. 0x[[#%X, VAR2:VAR1+12]] when VAR1 is defined with -D#VAR1=3 is equivalent to checking the output of the following against the input file:
VAR1=3;
printf("0x%X", VAR1+12);
and scanning the matched input with:
scanf("0x%X", VAR2);
The case of #%X,VAR:12 is then just a special case, i.e. matching the input file against:
LGTM from me on this. I can't think of anything else I'd like addressing, but please wait for @arichardson and anybody else who wants to comment to be happy.
I was responding to the general conversation, and giving my thoughts, so not addressing anybody specifically. Yes, the as-it-stands behaviour is what I would prefer, precisely for the reasons you outlined with the printf semantics.
@thopre Thanks for the detailed explanation. I think the current behaviour in the tests makes sense. Having the match format not affect the parsing of whats to the right of the = seems simpler and avoids problems.
My apologies but I still think there's changes needed on this patch. I had changed the status earlier but I presume updating the patch did reset the status to "needs review". I haven't covered all the API changes in the unittests and I'd like to address the comment from Paul about the lack of clarity of which variables have implicit format conflict. I'm happy to deal with the latter in a separate patch if people are eager to have this change (I'm personally looking forward to it being committed as well) but I want to at least finish the unit testing.
Since this is already a sizable diff, I'd like to address this in a separate patch if you don't mind. It'll require recording more parsing state and I'm thinking about using a parser object for that and thus simplifying the interface of parsing functions (lots of info would be kept as internal state of the object).
Is everybody happy with the constness changes to the ExpressionFormat operators and the new unit tests?
Looks fine to me. I would say say this can be committed if @jhenderson is happy with it and all tests pass. If there are any remaining issues those can always be fixed in follow-up commit.
It won't do any harm, I think, but it's inconsistent with style elsewhere, so I think it should be deleted (the const on the method is fine though of course).
Nit: this comment isn't quite right. You probably wanted "methods' output", since its the output of the methods. However, I think a clearer phrase this and the similar comments to "Check unsgined decimal format methods", or possibly even "Check unsigned decimal format properties".
In fact, now that I think about it, perhaps it would make more sense for this test to be split into a separate TEST for each different format. The test name would then document what the test is for, and you wouldn't need the comment. You might even want to consider a parameterised test (see the TEST_P function), to avoid code duplication, for the Unsigned, HexLower and HexUpper cases.
I've given this a bit more thought, and I think it would be better here and in similar situations with failing Expecteds to actually check the error contents. You can see examples of this in the DWARFDebugLineTest.cpp unit tests (look for the checkError function for a rough guide).
takeError on an Expected seems weird without first having checked that the Expected actually failed. You probably want EXPECT_THAT_EXPECTED(someFunctionThatReturnsAnExpected(), Failed()); if you aren't going to actually check the properties of the Error within the Expected.
Test "NoFormat" explicitly by passing it into the constructor and then consider a separate test to show that the default constructor generates a NoFormat kind.
I've lost track. What's the difference between parsePatternExpect and parseSubstExpect? Why is it pattern, not subst here? Finally, what do these test cases have to do with formats?
I know this isn't something directly related to your change, so it should be a later one, but you should also remove all uses of errorToBool in favour of checking the actual Error.
Sorry I didn't pick up on those in earlier reviews.
parsePatternExpect will call parsePattern which parse the rhs of a CHECK directive. parseSubstExpect calls parseNumericSubstitutionBlock which parses what's inside a # block. The use of parsePatternExpect is because I'm testing that legacy @LINE expression only accept what the old [[@LINE]] (without #) accepted before this patch set. The restriction for legacy @LINE expression is in parseNumericSubstitutionBlock and private functions it calls but the detection of a legacy @LINE expression is in parsePattern hence the use of parsePatternExpect.
The last of the 3 tests is format related because this patch also enables hex literals (I didn't want to make a separate patch for a few line diff). The other 2 should have been done in earlier patch. I've split these 2 into a separate patch: https://reviews.llvm.org/D72912
It's an ASSERT because if the match fails it does not make sense checking Matches[0]. I've added negative testing, it's only for valueFromStringRepr that it is not possible (I've added a comment for that there).
accepted format specifiers