This is an archive of the discontinued LLVM Phabricator instance.

[libc] fix strtol returning the wrong length
ClosedPublic

Authored by michaelrj on Oct 20 2021, 1:14 PM.

Details

Summary

Previously, strtol/ll/ul/ull would return a pointer to the end of its
parsing, regardless of if it detected a number. Now it will return a
length of 0 when it doesn't find a number.

Diff Detail

Event Timeline

michaelrj created this revision.Oct 20 2021, 1:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 20 2021, 1:14 PM
michaelrj requested review of this revision.Oct 20 2021, 1:14 PM
sivachandra added inline comments.Oct 20 2021, 1:53 PM
libc/src/__support/str_conv_utils.h
73

Nit: s/seen_digit/is_number?

91

Do we need this if block at all?

michaelrj updated this revision to Diff 381095.Oct 20 2021, 2:15 PM
michaelrj marked 2 inline comments as done.

add tests for lone prefixes

libc/src/__support/str_conv_utils.h
91

it's to handle the case where the number is just "0" and the base is set to automatic. Without this, then that would be parsed as an octal number with no digits, as opposed to a decimal number with one digit. I've added a test to check this.
The other part of the condition was unnecessary though.

sivachandra added inline comments.Oct 20 2021, 11:13 PM
libc/src/__support/str_conv_utils.h
91

May be infer_base is incorrect then? Consider this:

const char *n = "08";
char *next;
long i = ::strtol(n, &next, 0);

Since "08" is an invalid number, should we want the above strtolcall to succeed or a fail? If it is a failure, should next point to 8 or 0? Should it be the same if the base was explicitly specified to be 8?

I also think we can ask the same questions when base is 16.

My reading of the standard says that "08" is an invalid number so no conversion should be performed. Which means, str_end should point to original_src. Likewise, "0xZ" with base 16 is an invalid number so no conversion should be performed and str_end should point to original_src.

michaelrj updated this revision to Diff 381426.Oct 21 2021, 4:23 PM

fix infer_base to not advance the pointer unnecessarily

michaelrj marked an inline comment as done.Oct 21 2021, 4:23 PM
michaelrj added inline comments.
libc/src/__support/str_conv_utils.h
91

infer_base was incorrect, and I have now fixed it so that it doesn't skip over the leading octal 0, and removed this condition.

From what I can tell, in the case of both "08" and "0xZ", when parsed with an automatic base, what we have is a solo octal 0 (and this matches what I see from other libc implementations).

In the C standard §6.4.4.1 it specifies that "A decimal constant begins with a nonzero digit", meaning that those numbers cannot be assumed to be base 10, but that "An octal constant consists of the prefix 0 optionally followed by a sequence of the digits 0 through 7 only". This means that the longest valid interpretation of "08" is that it is an octal 0 followed by some value that is not a valid digit (equivalent to "0Z").

For "0xZ" it is similar, a hexadecimal number has to have a valid hexadecimal digit after the "0x" so this is just 0.

sivachandra accepted this revision.Oct 21 2021, 9:13 PM
sivachandra added inline comments.
libc/src/__support/str_conv_utils.h
45–46

Replace

This will advance the string pointer.

with

This function will advance |src| to the first valid digit in the inferred base.
91

Ah, yes!

I am reading two different parts of the standard in isolation. 6.4.4.1 Has this text:

An octal constant consists of the prefix 0 optionally followed by a sequence of the
digits 0 through 7 only. A hexadecimal constant consists of the prefix 0x or 0X followed
by a sequence of the decimal digits and the letters a (or A) through f (or F) with values
10 through 15 respectively.

Which makes 08 an invalid literal. But, the spec for strto<int> functions also says:

decompose the input string into three parts: an initial, possibly empty, sequence of
white-space characters (as specified by the isspace function), a subject sequence
resembling an integer represented in some radix determined by the value of base, and a
final string of one or more unrecognized characters, including the terminating null
character of the input string.

So, in the string "08", '8' is the first unrecognized character, so the string consists of only one valid base 8 character "0". Combining the above two parts of the standard, "08" should be interpreted as just the octal zero. Which is what you are saying I think.

This same argument does not hold for "0x" or "0xZ" because just the prefix 0x or the number 0xZ are not hex numbers at all. But, the octal argument kicks in here as well: so "0x" is octal zero, and "0xZ" is also octal zero. I just checked what other libcs do and it looks like glibc atleast follows this argument. In both the cases, the end_ptr is updated by only 1 implying that only one valid character was read.

If you are convinced with this, can you update infer_base to reflect all of this. Also add comments quoting the appropriate sections in the standard.

This revision is now accepted and ready to land.Oct 21 2021, 9:13 PM
michaelrj marked 3 inline comments as done.

update comments to reflect the reasoning behing infer_base

This revision was automatically updated to reflect the committed changes.