This is an archive of the discontinued LLVM Phabricator instance.

[locale][num_get] Improve Stage 2 of string to float conversion
AbandonedPublic

Authored by tmatheson on Mar 22 2021, 10:18 AM.

Details

Reviewers
Quuxplusone
zoecarver
miyuki
ldionne
Group Reviewers
Restricted Project
Summary

https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2

"Stage 2" of num_get::do_get() depends on "a check ... to determine if c is
allowed as the next character of an input field of the conversion specifier
returned by Stage 1". Previously this was a very simple check whether the next
character was in a set of allowed characters. This could lead to Stage 2
accumulating character sequences such as "1.2f" and passing them to strtold
(Stage 3).
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3

Stage 3 can fail, however, if the entire character sequence from Stage 2 is not
used in the conversion. For example, the "f" in "1.2f" is not used.
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As a result, parsing a sequence like "1.2f" would return value 0.0 with failbit
set.

This change improves the checks made in Stage 2, determining what is passed to
Stage 3.

  • Hex digits are only considered valid if "0x" has been seen
  • INFINITY value is recognised
  • Characters in INFINITY and NAN are only valid in sequence. This is done by checking one character backwards, which has obvious limitations.
  • New tests are added. The old ones are preserved but refactored.

Diff Detail

Event Timeline

tmatheson requested review of this revision.Mar 22 2021, 10:18 AM
tmatheson created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 10:18 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript

Minor change to tests

miyuki added inline comments.Mar 22 2021, 10:33 AM
libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
55

I think there is a reason behind having many similar-looking blocks: each assert() call is located on a separate line, so an assertion failure message will indicate which assertion failed. Factoring out all checks into a single function will make error messages much less informative.

tmatheson updated this revision to Diff 332603.Mar 23 2021, 3:56 AM
  • Add tests for double and long double
  • Replace lambda with macro (there are C++03 builds, and filename/line number are not output when using lambda)
  • Remove constexpr (C++03 builds)
  • Undo refactoring of old tests
tmatheson updated this revision to Diff 332612.Mar 23 2021, 4:25 AM
tmatheson marked an inline comment as not done.

Tidy up tests

tmatheson updated this revision to Diff 334122.Mar 30 2021, 5:49 AM

Replace lambda in header with macro

tmatheson updated this revision to Diff 334185.Mar 30 2021, 9:14 AM

clang-format

tmatheson updated this revision to Diff 335762.Apr 7 2021, 2:28 AM
  • Update ABI symbols to account for added parameter in stage2_float_loop
  • Explicit cast in tests to avoid error when compiling with GCC
Quuxplusone requested changes to this revision.Apr 7 2021, 8:02 AM
Quuxplusone added inline comments.
libcxx/include/locale
1097

C++ doesn't support VLAs; you'd have to put something here that's a constant-expression.
Do I understand correctly that the majority of this patch is just changing 32 and 33 to 36 and 37 respectively? Could you just do that in the simplest possible way? E.g. here

-    char_type __atoms[32];
+    char_type __atoms[36];

That'll help focus attention on whatever details are actually important.

Meanwhile, the important change seems to be that you're adding those extra 4 characters for "tTyY", so that you can parse not only "INF" and "NAN" but also "INFINITY" (producing INF). Is this required by the Standard? Why didn't we have any tests for it before now?

I tentatively suggest that you split out the "INFINITY" change+test into its own PR (with a summary that cites chapter and verse for why this is needed); and then let's see what remains in this PR after that.

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_double.pass.cpp
57

INFxyz, INFinity, INFinite, INFiNiTy

66

Shouldn't

This revision now requires changes to proceed.Apr 7 2021, 8:02 AM
miyuki added inline comments.Apr 7 2021, 8:09 AM
libcxx/include/locale
536

I think it would be better to keep 4-space indentation for consistency with the rest of the file. See https://github.com/llvm/llvm-project/commit/7004d6664efde9d1148ed677649593f989cc6056

543

Would it be easier to handle these special cases outside of the loop?

miyuki added inline comments.Apr 7 2021, 8:22 AM
libcxx/include/locale
1097

I guess

char_type __atoms[__num_get_base::__n_atoms_float];

will work. We already have something similar on line 1089:

unsigned __g[__num_get_base::__num_get_buf_sz];
Quuxplusone added inline comments.Apr 7 2021, 9:00 AM
libcxx/include/locale
1097

(For the record, I don't want to see char_type __atoms[__num_get_base::__n_atoms_float]; I want to see char_type __atoms[36];, in a separate PR, with explanation of why we need to parse INFINITY as a float, and appropriate tests.)

tmatheson marked an inline comment as done.Apr 7 2021, 9:29 AM
tmatheson added inline comments.
libcxx/include/locale
536

I agree, but the CI was failing if I didn't clang-format the patch.

543

Yes, and I tried but there are two problems with doing so. First it requires an even bigger refactor to get it to work. But more importantly, the standard is pretty over-specified here, to the point of essentially dictating the algorithm to use: https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2

Therefore while moving these special cases outside of the loop would probably make a better algorithm, it would be a deviation from the standard.

1097

It seemed slightly more readable with one less magic number, but I can change it. This whole section of code is pretty obscure.

String should be converted to float by the rules of strtold: https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3
INFINITY is parsed by strtold as described here: https://en.cppreference.com/w/c/string/byte/strtof
(C11 standard, "7.22.1.3 The strtod, strtof, and strtold functions")

I don't know why there are no tests for it, but the tests in these files are pretty minimal and haven't been significantly changed since "libcxx initial import".

I will remove the INFINITY stuff from this PR.

tmatheson updated this revision to Diff 335857.Apr 7 2021, 10:02 AM
tmatheson marked an inline comment as done.
  • Remove handling of INFINITY and associated tests, add a few tests for variations of INF
tmatheson marked 4 inline comments as done.Apr 7 2021, 10:16 AM
tmatheson added inline comments.
libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_double.pass.cpp
57

Added INFxyz and some crazy casing for INF.

It is worth noting that (by my reading of the standard, at least) INFinite is required to fail because:

  • Stage 1 will select %g as the format specifier
  • Stage 2 will keep consuming characters until it reaches the 'e', which is the first character that is not valid at that point in the sequence
  • Stage 3 will then try to process the strong "INFinit" and should only process the first 3 characters (INF). Therefore the whole string will not be read, and num_get should return 0.0 with failbit

I suggest we save that discussion for the INFINITE PR though.

libcxx/include/locale
536

The clang-format CI step is non-fatal; it's just there to tell you what parts fail clang-format. It is not intended as a black-and-white gatekeeper, and requests-for-consistency from real humans should always take precedence over the tool's suggestions — Asimov's Second Law applies. :)

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_double.pass.cpp
57

The point of testing "INFINITE" is for two reasons:

  • There is a right answer. Test that we produce the right answer.
  • It's a "garden-path" input: we want to test that the parser doesn't get confused by seeing "INFINI...", and will correctly backtrack to consume only the "INF" part, instead of failing when it fails to find a "Y" character at the end of the string.

So, please test its behavior (in the appropriate PR which it sounds like you're splitting out; nice!). I expect that behavior to resemble the behavior for "INFxyz"; but if it doesn't, then we should investigate why.

tmatheson updated this revision to Diff 335906.Apr 7 2021, 12:36 PM
tmatheson marked an inline comment as done.

Update formatting to match the rest of the file

tmatheson marked 2 inline comments as done.Apr 7 2021, 12:42 PM
tmatheson added inline comments.
libcxx/include/locale
536

Sorry my mistake, I thought it had caused one of the failures I had earlier

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
55

This was addressed by using a macro for new tests and not refactoring the old tests.

tmatheson updated this revision to Diff 337394.Apr 14 2021, 3:50 AM

Update abilist for MacOS C++20

Ping. @Quuxplusone were there any outstanding issues you wanted me to address?

Also, can anyone suggest what I should do about the ABI issues? The function I've changed (__num_get<_CharT>::__stage2_float_loop) doesn't seem like it should be part of the public ABI. After updating the expected names there are still CI failures on MacOS. Should the changes be wrapped with the macros described here? https://libcxx.llvm.org/docs/DesignDocs/ABIVersioning.html

Friendly ping

Also, can anyone suggest what I should do about the ABI issues? The function I've changed (__num_get<_CharT>::__stage2_float_loop) doesn't seem like it should be part of the public ABI. After updating the expected names there are still CI failures on MacOS. Should the changes be wrapped with the macros described here? https://libcxx.llvm.org/docs/DesignDocs/ABIVersioning.html

Yes the additional argument changes the ABI and should be an opt-in for the user. This can be done by adding a new define in include/__config in the block #if defined(_LIBCPP_ABI_UNSTABLE) || _LIBCPP_ABI_VERSION >= 2.

tmatheson updated this revision to Diff 344069.May 10 2021, 8:33 AM

Place breaking ABI change behind a macro

Yes the additional argument changes the ABI and should be an opt-in for the user. This can be done by adding a new define in include/__config in the block #if defined(_LIBCPP_ABI_UNSTABLE) || _LIBCPP_ABI_VERSION >= 2.

Thank you @Mordante, the change is now hidden behind the ABI flag.

I left some more minor comments... but FYI, this patch's actual functional change is above my pay grade. You'll have to interest @ldionne or someone like that in reviewing its actual functional change.
If the intent is speed, it would help for you to add to this PR a benchmark following the pattern in libcxx/benchmarks/.

libcxx/include/locale
656

#define _Toupper(x) ((x) & 0x5F) — "__UPPERCASE" reads to me like it's testing uppercaseness.

Separately: check throughout for /preceed/preced/.

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
50

I'd still like to see the test strings "INFINITE" and "INFINITY" somewhere in these tests. Also INAN, now that I think of it (because it looks like it's gonna be "INF" and then switches to "NAN" in the middle) — and any other "white-box" test cases you can think of.
Also every-possible-prefix-of-a-correct string: 0x, 123e, 123e+, 123e-. (Maybe these are already covered somewhere?)

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_long_double.pass.cpp
43

Is this ever defined during testing, though?

Thanks @Quuxplusone. If @ldionne can't be tempted into taking a look is there anyone else appropriate?

This patch is about correctness (still a long way to go, see below) rather than speed/performance so I won't add a benchmark.

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
50

I'll try to explain what the issue is with adding "intinite" and "infinity" tests -- besides the fact that "infinity" worked before I removed it on request :)

On one hand, we have the ideal behaviour, which is that do_get handles anything that strtod can handle.

Then we have the over-specified wording in the standard, which actually defines do_get in three stages, and requires us to (paraphrasing) "loop until we reach an invalid character, then stop and pass the substring to strtod for parsing."

(The actual wording: "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.")

This wording in the standard is probably why we have this slightly bizarre coding style in the first place, with the body of the loop factored out (and the signature added to the ABI).

With a garden path input like "infinite" you need to be able to backtrack in stage 2, and reject all the characters that you previously considered valid (e.g. "init") until you get to a valid substring. The parser described in the standard is simply not up to the job.

You might think we can err on the side of caution, and accumulate more characters than we need, and let strtod use what it wants (e.g. halt stage 2 after accumulating "infinit" and let strtod just parse "inf"). But no, it must return zero and give you an error: https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As such I can see *no way* that we can actually meet the following criteria simultaneously:

  • Be compliant with the standard, e.g. stick to the algorithm it describes
  • Handle all valid floating point inputs
  • Always return the same result as strtod would

The initial problem I wanted to solve is that "1.2f" is badly parsed. This is not a valid floating point number. "1.2" is valid. The "f" should be ignored, i.e. not passed to stage 2. It was not ignored because "f" is a valid hex digit. But the existing code had no way of keeping track of whether hex digits are valid at the current point in the string. I have added this here because it is relatively simple to do without a complete refactor, and it still looks like the standard if you squint. Adding the ability to backtrack to handle garden path inputs is a significant deviation from the standard imo.

This change should increase the number of correctly handled cases and hopefully not break any that were handled correctly before, but it is not perfect or complete.

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
50

I'd still like to see those tests, because otherwise we have codepaths that we're not exercising — which means they could segfault, or worse, for all we know. I see two not-mutually-exclusive alternatives:

  • Figure out the "least common correct denominator" of all implementations, and add a minimal coverage test in test/std/ for it. If the least common denominator is just "we should be able to call it without crashing," then test try { (void)thething("INFINITY"); } catch (...) {}. But we should still hit the codepath.
  • And/or, figure out the "current behavior of libc++," and add a regression test in test/libcxx/, so that we test our behavior, and then we'll know if it ever regresses by accident. Because we do care about regressions in this area, right? (That's why you're introducing the ABI flag in the first place.)

The responsible thing to do would probably be to add both of these approaches, honestly, now that I've said them out loud.

Ping. @ldionne? It doesn't seem like there is a lot of interest in this fix, since it's been up for over 2 months now in more or less the same form. I'm happy to keep making the suggested changes if it has a chance of landing, otherwise please let me know and I will just close it.

libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp
50

Sure, I can add something like this to test/std:

// FIXME these are all currently known to give the wrong answer
TEST("INFINITY", 8, 0.0, ios.failbit);

// FIXME it is unclear what the correct answer is in these cases
TEST("INAN", 1, 0.0, ios.failbit);
TEST("INFINITY", 3, INFINITY, ios.goodbit);
tmatheson abandoned this revision.Dec 16 2021, 6:17 AM

Somewhat embarrassing, but this simply went under my radar. I am aware of a couple of correctness bugs against our localization library/iostream and I am definitely interested in patches that would fix those. If you are still willing to, we can reopen and review this.

This overall LGTM, but I'd like to enable this change slightly differently. It should allow us to use the new function for all software built for a sufficiently recent deployment target. What we could do is keep the old implementation of __stage2_float_loop without the bool& hex parameter inside the dylib only and use a bit of cleverness to keep things working on older versions, like so:

///////////////////////////////////////////////////////////////////////////////////////////////////
// in <__config>, add (grep for _LIBCPP_ABI_ENABLE_ADDITIONAL_IOSTREAM_EXPLICIT_INSTANTIATIONS_1):
///////////////////////////////////////////////////////////////////////////////////////////////////
#if defined(_LIBCPP_BUILDING_LIBRARY) || defined(_LIBCPP_ABI_UNSTABLE) || _LIBCPP_ABI_VERSION >= 2
// Explain what's going on
# define _LIBCPP_ABI_LOCAL_NUM_GET_NEW_STAGE2_FLOAT_LOOP
#endif


///////////////////////////////////////////////////////////////////////////////////////////////////
// in <locale>:
///////////////////////////////////////////////////////////////////////////////////////////////////
template <class _CharT>
struct __num_get : protected __num_get_base {
#if defined(_LIBCPP_ABI_LOCAL_NUM_GET_NEW_STAGE2_FLOAT_LOOP)
    // your new implementation
    static int __stage2_float_loop(_CharT, bool&, char&, char*, char*&, _CharT, _CharT, const string&, unsigned*, unsigned*&, unsigned&, _CharT*, bool& __hex);
#else
    // new signature but forwarding to the old function. now you can change all the callers of this
    // function to use the new signature unconditionally, effectively making this hack more local.
    _LIBCPP_HIDE_FROM_ABI static inline 
    int __stage2_float_loop(_CharT, bool&, char&, char*, char*&, _CharT, _CharT, const string&, unsigned*, unsigned*&, unsigned&, _CharT*, bool& __hex) {
        return __stage2_float_loop(same args but drop __hex);
    }
#endif

    // If we are building the dylib, we keep the old function around for backwards compatibility.
    // If we are building for a target that doesn't support the new implementation, use the old function.
#if defined(_LIBCPP_BUILDING_LIBRARY) || !defined(_LIBCPP_ABI_LOCAL_NUM_GET_NEW_STAGE2_FLOAT_LOOP)
    static int __stage2_float_loop(_CharT, bool&, char&, char*, char*&, _CharT, _CharT, const string&, unsigned*, unsigned*&, unsigned&, _CharT*);
#endif
};

Benefits of this approach:

  1. All callers can pretend they are using the new, fixed version.
  2. The dylib retains the old function implementation, so we don't break the ABI.
  3. By default, all code will keep using the old implementation, but the new implementation will also start being exported by the shared library.
  4. Once the new implementation has shipped, vendors can go back and enable _LIBCPP_ABI_LOCAL_NUM_GET_NEW_STAGE2_FLOAT_LOOP based on whether they know the user is compiling for a target where the new implementation existed.

Concretely, this means that for example I can go back in a bit of time and enable the fix for anyone that's deploying to a recent macOS/iOS/whateverOS. The same can be done for other platforms (although I don't know that other vendors use these sorts of tricks that are available to them).

P.S.: The localization code is crazy and you're a real warrior for jumping into it -- I'm really sorry you didn't catch my attention the first time around.

[Github PR transition cleanup]

Revived as https://github.com/llvm/llvm-project/pull/65168

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 6:14 AM