This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Remove default definition of std::char_traits
ClosedPublic

Authored by ldionne on Nov 18 2022, 8:45 AM.

Details

Summary

This patch removes the base template implementation for std::char_traits.
If my reading of http://eel.is/c++draft/char.traits is correct, the
Standard mandates that the library provides specializations for several
types like char and wchar_t, but not any implementation in the base
template. Indeed, such an implementation is bound to be incorrect for
most types anyways, since things like eof() and int_type will definitely
have to be customized.

Since the base template implementation should not have worked for anyone,
this shouldn't be a breaking change (I expect that anyone defining a
custom character type today will already have to provide their own
specialization of char_traits). However, since we're aware of some users
of char_traits for unsigned char and signed char, we're keeping those two
specializations around for two releases to give people some time to migrate.

Diff Detail

Event Timeline

ldionne created this revision.Nov 18 2022, 8:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2022, 8:45 AM
ldionne requested review of this revision.Nov 18 2022, 8:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2022, 8:45 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
ldionne added a subscriber: Restricted Project.Nov 18 2022, 8:45 AM
philnik added inline comments.
libcxx/include/__string/char_traits.h
42–47

We probably shouldn't mark it with a visibility macro if we never define it. If people care about it they can add their own visibility macros to any specializations.

55

These don't have to be marked constexpr or noexcept for user-provided ones. It's probably also just noise for most people.

ldionne updated this revision to Diff 476699.Nov 19 2022, 12:45 PM
ldionne marked 2 inline comments as done.

Fix CI

ldionne updated this revision to Diff 476704.Nov 19 2022, 2:05 PM

Fix issues with wchar_t

philnik accepted this revision.Nov 19 2022, 2:09 PM

LGTM with green CI.

libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
27

I'd rather avoid opening the std namespace and instead use template<> struct std::char_traits<VeryLarge>.

This revision is now accepted and ready to land.Nov 19 2022, 2:09 PM
ldionne added inline comments.Nov 19 2022, 2:16 PM
libcxx/include/__string/char_traits.h
55

I think you're right.

libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
27

I don't think that's valid in C++03. GCC gives an error and requires using -fpermissive.

philnik added inline comments.Nov 19 2022, 2:24 PM
libcxx/test/std/strings/basic.string/string.modifiers/string_append/push_back.pass.cpp
27

Huh, interesting. Weird that clang doesn't even give you a warning for that.

I'm concerned that this change will break "working" code. While the standard does not specify specializations for signed char and unsigned char, it isn't hard to find cases of people using basic_string<unsigned char>. I didn't look hard, but it looks like Emscripten would be broken by this change based on https://github.com/emscripten-core/emscripten/issues/7362.

All major compilers appear to "support" specializations for signed char and unsigned char: https://godbolt.org/z/h3xG7T48T. There is an old paper that argues for standardizing such specializations: https://wg21.link/n1985.

I'm concerned that this change will break "working" code. While the standard does not specify specializations for signed char and unsigned char, it isn't hard to find cases of people using basic_string<unsigned char>. I didn't look hard, but it looks like Emscripten would be broken by this change based on https://github.com/emscripten-core/emscripten/issues/7362.

All major compilers appear to "support" specializations for signed char and unsigned char: https://godbolt.org/z/h3xG7T48T. There is an old paper that argues for standardizing such specializations: https://wg21.link/n1985.

I think a common ground here could be to support signed char and unsigned char explicitly as extensions. But do we agree that std::string<int> makes no sense and if it breaks some users that's fine?

Also, would you have an appetite for reviving https://wg21.link/n1985?

I think a common ground here could be to support signed char and unsigned char explicitly as extensions. But do we agree that std::string<int> makes no sense and if it breaks some users that's fine?

Yes, I'm not worried about other fundamental types; they are much more problematic than signed char and unsigned char for the reasons you indicated.

Also, would you have an appetite for reviving https://wg21.link/n1985?

Not really. I could bring it to SG16, but I don't expect it to garner much favor. It might get some points for standardizing existing practice, but I expect the SG16 response to be that we don't want to encourage use of such extensions.

Also, would you have an appetite for reviving https://wg21.link/n1985?

Not really. I could bring it to SG16, but I don't expect it to garner much favor. It might get some points for standardizing existing practice, but I expect the SG16 response to be that we don't want to encourage use of such extensions.

In that case, how about this:

  1. We rip out char_traits support for int and other nonsensical stuff
  2. We keep supporting it for unsigned char and signed char, but we make it deprecated
  3. We remove it in 1-2 releases

If there's no desire to standardize it because the Committee thinks it's not a good idea, we should get clients to move away from it. We don't want to make it painful just for the sake of it, but we should initiate the movement.

ldionne updated this revision to Diff 477015.Nov 21 2022, 3:04 PM

Rebase onto main with some changes extracted into separate patches. Will update with signed char and unsigned char once we have agreed on a direction.

If there's no desire to standardize it because the Committee thinks it's not a good idea, we should get clients to move away from it. We don't want to make it painful just for the sake of it, but we should initiate the movement.

Agreed; the plan you proposed sounds good to me.

ldionne updated this revision to Diff 477198.Nov 22 2022, 7:43 AM
ldionne edited the summary of this revision. (Show Details)

Implement the agreed-upon direction

tahonermann added inline comments.Nov 22 2022, 8:40 AM
libcxx/test/std/ranges/range.adaptors/range.lazy.split/general.pass.cpp
37–47

Is the intent really to support conversions from ranges of different character types? That seems odd and doesn't seem to be required by the test.

ldionne marked an inline comment as done.Nov 22 2022, 9:12 AM
ldionne added inline comments.
libcxx/test/std/ranges/range.adaptors/range.lazy.split/general.pass.cpp
37–47

No, you're right. Thanks for the catch, I'll fix that.

The intent is to be constructible from a string_view (specifically we need to be constructible from a null-terminated C-style string) whenever Char is one of these types, and my change here was a really sloppy way of doing that.

ldionne updated this revision to Diff 477223.Nov 22 2022, 9:13 AM
ldionne marked an inline comment as done.

Address comment.

tahonermann accepted this revision.Nov 22 2022, 9:45 AM

Looks good!

ldionne updated this revision to Diff 477282.Nov 22 2022, 1:27 PM

Fix CI and rebase

This revision was landed with ongoing or failed builds.Nov 23 2022, 6:51 AM
This revision was automatically updated to reflect the committed changes.

This breaks Apache Arrow, which makes a basic_string_view<unsigned long>, here:
https://github.com/apache/arrow/blob/91ee6dad722ee154d63eea86ce5644e1e658b53b/cpp/src/arrow/util/bitmap.h#L53

/usr/include/c++/v1/string_view:279:45: error: implicit instantiation of undefined template 'std::char_traits<unsigned long>'
    static_assert((is_same<_CharT, typename traits_type::char_type>::value),
                                            ^
arrow/util/bitmap.h:153:16: note: in instantiation of template class 'std::basic_string_view<unsigned long>' requested here
    View<Word> words[N];
               ^
arrow/compute/kernels/scalar_if_else.cc:1595:15: note: in instantiation of function template specialization 'arrow::internal::Bitmap::VisitWords<3UL, (lambda at arrow/compute/kernels/scalar_if_else.cc:1595:35), unsigned long>' requested here
      Bitmap::VisitWords(bitmaps, [&](std::array<uint64_t, 3> words) {
              ^
arrow/compute/kernels/scalar_if_else.cc:1652:14: note: in instantiation of function template specialization 'arrow::compute::internal::(anonymous namespace)::ExecArrayCaseWhen<arrow::BooleanType>' requested here
      return ExecArrayCaseWhen<Type>(ctx, batch, out);
             ^
arrow/compute/kernels/codegen_internal.h:1173:47: note: in instantiation of member function 'arrow::compute::internal::(anonymous namespace)::CaseWhenFunctor<arrow::BooleanType>::Exec' requested here
      return Generator<BooleanType, Args...>::Exec;
                                              ^
arrow/compute/kernels/scalar_if_else.cc:2738:17: note: in instantiation of function template specialization 'arrow::compute::internal::GenerateTypeAgnosticPrimitive<arrow::compute::internal::(anonymous namespace)::CaseWhenFunctor>' requested here
    auto exec = GenerateTypeAgnosticPrimitive<CaseWhenFunctor>(*type);
                ^
/usr/include/c++/v1/__string/char_traits.h:42:8: note: template is declared here
struct char_traits;

Also breaks v8, which uses a std::basic_string<uint16_t> member, https://github.com/v8/v8/blob/0afe3f519821449f00c746df3747543d3458bb68/src/inspector/string-16.h#L113

This breaks Apache Arrow, which makes a basic_string_view<unsigned long>, here:
https://github.com/apache/arrow/blob/91ee6dad722ee154d63eea86ce5644e1e658b53b/cpp/src/arrow/util/bitmap.h#L53

[...]

Also breaks v8, which uses a std::basic_string<uint16_t> member, https://github.com/v8/v8/blob/0afe3f519821449f00c746df3747543d3458bb68/src/inspector/string-16.h#L113

That is intended, they should fix their usage of std::string. If they really want to use std::string for these "character types", then they can define struct MyChar { unsigned long c_; }; and then specialize std::char_traits<MyChar>.

This breaks Apache Arrow, which makes a basic_string_view<unsigned long>, here:
https://github.com/apache/arrow/blob/91ee6dad722ee154d63eea86ce5644e1e658b53b/cpp/src/arrow/util/bitmap.h#L53

[...]

Also breaks v8, which uses a std::basic_string<uint16_t> member, https://github.com/v8/v8/blob/0afe3f519821449f00c746df3747543d3458bb68/src/inspector/string-16.h#L113

That is intended, they should fix their usage of std::string. If they really want to use std::string for these "character types", then they can define struct MyChar { unsigned long c_; }; and then specialize std::char_traits<MyChar>.

One thing we could do is instead deprecate the whole base template (but still provide an implementation) so as to give folks two releases to get off of it. I'm pretty skeptical about std::string working as intended for unsigned long, but whatever. @tahonermann What would you think about that?

This breaks Apache Arrow, which makes a basic_string_view<unsigned long>

Hmm, this usage is exposed in the interface (see the words() member function template of the Bitmap class template). Ick. Are there tests that demonstrate that the code actually works as intended when Word is not one of the standard character types?

Also breaks v8, which uses a std::basic_string<uint16_t> member

At least in this case, the use is an implementation detail. It looks to me like this class could be relatively easily changed to use std::vector instead. Additional code would be required to ensure null termination, but it at least looks like there are no obvious dependencies on char_traits.

One thing we could do is instead deprecate the whole base template (but still provide an implementation) so as to give folks two releases to get off of it. I'm pretty skeptical about std::string working as intended for unsigned long, but whatever.

That seems prudent. I think it is likely that we'll see additional reports of breakage.

This is something we should communicate to the community via the Announcements (https://discourse.llvm.org/c/announce/46) channel following the conclusion of discussion at https://discourse.llvm.org/t/rfc-add-new-discourse-channel-for-potentially-breaking-disruptive-changes-for-clang/65251.

gulfem added a subscriber: gulfem.Nov 23 2022, 10:49 AM

We also started seeing errors in our Fuchsia source code.

[91666/283659](64) CXX obj/src/lib/elfldltl/test/note-fuzzer.note-fuzzer.cc.o
FAILED: obj/src/lib/elfldltl/test/note-fuzzer.note-fuzzer.cc.o
../../../recipe_cleanup/clangyvv4ambu/bin/clang++ -MD -MF obj/src/lib/elfldltl/test/note-fuzzer.note-fuzzer.cc.o.d -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCPP_REMOVE_TRANSITIVE_INCLUDES -DND...
In file included from ../../src/lib/elfldltl/test/note-fuzzer.cc:5:
In file included from ../../src/lib/elfldltl/include/lib/elfldltl/fuzzer.h:12:
In file included from ../../src/lib/elfldltl/include/lib/elfldltl/layout.h:11:
../../../recipe_cleanup/clangyvv4ambu/bin/../include/c++/v1/string_view:279:45: error: implicit instantiation of undefined template 'std::char_traits<std::byte>'
    static_assert((is_same<_CharT, typename traits_type::char_type>::value),
                                            ^
../../src/lib/elfldltl/include/lib/elfldltl/note.h:97:9: note: in instantiation of template class 'std::basic_string_view<std::byte>' requested here
  Bytes desc;
        ^
../../../recipe_cleanup/clangyvv4ambu/bin/../include/c++/v1/__fwd/string.h:22:29: note: template is declared here
struct _LIBCPP_TEMPLATE_VIS char_traits;

https://luci-milo.appspot.com/ui/p/fuchsia/builders/ci/clang_toolchain.ci.core.x64-release/b8796686452214900177/overview

rnk added a subscriber: rnk.Nov 23 2022, 11:15 AM

Would it be reasonable to retain the default implementation of char_traits under a __config_site macro? This will provide downstream projects with a migration path that is not tightly coupled to the libc++ version. They can update libc++ and complete the migration as separate steps. The migration path can be removed over a longer time horizon, like after the next release.

In our experience at Google, libc++ has recently made frequent breaking changes which prevent us from uprevving libc++, and oftentimes by extension, the entire LLVM toolchain. We are definitely committed and willing to rewrite code to conform to the standard, but the current pace of breakage and lack of migration paths is causing us problems, and I worry that we are just the canary hitting these issues first. I think other users will experience them later when it comes time to release.

I aspired to start a large discussion about appropriate project policies on breaking changes, but I never got around to it, and it looks like we've discovered the next breaking change before I was able to kick off an RFC. I will also be going on leave soon until May, so it's not likely that I'll be able to follow up, but I wanted to share my thoughts.

Would it be reasonable to retain the default implementation of char_traits under a __config_site macro? This will provide downstream projects with a migration path that is not tightly coupled to the libc++ version. They can update libc++ and complete the migration as separate steps. The migration path can be removed over a longer time horizon, like after the next release.

I think the currently proposed approach in D138596 handles that -- you basically get a deprecation warning that can be turned off via the usual deprecation-removal mechanism, and then roughly one year from now, this becomes a hard error. I think this is very close to what you mention, unless I am misunderstanding something. Please let me know if D138596 doesn't solve your problem.

In our experience at Google, libc++ has recently made frequent breaking changes which prevent us from uprevving libc++, and oftentimes by extension, the entire LLVM toolchain. We are definitely committed and willing to rewrite code to conform to the standard, but the current pace of breakage and lack of migration paths is causing us problems, and I worry that we are just the canary hitting these issues first. I think other users will experience them later when it comes time to release.

FWIW, I completely acknowledge that we've been breaking more stuff recently. However, to provide some context, I would argue this is because we've never been making as much progress on our standards conformance, and also we've adopted a more aggressive stance with respect to catching invalid user code. Breakage induced by standards conformance is, I would argue, very difficult to avoid. Other breakage (such as this patch) is much easier to do something about.

I think another important thing to note here is that the project was not as active for some years, so folks perhaps got used to ultimate stability. Anyway, I'm all for providing reasonable migration paths when it's reasonable to do so, but the meaning of reasonable is what I expect we may have challenges agreeing on (not just you and me, but the overall LLVM/libc++ community and its downstream users).

rnk added a comment.Nov 23 2022, 12:10 PM

I think the currently proposed approach in D138596 handles that -- you basically get a deprecation warning that can be turned off via the usual deprecation-removal mechanism, and then roughly one year from now, this becomes a hard error. I think this is very close to what you mention, unless I am misunderstanding something. Please let me know if D138596 doesn't solve your problem.

Thanks, that sounds good to me.

I think another important thing to note here is that the project was not as active for some years, so folks perhaps got used to ultimate stability. Anyway, I'm all for providing reasonable migration paths when it's reasonable to do so, but the meaning of reasonable is what I expect we may have challenges agreeing on (not just you and me, but the overall LLVM/libc++ community and its downstream users).

Sounds like we're on the same page. I agree, the definition of "reasonable" is subjective and hard to know ahead of time, and I don't expect everyone to always agree. I'm satisfied as long as folks are willing to compromise from time to time and work together.

This broke some existing cases using std::basic_string_view<std::byte> and the like.

Sounds like we're on the same page. I agree, the definition of "reasonable" is subjective and hard to know ahead of time, and I don't expect everyone to always agree. I'm satisfied as long as folks are willing to compromise from time to time and work together.

Yes, absolutely. We really appreciate this kind of feedback from downstreams since it allows us to catch things that would be painful for people when migrating, and you can be certain that we are willing to compromise within reason to give folks a good user experience. Of course, we also have to take into account multiple factors, so sometimes it may not be possible to compromise as much as downstreams would like, but I think success lies in the existence of a dialogue like we're having here. For example, I think that the current migration path (where we keep all specializations of std::char_traits for some time) is a better choice than what we started with. It's simpler for the library and it provides a better user experience, so everyone wins from that.

This broke some existing cases using std::basic_string_view<std::byte> and the like.

Yes, please see the above discussion. This will be restored for the time being with a deprecation warning and then removed in an upcoming release (currently planning for LLVM 18).

Leaving an xref trail here, I filed issues against the two projects I reported above: