This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Implement missing filesystem::path constructors with locale
Needs RevisionPublic

Authored by ldionne on Jul 16 2019, 1:42 PM.

Details

Reviewers
mclow.lists
EricWF
jguegant
tahonermann
Group Reviewers
Restricted Project
Summary

This patch provides the two missing constructors for std::filesystem::path
using an instance of std::locale to convert the source.

Given that the source needs to be converted using the
codecvt<wchar_t, char, mbstate_t> facet of the std::locale at first,
these two constructors:

  1. Can only take a char source.
  2. Need to have different path than the current two other constructors taking a source.

After converting to a wchar_t source using the codecvt, we need to convert
a second time to store the result. The standard is unclear on how the
second conversion should happen:

> Otherwise a conversion is performed using the codecvt<wchar_t, char, mbstate_t>
> facet of loc, and then a second conversion to the current ordinary encoding.

I interpreted this as the second conversion can be executed however we want,
meaning that I can reuse the facilities from the two constructors that accept
any kind of source.

Diff Detail

Event Timeline

jguegant created this revision.Jul 16 2019, 1:42 PM
jguegant edited the summary of this revision. (Show Details)Jul 16 2019, 1:43 PM

Note: this is my first contribution to libc++. I tried to follow the style used in the rest of files, but I may have missed some hidden coding style rules: some files use four spaces, some two? Which SFINAE pattern (return type, class paramter...) is to be preferred? Free template functions vs template classes with static member functions?

Quuxplusone added inline comments.
libcxx/include/filesystem
809 ↗(On Diff #210158)

char{} is a strange way to spell '\0', and since sentinel is used only on line 811, I think you don't need sentinel at all.

jguegant marked an inline comment as done.Jul 17 2019, 11:19 AM
jguegant added inline comments.
libcxx/include/filesystem
741 ↗(On Diff #210158)

Not sure what is the expected behavior if the locale does not this facet.

809 ↗(On Diff #210158)

Sure, that make sense now! It used to be a generic CharT.

jguegant updated this revision to Diff 210375.Jul 17 2019, 11:25 AM

Removed the unnecessary variables __sentinel.

Please re-upload the diff with full context (e.g. git diff -U999 blah blah).

BTW, I'm giving a mostly-superficial review, and only because we talked on Slack the other day. I have no opinion as to whether this patch is wanted by libc++, no opinion as to whether it's correct from a charset-wonk POV, and no ability to land it. You'll still have to get interest from one of the three listed reviewers in order to make concrete progress here. :)

libcxx/include/__locale
1367

Peanut gallery says: I'm not 100% sure of the rules about const. I observe that this line seems to be okay even in -pedantic mode, but intuitively I would expect __buf to be a VLA. IMHO it would be clearer to make __sz a constexpr int instead of just a const int — or even my generally preferred formulation,

_InternT __buf[32];
constexpr ptrdiff_t __sz = sizeof(__buf) / sizeof(__buf[0]);
1374

__s is of user-provided type. Depending on level of library-maintainer paranoia, you might want to make this ++__p, void(), ++__s, or pull ++__s down into the body of the for-loop (and add curly braces). The current formulation will call a user-defined operator,(const char*, _OutputIterator) if one exists. (And will perform ADL to find out if one exists, regardless.)

libcxx/include/filesystem
749 ↗(On Diff #210375)

class = _EnableIf<!is_constructible<const char*, _InputIt>::value>
But also, how is that constructibility relevant? Does the Standard say something like "If the input iterator is explicitly convertible to const char* then do X, else do Y"?

Furthermore, this codepath is going to do heap-allocation with string __s(__b, __e). Can we avoid the heap-allocation somehow?

753 ↗(On Diff #210375)

I'm worried about the way you use overloading on the name __widen_from_char_iter_pair. If some well-intentioned maintainer removed the const on line 752, then line 753 would turn into an infinite recursion. I don't see any reason to use overloading here. Maybe change __widen_from_char_iter_pair on lines 739, 753, 766, 776, 793, and 811 to __widen_from_char_pointer_pair? That would leave only line 889 as a potential source of extra heap-allocations...

792 ↗(On Diff #210375)

This could probably be more simply expressed as

string_view __sv(__b);
return __widen_from_char_iter_pair(__sv.data(), __sv.data() + __sv.size(), __loc);
838 ↗(On Diff #210375)
template <class _SourceOrIter>
using _EnableIfWidenableFromCharSource =
    _EnableIf<__is_widenable_from_char_source<_SourceOrIter>::value>;

You don't seem to use the _Tp parameter for anything at the moment.

libcxx/test/std/input.output/filesystems/class.path/path.member/path.construct/source_and_locale.pass.cpp
33

Should std::locale be passed as const std::locale& instead? Does it matter? (I don't know.)

98

Why is this commented out?

jguegant marked 8 inline comments as done.Jul 22 2019, 12:54 PM

Thanks a lot @Quuxplusone! I am aware that you cannot take any final decision on what gets in but these are really helpful comments here.
I will fix some of you remarks and upload a better diff (with proper context to get a better picture) afterwards.

Any way to gently attract the three reviewers here?

libcxx/include/__locale
1367

This function is a dumb generalisation of the functions 1389, 1426 which had this pattern.

Correct me if I am wrong but this will work as C-arrays have such syntax:

D1 [ constant-expressionopt ] attribute-specifier-seqopt

Where:

The constant-expression shall be a converted constant expression of type std::size_­t ([expr.const]).

Here const int should work thanks to a special rule for const-qualified integrals:

A variable is usable in constant expressions after its initializing declaration is encountered if it is a constexpr variable, or it is of reference type or of const-qualified integral or enumeration type, and its initializer is a constant initializer.
1374

Good idea! Now that this function accept more than one type, it seems a reasonable to be paranoid.

libcxx/include/filesystem
749 ↗(On Diff #210375)

The idea here is to give the priority to the overload with const char* which effectively make both char* and const char*being routed by the first overload line 739. Obviously, my intent is not really clear here, I wonder how I could express that in a better way if I continue in that direction. Maybe having 3 overloads? const char*, char* and generic? A more descriptive trait name?

753 ↗(On Diff #210375)

I believe that this would not create an infinite recursion, the enable_if is sending both char* and const char* to the first overload line 739.

I am not sure if there is a smart way to avoid allocation here. Maybe if I call codecvt::in on every character from the iterator. But wouldn't that be slow? What happen if your code-point is made of 3-4 char instead of one? Should I then copy X amounts of characters into a buffer allocated on the stack and send process the iterator chunks by chunks?
I have a similar issue line 808 and if anyone has a clever solution, I am all for it!

I had a look at how other constructors for path already in place (like line 681 or 692) do it and they already heap-allocate on even more code-paths:

template <class _Iter>
static void __append_range(string& __dest, _Iter __b, _Iter __e) {
  static_assert(!is_same<_Iter, _ECharT*>::value, "Call const overload");
  if (__b == __e)
    return;
  basic_string<_ECharT> __tmp(__b, __e);
  _Narrower()(back_inserter(__dest), __tmp.data(),
              __tmp.data() + __tmp.length());
}

If we go in your direction, 889 would now heap-allocation in all scenarios, whereas now the 889 overload supplied with const char* and char* would avoid allocating a string.

792 ↗(On Diff #210375)

Neat! Will do that.

838 ↗(On Diff #210375)

Indeed, I will remove it!

libcxx/test/std/input.output/filesystems/class.path/path.member/path.construct/source_and_locale.pass.cpp
33

I was thinking that locale is a pretty cheap type (a sort of reference counted pointer), but I might be wrong. Taking it by reference cannot make it worst ;)

98

This should not be here. Thanks for spotting it!

jguegant updated this revision to Diff 211176.Jul 22 2019, 1:20 PM

Fixed few remarks mentioned by @Quuxplusone.

ldionne added a reviewer: Restricted Project.Nov 2 2020, 2:38 PM
ldionne commandeered this revision.Sep 19 2023, 11:31 AM
ldionne edited reviewers, added: jguegant; removed: ldionne.

[Github PR transition cleanup]

Commandeering to finish.

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2023, 11:31 AM
ldionne updated this revision to Diff 557061.Sep 19 2023, 11:32 AM
ldionne retitled this revision from [libcxx] Construct path using a instance of std::locale to [libc++] Implement missing filesystem::path constructors with locale.
ldionne edited the summary of this revision. (Show Details)

Rebase. Simplify the implementation a bit. Remove some tests for SFINAE-friendliness that are not necessary according to my reading of the spec.

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2023, 11:32 AM
ldionne updated this revision to Diff 557333.Sep 25 2023, 3:26 PM

Add missing _LIBCPP_HIDE_FROM_ABI

ldionne updated this revision to Diff 557355.Sep 26 2023, 6:22 AM

Fix C++03 build.

ldionne updated this revision to Diff 557613.Oct 5 2023, 7:47 AM

Make sure we initialize mbstate_t to avoid uninitialized read.

ldionne updated this revision to Diff 557881.Oct 25 2023, 10:27 AM

Try to fix the CI

ldionne updated this revision to Diff 557895.Oct 26 2023, 6:48 AM

Remove invalid test -- we're not allowed to construct a path from a wchar_t* using the locale constructor, only char* it allowed.

I think this is looking good. Before I bless it though, I want to make sure the intent is clear.

The new locale sensitive constructors will use the codecvt<wchar_t, char, ...> facet from the specified locale to convert the provided char-based source (locale dependent encoding) to a wchar_t-based representation (UTF-16 or UTF-32 depending on platform). If on Windows, done. If on POSIX, the wchar_t-based representation is then converted to UTF-8 in char-based storage.

That seems like the right semantics; or at least the best that we can do given that filenames don't have strongly associated encodings. I haven't managed to validate for myself yet; is this consistent with the non-locale based constructors that take a char-based source? I presume those are assumed to be UTF-8 on POSIX regardless of whatever the current locale settings are.

If it isn't too difficult to do, it would be good to use a real locale (e.g., one that uses ISO-8859-1 encoding) and validate the conversion to UTF-8 (POSIX) or UTF-16 (Windows) in the test with some characters that require re-encoding. The existing test validates that all the characters are changed to 'o', but doesn't exercise a change to encoded length or that the characters didn't undergo a further conversion (exercising non-ASCII characters would help with that).

No real issue from my side. I'll leave the approval to Tom.

libcxx/include/__filesystem/path.h
552
557
tahonermann requested changes to this revision.Nov 1 2023, 1:17 PM
tahonermann added inline comments.
libcxx/include/__locale
1566–1579

This can end up in an infinite loop when the narrow input string ends in a partial code unit sequence. In that case, partial will be returned, but from_next (aka __nn) will be left pointing to the beginning of the partial sequence such that the next iteration will encounter the same partial result.

See https://godbolt.org/z/3q5xaG5dW for an example. The test there reliably exhibits an infinite loop for std::codecvt<char32_t, char, std::mbstate_t>. For the std::codecvt<wchar_t, char, std::mbstate_t> facet, the infinite loop is avoided because the converter packs partial state into std::mbstate_t and then (erroneously) returns ok instead of partial (erroneously because no character is ever translated, but no error is ever issued either).

These appear to be existing bug given that this code was just factored out from other locations.

libcxx/test/std/input.output/filesystems/class.path/path.member/path.construct/source_and_locale.pass.cpp
130

I think additional tests should be added here to exercise various edge cases like the truncated partial code unit sequence in the godbolt link I added in another comment. We should also exercise some MxN sequence conversions (e.g., UTF-8 to UTF-16). That can be done by calling .u16string() on the resulting path object and comparing what it returns to expectations.

This revision now requires changes to proceed.Nov 1 2023, 1:17 PM