Page MenuHomePhabricator

[libcxx] Construct path using a instance of std::locale
Needs ReviewPublic

Authored by jguegant on Jul 16 2019, 1:42 PM.



This patch tries to provide the two missing constructors for std::filesystem::path using an instance of std::locale to convert the source.

Given that the source needs to be converted using the codecvt<wchar_­t, char, mbstate_­t> facet of the std::locale at first, these two constructors:

  1. Can only take a char source.
  2. Need to have different path than the current two other constructors taking a source.

After converting to a wchar_t source using the codecvt, we need to convert a second time to store the result. The standard is unclear on how the second conversion should happen:

Otherwise a conversion is performed using the codecvt<wchar_­t, char, mbstate_­t> facet of loc, and then a second conversion to the current ordinary encoding.

I interpreted this as the second conversion can be executed however we want, meaning that I can reuse the facilities from the two constructors that accept any kind of source.

For the tests, this patch contains both exhaustive testing using an already existing list of paths but also a check that the first conversion is indeed using codecvt<wchar_­t, char, mbstate_­t> to avoid any regression in the internal logic.

Diff Detail

rCXX libc++

Event Timeline

jguegant created this revision.Jul 16 2019, 1:42 PM
jguegant edited the summary of this revision. (Show Details)Jul 16 2019, 1:43 PM

Note: this is my first contribution to libc++. I tried to follow the style used in the rest of files, but I may have missed some hidden coding style rules: some files use four spaces, some two? Which SFINAE pattern (return type, class paramter...) is to be preferred? Free template functions vs template classes with static member functions?

Quuxplusone added inline comments.

char{} is a strange way to spell '\0', and since sentinel is used only on line 811, I think you don't need sentinel at all.

jguegant marked an inline comment as done.Jul 17 2019, 11:19 AM
jguegant added inline comments.

Not sure what is the expected behavior if the locale does not this facet.


Sure, that make sense now! It used to be a generic CharT.

jguegant updated this revision to Diff 210375.Jul 17 2019, 11:25 AM

Removed the unnecessary variables __sentinel.

Please re-upload the diff with full context (e.g. git diff -U999 blah blah).

BTW, I'm giving a mostly-superficial review, and only because we talked on Slack the other day. I have no opinion as to whether this patch is wanted by libc++, no opinion as to whether it's correct from a charset-wonk POV, and no ability to land it. You'll still have to get interest from one of the three listed reviewers in order to make concrete progress here. :)


Peanut gallery says: I'm not 100% sure of the rules about const. I observe that this line seems to be okay even in -pedantic mode, but intuitively I would expect __buf to be a VLA. IMHO it would be clearer to make __sz a constexpr int instead of just a const int — or even my generally preferred formulation,

_InternT __buf[32];
constexpr ptrdiff_t __sz = sizeof(__buf) / sizeof(__buf[0]);

__s is of user-provided type. Depending on level of library-maintainer paranoia, you might want to make this ++__p, void(), ++__s, or pull ++__s down into the body of the for-loop (and add curly braces). The current formulation will call a user-defined operator,(const char*, _OutputIterator) if one exists. (And will perform ADL to find out if one exists, regardless.)


class = _EnableIf<!is_constructible<const char*, _InputIt>::value>
But also, how is that constructibility relevant? Does the Standard say something like "If the input iterator is explicitly convertible to const char* then do X, else do Y"?

Furthermore, this codepath is going to do heap-allocation with string __s(__b, __e). Can we avoid the heap-allocation somehow?


I'm worried about the way you use overloading on the name __widen_from_char_iter_pair. If some well-intentioned maintainer removed the const on line 752, then line 753 would turn into an infinite recursion. I don't see any reason to use overloading here. Maybe change __widen_from_char_iter_pair on lines 739, 753, 766, 776, 793, and 811 to __widen_from_char_pointer_pair? That would leave only line 889 as a potential source of extra heap-allocations...


This could probably be more simply expressed as

string_view __sv(__b);
return __widen_from_char_iter_pair(, + __sv.size(), __loc);
template <class _SourceOrIter>
using _EnableIfWidenableFromCharSource =

You don't seem to use the _Tp parameter for anything at the moment.


Should std::locale be passed as const std::locale& instead? Does it matter? (I don't know.)


Why is this commented out?

jguegant marked 8 inline comments as done.Jul 22 2019, 12:54 PM

Thanks a lot @Quuxplusone! I am aware that you cannot take any final decision on what gets in but these are really helpful comments here.
I will fix some of you remarks and upload a better diff (with proper context to get a better picture) afterwards.

Any way to gently attract the three reviewers here?


This function is a dumb generalisation of the functions 1389, 1426 which had this pattern.

Correct me if I am wrong but this will work as C-arrays have such syntax:

D1 [ constant-expressionopt ] attribute-specifier-seqopt


The constant-expression shall be a converted constant expression of type std::size_­t ([expr.const]).

Here const int should work thanks to a special rule for const-qualified integrals:

A variable is usable in constant expressions after its initializing declaration is encountered if it is a constexpr variable, or it is of reference type or of const-qualified integral or enumeration type, and its initializer is a constant initializer.

Good idea! Now that this function accept more than one type, it seems a reasonable to be paranoid.


The idea here is to give the priority to the overload with const char* which effectively make both char* and const char*being routed by the first overload line 739. Obviously, my intent is not really clear here, I wonder how I could express that in a better way if I continue in that direction. Maybe having 3 overloads? const char*, char* and generic? A more descriptive trait name?


I believe that this would not create an infinite recursion, the enable_if is sending both char* and const char* to the first overload line 739.

I am not sure if there is a smart way to avoid allocation here. Maybe if I call codecvt::in on every character from the iterator. But wouldn't that be slow? What happen if your code-point is made of 3-4 char instead of one? Should I then copy X amounts of characters into a buffer allocated on the stack and send process the iterator chunks by chunks?
I have a similar issue line 808 and if anyone has a clever solution, I am all for it!

I had a look at how other constructors for path already in place (like line 681 or 692) do it and they already heap-allocate on even more code-paths:

template <class _Iter>
static void __append_range(string& __dest, _Iter __b, _Iter __e) {
  static_assert(!is_same<_Iter, _ECharT*>::value, "Call const overload");
  if (__b == __e)
  basic_string<_ECharT> __tmp(__b, __e);
     + __tmp.length());

If we go in your direction, 889 would now heap-allocation in all scenarios, whereas now the 889 overload supplied with const char* and char* would avoid allocating a string.


Neat! Will do that.


Indeed, I will remove it!


I was thinking that locale is a pretty cheap type (a sort of reference counted pointer), but I might be wrong. Taking it by reference cannot make it worst ;)


This should not be here. Thanks for spotting it!

jguegant updated this revision to Diff 211176.Jul 22 2019, 1:20 PM

Fixed few remarks mentioned by @Quuxplusone.