This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Improve string formatters
ClosedPublic

Authored by Mordante on May 14 2022, 6:04 AM.

Details

Reviewers
ldionne
vitaut
Group Reviewers
Restricted Project
Commits
rG77ad77c0710f: [libc++][format] Improve string formatters
Summary

This changes the implementation of the formatter. Instead of inheriting
from a specialized parser all formatters will use the same generic
parser. This reduces the binary size.

The new parser contains some additional fields only used in the chrono
formatting. Since this doesn't change the size of the parser the fields
are in the generic parser. The parser is designed to fit in 128-bit,
making it cheap to pass by value.

The new format function is a const member function. This isn't required
by the Standard yet, but it will be after LWG-3636 is accepted.
Additionally P2286 adds a formattable concept which requires the member
function to be const qualified in C++23. This paper is likely to be
accepted in the 2022 July plenary.

Depends on D121530

NOTE parts of the code now contains duplicates for the current and new parser.
The intention is to remove the duplication in followup patches. A general
overview of the final code is available in D124620. That review however lacks a
bit of polish.

Most of the new code is based on the same algorithms used in the current code.

The final version of this code reduces the binary size by 17 KB for this example
code

int main() {
  {
    std::string_view sv{"hello world"};
    std::format("{}{}|{}{}{}{}{}{}|{}{}{}{}{}{}|{}{}{}|{}{}|{}", true, '*',
                (signed char)(42), (short)(42), (int)(42), (long)(42), (long long)(42), (__int128_t)(42),
                (unsigned char)(42), (unsigned short)(42), (unsigned int)(42), (unsigned long)(42),
                (unsigned long long)(42), (__uint128_t)(42),
                (float)(42), (double)(42), (long double)(42),
                "hello world", sv,
                nullptr);
  }
  {
    std::wstring_view sv{L"hello world"};
    std::format(L"{}{}|{}{}{}{}{}{}|{}{}{}{}{}{}|{}{}{}|{}{}|{}", true, L'*',
                (signed char)(42), (short)(42), (int)(42), (long)(42), (long long)(42), (__int128_t)(42),
                (unsigned char)(42), (unsigned short)(42), (unsigned int)(42), (unsigned long)(42),
                (unsigned long long)(42), (__uint128_t)(42),
                (float)(42), (double)(42), (long double)(42),
                L"hello world", sv,
                nullptr);
  }
}

Diff Detail

Event Timeline

Mordante created this revision.May 14 2022, 6:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 6:04 AM
Herald added a subscriber: mgorny. · View Herald Transcript
Mordante requested review of this revision.May 14 2022, 6:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 6:04 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
philnik added inline comments.
libcxx/include/__format/parser_std_format_spec.h
1538

Why not use bool?

Mordante added inline comments.May 14 2022, 6:59 AM
libcxx/include/__format/parser_std_format_spec.h
1538

That will most likely change the size of the object on Windows. (Note that the underlying type of the enum classes is an uint8_t too.) I can test this after the Windows build is green.

Mordante updated this revision to Diff 429448.May 14 2022, 7:15 AM

Attempt to fix the CI.

Mordante updated this revision to Diff 429465.May 14 2022, 9:46 AM

Attempt to fix Windows build.

Mordante updated this revision to Diff 429472.May 14 2022, 11:41 AM

Another attempt to fix the Windows build.

Mordante updated this revision to Diff 429524.May 15 2022, 2:56 AM

Test whether bools work on all supported platforms as @philnik suggested.

Mordante updated this revision to Diff 429531.May 15 2022, 6:02 AM

Minor polishing.

Mordante marked an inline comment as done.
Mordante added inline comments.
libcxx/include/__format/formatter_output.h
33

This will be done after all formatters are converted to this style.

libcxx/include/__format/parser_std_format_spec.h
1538

I tested in Godbolt and it seems bool works. Maybe because it has the same size and alignment as the uint8_t. Since the CI is also happy I'll keep the bools.

Mordante updated this revision to Diff 432720.May 28 2022, 7:08 AM

Rebased to fix CI.

Mordante updated this revision to Diff 432772.May 29 2022, 2:15 AM

Rebased on main.

vitaut added a comment.Jun 5 2022, 8:37 AM

Looks great! My main suggestion is to rename __parser into something more appropriate (see an inline comment) and possibly separate the actual parsing logic from the parser output.

libcxx/include/__format/formatter_output.h
80

Normally parser is something that does parsing. "The parsed formatting output settings" sounds more like format specs so I'd suggest renaming accordingly and making parser return these specs.

119–132

Not specific to this diff but I think it would be cleaner to fold Unicode/non-Unicode handling into __estimate_column_width and have a single check and copy here.

libcxx/include/__format/formatter_string.h
35

What is this function for? Why not use __write_unicode directly?

81–83

Premature optimization: you could use negative values to indicate "no width/precision" and avoid extra checks here.

libcxx/include/__format/parser_std_format_spec.h
1528

Another option is to fold this into fill/alignment (this would require an additional alignment enumerator making it larger). Not suggesting to change, just something to consider.

1689

nit: maybe replace float_hexadecimal with hexfloat for consistency with the naming in the standard (e.g. https://en.cppreference.com/w/cpp/io/manip/fixed)? The other names seem to be already consistent.

1750–1751

This is a weird check. I think we should open an LWG issue to allow zero dynamic width and remove this unnecessary check.

1757

nit: why assign false here but 0 to __width_as_arg_ in __substitute_width_arg_id above? Make this consistent?

Mordante marked 4 inline comments as done.Jun 5 2022, 11:35 AM

Thanks for the review!

Looks great! My main suggestion is to rename __parser into something more appropriate (see an inline comment) and possibly separate the actual parsing logic from the parser output.

I'll investigate that direction.

libcxx/include/__format/formatter_output.h
80

Interesting observation. I give this some thought.

119–132

I like this suggestion, but I leave it as is in this patch.
I noticed this code doesn't do grapheme clustering at all.
I've been working on that in D126971. So it makes more sense it implement those changes in that patch.

libcxx/include/__format/formatter_string.h
35

I did that to be consistent with the other new formatters. But this one indeed can be removed.

81–83

That won't work. Here I need to check for both the value set or an argument used. Here I haven't resolved the formatting arguments.

However when a formatting argument used for the width can contain the value 0 it would be easier to resolve this width before validating its value.

libcxx/include/__format/parser_std_format_spec.h
1528

I had considered that in the original parser, but it didn't work out in that design. But I don't see why it can't work in this design. I like the idea since it makes the code of setting the zero-padding alignment cleaner. During the parsing I can test whether another alignment is set, instead of having "uglier" post processing code.

Indeed the number of bits required to store the data remains the same.

Thanks for the suggestion!

1689

Good point.

1750–1751

Currently width is specified as

width:
    positive-integer
    { arg-idopt }

do you want to change that also to nonnegative-integer? When not it's the question what the width of 0 should do.
Note that __format_spec::__substitute_arg_id(__arg); already throws on negative values.

Another option would be to give the minimum as an argument to __substitute_arg_id so there it can validate the proper boundary.

1757

I originally used 0 and 1 but switched to false and true later. It seems I missed one place.

Mordante updated this revision to Diff 434407.Jun 6 2022, 2:34 AM
Mordante marked 4 inline comments as done.

Addresses review comments.

Mordante marked 3 inline comments as done.Jun 6 2022, 4:53 AM
vitaut added inline comments.Jun 6 2022, 11:00 AM
libcxx/include/__format/parser_std_format_spec.h
1750–1751

The grammar was done that way to avoid confusion between zero width and zero flag. I am not suggesting to change the grammar, just remove the runtime check for zero dynamic width since it's mostly harmless and only adds overhead.

ldionne requested changes to this revision.Jun 14 2022, 12:07 PM

I really like the new approach, especially since it cuts down on code size. I don't have any strong reservations about this patch, the only thing I'd really like is to investigate what the patch would look like if we did not jam the chrono and non-chrono formatter specs into the same structs. Other than that, I'm generally really happy with this -- thanks for making these improvements!

libcxx/include/__format/parser_std_format_spec.h
1390–1391

I am not sure I understand this comment.

1403–1407

It seems to me like it might be cleaner to define it inside formatter_string.h (closer to its point of use), but this is just a suggestion, if you've been through these changes and feel like this is the right way to group things, go with it.

1459–1463

I think it's nice that this is really tailor-made to fit into as little space as possible, however I do have a concern that it might not be robust against future changes. For example, this code would break if we added a (admittedly large) number of elements to the __type enumeration. I'm not sure what would make me feel more easy about this -- perhaps simply adding some sort of static_assert to make sure the code stays in sync would be enough?

1465–1466

I find it a bit confusing that we're jamming these chrono specific fields here. I think we should either have a separate __chrono_parsed_specifications with just the fields it needs, or make this some sort of union { <things-needed-by-std-format-spec>; <things-needed-by-chrono-format-spec> }.

1468

Do you think that makes sense? Otherwise, perhaps a comment is enough. Basically, it would be nice not to leave a raw hard-coded number unexplained, for newbies like me :-)

1553–1556

Perhaps we could centralize this 4 value with this comment, and then reuse it here and above where I left a similar review comment?

This revision now requires changes to proceed.Jun 14 2022, 12:07 PM
Mordante marked 3 inline comments as done.Jun 16 2022, 11:58 AM

@ldionne I've the suggested changes, but before putting it up for review again I want to see whether these changes impact other work-in-progress code. (Not merge conflicts, but incompatibilities.)

libcxx/include/__format/parser_std_format_spec.h
1390–1391

As discussed, this is a left-over of an experiment. Removed the unneeded part.

1403–1407

I prefer to keep it here. That way it's easy to see which options are used for a formatter. The other formatters will also get their specific flags here. I expect to look more often at the flags when working on the format specs then when working on the formatters.

1459–1463

We currently have 18 members in the enum. I'm only aware of two papers intending to add a field

This would leave use with 12 unused values. I can add one bit, which would give us 44 values to use.

I've added a static_assert for the entire struct.

1465–1466

I added a union. These two unions share __alignment __alignment_ : 3; since that field is shared between the two formatters. This means we can access a member of a not active member. However this is valid per http://eel.is/c++draft/class.union#def:active,union_member

In a union, a non-static data member is active if its name refers to an object
whose lifetime has begun and has not ended ([basic.life]).  At most one of the
non-static data members of an object of union type can be active at any time,
that is, the value of at most one of the non-static data members can be stored
in a union at any time.

[Note 1: One special guarantee is made in order to simplify the use of unions:
If a standard-layout union contains several standard-layout structs that share
a common initial sequence ([class.mem]), and if a non-static data member of an
object of this standard-layout union type is active and is one of the
standard-layout structs, it is permitted to inspect the common initial sequence
of any of the standard-layout struct members; see [class.mem].  — end note]

Clearly this is intended to put the tag of a tagged union at the start, but we can use that for our purposes.

1468

I moved this field after the precision_. The next field width_ is an int32_t and has the wanted alignment and avoids using magic values.

1553–1556

I moved this field after the __precision_. The next field __width_ is an int32_t and has the wanted alignment and avoids using magic values.

ldionne added inline comments.Jun 16 2022, 12:14 PM
libcxx/include/__format/parser_std_format_spec.h
1459–1463

Do you know how other implementations handle this problem? What about @vitaut 's fmt library?

I'm sure there will be more stuff added to <format> in the future, but I have no idea of how much stuff is going to be added. Is it possible that we'll ever run into this limitation? I don't know. However, the situation I'd like to avoid is one where WG21 wants to move forward with a paper, but libc++ has to say "we won't be able to implement it due to ABI", while other implementations are able to implement it <for some reason>. If all other implementations make a tradeoff similar to this, I think it's fine.

Basically, I'd just like to make sure we're not painting our implementation into a corner.

Mordante marked 5 inline comments as done.Jun 17 2022, 9:16 AM
Mordante added inline comments.
libcxx/include/__format/parser_std_format_spec.h
1459–1463

fmt uses a struct with bitfields. It stores the type as an enum class with unsigned char as underlying value. This type isn't a bitfield.

MSVC STL uses a struct without bitfields and stores the type as the char used in the format string.

So if we want to avoid to be ever in a corner this field should not be a bit field. I feel it's a bit overkill, but it's not like the available space is tight, so I'll make this field 8 bits.

Mordante updated this revision to Diff 438108.Jun 18 2022, 3:02 AM
Mordante marked 3 inline comments as done.

Address review comments.

ldionne accepted this revision.Jun 20 2022, 11:39 AM
ldionne added inline comments.
libcxx/include/__format/parser_std_format_spec.h
1468
1470

We don't need visibility(default). First of all, it doesn't change anything because all the member functions are marked with HIDE_FROM_ABI (which gives them visibility(hidden)) and there's no vtable to give visibility to, and secondly it's kind of confusing to use a visibility annotation other than _LIBCPP_HIDDEN on a struct that we call out as being not part of the ABI.

1477
1502

Can you add a comment explaining *why* you perform those tests (something like // make sure it fits in a register and it's trivially copyable for performance)? Otherwise, it looks as-if you are trying to enforce some ABI stability, which would be counter to the comment above (and your actual intent IIUC).

1515
This revision is now accepted and ready to land.Jun 20 2022, 11:39 AM
Mordante marked 7 inline comments as done.Jun 21 2022, 8:59 AM

Thanks for the reviews!

libcxx/include/__format/parser_std_format_spec.h
1515

Fair point. Looking at the amount of papers aiming to use format I still guess it's a matter of when ;-)

1750–1751

I've filed an LWG-issue Sunday, based on previous experience it will be processed next weekend.
(I don't feel the need to add the issue in the comments of the code.)

Mordante updated this revision to Diff 438730.Jun 21 2022, 9:09 AM
Mordante marked 2 inline comments as done.

Addresses all review comments and rebased.
Make sure the CI passes since during the last iteration the CI was down.

This revision was automatically updated to reflect the committed changes.