Page MenuHomePhabricator

[libc++][format] Improve format buffer.
ClosedPublic

Authored by Mordante on Jul 17 2022, 7:31 AM.

Details

Reviewers
ldionne
vitaut
Group Reviewers
Restricted Project
Commits
rGf7c0df002a08: [libc++][format] Improve format buffer.
Summary

Allow bulk output operations on the buffer instead of adding one
code unit at a time. This has a huge performance benefit at the cost of
larger binary. This doesn't implement @vitaut's earlier suggestion to
avoid buffering for std::string when writing a strings. That can be done
in a follow-up patch.

There are some minor complications for the non-buffered format_to_n.
When writing one character at a time it's easy to detect when reaching
the limit n. This is solved by adding a small overhead for format_to_n.
When the next write would overflow it stores the data in the internal
buffer and copies that up-to n code units. The overhead isn't measured,
but it's expected to only be an issue for small values of n; for larger
values the general improvements will outweight the new overhead.

   text	   data	    bss	    dec	    hex	filename
 349081	   6096	    440	 355617	  56d21	format.libcxx.out-baseline
 344442	   6088	    440	 350970	  55afa	formatted_size.libcxx.out-baseline
4567980	  57272	    424	4625676	 46950c	formatter_float.libcxx.out-baseline
 718800	  12472	    488	 731760	  b2a70	formatter_int.libcxx.out-baseline
 376341	   6096	    552	 382989	  5d80d	format_to.libcxx.out-beaseline

 370169	   6096	    440	 376705	  5bf81	format.libcxx.out
 365530	   6088	    440	 372058	  5ad5a	formatted_size.libcxx.out
4575116	  57272	    424	4632812	 46b0ec	formatter_float.libcxx.out
 725936	  12472	    488	 738896	  b4650	formatter_int.libcxx.out
 397429	   6096	    552	 404077	  62a6d	format_to.libcxx.out

For very small strings the new method is slower, from 4 characters
there's already a small gain.

Comparing ./format.libcxx.out-baseline to ./format.libcxx.out
Benchmark                                           Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------
BM_format_string<char>/1                         +0.0268         +0.0268            43            44            43            44
BM_format_string<char>/2                         +0.0133         +0.0133            22            22            22            22
BM_format_string<char>/4                         -0.0248         -0.0248            12            11            12            11
BM_format_string<char>/8                         -0.0831         -0.0831             6             6             6             6
BM_format_string<char>/16                        -0.2976         -0.2976             4             3             4             3
BM_format_string<char>/32                        -0.4369         -0.4369             3             2             3             2
BM_format_string<char>/64                        -0.6375         -0.6375             3             1             3             1
BM_format_string<char>/128                       -0.7685         -0.7685             2             1             2             1

The int benchmark has benefits for the simple formatting, but shines for
the complex formatting:

Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out
Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
BM_Basic<uint32_t>                                                   -0.2307         -0.2307            60            46            60            46
BM_Basic<int32_t>                                                    -0.1985         -0.1985            61            49            61            49
BM_Basic<uint64_t>                                                   -0.3478         -0.3479            81            53            81            53
BM_Basic<int64_t>                                                    -0.3475         -0.3475            81            53            81            53
BM_BasicLow<__uint128_t>                                             -0.3388         -0.3388            86            57            86            57
BM_BasicLow<__int128_t>                                              -0.3431         -0.3431            86            57            86            57
BM_Basic<__uint128_t>                                                -0.2822         -0.2822           236           170           236           170
BM_Basic<__int128_t>                                                 -0.3107         -0.3107           219           151           219           151
Integral_LocFalse_BaseBin_AlignNone_Int64                            -0.5781         -0.5781           178            75           178            75
Integral_LocFalse_BaseBin_AlignmentLeft_Int64                        -0.9231         -0.9231          1156            89          1156            89
Integral_LocFalse_BaseBin_AlignmentCenter_Int64                      -0.9179         -0.9179          1107            91          1107            91
Integral_LocFalse_BaseBin_AlignmentRight_Int64                       -0.9238         -0.9238          1147            87          1147            87
Integral_LocFalse_BaseBin_ZeroPadding_Int64                          -0.9170         -0.9170          1137            94          1137            94
Integral_LocFalse_BaseBin_AlignNone_Uint64                           -0.5923         -0.5923           175            71           175            71
Integral_LocFalse_BaseBin_AlignmentLeft_Uint64                       -0.9251         -0.9251          1154            86          1154            86
Integral_LocFalse_BaseBin_AlignmentCenter_Uint64                     -0.9204         -0.9204          1105            88          1105            88
Integral_LocFalse_BaseBin_AlignmentRight_Uint64                      -0.9242         -0.9242          1125            85          1125            85
Integral_LocFalse_BaseBin_ZeroPadding_Uint64                         -0.9232         -0.9232          1139            88          1139            88
Integral_LocFalse_BaseOct_AlignNone_Int64                            -0.3241         -0.3241           100            67           100            67
Integral_LocFalse_BaseOct_AlignmentLeft_Int64                        -0.9322         -0.9322          1166            79          1166            79
Integral_LocFalse_BaseOct_AlignmentCenter_Int64                      -0.9251         -0.9251          1108            83          1108            83
Integral_LocFalse_BaseOct_AlignmentRight_Int64                       -0.9303         -0.9303          1136            79          1136            79
Integral_LocFalse_BaseOct_ZeroPadding_Int64                          -0.9264         -0.9264          1156            85          1156            85
Integral_LocFalse_BaseOct_AlignNone_Uint64                           -0.3116         -0.3116            96            66            96            66
Integral_LocFalse_BaseOct_AlignmentLeft_Uint64                       -0.9310         -0.9310          1168            81          1168            81
Integral_LocFalse_BaseOct_AlignmentCenter_Uint64                     -0.9281         -0.9281          1128            81          1128            81
Integral_LocFalse_BaseOct_AlignmentRight_Uint64                      -0.9299         -0.9299          1148            80          1148            80
Integral_LocFalse_BaseOct_ZeroPadding_Uint64                         -0.9288         -0.9288          1153            82          1153            82
Integral_LocFalse_BaseDec_AlignNone_Int64                            -0.3342         -0.3342            95            63            95            63
Integral_LocFalse_BaseDec_AlignmentLeft_Int64                        -0.9360         -0.9360          1157            74          1157            74
Integral_LocFalse_BaseDec_AlignmentCenter_Int64                      -0.9303         -0.9303          1128            79          1128            79
Integral_LocFalse_BaseDec_AlignmentRight_Int64                       -0.9369         -0.9369          1164            73          1164            73
Integral_LocFalse_BaseDec_ZeroPadding_Int64                          -0.9323         -0.9323          1157            78          1157            78
Integral_LocFalse_BaseDec_AlignNone_Uint64                           -0.3198         -0.3198            93            63            93            63
Integral_LocFalse_BaseDec_AlignmentLeft_Uint64                       -0.9351         -0.9351          1158            75          1158            75
Integral_LocFalse_BaseDec_AlignmentCenter_Uint64                     -0.9298         -0.9298          1128            79          1128            79
Integral_LocFalse_BaseDec_AlignmentRight_Uint64                      -0.9361         -0.9361          1157            74          1157            74
Integral_LocFalse_BaseDec_ZeroPadding_Uint64                         -0.9333         -0.9333          1151            77          1151            77
Integral_LocFalse_BaseHex_AlignNone_Int64                            -0.3020         -0.3020            89            62            89            62
Integral_LocFalse_BaseHex_AlignmentLeft_Int64                        -0.9357         -0.9357          1174            75          1174            75
Integral_LocFalse_BaseHex_AlignmentCenter_Int64                      -0.9319         -0.9319          1129            77          1129            77
Integral_LocFalse_BaseHex_AlignmentRight_Int64                       -0.9350         -0.9350          1161            75          1161            75
Integral_LocFalse_BaseHex_ZeroPadding_Int64                          -0.9293         -0.9293          1150            81          1150            81
Integral_LocFalse_BaseHex_AlignNone_Uint64                           -0.3056         -0.3057            86            59            86            59
Integral_LocFalse_BaseHex_AlignmentLeft_Uint64                       -0.9378         -0.9378          1174            73          1174            73
Integral_LocFalse_BaseHex_AlignmentCenter_Uint64                     -0.9341         -0.9341          1129            74          1130            74
Integral_LocFalse_BaseHex_AlignmentRight_Uint64                      -0.9361         -0.9361          1157            74          1157            74
Integral_LocFalse_BaseHex_ZeroPadding_Uint64                         -0.9315         -0.9315          1147            79          1147            79
Integral_LocFalse_BaseHexUpper_AlignNone_Int64                       -0.0019         -0.0019            91            90            91            90
Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64                   -0.9099         -0.9099          1162           105          1162           105
Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64                 -0.9041         -0.9041          1121           108          1121           108
Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64                  -0.9086         -0.9086          1162           106          1162           106
Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64                     -0.9057         -0.9057          1164           110          1164           110
Integral_LocFalse_BaseHexUpper_AlignNone_Uint64                      +0.0110         +0.0110            86            87            86            87
Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64                  -0.9136         -0.9136          1161           100          1161           100
Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64                -0.9078         -0.9078          1133           104          1133           104
Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64                 -0.9132         -0.9132          1177           102          1177           102
Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64                    -0.9091         -0.9091          1160           105          1160           105

Other benchmarks give similar results.

Diff Detail

Event Timeline

Mordante created this revision.Jul 17 2022, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2022, 7:31 AM
Mordante requested review of this revision.Jul 17 2022, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2022, 7:31 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Mordante added inline comments.Jul 17 2022, 7:40 AM
libcxx/include/__format/buffer.h
71

These existing functions should be __uglified I'll do that in a followup patch.

103
libcxx/include/__format/formatter_output.h
96

Forgot them here and the next functions too.

Mordante updated this revision to Diff 445318.Jul 17 2022, 7:56 AM

Fixes CI error.

Mordante updated this revision to Diff 445321.Jul 17 2022, 9:39 AM

Address my own review comments and fix some CIs.
There's still an issue with the benchmarks.
But want to make sure the other jobs are green now.

Mordante updated this revision to Diff 445323.Jul 17 2022, 10:00 AM

Fix benchmarks, just a simple copy paste bug.

Nice improvement but I'd still recommend exploring direct writes to contiguous containers as discussed on the other diff.

libcxx/include/__format/buffer.h
89

Why are some member functions like __copy mangled while others like make_output_iterator are not?

217

complexer -> more complex

libcxx/include/__format/formatter_output.h
97

Why const?

Mordante marked an inline comment as done and 3 inline comments as not done.Jul 29 2022, 10:30 AM

Thanks for the review!

Nice improvement but I'd still recommend exploring direct writes to contiguous containers as discussed on the other diff.

I could explore that, but I want to see what the overhead is. I already wrap a contiguous storage as a direct buffer, which has a capacity of size_t(-1). So these write directly end up in the underlaying storage. The only overhead should be one size check for every write operation. This change shows a large improvement there too since instead of checking the capacity every character it is checked once per write operation.

Next I want to look at the std::back_inserter<std::string>. That operation still stores all data in a temporary buffer before copying the entire buffer to the string. There I expect it will be possible to improve the performance of std::format itself.

libcxx/include/__format/buffer.h
89

The all should be, but I will do the unrelated changes in a separate NFC patch. I noticed it while working on this, the the comment at line 71.

libcxx/include/__format/formatter_output.h
97

You mean in output_iterator<const _OutCharT&>?
I used that due to http://eel.is/c++draft/format#functions-13
Constraints: Out satisfies output_­iterator<const charT&>. Should that wording be different?

Next I want to look at the std::back_inserter<std::string>.

That and std::back_inserter<std::vector<char>> are exactly the cases I had in mind in "exploring direct writes to contiguous containers", sorry for not being clear. I'm glad that you have it planned.

libcxx/include/__format/formatter_output.h
97

Ah, right. I always forget what T means in output_iterator =).

Mordante marked 3 inline comments as done.Aug 3 2022, 10:31 AM

Next I want to look at the std::back_inserter<std::string>.

That and std::back_inserter<std::vector<char>> are exactly the cases I had in mind in "exploring direct writes to contiguous containers", sorry for not being clear. I'm glad that you have it planned.

No problem, I was already wondering whether you meant that use case instead.

ldionne requested changes to this revision.Aug 9 2022, 9:57 AM
ldionne added inline comments.
libcxx/include/__format/buffer.h
97
130–136

I'm not super happy with this comment and this assertion. Either it is possible to hit that code path, in which case there should be a test for it and this comment can go. Or it's impossible to hit the code path, in which case the assertion stays (with a different diagnostic) but the comment goes.

200

Side comment: I find it a bit hard to read with the Doxygen annotations. Since we don't generate Doxygen docs, I wonder what's the benefit of doing this.

219

Where __n is the number of characters we want to write. Or maybe __characters.

This revision now requires changes to proceed.Aug 9 2022, 9:57 AM
Mordante marked 4 inline comments as done.Aug 13 2022, 2:44 AM
Mordante added inline comments.
libcxx/include/__format/buffer.h
130–136

Actually it I had already done the right thing(tm) but removed the comment and assertion.

200

I removed some. I still see some benefit since some syntax highlighting uses this information too.

Mordante updated this revision to Diff 452397.Aug 13 2022, 2:56 AM
Mordante marked an inline comment as done.

Rebased and addresses review comments.

ldionne accepted this revision.Aug 16 2022, 9:18 AM
This revision is now accepted and ready to land.Aug 16 2022, 9:18 AM
This revision was landed with ongoing or failed builds.Aug 16 2022, 9:54 AM
This revision was automatically updated to reflect the committed changes.