Allow bulk output operations on the buffer instead of adding one
code unit at a time. This has a huge performance benefit at the cost of
larger binary. This doesn't implement @vitaut's earlier suggestion to
avoid buffering for std::string when writing a strings. That can be done
in a follow-up patch.
There are some minor complications for the non-buffered format_to_n.
When writing one character at a time it's easy to detect when reaching
the limit n. This is solved by adding a small overhead for format_to_n.
When the next write would overflow it stores the data in the internal
buffer and copies that up-to n code units. The overhead isn't measured,
but it's expected to only be an issue for small values of n; for larger
values the general improvements will outweight the new overhead.
text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out
For very small strings the new method is slower, from 4 characters
there's already a small gain.
Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1
The int benchmark has benefits for the simple formatting, but shines for
the complex formatting:
Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105
Other benchmarks give similar results.
These existing functions should be __uglified I'll do that in a followup patch.