This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Use __libcpp_clz for a tighter __log2i function

Authored by hans on May 19 2022, 2:49 AM.



While looking at D113413 I noticed that __log2i could perhaps be improved to be both slightly smaller and faster.

I'm not familiar with how to run benchmarks for this if there are any, but looks like an improvement to me.

Diff Detail

Event Timeline

hans created this revision.May 19 2022, 2:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 2:49 AM
hans requested review of this revision.May 19 2022, 2:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2022, 2:49 AM
Herald added 1 blocking reviewer(s): Restricted Project. · View Herald Transcript
ayzhao added a subscriber: ayzhao.May 19 2022, 9:46 AM
nilayvaish requested changes to this revision.May 21 2022, 12:31 PM

Can you take a look at the failed test?


I am wondering whether CHAR_BIT is preferred over using CHAR_BIT (available from climit header file). I don't know what the convention is for writing library code.


Instead of retaining this, should we just have a static_assert() on the size of the type being less or equal to that of unsigned long long?

This revision now requires changes to proceed.May 21 2022, 12:31 PM
philnik requested changes to this revision.May 21 2022, 12:38 PM
philnik added a subscriber: philnik.

The CI failue look unrelated, but I can't recall any problems with the AArch64 CI. Just rebase it and try again. I don't think there are any benchmarks, but it would be nice if you could provide some in libcxx/benchmarks. I'd also like to see benchmark results before approving.


I think CHAR_BIT is the preferred way. At least that's what I'm using.


We support __int128, so this needs to stay.

philnik retitled this revision from Use __libcpp_clz for a tighter __log2i function to [libc++] Use __libcpp_clz for a tighter __log2i function.May 21 2022, 12:40 PM
hans marked 4 inline comments as done.May 24 2022, 7:27 AM

Here are some benchmark numbers (requires D126297). I ran build.release/projects/libcxx/benchmarks/sort.libcxx.out --benchmark_filter=BM_Sort_uint32_QuickSortAdversary*.

Before my patch:

Benchmark                                         Time             CPU   Iterations
BM_Sort_uint32_QuickSortAdversary_1            4.51 ns         4.51 ns    154664960
BM_Sort_uint32_QuickSortAdversary_4            1.97 ns         1.97 ns    357040128
BM_Sort_uint32_QuickSortAdversary_16          0.941 ns        0.941 ns    742129664
BM_Sort_uint32_QuickSortAdversary_64           11.0 ns         11.0 ns     63176704
BM_Sort_uint32_QuickSortAdversary_256          18.1 ns         18.1 ns     38535168
BM_Sort_uint32_QuickSortAdversary_1024         33.1 ns         33.1 ns     21233664
BM_Sort_uint32_QuickSortAdversary_16384        49.4 ns         49.4 ns     14155776
BM_Sort_uint32_QuickSortAdversary_262144       59.9 ns         59.9 ns     11796480

After my patch:

Benchmark                                         Time             CPU   Iterations
BM_Sort_uint32_QuickSortAdversary_1            4.81 ns         4.81 ns    145227776
BM_Sort_uint32_QuickSortAdversary_4            1.82 ns         1.82 ns    379322368
BM_Sort_uint32_QuickSortAdversary_16          0.849 ns        0.849 ns    820772864
BM_Sort_uint32_QuickSortAdversary_64           11.6 ns         11.6 ns     60293120
BM_Sort_uint32_QuickSortAdversary_256          19.5 ns         19.5 ns     36175872
BM_Sort_uint32_QuickSortAdversary_1024         36.0 ns         36.0 ns     19660800
BM_Sort_uint32_QuickSortAdversary_16384        53.2 ns         53.2 ns     13369344
BM_Sort_uint32_QuickSortAdversary_262144       63.7 ns         63.7 ns     11272192

Looks like some numbers are up and some are down. I'd say this is perf neutral.

But it does have a small effect on size. The text of sort.libcxx.out goes from 509310 to 508542 bytes, so 768 bytes saved.

I also tried this on a "release" build of Chrome (without ThinLTO or PGO, so doesn't exactly match what we ship), and the text went from 259288968 to 259282312 bytes, so 6.5 KB saved.


It seems both are used roughly as often.

$ git grep -In [^_]CHAR_BIT -- libcxx/include | wc -l
$ git grep -In __CHAR_BIT__ -- libcxx/include | wc -l

Given that __CHAR_BIT__ avoids potentially pushing another header include on every user of <sort>, I'd prefer to stay with it.

I'd also like to see a clean CI run.


I think that's manageable. Preprocessing <climits> produces 104 LOC while preprocessing <algorithm> produces 32888 LOC. That's not even a drop in the ocean. And climits is already included anyways. Using CHAR_BIT just makes the code a bit nicer to read. IMO we should actually replace the uses of __CHAR_BIT__. If you are worried about compile times you should use -fmodules.

hans updated this revision to Diff 431705.May 24 2022, 9:58 AM
hans marked an inline comment as done.

Rebase to kick the CI.

hans added inline comments.May 25 2022, 7:08 AM

I suppose if <climits> is already included, it doesn't matter in this instance. I'll leave it to someone else to decide whether CHAR_BIT or __CHAR_BIT__ should be preferred in libc++ in general.

If you are worried about compile times you should use -fmodules.

If only things were that simple :-)

hans updated this revision to Diff 431973.May 25 2022, 7:08 AM

Switch to CHAR_BIT from <climits>.

Please take another look.

nilayvaish accepted this revision.May 25 2022, 9:03 PM
hans added a comment.May 27 2022, 8:57 AM

@philnik: Does it look good to you to?

philnik accepted this revision.May 27 2022, 9:03 AM


This revision is now accepted and ready to land.May 27 2022, 9:03 AM
This revision was automatically updated to reflect the committed changes.