This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Call __count_bool_true for bitset count
ClosedPublic

Authored by zatrazz on Dec 7 2018, 5:08 AM.

Details

Summary

This patch aims to help clang with better information so it can inline
__bit_reference count function usage, for both std::vector<bool> and
bitset. Current clang inliner can not infer that the passed typed
will be used only to select the optimized variant, it evaluates the
type argument and type check as a load plus compare (although later
optimization phases correctly optimized this out).

Diff Detail

Event Timeline

zatrazz created this revision.Dec 7 2018, 5:08 AM

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

Sorry; the old code calls __count_bool_true if the static_cast<bool>(value) is true, and __count_bool_false otherwise.

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

Sorry; the old code calls __count_bool_true if the static_cast<bool>(value) is true, and __count_bool_false otherwise.

You are correct, this change breaks std::vector<bool> with std::count. What about calling __count_bool_true direct on std::bitset
to avoid having to pass with bool type? For std::count I think we will need to fix it on clang inliner (my initial plan).

zatrazz updated this revision to Diff 177489.Dec 10 2018, 5:29 AM
zatrazz retitled this revision from [libcxx] Use specialized helper for __bit_reference count to [libcxx] Call __count_bool_true for bitset count.

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

I'm unclear on the magnitude of the improvement here.
Are we talking a single load + compare instruction in the call to std::count ?
Or something inside the loop?

[ I'm pretty sure that the patch is correct now - but I don't understand how important it is ]

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

I'm unclear on the magnitude of the improvement here.
Are we talking a single load + compare instruction in the call to std::count ?
Or something inside the loop?

[ I'm pretty sure that the patch is correct now - but I don't understand how important it is ]

It is mainly to help llvm inliner to generate better code for std::bitset count for aarch64. It helps
on both runtime and code size, since if inline decides that _VSTD::count should not be inlined
the vectorization will create both aligned and unaligned variants (which add both code size and
runtime costs)

For instance, on aarch64 the snippet:

#include <bitset>

int foo (std::bitset<256> &bt)
{

return bt.count();

}

Generates a text of 844 bytes, while with the patch is just 112 bytes (due vectorization code
being able to assume aligned input and just generate one code path).

As a side note, x86_64 it is not affected because of the cost analysis being done see less
instruction being required and the template instantiation being less costly.

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

Makes sense?

cheers,
--renato

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

The issue I have to define it per symbol is the hackery it would require to handle _LIBCPP_INTERNAL_LINKAGE and its implications,
or at least add *another* macro to inline some symbols depending of the configuration/etc.

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

The issue I have to define it per symbol is the hackery it would require to handle _LIBCPP_INTERNAL_LINKAGE and its implications,
or at least add *another* macro to inline some symbols depending of the configuration/etc.

I still think this patch is simpler than adding another flag to instruct always inline and works better with current clang inline strategy.

mclow.lists accepted this revision.Jan 10 2019, 10:27 AM

After removing the __VSTD::, I'm good with this.

include/bitset
994

Don't think we need the __VSTD:: here any more; no one should be defining their own __count_bool_true.

1002

Have you investigated the codegen here? It seems to me that the same arguments (vectorization, alignment, etc) would apply to operator== as well.

This revision is now accepted and ready to land.Jan 10 2019, 10:27 AM
zatrazz marked 2 inline comments as done.Jan 11 2019, 2:39 AM
zatrazz added inline comments.
include/bitset
994

I will remove it, thank.

1002

Good call, I will investigate it.

zatrazz closed this revision.Jan 11 2019, 9:35 AM