This patch aims to help clang with better information so it can inline
__bit_reference count function usage, for both std::vector<bool> and
bitset. Current clang inliner can not infer that the passed typed
will be used only to select the optimized variant, it evaluates the
type argument and type check as a load plus compare (although later
optimization phases correctly optimized this out).
Details
- Reviewers
EricWF mclow.lists
Diff Detail
Event Timeline
This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.
Sorry; the old code calls __count_bool_true if the static_cast<bool>(value) is true, and __count_bool_false otherwise.
You are correct, this change breaks std::vector<bool> with std::count. What about calling __count_bool_true direct on std::bitset
to avoid having to pass with bool type? For std::count I think we will need to fix it on clang inliner (my initial plan).
This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).
I'm unclear on the magnitude of the improvement here.
Are we talking a single load + compare instruction in the call to std::count ?
Or something inside the loop?
[ I'm pretty sure that the patch is correct now - but I don't understand how important it is ]
It is mainly to help llvm inliner to generate better code for std::bitset count for aarch64. It helps
on both runtime and code size, since if inline decides that _VSTD::count should not be inlined
the vectorization will create both aligned and unaligned variants (which add both code size and
runtime costs)
For instance, on aarch64 the snippet:
#include <bitset>
int foo (std::bitset<256> &bt)
{
return bt.count();
}
Generates a text of 844 bytes, while with the patch is just 112 bytes (due vectorization code
being able to assume aligned input and just generate one code path).
As a side note, x86_64 it is not affected because of the cost analysis being done see less
instruction being required and the template instantiation being less costly.
Getting a bit late in this discussion, as we had an internal one just recently.
The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.
In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).
Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.
Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.
Makes sense?
cheers,
--renato
The issue I have to define it per symbol is the hackery it would require to handle _LIBCPP_INTERNAL_LINKAGE and its implications,
or at least add *another* macro to inline some symbols depending of the configuration/etc.
I still think this patch is simpler than adding another flag to instruct always inline and works better with current clang inline strategy.
After removing the __VSTD::, I'm good with this.
include/bitset | ||
---|---|---|
994 | Don't think we need the __VSTD:: here any more; no one should be defining their own __count_bool_true. | |
1002 | Have you investigated the codegen here? It seems to me that the same arguments (vectorization, alignment, etc) would apply to operator== as well. |
Don't think we need the __VSTD:: here any more; no one should be defining their own __count_bool_true.