This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/
2/4
bitset

Differential D55427

[libcxx] Call __count_bool_true for bitset count
ClosedPublic

Authored by zatrazz on Dec 7 2018, 5:08 AM.

Download Raw Diff

Details

Reviewers

EricWF
mclow.lists

Summary

This patch aims to help clang with better information so it can inline
__bit_reference count function usage, for both std::vector<bool> and
bitset. Current clang inliner can not infer that the passed typed
will be used only to select the optimized variant, it evaluates the
type argument and type check as a load plus compare (although later
optimization phases correctly optimized this out).

Diff Detail

Event Timeline

zatrazz created this revision.Dec 7 2018, 5:08 AM

Herald added subscribers: libcxx-commits, ldionne. · View Herald TranscriptDec 7 2018, 5:08 AM

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

In D55427#1323656, @mclow.lists wrote:

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

Sorry; the old code calls __count_bool_true if the static_cast<bool>(value) is true, and __count_bool_false otherwise.

In D55427#1323715, @mclow.lists wrote:

In D55427#1323656, @mclow.lists wrote:

This looks like a behavior change to me.
The old code calls __count_bool_true if the _Tp can be static_cast to bool, and __count_bool_false otherwise.
The new code calls __count_bool_true if the _Tp is exactly bool, and __count_bool_false otherwise.

Sorry; the old code calls __count_bool_true if the static_cast<bool>(value) is true, and __count_bool_false otherwise.

You are correct, this change breaks std::vector<bool> with std::count. What about calling __count_bool_true direct on std::bitset
to avoid having to pass with bool type? For std::count I think we will need to fix it on clang inliner (my initial plan).

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

Ping.

In D55427#1325285, @zatrazz wrote:

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

I'm unclear on the magnitude of the improvement here.
Are we talking a single load + compare instruction in the call to std::count ?
Or something inside the loop?

[ I'm pretty sure that the patch is correct now - but I don't understand how important it is ]

In D55427#1329797, @mclow.lists wrote:

In D55427#1325285, @zatrazz wrote:

This patch aims to help clang with better information so it can inline
__bit_reference count function usage for both std::biset. Current clang
inliner can not infer that the passed typed will be used only to select
the optimized variant, it evaluates the type argument and type check as
a load plus compare (although later optimization phases correctly
optimized this out).

I'm unclear on the magnitude of the improvement here.
Are we talking a single load + compare instruction in the call to std::count ?
Or something inside the loop?

[ I'm pretty sure that the patch is correct now - but I don't understand how important it is ]

It is mainly to help llvm inliner to generate better code for std::bitset count for aarch64. It helps
on both runtime and code size, since if inline decides that _VSTD::count should not be inlined
the vectorization will create both aligned and unaligned variants (which add both code size and
runtime costs)

For instance, on aarch64 the snippet:

#include <bitset>

int foo (std::bitset<256> &bt)
{

return bt.count();

}

Generates a text of 844 bytes, while with the patch is just 112 bytes (due vectorization code
being able to assume aligned input and just generate one code path).

As a side note, x86_64 it is not affected because of the cost analysis being done see less
instruction being required and the template instantiation being less costly.

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

Makes sense?

cheers,
--renato

In D55427#1331048, @rengolin wrote:

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

The issue I have to define it per symbol is the hackery it would require to handle _LIBCPP_INTERNAL_LINKAGE and its implications,
or at least add *another* macro to inline some symbols depending of the configuration/etc.

In D55427#1331277, @zatrazz wrote:

In D55427#1331048, @rengolin wrote:

Getting a bit late in this discussion, as we had an internal one just recently.

The change to remove always_inline in a number of libc++ template functions is a good one, especially when the inliner can guess and does a good job already.

In this case, however, because the type is a reference, the inliner would require a lot more effort to inspect the uses (and side-effects).

Improving the inliner here would be a huge hammer, probably increasing compile time for all codes for the minimal benefit of this very special case.

Then perhaps, it would be beneficial and pragmatic, to revert that removal in this special case.

The issue I have to define it per symbol is the hackery it would require to handle _LIBCPP_INTERNAL_LINKAGE and its implications,
or at least add *another* macro to inline some symbols depending of the configuration/etc.

I still think this patch is simpler than adding another flag to instruct always inline and works better with current clang inline strategy.

Ping.

After removing the __VSTD::, I'm good with this.

include/bitset
994	Don't think we need the `__VSTD::` here any more; no one should be defining their own `__count_bool_true`.
1002	Have you investigated the codegen here? It seems to me that the same arguments (vectorization, alignment, etc) would apply to `operator==` as well.

This revision is now accepted and ready to land.Jan 10 2019, 10:27 AM

zatrazz marked 2 inline comments as done.Jan 11 2019, 2:39 AM

zatrazz added inline comments.

include/bitset
994	I will remove it, thank.
1002	Good call, I will investigate it.

zatrazz closed this revision.Jan 11 2019, 9:35 AM

Revision Contents

Path

Size

include/

bitset

2 lines

Diff 177489

include/bitset

Show First 20 Lines • Show All 985 Lines • ▼ Show 20 Lines	bitset<_Size>::to_string(char __zero, char __one) const
return to_string<char, char_traits<char>, allocator<char> >(__zero, __one);		return to_string<char, char_traits<char>, allocator<char> >(__zero, __one);
}		}

template <size_t _Size>		template <size_t _Size>
inline		inline
size_t		size_t
bitset<_Size>::count() const _NOEXCEPT		bitset<_Size>::count() const _NOEXCEPT
{		{
return static_cast<size_t>(_VSTD::count(base::__make_iter(0), base::__make_iter(_Size), true));		return static_cast<size_t>(_VSTD::__count_bool_true(base::__make_iter(0), _Size));
		mclow.listsUnsubmitted Not Done Reply Inline Actions Don't think we need the `__VSTD::` here any more; no one should be defining their own `__count_bool_true`. mclow.lists: Don't think we need the `__VSTD::` here any more; no one should be defining their own…
		zatrazzAuthorUnsubmitted Done Reply Inline Actions I will remove it, thank. zatrazz: I will remove it, thank.
}		}

template <size_t _Size>		template <size_t _Size>
inline		inline
bool		bool
bitset<_Size>::operator==(const bitset& __rhs) const _NOEXCEPT		bitset<_Size>::operator==(const bitset& __rhs) const _NOEXCEPT
{		{
return _VSTD::equal(base::__make_iter(0), base::__make_iter(_Size), __rhs.__make_iter(0));		return _VSTD::equal(base::__make_iter(0), base::__make_iter(_Size), __rhs.__make_iter(0));
		mclow.listsUnsubmitted Not Done Reply Inline Actions Have you investigated the codegen here? It seems to me that the same arguments (vectorization, alignment, etc) would apply to `operator==` as well. mclow.lists: Have you investigated the codegen here? It seems to me that the same arguments (vectorization…
		zatrazzAuthorUnsubmitted Done Reply Inline Actions Good call, I will investigate it. zatrazz: Good call, I will investigate it.
}		}

template <size_t _Size>		template <size_t _Size>
inline		inline
bool		bool
bitset<_Size>::operator!=(const bitset& __rhs) const _NOEXCEPT		bitset<_Size>::operator!=(const bitset& __rhs) const _NOEXCEPT
{		{
return !(*this == __rhs);		return !(*this == __rhs);
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines