This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Optimize transform_reduce for floating point types
Needs ReviewPublic

Authored by philnik on May 25 2023, 4:58 PM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Summary

The standard doesn't define the order of execution for transform_reduce on purpose, so we might as well make use of it.

Diff Detail

Event Timeline

philnik created this revision.May 25 2023, 4:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2023, 4:59 PM
philnik requested review of this revision.May 25 2023, 4:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2023, 4:59 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
philnik updated this revision to Diff 526095.May 26 2023, 9:17 AM

Fix stuff

Benchmarks:

------------------------------------------------------
Benchmark                           old            new
------------------------------------------------------
bm_reduce<float>/1             0.555 ns       0.710 ns
bm_reduce<float>/2             0.703 ns        1.58 ns
bm_reduce<float>/3              1.56 ns        1.88 ns
bm_reduce<float>/4              1.88 ns        2.19 ns
bm_reduce<float>/5              2.21 ns        2.51 ns
bm_reduce<float>/6              2.50 ns        2.84 ns
bm_reduce<float>/7              2.82 ns        3.16 ns
bm_reduce<float>/8              3.13 ns        3.45 ns
bm_reduce<float>/16             5.63 ns        1.69 ns
bm_reduce<float>/64             26.9 ns        2.88 ns
bm_reduce<float>/512             412 ns        17.9 ns
bm_reduce<float>/4096           3778 ns         219 ns
bm_reduce<float>/32768         30722 ns        1913 ns
bm_reduce<float>/262144       246213 ns       15445 ns
bm_reduce<float>/1048576      994918 ns       62948 ns

----------------------------------------------------------------
Benchmark                                    old             new
----------------------------------------------------------------
bm_transform_reduce<float>/1            0.863 ns        0.872 ns
bm_transform_reduce<float>/2             1.57 ns         1.58 ns
bm_transform_reduce<float>/3             1.88 ns         1.92 ns
bm_transform_reduce<float>/4             2.19 ns         2.23 ns
bm_transform_reduce<float>/5             2.50 ns         2.51 ns
bm_transform_reduce<float>/6             2.83 ns         2.82 ns
bm_transform_reduce<float>/7             3.13 ns         3.14 ns
bm_transform_reduce<float>/8             3.44 ns         3.46 ns
bm_transform_reduce<float>/16            3.23 ns         2.24 ns
bm_transform_reduce<float>/64            19.6 ns         4.56 ns
bm_transform_reduce<float>/512            356 ns         30.3 ns
bm_transform_reduce<float>/4096          3720 ns          241 ns
bm_transform_reduce<float>/32768        31159 ns         3177 ns
bm_transform_reduce<float>/262144      246021 ns        25263 ns
bm_transform_reduce<float>/1048576    1069957 ns       111642 ns
ldionne added inline comments.
libcxx/include/__numeric/reduce.h
37

Same here.

libcxx/include/__numeric/transform_reduce.h
51

We shouldn't be optimizing for user-defined types even if the operation is "trivial" (also applies to the other algo).

This makes me think that we need to revisit our definition of a trivial operation. std::plus<> is trivial but only for a specific set of types. And in fact I am not certain what we mean by "is trivial" anymore.

59–63

I would call the unseq version in the PSTL instead, that's more general and we'll end up using the appropriate backend.

Not looked at the patch closely but I wonder about the usage of the pragma and their scope.

libcxx/include/__numeric/reduce.h
40

Will this restore the old value of clang loop vectorize?