This is an archive of the discontinued LLVM Phabricator instance.

[libc][math] Improve tanhf performance.
ClosedPublic

Authored by lntue on Sep 15 2022, 5:53 PM.

Details

Summary

Optimize the core part of tanhf implementation that is to compute e^x
similar to https://reviews.llvm.org/D133870. Factor the constants and
polynomial approximation out so that it can be used for exp10f

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanhf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 13.377
System LIBC reciprocal throughput : 55.046

BEFORE:
LIBC reciprocal throughput        : 75.674
LIBC reciprocal throughput        : 33.242    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 25.927    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 26.359
LIBC reciprocal throughput        : 18.888    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 14.243    (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanhf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 43.365
System LIBC latency : 123.499

BEFORE
LIBC latency        : 112.968
LIBC latency        : 104.908   (with `-msse4.2` flag)
LIBC latency        : 92.310    (with `-mfma` flag)

AFTER
LIBC latency        : 69.828
LIBC latency        : 63.874    (with `-msse4.2` flag)
LIBC latency        : 57.427    (with `-mfma` flag)

Diff Detail

Event Timeline

lntue created this revision.Sep 15 2022, 5:53 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 15 2022, 5:53 PM
lntue requested review of this revision.Sep 15 2022, 5:53 PM
lntue edited the summary of this revision. (Show Details)Sep 15 2022, 5:59 PM
lntue updated this revision to Diff 460583.Sep 15 2022, 6:02 PM

Update docs.

does this patch need to be applied on another one? Or rebased? It does not apply cleanly to main (71e52a1), unless I did something wrong.

orex accepted this revision.Sep 16 2022, 4:30 AM
This revision is now accepted and ready to land.Sep 16 2022, 4:30 AM
lntue updated this revision to Diff 460700.Sep 16 2022, 4:43 AM

Sync to head.

lntue added a comment.Sep 16 2022, 4:45 AM

does this patch need to be applied on another one? Or rebased? It does not apply cleanly to main (71e52a1), unless I did something wrong.

I've just synced the patch to HEAD. Can you try it again? Thanks,

lntue updated this revision to Diff 460775.Sep 16 2022, 8:29 AM

Change to use 32 sub-intervals for middle part to reduce the degree of
polynomial approximation for lower part to 5 from 7. Also update sinhf,
coshf, exp2f to use the same range reduction. This slightly reduce their
latencies and reciprocal throughputs.

lntue updated this revision to Diff 460840.Sep 16 2022, 12:01 PM

Only return lo part in range reduction, separating it with powb_lo
evaluation.

zimmermann6 accepted this revision.Sep 19 2022, 1:34 AM

I confirm the function is still correctly rounded, and the timings improved.

This revision was automatically updated to reflect the committed changes.