This is an archive of the discontinued LLVM Phabricator instance.

[libc][math] Make log10 correctly rounded for non-FMA targets and improve itsperformance.
ClosedPublic

Authored by lntue on May 5 2023, 7:13 PM.

Details

Summary

Make log10 correctly rounded for non-FMA targets and improve its
performance.

Implemented fast pass and accurate pass:

Fast Pass:

  • Range reduction step 0: Extract exponent and mantissa
x = 2^(e_x) * m_x
  • Range reduction step 1: Use lookup tables of size 2^7 = 128 to reduce the argument to:
 -2^-8 <= v = r * m_x - 1 < 2^-7
where r = 2^-8 * ceil( 2^8 * (1 - 2^-8) / (1 + k * 2^-7) )
and k = trunc( (m_x - 1) * 2^7 )
  • Polynomial approximation: approximate log(1 + v) by a degree-7 polynomial generated by Sollya with:
> P = fpminimax((log(1 + x) - x)/x^2, 5, [|D...|], [-2^-8, 2^-7]);
  • Combine the results:
log10(x) ~ ( e_x * log(2) - log(r) + v + v^2 * P(v) ) * log10(e)
  • Perform additive Ziv's test with errors bounded by P_ERR * v^2. Return the result if Ziv's test passed.

Accurate Pass:

  • Take e_x, v, and the lookup table index from the range reduction step of fast pass.
  • Perform 3 more range reduction steps:
    • Range reduction step 2: Use look-up tables of size 193 to reduce the argument to [-0x1.3ffcp-15, 0x1.3e3dp-15]
 v2 = r2 * (1 + v) - 1 = (1 + s2) * (1 + v) - 1 = s2 + v + s2 * v
where r2 = 2^-16 * round ( 2^16 / (1 + k * 2^-14) )
and k = trunc( v * 2^14 + 0.5 ).
  • Range reduction step 3: Use look-up tables of size 161 to reduce the argument to [-0x1.01928p-22 , 0x1p-22]
 v3 = r3 * (1 + v2) - 1 = (1 + s3) * (1 + v2) - 1 = s3 + v2 + s3 * v2
where r3 = 2^-21 * round ( 2^21 / (1 + k * 2^-21) )
and k = trunc( v * 2^21 + 0.5 ).
  • Range reduction step 4: Use look-up tables of size 130 to reduce the argument to [-0x1.0002143p-29 , 0x1p-29]
 v4 = r4 * (1 + v3) - 1 = (1 + s4) * (1 + v3) - 1 = s4 + v3 + s4 * v3
where r4 = 2^-28 * round ( 2^28 / (1 + k * 2^-28) )
and k = trunc( v * 2^28 + 0.5 ).
  • Polynomial approximation: approximate log10(1 + v4) by a degree-4 minimax polynomial generated by Sollya with:
> P = fpminimax(log10(1 + x)/x, 3, [|128...|], [-0x1.0002143p-29 , 0x1p-29]);
  • Combine the results:
log10(x) ~ e_x * log10(2) - log10(r) - log10(r2) - log10(r3) - log10(r4) + v * P(v)
  • The combined results are computed using floating points of 128-bit precision.

Performance

  • For 0.5 <= x <= 2, the fast pass hitting rate is about 99.92%.
  • Reciprocal throughput from CORE-MATH's perf tool on Ryzen 5900X:
$ ./perf.sh log10
GNU libc version: 2.35
GNU libc release: stable

-- CORE-MATH reciprocal throughput -- with FMA
[####################] 100 %
Ntrial = 20 ; Min = 20.402 + 0.589 clc/call; Median-Min = 0.277 clc/call; Max = 22.752 clc/call;

-- CORE-MATH reciprocal throughput -- without FMA (-march=x86-64-v2)
[####################] 100 %
Ntrial = 20 ; Min = 75.797 + 3.317 clc/call; Median-Min = 3.407 clc/call; Max = 79.371 clc/call;

-- System LIBC reciprocal throughput --
[####################] 100 %
Ntrial = 20 ; Min = 22.668 + 0.184 clc/call; Median-Min = 0.181 clc/call; Max = 23.205 clc/call;

-- LIBC reciprocal throughput -- with FMA
[####################] 100 %
Ntrial = 20 ; Min = 25.977 + 0.183 clc/call; Median-Min = 0.138 clc/call; Max = 26.283 clc/call;

-- LIBC reciprocal throughput -- without FMA
[####################] 100 %
Ntrial = 20 ; Min = 22.140 + 0.980 clc/call; Median-Min = 0.853 clc/call; Max = 23.790 clc/call;
  • Latency from CORE-MATH's perf tool on Ryzen 5900X:
$ ./perf.sh log10 --latency
GNU libc version: 2.35
GNU libc release: stable

-- CORE-MATH latency -- with FMA
[####################] 100 %
Ntrial = 20 ; Min = 54.613 + 0.357 clc/call; Median-Min = 0.287 clc/call; Max = 55.701 clc/call;

-- CORE-MATH latency -- without FMA (-march=x86-64-v2)
[####################] 100 %
Ntrial = 20 ; Min = 79.681 + 0.482 clc/call; Median-Min = 0.294 clc/call; Max = 81.604 clc/call;

-- System LIBC latency --
[####################] 100 %
Ntrial = 20 ; Min = 61.532 + 0.208 clc/call; Median-Min = 0.199 clc/call; Max = 62.256 clc/call;

-- LIBC latency -- with FMA
[####################] 100 %
Ntrial = 20 ; Min = 41.510 + 0.205 clc/call; Median-Min = 0.244 clc/call; Max = 41.867 clc/call;

-- LIBC latency -- without FMA
[####################] 100 %
Ntrial = 20 ; Min = 55.669 + 0.240 clc/call; Median-Min = 0.280 clc/call; Max = 56.056 clc/call;
  • Accurate pass latency:
$ ./perf.sh log10 --latency --simple_stat
GNU libc version: 2.35
GNU libc release: stable

-- CORE-MATH latency -- with FMA
640.688

-- CORE-MATH latency -- without FMA (-march=x86-64-v2)
667.354

-- LIBC latency -- with FMA
495.593

-- LIBC latency -- without FMA
504.143

Diff Detail

Event Timeline

lntue created this revision.May 5 2023, 7:13 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 5 2023, 7:13 PM
lntue requested review of this revision.May 5 2023, 7:13 PM
lntue retitled this revision from [libc][math] Make log10 correctly rounded for non-FMA targets and improve its performance. to [libc][math] Make log10 correctly rounded for non-FMA targets and improve itsperformance..May 5 2023, 8:06 PM
lntue edited the summary of this revision. (Show Details)
lntue updated this revision to Diff 520029.May 5 2023, 8:53 PM

Fix log10(1.0) = +0.0 for all rounding mode.

lntue updated this revision to Diff 520721.May 9 2023, 9:00 AM

Update error bounds of the fast pass and sync to HEAD.

lntue updated this revision to Diff 524103.May 21 2023, 7:51 AM

Sync to HEAD.

zimmermann6 accepted this revision.May 23 2023, 4:51 AM

all worst cases from core-math do pass. However I get a larger reciprocal throughput on a AMD EPYC 7282:

GNU libc version: 2.36
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 21.270 + 0.322 clc/call; Median-Min = 0.302 clc/call; Max = 23.839 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 26.172 + 0.346 clc/call; Median-Min = 0.310 clc/call; Max = 26.798 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 49.688 + 0.306 clc/call; Median-Min = 0.308 clc/call; Max = 50.288 clc/call;

For the latency I get similar values:

GNU libc version: 2.36
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 65.915 + 0.316 clc/call; Median-Min = 0.282 clc/call; Max = 66.518 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 73.612 + 0.389 clc/call; Median-Min = 0.326 clc/call; Max = 74.422 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 66.387 + 0.276 clc/call; Median-Min = 0.303 clc/call; Max = 67.140 clc/call;
This revision is now accepted and ready to land.May 23 2023, 4:51 AM
This revision was landed with ongoing or failed builds.May 23 2023, 7:18 AM
This revision was automatically updated to reflect the committed changes.