# [libc] Calculate ulp error after rounding MPFR result to the result type.ClosedPublicActions

Authored by sivachandra on Jun 20 2021, 10:13 PM.

# Diff Detail

### Event Timeline

sivachandra created this revision.Jun 20 2021, 10:13 PM
Herald added a project: Restricted Project. Jun 20 2021, 10:13 PM
Herald added subscribers: libc-commits, ecnelises, tschuett.
sivachandra requested review of this revision.Jun 20 2021, 10:13 PM

If this change makes sense, can we just switch comparisons with ulp error of 0.5 to comparisons with ulp error of 0.0?

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

If we change to | input - float(mpfrValue) | / eps(input), we more or less calculating | input_bitfield - float(mpfrValue)_bitfield |, so in my opinion it's better to use
min (eps(input), eps(float(mpfrValue))), since that would avoid the case where float(mpfrValue) is 2^n and input is 2^n - (eps(x)/2) which is representable:

• if we use the eps(input), the calculated ulp will be 2,
• if we use min ( eps(input), eps(float(mpfrValue)) ), the calculated ulp will be 1.

A concrete example is that ulp( input = float(1 - 2^(-24)), float(mpfrValue) = float(1) )
Moreover, another advantage of using min (eps, eps) is that the ulp function will then be symmetric: ulp(a, b) = ulp(b, a).

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

It seems like the problem you describe will occur irrespective of this change, no?
Also, why should we want ulp(b, a) = ulp(a, b)? Thinking of it now, it seems like we should define ulp error to be:

`|float(mpfrValue) - input|/eps(float(mpfrValue))`

My reasoning is, the error should be relative to what we think is the correct/more accurate answer. Since we treat float(mpfrValue) to be the more accurate one, we should be calculating the error wrt its eps. Anything wrong with this reasoning? The point you raise about ulp error of 2 vs 1 is very valid. We already it special case it one way. Should we special case the other way around as well? And may be that is why you are saying we should generalize these special cases with a symmetric algorithm to calculate the ulp error?

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

Yes, it is the same problem with / eps(float(mpfrValue)), with input is 2^n and float(mpfrValue) is 2^n - (eps(x)/2). So the only reasonable way that will return 1 ulp for both cases is to / min (eps(input), eps(float(mpfrValue)). And actually these edge cases are the only time / min(eps, eps) gives different answers than / eps(input) or eps(float(mpfrValue).

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

If we change to | input - float(mpfrValue) | / eps(input), we more or less calculating | input_bitfield - float(mpfrValue)_bitfield |, so in my opinion it's better to use
min (eps(input), eps(float(mpfrValue))), since that would avoid the case where float(mpfrValue) is 2^n and input is 2^n - (eps(x)/2) which is representable:

• if we use the eps(input), the calculated ulp will be 2,
• if we use min ( eps(input), eps(float(mpfrValue)) ), the calculated ulp will be 1.

A concrete example is that ulp( input = float(1 - 2^(-24)), float(mpfrValue) = float(1) )

I want to discuss this example more. Assuming we are dealing with single precision floating point numbers and that mpfrResult = float(actualMpfrResult):

```input = float(1 - 2 ^ (-24))
mpfrResult = float(1);
eps(mpfrResult) = 2 ^ (-23)
eps(input) = 2 ^ (-24)

|mpfrResult - input| = 2 ^ (-24)```

So, if we calculate ulp error wrt the input eps, then:

`ulp = (2 ^ (-24)) / (2 ^ (-24)) = 1`

If we calculate ulp error wrt eps of mpfrResult, then:

`ulp = (2 ^ (-24)) / (2 ^ (-23)) = 1/2`

Lets consider the other way around example:

```input = float(1 + 2 ^ (-23))
mpfrResult = float(1);
eps(mpfrResult) = 2 ^ (-23)
eps(input) = 2 ^ (-23)

|mpfrResult - input| = 2 ^ (-23)```

And so the ulp error will be 1 in whichever way we calculate.

Now, consider another example:

```input = float(1 - 2 ^ (-24))
mpfrResult = float(1 + 2 ^ (-23));
eps(mpfrResult) = 2 ^ (-23)
eps(input) = 2 ^ (-24)

|mpfrResult - input| = 2 ^ (-23) + 2 ^ (-24)```

So, if we calculate ulp error wrt the input eps, then:

`ulp = (2 ^ (-24) + 2 ^ (-23)) / (2 ^ (-24)) = 3`

If we calculate ulp error wrt eps of mpfrResult, then:

`ulp = (2 ^ (-24) + 2 ^ (-23)) / (2 ^ (-23)) = 1.5`

I think our goal should be to treat bit distances on either sides of 2^N uniformly where N = max(exp(input), exp(mpfrResult)). Then:

```N = max(exp(input), exp(mpfrResult))
eps_input = 2^(exp(input) - 23)
eps_mpfr = 2^(exp(mpfrResult - 23)
ulp = |2^N - input|/eps_input+ |2^N - mpfrResult|/eps_mpfr```

I think this formulation not only has the symmetry property, but also corresponds to the bit distances for close enough results (which do not differ in the exponent by more than 1). For results farther apart, I don't think it matters. WDYT?

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

When exp(input) == exp(mpfrResult), the formula can be simply:

`ulp = |input - mpfrResult|/eps(input)`

This should also take care of numbers on either side of 0.

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

In either case, we won't have to worry about 0, because eps(0) == eps( smallest non-zero denormal number).

301

One main problem with using max( eps(input), float(eps(mpfrResult)) ) is that it will give us a false-positive when:
input = float(1) and float(mpfrResult) = float(1 - 2^(-23))
Their representation differ by last 2 bits, but ulp calculation will return 1 if we use max eps.

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

One main problem with using max( eps(input), float(eps(mpfrResult)) ) is that it will give us a false-positive when:
input = float(1) and float(mpfrResult) = float(1 - 2^(-23))
Their representation differ by last 2 bits, but ulp calculation will return 1 if we use max eps.

So, lets do ulp error calculation for this example using the scheme I proposed:

```eps_input = 2 ^ (-23)
eps_mpfr = 2 ^ (-24)
N = max(exp(input), exp(mpfrResult)) = 0
ulp = |2^N - input|/eps_input + |2^N - mpfrResult|/eps_mpfr = 0 + (2 ^ (-23))/(2 ^ (-24) = 2```

So, the ulp error calculated is as expected. The solution I am proposing is NOT to use max(eps(input), eps(mpfrResult)). May be you are misreading the step which calculates N?

301

In either case, we won't have to worry about 0, because eps(0) == eps( smallest non-zero denormal number).

What I meant to say is that, for all cases in which input and mpfrResult differ in sign, we can use the simple formula:

`ulp = |input - mpfrResult|/eps(input)`

This will be correct for close enough numbers (numbers which have the same exponent).

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

Sorry for the confusion. So let summary as following: consider 4 options:

```ulp_1 = | input - float(mpfrResult) | / eps_input
ulp_2 = | input - float(mpfrResult) | / eps_mpfrResult
ulp_3 = | input - float(mpfrResult) | / min( eps_input, eps_mpfrResult )
ulp_4 = | input - float(mpfrResult) | / max( eps_input, eps_mpfrResult )```

When eps_input == eps(mpfrResult), all 4 ulp functions will return the same answer, so it doesn't matter which one to use in this case.

On the other hand, on the edge cases:

 input float(mpfrResult) eps_input eps_mpfrResult ulp_1 ulp_2 ulp_3 ulp_4 1 1 - 2^(-23) 2^(-23) 2^(-24) 1 2 2 1 1 - 2^(-23) 1 2^(-24) 2^(-23) 2 1 2 1

So if we using eps_input (ulp_1), we will risk accepting (1 approximating 1 - 2^(-23) with 1 bit of accuracy) and with eps_mpfrResult, we will risk accepting (1 - 2^(-23) approximating 1 with 1 bit of accuracy).

So I think using ulp_3 overall is the correct one to use if the goal is to have at most 1 bit difference compared to mpfr results.

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

Let me define a ulp_5 as follows:

`ulp_5 = |2^N - input| / eps(input) + |2^N - mpfrResult| / eps(mpfrResult)`

where:

`N = max(exponent(input), exponent(mpfrResult))`

And now lets add a couple of more rows and a column to the table you have:

 input float(mpfrResult) eps_input eps_mpfrResult ulp_1 ulp_2 ulp_3 ulp_4 ulp_5 1 1 - 2^(-23) 2^(-23) 2^(-24) 1 2 2 1 2 1 - 2^(-23) 1 2^(-24) 2^(-23) 2 1 2 1 2 1 - 2^(-24) 1 + 2 ^ (-23) 2^(-24) 2^(-23) 3 1.5 3 1.5 2 1 + 2^(-23) 1 - 2 ^ (-24) 2^(-23) 2^(-24) 1.5 3 3 1.5 2

So, the point I am trying to make is that ulp_5 captures the bit distance better than the other definitions of ulp error. For example, ulp_3 is overestimating the bit distance.

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

ulp_5 will work well when 2^N is between input and mpfrResult, but it will return wrong answer when 2^N < min( input, mpfrResult ).
For example, if input == 1 + 2^(-23) == mpfrResult, ulp_5 will return 2. ulp_3 does have overestimating bit distance, but as I mentioned, it does correctly capture whether the input and float(mpfrResult) are within 1-bit of each other.

libc/utils/MPFRWrapper/MPFRUtils.cpp
301

Yes. That is why, in a separate comment above, I said that if exponent(input) == exponent(mpfrResult), then ulp error should be calculated as:

`ulp = |input - mpfrResult|/eps(input)`

I have also said that, to keep it simple, we can apply the same formula when input and mpfrResult differ in sign.

That said, I think we both are now talking about the same thing. What I want to understand next is, how is this discussion related to change being attempted in this patch? As in, can the change to the ULP error formula be done in a separate patch? IIUC, what you are trying to point out is, with mpfrResult rounded to the target floating point format as done in this change, we can now make ULP error match the bit distance better?

lntue accepted this revision.Jun 23 2021, 11:18 AM