Page MenuHomePhabricator

[builtins] Unify the softfloat division implementation

Authored by atrosinenko on Jul 31 2020, 8:00 AM.



This patch replaces three different pre-existing implementations of div[sdt]f3 LibCalls with a generic one - like it is already done for many other LibCalls.

The patch was written with intent of the proof being as self-contained as possible, so future contributors do not have to re-prove it again. On the other hand, this may make it look somewhat cluttered, so feedback on both correctness and readability is highly appreciated.

When fuzzing with AFL++ (25M iterations each type width), just one error was found: for single precision, 0x1.fffffep-126F divided by 2.F was not correctly rounded to exactly 1.0. On the other hand, this patch is just an intentionally simplified version of the full patch introducing a proper support for subnormal results - the full version fixes this issue as well.

This particular diff pretends to be an NFC refactoring and technically the above issue is not a regression because the original implementation yields the same result. :)

Diff Detail

Event Timeline

atrosinenko created this revision.Jul 31 2020, 8:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2020, 8:00 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
atrosinenko requested review of this revision.Jul 31 2020, 8:01 AM

Revert auto-linting

Here are the benchmark and fuzzing harness used to test this patch.


Interesting fact: swapping these two seemingly commuting lines makes code slower by 15-25%. This applies to current Clang as well as to clang-8 from Ubuntu 20.04 repository.


Re-upload after D85731: [NFC][builtins] Make softfloat-related errors less noisy to get rid of "error: unknown type name 'fp_t'" and similar clang-tidy diagnostics for

atrosinenko edited the summary of this revision. (Show Details)Aug 12 2020, 2:01 AM
atrosinenko added a reviewer: sepavloff.

On linter diagnostics: error messages are due to linter trying to lint *.inc file that is not self-contained by design. D85731: [NFC][builtins] Make softfloat-related errors less noisy tries to make those errors more meaningful, at least.

Some of readability-identifier-naming warnings are just due to code style of runtime library written in C being different from the most of LLVM C++ code base, while some may signify actual issues with my variables' names (but here the preference was to carry as much hints as possible in names).

Rebase the entire patch stack against the up-to-date master and re-upload.

sepavloff added inline comments.Aug 20 2020, 9:17 AM

This estimation is absent from the original comment. Do you have reference where it came from? Also the original comment states This is accurate to about 3.5 binary digits. Is still true? If yes, it could be worth copying here.


The original comment states:

// This doubles the number of correct binary digits in the approximation
// with each iteration.

It is true in this implementation? If yes, it could be worth copying here.


It is good optimization. Could you please put a comment shortly describing the idea of using half-sized temporaries?


In what cases 16-bit temporaries are used? NUMBER_OF_HALF_ITERATIONS is set to zero in divsf3.c.


It would be better to put short comment to explain using 0 instead of 2.


x_UQ0_hw and b_UQ1_hw are declared inside the conditional block #if NUMBER_OF_HALF_ITERATIONS > 0. Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS > 0 ?

atrosinenko added a comment.EditedAug 20 2020, 10:23 AM

Thank you @sepavloff !

Some general context: The final goal was to have an explanation why this particular number of iteration (3, 4 or 5, depending on type) are enough for any a and b passed as input arguments taking into account errors due to particular finite precision computations. Initially, I was trying to just "mechanically" unify the three implementations and their comments like this. After trying for a while, turned out I cannot simply pick the original proof up and add the subnormal case. Some of statements were too vague, some seemed not explained enough, etc. Then, an attempt was made to re-prove it from scratch with an intention for the resulting explanation to be as much self-contained as possible. The implementation, on the other hand, gathers various features of three original functions and some hacks to make it easier to prove.


This approximation was deduced by writing down the derivative of f "in infinite precision" and finding its root. Then the values of f applied to its root, 1.0 and 2.0 were calculated -- as far as I remember, all of them were 3/4 - 1/sqrt(2) or negated - this is what "minimax polynomial" probably means, that term was just copied from the original implementation :).


For me this looks too vague. This is probably approximately true but I don't know how exactly this should be interpreted.


The idea is just "I guess this takes less CPU time and I have managed to prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t multiplication will be emulated with lots of CPU instructions while the lower half contain some noise at that point. This particular optimization did exist in the original implementation for float64 and float128. For float32 it had not much sense, I guess. Still, estimations were calculated for the case of float32 with half-size iterations as it may be useful for MSP430 and other 16-bit targets.


Agree, this needs to be re-evaluated and some comment should be added at least. This could be dead code for now, that was expected to speed things up on 16-bit targets that even lack hardware multiplication sometimes (such as MSP430).


Agree, it was expected to be something like /* = 2.0 in UQ1.(HW-1) */. Naming things is especially painful here...



Hmm... It should imply == 0, at first glance... Generally, total number of iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds are calculated. There are generally only two modes: n-1 half-size iteration + 1 full-size iteration OR n full-size iteration (as one generally has no performance gains from using 16x16-bit multiplications on one hand, and that particular case turned out to require extra rounding, on the other).

sepavloff added inline comments.Aug 20 2020, 10:09 PM

IIUC, you don't want to put this statement here because you are us not sure it is true? Sounds reasonable.


The idea is clear but it require some study of the sources. I would propose to add a comment saying:

At the first iterations number of significant digits is small, so we may use shorter type values. Operations on them are usually faster.

or something like that.


2.0 cannot be represented in UQ1.X. I would add comment line like:

Due to wrapping 2.0 in UQ1.X is equivalent to 0.

or something similar.


I have concern that x_UQ0_hw and x_UQ1_hw are declared in the block with condition NUMBER_OF_HALF_ITERATIONS > 0 but uses in the other block with condition !USE_NATIVE_FULL_ITERATIONS && NUMBER_OF_HALF_ITERATIONS > 0, so probably there may be a combination of the macros when the variables are used but not declared. Maybe it is impossible due to some reason, in this case a proper check may be put into the block #ifdef USE_NATIVE_FULL_ITERATIONS which asserts that NUMBER_OF_HALF_ITERATIONS > 0. Otherwise x_UQ0_hw and x_UQ1_hw need to be moved out of the conditional block.

atrosinenko updated this revision to Diff 287351.EditedAug 24 2020, 5:36 AM

Addressed the review comments mostly by clarifying the explanations.

I expect this code to have no unresolved review comments now. Please feel free to request further explanations in case of any unclarity.

sepavloff accepted this revision.Aug 26 2020, 6:41 AM


I don't fully understand the magic of fixing possible overflow, I hope you made enough investigation and testing to be sure it works as expected.
Please wait a couple of days before commit, so that other reviewers could make their notes.


Should the right part contain 1/b?

This revision is now accepted and ready to land.Aug 26 2020, 6:41 AM

No-change re-upload: rebase onto current master branch.

atrosinenko added inline comments.Aug 27 2020, 9:14 AM

What line are you referring to? For line 142, e_0 is defined as x_n - 1/b_hw in infinite precision (please note it intentionally refers b_hw that is a truncated version of b, see lines 113-114).

This update is expected to be completely NFC w.r.t. code behavior and significantly clarify the proof up to the end of half-width iterations.

Particularly, the reasoning about possible overflow of intermediate results turned out to be actually unclear/incorrect.

@sepavloff could you take a look on the new version in case it clarifies some of your questions? Another update for the second half of function may follow slightly later.

Add some other explanations.

Add more clarifications, fix explanation for "why it is enough to adjust only once in case of overflow".

The new comments are much better, thank you!
I think this version may be committed.

Clarify rounding-related part of function.

scanon added inline comments.Aug 31 2020, 6:54 AM

To be precise, what _minimax polynomial_ means is that p(x) = 3/4 + 1/sqrt(2) - x/2 is the first-order polynomial that minimizes the error term max(|1/x - p(x)|) on the interval [1,2]. I.e. every other linear polynomial would achieve a larger maximum error.

The bound of a minimax approximation to a well-behaved function is always achieved at the endpoints, so we can just evaluate at 1 to get the max error: |1/1 - 3/4 - 1/sqrt(2) + 1/2| = 3/4 - 1/sqrt(2) = 0.04289... (which is actually about _4.5_ bits).


N-R is quadratically convergent under a bunch of assumptions on how good the initial guess is and bounds on the second derivative, which are all satisfied here, but probably not worth going into in the comments. IIRC the usual reference here is Montuschi and Mezzalama's "Survey of square rooting algorithms" (1990).


This is absolutely standard in HW construction of pipelined iterative dividers and square root units, so I'm not sure how much explanation is really needed =)

sepavloff added inline comments.Aug 31 2020, 7:58 AM

This is absolutely standard in HW construction of pipelined iterative dividers and square root units, so I'm not sure how much explanation is really needed =)

I think now the code has enough explanations to be easily understood by mere mortals also :)

Update after the latest comments.

Thank you very much for review!

I have amended this diff based on the latest comment by @scanon.

So, I will land D85031 and then D85032: [builtins] Make divXf3 handle denormal results if there are no other objections.

This revision was landed with ongoing or failed builds.Sep 1 2020, 9:25 AM
This revision was automatically updated to reflect the committed changes.