Certain subnormals would be incorrectly rounded away from zero.
Fixes #55838
Differential D127140
[APFloat] Fix truncation of certain subnormal numbers danilaml on Jun 6 2022, 11:33 AM. Authored by
Details Certain subnormals would be incorrectly rounded away from zero. Fixes #55838
Diff Detail
Event Timeline
Comment Actions I'm not sure it's safe to assume that the lost fraction is lfLessThanHalf, in general. At least, it isn't obvious to me, particularly for cases involving bfloat. We already have a bunch of code to adjust the shift amount for truncations: the "If this is a truncation of a denormal number [...]" block. Can we adjust that code to reduce the shift amount in this case? Comment Actions If I understood correctly (which is not a given, ofc), using lfLessThanHalf would just prevent rounding up whichever number results after right shift, which is exactly the expected behavior for stuff like bfloat (that just truncates). /* Underflow to zero and round. */ category = fcNormal; zeroSignificand(); fs = normalize(rounding_mode, lfLessThanHalf); in APFloat.cpp:2770, IEEEFloat::convertFromDecimalString (from the EXPECT_TRUE(APFloat(APFloat::IEEEdouble(), "1e-99999").isPosZero()); test). Comment Actions In some cases, we might need to round up a denormal number. Suppose, for example, the result of a conversion is exactly half of the smallest denormal number, or slightly greater than that. Then in "nearest" rounding mode, we need to round up. And in the case of float->bfloat, exactly half of the smallest denormal bfloat is a denormal float. So I'm really not comfortable just overwriting the lost_fraction. The "Underflow to zero and round" dance in convertFromDecimalString is meant to handle different rounding modes, for example, round away from zero. normalize() rounds the result appropriately. So either lfExactlyZero or lfLessThanHalf is meaningful with a zero signficand, but lfMoreThanHalf isn't really. normalize() should handle "arbitrary" inputs, in the sense that it can correctly handle significands and exponents which aren't directly representable in the destination floating-point format. Comment Actions New patch looks better. Can you add unittests for the float->bfloat case I mentioned before?
Comment Actions @efriedma I've added float->bloat16 conversion tests but I'm not sure these are exact same you had in mind. |
"caseisn't" -> "case isn't"