This is an archive of the discontinued LLVM Phabricator instance.

I'm not sure it's safe to assume that the lost fraction is lfLessThanHalf, in general. At least, it isn't obvious to me, particularly for cases involving bfloat.

We already have a bunch of code to adjust the shift amount for truncations: the "If this is a truncation of a denormal number [...]" block. Can we adjust that code to reduce the shift amount in this case?

In D127140#3561467, @efriedma wrote:

I'm not sure it's safe to assume that the lost fraction is lfLessThanHalf, in general. At least, it isn't obvious to me, particularly for cases involving bfloat.

We already have a bunch of code to adjust the shift amount for truncations: the "If this is a truncation of a denormal number [...]" block. Can we adjust that code to reduce the shift amount in this case?

If I understood correctly (which is not a given, ofc), using lfLessThanHalf would just prevent rounding up whichever number results after right shift, which is exactly the expected behavior for stuff like bfloat (that just truncates).
Conversely, I'm not sure if it's safe to adjust the shift amount for truncations either. Is normalize expected to handle arbitrary input and "normalize" it to the current semantics, with the only exception being the "zero significand, non-zero lost fraction"?
I've tried adding an assert and quickly found the place where the code does exactly this:

/* Underflow to zero and round.  */
category = fcNormal;
zeroSignificand();
fs = normalize(rounding_mode, lfLessThanHalf);

in APFloat.cpp:2770, IEEEFloat::convertFromDecimalString (from the EXPECT_TRUE(APFloat(APFloat::IEEEdouble(), "1e-99999").isPosZero()); test).

Submitted an alternative solution (adjusting the shift/exponent)

In some cases, we might need to round up a denormal number. Suppose, for example, the result of a conversion is exactly half of the smallest denormal number, or slightly greater than that. Then in "nearest" rounding mode, we need to round up. And in the case of float->bfloat, exactly half of the smallest denormal bfloat is a denormal float.

So I'm really not comfortable just overwriting the lost_fraction.

The "Underflow to zero and round" dance in convertFromDecimalString is meant to handle different rounding modes, for example, round away from zero. normalize() rounds the result appropriately. So either lfExactlyZero or lfLessThanHalf is meaningful with a zero signficand, but lfMoreThanHalf isn't really.

normalize() should handle "arbitrary" inputs, in the sense that it can correctly handle significands and exponents which aren't directly representable in the destination floating-point format.

New patch looks better.

Can you add unittests for the float->bfloat case I mentioned before?

llvm/lib/Support/APFloat.cpp
2217	"caseisn't" -> "case isn't"

Harbormaster completed remote builds in B168350: Diff 434877.Jun 7 2022, 11:32 AM

Added float->bfloat conversion tests

@efriedma I've added float->bloat16 conversion tests but I'm not sure these are exact same you had in mind.

Harbormaster completed remote builds in B168566: Diff 435153.Jun 8 2022, 7:55 AM

LGTM

This revision is now accepted and ready to land.Jun 8 2022, 9:49 AM

Closed by commit rGed6c309d4bf6: [APFloat] Fix truncation of certain subnormal numbers (authored by danilaml). · Explain WhyJun 8 2022, 11:55 AM

This revision was automatically updated to reflect the committed changes.

danilaml added a commit: rGed6c309d4bf6: [APFloat] Fix truncation of certain subnormal numbers.

Revision Contents

Path

Size

llvm/

lib/

Support/

APFloat.cpp

9 lines

test/

Transforms/

InstSimplify/

ConstProp/

cast.ll

12 lines

unittests/

ADT/

APFloatTest.cpp

42 lines

Diff 435280

llvm/lib/Support/APFloat.cpp

Show First 20 Lines • Show All 2,207 Lines • ▼ Show 20 Lines	if (&fromSemantics == &semX87DoubleExtended &&
X86SpecialNan = true;		X86SpecialNan = true;
}		}

// If this is a truncation of a denormal number, and the target semantics		// If this is a truncation of a denormal number, and the target semantics
// has larger exponent range than the source semantics (this can happen		// has larger exponent range than the source semantics (this can happen
// when truncating from PowerPC double-double to double format), the		// when truncating from PowerPC double-double to double format), the
// right shift could lose result mantissa bits. Adjust exponent instead		// right shift could lose result mantissa bits. Adjust exponent instead
// of performing excessive shift.		// of performing excessive shift.
		// Also do a similar trick in case shifting denormal would produce zero
		// significand as this case isn't handled correctly by normalize.
		efriedmaUnsubmitted Not Done Reply Inline Actions "caseisn't" -> "case isn't" efriedma: "caseisn't" -> "case isn't"
if (shift < 0 && isFiniteNonZero()) {		if (shift < 0 && isFiniteNonZero()) {
int exponentChange = significandMSB() + 1 - fromSemantics.precision;		int omsb = significandMSB() + 1;
		int exponentChange = omsb - fromSemantics.precision;
if (exponent + exponentChange < toSemantics.minExponent)		if (exponent + exponentChange < toSemantics.minExponent)
exponentChange = toSemantics.minExponent - exponent;		exponentChange = toSemantics.minExponent - exponent;
if (exponentChange < shift)		if (exponentChange < shift)
exponentChange = shift;		exponentChange = shift;
if (exponentChange < 0) {		if (exponentChange < 0) {
shift -= exponentChange;		shift -= exponentChange;
exponent += exponentChange;		exponent += exponentChange;
		} else if (omsb <= -shift) {
		exponentChange = omsb + shift - 1; // leave at least one bit set
		shift -= exponentChange;
		exponent += exponentChange;
}		}
}		}

// If this is a truncation, perform the shift before we narrow the storage.		// If this is a truncation, perform the shift before we narrow the storage.
if (shift < 0 && (isFiniteNonZero() \|\| category==fcNaN))		if (shift < 0 && (isFiniteNonZero() \|\| category==fcNaN))
lostFraction = shiftRight(significandParts(), oldPartCount, -shift);		lostFraction = shiftRight(significandParts(), oldPartCount, -shift);

// Fix the storage so it can hold to new value.		// Fix the storage so it can hold to new value.
if (newPartCount > oldPartCount) {		if (newPartCount > oldPartCount) {
// The new type requires more storage; make it available.		// The new type requires more storage; make it available.
integerPart *newParts;		integerPart *newParts;
		danilamlAuthorUnsubmitted Done Reply Inline Actions Not 100% sure it is safe to use this function here (after the shift), rather than just `exponent == semantics->minExponent` or something. danilaml: Not 100% sure it is safe to use this function here (after the shift), rather than just…
newParts = new integerPart[newPartCount];		newParts = new integerPart[newPartCount];
APInt::tcSet(newParts, 0, newPartCount);		APInt::tcSet(newParts, 0, newPartCount);
if (isFiniteNonZero() \|\| category==fcNaN)		if (isFiniteNonZero() \|\| category==fcNaN)
APInt::tcAssign(newParts, significandParts(), oldPartCount);		APInt::tcAssign(newParts, significandParts(), oldPartCount);
freeSignificand();		freeSignificand();
significand.parts = newParts;		significand.parts = newParts;
} else if (newPartCount == 1 && oldPartCount != 1) {		} else if (newPartCount == 1 && oldPartCount != 1) {
// Switch to built-in storage for a single part.		// Switch to built-in storage for a single part.
▲ Show 20 Lines • Show All 2,680 Lines • Show Last 20 Lines

llvm/test/Transforms/InstSimplify/ConstProp/cast.ll

	Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines
	define float @trunc_denorm_lost_fraction0() {			define float @trunc_denorm_lost_fraction0() {
	; CHECK-LABEL: @trunc_denorm_lost_fraction0(			; CHECK-LABEL: @trunc_denorm_lost_fraction0(
	; CHECK-NEXT: ret float 0.000000e+00			; CHECK-NEXT: ret float 0.000000e+00
	;			;
	%b = fptrunc double 0x0000000010000000 to float			%b = fptrunc double 0x0000000010000000 to float
	ret float %b			ret float %b
	}			}

	; FIXME: This should be 0.0.

	define float @trunc_denorm_lost_fraction1() {			define float @trunc_denorm_lost_fraction1() {
	; CHECK-LABEL: @trunc_denorm_lost_fraction1(			; CHECK-LABEL: @trunc_denorm_lost_fraction1(
	; CHECK-NEXT: ret float 0x36A0000000000000			; CHECK-NEXT: ret float 0.000000e+00
	;			;
	%b = fptrunc double 0x0000000010000001 to float			%b = fptrunc double 0x0000000010000001 to float
	ret float %b			ret float %b
	}			}

	; FIXME: This should be 0.0.

	define float @trunc_denorm_lost_fraction2() {			define float @trunc_denorm_lost_fraction2() {
	; CHECK-LABEL: @trunc_denorm_lost_fraction2(			; CHECK-LABEL: @trunc_denorm_lost_fraction2(
	; CHECK-NEXT: ret float 0x36A0000000000000			; CHECK-NEXT: ret float 0.000000e+00
	;			;
	%b = fptrunc double 0x000000001fffffff to float			%b = fptrunc double 0x000000001fffffff to float
	ret float %b			ret float %b
	}			}

	define float @trunc_denorm_lost_fraction3() {			define float @trunc_denorm_lost_fraction3() {
	; CHECK-LABEL: @trunc_denorm_lost_fraction3(			; CHECK-LABEL: @trunc_denorm_lost_fraction3(
	; CHECK-NEXT: ret float 0.000000e+00			; CHECK-NEXT: ret float 0.000000e+00
	;			;
	%b = fptrunc double 0x0000000020000000 to float			%b = fptrunc double 0x0000000020000000 to float
	ret float %b			ret float %b
	}			}

	; FIXME: This should be -0.0.

	define float @trunc_denorm_lost_fraction4() {			define float @trunc_denorm_lost_fraction4() {
	; CHECK-LABEL: @trunc_denorm_lost_fraction4(			; CHECK-LABEL: @trunc_denorm_lost_fraction4(
	; CHECK-NEXT: ret float 0xB6A0000000000000			; CHECK-NEXT: ret float -0.000000e+00
	;			;
	%b = fptrunc double 0x8000000010000001 to float			%b = fptrunc double 0x8000000010000001 to float
	ret float %b			ret float %b
	}			}

llvm/unittests/ADT/APFloatTest.cpp

Show First 20 Lines • Show All 1,853 Lines • ▼ Show 20 Lines	TEST(APFloatTest, convert) {
EXPECT_EQ(status, APFloat::opInvalidOp);		EXPECT_EQ(status, APFloat::opInvalidOp);

// The payload is lost in truncation. QNaN remains QNaN.		// The payload is lost in truncation. QNaN remains QNaN.
test = APFloat::getQNaN(APFloat::IEEEdouble(), false, &payload);		test = APFloat::getQNaN(APFloat::IEEEdouble(), false, &payload);
status = test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);		status = test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
EXPECT_EQ(0x7fc00000, test.bitcastToAPInt());		EXPECT_EQ(0x7fc00000, test.bitcastToAPInt());
EXPECT_TRUE(losesInfo);		EXPECT_TRUE(losesInfo);
EXPECT_EQ(status, APFloat::opOK);		EXPECT_EQ(status, APFloat::opOK);

		// Test that subnormals are handled correctly in double to float conversion
		test = APFloat(APFloat::IEEEdouble(), "0x0.0000010000000p-1022");
		test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		test = APFloat(APFloat::IEEEdouble(), "0x0.0000010000001p-1022");
		test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		test = APFloat(APFloat::IEEEdouble(), "-0x0.0000010000001p-1022");
		test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		test = APFloat(APFloat::IEEEdouble(), "0x0.0000020000000p-1022");
		test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		test = APFloat(APFloat::IEEEdouble(), "0x0.0000020000001p-1022");
		test.convert(APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		// Test subnormal conversion to bfloat
		test = APFloat(APFloat::IEEEsingle(), "0x0.01p-126");
		test.convert(APFloat::BFloat(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0.0f, test.convertToFloat());
		EXPECT_TRUE(losesInfo);

		test = APFloat(APFloat::IEEEsingle(), "0x0.02p-126");
		test.convert(APFloat::BFloat(), APFloat::rmNearestTiesToEven, &losesInfo);
		EXPECT_EQ(0x01, test.bitcastToAPInt());
		EXPECT_FALSE(losesInfo);

		test = APFloat(APFloat::IEEEsingle(), "0x0.01p-126");
		test.convert(APFloat::BFloat(), APFloat::rmNearestTiesToAway, &losesInfo);
		EXPECT_EQ(0x01, test.bitcastToAPInt());
		EXPECT_TRUE(losesInfo);
}		}

TEST(APFloatTest, PPCDoubleDouble) {		TEST(APFloatTest, PPCDoubleDouble) {
APFloat test(APFloat::PPCDoubleDouble(), "1.0");		APFloat test(APFloat::PPCDoubleDouble(), "1.0");
EXPECT_EQ(0x3ff0000000000000ull, test.bitcastToAPInt().getRawData()[0]);		EXPECT_EQ(0x3ff0000000000000ull, test.bitcastToAPInt().getRawData()[0]);
EXPECT_EQ(0x0000000000000000ull, test.bitcastToAPInt().getRawData()[1]);		EXPECT_EQ(0x0000000000000000ull, test.bitcastToAPInt().getRawData()[1]);

// LDBL_MAX		// LDBL_MAX
▲ Show 20 Lines • Show All 3,100 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[APFloat] Fix truncation of certain subnormal numbersClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 435280

llvm/lib/Support/APFloat.cpp

llvm/test/Transforms/InstSimplify/ConstProp/cast.ll

llvm/unittests/ADT/APFloatTest.cpp

[APFloat] Fix truncation of certain subnormal numbers
ClosedPublic