This is an archive of the discontinued LLVM Phabricator instance.

I was thinking about the implementation. I little bit worries about the performance. To many if here. I would like to propose you another solution. Of course it is up to you to accept it or not. The strongest point of the solution is slightly reduced number of arithmetic operations and only one "main" if. The second one triggered very rare.

  uint64_t carry = 0;
  for (size_t i = 0; i < WordCount; ++i) {
    // Will be wrapped if sum more than 2^(sizeof(x)) - 1
    val[i] += x.val[i];
    // If an overflow appears, the result is less than both of the initial
    // variables
    if (val[i] < x.val[i]) {
      // Add previous carry. Overflow is not possible.
      val[i] += carry;
      // Put 1 to the next digits.
      carry = 1;
    } else {
      val[i] += carry;
      // Likely no overflow.
      if (likely(val[i]) != 0)
        carry = 0;
      // else carry keeps value in case of carry = 0 it is simply 0 with
      // no overflow in case of 1 this made overflow and propagates next.
    }
  }
  return carry;
}

lntue added inline comments.Aug 4 2022, 6:27 AM

libc/src/__support/CPP/UInt.h
74	Hi Kirill, thanks for thinking about improving the performance! I should have added some more background to this patch. The main reason I had to make it a bit complicated is that when using `UInt<128>` to replace `__uint128_t` (which we will need to do for targets without `__uint128_t` builtin supports), it failed some tests that check overflow flags, which are set by the intermediate computation such as `val[i] += x.val[i]` in your improvement or the previous implementation, while the overall `__uint128_t` addition is not overflowed. Of course this change will make another issue pop up, that is now we don't set overflow flag/trap when the real sum in `__uint128_t` is overflowed. That's why I left some todo for later patches.

orex added inline comments.Aug 4 2022, 7:37 AM

libc/src/__support/CPP/UInt.h
74	Thank you for the reply. it was mistake, I should read the title more carefully. Your solution looks good and beautiful to solve the problem. I don't know the full context, but for x86 I can offer an alternative, just clear CF the flag with test command (if you are not interesting in other flags): https://en.wikipedia.org/wiki/TEST_(x86_instruction) Another proposed solution can be, something like this: calculate all sums, but last with the loop I proposed or with your previous solution, which is also very elegant. clear the flag for some architectures. As I know, for example, RISC V do not have flags, so do not needed. Simply add vals in the end. So you can get "true overflow" flag in the end or you can clear the flag in the end, so the behavior will be the same. But anyhow your solution is very elegant. P.S. To be honest I don't see why you previous solution overflows. Sorry for bothering you, but It is very interesting for me?

Did you consider using builtins with overflow checking:
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
It is platform independent and leaves some optimisations opportunities to the compiler.

In D131095#3699654, @tschuett wrote:

Did you consider using builtins with overflow checking:
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
It is platform independent and leaves some optimisations opportunities to the compiler.

In general this is a good idea, but it can be two problems:

It still can setup flags, which should be avoided.
I'm worried is it already exists on this stage. I had a problem, that some builtins were not exists. See https://github.com/llvm/llvm-project/commit/27aca975b6b6e9d5c7516c091f954884b28650ae for example.

In D131095#3699654, @tschuett wrote:

Did you consider using builtins with overflow checking:
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
It is platform independent and leaves some optimisations opportunities to the compiler.

Thanks for your suggestion! We definitely want to use compiler builtins to improve the performance whenever they are available, and fall back to generic implementation otherwise.

As of now, this code is only used in a very limited way for arm32 platforms that do not have __uint128_t builtins, and we haven't finished setting up other requirements such as flags, environments or performance testings yet.

Once those are set, we will followup with performance enhancement using builtins, together with more comprehensive testings, including exception flags and performance, and making sure that the fallback works properly just as @orex mentioned.

lntue added inline comments.Aug 4 2022, 12:00 PM

libc/src/__support/CPP/UInt.h
74	Thanks! We can definitely have followups with performance enhancement per architecture targets once this class is used in more places, together with more comprehensive testings. Re P.S.: Sorry I didn't dig in too much to see why it overflowed. It's possible that some compiler optimizations were too aggressive recognizing that the operations could be done in 32-bits and used such; or they recognized the pattern and reduced it to `uint64_t` adds with carry bit checks? Anyhow, I plan to use this safe implementation as our baseline and extend it to subtraction next. We definitely should have specialized versions with builtins / clear flags for x86_64 and aarch64 to improve the performance, as we are going to use more than 128-bit integers very soon.

Closed by commit rGd1a9ba7b6703: [libc] Prevent overflow from intermediate results when adding UInt<N> values. (authored by lntue). · Explain WhyAug 4 2022, 12:01 PM

This revision was automatically updated to reflect the committed changes.

lntue added a commit: rGd1a9ba7b6703: [libc] Prevent overflow from intermediate results when adding UInt<N> values..

Revision Contents

Path

Size

libc/

src/

__support/

CPP/

UInt.h

35 lines

test/

src/

__support/

uint128_test.cpp

11 lines

utils/

UnitTest/

LibcTest.cpp

2 lines

Diff 450091

libc/src/__support/CPP/UInt.h

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	public:
UInt<Bits> &operator=(const UInt<Bits> &other) {		UInt<Bits> &operator=(const UInt<Bits> &other) {
for (size_t i = 0; i < WordCount; ++i)		for (size_t i = 0; i < WordCount; ++i)
val[i] = other.val[i];		val[i] = other.val[i];
return *this;		return *this;
}		}

// Add x to this number and store the result in this number.		// Add x to this number and store the result in this number.
// Returns the carry value produced by the addition operation.		// Returns the carry value produced by the addition operation.
		// To prevent overflow from intermediate results, we use the following
		// property of unsigned integers:
		// x + (~x) = 2^(sizeof(x)) - 1.
constexpr uint64_t add(const UInt<Bits> &x) {		constexpr uint64_t add(const UInt<Bits> &x) {
uint64_t carry = 0;		bool carry = false;
		orexUnsubmitted Not Done Reply Inline Actions Hi, Tue! I was thinking about the implementation. I little bit worries about the performance. To many `if` here. I would like to propose you another solution. Of course it is up to you to accept it or not. The strongest point of the solution is slightly reduced number of arithmetic operations and only one "main" if. The second one triggered very rare. uint64_t carry = 0; for (size_t i = 0; i < WordCount; ++i) { // Will be wrapped if sum more than 2^(sizeof(x)) - 1 val[i] += x.val[i]; // If an overflow appears, the result is less than both of the initial // variables if (val[i] < x.val[i]) { // Add previous carry. Overflow is not possible. val[i] += carry; // Put 1 to the next digits. carry = 1; } else { val[i] += carry; // Likely no overflow. if (likely(val[i]) != 0) carry = 0; // else carry keeps value in case of carry = 0 it is simply 0 with // no overflow in case of 1 this made overflow and propagates next. } } return carry; } orex: Hi, Tue! I was thinking about the implementation. I little bit worries about the performance.
		lntueAuthorUnsubmitted Done Reply Inline Actions Hi Kirill, thanks for thinking about improving the performance! I should have added some more background to this patch. The main reason I had to make it a bit complicated is that when using `UInt<128>` to replace `__uint128_t` (which we will need to do for targets without `__uint128_t` builtin supports), it failed some tests that check overflow flags, which are set by the intermediate computation such as `val[i] += x.val[i]` in your improvement or the previous implementation, while the overall `__uint128_t` addition is not overflowed. Of course this change will make another issue pop up, that is now we don't set overflow flag/trap when the real sum in `__uint128_t` is overflowed. That's why I left some todo for later patches. lntue: Hi Kirill, thanks for thinking about improving the performance! I should have added some more…
		orexUnsubmitted Not Done Reply Inline Actions Thank you for the reply. it was mistake, I should read the title more carefully. Your solution looks good and beautiful to solve the problem. I don't know the full context, but for x86 I can offer an alternative, just clear CF the flag with test command (if you are not interesting in other flags): https://en.wikipedia.org/wiki/TEST_(x86_instruction) Another proposed solution can be, something like this: calculate all sums, but last with the loop I proposed or with your previous solution, which is also very elegant. clear the flag for some architectures. As I know, for example, RISC V do not have flags, so do not needed. Simply add vals in the end. So you can get "true overflow" flag in the end or you can clear the flag in the end, so the behavior will be the same. But anyhow your solution is very elegant. P.S. To be honest I don't see why you previous solution overflows. Sorry for bothering you, but It is very interesting for me? orex: Thank you for the reply. it was mistake, I should read the title more carefully. Your solution…
		lntueAuthorUnsubmitted Done Reply Inline Actions Thanks! We can definitely have followups with performance enhancement per architecture targets once this class is used in more places, together with more comprehensive testings. Re P.S.: Sorry I didn't dig in too much to see why it overflowed. It's possible that some compiler optimizations were too aggressive recognizing that the operations could be done in 32-bits and used such; or they recognized the pattern and reduced it to `uint64_t` adds with carry bit checks? Anyhow, I plan to use this safe implementation as our baseline and extend it to subtraction next. We definitely should have specialized versions with builtins / clear flags for x86_64 and aarch64 to improve the performance, as we are going to use more than 128-bit integers very soon. lntue: Thanks! We can definitely have followups with performance enhancement per architecture targets…
for (size_t i = 0; i < WordCount; ++i) {		for (size_t i = 0; i < WordCount; ++i) {
uint64_t res_lo = low(val[i]) + low(x.val[i]) + carry;		uint64_t complement = ~x.val[i];
carry = high(res_lo);		if (!carry) {
res_lo = low(res_lo);		if (val[i] <= complement)
		val[i] += x.val[i];
uint64_t res_hi = high(val[i]) + high(x.val[i]) + carry;		else {
carry = high(res_hi);		val[i] -= complement + 1;
res_hi = low(res_hi);		carry = true;
		}
val[i] = res_lo + (res_hi << 32);		} else {
		if (val[i] < complement) {
		val[i] += x.val[i] + 1;
		carry = false;
		} else
		val[i] -= complement;
		}
}		}
return carry;		return carry ? 1 : 0;
}		}

constexpr UInt<Bits> operator+(const UInt<Bits> &other) const {		constexpr UInt<Bits> operator+(const UInt<Bits> &other) const {
UInt<Bits> result(*this);		UInt<Bits> result(*this);
result.add(other);		result.add(other);
		// TODO(lntue): Set overflow flag / errno when carry is true.
return result;		return result;
}		}

constexpr UInt<Bits> operator+=(const UInt<Bits> &other) {		constexpr UInt<Bits> operator+=(const UInt<Bits> &other) {
this = this + other;		// TODO(lntue): Set overflow flag / errno when carry is true.
		add(other);
return *this;		return *this;
}		}

// Multiply this number with x and store the result in this number. It is		// Multiply this number with x and store the result in this number. It is
// implemented using the long multiplication algorithm by splitting the		// implemented using the long multiplication algorithm by splitting the
// 64-bit words of this number and \|x\| in to 32-bit halves but peforming		// 64-bit words of this number and \|x\| in to 32-bit halves but peforming
// the operations using 64-bit numbers. This ensures that we don't lose the		// the operations using 64-bit numbers. This ensures that we don't lose the
// carry bits.		// carry bits.
▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

libc/test/src/__support/uint128_test.cpp

Show All 21 Lines	TEST(LlvmLibcUInt128ClassTest, BasicInit) {
ASSERT_TRUE(half_val != full_val);		ASSERT_TRUE(half_val != full_val);
}		}

TEST(LlvmLibcUInt128ClassTest, AdditionTests) {		TEST(LlvmLibcUInt128ClassTest, AdditionTests) {
LL_UInt128 val1(12345);		LL_UInt128 val1(12345);
LL_UInt128 val2(54321);		LL_UInt128 val2(54321);
LL_UInt128 result1(66666);		LL_UInt128 result1(66666);
EXPECT_EQ(val1 + val2, result1);		EXPECT_EQ(val1 + val2, result1);
EXPECT_EQ((val1 + val2), (val2 + val1)); // addition is reciprocal		EXPECT_EQ((val1 + val2), (val2 + val1)); // addition is commutative

// Test overflow		// Test overflow
LL_UInt128 val3({0xf000000000000001, 0});		LL_UInt128 val3({0xf000000000000001, 0});
LL_UInt128 val4({0x100000000000000f, 0});		LL_UInt128 val4({0x100000000000000f, 0});
LL_UInt128 result2({0x10, 0x1});		LL_UInt128 result2({0x10, 0x1});
EXPECT_EQ(val3 + val4, result2);		EXPECT_EQ(val3 + val4, result2);
EXPECT_EQ(val3 + val4, val4 + val3);		EXPECT_EQ(val3 + val4, val4 + val3);

		// Test overflow
		LL_UInt128 val5({0x0123456789abcdef, 0xfedcba9876543210});
		LL_UInt128 val6({0x1111222233334444, 0xaaaabbbbccccdddd});
		LL_UInt128 result3({0x12346789bcdf1233, 0xa987765443210fed});
		EXPECT_EQ(val5 + val6, result3);
		EXPECT_EQ(val5 + val6, val6 + val5);
}		}

TEST(LlvmLibcUInt128ClassTest, MultiplicationTests) {		TEST(LlvmLibcUInt128ClassTest, MultiplicationTests) {
LL_UInt128 val1({5, 0});		LL_UInt128 val1({5, 0});
LL_UInt128 val2({10, 0});		LL_UInt128 val2({10, 0});
LL_UInt128 result1({50, 0});		LL_UInt128 result1({50, 0});
EXPECT_EQ((val1 * val2), result1);		EXPECT_EQ((val1 * val2), result1);
EXPECT_EQ((val1 * val2), (val2 * val1)); // multiplication is reciprocal		EXPECT_EQ((val1 * val2), (val2 * val1)); // multiplication is commutative

// Check that the multiplication works accross the whole number		// Check that the multiplication works accross the whole number
LL_UInt128 val3({0xf, 0});		LL_UInt128 val3({0xf, 0});
LL_UInt128 val4({0x1111111111111111, 0x1111111111111111});		LL_UInt128 val4({0x1111111111111111, 0x1111111111111111});
LL_UInt128 result2({0xffffffffffffffff, 0xffffffffffffffff});		LL_UInt128 result2({0xffffffffffffffff, 0xffffffffffffffff});
EXPECT_EQ((val3 * val4), result2);		EXPECT_EQ((val3 * val4), result2);
EXPECT_EQ((val3 * val4), (val4 * val3));		EXPECT_EQ((val3 * val4), (val4 * val3));

▲ Show 20 Lines • Show All 194 Lines • Show Last 20 Lines

libc/utils/UnitTest/LibcTest.cpp

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

	std::string describeValue(std::string Value) { return std::string(Value); }			std::string describeValue(std::string Value) { return std::string(Value); }

	// When the value is UInt128 or __uint128_t, show its hexadecimal digits.			// When the value is UInt128 or __uint128_t, show its hexadecimal digits.
	// We cannot just use a UInt128 specialization as that resolves to only			// We cannot just use a UInt128 specialization as that resolves to only
	// one type, UInt<128> or __uint128_t. We want both overloads as we want to			// one type, UInt<128> or __uint128_t. We want both overloads as we want to
	// be able to unittest UInt<128> on platforms where UInt128 resolves to			// be able to unittest UInt<128> on platforms where UInt128 resolves to
	// UInt128.			// UInt128.
				// TODO(lntue): Investigate why UInt<128> was printed backward, with the lower
				// 64-bits first.
	template <typename UInt128Type>			template <typename UInt128Type>
	std::string describeValue128(UInt128Type Value) {			std::string describeValue128(UInt128Type Value) {
	std::string S(sizeof(UInt128) * 2, '0');			std::string S(sizeof(UInt128) * 2, '0');

	for (auto I = S.rbegin(), End = S.rend(); I != End; ++I, Value >>= 4) {			for (auto I = S.rbegin(), End = S.rend(); I != End; ++I, Value >>= 4) {
	unsigned char Mod = static_cast<unsigned char>(Value) & 15;			unsigned char Mod = static_cast<unsigned char>(Value) & 15;
	*I = Mod < 10 ? '0' + Mod : 'a' + Mod - 10;			*I = Mod < 10 ? '0' + Mod : 'a' + Mod - 10;
	}			}
	▲ Show 20 Lines • Show All 336 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Prevent overflow from intermediate results when adding UInt<N> values.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 450091

libc/src/__support/CPP/UInt.h

libc/test/src/__support/uint128_test.cpp

libc/utils/UnitTest/LibcTest.cpp

[libc] Prevent overflow from intermediate results when adding UInt<N> values.
ClosedPublic