Optimize the 128-bit division according to https://gmplib.org/~tege/division-paper.pdf
Old benchmark:
------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------- BM_DivideIntrinsic128UniformDivisor<__uint128_t> 12 ns 12 ns 57176570 BM_DivideIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 49819848 BM_RemainderIntrinsic128UniformDivisor<__uint128_t> 13 ns 13 ns 54875157 BM_RemainderIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 51453585 BM_DivideIntrinsic128SmallDivisor<__uint128_t> 27 ns 27 ns 25857150 BM_DivideIntrinsic128SmallDivisor<__int128_t> 29 ns 29 ns 23769665 BM_RemainderIntrinsic128SmallDivisor<__uint128_t> 28 ns 28 ns 25264879 BM_RemainderIntrinsic128SmallDivisor<__int128_t> 30 ns 30 ns 23714126
New benchmark
------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------- BM_DivideIntrinsic128UniformDivisor<__uint128_t> 12 ns 12 ns 56122341 BM_DivideIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 50306881 BM_RemainderIntrinsic128UniformDivisor<__uint128_t> 13 ns 13 ns 56006931 BM_RemainderIntrinsic128UniformDivisor<__int128_t> 13 ns 13 ns 52443568 BM_DivideIntrinsic128SmallDivisor<__uint128_t> 13 ns 13 ns 53636689 BM_DivideIntrinsic128SmallDivisor<__int128_t> 15 ns 15 ns 45276581 BM_RemainderIntrinsic128SmallDivisor<__uint128_t> 15 ns 15 ns 46067118 BM_RemainderIntrinsic128SmallDivisor<__int128_t> 17 ns 17 ns 40513535
PowerPC and ARM benchmarks are the same or slightly better.
I haven't proceeded it with the previous patch because this one contains 512 byte table. If you think this is not acceptable in builtins library, I'll just revert it then.
clang-tidy: warning: invalid case style for variable 'kReciprocalTable' [readability-identifier-naming]
not useful