While benchmarking uint128 division we found out that it has huge latency for small divisors
https://github.com/abseil/abseil-cpp/blob/master/absl/numeric/int128_benchmark.cc#L52
Benchmark Time(ns) CPU(ns)
BM_DivideClass128UniformDivisor 13.8 13.8
BM_DivideClass128SmallDivisor 168 168
BM_DivideIntrinsic128UniformDivisor 13.3 13.3
BM_DivideIntrinsic128SmallDivisor 153 153
This patch suggests a more optimized version of the division:
If the divisor is 64 bit, we can proceed with the divq instruction on x86 or constant multiplication mechanisms for other platforms. Once both divisor and dividend are not less than 2**64, we use branch free subtract algorithm, it has at most 64 cycles. After that our benchmarks improved significantly
BM_DivideIntrinsic128UniformDivisor 11.9 11.9 59000000
BM_DivideIntrinsic128SmallDivisor 26.5 26.5 27000000
If not using divq instrinsics, it is still much better
BM_DivideIntrinsic128UniformDivisor 11.9 11.9 59000000
BM_DivideIntrinsic128SmallDivisor 36.5 36.5 21000000
My email: danilak@google.com, I don't have commit rights