Old benchmark:
```
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t> 12 ns 12 ns 57176570
BM_DivideIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 49819848
BM_RemainderIntrinsic128UniformDivisor<__uint128_t> 13 ns 13 ns 54875157
BM_RemainderIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 51453585
BM_DivideIntrinsic128SmallDivisor<__uint128_t> 27 ns 27 ns 25857150
BM_DivideIntrinsic128SmallDivisor<__int128_t> 29 ns 29 ns 23769665
BM_RemainderIntrinsic128SmallDivisor<__uint128_t> 28 ns 28 ns 25264879
BM_RemainderIntrinsic128SmallDivisor<__int128_t> 30 ns 30 ns 23714126
```
New benchmark
```
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t> 12 ns 12 ns 56122341
BM_DivideIntrinsic128UniformDivisor<__int128_t> 14 ns 14 ns 50306881
BM_RemainderIntrinsic128UniformDivisor<__uint128_t> 13 ns 13 ns 56006931
BM_RemainderIntrinsic128UniformDivisor<__int128_t> 13 ns 13 ns 52443568
BM_DivideIntrinsic128SmallDivisor<__uint128_t> 13 ns 13 ns 53636689
BM_DivideIntrinsic128SmallDivisor<__int128_t> 15 ns 15 ns 45276581
BM_RemainderIntrinsic128SmallDivisor<__uint128_t> 15 ns 15 ns 46067118
BM_RemainderIntrinsic128SmallDivisor<__int128_t> 17 ns 17 ns 40513535
```
PowerPC and ARM benchmarks are the same or slightly better.
I haven't proceeded it with the previous patch because this one contains 512 byte table. If you think this is not acceptable in builtins library, I'll just revert it then.