This is an archive of the discontinued LLVM Phabricator instance.

[builtins] Optimize the 128-bit division according to "Improved division by invariant integers" paper
AbandonedPublic

Authored by danlark on Jul 10 2020, 5:34 AM.

Details

Summary

Optimize the 128-bit division according to https://gmplib.org/~tege/division-paper.pdf

Old benchmark:

-------------------------------------------------------------------------------------------
Benchmark                                                    Time           CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t>            12 ns         12 ns   57176570
BM_DivideIntrinsic128UniformDivisor<__int128_t>             14 ns         14 ns   49819848
BM_RemainderIntrinsic128UniformDivisor<__uint128_t>         13 ns         13 ns   54875157
BM_RemainderIntrinsic128UniformDivisor<__int128_t>          14 ns         14 ns   51453585
BM_DivideIntrinsic128SmallDivisor<__uint128_t>              27 ns         27 ns   25857150
BM_DivideIntrinsic128SmallDivisor<__int128_t>               29 ns         29 ns   23769665
BM_RemainderIntrinsic128SmallDivisor<__uint128_t>           28 ns         28 ns   25264879
BM_RemainderIntrinsic128SmallDivisor<__int128_t>            30 ns         30 ns   23714126

New benchmark

-------------------------------------------------------------------------------------------
Benchmark                                                    Time           CPU Iterations
-------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<__uint128_t>            12 ns         12 ns   56122341
BM_DivideIntrinsic128UniformDivisor<__int128_t>             14 ns         14 ns   50306881
BM_RemainderIntrinsic128UniformDivisor<__uint128_t>         13 ns         13 ns   56006931
BM_RemainderIntrinsic128UniformDivisor<__int128_t>          13 ns         13 ns   52443568
BM_DivideIntrinsic128SmallDivisor<__uint128_t>              13 ns         13 ns   53636689
BM_DivideIntrinsic128SmallDivisor<__int128_t>               15 ns         15 ns   45276581
BM_RemainderIntrinsic128SmallDivisor<__uint128_t>           15 ns         15 ns   46067118
BM_RemainderIntrinsic128SmallDivisor<__int128_t>            17 ns         17 ns   40513535

PowerPC and ARM benchmarks are the same or slightly better.

I haven't proceeded it with the previous patch because this one contains 512 byte table. If you think this is not acceptable in builtins library, I'll just revert it then.

Diff Detail

Event Timeline

danlark created this revision.Jul 10 2020, 5:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 5:34 AM
Herald added subscribers: Restricted Project, steven.zhang, kristof.beyls. · View Herald Transcript
danlark retitled this revision from Optimize the 128-bit division according to https://gmplib.org/~tege/division-paper.pdf to [builtins] Optimize the 128-bit division according to "Improved division by invariant integers" paper.Jul 10 2020, 5:35 AM
danlark edited the summary of this revision. (Show Details)

I have no idea what the policy is WRT size for compiler-rt. How big is the diff if you take code size into consideration ?

I have no idea what the policy is WRT size for compiler-rt. How big is the diff if you take code size into consideration ?

du -shb ./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64_old.a
223538	./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64_old.a

du -shb ./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64.a
224658	./build/lib/clang/11.0.0/lib/linux/libclang_rt.builtins-x86_64.a

+1.1kb

danlark updated this revision to Diff 277008.Jul 10 2020, 5:55 AM

Provide better names than in paper

danlark planned changes to this revision.Jul 10 2020, 6:38 AM

I need to double check the cold cache performance, it may happen that it is slower in that case

danlark abandoned this revision.May 15 2021, 2:48 AM