This is an archive of the discontinued LLVM Phabricator instance.

[builtins] Optimize udivmodti4 for many platforms.
ClosedPublic

Authored by danlark on Jun 14 2020, 10:01 AM.

Details

Summary

While benchmarking uint128 division we found out that it has huge latency for small divisors

https://reviews.llvm.org/D83027

Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            13.0           13.0     55000000  
BM_DivideIntrinsic128UniformDivisor<__int128>                     14.3           14.3     50000000  
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         13.5           13.5     52000000  
BM_RemainderIntrinsic128UniformDivisor<__int128>                  14.1           14.1     50000000  
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>             153            153        5000000  
BM_DivideIntrinsic128SmallDivisor<__int128>                      170            170        3000000  
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>          153            153        5000000  
BM_RemainderIntrinsic128SmallDivisor<__int128>                   155            155        5000000

This patch suggests a more optimized version of the division:

If the divisor is 64 bit, we can proceed with the divq instruction on x86 or constant multiplication mechanisms for other platforms. Once both divisor and dividend are not less than 2**64, we use branch free subtract algorithm, it has at most 64 cycles. After that our benchmarks improved significantly

Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            11.0           11.0     64000000  
BM_DivideIntrinsic128UniformDivisor<__int128>                     13.8           13.8     51000000  
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         11.6           11.6     61000000  
BM_RemainderIntrinsic128UniformDivisor<__int128>                  13.7           13.7     52000000  
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              27.1           27.1     26000000  
BM_DivideIntrinsic128SmallDivisor<__int128>                       29.4           29.4     24000000  
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           27.9           27.8     26000000  
BM_RemainderIntrinsic128SmallDivisor<__int128>                    29.1           29.1     25000000

If not using divq instrinsics, it is still much better

Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            12.2           12.2     58000000  
BM_DivideIntrinsic128UniformDivisor<__int128>                     13.5           13.5     52000000  
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         12.7           12.7     56000000  
BM_RemainderIntrinsic128UniformDivisor<__int128>                  13.7           13.7     51000000  
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              30.2           30.2     24000000  
BM_DivideIntrinsic128SmallDivisor<__int128>                       33.2           33.2     22000000  
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           31.4           31.4     23000000  
BM_RemainderIntrinsic128SmallDivisor<__int128>                    33.8           33.8     21000000

PowerPC benchmarks:

Was

BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            22.3           22.3     32000000  
BM_DivideIntrinsic128UniformDivisor<__int128>                     23.8           23.8     30000000  
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         22.5           22.5     32000000  
BM_RemainderIntrinsic128UniformDivisor<__int128>                  24.9           24.9     29000000  
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>             394            394        2000000  
BM_DivideIntrinsic128SmallDivisor<__int128>                      397            397        2000000  
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>          399            399        2000000  
BM_RemainderIntrinsic128SmallDivisor<__int128>                   397            397        2000000

With this patch

BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            21.7           21.7     33000000  
BM_DivideIntrinsic128UniformDivisor<__int128>                     23.0           23.0     31000000  
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         21.9           21.9     33000000  
BM_RemainderIntrinsic128UniformDivisor<__int128>                  23.9           23.9     30000000  
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              32.7           32.6     23000000  
BM_DivideIntrinsic128SmallDivisor<__int128>                       33.4           33.4     21000000  
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           31.1           31.1     22000000  
BM_RemainderIntrinsic128SmallDivisor<__int128>                    33.2           33.2     22000000

My email: danilak@google.com, I don't have commit rights

Diff Detail

Event Timeline

danlark created this revision.Jun 14 2020, 10:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2020, 10:01 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
danlark retitled this revision from Optimize udivmodti4 for many platforms. to [builtins] Optimize udivmodti4 for many platforms..Jun 14 2020, 10:06 AM
danlark edited the summary of this revision. (Show Details)
danlark updated this revision to Diff 270640.Jun 14 2020, 3:13 PM

Use native division when high is zero.

Nice. I have not looked at the code yet, but I think it would make sense to add your benchmark to the compiler-rt unit tests.

danlark updated this revision to Diff 270666.Jun 14 2020, 11:40 PM

Clang format

danlark edited the summary of this revision. (Show Details)Jun 14 2020, 11:46 PM
danlark added a comment.EditedJul 1 2020, 2:17 PM

Ping and maybe I need some help in introducing benchmarks (if we really need to) for this because I don't quite understand how and where it should be best located.

Also, there are some in test/builtins/timing but they seem to work only for MacOS which is pretty non-obvious why :\

danlark edited the summary of this revision. (Show Details)Jul 1 2020, 3:07 PM
danlark edited the summary of this revision. (Show Details)
danlark updated this revision to Diff 274938.Jul 1 2020, 3:19 PM

CHAR_BIT instead of 8

danlark edited the summary of this revision. (Show Details)Jul 1 2020, 3:52 PM

Nice. I have not looked at the code yet, but I think it would make sense to add your benchmark to the compiler-rt unit tests.

Added https://reviews.llvm.org/D83027

danlark edited the summary of this revision. (Show Details)Jul 2 2020, 2:33 AM
danlark edited the summary of this revision. (Show Details)
danlark updated this revision to Diff 275969.Jul 7 2020, 3:08 AM

Fix clang-tidy warnings

danlark updated this revision to Diff 275971.Jul 7 2020, 3:10 AM

Fix non x86-mode

danlark updated this revision to Diff 275973.Jul 7 2020, 3:11 AM

clang-format

danlark updated this revision to Diff 275974.Jul 7 2020, 3:15 AM

Fix clang-tidy readability warnings more

MaskRay added inline comments.Jul 7 2020, 9:55 PM
compiler-rt/lib/builtins/udivmodti4.c
22

Please use lower case, as the file originally uses. A large portion of LLVM does use uppercase and people find that annoying, but they don't want to fix which will cause lots of churns. However, for code which correctly uses lower case, there is no point converting to uppercase.

33

No need to comment __builtin_clzll

36

For comments which are one their own lines, the LLVM style usually appends a period.

42

The first sentence is sufficient. No need to repeat: "The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand."

MaskRay added inline comments.Jul 7 2020, 9:57 PM
compiler-rt/lib/builtins/udivmodti4.c
151

Please keep lower-case and don't do unneeded variable renaming.

danlark updated this revision to Diff 276317.Jul 7 2020, 11:33 PM
danlark marked 5 inline comments as done.

Fix comments

danlark updated this revision to Diff 276504.Jul 8 2020, 11:45 AM

Update the real reference

Please, take a look

Sorry for taking that long, this is a dense review. Overall this looks good, and the renaming and splitting actually also improve the readability of the non-x86 part.
I only have style comments.

compiler-rt/lib/builtins/udivmodti4.c
23

How is this simple ? :) maybe default ?

32

what's norm ? Normalization ? In general, please avoid the abbreviations here, this is already quite dense.

115–122

// When the divisor fits in 64 bits, we can use an optimized path.

124–129

typo: remainder

danlark updated this revision to Diff 276938.Jul 10 2020, 12:43 AM
danlark marked 3 inline comments as done.

Fix comments

danlark marked an inline comment as done.Jul 10 2020, 12:43 AM
courbet accepted this revision.Jul 10 2020, 12:45 AM
This revision is now accepted and ready to land.Jul 10 2020, 12:45 AM
danlark updated this revision to Diff 276939.Jul 10 2020, 12:55 AM

Fix clang format

This revision was automatically updated to reflect the committed changes.