__builtin_clz requires just a single instruction on x86 and arm, so this is a performance improvement.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libc/src/__support/FPUtil/generic/sqrt.h | ||
---|---|---|
65–66 | Can you do a simple perf test to see if using clz for 64 bits is faster than binary search? Something like: uint64_t hi_bits = static_cast<uint64_t>(mantissa >> 64); int shift = hi_bits ? (clz(hi_bits) - 15) : (clz(static_cast<uint64_t>(mantissa)) + 49); exponent -= shift; mantissa <<= shift; |
libc/src/__support/FPUtil/generic/sqrt.h | ||
---|---|---|
65–66 |
|
libc/src/__support/FPUtil/generic/sqrt.h | ||
---|---|---|
65–66 | Following up here: re: #1: It seems as though we only have perf tests for the float32 variants of math functions due to the logarithmic increase in the domain required for 64-bit inputs. Trying to run an exhaustive performance test using float64 never completed even after an hour of waiting, and I can't imagine an exhaustive test for 128-bit inputs would complete even after days. So I'm going to write a one-off performance test that terminates after 2^24 iterations to test this, but I won't be checking it in re: #2: The x87 8-bit variant uses a 64-bit mantissa, which means that __bulting_clzll is will still work after truncation, so this is trivial to implement (and I have confirmed a slight performance increase here using the method mentioned above). I was incorrect that aarch64 uses 64-bit floats for long double, and I have access to some hardware with an aarch64 Cortex-A53 that I can run these performance test on with the changes you mentioned. If performance is improved, then I will update the patch accordingly. |
libc/src/__support/FPUtil/generic/sqrt.h | ||
---|---|---|
65–66 | Results for 128-bit floats (long double) from the aarch64 Cortex-A53 core in denormal range: clz: 28894915309ns So, just over a 1% performance improvement, which is in-line with what I'm seeing on the 32-bit float sqrtf function. Therefore, I'm patching in that change (as well as the x87 80-bit specialization). |
Thank you! As I don't have repo access (nor probably should I yet), please submit per https://llvm.org/docs/Phabricator.html#committing-someone-s-change-from-phabricator if you would (sivachandra did this for me on D117684).
Can you do a simple perf test to see if using clz for 64 bits is faster than binary search? Something like: