Instead of scanning byte after byte, scan words after words and rely on a bithack to perform the scan efficiently.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Would it help readability and portability to change unsigned long to either uintptr_t, uint32_t, or uint64_t?
Very cool. Please follow clang-format's requests. Does this measurably speed things up?
clang used to have hand-vectorized code for SSE and Altivec, but it looks like someone removed it. The
#ifdef __SSE2__ #include <emmintrin.h> #endif
is a remnant of that, it looks like you could drop that now, or investigate why it was removed and see if it is worth doing something related but updated to newer processors. You might also be interested in checking out llvm/lib/Support/SourceMgr.cpp which has similar algorithms and can also be sped up. It is widely used by mlir and other LLVM ecosystem projects.
-Chris
Thanks for pointing me at the former SSE implementation. This got me into deep thoughts, I've been trying various approaches, among which:
- sequential optimization, basically the one currently in main
- various optimization involving casting to uint64_t and doing various bit twiddling (damn, bit hacks are fun)
- SSE / AVX optimization, based on the 2012 version.
So far, vectorization is the fastest one, I derived from the 2012 version a bit, leading to relatively simple and explicit bitcode.
My test bed is the following: computing the average (over 1000 runs) time run spent the following command: ./bin/clang -w -o/dev/null sqlite3.c -E
former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115s
the bithacked version is arch-agnostic but more complex to follow, the sse version is very easy to read and faster.
bithacked version: https://github.com/serge-sans-paille/llvm-project/blob/feature/line-offset-bithack/clang/lib/Basic/SourceManager.cpp#L1285
sse version: https://github.com/serge-sans-paille/llvm-project/blob/feature/faster-line-offset-init/clang/lib/Basic/SourceManager.cpp#L1296
(it also contains an AVX implementation, which is slightly... slower)
so what do you think? I'd advocate for the new sse version, maybe @MaskRay has an opinion on this?
Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?
Drive-by question:
The units are different for the new SSE version, indicating that it's ~1000x faster than the others. Is that a typo?
On a personnal note, I consider the SSE2 version easier to read and maintain, and SSE2 is a *very* widespread instruction set. The way I've split the implementation and the architectural detail should make it easy to port it to other architecture - I can do it for NEON if needs be.
@MaskRay / @lattner can you confirm you want to stay with the « bit hack » version? Note that I don't have strong opinion on this choice, but I still had one I wanted to share ;-)
[EDIT] I updated the bithack link on github, it wasn't the latest version I had, and not the one I benchmarked.
Yeah, I'd stay with the bithack version, but i'd recommend refactoring it to use some helper functions with evocative names.
Thanks @lattner for the feedback. I wasn't super happy with the stability of the performance, so I isolated the function in a standalone file, and benchmarked several implementations. All the sources are available there: https://github.com/serge-sans-paille/mapping-line-to-offset
I get different result depending on how I configure my CPU (basically enabling Turbo-Boost or not).
With Turbo-boost (in ms):
ref: 11.01 seq: 10.86 bithack: 5.01 sse_align: 5 sse: 3
Without Turbo-boost (still in ms)
ref: 23.01 seq: 22.04 bithack: 11.02 sse_align: 10 sse: 6
Does it make you change your mind ? ;-)
I'll happily gather more data, it should be as simple as running make perf -C src once the repo above is cloned.
For complentess, I must mention another optimization: run a first memchr(Buf, BufLen, '\r') and if it finds nothing, use a fast path. That brings interesting speedup for the ref and seq version, but not for the SSE one. And it makes the code more difficult to read.
Ok, I don't care very strongly one way or the other. The appeal of the bithack version was that it works on non-x86 systems.
After a great deal of benchmarking on different architectures, I finally settled for the bithack version, as advised by @lattner . The usage of llvm::countTrailingZeros makes the implementation faster than the previous version and it's easier to read.
nice! There is something weird going on with the diff above, please make sure 'git diff' shows something reasonable before pushing. Thanks!
@serge-sans-paille Please can you look at the clang-ppc buildbot breakages: http://lab.llvm.org:8011/#/builders/52/builds/6108
Looks like this breaks tests on Windows: http://45.33.8.238/win/36495/step_7.txt (takes a bit to load, since many tests are broken)
Please take a look, and please revert for now if it takes a while to fix.
Oh, looks like someone already reported a breakage over an hour ago. I'll revert for now...
Thanks @thakis, inestigating the issue is likely to take some time as it seems to be arch or system dependent
clang/lib/Basic/SourceManager.cpp | ||
---|---|---|
1262 | UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here? If so, then you should be able to repro this on linux by building a 32-bit clang binary there. |
clang/lib/Basic/SourceManager.cpp | ||
---|---|---|
1262 | That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-) |
clang/lib/Basic/SourceManager.cpp | ||
---|---|---|
1255 | Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range [m, n] rather than (m, n). |
Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range [m, n] rather than (m, n).