Instead of scanning byte after byte, scan words after words and rely on a bithack to perform the scan efficiently.
Very cool. Please follow clang-format's requests. Does this measurably speed things up?
clang used to have hand-vectorized code for SSE and Altivec, but it looks like someone removed it. The
#ifdef __SSE2__ #include <emmintrin.h> #endif
is a remnant of that, it looks like you could drop that now, or investigate why it was removed and see if it is worth doing something related but updated to newer processors. You might also be interested in checking out llvm/lib/Support/SourceMgr.cpp which has similar algorithms and can also be sped up. It is widely used by mlir and other LLVM ecosystem projects.
Thanks for pointing me at the former SSE implementation. This got me into deep thoughts, I've been trying various approaches, among which:
- sequential optimization, basically the one currently in main
- various optimization involving casting to uint64_t and doing various bit twiddling (damn, bit hacks are fun)
- SSE / AVX optimization, based on the 2012 version.
So far, vectorization is the fastest one, I derived from the 2012 version a bit, leading to relatively simple and explicit bitcode.
My test bed is the following: computing the average (over 1000 runs) time run spent the following command: ./bin/clang -w -o/dev/null sqlite3.c -E
former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115s
the bithacked version is arch-agnostic but more complex to follow, the sse version is very easy to read and faster.
bithacked version: https://github.com/serge-sans-paille/llvm-project/blob/feature/line-offset-bithack/clang/lib/Basic/SourceManager.cpp#L1285
sse version: https://github.com/serge-sans-paille/llvm-project/blob/feature/faster-line-offset-init/clang/lib/Basic/SourceManager.cpp#L1296
(it also contains an AVX implementation, which is slightly... slower)
so what do you think? I'd advocate for the new sse version, maybe @MaskRay has an opinion on this?
On a personnal note, I consider the SSE2 version easier to read and maintain, and SSE2 is a *very* widespread instruction set. The way I've split the implementation and the architectural detail should make it easy to port it to other architecture - I can do it for NEON if needs be.
Thanks @lattner for the feedback. I wasn't super happy with the stability of the performance, so I isolated the function in a standalone file, and benchmarked several implementations. All the sources are available there: https://github.com/serge-sans-paille/mapping-line-to-offset
I get different result depending on how I configure my CPU (basically enabling Turbo-Boost or not).
With Turbo-boost (in ms):
ref: 11.01 seq: 10.86 bithack: 5.01 sse_align: 5 sse: 3
Without Turbo-boost (still in ms)
ref: 23.01 seq: 22.04 bithack: 11.02 sse_align: 10 sse: 6
Does it make you change your mind ? ;-)
I'll happily gather more data, it should be as simple as running make perf -C src once the repo above is cloned.
For complentess, I must mention another optimization: run a first memchr(Buf, BufLen, '\r') and if it finds nothing, use a fast path. That brings interesting speedup for the ref and seq version, but not for the SSE one. And it makes the code more difficult to read.
UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here?
If so, then you should be able to repro this on linux by building a 32-bit clang binary there.