This is an archive of the discontinued LLVM Phabricator instance.

[clang] Speedup line offset mapping computation

Authored by serge-sans-paille on Mar 26 2021, 3:58 AM.



Instead of scanning byte after byte, scan words after words and rely on a bithack to perform the scan efficiently.

Diff Detail

Event Timeline

serge-sans-paille requested review of this revision.Mar 26 2021, 3:58 AM

Would it help readability and portability to change unsigned long to either uintptr_t, uint32_t, or uint64_t?

Very cool. Please follow clang-format's requests. Does this measurably speed things up?

clang used to have hand-vectorized code for SSE and Altivec, but it looks like someone removed it. The

#ifdef __SSE2__
#include <emmintrin.h>

is a remnant of that, it looks like you could drop that now, or investigate why it was removed and see if it is worth doing something related but updated to newer processors. You might also be interested in checking out llvm/lib/Support/SourceMgr.cpp which has similar algorithms and can also be sped up. It is widely used by mlir and other LLVM ecosystem projects.


serge-sans-paille added a subscriber: MaskRay.EditedMar 30 2021, 8:44 AM

Thanks for pointing me at the former SSE implementation. This got me into deep thoughts, I've been trying various approaches, among which:

  • sequential optimization, basically the one currently in main
  • various optimization involving casting to uint64_t and doing various bit twiddling (damn, bit hacks are fun)
  • SSE / AVX optimization, based on the 2012 version.

So far, vectorization is the fastest one, I derived from the 2012 version a bit, leading to relatively simple and explicit bitcode.

My test bed is the following: computing the average (over 1000 runs) time run spent the following command: ./bin/clang -w -o/dev/null sqlite3.c -E

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115s

the bithacked version is arch-agnostic but more complex to follow, the sse version is very easy to read and faster.

bithacked version:
sse version:
(it also contains an AVX implementation, which is slightly... slower)

so what do you think? I'd advocate for the new sse version, maybe @MaskRay has an opinion on this?

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?

Drive-by question:

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115ms

The units are different for the new SSE version, indicating that it's ~1000x faster than the others. Is that a typo?

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?



Capitalize comments.


if( -> if (


replace tabs with spaces

On a personnal note, I consider the SSE2 version easier to read and maintain, and SSE2 is a *very* widespread instruction set. The way I've split the implementation and the architectural detail should make it easy to port it to other architecture - I can do it for NEON if needs be.

@MaskRay / @lattner can you confirm you want to stay with the « bit hack » version? Note that I don't have strong opinion on this choice, but I still had one I wanted to share ;-)

[EDIT] I updated the bithack link on github, it wasn't the latest version I had, and not the one I benchmarked.

Yeah, I'd stay with the bithack version, but i'd recommend refactoring it to use some helper functions with evocative names.

serge-sans-paille added a comment.EditedApr 3 2021, 12:56 PM

Thanks @lattner for the feedback. I wasn't super happy with the stability of the performance, so I isolated the function in a standalone file, and benchmarked several implementations. All the sources are available there:

I get different result depending on how I configure my CPU (basically enabling Turbo-Boost or not).
With Turbo-boost (in ms):

ref: 11.01
seq: 10.86
bithack: 5.01
sse_align: 5
sse: 3

Without Turbo-boost (still in ms)

ref: 23.01
seq: 22.04
bithack: 11.02
sse_align: 10
sse: 6

Does it make you change your mind ? ;-)
I'll happily gather more data, it should be as simple as running make perf -C src once the repo above is cloned.

For complentess, I must mention another optimization: run a first memchr(Buf, BufLen, '\r') and if it finds nothing, use a fast path. That brings interesting speedup for the ref and seq version, but not for the SSE one. And it makes the code more difficult to read.

Ok, I don't care very strongly one way or the other. The appeal of the bithack version was that it works on non-x86 systems.

After a great deal of benchmarking on different architectures, I finally settled for the bithack version, as advised by @lattner . The usage of llvm::countTrailingZeros makes the implementation faster than the previous version and it's easier to read.

lattner accepted this revision.Apr 6 2021, 8:42 AM

nice! There is something weird going on with the diff above, please make sure 'git diff' shows something reasonable before pushing. Thanks!

This revision is now accepted and ready to land.Apr 6 2021, 8:42 AM

Fix handling of final character

MaskRay added inline comments.Apr 6 2021, 6:00 PM

Drop unneeded parentheses. This is not a macro


It might be good converting the function into a closed internal form to avoid -1/+1 at call sites.

Take @MaskRay comments into account

Herald added a reviewer: MaskRay. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
This revision was landed with ongoing or failed builds.Apr 7 2021, 5:10 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2021, 5:10 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
RKSimon added a subscriber: RKSimon.Apr 7 2021, 5:48 AM

@serge-sans-paille Please can you look at the clang-ppc buildbot breakages:

thakis added a subscriber: thakis.Apr 7 2021, 6:39 AM

Looks like this breaks tests on Windows: (takes a bit to load, since many tests are broken)

Please take a look, and please revert for now if it takes a while to fix.

thakis added a comment.Apr 7 2021, 6:41 AM

Oh, looks like someone already reported a breakage over an hour ago. I'll revert for now...

Thanks @thakis, inestigating the issue is likely to take some time as it seems to be arch or system dependent

thakis added inline comments.Apr 7 2021, 7:26 AM

UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here?

If so, then you should be able to repro this on linux by building a 32-bit clang binary there.


That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-)

probinson added inline comments.

Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range [m, n] rather than (m, n).