Download Raw Diff

Details

Reviewers

lattner
jhenderson
MaskRay

Commits

rG20105b6b4874: [clang] Speedup line offset mapping computation
rG6951b72334bb: [clang] Speedup line offset mapping computation

Summary

Instead of scanning byte after byte, scan words after words and rely on a bithack to perform the scan efficiently.

Diff Detail

Unit TestsFailed

	Time	Test
	480 ms	x64 debian > Clang Tools.clang-tidy/infrastructure::export-diagnostics.cpp
	690 ms	x64 debian > Clang.Misc::caret-diags-macros.c
	630 ms	x64 debian > Clang.Misc::diag-trailing-null-bytes.cpp
	760 ms	x64 debian > Clang.Sema::attr-arm-sve-vector-bits.c
	700 ms	x64 debian > Clang.Sema::complex-int.c
		View Full Test Results (18 Failed)

Event Timeline

serge-sans-paille created this revision.Mar 26 2021, 3:58 AM

Herald added a subscriber: dexonsmith. · View Herald TranscriptMar 26 2021, 3:58 AM

serge-sans-paille requested review of this revision.Mar 26 2021, 3:58 AM

Would it help readability and portability to change unsigned long to either uintptr_t, uint32_t, or uint64_t?

Harbormaster completed remote builds in B95858: Diff 333525.Mar 26 2021, 4:31 AM

Very cool. Please follow clang-format's requests. Does this measurably speed things up?

clang used to have hand-vectorized code for SSE and Altivec, but it looks like someone removed it. The

#ifdef __SSE2__
#include <emmintrin.h>
#endif

is a remnant of that, it looks like you could drop that now, or investigate why it was removed and see if it is worth doing something related but updated to newer processors. You might also be interested in checking out llvm/lib/Support/SourceMgr.cpp which has similar algorithms and can also be sped up. It is widely used by mlir and other LLVM ecosystem projects.

-Chris

Looks like it was deleted here:
https://github.com/llvm/llvm-project/commit/d906e731ece9942260df4097ed8e8b4cbfa70d32#diff-81963619e10845105426b4071463347e3af6addea3fc4a9a31ceb207c7986fd3

Thanks for pointing me at the former SSE implementation. This got me into deep thoughts, I've been trying various approaches, among which:

sequential optimization, basically the one currently in main
various optimization involving casting to uint64_t and doing various bit twiddling (damn, bit hacks are fun)
SSE / AVX optimization, based on the 2012 version.

So far, vectorization is the fastest one, I derived from the 2012 version a bit, leading to relatively simple and explicit bitcode.

My test bed is the following: computing the average (over 1000 runs) time run spent the following command: ./bin/clang -w -o/dev/null sqlite3.c -E

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115s

the bithacked version is arch-agnostic but more complex to follow, the sse version is very easy to read and faster.

bithacked version: https://github.com/serge-sans-paille/llvm-project/blob/feature/line-offset-bithack/clang/lib/Basic/SourceManager.cpp#L1285
sse version: https://github.com/serge-sans-paille/llvm-project/blob/feature/faster-line-offset-init/clang/lib/Basic/SourceManager.cpp#L1296
(it also contains an AVX implementation, which is slightly... slower)

so what do you think? I'd advocate for the new sse version, maybe @MaskRay has an opinion on this?

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?

Drive-by question:

In D99409#2658848, @serge-sans-paille wrote:

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115ms

The units are different for the new SSE version, indicating that it's ~1000x faster than the others. Is that a typo?

In D99409#2659627, @lattner wrote:

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?

clang/lib/Basic/SourceManager.cpp
1281	Capitalize comments.
1283	`if(` -> `if (`
1296	replace tabs with spaces

On a personnal note, I consider the SSE2 version easier to read and maintain, and SSE2 is a *very* widespread instruction set. The way I've split the implementation and the architectural detail should make it easy to port it to other architecture - I can do it for NEON if needs be.

@MaskRay / @lattner can you confirm you want to stay with the « bit hack » version? Note that I don't have strong opinion on this choice, but I still had one I wanted to share ;-)

[EDIT] I updated the bithack link on github, it wasn't the latest version I had, and not the one I benchmarked.

Yeah, I'd stay with the bithack version, but i'd recommend refactoring it to use some helper functions with evocative names.

Thanks @lattner for the feedback. I wasn't super happy with the stability of the performance, so I isolated the function in a standalone file, and benchmarked several implementations. All the sources are available there: https://github.com/serge-sans-paille/mapping-line-to-offset

I get different result depending on how I configure my CPU (basically enabling Turbo-Boost or not).
With Turbo-boost (in ms):

ref: 11.01
seq: 10.86
bithack: 5.01
sse_align: 5
sse: 3

Without Turbo-boost (still in ms)

ref: 23.01
seq: 22.04
bithack: 11.02
sse_align: 10
sse: 6

Does it make you change your mind ? ;-)
I'll happily gather more data, it should be as simple as running make perf -C src once the repo above is cloned.

For complentess, I must mention another optimization: run a first memchr(Buf, BufLen, '\r') and if it finds nothing, use a fast path. That brings interesting speedup for the ref and seq version, but not for the SSE one. And it makes the code more difficult to read.

Ok, I don't care very strongly one way or the other. The appeal of the bithack version was that it works on non-x86 systems.

After a great deal of benchmarking on different architectures, I finally settled for the bithack version, as advised by @lattner . The usage of llvm::countTrailingZeros makes the implementation faster than the previous version and it's easier to read.

Harbormaster completed remote builds in B97238: Diff 335413.Apr 6 2021, 12:04 AM

nice! There is something weird going on with the diff above, please make sure 'git diff' shows something reasonable before pushing. Thanks!

This revision is now accepted and ready to land.Apr 6 2021, 8:42 AM

Fix handling of final character

Harbormaster completed remote builds in B97314: Diff 335529.Apr 6 2021, 9:59 AM

MaskRay added inline comments.Apr 6 2021, 6:00 PM

clang/lib/Basic/SourceManager.cpp
1259	Drop unneeded parentheses. This is not a macro
1287	It might be good converting the function into a closed internal form to avoid -1/+1 at call sites.

Take @MaskRay comments into account

Herald added a reviewer: jhenderson. · View Herald TranscriptApr 6 2021, 11:39 PM

Herald added a reviewer: MaskRay. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, rupprecht. · View Herald Transcript

Harbormaster completed remote builds in B97459: Diff 335731.Apr 7 2021, 12:23 AM

This revision was landed with ongoing or failed builds.Apr 7 2021, 5:10 AM

Closed by commit rG6951b72334bb: [clang] Speedup line offset mapping computation (authored by serge-sans-paille). · Explain Why

This revision was automatically updated to reflect the committed changes.

serge-sans-paille added a commit: rG6951b72334bb: [clang] Speedup line offset mapping computation.

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2021, 5:10 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

@serge-sans-paille Please can you look at the clang-ppc buildbot breakages: http://lab.llvm.org:8011/#/builders/52/builds/6108

Looks like this breaks tests on Windows: http://45.33.8.238/win/36495/step_7.txt (takes a bit to load, since many tests are broken)

Please take a look, and please revert for now if it takes a while to fix.

Here too: http://lab.llvm.org:8011/#/builders/127

Oh, looks like someone already reported a breakage over an hour ago. I'll revert for now...

thakis added a reverting change: rGc22b09debddb: Revert "[clang] Speedup line offset mapping computation".Apr 7 2021, 6:42 AM

Thanks @thakis, inestigating the issue is likely to take some time as it seems to be arch or system dependent

thakis added inline comments.Apr 7 2021, 7:26 AM

clang/lib/Basic/SourceManager.cpp
1262	UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here? If so, then you should be able to repro this on linux by building a 32-bit clang binary there.

serge-sans-paille added inline comments.Apr 7 2021, 9:48 AM

clang/lib/Basic/SourceManager.cpp
1262	That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-)

serge-sans-paille added a commit: rG20105b6b4874: [clang] Speedup line offset mapping computation.Apr 8 2021, 1:12 AM

probinson added a subscriber: probinson.May 10 2021, 1:57 PM

probinson added inline comments.

clang/lib/Basic/SourceManager.cpp
1255	Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range `[m, n]` rather than `(m, n)`.

This is an archive of the discontinued LLVM Phabricator instance.

[clang] Speedup line offset mapping computation
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 333525

clang/lib/Basic/SourceManager.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[clang] Speedup line offset mapping computationClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 333525

clang/lib/Basic/SourceManager.cpp

[clang] Speedup line offset mapping computation
ClosedPublic