Download Raw Diff

Details

Reviewers

lattner
jhenderson
MaskRay

Commits

rG20105b6b4874: [clang] Speedup line offset mapping computation
rG6951b72334bb: [clang] Speedup line offset mapping computation

Summary

Instead of scanning byte after byte, scan words after words and rely on a bithack to perform the scan efficiently.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

serge-sans-paille created this revision.Mar 26 2021, 3:58 AM

Herald added a subscriber: dexonsmith. · View Herald TranscriptMar 26 2021, 3:58 AM

serge-sans-paille requested review of this revision.Mar 26 2021, 3:58 AM

Would it help readability and portability to change unsigned long to either uintptr_t, uint32_t, or uint64_t?

Harbormaster completed remote builds in B95858: Diff 333525.Mar 26 2021, 4:31 AM

Very cool. Please follow clang-format's requests. Does this measurably speed things up?

clang used to have hand-vectorized code for SSE and Altivec, but it looks like someone removed it. The

#ifdef __SSE2__
#include <emmintrin.h>
#endif

is a remnant of that, it looks like you could drop that now, or investigate why it was removed and see if it is worth doing something related but updated to newer processors. You might also be interested in checking out llvm/lib/Support/SourceMgr.cpp which has similar algorithms and can also be sped up. It is widely used by mlir and other LLVM ecosystem projects.

-Chris

Looks like it was deleted here:
https://github.com/llvm/llvm-project/commit/d906e731ece9942260df4097ed8e8b4cbfa70d32#diff-81963619e10845105426b4071463347e3af6addea3fc4a9a31ceb207c7986fd3

Thanks for pointing me at the former SSE implementation. This got me into deep thoughts, I've been trying various approaches, among which:

sequential optimization, basically the one currently in main
various optimization involving casting to uint64_t and doing various bit twiddling (damn, bit hacks are fun)
SSE / AVX optimization, based on the 2012 version.

So far, vectorization is the fastest one, I derived from the 2012 version a bit, leading to relatively simple and explicit bitcode.

My test bed is the following: computing the average (over 1000 runs) time run spent the following command: ./bin/clang -w -o/dev/null sqlite3.c -E

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115s

the bithacked version is arch-agnostic but more complex to follow, the sse version is very easy to read and faster.

bithacked version: https://github.com/serge-sans-paille/llvm-project/blob/feature/line-offset-bithack/clang/lib/Basic/SourceManager.cpp#L1285
sse version: https://github.com/serge-sans-paille/llvm-project/blob/feature/faster-line-offset-init/clang/lib/Basic/SourceManager.cpp#L1296
(it also contains an AVX implementation, which is slightly... slower)

so what do you think? I'd advocate for the new sse version, maybe @MaskRay has an opinion on this?

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?

Drive-by question:

In D99409#2658848, @serge-sans-paille wrote:

former sse version: 0.123s
trunk version: 0.122s
bithacked version: 0.118s
new sse version: 0.115ms

The units are different for the new SSE version, indicating that it's ~1000x faster than the others. Is that a typo?

In D99409#2659627, @lattner wrote:

Since the target independent version is almost as fast, I'd go with it. Does it help to use LLVM_UNLIKELY with likelyhasbetween ?

clang/lib/Basic/SourceManager.cpp
1281	Capitalize comments.
1283	`if(` -> `if (`
1296	replace tabs with spaces

On a personnal note, I consider the SSE2 version easier to read and maintain, and SSE2 is a *very* widespread instruction set. The way I've split the implementation and the architectural detail should make it easy to port it to other architecture - I can do it for NEON if needs be.

@MaskRay / @lattner can you confirm you want to stay with the « bit hack » version? Note that I don't have strong opinion on this choice, but I still had one I wanted to share ;-)

[EDIT] I updated the bithack link on github, it wasn't the latest version I had, and not the one I benchmarked.

Yeah, I'd stay with the bithack version, but i'd recommend refactoring it to use some helper functions with evocative names.

Thanks @lattner for the feedback. I wasn't super happy with the stability of the performance, so I isolated the function in a standalone file, and benchmarked several implementations. All the sources are available there: https://github.com/serge-sans-paille/mapping-line-to-offset

I get different result depending on how I configure my CPU (basically enabling Turbo-Boost or not).
With Turbo-boost (in ms):

ref: 11.01
seq: 10.86
bithack: 5.01
sse_align: 5
sse: 3

Without Turbo-boost (still in ms)

ref: 23.01
seq: 22.04
bithack: 11.02
sse_align: 10
sse: 6

Does it make you change your mind ? ;-)
I'll happily gather more data, it should be as simple as running make perf -C src once the repo above is cloned.

For complentess, I must mention another optimization: run a first memchr(Buf, BufLen, '\r') and if it finds nothing, use a fast path. That brings interesting speedup for the ref and seq version, but not for the SSE one. And it makes the code more difficult to read.

Ok, I don't care very strongly one way or the other. The appeal of the bithack version was that it works on non-x86 systems.

After a great deal of benchmarking on different architectures, I finally settled for the bithack version, as advised by @lattner . The usage of llvm::countTrailingZeros makes the implementation faster than the previous version and it's easier to read.

Harbormaster completed remote builds in B97238: Diff 335413.Apr 6 2021, 12:04 AM

nice! There is something weird going on with the diff above, please make sure 'git diff' shows something reasonable before pushing. Thanks!

This revision is now accepted and ready to land.Apr 6 2021, 8:42 AM

Fix handling of final character

Harbormaster completed remote builds in B97314: Diff 335529.Apr 6 2021, 9:59 AM

MaskRay added inline comments.Apr 6 2021, 6:00 PM

clang/lib/Basic/SourceManager.cpp
1259	Drop unneeded parentheses. This is not a macro
1290	It might be good converting the function into a closed internal form to avoid -1/+1 at call sites.

Take @MaskRay comments into account

Herald added a reviewer: jhenderson. · View Herald TranscriptApr 6 2021, 11:39 PM

Herald added a reviewer: MaskRay. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, rupprecht. · View Herald Transcript

Harbormaster completed remote builds in B97459: Diff 335731.Apr 7 2021, 12:23 AM

This revision was landed with ongoing or failed builds.Apr 7 2021, 5:10 AM

Closed by commit rG6951b72334bb: [clang] Speedup line offset mapping computation (authored by serge-sans-paille). · Explain Why

This revision was automatically updated to reflect the committed changes.

serge-sans-paille added a commit: rG6951b72334bb: [clang] Speedup line offset mapping computation.

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2021, 5:10 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

@serge-sans-paille Please can you look at the clang-ppc buildbot breakages: http://lab.llvm.org:8011/#/builders/52/builds/6108

Looks like this breaks tests on Windows: http://45.33.8.238/win/36495/step_7.txt (takes a bit to load, since many tests are broken)

Please take a look, and please revert for now if it takes a while to fix.

Here too: http://lab.llvm.org:8011/#/builders/127

Oh, looks like someone already reported a breakage over an hour ago. I'll revert for now...

thakis added a reverting change: rGc22b09debddb: Revert "[clang] Speedup line offset mapping computation".Apr 7 2021, 6:42 AM

Thanks @thakis, inestigating the issue is likely to take some time as it seems to be arch or system dependent

thakis added inline comments.Apr 7 2021, 7:26 AM

clang/lib/Basic/SourceManager.cpp
1262	UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here? If so, then you should be able to repro this on linux by building a 32-bit clang binary there.

serge-sans-paille added inline comments.Apr 7 2021, 9:48 AM

clang/lib/Basic/SourceManager.cpp
1262	That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-)

serge-sans-paille added a commit: rG20105b6b4874: [clang] Speedup line offset mapping computation.Apr 8 2021, 1:12 AM

probinson added a subscriber: probinson.May 10 2021, 1:57 PM

probinson added inline comments.

clang/lib/Basic/SourceManager.cpp
1255	Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range `[m, n]` rather than `(m, n)`.

Diff 335781

clang/lib/Basic/SourceManager.cpp

	Show First 20 Lines • Show All 1,246 Lines • ▼ Show 20 Lines

	unsigned SourceManager::getPresumedColumnNumber(SourceLocation Loc,			unsigned SourceManager::getPresumedColumnNumber(SourceLocation Loc,
	bool *Invalid) const {			bool *Invalid) const {
	PresumedLoc PLoc = getPresumedLoc(Loc);			PresumedLoc PLoc = getPresumedLoc(Loc);
	if (isInvalid(PLoc, Invalid)) return 0;			if (isInvalid(PLoc, Invalid)) return 0;
	return PLoc.getColumn();			return PLoc.getColumn();
	}			}

	#ifdef __SSE2__			// Check if mutli-byte word x has bytes between m and n, included. This may also
				probinsonUnsubmitted Not Done Reply Inline Actions Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range `[m, n]` rather than `(m, n)`. probinson: Typo: multi-byte. Also, I think "inclusive" rather than "included" assuming you mean the range…
	#include <emmintrin.h>			// catch bytes equal to n + 1.
	#endif			// The returned value holds a 0x80 at each byte position that holds a match.
				// see http://graphics.stanford.edu/~seander/bithacks.html#HasBetweenInWord
				template <class T>
				MaskRayUnsubmitted Not Done Reply Inline Actions Drop unneeded parentheses. This is not a macro MaskRay: Drop unneeded parentheses. This is not a macro
				static constexpr inline T likelyhasbetween(T x, unsigned char m,
				unsigned char n) {
				return ((x - ~0UL / 255 * (n + 1)) & ~x &
				thakisUnsubmitted Not Done Reply Inline Actions UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you want 0ULL here? If so, then you should be able to repro this on linux by building a 32-bit clang binary there. thakis: UL is 32-bit on windows but 64-bit on linux/mac in 64-bit builds (LP64 vs LLP64). Maybe you…
				serge-sans-pailleAuthorUnsubmitted Done Reply Inline Actions That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-) serge-sans-paille: That + some endianess issue on ppc64be, resubmit coming soon once I validate locally :-)
				(x & ~0UL / 255 * 127) + ~0UL / 255 * (127 - (m - 1))) &
				~0UL / 255 * 128;
				}

	LineOffsetMapping LineOffsetMapping::get(llvm::MemoryBufferRef Buffer,			LineOffsetMapping LineOffsetMapping::get(llvm::MemoryBufferRef Buffer,
	llvm::BumpPtrAllocator &Alloc) {			llvm::BumpPtrAllocator &Alloc) {

	// Find the file offsets of all of the physical source lines. This does			// Find the file offsets of all of the physical source lines. This does
	// not look at trigraphs, escaped newlines, or anything else tricky.			// not look at trigraphs, escaped newlines, or anything else tricky.
	SmallVector<unsigned, 256> LineOffsets;			SmallVector<unsigned, 256> LineOffsets;

	// Line #1 starts at char 0.			// Line #1 starts at char 0.
	LineOffsets.push_back(0);			LineOffsets.push_back(0);

	const unsigned char Buf = (const unsigned char )Buffer.getBufferStart();			const unsigned char Buf = (const unsigned char )Buffer.getBufferStart();
	const unsigned char End = (const unsigned char )Buffer.getBufferEnd();			const unsigned char End = (const unsigned char )Buffer.getBufferEnd();
	const std::size_t BufLen = End - Buf;			const std::size_t BufLen = End - Buf;

	unsigned I = 0;			unsigned I = 0;
				MaskRayUnsubmitted Not Done Reply Inline Actions Capitalize comments. MaskRay: Capitalize comments.
				uint64_t Word;

				MaskRayUnsubmitted Not Done Reply Inline Actions `if(` -> `if (` MaskRay: `if(` -> `if (`
				// scan sizeof(Word) bytes at a time for new lines.
				// This is much faster than scanning each byte independently.
				if (BufLen > sizeof(Word)) {
				do {
				memcpy(&Word, Buf + I, sizeof(Word));
				// no new line => jump over sizeof(Word) bytes.
				auto Mask = likelyhasbetween(Word, '\n', '\r');
				MaskRayUnsubmitted Not Done Reply Inline Actions It might be good converting the function into a closed internal form to avoid -1/+1 at call sites. MaskRay: It might be good converting the function into a closed internal form to avoid -1/+1 at call…
				if (!Mask) {
				I += sizeof(Word);
				continue;
				}

				// At that point, Mask contains 0x80 set at each byte that holds a value
				MaskRayUnsubmitted Not Done Reply Inline Actions replace tabs with spaces MaskRay: replace tabs with spaces
				// in [\n, \r + 1 [

				// Scan for the next newline - it's very likely there's one.
				unsigned N =
				llvm::countTrailingZeros(Mask) - 7; // -7 because 0x80 is the marker
				Word >>= N;
				I += N / 8 + 1;
				unsigned char Byte = Word;
				if (Byte == '\n') {
				LineOffsets.push_back(I);
				} else if (Byte == '\r') {
				// If this is \r\n, skip both characters.
				if (Buf[I] == '\n')
				++I;
				LineOffsets.push_back(I);
				}
				} while (I < BufLen - sizeof(Word) - 1);
				}

				// Handle tail using a regular check.
	while (I < BufLen) {			while (I < BufLen) {
	// Use a fast check to catch both newlines
	if (LLVM_UNLIKELY(Buf[I] <= std::max('\n', '\r'))) {
	if (Buf[I] == '\n') {			if (Buf[I] == '\n') {
	LineOffsets.push_back(I + 1);			LineOffsets.push_back(I + 1);
	} else if (Buf[I] == '\r') {			} else if (Buf[I] == '\r') {
	// If this is \r\n, skip both characters.			// If this is \r\n, skip both characters.
	if (I + 1 < BufLen && Buf[I + 1] == '\n')			if (I + 1 < BufLen && Buf[I + 1] == '\n')
	++I;			++I;
	LineOffsets.push_back(I + 1);			LineOffsets.push_back(I + 1);
	}			}
	}
	++I;			++I;
	}			}

	return LineOffsetMapping(LineOffsets, Alloc);			return LineOffsetMapping(LineOffsets, Alloc);
	}			}

	LineOffsetMapping::LineOffsetMapping(ArrayRef<unsigned> LineOffsets,			LineOffsetMapping::LineOffsetMapping(ArrayRef<unsigned> LineOffsets,
	llvm::BumpPtrAllocator &Alloc)			llvm::BumpPtrAllocator &Alloc)
	▲ Show 20 Lines • Show All 933 Lines • Show Last 20 Lines

llvm/test/tools/llvm-objdump/X86/source-interleave-prefix.test

	;; Test --prefix option.			;; Test --prefix option.

	;; Separators change from platform to platform. In POSIX the full path for the			;; Separators change from platform to platform. In POSIX the full path for the
	;; directory './Inputs' appended with the file 'source-interleave-x86_64.c' is			;; directory './Inputs' appended with the file 'source-interleave-x86_64.c' is
	;; './Inputs/source-interleave-x86_64.c'. For Windows it is			;; './Inputs/source-interleave-x86_64.c'. For Windows it is
	;; './Inputs\source-interleave-x86_64.c'. Platform specific tests are needed			;; './Inputs\source-interleave-x86_64.c'. Platform specific tests are needed
	;; since '\' may or may not be a separator.			;; since '\' may or may not be a separator.

	;; Test prefix option ignored for relative paths.			;; Test prefix option ignored for relative paths.
	;; For the test below it is possible to accept both '/' and '\' as a separator.			;; For the test below it is possible to accept both '/' and '\' as a separator.

	; RUN: sed -e "s,SRC_COMPDIR,./Inputs,g" %p/Inputs/source-interleave.ll > %t-relative-path.ll			; RUN: sed -e "s,SRC_COMPDIR,./Inputs,g" %p/Inputs/source-interleave.ll > %t-relative-path.ll
	; RUN: llc -o %t-relative-path.o -filetype=obj -mtriple=x86_64-pc-linux %t-relative-path.ll			; RUN: llc -o %t-relative-path.o -filetype=obj -mtriple=x86_64-pc-linux %t-relative-path.ll
	; RUN: llvm-objdump --prefix myprefix --source %t-relative-path.o 2>&1 \| \			; RUN: mkdir -p %t0 && cd %t0 && llvm-objdump --prefix myprefix --source %t-relative-path.o 2>&1 \| \
	; RUN: FileCheck %s --check-prefix=CHECK-BROKEN-PREFIX -DFILE=%t-relative-path.o -DPREFIX=. -DCOMPDIR=/Inputs			; RUN: FileCheck %s --check-prefix=CHECK-BROKEN-PREFIX -DFILE=%t-relative-path.o -DPREFIX=. -DCOMPDIR=/Inputs
	; CHECK-BROKEN-PREFIX: warning: '[[FILE]]': failed to find source [[PREFIX]][[COMPDIR]]{{[/\\]}}source-interleave-x86_64.c			; CHECK-BROKEN-PREFIX: warning: '[[FILE]]': failed to find source [[PREFIX]][[COMPDIR]]{{[/\\]}}source-interleave-x86_64.c

	;; Test invalid source interleave fixed by adding the correct prefix.			;; Test invalid source interleave fixed by adding the correct prefix.

	; RUN: sed -e "s,SRC_COMPDIR,/Inputs,g" %p/Inputs/source-interleave.ll > %t-missing-prefix.ll			; RUN: sed -e "s,SRC_COMPDIR,/Inputs,g" %p/Inputs/source-interleave.ll > %t-missing-prefix.ll
	; RUN: llc -o %t-missing-prefix.o -filetype=obj -mtriple=x86_64-pc-linux %t-missing-prefix.ll			; RUN: llc -o %t-missing-prefix.o -filetype=obj -mtriple=x86_64-pc-linux %t-missing-prefix.ll
	; RUN: llvm-objdump --prefix %p --source %t-missing-prefix.o 2>&1 \| \			; RUN: llvm-objdump --prefix %p --source %t-missing-prefix.o 2>&1 \| \
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[clang] Speedup line offset mapping computation
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 335781

clang/lib/Basic/SourceManager.cpp

llvm/test/tools/llvm-objdump/X86/source-interleave-prefix.test

This is an archive of the discontinued LLVM Phabricator instance.

[clang] Speedup line offset mapping computationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 335781

clang/lib/Basic/SourceManager.cpp

llvm/test/tools/llvm-objdump/X86/source-interleave-prefix.test

[clang] Speedup line offset mapping computation
ClosedPublic