This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/lld/Core/
-
lld/
-
Core/
-
Parallel.h
-
Resolver.h
-
lib/Core/
-
Core/
4
Resolver.cpp

Differential D8372

[Core] Update references in parallel
AbandonedPublic

Authored by davide on Mar 16 2015, 10:14 PM.

Download Raw Diff

Details

Reviewers

ruiu
silvas
• rafael
• shankar.easwaran

Summary

Profiling shows a fair amount of time spent in updateReferences.
Flamegraphs; https://people.freebsd.org/~davide/llvm/lld.svg
The following patch make updateReference() working in parallel.

Some numbers, linking clang with lld on FreeBSD 11.
This patch shaves 15 seconds making lld 18% faster for this workload.

Unpatched

real 1m23.951s user 2m43.226s sys 0m17.333s
real 1m25.684s user 2m45.534s sys 0m16.882s
real 1m22.960s user 2m40.704s sys 0m16.323s
real 1m24.512s user 2m44.209s sys 0m17.052s
real 1m24.591s user 2m43.689s sys 0m17.472s

Patched

real 1m12.069s user 2m58.951s sys 0m17.113s
real 1m8.631s user 3m4.934s sys 0m17.344s
real 1m11.175s user 3m1.927s sys 0m16.072s
real 1m11.530s user 3m3.798s sys 0m15.587s
real 1m8.922s user 3m3.654s sys 0m16.541s

Diff Detail

Event Timeline

davide updated this revision to Diff 22070.Mar 16 2015, 10:14 PM

davide retitled this revision from to [Core] Update references in parallel .

davide updated this object.

davide edited the test plan for this revision. (Show Details)

davide added reviewers: ruiu, • shankar.easwaran, • rafael.

davide set the repository for this revision to rL LLVM.

davide added a project: lld.

davide added subscribers: emaste, Unknown Object (MLST).

ruiu added inline comments.Mar 17 2015, 12:35 PM

lib/Core/Resolver.cpp
355	I think this change is not thread-safe because of this line. _deadAtoms.insert is not thread-safe.

davide added inline comments.Mar 17 2015, 1:25 PM

lib/Core/Resolver.cpp
355	Oh this is really a bummer, considering it's such an hot codepath :( Do you think it makes sense to surround insert with a lock and measure again and/or do you have a proposal for an alternative?

I'm sorry, this code doesn't even compile. But I hope you got what I tried
to mean. :)

silvas added a subscriber: silvas.Mar 17 2015, 1:58 PM

silvas added inline comments.

lib/Core/Resolver.cpp
355	Another solution would be to accumulate a separate vector in each thread, then merge them at the end. The `const_cast<Reference *>(ref)->setTarget(newTarget);` also needs some scrutiny. I would suggest testing these parallelism changes with threadsanitizer.

I updated the patch to use a concurrent accumulator. It's just a prototype, and a little bit hack'ish but it worked for me (and it showed up a substantial improvement). I'm working in parallel on a concurrent hash map to see if it helps with other bottlenecks in the resolver.

In the meanwhile, Rafael, do you mind to re-run your tests with this patch applied in order to see if it shows up some improvement for you (or if it doesn't) ?

I really don't like this approach. It has large memory contention in the inner loop even though this computation doesn't inherently have any. Also it is using an extremely cache-unfriendly design, along with putting a new into the inner loop, causing contention inside the memory allocator besides the usual allocation slowness.

We could just store a bit inside the Atom that marks it as dead. That avoids a bunch of hash table lookups in the _deadAtoms map anyway. We have a ton of space in the Atom base class in the _definition field, which is only using 2 bits of a pointer-aligned field (aligned to the vptr). We already have the _definition field loaded (and in cache) due to the dyn_cast above (which checks _definition). So it's just an OR and a store to mark it: no extra loads, no extra cache misses, no synchronization.

This revision now requires changes to proceed.Mar 19 2015, 6:33 PM

shankarke added a subscriber: shankarke.Mar 19 2015, 7:27 PM

shankarke added inline comments.

lib/Core/Resolver.cpp
351–357	Aside from the current review, I think adding to the deadAtoms set has to happen when the atoms are being coalesced. Why is this being done here so late and not when the atoms are being really coalesced ? Also if the atom gets coalesced, all atoms that are being referred from this atom will automatically get deadstripped right ? Why do you want to add to the deadAtoms vector ?

In the meanwhile, Rafael, do you mind to re-run your tests with this patch applied in order to see if it shows up some improvement for you (or if it doesn't) ?

Linking today's build of clang:

master

2513.543751      task-clock (msec)         #    1.836 CPUs

utilized ( +- 0.19% )

       68,894      context-switches          #    0.027 M/sec
          ( +-  0.10% )
          537      cpu-migrations            #    0.214 K/sec
          ( +-  1.71% )
      186,760      page-faults               #    0.074 M/sec
          ( +-  0.00% )
7,441,047,000      cycles                    #    2.960 GHz
          ( +-  0.15% )
4,815,567,274      stalled-cycles-frontend   #   64.72% frontend

cycles idle ( +- 0.28% )

<not supported>      stalled-cycles-backend
  6,277,994,513      instructions              #    0.84  insns per

cycle

0.77 stalled

cycles per insn ( +- 0.06% )

1,308,048,483      branches                  #  520.400 M/sec
          ( +-  0.06% )
   29,809,859      branch-misses             #    2.28% of all

branches ( +- 0.09% )

1.368777139 seconds time elapsed
   ( +-  0.23% )

master+patch:

2653.171771      task-clock (msec)         #    1.982 CPUs

utilized ( +- 0.24% )

       68,537      context-switches          #    0.026 M/sec
          ( +-  0.11% )
          588      cpu-migrations            #    0.222 K/sec
          ( +-  2.27% )
      187,068      page-faults               #    0.071 M/sec
          ( +-  0.00% )
7,939,751,958      cycles                    #    2.993 GHz
          ( +-  0.29% )
5,162,486,783      stalled-cycles-frontend   #   65.02% frontend

cycles idle ( +- 0.53% )

<not supported>      stalled-cycles-backend
  7,016,657,070      instructions              #    0.88  insns per

cycle

0.74 stalled

cycles per insn ( +- 0.08% )

1,581,215,011      branches                  #  595.972 M/sec
          ( +-  0.08% )
   29,727,439      branch-misses             #    1.88% of all

branches ( +- 0.14% )

1.338744727 seconds time elapsed
   ( +-  0.30% )

Cheers,
Rafael

Abandon. We're going to use sections instead of atoms very likely so this doesn't really make sense.

Revision Contents

Path

Size

include/

lld/

Core/

Parallel.h

32 lines

Resolver.h

1 line

lib/

Core/

Resolver.cpp

13 lines

Diff 22318

include/lld/Core/Parallel.h

Show First 20 Lines • Show All 298 Lines • ▼ Show 20 Lines	void parallel_for_each(Iterator begin, Iterator end, Func func) {
ptrdiff_t taskSize = 1024;		ptrdiff_t taskSize = 1024;
while (taskSize <= std::distance(begin, end)) {		while (taskSize <= std::distance(begin, end)) {
tg.spawn([=, &func] { std::for_each(begin, begin + taskSize, func); });		tg.spawn([=, &func] { std::for_each(begin, begin + taskSize, func); });
begin += taskSize;		begin += taskSize;
}		}
std::for_each(begin, end, func);		std::for_each(begin, end, func);
}		}
#endif		#endif

		template<typename T>
		struct node
		{
		T data;
		node *next;

		node(const T data) : data(data), next(nullptr) {}
		T getPayload() { return data; }
		node<T>* getNext() { return next; };
		};

		template<typename T>
		class ConcurrentLIFOQueue
		{
		public:
		ConcurrentLIFOQueue() : head(nullptr) {}

		void insert(const T& data)
		{
		node<T>* new_node = new node<T>(data);
		node<T>* old_head = head.load(std::memory_order_relaxed);
		do {
		new_node->next = old_head;
		} while (!head.compare_exchange_weak(old_head, new_node,
		std::memory_order_release,
		std::memory_order_relaxed));
		}
		node<T>* getHead() { return head; }
		private:
		std::atomic<node<T> *> head;
		};
} // end namespace lld		} // end namespace lld

#endif		#endif

include/lld/Core/Resolver.h

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	private:
};		};

LinkingContext &_ctx;		LinkingContext &_ctx;
SymbolTable _symbolTable;		SymbolTable _symbolTable;
std::vector<const Atom *> _atoms;		std::vector<const Atom *> _atoms;
std::set<const Atom *> _deadStripRoots;		std::set<const Atom *> _deadStripRoots;
llvm::DenseSet<const Atom *> _liveAtoms;		llvm::DenseSet<const Atom *> _liveAtoms;
llvm::DenseSet<const Atom *> _deadAtoms;		llvm::DenseSet<const Atom *> _deadAtoms;
		ConcurrentLIFOQueue<const Atom *> _deadAtomsAccumulator;
std::unique_ptr<MergedFile> _result;		std::unique_ptr<MergedFile> _result;
std::unordered_multimap<const Atom , const Atom > _reverseRef;		std::unordered_multimap<const Atom , const Atom > _reverseRef;

// --start-group and --end-group		// --start-group and --end-group
std::vector<File *> _files;		std::vector<File *> _files;
std::map<File *, bool> _newUndefinesAdded;		std::map<File *, bool> _newUndefinesAdded;
size_t _fileIndex;		size_t _fileIndex;

Show All 18 Lines

lib/Core/Resolver.cpp

Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines	for (;;) {
_newUndefinesAdded[file] = undefAdded;		_newUndefinesAdded[file] = undefAdded;
}		}
}		}

// switch all references to undefined or coalesced away atoms		// switch all references to undefined or coalesced away atoms
// to the new defined atom		// to the new defined atom
void Resolver::updateReferences() {		void Resolver::updateReferences() {
ScopedTask task(getDefaultDomain(), "updateReferences");		ScopedTask task(getDefaultDomain(), "updateReferences");
for (const Atom *atom : _atoms) {		parallel_for_each(_atoms.begin(), _atoms.end(),
		[&](const Atom *atom) {
if (const DefinedAtom *defAtom = dyn_cast<DefinedAtom>(atom)) {		if (const DefinedAtom *defAtom = dyn_cast<DefinedAtom>(atom)) {
for (const Reference ref : defAtom) {		for (const Reference ref : defAtom) {
// A reference of type kindAssociate should't be updated.		// A reference of type kindAssociate should't be updated.
// Instead, an atom having such reference will be removed		// Instead, an atom having such reference will be removed
// if the target atom is coalesced away, so that they will		// if the target atom is coalesced away, so that they will
// go away as a group.		// go away as a group.
if (ref->kindNamespace() == lld::Reference::KindNamespace::all &&		if (ref->kindNamespace() == lld::Reference::KindNamespace::all &&
ref->kindValue() == lld::Reference::kindAssociate) {		ref->kindValue() == lld::Reference::kindAssociate) {
if (_symbolTable.isCoalescedAway(atom))		if (_symbolTable.isCoalescedAway(atom))
_deadAtoms.insert(ref->target());		_deadAtomsAccumulator.insert(ref->target());
		ruiuUnsubmitted Not Done Reply Inline Actions I think this change is not thread-safe because of this line. _deadAtoms.insert is not thread-safe. ruiu: I think this change is not thread-safe because of this line. _deadAtoms.insert is not thread…
		davideAuthorUnsubmitted Not Done Reply Inline Actions Oh this is really a bummer, considering it's such an hot codepath :( Do you think it makes sense to surround insert with a lock and measure again and/or do you have a proposal for an alternative? davide: Oh this is really a bummer, considering it's such an hot codepath :( Do you think it makes…
		silvasUnsubmitted Not Done Reply Inline Actions Another solution would be to accumulate a separate vector in each thread, then merge them at the end. The `const_cast<Reference >(ref)->setTarget(newTarget);` also needs some scrutiny. I would suggest testing these parallelism changes with threadsanitizer. silvas:* Another solution would be to accumulate a separate vector in each thread, then merge them at…
continue;		continue;
}		}
		shankarkeUnsubmitted Not Done Reply Inline Actions Aside from the current review, I think adding to the deadAtoms set has to happen when the atoms are being coalesced. Why is this being done here so late and not when the atoms are being really coalesced ? Also if the atom gets coalesced, all atoms that are being referred from this atom will automatically get deadstripped right ? Why do you want to add to the deadAtoms vector ? shankarke: Aside from the current review, I think adding to the deadAtoms set has to happen when the atoms…
const Atom *newTarget = _symbolTable.replacement(ref->target());		const Atom *newTarget = _symbolTable.replacement(ref->target());
const_cast<Reference *>(ref)->setTarget(newTarget);		const_cast<Reference *>(ref)->setTarget(newTarget);
}		}
}		}
}		});
}		}

// For dead code stripping, recursively mark atoms "live"		// For dead code stripping, recursively mark atoms "live"
void Resolver::markLive(const Atom *atom) {		void Resolver::markLive(const Atom *atom) {
// Mark the atom is live. If it's already marked live, then stop recursion.		// Mark the atom is live. If it's already marked live, then stop recursion.
auto exists = _liveAtoms.insert(atom);		auto exists = _liveAtoms.insert(atom);
if (!exists.second)		if (!exists.second)
return;		return;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	_atoms.erase(std::remove_if(_atoms.begin(), _atoms.end(), [&](const Atom *a) {
_atoms.end());		_atoms.end());
}		}

bool Resolver::resolve() {		bool Resolver::resolve() {
updatePreloadArchiveMap();		updatePreloadArchiveMap();
if (!resolveUndefines())		if (!resolveUndefines())
return false;		return false;
updateReferences();		updateReferences();
		// XXX: rewrite me please to use an iterator.
		node<const Atom > tmp = _deadAtomsAccumulator.getHead();
		while (tmp != nullptr) {
		_deadAtoms.insert(tmp->getPayload());
		tmp = tmp->getNext();
		}
deadStripOptimize();		deadStripOptimize();
if (checkUndefines())		if (checkUndefines())
if (!_ctx.allowRemainingUndefines())		if (!_ctx.allowRemainingUndefines())
return false;		return false;
removeCoalescedAwayAtoms();		removeCoalescedAwayAtoms();
_result->addAtoms(_atoms);		_result->addAtoms(_atoms);
return true;		return true;
}		}
Show All 15 Lines