This is an archive of the discontinued LLVM Phabricator instance.

[Core] Update references in parallel
AbandonedPublic

Authored by davide on Mar 16 2015, 10:14 PM.

Details

Summary

Profiling shows a fair amount of time spent in updateReferences.
Flamegraphs; https://people.freebsd.org/~davide/llvm/lld.svg
The following patch make updateReference() working in parallel.

Some numbers, linking clang with lld on FreeBSD 11.
This patch shaves 15 seconds making lld 18% faster for this workload.

Unpatched

real 1m23.951s user 2m43.226s sys 0m17.333s
real 1m25.684s user 2m45.534s sys 0m16.882s
real 1m22.960s user 2m40.704s sys 0m16.323s
real 1m24.512s user 2m44.209s sys 0m17.052s
real 1m24.591s user 2m43.689s sys 0m17.472s

Patched

real 1m12.069s user 2m58.951s sys 0m17.113s
real 1m8.631s user 3m4.934s sys 0m17.344s
real 1m11.175s user 3m1.927s sys 0m16.072s
real 1m11.530s user 3m3.798s sys 0m15.587s
real 1m8.922s user 3m3.654s sys 0m16.541s

Diff Detail

Event Timeline

davide updated this revision to Diff 22070.Mar 16 2015, 10:14 PM
davide retitled this revision from to [Core] Update references in parallel .
davide updated this object.
davide edited the test plan for this revision. (Show Details)
davide set the repository for this revision to rL LLVM.
davide added a project: lld.
davide added subscribers: emaste, Unknown Object (MLST).
ruiu added inline comments.Mar 17 2015, 12:35 PM
lib/Core/Resolver.cpp
355

I think this change is not thread-safe because of this line. _deadAtoms.insert is not thread-safe.

davide added inline comments.Mar 17 2015, 1:25 PM
lib/Core/Resolver.cpp
355

Oh this is really a bummer, considering it's such an hot codepath :(
Do you think it makes sense to surround insert with a lock and measure again and/or do you have a proposal for an alternative?

ruiu edited edge metadata.Mar 17 2015, 1:55 PM

I'm sorry, this code doesn't even compile. But I hope you got what I tried
to mean. :)

silvas added a subscriber: silvas.Mar 17 2015, 1:58 PM
silvas added inline comments.
lib/Core/Resolver.cpp
355

Another solution would be to accumulate a separate vector in each thread, then merge them at the end. The const_cast<Reference *>(ref)->setTarget(newTarget); also needs some scrutiny. I would suggest testing these parallelism changes with threadsanitizer.

davide updated this revision to Diff 22318.Mar 19 2015, 4:13 PM
davide edited edge metadata.
davide removed rL LLVM as the repository for this revision.

I updated the patch to use a concurrent accumulator. It's just a prototype, and a little bit hack'ish but it worked for me (and it showed up a substantial improvement). I'm working in parallel on a concurrent hash map to see if it helps with other bottlenecks in the resolver.

In the meanwhile, Rafael, do you mind to re-run your tests with this patch applied in order to see if it shows up some improvement for you (or if it doesn't) ?

silvas requested changes to this revision.Mar 19 2015, 6:33 PM
silvas added a reviewer: silvas.

I really don't like this approach. It has large memory contention in the inner loop even though this computation doesn't inherently have any. Also it is using an extremely cache-unfriendly design, along with putting a new into the inner loop, causing contention inside the memory allocator besides the usual allocation slowness.

We could just store a bit inside the Atom that marks it as dead. That avoids a bunch of hash table lookups in the _deadAtoms map anyway. We have a ton of space in the Atom base class in the _definition field, which is only using 2 bits of a pointer-aligned field (aligned to the vptr). We already have the _definition field loaded (and in cache) due to the dyn_cast above (which checks _definition). So it's just an OR and a store to mark it: no extra loads, no extra cache misses, no synchronization.

This revision now requires changes to proceed.Mar 19 2015, 6:33 PM
shankarke added inline comments.
lib/Core/Resolver.cpp
351–357

Aside from the current review, I think adding to the deadAtoms set has to happen when the atoms are being coalesced. Why is this being done here so late and not when the atoms are being really coalesced ?

Also if the atom gets coalesced, all atoms that are being referred from this atom will automatically get deadstripped right ? Why do you want to add to the deadAtoms vector ?

rafael edited edge metadata.Mar 20 2015, 6:15 AM

In the meanwhile, Rafael, do you mind to re-run your tests with this patch applied in order to see if it shows up some improvement for you (or if it doesn't) ?

Linking today's build of clang:

master

2513.543751      task-clock (msec)         #    1.836 CPUs

utilized ( +- 0.19% )

       68,894      context-switches          #    0.027 M/sec
          ( +-  0.10% )
          537      cpu-migrations            #    0.214 K/sec
          ( +-  1.71% )
      186,760      page-faults               #    0.074 M/sec
          ( +-  0.00% )
7,441,047,000      cycles                    #    2.960 GHz
          ( +-  0.15% )
4,815,567,274      stalled-cycles-frontend   #   64.72% frontend

cycles idle ( +- 0.28% )

<not supported>      stalled-cycles-backend
  6,277,994,513      instructions              #    0.84  insns per

cycle

  1. 0.77 stalled

cycles per insn ( +- 0.06% )

1,308,048,483      branches                  #  520.400 M/sec
          ( +-  0.06% )
   29,809,859      branch-misses             #    2.28% of all

branches ( +- 0.09% )

1.368777139 seconds time elapsed
   ( +-  0.23% )

master+patch:

2653.171771      task-clock (msec)         #    1.982 CPUs

utilized ( +- 0.24% )

       68,537      context-switches          #    0.026 M/sec
          ( +-  0.11% )
          588      cpu-migrations            #    0.222 K/sec
          ( +-  2.27% )
      187,068      page-faults               #    0.071 M/sec
          ( +-  0.00% )
7,939,751,958      cycles                    #    2.993 GHz
          ( +-  0.29% )
5,162,486,783      stalled-cycles-frontend   #   65.02% frontend

cycles idle ( +- 0.53% )

<not supported>      stalled-cycles-backend
  7,016,657,070      instructions              #    0.88  insns per

cycle

  1. 0.74 stalled

cycles per insn ( +- 0.08% )

1,581,215,011      branches                  #  595.972 M/sec
          ( +-  0.08% )
   29,727,439      branch-misses             #    1.88% of all

branches ( +- 0.14% )

1.338744727 seconds time elapsed
   ( +-  0.30% )

Cheers,
Rafael

davide abandoned this revision.May 14 2015, 5:30 PM

Abandon. We're going to use sections instead of atoms very likely so this doesn't really make sense.