This is an archive of the discontinued LLVM Phabricator instance.

[LiveDebugValues] Encode register location within VarLoc IDs [3/3]
ClosedPublic

Authored by vsk on Feb 21 2020, 12:17 PM.

Details

Summary

This is part 3 of a 3-part series to address a compile-time explosion
issue in LiveDebugValues.


Start encoding register locations within VarLoc IDs, and take advantage
of this encoding to speed up transferRegisterDef.

There is no fundamental algorithmic change: this patch simply swaps out
SparseBitVector in favor of CoalescingBitVector. That changes iteration
order (hence the test updates), but otherwise this patch is NFCI.

The only interesting change is in transferRegisterDef. Instead of doing:

KillSet = {}
for (ID : OpenRanges.getVarLocs())
  if (DeadRegs.count(ID))
    KillSet.add(ID)

We now do:

KillSet = {}
for (Reg : DeadRegs)
  for (ID : intervalsReservedForReg(Reg, OpenRanges.getVarLocs()))
    KillSet.add(ID)

By not visiting each open location every time we visit an instruction,
this eliminates some potentially quadratic behavior. The new
implementation basically does a constant amount of work per instruction
because the interval map lookups are very fast.

For a file in WebKit, this brings the time spent in LiveDebugValues down
from ~2.5 minutes to 4 seconds, reducing compile time spent in that pass
from 28% of the total to just over 1%.

Before:

2.49 min   27.8%	0 s	LiveDebugValues::process
2.41 min   27.0%	5.40 s	LiveDebugValues::transferRegisterDef
1.51 min   16.9%	1.51 min LiveDebugValues::VarLoc::isDescribedByReg() const
32.73 s    6.1%		8.70 s	 llvm::SparseBitVector<128u>::SparseBitVectorIterator::operator++()

After:

4.53 s	1.1%	0 s	LiveDebugValues::process
3.00 s	0.7%	107.00 ms		LiveDebugValues::transferRegisterCopy
892.00 ms	0.2%	406.00 ms	LiveDebugValues::transferSpillOrRestoreInst
404.00 ms	0.1%	32.00 ms	LiveDebugValues::transferRegisterDef
110.00 ms	0.0%	2.00 ms		  LiveDebugValues::getUsedRegs
57.00 ms	0.0%	1.00 ms		  std::__1::vector<>::push_back
40.00 ms	0.0%	1.00 ms		  llvm::CoalescingBitVector<>::find(unsigned long long)

FWIW, I tried the same approach using SparseBitVector, but got bad
results. To do that, I had to extend SparseBitVector to support 64-bit
indices and expose its lower bound operation. The problem with this is
that the performance is very hard to predict: SparseBitVector's lower
bound operation falls back to O(n) linear scans in a std::list if you're
not /very/ careful about managing iteration order. When I profiled this
the performance looked worse than the baseline.

You can see the full CoalescingBitVector-based implementation here:

https://github.com/vedantk/llvm-project/commits/try-coalescing

You can see the full SparseBitVector-based implementation here:

https://github.com/vedantk/llvm-project/commits/try-sparsebitvec-find

Depends on D74984 and D74985.

Diff Detail

Event Timeline

vsk created this revision.Feb 21 2020, 12:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2020, 12:17 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
vsk updated this revision to Diff 245963.Feb 21 2020, 12:39 PM

Back out changes accidentally introduced in a rebase.

vsk updated this revision to Diff 246092.Feb 22 2020, 5:21 PM
vsk edited the summary of this revision. (Show Details)

Avoid a std::vector allocation in getUsedRegs and shorten the patch summary.

aprantl accepted this revision.Feb 24 2020, 9:00 AM

Those numbers are really impressive.

llvm/lib/CodeGen/LiveDebugValues.cpp
443–444

Comment: Index into what? Or perhaps even typedef a LocIndex type?

1505

This also *looks* nicer :-)

This revision is now accepted and ready to land.Feb 24 2020, 9:00 AM
vsk updated this revision to Diff 246302.Feb 24 2020, 2:12 PM

Add doxygen comments to VarLocMap.

This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.May 27 2020, 8:52 AM

While bisecting compile-time regressions, I hit on this change as a larger -O0 -g regression. It's 2% geomean on CTMark, with the largest regression being 4% on sqlite3. Might be worth a look.

vsk added a comment.May 27 2020, 10:44 AM

While bisecting compile-time regressions, I hit on this change as a larger -O0 -g regression. It's 2% geomean on CTMark, with the largest regression being 4% on sqlite3. Might be worth a look.

@nikic We actually saw a similar result on our CTMark bot, but the regression was fixed by D76465 + D76467. Were you able to measure any improvement after those changes (or, put another way, does reverting D74986 + D76465 + D76467 fix the issue)?

nikic added a comment.May 27 2020, 3:08 PM
In D74986#2057820, @vsk wrote:

While bisecting compile-time regressions, I hit on this change as a larger -O0 -g regression. It's 2% geomean on CTMark, with the largest regression being 4% on sqlite3. Might be worth a look.

@nikic We actually saw a similar result on our CTMark bot, but the regression was fixed by D76465 + D76467. Were you able to measure any improvement after those changes (or, put another way, does reverting D74986 + D76465 + D76467 fix the issue)?

Thanks for the references! When I revert these patches in reverse order, I see (for the instructions retired metric) the original regression for sqlite at +4.4%, the CoalescingBitVector optimization as a -1.1% improvement, and the collectIDsForRegs() optimization as a -0.5% improvement. So these two changes seem to recover a sizable part of the regression (about a 35%), but not the entirety. As such, there might still be some optimization potential here. (Results for reference -- these are reverts, so inverted.)

vsk added a comment.May 27 2020, 4:25 PM
In D74986#2057820, @vsk wrote:

While bisecting compile-time regressions, I hit on this change as a larger -O0 -g regression. It's 2% geomean on CTMark, with the largest regression being 4% on sqlite3. Might be worth a look.

@nikic We actually saw a similar result on our CTMark bot, but the regression was fixed by D76465 + D76467. Were you able to measure any improvement after those changes (or, put another way, does reverting D74986 + D76465 + D76467 fix the issue)?

Thanks for the references! When I revert these patches in reverse order, I see (for the instructions retired metric) the original regression for sqlite at +4.4%, the CoalescingBitVector optimization as a -1.1% improvement, and the collectIDsForRegs() optimization as a -0.5% improvement. So these two changes seem to recover a sizable part of the regression (about a 35%), but not the entirety. As such, there might still be some optimization potential here. (Results for reference -- these are reverts, so inverted.)

Interesting; I missed this because I only looked at averaged wall times while investigating. Thanks for sharing this data (the tracker is awesome). Would you mind adding reverts of D74985 + D74633 to nikic/perf/live-debug-values so we can get a complete picture? Those are two NFC preparatory changes for the main data structure change (this patch). I suspect D74633 in particular gave a large speedup: it's possible that this patch causes a slowdown relative to D74633 on small inputs. That could tell us which direction to go. There a number of compile-time optimizations that are possible with CoalescingBitVector, but if we're slower on small inputs after reverting D74633, maybe we should rethink the data structure instead.

vsk added a comment.May 27 2020, 6:24 PM

@nikic D80684 might help with this as well.