This is an archive of the discontinued LLVM Phabricator instance.

[WIP][DebugInfo][LiveDebugValues] Index variable location IDs by machine location
AbandonedPublic

Authored by jmorse on Feb 17 2020, 5:28 AM.

Details

Reviewers
None
Summary

This patch isn't intended for review right now, instead for discussion in comparison with D74633.

This patch morphs the "VarMap" map of VarLoc -> ID numbers into a class that also indexes all variable ID numbers by their register / machine location. This speeds up various parts of LiveDebugValues that step over all variable locations in search of a particular machine location, at the expense of additional memory consumption.

Diff Detail

Event Timeline

jmorse created this revision.Feb 17 2020, 5:28 AM
jmorse edited the summary of this revision. (Show Details)Feb 17 2020, 5:30 AM
vsk added a comment.Feb 17 2020, 1:22 PM

Nice. I hacked up a similar version of this patch that defined VarLocMap as DenseMap<unsigned, UniqueVector<VarLoc>>. Similar to here, the idea was to maintain a UniqueVector<VarLoc> for each register. The trick here was to make "ID"s 64-bit, so that they could efficiently encode a (register, index into the UniqueVector for that register) pair. To do a lookup given an ID, we'd break apart the 64-bit ID to grab the UniqueVector for the register (non-register locations were all slapped into the '0'th UniqueVector), then index into the vector. This seems less general than your patch.

When I was putting the finishing touches on this, I took a closer look at my profiling run with D74633 applied and saw this:

2.49 min   27.8%	0 s	LiveDebugValues::process
2.41 min   27.0%	5.40 s	LiveDebugValues::transferRegisterDef
1.51 min   16.9%	1.51 min LiveDebugValues::VarLoc::isDescribedByReg() const
32.73 s    6.1%		8.70 s	 llvm::SparseBitVector<128u>::SparseBitVectorIterator::operator++()

It looks like it costs us 33 seconds to iterate over OpenLocs /once/. If we were to do this once per location, I fear performance would regress. Well, at least for this particular WebKit test case I'm looking at, which has truly massive basic blocks.

So, while I think there is potential here, I'm not sure it's architecturally a good fit for transferRegisterDef (although, I readily admit it could be a good fit elsewhere).

What I was thinking of as an alternative (haven't had time to hack this up, might get to it today if my bug queue permits) is to replace our use of SparseBitVector with an IntervalSet. For prototyping purposes I was thinking of implementing this as an llvm::IntervalMap, equipped with set ops like intersectWithComplement, and a fancy iterator that lets us write for (unsigned ID : theIntervalSet). My intuition/expectation is that the pointer chasing in SparseBitVector is causing all kinds of issues: if that's correct, and we fix that first, that makes space for investigating cool alternative layouts for VarLocMap.

wdyt?

jmorse abandoned this revision.Apr 6 2020, 9:00 AM

Hi Vedant,

Sorry for the complete blank, the usual distractions and some illness (but not the dreaded lurgi!) to blame.

For the record, on the massive-blocks-full-of-debug-instrs inputs we've been seeing, this particular patch got LiveDebugValues to about eight seconds from >100, but your coalescing bitvector implementation got it down to 0.3 seconds, many thanks for that!

In the meantime I've been kicking around ideas for different ways of doing LiveDebugValues' job -- I get the feeling that we artificially increase the N-size of the problem by pairing up variables and registers too early, where we might instead be able to keep the analysis proportionate to the number of DBG_VALUEs in a function. I'm looking at various inputs, but would you be able to point out what parts of WebKit were producing the pathological behaviour? (Assuming it's part of an open source release). It'd greatly help settling on a solution that's acceptable to all.

(I think the number of VarLocs for a function would be even larger than today if these [0, 1] transfer-bailout returns didn't exist. I don't know why they do).

[0] https://github.com/llvm/llvm-project/blob/ddd2f4b96f9f3967a66e744a98b6ecec25c55de8/llvm/lib/CodeGen/LiveDebugValues.cpp#L1371
[1] https://github.com/llvm/llvm-project/blob/ddd2f4b96f9f3967a66e744a98b6ecec25c55de8/llvm/lib/CodeGen/LiveDebugValues.cpp#L1301

vsk added a comment.Apr 6 2020, 12:11 PM

I'm glad you've recovered!

I was looking at a benchmark for glyph rendering in WebKit. Unfortunately, the file appears to reference headers from an Apple-internal SDK, and I can't find an analogous benchmark in open source. I've stashed the bitcode for now: happy to benchmark future patches against it if requested.

Re: the early-exits [0, 1] you linked to, I don't understand why these exist. It could be a compile-time hack, but I'm not sure about that, because the loop over OpenLocs remains O(n). Do you expect getting rid of the early exits to result in a large increase in VarLocs? I suppose I'd expect a small/modest increase.

jmorse added a comment.Apr 7 2020, 3:03 AM

I was looking at a benchmark for glyph rendering in WebKit. Unfortunately, the file appears to reference headers from an Apple-internal SDK, and I can't find an analogous benchmark in open source. I've stashed the bitcode for now: happy to benchmark future patches against it if requested.

Righty-ho, I think I've got a reasonable body of benchmarks otherwise,

Re: the early-exits [0, 1] you linked to, I don't understand why these exist. It could be a compile-time hack, but I'm not sure about that, because the loop over OpenLocs remains O(n). Do you expect getting rid of the early exits to result in a large increase in VarLocs? I suppose I'd expect a small/modest increase.

It's odd; the return turned up in [2] in transferSpillInst and was duplicated in [3], possibly it's just an oversight. My fear is that large numbers of inlined variables could be transferred but currently aren't. This is a hunch though, at some point I'll have numbers to analyse.

[2] https://reviews.llvm.org/D29500
[3] https://reviews.llvm.org/D44016