Page MenuHomePhabricator

[DebugInfo] Address performance regression with r364515
Needs ReviewPublic

Authored by jmorse on Jul 12 2019, 6:16 AM.

Details

Summary

Hi; Hot on the heels of D56151 being committed in r364515 to fix PR40010, I wound up reverting it as there were reports of performance regressions on llvm-commits. Building AMDGPUDisassembler.cpp with ASAN enabled showed that register coalescing jumped from three seconds to fourty seconds. This patch cuts down on the performance cost (there's still a little) while preserving the original behaviour. It is, alas, ugly.

I believe the root cause of the performance problem is that DBG_VALUE insts don't get an entry in the SlotIndex map of Insts to Slots. The current code iterates through each DBG_VALUE for a register, getting its slot, then runs a liveness query. However, getting the slot requires a forwards-walk through the block until a non-debug instruction is found. With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for AMDGPUDisassembler.cpp), each of which gets examined, which ends up having quadratic complexity.

The solution is to not lookup slots from DBG_VALUEs, but to instead iterate over slots and check nearby DBG_VALUEs. To this end, I've replaced the per-dbg-value query method with some helper functions that will scan a basic block for unsound DBG_VALUEs, and maintains a copy of the current SlotIndex instead of lookup up for each DBG_VALUE.

This isn't pleasant, and just writing more C++ to fix problems isn't great, but:

  • We now have to query liveness to ensure DBG_VALUE soundness,
  • I imagine that indexing DBG_VALUE -> slots would be either expensive or risk causing codegen changes with -g,
  • We're going to wind up iterating over instructions at some point to fix this.

(Note that r364515 was reverted in r365448, this diff is based on r364515, as opposed to what's on trunk right now, to show the "old" and "new" implementation).

Diff Detail

Event Timeline

jmorse created this revision.Jul 12 2019, 6:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2019, 6:16 AM

With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for that file)

That seems excessive; in the past when I've seen this explosion of DBG instructions, the vast majority were redundant. Have you looked at them to see if this is the case? We might rather eliminate duplicates than write code to make it cheaper to have lots of unnecessary instructions.

rnk added a subscriber: rnk.Jul 12 2019, 11:26 AM

With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for that file)

That seems excessive; in the past when I've seen this explosion of DBG instructions, the vast majority were redundant. Have you looked at them to see if this is the case? We might rather eliminate duplicates than write code to make it cheaper to have lots of unnecessary instructions.

dbg.value packs are a major issue, ASan aside. I wrote this proposal a while back and never followed through on it:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127228.html

With the lack of follow-through on my part, I think we should take Chris's suggestion of changing our representation to multiplex dbg.value, so that one instruction can describe an arbitrary number of variable locations without increasing the number of instructions that other passes have to iterate over.

Happily D58453 killing off a large amount of placeDbgValues activity significantly reduces DBG_VALUE grouping -- I don't have the numbers to hand, but I would say the density was almost an order of magnitude lower. The largest back in the benchmark I referred to was about ~120, and other large packs occurred much less frequently.

Reid wrote:

dbg.value packs are a major issue, ASan aside. I wrote this proposal a while back and never followed through on it:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127228.html

With the lack of follow-through on my part, I think we should take Chris's suggestion of changing our representation to multiplex dbg.value, so that one instruction can describe an arbitrary number of variable locations without increasing the number of instructions that other passes have to iterate over.

While this review isn't the place, this is definitely an area I'd want to invest time into, debug-info in the instruction stream is a frequent pain.

jmorse edited the summary of this revision. (Show Details)Fri, Oct 25, 12:22 PM

Ping -- this is the last thing (TM) blocking placeDbgValues being removed, more or less. I've edited the summary to make more sense; as a brief recap:

  • D56151 was about dropping variable locations where we couldn't guarantee that RegisterCoalescing would correctly preserve locations,
  • The patch made asan builds extremely slow, and was reverted,
  • This new patch re-implements the idea, avoiding those lookups.

More info in the summary.

aprantl added inline comments.Fri, Oct 25, 3:02 PM
lib/CodeGen/RegisterCoalescer.cpp
1978

///

2003

if (MI.isDebugValue()) {

if (!RegIsLive &&MI.getOperand(0).isReg() && MI.getOperand(0).getReg() == Reg) {
}
return;

}
// If not, update current liveness record.
SlotIndex Slot = Slots.getInstructionIndex(MI);

2015

return is redundant

3362

///

3391

This has so much identical boilerplate with CheckDbgValuesInBlockForPhysReg that I wonder if it should be a generalized version that takes a std::function for code that is different?

vsk added a comment.Fri, Oct 25, 6:19 PM

In your summary you mentioned that:

However, getting the slot requires a forwards-walk through the block until a non-debug instruction is found.

Is it possible to just speed that part up? If so, we could just keep the simpler mergingChangesDbgValue implementation.

Possible options include: 1) computing a slot indexes structure for dbg_values on the side or 2) changing the MIR representation (akin to D51664, or introducing DBG_VALUE_PACKs -- @rnk suggested this a few comments up already, but I assume this is a stretch?). It might be nice to do something along these lines, since the approach in this patch looks like it can walk each MBB in the program (twice) per coalesce pair? Here's a hare-brained sketch of (1):

DVNumbers = map of DBG_VALUE to number
DVSlots = map of number interval to Slot
CurSlot = nullptr
DVPackLen = 0
for (index, MI) in enumerate(rpot(MF)):
  if MI is DBG_VALUE:
    DVNumbers[MI] = index
    DVPackLen++
  else:
    CurSlot = slot for the current MI
    if DVPackLen > 0:
      DVSlots[{index - DVPackLen, index}] = CurSlot
      DVPackLen = 0
lib/CodeGen/RegisterCoalescer.cpp
3363

CheckDbgValuesInBlock looks like it's more about making dead dbg_values undef than about checking them. Let's call it makeDeadDbgValsUndef?

bjope added a subscriber: bjope.Wed, Oct 30, 10:29 AM
bjope added inline comments.Wed, Oct 30, 10:48 AM
lib/CodeGen/RegisterCoalescer.cpp
1978

Not clear to me, just reading the description and the argument list, to understand if Reg is the physical register or the other register being coalesced.

1982

I think we want Reg to be of type Register here (at least in the future, but maybe we can get it right from the start).

2004

Not sure if the old code cared much about it either, but I always wonder if we forget to consider sub registers when I see code that only looks at getReg() but not getSubReg() and not even mentions sub registers in any code comment. But since we are dealing with physical registers, then maybe getSubReg() is out-of-play here (although it is only one side of the coalescing pair that is physical).

jmorse updated this revision to Diff 228222.Thu, Nov 7, 5:12 AM
jmorse marked 8 inline comments as done.

Fold two helpers into one; review comments and formatting

jmorse added a comment.Thu, Nov 7, 5:33 AM

Bjorn wrote inline:

Not sure if the old code cared much about it either, but I always wonder if we forget to consider sub registers when I see code that only looks at getReg() but not getSubReg()

In this circumstance I think it's legitimate to ignore the subregisters -- we're not considering the precise location of a value with a reg/vreg, just whether the merging vregs are live or not. The riskiest circumstance would be a non-live DBG_VALUE-of-subreg being undef'd when a live value was merged into a different subregister within the same virtual register. This is conservative at the least; and I suspect that kind of merging would require an undef anyway, although I'm not overly familiar with subregisters.

Vedant wrote:

[A proposal for a different way of doing this]

I think that'd work, trading some memory for some performance. The approach in this patch seems to be Good Enough (TM) though, I don't observe any performance differences on a clang-3.4 RelWithDebInfo build, and only some fractional increases on the (pathalogical) ASan build. My preference is shipping this, and making DBG_VALUE_PACKs happen sometime soon, as that'll eliminate similar problems elsewhere (PR43855 recently bit me, for example).

vsk added a comment.Fri, Nov 8, 10:45 AM

I guess there are a few alternatives to consider that call for changing data structures. Perhaps it makes more sense to start with the approach taken here to address the performance issue and to keep an eye out for any more problems. BTW @jmorse is this patch still rebased on top of r364515? I don't see mergingChangesDbgValue anymore.

llvm/lib/CodeGen/RegisterCoalescer.cpp
3338 ↗(On Diff #228222)

IIUC there's no need to check for the case where Reg is live & OtherLiveness is not, because makeDeadDbgValsUndef is called once for each vreg in a coalesce pair. (Assuming that's correct) maybe that's worth a comment here, or in the function doc.

jmorse updated this revision to Diff 228500.Fri, Nov 8, 11:14 AM

Correctly base patch on prior implementation

Vedant wrote:

I guess there are a few alternatives to consider that call for changing data structures. Perhaps it makes more sense to start with the approach taken here to address the performance issue and to keep an eye out for any more problems.

Indeed, I'd much prefer to design it out; the current scenario isn't ideal for a number of reasons.

BTW @jmorse is this patch still rebased on top of r364515? I don't see mergingChangesDbgValue anymore.

*blinks* ah yeah, the latest update should fix that.

+ bjope & Quentin

I think this looks reasonable, but am not yet familiar enough with RegisterCoalescer to confidently lgtm.

bjope added inline comments.Sat, Nov 9, 6:08 AM
llvm/lib/CodeGen/RegisterCoalescer.cpp
3365 ↗(On Diff #228500)

I think it is enough to do this if the next instruction isDebugValue. So it could perhaps speed up things if we only do these liveness calculations for the first MI in a sequence of DebugValue (or DebugInstr) instructions?

One idea would be to keep track of if the previous instruction was a DebugValue, move these calculation into the if (MI.isDebugValue) above, and do it conditionally when the previous instruction wasn't a DebugValue. E.g. by saving the MachineInstr* pointing to the previous instruction (indicating that we should update RegIsLive/OtherIsLive) whenever a MI that isn't isDebugValue is found.

3367 ↗(On Diff #228500)

Is it correct to use the RegSlot here. Maybe it should be getDeadSlot() to make sure the liveAt call below will get "live out" from the MI rather than "live in" (although I'm still learning about these slot indices myself)?

jmorse planned changes to this revision.Fri, Nov 15, 9:21 AM

Sitrep on this -- Vedants question about early-exits led to me digging further into the pass, and discovering even more bad assumptions I'd made. I've prototyped something based on Vedants proposal of the other way of doing this; it'll have to wait until next week though.

jmorse marked an inline comment as done.Mon, Nov 18, 12:50 PM
jmorse added inline comments.
llvm/lib/CodeGen/RegisterCoalescer.cpp
3338 ↗(On Diff #228222)

For completeness: the original intention here / site of this comment, was to detect an early exit. The assumption was that the register coalescer doesn't merge overlapping live ranges; and so if a block was completely covered by a live range, we could assume no invalid coalescing could occur.

Digging into that however, it turns out the coalescer really does merge overlapping live ranges, which is great! But not for this implementation. New one up in a few moments; alas it's another redo :/

jmorse updated this revision to Diff 229915.Mon, Nov 18, 1:35 PM
jmorse added a subscriber: andreadb.

Hokay, here's another take on this problem. To recap, all my previous implementations were broken because of

a) performance problems looking up slot indexes for DBG_VALUEs, and
b) I wasn't aware that the coalescer will merge overlapping live ranges.

which I'll address in order below.

This implementation is inspired by Vedants sketch -- we build a data structure for looking up DBG_VALUEs by slot index quickly. For this (bear with me) I've mapped VRegs -> a set of {SlotIndex,MachineInstr*} pairs. In the body of checkMergingChangesDbgValuesImpl, we can then use the set-order to simultaneously advance through:

  • The live ranges of the VReg being merged, and
  • The set of all DBG_VALUEs for the other VReg.

to identify those DBG_VALUEs at risk of unsound merging. This avoids having to perform a slot index lookup at all, at the cost of stepping through valid DBG_VALUEs in the process. For an asan build of clang-3.4 I don't observe any change in build time with this change. For the previous worst-case file, AMDGPUDisassembler.cpp on trunk/master, the register coalescing pass increases from 2.9 seconds to roughly 3.5 seconds (out of an overall compile time of 27 seconds). This is, IMO, quite good given that it's dealing with pathological conditions (the packs of 800 DBG_VALUEs in a row).

For the correctness issue, observe the ShouldUndef lambda in the patch. This identifies when overlapping ranges have been merged, and examines RegisterCoalescers record of how it resolved the conflict (details in the comment). This probably wants attention from people who know the register allocator/coalescer -- the aim is to ensure DBG_VALUEs, which don't contribute to liveness, do not refer to a different live value-number after coalescing. We don't care if they refer to a non-live vreg. (Paging @andreadb , this is what I was going to mention, a uh, while ago).

Running through llvm-dwarfdump --statistics, the same as before, a tiny fraction of variables go missing (<0.02%). This is fine IMO, as we're dropping locations that are broken.

Rough edges:

  • It's not great to make JoinVals and RegisterCoalescer friends; I'll introduce an accessor for examining conflict resolutions later, but I've burnt out of time today.
  • I haven't made an attempt to address vregs being merged with physregs in this patch -- if this seems to be going in the right direction, I'll extend it to that.