Page MenuHomePhabricator

[DebugInfo] Address performance regression with r364515
Changes PlannedPublic

Authored by jmorse on Jul 12 2019, 6:16 AM.

Details

Summary

Hi; Hot on the heels of D56151 being committed in r364515 to fix PR40010, I wound up reverting it as there were reports of performance regressions on llvm-commits. Building AMDGPUDisassembler.cpp with ASAN enabled showed that register coalescing jumped from three seconds to fourty seconds. This patch cuts down on the performance cost (there's still a little) while preserving the original behaviour. It is, alas, ugly.

I believe the root cause of the performance problem is that DBG_VALUE insts don't get an entry in the SlotIndex map of Insts to Slots. The current code iterates through each DBG_VALUE for a register, getting its slot, then runs a liveness query. However, getting the slot requires a forwards-walk through the block until a non-debug instruction is found. With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for AMDGPUDisassembler.cpp), each of which gets examined, which ends up having quadratic complexity.

The solution is to not lookup slots from DBG_VALUEs, but to instead iterate over slots and check nearby DBG_VALUEs. To this end, I've replaced the per-dbg-value query method with some helper functions that will scan a basic block for unsound DBG_VALUEs, and maintains a copy of the current SlotIndex instead of lookup up for each DBG_VALUE.

This isn't pleasant, and just writing more C++ to fix problems isn't great, but:

  • We now have to query liveness to ensure DBG_VALUE soundness,
  • I imagine that indexing DBG_VALUE -> slots would be either expensive or risk causing codegen changes with -g,
  • We're going to wind up iterating over instructions at some point to fix this.

(Note that r364515 was reverted in r365448, this diff is based on r364515, as opposed to what's on trunk right now, to show the "old" and "new" implementation).

Diff Detail

Event Timeline

jmorse created this revision.Jul 12 2019, 6:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2019, 6:16 AM

With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for that file)

That seems excessive; in the past when I've seen this explosion of DBG instructions, the vast majority were redundant. Have you looked at them to see if this is the case? We might rather eliminate duplicates than write code to make it cheaper to have lots of unnecessary instructions.

rnk added a subscriber: rnk.Jul 12 2019, 11:26 AM

With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for that file)

That seems excessive; in the past when I've seen this explosion of DBG instructions, the vast majority were redundant. Have you looked at them to see if this is the case? We might rather eliminate duplicates than write code to make it cheaper to have lots of unnecessary instructions.

dbg.value packs are a major issue, ASan aside. I wrote this proposal a while back and never followed through on it:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127228.html

With the lack of follow-through on my part, I think we should take Chris's suggestion of changing our representation to multiplex dbg.value, so that one instruction can describe an arbitrary number of variable locations without increasing the number of instructions that other passes have to iterate over.

Happily D58453 killing off a large amount of placeDbgValues activity significantly reduces DBG_VALUE grouping -- I don't have the numbers to hand, but I would say the density was almost an order of magnitude lower. The largest back in the benchmark I referred to was about ~120, and other large packs occurred much less frequently.

Reid wrote:

dbg.value packs are a major issue, ASan aside. I wrote this proposal a while back and never followed through on it:
http://lists.llvm.org/pipermail/llvm-dev/2018-October/127228.html

With the lack of follow-through on my part, I think we should take Chris's suggestion of changing our representation to multiplex dbg.value, so that one instruction can describe an arbitrary number of variable locations without increasing the number of instructions that other passes have to iterate over.

While this review isn't the place, this is definitely an area I'd want to invest time into, debug-info in the instruction stream is a frequent pain.

jmorse edited the summary of this revision. (Show Details)Fri, Oct 25, 12:22 PM

Ping -- this is the last thing (TM) blocking placeDbgValues being removed, more or less. I've edited the summary to make more sense; as a brief recap:

  • D56151 was about dropping variable locations where we couldn't guarantee that RegisterCoalescing would correctly preserve locations,
  • The patch made asan builds extremely slow, and was reverted,
  • This new patch re-implements the idea, avoiding those lookups.

More info in the summary.

aprantl added inline comments.Fri, Oct 25, 3:02 PM
lib/CodeGen/RegisterCoalescer.cpp
1978 ↗(On Diff #209470)

///

2003 ↗(On Diff #209470)

if (MI.isDebugValue()) {

if (!RegIsLive &&MI.getOperand(0).isReg() && MI.getOperand(0).getReg() == Reg) {
}
return;

}
// If not, update current liveness record.
SlotIndex Slot = Slots.getInstructionIndex(MI);

2015 ↗(On Diff #209470)

return is redundant

3362 ↗(On Diff #209470)

///

3391 ↗(On Diff #209470)

This has so much identical boilerplate with CheckDbgValuesInBlockForPhysReg that I wonder if it should be a generalized version that takes a std::function for code that is different?

vsk added a comment.Fri, Oct 25, 6:19 PM

In your summary you mentioned that:

However, getting the slot requires a forwards-walk through the block until a non-debug instruction is found.

Is it possible to just speed that part up? If so, we could just keep the simpler mergingChangesDbgValue implementation.

Possible options include: 1) computing a slot indexes structure for dbg_values on the side or 2) changing the MIR representation (akin to D51664, or introducing DBG_VALUE_PACKs -- @rnk suggested this a few comments up already, but I assume this is a stretch?). It might be nice to do something along these lines, since the approach in this patch looks like it can walk each MBB in the program (twice) per coalesce pair? Here's a hare-brained sketch of (1):

DVNumbers = map of DBG_VALUE to number
DVSlots = map of number interval to Slot
CurSlot = nullptr
DVPackLen = 0
for (index, MI) in enumerate(rpot(MF)):
  if MI is DBG_VALUE:
    DVNumbers[MI] = index
    DVPackLen++
  else:
    CurSlot = slot for the current MI
    if DVPackLen > 0:
      DVSlots[{index - DVPackLen, index}] = CurSlot
      DVPackLen = 0
lib/CodeGen/RegisterCoalescer.cpp
3363 ↗(On Diff #209470)

CheckDbgValuesInBlock looks like it's more about making dead dbg_values undef than about checking them. Let's call it makeDeadDbgValsUndef?

bjope added a subscriber: bjope.Wed, Oct 30, 10:29 AM
bjope added inline comments.Wed, Oct 30, 10:48 AM
lib/CodeGen/RegisterCoalescer.cpp
1978 ↗(On Diff #209470)

Not clear to me, just reading the description and the argument list, to understand if Reg is the physical register or the other register being coalesced.

1982 ↗(On Diff #209470)

I think we want Reg to be of type Register here (at least in the future, but maybe we can get it right from the start).

2004 ↗(On Diff #209470)

Not sure if the old code cared much about it either, but I always wonder if we forget to consider sub registers when I see code that only looks at getReg() but not getSubReg() and not even mentions sub registers in any code comment. But since we are dealing with physical registers, then maybe getSubReg() is out-of-play here (although it is only one side of the coalescing pair that is physical).

jmorse updated this revision to Diff 228222.Thu, Nov 7, 5:12 AM
jmorse marked 8 inline comments as done.

Fold two helpers into one; review comments and formatting

jmorse added a comment.Thu, Nov 7, 5:33 AM

Bjorn wrote inline:

Not sure if the old code cared much about it either, but I always wonder if we forget to consider sub registers when I see code that only looks at getReg() but not getSubReg()

In this circumstance I think it's legitimate to ignore the subregisters -- we're not considering the precise location of a value with a reg/vreg, just whether the merging vregs are live or not. The riskiest circumstance would be a non-live DBG_VALUE-of-subreg being undef'd when a live value was merged into a different subregister within the same virtual register. This is conservative at the least; and I suspect that kind of merging would require an undef anyway, although I'm not overly familiar with subregisters.

Vedant wrote:

[A proposal for a different way of doing this]

I think that'd work, trading some memory for some performance. The approach in this patch seems to be Good Enough (TM) though, I don't observe any performance differences on a clang-3.4 RelWithDebInfo build, and only some fractional increases on the (pathalogical) ASan build. My preference is shipping this, and making DBG_VALUE_PACKs happen sometime soon, as that'll eliminate similar problems elsewhere (PR43855 recently bit me, for example).

vsk added a comment.Fri, Nov 8, 10:45 AM

I guess there are a few alternatives to consider that call for changing data structures. Perhaps it makes more sense to start with the approach taken here to address the performance issue and to keep an eye out for any more problems. BTW @jmorse is this patch still rebased on top of r364515? I don't see mergingChangesDbgValue anymore.

llvm/lib/CodeGen/RegisterCoalescer.cpp
3319–3411

IIUC there's no need to check for the case where Reg is live & OtherLiveness is not, because makeDeadDbgValsUndef is called once for each vreg in a coalesce pair. (Assuming that's correct) maybe that's worth a comment here, or in the function doc.

jmorse updated this revision to Diff 228500.Fri, Nov 8, 11:14 AM

Correctly base patch on prior implementation

Vedant wrote:

I guess there are a few alternatives to consider that call for changing data structures. Perhaps it makes more sense to start with the approach taken here to address the performance issue and to keep an eye out for any more problems.

Indeed, I'd much prefer to design it out; the current scenario isn't ideal for a number of reasons.

BTW @jmorse is this patch still rebased on top of r364515? I don't see mergingChangesDbgValue anymore.

*blinks* ah yeah, the latest update should fix that.

+ bjope & Quentin

I think this looks reasonable, but am not yet familiar enough with RegisterCoalescer to confidently lgtm.

bjope added inline comments.Sat, Nov 9, 6:08 AM
llvm/lib/CodeGen/RegisterCoalescer.cpp
3365

I think it is enough to do this if the next instruction isDebugValue. So it could perhaps speed up things if we only do these liveness calculations for the first MI in a sequence of DebugValue (or DebugInstr) instructions?

One idea would be to keep track of if the previous instruction was a DebugValue, move these calculation into the if (MI.isDebugValue) above, and do it conditionally when the previous instruction wasn't a DebugValue. E.g. by saving the MachineInstr* pointing to the previous instruction (indicating that we should update RegIsLive/OtherIsLive) whenever a MI that isn't isDebugValue is found.

3367

Is it correct to use the RegSlot here. Maybe it should be getDeadSlot() to make sure the liveAt call below will get "live out" from the MI rather than "live in" (although I'm still learning about these slot indices myself)?

jmorse planned changes to this revision.Fri, Nov 15, 9:21 AM

Sitrep on this -- Vedants question about early-exits led to me digging further into the pass, and discovering even more bad assumptions I'd made. I've prototyped something based on Vedants proposal of the other way of doing this; it'll have to wait until next week though.