Hi; Hot on the heels of D56151 being committed in r364515 to fix PR40010, I wound up reverting it as there were reports of performance regressions on llvm-commits. Building AMDGPUDisassembler.cpp with ASAN enabled showed that register coalescing jumped from three seconds to fourty seconds. This patch cuts down on the performance cost (there's still a little) while preserving the original behaviour. It is, alas, ugly.
I believe the root cause of the performance problem is that DBG_VALUE insts don't get an entry in the SlotIndex map of Insts to Slots. The current code iterates through each DBG_VALUE for a register, getting its slot, then runs a liveness query. However, getting the slot requires a forwards-walk through the block until a non-debug instruction is found. With ASAN, packs of up to 800 DBG_VALUEs in a row appear (for AMDGPUDisassembler.cpp), each of which gets examined, which ends up having quadratic complexity.
The solution is to not lookup slots from DBG_VALUEs, but to instead iterate over slots and check nearby DBG_VALUEs. To this end, I've replaced the per-dbg-value query method with some helper functions that will scan a basic block for unsound DBG_VALUEs, and maintains a copy of the current SlotIndex instead of lookup up for each DBG_VALUE.
This isn't pleasant, and just writing more C++ to fix problems isn't great, but:
- We now have to query liveness to ensure DBG_VALUE soundness,
- I imagine that indexing DBG_VALUE -> slots would be either expensive or risk causing codegen changes with -g,
- We're going to wind up iterating over instructions at some point to fix this.
(Note that r364515 was reverted in r365448, this diff is based on r364515, as opposed to what's on trunk right now, to show the "old" and "new" implementation).
This sets of my "this looks expensive" alarm.
Could this be more efficiently be replaced by a DenseMap or a sorted std::vector + std::lower_bound(), or is this the best choice here? Similarly, should the std::set be something more compact?