Page MenuHomePhabricator

[GlobalISel] Add MachineInstNumbering to CSEInfo and propagate CSE throughout AArch64 pipeline.
Needs ReviewPublic

Authored by aemerson on Jan 7 2021, 1:52 PM.

Details

Summary

MachineInstNumbering is a wrapper around SlotIndexes, which exposes an API to allow fast dominance checks in the CSEMIRBuilder, and to speed up intra-block instruction sinking in the Localizer.

The numbering is stored within the CSEInfo analysis. This change also sets the AArch64 and generic GlobalISel passes to preserve the CSEInfo, and therefore use the CSEMIRBuilder within the RegBankSelect and other combiner passes.

Overall this seems to impact compile time negatively by about ~0.5% on -O0 and -Os but prevents us from having unreasonably long compile times in the worst cases.

Diff Detail

Event Timeline

aemerson created this revision.Jan 7 2021, 1:52 PM
aemerson requested review of this revision.Jan 7 2021, 1:52 PM
aemerson updated this revision to Diff 316325.Jan 12 2021, 10:46 PM

Improve the performance of the insertion queue flushing.

For huge basic blocks, this was having hanging in the renumbering. This new version now tries to not flush on removal of an instruction.

The second issue was that in those huge blocks, any time there was multiple insertions, we'd end up renumbering the entire block. For such blocks, this was way too slow. I've added a more intelligent flushing algorithm to instead search up from each instruction in the block's queue, so that we find an instruction that allows us to use the SlotIndexes to do a fast comparison of ordering. This lets us find the begin iterator for a numbering repair, and likewise walk down each queued instruction until we find the post dominator instruction that gives us an "end" iterator.

These two improvements don't really help on CTMark, they only help on very large pathological blocks.

I think it may be worth exploring using some heuristics early in the pipeline like IRTranslation to detect huge blocks, and switch to using this numbering analysis. Otherwise, we can continue to use the simple linked list block walk, which is slightly faster in the common cases.

aemerson updated this revision to Diff 316428.Jan 13 2021, 9:26 AM

Fix clang-format warnings.

How big is a huge block?

How big is a huge block?

On the order of millions of instructions.

paquette added inline comments.Mon, Feb 22, 10:01 AM
llvm/include/llvm/CodeGen/GlobalISel/CSEInfo.h
109

Maybe good to note that this also removes MI from the queue if it is present?

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
372

Why is this necessary?

llvm/lib/CodeGen/GlobalISel/CSEInfo.cpp
215

Does this bug have a FIXME or something somewhere?

356

I think you can remove the check for CurrInst->isDebugInstr() if you use skipDebugInstructionsBackward.

380

This function and findEarlierNonQueuedInst are the same aside from the direction they iterate in.

Is it possible to refactor them somehow so they can share more code?

findEarlierNonQueuedInst looks like it's the same loop if you use a reverse_iterator.

380

I think you can remove the check for CurrInst->isDebugInstr() if you use skipDebugInstructionsForward.

Could you split large basic blocks into smaller ones?

I think legalizing/combining will do way, way more insertions than would happen in regalloc. Is this losing time to renumberings? Would it make sense to increase the increment size?

aemerson added a comment.EditedMon, Feb 22, 2:02 PM

I think legalizing/combining will do way, way more insertions than would happen in regalloc. Is this losing time to renumberings?

On average, this does cost more in renumbering/flushing than it saves. It only provides a net benefit on the pathological cases in legalization & localization.

Would it make sense to increase the increment size?

What do you mean by increment size?

Could you split large basic blocks into smaller ones?

The issue does get resolved if we keep the constants entry block separate after IR translation, instead of merging them. However @qcolombet believes that's not the right approach.

I think legalizing/combining will do way, way more insertions than would happen in regalloc. Is this losing time to renumberings?

On average, this does cost more in renumbering/flushing than it saves. It only provides a net benefit on the pathological cases in legalization & localization.

Would it make sense to increase the increment size?

What do you mean by increment size?

The instruction index values initially increment by 16 so you only renumber for every 15 or so insertions between instructions:

enum {
  /// The default distance between instructions as returned by distance().
  /// This may vary as instructions are inserted and removed.
  InstrDist = 4 * Slot_Count
};