This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Update instruction depths incrementally for large BBs.
ClosedPublic

Authored by fhahn on Aug 11 2017, 8:47 AM.

Details

Summary

For large basic blocks with lots of combinable instructions, the
MachineTraceMetrics computations in MachineCombiner can dominate the compile
time, as computing the trace information is quadratic in the number of
instructions in a BB and it's relevant successors/predecessors.

In most cases, knowing the instruction depth should be enough to make
combination decisions. As we already iterate over all instructions in a basic
block, the instruction depth can be computed incrementally. This reduces the
cost of machine-combine drastically in cases where lots of instructions
are combined. The major drawback is that AFAIK, computing the critical path
length cannot be done incrementally. Therefore we only compute
instruction depths incrementally, for basic blocks with more
instructions than inc_threshold. The -machine-combiner-inc-threshold
option can be used to set the threshold and allows for easier
experimenting and checking if using incremental updates for all basic
blocks has any impact on the performance.

Diff Detail

Event Timeline

fhahn created this revision.Aug 11 2017, 8:47 AM

For reference, I also ran into this problem at one point, and came up with a hack to just turn off trace-based optimizations for long basic blocks. This implementation looks nicer. :)

fhahn updated this revision to Diff 111016.Aug 14 2017, 9:54 AM
fhahn retitled this revision from [MachineCombine] Update instruction depths incrementally. to [MachineCombiner] Update instruction depths incrementally for large BBs..
fhahn edited the summary of this revision. (Show Details)
fhahn added a reviewer: efriedma.

Thanks for your feedback Eli. I've update the patch to only use incremental updates if the size of a BB is over a certain threshold. I've chosen 500 instructions for now, but a slightly higher one would probably be OK too.

I've also updated the patch to take the LiveRegUnits into account when doing incremental updates. Also, the code to compute the incremental info is only executed if we found viable patterns.

Gerolf edited edge metadata.Aug 21 2017, 2:17 AM

Hi Florian

This looks promising! Thanks for working on this nagging problem spot. More out of curiosity: Could you also share the data for your compile-time improvements?

Cheers
Gerolf

lib/CodeGen/MachineCombiner.cpp
273–274

What happened to the comment?

289

SlackIsAccurate is checked twice for OldCycleCount?! I think the old scheme is fine which just dumps the components of new and old root latencies. Perhaps only add whether or not SlackIsAccurate is set?

454–458

Typos (eg. Wwe -> We) and dups ('the first time'). What does 'relatively accurate' mean?

fhahn updated this revision to Diff 111958.Aug 21 2017, 5:29 AM
fhahn marked an inline comment as done.

Thanks for having a look!
Address review comments.

As for data for your compile-time improvements: I was looking into compile-time outliers for a proprietary benchmark suite, so unfortunately I cannot share the source. There was a module with a few large basic block (one with ~35000 instructions and a couple with > 3000 instructions), with lots of combinable instructions on AArch64 . Without this patch, time-passes on llc shows

Total Execution Time: 18.2227 seconds (18.2513 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
14.0113 ( 77.4%)   0.0338 ( 28.5%)  14.0451 ( 77.1%)  14.0672 ( 77.1%)  Machine InstCombiner
 2.0149 ( 11.1%)   0.0552 ( 46.5%)   2.0701 ( 11.4%)   2.0726 ( 11.4%)  Greedy Register Allocator
 0.9370 (  5.2%)   0.0184 ( 15.5%)   0.9554 (  5.2%)   0.9577 (  5.2%)  AArch64 Instruction Selection
 0.3403 (  1.9%)   0.0010 (  0.8%)   0.3413 (  1.9%)   0.3417 (  1.9%)  Machine Instruction Scheduler

With this patch

Total Execution Time: 4.9057 seconds (4.9158 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 2.0272 ( 42.1%)   0.0612 ( 64.8%)   2.0884 ( 42.6%)   2.0931 ( 42.6%)  Greedy Register Allocator
 0.9621 ( 20.0%)   0.0190 ( 20.1%)   0.9810 ( 20.0%)   0.9829 ( 20.0%)  AArch64 Instruction Selection
 0.6782 ( 14.1%)   0.0021 (  2.2%)   0.6803 ( 13.9%)   0.6812 ( 13.9%)  Machine InstCombiner
lib/CodeGen/MachineCombiner.cpp
273–274

I've added the comment again. Removing the comment was a left-over from an earlier iteration of the patch.

454–458

I am not sure how to put that. updateDepth() uses the live register info in RegUnits to track dependencies of physical registers. In the following case, the depths could be wrong (R1 being a physical register):

R1 <- I0 ....
R2 <- I1 R1
R1 <- I2 ...
.. <- I3 R2

If we replace I1 and I2 with ... <- I4 R1, when we are updating the depth of I4, RegUnits would contain something like (R1 defined by I2), and it would use that dependency for the depth calculation. But when the MachineCombiner is run, combined instructions should still use virtual registers. At least for the AArch64, it seems like physical registers are used only for argument passing at this stage.

Thank you for sharing your numbers. I was hoping you had that show-off case.
I just have a few minor comments about readability and asserts. I still need bit more time to convince myself that all the pieces fit together. I was wondering if you could add a test that dumps the depths w/ and w/o the incremental updates (compile w/ full comination + dumps depths, compile with a threashold of 1 + dump depths, diff the dump)?

Thanks
Gerolf

lib/CodeGen/MachineCombiner.cpp
351

Please add comments what the parameters mean. More parameters make it harder to read now.

362

The assumption is that LastUpdate and End are never Null. Would it make sense to document this with an assertion.

362

The loops under IncrementalUpdate in this function could be wrapped in a function (with start and end iterators as parameters).

363–368

Would it make sense for the RegUnit update be part of eraseFromParent...(),? E.g. have a bool parameter for that function indicating it?

fhahn updated this revision to Diff 113156.Aug 29 2017, 2:35 PM
fhahn marked an inline comment as done.

I really appreciate you taking time to review this change!

As a sanity check for the incremental update I did the following:

  1. check-llvm-codegen passes with only using incremental depth computation (the full trace is computed before the main loop and IncremenalUpdate is initialized with true.
  1. As you suggested, I compiled the MultiSource/Benchmarks programs from the LLVM test suite with 2 version of LLVM, one only using incremental depth computations and the other only using the existing mechanics. I printed the depths of all instructions of changed basic blocks and compared them. The depths are the same for both versions, with a few exceptions. The minor differences in heights are caused by nodes that depend on the depths of instructions in previous basic blocks (PHI nodes). In a few cases, the previous basic block has been invalidated, but is not in the set of basic blocks for which the depths are re-computed in getTrace(). If we do not invalidate the trace, all depths match. Maybe it would be worth to only invalidate the traces for all basic block we changed, at the end of MachineCombiner::runOnMachineFunction
lib/CodeGen/MachineCombiner.cpp
362

I think I am missing something about the assertion, but neither LatestUpdate nor End are pointers.

I've added an updateDepths function

363–368

I think updating RegUnits is only relevant for the incremental updates here and I am not sure if it is worth making Instruction::eraseFromParent... more compilcated.

fhahn added a comment.Sep 5 2017, 2:24 PM

ping. @Gerolf did my last comments answer your questions adequately?

Gerolf added a comment.Sep 5 2017, 7:04 PM

LGTM.

I added a few minor remarks that you might want to consider before commit. Again, thanks for working on this and your patience!

Cheers
Gerolf

lib/CodeGen/MachineCombiner.cpp
373

You could call updateDepths here as well. Also if you had code like if (IncrementalUpdate) MinInstr->updateDepths(...) else MinInstr->invalidate(MBB) you could spare two parameters (at the cost of adding such a line after each call). It feels more natural to me than packing the code inside the insertDelete...() procedure.

383

LastUpdate = BlockIter like in the code below?

fhahn added a comment.Sep 6 2017, 10:20 AM

Thank you very much for having a look! I'll address your comments tomorrow and go ahead committing the patch. Also, this patch depends on D36696, which is a NFC that moves the code to update the instruction depths to a separate function (in MachineTraceMetrics.cpp). Pending further objections, I'll commit this patch as well tomorrow.

fhahn updated this revision to Diff 114163.Sep 7 2017, 5:34 AM
fhahn marked an inline comment as done.
fhahn added inline comments.
lib/CodeGen/MachineCombiner.cpp
373

This is slightly more difficult, I think, as MachineBasicBlock::iterator iterates over instruction references while InsInstrs contains instruction pointers. I'll leave it as it is for now and will think a bit more if we can make this nicer.

I think it is slightly safer to update the state of the trace here, as this should always be done if we modify the basic block.

fhahn accepted this revision.Sep 7 2017, 5:50 AM

Accepted in Phabricator to make arc happy.

This revision is now accepted and ready to land.Sep 7 2017, 5:50 AM
fhahn closed this revision.Sep 7 2017, 5:50 AM