This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Move the Operation OperandStorage to the first trailing object
ClosedPublic

Authored by rriddle on Oct 13 2021, 12:20 AM.

Details

Summary

The main benefits of this change are faster access to operands
(no need to compute the offset, as it is now right after the
operation), simpler code(no need to manage a lot of the "is the
operand storage trailing" logic we had to before). The major
downside to this though, is that operand holding operations now
grow in size by 1 word (as no matter how we do this change, there
will need to be some additional book keeping).

Diff Detail

Event Timeline

rriddle created this revision.Oct 13 2021, 12:20 AM
rriddle requested review of this revision.Oct 13 2021, 12:20 AM

nice, I'll take a look at this this weekend, thank you for sharing it!

Cool, thank you for this patch River. My ability to look at this was a little overoptimistic, but I applied this now and gave it a try, it shows pretty much exactly what I was expecting.

Before (total time on all threads + wall time):

194.4689 (100.0%)   72.7334 (100.0%)  Total

After:

188.3475 (100.0%)   70.4277 (100.0%)  Total

Which is about a 3% speedup.

lattner added a comment.EditedOct 30 2021, 9:50 PM

That said, I'd recommend a minor change. Rearranging the fields in OperandStorage and making capacity be the compressed one instead of numOperands provides a further speedup:

176.2160 (100.0%)   67.5828 (100.0%)  Total

The field order looks like this:

/// The total capacity number of operands that the storage can hold.
unsigned capacity : 31;
/// A flag indicating if the operand storage was dynamically allocated, as
/// opposed to inlined into the owning operation.
unsigned isStorageDynamic : 1;
/// The number of operands within the storage.
unsigned numOperands;
/// A pointer to the operand storage.
OpOperand *operandStorage;

This benchmark is a macro benchmark running firtool on a large .fir design (parsing 589MB of .fir file, and emitting 307MB of verilog) on an 8 core intel MBP. The full timing output is:

===-------------------------------------------------------------------------===
                         ... Execution time report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 67.5828 seconds

  ----User Time----  ----Wall Time----  ----Name----
    7.2946 (  4.1%)    7.2946 ( 10.8%)  FIR Parser
  120.3814 ( 68.3%)   44.2111 ( 65.4%)  'firrtl.circuit' Pipeline
   22.7650 ( 12.9%)    2.0119 (  3.0%)    'firrtl.module' Pipeline
   19.0680 ( 10.8%)    1.8196 (  2.7%)      CSE
    0.0207 (  0.0%)    0.0032 (  0.0%)        (A) DominanceInfo
    3.6656 (  2.1%)    0.3209 (  0.5%)      LowerCHIRRTL
    5.8990 (  3.3%)    5.8990 (  8.7%)    InferWidths
    3.1869 (  1.8%)    3.1869 (  4.7%)    InferResets
    0.3583 (  0.2%)    0.3583 (  0.5%)      (A) circt::firrtl::InstanceGraph
    0.3518 (  0.2%)    0.3518 (  0.5%)    PrefixModules
   11.5896 (  6.6%)   11.5896 ( 17.1%)    LowerFIRRTLTypes
   50.5189 ( 28.7%)    8.7349 ( 12.9%)    'firrtl.module' Pipeline
   19.0035 ( 10.8%)    3.6512 (  5.4%)      ExpandWhens
   31.4828 ( 17.9%)    5.0830 (  7.5%)      Canonicalizer
    1.2106 (  0.7%)    1.2106 (  1.8%)    Inliner
    7.1370 (  4.1%)    7.1370 ( 10.6%)    IMConstProp
    0.4026 (  0.2%)    0.4026 (  0.6%)      (A) circt::firrtl::InstanceGraph
    0.0018 (  0.0%)    0.0018 (  0.0%)    BlackBoxReader
   13.5857 (  7.7%)    2.1557 (  3.2%)    'firrtl.module' Pipeline
   13.5538 (  7.7%)    2.1522 (  3.2%)      Canonicalizer
    0.7672 (  0.4%)    0.7672 (  1.1%)    CreateSiFiveMetadata
    1.1527 (  0.7%)    1.1527 (  1.7%)    EmitOMIR
    0.3725 (  0.2%)    0.3725 (  0.6%)      (A) circt::firrtl::InstanceGraph
    3.7628 (  2.1%)    3.7628 (  5.6%)  LowerFIRRTLToHW
    0.6715 (  0.4%)    0.6715 (  1.0%)  HWMemSimImpl
    2.8777 (  1.6%)    2.8777 (  4.3%)  SVExtractTestCode
   34.3632 ( 19.5%)    3.5022 (  5.2%)  'hw.module' Pipeline
    0.6692 (  0.4%)    0.0587 (  0.1%)    HWCleanup
   11.2831 (  6.4%)    1.1087 (  1.6%)    CSE
    0.0390 (  0.0%)    0.0059 (  0.0%)      (A) DominanceInfo
   18.6762 ( 10.6%)    2.1384 (  3.2%)    Canonicalizer
    0.1391 (  0.1%)    0.0152 (  0.0%)    HWLegalizeModules
    3.3793 (  1.9%)    0.2665 (  0.4%)    PrettifyVerilog
    5.1930 (  2.9%)    5.1930 (  7.7%)  ExportVerilog
    0.0595 (  0.0%)    0.0595 (  0.1%)  Rest
  176.2160 (100.0%)   67.5828 (100.0%)  Total

I think we should take this patch, with the revisions to OperandStorage.

FWIW, I also just experimented with removing OperandStorage from the TrailingResults and make it be an unconditional member of Operation. This simplifies the generated machine code by eliminating "hasOperandStorage" and all the branches on it.

Doing this shows an extremely tiny speedup (67.4075s) but since it increases memory usage for commonly used operations, I don't think we should eliminate it.

lattner accepted this revision.Oct 30 2021, 10:17 PM
lattner added inline comments.
mlir/include/mlir/IR/OperationSupport.h
567

Please change these fields to:

/// The total capacity number of operands that the storage can hold.
unsigned capacity : 31;
/// A flag indicating if the operand storage was dynamically allocated, as
/// opposed to inlined into the owning operation.
unsigned isStorageDynamic : 1;
/// The number of operands within the storage.
unsigned numOperands;
/// A pointer to the operand storage.
OpOperand *operandStorage;

This shows a further speedup on top of this patch because "numOperands" is frequently accessed and capacity is not.

This revision is now accepted and ready to land.Oct 30 2021, 10:17 PM

Actually, wait, this isn't done yet: the OperandRange::OperandRange(Operation*) convenience constructor is still 2% of the profile because it is out-of-line. Please remove it (from the .h file and the OperationSupport.cpp file) as it is trivial and only used in one place. Please change Operation::getOperands() to be defined as:

   /// Returns an iterator on the underlying Value's.
-  operand_range getOperands() { return operand_range(this); }
+  operand_range getOperands() {
+    return operand_range(getOpOperands().data(), getNumOperands());
+  }

or equivalent. This change further reduces time from 67.5s to 66.39s!

rriddle updated this revision to Diff 384180.Nov 2 2021, 12:10 PM
rriddle marked an inline comment as done.

resolve comments

lattner added inline comments.Nov 2 2021, 1:39 PM
mlir/include/mlir/IR/OperationSupport.h
864–865

Is this hunk related?

mlir/lib/IR/Operation.cpp
22

I think something unrelated snuck into this patch set.

651–652

FYI, This looks unrelated to the patch.

rriddle added inline comments.Nov 2 2021, 1:40 PM
mlir/lib/IR/Operation.cpp
651–652

I'm not seeing any of these hunks from my view, weird phab thing?

Yes, looks like it was showing me stuff that came in in your merge. The patch lgtm, thanks!

lattner accepted this revision.Nov 2 2021, 1:42 PM
rriddle added inline comments.Nov 2 2021, 1:42 PM
mlir/lib/IR/Operation.cpp
651–652

Oh I see what it is. You are viewing the diff between patches, not base (I only ever look at the diff from base). I've rebased since the last patch.

woo hoo, great work River!