The main benefits of this change are faster access to operands
(no need to compute the offset, as it is now right after the
operation), simpler code(no need to manage a lot of the "is the
operand storage trailing" logic we had to before). The major
downside to this though, is that operand holding operations now
grow in size by 1 word (as no matter how we do this change, there
will need to be some additional book keeping).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Cool, thank you for this patch River. My ability to look at this was a little overoptimistic, but I applied this now and gave it a try, it shows pretty much exactly what I was expecting.
Before (total time on all threads + wall time):
194.4689 (100.0%) 72.7334 (100.0%) Total
After:
188.3475 (100.0%) 70.4277 (100.0%) Total
Which is about a 3% speedup.
That said, I'd recommend a minor change. Rearranging the fields in OperandStorage and making capacity be the compressed one instead of numOperands provides a further speedup:
176.2160 (100.0%) 67.5828 (100.0%) Total
The field order looks like this:
/// The total capacity number of operands that the storage can hold. unsigned capacity : 31; /// A flag indicating if the operand storage was dynamically allocated, as /// opposed to inlined into the owning operation. unsigned isStorageDynamic : 1; /// The number of operands within the storage. unsigned numOperands; /// A pointer to the operand storage. OpOperand *operandStorage;
This benchmark is a macro benchmark running firtool on a large .fir design (parsing 589MB of .fir file, and emitting 307MB of verilog) on an 8 core intel MBP. The full timing output is:
===-------------------------------------------------------------------------=== ... Execution time report ... ===-------------------------------------------------------------------------=== Total Execution Time: 67.5828 seconds ----User Time---- ----Wall Time---- ----Name---- 7.2946 ( 4.1%) 7.2946 ( 10.8%) FIR Parser 120.3814 ( 68.3%) 44.2111 ( 65.4%) 'firrtl.circuit' Pipeline 22.7650 ( 12.9%) 2.0119 ( 3.0%) 'firrtl.module' Pipeline 19.0680 ( 10.8%) 1.8196 ( 2.7%) CSE 0.0207 ( 0.0%) 0.0032 ( 0.0%) (A) DominanceInfo 3.6656 ( 2.1%) 0.3209 ( 0.5%) LowerCHIRRTL 5.8990 ( 3.3%) 5.8990 ( 8.7%) InferWidths 3.1869 ( 1.8%) 3.1869 ( 4.7%) InferResets 0.3583 ( 0.2%) 0.3583 ( 0.5%) (A) circt::firrtl::InstanceGraph 0.3518 ( 0.2%) 0.3518 ( 0.5%) PrefixModules 11.5896 ( 6.6%) 11.5896 ( 17.1%) LowerFIRRTLTypes 50.5189 ( 28.7%) 8.7349 ( 12.9%) 'firrtl.module' Pipeline 19.0035 ( 10.8%) 3.6512 ( 5.4%) ExpandWhens 31.4828 ( 17.9%) 5.0830 ( 7.5%) Canonicalizer 1.2106 ( 0.7%) 1.2106 ( 1.8%) Inliner 7.1370 ( 4.1%) 7.1370 ( 10.6%) IMConstProp 0.4026 ( 0.2%) 0.4026 ( 0.6%) (A) circt::firrtl::InstanceGraph 0.0018 ( 0.0%) 0.0018 ( 0.0%) BlackBoxReader 13.5857 ( 7.7%) 2.1557 ( 3.2%) 'firrtl.module' Pipeline 13.5538 ( 7.7%) 2.1522 ( 3.2%) Canonicalizer 0.7672 ( 0.4%) 0.7672 ( 1.1%) CreateSiFiveMetadata 1.1527 ( 0.7%) 1.1527 ( 1.7%) EmitOMIR 0.3725 ( 0.2%) 0.3725 ( 0.6%) (A) circt::firrtl::InstanceGraph 3.7628 ( 2.1%) 3.7628 ( 5.6%) LowerFIRRTLToHW 0.6715 ( 0.4%) 0.6715 ( 1.0%) HWMemSimImpl 2.8777 ( 1.6%) 2.8777 ( 4.3%) SVExtractTestCode 34.3632 ( 19.5%) 3.5022 ( 5.2%) 'hw.module' Pipeline 0.6692 ( 0.4%) 0.0587 ( 0.1%) HWCleanup 11.2831 ( 6.4%) 1.1087 ( 1.6%) CSE 0.0390 ( 0.0%) 0.0059 ( 0.0%) (A) DominanceInfo 18.6762 ( 10.6%) 2.1384 ( 3.2%) Canonicalizer 0.1391 ( 0.1%) 0.0152 ( 0.0%) HWLegalizeModules 3.3793 ( 1.9%) 0.2665 ( 0.4%) PrettifyVerilog 5.1930 ( 2.9%) 5.1930 ( 7.7%) ExportVerilog 0.0595 ( 0.0%) 0.0595 ( 0.1%) Rest 176.2160 (100.0%) 67.5828 (100.0%) Total
I think we should take this patch, with the revisions to OperandStorage.
FWIW, I also just experimented with removing OperandStorage from the TrailingResults and make it be an unconditional member of Operation. This simplifies the generated machine code by eliminating "hasOperandStorage" and all the branches on it.
Doing this shows an extremely tiny speedup (67.4075s) but since it increases memory usage for commonly used operations, I don't think we should eliminate it.
mlir/include/mlir/IR/OperationSupport.h | ||
---|---|---|
567 | Please change these fields to: /// The total capacity number of operands that the storage can hold. unsigned capacity : 31; /// A flag indicating if the operand storage was dynamically allocated, as /// opposed to inlined into the owning operation. unsigned isStorageDynamic : 1; /// The number of operands within the storage. unsigned numOperands; /// A pointer to the operand storage. OpOperand *operandStorage; This shows a further speedup on top of this patch because "numOperands" is frequently accessed and capacity is not. |
Actually, wait, this isn't done yet: the OperandRange::OperandRange(Operation*) convenience constructor is still 2% of the profile because it is out-of-line. Please remove it (from the .h file and the OperationSupport.cpp file) as it is trivial and only used in one place. Please change Operation::getOperands() to be defined as:
/// Returns an iterator on the underlying Value's. - operand_range getOperands() { return operand_range(this); } + operand_range getOperands() { + return operand_range(getOpOperands().data(), getNumOperands()); + }
or equivalent. This change further reduces time from 67.5s to 66.39s!
mlir/lib/IR/Operation.cpp | ||
---|---|---|
651–652 | I'm not seeing any of these hunks from my view, weird phab thing? |
Yes, looks like it was showing me stuff that came in in your merge. The patch lgtm, thanks!
mlir/lib/IR/Operation.cpp | ||
---|---|---|
651–652 | Oh I see what it is. You are viewing the diff between patches, not base (I only ever look at the diff from base). I've rebased since the last patch. |
Please change these fields to:
This shows a further speedup on top of this patch because "numOperands" is frequently accessed and capacity is not.