I see a lot of array sorting in stack traces of our compiler, canonicalizer traverses this list every time it builds a pattern set, and it gets expensive very quickly.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I was looking at the crash, and from 30 threads running MLIR compiler, ~10 were in getRegisteredOperations. For TF/TFRT compiler it takes 300us to get all operations, and it is called ~30 times for a typical module (10ms in total).
Inside google it's in top 7 function under mlir::MLIRContext
mlir::MLIRContext::getOrLoadDialect mlir::MLIRContext::MLIRContext mlir::MLIRContext::~MLIRContext mlir::MLIRContextImpl::~MLIRContextImpl mlir::MLIRContext::loadAllAvailableDialects mlir::MLIRContext::getOrLoadDialect()::{lambda()#1}::operator() mlir::MLIRContext::getRegisteredOperations mlir::MLIRContext::getRegisteredOperations()::$_1::__invoke
Assuming that this will be called at least once during the MLIRContext lifetime it's cheaper to precompute.
mlir/lib/IR/MLIRContext.cpp | ||
---|---|---|
189 | Why SmallVector? It's unlikely for context to have that few operations but it will bloat context object size. |
mlir/lib/IR/MLIRContext.cpp | ||
---|---|---|
189 | Just for consistency with every other container, I don't think that this really matters |
Oh wow!
I'm really surprised we're using this API in the canonicalized and in FrozenRewritePatternSet.
This was really supposed to be more of a debugging API only.
It's a bit annoying that we're keeping both a map and a vector in the Context, but that seems justified if this API is actually useful.
Do you have an idea on the cost of this vs. say computing on demand?
Isn't "computing on demand" what to the existing API does? Do you have something else in mind?
mlir/include/mlir/IR/MLIRContext.h | ||
---|---|---|
179–181 | ||
mlir/lib/IR/MLIRContext.cpp | ||
187 |
mlir/lib/IR/MLIRContext.cpp | ||
---|---|---|
189 | You may use llvm::SmallVector<RegisteredOperationName, 0> to avoid the extra bloat (this is even smaller than std::vector if I remember correctly) |
No, by "on demand" I mean lazy sorting. Instead of doing a sorted insert for every operation, only sort if the list is requested and not already sorted.
Gotcha!
So you'd keep a bool sorted = falst; next to the vector, and in the getRegisteredOperations() call you would check:
if (!sorted) { llvm::sort(ctxImpl.sortedRegisteredOperations), [](auto &lhs, auto &rhs) { return lhs.getIdentifier().compare( rhs.getIdentifier()); }); sorted = true; }
Makes sense!
You'd also need to lock while actually sorting, but effectively yeah.
It will lead to slightly slower startup times, but we can always switch to deferred sorting if that is a problem. (It isn't clear how much it matters in practice anyways).
mlir/lib/IR/MLIRContext.cpp | ||
---|---|---|
189 |
It will lead to slightly slower startup times, but we can always switch to deferred sorting if that is a problem. (It isn't clear how much it matters in practice anyways).
I think that inserting into various maps and synchronization is much more expensive, and for simplicity I'd prefer to keep all collections mutations close to each other.