This is the start of mlir benchmarks. It configures cmake files necessary to get started writing actual benchmarks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Changing commit message by removing references to sparse kernel since we are introducing a general benchmark framework here
mlir/CMakeLists.txt | ||
---|---|---|
216 | I think this should all be inside the benchmark folder. | |
221 | This should be behind an option -DMLIR_ENABLE_CXX_BENCHMARKS or something like that. |
Address comments: move benchmarking configuration to inner directory and enable/disable benchmarking with a flag
mlir/CMakeLists.txt | ||
---|---|---|
195 | looks like newline is missing | |
mlir/benchmarks/MLIRStarter.cpp | ||
3–8 ↗ | (On Diff #392175) | jumping ahead a little, you are setting up a framework that runs C++ code how are you going to make the link from C++ to MLIR IR (i.e. how do you envision running a sparse kernel benchmark this way)? |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | I don't know the answer to it yet but I am going through some more documentation about how MLIR IR, specifically around what's the equivalent of IRBuilder in MLIR world. What metrics of MLIR IR do we care about? For example, is code size important? |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | code size could be something important on Mobile indeed. It isn't the main focus in general though (compared to execution time). We can write C++ execution tests (example: https://github.com/llvm/llvm-project/blob/main/mlir/unittests/ExecutionEngine/Invoke.cpp ) but we'll need to significantly improve the runtime support to make this convenient I think. |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | Thanks for the pointers @mehdi_amini. That's what I was thinking, somehow generating IR inline and then timing it, but looks like it's not viable yet. I will think about it more and post my findings here. |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | We can have an external python program, similar to llvm-lit, that can do these transformations externally and run some benchmarks. What do you think about this high level idea? I can start on this direction if that sounds viable. |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | I'm not sure what you mean actually, can you elaborate what you have in mind? |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | Note that we are actually much more advanced in building, compiling, and running MLIR IR from Python programs, take a look at https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/SparseTensor/python/test_SpMM.py for example (which would give many opportunities to time part of the generated IR in a very fine-grained manner [after a warm JIT execution startup]). So in that sense, perhaps a Python based benchmarking framework for starters would make some sense too. But please explore a bit and report back to us. |
mlir/benchmarks/MLIRStarter.cpp | ||
---|---|---|
3–8 ↗ | (On Diff #392175) | Yeah, my idea is not very concrete yet but https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/SparseTensor/python/test_SpMM.py can be a good starting point. I will flesh this out more and come back. |
@mehdi_amini @aartbik I have introduced a python example benchmark. Let me know what you think about this new approach. It needs more fleshing out and a possible integration with llvm-lit. I have suggested a way we can use FileCheck for these benchmarks.
I am also posting a comment on discourse thread with more context so that more people can weigh in.
mlir/benchmark/Dialect/SparseTensor/python/test_SpMM.py | ||
---|---|---|
135 ↗ | (On Diff #393755) | if you are just showing this as a first very rough example, that's fine, but we cannot check this in as is, since this is a very bad test for a benchmark (it stress tests all sparsity annotations over a very small test case) In the long run, we want a proper benchmark to (1) pick one sparsity annotation (or report numbers for each different annotation at least) |
mlir/benchmark/Dialect/SparseTensor/python/test_SpMM.py | ||
---|---|---|
135 ↗ | (On Diff #393755) | Yes, I couldn't think of a good example so just copied the original sparse multiplication test just to demonstrate what it would look like. In fact, we should probably not even use timeit utility of python. |
Even if the current benchmarking is the way to go, I couldn't find a way to consistently run them. In local, I have been running them from command line like this
bash PYTHONPATH=build/tools/mlir/python_packages/mlir_core MLIR_C_RUNNER_UTILS=build/lib/libmlir_c_runner_utils.dylib MLIR_RUNNER_UTILS=build/lib/libmlir_runner_utils.dylib python mlir/benchmark/python/*.bench.py
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | I'm not convinced by the parameters here: I would expect the "driver" / "framework" to be agnostic to what the benchmark function does mostly.
|
mlir/benchmark/python/run.py | ||
33 ↗ | (On Diff #395836) | Can you add a mode where I get a local report file instead? (offline mode) |
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | Makes total sense. I have never implemented discovery or organising suites so give me some time to design and implement it. 🙂 I do hope that the approach is right and the final outcome would mostly look like a pytest invocation. |
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | Hey, I was implementing this and a question came to me. In the current implementation, we compile the module once and then run the compiled module n times. If we make the driver benchmark agnostic, we would be both compiling and running n number of times. Would that be acceptable? One way we could reproduce the current arrangement is by having a convention that the benchmarks return the compiled module and then we run the compiled module at the driver level. But because we are recording running times using nano_time, the driver would have to make assumptions about kernel's function signature to access the running times like we are doing now. I don't think we should be imposing this much from the driver. Do you have any thoughts on this? |
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | Thinking about this more, I think what I am looking for is "setup" for unit testing. This could be a good way to divide responsibilities between the driver and the sparse matrix specific setup. I also think that each benchmark can return a float for the time taken in a run which can then be collected by the driver. |
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | Sorry for too many messages here. I will spend some more time. It would be much easier if I have an implementation of these scattered ideas. |
mlir/benchmark/python/README.md | ||
---|---|---|
25 ↗ | (On Diff #395836) | There are many ways to handle this, for example a function decorated with @benchmark could actually return a tuple of functors to split various phases, instead of the function itself being invoked and timed. ```python @benchmark(pipeline_string, ntimes) def benchmark_something_module(): def setup(): module = ir.Module.create() # Define arguments for the kernel function with ir.InsertionPoint(module.body): @builtin.FuncOp.from_py_func(<arguments_defined_above>) def kernel_name(<parameters>): # Kernel implementation compiled_module = compile(module) execution_engine = mlir.jit.init() execution_engine.register(compiled_module) return compiled_module, execution_engine def run(args): compiled_module = args[0] execution_engine = args[1] execution_engine.run("main", ....) return (setup, run) The framework can then do: run_args = setup() time(run, run_args) I just made that up right now, not saying it is necessarily the best option :) |
Still a work in progress but I have addressed some comments.
- I have abstracted everything into a library.
- Implemented benchmark discovery similar to pytests.
- Better separation of running passes, compiling, and running.
Some issues still need resolution:
- I am still running benchmarks a fixed number of times instead of running till they are statistically significant.
- The module discovery doesn't include filtering like pytests.
I am working on these two things right now.
Glancing at it:
- it's nice that you're splitting the general library in mlir/utils/
- it seems that there is still a strong coupling between the benchmark framework and the "thing to benchmark". It isn't clear to me why the code mlir/utils/mbr needs to know anything about the IR or the execution engine. Ideally it should be able to benchmark any random python code. Think that we may use this to benchmark a Numpy implementation vs MLIR for example. This is why the example I showed you before was directly returning callbacks: they can run any arbitrary code and the framework does not know about it.
This diff has these changes:
- Address the comment of returning a compile function and a run function from a benchmark which the framework can use.
- Add filtering of benchmark paths.
- Dynamically determine the number of runs required for a benchmark function. I have used a strategy similar to python's timeit (https://github.com/python/cpython/blob/main/Lib/timeit.py#L31-L33) and google benchmark (https://github.com/google/benchmark/blob/main/src/benchmark_runner.cc#L231-L253). Let me know if need something different here.
I haven't included a README since I am not yet sure if the framework interface is something we agree on. Once we have an agreement, I will add a user guide of this library in the README.
The latest revision contains the following changes
- Adding a README.
- Having a configuration file for the library.
- Having a numpy benchmark as an example where there is no compile function.
- Improve benchmark filtering to filter by benchmark name.
Hey @mehdi_amini @aartbik , I have addressed all outstanding issues. Can you please take a look and help me decide how to proceed further on this?
this is converging to something very nice, thanks for all your work and patience!
mlir/benchmark/python/benchmark_sparse.py | ||
---|---|---|
23 | Please add top level comment """xxxx""" describing the benchmark, just to set the right example for future benchmarks also, nitpicky, do we want a "flat" python/bencmark_*.py structure, or do we want to specialize like benchmark/sparse/*.py Just curious if we should set that example now already, or whether we refactor later if the number of benchmarks grows | |
24 | this probably belongs in common now, ie. setting up various pipelines that will be shared by benchmarks | |
mlir/benchmark/python/common.py | ||
10 | top level doc on what is in this file """Common utilities .... """ | |
mlir/utils/mbr/mbr/config.ini | ||
10 | what are these newline missing comments? | |
mlir/utils/mbr/mbr/discovery.py | ||
10 | please document every method with a python doc string | |
mlir/utils/mbr/mbr/main.py | ||
33 | try to keep the 80-col limit |
Addressed latest round of comments
- Added python docstrings for functions and modules.
- Made all lines wrap around 80 character limit.
- Added newlines.
Regarding flat python/bencmark_*.py structure, my vote would be for the current structure as its simpler and we could always move
things around later. I am open to other directory structures as well so let me know if you have any alternate thoughts :)
mlir/benchmark/python/common.py | ||
---|---|---|
27 | I wonder whether this setup_pass belongs to common as it seems specific to sparse tensors. |
mlir/.gitignore | ||
---|---|---|
10 ↗ | (On Diff #400972) | Why do all these folder show up here? In general the execution is in the build (which is at the top level and not under MLIR). |
mlir/benchmark/python/common.py | ||
27 | Yes it isn't "common" from this point of view, but this is also the case of a helper like create_sparse_np_tensor below. | |
91 | This is not describing the loop that is emitted and the wrapping | |
mlir/utils/mbr/mbr/stats.py | ||
15 | These seems fairly arbitrary and overly large to me. Where is this coming from? Did you look into how other solution are using some statistical approach to evaluate confidence? |
Address latest round of comments.
Why do all these folder show up here? In general the execution is in the build (which is at the top level and not under MLIR).
We should leave our source directory "pristine".
I have now made an llvm-lit kind of arrangement where we move the executable to the build directory and mlir-mbr is invoked from there. That way, we won't be bothered by "*.pyc" files. Unfortunately, the egg-info and build/ directories are created by pip install -e which we have to do to install mlir-mbr. I included manual rm -rs in CMakeLists.txt.
Yes it isn't "common" from this point of view, but this is also the case of a helper like create_sparse_np_tensor below.
Absolutely. I tried splitting common into benchmark wide common and sparse specific common by creating a directory called mlir/benchmark/python/sparse and moving sparse specific common and sparse benchmark there. Unfortunately, I couldn't work around parent package import problem where benchmark is trying to import from common in the parent directory. I kept it unchanged for now.
We need to solve this problem as I expect benchmarks would be divided into different directories.
This is not describing the loop that is emitted and the wrapping
Updated the comment.
These seems fairly arbitrary and overly large to me. Where is this coming from?
I took the general idea from google benchmark https://github.com/google/benchmark/blob/main/src/benchmark_runner.cc#L260-L265 and python timeit https://github.com/python/cpython/blob/main/Lib/timeit.py#L31-L33. The values probably need tweaking but my understanding is the general idea is to either run it till some max time has elapsed or we have a sufficient number of measurements.
llvm-lit is also a python program, how is it setup?
The rm looks a bit hacky to me, I rather not touch the source directory at all.
These seems fairly arbitrary and overly large to me. Where is this coming from?
I took the general idea from google benchmark https://github.com/google/benchmark/blob/main/src/benchmark_runner.cc#L260-L265 and python timeit https://github.com/python/cpython/blob/main/Lib/timeit.py#L31-L33. The values probably need tweaking but my understanding is the general idea is to either run it till some max time has elapsed or we have a sufficient number of measurements.
Wow, that really low-tech...
Probably good enough to get started, but please make a default to a couple of seconds, not 100.
llvm-lit is also a python program, how is it setup?
The rm looks a bit hacky to me, I rather not touch the source directory at all.
Yep, got a solution for it! The problem was mbr seeked to be both a library and a CLI runner. As a library, it used to provide BenchmarkRunConfig so that benchmarks
could import them and return them like this.
python from mbr import BenchmarkRunConfig def some_benchmark(): ... return BenchmarkRunConfig(compiler=compiler, runner=runner)
To make mbr available, we had to do pip install which created those build directories. Now I just return a two element tuple from each benchmark and check them in the runner. So we don't need a library and thus we can remove the pip install.
Probably good enough to get started, but please make a default to a couple of seconds, not 100.
Made it 1 second (1e9 ns).
I have also made it default to continue running benchmarks if any benchmark raises an exception. This is similar to how unittests handle testcases that raise an exception.
Thanks a lot for all your efforts and patience!
LGTM as well, but with a few last nits on naming
please address those before submitting (but I am giving you a green light now already to avoid further delay once you are done with those ;-) ;-)
mlir/benchmark/python/benchmark_sparse.py | ||
---|---|---|
2 | matrices -> tensors | |
31 | throughout the file please replace sparse multiplication with sparse matrix multiplication (note that we use "sparse tensor" when we talk about the support in general, but for specific 2-d cases, using matirx is okay) | |
mlir/benchmark/python/common.py | ||
27 | Also note that thanks to Wren's work, we soon can simplify this a lot! |
mlir/benchmark/python/common.py | ||
---|---|---|
27 | Nice, I guess you are talking about this one. I am keeping track of that and will change this once that's in the main tree. I hope I am guessing it correctly that you are okay with this going in as of now. I have addressed the name changes that you raised in the last comment. |
The build is taking quite a while. Is it okay to restart if it doesn't finish in about 9-10 hours?
I think this should all be inside the benchmark folder.