Hi,
I have used google benchmark library on Harris corner detection kernel (from polymage benchmarks). I want to know if the way I have used the google benchmark library here is correct or there can be a better way to do this.
Details
Diff Detail
Event Timeline
Did you consider putting the driver code (initialization, malloc, free, checking result), Into a different file than the kernel code? This would allow measuring it (code size, LLVM statistics, compile time, etc.) independently from the boilerplate code.
MicroBenchmarks/harris/harris.cpp | ||
---|---|---|
177–197 ↗ | (On Diff #149608) | Maybe the malloc/free calls should be taken out of the measured kernel. |
201–208 ↗ | (On Diff #149608) | Could you rewrite this to use multidimensional access subscripts? E.g. img[_i0-1][_i1-1]. |
MicroBenchmarks/harris/sha1.hpp | ||
1–45 ↗ | (On Diff #149608) | There is already hashing used by test-suite (see HashProgramOutput.sh), why add another one? |
Looks great. Did you do a performance comparison with/without Polly?
[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.
[comment] Is there are a reason why the init.cpp and main.cpp are are separate files?
MicroBenchmarks/harris/harris.h | ||
---|---|---|
34 | [style] DUMP_IMAGE is named like a macro, but is a function. | |
MicroBenchmarks/harris/harris_kernel.cpp | ||
118 ↗ | (On Diff #150815) | [style] return before the end of the function seem unnecessary. |
Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.
[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.
Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.
[comment] Is there are a reason why the init.cpp and main.cpp are separate files?
init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.
How long is a single run with O3?
[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.
Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.
It still writes to the target array in parallel without locking.
I suggest to call harrisKernel once more only for the correctness check,.
[comment] Is there are a reason why the init.cpp and main.cpp are separate files?
init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.
If it is supposed to be a shared resource, it shouldn't be in the harris directory.
Could you check whether llvm-lit correctly collects execution time,compile/link time, LLVM -stats, code size?
I don't know how to check LLVM -stats using lit.
Sizes matches the output of llvm-size, compile time and link time are also fine.
lit Output: used "lit ." in build/Microbenchmark/harris
compile_time: 1.3595
link_time: 0.0832
exec size: 364288
exec_time: 28254000.0000
Testing Time: 1.22s
.
.
.
For now, I have merged the init.cpp and main.cpp. The image initialization here is special to visualize the output nicely. We may/may not use this initialization for other image processing kernels.
If at all there is a common image initialization source code introduced in future (maybe with multiple type of image init and some helper function that helps in image processing), there will be only minute changes to main.cpp file.
Here are some of the stats for this code:
RunTime
With Benchmark library
Flag | CPU Time | Iteration | System measured (using time ./a.out) |
---|---|---|---|
O0 | 117259447ns | 6 | 1.039s |
O3 | 32498956ns | 21 | 1.214s |
O3+Polly | 32562202ns | 21 | 1.205s |
Without Benchmark library (input size is changed - refer harris.h)
Flag | System measured time |
---|---|
O0 | 1.241s |
O3 | 0.697s |
O3+Polly | 0.611s |
Note: Clang is built in debug mode and I have compiled benchmark using clang++ -O3 -mllvm -polly --std=c++11 harrisKernel.cpp main.cpp -lbenchmark -lpthread -o withPolly.out
lit --vg --vg-leak
It Fails... but so does other benchmarks in microbenchmark folder so maybe there is memory leak problem with benchmark library
I am thinking of manually verifying output as there is a pattern to output with this checkbox initialization, let me know if it is a good idea or not.
I have removed the reference output for now as it is of size 10MB, I will update the diff with reference output once the checking method is finalized.
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
108–121 | There's a few comments I have about this code, but let me start with the simple(r) ones:
|
Using Polly's -debug-only=polly-scops output showed that the kernel is in fact not optimized:
Invalidate SCoP because of reason 0 NOTE: Run time checks for %for.cond11.preheader---%for.cond.cleanup532 could not be created as the number of parameters involved is too high. The SCoP will be dismissed. Use: --polly-rtc-max-parameters=X to adjust the maximal number of parameters but be advised that the compile time might increase exponentially. Bailing-out because could not build alias checks
Using -polly-rtc-max-parameters=999 does not help. I remember that Polly prints that message whenever it cannot build the alias checks, even for other reasons than stated.
Runtime on this differential
Benchmark | Time | CPU | Iterations |
---|---|---|---|
Polly | 28825 | 28757 | 25 |
Polly | 28964 | 28910 | 24 |
-O3 | 13216 | 13191 | 54 |
-O3 | 13072 | 13049 | 54 |
-O0 | 98790 | 98716 | 7 |
-O0 | 99720 | 99652 | 7 |
Modified such that polly is now able to make some changes in the kernel (no runtime checks problem).
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
180 | It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to see how the kernel performs as you scale the image size goes up. You might also not need the __restrict__ attributes for the malloc-provided heap memory either. This means you could do: float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + state.range(0)) * (2 + state.range(1)))); When you register the benchmark, you can then provide the image sizes to test with: BENCHMARK(HarrisBenchmark) ->Unit(benchmark::kMicrosecond) ->Args({256, 256}) ->Args({512, 512}) ->Args({1024, 1024}) ->Args({2048, 2048}); You can see more options at https://github.com/google/benchmark#passing-arguments. Another thing you may consider measuring as I suggested in the past is throughput. To do that, you can call state.SetBytesProcessed(...) in the benchmark body, typically at the end just before exiting -- you want to essentially report something like: state.SetBytesProcessed(sizeof(float) * (state.range(0) + 2) * (state.range(1) + 2) * state.iterations()); This will add a "MB/sec" output alongside the time it took for each iteration of the benchmark. |
Thanks for making some of the changes. I'm still not clear on a couple of things.
Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
111–119 | Why are these still using HEIGHT and WIDTH? Why aren't these just: const size_t height = state.range(0); const size_t weight = state.range(1); float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width))); float **imageOutput = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width))); | |
185 | Can you re-format this? Preferably with clang-format if possible, so that it's easier to read. |
Formatted using clang format
Results
With Polly:
BENCHMARK | Time | CPU | Iteration | Throughput |
---|---|---|---|---|
BENCHMARK_HARRIS/256/256 | 769 us | 768 us | 911 | 330.508MB/s |
BENCHMARK_HARRIS/512/512 | 4001 us | 3996 us | 177 | 252.207MB/s |
BENCHMARK_HARRIS/1024/1024 | 25690 us | 25650 us | 28 | 156.553MB/s |
BENCHMARK_HARRIS/2048/2048 | 118023 us | 117830 us | 6 | 136.054MB/s |
Without Polly:
BENCHMARK | Time | CPU | Iteration | Throughput |
---|---|---|---|---|
BENCHMARK_HARRIS/256/256 | 626 us | 625 us | 1135 | 406.15MB/s |
BENCHMARK_HARRIS/512/512 | 3074 us | 3068 us | 229 | 328.51MB/s |
BENCHMARK_HARRIS/1024/1024 | 17121 us | 17086 us | 42 | 235.022MB/s |
BENCHMARK_HARRIS/2048/2048 | 64207 us | 64077 us | 11 | 250.189MB/s |
I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.
I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.
I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++
Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
180 | Cannot use float **image as pointers may overlap and this prevents Polly from detecting scops. I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++ I did considered adding SetBytesProcessed but I was not sure how many bytes should be written as argument (output image size or the total bytes accessed in kernel) so I commented the line "SetBytesProcessed(static_cast<int64_t>(state.iterations())*WIDTH*HEIGHT*50);" but forgot to ask about it. |
I'm not sure that's entirely true -- the pointer is coming from malloc, so they're meant to not overlap. At least LLVM should be able to detect that.
I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++
It's not allowed because height and width are not constant expressions. Note that even in function arguments, the "array" form will be treated as pointers anyway so it shouldn't make any difference if you change the API to:
void harrisKernel( int height, int width, float **inputImg, float **outputImg, float **Ix, float **Iy, float **Ixx, float **Ixy, float **Iyy, float **Sxx, float **Sxy, float **Syy, float **det, float **trace)
If anything, you want to mark those pointers that the compiler is supposed to treat as non-aliasing as restrict or __restrict__.
Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?
The bytes processed is what you say it is -- as I suggested, it is based on the input size (size of the input image). If you want to measure something else, you're going to have to provide what that throughput is.
The beauty of having a benchmark suite is that you can test out these various approaches in the same benchmark. You can explore alternative strategies like:
- Turn the harris kernel implementation into a template, so you can use C++ features and take const std::array<std::array<float, W>, H>&, and let the compiler deduce the sizes.
- Have a version of the kernel with pointer arguments with __restrict__ and one without.
- Instead of taking output pointers, consider returning a struct/tuple with std::unique_ptr<float[][]> members.
- Use C++ array new and array delete (say new float*[(height *2) * (width * 2)]).
Anyway, I'm fine with the state of it currently, but would like to see more work done if not now but in the future. This will be really helpful for people working on the compiler trying to see whether performance/throughput can be improved (or regress) with changes to the compiler. I'm sure the folks working on Polly would like to be able to diagnose why the Polly version is slower than the normal build.
Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
180 | I don't know whether you want to optimise for Polly or make Polly just recognise these pointers shouldn't overlap. If Polly can't detect that these pointers are coming from different 'malloc' calls, then I suspect that's a bug in Polly rather than something you need to work around in the benchmark. Note that maybe the better thing to do is to change the kernel's API to put restrict or __restrict__ on the pointers, so that the optimiser in those cases might be able to assume that the pointers don't alias and don't do anything special in this benchmark. See my top-level comment for alternatives to explore, if you're open to it. |
I agree with @dberris. If landed we can look into addition additional such benchmarks and base patches to common structures.
MicroBenchmarks/harris/main.cpp | ||
---|---|---|
88 | This will write output.txt into whichever directory lit is run in. You can set the working directory that the test will run in by passing an argument to llvm_test_run() > llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR}) |
[style] DUMP_IMAGE is named like a macro, but is a function.