[test-suite][RFC] Using Google Benchmark Library on Harris Kernel
Needs ReviewPublic

Authored by proton on Sat, Jun 2, 5:06 AM.

Details

Summary

Hi,
I have used google benchmark library on Harris corner detection kernel (from polymage benchmarks). I want to know if the way I have used the google benchmark library here is correct or there can be a better way to do this.

Diff Detail

proton created this revision.Sat, Jun 2, 5:06 AM

Did you consider putting the driver code (initialization, malloc, free, checking result), Into a different file than the kernel code? This would allow measuring it (code size, LLVM statistics, compile time, etc.) independently from the boilerplate code.

MicroBenchmarks/harris/harris.cpp
177–197 ↗(On Diff #149608)

Maybe the malloc/free calls should be taken out of the measured kernel.

201–208 ↗(On Diff #149608)

Could you rewrite this to use multidimensional access subscripts? E.g. img[_i0-1][_i1-1].

MicroBenchmarks/harris/sha1.hpp
1–45 ↗(On Diff #149608)

There is already hashing used by test-suite (see HashProgramOutput.sh), why add another one?

proton updated this revision to Diff 150032.Tue, Jun 5, 1:16 PM
proton updated this revision to Diff 150815.Mon, Jun 11, 12:40 PM
proton edited the summary of this revision. (Show Details)

Looks great. Did you do a performance comparison with/without Polly?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

[comment] Is there are a reason why the init.cpp and main.cpp are are separate files?

MicroBenchmarks/harris/harris.h
34

[style] DUMP_IMAGE is named like a macro, but is a function.

MicroBenchmarks/harris/harris_kernel.cpp
118 ↗(On Diff #150815)

[style] return before the end of the function seem unnecessary.

proton updated this revision to Diff 151105.Wed, Jun 13, 12:11 AM
proton added a comment.EditedWed, Jun 13, 12:22 AM

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

How long is a single run with O3?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

It still writes to the target array in parallel without locking.

I suggest to call harrisKernel once more only for the correctness check,.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

`init.cpp` have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

If it is supposed to be a shared resource, it shouldn't be in the harris directory.

Could you check whether llvm-lit correctly collects execution time,compile/link time, LLVM -stats, code size?

proton updated this revision to Diff 151280.Wed, Jun 13, 5:51 PM

Updated input size, used malloc to allocate memory for the array.

Could you check whether llvm-lit correctly collects execution time, compile/link time, LLVM -stats, code size?

I don't know how to check LLVM -stats using lit.
Sizes matches the output of llvm-size, compile time and link time are also fine.

lit Output: used "lit ." in build/Microbenchmark/harris
compile_time: 1.3595
link_time: 0.0832
exec size: 364288
exec_time: 28254000.0000
Testing Time: 1.22s
.
.
.

For now, I have merged the init.cpp and main.cpp. The image initialization here is special to visualize the output nicely. We may/may not use this initialization for other image processing kernels.
If at all there is a common image initialization source code introduced in future (maybe with multiple type of image init and some helper function that helps in image processing), there will be only minute changes to main.cpp file.

Here are some of the stats for this code:

RunTime

With Benchmark library

FlagCPU TimeIterationSystem measured (using time ./a.out)
O0117259447ns61.039s
O332498956ns211.214s
O3+Polly32562202ns211.205s

Without Benchmark library (input size is changed - refer harris.h)

FlagSystem measured time
O01.241s
O30.697s
O3+Polly0.611s

Note: Clang is built in debug mode and I have compiled benchmark using clang++ -O3 -mllvm -polly --std=c++11 harrisKernel.cpp main.cpp -lbenchmark -lpthread -o withPolly.out

lit --vg --vg-leak

It Fails... but so does other benchmarks in microbenchmark folder so maybe there is memory leak problem with benchmark library

I am thinking of manually verifying output as there is a pattern to output with this checkbox initialization, let me know if it is a good idea or not.
I have removed the reference output for now as it is of size 10MB, I will update the diff with reference output once the checking method is finalized.

dberris added inline comments.Wed, Jun 13, 7:07 PM
MicroBenchmarks/harris/main.cpp
107–120

There's a few comments I have about this code, but let me start with the simple(r) ones:

  • You may want to make the image sizes configurable, and using the benchmark API to try it on differently sized images (so that you can see how the algorithm scales based on input sizes).
  • You probably want to ensure that you do something with the image data after the loop(s) to ensure that the allocations aren't optimised away. You can use the benchmark::DoNotOptimize(...) function to do some of that.
  • You might also want to consider measuring throughput (computing the amount of data processed for the time it took per iteration).
  • Before you get into the loop, you should probably run the kernel once on the just-allocated memory, to ensure that you're not just measuring the cost of pulling data through the cache(s). This is probably better to do with smaller images.

Using Polly's -debug-only=polly-scops output showed that the kernel is in fact not optimized:

Invalidate SCoP because of reason 0

NOTE: Run time checks for %for.cond11.preheader---%for.cond.cleanup532 could not be created as the number of parameters involved is too high. The SCoP will be dismissed.
Use:
        --polly-rtc-max-parameters=X
to adjust the maximal number of parameters but be advised that the compile time might increase exponentially.

Bailing-out because could not build alias checks

Using -polly-rtc-max-parameters=999 does not help. I remember that Polly prints that message whenever it cannot build the alias checks, even for other reasons than stated.