[test-suite][RFC] Using Google Benchmark Library on Harris Kernel
ClosedPublic

Authored by proton on Jun 2 2018, 5:06 AM.

Details

Summary

Hi,
I have used google benchmark library on Harris corner detection kernel (from polymage benchmarks). I want to know if the way I have used the google benchmark library here is correct or there can be a better way to do this.

Diff Detail

Repository
rL LLVM
proton created this revision.Jun 2 2018, 5:06 AM

Did you consider putting the driver code (initialization, malloc, free, checking result), Into a different file than the kernel code? This would allow measuring it (code size, LLVM statistics, compile time, etc.) independently from the boilerplate code.

MicroBenchmarks/harris/harris.cpp
177–197 ↗(On Diff #149608)

Maybe the malloc/free calls should be taken out of the measured kernel.

201–208 ↗(On Diff #149608)

Could you rewrite this to use multidimensional access subscripts? E.g. img[_i0-1][_i1-1].

MicroBenchmarks/harris/sha1.hpp
1–45 ↗(On Diff #149608)

There is already hashing used by test-suite (see HashProgramOutput.sh), why add another one?

proton updated this revision to Diff 150032.Jun 5 2018, 1:16 PM
proton updated this revision to Diff 150815.Jun 11 2018, 12:40 PM
proton edited the summary of this revision. (Show Details)

Looks great. Did you do a performance comparison with/without Polly?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

[comment] Is there are a reason why the init.cpp and main.cpp are are separate files?

MicroBenchmarks/harris/harris.h
33 ↗(On Diff #150815)

[style] DUMP_IMAGE is named like a macro, but is a function.

MicroBenchmarks/harris/harris_kernel.cpp
118 ↗(On Diff #150815)

[style] return before the end of the function seem unnecessary.

proton updated this revision to Diff 151105.Jun 13 2018, 12:11 AM
proton added a comment.EditedJun 13 2018, 12:22 AM

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

init.cpp have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

Looks great. Did you do a performance comparison with/without Polly?

Polly + O3 and only O3 are taking the same time. It seems like before code reaches Polly, It is already heavily optimized at O3 and Polly cannot find any further optimization possible on it even though it is in its SCoP.

How long is a single run with O3?

[suggestion] Even though Google Benchmark by default does not run kernels in multiple threads, it might be a good idea to prepare for it. That is, no global shared img array.

Updated.
I still have to keep an extra global array (other than the image) to copy the final output to. I couldn't modify the original image array as it will affect the input of other threads.

It still writes to the target array in parallel without locking.

I suggest to call harrisKernel once more only for the correctness check,.

[comment] Is there are a reason why the init.cpp and main.cpp are separate files?

`init.cpp` have image initialization and print function which can be used on other image processing kernels as well, So I kept it in a separate file.

If it is supposed to be a shared resource, it shouldn't be in the harris directory.

Could you check whether llvm-lit correctly collects execution time,compile/link time, LLVM -stats, code size?

proton updated this revision to Diff 151280.Jun 13 2018, 5:51 PM

Updated input size, used malloc to allocate memory for the array.

Could you check whether llvm-lit correctly collects execution time, compile/link time, LLVM -stats, code size?

I don't know how to check LLVM -stats using lit.
Sizes matches the output of llvm-size, compile time and link time are also fine.

lit Output: used "lit ." in build/Microbenchmark/harris
compile_time: 1.3595
link_time: 0.0832
exec size: 364288
exec_time: 28254000.0000
Testing Time: 1.22s
.
.
.

For now, I have merged the init.cpp and main.cpp. The image initialization here is special to visualize the output nicely. We may/may not use this initialization for other image processing kernels.
If at all there is a common image initialization source code introduced in future (maybe with multiple type of image init and some helper function that helps in image processing), there will be only minute changes to main.cpp file.

Here are some of the stats for this code:

RunTime

With Benchmark library

FlagCPU TimeIterationSystem measured (using time ./a.out)
O0117259447ns61.039s
O332498956ns211.214s
O3+Polly32562202ns211.205s

Without Benchmark library (input size is changed - refer harris.h)

FlagSystem measured time
O01.241s
O30.697s
O3+Polly0.611s

Note: Clang is built in debug mode and I have compiled benchmark using clang++ -O3 -mllvm -polly --std=c++11 harrisKernel.cpp main.cpp -lbenchmark -lpthread -o withPolly.out

lit --vg --vg-leak

It Fails... but so does other benchmarks in microbenchmark folder so maybe there is memory leak problem with benchmark library

I am thinking of manually verifying output as there is a pattern to output with this checkbox initialization, let me know if it is a good idea or not.
I have removed the reference output for now as it is of size 10MB, I will update the diff with reference output once the checking method is finalized.

dberris added inline comments.Jun 13 2018, 7:07 PM
MicroBenchmarks/harris/main.cpp
107–120 ↗(On Diff #151280)

There's a few comments I have about this code, but let me start with the simple(r) ones:

  • You may want to make the image sizes configurable, and using the benchmark API to try it on differently sized images (so that you can see how the algorithm scales based on input sizes).
  • You probably want to ensure that you do something with the image data after the loop(s) to ensure that the allocations aren't optimised away. You can use the benchmark::DoNotOptimize(...) function to do some of that.
  • You might also want to consider measuring throughput (computing the amount of data processed for the time it took per iteration).
  • Before you get into the loop, you should probably run the kernel once on the just-allocated memory, to ensure that you're not just measuring the cost of pulling data through the cache(s). This is probably better to do with smaller images.

Using Polly's -debug-only=polly-scops output showed that the kernel is in fact not optimized:

Invalidate SCoP because of reason 0

NOTE: Run time checks for %for.cond11.preheader---%for.cond.cleanup532 could not be created as the number of parameters involved is too high. The SCoP will be dismissed.
Use:
        --polly-rtc-max-parameters=X
to adjust the maximal number of parameters but be advised that the compile time might increase exponentially.

Bailing-out because could not build alias checks

Using -polly-rtc-max-parameters=999 does not help. I remember that Polly prints that message whenever it cannot build the alias checks, even for other reasons than stated.

proton updated this revision to Diff 151811.Jun 18 2018, 4:11 PM
proton marked an inline comment as done.

Runtime on this differential

BenchmarkTimeCPUIterations
Polly288252875725
Polly289642891024
-O3132161319154
-O3130721304954
-O098790987167
-O099720996527

Modified such that polly is now able to make some changes in the kernel (no runtime checks problem).

Did you check whether Polly recognizes the call to exp as part of the SCoP?

dberris added inline comments.Jun 19 2018, 10:29 PM
MicroBenchmarks/harris/main.cpp
179 ↗(On Diff #151811)

It seems that HEIGHT and WIDTH are input values anyway, consider making multiple input sizes to see how the kernel performs as you scale the image size goes up.

You might also not need the __restrict__ attributes for the malloc-provided heap memory either. This means you could do:

float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + state.range(0)) * (2 + state.range(1))));

When you register the benchmark, you can then provide the image sizes to test with:

BENCHMARK(HarrisBenchmark)
    ->Unit(benchmark::kMicrosecond)
    ->Args({256, 256})
    ->Args({512, 512})
    ->Args({1024, 1024})
    ->Args({2048, 2048});

You can see more options at https://github.com/google/benchmark#passing-arguments.

Another thing you may consider measuring as I suggested in the past is throughput. To do that, you can call state.SetBytesProcessed(...) in the benchmark body, typically at the end just before exiting -- you want to essentially report something like:

state.SetBytesProcessed(sizeof(float) * (state.range(0) + 2) * (state.range(1) + 2) * state.iterations());

This will add a "MB/sec" output alongside the time it took for each iteration of the benchmark.

proton updated this revision to Diff 152211.Jun 20 2018, 6:35 PM
proton marked an inline comment as done.

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

MicroBenchmarks/harris/main.cpp
111–119 ↗(On Diff #152211)

Why are these still using HEIGHT and WIDTH? Why aren't these just:

const size_t height = state.range(0);
const size_t weight = state.range(1);

float **image = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));
float **imageOutput = reinterpret_cast<float**>(malloc(sizeof(float) * (2 + height) * (2 + width)));
185 ↗(On Diff #152211)

Can you re-format this? Preferably with clang-format if possible, so that it's easier to read.

proton updated this revision to Diff 152363.Jun 21 2018, 1:14 PM

Formatted using clang format

Results

With Polly:

BENCHMARKTimeCPUIterationThroughput
BENCHMARK_HARRIS/256/256769 us768 us911330.508MB/s
BENCHMARK_HARRIS/512/5124001 us3996 us177252.207MB/s
BENCHMARK_HARRIS/1024/102425690 us25650 us28156.553MB/s
BENCHMARK_HARRIS/2048/2048118023 us117830 us6136.054MB/s

Without Polly:

BENCHMARKTimeCPUIterationThroughput
BENCHMARK_HARRIS/256/256626 us625 us1135406.15MB/s
BENCHMARK_HARRIS/512/5123074 us3068 us229328.51MB/s
BENCHMARK_HARRIS/1024/102417121 us17086 us42235.022MB/s
BENCHMARK_HARRIS/2048/204864207 us64077 us11250.189MB/s

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

MicroBenchmarks/harris/main.cpp
179 ↗(On Diff #151811)

Cannot use float **image as pointers may overlap and this prevents Polly from detecting scops.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

I did considered adding SetBytesProcessed but I was not sure how many bytes should be written as argument (output image size or the total bytes accessed in kernel) so I commented the line "SetBytesProcessed(static_cast<int64_t>(state.iterations())*WIDTH*HEIGHT*50);" but forgot to ask about it.

dberris accepted this revision.Jun 21 2018, 4:37 PM

Thanks for making some of the changes. I'm still not clear on a couple of things.

Do you mind sharing some of the results with the new benchmark runs, with the different image sizes? Do we actually get the throughput numbers in there as well?

I replied to your previous comment but for some reason, it is showing only after your inlined comment not here.

I cannot use a pointer to pointer of an array (float **) here as the compiler may think that some pointers may overlap and prevents Polly from detecting SCoPs here.

I'm not sure that's entirely true -- the pointer is coming from malloc, so they're meant to not overlap. At least LLVM should be able to detect that.

I have to allocate the fixed size arrays here as "float (&outputImg)[2+height][2+width] = *reinterpret_cast<float (*)[2+height][2+width]>((float *) malloc(...)); " is not allowed by clang++

It's not allowed because height and width are not constant expressions. Note that even in function arguments, the "array" form will be treated as pointers anyway so it shouldn't make any difference if you change the API to:

void harrisKernel(
    int height, int width, float **inputImg,
    float **outputImg, float **Ix,
    float **Iy, float **Ixx,
    float **Ixy, float **Iyy,
    float **Sxx, float **Sxy,
    float **Syy, float **det,
    float **trace)

If anything, you want to mark those pointers that the compiler is supposed to treat as non-aliasing as restrict or __restrict__.

Also, Are the Number of bytes processed is calculated w.r.t to the size of output or the total number of bytes accessed in the kernel?

The bytes processed is what you say it is -- as I suggested, it is based on the input size (size of the input image). If you want to measure something else, you're going to have to provide what that throughput is.

The beauty of having a benchmark suite is that you can test out these various approaches in the same benchmark. You can explore alternative strategies like:

  • Turn the harris kernel implementation into a template, so you can use C++ features and take const std::array<std::array<float, W>, H>&, and let the compiler deduce the sizes.
  • Have a version of the kernel with pointer arguments with __restrict__ and one without.
  • Instead of taking output pointers, consider returning a struct/tuple with std::unique_ptr<float[][]> members.
  • Use C++ array new and array delete (say new float*[(height *2) * (width * 2)]).

Anyway, I'm fine with the state of it currently, but would like to see more work done if not now but in the future. This will be really helpful for people working on the compiler trying to see whether performance/throughput can be improved (or regress) with changes to the compiler. I'm sure the folks working on Polly would like to be able to diagnose why the Polly version is slower than the normal build.

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

MicroBenchmarks/harris/main.cpp
179 ↗(On Diff #151811)

I don't know whether you want to optimise for Polly or make Polly just recognise these pointers shouldn't overlap. If Polly can't detect that these pointers are coming from different 'malloc' calls, then I suspect that's a bug in Polly rather than something you need to work around in the benchmark.

Note that maybe the better thing to do is to change the kernel's API to put restrict or __restrict__ on the pointers, so that the optimiser in those cases might be able to assume that the pointers don't alias and don't do anything special in this benchmark.

See my top-level comment for alternatives to explore, if you're open to it.

This revision is now accepted and ready to land.Jun 21 2018, 4:37 PM
Meinersbur accepted this revision.Jun 23 2018, 10:43 PM

Please wait for someone else to LGTM/Accept before landing. I'm sure we can spend a lot more time trying to make these benchmarks better, but in the meantime having *something* in there is better than not having one in there.

I agree with @dberris. If landed we can look into addition additional such benchmarks and base patches to common structures.

homerdin added inline comments.Jun 25 2018, 11:43 AM
MicroBenchmarks/harris/main.cpp
87 ↗(On Diff #152363)

This will write output.txt into whichever directory lit is run in. You can set the working directory that the test will run in by passing an argument to llvm_test_run() > llvm_test_run(WORKDIR ${CMAKE_CURRENT_BINARY_DIR})

proton updated this revision to Diff 152839.Jun 25 2018, 10:50 PM
proton marked an inline comment as done.

added LICENSE and fixed work directory issue.

This revision was automatically updated to reflect the committed changes.